My Personal Intro to F1 Race Statistics

One of the many things I keep avoiding is statistics. I’ve never really been convinced about the 5% significance level thing; as far as I can tell, hardly anything that’s interesting normally distributes; all the counting that’s involved just confuses me; and I never really got to grips with confidently combining probabilities. I find a lot of statistics related language impenetrable too, with an obscure vocabulary and some very peculiar usage. (Regular readers of this blog know that’s true here, as well ;-)

So this year I’m going to try to do some stats, and use some stats, and see if I can find out from personal need and personal interest whether they lead me to any insights about, or stories hidden within, various data sets I keep playing with. So things like: looking for patterns or trends, looking for outliers, and comparing one thing with another. If I can find any statistics that appear to suggest particular betting odds look particularly favourable, that might be interesting too. (As Nate Silver suggests, betting, even fantasy betting, is a great way of keeping score…)

Note that what I hope will turn into a series of posts should not be viewed as tutorial notes – they’re far more likely to be akin to student notes on a problem set the student is trying to work through, without having attended any of the required courses, and without having taken the time to read through a proper tutorial on the subject. Nor do I intend to to set out with a view to learning particular statistical techniques. Instead, I’ll be dipping into the world of stats looking for useful tools to see if they help me explore particular questions that come to mind and then try to apply them cut-and-past fashion, which is how I approach most of my coding!

Bare naked learning, in other words.

So if you thought I had any great understanding about stats – in fact, any understanding at all – I’m afraid I’m going to disabuse you of that notion. As to my use of the R statistical programming language, that’s been pretty much focussed on using it for generating graphics in a hacky way. (I’ve also found it hard, in the past, plotting pixels on screen and page in a programmable way, but R graphics libraries such as ggplot2 make it really easy at a high level of abstraction…:-)

That’s the setting then… Now: #f1stats. What’s that all about?

Notwithstanding the above (that this isn’t about learning a particular set of stats methods defined in advance) I did do a quick trawl looking for “F1 stats tutorials” to see if there were any that I could crib from directly; but my search didn’t turn up much that was directly and immediately useful (if you know of anything that might be, please post a link in the comments). There were a few things that looked like they might be interesting, so here’s a quick dump of the relevant…

If you know of any other relevant looking papers or articles, please post a link in the comments.

Who is the Best Formula 1 Driver? An Econometric Analysis

I was hoping to finish this post with a couple of quick R hacks around some F1 datasets, but I’ve just noticed that today, as in yesterday, has become tomorrow, as in today, and this post is probably already long enough… So it’ll have to wait for another day…

PS FWIW, I also note the arrival of the Sports Analytics Innovation Summit in London in March… I doubt I have the impact required to make it as a media partner though… Although maybe OpenLearn does…?!

Author: Tony Hirst

I'm a lecturer at The Open University, with an interest in #opendata policy and practice, as well as general web tinkering...

2 thoughts on “My Personal Intro to F1 Race Statistics”

  1. One of the problems with using final placings (i.e., ranks) as outcomes is that these are (a) definitely not normally distributed, and (b) not independent of one another: if Button finishes second, no one else can. You might instead want to look at models for survival data. Survival data is often used for time to mortality, modelled as a function of various risk factors. Here, the outcome is time to completion, higher “mortality” is a good thing, and risk factors are actually factors associated with racing quicker.

    The benefit of using this kind of analysis is that survival analyses are quite good at dealing with censored observations — i.e., times where we don’t know how quick a driver would have finished a race, because they had a DNF.

    Googling “survival analysis tutorial R” throws up a number of useful-looking tutorials. Predicting from survival models is a bit ropier, but should be possible out of the box.

    1. Hi Chris – thanks for that suggestion. “Matters of independence” was another of the bugbears I should have added to my opening line! As well as rankings, I have lap time data (for a couple of seasons at least, courtesy of Ergast API), and sector times for practice and qualifying going back several years (scraped from Formula One website; I’ve also got pit times, of a sort, from that site, and fastest laps).

      One thing I have been mulling over is things like podium finish, top 10/middle 7/bottom 7 in qualifying, clean side/dirty side on grid (for end of first lap position change stats) etc.

Comments are closed.