One of the many things I keep avoiding is statistics. I’ve never really been convinced about the 5% significance level thing; as far as I can tell, hardly anything that’s interesting normally distributes; all the counting that’s involved just confuses me; and I never really got to grips with confidently combining probabilities. I find a lot of statistics related language impenetrable too, with an obscure vocabulary and some very peculiar usage. (Regular readers of this blog know that’s true here, as well ;-)

So this year I’m going to try to do some stats, and use some stats, and see if I can find out from personal need and personal interest whether they lead me to any insights about, or stories hidden within, various data sets I keep playing with. So things like: looking for patterns or trends, looking for outliers, and comparing one thing with another. If I can find any statistics that appear to suggest particular betting odds look particularly favourable, that might be interesting too. (As Nate Silver suggests, betting, even fantasy betting, is a great way of keeping score…)

Note that what I hope will turn into a series of posts should not be viewed as tutorial notes – they’re far more likely to be akin to student notes on a problem set the student is trying to work through, without having attended any of the required courses, and without having taken the time to read through a proper tutorial on the subject. Nor do I intend to to set out with a view to learning particular statistical techniques. Instead, I’ll be dipping into the world of stats looking for useful tools to see if they help me explore particular questions that come to mind and then try to apply them cut-and-past fashion, which is how I approach most of my coding!

Bare naked learning, in other words.

So if you thought I had any great understanding about stats – in fact, any understanding at all – I’m afraid I’m going to disabuse you of that notion. As to my use of the R statistical programming language, that’s been pretty much focussed on using it for generating graphics in a hacky way. (I’ve also found it hard, in the past, plotting pixels on screen and page in a programmable way, but R graphics libraries such as ggplot2 make it really easy at a high level of abstraction…:-)

That’s the setting then… Now: #f1stats. What’s that all about?

Notwithstanding the above (that this isn’t about learning a particular set of stats methods defined in advance) I did do a quick trawl looking for “F1 stats tutorials” to see if there were any that I could crib from directly; but my search didn’t turn up much that was directly and immediately useful (if you know of anything that might be, please post a link in the comments). There were a few things that looked like they might be interesting, so here’s a quick dump of the relevant…

- First up, I’ve been reading Nate Silver’s The Signal and the Noise, which mentions the aging stats and aging models for baseball players. I found a paper on The Age Productivity Gradient: Evidence from a sample of F1 Drivers, which hasn’t got too many scary equations in, so I may try to replicate that and then bring the models up to date (the paper is dated 2009). It would have been so nice if the authors had published code equivalents in R that I could have played with directly, but I haven’t been able to find it if they did. I also found a paper on Estimated Age Effects in Baseball, again with equations but no code, but it may provide additional clues. From a quick skim, I think there may be some mileage in trying to get my head round different ways of comparing
*rankings*. - A Tale of Two Motorsports: A Graphical-Statistical Analysis of How Practice, Qualifying, and Past Success Relate to Finish Position in NASCAR and Formula One Racing is perhaps an easier thing to try to copy for starters, though>
- The article The wisdom of ignorant crowds: predicting sport outcomes by mere recognition explores a simple tournament winner predicting strategy based on how recognisable the names of competitors are. (I guess social media metrics might be a proxy for recognition? Hmm.. could test that I suppose with reference to this paper?) One thing that caught my eye were a couple of simple schemes for benchmarking different prediction models, which might be something I could pull on if I end up exploring prediction models?
- NASCAR results have featured in several papers (I think there’s also a NASCAR dataset available in R?) so I’ll probably try dipping in to them at some point to see if I can do similar things with F1 data. For example, an analysis of NASCAR Winston Cup Race Results for 1975-2003; a couple of papers on hierarchical modelling of auto-racing results; and some Bayesian inference stuff that I guess is really beyond me for now and that I really really could do with a pre-built R libraries for;
- an MSc thesis I’ve referred to before on Prediction of Formula One Race Results Using Driver Characteristics has some handy ideas that I might be able to draw on if I have a look at laptime data;
- One of the the things I’ve been pondering is ways of ranking drivers based on fast lap times (eg during qualifying, vs. during the race). Although not about motor sport, or any sort of racing, A New Method for Ranking Total Driving Performance on the PGA Tour has a metric I may be able to bastardise in a Formula One context. The same periodical also has an article on Do Reliable Predictors Exist for the Outcomes of NASCAR Races?, the techniques of which might be applicable to F1? Predicting The Outcome Of NASCAR Races: The Role Of Driver Experience looks to be in a similar vein too…
- A paper on Outcome Uncertainty in NASCAR looks at how attendance and TV audience figures are influenced by race expectations, which might be something that could also be explored in context of UK F1 TV audience figures. That said, the notion of “outcome uncertainty” itself, and related measures, might also be worth exploring in their own right?

If you know of any other relevant looking papers or articles, please post a link in the comments.

[MORE LINKS…

– Who is the Best Formula 1 Driver? An Econometric Analysis

]

I was hoping to finish this post with a couple of quick R hacks around some F1 datasets, but I’ve just noticed that today, as in yesterday, has become tomorrow, as in today, and this post is probably already long enough… So it’ll have to wait for another day…

PS FWIW, I also note the arrival of the Sports Analytics Innovation Summit in London in March… I doubt I have the impact required to make it as a media partner though… Although maybe OpenLearn does…?!

One of the problems with using final placings (i.e., ranks) as outcomes is that these are (a) definitely not normally distributed, and (b) not independent of one another: if Button finishes second, no one else can. You might instead want to look at models for survival data. Survival data is often used for time to mortality, modelled as a function of various risk factors. Here, the outcome is time to completion, higher “mortality” is a good thing, and risk factors are actually factors associated with racing quicker.

The benefit of using this kind of analysis is that survival analyses are quite good at dealing with censored observations — i.e., times where we don’t know how quick a driver would have finished a race, because they had a DNF.

Googling “survival analysis tutorial R” throws up a number of useful-looking tutorials. Predicting from survival models is a bit ropier, but should be possible out of the box.

Hi Chris – thanks for that suggestion. “Matters of independence” was another of the bugbears I should have added to my opening line! As well as rankings, I have lap time data (for a couple of seasons at least, courtesy of Ergast API), and sector times for practice and qualifying going back several years (scraped from Formula One website; I’ve also got pit times, of a sort, from that site, and fastest laps).

One thing I have been mulling over is things like podium finish, top 10/middle 7/bottom 7 in qualifying, clean side/dirty side on grid (for end of first lap position change stats) etc.