My Personal Intro to F1 Race Statistics

One of the many things I keep avoiding is statistics. I’ve never really been convinced about the 5% significance level thing; as far as I can tell, hardly anything that’s interesting normally distributes; all the counting that’s involved just confuses me; and I never really got to grips with confidently combining probabilities. I find a lot of statistics related language impenetrable too, with an obscure vocabulary and some very peculiar usage. (Regular readers of this blog know that’s true here, as well ;-)

So this year I’m going to try to do some stats, and use some stats, and see if I can find out from personal need and personal interest whether they lead me to any insights about, or stories hidden within, various data sets I keep playing with. So things like: looking for patterns or trends, looking for outliers, and comparing one thing with another. If I can find any statistics that appear to suggest particular betting odds look particularly favourable, that might be interesting too. (As Nate Silver suggests, betting, even fantasy betting, is a great way of keeping score…)

Note that what I hope will turn into a series of posts should not be viewed as tutorial notes – they’re far more likely to be akin to student notes on a problem set the student is trying to work through, without having attended any of the required courses, and without having taken the time to read through a proper tutorial on the subject. Nor do I intend to to set out with a view to learning particular statistical techniques. Instead, I’ll be dipping into the world of stats looking for useful tools to see if they help me explore particular questions that come to mind and then try to apply them cut-and-past fashion, which is how I approach most of my coding!

Bare naked learning, in other words.

So if you thought I had any great understanding about stats – in fact, any understanding at all – I’m afraid I’m going to disabuse you of that notion. As to my use of the R statistical programming language, that’s been pretty much focussed on using it for generating graphics in a hacky way. (I’ve also found it hard, in the past, plotting pixels on screen and page in a programmable way, but R graphics libraries such as ggplot2 make it really easy at a high level of abstraction…:-)

That’s the setting then… Now: #f1stats. What’s that all about?

Notwithstanding the above (that this isn’t about learning a particular set of stats methods defined in advance) I did do a quick trawl looking for “F1 stats tutorials” to see if there were any that I could crib from directly; but my search didn’t turn up much that was directly and immediately useful (if you know of anything that might be, please post a link in the comments). There were a few things that looked like they might be interesting, so here’s a quick dump of the relevant…

If you know of any other relevant looking papers or articles, please post a link in the comments.

Who is the Best Formula 1 Driver? An Econometric Analysis

I was hoping to finish this post with a couple of quick R hacks around some F1 datasets, but I’ve just noticed that today, as in yesterday, has become tomorrow, as in today, and this post is probably already long enough… So it’ll have to wait for another day…

PS FWIW, I also note the arrival of the Sports Analytics Innovation Summit in London in March… I doubt I have the impact required to make it as a media partner though… Although maybe OpenLearn does…?!

The Basics of Betting as a Way of Keeping Score…

Another preparatory step before I start learning about stats in the context of Formula One… There are a couple of things I’m hoping to achieve when I actually start the journey: 1) finding ways of using stats to help to pull out patterns and events that are interesting from a storytelling or news perspective; 2) seeing if I can come up with any models that help forecast or predict race winners or performances over a race weekend.

There are a couple of problems I can foresee (?!) when it comes to the predictions: firstly, unlike horseracing, there aren’t that many F1 races each year to test the predictions against. Secondly, how do I even get a baseline start on the probabilities that driver X or team Y might end up on the podium?

It seems to me as if betting odds provide one publicly available “best guess” at the likelihood of any driver winning a race (a range of other bets are possible, of course, that give best guess predictions for other situations…) Having had a sheltered life, the world of betting is completely alien to me, so here’s what I think I’ve learned so far…

Odds are related to the anticipated likelihood of a particular event occurring and represent the winnings you get back (plus your stake) if a particular event happens. So 2/1 (2 to 1) fractional odds say: if the event happens, you’ll get 2 back for every 1 you placed, plus your stake back. If I bet 1 unit at 2/1 and win, I get 3 back: my original 1 plus 2 more. If I bet 3, I get 9 back: my original 3 plus 2 for every 1 I placed. Since I placed 3 1s, I get back 3 x 2 = 6 in winnings. Plus my original 3, which gives me 9 back on 3 staked, a profit of 6.

Odds are related (loosely) to the likelihood of an event happening. 2/1 odds represent a likelihood (probability) that an event will happen (1/3) = 0.333… of the time (to cur down on confusion between fractional odds and fractional probabilities, I’ll try to remember to put the fractional probabilities in brackets; so 1/2 is fractional odds of 2 to 1 on, and (1/2) is a probability of one half). To see how this works, consider an evens bet, fractional odds of 1/1, such as someone might make for tossing a coin. The probability of getting heads on a single toss is (1/2); the probability of getting tails is also (1/2). If I’m giving an absolutely fair book based on these likelihoods, I’d offer you even odds that you get a head, for example, on a single toss. After all, it’s (fifty/fifty) (fifty per cent chance either way) of whether a heads or tails will land face up. If there are three equally possible outcomes, (1/3) each, then I’d offer 2/1. After all, it’s twice as likely that something other than the single outcome you called would come up. If there are four possible outcomes, I’d offer 3/1, because it’s likely (if we played repeatedly) that three times out of four, you’d be wrong. So every three times out of four you’d lose and I’d take your stake. And on the fourth go, when you get it right, I give you your stake back for that round plus three for winning, so over all we’d be back where we started.

Decimal odds are a way of describing the return you get on a unit stake. So for a 2/1 bet, the decimal odds are three. For a 4/1 bet they’d be 5. For an N/1 bet they’d be 1+N. For an 1/2 (two to one on?) bet they’d be 1.5, for a 1/10 bet they’d be 1.1. So for a 1/M bet, 1+1/M. Generally, for an N/M bet, decimal odds are 1+N/M.

Decimal odds give an easy way in to calculating the likelihood of an event. Decimal odds of 3, (that is, fractional odds 2/1), describe an event that will happen (1/3) of the time in a fair game. That is (1/(decimal odds)) of the time. For fractional odds of N/M, you expect the event to happen with probability (1/(1+N/M))

In a completely fair book (?my phrase), the sum of the odds should lead to the summed probability of all possible events happening of 1. Bookmakers right the odds in their favour though, so the summed probabilities on a book will add up to more than 1 – this represents the bookmaker’s margin. If you’re betting on the toss of a coin with a bookie, they may offer you 99/100 for heads, evens for tails. If you play 400 games and bet 300 heads and 200 tails, winning 100 of each, you’ll overall stake 400, win 100 (plus 100 back) on tails along with 99 (plus 100 original stake) on heads. That is, you’ll have staked 400 and got back 399. The bookie will be 1 up overall. The summed probabilities add up to more than 1, since (1/2) + (1/(1+99/100)) = (0.5 + ~0.5025) > 1.

One off bets are no basis for a strategy. You need to bet regularly. One way of winning is to follow a value betting strategy where you place bets on outcomes that you predict are more likely than the odds you’re offered. This is counter to how the bookie works. If a bookie offers you fractional odds of 3/1 (expectation that the event will happen (1/4) of the time), and you have evidence that suggests it will happen (1/3) of the time (decimal odds of 3, fractional odds 2/1) then it’s worth your while repeatedly accepting the bet. After all, if you play 12 rounds, you’ll wager 12, and win on 12/3=4 occasions, getting 4 back (3 + your stake) each time, to give you a net return of 4 x 4 – 12 = 16 – 12 = +4. If the event had happened at the bookie’s predicted likelihood of 1/4 of the time, you would have got back ( 12/4 ) * 4 – 12 = +0 overall.

I’ve tried to do an R script to explore this:

#My vocabulary may be a bit confused herein
#Corrections welcome in the comments from both statisticians and gamblers;-)

#The offered odds
price=4 #3/1 -> 3+1 That is, the decimal odds on fractional odds of 3/1

#The odds I've predicted
myodds=1/3 #2/1 -> 1/(2+1)

#The number of repeated trials in the game

#The amount staked

#The experiment that we'll run trials number of times
  #trial sets a uniform random number in ranger 0..1
  #The win condition happens at my predicted odds, ie if trial value is less than my odds
  #So if my odds are (1/4) = 0.25, a trial value in range 0..0.25 counts as a win
  # (df$trial<myodds) is TRUE if trial < myodds, which is cast by as.integer() to value 1
  # If (df$trial<myodds) is FALSE, as.integer() returns 0
  #The winnings are calculated at the offered odds and are net of the stake
  #The df$win/odds = 1/odds = price (the decimal odds) on a win, else 0
  #The actual win is the product of the stake (bet) and the decimal odds
  #The winnings are the return net of the initial amount staked
  #Where there is no win, the winnings are a loss of the value of the bet


#The overall net winnings

#If myodds > odds, then I'm likely to end up winning on a value betting strategy

#A way of running the experiment several times
#There are probably better R protocols for doing this?
for (i in 1:runs){

#It would be nice to do some statistical graphics demonstrations of the different distributions of possible outcomes for different regimes. For example:
## different combinations of odds and myodds
## different numbers of trials
## different bet sizes

There are apparently also “efficient” ways of working out what stake to place (the “staking strategy”). The value strategy gives you the edge to win, long term, the staking strategy is how you maximise profits. See for example Horse Racing Staking and Betting: Horse racing basics part 2 or more mathematical treatments, such as The Kelly Criterion or Statistical Methodology for Profitable Sports Gambling. See also the notion of “betting rules”, eg A statistical development of fixed odds betting rules in soccer.

There is possibly some mileage to be had in getting to grips with R modeling using staking strategy models as an opening exercise, along with statistical graphical demonstrations of the same, but that is perhaps a little off topic for now…

To recap then, what I think I’ve learned is that we can test predictions against the benchmark of offered odds. The offered odds in themselves give us a ballpark estimate of what the (expert) bookmakers, as influenced by the betting/prediction market, expect the outcome of an event to be. Note that the odds are rigged to give summed probabilities over a range of events happening to be greater than 1, to build in a profit margin (does it have a proper name?) for the bookmaker. If we have a prediction model that appears to offer better odds on an event than the odds that are actually offered, and we believe in our prediction, we can run a value betting strategy on that basis and hopefully come out, over the long term, with a profit. The size of the profit is in part an indicator of how much more accurate our model is as a predictive model than the expert knowledge and prediction market basis that is used to set the bookie’s odds.

PS Re: the bookie’s profit, seems that this is called the overround or vigorish. The paper Forecasting sports tournaments by ratings of (prob)abilities: A comparison for the EURO 2008 makes clear the relationship between the bookies’ cut and the odds:


One thing that immediately springs to mind is to look at what sort of overround applies to different bookmakers around different sorts of F1 bets, and whether this is related to the apparent forecast accuracy of the odds offered, at least in ranking terms? (See the comments for a couple of links to papers on forecast accuracy of sports betting odds.)

PPS FWIW, as and when I come across R libraries to access bookmaker APIs, I’ll add them here:
Betfair R package – access Betair API; another package (CRAN): Betfairly

F1Stats – A Prequel to Getting Started With Rank Correlations

I finally, finally made time to get started on my statistics learning journey with a look at some of the results in the paper A Tale of Two Motorsports: A Graphical-Statistical Analysis of How Practice, Qualifying, and Past SuccessRelate to Finish Position in NASCAR and Formula One Racing.

Note that these notes represent a description of the things I learned trying to draw on ideas contained within the paper and apply it to data I had available. There may be errors… if you spot any, please let me know via the comments.

The paper uses race classification data from the 2009 season, comparing F1 and NASCAR championships and claiming to explore the extent to which positions in practice and qualifying relate to race classification. I won’t be looking at the NASCAR data, but I did try to replicate the F1 stats which I’ll start to describe later on in this post. I’ll also try to build up a series of interactive apps around the analyses, maybe along with some more traditional print format style reports.

(There are so many things I want to learn about, from the stats themselves, to possible workflows for analysis and reporting, to interactive analysis tools that I’m not sure what order any of it will pan out into, or even the extent to which I should try to write separate posts about the separate learnings…)

As a source of data, I used my f1com megascraper that grabs classification data (along with sector times and fastest laps) since 2006. (The ergast API doesn’t have the practice results, though it does have race and qualifying results going back a long way, so it could be used to do a partial analysis over many more seasons). I downloaded the whole Scraperwiki SQLite database which I could then load into R and play with offline at my leisure.

The first result of note in the paper is a chart that claims to demonstrate the Spearman rank correlation between practise and race results, qualification and race results, and championship points and race results, for each race in the season. The caption to the corresponding NASCAR graphs explains the shaded region: “In general, the values in the grey area are not statistically significant and the values in the white area are statistically significant.” A few practical uses we might put the chart to come to mind (possibly!): is qualifying position or p3 position a good indicator of race classification (that is, is the order folk finish at the end of p3 a good indicator of the rank order in which they’ll finish the race?)?; if the different rank orders are not correlated, (people finish the race in a different order to the gird position), does this say anything about how exciting the race might have been? Does the “statistical significance” of the correlation value add anything meaningful?

F1 2009 correlations

So what is this chart trying to show and what, if anything, might it tell us of interest about the race weekend?

First up, the statistic that’s being plotted is Spearman’s rank correlation coefficient. There are four things to note there:

  1. it’s a coefficient: a coefficent is a single number that tends to come in two flavours (often both at the same time). In a mathematical equation, a coefficient is typically a constant number that is used as multiplier of a variable. So for example, in the equation t = 2 x, the x variable has the coefficient 2. Note that the coefficient may also be a parameter, as for example in the equation y= a.x (where the . means ‘multiply’, and we naturally read x as a dependent variable that is used to determine the value of y having been multiplied by the value of a). However, a coefficient may also be a particular number that characterises a particular relationship between two things. In this case, it characterises the degree of correlation between two things…
  2. The refrain “correlation is not causation” is one heard widely around the web that mocks the fact that just because two things may be correlated – that is, when one thing changes, another changes in a similar way – it doesn’t necessarily mean that the way one thing changed caused the other to change in a similar way as a result. (Of course, it might mean that…;-). (If you want to find purely incidental correlations between things, have a look at Google correlate, that identifies different search terms whose search trends over time are similar. You can even draw your own trend over time to find terms that have trended in a similar way.)

    Correlation, then, describes the extent to which two things tend to change over time in a similar way: when one goes up, the other goes up; when one goes down, the other goes down. (If they behave in opposite ways – if one goes up steeply the other goes down steeply; if one goes up gently, the other goes down gently – then they are negatively or anti-correlated).
    Correlation measures require that you have paired data. You know you have paired data if you can plot your data as a two dimensional scatterplot and label each point with the name of the person or thing that was measured to get the two co-ordinate values. So for example, on a plot of qualifying position versus race classification, I can label each point with the name of the driver. The qualification position and race classification is paired data around the set of drivers.

  3. A rank correlation coefficient is used to describe the extent to which two rankings are correlated. Ranked orders do away with the extent to which different elements in a series differ from each other and just concentrate on their rank order. That is, we don’t care how much change there is in each of the data series, we’re just interested in rank positions within each series. In the case of an F1 race, the distribution of laptimes during qualifying may see the first few cars separated by a few thousandths of a second, but the time between the best laptimes of consecutively placed cars at the back of the grid might be of the order of tenths of a second. (Distributions take into account, for example, the variety and range of values in a dataset.) However, in a rank ordered chart, all we are interested in is the integer position: first, second, third, …, nineteenth, twentieth. There is no information about the magnitude of the actual time difference between the laptimes, that is, how far close to or far apart from each other the laptimes of consecutively placed cars were, we just know the rank order of fastest to slowest cars. The distribution of the rank values is not really very interesting, or subject to change, at all.
    One thing that I learned that’s possibly handy to know when decoding the jargon: rank based stats are also often referred to as non-parametric statistics because no assumptions are made about how the numbers are distributed (presumably, there are no parameters of note relating to how the values are distributed, such as the mean and standard deviation of a “normal” distribution). If we think about the evolution of laptimes in a race, then most of them will be within a few tenths of the fastest lap each lap, with perhaps two bunches of slower lap times (in-lap and out-lap around a pitstop). The distribution of these lap times may be interesting (for example, the distribution of laptimes on a lap when everyone is racing will be different to the distribution of lap times on a lap when several cars are pitting). On the other hand, for each lap, the distribution of the rank order of laptimes during that lap will always be the same (first fastest, second fastest, third fastest, etc.). That is not to say, however, that the rank order of the drivers’ lap times does not change lap on lap, which of course it might do (Webber might be tenth fastest on lap 1, fastest on lap 2, eight fastest on lap three, and so on).
    Of course, this being stats, “non-parametric” probably means lots of other things as well, but for now my rule of thumb will be: the distribution doesn’t matter (that is, the statistic does not make any assumptions about the distribution of the data in order for the statistic to work; which is to say, that’s one thing you don’t have to check in order to use the statistic (erm, I think…?!)
  4. The statistic chosen was Spearman’s rank correlation coefficient. Three differently calculated correlation coefficients appear to be widely used, (and also appear as possible methods in the R corr() function that calculates correlations between lists of numbers): i) Pearson’s product moment correlation coefficient (how well does a straight line through an x-y scatterplot of the data describe the relationship between the x and y values, and what’s the sign of its gradient); ii) Spearman’s rank correlation coefficient (also known as Spearman’s rho or rs); [this interactive is rather nice and shows how Pearson and Spearman correlations can differ]; iii) Kendall’s τ (that is, Kendall’s Tau; this coefficient is based on concordance, which describes how the sign of the difference in rank between pairs of numbers in one data series is the same as the sign of the difference in rank between a corresponding pair in the other data series.). Other flavours of correlation coefficient are also available (for example, Lin’s concordance correlation coefficient, as demonstrated in this example of identifying a signature collapse in a political party’s electoral vote when the Pearson coefficient suggested it had held up, which I think is used to test for how close to a 45 degree line the x-y association between paired data points is…).

The statistical significance test is based around the “null hypothesis” that the two sets of results are not correlated; the result is significant if they are more correlated than you might expect if both are randomly ordered. this made me a little twitchy: wouldn’t it be equally valid to argue that F1 is a procession and we would expect the race position and grid position to be perfectly correlated, for example, and then define our test for significance on the extent to which they are not?

This comparison of Pearson’s product moment and Spearman’s rank correlation coefficients helped me get a slightly clearer view of the notion of “test” and how both these coefficients act as tests for particular sorts of relationship. The Pearson product moment coefficent has a high value if a strong linear relationship holds across the data pairs. The Spearman rank correlation is weaker, in that it is simply looking to see whether or not the relationship is monotonic (that is, things all go up together, or they all go down together, but the extent to which they do so need not define a linear relationship, which is an assumption of the Pearson test.). In addition, when defining the statistical significance of the test, this is dependent on particular assumptions about the distribution of the data values, at least in the case of the Pearson test. The statistical significance relates to how likely the correlation value was assuming a normal distribution in the values within the paired data series (that is, each series is assumed to represent a normally distributed set of values).

If I understand this right, it means we separate the two things out: on the one hand, we have the statistic (the correlation coefficient); on the other, we have the significance test, which tells you how likely that result us given a set of assumptions about how the data is distributed. A question that then comes to mind is this: is the definition of the statistic dependent on a particular distribution of the data in order for the statistic to have something interesting to say, or is it just the significance that relies on that distribution. To twist this slightly, if we can’t do a significance test, is the statistic then essentially meaningless (because we don’t know whether those values are likely to occur whatever the relationships (or even no relationship) between the data sets?). Hmm.. maybe a statistic is actually a measurement in the context of its significance given some sort of assumption about how likely it is to occur by chance?

As far as Spearman’s rank correlation coefficient goes, I was little bit confused by the greyed “not significant” boundary on the diagram shown above. The claim is that any correlation value in that grey area could be accounted for in many cases by random sorting. Take two sets of driver numbers, sort them both randomly, and much of the time the correlation value will fall in that region. (I suspect this is not only arbitrary but misleading? If you have random orderings, is the likelihood that the correlation is in the range -0.1 to 0.1 the same as the likelihood that it will be in the range 0.2 to 0.4? Is the probability distribution of correlations “uniform” across the +/- 1 range?) Also, my hazy vague memory is that the population size affects the confidence interval (see also Explorations in statistics: confidence intervals) – isn’t this the principle on which funnel plots are built? The caption to the figure suggests that the population size (the “degrees of freedom”) was different for different races (different numbers of drivers). So why isn’t the shaded interval differently sized for those races?

Something else confused me about the interval values used to denote the significance of the Spearman rho values – where do they come from? A little digging suggested that they come from a table (i.e. someone worked them out numerically, presumably by generating looking at the distribution of different random rank orderings, rather than algorithmically – I couldn’t find a formula to calculate them? I did find this on Sample Size Requirements for Estimating Pearson, Kendall and Spearman Correlations by D Bonett (Psychometrika Vol. 65, No. 1, 23-28, March 2000) though). A little more digging suggested Significance Testing of the Spearman Rank Correlation Coefficient by J Zar (Journal of the American Statistical Association, Vol. 67, No. 339 (Sep., 1972), pp. 578- 580) as a key work on this, with some later qualification in Testing the Significance of Kendall’s τ and Spearman’s rs by M. Nijsse (Psychological Bulletin, 1988, Vol. 103, No. 2,235-237).

Hmmm.. this post was supposed to be about running the some of the stats used in A Tale of Two Motorsports: A Graphical-Statistical Analysis of How Practice, Qualifying, and Past SuccessRelate to Finish Position in NASCAR and Formula One Racing on some more recent data. But I’m well over a couple of thousand words into this post and still not started that bit… So maybe I’ll finish now, and hold the actual number crunching over to the next post…

PS I find myself: happier that I (think I) understand a little bit more about the rationale of significance tests; just as sceptical as ever I was about the actual practice;-)

Getting Started with F1 Betting Data

As part of my “learn about Formula One Stats” journey, one of the things I wanted to explore was how F1 betting odds change over the course of a race weekend, along with how well they predict race weekend outcomes.

Courtesy of @flutterF1, I managed to get a peek of some betting data from one of the race weekends last year year. In this preliminary post, I’ll describe some of the ways I started to explore the data initially, before going on to look at some of the things it might be able to tell us in more detail in a future post.

(I’m guessing that it’s possible to buy historical data(?), as well as collecting it yourself it for personal research purposes? eg Betfair have an api, and there’s at least one R library to access it: betfairly.)

The application I’ll be using to explore the data is RStudio, the cross-platform integrated development environment for the R programming language. Note that I will be making use of some R packages that are not part of the base install, so you will need to load them yourself. (I really need to find a robust safe loader that installs any required packages first if they have not already been installed.)

The data @flutterF1 showed me came in two spreadsheets. The first (filename convention RACE Betfair Odds Race Winner.xlsx) appears to contain a list of frequently sampled timestamped odds from Betfair, presumably, for each driver recorded over the course of the weekend. The second (filename convention RACE Bookie Odds.xlsx) has multiple sheets that contain less frequently collected odds from different online bookmakers for each driver on a variety of bets – race winner, pole position, top 6 finisher, podium, fastest lap, first lap leader, winner of each practice session, and so on.

Both the spreadsheets were supplied as Excel spreadsheets. I guess that many folk who collect betting data store it as spreadsheets, so this recipe for loading spreadsheets in to an R environment might be useful to them. The gdata library provides hooks for working with Excel documents, so I opted for that.

Let’s look at the Betfair prices spreadsheet first. The top line is junk, so we’ll skip it on load, and add in our own column names, based on John’s description of the data collected in this file:

The US Betfair Odds Race Winner.xslx is a raw data collection with 5 columns….
1) The timestap (an annoying format but there is a reason for this albeit a pain to work with).
2) The driver.
3) The last price money was traded at.
4) the total amount of money traded on that driver so far.
5) If the race is in ‘In-Play’. True means the race has started – however this goes from the warm up lap, not the actual start.

To reduce the amount of data I only record it when the price traded changes or if the amount changes.

Looking through the datafile, they appear to be some gotchas, so these need cleansing out:

datafile gotchas

Here’s my initial loader script:

xl=read.xls('US Betfair Odds Race Winner.xlsx',skip = 1)

#Cleansing pass

'data.frame':	10732 obs. of  5 variables:
 $ dt    : Factor w/ 2707 levels "11/16/2012 12:24:52 AM",..: 15 15 15 15 15 15 15 15 15 15 ...
 $ driver: Factor w/ 34 levels " Recoding Began",..: 19 11 20 16 18 29 26 10 31 17 ...
 $ odds  : num  3.9 7 17 16.5 24 140 120 180 270 550 ...
 $ amount: num  1340 557 120 118 195 ...
 $ racing: int  0 0 0 0 0 0 0 0 0 0 ...

#Generate a proper datetime field from the dt column
#This is a hacked way of doing it. How do I do it properly?
bf.odds$dtt=as.POSIXlt(gsub("T", " ", bf.odds$dt))

#If we rerun str(), we get the following extra line in the results:
# $ dtt   : POSIXlt, format: "2012-11-11 11:00:08" "2012-11-11 11:00:08" "2012-11-11 11:00:08" "2012-11-11 11:00:08" ...

Here’s what the raw data, as loaded, looks like to the eye:
Betfair spreadsheet

Having loaded the data, cleansed it, and cast a proper datetime column, it’s easy enough to generate a few plots:

#We're going to make use of the ggplot2 graphics library

#Let's get a quick feel for bets around each driver

#Let's look in a little more detail around a particular driver within a particular time window
g=ggplot(subset(xl,driver=="Lewis Hamilton"))+geom_point(aes(x=dtt,y=odds))+facet_wrap(~driver,scales="free_y")
g=g+ scale_x_datetime(limits=c(as.POSIXct('2012/11/18 18:00:00'), as.POSIXct('2012/11/18 22:00:00')))

Here are the charts (obviously lacking in caption, tidy labels and so on).

Firstly, the odds by driver:

odds by driver

Secondly, zooming in on a particular driver in a particular time window:


That all seems to work okay, so how about the other spreadsheet?

#There are several sheets to choose from, named as follows:
#Race,Pole,Podium,Points,SC,Fastest Lap, Top 6, Hattrick,Highest Scoring,FP1, ReachQ3,FirstLapLeader, FP2, FP3

#Load in data from a particular specified sheet
race.odds=read.xls('USA Bookie Odds.xlsx',sheet='Race')

#The datetime column appears to be in Excel datetime format, so cast it into something meaningful
race.odds$tTime=as.POSIXct((race.odds$Time-25569)*86400, tz="GMT",origin=as.Date("1970-1-1"))
#Note that I am not I checking for gotcha rows, though maybe I should...?

#Use the directlabels package to help tidy up the display a little

#Let's just check we've got something loaded - prune the display to rule out the longshots
g=g+geom_line()+theme_bw()+theme(legend.position = "none")

Here’s a view over the drivers’ odds to win, with the longshots pruned out:

example race odds by driver

With a little bit of fiddling, we can also look to see how the odds for a particular driver compare for different bookies:

#Let's see if we can also plot the odds by bookie
#[1] "Time" "Runner" "Bet365" "SkyBet" "Totesport" "Boylesport" "Betfred"     
# [8] "SportingBet" "BetVictor" "BlueSQ" "Paddy.Power" "Stan.James" "X888Sport" "Bwin"        
#[15] "Ladbrokes" "X188Bet" "Coral" "William.Hill" "You.Win" "Pinnacle" "X32.Red"     
#[22] "Betfair" "WBX" "Betdaq" "Median" "Median.." "Min" "Max"         
#[29] "Range" "tTime"   

#We can remove items from this list using something like this:
tmp=tmp[tmp!='Range' & tmp!='Median' & tmp!='Median..' & tmp!='Min' & tmp!= 'Max' & tmp!= 'Time']
#Then we can create a subset of cols,select=tmp)

#Melt the data

#                tTime                 Runner variable value
#1 2012-11-11 19:07:01 Sebastian Vettel (Red)   Bet365  2.37
#2 2012-11-11 19:07:01   Lewis Hamilton (McL)   Bet365  3.25
#3 2012-11-11 19:07:01  Fernando Alonso (Fer)   Bet365  6.00

#Now we can plot how the different bookies compare
g=ggplot(subset(,value<30 & Runner=='Sebastian Vettel (Red)'),aes(x=tTime,y=value,group=variable,col=variable,label=variable))
g=g+geom_line()+theme_bw()+theme(legend.position = "none")

bookies odds

Okay, so that all seems to work… Now I can start pondering what sensible questions to ask…

F1Stats – Visually Comparing Qualifying and Grid Positions with Race Classification

[If this isn’t your thing, the posts in this thread will be moving to a new blog soon, rather than taking over…]

Following the roundabout tour of F1Stats – A Prequel to Getting Started With Rank Correlations, here’s a walk through of my attempt to replicate the first part of A Tale of Two Motorsports: A Graphical-Statistical Analysis of How Practice, Qualifying, and Past SuccessRelate to Finish Position in NASCAR and Formula One Racing. Specifically, a look at the correlation between various rankings achieved over a race weekend and the final race position for that weekend.

The intention behind doing the replication is two-fold: firstly, published papers presumably do “proper stats”, so by pinching the methodology, I can hopefully pick up some tricks about how you’re supposed to do this stats lark by apprenticing myself, via the web, to someone who has done something similar; secondly, it provides an answer scheme, of a sort: I can check my answers with the answers in the back of the book published paper. I’m also hoping that by working through the exercise, I can start to frame my own questions and start testing my own assumptions.

So… let’s finally get started. The data I’ll be using for now is pulled from my scraper of the FormulaOne website (personal research purposes, blah, blah, blah;-). If you’re following along, from the Download button grab the whole SQLite database and save it into whatever directory you’re going to be working in in R…

…because this is an R-based replication… (I use RStudio for my R tinkerings; if you need to change directory when you’re in R, setwd('~/path/to/myfiles')).

To load the data in, and have a quick peek at what’s loaded, I use the following recipe:

f1 = dbConnect(drv ="SQLite", dbname="f1com_megascraper.sqlite")

#There are a couple of ways we can display the tables that are loaded
#Or via a query:
dbGetQuery(f1, 'SELECT name FROM sqlite_master WHERE type="table"')

The last two commands each display a list of the names of the database tables that can be loaded in as separate R dataframes.

To load the data in from a particular table, we run a SELECT query on the database. Let’s start with grabbing the same data that the paper analysed – the results from 2009:

#Grab the race results from the raceResults table for 2009
results.race=dbGetQuery(f1, 'SELECT pos as racepos, race, grid, driverNum FROM raceResults where year==2009')

#We can load data in from other tables and merge the results:
results.quali=dbGetQuery(f2, 'SELECT pos as qpos, driverNum, race FROM qualiResults where year==2009')
results.combined.clunky = merge(results.race, results.quali, by = c('race', 'driverNum'), all=T)

#A more efficient approach is to run a JOINed query to pull in the data in in one go
    'SELECT raceResults.year as year, qualiResults.pos as qpos, p3Results.pos as p3pos, raceResults.pos as racepos, raceResults.race as race, raceResults.grid as grid, raceResults.driverNum as driverNum, raceResults.raceNum as raceNum FROM raceResults, qualiResults, p3Results WHERE raceResults.year==2009 and raceResults.year = qualiResults.year and raceResults.year = p3Results.year and raceResults.race = qualiResults.race and raceResults.race = p3Results.race and raceResults.driverNum = qualiResults.driverNum and raceResults.driverNum = p3Results.driverNum;' )

#Note: I haven't used SQL that much so there may be a more efficient way of writing this?

#We can preview the response
#  year qpos p3pos racepos      race grid driverNum raceNum
#1 2009    1     3       1 AUSTRALIA    1        22       1
#2 2009    2     6       2 AUSTRALIA    2        23       1
#3 2009   20     2       3 AUSTRALIA   20         9       1
#4 2009   19     8       4 AUSTRALIA   19        10       1
#5 2009   10    17       5 AUSTRALIA   10         7       1
#6 2009    5     1       6 AUSTRALIA    5        16       1

#We can look at the format of each column
#'data.frame':	338 obs. of  8 variables:
# $ year     : chr  "2009" "2009" "2009" "2009" ...
# $ qpos     : chr  "1" "2" "20" "19" ...
# $ p3pos    : chr  "3" "6" "2" "8" ...
# $ racepos  : chr  "1" "2" "3" "4" ...
# $ grid     : chr  "1" "2" "20" "19" ...
# $ driverNum: chr  "22" "23" "9" "10" ...
# $ raceNum  : int  1 1 1 1 1 1 1 1 1 1 ...

#We can inspect the different values taken by each field
# [1] "1"   "10"  "11"  "12"  "13"  "14"  "15"  "16"  "17"  "18"  "19"  "2"   "3"   "4"   "5"   "6"   "7"   "8"   "9"  
#[20] "DSQ" "Ret"

#We need to do a little tidying... Let's make integer versions of the positions
#NA values will be introduced where there are no integers

#The race classification does not give a numerical position for each driver
##(eg in cases of DSQ or Ret) although the tables are ordered
#Let's use the row order for each race to give an actual numerical position to each driver for the race
#(If we wanted to do this for practice and quali too, we would have to load the tables
#in separately, number each row, then merge them.)
results.combined=ddply(results.combined, .(race), mutate, racepos.raw=1:length(race))

#Now we have a data frame that looks like this:
#  year qpos p3pos racepos      race grid driverNum raceNum racepos.raw
#1 2009    2    11       1 ABU DHABI    2        15      17           1        2        2           1
#2 2009    3    15       2 ABU DHABI    3        14      17           2        3        3           2
#3 2009    5     1       3 ABU DHABI    5        22      17           3        5        5           3
#4 2009    4     3       4 ABU DHABI    4        23      17           4        4        4           4
#5 2009    8     5       5 ABU DHABI    8         6      17           5        8        8           5
#6 2009   12    13       6 ABU DHABI   12        10      17           6       12       12           6

#If necessary we can force the order of the races factor levels away from alphabetical into race order
results.combined$race=reorder(results.combined$race, results.combined$raceNum)
# [1] "AUSTRALIA"     "MALAYSIA"      "CHINA"         "BAHRAIN"       "SPAIN"         "MONACO"        "TURKEY"       
# [8] "GREAT BRITAIN" "GERMANY"       "HUNGARY"       "EUROPE"        "BELGIUM"       "ITALY"         "SINGAPORE"    
#[15] "JAPAN"         "BRAZIL"        "ABU DHABI" 

We’re now in a position whereby we can start to look at the data, and do some analysis of it. The paper shows a couple of example scatterplots charting the “qualifying” position against the final race position. We can go one better in single line using the ggplot2 package:

g=ggplot(results.combined) + geom_point(aes(, y=racepos.raw)) + facet_wrap(~race)

f1_2009 q vs racepos

At first glance, this may all look very confusing, but just take your time, actually look at it, see it you can spot any patterns or emergent structure in the way the points are distributed. Does it look as if the points on the chart might have a story to tell? (If a picture saves a thousand words, and it takes five minutes to read a thousand words, the picture might actually take a couple of minutes to read get across the same information…;-)

First thing to notice: there are lots of panels – one for each race. The panels are ordered left to right, then down a row, left to right, down a row, etc, in the order they took place during the season. So you should be able to quickly spot the first race of the season (top left), the last race (at the end of the last/bottom row), and races in mid-season (middle of them all!)

Now focus on a single chart and thing about that the positioning of the points actually means. If the qualifying position was the same as the race position, what would you expect to see? That is, if the car that qualified first finished first; the car that qualified second finished second; the car that qualified tenth finished tenth; and so on. As the axes are incrementally ordered, I’d expect to see a straight line, going up at 45 degrees if the same space was given on each axis (that is, if the charts were square plots).

We can run a quick experiment to see how that looks:

#Plot increasing x and increasing y in step with each other
ggplot() + geom_point(aes(x=1:20, y=1:20))


Looking over the race charts, some of the charts have the points plotted roughly along a straight line – Turkey and Abu Dhabi, for example. If the cars started the race roughly according to qualifying position, this would suggest the races may have been rather processional? On the other hand, Brazil is all over the place – qualifying position appears to bear very little relationship to where the cars actually finished the race.

Even for such a simple chart type, it’s amazing how much structure we might be able to pull out of these charts. If every one was a straight line, we might imagine a very boring season, full of processions. If every chart was all over the place, with little apparent association between qualifying and race positions, we might begin to wonder whether there was any “form” to speak of at all. (Such events may on the surface provide exciting races, at least at the start of the season, but if every driver has an equal, but random, chance of winning, it makes it hard to root for anyone in particular, makes it hard to root for an underdog versus a favourite, and so on.)

Again, we can run a quick experiment to see what things would look like if each car qualified at random and was placed at random at the end of the race:

expt2 = NULL
#We're going to run 20 experiments (z)
for (i in 1:20){
    #Each experiment generates a random start order (x) and random finish order (y)
    expt=data.frame(x=sample(1:20, 20), y=sample(1:20, 20), z=i)
    expt2=rbind(expt2, expt)
#Plot the result, by experiment
ggplot(expt2) + geom_point(aes(x=x, y=y)) + facet_wrap(~z)

Here’s the result:

There’s not a lot of order in there, is there? However, it is worth noting that on occasion it may look as if there is some sort of ordering (as in trial 4, for example?), purely by chance.

For what it’s worth, we can also tidy up the race plots a little to make them a little bit more presentable:

g = g + ggtitle( 'F1 2009' )
g = g + xlab('Qualifying Position') + ylab('Final Race Classification')

Here’s the chart with added titles and axis labels:

We can also grab a plot for a single race. The paper showed scatterplots for Brazil and Japan.


Here’s the code:

g1 = ggplot(subset(results.combined, race == 'BRAZIL')) + geom_point(aes(, y=racepos.raw)) + facet_wrap(~race)
g1 = g1 + ggtitle('F1 2009 Brazil') + xlab('Qualifying Position') + ylab('Final Race Classification') + theme_bw() 

g2 = ggplot(subset(results.combined, race=='JAPAN' ) )+geom_point(aes(x, y=racepos.raw)) + facet_wrap(~race)
g2 = g2 + ggtitle('F1 2009 Japan') + xlab('Qualifying Position') + ylab('Final Race Classification') + theme_bw() 

require( gridExtra )
grid.arrange( g1, g2, ncol=2 )

Note that I have: i) filtered the data to get just the data required for each race; ii) added a little bit more styling with the theme_bw() call; iii) used the gridExtra package to generate the side-by-side plot.

Here’s what the data looked like in the original paper:


NOT the same… Hmmm… How about if we plot the position vs. the final position? (Set x = and tweak the xlab()):


So – the “qualifying” position in the published paper is actually the grid position? Did I not read the paper closely enough, or did they make a mistake? (So much for academic peer review stopping errors getting through. Like, erm, the error in the visualiastion of the confidence interval that also got through? ;-)

Generating similar views for other years should be trivial – simply tweak the year selector in the SELECT query and then repeat as above.

In the limit, I guess everything could be wrapped up in a single function to plot either the facetted view, or, if a list of explicitly named races are passed in to the function, as a set of more specific race charts. But that is left as an exercise for the reader…;-)

Okay – enough for now… am I ever going to get even as far as blogging the correlation analysis., let alone the regressions?! Hopefully in the next post in this series!

PS I couldn’t resist… here’s the summary set of charts for the 2012 season, in respect of grid positions vs. final race classifications:


Was it as you remember it?!! Were Australia and Malaysia all over the place? Was there a big crash amongst the front runners in Belgium?!

PPS Rather than swap this blog with too many posts on this topic, I intend to start a new uncourse blog and post them there in future… details to follow…

PPPS if you liked this post, you might like this Ergast API Interactive F1 Results data app

F1Stats – Correlations Between Qualifying, Grid and Race Classification

Following directly on from F1Stats – Visually Comparing Qualifying and Grid Positions with Race Classification, and continuing in my attempt to replicate some of the methodology and results used in A Tale of Two Motorsports: A Graphical-Statistical Analysis of How Practice, Qualifying, and Past SuccessRelate to Finish Position in NASCAR and Formula One Racing, here’s a quick look at the correlation scores between the final practice, qualifying and grid positions and the final race classification.

I’ve already done brief review of what correlation means (sort of) in F1Stats – A Prequel to Getting Started With Rank Correlations, so I’m just going to dive straight in with some R code that shows how I set about trying to find the correlations between the different classifications:

Here’s the answer from the back of the book paper that we’re aiming for…


Here’s what I got:

> corrs.df[order(corrs.df$V1),]
              V1 racepos.raw    pval.grid    pval.qpos  pval.p3pos
2      AUSTRALIA  0.30075188  0.01503759  0.087218045           1 7.143421e-01 9.518408e-01 0.197072158
13      MALAYSIA  0.42706767  0.57293233  0.630075188           1 3.584362e-03 9.410805e-03 0.061725312
6          CHINA -0.26015038  0.57443609  0.514285714           1 2.183596e-02 9.193214e-03 0.266812583
3        BAHRAIN  0.13082707  0.73233083  0.739849624           1 2.900250e-04 3.601434e-04 0.581232598
16         SPAIN  0.25112782  0.80451128  0.804511278           1 2.179221e-05 2.179221e-05 0.284231482
14        MONACO  0.51578947  0.48120301  0.476691729           1 3.513870e-02 3.326706e-02 0.021403708
17        TURKEY  0.52330827  0.73082707  0.730827068           1 3.756531e-04 3.756531e-04 0.019344720
9  GREAT BRITAIN  0.65413534  0.83007519  0.830075188           1 8.921842e-07 8.921842e-07 0.002260234
8        GERMANY  0.32030075  0.46917293  0.452631579           1 4.657539e-02 3.844275e-02 0.168419054
10       HUNGARY  0.49649123  0.37017544  0.370175439           1 1.194050e-01 1.194050e-01 0.032293715
7         EUROPE  0.28120301  0.72030075  0.720300752           1 4.997719e-04 4.997719e-04 0.228898214
4        BELGIUM  0.06766917  0.62105263  0.621052632           1 4.222076e-03 4.222076e-03 0.777083014
11         ITALY  0.52932331  0.52481203  0.524812030           1 1.895282e-02 1.895282e-02 0.017815489
15     SINGAPORE  0.50526316  0.58796992  0.715789474           1 5.621214e-04 7.414170e-03 0.024579520
12         JAPAN  0.34912281  0.74561404  0.849122807           1 0.000000e+00 3.739715e-04 0.143204045
5         BRAZIL -0.51578947 -0.02105263 -0.007518797           1 9.771776e-01 9.316030e-01 0.021403708
1      ABU DHABI  0.42556391  0.66466165  0.628571429           1 3.684738e-03 1.824565e-03 0.062722332

The paper mistakenly reports the grid values as the qualifying positions, so if we look down the column that I use to contain the correlation values between the grid and final classifications, we see they broadly match the values quoted in the paper. I also calculated the p-values and they seem to be a little bit off, but of the right order.

And here’s the R-code I used to get those results… The first chunk is just the loader, a refinement of the code I have used previously:


#Data downloaded from my f1com scraper on scraperwiki
f1 = dbConnect(drv="SQLite", dbname="f1com_megascraper.sqlite")

  #Data query
                              paste('SELECT raceResults.year as year, qualiResults.pos as qpos, p3Results.pos as p3pos, raceResults.pos as racepos, raceResults.race as race, raceResults.grid as grid, raceResults.driverNum as driverNum, raceResults.raceNum as raceNum FROM raceResults, qualiResults, p3Results WHERE raceResults.year==',year,' and raceResults.year = qualiResults.year and raceResults.year = p3Results.year and raceResults.race = qualiResults.race and raceResults.race = p3Results.race and raceResults.driverNum = qualiResults.driverNum and raceResults.driverNum = p3Results.driverNum;',sep=''))
  #Data tidying
  for (i in c('racepos','grid','qpos','p3pos','driverNum'))
    results.combined[[paste(i,'.int',sep='')]]=as.integer( as.character(results.combined[[i]]))

f1 = dbConnect(drv="SQLite", dbname="f1com_megascraper.sqlite")


Here’s the actual correlation calculation – I use the cor function:

#The cor() function returns data that looks like:
#   racepos.raw   1.0000000 0.31578947 0.28270677  0.30075188    0.3157895 1.00000000 0.97744361  0.01503759    0.2827068 0.97744361 1.00000000  0.08721805
#racepos.raw 0.3007519 0.01503759 0.08721805  1.00000000
#Row/col 4 relates to the correlation with the race classification, so for now just return that

  #Run through the races
  for (i in levels(factor(results.combined$race))){
    results.classified = subset( results.combined,
    #print( results.classified)
  for (i in levels(factor(results.combined$race))){
    results.classified = subset( results.combined,
    pval.grid=cor.test(results.classified$racepos.raw,results.classified$,method=cmethod,alternative = "two.sided")$p.value
    pval.qpos=cor.test(results.classified$racepos.raw,results.classified$,method=cmethod,alternative = "two.sided")$p.value
    pval.p3pos=cor.test(results.classified$racepos.raw,results.classified$,method=cmethod,alternative = "two.sided")$p.value




It’s then trivial to plot the result:

xRot=function(g,s=5,lab=NULL) g+theme(axis.text.x=element_text(angle=-90,size=s))+xlab(lab)

g=g+ggtitle('F1 2009 Correlation: grid and final classification')


Recalling that there are different types of rank correlation function, specifically “Kendall’s τ (that is, Kendall’s Tau; this coefficient is based on concordance, which describes how the sign of the difference in rank between pairs of numbers in one data series is the same as the sign of the difference in rank between a corresponding pair in the other data series”, I wondered whether it would make sense to look at correlations under this measure to see whether there were any obvious looking differences compared to Spearmans’s rho, that might prompt us to look at the actual grid/race classifications to see which score appears to be more meaningful.

The easiest way to spot the difference is probably graphically:


g=g+geom_point(data=corrs.df2, aes(x=V1,,col='blue')
g=g+ggtitle('F1 2009 Correlation: grid and final classification')

              V1 racepos.raw    pval.grid    pval.qpos  pval.p3pos
2      AUSTRALIA  0.17894737 -0.01052632  0.04210526           1 8.226829e-01 9.744669e-01 0.288378196
13      MALAYSIA  0.26315789  0.41052632  0.46315789           1 3.782665e-03 1.110136e-02 0.112604127
6          CHINA -0.20000000  0.41052632  0.35789474           1 2.832863e-02 1.110136e-02 0.233266557
3        BAHRAIN  0.07368421  0.51578947  0.52631579           1 8.408301e-04 1.099522e-03 0.677108239
16         SPAIN  0.17894737  0.64210526  0.64210526           1 2.506940e-05 2.506940e-05 0.288378196
14        MONACO  0.38947368  0.35789474  0.35789474           1 2.832863e-02 2.832863e-02 0.016406081
17        TURKEY  0.37894737  0.64210526  0.64210526           1 2.506940e-05 2.506940e-05 0.019784403
9  GREAT BRITAIN  0.46315789  0.63157895  0.63157895           1 3.622261e-05 3.622261e-05 0.003782665
8        GERMANY  0.23157895  0.31578947  0.30526316           1 6.380788e-02 5.475355e-02 0.164976406
10       HUNGARY  0.36842105  0.36842105  0.36842105           1 2.860214e-02 2.860214e-02 0.028602137
7         EUROPE  0.21052632  0.62105263  0.62105263           1 5.176962e-05 5.176962e-05 0.208628398
4        BELGIUM  0.02105263  0.46315789  0.46315789           1 3.782665e-03 3.782665e-03 0.923502331
11         ITALY  0.35789474  0.36842105  0.36842105           1 2.373450e-02 2.373450e-02 0.028328627
15     SINGAPORE  0.35789474  0.45263158  0.55789474           1 3.589956e-04 4.748310e-03 0.028328627
12         JAPAN  0.26315789  0.57894737  0.69590643           1 6.491222e-06 3.109641e-04 0.124796908
5         BRAZIL -0.37894737 -0.05263158 -0.04210526           1 8.226829e-01 7.732195e-01 0.019784403
1      ABU DHABI  0.34736842  0.61052632  0.55789474           1 3.589956e-04 7.321900e-05 0.033643947


Hmm.. Kendall gives lower values for all races except Hungary – maybe put that on the “must look at Hungary compared to the other races” pile…;-)

One thing that did occur to me was that I have access to race data from other years, so it shouldn’t be too hard to see how the correlations play out over the years at different circuits (do grid/race correlations tend to be higher at some circuits, for example?).

  for (year in years) {




So Spain and Turkey look like they tend to the processional? Let’s see if a boxplot bears that out:


How predictable have the years been, year on year?




And as a boxplot:


From a betting point of view, (eg Getting Started with F1 Betting Data and The Basics of Betting as a Way of Keeping Score…) it possibly also makes sense to look at the correlation between the P3 times and the qualifying classification to see if there is a testable edge in the data when it comes to betting on quali?

I think I need to tweak my code slightly to make it easy to pull out correlations between specific columns, but that’ll have to wait for another day…