As part of my “learn about Formula One Stats” journey, one of the things I wanted to explore was how F1 betting odds change over the course of a race weekend, along with how well they predict race weekend outcomes.
Courtesy of @flutterF1, I managed to get a peek of some betting data from one of the race weekends last year year. In this preliminary post, I’ll describe some of the ways I started to explore the data initially, before going on to look at some of the things it might be able to tell us in more detail in a future post.
(I’m guessing that it’s possible to buy historical data(?), as well as collecting it yourself it for personal research purposes? eg Betfair have an api, and there’s at least one R library to access it: betfairly.)
The application I’ll be using to explore the data is RStudio, the cross-platform integrated development environment for the R programming language. Note that I will be making use of some R packages that are not part of the base install, so you will need to load them yourself. (I really need to find a robust safe loader that installs any required packages first if they have not already been installed.)
The data @flutterF1 showed me came in two spreadsheets. The first (filename convention RACE Betfair Odds Race Winner.xlsx) appears to contain a list of frequently sampled timestamped odds from Betfair, presumably, for each driver recorded over the course of the weekend. The second (filename convention RACE Bookie Odds.xlsx) has multiple sheets that contain less frequently collected odds from different online bookmakers for each driver on a variety of bets – race winner, pole position, top 6 finisher, podium, fastest lap, first lap leader, winner of each practice session, and so on.
Both the spreadsheets were supplied as Excel spreadsheets. I guess that many folk who collect betting data store it as spreadsheets, so this recipe for loading spreadsheets in to an R environment might be useful to them. The gdata library provides hooks for working with Excel documents, so I opted for that.
Let’s look at the Betfair prices spreadsheet first. The top line is junk, so we’ll skip it on load, and add in our own column names, based on John’s description of the data collected in this file:
The US Betfair Odds Race Winner.xslx is a raw data collection with 5 columns….
1) The timestap (an annoying format but there is a reason for this albeit a pain to work with).
2) The driver.
3) The last price money was traded at.
4) the total amount of money traded on that driver so far.
5) If the race is in ‘In-Play’. True means the race has started – however this goes from the warm up lap, not the actual start.
To reduce the amount of data I only record it when the price traded changes or if the amount changes.
Looking through the datafile, they appear to be some gotchas, so these need cleansing out:
Here’s my initial loader script:
library(gdata) xl=read.xls('US Betfair Odds Race Winner.xlsx',skip = 1) colnames(xl)=c('dt','driver','odds','amount','racing') #Cleansing pass bf.odds=subset(xl,racing!='') str(bf.odds) 'data.frame': 10732 obs. of 5 variables: $ dt : Factor w/ 2707 levels "11/16/2012 12:24:52 AM",..: 15 15 15 15 15 15 15 15 15 15 ... $ driver: Factor w/ 34 levels " Recoding Began",..: 19 11 20 16 18 29 26 10 31 17 ... $ odds : num 3.9 7 17 16.5 24 140 120 180 270 550 ... $ amount: num 1340 557 120 118 195 ... $ racing: int 0 0 0 0 0 0 0 0 0 0 ... #Generate a proper datetime field from the dt column #This is a hacked way of doing it. How do I do it properly? bf.odds$dtt=as.POSIXlt(gsub("T", " ", bf.odds$dt)) #If we rerun str(), we get the following extra line in the results: # $ dtt : POSIXlt, format: "2012-11-11 11:00:08" "2012-11-11 11:00:08" "2012-11-11 11:00:08" "2012-11-11 11:00:08" ...
Here’s what the raw data, as loaded, looks like to the eye:
Having loaded the data, cleansed it, and cast a proper datetime column, it’s easy enough to generate a few plots:
#We're going to make use of the ggplot2 graphics library library(ggplot2) #Let's get a quick feel for bets around each driver g=ggplot(xl)+geom_point(aes(x=dtt,y=odds))+facet_wrap(~driver,scales="free_y") g=g+theme(axis.text.x=element_text(angle=-90)) g #Let's look in a little more detail around a particular driver within a particular time window g=ggplot(subset(xl,driver=="Lewis Hamilton"))+geom_point(aes(x=dtt,y=odds))+facet_wrap(~driver,scales="free_y") g=g+theme(axis.text.x=element_text(angle=-90)) g=g+ scale_x_datetime(limits=c(as.POSIXct('2012/11/18 18:00:00'), as.POSIXct('2012/11/18 22:00:00'))) g
Here are the charts (obviously lacking in caption, tidy labels and so on).
Firstly, the odds by driver:
Secondly, zooming in on a particular driver in a particular time window:
That all seems to work okay, so how about the other spreadsheet?
#There are several sheets to choose from, named as follows: #Race,Pole,Podium,Points,SC,Fastest Lap, Top 6, Hattrick,Highest Scoring,FP1, ReachQ3,FirstLapLeader, FP2, FP3 #Load in data from a particular specified sheet race.odds=read.xls('USA Bookie Odds.xlsx',sheet='Race') #The datetime column appears to be in Excel datetime format, so cast it into something meaningful race.odds$tTime=as.POSIXct((race.odds$Time-25569)*86400, tz="GMT",origin=as.Date("1970-1-1")) #Note that I am not I checking for gotcha rows, though maybe I should...? #Use the directlabels package to help tidy up the display a little library(directlabels) #Let's just check we've got something loaded - prune the display to rule out the longshots g=ggplot(subset(race.odds,Bet365<30),aes(x=tTime,y=Bet365,group=Runner,col=Runner,label=Runner)) g=g+geom_line()+theme_bw()+theme(legend.position = "none") g=g+geom_dl(method=list('top.bumpup',cex=0.6)) g=g+scale_x_datetime(expand=c(0.15,0)) g
Here’s a view over the drivers’ odds to win, with the longshots pruned out:
With a little bit of fiddling, we can also look to see how the odds for a particular driver compare for different bookies:
#Let's see if we can also plot the odds by bookie colnames(race.odds) # "Time" "Runner" "Bet365" "SkyBet" "Totesport" "Boylesport" "Betfred" #  "SportingBet" "BetVictor" "BlueSQ" "Paddy.Power" "Stan.James" "X888Sport" "Bwin" # "Ladbrokes" "X188Bet" "Coral" "William.Hill" "You.Win" "Pinnacle" "X32.Red" # "Betfair" "WBX" "Betdaq" "Median" "Median.." "Min" "Max" # "Range" "tTime" #We can remove items from this list using something like this: tmp=colnames(race.odds) #tmp=tmp[tmp!='Range'] tmp=tmp[tmp!='Range' & tmp!='Median' & tmp!='Median..' & tmp!='Min' & tmp!= 'Max' & tmp!= 'Time'] #Then we can create a subset of cols race.odds.data=subset(race.odds,select=tmp) #Melt the data library(reshape) race.odds.data.m=melt(race.odds.data,id=c('tTime','Runner')) #head( race.odds.data.m) # tTime Runner variable value #1 2012-11-11 19:07:01 Sebastian Vettel (Red) Bet365 2.37 #2 2012-11-11 19:07:01 Lewis Hamilton (McL) Bet365 3.25 #3 2012-11-11 19:07:01 Fernando Alonso (Fer) Bet365 6.00 #... #Now we can plot how the different bookies compare g=ggplot(subset(race.odds.data.m,value<30 & Runner=='Sebastian Vettel (Red)'),aes(x=tTime,y=value,group=variable,col=variable,label=variable)) g=g+geom_line()+theme_bw()+theme(legend.position = "none") g=g+geom_dl(method=list('top.bumpup',cex=0.6)) g=g+scale_x_datetime(expand=c(0.15,0)) g
Okay, so that all seems to work… Now I can start pondering what sensible questions to ask…