## Archive for the ‘**f1stats**’ Category

## Tools in Tandem – SQL and ggplot. But is it Really R?

Increasingly I find that I have fallen into using not-really-R whilst playing around with Formula One stats data. Instead, I seem to be using a hybrid of SQL to get data out of a small SQLite3 datbase and into an R dataframe, and then ggplot2 to render visualise it.

So for example, I’ve recently been dabbling with laptime data from the ergast database, using it as the basis for counts of how many laps have been led by a particular driver. The recipe typically goes something like this – set up a database connection, and run a query:

#Set up a connection to a local copy of the ergast database library(DBI) ergastdb = dbConnect(RSQLite::SQLite(), './ergastdb13.sqlite') #Run a query q='SELECT code, grid, year, COUNT(l.lap) AS Laps FROM (SELECT grid, raceId, driverId from results) rg, lapTimes l, races r, drivers d WHERE rg.raceId=l.raceId AND d.driverId=l.driverId AND rg.driverId=l.driverId AND l.position=1 AND r.raceId=l.raceId GROUP BY grid, driverRef, year ORDER BY year' driverlapsledfromgridposition=dbGetQuery(ergastdb,q)

In this case, the data is table that shows for each year a count of laps led by each driver given their grid position in corresponding races (null values are not reported). The data grabbed from the database is based into a dataframe in a relatively tidy format, from which we can easily generate a visualisation of it.

The chart I have opted for is a text plot faceted by year:

The count of lead laps for a given driver by grid position is given as a text label, sized by count, and rotated to mimimise overlap. The horizontal grid is actually a logarithmic scale, which “stretches out” the positions at the from of the grid (grid positions 1 and 2) compared to positions lower down the grid – where counts are likely to be lower anyway. To try to recapture some sense of where grid positions lay along the horizontal axis, a dashed vertical line at grid position 2.5 marks out the front row. The *x-axis* is further expanded to mitigate against labels being obfuscated or overflowing off the left hand side of the plotting area. The clean black and white theme finished off the chart.

g = ggplot(driverlapsledfromgridposition) g = g + geom_vline(xintercept = 2.5, colour='lightgrey', linetype='dashed') g = g + geom_text(aes(x=grid, y=code, label=Laps, size=log(Laps), angle=45)) g = g + facet_wrap(~year) + xlab(NULL) + ylab(NULL) + guides(size=FALSE) g + scale_x_log10(expand=c(0,0.3)) + theme_bw()

There are still a few problems with this graphic, however. The order of labels on the y-axis is in alphabetical order, and would perhaps be more informative if ordered to show championship rankings, for example.

However, to return to the main theme of this post, whilst the R language and RStudio environment are being used as a medium within which this activity has taken place, the data wrangling and analysis (in the sense of counting) is being performed by the SQL query, and the visual representation and analysis (in the sense of faceting, for example, and generating visual cues based on data properties) is being performed by routines supplied as part of the ggplot library.

So if asked whether this is an example of using R for data analysis and visualisation, what would your response be? What does it take for something to be peculiarly or particularly an *R* based analysis?

*For more details, see the “Laps Completed and Laps Led” draft chapter and the Wrangling F1 Data With R book.*

## Calculating Churn in Seasonal Leagues

One of the things I wanted to explore in the production of the Wrangling F1 Data With R book was the extent to which I could draw on published academic papers for inspiration in exploring the the various results and timing datasets.

In a chapter published earlier this week, I explored the notion of *churn*, as described in Mizak, D, Neral, J & Stair, A (2007) The adjusted churn: an index of competitive balance for sports leagues based on changes in team standings over time. Economics Bulletin, Vol. 26, No. 3 pp. 1-7, and further appropriated by Berkowitz, J. P., Depken, C. A., & Wilson, D. P. (2011). *When going in circles is going backward: Outcome uncertainty in NASCAR.* Journal of Sports Economics, 12(3), 253-283.

In a competitive league, *churn* is defined as:

where is the churn in team standings for year , is the absolute value of the -th team’s change in finishing position going from season to season , and is the number of teams.

The *adjusted churn* is defined as an indicator with the range 0..1 by dividing the churn, , by the maximum churn, . The value of the maximum churn depends on whether there is an even or odd number of competitors:

Berkowitz et al. reconsidered churn as applied to an individual NASCAR race (that is, at the event level). In this case, is the position of driver at the end of race , is the starting position of driver at the beginning of that race (that is, race ) and is the number of drivers participating in the race. Once again, the authors recognise the utility of normalising the churn value to give an *adjusted churn* in the range 0..1 by dividing through by the maximum churn value.

Using these models, I created churn function of the form:

is.even = function(x) x %% 2 == 0 churnmax=function(N) if (is.even(N)) return(N/2) else return(((N*N)-1)/(2*N)) churn=function(d) sum(d)/length(d) adjchurn = function(d) churn(d)/churnmax(length(d))

and then used it to explore churn in a variety of contexts:

- comparing grid positions vs race classifications across a season (cf. Berkowitz et al.)
- churn in Drivers’ Championship standings over several seasons (cf. Mizak et al.)
- churn in Constructors’ Championship standings over several seasons (cf. Mizak et al.)

For example, in the first case, we can process data from the ergast database as follows:

library(DBI) ergastdb = dbConnect(RSQLite::SQLite(), './ergastdb13.sqlite') q=paste('SELECT round, name, driverRef, code, grid, position, positionText, positionOrder FROM results rs JOIN drivers d JOIN races r ON rs.driverId=d.driverId AND rs.raceId=r.raceId WHERE r.year=2013',sep='') results=dbGetQuery(ergastdb,q) library(plyr) results['delta'] = abs(results['grid']-results['positionOrder']) churn.df = ddply(results[,c('round','name','delta')], .(round,name), summarise, churn = churn(delta), adjchurn = adjchurn(delta) )

For more details, see this first release of the Keeping an Eye on Competitiveness – Tracking Churn chapter of the Wrangling F1 Data With R book.

## Identifying Position Change Groupings in Rank Ordered Lists

The title says it all, doesn’t it?!

Take the following example – it happens to show race positions by driver for each lap of a particular F1 grand prix, but it could be the evolution over time of any rank-based population.

The question I had in mind was – how can I identify positions that are being contested during a particular window of time, where by contested I mean that the particular position was held by more than one person in a particular window of time?

Let’s zoom in to look at a couple of particular steps.

We see distinct groups of individuals who swap positions with each other between those two consecutive steps, so how can we automatically detect the positions that these drivers are fighting over?

A solution given to a Stack Overflow question on how to get disjoint sets from a list in R gives what I thought was a really nice solution: treat it as a graph, and then grab the connected components.

Here’s my working of it. Start by getting a list of results that show a particular driver held different positions in the window selected – each row in the original dataframe identifies the position held by a particular driver at the end of a particular lap:

library(DBI) ergastdb =dbConnect(RSQLite::SQLite(), './ergastdb13.sqlite') #Get a race identifier for a specific race raceId=dbGetQuery(ergastdb, 'SELECT raceId FROM races WHERE year="2012" AND round="1"') q=paste('SELECT * FROM lapTimes WHERE raceId=',raceId[[1]]) lapTimes=dbGetQuery(ergastdb,q) lapTimes$position=as.integer(lapTimes$position) library(plyr) #Sort by lap first just in case lapTimes=arrange(lapTimes,driverId,lap) #Create a couple of new columns #pre is previous lap position held by a driver given their current lap #ch is position change between the current and previous lap tlx=ddply(lapTimes,.(driverId),transform,pre=(c(0,position[-length(position)])),ch=diff(c(0,position))) #Find rows where there is a change between a given lap and its previous lap #In particular, focus on lap 17 llx=tlx[tlx['ch']!=0 & tlx['lap']==17,c("position","pre")] llx

This filters the complete set of data to just those rows where there is a difference between a driver’s current position and previous position (the first column in the result just shows row numbers and can be ignored).

## position pre ## 17 2 1 ## 191 17 18 ## 390 9 10 ## 448 1 2 ## 506 6 4 ## 719 10 9 ## 834 4 5 ## 892 18 19 ## 950 5 6 ## 1008 19 17

We can now create a graph in which nodes represent positions (*position* or *pre* values) and edges connect a current and previous position.

#install.packages("igraph") #http://stackoverflow.com/a/25130575/454773 library(igraph) posGraph = graph.data.frame(llx) } plot(posGraph)

The resulting graph is split into several components:

We can then identify the connected components:

posBattles=split(V(posGraph)$name, clusters(posGraph)$membership) #Find the position change battles for (i in 1:length(posBattles)) print(posBattles[[i]])

This gives the following clusters, and their corresponding members:

## [1] "2" "1" ## [1] "17" "18" "19" ## [1] "9" "10" ## [1] "6" "4" "5"

To generalise this approach, I think we need to do a couple of things:

- allow a wider window within which to identify battles (so look over groups of three or more consecutive laps);
- simplify the way we detect position changes for a particular driver; for example, if we take the set of positions held by a driver within the desired window, if the cardinality of the set (that is, its size) is greater than one, then we have had at least one position change for that driver within that window. Each set of size > 1 of unique positions held by different drivers can be used to generate a set of distinct, unordered pairs that connect the positions (I think it only matters that they are connected, not that a driver specifically went from position
*x*to position*y*going from one lap to the next?). If we generate the graph from the set of distinct unordered pairs taken across all drivers, we should then be able to identify the contested/driver change position clusters.

Hmm… I need to try that out… And when I do, if and when it works(?!), I’ll add a complete demonstration of it – and how we might make use of it – to the *Wrangling F1 Data With R* book.

## F1 Doing the Data Visualisation Competition Thing With Tata?

Sort of via @jottevanger, it seems that Tata Communications announces the first challenge in the F1® Connectivity Innovation Prize to extract and present new information from Formula One Management’s live data feeds. (The F1 site has a post Tata launches F1® Connectivity Innovation Prize dated “10 Jun 2014″? What’s that about then?)

Tata Communications are the folk who supply connectivity to F1, so this could be a good call from them. It’ll be interesting to see how much attention – and interest – it gets.

The competition site can be found here: The F1 Innovation Connectivity Prize.

The first challenge is framed as follows:

The Formula One Management Data Screen Challenge is to propose what new and insightful information can be derived from the sample data set provided and, as a second element to the challenge, show how this insight can be delivered visually to add suspense and excitement to the audience experience.

…

The sample dataset provided by Formula One Management includes Practice 1, Qualifying and race data, and contains the following elements:– Position

– Car number

– Driver’s name

– Fastest lap time

– Gap to the leader’s fastest lap time

– Sector 1 time for the current lap

– Sector 2 time for the current lap

– Sector 3 time for the current lap

– Number of laps

If you aren’t familiar with motorsport timing screens, they typically look like this…

A technical manual is also provided for helping makes sense of the data files.

Here are fragments from the data files – one for practice, one for qualifying and one for the race.

First up, practice:

... <transaction identifier="101" messagecount="10640" timestamp="10:53:14.159"><data column="2" row="15" colour="RED" value="14"/></transaction> <transaction identifier="101" messagecount="10641" timestamp="10:53:14.162"><data column="3" row="15" colour="WHITE" value="F. ALONSO"/></transaction> <transaction identifier="103" messagecount="10642" timestamp="10:53:14.169"><data column="9" row="2" colour="YELLOW" value="16"/></transaction> <transaction identifier="101" messagecount="10643" timestamp="10:53:14.172"><data column="2" row="6" colour="WHITE" value="17"/></transaction> <transaction identifier="102" messagecount="1102813" timestamp="10:53:14.642"><data column="2" row="1" colour="YELLOW" value="59:39" clock="true"/></transaction> <transaction identifier="102" messagecount="1102823" timestamp="10:53:15.640"><data column="2" row="1" colour="YELLOW" value="59:38" clock="true"/></transaction> ...

Then qualifying:

... <transaction identifier="102" messagecount="64968" timestamp="12:22:01.956"><data column="4" row="3" colour="WHITE" value="210"/></transaction> <transaction identifier="102" messagecount="64971" timestamp="12:22:01.973"><data column="3" row="4" colour="WHITE" value="PER"/></transaction> <transaction identifier="102" messagecount="64972" timestamp="12:22:01.973"><data column="4" row="4" colour="WHITE" value="176"/></transaction> <transaction identifier="103" messagecount="876478" timestamp="12:22:02.909"><data column="2" row="1" colour="YELLOW" value="16:04" clock="true"/></transaction> <transaction identifier="101" messagecount="64987" timestamp="12:22:03.731"><data column="2" row="1" colour="WHITE" value="21"/></transaction> <transaction identifier="101" messagecount="64989" timestamp="12:22:03.731"><data column="3" row="1" colour="YELLOW" value="E. GUTIERREZ"/></transaction> ...

Then the race:

... <transaction identifier="101" messagecount="121593" timestamp="14:57:10.878"><data column="23" row="1" colour="PURPLE" value="31.6"/></transaction> <transaction identifier="103" messagecount="940109" timestamp="14:57:11.219"><data column="2" row="1" colour="YELLOW" value="1:41:13" clock="true"/></transaction> <transaction identifier="101" messagecount="121600" timestamp="14:57:11.681"><data column="2" row="3" colour="WHITE" value="77"/></transaction> <transaction identifier="101" messagecount="121601" timestamp="14:57:11.681"><data column="3" row="3" colour="WHITE" value="V. BOTTAS"/></transaction> <transaction identifier="101" messagecount="121602" timestamp="14:57:11.681"><data column="4" row="3" colour="YELLOW" value="17.7"/></transaction> <transaction identifier="101" messagecount="121603" timestamp="14:57:11.681"><data column="5" row="3" colour="YELLOW" value="14.6"/></transaction> <transaction identifier="101" messagecount="121604" timestamp="14:57:11.681"><data column="6" row="3" colour="WHITE" value="1:33.201"/></transaction> <transaction identifier="101" messagecount="121605" timestamp="14:57:11.686"><data column="9" row="3" colour="YELLOW" value="35.4"/></transaction> ...

We can parse the datafiles using python using an approach something like the following:

from lxml import etree pl=[] for xml in open(xml_doc, 'r'): pl.append(etree.fromstring(xml)) pl[100].attrib #{'identifier': '101', 'timestamp': '10:49:56.085', 'messagecount': '9716'} pl[100][0].attrib #{'column': '3', 'colour': 'WHITE', 'value': 'J. BIANCHI', 'row': '12'}

A few things are worth mentioning about this format… Firstly, the *identifier* is an identifier of the message type, rather then the message: each transaction message appears instead to be uniquely identified by the *messagecount*. The transactions each update the value of a single cell in the display screen, setting its *value* and *colour*. The cell is identified by its *row* and *column* co-ordinates. The *timestamp* also appears to group messages.

Secondly, within a session, several screen views are possible – essentially associated with data labelled with a particular *identifier*. This means the data feed is essentially powering several data structures.

Thirdly, each screen display is a snapshot of a datastructure at a particular point in time. There is no single record in the datafeed that gives a view over the whole results table. In fact, there is no single message that describes the state of a single row at a particular point in time. Instead, the datastructure is built up by a continual series of updates to individual cells. Transaction elements in the feed are cell based events not row based events.

It’s not obvious how we can make a row based transaction update, even, though on occasion we may be able to group updates to several columns within a row by gathering together all the messages that occur at a particular timestamp and mention a particular row. For example, look at the example of the race timing data above, for *timestamp=”14:57:11.681″* and *row=”3″*. If we parsed each of these into separate dataframes, using the timestamp as the index, we could align the dataframes using the *pandas* DataFrame *.align()* method.

[I think I’m thinking about this wrong: the updates to a row appear to come in column order, so if column 2 changes, the driver number, then changes to the rest of the row will follow. So if we keep track of a cursor for each row describing the last column updated, we should be able to track things like row changes, end of lap changes when sector times change and so on. Pitting may complicate matters, but at least I think I have an in now… Should have looked more closely the first time… Doh!]

Note: I’m not sure that the timestamps are *necessarily* unique across rows, though I suspect that they are likely to be so, which means it would be safer to align, or merge, on the basis of the timestamp *and* the row number? From inspection of the data, it looks as if it is possible for a couple of timestamps to differ slightly (by milliseconds) yet apply to the same row. I guess we would treat these as separate grouped elements? Depending on the timewidth that all changes to a row are likely to occur in, we could perhaps round times for the basis of the join?

Even with a bundling, we still don’t a have a complete description of all the cells in a row. They need to have been set historically…

The following fragment is a first attempt at building up the timing screen data structure for the practice timing at a particular point of time. To find the state of the timing screen at a particular time, we’d have to start building it up from the start of time, and then stop it updating at the time we were interested in:

#Hacky load and parse of each row in the datafile pl=[] for xml in open('data/F1 Practice.txt', 'r'): pl.append(etree.fromstring(xml)) #Dataframe for current state timing screen df_practice_pos=pd.DataFrame(columns=[ "timestamp", "time", "classpos", "classpos_colour", "racingNumber","racingNumber_colour", "name","name_colour", ],index=range(50)) #Column mappings practiceMap={ '1':'classpos', '2':'racingNumber', '3':'name', '4':'laptime', '5':'gap', '6':'sector1', '7':'sector2', '8':'sector3', '9':'laps', '21':'sector1_best', '22':'sector2_best', '23':'sector3_best' } def parse_practice(p,df_practice_pos): if p.attrib['identifier']=='101' and 'sessionstate' not in p[0].attrib: if p[0].attrib['column'] not in ['10','21','22','23']: colname=practiceMap[p[0].attrib['column']] row=int(p[0].attrib['row'])-1 df_practice_pos.ix[row]['timestamp']=p.attrib['timestamp'] tt=p.attrib['timestamp'].replace('.',':').split(':') df_practice_pos.ix[row]['time'] = datetime.time(int(tt[0]),int(tt[1]),int(tt[2]),int(tt[3])*1000) df_practice_pos.ix[row][colname]=p[0].attrib['value'] df_practice_pos.ix[row][colname+'_colour']=p[0].attrib['colour'] return df_practice_pos for p in pl[:2850]: df_practice_pos=parse_practice(p,df_practice_pos) df_practice_pos

Getting sensible data structures at the timing screen level looks like it could be problematic. But to what extent are the feed elements meaningful in and of themselves? Each element in the feed actually has a couple of semantically meaningful data points associated with it, as well as the timestamp: the classification position, which corresponds to the row; and the column designator.

That means we can start to explore simple charts that map driver number against race classification, for example, by grabbing the row (that is, the race classification position) and timestamp every time we see a particular driver number:

A notebook where I start to explore some of these ideas can be found here: racedemo.ipynb.

Something else I’ve started looking at is the use of MongoDB for grouping items that share the same timestamp (again, check the *racedemo.ipynb* notebook). If we create an ID based on the timestamp and row, we can repeatedly *$set* document elements against that key even if they come from separate timing feed elements. This gets us so far, but still falls short of identifying row based sets. We can perhaps get closer by grouping items associated with a particular row in time, for example, grouping elements associated with a particular row that are within half a second of each other. Again, the *racedemo.ipynb* notebook has the first fumblings of an attempt to work this out.

I’m not likely to have much chance to play with this data over the next week or so, and the time for making entries is short. I never win data competitions anyway (I can’t do the shiny stuff that judges tend to go for), but I’m keen to see what other folk can come up with:-)

PS The R book has stalled so badly I’ve pushed what I’ve got so far to wranglingf1datawithr repo now… Hopefully I’ll get a chance to revisit it over the summer, and push on with it a bit more… WHen I get a couple of clear hours, I’ll try to push the stuff that’s there out onto leanpub as a preview…

## F1Stats – Visually Comparing Qualifying and Grid Positions with Race Classification

*[If this isn’t your thing, the posts in this thread will be moving to a new blog soon, rather than taking over OUseful.info…]*

Following the roundabout tour of F1Stats – A Prequel to Getting Started With Rank Correlations, here’s a walk through of my attempt to replicate the first part of A Tale of Two Motorsports: A Graphical-Statistical Analysis of How Practice, Qualifying, and Past SuccessRelate to Finish Position in NASCAR and Formula One Racing. Specifically, a look at the correlation between various rankings achieved over a race weekend and the final race position for that weekend.

The intention behind doing the replication is two-fold: firstly, published papers presumably do “proper stats”, so by pinching the methodology, I can hopefully pick up some tricks about how you’re supposed to do this stats lark by apprenticing myself, via the web, to someone who has done something similar; secondly, it provides an answer scheme, of a sort: I can check my answers with the answers in the ~~back of the book~~ published paper. I’m also hoping that by working through the exercise, I can start to frame my own questions and start testing my own assumptions.

So… let’s finally get started. The data I’ll be using for now is pulled from my scraper of the FormulaOne website (personal research purposes, blah, blah, blah;-). If you’re following along, from the Download button grab the whole SQLite database and save it into whatever directory you’re going to be working in in R…

…because this is an R-based replication… (I use RStudio for my R tinkerings; if you need to change directory when you’re in R, `setwd('~/path/to/myfiles')`).

To load the data in, and have a quick peek at what’s loaded, I use the following recipe:

f1 = dbConnect(drv ="SQLite", dbname="f1com_megascraper.sqlite") #There are a couple of ways we can display the tables that are loaded #Directly: dbListTables(f1) #Or via a query: dbGetQuery(f1, 'SELECT name FROM sqlite_master WHERE type="table"')

The last two commands each display a list of the names of the database tables that can be loaded in as separate R dataframes.

To load the data in from a particular table, we run a SELECT query on the database. Let’s start with grabbing the same data that the paper analysed – the results from 2009:

#Grab the race results from the raceResults table for 2009 results.race=dbGetQuery(f1, 'SELECT pos as racepos, race, grid, driverNum FROM raceResults where year==2009') #We can load data in from other tables and merge the results: results.quali=dbGetQuery(f2, 'SELECT pos as qpos, driverNum, race FROM qualiResults where year==2009') results.combined.clunky = merge(results.race, results.quali, by = c('race', 'driverNum'), all=T) #A more efficient approach is to run a JOINed query to pull in the data in in one go results.combined=dbGetQuery(f1, 'SELECT raceResults.year as year, qualiResults.pos as qpos, p3Results.pos as p3pos, raceResults.pos as racepos, raceResults.race as race, raceResults.grid as grid, raceResults.driverNum as driverNum, raceResults.raceNum as raceNum FROM raceResults, qualiResults, p3Results WHERE raceResults.year==2009 and raceResults.year = qualiResults.year and raceResults.year = p3Results.year and raceResults.race = qualiResults.race and raceResults.race = p3Results.race and raceResults.driverNum = qualiResults.driverNum and raceResults.driverNum = p3Results.driverNum;' ) #Note: I haven't used SQL that much so there may be a more efficient way of writing this? #We can preview the response head(results.combined) # year qpos p3pos racepos race grid driverNum raceNum #1 2009 1 3 1 AUSTRALIA 1 22 1 #2 2009 2 6 2 AUSTRALIA 2 23 1 #3 2009 20 2 3 AUSTRALIA 20 9 1 #4 2009 19 8 4 AUSTRALIA 19 10 1 #5 2009 10 17 5 AUSTRALIA 10 7 1 #6 2009 5 1 6 AUSTRALIA 5 16 1 #We can look at the format of each column str(results.combined) #'data.frame': 338 obs. of 8 variables: # $ year : chr "2009" "2009" "2009" "2009" ... # $ qpos : chr "1" "2" "20" "19" ... # $ p3pos : chr "3" "6" "2" "8" ... # $ racepos : chr "1" "2" "3" "4" ... # $ race : chr "AUSTRALIA" "AUSTRALIA" "AUSTRALIA" "AUSTRALIA" ... # $ grid : chr "1" "2" "20" "19" ... # $ driverNum: chr "22" "23" "9" "10" ... # $ raceNum : int 1 1 1 1 1 1 1 1 1 1 ... #We can inspect the different values taken by each field levels(factor(results.combined$racepos)) # [1] "1" "10" "11" "12" "13" "14" "15" "16" "17" "18" "19" "2" "3" "4" "5" "6" "7" "8" "9" #[20] "DSQ" "Ret" #We need to do a little tidying... Let's make integer versions of the positions #NA values will be introduced where there are no integers results.combined$racepos.int=as.integer(as.character(results.combined$racepos)) results.combined$grid.int=as.integer(as.character(results.combined$grid)) results.combined$qpos.int=as.integer(as.character(results.combined$qpos)) #The race classification does not give a numerical position for each driver ##(eg in cases of DSQ or Ret) although the tables are ordered #Let's use the row order for each race to give an actual numerical position to each driver for the race #(If we wanted to do this for practice and quali too, we would have to load the tables #in separately, number each row, then merge them.) require(reshape) results.combined=ddply(results.combined, .(race), mutate, racepos.raw=1:length(race)) #Now we have a data frame that looks like this: head(results.combined) # year qpos p3pos racepos race grid driverNum raceNum racepos.int grid.int qpos.int racepos.raw #1 2009 2 11 1 ABU DHABI 2 15 17 1 2 2 1 #2 2009 3 15 2 ABU DHABI 3 14 17 2 3 3 2 #3 2009 5 1 3 ABU DHABI 5 22 17 3 5 5 3 #4 2009 4 3 4 ABU DHABI 4 23 17 4 4 4 4 #5 2009 8 5 5 ABU DHABI 8 6 17 5 8 8 5 #6 2009 12 13 6 ABU DHABI 12 10 17 6 12 12 6 #If necessary we can force the order of the races factor levels away from alphabetical into race order results.combined$race=reorder(results.combined$race, results.combined$raceNum) levels(results.combined$race) # [1] "AUSTRALIA" "MALAYSIA" "CHINA" "BAHRAIN" "SPAIN" "MONACO" "TURKEY" # [8] "GREAT BRITAIN" "GERMANY" "HUNGARY" "EUROPE" "BELGIUM" "ITALY" "SINGAPORE" #[15] "JAPAN" "BRAZIL" "ABU DHABI"

We’re now in a position whereby we can start to look at the data, and do some analysis of it. The paper shows a couple of example scatterplots charting the “qualifying” position against the final race position. We can go one better in single line using the ggplot2 package:

require(ggplot2) g=ggplot(results.combined) + geom_point(aes(x=qpos.int, y=racepos.raw)) + facet_wrap(~race) g

At first glance, this may all look very confusing, but just take your time, actually *look* at it, see it you can spot any patterns or emergent structure in the way the points are distributed. Does it look as if the points on the chart might have a story to tell? (If a picture saves a thousand words, and it takes five minutes to read a thousand words, the picture might actually take a couple of minutes to read get across the same information…;-)

First thing to notice: there are lots of panels – one for each race. The panels are ordered left to right, then down a row, left to right, down a row, etc, in the order they took place during the season. So you should be able to quickly spot the first race of the season (top left), the last race (at the end of the last/bottom row), and races in mid-season (middle of them all!)

Now focus on a single chart and thing about that the positioning of the points actually means. If the qualifying position was the same as the race position, what would you expect to see? That is, if the car that qualified first finished first; the car that qualified second finished second; the car that qualified tenth finished tenth; and so on. As the axes are incrementally ordered, I’d expect to see a straight line, going up at 45 degrees if the same space was given on each axis (that is, if the charts were square plots).

We can run a quick experiment to see how that looks:

#Plot increasing x and increasing y in step with each other ggplot() + geom_point(aes(x=1:20, y=1:20))

Looking over the race charts, some of the charts have the points plotted roughly along a straight line – Turkey and Abu Dhabi, for example. If the cars started the race roughly according to qualifying position, this would suggest the races may have been rather processional? On the other hand, Brazil is all over the place – qualifying position appears to bear very little relationship to where the cars actually finished the race.

Even for such a simple chart type, it’s amazing how much structure we might be able to pull out of these charts. If every one was a straight line, we might imagine a very boring season, full of processions. If every chart was all over the place, with little apparent association between qualifying and race positions, we might begin to wonder whether there was any “form” to speak of at all. (Such events may on the surface provide exciting races, at least at the start of the season, but if every driver has an equal, but random, chance of winning, it makes it hard to root for anyone in particular, makes it hard to root for an underdog versus a favourite, and so on.)

Again, we can run a quick experiment to see what things would look like if each car qualified at random and was placed at random at the end of the race:

expt2 = NULL #We're going to run 20 experiments (z) for (i in 1:20){ #Each experiment generates a random start order (x) and random finish order (y) expt=data.frame(x=sample(1:20, 20), y=sample(1:20, 20), z=i) expt2=rbind(expt2, expt) } #Plot the result, by experiment ggplot(expt2) + geom_point(aes(x=x, y=y)) + facet_wrap(~z)

There’s not a lot of order in there, is there? However, it is worth noting that on occasion it may look as if there is some sort of ordering (as in trial 4, for example?), purely by chance.

For what it’s worth, we can also tidy up the race plots a little to make them a little bit more presentable:

g = g + ggtitle( 'F1 2009' ) g = g + xlab('Qualifying Position') + ylab('Final Race Classification') g

Here’s the chart with added titles and axis labels:

We can also grab a plot for a single race. The paper showed scatterplots for Brazil and Japan.

Here’s the code:

g1 = ggplot(subset(results.combined, race == 'BRAZIL')) + geom_point(aes(x=qpos.int, y=racepos.raw)) + facet_wrap(~race) g1 = g1 + ggtitle('F1 2009 Brazil') + xlab('Qualifying Position') + ylab('Final Race Classification') + theme_bw() g2 = ggplot(subset(results.combined, race=='JAPAN' ) )+geom_point(aes(x =qpos.int, y=racepos.raw)) + facet_wrap(~race) g2 = g2 + ggtitle('F1 2009 Japan') + xlab('Qualifying Position') + ylab('Final Race Classification') + theme_bw() require( gridExtra ) grid.arrange( g1, g2, ncol=2 )

Note that I have: i) filtered the data to get just the data required for each race; ii) added a little bit more styling with the `theme_bw()` call; iii) used the `gridExtra` package to generate the side-by-side plot.

Here’s what the data looked like in the original paper:

NOT the same… Hmmm… How about if we plot the *grid.int* position vs. the final position? (Set `x = grid.int` and tweak the `xlab()`):

So – the “qualifying” position in the published paper is actually the *grid* position? Did I not read the paper closely enough, or did they make a mistake? (So much for academic peer review stopping errors getting through. Like, erm, the error in the visualiastion of the confidence interval that also got through? ;-)

Generating similar views for other years *should* be trivial – simply tweak the year selector in the SELECT query and then repeat as above.

In the limit, I guess everything could be wrapped up in a single function to plot either the facetted view, or, if a list of explicitly named races are passed in to the function, as a set of more specific race charts. But that is left as an exercise for the reader…;-)

Okay – enough for now… am I *ever* going to get even as far as blogging the correlation analysis., let alone the regressions?! Hopefully in the next post in this series!

PS I couldn’t resist… here’s the summary set of charts for the 2012 season, in respect of grid positions vs. final race classifications:

Was it as you remember it?!! Were Australia and Malaysia all over the place? Was there a big crash amongst the front runners in Belgium?!

PPS Rather than swap this blog with too many posts on this topic, I intend to start a new uncourse blog and post them there in future… details to follow…

PPPS if you liked this post, you might like this Ergast API Interactive F1 Results data app

## Getting Started with F1 Betting Data

As part of my “learn about Formula One Stats” journey, one of the things I wanted to explore was how F1 betting odds change over the course of a race weekend, along with how well they predict race weekend outcomes.

Courtesy of @flutterF1, I managed to get a peek of some betting data from one of the race weekends last year year. In this preliminary post, I’ll describe some of the ways I started to explore the data initially, before going on to look at some of the things it might be able to tell us in more detail in a future post.

(I’m guessing that it’s possible to buy historical data(?), as well as collecting it yourself it for personal research purposes? eg Betfair have an api, and there’s at least one R library to access it: betfairly.)

The application I’ll be using to explore the data is RStudio, the cross-platform integrated development environment for the R programming language. Note that I will be making use of some R packages that are not part of the base install, so you will need to load them yourself. (I really need to find a robust safe loader that installs any required packages first if they have not already been installed.)

The data @flutterF1 showed me came in two spreadsheets. The first (filename convention `RACE Betfair Odds Race Winner.xlsx`) appears to contain a list of frequently sampled timestamped odds from Betfair, presumably, for each driver recorded over the course of the weekend. The second (filename convention `RACE Bookie Odds.xlsx`) has multiple sheets that contain less frequently collected odds from different online bookmakers for each driver on a variety of bets – race winner, pole position, top 6 finisher, podium, fastest lap, first lap leader, winner of each practice session, and so on.

Both the spreadsheets were supplied as Excel spreadsheets. I guess that many folk who collect betting data store it as spreadsheets, so this recipe for loading spreadsheets in to an R environment might be useful to them. The `gdata` library provides hooks for working with Excel documents, so I opted for that.

Let’s look at the Betfair prices spreadsheet first. The top line is junk, so we’ll skip it on load, and add in our own column names, based on John’s description of the data collected in this file:

The US Betfair Odds Race Winner.xslx is a raw data collection with 5 columns….

1) The timestap (an annoying format but there is a reason for this albeit a pain to work with).

2) The driver.

3) The last price money was traded at.

4) the total amount of money traded on that driver so far.

5) If the race is in ‘In-Play’. True means the race has started – however this goes from the warm up lap, not the actual start.To reduce the amount of data I only record it when the price traded changes or if the amount changes.

Looking through the datafile, they appear to be some gotchas, so these need cleansing out:

Here’s my initial loader script:

library(gdata) xl=read.xls('US Betfair Odds Race Winner.xlsx',skip = 1) colnames(xl)=c('dt','driver','odds','amount','racing') #Cleansing pass bf.odds=subset(xl,racing!='') str(bf.odds) 'data.frame': 10732 obs. of 5 variables: $ dt : Factor w/ 2707 levels "11/16/2012 12:24:52 AM",..: 15 15 15 15 15 15 15 15 15 15 ... $ driver: Factor w/ 34 levels " Recoding Began",..: 19 11 20 16 18 29 26 10 31 17 ... $ odds : num 3.9 7 17 16.5 24 140 120 180 270 550 ... $ amount: num 1340 557 120 118 195 ... $ racing: int 0 0 0 0 0 0 0 0 0 0 ... #Generate a proper datetime field from the dt column #This is a hacked way of doing it. How do I do it properly? bf.odds$dtt=as.POSIXlt(gsub("T", " ", bf.odds$dt)) #If we rerun str(), we get the following extra line in the results: # $ dtt : POSIXlt, format: "2012-11-11 11:00:08" "2012-11-11 11:00:08" "2012-11-11 11:00:08" "2012-11-11 11:00:08" ...

Here’s what the raw data, as loaded, looks like to the eye:

Having loaded the data, cleansed it, and cast a proper datetime column, it’s easy enough to generate a few plots:

#We're going to make use of the ggplot2 graphics library library(ggplot2) #Let's get a quick feel for bets around each driver g=ggplot(xl)+geom_point(aes(x=dtt,y=odds))+facet_wrap(~driver,scales="free_y") g=g+theme(axis.text.x=element_text(angle=-90)) g #Let's look in a little more detail around a particular driver within a particular time window g=ggplot(subset(xl,driver=="Lewis Hamilton"))+geom_point(aes(x=dtt,y=odds))+facet_wrap(~driver,scales="free_y") g=g+theme(axis.text.x=element_text(angle=-90)) g=g+ scale_x_datetime(limits=c(as.POSIXct('2012/11/18 18:00:00'), as.POSIXct('2012/11/18 22:00:00'))) g

Here are the charts (obviously lacking in caption, tidy labels and so on).

Firstly, the odds by driver:

Secondly, zooming in on a particular driver in a particular time window:

That all seems to work okay, so how about the other spreadsheet?

#There are several sheets to choose from, named as follows: #Race,Pole,Podium,Points,SC,Fastest Lap, Top 6, Hattrick,Highest Scoring,FP1, ReachQ3,FirstLapLeader, FP2, FP3 #Load in data from a particular specified sheet race.odds=read.xls('USA Bookie Odds.xlsx',sheet='Race') #The datetime column appears to be in Excel datetime format, so cast it into something meaningful race.odds$tTime=as.POSIXct((race.odds$Time-25569)*86400, tz="GMT",origin=as.Date("1970-1-1")) #Note that I am not I checking for gotcha rows, though maybe I should...? #Use the directlabels package to help tidy up the display a little library(directlabels) #Let's just check we've got something loaded - prune the display to rule out the longshots g=ggplot(subset(race.odds,Bet365<30),aes(x=tTime,y=Bet365,group=Runner,col=Runner,label=Runner)) g=g+geom_line()+theme_bw()+theme(legend.position = "none") g=g+geom_dl(method=list('top.bumpup',cex=0.6)) g=g+scale_x_datetime(expand=c(0.15,0)) g

Here’s a view over the drivers’ odds to win, with the longshots pruned out:

With a little bit of fiddling, we can also look to see how the odds for a particular driver compare for different bookies:

#Let's see if we can also plot the odds by bookie colnames(race.odds) #[1] "Time" "Runner" "Bet365" "SkyBet" "Totesport" "Boylesport" "Betfred" # [8] "SportingBet" "BetVictor" "BlueSQ" "Paddy.Power" "Stan.James" "X888Sport" "Bwin" #[15] "Ladbrokes" "X188Bet" "Coral" "William.Hill" "You.Win" "Pinnacle" "X32.Red" #[22] "Betfair" "WBX" "Betdaq" "Median" "Median.." "Min" "Max" #[29] "Range" "tTime" #We can remove items from this list using something like this: tmp=colnames(race.odds) #tmp=tmp[tmp!='Range'] tmp=tmp[tmp!='Range' & tmp!='Median' & tmp!='Median..' & tmp!='Min' & tmp!= 'Max' & tmp!= 'Time'] #Then we can create a subset of cols race.odds.data=subset(race.odds,select=tmp) #Melt the data library(reshape) race.odds.data.m=melt(race.odds.data,id=c('tTime','Runner')) #head( race.odds.data.m) # tTime Runner variable value #1 2012-11-11 19:07:01 Sebastian Vettel (Red) Bet365 2.37 #2 2012-11-11 19:07:01 Lewis Hamilton (McL) Bet365 3.25 #3 2012-11-11 19:07:01 Fernando Alonso (Fer) Bet365 6.00 #... #Now we can plot how the different bookies compare g=ggplot(subset(race.odds.data.m,value<30 & Runner=='Sebastian Vettel (Red)'),aes(x=tTime,y=value,group=variable,col=variable,label=variable)) g=g+geom_line()+theme_bw()+theme(legend.position = "none") g=g+geom_dl(method=list('top.bumpup',cex=0.6)) g=g+scale_x_datetime(expand=c(0.15,0)) g

Okay, so that all seems to work… Now I can start pondering what sensible questions to ask…

## F1Stats – A Prequel to Getting Started With Rank Correlations

I finally, finally made time to get started on my statistics learning journey with a look at some of the results in the paper A Tale of Two Motorsports: A Graphical-Statistical Analysis of How Practice, Qualifying, and Past SuccessRelate to Finish Position in NASCAR and Formula One Racing.

*Note that these notes represent a description of the things I learned trying to draw on ideas contained within the paper and apply it to data I had available. There may be errors… if you spot any, please let me know via the comments.*

The paper uses race classification data from the 2009 season, comparing F1 and NASCAR championships and claiming to explore the extent to which positions in practice and qualifying relate to race classification. I won’t be looking at the NASCAR data, but I did try to replicate the F1 stats which I’ll start to describe later on in this post. I’ll also try to build up a series of interactive apps around the analyses, maybe along with some more traditional print format style reports.

(There are so many things I want to learn about, from the stats themselves, to possible workflows for analysis and reporting, to interactive analysis tools that I’m not sure what order any of it will pan out into, or even the extent to which I should try to write separate posts about the separate learnings…)

As a source of data, I used my f1com megascraper that grabs classification data (along with sector times and fastest laps) since 2006. (The ergast API doesn’t have the practice results, though it does have race and qualifying results going back a long way, so it could be used to do a partial analysis over many more seasons). I downloaded the whole Scraperwiki SQLite database which I could then load into R and play with offline at my leisure.

The first result of note in the paper is a chart that claims to demonstrate the Spearman rank correlation between practise and race results, qualification and race results, and championship points and race results, for each race in the season. The caption to the corresponding NASCAR graphs explains the shaded region: “In general, the values in the grey area are not statistically significant and the values in the white area are statistically significant.” A few practical uses we might put the chart to come to mind (possibly!): is qualifying position or p3 position a good indicator of race classification (that is, is the order folk finish at the end of p3 a good indicator of the rank order in which they’ll finish the race?)?; if the different rank orders are not correlated, (people finish the race in a different order to the gird position), does this say anything about how exciting the race might have been? Does the “statistical significance” of the correlation value add anything meaningful?

So what is this chart trying to show and what, if anything, might it tell us of interest about the race weekend?

First up, the statistic that’s being plotted is *Spearman’s rank correlation coefficient*. There are four things to note there:

- it’s a
**coefficient**: a coefficent is a single number that tends to come in two flavours (often both at the same time). In a mathematical equation, a coefficient is typically a constant number that is used as multiplier of a variable. So for example, in the equation`t = 2 x`, the`x`variable has the coefficient`2`. Note that the coefficient may also be a parameter, as for example in the equation`y= a.x`(where the . means ‘multiply’, and we naturally read`x`as a dependent variable that is used to determine the value of`y`having been multiplied by the value of`a`). However, a coefficient may also be a particular number that characterises a particular relationship between two things. In this case, it characterises the degree of*correlation*between two things… - The refrain “correlation is not causation” is one heard widely around the web that mocks the fact that just because two things may be correlated – that is, when one thing changes, another changes in a similar way – it doesn’t necessarily mean that the way one thing changed
*caused*the other to change in a similar way as a result. (Of course, it*might*mean that…;-). (If you want to find purely incidental correlations between things, have a look at Google correlate, that identifies different search terms whose search trends over time are similar. You can even draw your own trend over time to find terms that have trended in a similar way.)**Correlation**, then, describes the extent to which two things tend to change over time in a similar way: when one goes up, the other goes up; when one goes down, the other goes down. (If they behave in opposite ways – if one goes up steeply the other goes down steeply; if one goes up gently, the other goes down gently – then they are*negatively*or*anti*-correlated).

Correlation measures require that you have*paired*data. You know you have paired data if you can plot your data as a two dimensional scatterplot and label each point with the name of the person or thing that was measured to get the two co-ordinate values. So for example, on a plot of qualifying position versus race classification, I can label each point with the name of the driver. The qualification position and race classification is paired data around the set of drivers. - A
is used to describe the extent to which two**rank**correlation coefficient*rankings*are correlated. Ranked orders do away with the extent to which different elements in a series differ from each other and just concentrate on their rank order. That is, we don’t care how much change there is in each of the data series, we’re just interested in rank positions within each series. In the case of an F1 race, the*distribution*of laptimes during qualifying may see the first few cars separated by a few thousandths of a second, but the time between the best laptimes of consecutively placed cars at the back of the grid might be of the order of tenths of a second. (*Distributions*take into account, for example, the variety and range of values in a dataset.) However, in a*rank ordered*chart, all we are interested in is the integer position: first, second, third, …, nineteenth, twentieth. There is no information about the magnitude of the actual time difference between the laptimes, that is, how far close to or far apart from each other the laptimes of consecutively placed cars were, we just know the rank order of fastest to slowest cars. The*distribution*of the rank values is not really very interesting, or subject to change, at all.

One thing that I learned that’s possibly handy to know when decoding the jargon: rank based stats are also often referred to as*non-parametric statistics*because no assumptions are made about how the numbers are*distributed*(presumably, there are no*parameters*of note relating to how the values are distributed, such as the mean and standard deviation of a “normal” distribution). If we think about the evolution of laptimes in a race, then most of them will be within a few tenths of the fastest lap each lap, with perhaps two bunches of slower lap times (in-lap and out-lap around a pitstop). The distribution of these lap times may be interesting (for example, the distribution of laptimes on a lap when everyone is racing will be different to the distribution of lap times on a lap when several cars are pitting). On the other hand, for each lap, the distribution of the rank order of laptimes during that lap will always be the same (first fastest, second fastest, third fastest, etc.). That is not to say, however, that the rank order of the drivers’ lap times does not change lap on lap, which of course it might do (Webber might be tenth fastest on lap 1, fastest on lap 2, eight fastest on lap three, and so on).

Of course, this being stats, “non-parametric” probably means lots of other things as well, but for now my rule of thumb will be:*the distribution doesn’t matter*(that is, the statistic does not make any assumptions about the distribution of the data in order for the statistic to work; which is to say, that’s one thing you don’t have to check in order to use the statistic (erm, I think…?!) - The statistic chosen was
. Three differently calculated correlation coefficients appear to be widely used, (and also appear as possible methods in the R**Spearman’s**rank correlation coefficient`corr()`function that calculates correlations between lists of numbers): i) Pearson’s product moment correlation coefficient (how well does a straight line through an x-y scatterplot of the data describe the relationship between the x and y values, and what’s the sign of its gradient); ii) Spearman’s rank correlation coefficient (also known as*Spearman’s rho*or r_{s}); [this interactive is rather nice and shows how Pearson and Spearman correlations can differ]; iii) Kendall’s τ (that is, Kendall’s Tau; this coefficient is based on*concordance*, which describes how the sign of the difference in rank between pairs of numbers in one data series is the same as the sign of the difference in rank between a corresponding pair in the other data series.). Other flavours of correlation coefficient are also available (for example,*Lin’s concordance correlation coefficient*, as demonstrated in this example of identifying a signature collapse in a political party’s electoral vote when the Pearson coefficient suggested it had held up, which I think is used to test for how close to a 45 degree line the x-y association between paired data points is…).

The statistical significance test is based around the “null hypothesis” that the two sets of results are not correlated; the result is significant if they are more correlated than you might expect if both are randomly ordered. this made me a little twitchy: wouldn’t it be equally valid to argue that F1 is a procession and we would expect the race position and grid position to be perfectly correlated, for example, and then define our test for significance on the extent to which they are not?

This comparison of Pearson’s product moment and Spearman’s rank correlation coefficients helped me get a slightly clearer view of the notion of “test” and how both these coefficients act as tests for particular sorts of relationship. The Pearson product moment coefficent has a high value if a strong linear relationship holds across the data pairs. The Spearman rank correlation is weaker, in that it is simply looking to see whether or not the relationship is monotonic (that is, things all go up together, or they all go down together, but the extent to which they do so need not define a linear relationship, which is an assumption of the Pearson test.). In addition, when defining the statistical significance of the test, this is dependent on particular assumptions about the distribution of the data values, at least in the case of the Pearson test. The statistical significance relates to how likely the correlation value was assuming a normal distribution in the values within the paired data series (that is, each series is assumed to represent a normally distributed set of values).

If I understand this right, it means we separate the two things out: on the one hand, we have the statistic (the correlation coefficient); on the other, we have the significance test, which tells you how likely that result us *given a set of assumptions about how the data is distributed*. A question that then comes to mind is this: is the definition of the statistic dependent on a particular distribution of the data in order for the statistic to have something interesting to say, or is it just the significance that relies on that distribution. To twist this slightly, if we can’t do a significance test, is the statistic then essentially meaningless (because we don’t know whether those values are likely to occur whatever the relationships (or even no relationship) between the data sets?). Hmm.. maybe a statistic is actually a measurement in the context of its significance given some sort of assumption about how likely it is to occur by chance?

As far as Spearman’s rank correlation coefficient goes, I was little bit confused by the greyed “not significant” boundary on the diagram shown above. The claim is that any correlation value in that grey area could be accounted for in many cases by random sorting. Take two sets of driver numbers, sort them both randomly, and much of the time the correlation value will fall in that region. (I suspect this is not only arbitrary but misleading? If you have random orderings, is the likelihood that the correlation is in the range -0.1 to 0.1 the same as the likelihood that it will be in the range 0.2 to 0.4? Is the probability distribution of correlations “uniform” across the +/- 1 range?) Also, my hazy vague memory is that the population size affects the confidence interval (see also Explorations in statistics: confidence intervals) – isn’t this the principle on which funnel plots are built? The caption to the figure suggests that the population size (the “degrees of freedom”) was different for different races (different numbers of drivers). *So why isn’t the shaded interval differently sized for those races?*

Something else confused me about the interval values used to denote the significance of the Spearman rho values – where do they come from? A little digging suggested that they come *from a table* (i.e. someone worked them out numerically, presumably by generating looking at the distribution of different random rank orderings, rather than algorithmically – I couldn’t find a formula to calculate them? I did find this on Sample Size Requirements for Estimating Pearson, Kendall and Spearman Correlations by D Bonett (Psychometrika Vol. 65, No. 1, 23-28, March 2000) though). A little more digging suggested *Significance Testing of the Spearman Rank Correlation Coefficient* by J Zar (Journal of the American Statistical Association, Vol. 67, No. 339 (Sep., 1972), pp. 578- 580) as a key work on this, with some later qualification in *Testing the Significance of Kendall’s τ and Spearman’s r _{s}* by M. Nijsse (Psychological Bulletin, 1988, Vol. 103, No. 2,235-237).

Hmmm.. this post was supposed to be about running the some of the stats used in A Tale of Two Motorsports: A Graphical-Statistical Analysis of How Practice, Qualifying, and Past SuccessRelate to Finish Position in NASCAR and Formula One Racing on some more recent data. But I’m well over a couple of thousand words into this post and still not started that bit… So maybe I’ll finish now, and hold the actual number crunching over to the next post…

PS I find myself: happier that I (think I) understand a little bit more about the rationale of significance tests; just as sceptical as ever I was about the actual practice;-)