Detecting Features in Data Using Symbolic Coding and Regular Expression Pattern Matching

One of the reasons I dive into motorsport results and timing data every so often is that it gives me a quite limited set of data to play with. In turn, this means I have to get creative when it comes to reshaping the data to see what visuals I can pull out of it, as generating derived datasets to see what other story forms and insights might be hidden in there.

One of the things I hope to do with the WRC data is push a bit more on automatically generating text-based race reports from the data. Part of the trick here is spotting patterns that can be be mapped onto textual tropes, common sorts of phrase or sentence that you are likely to see in the more vanilla forms of sports reporting. (“X led the race from the start”, “Despite a poor start to the stage, Y went on to win it, N seconds ahead of Z in second place” and so on.)

So how can we spot the patterns? One way is to write a SQL query that detects a particular pattern in the data and uses that to flag a possible event (for example, Detecting Undercuts in F1 Races Using R). Another might be to cast the data as a graph and then detect features using graph based algorithms (eg Identifying Position Change Groupings in Rank Ordered Lists).

During the middle of last night, I woke up wondering whether or not it would be possible to cast simple feature components as symbols and then use a regular expression pattern matcher to identify a particular sort of pattern from a symbolic string. So here’s a quick proof of concept…

From the WRC Monte Carlo 2107 rally, stage 3, some split times and rank positions at each split.

wrc_results_scraper10

Here’s a visual representation of the same (the number labels are rank position at each split, the y-axis is the delta to the fastest time recorded over that split (the “sector time”, if you will, derived data from the original results data).

wrc_results_scraper_and_mentions8

For each driver, you may be able to spot several shapes. For example, Ogier is way behind at the first split, but then gains over the rest of the stage, Kreeke and Breen lose time at the second split, Hanninen loses it on the final part of the stage, and so on. Can we code for these different patterns, and then detect them?

wrc_results_scraper11

So that seems to work okay… Now all I need to do is come up with some suitable symbolic encodings and pattern matching strings…

Hmmm… Vague memories… I wonder if there are any symbolic dynamics algorithms or finite state machine grammar parsers I could make use of?

Tata F1 Connectivity Innovation Prize, 2015 – Mood Board Notes

It’s probably the wrong way round to do this – I’ve already done an original quick sketch, after all – but I thought I’d put together a mood board collection of images and design ideas relating to the 2015 Tata F1 Connectivity Innovation Prize to see what else is current in the world of motorsport telemetry display design just in case I do get round to entering the competition and need to bulk up the entry a bit with additional qualification…

First up, some imagery from last year’s challenge brief – and F1 timing screen; note the black background the use of a particular colour palette:

In the following images, click through on the image to see the original link.

How about some context – what sort of setting might the displays be used in?

20140320-img1rsr_HotSpot full

Aston-Martin-Racing-Bahrain-25-580x435image1.img.640.medium

From flatvision case study of the Lotus F1 pit wall basic requirements include:

  • sunlight readable displays to be integrated within a mobile pit wall;
  • a display bright enough to be viewed in all light conditions.

The solution included ‘9 x 24” Transflective sunlight readable monitors, featuring a 1920×1200 resolution’. SO that fives some idea of real estate available per screen.

So how about some example displays…

The following seems to come from a telemetry dash product:

telemetry_main_screen

There’s a lot of text on that display, and what also looks like timing screen info about other cars. The rev counter uses bar segments that increase in size (something I’ve seen on a lot of other car dashboards). The numerbs are big and bold, with units identifying what the value relates to.

The following chart (the engineer’s view from something called The Pitwall Project) provides an indication of tyre status in the left hand column, with steering and pedal indicators (i.e. driver actions) in the right hand column.

engineerinaction

Here’s a view (from an unknown source) that also displays separate tyre data:

index

 

Another take on displaying the wheel layout and a partial gauge view in the right hand column:

GAUGE track

Or how about relating tyre indicator values even more closely to the host vehicle?

file-1544-005609

This Race Technology Monitor screen uses a segmented bar for the majority of numerical displays. These display types give a quantised display, compared to the continuously varying numerical indicator. The display also shows historical traces, presumably of the corresponding quantity?

telem1

The following dashes show a dial rich view compared to a more numerical display:

download (1) download

The following sample dash seems to be a recreation for a simulation game? Note the spectrum coloured bar that has a full range outline, and the use of colour in the block colour background indicators. Note also the combined bar and label views (the label in the mid-point of the bar – which means is may have to straddle two differently coloured backgrounds.

RealTimeTelemetry2.3maxresdefault

 

 

 

 

 

 

The following Sim Racing Data Analysis display uses markers on the bars to identify notable values – max and min, perhaps? Or max and optimal?

data-analysis

 

It also seems like there are a few mobile apps out there doing dashboard style displays – this one looks quite clean to me and demonstrates a range of colour and border styles:

unnamed unnamed (2)screen480x480  unnamed (1)

 

Here’s another app – and a dial heavy display style:

unnamed (3)

Finally, some views on how to capture the history of the time series. The first one is health monitoring data – as you;d expect from health-monitoring related displays, it’s really clean looking:

FIA-seeking-real-time-human-telemetry-for-F1

 

I’m guessing the time-based trace goes left to right, but for our purposes, streaming the history from right to left, with the numerical indicator essentially bleeding into the line chart display, could work?

This view shows a crude way of putting several scales onto one line chart area:

telemetory

 

This curiosity is from one of my earlier experiments – “driver DNA”. For each of the four bands, lap count is on the vertical axis, distance round lap on the horizontal axis. The colour is the indicator value. The advantage of this is that you see patterns lap on lap, but the resolution of the most current value in a realtime trace might be hard to see?

china_race_data_gears

 

Some time ago, The Still design agency posted a concept pitwall for McLaren Mercedes. The images are still lurking in the Google Image search cache, and include example widgets:

widgets

and an example tyre health monitor display:

tyre-health

To my eye, those numbers are too far apart (the display is too wide and likely occluded by the animated line charts), and the semantics of the streaming are unclear (if the stream flows from the number, new numbers will come in at the left for the left hand tyres and from the right for the right hand ones?

And finally, an example of a typical post hoc race data capture analysis display/replay.

maxresdefault (1)Where do you start to look?!

PS in terms of implementation, a widget display seems sensible. Something like freeboard looks like it could provide a handy prototyping tool, or something like the RShiny dashboard backed up by JSON streaming support from jsonlite and HTML widgets wrapped by htmlwidgets.

Keeping Track of an Evolving “Top N” Cutoff Threshold Value

In a previous post (Charts are for Reading), I noted how it was difficult to keep track of which times in an F1 qualifying session had made the cutoff time as a qualifying session evolved. The problem can be stated as follows: in the first session, with 20 drivers competing, the 15 drivers with the best ranked laptime will make it into the next session. Each driver can complete zero or more timed laps, with drivers completing laps in any order.

Finding the 15 drivers who will make the cutoff is therefore not simply a matter of ranking the best 15 laptimes at any point, because the same 5 drivers, say, may each record 3 fast laptimes, thus taking up the 15 slots that record the 15 fastest laptimes.

If we define a discrete time series with steps corresponding to each recorded laptime (from any driver), then at each time step we can find the best 15 drivers by finding each driver’s best laptime to date and ranking by those times. Conceptually, we need something like a lap chart which uses a ‘timed lap count’ rather than race lap index to keep track of the top 15 cars at any point.

example_fia_lapchart

At each index step, we can then find the laptime of the 15th ranked car to find the current session laptime.

In a dataframe that records laptimes in a session by driver code for each driver, along with a column that contains the current purple laptime, we can arrange the laptimes by cumulative session laptime (so the order of rows follows the order in which laptimes are recorded) and then iterate through those rows one at a time. At each step, we can summarise the best laptime recorded so far in the session for each driver.

df=arrange(df,cuml)
dfc=data.frame()
for (r in 1:nrow(df)) {
  #summarise best laptime recorded so far in the session for each driver
  dfcc=ddply(df[1:r,],.(qsession,code),summarise,dbest=min(stime))
  #Keep track of which session we are in
  session=df[r,]$qsession
  #Rank the best laptimes for each driver to date in the current session
  #(Really should filter by session at the start of this loop?)
  dfcc=arrange(dfcc[dfcc['qsession']==session,],dbest)
  #The different sessions have different cutoffs: Q1, top 15; Q2, top 10
  n=cutoffvals[df[r,]$qsession]
  #if we have at least as many driver best times recorded as the cutoff number
  if (nrow(dfcc) >=n){
    #Grab a record of the current cut-off time
    #along with info about each recorded laptime
    dfc=rbind(dfc,data.frame(df[r,]['qsession'],df[r,]['code'],df[r,]['cuml'],dfcc[n,]['dbest']) )
  }
}

We can then plot the evolution of the cut-off time as the sessions proceed. The chart in it’s current form is still a bit hard to parse, but it’s a start…

qualicutoff

In the above sketch, the lines connect the current purple time and the current cut-off time in each session (aside from the horizontal line which represents the cut-off time at the end of the session). This gives a false impression of the evolution of the cutoff time – really, the line should be a stepped line that traces the current cut-off time horizontally until it is improved, at which point it should step vertically down. (In actual fact, the line does track horizontally as laptimes are recorded that do not change the cuttoff time, as indicated by the horizontal tracks in the Q1 panel as the grey times (laptime slower than driver’s best time in session so far) are completed.

The driver labels are coloured according to: purple – current best in session time; green – driver best in session time to date (that wasn’t also purple); red – driver’s best time in session that was outside the final cut-off time. This colouring conflates two approaches to representing information – the purple/green colours represent online algorithmic processing (if we constructed the chart in real time from laptime data as laps we completed, that’s how we’d colour the points), whereas the red colouring represents the results of offline algorithmic processing (the colour can only be calculated at the end of the session when we know the final session cutoff time). I think these mixed semantics contribute to making the chart difficult to read…

In terms of what sort of stories we might be able to pull from the chart, we see that in Q2, Hulkenberg and Sainz were only fractions of a second apart, and Perez looked like he had squeezed in to the final top 10 slot until Sainz pushed him out. To make it easier to see which times contribute to the top 10 times, we could use font weight (eg bold font) to highlight each drivers session best laptimes.

To make the chart easier to read, I think each time improvement to the cutoff time should be represented by a faint horizontal line, with a slightly darker line tracing the evolution of the cutoff time as a stepped line. This would all us to see which times were within the cutoff time at any point.

I also wonder whether it might be interesting to generate a table a bit like the race lap chart, using session timed lap count rather than race lap count, perhaps with additional colour fields to show which car recorded the time that increased the lap count index, and perhaps also where in the order that time fell if it didn’t change the order in the session so far. We could also generate online and offline differences between each laptime in the session and the current cutoff time (online algorithm) as well as the final overall session cutoff time (offline algorithm).

[As and when I get this chart sorted, it will appear in an update to the Wrangling F1 Data With R lean book.]

Charts are for Reading…

If charts are pictures, and every picture not only tells a story, but also saves a thousand words in doing so, how then are we to actually read them?

Take the following example, a quick #f1datajunkie sketch show how the Bahrain 2015 qualifying session progressed. The chart is split into three, one for each part of qualifying (which we might refer to as fractional sessions), which already starts to set the scene for the story. The horizontal x-axis is the time in seconds into qualifying at which each laptime is recorded, indexed against the first laptime recorded in qualifying overall. The vertical y-axis records laptimes in in seconds, limited to 107% of the fastest laptime recorded in a particular session. The green colour denotes a driver’s fastest laptime recorded in each fractional session, purple the overall fasted laptime recorded so far in a fractional session (purple trumps green). So again, the chart is starting to paint a picture.

bah_2105_q_session_times

An example of the sort of analysis that can be provided for a qualifying session can be found in a post by Justin Hynes, Lewis Hamilton seals his first Bahrain pole but Vettel poses the menace to Mercedes’ hopes, that appeared on he James Allen on F1 blog. In this post, I’ll try to match elements of that analysis with things we can directly see in the chart above…

[Hamilton] finish[ed] 0.411s clear of Ferrari’s Sebastian Vettel and more than half a second in front of his Mercedes team-mate Nico Rosberg

We don’t get the time gap exactly from the chart, but looking to the rightmost panel (Q3), finding the lowest vertical marks for HAM, VET and ROS, and imagining a horizontal line across to the y-axis, we get a feeling for the relative gaps.

Q1 got underway in slightly calmer conditions than blustery FP3 and Raikkonen was the first to take to the track, with Bottas joining the fray soon after. The Williams driver quickly took P1 but was then eclipsed by Rosberg, who set a time of 1: 35.657 on the medium tyres.

Q1 is the leftmost panel, in which we see RAI setting the first representative laptime at least (within the 107% limit of the session best overall), followed by BOT and then ROS improving on the early purple times.

The Mercedes man was soon joined in the top five by soft-tyre runners Nico Hulkenberg and Felipe Nasr.

HUL and NAS appear around the 300 cuml (cumulative laptime) mark. We note that PER is there in the mix too, but is not mentioned explicitly in the report.

In the closing stages of the session those in the danger zone were Max Verstappen, Pastor Maldonado and Will Stevens and Roberto Merhi.

On the right hand side of the chart, we see laps at the end of the session from MAL and VES (and way off the pace, STE). One problem with the chart as style above (showing cumulative best times in the session, makes it hard to see which a driver’s best session time overall actually is. (We could address this by perhaps displaying a driver’s session best time using a bold font.) The chart is also very cluttered around the cutoff time which makes it hard to see clearly who got through and who didn’t. And we don’t really know where the danger zone is because we have no clear indication of what the best 15 drivers’ times are – and hence, where the evolving cut-off time is…

Verstappen found the required pace and scraped into Q2 with a time of 1:35.611. Maldonado, however, failed to make it through, his best lap of 1:35.677 only being good enough for P16.

Verstappen’s leap to safety also pushed out Daniil Kvyat, with the Russian putting in a disappointing final lap that netted him P17 behind the Lotus driver. Hulkenberg was the last man through to Q2, the Force India driver’s 1:35.653 seeing him safely through with just two hundredths of a second in hand over Maldonado…

With an evolution of the cutoff time, and a zoom around the final cutoff time, we should be able to see what went on rather more clearly.

At the top of the order, Hamilton was quickest, finishing a tenth in front of Bottas. Rosberg was third, though he finished the session close on half a second down on his team-mate.

Felipe Massa was fourth for Williams, ahead of Raikkonen, Red Bull’s Daniel Ricciardo and Sebastian Vettel, who completed just three laps in the opening session. All drivers set their best times on the soft tyre.

This information can be quite clearly seen on the chart – aside from the tyre data which is not made available by the FIA.

The follow description of Q2 provides quite a straightforward reading of the second panel of the chart.

In the second session, Rosberg initially set the pace but Hamilton quickly worked his way back to the top of the order, his first run netting a time of 1:32.669. Rosberg was also again eclipsed by Massa who set a time three tenths of a second quicker than Rosberg’s.

The last to set an opening time were the Ferraris of Raikkonen and Vettel, though both rapidly staked a claim on a Q3 berth with the Finn in P2 and the German in P4.

Most of the front runners opted to rely on their first run to see them through and in the closing stages those in the drop zone were Hulkenberg, Force India team-mate Sergio Perez, Nasr, Sauber team-mate Ericsson and McLaren’s Fernando Alonso.

However, the chart does not clearly show how ROS’ early purple time was challenged by BOT, or how MAS early pace time was challenged mid-way through the session by VET and RAI.

Hulkenberg was the man to make the big move, claiming ninth place in Q2 with a time of 1:34.613. Behind him Toro Rosso’s Carlos Sainz scraped through in P10, six hundredths of a second clear of 11th-placed Sergio Perez. The Mexican was followed by Nasr and Ericsson. Alonso claimed P14, while 15th place went to the unfortunate Verstappen, who early in the session had reported that he was down on power.

Again, this reading of the chart would be aided by an evolving cut-off time line.

Looking now to the third panel…

The first runs in Q3 saw Hamilton in charge again, with the champion setting a time of 1:33.552 on used softs to take P1 three tenths of a second ahead of Red Bull’s Ricciardo, who prior to Hamilton’s lap had claimed the fastest S3 time of the session using new soft tyres.

Rosberg, also on used softs, was third, four thousandths of a second down on the Australian’s time. Hulkenberg, with just one new set of softs at his disposal, opted to sit out the first run.

The chart clearly shows the early and late session runs, and is reflected in the analysis:

In the final runs, Vettel was the first of the likely front-row men across the line and with purple times in S1 and S2, the German set a provisional pole time of 1:32.982. It was a superb lap but Hamilton was already running faster, stealing the S1 purple time from the German.

Ahead of the champion on track, Rosberg had similarly taken the best S2 time but he could not find more pace and when he crossed the line he slotted into third, four hundredths [??] of a second behind Vettel.

So what does Justin Hynes’ qualifying session commentary tell us about how we might be able to read the charted summary of the session? And how can we improve the chart to help draw out some of the stories? A couple of things jump out for me – firstly, the evolving purple and green times can be confusing, and are perhaps better placed (for a summary reading of the session) by best in session purple/green times; secondly, the evolution of the cut-off times would help to work out where drivers were placed at different stages of qualifying and what they still had to do – or whether a best-time-so-far recorded by a driver earlier in the session was bumped by the cutoff evolution. Note that the purple time evolution is identified implicitly by the lower envelope of the laptimes in each session.

Tools in Tandem – SQL and ggplot. But is it Really R?

Increasingly I find that I have fallen into using not-really-R whilst playing around with Formula One stats data. Instead, I seem to be using a hybrid of SQL to get data out of a small SQLite3 datbase and into an R dataframe, and then ggplot2 to render visualise it.

So for example, I’ve recently been dabbling with laptime data from the ergast database, using it as the basis for counts of how many laps have been led by a particular driver. The recipe typically goes something like this – set up a database connection, and run a query:

#Set up a connection to a local copy of the ergast database
library(DBI)
ergastdb = dbConnect(RSQLite::SQLite(), './ergastdb13.sqlite')

#Run a query
q='SELECT code, grid, year, COUNT(l.lap) AS Laps 
    FROM (SELECT grid, raceId, driverId from results) rg,
        lapTimes l, races r, drivers d 
    WHERE rg.raceId=l.raceId AND d.driverId=l.driverId
          AND rg.driverId=l.driverId AND l.position=1 AND r.raceId=l.raceId 
    GROUP BY grid, driverRef, year 
    ORDER BY year'

driverlapsledfromgridposition=dbGetQuery(ergastdb,q)

In this case, the data is table that shows for each year a count of laps led by each driver given their grid position in corresponding races (null values are not reported). The data grabbed from the database is based into a dataframe in a relatively tidy format, from which we can easily generate a visualisation of it.

lapsled_demo

The chart I have opted for is a text plot faceted by year:

lapsLed-driverlapsledbygrid-1

The count of lead laps for a given driver by grid position is given as a text label, sized by count, and rotated to mimimise overlap. The horizontal grid is actually a logarithmic scale, which “stretches out” the positions at the from of the grid (grid positions 1 and 2) compared to positions lower down the grid – where counts are likely to be lower anyway. To try to recapture some sense of where grid positions lay along the horizontal axis, a dashed vertical line at grid position 2.5 marks out the front row. The x-axis is further expanded to mitigate against labels being obfuscated or overflowing off the left hand side of the plotting area. The clean black and white theme finished off the chart.

g = ggplot(driverlapsledfromgridposition)
g = g + geom_vline(xintercept = 2.5, colour='lightgrey', linetype='dashed')
g = g + geom_text(aes(x=grid, y=code, label=Laps, size=log(Laps), angle=45))
g = g + facet_wrap(~year) + xlab(NULL) + ylab(NULL) + guides(size=FALSE)
g + scale_x_log10(expand=c(0,0.3)) + theme_bw()

There are still a few problems with this graphic, however. The order of labels on the y-axis is in alphabetical order, and would perhaps be more informative if ordered to show championship rankings, for example.

However, to return to the main theme of this post, whilst the R language and RStudio environment are being used as a medium within which this activity has taken place, the data wrangling and analysis (in the sense of counting) is being performed by the SQL query, and the visual representation and analysis (in the sense of faceting, for example, and generating visual cues based on data properties) is being performed by routines supplied as part of the ggplot library.

So if asked whether this is an example of using R for data analysis and visualisation, what would your response be? What does it take for something to be peculiarly or particularly an R based analysis?

For more details, see the “Laps Completed and Laps Led” draft chapter and the Wrangling F1 Data With R book.

Calculating Churn in Seasonal Leagues

One of the things I wanted to explore in the production of the Wrangling F1 Data With R book was the extent to which I could draw on published academic papers for inspiration in exploring the the various results and timing datasets.

In a chapter published earlier this week, I explored the notion of churn, as described in Mizak, D, Neral, J & Stair, A (2007) The adjusted churn: an index of competitive balance for sports leagues based on changes in team standings over time. Economics Bulletin, Vol. 26, No. 3 pp. 1-7, and further appropriated by Berkowitz, J. P., Depken, C. A., & Wilson, D. P. (2011). When going in circles is going backward: Outcome uncertainty in NASCAR. Journal of Sports Economics, 12(3), 253-283.

In a competitive league, churn is defined as:

C_t =  \frac{\sum_{i=1}^{N}\left|f_{i,t} - f_{i,t-1}\right|}{N}

where C_t is the churn in team standings for year t, \left|f_{i,t} - f_{i,t-1}\right| is the absolute value of the i-th team’s change in finishing position going from season t-1 to season t, and N is the number of teams.

The adjusted churn is defined as an indicator with the range 0..1 by dividing the churn, C_t, by the maximum churn, C_max. The value of the maximum churn depends on whether there is an even or odd number of competitors:

C_{max} = N/2 \text{, for even N}

C_{max} = (N^2 - 1) / 2N \text{, for odd N}

Berkowitz et al. reconsidered churn as applied to an individual NASCAR race (that is, at the event level). In this case, f_{i,t} is the position of driver i at the end of race t, f_{i,t-1} is the starting position of driver i at the beginning of that race (that is, race t) and N is the number of drivers participating in the race. Once again, the authors recognise the utility of normalising the churn value to give an *adjusted churn* in the range 0..1 by dividing through by the maximum churn value.

Using these models, I created churn function of the form:

is.even = function(x) x %% 2 == 0
churnmax=function(N)
  if (is.even(N)) return(N/2) else return(((N*N)-1)/(2*N))

churn=function(d) sum(d)/length(d)
adjchurn = function(d) churn(d)/churnmax(length(d))

and then used it to explore churn in a variety of contexts:

  • comparing grid positions vs race classifications across a season (cf. Berkowitz et al.)
  • churn in Drivers’ Championship standings over several seasons (cf. Mizak et al.)
  • churn in Constructors’ Championship standings over several seasons (cf. Mizak et al.)

For example, in the first case, we can process data from the ergast database as follows:

library(DBI)
ergastdb = dbConnect(RSQLite::SQLite(), './ergastdb13.sqlite')

q=paste('SELECT round, name, driverRef, code, grid, 
                position, positionText, positionOrder
          FROM results rs JOIN drivers d JOIN races r
          ON rs.driverId=d.driverId AND rs.raceId=r.raceId
          WHERE r.year=2013',sep='')
results=dbGetQuery(ergastdb,q)

library(plyr)
results['delta'] =  abs(results['grid']-results['positionOrder'])
churn.df = ddply(results[,c('round','name','delta')], .(round,name), summarise,
            churn = churn(delta),
            adjchurn = adjchurn(delta)
            )

For more details, see this first release of the Keeping an Eye on Competitiveness – Tracking Churn chapter of the Wrangling F1 Data With R book.

Identifying Position Change Groupings in Rank Ordered Lists

The title says it all, doesn’t it?!

Take the following example – it happens to show race positions by driver for each lap of a particular F1 grand prix, but it could be the evolution over time of any rank-based population.

poschanges

The question I had in mind was – how can I identify positions that are being contested during a particular window of time, where by contested I mean that the particular position was held by more than one person in a particular window of time?

Let’s zoom in to look at a couple of particular steps.

poschangeGroup

We see distinct groups of individuals who swap positions with each other between those two consecutive steps, so how can we automatically detect the positions that these drivers are fighting over?

A solution given to a Stack Overflow question on how to get disjoint sets from a list in R gives what I thought was a really nice solution: treat it as a graph, and then grab the connected components.

Here’s my working of it. Start by getting a list of results that show a particular driver held different positions in the window selected – each row in the original dataframe identifies the position held by a particular driver at the end of a particular lap:

library(DBI)
ergastdb =dbConnect(RSQLite::SQLite(), './ergastdb13.sqlite')

#Get a race identifier for a specific race
raceId=dbGetQuery(ergastdb,
                  'SELECT raceId FROM races WHERE year="2012" AND round="1"')

q=paste('SELECT * FROM lapTimes WHERE raceId=',raceId[[1]])

lapTimes=dbGetQuery(ergastdb,q)
lapTimes$position=as.integer(lapTimes$position)

library(plyr)

#Sort by lap first just in case
lapTimes=arrange(lapTimes,driverId,lap)

#Create a couple of new columns
#pre is previous lap position held by a driver given their current lap
#ch is position change between the current and previous lap
tlx=ddply(lapTimes,.(driverId),transform,pre=(c(0,position[-length(position)])),ch=diff(c(0,position)))

#Find rows where there is a change between a given lap and its previous lap
#In particular, focus on lap 17
llx=tlx[tlx['ch']!=0 & tlx['lap']==17,c("position","pre")]

llx

This filters the complete set of data to just those rows where there is a difference between a driver’s current position and previous position (the first column in the result just shows row numbers and can be ignored).

##      position pre
## 17          2   1
## 191        17  18
## 390         9  10
## 448         1   2
## 506         6   4
## 719        10   9
## 834         4   5
## 892        18  19
## 950         5   6
## 1008       19  17

We can now create a graph in which nodes represent positions (position or pre values) and edges connect a current and previous position.

#install.packages("igraph")
#http://stackoverflow.com/a/25130575/454773
library(igraph)

posGraph = graph.data.frame(llx)
    
}

plot(posGraph)

The resulting graph is split into several components:

posgraph

We can then identify the connected components:

posBattles=split(V(posGraph)$name, clusters(posGraph)$membership)
#Find the position change battles
for (i in 1:length(posBattles)) print(posBattles[[i]])

This gives the following clusters, and their corresponding members:

## [1] "2" "1"
## [1] "17" "18" "19"
## [1] "9"  "10"
## [1] "6" "4" "5"

To generalise this approach, I think we need to do a couple of things:

  • allow a wider window within which to identify battles (so look over groups of three or more consecutive laps);
  • simplify the way we detect position changes for a particular driver; for example, if we take the set of positions held by a driver within the desired window, if the cardinality of the set (that is, its size) is greater than one, then we have had at least one position change for that driver within that window. Each set of size > 1 of unique positions held by different drivers can be used to generate a set of distinct, unordered pairs that connect the positions (I think it only matters that they are connected, not that a driver specifically went from position x to position y going from one lap to the next?). If we generate the graph from the set of distinct unordered pairs taken across all drivers, we should then be able to identify the contested/driver change position clusters.

Hmm… I need to try that out… And when I do, if and when it works(?!), I’ll add a complete demonstration of it – and how we might make use of it – to the Wrangling F1 Data With R book.

F1 Doing the Data Visualisation Competition Thing With Tata?

Sort of via @jottevanger, it seems that Tata Communications announces the first challenge in the F1® Connectivity Innovation Prize to extract and present new information from Formula One Management’s live data feeds. (The F1 site has a post Tata launches F1® Connectivity Innovation Prize dated “10 Jun 2014”? What’s that about then?)

Tata Communications are the folk who supply connectivity to F1, so this could be a good call from them. It’ll be interesting to see how much attention – and interest – it gets.

The competition site can be found here: The F1 Innovation Connectivity Prize.

The first challenge is framed as follows:

The Formula One Management Data Screen Challenge is to propose what new and insightful information can be derived from the sample data set provided and, as a second element to the challenge, show how this insight can be delivered visually to add suspense and excitement to the audience experience.

The sample dataset provided by Formula One Management includes Practice 1, Qualifying and race data, and contains the following elements:

– Position
– Car number
– Driver’s name
– Fastest lap time
– Gap to the leader’s fastest lap time
– Sector 1 time for the current lap
– Sector 2 time for the current lap
– Sector 3 time for the current lap
– Number of laps

If you aren’t familiar with motorsport timing screens, they typically look like this…

f1-innovation-prize_s3_amazonaws_com_challenge_packs_The_F1_Connectivity_Innovation_Prize_–_Challenge_1_Brief_pdf

A technical manual is also provided for helping makes sense of the data files.

Basic_Timing_Data_Protocol_Overview_pdf__page_1_of_15_

Here are fragments from the data files – one for practice, one for qualifying and one for the race.

First up, practice:

...
<transaction identifier="101" messagecount="10640" timestamp="10:53:14.159"><data column="2" row="15" colour="RED" value="14"/></transaction>
<transaction identifier="101" messagecount="10641" timestamp="10:53:14.162"><data column="3" row="15" colour="WHITE" value="F. ALONSO"/></transaction>
<transaction identifier="103" messagecount="10642" timestamp="10:53:14.169"><data column="9" row="2" colour="YELLOW" value="16"/></transaction>
<transaction identifier="101" messagecount="10643" timestamp="10:53:14.172"><data column="2" row="6" colour="WHITE" value="17"/></transaction>
<transaction identifier="102" messagecount="1102813" timestamp="10:53:14.642"><data column="2" row="1" colour="YELLOW" value="59:39" clock="true"/></transaction>
<transaction identifier="102" messagecount="1102823" timestamp="10:53:15.640"><data column="2" row="1" colour="YELLOW" value="59:38" clock="true"/></transaction>
...

Then qualifying:

...
<transaction identifier="102" messagecount="64968" timestamp="12:22:01.956"><data column="4" row="3" colour="WHITE" value="210"/></transaction>
<transaction identifier="102" messagecount="64971" timestamp="12:22:01.973"><data column="3" row="4" colour="WHITE" value="PER"/></transaction>
<transaction identifier="102" messagecount="64972" timestamp="12:22:01.973"><data column="4" row="4" colour="WHITE" value="176"/></transaction>
<transaction identifier="103" messagecount="876478" timestamp="12:22:02.909"><data column="2" row="1" colour="YELLOW" value="16:04" clock="true"/></transaction>
<transaction identifier="101" messagecount="64987" timestamp="12:22:03.731"><data column="2" row="1" colour="WHITE" value="21"/></transaction>
<transaction identifier="101" messagecount="64989" timestamp="12:22:03.731"><data column="3" row="1" colour="YELLOW" value="E. GUTIERREZ"/></transaction>
...

Then the race:

...
<transaction identifier="101" messagecount="121593" timestamp="14:57:10.878"><data column="23" row="1" colour="PURPLE" value="31.6"/></transaction>
<transaction identifier="103" messagecount="940109" timestamp="14:57:11.219"><data column="2" row="1" colour="YELLOW" value="1:41:13" clock="true"/></transaction>
<transaction identifier="101" messagecount="121600" timestamp="14:57:11.681"><data column="2" row="3" colour="WHITE" value="77"/></transaction>
<transaction identifier="101" messagecount="121601" timestamp="14:57:11.681"><data column="3" row="3" colour="WHITE" value="V. BOTTAS"/></transaction>
<transaction identifier="101" messagecount="121602" timestamp="14:57:11.681"><data column="4" row="3" colour="YELLOW" value="17.7"/></transaction>
<transaction identifier="101" messagecount="121603" timestamp="14:57:11.681"><data column="5" row="3" colour="YELLOW" value="14.6"/></transaction>
<transaction identifier="101" messagecount="121604" timestamp="14:57:11.681"><data column="6" row="3" colour="WHITE" value="1:33.201"/></transaction>
<transaction identifier="101" messagecount="121605" timestamp="14:57:11.686"><data column="9" row="3" colour="YELLOW" value="35.4"/></transaction>

...

We can parse the datafiles using python using an approach something like the following:

from lxml import etree

pl=[]
for xml in open(xml_doc, 'r'):
    pl.append(etree.fromstring(xml))

pl[100].attrib
#{'identifier': '101', 'timestamp': '10:49:56.085', 'messagecount': '9716'}

pl[100][0].attrib
#{'column': '3', 'colour': 'WHITE', 'value': 'J. BIANCHI', 'row': '12'}

A few things are worth mentioning about this format… Firstly, the identifier is an identifier of the message type, rather then the message: each transaction message appears instead to be uniquely identified by the messagecount. The transactions each update the value of a single cell in the display screen, setting its value and colour. The cell is identified by its row and column co-ordinates. The timestamp also appears to group messages.

Secondly, within a session, several screen views are possible – essentially associated with data labelled with a particular identifier. This means the data feed is essentially powering several data structures.

Thirdly, each screen display is a snapshot of a datastructure at a particular point in time. There is no single record in the datafeed that gives a view over the whole results table. In fact, there is no single message that describes the state of a single row at a particular point in time. Instead, the datastructure is built up by a continual series of updates to individual cells. Transaction elements in the feed are cell based events not row based events.

It’s not obvious how we can make a row based transaction update, even, though on occasion we may be able to group updates to several columns within a row by gathering together all the messages that occur at a particular timestamp and mention a particular row. For example, look at the example of the race timing data above, for timestamp=”14:57:11.681″ and row=”3″. If we parsed each of these into separate dataframes, using the timestamp as the index, we could align the dataframes using the *pandas* DataFrame .align() method.

[I think I’m thinking about this wrong: the updates to a row appear to come in column order, so if column 2 changes, the driver number, then changes to the rest of the row will follow. So if we keep track of a cursor for each row describing the last column updated, we should be able to track things like row changes, end of lap changes when sector times change and so on. Pitting may complicate matters, but at least I think I have an in now… Should have looked more closely the first time… Doh!]

Note: I’m not sure that the timestamps are necessarily unique across rows, though I suspect that they are likely to be so, which means it would be safer to align, or merge, on the basis of the timestamp and the row number? From inspection of the data, it looks as if it is possible for a couple of timestamps to differ slightly (by milliseconds) yet apply to the same row. I guess we would treat these as separate grouped elements? Depending on the timewidth that all changes to a row are likely to occur in, we could perhaps round times for the basis of the join?

Even with a bundling, we still don’t a have a complete description of all the cells in a row. They need to have been set historically…

The following fragment is a first attempt at building up the timing screen data structure for the practice timing at a particular point of time. To find the state of the timing screen at a particular time, we’d have to start building it up from the start of time, and then stop it updating at the time we were interested in:

#Hacky load and parse of each row in the datafile
pl=[]
for xml in open('data/F1 Practice.txt', 'r'):
    pl.append(etree.fromstring(xml))

#Dataframe for current state timing screen
df_practice_pos=pd.DataFrame(columns=[
    "timestamp", "time",
    "classpos",  "classpos_colour",
    "racingNumber","racingNumber_colour",
    "name","name_colour",
],index=range(50))

#Column mappings
practiceMap={
    '1':'classpos',
    '2':'racingNumber',
    '3':'name',
    '4':'laptime',
    '5':'gap',
    '6':'sector1',
    '7':'sector2',
    '8':'sector3',
    '9':'laps',
    '21':'sector1_best',
    '22':'sector2_best',
    '23':'sector3_best'
}

def parse_practice(p,df_practice_pos):
    if p.attrib['identifier']=='101' and 'sessionstate' not in p[0].attrib:
        if p[0].attrib['column'] not in ['10','21','22','23']:
            colname=practiceMap[p[0].attrib['column']]
            row=int(p[0].attrib['row'])-1
            df_practice_pos.ix[row]['timestamp']=p.attrib['timestamp']
            tt=p.attrib['timestamp'].replace('.',':').split(':')
            df_practice_pos.ix[row]['time'] = datetime.time(int(tt[0]),int(tt[1]),int(tt[2]),int(tt[3])*1000)
            df_practice_pos.ix[row][colname]=p[0].attrib['value']
            df_practice_pos.ix[row][colname+'_colour']=p[0].attrib['colour']
    return df_practice_pos

for p in pl[:2850]:
    df_practice_pos=parse_practice(p,df_practice_pos)
df_practice_pos

(See the notebook.)

Getting sensible data structures at the timing screen level looks like it could be problematic. But to what extent are the feed elements meaningful in and of themselves? Each element in the feed actually has a couple of semantically meaningful data points associated with it, as well as the timestamp: the classification position, which corresponds to the row; and the column designator.

That means we can start to explore simple charts that map driver number against race classification, for example, by grabbing the row (that is, the race classification position) and timestamp every time we see a particular driver number:

racedemo

A notebook where I start to explore some of these ideas can be found here: racedemo.ipynb.

Something else I’ve started looking at is the use of MongoDB for grouping items that share the same timestamp (again, check the racedemo.ipynb notebook). If we create an ID based on the timestamp and row, we can repeatedly $set document elements against that key even if they come from separate timing feed elements. This gets us so far, but still falls short of identifying row based sets. We can perhaps get closer by grouping items associated with a particular row in time, for example, grouping elements associated with a particular row that are within half a second of each other. Again, the racedemo.ipynb notebook has the first fumblings of an attempt to work this out.

I’m not likely to have much chance to play with this data over the next week or so, and the time for making entries is short. I never win data competitions anyway (I can’t do the shiny stuff that judges tend to go for), but I’m keen to see what other folk can come up with:-)

PS The R book has stalled so badly I’ve pushed what I’ve got so far to wranglingf1datawithr repo now… Hopefully I’ll get a chance to revisit it over the summer, and push on with it a bit more… WHen I get a couple of clear hours, I’ll try to push the stuff that’s there out onto leanpub as a preview…

F1Stats – Visually Comparing Qualifying and Grid Positions with Race Classification

[If this isn’t your thing, the posts in this thread will be moving to a new blog soon, rather than taking over OUseful.info…]

Following the roundabout tour of F1Stats – A Prequel to Getting Started With Rank Correlations, here’s a walk through of my attempt to replicate the first part of A Tale of Two Motorsports: A Graphical-Statistical Analysis of How Practice, Qualifying, and Past SuccessRelate to Finish Position in NASCAR and Formula One Racing. Specifically, a look at the correlation between various rankings achieved over a race weekend and the final race position for that weekend.

The intention behind doing the replication is two-fold: firstly, published papers presumably do “proper stats”, so by pinching the methodology, I can hopefully pick up some tricks about how you’re supposed to do this stats lark by apprenticing myself, via the web, to someone who has done something similar; secondly, it provides an answer scheme, of a sort: I can check my answers with the answers in the back of the book published paper. I’m also hoping that by working through the exercise, I can start to frame my own questions and start testing my own assumptions.

So… let’s finally get started. The data I’ll be using for now is pulled from my scraper of the FormulaOne website (personal research purposes, blah, blah, blah;-). If you’re following along, from the Download button grab the whole SQLite database and save it into whatever directory you’re going to be working in in R…

…because this is an R-based replication… (I use RStudio for my R tinkerings; if you need to change directory when you’re in R, setwd('~/path/to/myfiles')).

To load the data in, and have a quick peek at what’s loaded, I use the following recipe:

f1 = dbConnect(drv ="SQLite", dbname="f1com_megascraper.sqlite")

#There are a couple of ways we can display the tables that are loaded
#Directly:
dbListTables(f1)
#Or via a query:
dbGetQuery(f1, 'SELECT name FROM sqlite_master WHERE type="table"')

The last two commands each display a list of the names of the database tables that can be loaded in as separate R dataframes.

To load the data in from a particular table, we run a SELECT query on the database. Let’s start with grabbing the same data that the paper analysed – the results from 2009:

#Grab the race results from the raceResults table for 2009
results.race=dbGetQuery(f1, 'SELECT pos as racepos, race, grid, driverNum FROM raceResults where year==2009')

#We can load data in from other tables and merge the results:
results.quali=dbGetQuery(f2, 'SELECT pos as qpos, driverNum, race FROM qualiResults where year==2009')
results.combined.clunky = merge(results.race, results.quali, by = c('race', 'driverNum'), all=T)

#A more efficient approach is to run a JOINed query to pull in the data in in one go
results.combined=dbGetQuery(f1,
    'SELECT raceResults.year as year, qualiResults.pos as qpos, p3Results.pos as p3pos, raceResults.pos as racepos, raceResults.race as race, raceResults.grid as grid, raceResults.driverNum as driverNum, raceResults.raceNum as raceNum FROM raceResults, qualiResults, p3Results WHERE raceResults.year==2009 and raceResults.year = qualiResults.year and raceResults.year = p3Results.year and raceResults.race = qualiResults.race and raceResults.race = p3Results.race and raceResults.driverNum = qualiResults.driverNum and raceResults.driverNum = p3Results.driverNum;' )

#Note: I haven't used SQL that much so there may be a more efficient way of writing this?

#We can preview the response
head(results.combined)
#  year qpos p3pos racepos      race grid driverNum raceNum
#1 2009    1     3       1 AUSTRALIA    1        22       1
#2 2009    2     6       2 AUSTRALIA    2        23       1
#3 2009   20     2       3 AUSTRALIA   20         9       1
#4 2009   19     8       4 AUSTRALIA   19        10       1
#5 2009   10    17       5 AUSTRALIA   10         7       1
#6 2009    5     1       6 AUSTRALIA    5        16       1

#We can look at the format of each column
str(results.combined)
#'data.frame':	338 obs. of  8 variables:
# $ year     : chr  "2009" "2009" "2009" "2009" ...
# $ qpos     : chr  "1" "2" "20" "19" ...
# $ p3pos    : chr  "3" "6" "2" "8" ...
# $ racepos  : chr  "1" "2" "3" "4" ...
# $ race     : chr  "AUSTRALIA" "AUSTRALIA" "AUSTRALIA" "AUSTRALIA" ...
# $ grid     : chr  "1" "2" "20" "19" ...
# $ driverNum: chr  "22" "23" "9" "10" ...
# $ raceNum  : int  1 1 1 1 1 1 1 1 1 1 ...

#We can inspect the different values taken by each field
levels(factor(results.combined$racepos))
# [1] "1"   "10"  "11"  "12"  "13"  "14"  "15"  "16"  "17"  "18"  "19"  "2"   "3"   "4"   "5"   "6"   "7"   "8"   "9"  
#[20] "DSQ" "Ret"

#We need to do a little tidying... Let's make integer versions of the positions
#NA values will be introduced where there are no integers
results.combined$racepos.int=as.integer(as.character(results.combined$racepos))
results.combined$grid.int=as.integer(as.character(results.combined$grid))
results.combined$qpos.int=as.integer(as.character(results.combined$qpos))

#The race classification does not give a numerical position for each driver
##(eg in cases of DSQ or Ret) although the tables are ordered
#Let's use the row order for each race to give an actual numerical position to each driver for the race
#(If we wanted to do this for practice and quali too, we would have to load the tables
#in separately, number each row, then merge them.)
require(reshape)
results.combined=ddply(results.combined, .(race), mutate, racepos.raw=1:length(race))

#Now we have a data frame that looks like this:
head(results.combined)
#  year qpos p3pos racepos      race grid driverNum raceNum racepos.int grid.int qpos.int racepos.raw
#1 2009    2    11       1 ABU DHABI    2        15      17           1        2        2           1
#2 2009    3    15       2 ABU DHABI    3        14      17           2        3        3           2
#3 2009    5     1       3 ABU DHABI    5        22      17           3        5        5           3
#4 2009    4     3       4 ABU DHABI    4        23      17           4        4        4           4
#5 2009    8     5       5 ABU DHABI    8         6      17           5        8        8           5
#6 2009   12    13       6 ABU DHABI   12        10      17           6       12       12           6

#If necessary we can force the order of the races factor levels away from alphabetical into race order
results.combined$race=reorder(results.combined$race, results.combined$raceNum)
levels(results.combined$race)
# [1] "AUSTRALIA"     "MALAYSIA"      "CHINA"         "BAHRAIN"       "SPAIN"         "MONACO"        "TURKEY"       
# [8] "GREAT BRITAIN" "GERMANY"       "HUNGARY"       "EUROPE"        "BELGIUM"       "ITALY"         "SINGAPORE"    
#[15] "JAPAN"         "BRAZIL"        "ABU DHABI" 

We’re now in a position whereby we can start to look at the data, and do some analysis of it. The paper shows a couple of example scatterplots charting the “qualifying” position against the final race position. We can go one better in single line using the ggplot2 package:

require(ggplot2)
g=ggplot(results.combined) + geom_point(aes(x=qpos.int, y=racepos.raw)) + facet_wrap(~race)
g

f1_2009 q vs racepos

At first glance, this may all look very confusing, but just take your time, actually look at it, see it you can spot any patterns or emergent structure in the way the points are distributed. Does it look as if the points on the chart might have a story to tell? (If a picture saves a thousand words, and it takes five minutes to read a thousand words, the picture might actually take a couple of minutes to read get across the same information…;-)

First thing to notice: there are lots of panels – one for each race. The panels are ordered left to right, then down a row, left to right, down a row, etc, in the order they took place during the season. So you should be able to quickly spot the first race of the season (top left), the last race (at the end of the last/bottom row), and races in mid-season (middle of them all!)

Now focus on a single chart and thing about that the positioning of the points actually means. If the qualifying position was the same as the race position, what would you expect to see? That is, if the car that qualified first finished first; the car that qualified second finished second; the car that qualified tenth finished tenth; and so on. As the axes are incrementally ordered, I’d expect to see a straight line, going up at 45 degrees if the same space was given on each axis (that is, if the charts were square plots).

We can run a quick experiment to see how that looks:

#Plot increasing x and increasing y in step with each other
ggplot() + geom_point(aes(x=1:20, y=1:20))

rankorder

Looking over the race charts, some of the charts have the points plotted roughly along a straight line – Turkey and Abu Dhabi, for example. If the cars started the race roughly according to qualifying position, this would suggest the races may have been rather processional? On the other hand, Brazil is all over the place – qualifying position appears to bear very little relationship to where the cars actually finished the race.

Even for such a simple chart type, it’s amazing how much structure we might be able to pull out of these charts. If every one was a straight line, we might imagine a very boring season, full of processions. If every chart was all over the place, with little apparent association between qualifying and race positions, we might begin to wonder whether there was any “form” to speak of at all. (Such events may on the surface provide exciting races, at least at the start of the season, but if every driver has an equal, but random, chance of winning, it makes it hard to root for anyone in particular, makes it hard to root for an underdog versus a favourite, and so on.)

Again, we can run a quick experiment to see what things would look like if each car qualified at random and was placed at random at the end of the race:

expt2 = NULL
#We're going to run 20 experiments (z)
for (i in 1:20){
    #Each experiment generates a random start order (x) and random finish order (y)
    expt=data.frame(x=sample(1:20, 20), y=sample(1:20, 20), z=i)
    expt2=rbind(expt2, expt)
}
#Plot the result, by experiment
ggplot(expt2) + geom_point(aes(x=x, y=y)) + facet_wrap(~z)

Here’s the result:
random20x20

There’s not a lot of order in there, is there? However, it is worth noting that on occasion it may look as if there is some sort of ordering (as in trial 4, for example?), purely by chance.

For what it’s worth, we can also tidy up the race plots a little to make them a little bit more presentable:

g = g + ggtitle( 'F1 2009' )
g = g + xlab('Qualifying Position') + ylab('Final Race Classification')
g

Here’s the chart with added titles and axis labels:
Rplot

We can also grab a plot for a single race. The paper showed scatterplots for Brazil and Japan.

brazilVsJapan

Here’s the code:

g1 = ggplot(subset(results.combined, race == 'BRAZIL')) + geom_point(aes(x=qpos.int, y=racepos.raw)) + facet_wrap(~race)
g1 = g1 + ggtitle('F1 2009 Brazil') + xlab('Qualifying Position') + ylab('Final Race Classification') + theme_bw() 

g2 = ggplot(subset(results.combined, race=='JAPAN' ) )+geom_point(aes(x =qpos.int, y=racepos.raw)) + facet_wrap(~race)
g2 = g2 + ggtitle('F1 2009 Japan') + xlab('Qualifying Position') + ylab('Final Race Classification') + theme_bw() 

require( gridExtra )
grid.arrange( g1, g2, ncol=2 )

Note that I have: i) filtered the data to get just the data required for each race; ii) added a little bit more styling with the theme_bw() call; iii) used the gridExtra package to generate the side-by-side plot.

Here’s what the data looked like in the original paper:

f1vnascar2009paperbraziljapanscatter

NOT the same… Hmmm… How about if we plot the grid.int position vs. the final position? (Set x = grid.int and tweak the xlab()):

japonaBrazil2009gridvracepos

So – the “qualifying” position in the published paper is actually the grid position? Did I not read the paper closely enough, or did they make a mistake? (So much for academic peer review stopping errors getting through. Like, erm, the error in the visualiastion of the confidence interval that also got through? ;-)

Generating similar views for other years should be trivial – simply tweak the year selector in the SELECT query and then repeat as above.

In the limit, I guess everything could be wrapped up in a single function to plot either the facetted view, or, if a list of explicitly named races are passed in to the function, as a set of more specific race charts. But that is left as an exercise for the reader…;-)

Okay – enough for now… am I ever going to get even as far as blogging the correlation analysis., let alone the regressions?! Hopefully in the next post in this series!

PS I couldn’t resist… here’s the summary set of charts for the 2012 season, in respect of grid positions vs. final race classifications:

f12012gridvrace

Was it as you remember it?!! Were Australia and Malaysia all over the place? Was there a big crash amongst the front runners in Belgium?!

PPS Rather than swap this blog with too many posts on this topic, I intend to start a new uncourse blog and post them there in future… details to follow…

PPPS if you liked this post, you might like this Ergast API Interactive F1 Results data app

Getting Started with F1 Betting Data

As part of my “learn about Formula One Stats” journey, one of the things I wanted to explore was how F1 betting odds change over the course of a race weekend, along with how well they predict race weekend outcomes.

Courtesy of @flutterF1, I managed to get a peek of some betting data from one of the race weekends last year year. In this preliminary post, I’ll describe some of the ways I started to explore the data initially, before going on to look at some of the things it might be able to tell us in more detail in a future post.

(I’m guessing that it’s possible to buy historical data(?), as well as collecting it yourself it for personal research purposes? eg Betfair have an api, and there’s at least one R library to access it: betfairly.)

The application I’ll be using to explore the data is RStudio, the cross-platform integrated development environment for the R programming language. Note that I will be making use of some R packages that are not part of the base install, so you will need to load them yourself. (I really need to find a robust safe loader that installs any required packages first if they have not already been installed.)

The data @flutterF1 showed me came in two spreadsheets. The first (filename convention RACE Betfair Odds Race Winner.xlsx) appears to contain a list of frequently sampled timestamped odds from Betfair, presumably, for each driver recorded over the course of the weekend. The second (filename convention RACE Bookie Odds.xlsx) has multiple sheets that contain less frequently collected odds from different online bookmakers for each driver on a variety of bets – race winner, pole position, top 6 finisher, podium, fastest lap, first lap leader, winner of each practice session, and so on.

Both the spreadsheets were supplied as Excel spreadsheets. I guess that many folk who collect betting data store it as spreadsheets, so this recipe for loading spreadsheets in to an R environment might be useful to them. The gdata library provides hooks for working with Excel documents, so I opted for that.

Let’s look at the Betfair prices spreadsheet first. The top line is junk, so we’ll skip it on load, and add in our own column names, based on John’s description of the data collected in this file:

The US Betfair Odds Race Winner.xslx is a raw data collection with 5 columns….
1) The timestap (an annoying format but there is a reason for this albeit a pain to work with).
2) The driver.
3) The last price money was traded at.
4) the total amount of money traded on that driver so far.
5) If the race is in ‘In-Play’. True means the race has started – however this goes from the warm up lap, not the actual start.

To reduce the amount of data I only record it when the price traded changes or if the amount changes.

Looking through the datafile, they appear to be some gotchas, so these need cleansing out:

datafile gotchas

Here’s my initial loader script:

library(gdata)
xl=read.xls('US Betfair Odds Race Winner.xlsx',skip = 1)
colnames(xl)=c('dt','driver','odds','amount','racing')

#Cleansing pass
bf.odds=subset(xl,racing!='')

str(bf.odds)
'data.frame':	10732 obs. of  5 variables:
 $ dt    : Factor w/ 2707 levels "11/16/2012 12:24:52 AM",..: 15 15 15 15 15 15 15 15 15 15 ...
 $ driver: Factor w/ 34 levels " Recoding Began",..: 19 11 20 16 18 29 26 10 31 17 ...
 $ odds  : num  3.9 7 17 16.5 24 140 120 180 270 550 ...
 $ amount: num  1340 557 120 118 195 ...
 $ racing: int  0 0 0 0 0 0 0 0 0 0 ...

#Generate a proper datetime field from the dt column
#This is a hacked way of doing it. How do I do it properly?
bf.odds$dtt=as.POSIXlt(gsub("T", " ", bf.odds$dt))

#If we rerun str(), we get the following extra line in the results:
# $ dtt   : POSIXlt, format: "2012-11-11 11:00:08" "2012-11-11 11:00:08" "2012-11-11 11:00:08" "2012-11-11 11:00:08" ...

Here’s what the raw data, as loaded, looks like to the eye:
Betfair spreadsheet

Having loaded the data, cleansed it, and cast a proper datetime column, it’s easy enough to generate a few plots:

#We're going to make use of the ggplot2 graphics library
library(ggplot2)

#Let's get a quick feel for bets around each driver
g=ggplot(xl)+geom_point(aes(x=dtt,y=odds))+facet_wrap(~driver,scales="free_y")
g=g+theme(axis.text.x=element_text(angle=-90))
g

#Let's look in a little more detail around a particular driver within a particular time window
g=ggplot(subset(xl,driver=="Lewis Hamilton"))+geom_point(aes(x=dtt,y=odds))+facet_wrap(~driver,scales="free_y")
g=g+theme(axis.text.x=element_text(angle=-90))
g=g+ scale_x_datetime(limits=c(as.POSIXct('2012/11/18 18:00:00'), as.POSIXct('2012/11/18 22:00:00')))
g

Here are the charts (obviously lacking in caption, tidy labels and so on).

Firstly, the odds by driver:

odds by driver

Secondly, zooming in on a particular driver in a particular time window:

timewindow

That all seems to work okay, so how about the other spreadsheet?

#There are several sheets to choose from, named as follows:
#Race,Pole,Podium,Points,SC,Fastest Lap, Top 6, Hattrick,Highest Scoring,FP1, ReachQ3,FirstLapLeader, FP2, FP3

#Load in data from a particular specified sheet
race.odds=read.xls('USA Bookie Odds.xlsx',sheet='Race')

#The datetime column appears to be in Excel datetime format, so cast it into something meaningful
race.odds$tTime=as.POSIXct((race.odds$Time-25569)*86400, tz="GMT",origin=as.Date("1970-1-1"))
#Note that I am not I checking for gotcha rows, though maybe I should...?

#Use the directlabels package to help tidy up the display a little
library(directlabels)

#Let's just check we've got something loaded - prune the display to rule out the longshots
g=ggplot(subset(race.odds,Bet365<30),aes(x=tTime,y=Bet365,group=Runner,col=Runner,label=Runner))
g=g+geom_line()+theme_bw()+theme(legend.position = "none")
g=g+geom_dl(method=list('top.bumpup',cex=0.6))
g=g+scale_x_datetime(expand=c(0.15,0))
g

Here’s a view over the drivers’ odds to win, with the longshots pruned out:

example race odds by driver

With a little bit of fiddling, we can also look to see how the odds for a particular driver compare for different bookies:

#Let's see if we can also plot the odds by bookie
colnames(race.odds)
#[1] "Time" "Runner" "Bet365" "SkyBet" "Totesport" "Boylesport" "Betfred"     
# [8] "SportingBet" "BetVictor" "BlueSQ" "Paddy.Power" "Stan.James" "X888Sport" "Bwin"        
#[15] "Ladbrokes" "X188Bet" "Coral" "William.Hill" "You.Win" "Pinnacle" "X32.Red"     
#[22] "Betfair" "WBX" "Betdaq" "Median" "Median.." "Min" "Max"         
#[29] "Range" "tTime"   

#We can remove items from this list using something like this:
tmp=colnames(race.odds)
#tmp=tmp[tmp!='Range']
tmp=tmp[tmp!='Range' & tmp!='Median' & tmp!='Median..' & tmp!='Min' & tmp!= 'Max' & tmp!= 'Time']
#Then we can create a subset of cols
race.odds.data=subset(race.odds,select=tmp)

#Melt the data
library(reshape)
race.odds.data.m=melt(race.odds.data,id=c('tTime','Runner'))

#head( race.odds.data.m)
#                tTime                 Runner variable value
#1 2012-11-11 19:07:01 Sebastian Vettel (Red)   Bet365  2.37
#2 2012-11-11 19:07:01   Lewis Hamilton (McL)   Bet365  3.25
#3 2012-11-11 19:07:01  Fernando Alonso (Fer)   Bet365  6.00
#...

#Now we can plot how the different bookies compare
g=ggplot(subset(race.odds.data.m,value<30 & Runner=='Sebastian Vettel (Red)'),aes(x=tTime,y=value,group=variable,col=variable,label=variable))
g=g+geom_line()+theme_bw()+theme(legend.position = "none")
g=g+geom_dl(method=list('top.bumpup',cex=0.6))
g=g+scale_x_datetime(expand=c(0.15,0))
g

bookies odds

Okay, so that all seems to work… Now I can start pondering what sensible questions to ask…