Tagged: f1datajunkie

Mixing Numbers and Symbols in Time Series Charts

One of the things I’ve been trying to explore with my #f1datajunkie projects are ways of representing information that work both in a glanceable way as well as repaying deeper reading. I’ve also been looking at various ways of using text labels rather than markers to provide additional information around particular data points.

For example, in a race battlemap, with lap number on the horizontal x-axis and gap time on the vertical y-axis, I use a text label to indicate which driver is ahead (or behind) a particular target driver.


In the revised version of this chart type shown in F1 Malaysia, 2015 – Rosberg’s View of the Race, and additional numerical label along the x-axis indicatesd the race position of the target driver at the end of each lap.

What these charts are intended to do is help the eye see particular structural shapes within the data – for example whether a particular driver is being attacked from behind in the example of a battlemap, or whether they are catching the car ahead (perhaps with intervening cars in the way – although more needs to be done on the chart with respect to this for examples where there are several intervening cars; currently, only a single intervening car immediately ahead on track is shown.)

Two closer readings of the chart are then possible. Firstly, by looking at the y-value we can see the actual time a car is ahead (and here the dashed guide line at +/1 1s helps indicate in a glanceable way the DRS activation line; I’m also pondering how to show an indication of pit loss time to indicate what effect a pit stop might have on the current situation). Secondly, we can read off the labels of the drivers involved i a battle to get a more detailed picture of the race situation.

The latest type of chart I’ve been looking at are session utilisation maps, which in their simplest form look something like the following:


The charts show how each driver made use of a practice session or qualifying – drivers are listed on the vertical y-axis and the time into the session each lap was recorded at is identified along the horizontal x-axis.

This chart makes it easy to see how many stints, and of what length, were completed by each driver and at what point in the session. Other information might be inferred – for example, significant gaps in which no cars are recording times may indicate poor weather conditions or red flags. However, no information is provided about the times recorded for each lap.

We can, however, use colour to identify “purple” laps (fastest lap time recorded so far in the session) and “green” laps (a driver’s fastest laptime so far in the session that isn’t a purple time), as well as laps on which a driver pitted:


But still, no meaningful lap times.

One thing to note about laptimes is that they come in various flavours, such as outlaps, when a driver starts the lap from the pitlane; inlaps, or laps on which a driver comes into the pits at the end of the lap; and flying laps when a driver is properly going for it. There are also those laps on which a driver may be trying out various new lines, slowing down to give themselves space for a flying lap, and so on.

Assuming that inlaps and outlaps are not the best indicators of pace, we can use a blend of symbols and text labels on the chart to identify inlaps and outlaps, as well as showing laptimes for “racing” laps, also using colour to highlight purple and green laps:


The chart is produced using ggplot, and a layered approach in which chart elements are added to the chart in separate layers.

#The base chart with the dataset used to create the original chart
#In this case, the dataset included here is redundant
g = ggplot(f12015test)

#Layer showing in-laps (laps on which a driver pitted) and out-laps
#Use a subset of the dataset to place markers for outlaps and inlaps
g = g + geom_point(data=f12015test[f12015test['outlap'] | f12015test['pit'],],aes(x=cuml, y=name, color=factor(colourx)), pch=1)

#Further annotation to explicitly identify pit laps (in-laps)
g = g + geom_point(data=f12015test[f12015test['pit']==TRUE,],aes(x=cuml, y=name),pch='.')

#Layer showing full laps with rounded laptimes and green/purple lap highlights
#In this case, use the laptime value as a text label, rather than a symbol marker
g = g + geom_text(data=f12015test[!f12015test['outlap'] & !f12015test['pit'],],aes(x=cuml, y=name, label=round(stime,1), color=factor(colourx)), size=2, angle=45)

#Force the colour scale to be one we want
g = g + scale_colour_manual(values=c('darkgrey','darkgreen','purple'))

This version of the chart has the advantage of being glanceable when it comes to identifying session utilisation (number, duration and timing of stints) as well as when purple and green laptimes were recorded, as well as repaying closer reading when it comes to inspecting the actual laptimes recorded during each stint.

To reduce clutter on the chart, laptimes are round to 1 decimal place (tenths of a second) rather than using the full lap time which is recorded down to thousandths of a second.

Session utilisation charts are described more fully in a forthcoming recently released chapter of the Wrangling F1 Data With R Leanpub book. Buying a copy of the book gains you access to future updates of the book. A draft version of the chapter can be found here.

Segmenting F1 Qualifying Session Laptimes

I’ve started scraping some FIA timing sheets again, including practice and qualifying session laptimes. One of the things I’d like to do is explore various ways of looking at the qualifying session laptimes, which means identifying which qualifying session each laptime falls into, using some sort of clustering algorithm… or other means…:


For looking at session utilisation charts I’ve been making use of accumulated time into session to help display the data, as the following session utilisation chart (including green and purple laptimes) shows:


The horizontal x-axis is time into session from a basetime of the first time-of-day timestamp recorded on the timing sheets for the session.

If we look at the distribution of qualifying session laptimes for the 2015 Malaysian Grand Prix, we get something like this:


We can see a big rain delay gap, and also a tighter gap between the first and second sessions.

If we try to run a k-means clustering algorithm on the data, using 3 means for the three sessions, we see that in this case it isn’t managing to cluster the laptimes into actual sessions:

# Attempt to identify qualifying session using K-Means Cluster Analysis around 3 means
clusters <- kmeans(f12015test['cuml'], 3)

f12015test = data.frame(f12015test, clusters$cluster)

ggplot(f12015test)+geom_text(aes(x=cuml, y=stime,
label=code, colour=factor(clusters.cluster)) ,angle=45,size=3)


In particular, so of the Q1 laptimes are being grouped with Q2 laptimes.

However, we know that there is at least a 2 minute gap between sessions (regulations suggest 7 minutes, though if this is the time between lights going red then green again, we might need to knock a couple of minutes off the gap to account to for drivers who start their last lap just before the lights go red on a session) so if we assume that the only times there will be a two minute gap between recorded laptimes during the whole of qualifying session will be in the periods between the qualifying sessions, we can can generate a flag on those gaps, and then generate session number counts by counting on those flags.

#Look for a two minute gap
f12015test['gapflag']= (f12015test['gap']>=120)

ggplot(f12015test)+ geom_text(aes(x=cuml, y=stime, label=code), angle=45,size=3
+facet_wrap(~qsession, scale="free")


(To tighten this up, we might also try to factor in the number of cars in the pits at any particular point in time…)

This chart clearly shows how the first qualifying session saw cars trialling evenly throughout the session, whereas in Q2 and Q3 they were far more bunched up (which perhaps creates more opportunities for cars to get in each others’ way on flying laps…)

One of the issues with this chart is that we don’t get to zoom in to actual flying laps. If all the flying lap times were about the same time, we could simply generate y-axis limits based on purple laptimes:


#Use these values in ylim()...

However, where the laptimes differ significantly across sessions as they do in this case due to a dramatic change in weather conditions, we probably need to filter the data for each session separately.

Another crib we might use is to identify PIT lap and out-laps (laps immediately following a PIT event) and filter those out of the laptime traces.

Versions of these recipes will soon be added to the Wrangling F1 Data With R book. Once you buy into the book, you get all future updates to it for no additional cost, even in the case of the minimum book price increasing over time.

Rediscovering Formula One Race Battlemaps

A couple of days ago, I posted a recipe on the F1DataJunkie blog that described how to calculate track position from laptime data.

Using that information, as well as additional derived columns such as the identity of, and time to, the cars immediately ahead of and behind a particular selected driver, both in terms of track position and race position, I revisited a chart type I first started exploring several years ago – race battle charts.

The main idea behind the battlemaps is that they can help us search for stories amidst the runners.

dirattr=function(attr,dir='ahead') paste(attr,dir,sep='')

#We shall find it convenenient later on to split out the initial data selection

  if (dir=='ahead') diff_X='diff' else diff_X='chasediff'
  if (dir=='ahead') drs=1000 else drs=-1000
  #Plot the offlap cars that aren't directly being raced
  #Plot the cars being raced directly
  g+guides(col=guide_legend(title='Intervening car'))


In this first sketch, from the 2012 Australian Grand Prix, I show the battlemap for Mark Webber:


We see how at the start of the race Webber kept pace with Alonso, albeit around about a second behind, at the same time as he drew away from Massa. In the last third of the race, he was closely battling with Hamilton whilst drawing away from Alonso. Coloured labels are used to highlight cars on a different lap (either ahead (aqua) or behind (orange)) that are in a track position between the selected driver and the car one place ahead or behind in terms of race position (the black labels). The y-axis is the time delta in milliseconds between the selected car and cars ahead (y > 0) or behind (y < 0). A dashed line at the +/- one second mark identifies cars within DRS range.

As well as charting the battles in the vicinity of a particular driver, we can also chart the battle in the context of a particular race position. We can reuse the chart elements and simply need to redefine the filtered dataset we are charting.

For example, if we filter the dataset to just get the data for the car in third position at the end of each lap, we can then generate a battle map of this data.





For more details, see the original version of the battlemap chapter. For updates to the chapter, I recommend that you invest in a copy Wrangling F1 Data With R book if you haven’t already done so:-)

Connecting RStudio and MySQL Docker Containers – an example using the ergast db

building on Dockerising Open Data Databases – First Fumblings and my Book Extras – Data Files, Code Files and a Dockerised Application, I just figured out how to get the ergast db into a MySQL docker container and then query it from RStudio:

  • Download and unzip the f1db.sql.gz file to f1db.sql
  • install these docker-mysql-scripts
  • run boot2docker
  • from the boot2docker shell, start up a MySQL server (ergastdb) with password f1: dmysql-server ergastdb f1 By default, this exposes port 3306
  • create an new empty database (f1db): dmysql-create-database ergastdb f1db
  • add the ergast data to it: dmysql-import-database ergastdb /path/to/ergastdb/f1db.sql --database f1db
  • fire up a copy of RStudio, in this case using my psychemedia/wranglingf1data container, linking it to the MySQL database which has the alias db: docker run --name f1djd -p 8788:8787 --link ergastdb:db -d psychemedia/wranglingf1data
  • run boot2docker ip to find where RStudio is running (IPADDRESS) and in your browser go to: http://IPADDRESS:8788, logging in with username rstudio and password rstudio
  • in RStudio, import the RMySQL library: library(RMySQL)
  • in RStudio, connect to the database: con=dbConnect(MySQL(),user='root',password='f1',host='db',port=3306,dbname='f1db')
  • in RStudio, run a test query: dbQuery(con,'SHOW TABLES');


I guess what I need to do now is pull the various bits into another script to make it a one-liner, perhaps with a few switches? For example, to create the database if it doesn’t exist, to download the ergast database file automatically, to populate the database for the first time, or update it with a more recent copy of the database, to fire up both containers and make sure they are appropriately linked etc. This would dramatically simplify things for use in the context of the Wrangling F1 Data With R book, for example. (If you beat me to it, please post details in the comments below.)

PS Hmm…. seems I get a UTF-8 encoding issue:


Not sure if this is with the database, or the RMySQL connector? Anyone got any ideas of a fix?

Ah ha – sort of via SO:

Running dbGetQuery(con,'SET NAMES utf8;') before querying seems to do the trick…

Calculating Churn in Seasonal Leagues

One of the things I wanted to explore in the production of the Wrangling F1 Data With R book was the extent to which I could draw on published academic papers for inspiration in exploring the the various results and timing datasets.

In a chapter published earlier this week, I explored the notion of churn, as described in Mizak, D, Neral, J & Stair, A (2007) The adjusted churn: an index of competitive balance for sports leagues based on changes in team standings over time. Economics Bulletin, Vol. 26, No. 3 pp. 1-7, and further appropriated by Berkowitz, J. P., Depken, C. A., & Wilson, D. P. (2011). When going in circles is going backward: Outcome uncertainty in NASCAR. Journal of Sports Economics, 12(3), 253-283.

In a competitive league, churn is defined as:

C_t =  \frac{\sum_{i=1}^{N}\left|f_{i,t} - f_{i,t-1}\right|}{N}

where C_t is the churn in team standings for year t, \left|f_{i,t} - f_{i,t-1}\right| is the absolute value of the i-th team’s change in finishing position going from season t-1 to season t, and N is the number of teams.

The adjusted churn is defined as an indicator with the range 0..1 by dividing the churn, C_t, by the maximum churn, C_max. The value of the maximum churn depends on whether there is an even or odd number of competitors:

C_{max} = N/2 \text{, for even N}

C_{max} = (N^2 - 1) / 2N \text{, for odd N}

Berkowitz et al. reconsidered churn as applied to an individual NASCAR race (that is, at the event level). In this case, f_{i,t} is the position of driver i at the end of race t, f_{i,t-1} is the starting position of driver i at the beginning of that race (that is, race t) and N is the number of drivers participating in the race. Once again, the authors recognise the utility of normalising the churn value to give an *adjusted churn* in the range 0..1 by dividing through by the maximum churn value.

Using these models, I created churn function of the form:

is.even = function(x) x %% 2 == 0
  if (is.even(N)) return(N/2) else return(((N*N)-1)/(2*N))

churn=function(d) sum(d)/length(d)
adjchurn = function(d) churn(d)/churnmax(length(d))

and then used it to explore churn in a variety of contexts:

  • comparing grid positions vs race classifications across a season (cf. Berkowitz et al.)
  • churn in Drivers’ Championship standings over several seasons (cf. Mizak et al.)
  • churn in Constructors’ Championship standings over several seasons (cf. Mizak et al.)

For example, in the first case, we can process data from the ergast database as follows:

ergastdb = dbConnect(RSQLite::SQLite(), './ergastdb13.sqlite')

q=paste('SELECT round, name, driverRef, code, grid, 
                position, positionText, positionOrder
          FROM results rs JOIN drivers d JOIN races r
          ON rs.driverId=d.driverId AND rs.raceId=r.raceId
          WHERE r.year=2013',sep='')

results['delta'] =  abs(results['grid']-results['positionOrder'])
churn.df = ddply(results[,c('round','name','delta')], .(round,name), summarise,
            churn = churn(delta),
            adjchurn = adjchurn(delta)

For more details, see this first release of the Keeping an Eye on Competitiveness – Tracking Churn chapter of the Wrangling F1 Data With R book.

Information Density and Custom Chart Designs

I’ve been doodling today with a some charts for the Wrangling F1 Data With R living book, trying to see how much information I can start trying to pack into a single chart.

The initial impetus came simply from thinking about a count of laps led in a particular race by each drive; this morphed into charting the number of laps in each position for each driver, and then onto a more comprehensive race summary chart (see More Shiny Goodness – Tinkering With the Ergast Motor Racing Data API for an earlier graphical attempt at producing a race summary chart).


The chart shows:

grid position: identified using an empty grey square;
race position after the first lap: identified using an empty grey circle;
race position on each driver’s last lap: y-value (position) of corresponding pink circle;
points cutoff line: a faint grey dotted line to show which positions are inside – or out of – the points;
number of laps completed by each driver: size of pink circle;
total laps completed by driver: greyed annotation at the bottom of the chart;
whether a driver was classified or not: the total lap count is displayed using a bold font for classified drivers, and in italics for unclassified drivers;
finishing status of each driver: classification statuses other than *Finished* are also recorded at the bottom of the chart.

The chart also shows drivers who started the race but did not complete the first lap.

What the chart doesn’t show is what stage of the race the driver was in each position, and how long for. But I have an idea for another chart that could help there, as well as being able to reuse elements used in the chart shown here.

FWIW, the following fragment of R code shows the ggplot function used to create the chart. The data came from the ergast API, though it did require a bit of wrangling to get it into a shape that I could use to power the chart.

#Reorder the drivers according to a final ranked position
#Highlight the points cutoff
#Highlight the position each driver was in on their final lap
#Highlight the grid position of each driver
#Highlight the position of each driver at the end of the first lap
#Provide a count of how many laps each driver held each position for
#Number of laps completed by driver
g=g+geom_text(aes(x=driverRef,y=-1,label=lap,fontface=ifelse(is.na(classification), 'italic' , 'bold')),size=3,colour='grey')
#Record the status of each driver
g=g+geom_text(aes(x=driverRef,y=-2,label=ifelse(status!='Finished', status,'')),size=2,angle=30,colour='grey')
#Styling - tidy the chart by removing the transparency legend
g+theme_bw()+xRotn()+xlab(NULL)+ylab(&quot;Race Position&quot;)+guides(alpha=FALSE)

The fully worked code can be found in forthcoming update to the Wrangling F1 Data With R living book.

F1 Championship Race, 2014 – Winning Combinations…

As we come up to the final two races of the 2014 Formula One season, the double points mechanism for the final race means that two drivers are still in with a shot at the Drivers’ Championship: Lewis Hamilton and Nico Rosberg.

As James Allen describes in Hamilton closes in on world title: maths favour him but Abu Dhabi threat remains:

Hamilton needs 51 points in the remaining races to be champion if Rosberg wins both races. Hamilton can afford to finish second in Brazil and at the double points finale in Abu Dhabi and still be champion. Mathematically he could also finish third in Brazil and second in the finale and take it on win countback, as Rosberg would have just six wins to Hamilton’s ten.
If Hamilton leads Rosberg home again in a 1-2 in Brazil, then he will go to Abu Dhabi needing to finish fifth or higher to be champion (echoes of Brazil 2008!!). If Rosberg does not finish in Brazil and Hamilton wins the race, then Rosberg would need to win Abu Dhabi with Hamilton not finishing; no other scenario would give Rosberg the title.

A couple of years ago, I developed an interactive R/shiny app for exploring finishing combinations of two drivers in the last two races of a season to see what situations led to what result: Interactive Scenarios With Shiny – The Race to the F1 2012 Drivers’ Championship.


I’ve updated the app (taking into account the matter of double points in the final race) so you can check out James Allen’s calculations with it (assuming I got my sums right too!). I tried to pop up an interactive version to Shinyapps, but the Shinyapps publication mechanism seems to be broken (for me at least) at the moment…:-(

In the meantime, if you have RStudio installed, you can run the application yourself. The code is avaliable and can be run from RStudio with: runGist("81380ff09ebe1cd67005")

When I get a chance, I’ll weave elements of this recipe into the Wrangling F1 Data With R book.

PS I’ve also started using the F1dataJunkie blog again as a place to post drafts and snippets of elements I’m working on for that book…