Tagged: f1datajunkie

ergastR – R Wrapper for ergast F1 Results Data API

By the by, I’ve posted a first attempt at an R package – ergastR to wrap the ergast developer API, which is where I get chunks of data from for my f1datajunkie tinkerings.

You can find it on Github: psychemedia/ergastR.

The function names are the ones used in the Wrangling F1 Data With R book.

The R package needs a bit of tidying up and also needs work on the following: cacheing, so that we don’t keep hitting the ergast API unnecessarily; paged results handling (I fudge this a bit at the moment by explicitly setting a large results limit); and dual handling of ergast API versus downloaed ergast database requests (if a database connection string is passed, use that rather than make a call to the ergast API). But it’s a start… Feel free to raise issues via the repo:-)

In related news, Will Vaughan tipped me off to a Python package he’s started putting together to wrap the ergast API: ergast-python. He’s also making a start on some Wrangling F1 Data Jupyter notebooks that make use of the Python wrapper: wranglingf1data.

Running the Numbers – How Can Hamilton Still Take the 2016 F1 Drivers’ Championship?

Way back in 2012, I posted a simple R script for trying to work out the finishing combinations in the last two races of that year’s F1 season for Fernando Alonso and Sebastien Vettel to explore the circumstances under which Alonso could take the championship (Paths to the F1 2012 Championship Based on How They Might Finish in the US Grand Prix); I also put together a simple shiny version of the script to make it bit more app like (Interactive Scenarios With Shiny – The Race to the F1 2012 Drivers’ Championship), which I also updated for the 2014 season (F1 Championship Race, 2014 – Winning Combinations…).

And now we come to 2016, and once again, with two races to go, there are two drivers in with a chance of winning overall… But what race finishing combinations could see Hamilton make a last stand and reclaim his title? The  F1 Drivers’ Championship Scenarios, 2016 shiny app will show you…


You can find the code in a gist here:

Visualising F1 Stint Strategies

With the new F1 season upon us, I’ve started tinkering with bits of code from the Wrangling F1 Data With R book and looking at the data in some new ways.

For example, I started wondering whether we might be able to learn something interesting about the race strategies by looking at laptimes on a stint by stint basis.

To begin with, we need some data – I’m going to grab it directly from the ergast API using some functions that are bundled in with the Leanpub book…

#ergast functions described in: https://leanpub.com/wranglingf1datawithr/
#Get laptime data from the ergast API
#Get pits data from the ergast API

#merge pit data into the laptime data

#generate an inlap flag (inlap is the lap assigned the pit time)

#generate an outlap flag (outlap is the first lap of the race or laps starting from the pits

#use the pitstop flag to number stints; note: a drive through penalty increments the stint count
l3=arrange(l3,driverId, -lap)

#number the laps in each stint
l3=arrange(l3,driverId, lap)
l3=arrange(l3,driverId, lap)

The laptimes associated with the in- and out- lap associated with a pit stop add noise to the full lap times completed within each stint, so lets flag those laps so we can then filter them out:

#Discount the inlap and outlap
l4=l3[!l3['outlap'] & !l3['inlap'],]

We can now look at the data… I’m going to facet by driver, and also group the laptimes associated with each stint. Then we can plot just the raw laptimes, and also a simple linear model based on the full lap times within each stint:

#Generate a base plot
g=ggplot(l4,aes(x=lapInStint, y=rawtime, col=factor(stint)))+facet_wrap(~driverId)

#Chart the raw laptimes within each stint

#Plot a simple linear model for each stint
g+ geom_smooth(method = "lm", formula = y ~ x)

So for example, here are the raw laptimes, excluding inlap and outlap, by stint for each driver in the recent 2016 Bahrain Formual One Grand Prix:


And here’s the simple linear model:


These charts highlight several things:

  • trivially, the number and length of the stints completed by each driver;
  • degradation effects in terms of the gradient of the slope of each stint trace;
  • fuel effects- the y-axis offset for each stint is the sum of the fuel effect (as the race progresses the cars get lighter and laptime goes down more or less linearly) and a basic tyre effect (the “base” time we might expect from a tyre). Based on the total number of laps completed in stints prior to a particular stint, we can calculate a fuel effect offset for the laptimes in each stint which should serve to normalise the y-axis laptimes and make more evident the base tyre laptime.
  • looking at the charts as a whole, we get a feel for strategy – what sort of tyre/stint strategy do the cars start race with, for example; are the cars going long on tyres without much degradation, or pushing for various length stints on tyres that lose significant time each lap? And so on… (What can you read into/from the charts? Let me know in the comments below;-)

If we assume a 0.083s per lap fuel weight penalty effect, we can replot the chart to account for this:

#Generate a base plot
g=ggplot(l4,aes(x=lapInStint, y=rawtime+0.083*lap, col=factor(stint)))
g+facet_wrap(~driverId) +geom_line()

Here’s what we get:


And here’s what the fuel corrected models look like:


UPDATE: the above fuel calculation goes the wrong way – oops! It should be:

.e = environment()

What we really need to do now is annotate the charts with additional tyre selection information for each stint.

We can also do a few more sums. For example, generate a simple average laptime per stint, excluding inlap and outlap times:

#Calculate some stint summary data
l5=ddply(l4,.(driverId,stint), summarise,

which gives results of the form:

          driverId stint   stintav stintsum stinlen
1          rosberg     1  98.04445 1078.489      11
2          rosberg     2  97.55133 1463.270      15
3          rosberg     3  96.15543  673.088       7
4          rosberg     4  96.32494 1637.524      17
5            massa     1  99.13600  495.680       5
6            massa     2 100.48300 2009.660      20
7            massa     3  98.77862 2568.244      26

It would possibly be useful to also compare inlap and outlaps somehow, as well as factoring in the pitstop time. I’m pondering a couple a possibilities for the latter :

  • amortise the pitstop time over the laps leading up to a pitstop by adding a pitsop lap penalty to each lap in that stint calculated as the pitstop time of the stint length of the laps in the stint leading up to the pitstop; this essentially penalises the stint that leads up to the pitstop as a consequence of forcing the pitstop;
  • amortise the pitstop time over the laps immediately following a pitstop by adding a pitsop lap penalty to each lap in that stint calculated as the pitstop time of the stint length of the laps in the stint following the pitstop; this essentially penalises the stint that immediately follows the pitstop, and discounts some of the benefit from the pitstop.

I haven’t run the numbers yet though, so I’m not sure how these different approaches will feel…

Detecting Undercuts in F1 Races Using R

One of the things that’s been on my to do list for some time has been the identification of tactical or strategic events within a race that might be detected automatically. One such event is an undercut described by F1 journalist James Allen in the following terms (The secret of undercut and offset):

An undercut is where Driver A leads Driver B, but Driver B turns into the pits before Driver A and changes to new tyres. As Driver A is ahead, he’s unaware that this move is coming until it’s too late to react and he has passed the pit lane entry.
On fresh tyres, Driver B then drives a very fast “Out” lap from the pits. Driver A will react to the stop and pit on the next lap, but his “In” lap time will have been set on old tyres, so will be slower. As he emerges from the pit lane after his stop, Driver B is often narrowly ahead of him into the first corner.

In logical terms, we might characterise this as follows:

    • two drivers, d1 and d2: d1 !=d2;
    • d1 pits on lap X, and drives an outlap on lap X+1;
    • d1’s position on their pitlap (lap X) is greater than d2’s position on the same lap X;
    • d2 pits on lap X+1, with an outlap on lap X+2;
    • d2’s position on their outlap (lap X+2) is greater than d1’s position on the same lap X+2.

We can generalise this formulation, and try to make it more robust, by comparing positions on the lap prior to d1’s stop (lap A) with the positions on d2’s outlap (lap B):

        • two drivers, d1 and d2: d1 !=d2;
        • d1 pits on lap A+1;
        • d1’s position on their “prelap” (lap A), the lap prior to their pitlap (lap A+1), is greater than d2’s position on lap A; this condition tries to ensure that d1 is behind d2 as they enter the pit stop phase but it misses the effect on any first lap stops (unless we add a lap 0 containing the grid positions);
        • d1’s outlap is on lap A+2;
        • d2 pits on lap B-1 within the inclusive range [lap A+2, lap A+1+N]: N>=1, (that is, within N laps of D1’s stop) with an outlap on lap B; the parameter, N, allows us to test for changes of position within a pit stop window, rather than requiring that d2 pits on the lap immediately following d1’s stop;
        • d2’s position on their outlap (lap B, in the inclusive range [lap A+3, lap A+2+N]) is greater than d1’s position on the same lap B.

One way of implementing these constraints is to write a declarative style query that specifies the conditions we want the solution to meet, rather than writing a procedural programme to find such an answer. Using the sqldf package, we can use a SQL query to achieve just this result.

One way of writing the query is to create two situations, a and b, where situation a corresponds to a lap on which d1 stops, and situation b corresponds to the driver d2’s stop. We then capture the data for each driver in each situation, to give four data states: d1a, d1b, d2a, d2b. These states are then subjected to the conditions specified above (using N=5).

#First get laptime data from the ergast API

#Now find pit times

#merge pitdata with lapsdata
lapTimesp=merge(lapTimes, p, by = c('lap','driverId'), all.x=T)

#flag pit laps
lapTimesp$ps = ifelse(is.na(lapTimesp$milliseconds), F, T)

#Ensure laps for each driver are sorted
lapTimesp=arrange(lapTimesp, driverId, lap)

#do an offset on the laps that are pitstops for each driver
#to set outlap flags for each driver
lapTimesp=ddply(lapTimesp, .(driverId), transform, outlap=c(FALSE, head(ps,-1)))

#identify lap before pit lap by reverse sorting
lapTimesp=arrange(lapTimesp, driverId, -lap)
#So we can do an offset going the other way
lapTimesp=ddply(lapTimesp, .(driverId), transform, prelap=c(FALSE, head(ps,-1)))

#tidy up

#Now we can run the SQL query
ss=sqldf('SELECT d1a.driverId AS d1, d2a.driverId AS d2, \
            d1a.lap AS A, d1a.position AS d1posA, d1b.position AS d1posB, \
            d2b.lap AS B, d2a.position AS d2posA, d2b.position AS d2posB \
          FROM lapTimesp d1a, lapTimesp d1b, lapTimesp d2a, lapTimesp d2b \
          WHERE d1a.driverId=d1b.driverId AND d2a.driverId=d2b.driverId \
            AND d1a.driverId!=d2a.driverId \
            AND d1a.prelap AND d1a.lap=d2a.lap AND d2b.outlap AND d2b.lap=d1b.lap \
            AND (d1a.lap+3<=d1b.lap AND d1b.lap<=d1a.lap+2+5) \
            AND d1a.position>d2a.position AND d1b.position < d2b.position')

For the 2015 British Grand Prix, here’s what we get:

          d1         d2  A d1posA d2posA  B d1posB d2posB
1  ricciardo      sainz 10     11     10 13     12     13
2     vettel      kvyat 13      8      7 19      8     10
3     vettel hulkenberg 13      8      6 20      7     10
4      kvyat hulkenberg 17      6      5 20      9     10
5   hamilton      massa 18      3      1 21      2      3
6   hamilton     bottas 18      3      2 22      1      3
7     alonso   ericsson 36     11     10 42     10     11
8     alonso   ericsson 36     11     10 43     10     11
9     vettel     bottas 42      5      4 45      3      5
10    vettel      massa 42      5      3 45      3      4
11     merhi    stevens 43     13     12 46     12     13

With a five lap window we have evidence that supports successful undercuts in several cases, including VET taking KVY and HUL with his early stop at lap 13+1 (KVY pitting on lap 19-1 and HUL on lap 20-1), and MAS and BOT both being taken first by HAM’s stop at lap 18+1 and then by VET’s stop at lap 42+1.

To make things easier to read, we may instead define d1a.lap+1 AS d1Pitlap and d2b.lap-1 AS d2Pitlap.

The query doesn’t guarantee that the pit stop was responsible for change in order, but it does at least gives us some prompts as to where we might look.

Spotting Potential Battles in F1 Races

Over the last couple of races, I’ve started trying to review a variety of battlemaps for various drivers in each race. Prompted by an email request for more info around the battlemaps, I generated a new sketch charting the on track gaps between each driver and the lap leader for each lap of the race (How the F1 Canadian Grand Prix Race Evolved on Track).

Colour is used to identify cars on lead lap compared to lapped drivers. For lapped drivers, a count of how many laps they are behind the leader is displayed. I additionally overplot with a highlight for specified driver, as well as adding in a mark that shows the on track position of the leader of the next lap, along with their driver code.


Battles can be identified through the close proximity of two or more drivers within a lap, across several laps. The ‘next-lap-leader’ time at the far right shows how close the leader on the next lead lap is to the backmarker (on track) on the current lead lap.

By highlighting two particular drivers, we could compare how their races evolved, perhaps highlighting different strategies used within a race that eventually bring the drivers into a close competitive battle in the last few laps of a race.

The unchanging leader-on-track-delta-of-0 line is perhaps missing an informational opportunity? For example, should we set the leader’s time to be the delta compared to the lap time for the leader laps from the previous lead lap? Or a delta compared to the fastest laptime on the previous lead lap? And if we do start messing about with an offset to the leader’s lap time, we presumably need to apply the same offset to the laptime of everyone else on the lap so we can still see the comparative on-track gaps to leader?

On the to-do list are various strategies for automatically identifying potential battles based on a variety of in-lap and across-lap heuristics.

Here’s the code:

#Grab some data
lapTimes =lapsData.df(2015,7)

#Process the laptimes

#Find the accumulated race time at the start of each leader's lap

#Find the on-track gap to leader

#Construct a dataframe that contains the difference between the 
#leader accumulated laptime on current lap and next lap
#i.e. how far behind current lap leader is next-lap leader?
#Generate a de facto lap count
#Grab the code of the lap leader on the next lap
ll['c']=lapTimes[lapTimes['position']==1 & lapTimes['lap']>1,'code']

#Plot the on-track gap to leader versus leader lap
g = ggplot(lapTimes) 
g = g + geom_point(aes(x=trackdiff,y=leadlap,col=(lap==leadlap)), pch=1)
g = g + geom_point(data=lapTimes[lapTimes['driverId']=='vettel',],
                  aes(x=trackdiff,y=leadlap), pch='+')
g = g + geom_text(data=lapTimes[lapTimes['lapsbehind']>0,],
                  aes(x=trackdiff,y=leadlap, label=lapsbehind),size=3)
g = g + geom_point(data=ll,aes(x=t, y=n), pch='x')
g = g + geom_text(data=ll,aes(x=t+3, y=n,label=c), size=2)
g = g + geom_vline(aes(xintercept=17), linetype=3)

This chart will be included in a future update to the Wrangling F1 Data With R book. I hope to do a sprint on that book to tidy it up and get it into a reasonably edited state in the next few weeks. At that point, the text will probably be frozen, a print-on-demand version generated, and if it ends up on Amazon, the minimum price being hiked considerably.

Charts are for Reading…

If charts are pictures, and every picture not only tells a story, but also saves a thousand words in doing so, how then are we to actually read them?

Take the following example, a quick #f1datajunkie sketch show how the Bahrain 2015 qualifying session progressed. The chart is split into three, one for each part of qualifying (which we might refer to as fractional sessions), which already starts to set the scene for the story. The horizontal x-axis is the time in seconds into qualifying at which each laptime is recorded, indexed against the first laptime recorded in qualifying overall. The vertical y-axis records laptimes in in seconds, limited to 107% of the fastest laptime recorded in a particular session. The green colour denotes a driver’s fastest laptime recorded in each fractional session, purple the overall fasted laptime recorded so far in a fractional session (purple trumps green). So again, the chart is starting to paint a picture.


An example of the sort of analysis that can be provided for a qualifying session can be found in a post by Justin Hynes, Lewis Hamilton seals his first Bahrain pole but Vettel poses the menace to Mercedes’ hopes, that appeared on he James Allen on F1 blog. In this post, I’ll try to match elements of that analysis with things we can directly see in the chart above…

[Hamilton] finish[ed] 0.411s clear of Ferrari’s Sebastian Vettel and more than half a second in front of his Mercedes team-mate Nico Rosberg

We don’t get the time gap exactly from the chart, but looking to the rightmost panel (Q3), finding the lowest vertical marks for HAM, VET and ROS, and imagining a horizontal line across to the y-axis, we get a feeling for the relative gaps.

Q1 got underway in slightly calmer conditions than blustery FP3 and Raikkonen was the first to take to the track, with Bottas joining the fray soon after. The Williams driver quickly took P1 but was then eclipsed by Rosberg, who set a time of 1: 35.657 on the medium tyres.

Q1 is the leftmost panel, in which we see RAI setting the first representative laptime at least (within the 107% limit of the session best overall), followed by BOT and then ROS improving on the early purple times.

The Mercedes man was soon joined in the top five by soft-tyre runners Nico Hulkenberg and Felipe Nasr.

HUL and NAS appear around the 300 cuml (cumulative laptime) mark. We note that PER is there in the mix too, but is not mentioned explicitly in the report.

In the closing stages of the session those in the danger zone were Max Verstappen, Pastor Maldonado and Will Stevens and Roberto Merhi.

On the right hand side of the chart, we see laps at the end of the session from MAL and VES (and way off the pace, STE). One problem with the chart as style above (showing cumulative best times in the session, makes it hard to see which a driver’s best session time overall actually is. (We could address this by perhaps displaying a driver’s session best time using a bold font.) The chart is also very cluttered around the cutoff time which makes it hard to see clearly who got through and who didn’t. And we don’t really know where the danger zone is because we have no clear indication of what the best 15 drivers’ times are – and hence, where the evolving cut-off time is…

Verstappen found the required pace and scraped into Q2 with a time of 1:35.611. Maldonado, however, failed to make it through, his best lap of 1:35.677 only being good enough for P16.

Verstappen’s leap to safety also pushed out Daniil Kvyat, with the Russian putting in a disappointing final lap that netted him P17 behind the Lotus driver. Hulkenberg was the last man through to Q2, the Force India driver’s 1:35.653 seeing him safely through with just two hundredths of a second in hand over Maldonado…

With an evolution of the cutoff time, and a zoom around the final cutoff time, we should be able to see what went on rather more clearly.

At the top of the order, Hamilton was quickest, finishing a tenth in front of Bottas. Rosberg was third, though he finished the session close on half a second down on his team-mate.

Felipe Massa was fourth for Williams, ahead of Raikkonen, Red Bull’s Daniel Ricciardo and Sebastian Vettel, who completed just three laps in the opening session. All drivers set their best times on the soft tyre.

This information can be quite clearly seen on the chart – aside from the tyre data which is not made available by the FIA.

The follow description of Q2 provides quite a straightforward reading of the second panel of the chart.

In the second session, Rosberg initially set the pace but Hamilton quickly worked his way back to the top of the order, his first run netting a time of 1:32.669. Rosberg was also again eclipsed by Massa who set a time three tenths of a second quicker than Rosberg’s.

The last to set an opening time were the Ferraris of Raikkonen and Vettel, though both rapidly staked a claim on a Q3 berth with the Finn in P2 and the German in P4.

Most of the front runners opted to rely on their first run to see them through and in the closing stages those in the drop zone were Hulkenberg, Force India team-mate Sergio Perez, Nasr, Sauber team-mate Ericsson and McLaren’s Fernando Alonso.

However, the chart does not clearly show how ROS’ early purple time was challenged by BOT, or how MAS early pace time was challenged mid-way through the session by VET and RAI.

Hulkenberg was the man to make the big move, claiming ninth place in Q2 with a time of 1:34.613. Behind him Toro Rosso’s Carlos Sainz scraped through in P10, six hundredths of a second clear of 11th-placed Sergio Perez. The Mexican was followed by Nasr and Ericsson. Alonso claimed P14, while 15th place went to the unfortunate Verstappen, who early in the session had reported that he was down on power.

Again, this reading of the chart would be aided by an evolving cut-off time line.

Looking now to the third panel…

The first runs in Q3 saw Hamilton in charge again, with the champion setting a time of 1:33.552 on used softs to take P1 three tenths of a second ahead of Red Bull’s Ricciardo, who prior to Hamilton’s lap had claimed the fastest S3 time of the session using new soft tyres.

Rosberg, also on used softs, was third, four thousandths of a second down on the Australian’s time. Hulkenberg, with just one new set of softs at his disposal, opted to sit out the first run.

The chart clearly shows the early and late session runs, and is reflected in the analysis:

In the final runs, Vettel was the first of the likely front-row men across the line and with purple times in S1 and S2, the German set a provisional pole time of 1:32.982. It was a superb lap but Hamilton was already running faster, stealing the S1 purple time from the German.

Ahead of the champion on track, Rosberg had similarly taken the best S2 time but he could not find more pace and when he crossed the line he slotted into third, four hundredths [??] of a second behind Vettel.

So what does Justin Hynes’ qualifying session commentary tell us about how we might be able to read the charted summary of the session? And how can we improve the chart to help draw out some of the stories? A couple of things jump out for me – firstly, the evolving purple and green times can be confusing, and are perhaps better placed (for a summary reading of the session) by best in session purple/green times; secondly, the evolution of the cut-off times would help to work out where drivers were placed at different stages of qualifying and what they still had to do – or whether a best-time-so-far recorded by a driver earlier in the session was bumped by the cutoff evolution. Note that the purple time evolution is identified implicitly by the lower envelope of the laptimes in each session.

Scraping Web Pages With R

One of the things I tend to avoid doing in R, partly because there are better tools elsewhere, is screenscraping. With the release of the new rvest package, I thought I’d have a go at what amounts to one of the simplest webscraping activites – grabbing HTML tables out of webpages.

The tables I had in my sights (when I can actually find them…) are the tables that appear on the newly designed FIA website that describe a range of timing results for F1 qualifying and races [quali example, race example].

Inspecting an example target web page, whilst a menu allows you to select several different results tables, a quick look at the underlying HTML source code reveals that all the tables relevant to the session (that is, a particular race, or complete qualifying session) are described within a single page.

So how can we grab those tables down from a target page? The following recipe seems to do the trick:


#URL of the HTML webpage we want to scrape

  #Grab the page
  #Parse HTML
  cc=html_nodes(hh, xpath = &quot;//table&quot;)[[num]] %&gt;% html_table(fill=TRUE)
  #TO DO - extract table name
  #Set the column names
  colnames(cc) = cc[1, ]
  #Drop all NA column
  cc=Filter(function(x)!all(is.na(x)), cc[-1,])
  #Fill blanks with NA
  cc=apply(cc, 2, function(x) gsub(&quot;^$|^ $&quot;, NA, x))
  #would the dataframe cast handle the NA?

## Qualifying:

The fiaTableGrabber() grabs a particular table from a page with a particular URL (I really should grab the page separately and then extract whatever table from the fetched page, or at least cache the page (unless there is a cacheing option built-in?)

Depending on the table grabbed, we may then need to tidy it up. I hacked together a few sketch functions that tidy up (and remap) column names, convert “natural times” in minutes and seconds to seconds equivalent, and in the case of the race pits data, separate out two tables that get merged into one.

  for (q in c('Q1','Q2','Q3')){
  xx=dplyr:::rename(xx, Q1_laps=LAPS)
  xx=dplyr:::rename(xx, Q2_laps=LAPS.1)
  xx=dplyr:::rename(xx, Q3_laps=LAPS.2)

#2Q, 3R 
  for (s in c('s1','s2','s3')) {

#3Q, 4R

# 4Q, 5R

# 2R
  xx['time']=apply(xx['LAP TIME'],1,timeInS)

# 6R
  r=which(xx['NO']=='RACE - PIT STOP - DETAIL')
  xx['tot_time']=apply(xx['TOTAL TIME'],1,timeInS)
  Filter(function(x)!all(is.na(x)), xx[1:r-1,])

  colnames(xx)=c('NO','DRIVER','LAP','TIME','STOP','NAT DURATION','TOTAL TIME')
  xx['tot_time']=apply(xx['TOTAL TIME'],1,timeInS)
  xx['duration']=apply(xx['NAT DURATION'],1,timeInS)
  r=which(xx['NO']=='RACE - PIT STOP - DETAIL')
  #Remove blank row - http://stackoverflow.com/a/6437778/454773
  xx[rowSums(is.na(xx)) != ncol(xx),]

So for example:


I’m still not convinced that R is the most natural, efficient, elegant or expressive language for scraping with, though…

PS In passing, I note the release of the readxl Excel reading library (no external-to-R dependencies, compatible with various flavours of Excel spreadsheet).

PPS Looking at the above screenshot, it strikes me that if we look at the time of day of and the duration, we can tell if there is a track position (at least) change in the pits… So for example, ROS goes in at 15:11:11 with a 33.689 stop and RIC goes in at 15:11:13 with a 26.714. So ROS enters the pits ahead of RIC and leaves after him? The following lap chart from f1fanatic perhaps reinforces this view?