f1datajunkie – OUseful.Info, the blog…

ergastR – R Wrapper for ergast F1 Results Data API

By the by, I’ve posted a first attempt at an R package – ergastR to wrap the ergast developer API, which is where I get chunks of data from for my f1datajunkie tinkerings.

You can find it on Github: psychemedia/ergastR.

The function names are the ones used in the Wrangling F1 Data With R book.

The R package needs a bit of tidying up and also needs work on the following: cacheing, so that we don’t keep hitting the ergast API unnecessarily; paged results handling (I fudge this a bit at the moment by explicitly setting a large results limit); and dual handling of ergast API versus downloaed ergast database requests (if a database connection string is passed, use that rather than make a call to the ergast API). But it’s a start… Feel free to raise issues via the repo:-)

In related news, Will Vaughan tipped me off to a Python package he’s started putting together to wrap the ergast API: ergast-python. He’s also making a start on some Wrangling F1 Data Jupyter notebooks that make use of the Python wrapper: wranglingf1data.

Running the Numbers – How Can Hamilton Still Take the 2016 F1 Drivers’ Championship?

Way back in 2012, I posted a simple R script for trying to work out the finishing combinations in the last two races of that year’s F1 season for Fernando Alonso and Sebastien Vettel to explore the circumstances under which Alonso could take the championship (Paths to the F1 2012 Championship Based on How They Might Finish in the US Grand Prix); I also put together a simple shiny version of the script to make it bit more app like (Interactive Scenarios With Shiny – The Race to the F1 2012 Drivers’ Championship), which I also updated for the 2014 season (F1 Championship Race, 2014 – Winning Combinations…).

And now we come to 2016, and once again, with two races to go, there are two drivers in with a chance of winning overall… But what race finishing combinations could see Hamilton make a last stand and reclaim his title? The F1 Drivers’ Championship Scenarios, 2016 shiny app will show you…

You can find the code in a gist here:

	library(shiny)
	library(ggplot2)
	library(reshape)

	# Define server logic required to generate and plot a random distribution
	shinyServer(function(input, output) {
	points=data.frame(pos=1:11,val=c(25,18,15,12,10,8,6,4,2,1,0))
	points[[1,2]]
	a=330
	v=349

	pospoints=function(a,v,pdiff,points){
	pp=matrix(ncol = nrow(points), nrow = nrow(points))
	for (i in 1:nrow(points)){
	for (j in 1:nrow(points))
	pp[[i,j]]=v-a+pdiff[[i,j]]
	}
	pp
	}

	pdiff=matrix(ncol = nrow(points), nrow = nrow(points))
	for (i in 1:nrow(points)){
	for (j in 1:nrow(points))
	pdiff[[i,j]]=points[[i,2]]-points[[j,2]]
	}

	ppx=pospoints(a,v,pdiff,points)

	winmdiff=function(vadiff,pdiff,points){
	win=matrix(ncol = nrow(points), nrow = nrow(points))
	for (i in 1:nrow(points)){
	for (j in 1:nrow(points))
	if (i==j) win[[i,j]]=''
	else if ((vadiff+pdiff[[i,j]])>=0) win[[i,j]]='ROS'
	else win[[i,j]]='HAM'
	}
	win
	}

	# Function that generates a plot of the distribution. The function
	# is wrapped in a call to reactivePlot to indicate that:
	#
	# 1) It is "reactive" and therefore should be automatically
	# re-executed when inputs change
	# 2) Its output type is a plot
	#
	output$distPlot <- renderPlot( {
	wmd=winmdiff(ppx[[input$ros,input$ham]],pdiff,points)
	wmdm=melt(wmd)
	g=ggplot(wmdm)+geom_text(aes(X1,X2,label=value,col=value))
	g=g+xlab('ROS position in Abu Dhabi')+ ylab('HAM position in Abu Dhabi')
	g=g+labs(title="Championship outcomes in Abu Dhabi")
	g=g+ theme(legend.position="none")
	g=g+scale_x_continuous(breaks=seq(1, 11, 1))+scale_y_continuous(breaks=seq(1, 11, 1))
	g=g+coord_flip()
	print(g)
	})
	})

view raw server.R hosted with ❤ by GitHub

	library(shiny)

	shinyUI(pageWithSidebar(

	# Application title
	headerPanel("F1 Driver Championship Scenarios, 2016"),

	# Sidebar with a slider input for number of observations
	sidebarPanel(
	sliderInput("ham",
	"HAM race pos in Brazilian Grand Prix:",
	min = 1,
	max = 11,
	value = 1),
	sliderInput("ros",
	"ROS race pos in Brazilian Grand Prix:",
	min = 1,
	max = 11,
	value = 2),
	hr(),
	em("See also:"),br(),
	a(href="http://f1datajunkie.com","f1datajunkie.com"),
	br(),
	a(href="https://leanpub.com/wranglingf1datawithr","Wrangling F1 Data With R")
	),

	# Show a plot of the generated distribution
	mainPanel(
	plotOutput("distPlot"),
	h3("How to use the Championship predictor"),
	p("With two more races to go in the 2016 F1 season, what do Hamilton and Rosberg each need to do to win the Drivers' Championship?" ),
	p("Using the sliders, select various finishing positions for the drivers in the next race, the Brazilian Grand Prix." ),
	p("The output will then update to show the who will be champion if for a all possible points scoring finishing positions at the last race in Abu Dhabi. For a particular finishing combination, the champion will be the named driver." ),
	h4("How to Read the Chart"),
	p("If you expect Hamilton to win in Brazil, and Rosberg to come second, set the sliders accordingly. The display changes to show that if HAM finished first in Abu Dhabi, if ROS finishes second or third, ROS will take the championship. If HAM finishes second in Abu Dhabi, HAM will win overall if ROS places 8th or lower. If ROS places 9th in Abu DHan=bi, HAM wins overall if he makes it onto the race podium. If ROS finishes on the race podium in the final race, he takes the Drivers' Championship wherever HAM finishes." ),
	p("If neither of the drivers take points in Brazil, HAM can still win overall if he wins in Abu Dhabi and ROC somes 8th or lower." ),
	hr(),
	em("For more F1 stats'n'data wrangling, see "), a(href="http://f1datajunkie.com","f1datajunkie.com"), em("or the Leanpub book"),
	a(href="https://leanpub.com/wranglingf1datawithr","Wrangling F1 Data With R"),em(".")
	)
	))

view raw ui.R hosted with ❤ by GitHub

Visualising F1 Stint Strategies

With the new F1 season upon us, I’ve started tinkering with bits of code from the Wrangling F1 Data With R book and looking at the data in some new ways.

For example, I started wondering whether we might be able to learn something interesting about the race strategies by looking at laptimes on a stint by stint basis.

To begin with, we need some data – I’m going to grab it directly from the ergast API using some functions that are bundled in with the Leanpub book…

#ergast functions described in: https://leanpub.com/wranglingf1datawithr/
#Get laptime data from the ergast API
l2=lapsData.df(2016,2)
#Get pits data from the ergast API
p2=pitsData.df(2016,2)

#merge pit data into the laptime data
l3=merge(l2,p2[,c('driverId','lap','rawduration')],by=c('driverId','lap'),all=T)

#generate an inlap flag (inlap is the lap assigned the pit time)
l3['inlap']=!is.na(l3['rawduration'])

#generate an outlap flag (outlap is the first lap of the race or laps starting from the pits
l3=ddply(l3,.(driverId),transform,outlap=c(T,!is.na(head(rawduration,-1))))

#use the pitstop flag to number stints; note: a drive through penalty increments the stint count
l3=arrange(l3,driverId, -lap)
l3=ddply(l3,.(driverId),transform,stint=1+sum(inlap)-cumsum(inlap))

#number the laps in each stint
l3=arrange(l3,driverId, lap)
l3=ddply(l3,.(driverId,stint),transform,lapInStint=1:length(stint))
l3=arrange(l3,driverId, lap)

The laptimes associated with the in- and out- lap associated with a pit stop add noise to the full lap times completed within each stint, so lets flag those laps so we can then filter them out:

#Discount the inlap and outlap
l4=l3[!l3['outlap'] & !l3['inlap'],]

We can now look at the data… I’m going to facet by driver, and also group the laptimes associated with each stint. Then we can plot just the raw laptimes, and also a simple linear model based on the full lap times within each stint:

#Generate a base plot
g=ggplot(l4,aes(x=lapInStint, y=rawtime, col=factor(stint)))+facet_wrap(~driverId)

#Chart the raw laptimes within each stint
g+geom_line()

#Plot a simple linear model for each stint
g+ geom_smooth(method = "lm", formula = y ~ x)

So for example, here are the raw laptimes, excluding inlap and outlap, by stint for each driver in the recent 2016 Bahrain Formual One Grand Prix:

bah_2-16_racestint_raw

And here’s the simple linear model:

bah_2-16_racestint_lm

These charts highlight several things:

trivially, the number and length of the stints completed by each driver;
degradation effects in terms of the gradient of the slope of each stint trace;
fuel effects- the y-axis offset for each stint is the sum of the fuel effect (as the race progresses the cars get lighter and laptime goes down more or less linearly) and a basic tyre effect (the “base” time we might expect from a tyre). Based on the total number of laps completed in stints prior to a particular stint, we can calculate a fuel effect offset for the laptimes in each stint which should serve to normalise the y-axis laptimes and make more evident the base tyre laptime.
looking at the charts as a whole, we get a feel for strategy – what sort of tyre/stint strategy do the cars start race with, for example; are the cars going long on tyres without much degradation, or pushing for various length stints on tyres that lose significant time each lap? And so on… (What can you read into/from the charts? Let me know in the comments below;-)

If we assume a 0.083s per lap fuel weight penalty effect, we can replot the chart to account for this:

#Generate a base plot
g=ggplot(l4,aes(x=lapInStint, y=rawtime+0.083*lap, col=factor(stint)))
g+facet_wrap(~driverId) +geom_line()

Here’s what we get:

f1_2016_bah_stint_raw_fc

And here’s what the fuel corrected models look like:

f1_2016_bah_stint_lm_fc

UPDATE: the above fuel calculation goes the wrong way – oops! It should be:

MAXLAPS=max(l4['lap'])
FUEL_PENALTY =0.083
.e = environment()
g=ggplot(l4,aes(x=lapInStint,y=rawtime-(MAXLAPS-lap)*FUEL_PENALTY,col=factor(stint)),environment=.e)

What we really need to do now is annotate the charts with additional tyre selection information for each stint.

We can also do a few more sums. For example, generate a simple average laptime per stint, excluding inlap and outlap times:

#Calculate some stint summary data
l5=ddply(l4,.(driverId,stint), summarise,
                               stintav=sum(rawtime)/length(rawtime),
                               stintsum=sum(rawtime),
                               stinlen=length(rawtime))

which gives results of the form:

          driverId stint   stintav stintsum stinlen
1          rosberg     1  98.04445 1078.489      11
2          rosberg     2  97.55133 1463.270      15
3          rosberg     3  96.15543  673.088       7
4          rosberg     4  96.32494 1637.524      17
5            massa     1  99.13600  495.680       5
6            massa     2 100.48300 2009.660      20
7            massa     3  98.77862 2568.244      26

It would possibly be useful to also compare inlap and outlaps somehow, as well as factoring in the pitstop time. I’m pondering a couple a possibilities for the latter :

amortise the pitstop time over the laps leading up to a pitstop by adding a pitsop lap penalty to each lap in that stint calculated as the pitstop time of the stint length of the laps in the stint leading up to the pitstop; this essentially penalises the stint that leads up to the pitstop as a consequence of forcing the pitstop;
amortise the pitstop time over the laps immediately following a pitstop by adding a pitsop lap penalty to each lap in that stint calculated as the pitstop time of the stint length of the laps in the stint following the pitstop; this essentially penalises the stint that immediately follows the pitstop, and discounts some of the benefit from the pitstop.

I haven’t run the numbers yet though, so I’m not sure how these different approaches will feel…

Detecting Undercuts in F1 Races Using R

One of the things that’s been on my to do list for some time has been the identification of tactical or strategic events within a race that might be detected automatically. One such event is an undercut described by F1 journalist James Allen in the following terms (The secret of undercut and offset):

An undercut is where Driver A leads Driver B, but Driver B turns into the pits before Driver A and changes to new tyres. As Driver A is ahead, he’s unaware that this move is coming until it’s too late to react and he has passed the pit lane entry.
On fresh tyres, Driver B then drives a very fast “Out” lap from the pits. Driver A will react to the stop and pit on the next lap, but his “In” lap time will have been set on old tyres, so will be slower. As he emerges from the pit lane after his stop, Driver B is often narrowly ahead of him into the first corner.

In logical terms, we might characterise this as follows:

two drivers, d1 and d2: d1 !=d2;
d1 pits on lap X, and drives an outlap on lap X+1;
d1’s position on their pitlap (lap X) is greater than d2’s position on the same lap X;
d2 pits on lap X+1, with an outlap on lap X+2;
d2’s position on their outlap (lap X+2) is greater than d1’s position on the same lap X+2.

We can generalise this formulation, and try to make it more robust, by comparing positions on the lap prior to d1’s stop (lap A) with the positions on d2’s outlap (lap B):

two drivers, d1 and d2: d1 !=d2;
d1 pits on lap A+1;
d1’s position on their “prelap” (lap A), the lap prior to their pitlap (lap A+1), is greater than d2’s position on lap A; this condition tries to ensure that d1 is behind d2 as they enter the pit stop phase but it misses the effect on any first lap stops (unless we add a lap 0 containing the grid positions);
d1’s outlap is on lap A+2;
d2 pits on lap B-1 within the inclusive range [lap A+2, lap A+1+N]: N>=1, (that is, within N laps of D1’s stop) with an outlap on lap B; the parameter, N, allows us to test for changes of position within a pit stop window, rather than requiring that d2 pits on the lap immediately following d1’s stop;
d2’s position on their outlap (lap B, in the inclusive range [lap A+3, lap A+2+N]) is greater than d1’s position on the same lap B.

One way of implementing these constraints is to write a declarative style query that specifies the conditions we want the solution to meet, rather than writing a procedural programme to find such an answer. Using the sqldf package, we can use a SQL query to achieve just this result.

One way of writing the query is to create two situations, a and b, where situation a corresponds to a lap on which d1 stops, and situation b corresponds to the driver d2’s stop. We then capture the data for each driver in each situation, to give four data states: d1a, d1b, d2a, d2b. These states are then subjected to the conditions specified above (using N=5).

#First get laptime data from the ergast API
lapTimes=lapsData.df(2015,9)

#Now find pit times
p=pitsData.df(2015,9)

#merge pitdata with lapsdata
lapTimesp=merge(lapTimes, p, by = c('lap','driverId'), all.x=T)

#flag pit laps
lapTimesp$ps = ifelse(is.na(lapTimesp$milliseconds), F, T)

#Ensure laps for each driver are sorted
library(plyr)
lapTimesp=arrange(lapTimesp, driverId, lap)

#do an offset on the laps that are pitstops for each driver
#to set outlap flags for each driver
lapTimesp=ddply(lapTimesp, .(driverId), transform, outlap=c(FALSE, head(ps,-1)))

#identify lap before pit lap by reverse sorting
lapTimesp=arrange(lapTimesp, driverId, -lap)
#So we can do an offset going the other way
lapTimesp=ddply(lapTimesp, .(driverId), transform, prelap=c(FALSE, head(ps,-1)))

#tidy up
lapTimesp=arrange(lapTimesp,acctime)

#Now we can run the SQL query
library(sqldf)
ss=sqldf('SELECT d1a.driverId AS d1, d2a.driverId AS d2, \
            d1a.lap AS A, d1a.position AS d1posA, d1b.position AS d1posB, \
            d2b.lap AS B, d2a.position AS d2posA, d2b.position AS d2posB \
          FROM lapTimesp d1a, lapTimesp d1b, lapTimesp d2a, lapTimesp d2b \
          WHERE d1a.driverId=d1b.driverId AND d2a.driverId=d2b.driverId \
            AND d1a.driverId!=d2a.driverId \
            AND d1a.prelap AND d1a.lap=d2a.lap AND d2b.outlap AND d2b.lap=d1b.lap \
            AND (d1a.lap+3<=d1b.lap AND d1b.lap<=d1a.lap+2+5) \
            AND d1a.position>d2a.position AND d1b.position < d2b.position')

For the 2015 British Grand Prix, here’s what we get:

          d1         d2  A d1posA d2posA  B d1posB d2posB
1  ricciardo      sainz 10     11     10 13     12     13
2     vettel      kvyat 13      8      7 19      8     10
3     vettel hulkenberg 13      8      6 20      7     10
4      kvyat hulkenberg 17      6      5 20      9     10
5   hamilton      massa 18      3      1 21      2      3
6   hamilton     bottas 18      3      2 22      1      3
7     alonso   ericsson 36     11     10 42     10     11
8     alonso   ericsson 36     11     10 43     10     11
9     vettel     bottas 42      5      4 45      3      5
10    vettel      massa 42      5      3 45      3      4
11     merhi    stevens 43     13     12 46     12     13

With a five lap window we have evidence that supports successful undercuts in several cases, including VET taking KVY and HUL with his early stop at lap 13+1 (KVY pitting on lap 19-1 and HUL on lap 20-1), and MAS and BOT both being taken first by HAM’s stop at lap 18+1 and then by VET’s stop at lap 42+1.

To make things easier to read, we may instead define d1a.lap+1 AS d1Pitlap and d2b.lap-1 AS d2Pitlap.

The query doesn’t guarantee that the pit stop was responsible for change in order, but it does at least gives us some prompts as to where we might look.

Spotting Potential Battles in F1 Races

Over the last couple of races, I’ve started trying to review a variety of battlemaps for various drivers in each race. Prompted by an email request for more info around the battlemaps, I generated a new sketch charting the on track gaps between each driver and the lap leader for each lap of the race (How the F1 Canadian Grand Prix Race Evolved on Track).

Colour is used to identify cars on lead lap compared to lapped drivers. For lapped drivers, a count of how many laps they are behind the leader is displayed. I additionally overplot with a highlight for specified driver, as well as adding in a mark that shows the on track position of the leader of the next lap, along with their driver code.

Battles can be identified through the close proximity of two or more drivers within a lap, across several laps. The ‘next-lap-leader’ time at the far right shows how close the leader on the next lead lap is to the backmarker (on track) on the current lead lap.

By highlighting two particular drivers, we could compare how their races evolved, perhaps highlighting different strategies used within a race that eventually bring the drivers into a close competitive battle in the last few laps of a race.

The unchanging leader-on-track-delta-of-0 line is perhaps missing an informational opportunity? For example, should we set the leader’s time to be the delta compared to the lap time for the leader laps from the previous lead lap? Or a delta compared to the fastest laptime on the previous lead lap? And if we do start messing about with an offset to the leader’s lap time, we presumably need to apply the same offset to the laptime of everyone else on the lap so we can still see the comparative on-track gaps to leader?

On the to-do list are various strategies for automatically identifying potential battles based on a variety of in-lap and across-lap heuristics.

Here’s the code:

#Grab some data
lapTimes =lapsData.df(2015,7)

#Process the laptimes
lapTimes=battlemap_encoder(lapTimes)

#Find the accumulated race time at the start of each leader's lap
lapTimes=ddply(lapTimes,.(leadlap),transform,lstart=min(acctime))

#Find the on-track gap to leader
lapTimes['trackdiff']=lapTimes['acctime']-lapTimes['lstart']

#Construct a dataframe that contains the difference between the 
#leader accumulated laptime on current lap and next lap
#i.e. how far behind current lap leader is next-lap leader?
ll=data.frame(t=diff(lapTimes[lapTimes['position']==1,'acctime']))
#Generate a de facto lap count
ll['n']=1:nrow(ll)
#Grab the code of the lap leader on the next lap
ll['c']=lapTimes[lapTimes['position']==1 & lapTimes['lap']>1,'code']

#Plot the on-track gap to leader versus leader lap
g = ggplot(lapTimes) 
g = g + geom_point(aes(x=trackdiff,y=leadlap,col=(lap==leadlap)), pch=1)
g = g + geom_point(data=lapTimes[lapTimes['driverId']=='vettel',],
                  aes(x=trackdiff,y=leadlap), pch='+')
g = g + geom_text(data=lapTimes[lapTimes['lapsbehind']>0,],
                  aes(x=trackdiff,y=leadlap, label=lapsbehind),size=3)
g = g + geom_point(data=ll,aes(x=t, y=n), pch='x')
g = g + geom_text(data=ll,aes(x=t+3, y=n,label=c), size=2)
g = g + geom_vline(aes(xintercept=17), linetype=3)
g

This chart will be included in a future update to the Wrangling F1 Data With R book. I hope to do a sprint on that book to tidy it up and get it into a reasonably edited state in the next few weeks. At that point, the text will probably be frozen, a print-on-demand version generated, and if it ends up on Amazon, the minimum price being hiked considerably.

Charts are for Reading…

If charts are pictures, and every picture not only tells a story, but also saves a thousand words in doing so, how then are we to actually read them?

Take the following example, a quick #f1datajunkie sketch show how the Bahrain 2015 qualifying session progressed. The chart is split into three, one for each part of qualifying (which we might refer to as fractional sessions), which already starts to set the scene for the story. The horizontal x-axis is the time in seconds into qualifying at which each laptime is recorded, indexed against the first laptime recorded in qualifying overall. The vertical y-axis records laptimes in in seconds, limited to 107% of the fastest laptime recorded in a particular session. The green colour denotes a driver’s fastest laptime recorded in each fractional session, purple the overall fasted laptime recorded so far in a fractional session (purple trumps green). So again, the chart is starting to paint a picture.

An example of the sort of analysis that can be provided for a qualifying session can be found in a post by Justin Hynes, Lewis Hamilton seals his first Bahrain pole but Vettel poses the menace to Mercedes’ hopes, that appeared on he James Allen on F1 blog. In this post, I’ll try to match elements of that analysis with things we can directly see in the chart above…

[Hamilton] finish[ed] 0.411s clear of Ferrari’s Sebastian Vettel and more than half a second in front of his Mercedes team-mate Nico Rosberg

We don’t get the time gap exactly from the chart, but looking to the rightmost panel (Q3), finding the lowest vertical marks for HAM, VET and ROS, and imagining a horizontal line across to the y-axis, we get a feeling for the relative gaps.

Q1 got underway in slightly calmer conditions than blustery FP3 and Raikkonen was the first to take to the track, with Bottas joining the fray soon after. The Williams driver quickly took P1 but was then eclipsed by Rosberg, who set a time of 1: 35.657 on the medium tyres.

Q1 is the leftmost panel, in which we see RAI setting the first representative laptime at least (within the 107% limit of the session best overall), followed by BOT and then ROS improving on the early purple times.

The Mercedes man was soon joined in the top five by soft-tyre runners Nico Hulkenberg and Felipe Nasr.

HUL and NAS appear around the 300 cuml (cumulative laptime) mark. We note that PER is there in the mix too, but is not mentioned explicitly in the report.

In the closing stages of the session those in the danger zone were Max Verstappen, Pastor Maldonado and Will Stevens and Roberto Merhi.

On the right hand side of the chart, we see laps at the end of the session from MAL and VES (and way off the pace, STE). One problem with the chart as style above (showing cumulative best times in the session, makes it hard to see which a driver’s best session time overall actually is. (We could address this by perhaps displaying a driver’s session best time using a bold font.) The chart is also very cluttered around the cutoff time which makes it hard to see clearly who got through and who didn’t. And we don’t really know where the danger zone is because we have no clear indication of what the best 15 drivers’ times are – and hence, where the evolving cut-off time is…

Verstappen found the required pace and scraped into Q2 with a time of 1:35.611. Maldonado, however, failed to make it through, his best lap of 1:35.677 only being good enough for P16.

Verstappen’s leap to safety also pushed out Daniil Kvyat, with the Russian putting in a disappointing final lap that netted him P17 behind the Lotus driver. Hulkenberg was the last man through to Q2, the Force India driver’s 1:35.653 seeing him safely through with just two hundredths of a second in hand over Maldonado…

With an evolution of the cutoff time, and a zoom around the final cutoff time, we should be able to see what went on rather more clearly.

At the top of the order, Hamilton was quickest, finishing a tenth in front of Bottas. Rosberg was third, though he finished the session close on half a second down on his team-mate.

Felipe Massa was fourth for Williams, ahead of Raikkonen, Red Bull’s Daniel Ricciardo and Sebastian Vettel, who completed just three laps in the opening session. All drivers set their best times on the soft tyre.

This information can be quite clearly seen on the chart – aside from the tyre data which is not made available by the FIA.

The follow description of Q2 provides quite a straightforward reading of the second panel of the chart.

In the second session, Rosberg initially set the pace but Hamilton quickly worked his way back to the top of the order, his first run netting a time of 1:32.669. Rosberg was also again eclipsed by Massa who set a time three tenths of a second quicker than Rosberg’s.

The last to set an opening time were the Ferraris of Raikkonen and Vettel, though both rapidly staked a claim on a Q3 berth with the Finn in P2 and the German in P4.

Most of the front runners opted to rely on their first run to see them through and in the closing stages those in the drop zone were Hulkenberg, Force India team-mate Sergio Perez, Nasr, Sauber team-mate Ericsson and McLaren’s Fernando Alonso.

However, the chart does not clearly show how ROS’ early purple time was challenged by BOT, or how MAS early pace time was challenged mid-way through the session by VET and RAI.

Hulkenberg was the man to make the big move, claiming ninth place in Q2 with a time of 1:34.613. Behind him Toro Rosso’s Carlos Sainz scraped through in P10, six hundredths of a second clear of 11th-placed Sergio Perez. The Mexican was followed by Nasr and Ericsson. Alonso claimed P14, while 15th place went to the unfortunate Verstappen, who early in the session had reported that he was down on power.

Again, this reading of the chart would be aided by an evolving cut-off time line.

Looking now to the third panel…

The first runs in Q3 saw Hamilton in charge again, with the champion setting a time of 1:33.552 on used softs to take P1 three tenths of a second ahead of Red Bull’s Ricciardo, who prior to Hamilton’s lap had claimed the fastest S3 time of the session using new soft tyres.

Rosberg, also on used softs, was third, four thousandths of a second down on the Australian’s time. Hulkenberg, with just one new set of softs at his disposal, opted to sit out the first run.

The chart clearly shows the early and late session runs, and is reflected in the analysis:

In the final runs, Vettel was the first of the likely front-row men across the line and with purple times in S1 and S2, the German set a provisional pole time of 1:32.982. It was a superb lap but Hamilton was already running faster, stealing the S1 purple time from the German.

Ahead of the champion on track, Rosberg had similarly taken the best S2 time but he could not find more pace and when he crossed the line he slotted into third, four hundredths [??] of a second behind Vettel.

So what does Justin Hynes’ qualifying session commentary tell us about how we might be able to read the charted summary of the session? And how can we improve the chart to help draw out some of the stories? A couple of things jump out for me – firstly, the evolving purple and green times can be confusing, and are perhaps better placed (for a summary reading of the session) by best in session purple/green times; secondly, the evolution of the cut-off times would help to work out where drivers were placed at different stages of qualifying and what they still had to do – or whether a best-time-so-far recorded by a driver earlier in the session was bumped by the cutoff evolution. Note that the purple time evolution is identified implicitly by the lower envelope of the laptimes in each session.

Scraping Web Pages With R

One of the things I tend to avoid doing in R, partly because there are better tools elsewhere, is screenscraping. With the release of the new rvest package, I thought I’d have a go at what amounts to one of the simplest webscraping activites – grabbing HTML tables out of webpages.

The tables I had in my sights (when I can actually find them…) are the tables that appear on the newly designed FIA website that describe a range of timing results for F1 qualifying and races [quali example, race example].

Inspecting an example target web page, whilst a menu allows you to select several different results tables, a quick look at the underlying HTML source code reveals that all the tables relevant to the session (that is, a particular race, or complete qualifying session) are described within a single page.

So how can we grab those tables down from a target page? The following recipe seems to do the trick:

#install.packages(&quot;rvest&quot;)
library(rvest)

#URL of the HTML webpage we want to scrape
url=&quot;http://www.fia.com/events/formula-1-world-championship/season-2015/qualifying-classification&quot;

fiaTableGrabber=function(url,num){
  #Grab the page
  hh=html(url)
  #Parse HTML
  cc=html_nodes(hh, xpath = &quot;//table&quot;)[[num]] %&gt;% html_table(fill=TRUE)
  #TO DO - extract table name
  
  #Set the column names
  colnames(cc) = cc[1, ]
  #Drop all NA column
  cc=Filter(function(x)!all(is.na(x)), cc[-1,])
  #Fill blanks with NA
  cc=apply(cc, 2, function(x) gsub(&quot;^$|^ $&quot;, NA, x))
  #would the dataframe cast handle the NA?
  as.data.frame(cc)
}

#Usage:
#NUM:
## Qualifying:
### 1 CLASSIFICATION 
### 2 BEST SECTOR TIMES
### 3 SPEED TRAP 
### 4 MAXIMUM SPEEDS
##Race:
### 1 CLASSIFICATION
### 2 FASTEST LAPS
### 3 BEST SECTOR TIMES
### 4 SPEED TRAP
### 5 MAXIMUM SPEEDS
### 6 PIT STOPS
xx=fiaTableGrabber(url,NUM)

The fiaTableGrabber() grabs a particular table from a page with a particular URL (I really should grab the page separately and then extract whatever table from the fetched page, or at least cache the page (unless there is a cacheing option built-in?)

Depending on the table grabbed, we may then need to tidy it up. I hacked together a few sketch functions that tidy up (and remap) column names, convert “natural times” in minutes and seconds to seconds equivalent, and in the case of the race pits data, separate out two tables that get merged into one.

#1Q
fiaQualiClassTidy=function(xx){
  for (q in c('Q1','Q2','Q3')){
    cn=paste(q,'time',sep='')
    xx[cn]=apply(xx[q],1,timeInS)
  }
  
  xx=dplyr:::rename(xx, Q1_laps=LAPS)
  xx=dplyr:::rename(xx, Q2_laps=LAPS.1)
  xx=dplyr:::rename(xx, Q3_laps=LAPS.2)
  xx
}

#2Q, 3R 
fiaSectorTidy=function(xx){
  colnames(xx)=c('pos',
                's1_driver','s1_nattime',
                's2_driver','s2_nattime',
                's3_driver','s3_nattime')
  for (s in c('s1','s2','s3')) {
    sn=paste(s,'_time',sep='')
    sm=paste(s,'_nattime',sep='')
    xx[sn]=apply(xx[sm],1,timeInS)
  }
  
  xx[-1,]
}

#3Q, 4R
fiaTrapTidy=function(xx){
  xx
}

# 4Q, 5R
fiaSpeedTidy=function(xx){
  colnames(xx)=c('pos',
                'inter1_driver','inter1_speed',
                'inter2_driver','inter2_speed',
                'inter3_driver','inter3_speed')
  
  xx[-1,]
}

# 2R
fiaRaceFastlapTidy=function(xx){
  xx['time']=apply(xx['LAP TIME'],1,timeInS)
  xx
}

# 6R
fiaPitsSummary=function(xx){
  r=which(xx['NO']=='RACE - PIT STOP - DETAIL')
  xx['tot_time']=apply(xx['TOTAL TIME'],1,timeInS)
  Filter(function(x)!all(is.na(x)), xx[1:r-1,])
}

#6R
fiaPitsDetail=function(xx){
  colnames(xx)=c('NO','DRIVER','LAP','TIME','STOP','NAT DURATION','TOTAL TIME')
  xx['tot_time']=apply(xx['TOTAL TIME'],1,timeInS)
  xx['duration']=apply(xx['NAT DURATION'],1,timeInS)
  r=which(xx['NO']=='RACE - PIT STOP - DETAIL')
  xx=xx[r+2:nrow(xx),]
  #Remove blank row - http://stackoverflow.com/a/6437778/454773
  xx[rowSums(is.na(xx)) != ncol(xx),]
}

So for example:

I’m still not convinced that R is the most natural, efficient, elegant or expressive language for scraping with, though…

PS In passing, I note the release of the readxl Excel reading library (no external-to-R dependencies, compatible with various flavours of Excel spreadsheet).

PPS Looking at the above screenshot, it strikes me that if we look at the time of day of and the duration, we can tell if there is a track position (at least) change in the pits… So for example, ROS goes in at 15:11:11 with a 33.689 stop and RIC goes in at 15:11:13 with a 26.714. So ROS enters the pits ahead of RIC and leaves after him? The following lap chart from f1fanatic perhaps reinforces this view?

Mixing Numbers and Symbols in Time Series Charts

One of the things I’ve been trying to explore with my #f1datajunkie projects are ways of representing information that work both in a glanceable way as well as repaying deeper reading. I’ve also been looking at various ways of using text labels rather than markers to provide additional information around particular data points.

For example, in a race battlemap, with lap number on the horizontal x-axis and gap time on the vertical y-axis, I use a text label to indicate which driver is ahead (or behind) a particular target driver.

In the revised version of this chart type shown in F1 Malaysia, 2015 – Rosberg’s View of the Race, and additional numerical label along the x-axis indicatesd the race position of the target driver at the end of each lap.

What these charts are intended to do is help the eye see particular structural shapes within the data – for example whether a particular driver is being attacked from behind in the example of a battlemap, or whether they are catching the car ahead (perhaps with intervening cars in the way – although more needs to be done on the chart with respect to this for examples where there are several intervening cars; currently, only a single intervening car immediately ahead on track is shown.)

Two closer readings of the chart are then possible. Firstly, by looking at the y-value we can see the actual time a car is ahead (and here the dashed guide line at +/1 1s helps indicate in a glanceable way the DRS activation line; I’m also pondering how to show an indication of pit loss time to indicate what effect a pit stop might have on the current situation). Secondly, we can read off the labels of the drivers involved i a battle to get a more detailed picture of the race situation.

The latest type of chart I’ve been looking at are session utilisation maps, which in their simplest form look something like the following:

The charts show how each driver made use of a practice session or qualifying – drivers are listed on the vertical y-axis and the time into the session each lap was recorded at is identified along the horizontal x-axis.

This chart makes it easy to see how many stints, and of what length, were completed by each driver and at what point in the session. Other information might be inferred – for example, significant gaps in which no cars are recording times may indicate poor weather conditions or red flags. However, no information is provided about the times recorded for each lap.

We can, however, use colour to identify “purple” laps (fastest lap time recorded so far in the session) and “green” laps (a driver’s fastest laptime so far in the session that isn’t a purple time), as well as laps on which a driver pitted:

But still, no meaningful lap times.

One thing to note about laptimes is that they come in various flavours, such as outlaps, when a driver starts the lap from the pitlane; inlaps, or laps on which a driver comes into the pits at the end of the lap; and flying laps when a driver is properly going for it. There are also those laps on which a driver may be trying out various new lines, slowing down to give themselves space for a flying lap, and so on.

Assuming that inlaps and outlaps are not the best indicators of pace, we can use a blend of symbols and text labels on the chart to identify inlaps and outlaps, as well as showing laptimes for “racing” laps, also using colour to highlight purple and green laps:

The chart is produced using ggplot, and a layered approach in which chart elements are added to the chart in separate layers.

#The base chart with the dataset used to create the original chart
#In this case, the dataset included here is redundant
g = ggplot(f12015test)

#Layer showing in-laps (laps on which a driver pitted) and out-laps
#Use a subset of the dataset to place markers for outlaps and inlaps
g = g + geom_point(data=f12015test[f12015test['outlap'] | f12015test['pit'],],aes(x=cuml, y=name, color=factor(colourx)), pch=1)

#Further annotation to explicitly identify pit laps (in-laps)
g = g + geom_point(data=f12015test[f12015test['pit']==TRUE,],aes(x=cuml, y=name),pch='.')

#Layer showing full laps with rounded laptimes and green/purple lap highlights
#In this case, use the laptime value as a text label, rather than a symbol marker
g = g + geom_text(data=f12015test[!f12015test['outlap'] &amp;amp; !f12015test['pit'],],aes(x=cuml, y=name, label=round(stime,1), color=factor(colourx)), size=2, angle=45)

#Force the colour scale to be one we want
g = g + scale_colour_manual(values=c('darkgrey','darkgreen','purple'))

This version of the chart has the advantage of being glanceable when it comes to identifying session utilisation (number, duration and timing of stints) as well as when purple and green laptimes were recorded, as well as repaying closer reading when it comes to inspecting the actual laptimes recorded during each stint.

To reduce clutter on the chart, laptimes are round to 1 decimal place (tenths of a second) rather than using the full lap time which is recorded down to thousandths of a second.

Session utilisation charts are described more fully in a ~~forthcoming~~ recently released chapter of the Wrangling F1 Data With R Leanpub book. Buying a copy of the book gains you access to future updates of the book. A draft version of the chapter can be found here.

Segmenting F1 Qualifying Session Laptimes

I’ve started scraping some FIA timing sheets again, including practice and qualifying session laptimes. One of the things I’d like to do is explore various ways of looking at the qualifying session laptimes, which means identifying which qualifying session each laptime falls into, using some sort of clustering algorithm… or other means…:

For looking at session utilisation charts I’ve been making use of accumulated time into session to help display the data, as the following session utilisation chart (including green and purple laptimes) shows:

The horizontal x-axis is time into session from a basetime of the first time-of-day timestamp recorded on the timing sheets for the session.

If we look at the distribution of qualifying session laptimes for the 2015 Malaysian Grand Prix, we get something like this:

We can see a big rain delay gap, and also a tighter gap between the first and second sessions.

If we try to run a k-means clustering algorithm on the data, using 3 means for the three sessions, we see that in this case it isn’t managing to cluster the laptimes into actual sessions:

# Attempt to identify qualifying session using K-Means Cluster Analysis around 3 means
clusters &amp;amp;lt;- kmeans(f12015test['cuml'], 3)

f12015test = data.frame(f12015test, clusters$cluster)

ggplot(f12015test)+geom_text(aes(x=cuml, y=stime,
label=code, colour=factor(clusters.cluster)) ,angle=45,size=3)

In particular, so of the Q1 laptimes are being grouped with Q2 laptimes.

However, we know that there is at least a 2 minute gap between sessions (regulations suggest 7 minutes, though if this is the time between lights going red then green again, we might need to knock a couple of minutes off the gap to account to for drivers who start their last lap just before the lights go red on a session) so if we assume that the only times there will be a two minute gap between recorded laptimes during the whole of qualifying session will be in the periods between the qualifying sessions, we can can generate a flag on those gaps, and then generate session number counts by counting on those flags.

#Look for a two minute gap
f12015test=arrange(f12015test,cuml)
f12015test['gap']=c(0,diff(f12015test[,'cuml']))
f12015test['gapflag']= (f12015test['gap']&amp;amp;gt;=120)
f12015test['qsession']=1+cumsum(f12015test[,'gapflag'])

ggplot(f12015test)+ geom_text(aes(x=cuml, y=stime, label=code), angle=45,size=3
+facet_wrap(~qsession, scale=&amp;amp;quot;free&amp;amp;quot;)

(To tighten this up, we might also try to factor in the number of cars in the pits at any particular point in time…)

This chart clearly shows how the first qualifying session saw cars trialling evenly throughout the session, whereas in Q2 and Q3 they were far more bunched up (which perhaps creates more opportunities for cars to get in each others’ way on flying laps…)

One of the issues with this chart is that we don’t get to zoom in to actual flying laps. If all the flying lap times were about the same time, we could simply generate y-axis limits based on purple laptimes:

minl=min(f12015test$purple)*0.95
maxl=min(f12015test$purple)*1.3

#Use these values in ylim()...

However, where the laptimes differ significantly across sessions as they do in this case due to a dramatic change in weather conditions, we probably need to filter the data for each session separately.

Another crib we might use is to identify PIT lap and out-laps (laps immediately following a PIT event) and filter those out of the laptime traces.

Versions of these recipes will soon be added to the Wrangling F1 Data With R book. Once you buy into the book, you get all future updates to it for no additional cost, even in the case of the minimum book price increasing over time.

Rediscovering Formula One Race Battlemaps

A couple of days ago, I posted a recipe on the F1DataJunkie blog that described how to calculate track position from laptime data.

Using that information, as well as additional derived columns such as the identity of, and time to, the cars immediately ahead of and behind a particular selected driver, both in terms of track position and race position, I revisited a chart type I first started exploring several years ago – race battle charts.

The main idea behind the battlemaps is that they can help us search for stories amidst the runners.

dirattr=function(attr,dir='ahead') paste(attr,dir,sep='')

#We shall find it convenenient later on to split out the initial data selection
battlemap_df_driverCode=function(driverCode){
  lapTimes[lapTimes['code']==driverCode,]
}

battlemap_core_chart=function(df,g,dir='ahead'){
  car_X=dirattr('car_',dir)
  code_X=dirattr('code_',dir)
  factor_X=paste('factor(position_',dir,'<position)',sep='')
  code_race_X=dirattr('code_race',dir)
  if (dir=='ahead') diff_X='diff' else diff_X='chasediff'
  
  if (dir=='ahead') drs=1000 else drs=-1000
  g=g+geom_hline(aes_string(yintercept=drs),linetype=5,col='grey')
  
  #Plot the offlap cars that aren't directly being raced
  g=g+geom_text(data=df[df[dirattr('code_',dir)]!=df[dirattr('code_race',dir)],],
                aes_string(x='lap',
                  y=car_X,
                  label=code_X,
                  col=factor_X),
              angle=45,size=2)
  #Plot the cars being raced directly
  g=g+geom_text(data=df,
                aes_string(x='lap',
                  y=diff_X,
                  label=code_race_X),
              angle=45,size=2)
  g=g+scale_color_discrete(labels=c('Behind','Ahead'))
  g+guides(col=guide_legend(title='Intervening car'))
  
}

battle_WEB=battlemap_df_driverCode('WEB')
g=battlemap_core_chart(battle_WEB,ggplot(),'ahead')
battlemap_core_chart(battle_WEB,g,dir='behind')

In this first sketch, from the 2012 Australian Grand Prix, I show the battlemap for Mark Webber:

We see how at the start of the race Webber kept pace with Alonso, albeit around about a second behind, at the same time as he drew away from Massa. In the last third of the race, he was closely battling with Hamilton whilst drawing away from Alonso. Coloured labels are used to highlight cars on a different lap (either ahead (aqua) or behind (orange)) that are in a track position between the selected driver and the car one place ahead or behind in terms of race position (the black labels). The y-axis is the time delta in milliseconds between the selected car and cars ahead (y > 0) or behind (y < 0). A dashed line at the +/- one second mark identifies cars within DRS range.

As well as charting the battles in the vicinity of a particular driver, we can also chart the battle in the context of a particular race position. We can reuse the chart elements and simply need to redefine the filtered dataset we are charting.

For example, if we filter the dataset to just get the data for the car in third position at the end of each lap, we can then generate a battle map of this data.

battlemap_df_position=function(position){
  lapTimes[lapTimes['position']==position,]
}

battleForThird=battlemap_df_position(3)

g=battlemap_core_chart(battleForThird,ggplot(),dir='behind')+xlab(NULL)+theme_bw()
g=battlemap_core_chart(battleForThird,g,'ahead')+guides(col=FALSE)
g

For more details, see the original version of the battlemap chapter. For updates to the chapter, I recommend that you invest in a copy Wrangling F1 Data With R book if you haven’t already done so:-)