I’ve started scraping some FIA timing sheets again, including practice and qualifying session laptimes. One of the things I’d like to do is explore various ways of looking at the qualifying session laptimes, which means identifying which qualifying session each laptime falls into, using some sort of clustering algorithm… or other means…:
For looking at session utilisation charts I’ve been making use of accumulated time into session to help display the data, as the following session utilisation chart (including green and purple laptimes) shows:
The horizontal x-axis is time into session from a basetime of the first time-of-day timestamp recorded on the timing sheets for the session.
If we look at the distribution of qualifying session laptimes for the 2015 Malaysian Grand Prix, we get something like this:
We can see a big rain delay gap, and also a tighter gap between the first and second sessions.
If we try to run a k-means clustering algorithm on the data, using 3 means for the three sessions, we see that in this case it isn’t managing to cluster the laptimes into actual sessions:
# Attempt to identify qualifying session using K-Means Cluster Analysis around 3 means clusters &amp;lt;- kmeans(f12015test['cuml'], 3) f12015test = data.frame(f12015test, clusters$cluster) ggplot(f12015test)+geom_text(aes(x=cuml, y=stime, label=code, colour=factor(clusters.cluster)) ,angle=45,size=3)
In particular, so of the Q1 laptimes are being grouped with Q2 laptimes.
However, we know that there is at least a 2 minute gap between sessions (regulations suggest 7 minutes, though if this is the time between lights going red then green again, we might need to knock a couple of minutes off the gap to account to for drivers who start their last lap just before the lights go red on a session) so if we assume that the only times there will be a two minute gap between recorded laptimes during the whole of qualifying session will be in the periods between the qualifying sessions, we can can generate a flag on those gaps, and then generate session number counts by counting on those flags.
#Look for a two minute gap f12015test=arrange(f12015test,cuml) f12015test['gap']=c(0,diff(f12015test[,'cuml'])) f12015test['gapflag']= (f12015test['gap']&amp;gt;=120) f12015test['qsession']=1+cumsum(f12015test[,'gapflag']) ggplot(f12015test)+ geom_text(aes(x=cuml, y=stime, label=code), angle=45,size=3 +facet_wrap(~qsession, scale=&amp;quot;free&amp;quot;)
(To tighten this up, we might also try to factor in the number of cars in the pits at any particular point in time…)
This chart clearly shows how the first qualifying session saw cars trialling evenly throughout the session, whereas in Q2 and Q3 they were far more bunched up (which perhaps creates more opportunities for cars to get in each others’ way on flying laps…)
One of the issues with this chart is that we don’t get to zoom in to actual flying laps. If all the flying lap times were about the same time, we could simply generate y-axis limits based on purple laptimes:
minl=min(f12015test$purple)*0.95 maxl=min(f12015test$purple)*1.3 #Use these values in ylim()...
However, where the laptimes differ significantly across sessions as they do in this case due to a dramatic change in weather conditions, we probably need to filter the data for each session separately.
Another crib we might use is to identify PIT lap and out-laps (laps immediately following a PIT event) and filter those out of the laptime traces.
Versions of these recipes will soon be added to the Wrangling F1 Data With R book. Once you buy into the book, you get all future updates to it for no additional cost, even in the case of the minimum book price increasing over time.