Segmenting F1 Qualifying Session Laptimes

I’ve started scraping some FIA timing sheets again, including practice and qualifying session laptimes. One of the things I’d like to do is explore various ways of looking at the qualifying session laptimes, which means identifying which qualifying session each laptime falls into, using some sort of clustering algorithm… or other means…:

qualifying_lap_times_0_pdf__page_1_of_4_

For looking at session utilisation charts I’ve been making use of accumulated time into session to help display the data, as the following session utilisation chart (including green and purple laptimes) shows:

practiceutil-purplegreen_utilisation-1

The horizontal x-axis is time into session from a basetime of the first time-of-day timestamp recorded on the timing sheets for the session.

If we look at the distribution of qualifying session laptimes for the 2015 Malaysian Grand Prix, we get something like this:

simpleSessionTimes

We can see a big rain delay gap, and also a tighter gap between the first and second sessions.

If we try to run a k-means clustering algorithm on the data, using 3 means for the three sessions, we see that in this case it isn’t managing to cluster the laptimes into actual sessions:

# Attempt to identify qualifying session using K-Means Cluster Analysis around 3 means
clusters <- kmeans(f12015test['cuml'], 3)

f12015test = data.frame(f12015test, clusters$cluster)

ggplot(f12015test)+geom_text(aes(x=cuml, y=stime,
label=code, colour=factor(clusters.cluster)) ,angle=45,size=3)

qsession-kmeans

In particular, so of the Q1 laptimes are being grouped with Q2 laptimes.

However, we know that there is at least a 2 minute gap between sessions (regulations suggest 7 minutes, though if this is the time between lights going red then green again, we might need to knock a couple of minutes off the gap to account to for drivers who start their last lap just before the lights go red on a session) so if we assume that the only times there will be a two minute gap between recorded laptimes during the whole of qualifying session will be in the periods between the qualifying sessions, we can can generate a flag on those gaps, and then generate session number counts by counting on those flags.

#Look for a two minute gap
f12015test=arrange(f12015test,cuml)
f12015test['gap']=c(0,diff(f12015test[,'cuml']))
f12015test['gapflag']= (f12015test['gap']>=120)
f12015test['qsession']=1+cumsum(f12015test[,'gapflag'])

ggplot(f12015test)+ geom_text(aes(x=cuml, y=stime, label=code), angle=45,size=3
+facet_wrap(~qsession, scale="free")

qsession_facets

(To tighten this up, we might also try to factor in the number of cars in the pits at any particular point in time…)

This chart clearly shows how the first qualifying session saw cars trialling evenly throughout the session, whereas in Q2 and Q3 they were far more bunched up (which perhaps creates more opportunities for cars to get in each others’ way on flying laps…)

One of the issues with this chart is that we don’t get to zoom in to actual flying laps. If all the flying lap times were about the same time, we could simply generate y-axis limits based on purple laptimes:

minl=min(f12015test$purple)*0.95
maxl=min(f12015test$purple)*1.3

#Use these values in ylim()...

However, where the laptimes differ significantly across sessions as they do in this case due to a dramatic change in weather conditions, we probably need to filter the data for each session separately.

Another crib we might use is to identify PIT lap and out-laps (laps immediately following a PIT event) and filter those out of the laptime traces.

Versions of these recipes will soon be added to the Wrangling F1 Data With R book. Once you buy into the book, you get all future updates to it for no additional cost, even in the case of the minimum book price increasing over time.

3 comments

  1. deepanalytics

    What is the goal? What are you able to characterize, or better yet, predict? Given that the lcd goal of qualifying is to be faster than the slowest 4 to 6 cars, can you discover the variables that a driver and/or team would have to improve to get on the starting grid? If you can’t, what is the use?

    • Tony Hirst

      @deepanalytics The churlish answer would be to say this is just a casual recreational data exercise where the point is to just practice various ways of manipulating datasets. Rather than doing Sudoku or Killer.

      It’s also worth bearing in mind that this is just one step of many – how to take a list of lap times for each driver and automatically associate a qualifying session number with each one. Nothing more than that.

      If you want a more practical rationale , I could say that the primary goal is to find ways of detecting and summarising features relating to the story of the qualifying session that could aid a race reporter/journalist with no or minimal data skills start thinking about how various depictions and manipulations of data might help them when writing a session summary. A more elaborate secondary goal is to detect features that can be summarised as text/human readable sentences that could be put into an automatically generated text summary of the session.

      In other words, it’s exploring ways of helping journalists think about ways of using the data to help them find and tell stories better.

      I guess this simple example also shows how a trivial application of K-means clustering a 1-d dataset that I know has three distinct clusters doesn’t segment them cleanly. Which is another nice cautionary example about how using stats and data analysis tools is as likely to trip you up as it is to help you if you blindly trust that applying a particular technique without further checking will just work. Add as a weak complement to Anscombe’s quartet, Simpson’s paradox, etc etc.

      To start trying to predict things, you need more laptimes, and perhaps speed and sector times. Long run laptimes can often be found in FP2, and sector and speed times from qualifying can be found in other FIA media releases. Two sorts of modeling are possible, depending on use. Betting folk often look to history (eg https://blog.ouseful.info/2013/02/09/f1stats-correlations-between-qualifying-grid-and-race-classisification/ ), techie F1 data geeks perhaps try to do more elaborate modeling such as https://f1metrics.wordpress.com/2014/10/03/building-a-race-simulator/ (this is on my to do list).

      • Theodore Omtzigt

        I would love to see, and think about, these predictive models, but the first hypothesis that I would test is to see if qualifying times have any correlation to race day results, before I would invest in characterizing models for qualifying. Last weekend’s Malaysian GP was a great example of the power of doing risk analysis first to see where the big impact factors are: I would want to see a replay of the prediction that the Hamilton team did on putting him on Primes for the last 18 laps. Vettel has demonstrated great tyre management in the past and that combination, I would posit, lead to disastrous results for team Hamilton. The lap times that both Hamilton and Vettel recorded during qualifying and during the race would not have been enough to predict the outcome. The pitstop that the Vettel team didn’t make during the first caution gave the Ferrari team a lot of margin.

        Trying to build a predictive model that guides strategy should be a good goal. Given the fact that the design space is relatively large, doing this as an optimization formulation would be the right structure for the team’s themselves so that they can ask for ‘what-if’s in real-time during the race. This would be a miniscule investment in time and money to optimize their enormous engineering and marketing budgets, so I have to believe that Mercedes, Renault, and Ferrari have all this stuff already worked out and fully optimized. A couple of million dollars to invest in that model would have a very attractive ROI. You could even entertain the thought that they put the strategy diffs right in some visual alert on the driver’s steering wheel, or maybe that would be considered coaching and thus not allowed.