Category: Rstats

Keeping Track of an Evolving “Top N” Cutoff Threshold Value

In a previous post (Charts are for Reading), I noted how it was difficult to keep track of which times in an F1 qualifying session had made the cutoff time as a qualifying session evolved. The problem can be stated as follows: in the first session, with 20 drivers competing, the 15 drivers with the best ranked laptime will make it into the next session. Each driver can complete zero or more timed laps, with drivers completing laps in any order.

Finding the 15 drivers who will make the cutoff is therefore not simply a matter of ranking the best 15 laptimes at any point, because the same 5 drivers, say, may each record 3 fast laptimes, thus taking up the 15 slots that record the 15 fastest laptimes.

If we define a discrete time series with steps corresponding to each recorded laptime (from any driver), then at each time step we can find the best 15 drivers by finding each driver’s best laptime to date and ranking by those times. Conceptually, we need something like a lap chart which uses a ‘timed lap count’ rather than race lap index to keep track of the top 15 cars at any point.


At each index step, we can then find the laptime of the 15th ranked car to find the current session laptime.

In a dataframe that records laptimes in a session by driver code for each driver, along with a column that contains the current purple laptime, we can arrange the laptimes by cumulative session laptime (so the order of rows follows the order in which laptimes are recorded) and then iterate through those rows one at a time. At each step, we can summarise the best laptime recorded so far in the session for each driver.

for (r in 1:nrow(df)) {
  #summarise best laptime recorded so far in the session for each driver
  #Keep track of which session we are in
  #Rank the best laptimes for each driver to date in the current session
  #(Really should filter by session at the start of this loop?)
  #The different sessions have different cutoffs: Q1, top 15; Q2, top 10
  #if we have at least as many driver best times recorded as the cutoff number
  if (nrow(dfcc) >=n){
    #Grab a record of the current cut-off time
    #along with info about each recorded laptime
    dfc=rbind(dfc,data.frame(df[r,]['qsession'],df[r,]['code'],df[r,]['cuml'],dfcc[n,]['dbest']) )

We can then plot the evolution of the cut-off time as the sessions proceed. The chart in it’s current form is still a bit hard to parse, but it’s a start…


In the above sketch, the lines connect the current purple time and the current cut-off time in each session (aside from the horizontal line which represents the cut-off time at the end of the session). This gives a false impression of the evolution of the cutoff time – really, the line should be a stepped line that traces the current cut-off time horizontally until it is improved, at which point it should step vertically down. (In actual fact, the line does track horizontally as laptimes are recorded that do not change the cuttoff time, as indicated by the horizontal tracks in the Q1 panel as the grey times (laptime slower than driver’s best time in session so far) are completed.

The driver labels are coloured according to: purple – current best in session time; green – driver best in session time to date (that wasn’t also purple); red – driver’s best time in session that was outside the final cut-off time. This colouring conflates two approaches to representing information – the purple/green colours represent online algorithmic processing (if we constructed the chart in real time from laptime data as laps we completed, that’s how we’d colour the points), whereas the red colouring represents the results of offline algorithmic processing (the colour can only be calculated at the end of the session when we know the final session cutoff time). I think these mixed semantics contribute to making the chart difficult to read…

In terms of what sort of stories we might be able to pull from the chart, we see that in Q2, Hulkenberg and Sainz were only fractions of a second apart, and Perez looked like he had squeezed in to the final top 10 slot until Sainz pushed him out. To make it easier to see which times contribute to the top 10 times, we could use font weight (eg bold font) to highlight each drivers session best laptimes.

To make the chart easier to read, I think each time improvement to the cutoff time should be represented by a faint horizontal line, with a slightly darker line tracing the evolution of the cutoff time as a stepped line. This would all us to see which times were within the cutoff time at any point.

I also wonder whether it might be interesting to generate a table a bit like the race lap chart, using session timed lap count rather than race lap count, perhaps with additional colour fields to show which car recorded the time that increased the lap count index, and perhaps also where in the order that time fell if it didn’t change the order in the session so far. We could also generate online and offline differences between each laptime in the session and the current cutoff time (online algorithm) as well as the final overall session cutoff time (offline algorithm).

[As and when I get this chart sorted, it will appear in an update to the Wrangling F1 Data With R lean book.]

Scraping Web Pages With R

One of the things I tend to avoid doing in R, partly because there are better tools elsewhere, is screenscraping. With the release of the new rvest package, I thought I’d have a go at what amounts to one of the simplest webscraping activites – grabbing HTML tables out of webpages.

The tables I had in my sights (when I can actually find them…) are the tables that appear on the newly designed FIA website that describe a range of timing results for F1 qualifying and races [quali example, race example].

Inspecting an example target web page, whilst a menu allows you to select several different results tables, a quick look at the underlying HTML source code reveals that all the tables relevant to the session (that is, a particular race, or complete qualifying session) are described within a single page.

So how can we grab those tables down from a target page? The following recipe seems to do the trick:


#URL of the HTML webpage we want to scrape

  #Grab the page
  #Parse HTML
  cc=html_nodes(hh, xpath = "//table")[[num]] %>% html_table(fill=TRUE)
  #TO DO - extract table name
  #Set the column names
  colnames(cc) = cc[1, ]
  #Drop all NA column
  cc=Filter(function(x)!all(, cc[-1,])
  #Fill blanks with NA
  cc=apply(cc, 2, function(x) gsub("^$|^ $", NA, x))
  #would the dataframe cast handle the NA?

## Qualifying:

The fiaTableGrabber() grabs a particular table from a page with a particular URL (I really should grab the page separately and then extract whatever table from the fetched page, or at least cache the page (unless there is a cacheing option built-in?)

Depending on the table grabbed, we may then need to tidy it up. I hacked together a few sketch functions that tidy up (and remap) column names, convert “natural times” in minutes and seconds to seconds equivalent, and in the case of the race pits data, separate out two tables that get merged into one.

  for (q in c('Q1','Q2','Q3')){
  xx=dplyr:::rename(xx, Q1_laps=LAPS)
  xx=dplyr:::rename(xx, Q2_laps=LAPS.1)
  xx=dplyr:::rename(xx, Q3_laps=LAPS.2)

#2Q, 3R 
  for (s in c('s1','s2','s3')) {

#3Q, 4R

# 4Q, 5R

# 2R
  xx['time']=apply(xx['LAP TIME'],1,timeInS)

# 6R
  r=which(xx['NO']=='RACE - PIT STOP - DETAIL')
  xx['tot_time']=apply(xx['TOTAL TIME'],1,timeInS)
  Filter(function(x)!all(, xx[1:r-1,])

  colnames(xx)=c('NO','DRIVER','LAP','TIME','STOP','NAT DURATION','TOTAL TIME')
  xx['tot_time']=apply(xx['TOTAL TIME'],1,timeInS)
  xx['duration']=apply(xx['NAT DURATION'],1,timeInS)
  r=which(xx['NO']=='RACE - PIT STOP - DETAIL')
  #Remove blank row -
  xx[rowSums( != ncol(xx),]

So for example:


I’m still not convinced that R is the most natural, efficient, elegant or expressive language for scraping with, though…

PS In passing, I note the release of the readxl Excel reading library (no external-to-R dependencies, compatible with various flavours of Excel spreadsheet).

PPS Looking at the above screenshot, it strikes me that if we look at the time of day of and the duration, we can tell if there is a track position (at least) change in the pits… So for example, ROS goes in at 15:11:11 with a 33.689 stop and RIC goes in at 15:11:13 with a 26.714. So ROS enters the pits ahead of RIC and leaves after him? The following lap chart from f1fanatic perhaps reinforces this view?


Wrangling Complex Spreadsheet Column Headers

[This isn’t an R post, per se, but I’m syndicating it via RBloggers because I’m interested – how do you work with hierarchical column indices in R? Do you try to reshape the data to something tidier on the way in? Can you autodetect elements to help with any reshaping?]

Not a little p****d off by the Conservative election pledge to extend the right-to-buy to housing association tenants (my response: so extend the right to private tenants too?) I thought I’d have a dig around to see what data might be available to see what I could learn about the housing situation on the Isle of Wight, using a method that could also be used in other constituencies. That is, what datasets are provided at a national level, broken down to local level. (To start with, I wanted to see what I could lean ex- of visiting the DCLG OpenDataCommuniteis site.

One source of data seems to be the Local authority housing statistics data returns for 2013 to 2014, a multi-sheet spreadsheet reporting at a local authority level on:

– Dwelling Stock
– Local Authority Housing Disposals
– Allocations
– Lettings, Nominations and Mobility Schemes
– Vacants
– Condition of Dwelling Stock
– Stock Management
– Local authority Rents and Rent Arrears
– Affordable Housing Supply


Something I’ve been exploring lately are “external spreadsheet data source” wrappers for the pandas Python library that wrap frequently released spreadsheets with a simple (?!) interface that lets you pull the data from the spreadsheet into a pandas dataframe.

For example, I got started on the LA housing stats sheet as follows – first a look at the sheets, then a routine to grab sheet names out of the Menu sheet:

import pandas as pd
#Menu sheet parse to identify sheets A-I
import re
sd=re.compile(r'Section (\w) - (.*)$')
for row in dfx.parse('Menu')[[1]].values:
    if str(row[0]).startswith('Section'):
#{u'A': u'Dwelling Stock',
# u'B': u'Local Authority Housing Disposals',
# u'C': u'Allocations',
# u'D': u'Lettings, Nominations and Mobility Schemes',
# u'E': u'Vacants',
# u'F': u'Condition of Dwelling Stock',
# u'G': u'Stock Management',
# u'H': u'Local authority Rents and Rent Arrears',
# u'I': u'Affordable Housing Supply'}

All the data sheets have similar columns on the left-hand side, which we can use as a crib to identify the simple, single row, coded header column.

def dfgrabber(dfx,sheet):
    #First pass - identify row for headers
    row = df[df.apply(lambda x: (x == "DCLG code").any(), axis=1)].index.tolist()[0]#.values[0] # will be an array
    #Second pass - generate dataframe
    df=df[df['DCLG code'].notnull()].reset_index(drop=True)
    return df


That gives something like the following:


This is completely useable if we know what the column codes refer to. What is handy is that a single row is available for columns, although metadata that neatly describes the codes is not so tidily presented:


Trying to generate pandas hierarchical index from this data is a bit messy…

One approach I’ve explored is trying to create a lookup table from the coded column names back into the hierarchical column names.

For example, if we can detect the column multi-index rows, we can fill down on the first row (for multicolumn labels, the label is in the leftmost cell), then fill down to fill the index grid spanned cells with the value that spans them.

#row is autodetected and contains the row for the simple header
#Get the header columns - and drop blank rows


#Fill down
xx.fillna(method='ffill', axis=0,inplace=True)
#Fill across
xx=xx.fillna(method='ffill', axis=1)


#append the coded header row


#Now make use of pandas' ability to read in a multi-index CSV
xx.to_csv('multi_index.csv',header=False, index=False)


Note that the pandas column multi-index can span several columns, but not “vertical” levels.

Get rid of the columns that don’t feature in the multi-index:

for c in mxx.columns.get_level_values(0).tolist():
    if c.startswith('Unnamed'):
        mxx = mxx.drop(c, level=0, axis=1)

Now start to work on the lookup…

#Get a dict from the multi-index


We can then use this as a basis for generating a lookup table for the column codes.

for r in dd:
    keyx[dd[r][0].split(' ')[0]]=r


We could also generate more elaborate dicts to provide ways of identifying particular codes.

Note that the key building required a little bit of tidying required arising from footnote numbers that appear in some of the coded column headings:

excel header footnote

This tidying should be also be applied to the code column generation step above…

I’m thinking there really should be an easier way?

PS and then, of course, there are the additional gotchas… like UTF-8 pound signs that’s break ascii encodings…


PPS Handy… informal guidance on Releasing data or statistics in spreadsheets, acts as a counterpoint to GSS’ Releasing statistics in spreadsheets: Good practice guidance.

A Couple of Handy ggplot Tricks – Using Environmental Variables and Saving Charts

A couple of handy tricks when working with ggplot that had escaped my radar until today.

First up, I had a problem in a function I was using to generate a ggplot2 in which I wanted to accept a couple of optional arguments in to the function and then make use of them in a ggplot aes() element.

Normally, ggplot will try to dereference a variable as a column in the current ggplot data context, rather than as a variable in it’s own right. So what can we do? Hack around aith aes_string()? In actual fact, the following trick identified via Stack Overflow did exactly what I needed – make a particular environment context available in ggplot directly:

core_qualifying_rank_slopegraph= function(qualiResults,qm,
  .e = environment()
  # if we pass this context into ggplot, we can access both spacer and cutoff
  g=ggplot(qualiResults,aes(x=session,y=laptime), environment = .e)
  g= g+geom_text(data=qm[qm['session']=='q1time',],
                     colour=(qspos>cutoff[1] )
                 ), size=3)
  g= g+geom_text(data=qm[qm['session']=='q2time',],
                     colour=(qspos>cutoff[2] )
                 ), size=3)

By the by, the complete version of the fragment above generates a chart like the, heavily influenced by Tufte style slopegraphs, which shows progression through the F1 qualifying session in China this weekend:


Note that I use a discrete rank rather than continuous laptime scale for the y-axis, which would be more in keeping with the original slope graph idea. (The post on F1 China 2015 – How They Qualified explores another chart type where continuous laptime scales are used, and a two column layout reminiscent of an F1 grid as a trick to try to minimise overlap of drivername labels, along with a 1-dimensional layout that shows all the qualifying session classification laptimes.)

The second useful trick I learned today was a recipe for saving chart objects. (With sales of the Wrangling F1 Data With R book (which provides the context for my learning these tricks) all but stalled, I need a new dream to live for that gives me hope of making enough from F1 related posts to cover the costs of seeing a race for real one weekend, so I’ve started wondering whether I could sell or license particular charts one day (if I can produce them quickly enough), either as PDFs or, perhaps, as actual chart objects, rather than having to give away all the code and wrangled data, for example….

So in that respect, the ability to save ggplot chart objects and then share them in a way that others can use them (if they have a workflow that can accommodate R/grid grobs) could be quite attractive… and this particular trick seems to do the job…

g=ggplot(..etc..) ...
#Get the grob...
g_out = ggplotGrob(g)
#Save the grob

#Look - nothing up my sleeves
#> g
#Error: object 'g' not found
#> g_out
#Error: object 'g_out' not found


#The g_out grob is reinstated and can be plotted as follows:


Mixing Numbers and Symbols in Time Series Charts

One of the things I’ve been trying to explore with my #f1datajunkie projects are ways of representing information that work both in a glanceable way as well as repaying deeper reading. I’ve also been looking at various ways of using text labels rather than markers to provide additional information around particular data points.

For example, in a race battlemap, with lap number on the horizontal x-axis and gap time on the vertical y-axis, I use a text label to indicate which driver is ahead (or behind) a particular target driver.


In the revised version of this chart type shown in F1 Malaysia, 2015 – Rosberg’s View of the Race, and additional numerical label along the x-axis indicatesd the race position of the target driver at the end of each lap.

What these charts are intended to do is help the eye see particular structural shapes within the data – for example whether a particular driver is being attacked from behind in the example of a battlemap, or whether they are catching the car ahead (perhaps with intervening cars in the way – although more needs to be done on the chart with respect to this for examples where there are several intervening cars; currently, only a single intervening car immediately ahead on track is shown.)

Two closer readings of the chart are then possible. Firstly, by looking at the y-value we can see the actual time a car is ahead (and here the dashed guide line at +/1 1s helps indicate in a glanceable way the DRS activation line; I’m also pondering how to show an indication of pit loss time to indicate what effect a pit stop might have on the current situation). Secondly, we can read off the labels of the drivers involved i a battle to get a more detailed picture of the race situation.

The latest type of chart I’ve been looking at are session utilisation maps, which in their simplest form look something like the following:


The charts show how each driver made use of a practice session or qualifying – drivers are listed on the vertical y-axis and the time into the session each lap was recorded at is identified along the horizontal x-axis.

This chart makes it easy to see how many stints, and of what length, were completed by each driver and at what point in the session. Other information might be inferred – for example, significant gaps in which no cars are recording times may indicate poor weather conditions or red flags. However, no information is provided about the times recorded for each lap.

We can, however, use colour to identify “purple” laps (fastest lap time recorded so far in the session) and “green” laps (a driver’s fastest laptime so far in the session that isn’t a purple time), as well as laps on which a driver pitted:


But still, no meaningful lap times.

One thing to note about laptimes is that they come in various flavours, such as outlaps, when a driver starts the lap from the pitlane; inlaps, or laps on which a driver comes into the pits at the end of the lap; and flying laps when a driver is properly going for it. There are also those laps on which a driver may be trying out various new lines, slowing down to give themselves space for a flying lap, and so on.

Assuming that inlaps and outlaps are not the best indicators of pace, we can use a blend of symbols and text labels on the chart to identify inlaps and outlaps, as well as showing laptimes for “racing” laps, also using colour to highlight purple and green laps:


The chart is produced using ggplot, and a layered approach in which chart elements are added to the chart in separate layers.

#The base chart with the dataset used to create the original chart
#In this case, the dataset included here is redundant
g = ggplot(f12015test)

#Layer showing in-laps (laps on which a driver pitted) and out-laps
#Use a subset of the dataset to place markers for outlaps and inlaps
g = g + geom_point(data=f12015test[f12015test['outlap'] | f12015test['pit'],],aes(x=cuml, y=name, color=factor(colourx)), pch=1)

#Further annotation to explicitly identify pit laps (in-laps)
g = g + geom_point(data=f12015test[f12015test['pit']==TRUE,],aes(x=cuml, y=name),pch='.')

#Layer showing full laps with rounded laptimes and green/purple lap highlights
#In this case, use the laptime value as a text label, rather than a symbol marker
g = g + geom_text(data=f12015test[!f12015test['outlap'] & !f12015test['pit'],],aes(x=cuml, y=name, label=round(stime,1), color=factor(colourx)), size=2, angle=45)

#Force the colour scale to be one we want
g = g + scale_colour_manual(values=c('darkgrey','darkgreen','purple'))

This version of the chart has the advantage of being glanceable when it comes to identifying session utilisation (number, duration and timing of stints) as well as when purple and green laptimes were recorded, as well as repaying closer reading when it comes to inspecting the actual laptimes recorded during each stint.

To reduce clutter on the chart, laptimes are round to 1 decimal place (tenths of a second) rather than using the full lap time which is recorded down to thousandths of a second.

Session utilisation charts are described more fully in a forthcoming recently released chapter of the Wrangling F1 Data With R Leanpub book. Buying a copy of the book gains you access to future updates of the book. A draft version of the chapter can be found here.

Iteratively Populating Templated Sentences With Inline R in knitr/Rmd

As part of the Wrangling F1 Data With R project, I want to be able to generate sentences iteratively from a templated base.

The following recipe works for sentences included in an external file:


What I’d really like to be able to do is put the Rmd template into a chunk something like this…:

```{rmd stintPara, eval=FALSE, echo=FALSE}
`r name` completed `r sum(abs(stints[stints['name']==name,]['l']))` laps over `r nrow(stints[stints['name']==name,])` stints, with a longest run of `r max(abs(stints[stints['name']==name,]['l']))` laps.

and then do something like:

```{r results='asis'}
for (name in levels(stints$name)){

Is anything like that possible?

PS via the knitr Google Group, h/t Jeff Newmiller for pointing me to the knit_child() text argument…


Segmenting F1 Qualifying Session Laptimes

I’ve started scraping some FIA timing sheets again, including practice and qualifying session laptimes. One of the things I’d like to do is explore various ways of looking at the qualifying session laptimes, which means identifying which qualifying session each laptime falls into, using some sort of clustering algorithm… or other means…:


For looking at session utilisation charts I’ve been making use of accumulated time into session to help display the data, as the following session utilisation chart (including green and purple laptimes) shows:


The horizontal x-axis is time into session from a basetime of the first time-of-day timestamp recorded on the timing sheets for the session.

If we look at the distribution of qualifying session laptimes for the 2015 Malaysian Grand Prix, we get something like this:


We can see a big rain delay gap, and also a tighter gap between the first and second sessions.

If we try to run a k-means clustering algorithm on the data, using 3 means for the three sessions, we see that in this case it isn’t managing to cluster the laptimes into actual sessions:

# Attempt to identify qualifying session using K-Means Cluster Analysis around 3 means
clusters <- kmeans(f12015test['cuml'], 3)

f12015test = data.frame(f12015test, clusters$cluster)

ggplot(f12015test)+geom_text(aes(x=cuml, y=stime,
label=code, colour=factor(clusters.cluster)) ,angle=45,size=3)


In particular, so of the Q1 laptimes are being grouped with Q2 laptimes.

However, we know that there is at least a 2 minute gap between sessions (regulations suggest 7 minutes, though if this is the time between lights going red then green again, we might need to knock a couple of minutes off the gap to account to for drivers who start their last lap just before the lights go red on a session) so if we assume that the only times there will be a two minute gap between recorded laptimes during the whole of qualifying session will be in the periods between the qualifying sessions, we can can generate a flag on those gaps, and then generate session number counts by counting on those flags.

#Look for a two minute gap
f12015test['gapflag']= (f12015test['gap']>=120)

ggplot(f12015test)+ geom_text(aes(x=cuml, y=stime, label=code), angle=45,size=3
+facet_wrap(~qsession, scale="free")


(To tighten this up, we might also try to factor in the number of cars in the pits at any particular point in time…)

This chart clearly shows how the first qualifying session saw cars trialling evenly throughout the session, whereas in Q2 and Q3 they were far more bunched up (which perhaps creates more opportunities for cars to get in each others’ way on flying laps…)

One of the issues with this chart is that we don’t get to zoom in to actual flying laps. If all the flying lap times were about the same time, we could simply generate y-axis limits based on purple laptimes:


#Use these values in ylim()...

However, where the laptimes differ significantly across sessions as they do in this case due to a dramatic change in weather conditions, we probably need to filter the data for each session separately.

Another crib we might use is to identify PIT lap and out-laps (laps immediately following a PIT event) and filter those out of the laptime traces.

Versions of these recipes will soon be added to the Wrangling F1 Data With R book. Once you buy into the book, you get all future updates to it for no additional cost, even in the case of the minimum book price increasing over time.