Tagged: f1dj

ergastR – R Wrapper for ergast F1 Results Data API

By the by, I’ve posted a first attempt at an R package – ergastR to wrap the ergast developer API, which is where I get chunks of data from for my f1datajunkie tinkerings.

You can find it on Github: psychemedia/ergastR.

The function names are the ones used in the Wrangling F1 Data With R book.

The R package needs a bit of tidying up and also needs work on the following: cacheing, so that we don’t keep hitting the ergast API unnecessarily; paged results handling (I fudge this a bit at the moment by explicitly setting a large results limit); and dual handling of ergast API versus downloaed ergast database requests (if a database connection string is passed, use that rather than make a call to the ergast API). But it’s a start… Feel free to raise issues via the repo:-)

In related news, Will Vaughan tipped me off to a Python package he’s started putting together to wrap the ergast API: ergast-python. He’s also making a start on some Wrangling F1 Data Jupyter notebooks that make use of the Python wrapper: wranglingf1data.

From Points to (Messy) Lines

A week or so ago, I came up with a new chart type – race concordance charts – for looking at a motor circuit race from the on-track perspective of a particular driver. Here are a couple of examples from the 2017 F1 Grand Prix:

The gap is the time to the car on track ahead (negative gap, to the left) or behind (to the right). The colour indicates whether the car is on the same lap (light blue),  on the lap behind (orange to red), or a lap ahead (dark blue).

In the dots, we can “see” lines relating to the relative progress of particular cars. But what if we actually plot the progress of each of those other cars as a line? The colours represent different cars.

 

bot_conc_lineHUL_conc_line

Here’s another view of the track from Hulkenberg’s perspective with a wider window, whoch by comparison with the previous chart suggests I need to handle better cars that do not drop off the track but do fall out of the display window… (At the moment, I only grab data for cars in the specified concordance window):

HUL-conc_line2

Note that we need to do a little bit of tidying up of the data so that we don’t connect lines for cars that flow off the left hand edge, for example, and then return several laps later from the right hand edge:

#Get the data for the cars, as before
inscope=sqldf(paste0('SELECT l1.code as code,l1.acctime-l2.acctime as acctimedelta,
                       l2.lap-l1.lap as lapdelta, l2.lap as focuslap
                       FROM lapTimes as l1 join lapTimes as l2
                       WHERE l1.acctime < (l2.acctime + ', abs(limits[2]), ') AND l1.acctime > (l2.acctime - ', abs(limits[1]),')
                       AND l2.code="',code,'";'))

  #If consecutive rows for same driver are on more than one focuslap apart, break the line
  inscope=ddply(inscope,.(code),transform,g=cumsum(c(0,diff(focuslap)>1)))
  #Continuous line segments have the same driver code and "group" number

  g = ggplot(inscope)

  #The interaction splits up the groups based on code and the contiguous focuslap group number
  #We also need to ensure we plot acctimedelta relative to increasing focuslap
  g=g+geom_line(aes(x=focuslap, y=acctimedelta, col=code,group=interaction(code, g)))
  #...which means we then need to flip the axes
  g=g+coord_flip()

There may still be some artefacts in the line plotting based on lapping… I can’t quite think this through at the moment:-(

So here’s my reading:

  • near horizontal lines that go slightly up and to the right, and where a lot of places in the window are lost in a single lap are a result of pit stop by the car that lost the places; if we have access to pit information, we could perhaps dot these lines?
  • the “waist” in the chart for HUL shows cars coming together for a safety car, and then HUL losing pace to some cars whilst making advances on others;
  • lines with a constant gradient show a  consistent gain or loss of time, per lap, over several laps;
  • a near vertical line shows a car keeping pace, and neither making nor losing time compared to the focus car.

Figure Aesthetics or Overlays?

Tinkering with a new chart type over the weekend, I spotted something rather odd in in my F1 track history charts – what look to be outliers in the form of cars that hadn’t been lapped on that lap appearing behind the lap leader of the next lap, on track.

If you count the number of cars on that leadlap, it’s also greater than the number of cars in the race on that lap.

How could that be? Cars being unlapped, perhaps, and so “appearing twice” on a particular leadlap – that is, recording two laptimes between consecutive passes of the start/finish line by the race leader?

My fix for this was to add an “unlap” attribute that detects whether

#Overplot unlaps
lapTimes=ddply(lapTimes,.(leadlap,code),transform,unlap= seq_along(leadlap))

This groups by leadlap an car, and counts 1 for each occurrence. So if the unlap count is greater than 1, a car a has completed more than 1 lap in a given leadlap.

My first thought was to add this as an overprint on the original chart:

#Overprint unlaps
g = g + geom_point(data = lapTimes[lapTimes['unlap']>1,],
                   aes(x = trackdiff, y = leadlap, col=(leadlap-lap)), pch = 0)

This renders as follows:

Whilst it works, as an approach it is inelegant, and had me up in the night pondering the use of overlays rather than aesthetics.

Because we can also view the fact that the car was on its second pass of the start/finish line for a given lead lap as a key property of the car and depict that directly via an aesthetic mapping of that property onto the symbol type:

  g = g + geom_point(aes( x = trackdiff, y = leadlap,
                          col = (lap == leadlap),
                          pch= (unlap==1) ))+scale_shape_identity()

This renders just a single mark on the chart, depicting the diff to the leader *as well as * the unlapping characteristic, rather than the two marks used previously, one for the diff, the second, overprinting, mark to depict the unlapping nature of that mark.

So now I’m wondering – when would it make sense to use multiple marks by overprinting?

Here’s one example where I think it does make sense: where I pass an argument into the chart plotter to highlight a particular driver by infilling a marker with a symbol to identify that driver.

#Drivers of interest passed in using construction: code=list(c("STR","+"),c("RAI","*"))
if (!is.na(code)){
  for (t in code) {
    g = g + geom_point(data = lapTimes[lapTimes['code'] == t[1], ],
                       aes(x = trackdiff, y = leadlap),
                       pch = t[2])
  }
}

In this case, the + symbol is not a property of the car, it is an additional information attribute that I want to add to that car, but not the other cars. That is, it is a property of my interest, not a property of the car itself.

Race Track Concordance Charts

Since getting started with generating templated R reports a few weeks ago, I’ve started spending the odd few minutes every race weekend around looking at ways of automating the generation of F1 qualifying and race reports.

Im yesterday’s race, some of the commentary focussed on whether MAS had given BOT an “assist” in blocking VET, which got me thinking about better ways of visualising whether drivers are stuck in traffic or not.

The track position chart makes a start at this, but it can be hard to focus on a particular driver (identified using a particular character to infill the circle marker for that driver). The race leader’s track position ahead is identified from the lap offset race leader marker at the right hand side of the chart.

One way to help keep track of things from the perspective of a particular driver, rather than the race leader, is to rebase the origin of the x-axis relative to the that driver.

In my track chart code, I use a dataframe that has a trackdiff column that gives a time offset on track to race leader for each lead lap.

track_encoder=function(lapTimes){
  #Find the accumulated race time at the start of each leader's lap
  lapTimes = ddply(lapTimes, .(leadlap), transform, lstart = min(acctime))

  #Find the on-track gap to leader
  lapTimes['trackdiff'] = lapTimes['acctime'] - lapTimes['lstart']
  lapTimes
}

Rebasing for a particular driver simply means resetting the origin with respect to that time, using the trackdiff time for one driver as an offset for the others, to create a new trackdiff2 for use on the x-axis.

#I'm sure there must be a more idiomatic way of doing this?
rebase=lapTimes[lapTimes['code']==code,c('leadlap','trackdiff')]
rebase=rename(rebase,c('trackdiff'='trackrebase'))
lapTimes=merge(lapTimes,rebase,by='leadlap')
lapTimes['trackdiff2']=lapTimes['trackdiff']-lapTimes['trackrebase']

Here’s how it looks for MAS:

But not so useful for BOT, who led much of the race:

This got me thinking about text concordances. In the NLTK text analysis package, the text concordance function allows you to display a search term centred in the context in which it is found:

concordance

The concordance view finds the location of each token and then displays the search term surrounded by tokens in neighbouring locations, within a particular window size.

I spent a chunk of time wondering how to do this sensibly in R, struggling to identify what it was I actually wanted to do: for a particular driver, find the neighbouring cars in terms of accumulated laptime on each lap. After failing to see the light for more an hour or so, I thought of it in terms of an SQL query, and the answer fell straight out – for the specified driver on a particular lead leadlap, get their accumulated laptime and the rows with accumulated laptimes in a window around it.

inscope=sqldf(paste0('SELECT l1.code as code,l1.acctime-l2.acctime as acctimedelta,
l2.lap-l1.lap as lapdelta, l2.lap as focuslap
FROM lapTimes as l1 join lapTimes as l2
WHERE l1.acctime &lt; (l2.acctime + ', abs(limits[2]), ') AND l1.acctime &gt; (l2.acctime - ', abs(limits[1]),')
AND l2.code="',code,'";'))

Plotting against the accumalated laptime delta on the x-axis gives a chart like this:

If we add in horizontal rules to show laps where the specified driver pitted and vertical bars to show pit windows, we get a much richer particular of the race from the point of view of the driver.

Here’s how it looks from the perspective of BOT, who led most of the race:

Different symbols inside the markers can be used to track different drivers (in the above charts, BOT and VET are highlighted). The colours are used to identify whether or not cars on the same lap as the specified driver, are cars on laps ahead for shades of blue then green (as per “blue flag”) and orange to red for cars on increasing laps behind (i.e. backmarkers from the perspective of the specified driver). If a marker is light blue, that car is on the same lap and you’re racing…

All in all, I’m pretty chuffed (for now!) with how that chart came together.

And a new recipe to add to the Wrangling F1 Data With R book, I guess..

PS in response to [misunderstanding…] a comment from @sidepodcast, we also have control over the concordance window size, and the plotsize:

concordresize

Generating hi-res versions in other file formats is also possible.

Just got to wrap it all up in a templated report now…

PPS On the track position charts, I just noticed that where cars are lapped, they fall off the radar… so I’ve added them in behind the leader to keep the car count correct for each leadlap…

trackposrebaselapped

 

PS See also: A New Chart Type – Race Concordance Charts, which also includes examples of “line chart” renderings of the concordance charts so you can explicitly see the progress of each individually highlighted driver on track.

A First Attempt at Wrangling WRC (World Rally Championship) Data With pandas and matplotlib

Last year was a quite year on the Wrangling F1 Data With R front, with a not even aborted start at doing a python/pandas equivalent project. With the new owners of F1 in place, things may change for the better in terms of engaging with fans and supporters, and I may revisit that idea properly, but in the meantime, I thought I started tinkering with a wider range of motorsport data.

The start to the BTCC season is still a few months away, but the WRC started over the weekend, and with review highlights and live coverage of one stage per rally on Red Bull TV, I thought I may give that data a go…

Results and timing info can be found on the WRC web pages (I couldn’t offhand find a source of official FIA timing sheets) so here’s a first quick sketch using stage results from the first rally of the year – Monte Carlo.

world_rally_championship_-_results_monte_carlo_-_wrc_com

To start with, we need to grab the data. I’m using the pandas library, which has a handy .read_html() method that can scrape tables (crudely) from an HTML page given its URL.

import pandas as pd

def getStageResultsBase(year,rallyid,stages):
    ''' Get stage results and overall results at end of stage '''
    
    # Accept one stage number or a list of stage numbers
    stages=[stages] if not isinstance(stages,list) else stages
    
    #There are actually two tables on the stage results page
    df_stage=pd.DataFrame()
    df_overallpart=pd.DataFrame()
    
    #Grab data for each stage
    for stage in stages:
        url='http://www.wrc.com/live-ticker/daten/{year}/{rallyid}/stage.{rallyid}.{stage}.all.html'.format(year=year, rallyid=rallyid, stage=stage)
        #scrape the data
        results=pd.read_html(url,encoding='utf-8')
        results[0].columns=['pos', 'carNo', 'driverName', 'time', 'diffPrev', 'diffFirst']
        results[1].columns=['pos', 'carNo', 'driverName', 'time', 'diffPrev', 'diffFirst']
        
        #Simple cleaning - cast the data types as required
        for i in [0,1]:
            results[i].fillna(0,inplace=True)
            results[i]['pos']=results[i]['pos'].astype(float).astype(int)
            for j in ['carNo','driverName','time','diffPrev','diffFirst']:
                results[i][j]=results[i][j].astype(str)
        
        #Add a stage identifier
        results[0]['stage']=stage
        results[1]['stage']=stage
        
        #Add the scraped stage data to combined stage results data frames
        df_stage=pd.concat([df_stage,results[0]])
        df_overallpart=pd.concat([df_overallpart,results[1]])

    return df_stage.reset_index(drop=True), df_overallpart.reset_index(drop=True)

The data we pull back looks like the following.

wrc_results_scraper1

Note that deltas (the time differences) are given as offset times in the form of a string. As the pandas library was in part originally developed for working with financial time series data, it has a lot of support for time handling. This includes the notion of a time delta:

pd.to_timedelta("1:2:3.0")
#Timedelta('0 days 01:02:03')

We can use this datatype to represent time differences from the results data:

#If we have hh:mm:ss format we can easily cast a timedelta
def regularTimeString(strtime):

    #Go defensive, just in case we're passed eg 0 as an int
    strtime=str(strtime)
    strtime=strtime.strip('+')

    modifier=''
    if strtime.startswith('-'):
        modifier='-'
        strtime=strtime.strip('-')

    timeComponents=strtime.split(':')
    ss=timeComponents[-1]
    mm=timeComponents[-2] if len(timeComponents)>1 else 0
    hh=timeComponents[-3] if len(timeComponents)>2 else 0
    timestr='{}{}:{}:{}'.format(modifier,hh,mm,ss)
    return pd.to_timedelta(timestr)

We can use the time handler to cast the time differences from the scraped data as timedelta typed data:

def getStageResults(year,rallyid,stages):
    df_stage, df_overallpart = getStageResultsBase(year,rallyid,stages)
    for col in ['time','diffPrev','diffFirst']:
        df_stage['td_'+col]=df_stage.apply(lambda x: regularTimeString(x[col]),axis=1)
        df_overallpart['td_'+col]=df_overallpart.apply(lambda x: regularTimeString(x[col]),axis=1)
    return df_stage, df_overallpart 

wrc_results_scraper2

The WRC results cover all entrants to the rally, but not all the cars are classed as fully blown WRC cars (class RC1). We can limit the data to just the RC1 cars and generate a plot showing the position of each driver at the end of each stage:

%matplotlib inline
import matplotlib.pyplot as plt

rc1=df_overall[df_overall['groupClass']=='RC1'].reset_index(drop=True)

fig, ax = plt.subplots(figsize=(15,8))
ax.get_yaxis().set_ticklabels([])
rc1.groupby('driverName').plot(x='stage',y='pos',ax=ax,legend=None);

wrc_results_scraper3

The position is actually the position of the driver across all entry classes, not just RC1. This means if a driver has a bad day, they could be placed well down the all-class field; but that’s not of too much interest if all we’re interested in is in-class ranking.,

So what about if we rerank the drivers within the RC1 class? And perhaps improve the chart a little by adding a name label to identify each driver at their starting position?

rc1['rank']=rc1.groupby('stage')['pos'].rank()

fig, ax = plt.subplots(figsize=(15,8))
ax.get_yaxis().set_ticklabels([])
rc1.groupby('driverName').plot(x='stage',y='rank',ax=ax,legend=None)

#Add some name labels at the start
for i,d in rc1[rc1['stage']==1].iterrows():
    ax.text(-0.5, i+1, d.ix(i)['driverName'])

wrc_results_scraper4

This chart is a bit cleaner, but now we lose information around the lower placed in-class drivers, in particular that information about  there overall position when other classes are taken into account too…

The way the FIA recover this information in their stage chart displays that reports on the evolution of the race for the top 10 cars overall (irrespective of class)  that shows excursions in interim stages outside the top 10  “below the line”, annotating them further with their overall classification on the corresponding stage.

stage_chart___federation_internationale_de_l_automobile

We can use this idea by assigning a “re-rank” to each car if they are positioned outside the size of the class.

#Reranking...
rc1['xrank']= (rc1['pos']>RC1SIZE)
rc1['xrank']=rc1.groupby('stage')['xrank'].cumsum()
rc1['xrank']=rc1.apply(lambda row: row['pos'] if row['pos']<=RC1SIZE else row['xrank'] +RC1SIZE, axis=1)
fig, ax = plt.subplots(figsize=(15,8))
ax.get_yaxis().set_ticklabels([])
rc1.groupby('driverName').plot(x='stage',y='xrank',ax=ax,legend=None)

#Name labels
for i,d in rc1[rc1['stage']==1].iterrows():
    ax.text(-0.5, d.ix(i)['xrank'], d.ix(i)['driverName'])
for i,d in rc1[rc1['stage']==17].iterrows():
    ax.text(17.3, d.ix(i)['xrank'], d.ix(i)['driverName'])

wrc_results_scraper5The chart now shows the evolution of the race for the RC1 cars, retaining the spaced ordering of the top 12 positions that would be filled by WRC1/RC1 cars if they were all placed above cars from other classes and then bunching those placed outside the group size. (Only 11 names are shown because one the entries retired right at the start of the rally.)

So for example, in this case we see how Neuvill, Hanninen and Serderidis are classed outside Lefebvre, who was actually classed 9th overall.

Further drawing on the FIA stage chart, we can label the excursions outside the top 12, and also invert the y-axis.

fig, ax = plt.subplots(figsize=(15,8))
ax.get_yaxis().set_ticklabels([])
rc1.groupby('driverName').plot(x='stage',y='xrank',ax=ax,legend=None);

for i,d in rc1[rc1['xrank']>RC1SIZE].iterrows(): ax.text(d.ix(i)['stage']-0.1, d.ix(i)['xrank'], d.ix(i)['pos'], bbox=dict( boxstyle='round,pad=0.3',color='pink')) #facecolor='none',edgecolor='black',
#Name labels
for i,d in rc1[rc1['stage']==1].iterrows(): ax.text(-0.5, d.ix(i)['xrank'], d.ix(i)['driverName']) for i,d in rc1[rc1['stage']==17].iterrows(): ax.text(17.3, d.ix(i)['xrank'], d.ix(i)['driverName'])
#Flip the y-axis plt.gca().invert_yaxis()

Lefebvre’s excursions outside the top 12 are now recorded and plain to see.

wrc_results_scraper6

We now have a chart that blends rank ordering with additional information showing where cars are outpaced by cars from other classes, in a space efficient manner.

PS as with Wrangling F1 Data With R, I may well turn this into a Leanpub book, this time exploring the workflow to markdown (and/or maybe reveal.js slides!) from Jupyter notebooks, rather than from RStudio/Rmd.

First Thoughts on Detecting Motorsport Safety Car Periods from Laptimes

Prompted by Markku Hänninen, I thought I’d have a quick look at estimating motorsport safety car laps from a set of laptime data. For the uninitiated, if there is a dangerous hazard on track, the race-cars are kept out while the hazard is cleared, but led around by a safety car that limits the pace. No overtaking is allowed for race position, but under certain regulations, lapped cars may unlap themselves. Cars may also pit under the safety car.

Timing sheets typically don’t identify safety car periods, so the question arises: how can we detect them?

One condition that is likely to follow is that the average pace of the laps under safety car conditions will be considerably slower than under racing conditions. A quick way of estimating the race pace is to find the fastest laptime across the whole of the race (or in an online algorithm, the fastest laptime to date).

With a lapTimes dataframe containing columns lap, rawtime (that is, raw laptime in seconds) and position (that is, the race position of the driver recording a particular laptime on a particular lap), we can easily find the fastest lap:

minl=min(lapTimes['rawtime'])

We can find the mean laptime per lap using ddply() to group around each lap:

ddply(lapTimes[c('lap', 'rawtime', 'position')], .(lap), summarise,
      mean_laptime=mean(rawtime) )

We can also generate a variety of other measures. For example, within the grouped ddply operation, if we divide the mean laptime per lap (mean(rawtime)) by the fastest overall laptime we get a normalised mean laptime based on the fastest lap in the race.

ddply(lapTimes[c('lap','rawtime')], .(lap), summarise,
      norm_laptime=mean(rawtime)/minl )

We might also normalise the leader’s laptime for each lap, on the basis that the leader will be the car most likely to be driving at the safety car’s pace. (The summarising is essentially redundant here because we only have one row per group.

ddply(lapTimes[lapTimes['position']==1, c('lap','rawtime')], .(lap), summarise,
      norm_leaders_laptime=mean(rawtime)/minl )

Using the normalised times, we can identify slow laps. For example, slow laps based on mean laptime. In this case, I am using a heuristic that says the laptime is a slow laptime if the normlised time is more than 1.3 times that of the fastest lap:

ddply(lapTimes[c('lap','rawtime')], .(lap), summarise,
      slow_lap_meanBasis= (mean(rawtime)/minl) > 1.3 )

If we assume that the first lap does not start under the safety car, we can then make a crude guess that a slow lap not on the first lap is a safety car lap.

However, this does not take into account things like sudden downpours or other changes to the weather or track conditions. In such a case it may be likely that the majority of the field pits, so we might want to have a detector that flags whether a certain number of cars have pitted on a lap, possibly normalised against the current size of the field.

laps=lapsData.df(2016,4)
pits=pitsData.df(2016,4)[c('lap','driverId')]
pits['pitstop']=T
lapTimes=merge(laps, pits, by=c('lap','driverId'), all.x=T)
lapTimes['pitstop']=!is.na(lapTimes['pitstop'])

#Count of stops per lap
ddply( lapTimes, .(lap), summarise, ps=sum(pitstop==TRUE) )

#Proportion of cars stopping per lap
ddply( lapTimes, .(lap), summarise, ps=sum(pitstop==TRUE)/length(pitstop) )

That said, under safety car conditions, many cars do also take the opportunity to pit. However, under sudden changes of weather condition, we might expect nearly all the cars to come in, even if it means doubling up. (So another detector for weather might be two cars in the same team, close to each other in terms of gap, pitting on the same lap, with the result that one will be queued behind the other.)

As and when I get a chance, I’ll try to add some sort of ‘safety car’ estimator to the Wrangling F1 Data With R book.

When Documents Become Databases – Tabulizer R Wrapper for Tabula PDF Table Extractor

Although not necessarily the best way of publishing data, data tables in PDF documents can often be extracted quite easily, particularly if the tables are regular and the cell contents reasonably space.

For example, official timing sheets for F1 races are published by the FIA as event and timing information in a set of PDF documents containing tabulated timing data:

R_-_Best_Sector_Times_pdf__1_page_

In the past, I’ve written a variety of hand crafted scrapers to extract data from the timing sheets, but the regular way in which the data is presented in the documents means that they are quite amenable to scraping using a PDF table extractor such as Tabula. Tabula exists as both a server application, accessed via a web browser, or as a service using the tabula extractor Java application.

I don’t recall how I came across it, but the tabulizer R package provides a wrapper for tabula extractor (bundled within the package), that lets you access the service via it’s command line calls. (One dependency you do need to take care of is to have Java installed; adding Java into an RStudio docker container would be one way of taking care of this.)

Running the default extractor command on the above PDF pulls out the data of the inner table:

extract_tables('Best Sector Times.pdf')

fia_pdf_sector_extract

Where the data is spread across multiple pages, you get a data frame per page.

R_-_Lap_Analysis_pdf__page_3_of_8_

Note that the headings for the distinct tables are omitted. Tabula’s “table guesser” identifies the body of the table, but not the spanning column headers.

The default settings are such that tabula will try to scrape data from every page in the document.

fia_pdf_scrape2

Individual pages, or sets of pages, can be selected using the pages parameter. For example:

  • extract_tables('Lap Analysis.pdf',pages=1
  • extract_tables('Lap Analysis.pdf',pages=c(2,3))

Specified areas for scraping can also be specified using the area parameter:

extract_tables('Lap Analysis.pdf', pages=8, guess=F, area=list(c(178, 10, 230, 500)))

The area parameter appears to take co-ordinates in the form: top, left, width, height is now fixed to take co-ordinates in the same form as those produced by tabula app debug: top, left, bottom, right.

You can find the necessary co-ordinates using the tabula app: if you select an area and preview the data, the selected co-ordinates are viewable in the browser developer tools console area.

Select_Tables___Tabula_concole

The tabula console output gives co-ordinates in the form: top, left, bottom, right so you need to do some sums to convert these numbers to the arguments that the tabulizer area parameter wants.

fia_pdf_head_scrape

Using a combination of “guess” to find the dominant table, and specified areas, we can extract the data we need from the PDF and combine it to provide a structured and clearly labeled dataframe.

On my to do list: add this data source recipe to the Wrangling F1 Data With R book…