A First Attempt at Wrangling WRC (World Rally Championship) Data With pandas and matplotlib

Last year was a quite year on the Wrangling F1 Data With R front, with a not even aborted start at doing a python/pandas equivalent project. With the new owners of F1 in place, things may change for the better in terms of engaging with fans and supporters, and I may revisit that idea properly, but in the meantime, I thought I started tinkering with a wider range of motorsport data.

The start to the BTCC season is still a few months away, but the WRC started over the weekend, and with review highlights and live coverage of one stage per rally on Red Bull TV, I thought I may give that data a go…

Results and timing info can be found on the WRC web pages (I couldn’t offhand find a source of official FIA timing sheets) so here’s a first quick sketch using stage results from the first rally of the year – Monte Carlo.

To start with, we need to grab the data. I’m using the pandas library, which has a handy .read_html() method that can scrape tables (crudely) from an HTML page given its URL.

import pandas as pd

def getStageResultsBase(year,rallyid,stages):
    ''' Get stage results and overall results at end of stage '''
    
    # Accept one stage number or a list of stage numbers
    stages=[stages] if not isinstance(stages,list) else stages
    
    #There are actually two tables on the stage results page
    df_stage=pd.DataFrame()
    df_overallpart=pd.DataFrame()
    
    #Grab data for each stage
    for stage in stages:
        url='http://www.wrc.com/live-ticker/daten/{year}/{rallyid}/stage.{rallyid}.{stage}.all.html'.format(year=year, rallyid=rallyid, stage=stage)
        #scrape the data
        results=pd.read_html(url,encoding='utf-8')
        results[0].columns=['pos', 'carNo', 'driverName', 'time', 'diffPrev', 'diffFirst']
        results[1].columns=['pos', 'carNo', 'driverName', 'time', 'diffPrev', 'diffFirst']
        
        #Simple cleaning - cast the data types as required
        for i in [0,1]:
            results[i].fillna(0,inplace=True)
            results[i]['pos']=results[i]['pos'].astype(float).astype(int)
            for j in ['carNo','driverName','time','diffPrev','diffFirst']:
                results[i][j]=results[i][j].astype(str)
        
        #Add a stage identifier
        results[0]['stage']=stage
        results[1]['stage']=stage
        
        #Add the scraped stage data to combined stage results data frames
        df_stage=pd.concat([df_stage,results[0]])
        df_overallpart=pd.concat([df_overallpart,results[1]])

    return df_stage.reset_index(drop=True), df_overallpart.reset_index(drop=True)

The data we pull back looks like the following.

wrc_results_scraper1

Note that deltas (the time differences) are given as offset times in the form of a string. As the pandas library was in part originally developed for working with financial time series data, it has a lot of support for time handling. This includes the notion of a time delta:

pd.to_timedelta("1:2:3.0") #Timedelta('0 days 01:02:03')

We can use this datatype to represent time differences from the results data:

#If we have hh:mm:ss format we can easily cast a timedelta
def regularTimeString(strtime):

    #Go defensive, just in case we're passed eg 0 as an int
    strtime=str(strtime)
    strtime=strtime.strip('+')

    modifier=''
    if strtime.startswith('-'):
        modifier='-'
        strtime=strtime.strip('-')

    timeComponents=strtime.split(':')
    ss=timeComponents[-1]
    mm=timeComponents[-2] if len(timeComponents)>1 else 0
    hh=timeComponents[-3] if len(timeComponents)>2 else 0
    timestr='{}{}:{}:{}'.format(modifier,hh,mm,ss)
    return pd.to_timedelta(timestr)

We can use the time handler to cast the time differences from the scraped data as timedelta typed data:

def getStageResults(year,rallyid,stages):
    df_stage, df_overallpart = getStageResultsBase(year,rallyid,stages)
    for col in ['time','diffPrev','diffFirst']:
        df_stage['td_'+col]=df_stage.apply(lambda x: regularTimeString(x[col]),axis=1)
        df_overallpart['td_'+col]=df_overallpart.apply(lambda x: regularTimeString(x[col]),axis=1)
    return df_stage, df_overallpart

wrc_results_scraper2

The WRC results cover all entrants to the rally, but not all the cars are classed as fully blown WRC cars (class RC1). We can limit the data to just the RC1 cars and generate a plot showing the position of each driver at the end of each stage:

%matplotlib inline
import matplotlib.pyplot as plt

rc1=df_overall[df_overall['groupClass']=='RC1'].reset_index(drop=True)

fig, ax = plt.subplots(figsize=(15,8))
ax.get_yaxis().set_ticklabels([])
rc1.groupby('driverName').plot(x='stage',y='pos',ax=ax,legend=None);

wrc_results_scraper3

The position is actually the position of the driver across all entry classes, not just RC1. This means if a driver has a bad day, they could be placed well down the all-class field; but that’s not of too much interest if all we’re interested in is in-class ranking.,

So what about if we rerank the drivers within the RC1 class? And perhaps improve the chart a little by adding a name label to identify each driver at their starting position?

rc1['rank']=rc1.groupby('stage')['pos'].rank()

fig, ax = plt.subplots(figsize=(15,8))
ax.get_yaxis().set_ticklabels([])
rc1.groupby('driverName').plot(x='stage',y='rank',ax=ax,legend=None)

#Add some name labels at the start
for i,d in rc1[rc1['stage']==1].iterrows():
    ax.text(-0.5, i+1, d.ix(i)['driverName'])

wrc_results_scraper4

This chart is a bit cleaner, but now we lose information around the lower placed in-class drivers, in particular that information about there overall position when other classes are taken into account too…

The way the FIA recover this information in their stage chart displays that reports on the evolution of the race for the top 10 cars overall (irrespective of class) that shows excursions in interim stages outside the top 10 “below the line”, annotating them further with their overall classification on the corresponding stage.

We can use this idea by assigning a “re-rank” to each car if they are positioned outside the size of the class.

#Reranking...
rc1['xrank']= (rc1['pos']>RC1SIZE)
rc1['xrank']=rc1.groupby('stage')['xrank'].cumsum()
rc1['xrank']=rc1.apply(lambda row: row['pos'] if row['pos']<=RC1SIZE else row['xrank'] +RC1SIZE, axis=1)
fig, ax = plt.subplots(figsize=(15,8))
ax.get_yaxis().set_ticklabels([])
rc1.groupby('driverName').plot(x='stage',y='xrank',ax=ax,legend=None)

#Name labels
for i,d in rc1[rc1['stage']==1].iterrows():
    ax.text(-0.5, d.ix(i)['xrank'], d.ix(i)['driverName'])
for i,d in rc1[rc1['stage']==17].iterrows():
    ax.text(17.3, d.ix(i)['xrank'], d.ix(i)['driverName'])

wrc_results_scraper5 The chart now shows the evolution of the race for the RC1 cars, retaining the spaced ordering of the top 12 positions that would be filled by WRC1/RC1 cars if they were all placed above cars from other classes and then bunching those placed outside the group size. (Only 11 names are shown because one the entries retired right at the start of the rally.)

So for example, in this case we see how Neuvill, Hanninen and Serderidis are classed outside Lefebvre, who was actually classed 9th overall.

Further drawing on the FIA stage chart, we can label the excursions outside the top 12, and also invert the y-axis.

fig, ax = plt.subplots(figsize=(15,8))
ax.get_yaxis().set_ticklabels([])
rc1.groupby('driverName').plot(x='stage',y='xrank',ax=ax,legend=None);

for i,d in rc1[rc1['xrank']>RC1SIZE].iterrows(): ax.text(d.ix(i)['stage']-0.1, d.ix(i)['xrank'], d.ix(i)['pos'], bbox=dict( boxstyle='round,pad=0.3',color='pink')) #facecolor='none',edgecolor='black',
#Name labels
for i,d in rc1[rc1['stage']==1].iterrows(): ax.text(-0.5, d.ix(i)['xrank'], d.ix(i)['driverName']) for i,d in rc1[rc1['stage']==17].iterrows(): ax.text(17.3, d.ix(i)['xrank'], d.ix(i)['driverName'])
#Flip the y-axis plt.gca().invert_yaxis()

Lefebvre’s excursions outside the top 12 are now recorded and plain to see.

wrc_results_scraper6

We now have a chart that blends rank ordering with additional information showing where cars are outpaced by cars from other classes, in a space efficient manner.

PS as with Wrangling F1 Data With R, I may well turn this into a Leanpub book, this time exploring the workflow to markdown (and/or maybe reveal.js slides!) from Jupyter notebooks, rather than from RStudio/Rmd.

Author: Tony Hirst

I'm a Senior Lecturer at The Open University, with an interest in #opendata policy and practice, as well as general web tinkering... View all posts by Tony Hirst