Tagged: f1dj

Notes – Detecting Safety Car Laps from F1 Laptime Data, Part 1

Looking at some track/lap maps for F1 races, there are several visual cues for when a safety car has been deployed.

For example, here’s a chart from the 2018 Chinese Grand Prix showing when a safety car arrived. The bunching of times at the left show that a full safety car is in track, bunching up the cars. The time-distance that race leader BOT is behind the car at the back in terms of track position increases to show the slower than race pace lap time.

Here’s how the race from Australia 2018 looked. In this case, there was a virtual safety car to begin with – the field doesn’t close up but the time-distance between the race leader and the track position backmarker increases to reflect the reduced laptime:

Singapore, 2017, was a wet race with lots of safety car action. To begin with, the race started under the safety car, and came out again on lap 11. The steady pattern of race trails show there was no safety car around lap 29/30, although there were pitstops, but there was a safety car around lap 40. I’m not sure if a VSC went out first, or whether the teams took the opportunity to pit.

 

So how can we detect full and virtual safety cars? Certainly there seems to be graphical evidence of safety cars – both full and virtual – in the above charts. But can we also identify and detect some numerical signatures, and if so, what form might they take? That will be in the next post…

Generating Text From An R DataFrame using PyTracery, Pandas and Reticulate

In a couple of recent posts (Textualisation With Tracery and Database Reporting 2.0 and More Tinkering With PyTracery) I’ve started exploring various ways of using the pytracery port of the tracery story generation tool to generate variety of texts from Python pandas data frames.

For my F1DataJunkie tinkerings I’ve been using R + SQL as the base languages, with some hardcoded Rdata2text constructions for rendering text from R dataframes (example).

Whilst there is a basic port of tracery to R, I want to make use of the various extensions I’ve been doodling with to pytracery, so it seemed like a good opportunity to start exploring the R reticulate package.

It was a bit of a faff trying to get things to work the first time, so here on some notes on what I had to consider to get a trivial demo working in my RStudio/Rmd/knitr environment.

Python Environment

My first attempt was to use python blocks in an Rmd document:

```{python}
import sys
print(sys.executable)
````

but R insisted on using the base Python path on my Mac that was not the path I wanted to use… The fix turned out to be setting the engine…

```{python, engine.path ='/Users/f1dj/anaconda3/bin/python' }
import sys
print(sys.executable)
````

This could also be done via a setting: opts_chunk$set(engine.path = '/Users/f1dj/anaconda3/bin/python')

One of the problems with this approach is that a Python environment is created for each chunk – so you can’t easily carry state over from one Python chunk to another.

So I had a look at a workaround using reticulate instead.

Calling pytracery from R using reticulate

The solution I think I’m going for is to put Python code into a file, call that into R, then pass an R dataframe as an argument to a called Python function and gett a response back into R as an R dataframe.

For example, here’s a simple python test file:

import tracery
from tracery.modifiers import base_english
import pandas as pd

def pandas_row_mapper(row, rules, root,  modifiers=base_english):
    ''' Function to parse single row of dataframe '''
    row=row.to_dict()
    rules=rules.copy()

    for k in row:
        rules[k] = str(row[k])
        grammar = tracery.Grammar(rules)
        if modifiers is not None:
            if isinstance(modifiers,list):
                for modifier in modifiers:
                    grammar.add_modifiers(modifier)
            else:
                grammar.add_modifiers(modifiers)

    return grammar.flatten(root)

def pandas_tracery(df, rules, root, modifiers=base_english):
  return df.apply(lambda row: pandas_row_mapper(row, rules, root, modifiers), axis=1)

def pdt_inspect(df):
  return(df)

def pdt_test1(df):
  return type(df)

def pdt_demo(df):
  return pandas_tracery(df, _demo_rules, "#origin#",  modifiers=base_english)

#Create example rule to apply to each row of dataframe
_demo_rules = {'origin': "#code# was placed #position#!",
         'position': "#pos.uppercase#"}

We can access a python environment using reticulate:

library(reticulate)

#Show conda environments
conda_list("auto")

#Use a particular, name conda environment
use_condaenv(condaenv='anaconda3', required=T)

#Check the availability of a particular module in the environment
py_module_available("tracery")

Now we can load in the python file – and the functions it defines – and then call one of the  loaded Python functions.

Note that I seemed to have to force the casting of the R dataframe to a python/pandas dataframe using r_to_py(), although I’d expected the type mapping to be handled automatically? (Perhaps there is a setting somewhere?)

```{r}
source_python("pd_tracery_demo.py")
df1=data.frame(code=c('Jo','Sam'), pos=c('first','Second'))
df1$result = pdt_demo(r_to_py(df1, convert=T))
df1
```

#Displays:
Jo	first	Jo was placed FIRST!
Sam	Second	Sam was placed SECOND!

(Note: I also spotted a gotcha – things don’t work so well if you define an R column name called name… )

So now I can start looking at converting sports reporting tropes like these:

into tracery story models I can call using my pandas/pytracery hacks:-)

PS here’s a quick demo of inlining Python code:

library(reticulate)
rvar=1

#Go into python shell - this persists
repl_python()
#Access R variables with r.
pyvar=r.rvar+1

#Return to R shell
exit

#Access Python variable with py$
py$pyvar

Sketch – Data Trivia

A bit more tinkering with F1 data from the ergast db, this time trying to generating trivia / facts around races.

The facts are identified using SQL queries:

#starts for a team
q=paste0('SELECT d.code, COUNT(code) AS startsforteam, c.name AS name FROM drivers d JOIN races r JOIN results rs JOIN constructors c WHERE c.constructorId=rs.constructorId AND d.driverId=rs.driverId AND r.raceId=rs.raceId AND d.code IN (',driversThisYear_str,') ',upto,' GROUP BY d.code, c.name HAVING (startsforteam+1) % 50 = 0')
startsTeammod50=dbGetQuery(ergastdb, q)

#looking for poles to date modulo 5 
q=paste0('SELECT d.code, COUNT(code) AS poles FROM drivers d JOIN qualifying q JOIN races r WHERE r.raceId=q.raceId AND d.code IN (',driversThisYear_str,') AND d.driverId=q.driverId AND q.position=1',upto,' GROUP BY code HAVING poles>1 AND (poles+1) % 5 = 0')
lookingpolesmod5=dbGetQuery(ergastdb, q)

Some of the queries also embed query fragments, which I intend to develop further…

upto=paste0(' AND (year<',year,' OR (year=',year,' AND round<',round,')) ')

I'm using knitr to generate Github flavoured markdown (gfm) from my Rmd docs – here’s part of the header:

---
output:
  md_document:
    variant: gfm
---

The following recipe then takes results from the trivia queries and spiels the output:

if (nrow(startsTeammod50)>0) {
  for (row in 1:nrow(startsTeammod50)) {
    text = '- `r startsTeammod50[row, "code"]` is looking for their `r toOrdinal(startsTeammod50[row, "startsforteam"]+1)` start for `r startsTeammod50[row, "name"]`'
    cat(paste0(knit_child(text=text,quiet=TRUE),'\n'))
  }
}

if (nrow(lookingpolesmod5)>0) {
  for (row in 1:nrow(lookingpolesmod5)) {
    text = '- `r lookingpolesmod5[row, "code"]` is looking for their `r toOrdinal(lookingpolesmod5[row, "poles"]+1)` ever pole position'
    cat(paste0(knit_child(text=text,quiet=TRUE),'\n'))
  }
}

We then get outputs of the form:

  • BOT is looking for their 100th race start
  • HAM is looking for their 100th start for Mercedes

See more example outputs here: Bahrain F1 2018 – Race Trivia.

This is another recipe I need to work up a little further and add to Wrangling F1 Data With R.

Tinkering with Competitive Supertimes

I’m back on the R thang with F1 data from ergast.com, and started having a look at how drivers and teams compare at a circuit.

One metric I came across for comparing teams over a season is the supertime, typically calculated for each manufacturer as the average of their fastest single lap recorded by the team at each race weekend expressed as a percentage of the fastest single lap overall.

It struck me that we can also derive a reduced competitive supertime by basing the calculation on best laptime recorded across the qualifying and race sessions, omitting laptimes recorded in the practice sessions.

We can draw on the notion of supertimes to derive two simple measures for comparing team performances based on laptime:

  • evolution of manufacturer competitive supertime for a circuit over the years;
  • evolution of manufacturer competitive supertime for each circuit over the course of a season.

We can also produce driver performance metrics based on the competitive supertime of each driver.

So here are some notes on doing just that… I’m pulling the data from a MySQL database built from a datadump published via ergast.com.

q=paste0('SELECT circuitRef, c.name AS circuit, r.name as race, location, country FROM races r JOIN circuits c WHERE r.circuitId=c.circuitId AND year=',year,' AND round=',round)
cct = as.list(dbGetQuery(ergastdb, q))

cct
$circuitRef
[1] "bahrain"

$circuit
[1] "Bahrain International Circuit"

$race
[1] "Bahrain Grand Prix"

$location
[1] "Sakhir"

$country
[1] "Bahrain"

Calculate some competitive supertimes for manufacturers:

library(reshape2)
library(plyr)

q=paste0('SELECT q1, q2, q3, fastestLapTime, cn.name, d.code, year FROM races r JOIN circuits c JOIN results rs JOIN constructors cn JOIN qualifying q JOIN drivers d WHERE r.circuitId=c.circuitId AND d.driverId=rs.driverId AND r.raceId=rs.raceId AND cn.constructorId=rs.constructorId AND q.raceId=r.raceId AND q.driverId=rs.driverId AND cn.name IN (',teamsThisYear_str,') AND circuitref="',cct$circuitRef,'" ORDER BY year')
st = dbGetQuery(ergastdb, q)


st = melt(st,
          id.vars = c("name","code", "year"),
          measure.vars = c("q1", "q2", "q3", "fastestLapTime"))

st['time'] = as.numeric(apply(st['value'], 1, timeInS))

# Normalise the time
st = ddply(st, .(year), transform, ntime = time/min(time, na.rm=TRUE))

#Find the best time for each manufacturer per race weekend (quali+race)
stt = ddply(st,
            .(name,year),
            summarise, 
            stime = min(ntime, na.rm=TRUE))

stt$label = as.factor(stt$name)
stt

Here’s a chart theme I’ve used elsewhere:

library(ggplot2)
library(directlabels)
library(ggrepel)
library(ggthemes)

champChartTheme=function(g){
  g = g + theme_minimal(base_family="Arial Narrow")
  #g = g + theme(panel.grid.major.y=element_blank())
  g = g + theme(panel.grid.minor=element_blank())
  g = g + theme(axis.line.y=element_line(color="#2b2b2b", size=0.15))
  g = g + theme(axis.text.y=element_text(margin=margin(r=0, l=0)))
  g = g + theme(plot.margin=unit(rep(30, 4), "pt"))
  g = g + theme(plot.title=element_text(face="bold"))
  g = g + theme(plot.subtitle=element_text(margin=margin(b=10)))
  g = g + theme(plot.caption=element_text(size=8, margin=margin(t=10)))
  g
}

Create a supertime chart plotting function:

compSupertime = function(df, cct, labelSize=0.7, smooth=FALSE){
  g=ggplot(data=df,aes(x=year,y=stime))
  if (smooth){g=g+stat_smooth(aes(colour=label),method = "lm", formula= y ~ x + I(x^2), se=FALSE, size=0.7)}
  else { g=g+geom_line(aes(colour=label))}
  g=g+ guides(colour=FALSE)+xlim(min(df$year),max(df$year)+1)
  #cex is label size
  g=g+geom_dl(aes(label = label, colour=label),method = list(cex = labelSize, dl.trans(x = x + .3), "last.bumpup"))
  g=g+labs(x=NULL,y='Competitive Supertime (% of best)',
           title=paste0('F1 ', cct$race, ' - ','Competitive Supertimes',', ', min(stt$year), ' to ', max(df$year)),
           subtitle = paste0(cct$circuit,', ',cct$location,', ',cct$country),
           caption="Data from Ergast database, ergast.com/mrd")
  champChartTheme(g)
}

Let’s have a look…

compSupertime(stt,cct,smooth=F)

 


Alternatively:

compSupertime(stt,cct,smooth=T)

The best fit / model line obviously leaves something to be desired, eg in the case of Renault. But it’s a start.

I’ve also started working on a workflow to autopublish stuff to Gitbooks via github. For example, here’s the site in progress around the Bahrain F1 2018 Grand Prix. Here’s an earlier example (subject to updates/reflowing!) around the Australia 2018 F1 Grand Prix.

PS As I get back in to this, I’ll probably start updating the Wrangling F1 Data With R with recipes again… Also on the to do list is a second edition based on the tidyverse way of doing things…

Setting up a Containerised Desktop API server (MySQL + Apache / PHP 5) for the ergast Motor Racing Data API

A couple of days ago, I noticed that Chris Newell had shared the code for his ergast Formula 1 motor racing results database that I use heavily for my F1DtataJunkie doodles. The code is a PHP application that pulls data from the ergast database, which is regularly shared as a MySQL database dump. So I raised an issue asking if a docker containerised version of the application was available, and Chris replied that there wasn’t, but if I wanted to look at creating one…?

…which I did. After a few false starts, I came up with the solution Chris has since pulled into his ergast-f1-api repo.

The pattern is quite a handy one, and I think reusable – I’ll give it a go for spinning up my own API to look up ONS Geography codes when I get a chance – so what’s involved?

The dockerised application is built from two components launched using docker-compose, a MySQL container and an Apache server configured with PHP5:

ergastdb:
  container_name: ergastdb
  build: ergastdb/
  environment:
    MYSQL_ROOT_PASSWORD: f1
    MYSQL_DATABASE: ergastdb
  expose:
    - "3306"

web:
  build: ./lamp
  #image: nimmis/apache-php5
  ports:
    - '8000:80'
  volumes:
    - ./webroot:/var/www/html
    - ./php/api:/php/api
    - ./logs:/var/log/apache2
  links:
    - ergastdb:ergastdb

The server runs the application, makes requests from the linked database container, and returns the result.

As part of my f1datajunkie tinkering, I’d put together a Dockerfle for populating a MySQL database with Chris’ database dump some time ago, so I could reuse that directly.

Which meant all I had to do was get the application up and running… Chris’ original instructions around the API server application were to place the application files into “the root directory” and also add in an Apache .htaccess URL rewrite, which he provided.

Simple… or maybe not..?! Not being an Apache user, it took me a bit of time to actually get things up and running, so here are some of the gotchas that caught me out:

  • where’s the “root directory”?
  • where should the .htaccess file?

Running a couple of simple tests identified the root directory as the root for the files served by the webserver, and a quick search revealed the .htaccess file should go in the same location.

But the redirects didn’t work…

As a test, I wrote a simple rewrite rule that should have redirected any url to the same test file – but no joy…

A bit more testing suggested I needed to enable the mod_rewrite plugin, which I did in the appropriate Dockerfile. But still no joy. Because to use patches like the .htaccess file in a server directory also requires allowing “Overrides” to the base Apache settings – which I did (via a crib) by rewriting the Apache config file, again via the Dockerfile that builds the server container image.

RUN sed -i '/<Directory \/var\/www\/>/,/<\/Directory>/ s/AllowOverride None/AllowOverride All/' /etc/apache2/apache2.conf

RUN a2enmod rewrite

Then the database connection didn’t work… But that was easily fixed: I’d forgotten to change the permission is in the appropriate application file.

Now running the docker-compose file with the command docker-compose up --build -d and it fires up two linked containers. The API is then available via a browser on http://localhost:8000/api/f1 using calls of the form http://localhost:8000/api/f1/2015.json.

So now I have a minimal working example of a PHP application powered by a MySQL database driving a simple web API. Which I can crib from for my own simple APIs…

As to how it could be improved, there are a couple of obvious things:

  • the containers are intended for personal, local use: the database is accessed as root and should be properly secured;
  • I haven’t got a pattern for updating the database when a new database image is released. The current workaround would be to destroy the containers, and the database volume, then rebuild the images and the containers.

PS to use the local ergast API server from R with my ergastR package, I needed to tweak the package to allow me to specify the path the API server. I’ll post details of how I did that later…

ergastR – R Wrapper for ergast F1 Results Data API

By the by, I’ve posted a first attempt at an R package – ergastR to wrap the ergast developer API, which is where I get chunks of data from for my f1datajunkie tinkerings.

You can find it on Github: psychemedia/ergastR.

The function names are the ones used in the Wrangling F1 Data With R book.

The R package needs a bit of tidying up and also needs work on the following: cacheing, so that we don’t keep hitting the ergast API unnecessarily; paged results handling (I fudge this a bit at the moment by explicitly setting a large results limit); and dual handling of ergast API versus downloaed ergast database requests (if a database connection string is passed, use that rather than make a call to the ergast API). But it’s a start… Feel free to raise issues via the repo:-)

In related news, Will Vaughan tipped me off to a Python package he’s started putting together to wrap the ergast API: ergast-python. He’s also making a start on some Wrangling F1 Data Jupyter notebooks that make use of the Python wrapper: wranglingf1data.

From Points to (Messy) Lines

A week or so ago, I came up with a new chart type – race concordance charts – for looking at a motor circuit race from the on-track perspective of a particular driver. Here are a couple of examples from the 2017 F1 Grand Prix:

The gap is the time to the car on track ahead (negative gap, to the left) or behind (to the right). The colour indicates whether the car is on the same lap (light blue),  on the lap behind (orange to red), or a lap ahead (dark blue).

In the dots, we can “see” lines relating to the relative progress of particular cars. But what if we actually plot the progress of each of those other cars as a line? The colours represent different cars.

 

bot_conc_lineHUL_conc_line

Here’s another view of the track from Hulkenberg’s perspective with a wider window, whoch by comparison with the previous chart suggests I need to handle better cars that do not drop off the track but do fall out of the display window… (At the moment, I only grab data for cars in the specified concordance window):

HUL-conc_line2

Note that we need to do a little bit of tidying up of the data so that we don’t connect lines for cars that flow off the left hand edge, for example, and then return several laps later from the right hand edge:

#Get the data for the cars, as before
inscope=sqldf(paste0('SELECT l1.code as code,l1.acctime-l2.acctime as acctimedelta,
                       l2.lap-l1.lap as lapdelta, l2.lap as focuslap
                       FROM lapTimes as l1 join lapTimes as l2
                       WHERE l1.acctime < (l2.acctime + ', abs(limits[2]), ') AND l1.acctime > (l2.acctime - ', abs(limits[1]),')
                       AND l2.code="',code,'";'))

  #If consecutive rows for same driver are on more than one focuslap apart, break the line
  inscope=ddply(inscope,.(code),transform,g=cumsum(c(0,diff(focuslap)>1)))
  #Continuous line segments have the same driver code and "group" number

  g = ggplot(inscope)

  #The interaction splits up the groups based on code and the contiguous focuslap group number
  #We also need to ensure we plot acctimedelta relative to increasing focuslap
  g=g+geom_line(aes(x=focuslap, y=acctimedelta, col=code,group=interaction(code, g)))
  #...which means we then need to flip the axes
  g=g+coord_flip()

There may still be some artefacts in the line plotting based on lapping… I can’t quite think this through at the moment:-(

So here’s my reading:

  • near horizontal lines that go slightly up and to the right, and where a lot of places in the window are lost in a single lap are a result of pit stop by the car that lost the places; if we have access to pit information, we could perhaps dot these lines?
  • the “waist” in the chart for HUL shows cars coming together for a safety car, and then HUL losing pace to some cars whilst making advances on others;
  • lines with a constant gradient show a  consistent gain or loss of time, per lap, over several laps;
  • a near vertical line shows a car keeping pace, and neither making nor losing time compared to the focus car.

Figure Aesthetics or Overlays?

Tinkering with a new chart type over the weekend, I spotted something rather odd in in my F1 track history charts – what look to be outliers in the form of cars that hadn’t been lapped on that lap appearing behind the lap leader of the next lap, on track.

If you count the number of cars on that leadlap, it’s also greater than the number of cars in the race on that lap.

How could that be? Cars being unlapped, perhaps, and so “appearing twice” on a particular leadlap – that is, recording two laptimes between consecutive passes of the start/finish line by the race leader?

My fix for this was to add an “unlap” attribute that detects whether

#Overplot unlaps
lapTimes=ddply(lapTimes,.(leadlap,code),transform,unlap= seq_along(leadlap))

This groups by leadlap an car, and counts 1 for each occurrence. So if the unlap count is greater than 1, a car a has completed more than 1 lap in a given leadlap.

My first thought was to add this as an overprint on the original chart:

#Overprint unlaps
g = g + geom_point(data = lapTimes[lapTimes['unlap']>1,],
                   aes(x = trackdiff, y = leadlap, col=(leadlap-lap)), pch = 0)

This renders as follows:

Whilst it works, as an approach it is inelegant, and had me up in the night pondering the use of overlays rather than aesthetics.

Because we can also view the fact that the car was on its second pass of the start/finish line for a given lead lap as a key property of the car and depict that directly via an aesthetic mapping of that property onto the symbol type:

  g = g + geom_point(aes( x = trackdiff, y = leadlap,
                          col = (lap == leadlap),
                          pch= (unlap==1) ))+scale_shape_identity()

This renders just a single mark on the chart, depicting the diff to the leader *as well as * the unlapping characteristic, rather than the two marks used previously, one for the diff, the second, overprinting, mark to depict the unlapping nature of that mark.

So now I’m wondering – when would it make sense to use multiple marks by overprinting?

Here’s one example where I think it does make sense: where I pass an argument into the chart plotter to highlight a particular driver by infilling a marker with a symbol to identify that driver.

#Drivers of interest passed in using construction: code=list(c("STR","+"),c("RAI","*"))
if (!is.na(code)){
  for (t in code) {
    g = g + geom_point(data = lapTimes[lapTimes['code'] == t[1], ],
                       aes(x = trackdiff, y = leadlap),
                       pch = t[2])
  }
}

In this case, the + symbol is not a property of the car, it is an additional information attribute that I want to add to that car, but not the other cars. That is, it is a property of my interest, not a property of the car itself.

Race Track Concordance Charts

Since getting started with generating templated R reports a few weeks ago, I’ve started spending the odd few minutes every race weekend around looking at ways of automating the generation of F1 qualifying and race reports.

Im yesterday’s race, some of the commentary focussed on whether MAS had given BOT an “assist” in blocking VET, which got me thinking about better ways of visualising whether drivers are stuck in traffic or not.

The track position chart makes a start at this, but it can be hard to focus on a particular driver (identified using a particular character to infill the circle marker for that driver). The race leader’s track position ahead is identified from the lap offset race leader marker at the right hand side of the chart.

One way to help keep track of things from the perspective of a particular driver, rather than the race leader, is to rebase the origin of the x-axis relative to the that driver.

In my track chart code, I use a dataframe that has a trackdiff column that gives a time offset on track to race leader for each lead lap.

track_encoder=function(lapTimes){
  #Find the accumulated race time at the start of each leader's lap
  lapTimes = ddply(lapTimes, .(leadlap), transform, lstart = min(acctime))

  #Find the on-track gap to leader
  lapTimes['trackdiff'] = lapTimes['acctime'] - lapTimes['lstart']
  lapTimes
}

Rebasing for a particular driver simply means resetting the origin with respect to that time, using the trackdiff time for one driver as an offset for the others, to create a new trackdiff2 for use on the x-axis.

#I'm sure there must be a more idiomatic way of doing this?
rebase=lapTimes[lapTimes['code']==code,c('leadlap','trackdiff')]
rebase=rename(rebase,c('trackdiff'='trackrebase'))
lapTimes=merge(lapTimes,rebase,by='leadlap')
lapTimes['trackdiff2']=lapTimes['trackdiff']-lapTimes['trackrebase']

Here’s how it looks for MAS:

But not so useful for BOT, who led much of the race:

This got me thinking about text concordances. In the NLTK text analysis package, the text concordance function allows you to display a search term centred in the context in which it is found:

concordance

The concordance view finds the location of each token and then displays the search term surrounded by tokens in neighbouring locations, within a particular window size.

I spent a chunk of time wondering how to do this sensibly in R, struggling to identify what it was I actually wanted to do: for a particular driver, find the neighbouring cars in terms of accumulated laptime on each lap. After failing to see the light for more an hour or so, I thought of it in terms of an SQL query, and the answer fell straight out – for the specified driver on a particular lead leadlap, get their accumulated laptime and the rows with accumulated laptimes in a window around it.

inscope=sqldf(paste0('SELECT l1.code as code,l1.acctime-l2.acctime as acctimedelta,
l2.lap-l1.lap as lapdelta, l2.lap as focuslap
FROM lapTimes as l1 join lapTimes as l2
WHERE l1.acctime &lt; (l2.acctime + ', abs(limits[2]), ') AND l1.acctime &gt; (l2.acctime - ', abs(limits[1]),')
AND l2.code="',code,'";'))

Plotting against the accumalated laptime delta on the x-axis gives a chart like this:

If we add in horizontal rules to show laps where the specified driver pitted and vertical bars to show pit windows, we get a much richer particular of the race from the point of view of the driver.

Here’s how it looks from the perspective of BOT, who led most of the race:

Different symbols inside the markers can be used to track different drivers (in the above charts, BOT and VET are highlighted). The colours are used to identify whether or not cars on the same lap as the specified driver, are cars on laps ahead for shades of blue then green (as per “blue flag”) and orange to red for cars on increasing laps behind (i.e. backmarkers from the perspective of the specified driver). If a marker is light blue, that car is on the same lap and you’re racing…

All in all, I’m pretty chuffed (for now!) with how that chart came together.

And a new recipe to add to the Wrangling F1 Data With R book, I guess..

PS in response to [misunderstanding…] a comment from @sidepodcast, we also have control over the concordance window size, and the plotsize:

concordresize

Generating hi-res versions in other file formats is also possible.

Just got to wrap it all up in a templated report now…

PPS On the track position charts, I just noticed that where cars are lapped, they fall off the radar… so I’ve added them in behind the leader to keep the car count correct for each leadlap…

trackposrebaselapped

 

PS See also: A New Chart Type – Race Concordance Charts, which also includes examples of “line chart” renderings of the concordance charts so you can explicitly see the progress of each individually highlighted driver on track.

A First Attempt at Wrangling WRC (World Rally Championship) Data With pandas and matplotlib

Last year was a quite year on the Wrangling F1 Data With R front, with a not even aborted start at doing a python/pandas equivalent project. With the new owners of F1 in place, things may change for the better in terms of engaging with fans and supporters, and I may revisit that idea properly, but in the meantime, I thought I started tinkering with a wider range of motorsport data.

The start to the BTCC season is still a few months away, but the WRC started over the weekend, and with review highlights and live coverage of one stage per rally on Red Bull TV, I thought I may give that data a go…

Results and timing info can be found on the WRC web pages (I couldn’t offhand find a source of official FIA timing sheets) so here’s a first quick sketch using stage results from the first rally of the year – Monte Carlo.

world_rally_championship_-_results_monte_carlo_-_wrc_com

To start with, we need to grab the data. I’m using the pandas library, which has a handy .read_html() method that can scrape tables (crudely) from an HTML page given its URL.

import pandas as pd

def getStageResultsBase(year,rallyid,stages):
    ''' Get stage results and overall results at end of stage '''
    
    # Accept one stage number or a list of stage numbers
    stages=[stages] if not isinstance(stages,list) else stages
    
    #There are actually two tables on the stage results page
    df_stage=pd.DataFrame()
    df_overallpart=pd.DataFrame()
    
    #Grab data for each stage
    for stage in stages:
        url='http://www.wrc.com/live-ticker/daten/{year}/{rallyid}/stage.{rallyid}.{stage}.all.html'.format(year=year, rallyid=rallyid, stage=stage)
        #scrape the data
        results=pd.read_html(url,encoding='utf-8')
        results[0].columns=['pos', 'carNo', 'driverName', 'time', 'diffPrev', 'diffFirst']
        results[1].columns=['pos', 'carNo', 'driverName', 'time', 'diffPrev', 'diffFirst']
        
        #Simple cleaning - cast the data types as required
        for i in [0,1]:
            results[i].fillna(0,inplace=True)
            results[i]['pos']=results[i]['pos'].astype(float).astype(int)
            for j in ['carNo','driverName','time','diffPrev','diffFirst']:
                results[i][j]=results[i][j].astype(str)
        
        #Add a stage identifier
        results[0]['stage']=stage
        results[1]['stage']=stage
        
        #Add the scraped stage data to combined stage results data frames
        df_stage=pd.concat([df_stage,results[0]])
        df_overallpart=pd.concat([df_overallpart,results[1]])

    return df_stage.reset_index(drop=True), df_overallpart.reset_index(drop=True)

The data we pull back looks like the following.

wrc_results_scraper1

Note that deltas (the time differences) are given as offset times in the form of a string. As the pandas library was in part originally developed for working with financial time series data, it has a lot of support for time handling. This includes the notion of a time delta:

pd.to_timedelta("1:2:3.0")
#Timedelta('0 days 01:02:03')

We can use this datatype to represent time differences from the results data:

#If we have hh:mm:ss format we can easily cast a timedelta
def regularTimeString(strtime):

    #Go defensive, just in case we're passed eg 0 as an int
    strtime=str(strtime)
    strtime=strtime.strip('+')

    modifier=''
    if strtime.startswith('-'):
        modifier='-'
        strtime=strtime.strip('-')

    timeComponents=strtime.split(':')
    ss=timeComponents[-1]
    mm=timeComponents[-2] if len(timeComponents)>1 else 0
    hh=timeComponents[-3] if len(timeComponents)>2 else 0
    timestr='{}{}:{}:{}'.format(modifier,hh,mm,ss)
    return pd.to_timedelta(timestr)

We can use the time handler to cast the time differences from the scraped data as timedelta typed data:

def getStageResults(year,rallyid,stages):
    df_stage, df_overallpart = getStageResultsBase(year,rallyid,stages)
    for col in ['time','diffPrev','diffFirst']:
        df_stage['td_'+col]=df_stage.apply(lambda x: regularTimeString(x[col]),axis=1)
        df_overallpart['td_'+col]=df_overallpart.apply(lambda x: regularTimeString(x[col]),axis=1)
    return df_stage, df_overallpart 

wrc_results_scraper2

The WRC results cover all entrants to the rally, but not all the cars are classed as fully blown WRC cars (class RC1). We can limit the data to just the RC1 cars and generate a plot showing the position of each driver at the end of each stage:

%matplotlib inline
import matplotlib.pyplot as plt

rc1=df_overall[df_overall['groupClass']=='RC1'].reset_index(drop=True)

fig, ax = plt.subplots(figsize=(15,8))
ax.get_yaxis().set_ticklabels([])
rc1.groupby('driverName').plot(x='stage',y='pos',ax=ax,legend=None);

wrc_results_scraper3

The position is actually the position of the driver across all entry classes, not just RC1. This means if a driver has a bad day, they could be placed well down the all-class field; but that’s not of too much interest if all we’re interested in is in-class ranking.,

So what about if we rerank the drivers within the RC1 class? And perhaps improve the chart a little by adding a name label to identify each driver at their starting position?

rc1['rank']=rc1.groupby('stage')['pos'].rank()

fig, ax = plt.subplots(figsize=(15,8))
ax.get_yaxis().set_ticklabels([])
rc1.groupby('driverName').plot(x='stage',y='rank',ax=ax,legend=None)

#Add some name labels at the start
for i,d in rc1[rc1['stage']==1].iterrows():
    ax.text(-0.5, i+1, d.ix(i)['driverName'])

wrc_results_scraper4

This chart is a bit cleaner, but now we lose information around the lower placed in-class drivers, in particular that information about  there overall position when other classes are taken into account too…

The way the FIA recover this information in their stage chart displays that reports on the evolution of the race for the top 10 cars overall (irrespective of class)  that shows excursions in interim stages outside the top 10  “below the line”, annotating them further with their overall classification on the corresponding stage.

stage_chart___federation_internationale_de_l_automobile

We can use this idea by assigning a “re-rank” to each car if they are positioned outside the size of the class.

#Reranking...
rc1['xrank']= (rc1['pos']>RC1SIZE)
rc1['xrank']=rc1.groupby('stage')['xrank'].cumsum()
rc1['xrank']=rc1.apply(lambda row: row['pos'] if row['pos']<=RC1SIZE else row['xrank'] +RC1SIZE, axis=1)
fig, ax = plt.subplots(figsize=(15,8))
ax.get_yaxis().set_ticklabels([])
rc1.groupby('driverName').plot(x='stage',y='xrank',ax=ax,legend=None)

#Name labels
for i,d in rc1[rc1['stage']==1].iterrows():
    ax.text(-0.5, d.ix(i)['xrank'], d.ix(i)['driverName'])
for i,d in rc1[rc1['stage']==17].iterrows():
    ax.text(17.3, d.ix(i)['xrank'], d.ix(i)['driverName'])

wrc_results_scraper5The chart now shows the evolution of the race for the RC1 cars, retaining the spaced ordering of the top 12 positions that would be filled by WRC1/RC1 cars if they were all placed above cars from other classes and then bunching those placed outside the group size. (Only 11 names are shown because one the entries retired right at the start of the rally.)

So for example, in this case we see how Neuvill, Hanninen and Serderidis are classed outside Lefebvre, who was actually classed 9th overall.

Further drawing on the FIA stage chart, we can label the excursions outside the top 12, and also invert the y-axis.

fig, ax = plt.subplots(figsize=(15,8))
ax.get_yaxis().set_ticklabels([])
rc1.groupby('driverName').plot(x='stage',y='xrank',ax=ax,legend=None);

for i,d in rc1[rc1['xrank']>RC1SIZE].iterrows(): ax.text(d.ix(i)['stage']-0.1, d.ix(i)['xrank'], d.ix(i)['pos'], bbox=dict( boxstyle='round,pad=0.3',color='pink')) #facecolor='none',edgecolor='black',
#Name labels
for i,d in rc1[rc1['stage']==1].iterrows(): ax.text(-0.5, d.ix(i)['xrank'], d.ix(i)['driverName']) for i,d in rc1[rc1['stage']==17].iterrows(): ax.text(17.3, d.ix(i)['xrank'], d.ix(i)['driverName'])
#Flip the y-axis plt.gca().invert_yaxis()

Lefebvre’s excursions outside the top 12 are now recorded and plain to see.

wrc_results_scraper6

We now have a chart that blends rank ordering with additional information showing where cars are outpaced by cars from other classes, in a space efficient manner.

PS as with Wrangling F1 Data With R, I may well turn this into a Leanpub book, this time exploring the workflow to markdown (and/or maybe reveal.js slides!) from Jupyter notebooks, rather than from RStudio/Rmd.