Tagged: f1dj

Notes – Detecting Safety Car Laps from F1 Laptime Data, Part 1

Looking at some track/lap maps for F1 races, there are several visual cues for when a safety car has been deployed.

For example, here’s a chart from the 2018 Chinese Grand Prix showing when a safety car arrived. The bunching of times at the left show that a full safety car is in track, bunching up the cars. The time-distance that race leader BOT is behind the car at the back in terms of track position increases to show the slower than race pace lap time.

Here’s how the race from Australia 2018 looked. In this case, there was a virtual safety car to begin with – the field doesn’t close up but the time-distance between the race leader and the track position backmarker increases to reflect the reduced laptime:

Singapore, 2017, was a wet race with lots of safety car action. To begin with, the race started under the safety car, and came out again on lap 11. The steady pattern of race trails show there was no safety car around lap 29/30, although there were pitstops, but there was a safety car around lap 40. I’m not sure if a VSC went out first, or whether the teams took the opportunity to pit.


So how can we detect full and virtual safety cars? Certainly there seems to be graphical evidence of safety cars – both full and virtual – in the above charts. But can we also identify and detect some numerical signatures, and if so, what form might they take? That will be in the next post…

Generating Text From An R DataFrame using PyTracery, Pandas and Reticulate

In a couple of recent posts (Textualisation With Tracery and Database Reporting 2.0 and More Tinkering With PyTracery) I’ve started exploring various ways of using the pytracery port of the tracery story generation tool to generate variety of texts from Python pandas data frames.

For my F1DataJunkie tinkerings I’ve been using R + SQL as the base languages, with some hardcoded Rdata2text constructions for rendering text from R dataframes (example).

Whilst there is a basic port of tracery to R, I want to make use of the various extensions I’ve been doodling with to pytracery, so it seemed like a good opportunity to start exploring the R reticulate package.

It was a bit of a faff trying to get things to work the first time, so here on some notes on what I had to consider to get a trivial demo working in my RStudio/Rmd/knitr environment.

Python Environment

My first attempt was to use python blocks in an Rmd document:

import sys

but R insisted on using the base Python path on my Mac that was not the path I wanted to use… The fix turned out to be setting the engine…

```{python, engine.path ='/Users/f1dj/anaconda3/bin/python' }
import sys

This could also be done via a setting: opts_chunk$set(engine.path = '/Users/f1dj/anaconda3/bin/python')

One of the problems with this approach is that a Python environment is created for each chunk – so you can’t easily carry state over from one Python chunk to another.

So I had a look at a workaround using reticulate instead.

Calling pytracery from R using reticulate

The solution I think I’m going for is to put Python code into a file, call that into R, then pass an R dataframe as an argument to a called Python function and gett a response back into R as an R dataframe.

For example, here’s a simple python test file:

import tracery
from tracery.modifiers import base_english
import pandas as pd

def pandas_row_mapper(row, rules, root,  modifiers=base_english):
    ''' Function to parse single row of dataframe '''

    for k in row:
        rules[k] = str(row[k])
        grammar = tracery.Grammar(rules)
        if modifiers is not None:
            if isinstance(modifiers,list):
                for modifier in modifiers:

    return grammar.flatten(root)

def pandas_tracery(df, rules, root, modifiers=base_english):
  return df.apply(lambda row: pandas_row_mapper(row, rules, root, modifiers), axis=1)

def pdt_inspect(df):

def pdt_test1(df):
  return type(df)

def pdt_demo(df):
  return pandas_tracery(df, _demo_rules, "#origin#",  modifiers=base_english)

#Create example rule to apply to each row of dataframe
_demo_rules = {'origin': "#code# was placed #position#!",
         'position': "#pos.uppercase#"}

We can access a python environment using reticulate:


#Show conda environments

#Use a particular, name conda environment
use_condaenv(condaenv='anaconda3', required=T)

#Check the availability of a particular module in the environment

Now we can load in the python file – and the functions it defines – and then call one of the  loaded Python functions.

Note that I seemed to have to force the casting of the R dataframe to a python/pandas dataframe using r_to_py(), although I’d expected the type mapping to be handled automatically? (Perhaps there is a setting somewhere?)

df1=data.frame(code=c('Jo','Sam'), pos=c('first','Second'))
df1$result = pdt_demo(r_to_py(df1, convert=T))

Jo	first	Jo was placed FIRST!
Sam	Second	Sam was placed SECOND!

(Note: I also spotted a gotcha – things don’t work so well if you define an R column name called name… )

So now I can start looking at converting sports reporting tropes like these:

into tracery story models I can call using my pandas/pytracery hacks:-)

PS here’s a quick demo of inlining Python code:


#Go into python shell - this persists
#Access R variables with r.

#Return to R shell

#Access Python variable with py$

Sketch – Data Trivia

A bit more tinkering with F1 data from the ergast db, this time trying to generating trivia / facts around races.

The facts are identified using SQL queries:

#starts for a team
q=paste0('SELECT d.code, COUNT(code) AS startsforteam, c.name AS name FROM drivers d JOIN races r JOIN results rs JOIN constructors c WHERE c.constructorId=rs.constructorId AND d.driverId=rs.driverId AND r.raceId=rs.raceId AND d.code IN (',driversThisYear_str,') ',upto,' GROUP BY d.code, c.name HAVING (startsforteam+1) % 50 = 0')
startsTeammod50=dbGetQuery(ergastdb, q)

#looking for poles to date modulo 5 
q=paste0('SELECT d.code, COUNT(code) AS poles FROM drivers d JOIN qualifying q JOIN races r WHERE r.raceId=q.raceId AND d.code IN (',driversThisYear_str,') AND d.driverId=q.driverId AND q.position=1',upto,' GROUP BY code HAVING poles>1 AND (poles+1) % 5 = 0')
lookingpolesmod5=dbGetQuery(ergastdb, q)

Some of the queries also embed query fragments, which I intend to develop further…

upto=paste0(' AND (year<',year,' OR (year=',year,' AND round<',round,')) ')

I'm using knitr to generate Github flavoured markdown (gfm) from my Rmd docs – here’s part of the header:

    variant: gfm

The following recipe then takes results from the trivia queries and spiels the output:

if (nrow(startsTeammod50)>0) {
  for (row in 1:nrow(startsTeammod50)) {
    text = '- `r startsTeammod50[row, "code"]` is looking for their `r toOrdinal(startsTeammod50[row, "startsforteam"]+1)` start for `r startsTeammod50[row, "name"]`'

if (nrow(lookingpolesmod5)>0) {
  for (row in 1:nrow(lookingpolesmod5)) {
    text = '- `r lookingpolesmod5[row, "code"]` is looking for their `r toOrdinal(lookingpolesmod5[row, "poles"]+1)` ever pole position'

We then get outputs of the form:

  • BOT is looking for their 100th race start
  • HAM is looking for their 100th start for Mercedes

See more example outputs here: Bahrain F1 2018 – Race Trivia.

This is another recipe I need to work up a little further and add to Wrangling F1 Data With R.

Tinkering with Competitive Supertimes

I’m back on the R thang with F1 data from ergast.com, and started having a look at how drivers and teams compare at a circuit.

One metric I came across for comparing teams over a season is the supertime, typically calculated for each manufacturer as the average of their fastest single lap recorded by the team at each race weekend expressed as a percentage of the fastest single lap overall.

It struck me that we can also derive a reduced competitive supertime by basing the calculation on best laptime recorded across the qualifying and race sessions, omitting laptimes recorded in the practice sessions.

We can draw on the notion of supertimes to derive two simple measures for comparing team performances based on laptime:

  • evolution of manufacturer competitive supertime for a circuit over the years;
  • evolution of manufacturer competitive supertime for each circuit over the course of a season.

We can also produce driver performance metrics based on the competitive supertime of each driver.

So here are some notes on doing just that… I’m pulling the data from a MySQL database built from a datadump published via ergast.com.

q=paste0('SELECT circuitRef, c.name AS circuit, r.name as race, location, country FROM races r JOIN circuits c WHERE r.circuitId=c.circuitId AND year=',year,' AND round=',round)
cct = as.list(dbGetQuery(ergastdb, q))

[1] "bahrain"

[1] "Bahrain International Circuit"

[1] "Bahrain Grand Prix"

[1] "Sakhir"

[1] "Bahrain"

Calculate some competitive supertimes for manufacturers:


q=paste0('SELECT q1, q2, q3, fastestLapTime, cn.name, d.code, year FROM races r JOIN circuits c JOIN results rs JOIN constructors cn JOIN qualifying q JOIN drivers d WHERE r.circuitId=c.circuitId AND d.driverId=rs.driverId AND r.raceId=rs.raceId AND cn.constructorId=rs.constructorId AND q.raceId=r.raceId AND q.driverId=rs.driverId AND cn.name IN (',teamsThisYear_str,') AND circuitref="',cct$circuitRef,'" ORDER BY year')
st = dbGetQuery(ergastdb, q)

st = melt(st,
          id.vars = c("name","code", "year"),
          measure.vars = c("q1", "q2", "q3", "fastestLapTime"))

st['time'] = as.numeric(apply(st['value'], 1, timeInS))

# Normalise the time
st = ddply(st, .(year), transform, ntime = time/min(time, na.rm=TRUE))

#Find the best time for each manufacturer per race weekend (quali+race)
stt = ddply(st,
            stime = min(ntime, na.rm=TRUE))

stt$label = as.factor(stt$name)

Here’s a chart theme I’ve used elsewhere:


  g = g + theme_minimal(base_family="Arial Narrow")
  #g = g + theme(panel.grid.major.y=element_blank())
  g = g + theme(panel.grid.minor=element_blank())
  g = g + theme(axis.line.y=element_line(color="#2b2b2b", size=0.15))
  g = g + theme(axis.text.y=element_text(margin=margin(r=0, l=0)))
  g = g + theme(plot.margin=unit(rep(30, 4), "pt"))
  g = g + theme(plot.title=element_text(face="bold"))
  g = g + theme(plot.subtitle=element_text(margin=margin(b=10)))
  g = g + theme(plot.caption=element_text(size=8, margin=margin(t=10)))

Create a supertime chart plotting function:

compSupertime = function(df, cct, labelSize=0.7, smooth=FALSE){
  if (smooth){g=g+stat_smooth(aes(colour=label),method = "lm", formula= y ~ x + I(x^2), se=FALSE, size=0.7)}
  else { g=g+geom_line(aes(colour=label))}
  g=g+ guides(colour=FALSE)+xlim(min(df$year),max(df$year)+1)
  #cex is label size
  g=g+geom_dl(aes(label = label, colour=label),method = list(cex = labelSize, dl.trans(x = x + .3), "last.bumpup"))
  g=g+labs(x=NULL,y='Competitive Supertime (% of best)',
           title=paste0('F1 ', cct$race, ' - ','Competitive Supertimes',', ', min(stt$year), ' to ', max(df$year)),
           subtitle = paste0(cct$circuit,', ',cct$location,', ',cct$country),
           caption="Data from Ergast database, ergast.com/mrd")

Let’s have a look…





The best fit / model line obviously leaves something to be desired, eg in the case of Renault. But it’s a start.

I’ve also started working on a workflow to autopublish stuff to Gitbooks via github. For example, here’s the site in progress around the Bahrain F1 2018 Grand Prix. Here’s an earlier example (subject to updates/reflowing!) around the Australia 2018 F1 Grand Prix.

PS As I get back in to this, I’ll probably start updating the Wrangling F1 Data With R with recipes again… Also on the to do list is a second edition based on the tidyverse way of doing things…

Setting up a Containerised Desktop API server (MySQL + Apache / PHP 5) for the ergast Motor Racing Data API

A couple of days ago, I noticed that Chris Newell had shared the code for his ergast Formula 1 motor racing results database that I use heavily for my F1DtataJunkie doodles. The code is a PHP application that pulls data from the ergast database, which is regularly shared as a MySQL database dump. So I raised an issue asking if a docker containerised version of the application was available, and Chris replied that there wasn’t, but if I wanted to look at creating one…?

…which I did. After a few false starts, I came up with the solution Chris has since pulled into his ergast-f1-api repo.

The pattern is quite a handy one, and I think reusable – I’ll give it a go for spinning up my own API to look up ONS Geography codes when I get a chance – so what’s involved?

The dockerised application is built from two components launched using docker-compose, a MySQL container and an Apache server configured with PHP5:

  container_name: ergastdb
  build: ergastdb/
    MYSQL_DATABASE: ergastdb
    - "3306"

  build: ./lamp
  #image: nimmis/apache-php5
    - '8000:80'
    - ./webroot:/var/www/html
    - ./php/api:/php/api
    - ./logs:/var/log/apache2
    - ergastdb:ergastdb

The server runs the application, makes requests from the linked database container, and returns the result.

As part of my f1datajunkie tinkering, I’d put together a Dockerfle for populating a MySQL database with Chris’ database dump some time ago, so I could reuse that directly.

Which meant all I had to do was get the application up and running… Chris’ original instructions around the API server application were to place the application files into “the root directory” and also add in an Apache .htaccess URL rewrite, which he provided.

Simple… or maybe not..?! Not being an Apache user, it took me a bit of time to actually get things up and running, so here are some of the gotchas that caught me out:

  • where’s the “root directory”?
  • where should the .htaccess file?

Running a couple of simple tests identified the root directory as the root for the files served by the webserver, and a quick search revealed the .htaccess file should go in the same location.

But the redirects didn’t work…

As a test, I wrote a simple rewrite rule that should have redirected any url to the same test file – but no joy…

A bit more testing suggested I needed to enable the mod_rewrite plugin, which I did in the appropriate Dockerfile. But still no joy. Because to use patches like the .htaccess file in a server directory also requires allowing “Overrides” to the base Apache settings – which I did (via a crib) by rewriting the Apache config file, again via the Dockerfile that builds the server container image.

RUN sed -i '/<Directory \/var\/www\/>/,/<\/Directory>/ s/AllowOverride None/AllowOverride All/' /etc/apache2/apache2.conf

RUN a2enmod rewrite

Then the database connection didn’t work… But that was easily fixed: I’d forgotten to change the permission is in the appropriate application file.

Now running the docker-compose file with the command docker-compose up --build -d and it fires up two linked containers. The API is then available via a browser on http://localhost:8000/api/f1 using calls of the form http://localhost:8000/api/f1/2015.json.

So now I have a minimal working example of a PHP application powered by a MySQL database driving a simple web API. Which I can crib from for my own simple APIs…

As to how it could be improved, there are a couple of obvious things:

  • the containers are intended for personal, local use: the database is accessed as root and should be properly secured;
  • I haven’t got a pattern for updating the database when a new database image is released. The current workaround would be to destroy the containers, and the database volume, then rebuild the images and the containers.

PS to use the local ergast API server from R with my ergastR package, I needed to tweak the package to allow me to specify the path the API server. I’ll post details of how I did that later…

ergastR – R Wrapper for ergast F1 Results Data API

By the by, I’ve posted a first attempt at an R package – ergastR to wrap the ergast developer API, which is where I get chunks of data from for my f1datajunkie tinkerings.

You can find it on Github: psychemedia/ergastR.

The function names are the ones used in the Wrangling F1 Data With R book.

The R package needs a bit of tidying up and also needs work on the following: cacheing, so that we don’t keep hitting the ergast API unnecessarily; paged results handling (I fudge this a bit at the moment by explicitly setting a large results limit); and dual handling of ergast API versus downloaed ergast database requests (if a database connection string is passed, use that rather than make a call to the ergast API). But it’s a start… Feel free to raise issues via the repo:-)

In related news, Will Vaughan tipped me off to a Python package he’s started putting together to wrap the ergast API: ergast-python. He’s also making a start on some Wrangling F1 Data Jupyter notebooks that make use of the Python wrapper: wranglingf1data.

From Points to (Messy) Lines

A week or so ago, I came up with a new chart type – race concordance charts – for looking at a motor circuit race from the on-track perspective of a particular driver. Here are a couple of examples from the 2017 F1 Grand Prix:

The gap is the time to the car on track ahead (negative gap, to the left) or behind (to the right). The colour indicates whether the car is on the same lap (light blue),  on the lap behind (orange to red), or a lap ahead (dark blue).

In the dots, we can “see” lines relating to the relative progress of particular cars. But what if we actually plot the progress of each of those other cars as a line? The colours represent different cars.



Here’s another view of the track from Hulkenberg’s perspective with a wider window, whoch by comparison with the previous chart suggests I need to handle better cars that do not drop off the track but do fall out of the display window… (At the moment, I only grab data for cars in the specified concordance window):


Note that we need to do a little bit of tidying up of the data so that we don’t connect lines for cars that flow off the left hand edge, for example, and then return several laps later from the right hand edge:

#Get the data for the cars, as before
inscope=sqldf(paste0('SELECT l1.code as code,l1.acctime-l2.acctime as acctimedelta,
                       l2.lap-l1.lap as lapdelta, l2.lap as focuslap
                       FROM lapTimes as l1 join lapTimes as l2
                       WHERE l1.acctime < (l2.acctime + ', abs(limits[2]), ') AND l1.acctime > (l2.acctime - ', abs(limits[1]),')
                       AND l2.code="',code,'";'))

  #If consecutive rows for same driver are on more than one focuslap apart, break the line
  #Continuous line segments have the same driver code and "group" number

  g = ggplot(inscope)

  #The interaction splits up the groups based on code and the contiguous focuslap group number
  #We also need to ensure we plot acctimedelta relative to increasing focuslap
  g=g+geom_line(aes(x=focuslap, y=acctimedelta, col=code,group=interaction(code, g)))
  #...which means we then need to flip the axes

There may still be some artefacts in the line plotting based on lapping… I can’t quite think this through at the moment:-(

So here’s my reading:

  • near horizontal lines that go slightly up and to the right, and where a lot of places in the window are lost in a single lap are a result of pit stop by the car that lost the places; if we have access to pit information, we could perhaps dot these lines?
  • the “waist” in the chart for HUL shows cars coming together for a safety car, and then HUL losing pace to some cars whilst making advances on others;
  • lines with a constant gradient show a  consistent gain or loss of time, per lap, over several laps;
  • a near vertical line shows a car keeping pace, and neither making nor losing time compared to the focus car.