Tinkering with Competitive Supertimes

I’m back on the R thang with F1 data from ergast.com, and started having a look at how drivers and teams compare at a circuit.

One metric I came across for comparing teams over a season is the supertime, typically calculated for each manufacturer as the average of their fastest single lap recorded by the team at each race weekend expressed as a percentage of the fastest single lap overall.

It struck me that we can also derive a reduced competitive supertime by basing the calculation on best laptime recorded across the qualifying and race sessions, omitting laptimes recorded in the practice sessions.

We can draw on the notion of supertimes to derive two simple measures for comparing team performances based on laptime:

  • evolution of manufacturer competitive supertime for a circuit over the years;
  • evolution of manufacturer competitive supertime for each circuit over the course of a season.

We can also produce driver performance metrics based on the competitive supertime of each driver.

So here are some notes on doing just that… I’m pulling the data from a MySQL database built from a datadump published via ergast.com.

q=paste0('SELECT circuitRef, c.name AS circuit, r.name as race, location, country FROM races r JOIN circuits c WHERE r.circuitId=c.circuitId AND year=',year,' AND round=',round)
cct = as.list(dbGetQuery(ergastdb, q))

cct
$circuitRef
[1] "bahrain"

$circuit
[1] "Bahrain International Circuit"

$race
[1] "Bahrain Grand Prix"

$location
[1] "Sakhir"

$country
[1] "Bahrain"

Calculate some competitive supertimes for manufacturers:

library(reshape2)
library(plyr)

q=paste0('SELECT q1, q2, q3, fastestLapTime, cn.name, d.code, year FROM races r JOIN circuits c JOIN results rs JOIN constructors cn JOIN qualifying q JOIN drivers d WHERE r.circuitId=c.circuitId AND d.driverId=rs.driverId AND r.raceId=rs.raceId AND cn.constructorId=rs.constructorId AND q.raceId=r.raceId AND q.driverId=rs.driverId AND cn.name IN (',teamsThisYear_str,') AND circuitref="',cct$circuitRef,'" ORDER BY year')
st = dbGetQuery(ergastdb, q)


st = melt(st,
          id.vars = c("name","code", "year"),
          measure.vars = c("q1", "q2", "q3", "fastestLapTime"))

st['time'] = as.numeric(apply(st['value'], 1, timeInS))

# Normalise the time
st = ddply(st, .(year), transform, ntime = time/min(time, na.rm=TRUE))

#Find the best time for each manufacturer per race weekend (quali+race)
stt = ddply(st,
            .(name,year),
            summarise, 
            stime = min(ntime, na.rm=TRUE))

stt$label = as.factor(stt$name)
stt

Here’s a chart theme I’ve used elsewhere:

library(ggplot2)
library(directlabels)
library(ggrepel)
library(ggthemes)

champChartTheme=function(g){
  g = g + theme_minimal(base_family="Arial Narrow")
  #g = g + theme(panel.grid.major.y=element_blank())
  g = g + theme(panel.grid.minor=element_blank())
  g = g + theme(axis.line.y=element_line(color="#2b2b2b", size=0.15))
  g = g + theme(axis.text.y=element_text(margin=margin(r=0, l=0)))
  g = g + theme(plot.margin=unit(rep(30, 4), "pt"))
  g = g + theme(plot.title=element_text(face="bold"))
  g = g + theme(plot.subtitle=element_text(margin=margin(b=10)))
  g = g + theme(plot.caption=element_text(size=8, margin=margin(t=10)))
  g
}

Create a supertime chart plotting function:

compSupertime = function(df, cct, labelSize=0.7, smooth=FALSE){
  g=ggplot(data=df,aes(x=year,y=stime))
  if (smooth){g=g+stat_smooth(aes(colour=label),method = "lm", formula= y ~ x + I(x^2), se=FALSE, size=0.7)}
  else { g=g+geom_line(aes(colour=label))}
  g=g+ guides(colour=FALSE)+xlim(min(df$year),max(df$year)+1)
  #cex is label size
  g=g+geom_dl(aes(label = label, colour=label),method = list(cex = labelSize, dl.trans(x = x + .3), "last.bumpup"))
  g=g+labs(x=NULL,y='Competitive Supertime (% of best)',
           title=paste0('F1 ', cct$race, ' - ','Competitive Supertimes',', ', min(stt$year), ' to ', max(df$year)),
           subtitle = paste0(cct$circuit,', ',cct$location,', ',cct$country),
           caption="Data from Ergast database, ergast.com/mrd")
  champChartTheme(g)
}

Let’s have a look…

compSupertime(stt,cct,smooth=F)

 


Alternatively:

compSupertime(stt,cct,smooth=T)

The best fit / model line obviously leaves something to be desired, eg in the case of Renault. But it’s a start.

I’ve also started working on a workflow to autopublish stuff to Gitbooks via github. For example, here’s the site in progress around the Bahrain F1 2018 Grand Prix. Here’s an earlier example (subject to updates/reflowing!) around the Australia 2018 F1 Grand Prix.

PS As I get back in to this, I’ll probably start updating the Wrangling F1 Data With R with recipes again… Also on the to do list is a second edition based on the tidyverse way of doing things…

Importing Functions From DevTesting Jupyter Notebooks

One of the ways I use Jupyter notebooks is as sketchbooks in which some code cells are used to develop useful functions and other are used as “in-passing” develop’n’test cells that include code fragments on the way to becoming useful as part of a larger function.

Once a function has been developed, it can be a pain getting it into a form where I can use it in other notebooks. One way is to copy the function code into a separate python file that can be imported into another notebook, but if the function code needs updating, this means changing it in the python file and the documenting notebook, which can lead to differences arising between the two versions of the function.

Recipes such as Importing Jupyter Notebooks as Modules provide a means for importing the contents of a notebook as a module, but they do so by executing all code cells.

So how can we get round this, loading – and executing – just the “exportable” cells, such as the ones containing “finished” functions, and ignoring the cruft?

I was thinking it might be handy to define some code cell metadata (‘exportable’:boolean, perhaps), that I could set on a code cell to say whether that cell was exportable as a notebook-module function or just littering a notebook as a bit of development testing.

The notebook-as-module recipe would then test to see whether a notebook cell was not just a code cell, but an exportable code cell, before running it. The metadata could also hook into a custom template that could export the notebook as python with the code cells set to exportable:False commented out.

But this is overly complicating and hides the difference between exportable and extraneous code cells in the metadata field. Because as Johannes Feist pointed out to me in the Jupyter Google group, we can actually use a feature of the import recipe machinery to mask out the content of certain code cells. As Johannes suggested:

what I have been doing for this case is the “standard” python approach, i.e., simply guard the part that shouldn’t run upon import with if __name__=='__main__': statements. When you execute a notebook interactively, __name__ is defined as '__main__', so the code will run, but when you import it with the hooks you mention, __name__is set to the module name, and the code behind the if doesn’t run.

Johannes also comments that “Of course it makes the notebook look a bit more ugly, but it works well, allows to develop modules as notebooks with included tests, and has the advantage of being immediately visible/obvious (as opposed to metadata).”

In my own workflow, I often make use of the ability to display as code cell output whatever value is returned from the last item in a code cell. Guarding code with the if statement prevents the output of the last code item in the guarded block from being displayed. However, passing a variable to the display() function as the last line of the guarded block displays the output as before.

Charts_-_Split_Sector_Delta

So now I have a handy workflow for writing sketch notebooks containing useful functions + cruft from which I can just load in the useful functions into another notebook. Thanks, Johannes :-)

PS see also this hint from Doug Blank about building up a class across several notebook code cells:

Cell 1:

class MyClass():
def method1(self):
print("method1")

Cell 2:

class MyClass(MyClass):
def method2(self):
print("method2")

Cell 3:

instance = MyClass()
instance.method1()
instance.method2()

 

See also: https://github.com/ipython/ipynb

Scraping Texts

One of the things I learned early on about scraping web pages (often referred to as “screen scraping”) is that it often amounts to trying to recreate databases that have been re-presented as web pages using HTML templates. For example:

  • display a database table as an HTML table in a web page;
  • display each row of a database as a templated HTML page.

The aim of the scrape in these cases might be as simple as pulling the table from the page and representing it as a dataframe, or trying to reverse engineer the HTML template that converts data to HTML into something that can extract the data from the HTML back as a row in a corresponding data table.

In the latter case, the scrape may proceed in a couple of ways. For example:

  • by trying to identify structural HTML tag elements that contain recognisable data items, retrieving the HTML tag element, then extracting the data value;
  • parsing the recognisable literal text displayed on the web page and trying to extract data items based on that (i.e. ignore the HTML structural eelements and go straight for the extracted text). For an example of this sort of parsing, see the r1chardj0n3s/parse Python package as applied to text pulled from a page using something like the kennethreitz/requests-html package.

When scraping from PDFs, it is often necessary to make use of positional information (the co-ordinates that identify where on the page a particular piece of text can be found) as well as literal text / pattern matching to try to identify different structured items on the page.

In more general cases, however, such as when trying to abstract meaningful information from arbitrary, natural language, texts, we need to up our game and start to analyse the texts as natural language texts.

At the basic level, we may be able to do this by recognising structural patterns with the text. For example:

Name: Joe Blogs
Address: 17, Withnail Street, Lesser Codswallop

We can then do some simple pattern matching to extract the identified elements.

Within the text, there may also be things that we might recognise as company names, dates, or addresses. Entity recognition refers to a natural language processing technique that attempts to extract words that describe “things”, that is, entities, as well as identifying what sorts of “thing”, or entity, they are.

One powerful Python natural language processing package, spacy, has an entity recognition capability that lets us identify entities within a text in couple of ways. The spacy package includes models for a variety of languages that can identify thinks like people’s names (PEOPLE), company names (ORG), MONEY and DATE strings.

However, we can also extend spacy by developing our own models, or building on top of spacy‘s pre-existing models.

In the first case, we can build an enumerated model that explicitly identifies terms we want to match against a particular entity type. For example, we might have a list of MP names that we want to use to tag a text to identify wherever an MP is mentioned.

In the second case, we may want to build a more general sort of model. Again, spacy can help us here. One way of matching text items is to look at the “shape” of tokens (words) in a text. For example, we might extract the shape of the word “Hello” as “Xxxxx” to identify upper and lower case alphabetic characters. We might use the “d” symbol to denote a numerical character. A common UK postcode form may then be identified from its shape, such as XXd dXX or Xd dXX.

Another way of matching elements is to look at “parts of speech” (POS) tags and the patterns they make. If you remember your grammar, things like nouns, proper nouns or adjectives, or conjunctions and prepositions.

Looking at a sentence in terms of its POS tags provides another level of structure across which we might look for patterns.
The following shows how even a crude model can start to identify useful features in a text, albeit with some false matches:

For examples of scraping texts, see this notebook: psychemedia/parlihacks/notebooks/Text%20Scraping%20-%20Notes

PS related, in policy  / ethical best practice terms: ONS web-scraping policy

Some More Rally Result Chart Sketches

Some more sketches, developing / updating one of the charts I first played with last year (the stage chart and tinkering with something new.

First the stage chart – I’ve started pondering a couple of things with this chart to try to get the information density up a bit.

At a first attempt at updating the chart, I’ve started to look at adding additional marginal layers. In the example above:

  • vertical dashed lines separate out the different legs. As soon as I get the data to hand, I think it could make sense to use something like a solid line to show service, maybe a double solid line to show *parc fermé*; I’m not sure about additionally separating the days? (They’re perhaps implied by *parc fermé*? I need to check that…)
  • I added stage names *above* the chart  – this has the benefit of identifying stages that are repeated;
  • stage distances are added *below* the chart. I’ve also been wondering about adding the transit distances in *between* the stages;
  • driver labels – and positions – are shown to the left and the right.

As a second attempt, I started zooming in to just the stages associated with a particular leg. This encouraged me to start adding more detailed layers. These can be applied to the whole chart, but it may start to get a bit cluttered.

Here’s an example of a chart that shows three stages that make up a notional leg:

You’ll notice several additions to the chart:

  • the labels to the left identify the driver associated with each line. The number is currently the overall position of the driver at the end of the first stage in the leg, but I’m not sure if it should be the position at the end of the previous stage so it carries more information. The time is the gap to the overall leading driver at the end of the first stage;
  • the labels to the right show the overall positions and gap to overall leader at the end of the leg. The position label is in bold font if the driver position has improved over the leg (a switch lets you select whether this is a class rank improvement or an overall position improvement). Thinking about it, I could use italics for class improvement and bold for overall improvement to carry both pieces of information in the same label. The position is actually redundant (you can count…) so maybe it’d make more sense to give a position delta from the start of the leg (that is, the position at the end of the stage prior to the first stage shown in the current leg). The time delta is given in bold if it is better than at the start of the leg.
  • the red dots depict that the gap to the overall leader had *increased* for a driver by the end of the stage compared to the end of the previous stage. So a red dot means the car is further behind the leader at the end of the stage than they were at the end of the previous stage; this indicator could be rebased to show deltas between a target (“hero”) car and the other cars on the stage. The green dot shows that the time to the leader did not increase;
  • the grey labels at the top are a running count of the number of “wins in a row” a driver has had. There are options to choose other running counts (eg stage wins so far), and flags available for colouring things like “took lead”, “retained lead”, “lost lead”.

As well as the stage chart, I started wondering about an “ultimate stage report” for each stage, showing the delta between each driver and the best time achieved in a sector (that is, the time spent between two splits).

Here’s what I came up with at a first attempt. Time delta is on the bottom. The lower level grey bar indicates the time a driver lost relative to the “ultimate” stage. (The bar maxes out at the upper limit of the chart to indicate “more than” – I maybe need to indicate this visually eg with a dashed / broken line at the end of a maxed out bar.)

Within each driver area is a series of lollipop style charts. These indicate the gap between a driver and the best time achieved on the sector (first sector at the top of the group, last at the bottom). The driver label indicates the driver who achieved the best sector time. This chart could be rebased to show other gaps, but I need to think about that… The labels are coloured to indicate sector, and transparent to cope with some of the overlapping issues.

It’s also possible to plot this chart using a log scale:

This makes it easier to see the small gaps, as well as giving a far range on the delta. However, the log scale is harder to read for folk not familiar with them. It might be handy to put in a vertical dashed line for each power of 10 time (so a dashed line at 1s and 10s; the limit is 100s). It might also make sense to add a label to the right of the total delta bar to show what the actual delta time is.

So… tinkering… I was hoping to start to pull all the chart types I’ve been playing with together in a Leanpub book, but Leanpub is not free to play anymore unless you have generated over $10k of royalties (which I haven’t…). I’ve started looking at gitbook, but that’s new to me so I need to spend some time getting a feel for how to use it and to come up with a workflow /toolchain around it.

More Tinkering With PyTracery

Having started to look at using pytracery for data2text transforms with pandas dataframes, I’ve been pondering how to do computational things with it, such as counts, conditional tests, and loops.

Here’s where I’m at – defining counters, a branch / conditional statement, and a conditional loop:

def inc(text, *params):
    return str(int(text)+1)

def dec(text, *params):
    return str(int(text)-1)

def branch(text, *params):
    if text==params[0].strip(): 
        ans = params[1].strip()
    else:
        ans = params[2].strip()
    return '[answer:#{}#]'.format(ans)

def whileNot0(text, *params):
    if int(text) > 0:
        return '[i:{i}]'.format(i=str(int(text)-1))
    return '[do:#{}#]'.format(params[0].strip())

pytracery_logic = {
    'inc': inc,
    'dec': dec,
    'branch':branch,
    'whileNot0': whileNot0
}

Count example:

rules = {'origin': 'Count demo - #countdemo#',
         'countdemo':'Count at [cnt:#initval#]#cnt#.\n#addone#, #addanother#, #takeone#',
         'addone': 'Add one gives [cnt:#cnt.inc#]#cnt#',
         'addanother': 'add another one gives [cnt:#cnt.inc#]#cnt#',
         'takeone':'take one gives [cnt:#cnt.dec#]#cnt.int_to_words#',
}

grammar = tracery.Grammar(rules)
grammar.add_modifiers(pytracery_logic)
grammar.add_modifiers(inflect_english)

print(grammar.flatten('[initval:1]#origin#'))
> Count demo - Count at 1.
> Add one gives 2, add another one gives 3, take one gives two

Logical test:

rules={'origin':'Logical test:[branch:#condition#]#branch# Testing #value# gives: #answer#',
       'condition': '#value.branch(VeryTrue, answer1, answer2)#',
    'answer1':'This is true...',
    'answer2':'This is false...',
    'answer3':'nope',
    'answer':'answer3'
}

grammar = tracery.Grammar(rules)
grammar.add_modifiers(pytracery_logic)

print(grammar.flatten("[value:VeryTrue]#origin#"))
> Logical test: Testing VeryTrue gives: This is true...

print(grammar.flatten("[value:notTrue]#origin#"))
> Logical test: Testing notTrue gives: This is false...

Loop test:

rules = {'origin': 'Loop test: \n#loop#',
         'loop':'[while:#i.whileNot0(end)#]#while##do#',
         'do':'#action##loop#',
         'action':'- count is at #i#\n',
         'end':'All done'
        }

grammar = tracery.Grammar(rules)
grammar.add_modifiers(pytracery_logic)
print(grammar.flatten("[i:6]#origin#"))
> Loop test: 
> - count is at 5
> - count is at 4
> - count is at 3
> - count is at 2
> - count is at 1
> - count is at 0
> All done

I’m a bit out of the way of declarative thinking (which pytracery relies on) atm, so there may be more native /idiomatic ways of doing this sort of stuff in pytracery.

I’m also a bit rusty on assembler / machine code programming, so there may be better ways of phrasing basic computational operations using pytracery using those sorts of operators.

Next step is to see whether I can start to use these reasoning steps in association with data pulled from a pandas dataframe.

Rally Stage Sector Charts, Overlaid On Stage Progress Charts

One of the last rally charts I sketched last year was a dodged bar chart showing “sector” times within a stage for each driver, either as absolute times or rebased relative to a particular target driver. This represented an alternative to the “driver subseries” line charts e.g. as shown here.

Re: the naming of driver subseries charts – this is intended to be reminiscent of seasonal subseries charts.

The original slit time data on the WRC site looks like this:

Taking the raw sector (split) times, we can rebase the times relative to a particular driver. In this case, I have rebased relative to OGI so the sector times are as shown in the table above. The colour basis is the opposite to the basis used in the chart because I am trying to highlight to the target driver where they lost time, rather than where the others gained time. It may be that the chart makes more sense to professionals if I change the colour basis in the chart below, to use green to show that the driver made up that amount of time on the target driver).

The dodged bar charts are ordered with the first split time at the top of the set for each driver. The overall stage time is the lower bar in each driver group.

Here’s how it looks using the other colour basis:

Hmm… maybe that is better…

Note that the above charts also differs from the WRC split times results table in the ordering. The results table orders results in order of start, whereas the above charts rank relative to stage position.

To generate the “sector” times, we can find the difference between each split for a particular driver, and between the final (overall) stage time and the final split time. As before, we can then rebase these times relative to a specific target driver.

This chart shows how the target driver compared to each of the other drivers on each of the “sectors” between each split. So we see OGI dropped quite a lot of time relative to other drivers on the fourth “sector” of the stage, between splits 3 and 4. He made up time against MEE on the first and second parts of the stage.

As well as just showing the split times, we can find the total time delta between the target driver and each of the other drivers on the stage as the sum of the sector times. We can use a lower graphic layer to underplot this total beneath the dodged bars for each driver.

The grey bars show the total time gained / lost by the target driver on the stage relative to the other drivers.

In a simular way, we can also overplot the dodged bars on top of a stage progress chart, recolouring slightly.

This increases the information density of the stage progress chart even further, and provides additional “delta” signals in keeping with the deltascope inspiration / basis for that chart type.

Again, this sort of fits with the warped hydraulic model: the dodged bars can be stacked to give the length of the lighter coloured bar underneath them

(I’m not sure I’ve ordered the drivers correctly in the above chart – it was generated from two discordantly arranged datasets. The final chart will be generated from a single, consistent dataframe.)

PS it strikes me that the dodged bars need to be transparent to show a solid bar underneath that doesn’t extend as far as the dodged bars. This might need some careful colour / transparency balancing.