Category: Rstats

First Thoughts on Detecting Motorsport Safety Car Periods from Laptimes

Prompted by Markku Hänninen, I thought I’d have a quick look at estimating motorsport safety car laps from a set of laptime data. For the uninitiated, if there is a dangerous hazard on track, the race-cars are kept out while the hazard is cleared, but led around by a safety car that limits the pace. No overtaking is allowed for race position, but under certain regulations, lapped cars may unlap themselves. Cars may also pit under the safety car.

Timing sheets typically don’t identify safety car periods, so the question arises: how can we detect them?

One condition that is likely to follow is that the average pace of the laps under safety car conditions will be considerably slower than under racing conditions. A quick way of estimating the race pace is to find the fastest laptime across the whole of the race (or in an online algorithm, the fastest laptime to date).

With a lapTimes dataframe containing columns lap, rawtime (that is, raw laptime in seconds) and position (that is, the race position of the driver recording a particular laptime on a particular lap), we can easily find the fastest lap:

minl=min(lapTimes['rawtime'])

We can find the mean laptime per lap using ddply() to group around each lap:

ddply(lapTimes[c('lap', 'rawtime', 'position')], .(lap), summarise,
      mean_laptime=mean(rawtime) )

We can also generate a variety of other measures. For example, within the grouped ddply operation, if we divide the mean laptime per lap (mean(rawtime)) by the fastest overall laptime we get a normalised mean laptime based on the fastest lap in the race.

ddply(lapTimes[c('lap','rawtime')], .(lap), summarise,
      norm_laptime=mean(rawtime)/minl )

We might also normalise the leader’s laptime for each lap, on the basis that the leader will be the car most likely to be driving at the safety car’s pace. (The summarising is essentially redundant here because we only have one row per group.

ddply(lapTimes[lapTimes['position']==1, c('lap','rawtime')], .(lap), summarise,
      norm_leaders_laptime=mean(rawtime)/minl )

Using the normalised times, we can identify slow laps. For example, slow laps based on mean laptime. In this case, I am using a heuristic that says the laptime is a slow laptime if the normlised time is more than 1.3 times that of the fastest lap:

ddply(lapTimes[c('lap','rawtime')], .(lap), summarise,
      slow_lap_meanBasis= (mean(rawtime)/minl) > 1.3 )

If we assume that the first lap does not start under the safety car, we can then make a crude guess that a slow lap not on the first lap is a safety car lap.

However, this does not take into account things like sudden downpours or other changes to the weather or track conditions. In such a case it may be likely that the majority of the field pits, so we might want to have a detector that flags whether a certain number of cars have pitted on a lap, possibly normalised against the current size of the field.

laps=lapsData.df(2016,4)
pits=pitsData.df(2016,4)[c('lap','driverId')]
pits['pitstop']=T
lapTimes=merge(laps, pits, by=c('lap','driverId'), all.x=T)
lapTimes['pitstop']=!is.na(lapTimes['pitstop'])

#Count of stops per lap
ddply( lapTimes, .(lap), summarise, ps=sum(pitstop==TRUE) )

#Proportion of cars stopping per lap
ddply( lapTimes, .(lap), summarise, ps=sum(pitstop==TRUE)/length(pitstop) )

That said, under safety car conditions, many cars do also take the opportunity to pit. However, under sudden changes of weather condition, we might expect nearly all the cars to come in, even if it means doubling up. (So another detector for weather might be two cars in the same team, close to each other in terms of gap, pitting on the same lap, with the result that one will be queued behind the other.)

As and when I get a chance, I’ll try to add some sort of ‘safety car’ estimator to the Wrangling F1 Data With R book.

When Documents Become Databases – Tabulizer R Wrapper for Tabula PDF Table Extractor

Although not necessarily the best way of publishing data, data tables in PDF documents can often be extracted quite easily, particularly if the tables are regular and the cell contents reasonably space.

For example, official timing sheets for F1 races are published by the FIA as event and timing information in a set of PDF documents containing tabulated timing data:

R_-_Best_Sector_Times_pdf__1_page_

In the past, I’ve written a variety of hand crafted scrapers to extract data from the timing sheets, but the regular way in which the data is presented in the documents means that they are quite amenable to scraping using a PDF table extractor such as Tabula. Tabula exists as both a server application, accessed via a web browser, or as a service using the tabula extractor Java application.

I don’t recall how I came across it, but the tabulizer R package provides a wrapper for tabula extractor (bundled within the package), that lets you access the service via it’s command line calls. (One dependency you do need to take care of is to have Java installed; adding Java into an RStudio docker container would be one way of taking care of this.)

Running the default extractor command on the above PDF pulls out the data of the inner table:

extract_tables('Best Sector Times.pdf')

fia_pdf_sector_extract

Where the data is spread across multiple pages, you get a data frame per page.

R_-_Lap_Analysis_pdf__page_3_of_8_

Note that the headings for the distinct tables are omitted. Tabula’s “table guesser” identifies the body of the table, but not the spanning column headers.

The default settings are such that tabula will try to scrape data from every page in the document.

fia_pdf_scrape2

Individual pages, or sets of pages, can be selected using the pages parameter. For example:

  • extract_tables('Lap Analysis.pdf',pages=1
  • extract_tables('Lap Analysis.pdf',pages=c(2,3))

Specified areas for scraping can also be specified using the area parameter:

extract_tables('Lap Analysis.pdf', pages=8, guess=F, area=list(c(178, 10, 230, 500)))

The area parameter appears to take co-ordinates in the form: top, left, width, height is now fixed to take co-ordinates in the same form as those produced by tabula app debug: top, left, bottom, right.

You can find the necessary co-ordinates using the tabula app: if you select an area and preview the data, the selected co-ordinates are viewable in the browser developer tools console area.

Select_Tables___Tabula_concole

The tabula console output gives co-ordinates in the form: top, left, bottom, right so you need to do some sums to convert these numbers to the arguments that the tabulizer area parameter wants.

fia_pdf_head_scrape

Using a combination of “guess” to find the dominant table, and specified areas, we can extract the data we need from the PDF and combine it to provide a structured and clearly labeled dataframe.

On my to do list: add this data source recipe to the Wrangling F1 Data With R book…

First Thoughts on Automatically Generating Accessible Text Descriptions of ggplot Charts in R

In a course team accessibility briefing last week, Richard Walker briefly mentioned a tool for automatically generating text descriptions of Statistics Canada charts to support accessibility. On further probing, the tool, created by Leo Ferres, turned out to be called iGraph-Lite:

… an extensible system that generates natural language descriptions of statistical graphs, particularly those created for Statistics Canada online publication, “The Daily”. iGraph-Lite reads a graph from a set of graphing programs, builds intermediate representations of it in several formats (XML, CouchDB and OWL) and outputs a description using NLG algorithms over a “string template” mechanism.

The tool is a C# application compiled as a Windows application, with the code available (unchanged now for several years) on Github.

The iGraph-Lite testing page gives a wide range of example outputs:

iGraph_Testing_Page

Increasingly, online charts are labeled with tooltip data that allows a mouseover or tab action to pop-up or play out text or audio descriptions of particular elements of the chart. A variety of “good chart design” principles designed to support clearer, more informative graphics (for example, sorting categorical variable bars in a bar chart by value rather than using an otherwise arbitrary alphabetic ordering) not only improves the quality of the graphic, it also makes a tabbed through audio description of the chart more useful. For more tips on writing clear chart and table descriptions, see Almost Everything You Wanted to Know About Making Tables and Figures.

The descriptions are quite formulaic, and to a certain extent represent a literal reading of the chart, along with some elements of interpretation and feature detection/description.

Here are a couple of examples – first, for a line chart:

This is a line graph. The title of the chart is “New motor vehicle sales surge in January”. There are in total 36 categories in the horizontal axis. The vertical axis starts at 110.0 and ends at 160.0, with ticks every 5.0 points. There are 2 series in this graph. The vertical axis is Note: The last few points could be subject to revisions when more data are added. This is indicated by the dashed line.. The units of the horizontal axis are months by year, ranging from February, 2005 to January, 2007. The title of series 1 is “Seasonally adjusted” and it is a line series. The minimum value is 126.047 occuring in September, 2005. The maximum value is 153.231 occuring in January, 2007. The title of series 2 is “Trend” and it is a line series. The minimum value is 133.88 occuring in February, 2005. The maximum value is 146.989 occuring in January, 2007.

And then for a bar chart:

This is a vertical bar graph, so categories are on the horizontal axis and values on the vertical axis. The title of the chart is “Growth in operating revenue slowed in 2005 and 2006”. There are in total 6 categories in the horizontal axis. The vertical axis starts at 0.0 and ends at 15.0, with ticks every 5.0 points. There is only one series in this graph. The vertical axis is % annual change. The units of the horizontal axis are years, ranging from 2001 to 2006. The title of series 1 is “Wholesale” and it is a series of bars. The minimum value is 2.1 occuring in 2001. The maximum value is 9.1 occuring in 2004.

One of the things I’ve pondered before was the question of textualising charts in R using a Grammar of Graphics approach such as that implemented by ggplot2. As well as literal textualisation of grammatical components, a small amount of feature detection or analysis could also be used to pull out some meaningful points. (It also occurs to me that the same feature detection elements could also then be used to drive a graphical highlighting layer back in the visual plane.)

Prompted by my quick look at the iGraph documentation, I idly wondered how easy it would be to pick apart an R ggplot2 chart object and generate a textual description of various parts of it.

A quick search around suggested two promising sources of structured data associated with the chart objects directly: firstly, we can interrogate the object return from calling ggplot() and it’s associated functions directly; and ggplot_build(g), which allows us to interrogate a construction generated from that object.

Here’s an example from the quickest of plays around this idea:

library(ggplot2)
g=ggplot(economics_long, aes(date, value01, colour = variable))
g = g + geom_line() + ggtitle('dummy title')

#The label values may not be the actual axis limits
txt=paste('The chart titled"', g$labels$title,'";',
          'with x-axis', g$labels$x,'labeled from',
          ggplot_build(g)$panel$ranges[[1]]$x.labels[1], 'to',
          tail(ggplot_build(g)$panel$ranges[[1]]$x.labels, n=1),
          'and y-axis', g$labels$y,' labeled from',
          ggplot_build(g)$panel$ranges[[1]]$y.labels[1], 'to',
          tail(ggplot_build(g)$panel$ranges[[1]]$y.labels,n=1), sep=' ')
if (<span class="pl-s"><span class="pl-pds">'</span>factor<span class="pl-pds">'</span></span> <span class="pl-k">%in%</span> class(<span class="pl-smi">x</span><span class="pl-k">$</span><span class="pl-smi">data</span>[[<span class="pl-smi">x</span><span class="pl-k">$</span><span class="pl-smi">labels</span><span class="pl-k">$</span><span class="pl-smi">colour</span>]])){
  txt=paste(txt, '\nColour is used to represent', g$labels$colour)

  if ( class(g$data[[g$labels$colour]]) =='factor') {
    txt=paste(txt,', a factor with levels: ',
              paste(levels(g$data[[g$labels$colour]]), collapse=', '), '.', sep='')
  }
}

txt

#The chart titled "dummy title" with x-axis date labeled from 1970 to 2010 and y-axis value01 labeled from 0.00 to 1.00
#Colour is used to represent variable, a factor with levels: pce, pop, psavert, uempmed, unemploy.&amp;amp;amp;quot;

So for a five minute hack, I’d rate that approach as plausible, perhaps?!

I guess an alternative approach might also be to add an additional textualise=True property to the ggplot() call that forces each ggplot component to return a text description of its actions, and/or features that might be visually obvious from the action of the component as well as, or instead of, the graphical elements. Hmm… maybe I should use the same syntax as ggplot2 and try to piece together something along the lines of a ggdescribe library? Rather than returning graphical bits, each function would return appropriate text elements to the final overall description?!

I’m not sure if we could call the latter approach a “transliteration” of a chart from a graphical to a textual rendering of its structure and some salient parts of a basic (summary) interpretation of it, but it feels to me that this has some merit as a strategy for thinking about data2text conversions in general. Related to this, I note that Narrative Science have teamed up with Microsoft Power BI to produce a “Narratives for Power BI” plugin. I imagine this will start trickling down to Excel at some point, though I’ve already commented on how simple text summarisers can be added into to Excel using simple hand-crafted formulas (Using Spreadsheets That Generate Textual Summaries of Data – HSCIC).

PS does anyone know if Python/matplotib chart elements also be picked apart in a similar way to ggplot objects? [Seems as if they can: example via Ben Russert.]

PPS In passing, here’s something that does look like it could be handy for simple natural language processing tasks in Python: polyglot, “a natural language pipeline that supports massive multilingual applications”; includes language detection, tokenisation, named entity extraction, part of speech tagging, transliteration, polarity (i.e. crude sentiment).

PPPS the above function is now included in Jonathan Godfrey’s BrailleR package (about).

PPPPS for examples of best practice chart description, see the UK Association for Accessible Formats (UKAAF) guidance on Accessible Images.

 

Accessing a Neo4j Graph Database Server from RStudio and Jupyter R Notebooks Using Docker Containers

In Getting Started With the Neo4j Graph Database – Linking Neo4j and Jupyter SciPy Docker Containers Using Docker Compose I posted a recipe demonstrating how to link a Jupyter notebook container with a neo4j container to provide a quick way to get up an running with neo4j from a Python environment.

It struck me that it should be just as easy to launch an R environment, so here’s a docker-compose.yml file that will do just that:

neo4j:
  image: kbastani/docker-neo4j:latest
  ports:
    - "7474:7474"
    - "1337:1337"
  volumes:
    - /opt/data

rstudio:
  image: rocker/rstudio
  ports:
    - "8787:8787"
  links:
    - neo4j:neo4j
  volumes:
    - ./rstudio:/home/rstudio

jupyterIR:
  image: jupyter/r-notebook
  ports:
    - "8889:8888"
  links:
    - neo4j:neo4j
  volumes:
    - ./notebooks:/home/jovyan/work

If you’re using Kitematic (available via the Docker Toolbox), launch the docker command line interface (Docker CLI), cd into the directory containing the docker-compose.yml file, and run the docker-compose up -d command. This will download the necessary images and fire up the linked containers: one running neo4j, one running RStudio, and one running a Jupyter notebook with an R kernel.

You should then be able to find the URLs/links for RStudio and the notebooks in Kitematic:

Screenshot_12_04_2016_08_59

Once again, Nicole White has some quickstart examples for using R with neo4j, this time using the Rneo4j R package. One thing I noticed with the Jupyter R kernel was that I needed to specify the CRAN mirror when installing the package: install.packages('RNeo4j', repos="http://cran.rstudio.com/")

To connect to the neo4j database, use the domain mapping specified in the Docker Compose file: graph = startGraph("http://neo4j:7474/db/data/")

Here’s an example in RStudio running from the container:

RStudio-neo4j

And the Jupyter notebook:

neo4j_R

Notebooks and RStudio project files are shared into subdirectories of the current directory (from which the docker compose command was run) on host.

Visualising F1 Stint Strategies

With the new F1 season upon us, I’ve started tinkering with bits of code from the Wrangling F1 Data With R book and looking at the data in some new ways.

For example, I started wondering whether we might be able to learn something interesting about the race strategies by looking at laptimes on a stint by stint basis.

To begin with, we need some data – I’m going to grab it directly from the ergast API using some functions that are bundled in with the Leanpub book…

#ergast functions described in: https://leanpub.com/wranglingf1datawithr/
#Get laptime data from the ergast API
l2=lapsData.df(2016,2)
#Get pits data from the ergast API
p2=pitsData.df(2016,2)

#merge pit data into the laptime data
l3=merge(l2,p2[,c('driverId','lap','rawduration')],by=c('driverId','lap'),all=T)

#generate an inlap flag (inlap is the lap assigned the pit time)
l3['inlap']=!is.na(l3['rawduration'])

#generate an outlap flag (outlap is the first lap of the race or laps starting from the pits
l3=ddply(l3,.(driverId),transform,outlap=c(T,!is.na(head(rawduration,-1))))

#use the pitstop flag to number stints; note: a drive through penalty increments the stint count
l3=arrange(l3,driverId, -lap)
l3=ddply(l3,.(driverId),transform,stint=1+sum(inlap)-cumsum(inlap))

#number the laps in each stint
l3=arrange(l3,driverId, lap)
l3=ddply(l3,.(driverId,stint),transform,lapInStint=1:length(stint))
l3=arrange(l3,driverId, lap)

The laptimes associated with the in- and out- lap associated with a pit stop add noise to the full lap times completed within each stint, so lets flag those laps so we can then filter them out:

#Discount the inlap and outlap
l4=l3[!l3['outlap'] & !l3['inlap'],]

We can now look at the data… I’m going to facet by driver, and also group the laptimes associated with each stint. Then we can plot just the raw laptimes, and also a simple linear model based on the full lap times within each stint:

#Generate a base plot
g=ggplot(l4,aes(x=lapInStint, y=rawtime, col=factor(stint)))+facet_wrap(~driverId)

#Chart the raw laptimes within each stint
g+geom_line()

#Plot a simple linear model for each stint
g+ geom_smooth(method = "lm", formula = y ~ x)

So for example, here are the raw laptimes, excluding inlap and outlap, by stint for each driver in the recent 2016 Bahrain Formual One Grand Prix:

bah_2-16_racestint_raw

And here’s the simple linear model:

bah_2-16_racestint_lm

These charts highlight several things:

  • trivially, the number and length of the stints completed by each driver;
  • degradation effects in terms of the gradient of the slope of each stint trace;
  • fuel effects- the y-axis offset for each stint is the sum of the fuel effect (as the race progresses the cars get lighter and laptime goes down more or less linearly) and a basic tyre effect (the “base” time we might expect from a tyre). Based on the total number of laps completed in stints prior to a particular stint, we can calculate a fuel effect offset for the laptimes in each stint which should serve to normalise the y-axis laptimes and make more evident the base tyre laptime.
  • looking at the charts as a whole, we get a feel for strategy – what sort of tyre/stint strategy do the cars start race with, for example; are the cars going long on tyres without much degradation, or pushing for various length stints on tyres that lose significant time each lap? And so on… (What can you read into/from the charts? Let me know in the comments below;-)

If we assume a 0.083s per lap fuel weight penalty effect, we can replot the chart to account for this:

#Generate a base plot
g=ggplot(l4,aes(x=lapInStint, y=rawtime+0.083*lap, col=factor(stint)))
g+facet_wrap(~driverId) +geom_line()

Here’s what we get:

f1_2016_bah_stint_raw_fc

And here’s what the fuel corrected models look like:

f1_2016_bah_stint_lm_fc

UPDATE: the above fuel calculation goes the wrong way – oops! It should be:

MAXLAPS=max(l4['lap'])
FUEL_PENALTY =0.083
.e = environment()
g=ggplot(l4,aes(x=lapInStint,y=rawtime-(MAXLAPS-lap)*FUEL_PENALTY,col=factor(stint)),environment=.e)

What we really need to do now is annotate the charts with additional tyre selection information for each stint.

We can also do a few more sums. For example, generate a simple average laptime per stint, excluding inlap and outlap times:

#Calculate some stint summary data
l5=ddply(l4,.(driverId,stint), summarise,
                               stintav=sum(rawtime)/length(rawtime),
                               stintsum=sum(rawtime),
                               stinlen=length(rawtime))

which gives results of the form:

          driverId stint   stintav stintsum stinlen
1          rosberg     1  98.04445 1078.489      11
2          rosberg     2  97.55133 1463.270      15
3          rosberg     3  96.15543  673.088       7
4          rosberg     4  96.32494 1637.524      17
5            massa     1  99.13600  495.680       5
6            massa     2 100.48300 2009.660      20
7            massa     3  98.77862 2568.244      26

It would possibly be useful to also compare inlap and outlaps somehow, as well as factoring in the pitstop time. I’m pondering a couple a possibilities for the latter :

  • amortise the pitstop time over the laps leading up to a pitstop by adding a pitsop lap penalty to each lap in that stint calculated as the pitstop time of the stint length of the laps in the stint leading up to the pitstop; this essentially penalises the stint that leads up to the pitstop as a consequence of forcing the pitstop;
  • amortise the pitstop time over the laps immediately following a pitstop by adding a pitsop lap penalty to each lap in that stint calculated as the pitstop time of the stint length of the laps in the stint following the pitstop; this essentially penalises the stint that immediately follows the pitstop, and discounts some of the benefit from the pitstop.

I haven’t run the numbers yet though, so I’m not sure how these different approaches will feel…

Another Route to Jupyter Notebooks – Azure Machine Learning

In much the same way that the IBM DataScientist Workbench seeks to provide some level of integration between analysis tools such as Jupyter notebooks and data access and storage, Azure Machine Learning studio also provides a suite of tools for accessing and working with data in one location. Microsoft’s offering is new to me, but it crossed my radar with the announcement that they have added native R kernel support, as well as Python 2 and 3, to their Jupyter notebooks: Jupyter Notebooks with R in Azure ML Studio.

Guest workspaces are available for free (I’m not sure if this is once only, or whether you can keep going back?) but there is also a free workspace if you have a (free) Microsoft account.

Microsoft_Azure_Machine_Learning_Studio

Once inside, you are provides with a range of tools – the one I’m interested in to begin with is the notebook (although the piepwork/dataflow experiments environment also looks interesting):

Notebooks_-_Microsoft_Azure_Machine_Learning_Studio2

Select a kernel:

Notebooks_-_Microsoft_Azure_Machine_Learning_Studio

give your new notebook a name, and it will launch into a new browser tab:

test1

You can also arrange notebooks within separate project folders. For example, create a project:

Projects_-_Microsoft_Azure_Machine_Learning_Studio

and then add notebooks to it:

Projects_-_Microsoft_Azure_Machine_Learning_Studio2
When creating a new notebook, you may have noted an option to View More in Gallery. The gallery includes examples of a range of project components, including example notebooks:

gallery_cortanaintelligence_com_browse_orderby_freshness_desc_skip_0_categories__5B_Notebook__5D

Thinking about things like the MyBinder app, which lets you launch a notebook in a container from a Github account, it would be nice to see additional buttons being made available to let folk run notebooks in Azure Machine Learning, or the Data Scientist Workbench.

It’s also worth noting how user tools – such as notebooks – seem to be being provided for free with a limited amount of compute and storage resource behind them as a way of recruiting users into platforms where they might then start to pay for more compute power.

From a course delivery perspective, I’m often unclear as to whether we can tell students to sign up for such services as part of a course or whether that breaks the service terms?  (Some providers, such as Wakari, make it clear that “[f]or classes, projects, and long-term use, we strongly encourage a paid plan or Wakari Enterprise. Special academic pricing is available.”) It seems unfair that we should require students to sign up for accounts on a “free” service in their own name as part of our offering for a couple of reasons at least: first, we have no control over what happens on the service; second, it seems that it’s a commercial transaction that should be formalised in some way, even if only to agree that we can (will?) send our students to that particular service exclusively. Another possibility is that we say students should make their own service available, whether by installing software themselves or finding an online provider for themselves.

On the other hand, trying to get online services provided at course scale in a timely fashion within an HEI seems to be all but impossible, at least until such a time as the indie edtech providers such as Reclaim Hosting start to move even more into end-user app provision either at the individual level, or affordable class level (with an SLA)…

See also: Seven Ways of Running IPython / Jupyter Notebooks.

New Version of “Wrangling F1 Data With R” Just Released…

So I finally got round to pushing a revised (and typo corrected!) version of Wrangling F1 Data With R: A Data Junkie’s Guide, that also includes a handful of new section and chapters, including descriptions of how to detect undercuts, the new style race history chart that shows the on-track position of each driver for each lap of a race relative to the lap leader, and a range of qualifying session analysis charts that show the evolution of session cut off times and drivers’ personal best times.

Code is described for every data manipulation and chart that appears in the book, along with directions for how to get hold of (most of) the data required to generate the charts. (Future updates will cover some of the scraping techniques required to get of of the rest of it!)

As well as the simple book, there’s also a version bundled with the R code libraries that are loaded in as a short-cut in many of the chapters.

The book is published on Leanpub, which means you can access several different electronic versions of the book, and once you’ve bought a copy, you get access to any future updates for no further cost…

There is a charge on the book, with a set minimum price, but you also have the freedom to pay more! Any monies received for this book go to cover costs (I’ve started trying to pay for the webservices I use, rather than just keep using their free plan). If the monthly receipts bump up a little, I’ll try to get some services that generate some of the charts interactively hosted somewhere…