F1 Timing Screen as a Spreadsheet?

One of the comment themes I’ve noticed around the first Challenge in the Tata F1 Connectivity Innovation Prize, a challenge to rethink what’s possible around the timing screen given only the data in the real time timing feed, is that the non-programmers don’t get to play. I don’t think that’s true – the challenge seems to be open to ideas as well as practical demonstrations, but it got me thinking about what technical ways in might be to non-programmers who wouldn’t know where to start when it came to working with the timing stream messages.

The answer is surely the timing screen itself… One of the issues I still haven’t fully resolved is a proven way of getting useful information events from the timing feed – it updates the timing screen on a cell by cell basis, so we have to finesse the way we associate new laptimes or sector times with a particular driver, bearing in mind cells update one at a time, in a potentially arbitrary order, and with potentially different timestamps.

f1-innovation-prize_s3_amazonaws_com_challenge_packs_The_F1_Connectivity_Innovation_Prize_–_Challenge_1_Brief_pdf

So how about if we work with a “live information model” by creating a copy of an example timing screen in a spreadsheet. If we know how, we might be able to parse the real data stream to directly update the appropriate cells, but that’s largely by the by. At least we have something we can work work to start playing with the timing screen in terms of a literal reimagining of it. So what can we do if we put the data from an example timing screen into a spreadsheet?

If we create a new worksheet, we can reference the cells in the “original” timing sheet and pull values over. The timing feed updates cells on a cell by cell basis, but spreadsheets are really good at rippling through changes from one or more cells which are themselves reference by one or more others.

The first thing we might do is just transform the shape of the timing screen. For example, we can take the cells in a column relating to sector 1 times and put them into a row.

The second thing we might do is start to think about some sums. For example, we might find the difference between each of those sector times and (for practice and qualifying sessions at least) the best sector time recorded in that session.

The third thing we might do is to use a calculated value as the basis for a custom cell format that colours the cell according to the delta from the best session time.

Simple, but a start.

I’ve not really tried to push this idea very far – I’m not much of a spreadsheet jockey – but I’d be interested to know how folk who are might be able to push this idea…

If you need example data, there’s some on the F1 site – f1.com – results for Spanish Grand Prix, 2014 and more on Ergast Developer API.

PS FWIW, my entry to the competition is here: #f1datajunkie challenge 1 entry. It’s perhaps a little off-brief, but I’ve been meaning to do this sort of summary for some time, and this was a good starting point. If I get a chance, I’ll have a go a getting the parsers to work properly properly!

Lazyweb Request – Node-RED & F1 Timing Data

A lazyweb request, because I’m rushing for a boat, going to be away from reliable network connections for getting on for a week, and would like to be able to play from a running start when I get back next week…

In context of the Tata/F1 timing data competition, I’d like to be able to have a play with the data in Node-RED. A feed-based, flow/pipes like environment, Node-RED’s been on my “should play with” list for some time, and this provides a good opportunity.

The data as provided looks something like:

...
<transaction identifier="101" messagecount="121593" timestamp="14:57:10.878"><data column="23" row="1" colour="PURPLE" value="31.6"/></transaction>
<transaction identifier="103" messagecount="940109" timestamp="14:57:11.219"><data column="2" row="1" colour="YELLOW" value="1:41:13" clock="true"/></transaction>
<transaction identifier="101" messagecount="121600" timestamp="14:57:11.681"><data column="2" row="3" colour="WHITE" value="77"/></transaction>
<transaction identifier="101" messagecount="121601" timestamp="14:57:11.681"><data column="3" row="3" colour="WHITE" value="V. BOTTAS"/></transaction>
<transaction identifier="101" messagecount="121602" timestamp="14:57:11.681"><data column="4" row="3" colour="YELLOW" value="17.7"/></transaction>
<transaction identifier="101" messagecount="121603" timestamp="14:57:11.681"><data column="5" row="3" colour="YELLOW" value="14.6"/></transaction>
<transaction identifier="101" messagecount="121604" timestamp="14:57:11.681"><data column="6" row="3" colour="WHITE" value="1:33.201"/></transaction>
<transaction identifier="101" messagecount="121605" timestamp="14:57:11.686"><data column="9" row="3" colour="YELLOW" value="35.4"/></transaction>

...

as a text file. (In the wild, it would be a real time data feed over http or https.)

What I’d like as a crib to work from is a Node-RED demo that has:

1) a file reader that reads the data in from the data file and plays it in as a stream in “real time” according to the timestamps, given a dummy start time;

2) an example of handling state – eg keeping track of drivernumber. (The row is effectively race position, Looking at column 2 (driverNumber), we can see what position a driver is in. Keep track of (row,driverNumber) pairs and if a driver changes position, flag it along with what the previous position was);

3) an example of appending the result to a flat file – for example, building up a list of statements “Driver number x has moved from position M to position N” over time.

Shouldn’t be that hard, right? And it would provide a good starting point for other people to be able to have a play without hassling over how to do the input/output bits?

F1 Doing the Data Visualisation Competition Thing With Tata?

Sort of via @jottevanger, it seems that Tata Communications announces the first challenge in the F1® Connectivity Innovation Prize to extract and present new information from Formula One Management’s live data feeds. (The F1 site has a post Tata launches F1® Connectivity Innovation Prize dated “10 Jun 2014”? What’s that about then?)

Tata Communications are the folk who supply connectivity to F1, so this could be a good call from them. It’ll be interesting to see how much attention – and interest – it gets.

The competition site can be found here: The F1 Innovation Connectivity Prize.

The first challenge is framed as follows:

The Formula One Management Data Screen Challenge is to propose what new and insightful information can be derived from the sample data set provided and, as a second element to the challenge, show how this insight can be delivered visually to add suspense and excitement to the audience experience.

The sample dataset provided by Formula One Management includes Practice 1, Qualifying and race data, and contains the following elements:

– Position
– Car number
– Driver’s name
– Fastest lap time
– Gap to the leader’s fastest lap time
– Sector 1 time for the current lap
– Sector 2 time for the current lap
– Sector 3 time for the current lap
– Number of laps

If you aren’t familiar with motorsport timing screens, they typically look like this…

f1-innovation-prize_s3_amazonaws_com_challenge_packs_The_F1_Connectivity_Innovation_Prize_–_Challenge_1_Brief_pdf

A technical manual is also provided for helping makes sense of the data files.

Basic_Timing_Data_Protocol_Overview_pdf__page_1_of_15_

Here are fragments from the data files – one for practice, one for qualifying and one for the race.

First up, practice:

...
<transaction identifier="101" messagecount="10640" timestamp="10:53:14.159"><data column="2" row="15" colour="RED" value="14"/></transaction>
<transaction identifier="101" messagecount="10641" timestamp="10:53:14.162"><data column="3" row="15" colour="WHITE" value="F. ALONSO"/></transaction>
<transaction identifier="103" messagecount="10642" timestamp="10:53:14.169"><data column="9" row="2" colour="YELLOW" value="16"/></transaction>
<transaction identifier="101" messagecount="10643" timestamp="10:53:14.172"><data column="2" row="6" colour="WHITE" value="17"/></transaction>
<transaction identifier="102" messagecount="1102813" timestamp="10:53:14.642"><data column="2" row="1" colour="YELLOW" value="59:39" clock="true"/></transaction>
<transaction identifier="102" messagecount="1102823" timestamp="10:53:15.640"><data column="2" row="1" colour="YELLOW" value="59:38" clock="true"/></transaction>
...

Then qualifying:

...
<transaction identifier="102" messagecount="64968" timestamp="12:22:01.956"><data column="4" row="3" colour="WHITE" value="210"/></transaction>
<transaction identifier="102" messagecount="64971" timestamp="12:22:01.973"><data column="3" row="4" colour="WHITE" value="PER"/></transaction>
<transaction identifier="102" messagecount="64972" timestamp="12:22:01.973"><data column="4" row="4" colour="WHITE" value="176"/></transaction>
<transaction identifier="103" messagecount="876478" timestamp="12:22:02.909"><data column="2" row="1" colour="YELLOW" value="16:04" clock="true"/></transaction>
<transaction identifier="101" messagecount="64987" timestamp="12:22:03.731"><data column="2" row="1" colour="WHITE" value="21"/></transaction>
<transaction identifier="101" messagecount="64989" timestamp="12:22:03.731"><data column="3" row="1" colour="YELLOW" value="E. GUTIERREZ"/></transaction>
...

Then the race:

...
<transaction identifier="101" messagecount="121593" timestamp="14:57:10.878"><data column="23" row="1" colour="PURPLE" value="31.6"/></transaction>
<transaction identifier="103" messagecount="940109" timestamp="14:57:11.219"><data column="2" row="1" colour="YELLOW" value="1:41:13" clock="true"/></transaction>
<transaction identifier="101" messagecount="121600" timestamp="14:57:11.681"><data column="2" row="3" colour="WHITE" value="77"/></transaction>
<transaction identifier="101" messagecount="121601" timestamp="14:57:11.681"><data column="3" row="3" colour="WHITE" value="V. BOTTAS"/></transaction>
<transaction identifier="101" messagecount="121602" timestamp="14:57:11.681"><data column="4" row="3" colour="YELLOW" value="17.7"/></transaction>
<transaction identifier="101" messagecount="121603" timestamp="14:57:11.681"><data column="5" row="3" colour="YELLOW" value="14.6"/></transaction>
<transaction identifier="101" messagecount="121604" timestamp="14:57:11.681"><data column="6" row="3" colour="WHITE" value="1:33.201"/></transaction>
<transaction identifier="101" messagecount="121605" timestamp="14:57:11.686"><data column="9" row="3" colour="YELLOW" value="35.4"/></transaction>

...

We can parse the datafiles using python using an approach something like the following:

from lxml import etree

pl=[]
for xml in open(xml_doc, 'r'):
    pl.append(etree.fromstring(xml))

pl[100].attrib
#{'identifier': '101', 'timestamp': '10:49:56.085', 'messagecount': '9716'}

pl[100][0].attrib
#{'column': '3', 'colour': 'WHITE', 'value': 'J. BIANCHI', 'row': '12'}

A few things are worth mentioning about this format… Firstly, the identifier is an identifier of the message type, rather then the message: each transaction message appears instead to be uniquely identified by the messagecount. The transactions each update the value of a single cell in the display screen, setting its value and colour. The cell is identified by its row and column co-ordinates. The timestamp also appears to group messages.

Secondly, within a session, several screen views are possible – essentially associated with data labelled with a particular identifier. This means the data feed is essentially powering several data structures.

Thirdly, each screen display is a snapshot of a datastructure at a particular point in time. There is no single record in the datafeed that gives a view over the whole results table. In fact, there is no single message that describes the state of a single row at a particular point in time. Instead, the datastructure is built up by a continual series of updates to individual cells. Transaction elements in the feed are cell based events not row based events.

It’s not obvious how we can make a row based transaction update, even, though on occasion we may be able to group updates to several columns within a row by gathering together all the messages that occur at a particular timestamp and mention a particular row. For example, look at the example of the race timing data above, for timestamp=”14:57:11.681″ and row=”3″. If we parsed each of these into separate dataframes, using the timestamp as the index, we could align the dataframes using the *pandas* DataFrame .align() method.

[I think I’m thinking about this wrong: the updates to a row appear to come in column order, so if column 2 changes, the driver number, then changes to the rest of the row will follow. So if we keep track of a cursor for each row describing the last column updated, we should be able to track things like row changes, end of lap changes when sector times change and so on. Pitting may complicate matters, but at least I think I have an in now… Should have looked more closely the first time… Doh!]

Note: I’m not sure that the timestamps are necessarily unique across rows, though I suspect that they are likely to be so, which means it would be safer to align, or merge, on the basis of the timestamp and the row number? From inspection of the data, it looks as if it is possible for a couple of timestamps to differ slightly (by milliseconds) yet apply to the same row. I guess we would treat these as separate grouped elements? Depending on the timewidth that all changes to a row are likely to occur in, we could perhaps round times for the basis of the join?

Even with a bundling, we still don’t a have a complete description of all the cells in a row. They need to have been set historically…

The following fragment is a first attempt at building up the timing screen data structure for the practice timing at a particular point of time. To find the state of the timing screen at a particular time, we’d have to start building it up from the start of time, and then stop it updating at the time we were interested in:

#Hacky load and parse of each row in the datafile
pl=[]
for xml in open('data/F1 Practice.txt', 'r'):
    pl.append(etree.fromstring(xml))

#Dataframe for current state timing screen
df_practice_pos=pd.DataFrame(columns=[
    "timestamp", "time",
    "classpos",  "classpos_colour",
    "racingNumber","racingNumber_colour",
    "name","name_colour",
],index=range(50))

#Column mappings
practiceMap={
    '1':'classpos',
    '2':'racingNumber',
    '3':'name',
    '4':'laptime',
    '5':'gap',
    '6':'sector1',
    '7':'sector2',
    '8':'sector3',
    '9':'laps',
    '21':'sector1_best',
    '22':'sector2_best',
    '23':'sector3_best'
}

def parse_practice(p,df_practice_pos):
    if p.attrib['identifier']=='101' and 'sessionstate' not in p[0].attrib:
        if p[0].attrib['column'] not in ['10','21','22','23']:
            colname=practiceMap[p[0].attrib['column']]
            row=int(p[0].attrib['row'])-1
            df_practice_pos.ix[row]['timestamp']=p.attrib['timestamp']
            tt=p.attrib['timestamp'].replace('.',':').split(':')
            df_practice_pos.ix[row]['time'] = datetime.time(int(tt[0]),int(tt[1]),int(tt[2]),int(tt[3])*1000)
            df_practice_pos.ix[row][colname]=p[0].attrib['value']
            df_practice_pos.ix[row][colname+'_colour']=p[0].attrib['colour']
    return df_practice_pos

for p in pl[:2850]:
    df_practice_pos=parse_practice(p,df_practice_pos)
df_practice_pos

(See the notebook.)

Getting sensible data structures at the timing screen level looks like it could be problematic. But to what extent are the feed elements meaningful in and of themselves? Each element in the feed actually has a couple of semantically meaningful data points associated with it, as well as the timestamp: the classification position, which corresponds to the row; and the column designator.

That means we can start to explore simple charts that map driver number against race classification, for example, by grabbing the row (that is, the race classification position) and timestamp every time we see a particular driver number:

racedemo

A notebook where I start to explore some of these ideas can be found here: racedemo.ipynb.

Something else I’ve started looking at is the use of MongoDB for grouping items that share the same timestamp (again, check the racedemo.ipynb notebook). If we create an ID based on the timestamp and row, we can repeatedly $set document elements against that key even if they come from separate timing feed elements. This gets us so far, but still falls short of identifying row based sets. We can perhaps get closer by grouping items associated with a particular row in time, for example, grouping elements associated with a particular row that are within half a second of each other. Again, the racedemo.ipynb notebook has the first fumblings of an attempt to work this out.

I’m not likely to have much chance to play with this data over the next week or so, and the time for making entries is short. I never win data competitions anyway (I can’t do the shiny stuff that judges tend to go for), but I’m keen to see what other folk can come up with:-)

PS The R book has stalled so badly I’ve pushed what I’ve got so far to wranglingf1datawithr repo now… Hopefully I’ll get a chance to revisit it over the summer, and push on with it a bit more… WHen I get a couple of clear hours, I’ll try to push the stuff that’s there out onto leanpub as a preview…

Reshaping Your Data – Pivot Tables in Google Spreadsheets

One of the slides I whizzed by in my presentation at the OU Statistitcs conference on “Visualisation an Presentation in Statistics” (tweet-commentary notes/links from other presentations here and here) relates to what we might describe as the shape that data takes.

An easy way of thinking about the shape of a dataset is is to consider a simple 2D data table with columns describing the properties of an object and rows typically corresponding to individual things. Often a regular structure, each cell in the table may take on a valid value. Occasionally, some cells may be blank, in which case we can start to think of the shape of the data getting a little bit ragged.

If you are working with data table, then on occasion you may want to swap the rows for columns (certain data operations require data to be organised in a particular way). By swapping the rows and columns, we change the shape of the data (for example, going from a table with N rows and M columns to one with M columns and N rows). So that’s one way of reshaping your data.

Many visualisation tools require data to be in a particular shape in order for the data to be visualised appropriately. If you look at the template pages on Number Picture, a new site hosting templated visualiastions built using Processing that allow you to cut, paste and visualise data – if it is is appropriately shaped – at a click.

But where do pivot tables come in? One way is to think of them as a tool for reshaping your data by providing summary reports of your original data set.

Here’s how the Goog describes them:

What pivot tables allow you to do is generate reports based on contents of a table using the values contained within one or more columns to define the columns and rows of a summary table. That is, you can re-present (or re-shape) a table as a new table that summarises data contained in the original table in terms of a rearrangement of the cell values of the original table.

Here’s a quick example. I have a data set that identifies the laptimes of drivers in an F1 race (yes, I know… again!;-) by stint, where different stints are groups of consecutive laps separated by pit stops.

If you look down the stint column you can see how its value groups together blocks of rows. But how do I easily show how much time each driver (column C) spent on each stint? The time the driver spent on each stint is the sum of laptimes by driver within a stint, so for each driver I need to find out the laps associated with each stint, and then sum them. Pivot tables let me do that. Here’s how:

So how does this work? The columns in the new table are defined according to the unique values found in the stint column of the original table. The rows in the new table are defined according to the unique values found in the car column of the original table. The cell values in the new table for a given row and column are defined as the sum of lapTime data values from rows in the original table where the car and stint values in the row correspond to the row and column values corresponding to each cell in the new table. Got that?

If you’re familiar with databases, you might think of the column and row settings in the new table defining “keys” into rows on the original table. The different car/stint pairs identify different blocks of rows that are processed per block to create the contents of the pivot table.

As well as summing the values from one column based on the values contained in two other columns, the pivot table can do other operations, such as counting the number of rows in the original table containing each “key” value. So for example, if we want to count the number of laps a car was out for by stint, we can do that simply by changing out pivot table Values setting.

Pivot tables can take a bit of time to get your head round… I find playing with them helps… A key thing to remember is: if you want to express a dataset in terms of the unique values contained within a column, the pivot table can help you do that. In the above example, I was essentially generating the row and column values for a new table based on categorical data (driver/car number and stint number). Another example might be sales data where the same categories of item appear across multiple rows and you want to generate reports based on category.

Plotting Tabular (CSV) Data and Algebraic Expressions On the Same Graph Using Gnuplot

A couple of the reasons why I’ve been making so much use of Formula 1 data for visualisations lately are that: a) the data is authentic, representing someone’s legitimate data but presented in a way that is outside my control (so I have to wrangle with scraping, representation and modeling issues); b) if I get anything wrong whilst I’m playing, it doesn’t matter… (Whereas if I plotted some university ranking tables and oopsed to show that all the students at X were unhappy, its research record and teaching quality were lousy, and it was going to be charging 9k fees on courses with a 70% likelihood of closing most of them, when in fact it was doing really well, I might get into trouble…;-)

I’ve also been using the data as a foil for finding tools and applications that I can use to create data visualisations that other people might be interested in trying out too. There is a work-related rationale here too: in October, I hope to run a “MOOC”, (all you need to know…) on visual analysis as a live development/production exercise for a new short course that will hopefully be released next year, and part of that will involve the use of various third party tools and applications for the hands-on activities.

One of the issues I’ve recently faced is how to plot a chart that combines tabulated data imported from a CSV file with a line graph plotted from an equation. My ideal tool would be a graphical environment that lets me import data and plot it, and then overlay a plot generated from a formula or equation of my own. Being able to apply a function to the tabulated data (for example, remove a value y = sin(x) from the tabular data would be ideal, but not essential.

In this post, I’ll describe one tool – Gnuplot – that meets at least the first requirement, and show how it can be used to plot some time series data from a CSV file overlaid with a decreasing linear function. (Which is to say, how to plot F1 laptime data against the fuel weight time penalty, (the amount of time that the weight of the fuel in the car slows the car down by… For more on this, see F1 2011 Turkey Race – Fuel Corrected Laptimes.)

I’ve been using Gnuplot off and on for a couple of decades(?!), though I’ve really fallen out of practice with it over the last ten years… (not doing research!;-)

The easiest way of using the tool is to launch it in the directory where your data files are stored. So for example, if the data file resides in the directory \User\tony\f1\data, I would launch my terminal, enter cd \User\tony\f1\data or the equivalent to move to that directory, and then start gnuplot there using the command gnuplot):

gnuplot> set term x11
Terminal type set to 'x11'
gnuplot> set xrange [1:58]
gnuplot> set xlabel "Lap"
gnuplot> set ylabel "Fuel Weight Time Penalty"
gnuplot> set datafile separator ","

For some reason, my version of Gnuplot (on a Mac), wouldn’t display any graphs till i set the output to use x11… The set xrange [1:58] command sets the range of the axis (there are 58 laps in a race, hence those settings.) The xlabel and ylabel settings are hopefully self-explanatory (they define axis labels). The set datafile separator "," command prepares Gnuplot to load in a file formatted as tabular data, one row per line, with commas separating the columns (I assume if you pass in something like this, “this, that”, the other, the “this,that” string is detected as a single column/cell value, and not as two columns with cell values “this and that”? I forget…)

The data file I have is not as clean as it might be. (If you want to play along, the data file is here). It’s arranged as follows:

Driver,Lap,Lap Time,Fuel Adjusted Laptime,Fuel and fastest lap adjusted laptime
25,1,106.951,102.334,9.73
25,2,99.264,94.728,2.124
...
25,55,94.979,94.574,1.97
25,56,95.083,94.759,2.155
20,1,103.531,98.914,6.959
20,2,97.370,92.834,0.879
...

That is, there is one row per driver lap. Each driver’s data is on a consecutive line, in increasing lap number, so driver 25 is on lines 1 to 56, driver 20’s data starts on line 26 and so on…

To plot from a data file, we use the command plot 'turlapTimeFuel.csv' (that is, plot ‘filename). To pull data from columns 3 and 5, we use the subcommand using 3 (x would count incrementally, so we get a plot of column 3 against increasing row number) from the command:
gnuplot> plot 'turlapTimeFuel.csv' using 3

To plot from just a range of numbers (e.g. rows 0 to 57 (the header row is ignored), against row number, we can use a subcommand if the form using y:
gnuplot> plot 'turlapTimeFuel.csv' every::0::57 using 3

To specify the x value (e.g. to plot column 3 as y against column 2 as x), we use a subcommand of the form using x:y:
gnuplot> plot 'turlapTimeFuel.csv' every::0::57 using 2:3

(The first column is column 1; I think the first row is row 0…)

So for example, we can plot Laptime against driver using plot ‘turlapTimeFuel.csv’ using 1:3, or Fuel Adjusted Laptime against driver using plot ‘turlapTimeFuel.csv’ using 1:4. The command plot ‘turlapTimeFuel.csv’ using 2:3 gives us a plot of each driver’s laptime against lap number.

But how do we plot the data for just a single driver? We saw how to plot against a consecutive range of row values (e.g. every::12:23 for rows 12 to 23), but
plotting the laptime data for each driver this way is really painful (we have to know which range of row numbers the data we want to plot are on). Instead we can filter out the data according to driver number (the column 1 values):
gnuplot> plot 'turlapTimeFuel.csv' using ($1==4 ? $2:1/0):3 with lines

How do we read this? The command is actually of the form using x:y, but we do a bit of work to choose a valid value of x. ($1==4 ? 2:1/0):3 says “if the value of column 1 ($1) equals 4, then (?) select column 2, otherwise/else (the first “:”), forget it (1/0 is one divided by zero, a number intensely disliked by gnuplot that says in this context, do nothing more with this row…). If the value of column 1 does equal 4, then we create a valid statement using 2:3, otherwise we ignore the row and the data in columns 2 and 3. The whole statement thus just plots the data for driver 4.

Rather than plot points, the with lines command will join consecutive points using a line, to produce a line chart:
plot 'turlapTimeFuel.csv' using ($1==1 ? $2:1/0):3 with lines

We can add an informative label using the title subcommand:
plot 'turlapTimeFuel.csv' using ($1==1 ? $2:1/0):3 with lines title "(unadjusted) VET Su | Su(11) Su(25) Hn(40) Hn(47)"

We can also plot two drivers’ times on the same chart using different lines:
plot 'turlapTimeFuel.csv' using ($1==1 ? $2:1/0):3 with lines title "(unadjusted) VET Su | Su(11) Su(25) Hn(40) Hn(47)",'turlapTimeFuel.csv' using ($1==2 ? $2:1/0):3 with lines title "WEB "

We can also plot functions. In the following case, I plot the time penalty applied to a car for each lap on the basis of how much more fuel it is carrying at the start of the race compared to the end:
gnuplot> plot 90+(58-x)*2.7*0.03

We can now overlay the drivers’ times and the fuel penalty on the same chart:
gnuplot> set yrange [85:120]
gnuplot> plot 'turlapTimeFuel.csv' using ($1==1 ? $2:1/0):3 with lines title "(unadjusted) VET Su | Su(11) Su(25) Hn(40) Hn(47)",'turlapTimeFuel.csv' using ($1==1 ? $2:1/0):4 with lines title "VET Su | Su(11) Su(25) Hn(40) Hn(47)",90+(58-x)*2.7*0.03

It’s also possible to do sums on the data. If you read $4 as “use the value in column 4 of the current row”, you can start to guess at creating things like the following, which in the first part plots cells in the laptime column 3 modified by the fuel penalty. (I also plot the pre-calculated fuel adjusted laptime data from column 5 as a comparison. The only difference in values is the offset…
gnuplot> set yrange [-1:20]
gnuplot> plot 'turlapTimeFuel.csv' using ($1==1 ? $2:1/0):($3-(88+(58-$2)*2.7*0.03)) with lines,'turlapTimeFuel.csv' using ($1==1 ? $2:1/0):5 with lines


Doh! I should have changed the y-axis label…:-(

That is, plot ‘turlapTimeFuel.csv’ using ($1==1 ? $2:1/0):($3-(88+(58-$2)*2.7*0.03)) with lines says: for rows applying to driver 1 rows where the value in column 1, ($1), equals (==) the driver number (1), $1==1), use the value in column 2 ($2) as the x-value and for the y-value use the value in column 3 ($3) minus 95+(58-$2)*2.7*0.03). Note that the (58-$2) fragment subtracts the lap number (as contained in value in the column 2 ($2) cell) from the lap count to work out how many more laps worth of fuel the car is carrying in the current lap than at the end of the race.

So – that’s a quick tour of gnuplot, showing how it can be used to plot CSV data and an algebraic expression on the same graph, how to filter the data plotted from the CSV file using particular values in a specified column, and how to perform a mathematical operation on the data pulled in from the CSV file before plotting it (and without changing it in the original file).

Just in passing, if you need an online formula plotter, Wolfram Alpha is rather handy… it can do calculus for you too…

PS In a forthcoming post, I’ll describe another tool for creating similar sorts of plot – GeoGebra. If you know of other free, cross-platform, ideally open source, hosted or desktop applications similar to Gnuplot or GeoGebra, please post a link in the comments:-)

PPS I quite like this not-so-frequently-asked questions cribsheet for Gnuplot

PPPS for another worked through example, see F1DataJunkie: F1 2011 Turkey Race – Race Lap History

F1 Data Junkie, the Blog…

To try to bring a bit of focus back to this blog, I’ve started a new blog – F1 Data Junkie: http://f1datajunkie.blogspot.com (aka http://bit.ly/F1DataJunkie) – that will act as the home for my “procedural” F1 Data postings. I’ll still post the occasional thing here – for example, reviewing the howto behind some of the novel visualisations I’m working on (such as practice/qualification session utilisation charts, and race battle maps), but charts relating to particular races, will, in the main, go onto the new blog….

I’m hoping by the end of the season to have an automated route of generating different sorts of visual reviews of practice, qualification and race sessions based on both official timing data, and (hopefully) the McLaren telemetry data. (If anyone has managed to scrape and decode the Mecedes F1 live telemetry data and is willing to share it with me, that would be much appreciated:-)

I also hope to use the spirit of F1 to innovate like crazy on the visualisations as and when I get the chance; I think that there’s a lot of mileage still to come in innovative sports graphics/data visualisations*, not only for the stats geek fan, but also for sports journalists looking to uncover stories from the data that they may have missed during an event. And with a backlog of data going back years for many sports, there’s also the opportunity to revisit previous events and reinterpret them… Over this weekend, I’ve been hacking around a few old scripts to to to automate the production of various data formatters, as well as working on a couple of my very own visualisation types:-) So if you want to see what I’ve been up to, you should probably pop over to F1 Data Junkie, the blog… ;-)

*A lot of innovation is happening in live sports graphics for TV overlays, such as the Piero system developed by the BBC, or the HawkEye ball tracking system (the company behind it has just been bought by Sony, so I wonder if we’ll see the tech migrate into TVs, start to play a role in capturing data that crosses over in gaming (e.g. Play ALong With the Real World), or feed commercial data augmentations from Sony to viewers via widgets on Sony TVs…

There’ve also been recent innovations in sports graphics in the press and online. For example, seeing this interactive football chalkboard on the Guardian website, that lets you pull up, in a graphical way, stats reviews of recent and historical games, or this Daily Telegraph interactive that provides a Hawk-Eye analysis of the Ashes (is there an equivalent for The Master golf anywhere, I wonder, or Wimbledon tennis? See also Cricket visualisation tool), I wonder why there aren’t any interactive graphical viewers over recent and historical F1 data…. (or maybe there are? If you know of any – or know of any interesting visualisations around motorsport in general and F1 in particular, please let me know in the comments…:-)

First Play With R and R-Studio – F1 Lap Time Box Plots

Last summer, at the European Centre for Journalism round table on data driven journalism, I remember saying something along the lines of “your eyes can often do the stats for you”, the implication being that our perceptual apparatus is good at pattern detection, and can often see things in the data that most of us would miss using the very limited range of statistical tools that we are either aware of, or are comfortable using.

I don’t know how good a statistician you need to be to distinguish between Anscombe’s quartet, but the differences are obvious to the eye:

Anscombe's quartet /via Wikipedia

Another shamistician (h/t @daveyp) heuristic (or maybe it’s a crapistician rule of thumb?!) might go something along the lines of: “if you use the right visualisations, you don’t necessarily need to do any statistics yourself”. In this case, the implication is that if you choose a viualisation technique that embodies or implements a statistical process in some way, the maths is done for you, and you get to see what the statistical tool has uncovered.

Now I know that as someone working in education, I’m probably supposed to uphold the “should learn it properly” principle… But needing to know statistics in order to benefit from the use of statistical tools seems to me to be a massive barrier to entry in the use of this technology (statistics is a technology…) You just need to know how to use the technology appropriately, or at least, not use it “dangerously”…

So to this end (“democratising access to technology”), I thought it was about time I started to play with R, the statistical programming language (and rival to SPSS?) that appears to have a certain amount of traction at the moment given the number of books about to come out around it… R is a command line language, but the recently released R-Studio seems to offer an easier way in, so I thought I’d go with that…

Flicking through A First Course in Statistical Programming with R, a book I bought a few weeks ago in the hope that the osmotic reading effect would give me some idea as to what it’s possible to do with R, I found a command line example showing how to create a simple box plot (box and whiskers plot) that I could understand enough to feel confident I could change…

Having an F1 data set/CSV file to hand (laptimes and fuel adjusted laptimes) from the China 2001 grand prix, I thought I’d see how easy it was to just dive in… And it was 2 minutes easy… (If you want to play along, here’s the data file).

Here’s the command I used:
boxplot(Lap.Time ~ Driver, data=lapTimeFuel)

Remembering a comment in a Making up the Numbers blogpost (Driver Consistency – Bahrain 2010) about the effect on laptime distributions from removing opening, in and out lap times, a quick Google turned up a way of quickly stripping out slow times. (This isn’t as clean as removing the actual opening, in and out lap times – it also removes mistake laps, for example, but I’m just exploring, right? Right?!;-)

lapTime2 <- subset(lapTimeFuel, Lap.Time < 110.1)

I could then plot the distribution in the reduced lapTime2 dataset by changing the original boxplot command to use (data=lapTime2). (Note that as with many interactive editors, using your keyboard’s up arrow displays previously entered commands in the current command line; so you can re-enter a previously entered command by hitting the up arrow a few times, then entering return. You can also edit the current command line, using the left and right arrow keys to move the cursor, and the delete key to delete text.)

Prior programming experience suggests this should also work…

boxplot(Lap.Time ~ Driver, data=subset(lapTimeFuel, Lap.Time < 110))

Something else I tried was to look at the distribution of fuel weight adjusted laptimes (where the time penalty from the weight of the fuel in the car is removed):

boxplot(Fuel.Adjusted.Laptime ~ Driver, data=lapTimeFuel)

Looking at the release notes for the latest version of R-Studio suggests that you can build interactive controls into your plots (a bit like Mathematica supports?). The example provided shows how to change the x-range on a plot:
manipulate(
plot(cars, xlim=c(0,x.max)),
x.max=slider(15,25))

Hmm… can we set the filter value dynamically I wonder?

manipulate(
boxplot(Lap.Time ~ Driver, data=subset(lapTimeFuel, Lap.Time < maxval)),
maxval=slider(100,140))

Seems like it…?:-) We can also combine interactive controls:

manipulate(boxplot(Lap.Time ~ Driver, data=subset(lapTimeFuel, Lap.Time < maxval),outline=outline),maxval=slider(100,140),outline = checkbox(FALSE, "Show outliers"))

Okay – that’s enough for now… I reckon that with a handful of commands on a crib sheet, you can probably get quite a lot of chart plot visualisations done, as well as statistical visualisations, in the R-Studio environment; it also seems easy enough to build in interactive controls that let you play with the data in a visually interactive way…

The trick comes from choosing visual statistics approaches to analyse your data that don’t break any of the assumptions about the data that the particular statistical approach relies on in order for it to be applied in any sensible or meaningful way.

[This blog post is written, in part, as a way for me to try to come up with something to say at the OU Statistics Group’s one day conference on Visualisation and Presentation in Statistics. One idea I wanted to explore was: visualisations are powerful; visualisation techniques may incorporate statistical methods or let you “see” statistical patterns; most people know very little statistics; that shouldnlt stop them being able to use statistics as a technology; so what are we going to do about it? Feedback welcome… Err….?!]

Thoughts on a Couple of Possible Lap Charting Apps

I seem to have posted a lot of F1 related items recently (there seem to have been a lot of Bank Holidays and weekends lately – and F1 diversions feed into those; normal service will be resumed shortly….) and here’s another one, in part inspired by Joe Saward’s post on Lap Charts (and as discussed in the most recent Sidepodcast Aside With Joe), but also harking back to something I though about at the BTCC/Brands Hatch race last year, and in mind because I hope to get to Thruxton tomorrow…

The problem? Capturing lap chart information like this:

Joe Saward, lap chart, Shanghai
Image used, without permission, from: Joe Saward, A great race in Shanghai

I’ve been experimenting with various ways of displaying this data, such as “augmenting” traditional lap charts with additional colour and size dimensions:

(In the above case, node size is proportional to time to car in front (or denotes a pit stop); colour is related to time to car behind (red is hot – car behind is close), or choice of tyres in a pit stop. Laps count across the screen, colours are ascending race position. A bright red dot with a large dot above it shows two cars close together. A trace for Car 18 on the grid (Webber) is shown throughout the race.)

Now I now there is is probably a way of grabbing this data, for F1 at least, from something like the BBC live timing feed, or from live timing feeds for other races from TSL/Timing Solutions Limited or MST Systems, but I’m thinking more generally for cases where live timing isn’t available…

So here are a couple of possible ideas for apps to support the collection of lap chart data: a tablet (e.g. iPad) app, and a mobile (e.g. iPhone or Android phone) app.

First up, the tablet app:

The idea here is that you can click on the car number as the car goes past and build up a live view of the lap chart. The last car clicked is highlighted and can be annotated. It may also be worth having a setting so that after a car has been selected it is greyed out for 5 seconds less than the fastest expected lap time. (Except maybe for pit option, where possibly capture car going into and out of the pit?)

Here’s a sketch for a mobile app:

As before, you click on the car number in left hand column as the car goes past. To simplify matters, the car numbers in the left hand column are in an ordered list (by track position? Initial state is Grid position). After a few seconds, the car clicked on disappears from the top of the list an is added to the bottom of the list, the idea being that the top of the list shows cars you expect to come past next. As with the tablet app, the last car clicked is highlighted and can be annotated using tags from the right hand column,

On a final note, if the positions are being added in real time, the app can also collect rough timing information. That means we can then also start to produce crude gap charts that show the time/distance between cars. Something like this, maybe?

In this case, the chart shows gaps between cars, per lap (increasing lap number up the screen, car positions left to right). Gaps indicate a pitted or lapped car. From the lap chart data, and crude timing, we could automatically generate this sort of view.

PS I probably won’t get round to making either of these apps, (at least, not in the immediate future…) but if anyone would like to take them on, I’d be happy to test them and chip in ideas:-)

Visualising Sports Championship Data Using Treemaps – F1 Driver & Team Standings

I *love* treemaps. If you’re not familiar with them, they provide a very powerful way of visualising categorically organised hierarchical data that bottoms out with a quantitative, numerical dimension in a single view.

For example, consider the total population of students on the degrees offered across UK HE by HESA subject code. As well as the subject level, we might also categorise the data according to the number of students in each year of study (first year, second year, third year).

If we were to tabulate this data, we might have columns: institution, HESA subject code, no. of first year students, no. of second year students, no. of third year students. We could also restructure the table so that the data was presented in the form: institution, HESA subject code, year of study, number of students. And then we could visualise it in a treemap… (which I may do one day… but not now; if you beat me to it, please post a link in the comments;-)

Instead, what I will show is how to visualise data from a sports championship, in particular the start of the Formula One 2011 season. This championship has the same entrants in each race, each a member of one of a fixed number of teams. Points are awarded for each race (that is, each round of the championship) and totalled across rounds to give the current standing. As well as the driver championship (based on points won by individual drivers) is the team championship (where the points contribution form drivers within a team is totalled).

Here’s what the results from the third round (China) looks like:

Driver Team Points
Lewis Hamilton McLaren-Mercedes 25
Sebastian Vettel RBR-Renault 18
Mark Webber RBR-Renault 15
Jenson Button McLaren-Mercedes 12
Nico Rosberg Mercedes 10
Felipe Massa Ferrari 8
Fernando Alonso Ferrari 6
Michael Schumacher Mercedes 4
Vitaly Petrov Renault 2
Kamui Kobayashi Sauber-Ferrari 1
Paul di Resta Force India-Mercedes 0
Nick Heidfeld Renault 0
Rubens Barrichello Williams-Cosworth 0
Sebastien Buemi STR-Ferrari 0
Adrian Sutil Force India-Mercedes 0
Heikki Kovalainen Lotus-Renault 0
Sergio Perez Sauber-Ferrari 0
Pastor Maldonado Williams-Cosworth 0
Jarno Trulli Lotus-Renault 0
Jerome d’Ambrosio Virgin-Cosworth 0
Timo Glock Virgin-Cosworth 0
Vitantonio Liuzzi HRT-Cosworth 0
Narain Karthikeyan HRT-Cosworth 0
Jaime Alguersuari STR-Ferrari 0

F1 2011 Results – China, © 2011 Formula One World Championship Ltd

We can represent data from across all the races using a table of the form:

Driver Team Points Race
Lewis Hamilton McLaren-Mercedes 25 China
Sebastian Vettel RBR-Renault 18 China
Felipe Massa Ferrari 10 Malaysia
Fernando Alonso Ferrari 8 Malaysia
Kamui Kobayashi Sauber-Ferrari 6 Malaysia
Michael Schumacher Mercedes 0 Australia
Pastor Maldonado Williams-Cosworth 0 Australia
Michael Schumacher Mercedes 0 Australia
Pastor Maldonado Williams-Cosworth 0 Australia
Narain Karthikeyan HRT-Cosworth 0 Australia
Vitantonio Liuzzi HRT-Cosworth 0 Australia

Sample of F1 2011 Results 2011, © 2011 Formula One World Championship Ltd

I’ve put a copy of the data to date at Many Eyes, IBM’s online interactive data visualisation site: F1 2011 Championship Points

Here’s what it looks like when we view it in a treemap visualisation:

The size of the boxes is proportional to the (summed) values within the hierarchical categories. In the above case, the large blocks are the total points awarded to each driver across teams and races. (The team field might be useful if a driver were to change team during the season.)

I’m not certain, but I think the Many Eyes treemap algorithm populates the map using a sorted list of summed numerical values taken through the hierarchical path from left to right, top to bottom. Which means top left is the category with the largest summed points. If this is the case, in the above example we can directly see that Webber is in fourth place overall in the championship. We can also look within each blocked area for more detail: for example, we can see Hamilton didn’t score as many points in Malaysia as he did in the other two races.

One of the nice features about the Many Eyes treemap is that it allows you to reorder the levels of the hierarchy that is being displayed. So for example, with a simple reordering of the labels we can get a view over the team championship too:

The Many Eyes treemap can be embedded in a web page (it’s a Java applet), although I’m not sure what, if any, licensing restrictions apply (I do know that the Guardian datastore blog embeds Many Eyes widgets on that site, though). Other treemap widgets are available (for example, Protovis and JIT both offer javascript enabled treemap displays).

What might be interesting would be to feed Protovis or the JIT with data dynamically form a Google Spreadsheet, for example, so that a single page could be used to display the treemap with the data being maintained in a spreadsheet.

Hmm, I wonder – does Google spreadsheets have a treemap gadget? Ooh – it does: treemap-gviz. It looks as if a bit of wrangling may be required around the data, but if the display works out then just popping the points data into a Google spreadsheet and creating the gadget should give an embeddable treemap display with no code required:-) (It will probably be necessary to format the data hierarchy by hand, though, requiring differently layed out data tables to act as source for individual and team based reports.)

So – how long before we see some “live” treemap displays for F1 results on the F1 blogs then? Or championship tables from other sports? Or is the treemap too confusing as a display for the uninitiated? (I personally don’t think so.. but then, I love macroscopic views over datasets:-)

PS see also More Olympics Medal Table Visualisations which includes a demonstration of a treemap visualisation over Olympic medal standings.

A First Attempt at Looking at F1 Timing Data in Google Motion Charts (aka “Gapminder”)

Having managed to get F1 timing data data through my cobbled together F1 timing data Scraperwiki, it becomes much easier to try out different visualisation approaches that can be used to review the stories that sometimes get hidden in the heat of the race (that data journalism trick of using visualisation as an analytic tool for story discovery, for example).

Whilst I was on holiday, reading a chapter in Beautiful Visualization on Gapminder/Trendalyser/Google Motion Charts (it seems the animations may be effective when narrated, as when Hans Rosling performs with them, but for the uninitiated, they can simply be confusing…), it struck me that I should be able to view some of the timing data in the motion chart…

So here’s a first attempt (going against the previously identified “works best with narration” bit of best practice;-) – F1 timing data (China 2011) in Google Motion Charts, the video:


Visualising the China 2011 F1 Grand Prix in Google Motion Charts

If you want to play with the chart itself, you can find it here: F1 timing data (China 2011) Google Motion Chart.

The (useful) dimensions are:

  • lap – the lap number;
  • pos – the car/racing number of each driver;
  • trackPos – the position in the race (the racing position);
  • currTrackPos – the position on the track (so if a lapped car is between the leader and second place car, their respective currtrackpos are 1, 2, 3);
  • pitHistory – the number of pit stops to date

The timeToLead, timeToFront and timeToBack measures give the time (in seconds) between each car and the leader, the time to the car in the racing position ahead, and the time to the car in racing position behind (these last two datasets are incomplete at the moment… I still need to calculate this missing datapoints…). The elapsedTime is the elapsed racetime for each car at the end of each measured lap.

The time starts at 1900 because of a quirk in Google Motion Charts – they only work properly for times measured in years, months and days (or years and quarters) for 1900 onwards. (You can use years less than 1900 but at 1899 bad things might happen!) This means that I can simply use the elapsed time as the timebase. So until such a time as the chart supports date:time or :time as well as date: stamps, my fix is simply to use an integer timecount (the elapsed time in seconds) + 1900.