## Posts Tagged ‘**f1data**’

## Reshaping Your Data – Pivot Tables in Google Spreadsheets

One of the slides I whizzed by in my presentation at the OU Statistitcs conference on “Visualisation an Presentation in Statistics” (tweet-commentary notes/links from other presentations here and here) relates to what we might describe as the *shape* that data takes.

An easy way of thinking about the *shape* of a dataset is is to consider a simple 2D data table with columns describing the properties of an object and rows typically corresponding to individual things. Often a regular structure, each cell in the table may take on a valid value. Occasionally, some cells may be blank, in which case we can start to think of the shape of the data getting a little bit ragged.

If you are working with data table, then on occasion you may want to swap the rows for columns (certain data operations require data to be organised in a particular way). By swapping the rows and columns, we change the *shape* of the data (for example, going from a table with N rows and M columns to one with M columns and N rows). So that’s one way of reshaping your data.

Many visualisation tools require data to be in a particular *shape* in order for the data to be visualised appropriately. If you look at the template pages on Number Picture, a new site hosting templated visualiastions built using Processing that allow you to cut, paste and visualise data – *if it is is appropriately shaped* – at a click.

But where do pivot tables come in? *One way is to think of them as a tool for reshaping your data by providing summary reports of your original data set.*

Here’s how the Goog describes them:

What pivot tables allow you to do is generate reports based on contents of a table using the *values* contained within one or more columns to define the columns and rows of a summary table. That is, you can re-present (or re-shape) a table as a new table that summarises data contained in the original table in terms of a rearrangement of the cell values of the original table.

Here’s a quick example. I have a data set that identifies the laptimes of drivers in an F1 race (yes, I know… again!;-) by *stint*, where different stints are groups of consecutive laps separated by pit stops.

If you look down the *stint* column you can see how its value groups together blocks of rows. But how do I easily show how much time each driver (column C) spent on each stint? The time the driver spent on each stint is the sum of laptimes by driver within a stint, so for each driver I need to find out the laps associated with each stint, and then sum them. Pivot tables let me do that. Here’s how:

So how does this work? The *columns* in the new table are defined according to the unique values found in the *stint* column of the original table. The *rows* in the new table are defined according to the unique values found in the *car* column of the original table. The cell values in the new table for a given row and column are defined as the sum of lapTime data values from rows in the original table where the *car* and *stint* values in the row correspond to the row and column values corresponding to each cell in the new table. Got that?

If you’re familiar with databases, you might think of the column and row settings in the new table defining “keys” into rows on the original table. The different car/stint pairs identify different blocks of rows that are processed *per block* to create the contents of the pivot table.

As well as summing the values from one column based on the values contained in two other columns, the pivot table can do other operations, such as counting the number of rows in the original table containing each “key” value. So for example, if we want to count the number of laps a car was out for by stint, we can do that simply by changing out pivot table *Values* setting.

Pivot tables can take a bit of time to get your head round… I find playing with them helps… A key thing to remember is: if you want to express a dataset in terms of the unique values contained within a column, the pivot table can help you do that. In the above example, I was essentially generating the row and column values for a new table based on categorical data (driver/car number and stint number). Another example might be sales data where the same categories of item appear across multiple rows and you want to generate reports based on category.

## Plotting Tabular (CSV) Data and Algebraic Expressions On the Same Graph Using Gnuplot

A couple of the reasons why I’ve been making so much use of Formula 1 data for visualisations lately are that: a) the data is *authentic*, representing someone’s legitimate data but presented in a way that is outside my control (so I have to wrangle with scraping, representation and modeling issues); b) if I get anything wrong whilst I’m playing, *it doesn’t matter*… (Whereas if I plotted some university ranking tables and oopsed to show that all the students at X were unhappy, its research record and teaching quality were lousy, and it was going to be charging 9k fees on courses with a 70% likelihood of closing most of them, when in fact it was doing really well, I might get into trouble…;-)

I’ve also been using the data as a foil for finding tools and applications that I can use to create data visualisations that other people might be interested in trying out too. There is a work-related rationale here too: in October, I hope to run a “MOOC”, (all you need to know…) on *visual analysis* as a live development/production exercise for a new short course that will hopefully be released next year, and part of that will involve the use of various third party tools and applications for the hands-on activities.

One of the issues I’ve recently faced is how to plot a chart that combines tabulated data imported from a CSV file with a line graph plotted from an equation. My ideal tool would be a graphical environment that lets me import data and plot it, and then overlay a plot generated from a formula or equation of my own. Being able to apply a function to the tabulated data (for example, remove a value *y = sin(x)* from the tabular data would be ideal, but not essential.

In this post, I’ll describe one tool – Gnuplot – that meets at least the first requirement, and show how it can be used to plot some time series data from a CSV file overlaid with a decreasing linear function. (Which is to say, how to plot F1 laptime data against the fuel weight time penalty, (the amount of time that the weight of the fuel in the car slows the car down by… For more on this, see F1 2011 Turkey Race – Fuel Corrected Laptimes.)

I’ve been using Gnuplot off and on for a couple of decades(?!), though I’ve really fallen out of practice with it over the last ten years… (not doing research!;-)

The easiest way of using the tool is to launch it in the directory where your data files are stored. So for example, if the data file resides in the directory *\User\tony\f1\data*, I would launch my terminal, enter `cd \User\tony\f1\data` or the equivalent to move to that directory, and then start gnuplot there using the command `gnuplot`):

`gnuplot> set term x11
Terminal type set to 'x11'
gnuplot> set xrange [1:58]
gnuplot> set xlabel "Lap"
gnuplot> set ylabel "Fuel Weight Time Penalty"
gnuplot> set datafile separator ","`

For some reason, my version of Gnuplot (on a Mac), wouldn’t display any graphs till i set the output to use x11… The `set xrange [1:58]` command sets the range of the axis (there are 58 laps in a race, hence those settings.) The `xlabel` and `ylabel` settings are hopefully self-explanatory (they define axis labels). The `set datafile separator ","` command prepares Gnuplot to load in a file formatted as tabular data, one row per line, with commas separating the columns (I assume if you pass in something like *this, “this, that”, the other*, the *“this,that”* string is detected as a single column/cell value, and *not* as two columns with cell values *“this* and *that”*? I forget…)

The data file I have is not as clean as it might be. (If you want to play along, the data file is here). It’s arranged as follows:

Driver,Lap,Lap Time,Fuel Adjusted Laptime,Fuel and fastest lap adjusted laptime 25,1,106.951,102.334,9.73 25,2,99.264,94.728,2.124 ... 25,55,94.979,94.574,1.97 25,56,95.083,94.759,2.155 20,1,103.531,98.914,6.959 20,2,97.370,92.834,0.879 ...

That is, there is one row per driver lap. Each driver’s data is on a consecutive line, in increasing lap number, so driver 25 is on lines 1 to 56, driver 20′s data starts on line 26 and so on…

To plot from a data file, we use the command `plot 'turlapTimeFuel.csv'` (that is, *plot ‘ filename‘*). To pull data from columns 3 and 5, we use the subcommand

`using 3`(x would count incrementally, so we get a plot of column 3 against increasing row number) from the command:

`gnuplot> plot 'turlapTimeFuel.csv' using 3`

To plot from just a range of numbers (e.g. rows 0 to 57 (the header row is ignored), against row number, we can use a subcommand if the form *using y*:

`gnuplot> plot 'turlapTimeFuel.csv' every::0::57 using 3`

To specify the x value (e.g. to plot column 3 as y against column 2 as x), we use a subcommand of the form *using x:y*:

`gnuplot> plot 'turlapTimeFuel.csv' every::0::57 using 2:3`

(The first column is column 1; I think the first row is row 0…)

So for example, we can plot Laptime against driver using *plot ‘turlapTimeFuel.csv’ using 1:3*, or Fuel Adjusted Laptime against driver using *plot ‘turlapTimeFuel.csv’ using 1:4*. The command *plot ‘turlapTimeFuel.csv’ using 2:3* gives us a plot of each driver’s laptime against lap number.

But how do we plot the data for just a single driver? We saw how to plot against a consecutive range of row values (e.g. *every::12:23* for rows 12 to 23), but

plotting the laptime data for each driver this way is really painful (we have to know which range of row numbers the data we want to plot are on). Instead we can filter out the data according to driver number (the column 1 values):

`gnuplot> plot 'turlapTimeFuel.csv' using ($1==4 ? $2:1/0):3 with lines`

How do we read this? The command is actually of the form *using x:y*, but we do a bit of work to choose a valid value of x. *($1==4 ? 2:1/0):3* says “if the value of column 1 ($1) equals 4, then (?) select column 2, otherwise/else (the first “:”), forget it (1/0 is one divided by zero, a number intensely disliked by gnuplot that says in this context, do nothing more with this row…). If the value of column 1 does equal 4, then we create a valid statement *using 2:3*, otherwise we ignore the row and the data in columns 2 and 3. The whole statement thus just plots the data for driver 4.

Rather than plot points, the *with lines* command will join consecutive points using a line, to produce a line chart:

` plot 'turlapTimeFuel.csv' using ($1==1 ? $2:1/0):3 with lines`

We can add an informative label using the *title* subcommand:

`plot 'turlapTimeFuel.csv' using ($1==1 ? $2:1/0):3 with lines title "(unadjusted) VET Su | Su(11) Su(25) Hn(40) Hn(47)"`

We can also plot two drivers’ times on the same chart using different lines:

`plot 'turlapTimeFuel.csv' using ($1==1 ? $2:1/0):3 with lines title "(unadjusted) VET Su | Su(11) Su(25) Hn(40) Hn(47)", 'turlapTimeFuel.csv' using ($1==2 ? $2:1/0):3 with lines title "WEB "`

We can also plot functions. In the following case, I plot the time penalty applied to a car for each lap on the basis of how much more fuel it is carrying at the start of the race compared to the end:

`gnuplot> plot 90+(58-x)*2.7*0.03`

We can now overlay the drivers’ times and the fuel penalty on the same chart:

`gnuplot> set yrange [85:120]
gnuplot> plot 'turlapTimeFuel.csv' using ($1==1 ? $2:1/0):3 with lines title "(unadjusted) VET Su | Su(11) Su(25) Hn(40) Hn(47)",'turlapTimeFuel.csv' using ($1==1 ? $2:1/0):4 with lines title "VET Su | Su(11) Su(25) Hn(40) Hn(47)",90+(58-x)*2.7*0.03`

It’s also possible to do sums on the data. If you read $4 as “use the value in column 4 of the current row”, you can start to guess at creating things like the following, which in the first part plots cells in the laptime column 3 modified by the fuel penalty. (I also plot the pre-calculated fuel adjusted laptime data from column 5 as a comparison. The only difference in values is the offset…

`gnuplot> set yrange [-1:20]
gnuplot> plot 'turlapTimeFuel.csv' using ($1==1 ? $2:1/0):($3-(88+(58-$2)*2.7*0.03)) with lines,'turlapTimeFuel.csv' using ($1==1 ? $2:1/0):5 with lines`

*Doh! I should have changed the y-axis label…:-(*

That is, *plot ‘turlapTimeFuel.csv’ using ($1==1 ? $2:1/0):($3-(88+(58-$2)*2.7*0.03)) with lines* says: for rows applying to driver 1 rows where the value in column 1, (*$1*), equals (==) the driver number (1), *$1==1*), use the value in column 2 (*$2*) as the x-value and for the y-value use the value in column 3 (*$3*) minus 95+(58-$2)*2.7*0.03). Note that the *(58-$2)* fragment subtracts the lap number (as contained in value in the column 2 ($2) cell) from the lap count to work out how many more laps worth of fuel the car is carrying in the current lap than at the end of the race.

So – that’s a quick tour of gnuplot, showing how it can be used to plot CSV data and an algebraic expression on the same graph, how to filter the data plotted from the CSV file using particular values in a specified column, and how to perform a mathematical operation on the data pulled in from the CSV file before plotting it (and without changing it in the original file).

Just in passing, if you need an online formula plotter, Wolfram Alpha is rather handy… it can do calculus for you too…

PS In a forthcoming post, I’ll describe another tool for creating similar sorts of plot – GeoGebra. If you know of other free, cross-platform, ideally open source, hosted or desktop applications similar to Gnuplot or GeoGebra, please post a link in the comments:-)

PPS I quite like this not-so-frequently-asked questions cribsheet for Gnuplot

PPPS for another worked through example, see F1DataJunkie: F1 2011 Turkey Race – Race Lap History

## F1 Data Junkie, the Blog…

To try to bring a bit of focus back to *this* blog, I’ve started a new blog – F1 Data Junkie: *http://f1datajunkie.blogspot.com* (aka *http://bit.ly/F1DataJunkie*) – that will act as the home for my “procedural” F1 Data postings. I’ll still post the occasional thing here – for example, reviewing the howto behind some of the novel visualisations I’m working on (such as *practice/qualification session utilisation charts*, and *race battle maps*), but charts relating to particular races, will, in the main, go onto the new blog….

I’m hoping by the end of the season to have an automated route of generating different sorts of visual reviews of practice, qualification and race sessions based on both official timing data, and (hopefully) the McLaren telemetry data. (If anyone has managed to scrape and decode the Mecedes F1 live telemetry data and is willing to share it with me, that would be much appreciated:-)

I also hope to use the spirit of F1 to innovate like crazy on the visualisations as and when I get the chance; I think that there’s a lot of mileage still to come in innovative sports graphics/data visualisations*, not only for the stats geek fan, but also for sports journalists looking to uncover stories from the data that they may have missed during an event. And with a backlog of data going back years for many sports, there’s also the opportunity to revisit previous events and reinterpret them… Over this weekend, I’ve been hacking around a few old scripts to to to automate the production of various data formatters, as well as working on a couple of my very own visualisation types:-) So if you want to see what I’ve been up to, you should probably pop over to F1 Data Junkie, the blog… ;-)

*A lot of innovation is happening in live sports graphics for TV overlays, such as the Piero system developed by the BBC, or the HawkEye ball tracking system (the company behind it has just been bought by Sony, so I wonder if we’ll see the tech migrate *into* TVs, start to play a role in capturing data that crosses over in gaming (e.g. Play ALong With the Real World), or feed commercial data augmentations from Sony to viewers via widgets on Sony TVs…

There’ve also been recent innovations in sports graphics in the press and online. For example, seeing this interactive football chalkboard on the Guardian website, that lets you pull up, in a graphical way, stats reviews of recent and historical games, or this Daily Telegraph interactive that provides a Hawk-Eye analysis of the Ashes (is there an equivalent for The Master golf anywhere, I wonder, or Wimbledon tennis? See also Cricket visualisation tool), I wonder why there aren’t any interactive graphical viewers over recent and historical F1 data…. (or maybe there are? If you know of any – or know of any interesting visualisations around motorsport in general and F1 in particular, please let me know in the comments…:-)

## First Play With R and R-Studio – F1 Lap Time Box Plots

Last summer, at the European Centre for Journalism round table on data driven journalism, I remember saying something along the lines of “your eyes can often do the stats for you”, the implication being that our perceptual apparatus is good at pattern detection, and can often see things in the data that most of us would miss using the very limited range of statistical tools that we are either aware of, or are comfortable using.

I don’t know how good a statistician you need to be to distinguish between Anscombe’s quartet, but the differences are obvious to the eye:

Another shamistician (h/t @daveyp) heuristic (or maybe it’s a crapistician rule of thumb?!) might go something along the lines of: “if you use the right visualisations, you don’t necessarily need to do any statistics yourself”. In this case, the implication is that if you choose a viualisation technique that embodies or implements a statistical process in some way, the maths is done for you, and you get to see what the statistical tool has uncovered.

Now I know that as someone working in education, I’m probably supposed to uphold the “should learn it properly” principle… But needing to know statistics in order to benefit from the use of statistical tools seems to me to be a massive barrier to entry in the use of this technology (statistics is a technology…) You just need to know how to use the technology appropriately, or at least, not use it “dangerously”…

So to this end (“democratising access to technology”), I thought it was about time I started to play with R, the statistical programming language (and rival to SPSS?) that appears to have a certain amount of traction at the moment given the number of books about to come out around it… R is a command line language, but the recently released R-Studio seems to offer an easier way in, so I thought I’d go with that…

Flicking through A First Course in Statistical Programming with R, a book I bought a few weeks ago in the hope that the osmotic reading effect would give me some idea as to what it’s possible to do with R, I found a command line example showing how to create a simple box plot (box and whiskers plot) that I could understand enough to feel confident I could change…

Having an F1 data set/CSV file to hand (laptimes and fuel adjusted laptimes) from the China 2001 grand prix, I thought I’d see how easy it was to just dive in… And it was 2 minutes easy… (If you want to play along, here’s the data file).

Here’s the command I used:

`boxplot(Lap.Time ~ Driver, data=lapTimeFuel)`

Remembering a comment in a Making up the Numbers blogpost (Driver Consistency – Bahrain 2010) about the effect on laptime distributions from removing opening, in and out lap times, a quick Google turned up a way of quickly stripping out slow times. (This isn’t as clean as removing the actual opening, in and out lap times – it also removes mistake laps, for example, but I’m just exploring, right? Right?!;-)

`lapTime2 <- subset(lapTimeFuel, Lap.Time < 110.1)`

I could then plot the distribution in the reduced *lapTime2* dataset by changing the original boxplot command to use (`data=lapTime2`). (Note that as with many interactive editors, using your keyboard’s up arrow displays previously entered commands in the current command line; so you can re-enter a previously entered command by hitting the up arrow a few times, then entering return. You can also edit the current command line, using the left and right arrow keys to move the cursor, and the delete key to delete text.)

Prior programming experience suggests this should also work…

`boxplot(Lap.Time ~ Driver, data=subset(lapTimeFuel, Lap.Time < 110))`

Something else I tried was to look at the distribution of fuel weight adjusted laptimes (where the time penalty from the weight of the fuel in the car is removed):

`boxplot(Fuel.Adjusted.Laptime ~ Driver, data=lapTimeFuel)`

Looking at the release notes for the latest version of R-Studio suggests that you can build interactive controls into your plots (a bit like Mathematica supports?). The example provided shows how to change the x-range on a plot:

`manipulate(
plot(cars, xlim=c(0,x.max)),
x.max=slider(15,25))`

Hmm… can we set the filter value dynamically I wonder?

`manipulate(
boxplot(Lap.Time ~ Driver, data=subset(lapTimeFuel, Lap.Time < maxval)),
maxval=slider(100,140))`

Seems like it…?:-) We can also combine interactive controls:

`manipulate(boxplot(Lap.Time ~ Driver, data=subset(lapTimeFuel, Lap.Time < maxval),outline=outline),maxval=slider(100,140),outline = checkbox(FALSE, "Show outliers"))`

Okay – that’s enough for now… I reckon that with a handful of commands on a crib sheet, you can probably get quite a lot of chart plot visualisations done, as well as statistical visualisations, in the R-Studio environment; it also seems easy enough to build in interactive controls that let you play with the data in a visually interactive way…

The trick comes from choosing visual statistics approaches to analyse your data that don’t break any of the assumptions about the data that the particular statistical approach relies on in order for it to be applied in any sensible or meaningful way.

[This blog post is written, in part, as a way for me to try to come up with something to say at the OU Statistics Group's one day conference on Visualisation and Presentation in Statistics. One idea I wanted to explore was: visualisations are powerful; visualisation techniques may incorporate statistical methods or let you "see" statistical patterns; most people know very little statistics; that shouldnlt stop them being able to use statistics as a technology; so what are we going to do about it? Feedback welcome... Err....?!]

## Thoughts on a Couple of Possible Lap Charting Apps

I seem to have posted a lot of F1 related items recently (there seem to have been a lot of Bank Holidays and weekends lately – and F1 diversions feed into those; normal service will be resumed shortly….) and here’s another one, in part inspired by Joe Saward’s post on Lap Charts (and as discussed in the most recent Sidepodcast Aside With Joe), but also harking back to something I though about at the BTCC/Brands Hatch race last year, and in mind because I hope to get to Thruxton tomorrow…

The problem? Capturing lap chart information like this:

*Image used, without permission, from: Joe Saward, A great race in Shanghai*

I’ve been experimenting with various ways of displaying this data, such as “augmenting” traditional lap charts with additional colour and size dimensions:

(In the above case, node size is proportional to time to car in front (or denotes a pit stop); colour is related to time to car behind (red is hot – car behind is close), or choice of tyres in a pit stop. Laps count across the screen, colours are ascending race position. A bright red dot with a large dot above it shows two cars close together. A trace for Car 18 on the grid (Webber) is shown throughout the race.)

Now I now there is is probably a way of grabbing this data, for F1 at least, from something like the BBC live timing feed, or from live timing feeds for other races from TSL/Timing Solutions Limited or MST Systems, but I’m thinking more generally for cases where live timing isn’t available…

So here are a couple of possible ideas for apps to support the collection of lap chart data: a tablet (e.g. iPad) app, and a mobile (e.g. iPhone or Android phone) app.

First up, the tablet app:

The idea here is that you can click on the car number as the car goes past and build up a live view of the lap chart. The last car clicked is highlighted and can be annotated. It may also be worth having a setting so that after a car has been selected it is greyed out for 5 seconds less than the fastest expected lap time. (Except maybe for pit option, where possibly capture car going into and out of the pit?)

Here’s a sketch for a mobile app:

As before, you click on the car number in left hand column as the car goes past. To simplify matters, the car numbers in the left hand column are in an ordered list (by track position? Initial state is Grid position). After a few seconds, the car clicked on disappears from the top of the list an is added to the bottom of the list, the idea being that the top of the list shows cars you expect to come past next. As with the tablet app, the last car clicked is highlighted and can be annotated using tags from the right hand column,

On a final note, if the positions are being added in real time, the app can also collect rough timing information. That means we can then also start to produce crude gap charts that show the time/distance between cars. Something like this, maybe?

In this case, the chart shows gaps between cars, per lap (increasing lap number up the screen, car positions left to right). Gaps indicate a pitted or lapped car. From the lap chart data, and crude timing, we could automatically generate this sort of view.

PS I probably won’t get round to making either of these apps, (at least, not in the immediate future…) but if anyone would like to take them on, I’d be happy to test them and chip in ideas:-)

## Visualising Sports Championship Data Using Treemaps – F1 Driver & Team Standings

I *love* treemaps. If you’re not familiar with them, they provide a very powerful way of visualising categorically organised hierarchical data that bottoms out with a quantitative, numerical dimension in a single view.

For example, consider the total population of students on the degrees offered across UK HE by HESA subject code. As well as the subject level, we might also categorise the data according to the number of students in each year of study (first year, second year, third year).

If we were to tabulate this data, we might have columns: *institution, HESA subject code, no. of first year students, no. of second year students, no. of third year students*. We could also restructure the table so that the data was presented in the form: *institution, HESA subject code, year of study, number of students*. And then we could visualise it in a treemap… (which I may do one day… but not now; if you beat me to it, please post a link in the comments;-)

Instead, what I will show is how to visualise data from a sports championship, in particular the start of the Formula One 2011 season. This championship has the same entrants in each race, each a member of one of a fixed number of teams. Points are awarded for each race (that is, each round of the championship) and totalled across rounds to give the current standing. As well as the driver championship (based on points won by individual drivers) is the team championship (where the points contribution form drivers within a team is totalled).

Here’s what the results from the third round (China) looks like:

Driver | Team | Points |
---|---|---|

Lewis Hamilton | McLaren-Mercedes | 25 |

Sebastian Vettel | RBR-Renault | 18 |

Mark Webber | RBR-Renault | 15 |

Jenson Button | McLaren-Mercedes | 12 |

Nico Rosberg | Mercedes | 10 |

Felipe Massa | Ferrari | 8 |

Fernando Alonso | Ferrari | 6 |

Michael Schumacher | Mercedes | 4 |

Vitaly Petrov | Renault | 2 |

Kamui Kobayashi | Sauber-Ferrari | 1 |

Paul di Resta | Force India-Mercedes | 0 |

Nick Heidfeld | Renault | 0 |

Rubens Barrichello | Williams-Cosworth | 0 |

Sebastien Buemi | STR-Ferrari | 0 |

Adrian Sutil | Force India-Mercedes | 0 |

Heikki Kovalainen | Lotus-Renault | 0 |

Sergio Perez | Sauber-Ferrari | 0 |

Pastor Maldonado | Williams-Cosworth | 0 |

Jarno Trulli | Lotus-Renault | 0 |

Jerome d’Ambrosio | Virgin-Cosworth | 0 |

Timo Glock | Virgin-Cosworth | 0 |

Vitantonio Liuzzi | HRT-Cosworth | 0 |

Narain Karthikeyan | HRT-Cosworth | 0 |

Jaime Alguersuari | STR-Ferrari | 0 |

*F1 2011 Results – China, © 2011 Formula One World Championship Ltd*

We can represent data from across all the races using a table of the form:

Driver | Team | Points | Race |
---|---|---|---|

Lewis Hamilton | McLaren-Mercedes | 25 | China |

Sebastian Vettel | RBR-Renault | 18 | China |

… | … | … | |

Felipe Massa | Ferrari | 10 | Malaysia |

Fernando Alonso | Ferrari | 8 | Malaysia |

Kamui Kobayashi | Sauber-Ferrari | 6 | Malaysia |

… | … | … | |

Michael Schumacher | Mercedes | 0 | Australia |

Pastor Maldonado | Williams-Cosworth | 0 | Australia |

… | … | … | |

Michael Schumacher | Mercedes | 0 | Australia |

Pastor Maldonado | Williams-Cosworth | 0 | Australia |

Narain Karthikeyan | HRT-Cosworth | 0 | Australia |

Vitantonio Liuzzi | HRT-Cosworth | 0 | Australia |

*Sample of F1 2011 Results 2011, © 2011 Formula One World Championship Ltd*

I’ve put a copy of the data to date at Many Eyes, IBM’s online interactive data visualisation site: F1 2011 Championship Points

Here’s what it looks like when we view it in a treemap visualisation:

The size of the boxes is proportional to the (summed) values within the hierarchical categories. In the above case, the large blocks are the total points awarded to each driver across teams and races. (The team field might be useful if a driver were to change team during the season.)

I’m not certain, but I think the Many Eyes treemap algorithm populates the map using a sorted list of summed numerical values taken through the hierarchical path from left to right, top to bottom. Which means top left is the category with the largest summed points. If this is the case, in the above example we can directly see that Webber is in fourth place overall in the championship. We can also look within each blocked area for more detail: for example, we can see Hamilton didn’t score as many points in Malaysia as he did in the other two races.

One of the nice features about the Many Eyes treemap is that it allows you to reorder the levels of the hierarchy that is being displayed. So for example, with a simple reordering of the labels we can get a view over the team championship too:

The Many Eyes treemap can be embedded in a web page (it’s a Java applet), although I’m not sure what, if any, licensing restrictions apply (I do know that the Guardian datastore blog embeds Many Eyes widgets on that site, though). Other treemap widgets are available (for example, Protovis and JIT both offer javascript enabled treemap displays).

What might be interesting would be to feed Protovis or the JIT with data dynamically form a Google Spreadsheet, for example, so that a single page could be used to display the treemap with the data being maintained in a spreadsheet.

Hmm, I wonder – does Google spreadsheets have a treemap gadget? Ooh – it does: treemap-gviz. It looks as if a bit of wrangling may be required around the data, but if the display works out then just popping the points data into a Google spreadsheet and creating the gadget should give an embeddable treemap display with no code required:-) (It will probably be necessary to format the data hierarchy by hand, though, requiring differently layed out data tables to act as source for individual and team based reports.)

So – how long before we see some “live” treemap displays for F1 results on the F1 blogs then? Or championship tables from other sports? Or is the treemap too confusing as a display for the uninitiated? (I personally don’t think so.. but then, I love macroscopic views over datasets:-)

PS see also More Olympics Medal Table Visualisations which includes a demonstration of a treemap visualisation over Olympic medal standings.

## A First Attempt at Looking at F1 Timing Data in Google Motion Charts (aka “Gapminder”)

Having managed to get F1 timing data *data* through my cobbled together F1 timing data Scraperwiki, it becomes much easier to try out different visualisation approaches that can be used to review the stories that sometimes get hidden in the heat of the race (that data journalism trick of using visualisation as an analytic tool for story discovery, for example).

Whilst I was on holiday, reading a chapter in Beautiful Visualization on Gapminder/Trendalyser/Google Motion Charts (it seems the animations may be effective when narrated, as when Hans Rosling performs with them, but for the uninitiated, they can simply be confusing…), it struck me that I should be able to view some of the timing data in the motion chart…

So here’s a first attempt (going against the previously identified “works best with narration” bit of best practice;-) – F1 timing data (China 2011) in Google Motion Charts, the video:

*Visualising the China 2011 F1 Grand Prix in Google Motion Charts*

If you want to play with the chart itself, you can find it here: F1 timing data (China 2011) Google Motion Chart.

The (useful) dimensions are:

*lap*– the lap number;*pos*– the car/racing number of each driver;*trackPos*– the position in the race (the racing position);*currTrackPos*– the position on the track (so if a lapped car is between the leader and second place car, their respective*currtrackpos*are 1, 2, 3);*pitHistory*– the number of pit stops to date

The *timeToLead*, *timeToFront* and *timeToBack* measures give the time (in seconds) between each car and the leader, the time to the car in the racing position ahead, and the time to the car in racing position behind (these last two datasets are incomplete at the moment… I still need to calculate this missing datapoints…). The *elapsedTime* is the elapsed racetime for each car at the end of each measured lap.

The time starts at 1900 because of a quirk in Google Motion Charts – they only work properly for times measured in years, months and days (or years and quarters) for 1900 onwards. (You can use years less than 1900 but at 1899 bad things might happen!) This means that I can simply use the elapsed time as the timebase. So until such a time as the chart supports date:time or :time as well as date: stamps, my fix is simply to use an integer timecount (the elapsed time in seconds) + 1900.

## Visualising China 2011 F1 – Timing Charts

Just a quick post (that I could actually have published 20 mins or so ago), showing a couple of graphics generated from my scrape of the 2011 China Formula One Grand Prix timing data (via FIA press releases).

First up, the race to the podium:

Data © 2011 Formula One World Championship Ltd, 6 Princes Gate, London, SW7 1QJ, England

The full lap chart, with pit stops:

Data © 2011 Formula One World Championship Ltd, 6 Princes Gate, London, SW7 1QJ, England

Both the above graphics were using data scraped from press releases published on the FIA media centre website. You can find the data in the GDF format I used to generate the images using Gephi here (howto).

PS @bencc has also been on the case, visualising telemetry data from Vodafone McLaren Mercedes. For example, Hamilton’s tour and Button’s tour.

PPS which reminds me – here’s an example of how to use Gephi to visualise telemetry data captured from the McLaren websire: Visualising Vodafone Mclaren F1 Telemetry Data in Gephi

## Visualising F1 Timing Sheet Data

Putting together a couple of tricks from recent posts (Visualising Vodafone Mclaren F1 Telemetry Data in Gephi and PDF Data Liberation: Formula One Press Release Timing Sheets), I thought I’d have a little play with the timing sheet data in Gephi…

The representations I have used to date are graph based, with each node corresponding a particular lap performance by a particular driver, and edges connecting consecutive laps.

**If you want to play along, you’ll need to download Gephi and this data file: F1 timing, Malaysia 2011 (NB it’s not throughly checked… glitches may have got through in the scraping process:-(**

The nodes carry the following data, as specified using the GDF format:

*name VARCHAR*: the ID of each node, given as*driverNumber_lapNumber*(e.g.`12_43`)*label VARCHAR*: the name of the driver (e.g.`S. VETTEL`*driverID INT*: the driver number (e.g.`7`)*driverNum VARCHAR*: an ID for the driver of the lap (e.g.`driver_12`*team VARCHAR*: the team name (e.g.`Vodafone McLaren Mercedes`)*lap INT*: the lap number (e.g.`41`)*pos INT*: the position at the end of the lap (e.g.`5`)*pitHistory INT*: the number of pitstops to date (e.g.`2`)*pitStopThisLap DOUBLE*: the duration of any pitstop this lap, else 0 (e.g.`12.321`)*laptime DOUBLE*: the laptime, in seconds (e.g.`72.125`)*lapdelta DOUBLE*: the difference between the current laptime and the previous laptime (e.g.`1.327`)*elapsedTime DOUBLE*: the summed laptime to date (e.g.`1839.021`)*elapsedTimeHun DOUBLE*: the elapsed time divided by a hundred (e.g.`)`

Using the geolayout with an equirectangular (presumably this means Cartesian?) layout, we can generate a range of charts simply by selecting suitable co-ordinate dimensions. For example, if we select the laptime as the y (“latitude”) co-ordinate and x (“longitude”) as the lap, filtering out the nodes with a null laptime value, we can generate a graph of the form:

We can then tweak this a little – e.g. colour the nodes by driver (using a Partition based coluring), and edges according to node, resize the nodes to show the number of pit stops to date, and then filter to compare just a couple of drivers :

This sort of lap time comparison is all very well, but it doesn’t necessarily tell us relative track positions. If we size the nodes non-linearly according to position, with a larger size for the “smaller” numerical position (so first is less than second, and hence first is sized larger than second), we can see whether the relative positions change (in this case, they don’t…)

Another sort of chart we might generate will be familiar to many race fans, with a tweak – simply plot position against lap, colour according to driver, and then size the nodes according to lap time:

Again, filtering is trivial:

If we plot the elapsed time against lap, we get a view of separations (deltas between cars are available in the media centre reports, but I haven’t used this data yet…):

In this example, lap time flows up the graph, elapsed time increases left to right. Nodes are coloured by driver, and sized according to postion. If a driver has a hight lap count and lower total elapsed time than a driver on the previous lap, then it’s lapped that car… Within a lap, we also see the separation of the various cars. (This difference *should* be the same as the deltas that are available via FIA press releases.)

If we zoom into a lap, we can better see the separation between cars. (Using the data I have, I’m hoping I haven’t introduced any systematic errors arising from essentially dead reckoning the deltas between cars…)

Also note that where lines between two laps cross, we have a change of position between laps.

[ADDED] Here’s another view, plotting elapsed time against itself to see where folk are on the track-as-laptime:

Okay, that’s enough from me for now.. Here’s something far more beautiful from @bencc/Ben Charlton that was built on top of the McLaren data…

First up, a 3D rendering of the lap data:

And then a rather nice lap-by-lap visualisation:

So come on F1 teams – give us some higher resolution data to play with and let’s see what we can *really* do… ;-)

PS I see that Joe Saward is a keen user of Lap charts…. That reminds me of an idea for an app I meant to do for race days that makes grabbing position data as cars complete a lap as simple as clicking…;-) Hmmm….

PPS for another take of visualising the timing data/timing stats, see Keith Collantine/F1Fanatic’s Malaysia summary post.

## Visualising Vodafone Mclaren F1 Telemetry Data in Gephi

Last year, I popped up an occasional series of posts visualising captures of the telemetry data that was being streamed by the Vodoafone McLaren F1 team (F1 Data Junkie).

I’m not sure what I’m going to do with the data this year, but being a lazy sort, it struck me that I should be able to visualise the data using Gephi (using in particular the *geo* layout that lets you specify which node attributes should be used as x and y co-ordinates when placing the nodes.

Taking a race worth of data, and visualising each node as follows (size as throttle value, colour as brake) we get something like this:

(Note that the resolution of the data is 1Hz, which explains the gaps…)

It’s possible to filter the data to show only a lap’s worth:

We could also filter out the data to only show points where the throttle value is above a certain value, or the lateral acceleration (“G-force”) and so on… or a combination of things (points where throttle and brake are applied, for example). I’ll maybe post examples of these using data from this year’s races…. err..?;-)

For now though, here’s a little video tour of Gephi in action on the data:

What I’d like to be able to do is animate this so I could look at each lap in turn, or maybe even animate an onion skin of the “current” point and a couple of previous ones) but that’s a bit beyond me… (for now….?!;-) If *you* know how, maybe we should talk?!:-)

[Thanks to McLaren F1 for streaming this data. Data was captured from the McLaren F1 website in 2010. I believe the speed, throttle and brake data were sponsored by Vodafone.]

PS If McLaren would like to give me some slightly higher resolution data, maybe from an old car on a test circuit, I’ll see what I can do with it… Similarly, any other motor racing teams in any other formula who have data they’d like to share, I’m happy to have a play… I’m hoping to go to a few of the BTCC races this year, so I’d particularly like to hear from anyone from any of those teams, or teams in the supporting races:-) If a Ginetta Junior team is up for it, we might even be able to get an education/outreach thing going into school maths, science, design and engineering clubs…;-)