F1 Data Junkie – Getting Started

Data is a wonderful thing, isn’t it…?! Take the following image, for example:

Hamilton on the brake (red 0% blue 100%) round bahrain

It depicts the racetrack used for the Bahrain Formula One grand prix last week, and it was generated from 2 laps worth of data collected at 1 second intervals from Lewis Hamilton’s car.

Long time readers will probably know that over the last couple of years I tried to track down bits of F1 motor racing related data, but to no real avail. Last year, I had the chance to look round the Red Bull Factory a couple of times (it’s located just a few minutes walk away from the OU campus in Milton Keynes) and chat to a couple of the folk there about possible outreach activites around data among other things (I’m still hopeful that something may come of that…).

Related to that possibility, we commissioned a rather nice gadget for displaying time series data against a map, in anticipation of getting car data we could visualise.

Anyway, ever impatient, when I saw that the Mclaren F1 website included a racetime dashboard, a Dev8D payoff in the form of a quick twitter conversation with @bencc resulted in him grabbing the last 30 or so laps worth of that data… :-)

I’ve had a quick play with it to see what sorts of thing might be possible (such as the map above), and there’s a whole bundle of stories that I think the data can turn up. I intend to explore as many of these stories as I can over the next few months, hopefully aided, abetted and corrected by a colleague from my department who is far better at physics than me… because the data contains a whole raft of physics related stories (and remember, folks: physics is fun).

If this: a) sounds like a turn off, but b) you claim to be interested in things like data journalism, please try to stick with the posts in this series (who knows – it may even turn into an uncourse;-) On the other hand, if you’re an F1 junkie (i.e. you follow @sidepodcast and/or listen to the Sidepodcast podcast;-) I’ll tag the posts f1data, so you can visit the tag page or grab the feed if you want to and not have to expose yourself to the rest of the ramblings that appear on this blog…

There’ll be no magic involved, though the results may turn out to be magical if you’re into geeky F1 stuff…;-) but to try and widen the appeal I’ll try to explore what stories the data holds about what’s happening to the car and driver, reflect a bit on what we can learn about extracting stories from data, and look to try to unpick the additional knowledge we might need to bring to the data in order to extract the most meaning from it; you can then see if these lessons apply to data – and stories – that you are interested in!

PS the F1 Fanatic blog is also doing a series on visualising and making sense of F1 related data, as this post on Bahrain Grand Prix FP2 analysis demonstrates. (See my own attempts at doing similar things with timing related data from last year: Visualising Lap Time Data – Australian Grand Prix, 2009). [UPDATE: As pointed out in the comments – and how could I have forgotten this?! (doh!) – there is also the F1 Numbers blog.]

Depending on how my F1 data related posts go, I may even try to hook up with the F1 Fanatic blog and/or the Sidepodcast folk to see if we can work together on ways of presenting this data stuff in a way that your everyday F1 fan might appreciate; just like T151 Digital Worlds helps your everyday computer game player appreciate just what’s involved in the design, development, business and culture of computer gaming:-) <- that’s a shameless plug if Christine or Mr C are reading…:-)

F1 Data Junkie – Looking at What’s There

…and by looking, I mean looking at what’s there as raw data structure format and content, rather than looking at what stories a visual data analysis reveals, I’m afraid… (that’ll come later:-) But if you do need something visual to inspire you, here’s a tease of what the data from a single lap of Hamilton’s race day tour of the Bahrain 2010 F1 Grand Prix looks like:

Hamilton - single lap data from bahrain

In his book Visualising Data: Exploring and Explaining Data with the Processing Environment, Ben Fry describes a seven stage process for understanding data:

Acquire
Obtain the data, whether from a file on disk or a source over a network.
Parse
Provide some structure for the data’s meaning, and order it into categories.
Filter
Remove all but the data of interest.
Mine
Apply methods from statistics or data mining as a way to discern patterns or place the data in a mathematical context.
Represent
Choose a basic visual model, such as a bar graph, list or tree.
Refine
Improve the basic representation to make it clearer and more visually engaging.
Interact
Add methods for manipulating the data or controlling what features are visible.

I’m not sure I’d agree with the elements above defining a rigid linear process


(Produced using Graphviz; dot file)

I prefer to think of the way I work as something like this:


(Produced using Graphviz; dot file)

but what is clear is that we need to understand what data is available to us.

As I mentioned in F1 Data Junkie – Getting Started, the data I’m going to be playing with (at least at first) is data grabbed from the Mclaren F1 Live Dashboard (as developed by Work Club) so let’s have a look at it… (I’ll come back to how the data was acquired in a later post.)

The data – which I think is updated on the server at a rate of once per second (maybe someone form Work Club could confirm that?) – is published as JSON (Javascript Object Notation) data in a callback function (so it uses what is referred to as the JSON-P (“JSON with Padding”) convention. What this means is that the dashboard web page can call the server and get a some data back in such a way that as soon as the data is loaded into the page, a Javascript programme function can run using the data that has just been downloaded).

Here’s what the data as grabbed from the server looks like:

Dashboard.jsonCallback('{
 "drivers":{
  "HAM":{
   "code":"HAM",
   "telemetry" {
    "timestamp":"15:40:58.383",
    "nEngine":13920,
    "NGear":2,
    "rThrottlePedal":68,
    "pBrakeF":2,
    "gLat":1,
    "gLong":1,
    "sLap":3116,
    "vCar":96.9,
    "NGPSLatitude":26.03185,
    "NGPSLongitude":50.51304
   },
   "additional":{
    "lap":"17",
    "position":"5",
    "is_racing":"1"
   }
  },
  "BUT":{"code":"BUT","telemetry":{"timestamp":"15:40:58.790","nEngine":13106,"NGear":3,"rThrottlePedal":49,"pBrakeF":2,"gLat":2,"gLong":0,"sLap":2681,"vCar":140.2,"NGPSLatitude":26.0333,"NGPSLongitude":50.51635},"additional":{"lap":"17","position":"8","is_racing":"1"}}},
  "commentary":[
   {
    "name":"COM",
    "initials":"CM",
    "text":"Lewis sets the fastest lap with a 2\'00:447 that time around.",
    "timestamp":"2010-03-14 12:40:14"
   }
  ]
 }
');

So what data is there? Well, there are telemetry data fields for each driver:

  • timestamp – the time of day;
  • nEngine – the number of revs the engine is doing;
  • NGear – the gear the car is in (over the range 1 to 7);
  • rThrottlePedal – the amount of throttle depression(?) (as a percentage);
  • pBrakeF – the amount of brake depression(?) (as a percentage)
  • gLat – the lateral “g-force” (that is, the side-to-side g-force that you feel in a car when going round a corner too quickly);
  • gLong – the longitudinal “g-force” (that is, the forwards and backwards force you feel in a car that accelerates or decelerates quickly when going in a straight line);
  • sLap – the distance round the lap (in metres); this resets to zero on each tour, presumably at the start/finish line);
  • vCar – the speed of the car (km/h);
  • NGPSLatitude – the GPS identified latitude of the car;
  • NGPSLongitude – the GOS identified longitude of the car.

On some samples, there is also commentary information, but I’m going to largely ignore that..

Parsing

The data I got hold of was a bundle of files containing data in the JSONP format like that shown above, with one file containing one package of data created once a second.

In order to parse the data, I needed to decide what format I wanted it in for processing. The format I chose was CSV – comma separated variable data – that looks like this:

The first column is the original filename, the other columns correspond to data downloaded from the Mclaren site.

In order to generate the CSV data, I wrote a Python script that would:
– strip off the padding around the JSON data;
– parse the JSON using a standard Python JSON parsing library;
– (add a line to strip out escape characters that weren’t handled correctly by the parser and place it before the parsing step);
– use a CSV library to write out the data in the CSV format.

(See an example Python script here.)

I then refined my parsing script so that it would generate one CSV file per lap. To do this, the script had to:
– detect when the lap distance in one sample was less than the distance in the previous sample (i.e. the lap distance measure has been reset to zero between the two samples, using a construction of the form if oldDist > d[‘sLap’]: where oldDist = d[‘sLap’] once we have written the corresponding data record to the CSV file);
– if a new lap has been started, close the old CSV file, create a new one, write the column header information into the top of the file, and then start adding the data to that file.

Having got the data into a CSV format, I could then load it in to an environment where I could start to think about visualising it. A spreadsheet for example, or Processing, (which is what I used to create the single lap view shown at the start of this post).

But to see how do that on the one hand, and what stories we can find in the data on the other, we’ll need to move on to another post…

[Reflection on this post: to get a large number of folk interested, I really need to do less of the geeky techie stuff, and more of the “what does the data say” cool viz stuff… but if I don’t log the bootstrap techie stuff now before I overcomplicate it(?!) my record of the simplest file handling and parsing code will get lost…;-) In the book version, a large part of this post would be a technical appendix… But the “what data fields are available” bit would be in Chapter 2 (after a fluffy Chapter 1!).]

PS some of the technical details behind the Mclaren site have started appearing on the personal blog of one of the developers – e.g. Building McLaren.com – Part 3: Reading Telemetry. In that post it was pointed out I haven’t been adding copyright notices about the data to the posts – which I’ll happily do once I know who to acknowledge, and how… In the meantime, it appears that “the speed, throttle and brake are sponsored by Vodafone” and “McLaren are providing this data for you to view” so I should link to them: thanks McLaren :-)

F1 Data Junkie – What Does This Data Point Refer To Again?

The count down is on to my first post unpicking some of the telemetry data grabbed from the Mclaren F1 site during the Bahrain Grand Prix, and then maybe this weekend’s race, but first, here’s another tease…

One of the problems I’ve found from a data-based (groan…) storytelling perspective is relating what the data’s telling us to what we know the car is doing is from where it is on the track. As I/we refine our data anlaysis skills we’ll be able just to look at the data and work out what the likely features of the track are at the point the data was collected; but as novice data engineers, we need all the cribs we can get. Which is why I had a little play with my Processing code and built an interactive data explorer that looks something like this:

The idea is that I can easily select a data trace, or a location on the track, and get a snapshot of the data collected at that point in the context of the other data points. That is, this data navigator allows me to expose the data collected in a single sample, in the the context of the position of the car on the track, and given the state of the other data values at the same point in time, as well as immediately before and immediately after.

I’ll post a version of this data explorer somewhere when I post the first data analysis post proper, but for now, you’ll just have to make do with the video…;-)

PS As to where the data came from, that story is described here: F1 Data Junkie – Looking at What’s There

More Ways of Looking At the Mclaren F1 Telemetry Data

Okay – I know I said that the next post in this series would start looking at the stories the Mclaren F1 telemetry data is telling us, but I’m away this weekend so I thought I’d schedule another eye-candy post…

So here you go – a couple of ways of looking at the data in Google Earth by popping the data into a google spreadsheet, grabbing the CSV data out and pushing it though a Yahoo Pipe, which helpfully generates a KML file for us. Firstly, a simple tour with speed labels on the markers:

Hamitonian tour of Bahrain

Then it struck me – if I do a couple of minor tweaks to the KML, I can produce some coloured markers. So for example, here we have a view where the marker colour represents the gear Hamilton’s car is in during a race-day single lap of the 2010 Bahrain Grand Prix circuit:

Bahrain gear change map

The tilted view of the first image is far more appealing, don’t you think?

(What I really want to do is a heat map, but I think that’ll take a couple of hours the first time round, which I just don’t have at the moment…)

F1 Data Junkie – Visualising the Zone

Another scheduled eye candy tease of a post… this time, visualising the braking zone ( > 5% brake force) over several tours of the Bahrain circuit:

These markers really need colouring into bins (e.g. 5-10%, 10-20%, 21-40%, 41-60%, 61-80%, >80%) or treating via a heat map to show the brake going on/coming off, if we can assume that Hamilton is doing pretty much the same thing each lap… (What we’re doing i essentially trying to create a fine degree of resolution in space by taking samples over returns to the same space over multiple laps.)

By way of comparison, here’s where Hamilton is full on with the throttle (throttle at 100%):

Gulp… remember when the Hamster tried to do that?

Sheesh… so you you hear him talking about the braking forces, here’s what it looks like in the 4G (longitudinal, very heavy braking) and above bin:

Quick Viz – Australian Grand Prix

I seem to have no free time to do anything this week, or next, or the week after, so this is just a another placeholder – a couple of quick views over some of Hamilton’s telemetry from the Australian Grand Prix.

First, a quick Google Earth to check the geodata looks okay – the number labels on the pins show the gear the car is in:

Next, a quick look over the telemetry video (hopefully I’ll have managed to animate this properly for the next race…)

And finally, a Google map shows the locations where the pBrakeF (brake pedal force?) is greater than 10 %.

Oh to have some time to play with this more fully…;-)

PS for alternative views over the data, check out my other F1 telemetry data visualisations.

F1 Data Junkie – Driver DNA

Although I missed the live race for the second time in a row, and didn’t get a chance to play with the data as quickly as I would have liked to, I did spend some of my time away wondering how to plot all the telemetry data for a driver captured during a race in a single graphic.

The single lap view, like this one from one of Button’s laps at the 2010 Malaysian Grand Prix:

is all very well, but if we overlay traces from each lap onto the distance labeled x-axis, the charts just become messy to read.

So how about this instead. On the x-axis, we have the distance traveled round the track per lap. The drivers are pretty consistent in the lines they take, so the overall distance is pretty consistent. If we have a 4km track, and a chart that’s 400 pixels wide, each pixel corresponds to 10m resolution of track distance. For the y-axis, we use the lap number. And to plot the actual value of a telemetry measurement, let’s use colour. Put these together, and we come up with some driver DNA charts – the ones below are form Hamilton:

Hamilton's Malaysian DNA

So how do you read these? Each strip is a different measure. The colour intensity increases with increasing value up to the maximum recorded value. Within each strip, time flows down the strip.

The top, blue strip shows the gear (1 to 7); the green strip shows the throttle pedal depression (0-100%), and the red strip shows the brake (0-100%). The light blue strip is a composite of the previous three strips. The whiter the pixel, the closer it is to 100% throttle in 7th gear with no braking.

The bottom two traces show the longitudinal and lateral g-force respectively. For the longitudinal trace, red shows braking – being forced into the steering wheel; green shows acceleration – being forced back into your seat. You’ll see the greatest g-force under braking occurs when the brakes are slapped full on… (the red bits in the third and fifth traces line up). For the latitudinal g-force, the red shows the driving being flung to the left (i.e. right hand corner), the green shows them being pushed out to the right.

I’m slowly pulling enough tools together to be able to start telling some stories… so stay tuned ;-)

F1 Data Junkie – Mclaren Driver Comparison Snapshots from the 2010 Chinese Grand Prix

Another Formula One Grand Prix race weekend, another chance to tinker with some F1 visualisations, this time from China. Not much new this week – it’s been more a case of me making a start on tidying up my scripts, but I have started trying to think about driver comparisons.

So for example, on the driver DNA charts, we see a difference in gear change behavior between Button and Hamilton about 40% of the way into the track:

We can see this a little more clearly on a geographical projection of the gear change data (the tracks are offset from each other to aid visualisation):

We can also see some different in rThrottlePedal behaviour:

Okay, that’s enough for just now – back to coding up some KML views for Google Earth ;-)


Data is grabbed from the McLaren F1 Live Dashboard during the race and is Copyright (©) McLaren Marketing Ltd 2010 (I think? Or possibly Vodafone McLaren Mercedes F1 2010(?)). I believe that speed, throttle and brake data are sponsored by Vodafone.


As ever, thanks to @bencc for grabbing the data. For more posts in this series, see: OUseful f1data.

F1 Data Junkie – Lap Elevation Data

Fans of cycling will be more than familiar with the idea of a profile map that details the elevation above sea level along a particular race stage.

Cycling map (Toue de France stage) profile/elevation map

(See also More Thoughts on Data Driven Storytelling for several more ideas on representing geospatial time series augmented by other sensor data.)

Although F1 circuit maps are provided by the FIA for each Formula One Grand Prix, they don’t give elevation data. Nor did a quick web search turn up any elevation maps.

In several previous posts on the topic of visualising F1 telemetry data I’ve plotted various map views, so I wondered whether I could generate elevation maps too… and it appears I can, using the Google elevation API:

F1 cct elevation data

The x-axis is distance round the lap, so if data from multiple laps is captured, we can start to get a more complete set of altitude data round the circuit.

Simply pass the API one or more sets of latitude/longitude co-ordinates, and I can get back elevation data. So for example, the above data (captured from the Vodafone Mclaren Live website) shows a lap by Jenson Button of the Malaysia circuit earlier this year.

So here’s what I’m thinking: how about a set of telemetric circuit guides derived from the telemetry data?

Here’s what we’d be competing with – the FIA circuit maps:

FIA Malaysia circuit map

To do this, I think it would make sense to use data captured across the whole of a race, putting it into meter wide sLap bins (or maybe moving average bins 3 meters wide?) and then recording for each bin:

– the average latitude and longitude, to try to generate some idea of racing line;
– the most common (mode) gear setting
– the most common “g-force” values to show directional forces on the car; (or maybe we could derive an angle of travel based on current and next, or current and previous locations?);
– the average (mean) speed (maybe with outliers removed?)
– some average of brake and throttle values?

As to how to display the map – the use of Google elevation data requires that a Google map is also displayed, so using something like this map/scatterplot mashup technique might be appropriate?

Having good (average) resolution data for lat/long around the circuit as a whole also means we should be able to generate a reasonable KML tour to view a lap animation in Google Earth (e.g. reusing this Google Earth path/tour simulator (Silverstone data, F1 car kmz model)).

The only thing is – I can’t get myself motivated to hacking the code required to do this today:-(

PS hmmm… seems like racing line data is avaliable on the Racecar Enginerring website (e.g. Formula 1 2010: Round 5 Barcelona tech data). There’s also this interesting article on GPS data – it’s just a shame that the resolution we get on a per lap basis from the Mclaren site is at too poor a resolution (one sample per second) to be able to do anything really interesting with it…

PDF Data Liberation: Formula One Press Release Timing Sheets

If you want F1 summary timing data from practice sessions, qualifying and the race itself, you might imagine that the the FIA Media Centre is the place to go:

Hmm… PDFs…

Some of the documents provide all the results on a single page in a relatively straightforward fashion:

Others are split into tables over multiple pages:

Following the race, the official classification was available as a scrapable PDF in preliminary for, but the final result – with handwritten signature – looked to be a PDF of a photocopy, and as such defies scraping without an OCR pass first… which I didn’t try…

I did consider setting up separate scrapers for each timing document, and saving the data into a corresponding Scraperwiki database, but a quick look at the license conditions made me a little wary…

No part of these results/data may be reproduced, stored in a retrieval system or transmitted in any form or by any means electronic, mechanical, photocopying, recording, broadcasting or otherwise without prior permission of the copyright holder except for reproduction in local/national/international daily press and regular printed publications on sale to the public within 90 days of the event to which the results/data relate and provided that the copyright symbol appears together with the address shown below …

Instead, I took the scrapers just so far such that I (that is, me ;-) could see how I would be able to get hold of the data without too much additional effort, but I didn’t complete the job… there’s partly an ulterior motive for this too… if anyone really wants the data, then you’ll probably have to do a bit of delving into the mechanics of Scraperwiki;-)

(The other reason for not my spending more time on this at the moment is that I was looking for a couple of simple exercises to get started with grabbing data from PDFs, and the FIA docs seemed quite an easy way in… Writing the scrapers is also bit like doing Sudoku, or Killer, which is one of my weekend pastimes…;-)

The scraper I set up is here: F1 Timing Scraperwiki

To use the scrapers, you need to open up the Scraperwiki editor, and do a little bit of configuration:

(Note the the press releases may disappear a few days after the race – I’m not sure how persistent the URLs are?)

When you’ve configured the scraper, run it…

The results of the scrape should now be displayed…

Scraperwiki does allow scraped data to be deposited into a database, and then accessed via an API, or other scrapers, or uploaded to Google Spreadsheets. However, my code stops at the point of getting the data into a Python list. (If you want a copy of the code, I posted it as a gist: F1 timings – press release scraper; you can also access it via Scraperwiki, of course).

Note that so far I’ve only tried the docs from a single race, so the scrapers may break on the releases published for future (or previous) races… Such is life when working with scrapers… I’ll try to work on robustness as the races go by. (I also need to work on the session/qualifying times and race analysis scrapers… they currently report unstructured data and also display an occasional glitch that I need to handle via a post-scrape cleanser.

If you want to use the scraper code as a starting point for building a data grabber that publishes the timing information as data somewhere, that’s what it’s there for (please let me know in the comments;-)

PS by the by, Mercedes GP publish an XML file of the latest F1 Championship Standings. They also appear to be publishing racetrack information in XML form using URLs of the form http://assets.mercedes-gp.com/—9—swf/assets/xml/race_23_en.xml. Presumably the next race will be 24?

If you know of any other “data” sources or machine readable, structured/semantic data relating to F1, please let me know via a comment below:-)