…and by looking, I mean looking at what’s there as raw data structure format and content, rather than looking at what stories a visual data analysis reveals, I’m afraid… (that’ll come later:-) But if you do need something visual to inspire you, here’s a tease of what the data from a single lap of Hamilton’s race day tour of the Bahrain 2010 F1 Grand Prix looks like:
In his book Visualising Data: Exploring and Explaining Data with the Processing Environment, Ben Fry describes a seven stage process for understanding data:
Obtain the data, whether from a file on disk or a source over a network.
Provide some structure for the data’s meaning, and order it into categories.
Remove all but the data of interest.
Apply methods from statistics or data mining as a way to discern patterns or place the data in a mathematical context.
Choose a basic visual model, such as a bar graph, list or tree.
Improve the basic representation to make it clearer and more visually engaging.
Add methods for manipulating the data or controlling what features are visible.
I’m not sure I’d agree with the elements above defining a rigid linear process
(Produced using Graphviz; dot file)
I prefer to think of the way I work as something like this:
(Produced using Graphviz; dot file)
but what is clear is that we need to understand what data is available to us.
As I mentioned in F1 Data Junkie – Getting Started, the data I’m going to be playing with (at least at first) is data grabbed from the Mclaren F1 Live Dashboard (as developed by Work Club) so let’s have a look at it… (I’ll come back to how the data was acquired in a later post.)
Here’s what the data as grabbed from the server looks like:
"text":"Lewis sets the fastest lap with a 2\'00:447 that time around.",
So what data is there? Well, there are telemetry data fields for each driver:
- timestamp – the time of day;
- nEngine – the number of revs the engine is doing;
- NGear – the gear the car is in (over the range 1 to 7);
- rThrottlePedal – the amount of throttle depression(?) (as a percentage);
- pBrakeF – the amount of brake depression(?) (as a percentage)
- gLat – the lateral “g-force” (that is, the side-to-side g-force that you feel in a car when going round a corner too quickly);
- gLong – the longitudinal “g-force” (that is, the forwards and backwards force you feel in a car that accelerates or decelerates quickly when going in a straight line);
- sLap – the distance round the lap (in metres); this resets to zero on each tour, presumably at the start/finish line);
- vCar – the speed of the car (km/h);
- NGPSLatitude – the GPS identified latitude of the car;
- NGPSLongitude – the GOS identified longitude of the car.
On some samples, there is also commentary information, but I’m going to largely ignore that..
The data I got hold of was a bundle of files containing data in the JSONP format like that shown above, with one file containing one package of data created once a second.
In order to parse the data, I needed to decide what format I wanted it in for processing. The format I chose was CSV – comma separated variable data – that looks like this:
The first column is the original filename, the other columns correspond to data downloaded from the Mclaren site.
In order to generate the CSV data, I wrote a Python script that would:
– strip off the padding around the JSON data;
– parse the JSON using a standard Python JSON parsing library;
– (add a line to strip out escape characters that weren’t handled correctly by the parser and place it before the parsing step);
– use a CSV library to write out the data in the CSV format.
(See an example Python script here.)
I then refined my parsing script so that it would generate one CSV file per lap. To do this, the script had to:
– detect when the lap distance in one sample was less than the distance in the previous sample (i.e. the lap distance measure has been reset to zero between the two samples, using a construction of the form if oldDist > d[‘sLap’]: where oldDist = d[‘sLap’] once we have written the corresponding data record to the CSV file);
– if a new lap has been started, close the old CSV file, create a new one, write the column header information into the top of the file, and then start adding the data to that file.
Having got the data into a CSV format, I could then load it in to an environment where I could start to think about visualising it. A spreadsheet for example, or Processing, (which is what I used to create the single lap view shown at the start of this post).
But to see how do that on the one hand, and what stories we can find in the data on the other, we’ll need to move on to another post…
[Reflection on this post: to get a large number of folk interested, I really need to do less of the geeky techie stuff, and more of the “what does the data say” cool viz stuff… but if I don’t log the bootstrap techie stuff now before I overcomplicate it(?!) my record of the simplest file handling and parsing code will get lost…;-) In the book version, a large part of this post would be a technical appendix… But the “what data fields are available” bit would be in Chapter 2 (after a fluffy Chapter 1!).]
PS some of the technical details behind the Mclaren site have started appearing on the personal blog of one of the developers – e.g. Building McLaren.com – Part 3: Reading Telemetry. In that post it was pointed out I haven’t been adding copyright notices about the data to the posts – which I’ll happily do once I know who to acknowledge, and how… In the meantime, it appears that “the speed, throttle and brake are sponsored by Vodafone” and “McLaren are providing this data for you to view” so I should link to them: thanks McLaren :-)