Visualising F1 Telemetry Data and Plotting Latitude and Longitude with ggplot Map Projections in R

Why don’t X-Y plots of latitude and longitude data look “right” compared to traditional map views?

For example, here’s an X-Y scatterplot of some of Jenson Button’s McLaren telemetry data from the 2010 Australian Formula One Grand Prix:

The image was generated, from a data file hosted on Google Spreadsheets, using the following R script, and the ggplot2 library:

require(ggplot2)
require(RCurl)

#gsqAPI is a helper function that loads data in from a shared as public Google Spreadsheet.
gsqAPI = function(key,query,gid=0){ return( read.csv( paste( sep="",'http://spreadsheets.google.com/tq?', 'tqx=out:csv','&tq=', curlEscape(query), '&key=', key, '&gid=', gid) ) ) }

#Provide the spreadsheet key
#Data was originally grabbed from the McLaren F1 Live Dashboard during the race and is Copyright (�) McLaren Marketing Ltd 2010 (I think? Or possibly Vodafone McLaren Mercedes F1 2010(?)). I believe that speed, throttle and brake data were sponsored by Vodafone.
key='0AmbQbL4Lrd61dER5Qnl3bHo4MkVNRlZ1OVdicnZnTHc'
#We can write SQL like queries over the spreadsheet, as described in https://blog.ouseful.info/2009/05/18/using-google-spreadsheets-as-a-databace-with-the-google-visualisation-api-query-language/
q='select *'

#Run the query on the database
df=gsqAPI(key,q)

#Sanity check - preview the imported data
head(df)

#Example circuit map - sort of - showing the gLat (latitudinal 'g-force') values around the circuit (point size is absolute value of gLat, colour has two values, one for + and one for - values (swing to left and swing to right)).
g=ggplot(df) + geom_point(aes(x=NGPSLongitude,y=NGPSLatitude,col=sign(gLat),size=abs(gLat)))
print(g)

What’s lacking is a projection from the everyday Cartesian coordinate system to something like a Mercator based projection. Fortunately, the Grammar of Graphics model that underpins ggplot allows us to write the necessary co-ordinate system transformation into our chart generating command:

ggplot(df) + geom_point(aes(x=NGPSLongitude,y=NGPSLatitude,col=sign(gLat),size=abs(gLat))) + coord_map(project="mercator")

Here’s the result:

(Note: I haven’t totally got my head round what the different co-ordinate transforms do, or how they relate to any sort of ‘reality’! But they’re another thing I’m now aware of…;-)

As to what the chart shows? It’s a plot of how the latitudinal (left-right) ‘g-force’ acts on Button as he tours the circuit. The points are coloured according to whether the force acts from the left or the right (+ or -) and sized according to the magnitude of the force (away from normal?). So it points out left and right hand corners and how tight they are, essentially;-)

To show just how easy it is to write simple graphics and even statistical charts using ggplot, here are a few more examples:

#Example "driver DNA" trace, showing low gear  throttle usage (distance round track on x-axis, lap number on y axis, node size is inversely proportional to gear number (low gear, large point size), colour relativ to throttlepedal depression
g=ggplot(df) + geom_point(aes(x=sLap,y=Lap,col=rThrottlePedal,size=-NGear)) + scale_colour_gradient(low='red',high='green')
print(g)

I started calling things like the above chart “Driver DNA” charts – the x-axis represents distance round the track, the y-axis is lap number. In this case, nodes are sized inversely proportionally to the gear (so low gear, large pointsize) and coloured by throttle pedal pressure. You’ll notice how consistent Button is lap on lap. The idea behind the colouring/sizing for this chart was that it would provide a glimpse into behaviour around low gear turns.

#Example of gear value around the track
g=ggplot(df) + geom_line(aes(x=sLap,y=NGear))
print(g)

This chart soooo reminds me of simple op art…:-) I’m not sure how useful it is for showing gear selection according to distance round the track, but I just love the line of it:-)

#We can also show a trace for a single lap, such as speed coloured by gear
g=ggplot(subset(df,Lap==22)) + geom_line(aes(x=sLap,y=vCar,colour=NGear))
print(g)

ggplot can, of course, do line charts. In the above example, I make an intitial exploration into how line segment colour can be used to highlight gear selection that allows the car to reach the speed (y-axis) it does as it goes round the circuit (x is distance round lap).

#We can also do statistical graphics - like a boxplot showing the distribution of speed values by gear
g = ggplot(df) + geom_boxplot(aes(factor(NGear),vCar))
print(g)

ggplot isn’t just about literal graphing from data elements directly to marks on a canvas. It can also do stats as part of the mapping; in this case, we generate a boxplot that summarises the range of speeds achieved for different gear values.

#Footwork - brake and throttle pedal depression based on gear
g = ggplot(df) + geom_jitter(aes(factor(NGear),rThrottlePedal),colour='darkgreen') + geom_jitter(aes(factor(NGear),pBrakeF),colour='darkred')
print(g)

Some more dots… in the above case, I try to explore Button;s footwork in a little more detail, seeing how he applies brake and throttle pressure according to gear selection (throttle depression is green, brake is red).

#Forces on the driver
#gLong by brake and gear
g = ggplot(df) + geom_jitter(aes(factor(NGear),gLong,col=pBrakeF)) + scale_colour_gradient(low='red',high='green')
print(g)

In the diagrams immediately above and below, I try to show what sorts of longitudinal forces are typically experienced by Button according to gear selection, the idea being that we may get to see whether gears are used for linear acceleration or deceleration.

#gLong by throttle and gear
g = ggplot(df) + geom_jitter(aes(factor(NGear),gLong,col=rThrottlePedal)) + scale_colour_gradient(low='red',high='green')
print(g)

#gLong boxplot
ggplot(df) + geom_boxplot(aes(factor(NGear),gLong))+ geom_jitter(aes(factor(NGear),gLong),size=1)

Here, I use a boxplot to to try to see whether or not the longitudinal g-force is typically experienced under acceleration or braking by gear. Note that the points are scattered according to random jitter about their actual, integer values.

Finally, here’s a look at how engine RPM and the car speed relate to gear selection. Would you be able to work out how to write this diagram?

Here’s how I did it…

#How do engine revs and speed relate to gear selction?
ggplot(df)+geom_point(aes(x=nEngine,y=vCar,col=factor(NGear)))

Hopefully what this quick tour of ggplot has illustrated is how easy it can be to generate a wide range of charts from the same data set. May all your charts be written, and then generated directly from their source data;-)

PS I did try to generate these images via CloudStat, but for some reason the images didn’t generate properly [WORKS NOW – THANKS @cloudstatorg] (it seemed to work fine when I tried just a couple? Here’s the link anyway: Cloudstat: F1 telemetry demo. The plots were generated in RStudio and saved using ggsave()

As to how this fits in with other things? Regular readers may remember the occasional rant I’ve had about the importance of providing the queries that map from data sets onto summary data tables so that the means by which the summaries were generated from open raw data are made transparent. In a similar way, data cleansing tools such as Google Refine or Stanford Data Wrangler allow you to log the transformations applied to a messy raw data set in order to get it into a state where you can actually work with it. The same is true of images. It’s far too easy to generate complex graphics from even more complex datasets, and then forget how the image was actually created, maybe what it even represents. By writing the diagram, essentially generating a query that maps from the data onto the visual representation provided by the generated diagram or chart, we preserve the audit trail from data to the chart output.

In the same way we might imagine the word equation:

DATA + QUERY = REPORT

we might also imagine:

DATA + GRAPHER = CHART

(You get the idea? The words are probably not the right words, but the sentiment is there…)

In an academic research setting, where it’s common to find lists of figures presented separately in books or theses, it would also make sense to include a ‘figure appendix’ which gives, for example, the ggplot or ggplot commands required to generate each statistical graphic presented from the actual data source.

Sigh…if only my first name was Damien and I could persuade my minions to do the Lichtenstein thing and paint-by-numbers over some large projections of some of the more aesthetically interesting charts;-)

Sketching Sponsor Partners Running UK Clinical Trials

Using data from the clinicaltrials.gov registry (search for UK clinical trials), I grabbed all records relating to trials that have at least in part run in the UK as an XML file download, then mapped links between project lead sponsors and their collaborators. Here’s a quick sketch of the result:

ukClinicalTrialPartners (PDF)

The XML data schema defines the corresponding fields as follows:

<!-- === Sponsors ==================================================== -->

  <xs:complexType name="sponsors_struct">
    <xs:sequence>
      <xs:element name="lead_sponsor" type="sponsor_struct"/>
      <xs:element name="collaborator" type="sponsor_struct" minOccurs="0" maxOccurs="unbounded"/>
    </xs:sequence>
  </xs:complexType>

Here’s the Python script I used to extract the data and generate the graph representation of it (requires networkx), which I then exported as a GEXF file that could be loaded into Gephi and used to generate the sketch shown above.

import os
from lxml import etree
import networkx as nx
import networkx.readwrite.gexf as gf
from xml.etree.cElementTree import tostring


def flatten(el):
	if el != None:
		result = [ (el.text or "") ]
		for sel in el:
			result.append(flatten(sel))
			result.append(sel.tail or "")
		return "".join(result)
	return ''

def graphOut(DG):
	writer=gf.GEXFWriter(encoding='utf-8',prettyprint=True,version='1.1draft')
	writer.add_graph(DG)
	#print tostring(writer.xml)
	f = open('workfile.gexf', 'w')
	f.write(tostring(writer.xml))

def sponsorGrapher(DG,xmlRoot,sponsorList):
	sponsors_xml=xmlRoot.find('.//sponsors')
	lead=flatten(sponsors_xml.find('./lead_sponsor/agency'))
	if lead !='':
		if lead not in sponsorList:
			sponsorList.append(lead)
			DG.add_node(sponsorList.index(lead),label=lead,name=lead)
			
	for collab in sponsors_xml.findall('./collaborator/agency'):
		collabname=flatten(collab)
		if collabname !='':
			if collabname not in sponsorList:
				sponsorList.append(collabname)
				DG.add_node( sponsorList.index( collabname ), label=collabname, name=collabname, Label=collabname )
			DG.add_edge( sponsorList.index(lead), sponsorList.index(collabname) )
	return DG, sponsorList

def parsePage(path,fn,sponsorGraph,sponsorList):
	fnp='/'.join([path,fn])
	xmldata=etree.parse(fnp)
	xmlRoot = xmldata.getroot()
	sponsorGraph,sponsorList = sponsorGrapher(sponsorGraph,xmlRoot,sponsorList)
	return sponsorGraph,sponsorList

XML_DATA_DIR='./ukClinicalTrialsData'
listing = os.listdir(XML_DATA_DIR)

sponsorDG=nx.DiGraph()
sponsorList=[]
for page in listing:
	if os.path.splitext( page )[1] =='.xml':
		sponsorDG, sponsorList = parsePage(XML_DATA_DIR,page, sponsorDG, sponsorList)

graphOut(sponsorDG)

Once the file is loaded in to Gephi, you can hover over nodes to see which organisations partnered which other organisations, etc.

One thing the graph doesn’t show directly are links between co-collaborators – edges go simply from lead partner to each collaborator. It would also be possible to generate a graph that represents pairwise links between every sponsor of a particular trial.

The XML data download also includes information about the locations of trials (sometimes at the city level, sometimes giving postcode level data). So the next thing I may look at is a map to see where sponsors tend to runs trials in the UK, and maybe even see whether different sponsors tend to favour different trial sites…

Further down the line, I wonder whether any linkage can be made across to things like GP practice level prescribing behaviour, or grant awards from the MRC?

PS these may be handy too – World Health Organisation Clinical Trials Registry portal, EU Clinical Trials Register

PPS looks like we can generate a link to the clinicaltrials.gov download file easily enough. Original URL is:
http://clinicaltrials.gov/ct2/results?cntry1=EU%3AGB&show_flds=Y&show_down=Y#down
Download URL is:
http://clinicaltrials.gov/ct2/results/download?down_stds=all&down_typ=study&down_flds=shown&down_fmt=plain&cntry1=EU%3AGB&show_flds=Y&show_down=Y
So now I wonder: can Scraperwiki accept a zip file, unzip it, then parse all the resulting files? Answers, with code snippets, via the comments, please:-) DONE – example here: Scraperwiki: clinicaltrials.gov test

Sketching Substantial Council Spending Flows to Serco Using OpenlyLocal Aggregated Spending Data

An article in today’s Guardian (Serco investigated over claims of ‘unsafe’ out-of-hours GP service) about services provided by Serco to various NHS Trusts got me thinking about how much local councils spend with Serco companies. OpenlyLocal provides a patchy(?) aggregating service over local council spending data (I don’t think there’s an equivalent aggregator for NHS organisations’ spending, or police authority spending?) so I thought I’d have a quick peek at how the money flows from councils to Serco.

If we search the OpenlyLocal Spending Dashboard, we can get a summary of spend with various Serco companies from local councils whose spending data has ben logged by the site:

Using the local spend on corporates scraper I used to produce Inter-Council Payments Network Graph, I grabbed details of payments to companies returned by a search on OpenlyLocal for suppliers containing the keyword serco, and then generated a directed graph with edges defined: a) from council nodes to company nodes; b) from company nodes to canonical company nodes. (Where possible, OpenlyLocal tries to reconcile companies identified for payment by councils with canonical company identifiers so that we can start to get a feeling for how different councils make payments to the same companies.)

I then exported the graph as a json node/edge list so that it could be displayed by Mike Bostock’s d3.js Sankey diagram library:

(Note that I’ve filtered the edges to only show ones above a certain payment amount (£10k).)

As a presentation graphic, it’s really tatty, doesn’t include amount labels (though they could be added) and so on. But as a sketch, it provides an easy to digest view over the data as a starting point for a deeper conversation with the data. We might also be able to use the diagram as a starting point for a data quality improvement process, by identifying the companies that we really should try to reconcile.

Here are flows associated with speend to g4s identified companies:

I also had a quick peek at which councils were spending £3,500 and up (in total) with the OU…

Digging into OpenlyLocal spending data a little more deeply, it seems we can get a breakdown of how total payments from council to supplier are made up, such as by spending department.

Which suggests to me that we could introduce another “column” in the Sankey diagram that joins councils with payees via spending department (I suspect the Category column would result in data that’s a bit too fine grained).

See also: University Funding – A Wider View

Interest Differencing: Folk Commonly Followed by Tweeting MPs of Different Parties

Earlier this year I doodled a recipe for comparing the folk commonly followed by users of a couple of BBC programme hashtags (Social Media Interest Maps of Newsnight and BBCQT Twitterers). Prompted in part by a tweet from Michael Smethurst/@fantasticlife about generating an ESP map for UK politicians (something I’ve also doodled before – Sketching the Structure of the UK Political Media Twittersphere) I drew on the @tweetminster Twitter lists of MPs by party to generate lists of folk commonly followed by the MPs of each party.

Using the R wordcloud library commonality and comparison clouds, we can get a visual impression of folk commonly followed in significant numbers by all the MPs of the three main parties, as well as the folk the MPs of each party follow significantly and differentially to the other parties:

There’s still a fair bit to do making the methodology robust (for example, being able to cope with comparing folk commonly followed by different sets of users where the size of the set differs to a significant extent (for example, there is a large difference between the number of tweeting Conservative and LibDem MPs). I’ve also noticed that repeatedly running the comparison.cloud code turns up different clouds, so there’s some element of randomness in there. I guess this just adds to the “sketchy” nature of the visualisation; or maybe hints at a technique akin to the way a photogrpaher will take multiple shots of a subject before picking one or two to illustrate something in particular. Which is to say: the “truthiness” of the image reflects the message that you are trying to communicate. The visualisation in this case exposes a partial truth (which is to say, no absolute truth), or particular perspective about the way different groups differentially follow folk on Twitter. A couple of other quirks I’ve noticed about the comparison.cloud as currently defined: firstly, very highly represented friends are sized too large to appear in the cloud (which is why very commonly followed folk across all sets – the people that appear in the commonality cloud – tend not to appear) – there must be a better way of handling this? Secondly, if one person is represented so highly in one group that they don’t appear in the cloud for that group, they may appear elsewhere in the cloud. (So for example, I tried plotting clouds for folk commonly followed by a sample of the followers of @davegorman, as well as the people commonly followed by the friends of @davegorman – and @davegorman appeared as a small label in the friends part of the comparison.cloud (notwithstanding the fact that all the followers of @davegorman follow @davegorman, but not all his friends do… What might make more sense would be to suppress the display of a label in the colour of a particular group if that label has a higher representation in any of the other groups (and isn’t displayed because it would be too large)).

That said, as a quick sketch, I think there’s some information being revealed there (the coloured comparison.cloud seems to pull out some names that make sense as commonly followed folk peculiar to each party…). I guess way forward is to start picking apart the comparison.cloud code, another is to explore a few more comparison sets? Suggestions welcome as to what they might be…:-)

PS by the by, I notice via the Guardian datablog (Church vs beer: using Twitter to map regional differences in US culture) another Twitter based comparison project – Church or Beer? Americans on Twitter – which looked at geo-coded Tweets over a particular time period on a US state-wide basis and counted the relative occurrence of Tweets mentioning “church” or “beer”…

Pragmatic Visualisation – GDS Transaction Data as a Treemap

A week or two ago, the Government Data Service started publishing a summary document containing website transaction stats from across central government departments (GDS: Data Driven Delivery). The transactional services explorer uses a bubble chart to show the relative number of transactions occurring within each department:

The sizes of the bubbles are related to the volume of transactions (although I’m not sure what the exact relationship is?). They’re also positioned on a spiral, so as you work clockwise round the diagram starting from the largest bubble, the next bubble in the series is smaller (the “Other” catchall bubble is the exception, sitting as it does on the end of the tail irrespective of its relative size). This spatial positioning helps communicate relative sizes when the actual diameter of two bubbles next to each other is hard to differentiate between.

Clicking on a link takes you down into a view of the transactions occurring within that department:

Out of idle curiosity, I wondered what a treemap view of the data might reveal. The order of magnitude differences in the number of transactions across departments meant the the resulting graphic was dominated by departments with large numbers of transactions, so I did what you do in such cases and instead set the size of the leaf nodes in the tree to be the log10 of the number of transactions in a particular category, rather than the actual number of transactions. Each node higher up the tree was then simply the sum of values in the lower levels.

The result is a treemap that I decided shows “interestingness”, which I defined for the purposes of this graphic as being some function of the number and variety of transactions within a departement. Here’s a nested view of it, generated using a Google chart visualisation API treemap component:

The data I grabbed had a couple of usable structural levels that we can make use of in the chart. Here’s going down to the first level:

…and then the second:

Whilst the block sizes aren’t really a very good indicator of the number of transactions, it turns out that the default colouring does indicate relative proportions in the transaction count reasonably well: deep red corresponds to a low number of transactions, dark green a large number.

As a management tool, I guess the colours could also be used to display percentage change in transaction count within an area month on month (red for a decrease, green for an increase), though a slightly different size transformation function might be sensible in order to draw out the differences in relative transaction volumes a little more?

I’m not sure how well this works as a visualisation that would appeal to hardcore visualisation puritans, but as a graphical macroscopic device, I think it does give some sort of overview of the range and volume of transactions across departments that could be used as an opening gambit for a conversation with this data?

London Olympics 2012 Medal Tables At A Glance?

Looking at the various medal standings for medals awarded during any Olympics games is all very well, but it doesn’t really show where each country won its medals or whether particular sports are dominated by a single country. Ranked as they are by the number of gold medals won, the medal standings don’t make it easy to see what we might term “strength in depth” – that is, we don’t get an sense of how the rankings might change if other medal colours were taken into account in some way.

Four years ago, in a quick round up of visualisations from the 2008 Beijing Olympics (More Olympics Medal Table Visualisations) I posted an example of an IBM Many Eyes Treemap visualisation I’d created showing how medals had been awarded across the top 10 medal winning countries. (Quite by chance, a couple of days ago I noticed one of the visualisations I’d created had appeared as an example in an academic paper – A Magic Treemap Cube for Visualizing
Olympic Games Data
).

Although not that widely used, I personally find treemaps a wonderful device for providing a macroscopic overview of a dataset. Whilst getting actual values out of them may be hit and miss, they can be used to provide a quick orientation around a hierarchically ordered dataset. Yes, it may be hard to distinguish detail, but you can easily get your eye in and start framing more detailed questions to ask of the data.

Whilst there is still a lot more thinking I’d like to do around the use of treemaps for visualising Olympics medal data using treemaps, here are a handful of quick sketches constructed using Google visualisation chart treemap components, and data scraped from NBC.

The data I have scraped is represented using rows of the form:

Country, Event, Gold, Silver, Bronze

where Event is at the level of “Swimming”, “Cycling” etc rather than at finer levels of detail (it’s really hard finding data at even this level of data in an easily grabbable way?)

I’ve then treated the data as hierarchically structured over three levels, which can be arranged in six ways:

  • MedalType, Country, Event
  • MedalType, Event, Country
  • Event, MedalType, Country
  • Event, Country, MedalType
  • Country, MedalType, Event
  • Country, Event, MedalType

Each ordering provides a different view over the data, and can be used to get a feel for different stories that are to be told.

First up, ordered by Medal, Country, Event:

This is a representation, of sorts, of the traditional medal standings table. If you look to the Gold segment, you can see the top few countries by medal count. We can also zoom in to see what events those medals tended to be awarded in:

The colouring is a bit off – the Google components is not as directly scriptable as a d3js treemap, for example – but with a bit of experimentation it may be able to find a colour scheme that better indicates the number of medals allocated in each case.

The Medal-Country-Event view thus allows us to get a feel for the overall medal standings. But how about the extent to which one country or another dominated an event? In this case, an Event-Country-Medal view gives us a feeling for strength in depth (ie we’re happy to take a point of view based on the the award of any medal type:

The Country-Event-Medal view gives us a view of the relative strength in depth of each country in each event:

and the Country Medal Event view allows us to then tunnel in on the gold winning events:

I think that colour could be used to make these charts even more accessible – maybe using different colouring schemes for the different variations – which is something I need to start thinking about (please feel free to make suggestions in the comments:-). It would also be good to have a little more control over the text that is displayed. The Google chart component is a little limited in this respect, so I think I need to find an alternative for more involved play – d3js seems like it’d be a good bet, although I need to do a quick review of R based treemap libraries too to see if there is anything there that may be appropriate.

It’d probably also be worth jotting down a few notes about what each of the six hierarchical variants might be good for highlighting, as well as exploring just as quick doodles with the Google chart component simpler treemaps that don’t reveal lower level structure, leaving that to be discovered through interactivity. (I showed the lower levels in the above treemaps because I was exploring static (i.e. printable) macroscopic views over the medal standings data.)

Data allowing, it would also be interesting to be able to get more detailed data visualised (for example, down to the level of actual events- 100m and Long Jump, for example, rather than Tack and Field, as well as the names of individual medalists.

PS for another Olympics related visualisation I’ve started exploring, see At A Glance View of the 2012 Olympics Heptathlon Performances

PPS As mentioned at the start, I love treemaps. See for example this initial demo of an F1 Championship points treemap in Many Eyes and as an Ergast Motor Sport API powered ‘live’ visualisation using a Google treemap chart component: A Treemap View of the F1 2011 Drivers and Constructors Championship

Creating Olympic Medal Treemap Visualisations Using OTS R Libraries

In London Olympics 2012 Medal Tables At A Glance? I posted some treemap visualisations of the Olympics medal tables generated using a Google Visualisation Chart treemap component. I thought it might be worth posting a quick R generated example too, using the off-the-shelf/straight out of CRAN treemap component. (If you want to play along, download the data as CSV from here.)

The original data looks like this:

but ideally we want it to look like this:

I posted a quick recipe showing how to do this sort of reshaping in Google Refine, but in R it’s even easier – just melt the Gold, Silver and Bronze columns into a pair of columns…

Here’s the full code to do the reshaping and generate a simple treemap:

#load in the data from a file
odata = read.csv("~/Downloads/nbc_olympic_medalscrape.csv")

#Reshape the data
require(reshape)
odatar=melt(odata,id=c('cc','ccevent','Event'))

#And generate the treemap in the simplest possible way
require(treemap)
tmPlot(odatar, 
       index=c("cc", "Event","variable"), 
       vSize="value", vColor='value',
       type="value")

And here’s the treemap, with country blocks ordered in this case by total medal haul:

(To view the countries ordered according to number of Golds, a quick fix would be to order hierarchy with the medal type shown at the highest level of the tree: index=c("variable","cc", "Event").)

Generating variant views (I described six variants in the original post) is easy enough – just tweak the order of the elements of the index setting. (I should have named the melt created columns something more sensible than the default, shouldn’t I? Note that the vSize and vColor value value (sic) refers to the column name that identifies the medalType column. The type value says use the numerical value…. (i.e. it’s literal – it doesn’t refer to a column name…)

Out of the can – simples enough… So what might we be able to do with a little bit more treatment? Examples via the comments, please ;-)

Olympics Swimming Lap Charts from the New York Times

Part of the promise of sports data journalism is the ability to use data from an event to enrich the reporting of that event. One of the widely used graphical devices used in motor racing is the lap chart, which shows the relative positions of each car at the end of each lap:

Another, more complex chart, and one that can be quite hard to read when you first come across it, is the race history chart, which shows the laptime of each car relative to the average laptime (calculated over the whole of the race) of the race winner:

(Great examples of how to read a race history charts can be found on the IntelligentF1 blog. For the general case, see The IntelligentF1 model.)

Both of these charts can be used to illustrate the progression of a race, and even in some cases to identify stories that might otherwise have been missed (particularly races amongst back markers, for example). For Olympics events particularly, where reporting is often at a local level (national and local press reporting on the progression of their athletes, as well as the winning athletes), timing data may be one of the few sources available for finding out what actually happened to a particular competitor who didn’t feature in coverage that typically focusses on the head of the race.

I’ve also experimented with some other views, including a race summary chart that captures the start position, end of first lap position, final position and range of positions held at the end of each lap by each driver:

One of the ways of using this chart is as a quick summary of the race position chart, as well as a tool for highlighting possible “driver of the day” candidates.

A rich lap chart might also be used to convey information about the distance between cars as well as their relative positions. Here’s one experiment I tried (using Gephi to visualise the data) in which node size is proportional to time to car in front and colour is related to time to car behind (red is hot – car behind is close):

(You might also be able to imagine a variant of this chart where we fix the y-value so each row shows data relating to one particular driver. Looking along a row then allows us to see how exciting a race they had.)

All of these charts can be calculated from lap time data. Some of them can be calculated from data describing the position held by each competitor at the end of each lap. But whatever the case, the data is what drives the visualisation.

A little bit of me had been hoping that laptime data for Olympics track, swimming and cycling events might be available somewhere, but if it is, I haven’t found a reliable source yet. What I did find encouraging, though, was that the New York Times, (in many ways one of the organisations that is seeing the value of using visualised data-driven storytelling in its daily activities) did make some split time data available – and was putting it to work – in the swimming events:

Here, the NYT have given split data showing the times achieved in each leg by the relay team members, along with a lap chart that has a higher level of detail, showing the position of each team at the end of each 50m length (I think?!). The progression of each of the medal winners is highlighted using an appropriate colour theme.

[Here’s an insight from @kevinQ about how the New York Times dataviz team put this graphic together: Shifts in rankings. Apparently, they’d done similar views in previous years using a Flash component, but the current iteration uses d3.js]

The chart provides an illustration that can be used to help a reporter identify different stories about how the race progressed, whether or not it is included in the final piece. The graphic can also be used as a sidebar illustration of a race report.

Lap charts also lend themselves to interactive views, or highlighted customisations that can be used to illustrate competition between selected individuals – here’s another F1 example, this time from the f1fanatic blog:

(I have to admit, I prefer this sort of chart with greyed options for the unhighlighted drivers because it gives a better sense of the position churn that is happening elsewhere in the race.)

Of course, without the data, it can be difficult trying to generate these charts…

…which is to say: if you know where lap data can be found for any of the Olympics events, please post a link to the source in the comments below:-)

PS for an example of the lapcharting style used to track the hole by hole scoring across a multi-round golf tournament, see Andy Cotgreave’s Golf Analytics.

“Visualising” High Frequency Trading With Sound (Sonification)

Over the summer, an episode of one of my favourite audio/radio programmes, the OU co-produced Radio 4 programme More or Less included a package on high frequency trading. To illustrate how fast high frequency trading works, the programme used a beautiful bit of sonification (the audio equivalent of a graphical data visualisation). You can listen to it on iPlayer here: How fast is high-frequency trading?

Just in case it’s blocked outside the UK, here’s a version I cropped from the downloaded podcast myself [MP3]:

Tim Harford, presenter of the programme, also wrote about high frequency trading here: High-frequency trading and the $440m mistake. Interestingly, the article also includes the audio package… Here’s a link to the original programme on iPlayer: How to lose money, fast

A couple of weeks prior to the More or Less programme (coincidence? Or inspiration?) a blog post about a data sketch done by NYT’s incredibly creative Amanda Cox referred to a similar audio technique to illustrate(?!) close finishes in sprint races: Why Amanda Cox should be in charge of audio. The post also referred back to a New York Times piece from February 2012 capturing just how closely some of the 2010 Winter Olympics race finished: Fractions of a Second: An Olympic Musical.

So now I’m wondering – have you ever seen, erm, heard a presentation that has used audio, rather than graphics, to illustrate a data story?

See also: Robot wars: How high frequency trading changed global markets.

its the Gramma an punctuashun wot its’ about, Rgiht?

This is another of those confluence style posts, where a handful of things I’ve read in quick succession seem to phase lock in my mind:

(brought to mind in part via @downes a week or so ago: How to Synch 32 Metronomes)

The first was a post by Alan Levine on Making Text Work, which describes a simple technique for making text overlays on photographs create a more coherent image than just slapping text on:

One of the techniques I use on presentation slides from time to time is a solid filled banner or stripe containing text, some opaque, sometimes transparent. I wouldn’t claim to be much of an artist but it makes the slides slightly more interesting than a header with an image in the middle of the slide, surrounded by whitespace…

(Which reminds me: maybe I should look through Presentation Zen Design and Presentation Zen: Simple Ideas on Presentation Design and Delivery again…)

Reading Alan’s post, it occurred to me that once you get the idea of using a solid or semi-transparent solid filled background to a text label, you tend to remember it and add it to your toolbox of presentation ideas (of course, you might also forget and later rediscover this sort of trick… My own slides tend to crudely follow particular design ideas I’ve recently picked up on, albeit in a crude and often not very well polished way, that I’ve decided I want to try to explore…) In the slide above, several tricks are evident: the solid filled text label, the positioning of it, the backgrounded blog post that actually serves as a reference for the slide (you can do a web search for the post title to learn more about the topic), the foregrounded image, rotated slightly, and so on.

The thing that struck me about Alan’s post was that it reminded me of a time before I was really aware of using a solid filled label to foreground a piece of text, which in turn caused me to reflect on other things I now take for granted as ideas that I can draw on and combine with other ideas.

In the same way we learn to spell, and learn to use punctuation, and start to pick up on the grammer that structures a language, so we can use those rules to construct ever more complex sentences. Once we know the rules, it becomes a choice as to whether or not we employ them.

Here’s an example of how we might come to acquire a new design idea, drawn from a brief conversation with @mediaczar a couple of nights ago when Mat asked if anyone knew the name od this sort of chart combination:

I didn’t know what to call the chart, but thought it should be easy enough to try to wrangle one together using ggplot in R, guessing that a geom_errorbarh() might work; Mat came back suggesting geom_crossbar().

Here’s a minimal code fragment I used to explore it:

#plot a horizontal bar across a bar chart
a=data.frame(x=10,y=43)
b=stack(a)
c=data.frame(d=c('x','y'),f=c(3,22))
library(ggplot2)
g=ggplot(b)+geom_bar(aes(x=ind,y=values),stat='identity')
g=g+geom_crossbar(data = c, aes(x=d,y=f,ymax=f+1,ymin=f-1), colour = NA, fill = "red", width = 0.9, alpha = 0.5)
print(g)

Here’s an example of how I used it – in an as yet unlabelled sketch showing for a particular F1 driver, their gird position for each race (red bar) the the number of places they gained (or lost) during the first lap:

So now I know how to achieve that effect…

Now for one two more things… Just after reading Alan’s post, I read a post by James Allen on possible race strategies for the Japanese Grand Prix:

The first thing that struck me was that even if you vaguely understand how a race chart works, the following statement may not be readily obvious to you from the top chart (my emphasis):

Three stops is actually faster [taking new softs on lap 12], as the [upper] graph … shows, but it requires the driver to pass the two stoppers in the final stint. If there is a safety car, it will hand an advantage to the two stoppers.

So, can you see why the three stopper (the green line) “requires the driver to pass the two stoppers in the final stint”? Let’s step back for a moment – can you see which bit of the graphs represent an overtake?

This is actually quite a complex graph to read – the axes are non-obvious, and not that easy to describe, though you soon pick up a feeling for how the chart works (honest!). Getting a sensible interpretation working for the surprising feature – the sharp vertical drops – is one way of getting into this chart, as well as looking at how the lines are postioned at the extreme x values (that is, at the end of the first and last laps).

The second thing that occurred to me was that we could actually remove the fragment of the line that shows the pitstop and instead show a separate line segment for each stint for each driver, and hence the line crossings that do not represent required overtakes. I’ve used this technique before, for example to show the separate stints on a chart of laptimes for a particular driver over the course of a race:

And as to where I got that trick? I think it was a bastardisation of a cycle plot, which can be used to show monthly, weekly or seasonal trends over a series of years:

…but it could equally have been a Stephen Few highlighted trick of disconnecting a timeseries line at the crossing point of one month to the next…

Whatever the case, one of the ideas I always have in mind is whether it may be possible to introduce white space in the form of a break in a line in order to separate out different groups of data in a very simple way.