Data Structure + Narrative Chart = StoryLine?

A couple of years ago, prompted by a query from Michael Smethurst/@fantasticlife (then of the BBC, now of UK Parliament), I put together a post that described several ways for visually exploring the structure of a story or narrative – Narrative Charts Tell the Tale… (see also: From Storymaps to Notebooks).

One of the chart types described was the XKCD inspired narrative chart:


which led to demos (produced by Michael, drawing on a third party library – comic book narrative charts?) such as this one:


The data is supplied in two data files: an XML file that identifies the characters, and a JSON file that contains a list of scenes, with each scene comprising a set of characters associated with the scene.

More recently, the chart style was taken up by ABC News in an attempt to untangle a complicated story around a political scandal:


The code for that demo is available here – Github/abcnews/d3-layout-narrative (also check out the interesting way in which they annotated the source – and described in the post Automating XKCD-Style Narrative Charts.

The code library defines the layout engine, with the data for the graphic contained in a separate JSON file that contains a list of characters and a list of scenes:

icacDataCallback({"characters":{"name":"characters", "elements":[{"id":"EO1", "name":"Eddie Obeid", "bio":"A former member of the New South Wales Parliament, Mr Obeid was a Labor powerbroker who ICAC has previously found used his influence within the party to corruptly further coal mining interests for himself and his family.", "affiliation":"ALP", "investigated":"yes", "imageurl":"", "imagecredit":"AAP: Dean Lewins","rowNumber":1},''']},
"scenes":{"name":"scenes","elements":[{"id":"", "title":"", "plot":"Nick Di Girolamo becomes aware of a deal Australian Water Holdings (AWH) has with Sydney Water to provide and manage water and sewerage pipes in Sydney’s north-west that allows AWH to charge all its costs to Sydney Water. By 2007 Mr Di Girolamo is CEO of AWH and a majority owner of the company. ICAC has heard Mr Di Girolamo embarked on a plan to use the contract with Sydney Water to try transform the organisation into a major infrastructure company. ", "characters":"ND1", "date":"2006", "core-participants":"yes", "eightbyfive":"", "ofarrell":"", "rowNumber":1},...]}})

The list order of scenes appears to define the order in which they appear in the chart.

See also: Y. Tanahashi and K. Ma, “Design Considerations for Optimizing Storyline Visualizations,” in IEEE Transactions on Visualization and Computer Graphics, vol. 18, no. 12, pp. 2679-2688, Dec. 2012, doi: 10.1109/TVCG.2012.212. [h/t @hydrosquall]

A more recent chart captures the storylines of all the Start Wars movies – the different coloured threads are perhaps a useful device for highlighting players in a political story, or distinguishing teams or players in a sports based storyline?


Again, the structure of the data is based around characters and scenes, with additional metadata elements.

starwarsDataCallback({"characters":{"name":"characters", "elements":[{"id":"R2D", "name":"R2-D2", "bio":"A resourceful astromech droid, R2-D2 served Padmé Amidala, Anakin Skywalker and Luke Skywalker in turn, showing great bravery in rescuing his masters and their friends from many perils. A skilled starship mechanic and fighter pilot's assistant, he formed an unlikely but enduring friendship with the fussy protocol droid C-3PO.", "affiliation":"Light", "initialgroup":"0", "core":"*", "remove":""},...]},
"scenes":{"name":"scenes", "elements":[{"title":"Opening Logos", "plot":"", "episode":"Episode I", "characters":"", "dvaffiliation":"light"},...]}});

The rendering of the charts – from which we can read the story and get an idea of the flow of events – is simply a visual realisation of the way the data is structured an ordered in the data.

Which has got me thinking: could this be a handy way of viewing events detected from something like the F1 timing data? For example, pit stops and accidents are a given in the timing sheets, it’s easy enough to detect when the lead changes, I started exploring things like undercut detection, and so on. The actors are known (the drivers) and event can be sequenced by lap number, or race elapsed time. (Qualifying also presents an opportunity for telling the story using a narrative chart.)

There are two ways to approach this: first, I could just try to create some data files. Second, I wonder if I could text mine some race reports, treat each paragraph as a possible event, extract driver names (and perhaps even event keywords?) from the each paragraph, and then render the race report down as a narrative chart data file? And then start to iterate, improving the race report parser on the one hand, and building story trope generators (and detectors) into the timing sheet analysis in order to generate storylines automatically?

That is, can we use the narrative chart data format as in intermediary representation for picking apart and analysing human generated race reports, and as a target for automated storypoint identification routines?

See also: Notes on Robot Churnalism, Part I – Robot Writers.

A Couple of Interesting Interactive Data Storytelling Devices

A couple of interesting devices for trying to engage folk in a data mediated story. First up, a chart that is reminiscent in feel to Hans Rosling’s ignorance test, in which (if you aren’t familiar with it) audiences are asked a range of data-style questions and then demonstrate their ignorance based on their preconceived ideas about what they think the answer to the question is – and which is invariably the wrong answer (with the result that audiences perform worse than random – or as Hans Rosling often puts is, worse than a chimpanzee; by the by, Rosling recently gave a talk at the World Bank, which included a rendition of the ignorance test. Rosling’s dressing down of the audience – who make stats based policy and help spend billions in the areas covered by Rosling’s questions, yet still demonstrated their complete lack of grasp about what the numbers say – is worth watching alone…).

Anyway – the chart comes from the New York Times, in a post entitled You Draw It: How Family Income Predicts Children’s College Chances. A question is posed and the reader is invited to draw the shape of the curve they think describes the relationship between family income and college chances:


Once you’ve drawn your line and submitted it, you’re told how close your answer is the the actual result:

Another display demonstrates the general understanding calculated from across all the submissions.


Textual explanations also describe the actual relationship, putting it into context and trying to explain the basis of the relationship. As ever, a lovely piece of work, and once again with Amanda Cox in the credits…

The second example comes from Bloomberg, and riffs on the idea of immersive stories to produce a chart that gets updated as you scroll through (I saw this described as “scrollytelling” by @arnicas):


The piece is about global warming and shows the effect of various causal factors on temperature change, at first separately, and then in additive composite form to show how they explain the observed increase. It’s a nice device for advocacy, methinks…

It also reminds me that I never got round to trying to the Knight Lab Storymap.js with a zoomified/tiled chart image as the basis for a storymap (or should that be, storychart? For other storymappers, see Seven Ways to Create a Storymap). I just paid out the $19 or so for a copy of zoomify to convert large images to tilesets to work with that app, though I guess I really should have tried to hack a solution out with something like Imagemagick (I think that can export tiles?) or Inkscape (which would let me convert a vector image to tiles, I think?). Anyway, I just need a big image to try out now, which I guess I could generate from some F1 data using ggplot?

Narrative Charts Tell the Tale…

A couple of days ago, I got a message from @fantasticlfe asking if I’d done any tinkerings around what turned out to be “narrative charts”. I kept misapprehending what he was after (something to do with continuity?!;-), so here’s a summary of various graphical devices for looking at narrative texts that we passed back and forth, along with some we didn’t..

A Sankey diagram typically uses variable thickness lines to show flow between different elements in a system. (For this reason it’s often used to show energy flows throuygh a system, though it can also be used to good effect to show money flows.) The chart Michael linked to comes from xkcd:

xkcd narrative chart

In this chart, we have time along the horizontal x-axis. The y-axis is ambiguous (some sort of nominal ordering?) and the line thickness appears to represent army size.

To a certain extent, this diagram is reminiscent of Minard’s famous chart…

(See also What Makes a Minard? for some contemporary Minard diagrams. Is code available, I wonder?)

However, in the case of Minard’s chart (which I personally don’t like at all!), the x-y and co-ordinates represent map co-ordinates – the thick lines aren’t thick lines in a line chart (which a glanced “up and to the right” view might make you assume), they’re flow lines across a map.

I got distracted for a while by the Sankey aspect, and dug around my own bits of code. For example, Generating Sankey Diagrams from rCharts, an rCharts wrapper for the d3.js Sankey diagram. Michael was particularly interested in being able to group lines vertically (though I wasn’t sure what the y-axis would actually correspond to: some loose function of “location”, maybe as a categorical variable? Time was definitely to be on the horizontal x-axis); a posting on Stack Overflow (d3 sankey charts – manually position node along x axis) seemed likely to be able to help with that.

I then started going off on one…

Would a variant of nltk style lexical dispersion plots help, using characters rather than word categories? That would show when a character was in scene, but not much else?

lexical dispersion

How about sentence drawing, in which we show “turns” taken by different speakers?

sentence drawing

This shows something, but again, not relevant…

Nor are Kurt Vonnegut’s shapes-of-stories diagrams that plot some sort of emotional state on y and time on x:

Hmmm… Michael wanted to be able to look at scenes on x and presumably some function of location on y. Hmm… why? And how might we actually order those axes? Scenes occur in order in a film or play, but scene is a ranked, ordinal value. That said, scenes also have duration in terms of screentime, which may or may not be the same as the “interval” that the scene portrays in terms of the world it represents (this must have a name? eg a 20 second screen time scene shows a plane flying and this represents x hours in the story). The scene may also have a ‘calendar time’ associated with it in the story – so where you have a flashback scene this corresponds to a previous calendar time in the represented world. Did Michael want any of these dimensions capturing?

Related to shapes of stories, here’s how someone analysed several thousand plots: Examining the arc of 100,000 stories: a tidy analysis.

And then there’s location… how should these be represented? Locations are a distance apart and, perhaps more importantly from a continuity point of view, a travel time apart; as well as maybe a timezone difference apart. Did that need capturing in any way? Ordering axes for this could be quite hard if we wanted close things in space (distance? travel time?) to be close together on a single axis (A is 10 minutes from B and C, B is ten minutes from C: how do you show that intransitive relation on a single dimension? [Maybe relevant? Storygraph: Extracting patterns from spatio-temporal data, A Shrestha et al., Advances in Visual Computing.] Hmm… If we can capture distance between locations, and some sensible notion of time relating to scenes, could we maybe use line thickness to show that a person has lots of time to move between one (time, location) and another, as compared to scenetime? Do filmwriters have tools to support this? Do the police…?! Is the Mythology Engine relevant?

How about thinking about it as a graph? I’ve used Gephi before as a foil for getting me to think about ordered series as connected events in a graph – for example, Visualising F1 Timing Sheet Data. If we encode scene number as the x-coordinate and location number as the y-coordinate, with each graph line being the connected series of scenes a particular individual is in, then we can simply use a line chart to connect “individual lines” to different scene and location numbers. We’d also have a couple of extra dimensions to play with – node size and node colour, at each location. We’d also have the opportunity to play with edge (that is, line) colour and edge thickness?

Maybe I need to try to do some demos? But no time for that right now…

How about trying to find some? Here are some discovered via @jamesjefferies:

Here’s a view of connected (by travel between) locations in Game of Thrones:

game of thrones connected places

There’s also an animation of event in Game of Thrones, but I can’t quite figure out how to read it?!

geame of thrones events

Let’s go back to the sort of thing Michael was after – narrative charts..

@imhelenj found a related if cluttered interactive describing the evolution of web tech:

web histroy narrative chart

Then Michael shared a link to Comic Book Narrative Charts, a project for automatically generating xkcd style narrative charts:

xkc narrative chart d3js

Hovering over these charts, I noticed they were interactive d3.js charts. A quick View Source and the code for generating the chart dynamically from a characters file and a narrative file appeared to be there. Which I think is what Michael wanted all along…!

(By the by, the post also describes how the developers started thinking about fixing the vertical y-coordinate values. Here’s another example of someone thinking aloud around producing a narrative chart for the Holy Week story.)

Ho hum, an interesting set of detours nonetheless – and it got me thinking about the time-space complexity of a scene based tale that could keep be confused for weeks! :-)

PS this is quite interesting – visualising a process, via Tactical Tech Drawing By Numbers project:

visualise process

PPS some more bits: @r4isstatic points to Some visualisations of stories and narratives, another summary post similar to this one. Also via Paul Rissen, and picking up on whether the police have any interesting actor/event/time/location diagramming techniques, Vispol – An Interactive Scenario Visualization.

Elsewhere, I find Storyline Visualizations, which includes a paper (Design Considerations for Optimizing Storyline Visualizations, Y Tanahashi, and K-L Ma, IEEE Trans on Visualisation and Computer Graphics, 18(12) 2012, pp2679-2688 and some python code.

PPPS Some more… A collection by Stewart McKie of techniques for visualising screenplays: Screenplay Visualization: Concepts and Practice. The posts I wrote on the Digital Worlds game design uncourse blog about narrative structure. Sort of via Scott Wilson, some crime analysis software from (Link Explorer – White Paper) which includes descriptions of an event chart, a transaction chart and an activity timeline:

xanalys event chart

xanalys transaction chart

xanalys activity timeline

Via the comments, this rather lovely animated discourse map:

trinker.github animated discourse map

The Chart Equivalent of Comic Sans..?

Whilst looking at the apparently conflicting results from a couple of recent polls by YouGov on press regulation (reviewed in a piece by me over on OpenLearn: Two can play at that game: When polls collide in support of a package on the OU/BBC co-produced Radio 4 programme, More Or Less), my eye was also drawn to the different ways in which the survey results were presented graphically.

The polls were commissioned by The Sun newspaper on the one hand, and the Media Standards Trust/Hacked Off on the other. If you look at the poll data (The Sun/YouGov [PDF] and Media Standards Trust/YouGov [PDF] respectively), you’ll see that it’s reported in a standard format. (I couldn’t find actual data releases, but the survey reports look as if they are generated in a templated way, so getting the core of a generic scraper together for them shouldn’t be too difficult…) But how was that represented to readers (text based headlines and commentary aside?

Here are a couple of grabs from the Sun’s story (State-run watchdog ‘will gag free press’):

Pie-charts in 3D, with a tilt… gorgeous… erm, not… And the colour choice for the bar chart inner-column text is a bit low on contrast compared to the background, isn’t it?

It looks a bit like the writer took a photo of the print edition of the story on their phone, uploaded it and popped it into the story, doesn’t it?

I guess credit should be given for keeping the risk responses separate in the second image, when they could have just gone for the headline figures as pulled out in the YouGov report:

So what I’m wondering now is the extent to which a chart’s “theme” or style reflects the authority or formal weight we might ascribe to it, in much the same way as different fonts carry different associations? Anyone remember the slating that CERN got for using Comic Sans in their Higgs-Boson discovery announcement (eg here, here or here)?

Things could hardly have been more critical if they had used CrappyGraphs or an XKCD style chart generator (as for example described in Style your R charts like the Economist, Tableau … or XKCD ; or alternatively, XKCD-style for matplotlib).

XKCD - Science It works [XKCD]

Oh, hang on a minute, it almost looks like they did!

Anyway – back to the polls. The Media Standards Trust reported on their poll using charts that had a more formal look about them:

The chart annotations are also rather clearer to read.

So what, if anything, do we learn from this? That maybe you need to think about chart style, in the same way you might consider your font selection. From the R charts like the Economist, Tableau … or XKCD post, we also see that some of the different applications we might use to generate charts have their own very distinctive, and recognisable, style (as do many Javascript charting libraries). A question therefore arises about the extent to which you should try to come up with your own distinctive (but still clear) style that fits the tone of your communication, as well as its context and in sympathy with any necessary branding or house styling.

PS with respect to the Sun’s copyright/syndication notice, and my use of the images above:

I haven’t approached the copyright holders seeking permission to reproduce the charts here, but I would argue that this piece is just working up to being research into the way numerical data is reported, as well as hinting at criticism and review. So there…

PPS As far as bad charts go, they may also be, misrepresentations and underhand attempts at persuasion, graphic style, are also possible, as SimplyStatistics describes: “The statisticians at Fox News use classic and novel graphical techniques to lead with data” [ The statisticians at Fox News use classic and novel graphical techniques to lead with data ] See also: OpenLearn – Cheating with Charts.

its the Gramma an punctuashun wot its’ about, Rgiht?

This is another of those confluence style posts, where a handful of things I’ve read in quick succession seem to phase lock in my mind:

(brought to mind in part via @downes a week or so ago: How to Synch 32 Metronomes)

The first was a post by Alan Levine on Making Text Work, which describes a simple technique for making text overlays on photographs create a more coherent image than just slapping text on:

One of the techniques I use on presentation slides from time to time is a solid filled banner or stripe containing text, some opaque, sometimes transparent. I wouldn’t claim to be much of an artist but it makes the slides slightly more interesting than a header with an image in the middle of the slide, surrounded by whitespace…

(Which reminds me: maybe I should look through Presentation Zen Design and Presentation Zen: Simple Ideas on Presentation Design and Delivery again…)

Reading Alan’s post, it occurred to me that once you get the idea of using a solid or semi-transparent solid filled background to a text label, you tend to remember it and add it to your toolbox of presentation ideas (of course, you might also forget and later rediscover this sort of trick… My own slides tend to crudely follow particular design ideas I’ve recently picked up on, albeit in a crude and often not very well polished way, that I’ve decided I want to try to explore…) In the slide above, several tricks are evident: the solid filled text label, the positioning of it, the backgrounded blog post that actually serves as a reference for the slide (you can do a web search for the post title to learn more about the topic), the foregrounded image, rotated slightly, and so on.

The thing that struck me about Alan’s post was that it reminded me of a time before I was really aware of using a solid filled label to foreground a piece of text, which in turn caused me to reflect on other things I now take for granted as ideas that I can draw on and combine with other ideas.

In the same way we learn to spell, and learn to use punctuation, and start to pick up on the grammer that structures a language, so we can use those rules to construct ever more complex sentences. Once we know the rules, it becomes a choice as to whether or not we employ them.

Here’s an example of how we might come to acquire a new design idea, drawn from a brief conversation with @mediaczar a couple of nights ago when Mat asked if anyone knew the name od this sort of chart combination:

I didn’t know what to call the chart, but thought it should be easy enough to try to wrangle one together using ggplot in R, guessing that a geom_errorbarh() might work; Mat came back suggesting geom_crossbar().

Here’s a minimal code fragment I used to explore it:

#plot a horizontal bar across a bar chart
g=g+geom_crossbar(data = c, aes(x=d,y=f,ymax=f+1,ymin=f-1), colour = NA, fill = "red", width = 0.9, alpha = 0.5)

Here’s an example of how I used it – in an as yet unlabelled sketch showing for a particular F1 driver, their gird position for each race (red bar) the the number of places they gained (or lost) during the first lap:

So now I know how to achieve that effect…

Now for one two more things… Just after reading Alan’s post, I read a post by James Allen on possible race strategies for the Japanese Grand Prix:

The first thing that struck me was that even if you vaguely understand how a race chart works, the following statement may not be readily obvious to you from the top chart (my emphasis):

Three stops is actually faster [taking new softs on lap 12], as the [upper] graph … shows, but it requires the driver to pass the two stoppers in the final stint. If there is a safety car, it will hand an advantage to the two stoppers.

So, can you see why the three stopper (the green line) “requires the driver to pass the two stoppers in the final stint”? Let’s step back for a moment – can you see which bit of the graphs represent an overtake?

This is actually quite a complex graph to read – the axes are non-obvious, and not that easy to describe, though you soon pick up a feeling for how the chart works (honest!). Getting a sensible interpretation working for the surprising feature – the sharp vertical drops – is one way of getting into this chart, as well as looking at how the lines are postioned at the extreme x values (that is, at the end of the first and last laps).

The second thing that occurred to me was that we could actually remove the fragment of the line that shows the pitstop and instead show a separate line segment for each stint for each driver, and hence the line crossings that do not represent required overtakes. I’ve used this technique before, for example to show the separate stints on a chart of laptimes for a particular driver over the course of a race:

And as to where I got that trick? I think it was a bastardisation of a cycle plot, which can be used to show monthly, weekly or seasonal trends over a series of years:

…but it could equally have been a Stephen Few highlighted trick of disconnecting a timeseries line at the crossing point of one month to the next…

Whatever the case, one of the ideas I always have in mind is whether it may be possible to introduce white space in the form of a break in a line in order to separate out different groups of data in a very simple way.

“Visualising” High Frequency Trading With Sound (Sonification)

Over the summer, an episode of one of my favourite audio/radio programmes, the OU co-produced Radio 4 programme More or Less included a package on high frequency trading. To illustrate how fast high frequency trading works, the programme used a beautiful bit of sonification (the audio equivalent of a graphical data visualisation). You can listen to it on iPlayer here: How fast is high-frequency trading?

Just in case it’s blocked outside the UK, here’s a version I cropped from the downloaded podcast myself [MP3]:

Tim Harford, presenter of the programme, also wrote about high frequency trading here: High-frequency trading and the $440m mistake. Interestingly, the article also includes the audio package… Here’s a link to the original programme on iPlayer: How to lose money, fast

A couple of weeks prior to the More or Less programme (coincidence? Or inspiration?) a blog post about a data sketch done by NYT’s incredibly creative Amanda Cox referred to a similar audio technique to illustrate(?!) close finishes in sprint races: Why Amanda Cox should be in charge of audio. The post also referred back to a New York Times piece from February 2012 capturing just how closely some of the 2010 Winter Olympics race finished: Fractions of a Second: An Olympic Musical.

So now I’m wondering – have you ever seen, erm, heard a presentation that has used audio, rather than graphics, to illustrate a data story?

See also: Robot wars: How high frequency trading changed global markets.

Olympics Swimming Lap Charts from the New York Times

Part of the promise of sports data journalism is the ability to use data from an event to enrich the reporting of that event. One of the widely used graphical devices used in motor racing is the lap chart, which shows the relative positions of each car at the end of each lap:

Another, more complex chart, and one that can be quite hard to read when you first come across it, is the race history chart, which shows the laptime of each car relative to the average laptime (calculated over the whole of the race) of the race winner:

(Great examples of how to read a race history charts can be found on the IntelligentF1 blog. For the general case, see The IntelligentF1 model.)

Both of these charts can be used to illustrate the progression of a race, and even in some cases to identify stories that might otherwise have been missed (particularly races amongst back markers, for example). For Olympics events particularly, where reporting is often at a local level (national and local press reporting on the progression of their athletes, as well as the winning athletes), timing data may be one of the few sources available for finding out what actually happened to a particular competitor who didn’t feature in coverage that typically focusses on the head of the race.

I’ve also experimented with some other views, including a race summary chart that captures the start position, end of first lap position, final position and range of positions held at the end of each lap by each driver:

One of the ways of using this chart is as a quick summary of the race position chart, as well as a tool for highlighting possible “driver of the day” candidates.

A rich lap chart might also be used to convey information about the distance between cars as well as their relative positions. Here’s one experiment I tried (using Gephi to visualise the data) in which node size is proportional to time to car in front and colour is related to time to car behind (red is hot – car behind is close):

(You might also be able to imagine a variant of this chart where we fix the y-value so each row shows data relating to one particular driver. Looking along a row then allows us to see how exciting a race they had.)

All of these charts can be calculated from lap time data. Some of them can be calculated from data describing the position held by each competitor at the end of each lap. But whatever the case, the data is what drives the visualisation.

A little bit of me had been hoping that laptime data for Olympics track, swimming and cycling events might be available somewhere, but if it is, I haven’t found a reliable source yet. What I did find encouraging, though, was that the New York Times, (in many ways one of the organisations that is seeing the value of using visualised data-driven storytelling in its daily activities) did make some split time data available – and was putting it to work – in the swimming events:

Here, the NYT have given split data showing the times achieved in each leg by the relay team members, along with a lap chart that has a higher level of detail, showing the position of each team at the end of each 50m length (I think?!). The progression of each of the medal winners is highlighted using an appropriate colour theme.

[Here’s an insight from @kevinQ about how the New York Times dataviz team put this graphic together: Shifts in rankings. Apparently, they’d done similar views in previous years using a Flash component, but the current iteration uses d3.js]

The chart provides an illustration that can be used to help a reporter identify different stories about how the race progressed, whether or not it is included in the final piece. The graphic can also be used as a sidebar illustration of a race report.

Lap charts also lend themselves to interactive views, or highlighted customisations that can be used to illustrate competition between selected individuals – here’s another F1 example, this time from the f1fanatic blog:

(I have to admit, I prefer this sort of chart with greyed options for the unhighlighted drivers because it gives a better sense of the position churn that is happening elsewhere in the race.)

Of course, without the data, it can be difficult trying to generate these charts…

…which is to say: if you know where lap data can be found for any of the Olympics events, please post a link to the source in the comments below:-)

PS for an example of the lapcharting style used to track the hole by hole scoring across a multi-round golf tournament, see Andy Cotgreave’s Golf Analytics.

Creating Olympic Medal Treemap Visualisations Using OTS R Libraries

In London Olympics 2012 Medal Tables At A Glance? I posted some treemap visualisations of the Olympics medal tables generated using a Google Visualisation Chart treemap component. I thought it might be worth posting a quick R generated example too, using the off-the-shelf/straight out of CRAN treemap component. (If you want to play along, download the data as CSV from here.)

The original data looks like this:

but ideally we want it to look like this:

I posted a quick recipe showing how to do this sort of reshaping in Google Refine, but in R it’s even easier – just melt the Gold, Silver and Bronze columns into a pair of columns…

Here’s the full code to do the reshaping and generate a simple treemap:

#load in the data from a file
odata = read.csv("~/Downloads/nbc_olympic_medalscrape.csv")

#Reshape the data

#And generate the treemap in the simplest possible way
       index=c("cc", "Event","variable"), 
       vSize="value", vColor='value',

And here’s the treemap, with country blocks ordered in this case by total medal haul:

(To view the countries ordered according to number of Golds, a quick fix would be to order hierarchy with the medal type shown at the highest level of the tree: index=c("variable","cc", "Event").)

Generating variant views (I described six variants in the original post) is easy enough – just tweak the order of the elements of the index setting. (I should have named the melt created columns something more sensible than the default, shouldn’t I? Note that the vSize and vColor value value (sic) refers to the column name that identifies the medalType column. The type value says use the numerical value…. (i.e. it’s literal – it doesn’t refer to a column name…)

Out of the can – simples enough… So what might we be able to do with a little bit more treatment? Examples via the comments, please ;-)

London Olympics 2012 Medal Tables At A Glance?

Looking at the various medal standings for medals awarded during any Olympics games is all very well, but it doesn’t really show where each country won its medals or whether particular sports are dominated by a single country. Ranked as they are by the number of gold medals won, the medal standings don’t make it easy to see what we might term “strength in depth” – that is, we don’t get an sense of how the rankings might change if other medal colours were taken into account in some way.

Four years ago, in a quick round up of visualisations from the 2008 Beijing Olympics (More Olympics Medal Table Visualisations) I posted an example of an IBM Many Eyes Treemap visualisation I’d created showing how medals had been awarded across the top 10 medal winning countries. (Quite by chance, a couple of days ago I noticed one of the visualisations I’d created had appeared as an example in an academic paper – A Magic Treemap Cube for Visualizing
Olympic Games Data

Although not that widely used, I personally find treemaps a wonderful device for providing a macroscopic overview of a dataset. Whilst getting actual values out of them may be hit and miss, they can be used to provide a quick orientation around a hierarchically ordered dataset. Yes, it may be hard to distinguish detail, but you can easily get your eye in and start framing more detailed questions to ask of the data.

Whilst there is still a lot more thinking I’d like to do around the use of treemaps for visualising Olympics medal data using treemaps, here are a handful of quick sketches constructed using Google visualisation chart treemap components, and data scraped from NBC.

The data I have scraped is represented using rows of the form:

Country, Event, Gold, Silver, Bronze

where Event is at the level of “Swimming”, “Cycling” etc rather than at finer levels of detail (it’s really hard finding data at even this level of data in an easily grabbable way?)

I’ve then treated the data as hierarchically structured over three levels, which can be arranged in six ways:

  • MedalType, Country, Event
  • MedalType, Event, Country
  • Event, MedalType, Country
  • Event, Country, MedalType
  • Country, MedalType, Event
  • Country, Event, MedalType

Each ordering provides a different view over the data, and can be used to get a feel for different stories that are to be told.

First up, ordered by Medal, Country, Event:

This is a representation, of sorts, of the traditional medal standings table. If you look to the Gold segment, you can see the top few countries by medal count. We can also zoom in to see what events those medals tended to be awarded in:

The colouring is a bit off – the Google components is not as directly scriptable as a d3js treemap, for example – but with a bit of experimentation it may be able to find a colour scheme that better indicates the number of medals allocated in each case.

The Medal-Country-Event view thus allows us to get a feel for the overall medal standings. But how about the extent to which one country or another dominated an event? In this case, an Event-Country-Medal view gives us a feeling for strength in depth (ie we’re happy to take a point of view based on the the award of any medal type:

The Country-Event-Medal view gives us a view of the relative strength in depth of each country in each event:

and the Country Medal Event view allows us to then tunnel in on the gold winning events:

I think that colour could be used to make these charts even more accessible – maybe using different colouring schemes for the different variations – which is something I need to start thinking about (please feel free to make suggestions in the comments:-). It would also be good to have a little more control over the text that is displayed. The Google chart component is a little limited in this respect, so I think I need to find an alternative for more involved play – d3js seems like it’d be a good bet, although I need to do a quick review of R based treemap libraries too to see if there is anything there that may be appropriate.

It’d probably also be worth jotting down a few notes about what each of the six hierarchical variants might be good for highlighting, as well as exploring just as quick doodles with the Google chart component simpler treemaps that don’t reveal lower level structure, leaving that to be discovered through interactivity. (I showed the lower levels in the above treemaps because I was exploring static (i.e. printable) macroscopic views over the medal standings data.)

Data allowing, it would also be interesting to be able to get more detailed data visualised (for example, down to the level of actual events- 100m and Long Jump, for example, rather than Tack and Field, as well as the names of individual medalists.

PS for another Olympics related visualisation I’ve started exploring, see At A Glance View of the 2012 Olympics Heptathlon Performances

PPS As mentioned at the start, I love treemaps. See for example this initial demo of an F1 Championship points treemap in Many Eyes and as an Ergast Motor Sport API powered ‘live’ visualisation using a Google treemap chart component: A Treemap View of the F1 2011 Drivers and Constructors Championship

Pragmatic Visualisation – GDS Transaction Data as a Treemap

A week or two ago, the Government Data Service started publishing a summary document containing website transaction stats from across central government departments (GDS: Data Driven Delivery). The transactional services explorer uses a bubble chart to show the relative number of transactions occurring within each department:

The sizes of the bubbles are related to the volume of transactions (although I’m not sure what the exact relationship is?). They’re also positioned on a spiral, so as you work clockwise round the diagram starting from the largest bubble, the next bubble in the series is smaller (the “Other” catchall bubble is the exception, sitting as it does on the end of the tail irrespective of its relative size). This spatial positioning helps communicate relative sizes when the actual diameter of two bubbles next to each other is hard to differentiate between.

Clicking on a link takes you down into a view of the transactions occurring within that department:

Out of idle curiosity, I wondered what a treemap view of the data might reveal. The order of magnitude differences in the number of transactions across departments meant the the resulting graphic was dominated by departments with large numbers of transactions, so I did what you do in such cases and instead set the size of the leaf nodes in the tree to be the log10 of the number of transactions in a particular category, rather than the actual number of transactions. Each node higher up the tree was then simply the sum of values in the lower levels.

The result is a treemap that I decided shows “interestingness”, which I defined for the purposes of this graphic as being some function of the number and variety of transactions within a departement. Here’s a nested view of it, generated using a Google chart visualisation API treemap component:

The data I grabbed had a couple of usable structural levels that we can make use of in the chart. Here’s going down to the first level:

…and then the second:

Whilst the block sizes aren’t really a very good indicator of the number of transactions, it turns out that the default colouring does indicate relative proportions in the transaction count reasonably well: deep red corresponds to a low number of transactions, dark green a large number.

As a management tool, I guess the colours could also be used to display percentage change in transaction count within an area month on month (red for a decrease, green for an increase), though a slightly different size transformation function might be sensible in order to draw out the differences in relative transaction volumes a little more?

I’m not sure how well this works as a visualisation that would appeal to hardcore visualisation puritans, but as a graphical macroscopic device, I think it does give some sort of overview of the range and volume of transactions across departments that could be used as an opening gambit for a conversation with this data?