A couple of interesting devices for trying to engage folk in a data mediated story. First up, a chart that is reminiscent in feel to Hans Rosling’s ignorance test, in which (if you aren’t familiar with it) audiences are asked a range of data-style questions and then demonstrate their ignorance based on their preconceived ideas about what they think the answer to the question is – and which is invariably the wrong answer (with the result that audiences perform worse than random – or as Hans Rosling often puts is, worse than a chimpanzee; by the by, Rosling recently gave a talk at the World Bank, which included a rendition of the ignorance test. Rosling’s dressing down of the audience – who make stats based policy and help spend billions in the areas covered by Rosling’s questions, yet still demonstrated their complete lack of grasp about what the numbers say – is worth watching alone…).
Anyway – the chart comes from the New York Times, in a post entitled You Draw It: How Family Income Predicts Children’s College Chances. A question is posed and the reader is invited to draw the shape of the curve they think describes the relationship between family income and college chances:
Another display demonstrates the general understanding calculated from across all the submissions.
Textual explanations also describe the actual relationship, putting it into context and trying to explain the basis of the relationship. As ever, a lovely piece of work, and once again with Amanda Cox in the credits…
The second example comes from Bloomberg, and riffs on the idea of immersive stories to produce a chart that gets updated as you scroll through (I saw this described as “scrollytelling” by @arnicas):
The piece is about global warming and shows the effect of various causal factors on temperature change, at first separately, and then in additive composite form to show how they explain the observed increase. It’s a nice device for advocacy, methinks…
It also reminds me that I never got round to trying to the Knight Lab Storymap.js with a zoomified/tiled chart image as the basis for a storymap (or should that be, storychart? For other storymappers, see Seven Ways to Create a Storymap). I just paid out the $19 or so for a copy of zoomify to convert large images to tilesets to work with that app, though I guess I really should have tried to hack a solution out with something like Imagemagick (I think that can export tiles?) or Inkscape (which would let me convert a vector image to tiles, I think?). Anyway, I just need a big image to try out now, which I guess I could generate from some F1 data using ggplot?
A couple of days ago, I got a message from @fantasticlfe asking if I’d done any tinkerings around what turned out to be “narrative charts”. I kept misapprehending what he was after (something to do with continuity?!;-), so here’s a summary of various graphical devices for looking at narrative texts that we passed back and forth, along with some we didn’t..
— Michael Smethurst (@fantasticlife) April 3, 2014
A Sankey diagram typically uses variable thickness lines to show flow between different elements in a system. (For this reason it’s often used to show energy flows throuygh a system, though it can also be used to good effect to show money flows.) The chart Michael linked to comes from xkcd:
In this chart, we have time along the horizontal x-axis. The y-axis is ambiguous (some sort of nominal ordering?) and the line thickness appears to represent army size.
To a certain extent, this diagram is reminiscent of Minard’s famous chart…
(See also What Makes a Minard? for some contemporary Minard diagrams. Is code available, I wonder?)
However, in the case of Minard’s chart (which I personally don’t like at all!), the x-y and co-ordinates represent map co-ordinates – the thick lines aren’t thick lines in a line chart (which a glanced “up and to the right” view might make you assume), they’re flow lines across a map.
I got distracted for a while by the Sankey aspect, and dug around my own bits of code. For example, Generating Sankey Diagrams from rCharts, an rCharts wrapper for the d3.js Sankey diagram. Michael was particularly interested in being able to group lines vertically (though I wasn’t sure what the y-axis would actually correspond to: some loose function of “location”, maybe as a categorical variable? Time was definitely to be on the horizontal x-axis); a posting on Stack Overflow (d3 sankey charts – manually position node along x axis) seemed likely to be able to help with that.
I then started going off on one…
Would a variant of nltk style lexical dispersion plots help, using characters rather than word categories? That would show when a character was in scene, but not much else?
How about sentence drawing, in which we show “turns” taken by different speakers?
This shows something, but again, not relevant…
Nor are Kurt Vonnegut’s shapes-of-stories diagrams that plot some sort of emotional state on y and time on x:
Hmmm… Michael wanted to be able to look at scenes on x and presumably some function of location on y. Hmm… why? And how might we actually order those axes? Scenes occur in order in a film or play, but scene is a ranked, ordinal value. That said, scenes also have duration in terms of screentime, which may or may not be the same as the “interval” that the scene portrays in terms of the world it represents (this must have a name? eg a 20 second screen time scene shows a plane flying and this represents x hours in the story). The scene may also have a ‘calendar time’ associated with it in the story – so where you have a flashback scene this corresponds to a previous calendar time in the represented world. Did Michael want any of these dimensions capturing?
And then there’s location… how should these be represented? Locations are a distance apart and, perhaps more importantly from a continuity point of view, a travel time apart; as well as maybe a timezone difference apart. Did that need capturing in any way? Ordering axes for this could be quite hard if we wanted close things in space (distance? travel time?) to be close together on a single axis (A is 10 minutes from B and C, B is ten minutes from C: how do you show that intransitive relation on a single dimension? [Maybe relevant? Storygraph: Extracting patterns from spatio-temporal data, A Shrestha et al., Advances in Visual Computing.] Hmm… If we can capture distance between locations, and some sensible notion of time relating to scenes, could we maybe use line thickness to show that a person has lots of time to move between one (time, location) and another, as compared to scenetime? Do filmwriters have tools to support this? Do the police…?! Is the Mythology Engine relevant?
How about thinking about it as a graph? I’ve used Gephi before as a foil for getting me to think about ordered series as connected events in a graph – for example, Visualising F1 Timing Sheet Data. If we encode scene number as the x-coordinate and location number as the y-coordinate, with each graph line being the connected series of scenes a particular individual is in, then we can simply use a line chart to connect “individual lines” to different scene and location numbers. We’d also have a couple of extra dimensions to play with – node size and node colour, at each location. We’d also have the opportunity to play with edge (that is, line) colour and edge thickness?
Maybe I need to try to do some demos? But no time for that right now…
How about trying to find some? Here are some discovered via @jamesjefferies:
Here’s a view of connected (by travel between) locations in Game of Thrones:
There’s also an animation of event in Game of Thrones, but I can’t quite figure out how to read it?!
Let’s go back to the sort of thing Michael was after – narrative charts..
@imhelenj found a related if cluttered interactive describing the evolution of web tech:
Then Michael shared a link to Comic Book Narrative Charts, a project for automatically generating xkcd style narrative charts:
Hovering over these charts, I noticed they were interactive d3.js charts. A quick View Source and the code for generating the chart dynamically from a characters file and a narrative file appeared to be there. Which I think is what Michael wanted all along…!
(By the by, the post also describes how the developers started thinking about fixing the vertical y-coordinate values. Here’s another example of someone thinking aloud around producing a narrative chart for the Holy Week story.)
Ho hum, an interesting set of detours nonetheless – and it got me thinking about the time-space complexity of a scene based tale that could keep be confused for weeks! :-)
PS this is quite interesting – visualising a process, via Tactical Tech Drawing By Numbers project:
PPS some more bits: @r4isstatic points to Some visualisations of stories and narratives, another summary post similar to this one. Also via Paul Rissen, and picking up on whether the police have any interesting actor/event/time/location diagramming techniques, Vispol – An Interactive Scenario Visualization.
Elsewhere, I find Storyline Visualizations, which includes a paper (Design Considerations for Optimizing Storyline Visualizations, Y Tanahashi, and K-L Ma, IEEE Trans on Visualisation and Computer Graphics, 18(12) 2012, pp2679-2688 and some python code.
PPPS Some more… A collection by Stewart McKie of techniques for visualising screenplays: Screenplay Visualization: Concepts and Practice. The posts I wrote on the Digital Worlds game design uncourse blog about narrative structure. Sort of via Scott Wilson, some crime analysis software from xanalys.com (Link Explorer – White Paper) which includes descriptions of an event chart, a transaction chart and an activity timeline:
Via the comments, this rather lovely animated discourse map:
Whilst looking at the apparently conflicting results from a couple of recent polls by YouGov on press regulation (reviewed in a piece by me over on OpenLearn: Two can play at that game: When polls collide in support of a package on the OU/BBC co-produced Radio 4 programme, More Or Less), my eye was also drawn to the different ways in which the survey results were presented graphically.
The polls were commissioned by The Sun newspaper on the one hand, and the Media Standards Trust/Hacked Off on the other. If you look at the poll data (The Sun/YouGov [PDF] and Media Standards Trust/YouGov [PDF] respectively), you’ll see that it’s reported in a standard format. (I couldn’t find actual data releases, but the survey reports look as if they are generated in a templated way, so getting the core of a generic scraper together for them shouldn’t be too difficult…) But how was that represented to readers (text based headlines and commentary aside?
Here are a couple of grabs from the Sun’s story (State-run watchdog ‘will gag free press’):
Pie-charts in 3D, with a tilt… gorgeous… erm, not… And the colour choice for the bar chart inner-column text is a bit low on contrast compared to the background, isn’t it?
It looks a bit like the writer took a photo of the print edition of the story on their phone, uploaded it and popped it into the story, doesn’t it?
I guess credit should be given for keeping the risk responses separate in the second image, when they could have just gone for the headline figures as pulled out in the YouGov report:
So what I’m wondering now is the extent to which a chart’s “theme” or style reflects the authority or formal weight we might ascribe to it, in much the same way as different fonts carry different associations? Anyone remember the slating that CERN got for using Comic Sans in their Higgs-Boson discovery announcement (eg here, here or here)?
Things could hardly have been more critical if they had used CrappyGraphs or an XKCD style chart generator (as for example described in Style your R charts like the Economist, Tableau … or XKCD ; or alternatively, XKCD-style for matplotlib).
Oh, hang on a minute, it almost looks like they did!
Anyway – back to the polls. The Media Standards Trust reported on their poll using charts that had a more formal look about them:
The chart annotations are also rather clearer to read.
PS with respect to the Sun’s copyright/syndication notice, and my use of the images above:
I haven’t approached the copyright holders seeking permission to reproduce the charts here, but I would argue that this piece is just working up to being research into the way numerical data is reported, as well as hinting at criticism and review. So there…
PPS As far as bad charts go, they may also be, misrepresentations and underhand attempts at persuasion, graphic style, are also possible, as SimplyStatistics describes: “The statisticians at Fox News use classic and novel graphical techniques to lead with data” [ The statisticians at Fox News use classic and novel graphical techniques to lead with data ] See also: OpenLearn – Cheating with Charts.
This is another of those confluence style posts, where a handful of things I’ve read in quick succession seem to phase lock in my mind:
(brought to mind in part via @downes a week or so ago: How to Synch 32 Metronomes)
The first was a post by Alan Levine on Making Text Work, which describes a simple technique for making text overlays on photographs create a more coherent image than just slapping text on:
One of the techniques I use on presentation slides from time to time is a solid filled banner or stripe containing text, some opaque, sometimes transparent. I wouldn’t claim to be much of an artist but it makes the slides slightly more interesting than a header with an image in the middle of the slide, surrounded by whitespace…
(Which reminds me: maybe I should look through Presentation Zen Design and Presentation Zen: Simple Ideas on Presentation Design and Delivery again…)
Reading Alan’s post, it occurred to me that once you get the idea of using a solid or semi-transparent solid filled background to a text label, you tend to remember it and add it to your toolbox of presentation ideas (of course, you might also forget and later rediscover this sort of trick… My own slides tend to crudely follow particular design ideas I’ve recently picked up on, albeit in a crude and often not very well polished way, that I’ve decided I want to try to explore…) In the slide above, several tricks are evident: the solid filled text label, the positioning of it, the backgrounded blog post that actually serves as a reference for the slide (you can do a web search for the post title to learn more about the topic), the foregrounded image, rotated slightly, and so on.
The thing that struck me about Alan’s post was that it reminded me of a time before I was really aware of using a solid filled label to foreground a piece of text, which in turn caused me to reflect on other things I now take for granted as ideas that I can draw on and combine with other ideas.
In the same way we learn to spell, and learn to use punctuation, and start to pick up on the grammer that structures a language, so we can use those rules to construct ever more complex sentences. Once we know the rules, it becomes a choice as to whether or not we employ them.
Here’s an example of how we might come to acquire a new design idea, drawn from a brief conversation with @mediaczar a couple of nights ago when Mat asked if anyone knew the name od this sort of chart combination:
I didn’t know what to call the chart, but thought it should be easy enough to try to wrangle one together using ggplot in R, guessing that a geom_errorbarh() might work; Mat came back suggesting geom_crossbar().
Here’s a minimal code fragment I used to explore it:
#plot a horizontal bar across a bar chart a=data.frame(x=10,y=43) b=stack(a) c=data.frame(d=c('x','y'),f=c(3,22)) library(ggplot2) g=ggplot(b)+geom_bar(aes(x=ind,y=values),stat='identity') g=g+geom_crossbar(data = c, aes(x=d,y=f,ymax=f+1,ymin=f-1), colour = NA, fill = "red", width = 0.9, alpha = 0.5) print(g)
Here’s an example of how I used it – in an as yet unlabelled sketch showing for a particular F1 driver, their gird position for each race (red bar) the the number of places they gained (or lost) during the first lap:
So now I know how to achieve that effect…
one two more things… Just after reading Alan’s post, I read a post by James Allen on possible race strategies for the Japanese Grand Prix:
The first thing that struck me was that even if you vaguely understand how a race chart works, the following statement may not be readily obvious to you from the top chart (my emphasis):
Three stops is actually faster [taking new softs on lap 12], as the [upper] graph … shows, but it requires the driver to pass the two stoppers in the final stint. If there is a safety car, it will hand an advantage to the two stoppers.
So, can you see why the three stopper (the green line) “requires the driver to pass the two stoppers in the final stint”? Let’s step back for a moment – can you see which bit of the graphs represent an overtake?
This is actually quite a complex graph to read – the axes are non-obvious, and not that easy to describe, though you soon pick up a feeling for how the chart works (honest!). Getting a sensible interpretation working for the surprising feature – the sharp vertical drops – is one way of getting into this chart, as well as looking at how the lines are postioned at the extreme x values (that is, at the end of the first and last laps).
The second thing that occurred to me was that we could actually remove the fragment of the line that shows the pitstop and instead show a separate line segment for each stint for each driver, and hence the line crossings that do not represent required overtakes. I’ve used this technique before, for example to show the separate stints on a chart of laptimes for a particular driver over the course of a race:
And as to where I got that trick? I think it was a bastardisation of a cycle plot, which can be used to show monthly, weekly or seasonal trends over a series of years:
…but it could equally have been a Stephen Few highlighted trick of disconnecting a timeseries line at the crossing point of one month to the next…
Whatever the case, one of the ideas I always have in mind is whether it may be possible to introduce white space in the form of a break in a line in order to separate out different groups of data in a very simple way.
Over the summer, an episode of one of my favourite audio/radio programmes, the OU co-produced Radio 4 programme More or Less included a package on high frequency trading. To illustrate how fast high frequency trading works, the programme used a beautiful bit of sonification (the audio equivalent of a graphical data visualisation). You can listen to it on iPlayer here: How fast is high-frequency trading?
Just in case it’s blocked outside the UK, here’s a version I cropped from the downloaded podcast myself [MP3]:
Tim Harford, presenter of the programme, also wrote about high frequency trading here: High-frequency trading and the $440m mistake. Interestingly, the article also includes the audio package… Here’s a link to the original programme on iPlayer: How to lose money, fast
A couple of weeks prior to the More or Less programme (coincidence? Or inspiration?) a blog post about a data sketch done by NYT’s incredibly creative Amanda Cox referred to a similar audio technique to illustrate(?!) close finishes in sprint races: Why Amanda Cox should be in charge of audio. The post also referred back to a New York Times piece from February 2012 capturing just how closely some of the 2010 Winter Olympics race finished: Fractions of a Second: An Olympic Musical.
So now I’m wondering – have you ever
seen, erm, heard a presentation that has used audio, rather than graphics, to illustrate a data story?
Part of the promise of sports data journalism is the ability to use data from an event to enrich the reporting of that event. One of the widely used graphical devices used in motor racing is the lap chart, which shows the relative positions of each car at the end of each lap:
Another, more complex chart, and one that can be quite hard to read when you first come across it, is the race history chart, which shows the laptime of each car relative to the average laptime (calculated over the whole of the race) of the race winner:
Both of these charts can be used to illustrate the progression of a race, and even in some cases to identify stories that might otherwise have been missed (particularly races amongst back markers, for example). For Olympics events particularly, where reporting is often at a local level (national and local press reporting on the progression of their athletes, as well as the winning athletes), timing data may be one of the few sources available for finding out what actually happened to a particular competitor who didn’t feature in coverage that typically focusses on the head of the race.
I’ve also experimented with some other views, including a race summary chart that captures the start position, end of first lap position, final position and range of positions held at the end of each lap by each driver:
One of the ways of using this chart is as a quick summary of the race position chart, as well as a tool for highlighting possible “driver of the day” candidates.
A rich lap chart might also be used to convey information about the distance between cars as well as their relative positions. Here’s one experiment I tried (using Gephi to visualise the data) in which node size is proportional to time to car in front and colour is related to time to car behind (red is hot – car behind is close):
(You might also be able to imagine a variant of this chart where we fix the y-value so each row shows data relating to one particular driver. Looking along a row then allows us to see how exciting a race they had.)
All of these charts can be calculated from lap time data. Some of them can be calculated from data describing the position held by each competitor at the end of each lap. But whatever the case, the data is what drives the visualisation.
A little bit of me had been hoping that laptime data for Olympics track, swimming and cycling events might be available somewhere, but if it is, I haven’t found a reliable source yet. What I did find encouraging, though, was that the New York Times, (in many ways one of the organisations that is seeing the value of using visualised data-driven storytelling in its daily activities) did make some split time data available – and was putting it to work – in the swimming events:
Here, the NYT have given split data showing the times achieved in each leg by the relay team members, along with a lap chart that has a higher level of detail, showing the position of each team at the end of each 50m length (I think?!). The progression of each of the medal winners is highlighted using an appropriate colour theme.
[Here’s an insight from @kevinQ about how the New York Times dataviz team put this graphic together: Shifts in rankings. Apparently, they’d done similar views in previous years using a Flash component, but the current iteration uses d3.js]
The chart provides an illustration that can be used to help a reporter identify different stories about how the race progressed, whether or not it is included in the final piece. The graphic can also be used as a sidebar illustration of a race report.
Lap charts also lend themselves to interactive views, or highlighted customisations that can be used to illustrate competition between selected individuals – here’s another F1 example, this time from the f1fanatic blog:
(I have to admit, I prefer this sort of chart with greyed options for the unhighlighted drivers because it gives a better sense of the position churn that is happening elsewhere in the race.)
Of course, without the data, it can be difficult trying to generate these charts…
…which is to say: if you know where lap data can be found for any of the Olympics events, please post a link to the source in the comments below:-)
PS for an example of the lapcharting style used to track the hole by hole scoring across a multi-round golf tournament, see Andy Cotgreave’s Golf Analytics.
In London Olympics 2012 Medal Tables At A Glance? I posted some treemap visualisations of the Olympics medal tables generated using a Google Visualisation Chart treemap component. I thought it might be worth posting a quick R generated example too, using the off-the-shelf/straight out of CRAN treemap component. (If you want to play along, download the data as CSV from here.)
The original data looks like this:
but ideally we want it to look like this:
I posted a quick recipe showing how to do this sort of reshaping in Google Refine, but in R it’s even easier – just melt the Gold, Silver and Bronze columns into a pair of columns…
Here’s the full code to do the reshaping and generate a simple treemap:
#load in the data from a file odata = read.csv("~/Downloads/nbc_olympic_medalscrape.csv") #Reshape the data require(reshape) odatar=melt(odata,id=c('cc','ccevent','Event')) #And generate the treemap in the simplest possible way require(treemap) tmPlot(odatar, index=c("cc", "Event","variable"), vSize="value", vColor='value', type="value")
And here’s the treemap, with country blocks ordered in this case by total medal haul:
(To view the countries ordered according to number of Golds, a quick fix would be to order hierarchy with the medal type shown at the highest level of the tree: index=c("variable","cc", "Event").)
Generating variant views (I described six variants in the original post) is easy enough – just tweak the order of the elements of the index setting. (I should have named the melt created columns something more sensible than the default, shouldn’t I? Note that the vSize and vColor value value (sic) refers to the column name that identifies the medalType column. The type value says use the numerical value…. (i.e. it’s literal – it doesn’t refer to a column name…)
Out of the can – simples enough… So what might we be able to do with a little bit more treatment? Examples via the comments, please ;-)