One of the things that I’ve been pondering lately is how I increasingly read the news in a “View Source”* frame of mind, wanting to look behind news stories as reported to read the actual survey report, press release, or Hansard report they take their lead from (more from this in a future post…) – see for example Two can play at that game: When polls collide for a peek behind the polls that drove a couple of conflicting recent news stories. Once you start reading news stories in the context of the press releases that drove them, you can often start to see how little journalistic value add there is to a large proportions of particular sorts of news stories. When FutureLearn was announced, most of the early stories were just a restatement of the press release, for example.
[*View Source refers to the ability, in most desktop based web browsers, to view the HTML source code that is used to generate a rendered HTML web page. That is, you can look to see how a particular visual or design effect in web page was achieved by looking at the code that describes how it was done.]
I’m still a little hazy about what the distinguishing features of “data journalism” actually are (for example, Sketched Thoughts On Data Journalism Technologies and Practice), but for the sake of this post let’s just assume that doing something with an actual data file is necessary part of the process when producing a data driven journalistic story. Note that this might just be limited to re-presenting a supplied data set in a graphical form, or it might involve a rather more detailed analysis that requires, in part, the combination of several different original data sets.
So what might make for a useful “press release” or report publication as far as a data journalist goes? One example might be raw data drops published as part of a predefined public data publication scheme by a public body. But again, for the purposes of this post, I’m more interested in examples of data that is released in a form that is packaged in a that reduces the work the data journalist needs to do and yet still allows them to argue that what they’re doing is data journalism, as defined above (i.e. it involves doing something with a dataset…).
Here are three examples that I’ve seen “in the wild” lately, without doing any real sort of analysis or categorisation of the sorts of thing they contain, the way in which they publish the data, or the sorts of commentary they provide around it. That can come later, if anyone thinks there is mileage in trying to look at data releases in this way…
The press release for the UCAS End of Cycle report 2012 includes headline statistical figures, a link to a PDF report, a link to PNG files of the figures used in the report (so that they can be embedded in articles about the report, presumably) and a link to the datasets used to create the figures used in the report.
Each figure has it’s own datafile in CSV format:
Each datafile also contains editorial metadata, such as chart title and figure number:
The released data thus allows the data journalist (or the art department of a large news organisation…) to publish their own stylised view of the charts (or embed their own biases in the way they display the data…) and do a very limited amount of analysis on that data. The approach is still slightly short of true reproducibility, or replicability, though – it might take a little bit of effort for us to replicate the figure as depicted from the raw dataset, for example in the setting of range limits for numerical axes. (For an old example of what a replicable report might look like, see How Might Data Journalists Show Their Working?. Note that tools and workflows have moved on since that post was written – I really need to do an update. If you’re interested in examples of what’s currently possible, search for knitr…)
In this sort of release, where data is available separately for each published figure, it may be possible for the data journalist to combine data from different chart-related datasets (if they are compatible) into a new dataset. For example, if two separate charts displayed the performance of the same organisations on two different measures, we might be able to generate a single dataset that lets us plot a “dodged” bar chart showing the performance of each of those organisations against the two measures on the same chart; where two charts compare the behaviour of the same organisations at two different times, we may be able to combine the data to produce a slopegraph. And so on…
The ONS – the Office of National Statistics – had a hard time in December 2012 from the House of Commons Public Administration Committee over its website as part of an inquiry on Communicating and publishing statistics (see also the session the day before). I know I struggle with the ONS website from time to time, but it’s maybe worth considering as a minimum viable product, and to start iterating…?
So for example, the ONS publishes lots of statistical bulletins using what appears to be a templated format. For example, if we look at the Labour Market Statistics, December 2012, we see a human readable summary of the headline items in the release along with links to specific data files containing the data associated with each chart and a download area for data associated with the release:
If we look at the Excel data file associated with a “difference over time” chart, we notice the the data used to derive the difference is also included:
In this case, we could generate a slope graph directly from the datafile associated with the chart, even though not all that information was displayed in the original chart.
(This might then be a good rule of thumb for testing the quality of “change” data supplied as part of a data containing press release – are the original figures that are differenced to create the difference values also released?)
It can all start getting a bit rathole, rabbit warren from here on in… For example, here are the datasets related with the statistical bulletin:
Here’s a page for the Labour Market statistics dataset, and so on…
That said, the original statistical bulletin does provide specific data downloads that are closely tied to each chart contained within the bulletin.
The third example is the Chief Medical Officer’s 2012 annual report, a graphically rich report published in November 2012. (It’s really worth a look…) The announcement page mentions that “All of the underlying data used to create the images in this report will be made available at data.gov.uk.” (The link points to the top level of the data.gov.uk site). A second link invites you to Read the CMO’s report, leading to a page that breaks out the report in the form of links to chapter level PDFs. However, that page also describes how “When planning this report, the Chief Medical Officer decided to make available all of the data used to create images in the report“, which in turn leads to a page that contains links to a set of Dropbox pages that allow you to download data on a chapter by chapter basis from the first volume of the report in an Excel format.
Whilst the filenames are cryptic, and the figures in the report not well identified, the data is available, which is a Good Thing. (The page also notes: “The files produced this year cannot be made available in csv format. This option will become available once the Chief Medical Officer’s report is refreshed.” I’m not sure if that means CSV versions of the data will be produced for this report, or will be produced for future versions of the report, in the sense of the CMO’s Annual Report for 2013, etc?)
Once again, though, there may still be work to be done recreating a particular chart from a particular dataset (not least because some of the charts are really quite beautiful!;-) Whilst it may seem a little churlish to complain about a lack of detail about how to generate a particular chart from a particular dataset, I would just mention that one reason the web developed its graphical richness so quickly was that by “Viewing Source” developers could pinch the good design ideas they saw on other websites and implement (and further develop) them simply by cutting and pasting code from one page into another.
What each of the three examples described shows is an opening up of the data immediately behind a chart (and in at least one example from the ONS, making available the data from which the data displayed in a difference chart was calculated), and good examples of a basic form of data transparency? The reader does not have to take a ruler to a chart to work out what value a particular point is (which can be particularly hard on log-log or log-lin scale charts!), they can look it up in the original data table used to generate the chart. Taking them as examples of support for a “View Source” style of behaviour, what other forms of “View Source” supporting behaviour should we be trying to encourage?
PS If we now assume that the PR world is well versed with the idea that there are data journalists (or chart producing graphics editors) out there and that they do produce data bearing press releases for them. How might the PR folk try to influence the stories the data journalists tell by virtue of the data they release to them, and the way in which they release it?
PPS by the by, I noticed today that there is a British Standard Guide to presentation of tables and graphs [ BS 7581:1992 ] (as well as several other documents providing guidance on different forms of “statistical interpretation”). But being a British Standard, you have to pay to see it… unless you have a subscription, of course; which is one of the perks you get as a member of an academic library with just such a subscription. H/T to “Invisible librarian” (in sense of Joining the Flow – Invisible Library Tech Support) Richard Nurse (@richardn2009) for prefetching me a link to the OU’s subscription on British Standards Online in rsponse to a tweet I made about it:-)