Posts Tagged ‘dataJournalism’
For the last few years, I’ve been skulking round the edges of the whole “data journalism” thing, pondering it, dabbling with related tools, technologies and ideas, but never really trying to find out what the actual practice might be. After a couple of twitter chats and a phone conversation with Mark Woodward (Johnston Press), one of the participants at the BBC College of Journalism data training day held earlier this year, I spent a couple of days last week in the Harrogate Advertiser newsroom, pitching questions to investigations reporter and resident data journalist Ruby Kitchen, and listening in on the development of an investigations feature into food inspection ratings in and around the Harrogate area.
Here’s a quick debrief-to-self of some of the things that came to mind…
There’s not a lot of time available and there’s still “traditional” work to be done
One of Ruby’s takes on the story was to name low ranking locations, and contact each one that was going to be named to give them a right to response. Contacting a couple of dozen locations takes time and diplomacy (which even then seemed to provoke a variety of responses!), as does then writing those responses into the story in a fair and consistent way.
Even simple facts can take the lead in a small story
…for example, x% of schools attained the level 5 rating, something that can then also be contextualised and qualified by comparing it to other categories of establishment or national, regional or neighbouring locale averages. As a data junkie, it can be easy to count things by group, perhaps overlooking a journalistic take that many of these counts could be used as the basis of a quick filler story or space-filling, info-snack glanceable breakout box in a larger story.
Is the story tellable?
Looking at data, you can find all sorts of things that are perhaps interesting in their subtlety or detail, but if you can’t communicate a headline or what’s interesting in a few words, it maybe doesn’t fit… (Which is not to say that data reporting needs to be dumbed down or simplistic…) Related to this is the “so what?” question..? (I guess for news, if you wouldn’t share it in the pub or over dinner have read it – that is, if you wouldn’t remark on it – you’d have to ask: is it really that interesting? (Hmm… is “Liking” the same as remarking on something? I get the feeling it’s less engaged…)
There’s a huge difference between the tinkering I do and production warez
I have all manner of pseudo-workflows that allow me to generate quick sketches in an exploratory data analysis sort of way, but things that work for the individual “researcher” are not the same as things can work in a production environment. For example, I knocked up a quick interactive map using the folium library in an IPython notebook, but there are several problems with this:
- to regenerate the map requires someone having an IPython notebook environment set up and appropriate libraries installed
- there isn’t much time available… so you need to think about what to offer. For example:
- the map I produced was a simple one – just markers and popups. At the time, I hadn’t worked out how to colour the markers or add new icons to them (and I still don’t have a route for putting numbers into the markers…), so the look is quite simple (and clunky)
- there is no faceted navigation – so you can’t for example search for particular sorts of establishment or locations with a particular rating.
Given more time, it would have been possible to consider richer, faceted navigation, for example, but for a one off, what’s reasonable? If a publisher starts to use more and more maps, one possible workflow may to be iterate on previous precedents. (To an extent, I tried to do this with things I’ve posted on the OU OpenLearn site over the years. For example, first step was to get a marker displaying map embedded, which required certain technical things being put in place the first time but could then be reused for future maps. Next up was a map with user submitted marker locations – this represented an extension of the original solution, but again resulted in a new precedent that could be reused and in turn extended or iterated on again.)
This suggests an interesting development process in which ever richer components can perhaps be developed iteratively over an extended period of time or set of either related or independent stories, as the components are used in more and more stories. Where a media group has different independent publications, other ways of iterating are possible…
The whole tech angle also suggests that a great stumbling block to folk getting (interactive) data elements up on a story page is not just the discovery of the data, the processing and cleaning of it, and the generation of the initial sketch to check it could be something that could add to the telling of a story, (each of which may require a particular set of skills), but also the whole raft of production related issues that then result (which could require a whole raft of other technical skills (which are, for example, skills I know that I don’t really have, even given my quirky skillset…). And if the corporate IT folk take ownership of he publication element, there is then a cascade back upstream of constraints relating to how the data needs to be treated so it can fit in with the IT production system workflow.
Whilst I tend to use ggplot a lot in R for exploring datasets graphically, rather than producing presentation graphics to support the telling of a particular story. Add to that, I’m still not totally up to speed on charting in the python context, and the result is that I didn’t really (think to) explore how graphical, chart based representations might be used to support the story. One thing that charts can do – like photographs – is take up arbitrary amounts of space, which can be a Good Thing (if you need to fill the space) or a Bad Thing (is space is at a premium, or page (print or web) layout is a constraint, perhaps due to page templating allowances, for example.
Some things I didn’t consider but that now come to mind now are:
- how are charts practically handed over? (As Excel charts? as image files?)
- does a sub-editor or web-developer then process the charts somehow?
- for print, are there limitations on use of colour, line thickness, font-size and style?
Print vs Web
I didn’t really consider this, but in terms of workflow and output, are different styles of presentation required for:
- data tables
If you want data tables, there are various libraries or tools for styling charts, but again the question of workflow and the actual form in which items are handed over for print or web publication needs to be considered.
Being right/being wrong
Every cell in a data table is a “fact”. If your code is wrong and and one column, or row, or cell is wrong, that can cause trouble. When you’re tinkering in private, that doesn’t matter so much – every cell can be used as the basis for another question that can be used to test, probe or check that fact further. If you publish that cell, and it’s wrong, you’ve made a false claim… Academics are cautious and don’t usually like to commit to anything without qualifying it further (sic;-). I trust most data, metadata and my own stats skills little enough that I see stats’n’data as a source that needs corroborating, which means showing it to someone else with my conclusions and a question along the lines of “it seems to me that this data suggests that – would you agree?”. This perhaps contrasts with relaying a fact (eg a particular food hygiene score) and taking it as-is as a trusted fact, given it was published from a trusted authoritative source, obtained directly from that source, and not processed locally, but then asking the manager of that establishment for a comment about how that score came about or what action they have taken as a result of getting it.)
I’m also thinking it’d be interesting to compare the similarities and differences between journalists and academics in terms of their relative fears of being wrong…!
One of things I kept pondering – and have been pondering for months – is the extent to which templated analyses can be used to create local “press release” story packs around national datasets that can be customised for local or regional use. That’s a far more substantial topic for another day, but it was put into relief last week by my reading of Nick Carr’s The Glass Cage which got me thinking about the consequences of “robot” written stories… (More about that in a forthcoming post.)
Lots of skills issues, lots of process and workflow issues, lots of story discovery, story creation, story telling and story checking issues, lots of production constraints, lots of time constraints. Fascinating. Got me really excited again about the challenges of, and opportunities for, putting data to work in a news context…:-)
Thanks to all at the Harrogate Advertiser, in particular Ruby Kitchen for putting up with my questions and distractions, and Mark Woodward for setting it all up.
As well as serendipity, I believe in confluence…
A headline in the Press Gazette declares that Trinity Mirror will be roll[ing] out five templates across 130-plus regional newspapers as emphasis moves to digital. Apparently, this follows a similar initiative by Johnston Press midway through last year: Johnston to roll out five templates for network of titles.
It seems that “key” to the Trinity Mirror initiative is the creation of a new “Shared Content Unit” based in Liverpool that will provide features content to Trinity’s papers across the UK [which] will produce material across the regional portfolio in print and online including travel, fashion, food, films, books and “other content areas that do not require a wholly local flavour”.
In my local rag last week, (the Isle of Wight County Press), a front page story on the Island’s gambling habit localised a national report by the Campaign for Fairer Gambling on Fixed Odds Betting Terminals. The report included a dataset (“To find the stats for your area download the spreadsheet here and click on the arrow in column E to search for your MP”) that I’m guessing (I haven’t checked…) provided some of the numerical facts in the story. (The Guardian Datastore also republished the data (£5bn gambled on Britain’s poorest high streets: see the data) with an additional column relating to “claimant count”, presumably the number of unemployment benefit claimants in each area (again, I haven’t checked…)) Localisation appeared in several senses:
So for example, the number of local betting shops and Fixed Odds betting terminals was identified, the mooted spend across those and the spend per head of population. Sensemaking of the figures was also applied by relating the spend to an equivalent number of NHS procedures or police jobs. (Things like the BBC Dimensions How Big Really provide one way of coming up with equivalent or corresponding quantities, at least in geographical area terms. (There is also a “How Many Really” visualisation for comparing populations.) Any other services out there like this? Maybe it’s possible to craft Wolfram Alpha queries to do this?)
Something else I spotted, via RBloggers, a post by Alex Singleton of the University of Liverpool: an Open Atlas around the 2011 Census for England and Wales, who has “been busy writing (and then running – around 4 days!) a set of R code that would map every Key Statistics variable for all local authority districts”. The result is a set of PDF docs for each Local Authority district mapping out each indicator. As well as publishing the separate PDFs, Alex has made the code available.
So what’s confluential about those?
The IWCP article localises the Fairer Gambling data in several ways:
– the extent of the “problem” in the local area, in terms of numbers of betting shops and terminals;
– a consideration of what the spend equates to on a per capita basis (the report might also have used a population of over 18s to work out the average “per adult islander”); note that there are also at least a couple of significant problems with calculating per capita averages in this example: first, the Island is a holiday destination, and the population swings over the summer months; secondly, do holidaymakers spend differently to residents on this machines?
– a corresponding quantity explanation that recasts the numbers into an equivalent spend on matters with relevant local interest.
The Census Atlas takes one recipe and uses it to create localised reports for each LA district. (I’m guessing with a quick tweak,separate reports could be generated for the different areas within a single Local Authority).
Trinity Mirror’s “Shared Content Unit” will produce content “that do[es] not require a wholly local flavour”, presumably syndicating it to its relevant outlets. But it’s not hard to also imagine a “Localisable Content” unit that develops applications that can help produced localised variants of “templated” stories produced centrally. This needn’t be quite as automated as the line taken by computational story generation outfits such as Narrative Science (for example, Can the Computers at Narrative Science Replace Paid Writers? or Can an Algorithm Write a Better News Story Than a Human Reporter?) but instead could produce a story outline or shell that can be localised.
A shorter term approach might be to centrally produce data driven applications that can be used to generate charts, for example, relevant to a locale in an appropriate style. So for example, using my current tool of choice for generating charts, R, we could generate something and then allow local press to grab data relevant to them and generate a chart in an appropriate style (for example, Style your R charts like the Economist, Tableau … or XKCD). This approach saves duplication of effort in getting the data, cleaning it, building basic analysis and chart tools around it, and so on, whilst allowing for local customisation in the data views presented. With the increasing number of workflows available around R, (for example, RPubs, knitr, github, and a new phase for the lab notebook, Create elegant, interactive presentations from R with Slidify, [Wordpress] Bloggin’ from R).
Using R frameworks such as Shiny, we can quickly build applications such as my example NHS Winter Sitrep data viewer (about) that explores how users may be able to generate chart reports at Trust or Strategic Health Authority level, and (if required) download data sets related to those areas alone for further analysis. The data is scraped and cleaned once, “centrally”, and common analyses and charts coded once, “centrally”, and can then be used to generate items at a local level.
The next step would be to create scripted story templates that allow journalists to pull in charts and data as required, and then add local colour – quotes from local representatives, corresponding quantities that are somehow meaningful. (I should try to build an example app from the Fairer Gaming data, maybe, and pick up on the Guardian idea of also adding in additional columns…again, something where the work can be done centrally, looking for meaningful datasets and combining it with the original data set.)
Business opportunities also arise outside media groups. For example, a similar service idea could be used to provide story templates – and pull-down local data – to hyperlocal blogs. Or a ‘data journalism wire service’ could develop applications either to aid in the creation of data supported stories on a particular topic. PR companies could do a similar thing (for example, appifying the Fairer Gambling data as I “appified” the NHS Winter sitrep data, maybe adding in data such as the actual location of fixed odds betting terminals. (On my to do list is packaging up the recently announced UCAS 2013 entries data.)).
The insight here is not to produce interactive data apps (aka “news applications”) for “readers” who have no idea how to use them or what read from them whatever stories they might tell; rather, the production of interactive applications for generating charts and data views that can be used by a “data” journalist. Rather than having a local journalist working with a local team of developers and designers to get a data flavoured story out, a central team produces a single application that local journalists can use to create a localised version of a particular story that has local meaning but at national scale.
Note that by concentrating specialisms in a central team, there may also be the opportunity to then start exploring the algorithmic annotation of local data records. It is worth noting that Narrative Science are already engaged in this sort activity too, as for example described in this ProPublica article on How To Edit 52,000 Stories at Once, a news application that includes “short narrative descriptions of almost all of the more than 52,000 schools in our database, generated algorithmically by Narrative Science”.
PS Hmm… I wonder… is there time to get a proposal together on this sort of idea for the Carnegie Trust Neighbourhood News Competition? Get in touch if you’re interested…
One of the things that I’ve been pondering lately is how I increasingly read the news in a “View Source”* frame of mind, wanting to look behind news stories as reported to read the actual survey report, press release, or Hansard report they take their lead from (more from this in a future post…) – see for example Two can play at that game: When polls collide for a peek behind the polls that drove a couple of conflicting recent news stories. Once you start reading news stories in the context of the press releases that drove them, you can often start to see how little journalistic value add there is to a large proportions of particular sorts of news stories. When FutureLearn was announced, most of the early stories were just a restatement of the press release, for example.
[*View Source refers to the ability, in most desktop based web browsers, to view the HTML source code that is used to generate a rendered HTML web page. That is, you can look to see how a particular visual or design effect in web page was achieved by looking at the code that describes how it was done.]
I’m still a little hazy about what the distinguishing features of “data journalism” actually are (for example, Sketched Thoughts On Data Journalism Technologies and Practice), but for the sake of this post let’s just assume that doing something with an actual data file is necessary part of the process when producing a data driven journalistic story. Note that this might just be limited to re-presenting a supplied data set in a graphical form, or it might involve a rather more detailed analysis that requires, in part, the combination of several different original data sets.
So what might make for a useful “press release” or report publication as far as a data journalist goes? One example might be raw data drops published as part of a predefined public data publication scheme by a public body. But again, for the purposes of this post, I’m more interested in examples of data that is released in a form that is packaged in a that reduces the work the data journalist needs to do and yet still allows them to argue that what they’re doing is data journalism, as defined above (i.e. it involves doing something with a dataset…).
Here are three examples that I’ve seen “in the wild” lately, without doing any real sort of analysis or categorisation of the sorts of thing they contain, the way in which they publish the data, or the sorts of commentary they provide around it. That can come later, if anyone thinks there is mileage in trying to look at data releases in this way…
The press release for the UCAS End of Cycle report 2012 includes headline statistical figures, a link to a PDF report, a link to PNG files of the figures used in the report (so that they can be embedded in articles about the report, presumably) and a link to the datasets used to create the figures used in the report.
Each figure has it’s own datafile in CSV format:
Each datafile also contains editorial metadata, such as chart title and figure number:
The released data thus allows the data journalist (or the art department of a large news organisation…) to publish their own stylised view of the charts (or embed their own biases in the way they display the data…) and do a very limited amount of analysis on that data. The approach is still slightly short of true reproducibility, or replicability, though – it might take a little bit of effort for us to replicate the figure as depicted from the raw dataset, for example in the setting of range limits for numerical axes. (For an old example of what a replicable report might look like, see How Might Data Journalists Show Their Working?. Note that tools and workflows have moved on since that post was written – I really need to do an update. If you’re interested in examples of what’s currently possible, search for knitr…)
In this sort of release, where data is available separately for each published figure, it may be possible for the data journalist to combine data from different chart-related datasets (if they are compatible) into a new dataset. For example, if two separate charts displayed the performance of the same organisations on two different measures, we might be able to generate a single dataset that lets us plot a “dodged” bar chart showing the performance of each of those organisations against the two measures on the same chart; where two charts compare the behaviour of the same organisations at two different times, we may be able to combine the data to produce a slopegraph. And so on…
The ONS – the Office of National Statistics – had a hard time in December 2012 from the House of Commons Public Administration Committee over its website as part of an inquiry on Communicating and publishing statistics (see also the session the day before). I know I struggle with the ONS website from time to time, but it’s maybe worth considering as a minimum viable product, and to start iterating…?
So for example, the ONS publishes lots of statistical bulletins using what appears to be a templated format. For example, if we look at the Labour Market Statistics, December 2012, we see a human readable summary of the headline items in the release along with links to specific data files containing the data associated with each chart and a download area for data associated with the release:
If we look at the Excel data file associated with a “difference over time” chart, we notice the the data used to derive the difference is also included:
In this case, we could generate a slope graph directly from the datafile associated with the chart, even though not all that information was displayed in the original chart.
(This might then be a good rule of thumb for testing the quality of “change” data supplied as part of a data containing press release – are the original figures that are differenced to create the difference values also released?)
It can all start getting a bit rathole, rabbit warren from here on in… For example, here are the datasets related with the statistical bulletin:
Here’s a page for the Labour Market statistics dataset, and so on…
That said, the original statistical bulletin does provide specific data downloads that are closely tied to each chart contained within the bulletin.
The third example is the Chief Medical Officer’s 2012 annual report, a graphically rich report published in November 2012. (It’s really worth a look…) The announcement page mentions that “All of the underlying data used to create the images in this report will be made available at data.gov.uk.” (The link points to the top level of the data.gov.uk site). A second link invites you to Read the CMO’s report, leading to a page that breaks out the report in the form of links to chapter level PDFs. However, that page also describes how “When planning this report, the Chief Medical Officer decided to make available all of the data used to create images in the report“, which in turn leads to a page that contains links to a set of Dropbox pages that allow you to download data on a chapter by chapter basis from the first volume of the report in an Excel format.
Whilst the filenames are cryptic, and the figures in the report not well identified, the data is available, which is a Good Thing. (The page also notes: “The files produced this year cannot be made available in csv format. This option will become available once the Chief Medical Officer’s report is refreshed.” I’m not sure if that means CSV versions of the data will be produced for this report, or will be produced for future versions of the report, in the sense of the CMO’s Annual Report for 2013, etc?)
Once again, though, there may still be work to be done recreating a particular chart from a particular dataset (not least because some of the charts are really quite beautiful!;-) Whilst it may seem a little churlish to complain about a lack of detail about how to generate a particular chart from a particular dataset, I would just mention that one reason the web developed its graphical richness so quickly was that by “Viewing Source” developers could pinch the good design ideas they saw on other websites and implement (and further develop) them simply by cutting and pasting code from one page into another.
What each of the three examples described shows is an opening up of the data immediately behind a chart (and in at least one example from the ONS, making available the data from which the data displayed in a difference chart was calculated), and good examples of a basic form of data transparency? The reader does not have to take a ruler to a chart to work out what value a particular point is (which can be particularly hard on log-log or log-lin scale charts!), they can look it up in the original data table used to generate the chart. Taking them as examples of support for a “View Source” style of behaviour, what other forms of “View Source” supporting behaviour should we be trying to encourage?
PS If we now assume that the PR world is well versed with the idea that there are data journalists (or chart producing graphics editors) out there and that they do produce data bearing press releases for them. How might the PR folk try to influence the stories the data journalists tell by virtue of the data they release to them, and the way in which they release it?
PPS by the by, I noticed today that there is a British Standard Guide to presentation of tables and graphs [ BS 7581:1992 ] (as well as several other documents providing guidance on different forms of “statistical interpretation”). But being a British Standard, you have to pay to see it… unless you have a subscription, of course; which is one of the perks you get as a member of an academic library with just such a subscription. H/T to “Invisible librarian” (in sense of Joining the Flow – Invisible Library Tech Support) Richard Nurse (@richardn2009) for prefetching me a link to the OU’s subscription on British Standards Online in rsponse to a tweet I made about it:-)