Sketched Thoughts On Data Journalism Technologies and Practice

Over the last year or two, I’ve given a handful of talks to postgrad and undergrad students broadly on the topic of “technology for data driven journalism”. The presentations are typically uncompromising, which is to say I assume a lot. There are many risks in taking such an approach, of course, as waves of confusion spread out across the room… But it is, in part, a deliberate strategy intended to shock people into an awareness of some of the things that are possible with tools that are freely available for use in the desktop and browser based sheds of today’s digital tinkerers… Having delivered one such presentation yesterday, at UCA, Farnham, here are some reflections on the whole topic of “#ddj”. Needless to say, they do not necessarily reflect even my opinions, let alone those of anybody else;-)

The data-driven journalism thing is being made up as we go along. There is a fine tradition of computer assisted journalism, database journalism, and so on, but the notion of “data driven journalism” appears to have rather more popular appeal. Before attempting a definition, what are some of the things we associate with ddj that might explain the recent upsurge of interest around it?

  • access to data: this must surely be a part of it. In one version we might tell of the story, the arrival of Google Maps and the reverse engineering of an API to it by Paul Rademacher for his April 2005 “Housing Maps mashup”, opened up people’s eyes to the possibility of map-based mashups; a short while later, in May 2005, Adrian Holovaty’s Chicago Crime Map showed how the same mashup idea could be used as an example of “live”, automated and geographically contextualised reporting of crime data. Mashups were all about appropriating web technologies and web content, building new “stuff” from pre-existing “stuff” that was already out there. And as an idea, mashups became all the rage way back then, offering as they did the potential for appropriating, combining and re-presenting elements of different web applications and publications without the need for (further) programming.
    In March 2006, a year or so after the first demonstration of the Housing Maps mashup, and in part as a response to the difficulty in getting hold of latitude and longitude data for UK based locations that was required to build Google maps mashups around British locations, the Guardian Technology supplement (remember that? It had Kakoru puzzles and everything?!;-) launched the “Free Our Data” campaign (history). This campaign called for the free release of data collected at public expense, such as the data that gave the latitude and longitude for UK postcodes.
    The early promise of, and popular interest in “mashups” waxed, and then waned; but there was a new tide rising in the information system that is the web: access to data. The mashups had shown the way forward in terms of some of the things you could do if you could wire different applications together, but despite the promise of no programming it was still too techie, too geeky, too damned hard and fiddly for most people; and despite what the geeks said, it was still programming, and there often still was coding involved. So the focus changed. Awareness grew about the sorts of “mashup” were possible, so now you could ask a developer to build you “something like that”, as you pointed to an appropriate example. The stumbling block now was access to the data to power an app that looked like that, but did the same thing for this.
    For some reason, the notion of “open” public data hit a policy nerve, and in the UK, as elsewhere, started to receive cross-party support. (A brief history of open public data in a UK context is illustrated in the first part of Open Standards and Open Data.) The data started to flow, or at least, started to become both published (through mandated transparency initiatives, such as the release of public accounting data) and requestable (for example, via an extension to FOI by the Protection of Freedoms Act 2012).
    We’ve now got access in principle and in practice to increasing amounts of data, we’ve seen some of the ways in which it can be displayed and, to a certain extent, started to explore some of the ways in which we can use it as a source for news stories. So the time is right in data terms for data driven journalism, right?
  • access to visualisation technologies: it wasn’t very long ago when it was still really hard to display data on screen using anything other than canned chart types – pie charts, line charts, bar charts (that is, the charts you were introduced to in primary school. How many chart types have you learned to read, or create, since then?). Spreadsheets offer a range of grab-and-display chart generating wizards, of course, but they’re not ideal when working with large datasets, and they’re typically geared for generating charts for reports, rather than being used analytically. The visual analysis mantra – Overview first, zoom and filter, then details-on-demand – (coined in Ben Schneiderman’s 1997 article A Grander Goal: A Thousand-Fold Increase in Human Capabilities, I think?) arguably requires fast computers and big screens to achieve the levels of responsiveness that is required for interactive usage, and we have those now…

There are, however, still some considerable barriers to access:

  • access to clean data: you might think I’m repeating myself here, but access to data and access to clean data are two separate considerations. A lot of the data that’s out there and published is still not directly usable (you can’t just load it into a spreadsheet and work on it directly); things that are supposed to match often don’t (we might know that Open Uni, OU and Open University refer to the same thing, but why should a spreadsheet?); number columns often contain things that aren’t numbers (such as commas or other punctuation); dates are provided in a wide variety of formats that we can recognise as such, but a computer can’t – at least, not unless we give it a bit of help; data gets misplaced across columns; character encodings used by different applications and operating systems don’t play nicely; typos proliferate; and so on. So whose job is it to clean the data before it can be inspected or analysed?
  • access to skills and workflows: engineering practice tends to have a separation between the notion of “engineer” and “technician”. Over-generalising and trivialising matters somewhat, engineers have academic training, and typically come at problems from a theory dominated direction; technicians (or technical engineers) have the practical skills that can be used to enact the solutions produced by the engineers. (Of course, technicians can often suggest additional, or alternative, solutions, in part reflecting a better, or more immediate, knowledge about the practical considerations involved in taking one course of action compared to another.) At the moment, the demarcation of roles (and skills required at each step of the way) in a workflow based around data discovery, preparation, analysis and reporting is still confused.
  • What questions should ask? If you think of data as a source, with a story to tell: how do you set about finding that source? Why do you even think you want to talk to that source? What sorts of questions should you ask that source, and what sorts of answer might you reasonably expect it to provide you with? How can you tell if that source is misleading you, lying to you, hiding something from you, or is just plain wrong? To what extent do you or should you trust a data source? Remember, ever cell in a spreadsheet is a fact. If you have a spreadsheet containing a million data cells, that’s a lot of fact checking to do…
  • low or misplaced expectations: we don’t necessarily expect Journalism students to know how to drive to a spreadsheet let alone run or apply complex statistics, or even have a great grasp on “the application of number”; but should they? I’m not totally convinced we need to get them up to speed with yesterday’s tools and techniques… As a tool builder/tool user, I keep looking for tools and ways of using tools that may be thought of as emerging “professional” tools for people who work with data on a day-to-day basis, but wouldn’t class themselves as data scientists, or data researchers; tools for technicians, maybe. When presenting tools to students, I try showing the tools that are likely to be found on a technician’s workbench. As such, they may look a little bit more technical than tools developed for home use (compare a socket set from a trade supplier with a £3.50 tool-roll bargain offer from your local garage), but that’s because they’re quality tools that are fit for purpose. And as such, it may take a bit of care, training and effort to learn how to use them. But I thought the point was to expose students to “industry-strength” ideas and applications? And in an area where tools are developing quite quickly, students are exactly the sort of people we need to start engaging with them: 1) at the level of raising awareness about what these tools can do; 2) as a vector for knowledge and technology transfer, getting these tools (or at least, ideas about what they can do) out into industry; 3) for students so inclined, recruiting those students for the further development of the tools, recruiting power users to help drive requirements for future iterations of the tools, and so on. If the journalism students are going to be the “engineers” to the data wrangler technicians, it’ll be good for them to know the sorts of things they can reasonably ask their technicians to help them to do…Which is to say, the journalists need exposing to the data wrangling factory floor.

Although a lot of the #ddj posts on this OUseful.info blog relate to tools, the subtext is all about recognising data as a medium, the form particular datasets take, and the way in which different tools can be used to work with these forms. In part this leads to a consideration of the process questions that can be asked of a data source based on identifying natural representations that may be contained within it (albeit in hidden form). For example, a list of MPs hints at a list of constituencies, which have locations, and therefore may benefit from representation in a geographical, map based form; a collection of emails might hint at a timeline based reconstruction, or network analysis showing who corresponded with whom (and in what order), maybe?

And finally, something that I think is still lacking in the formulation of data journalism as a practice is an articulation of the process of discovering the stories from data: I like the notion of “conversations with data” and this is something I’ll try to develop over forthcoming blog posts.

PS see also @dkernohan’s The campaigning academic?. At the risk of spoiling the punchline (you should nevertheless go and read the whole thing), David writes: “There is a space – in the gap between academia and journalism, somewhere in the vicinity of the digital humanities movement – for what I would call the “campaigning academic”, someone who is supported (in a similar way to traditional research funding) to investigate issues of interest and to report back in a variety of accessible media. Maybe this “reporting back” could build up into equivalence to an academic reward, maybe not.

These would be cross-disciplinary scholars, not tied to a particular critical perspective or methodology. And they would likely be highly networked, linking in both to the interested and the involved in any particular area – at times becoming both. They might have a high media profile and an accessible style (Ben Goldacre comes to mind). Or they might be an anonymous but fascinating blogger (whoever it is that does the wonderful Public Policy and The Past). Or anything in between.

But they would campaign, they would investigate, they would expose and they would analyse. Bringing together academic and old-school journalistic standards of integrity and verifiability.”

Mixed up in my head – and I think in David’s – is the question of “public accounting”, as well as sensemaking around current events and trends, and the extent to which it’s the role of “the media” or “academic” to perform such a function. I think there’s much to be said for reimagining how we inform and educate in a network-centric web-based world, and it’s yet another of those things on my list of things I intend to ponder further… See also: From Academic Privilege to Consultations as Peer Review.

Author: Tony Hirst

I'm a Senior Lecturer at The Open University, with an interest in #opendata policy and practice, as well as general web tinkering...

%d bloggers like this: