Posts Tagged ‘datajourn’
For the last few years, I’ve been skulking round the edges of the whole “data journalism” thing, pondering it, dabbling with related tools, technologies and ideas, but never really trying to find out what the actual practice might be. After a couple of twitter chats and a phone conversation with Mark Woodward (Johnston Press), one of the participants at the BBC College of Journalism data training day held earlier this year, I spent a couple of days last week in the Harrogate Advertiser newsroom, pitching questions to investigations reporter and resident data journalist Ruby Kitchen, and listening in on the development of an investigations feature into food inspection ratings in and around the Harrogate area.
Here’s a quick debrief-to-self of some of the things that came to mind…
There’s not a lot of time available and there’s still “traditional” work to be done
One of Ruby’s takes on the story was to name low ranking locations, and contact each one that was going to be named to give them a right to response. Contacting a couple of dozen locations takes time and diplomacy (which even then seemed to provoke a variety of responses!), as does then writing those responses into the story in a fair and consistent way.
Even simple facts can take the lead in a small story
…for example, x% of schools attained the level 5 rating, something that can then also be contextualised and qualified by comparing it to other categories of establishment or national, regional or neighbouring locale averages. As a data junkie, it can be easy to count things by group, perhaps overlooking a journalistic take that many of these counts could be used as the basis of a quick filler story or space-filling, info-snack glanceable breakout box in a larger story.
Is the story tellable?
Looking at data, you can find all sorts of things that are perhaps interesting in their subtlety or detail, but if you can’t communicate a headline or what’s interesting in a few words, it maybe doesn’t fit… (Which is not to say that data reporting needs to be dumbed down or simplistic…) Related to this is the “so what?” question..? (I guess for news, if you wouldn’t share it in the pub or over dinner have read it – that is, if you wouldn’t remark on it – you’d have to ask: is it really that interesting? (Hmm… is “Liking” the same as remarking on something? I get the feeling it’s less engaged…)
There’s a huge difference between the tinkering I do and production warez
I have all manner of pseudo-workflows that allow me to generate quick sketches in an exploratory data analysis sort of way, but things that work for the individual “researcher” are not the same as things can work in a production environment. For example, I knocked up a quick interactive map using the folium library in an IPython notebook, but there are several problems with this:
- to regenerate the map requires someone having an IPython notebook environment set up and appropriate libraries installed
- there isn’t much time available… so you need to think about what to offer. For example:
- the map I produced was a simple one – just markers and popups. At the time, I hadn’t worked out how to colour the markers or add new icons to them (and I still don’t have a route for putting numbers into the markers…), so the look is quite simple (and clunky)
- there is no faceted navigation – so you can’t for example search for particular sorts of establishment or locations with a particular rating.
Given more time, it would have been possible to consider richer, faceted navigation, for example, but for a one off, what’s reasonable? If a publisher starts to use more and more maps, one possible workflow may to be iterate on previous precedents. (To an extent, I tried to do this with things I’ve posted on the OU OpenLearn site over the years. For example, first step was to get a marker displaying map embedded, which required certain technical things being put in place the first time but could then be reused for future maps. Next up was a map with user submitted marker locations – this represented an extension of the original solution, but again resulted in a new precedent that could be reused and in turn extended or iterated on again.)
This suggests an interesting development process in which ever richer components can perhaps be developed iteratively over an extended period of time or set of either related or independent stories, as the components are used in more and more stories. Where a media group has different independent publications, other ways of iterating are possible…
The whole tech angle also suggests that a great stumbling block to folk getting (interactive) data elements up on a story page is not just the discovery of the data, the processing and cleaning of it, and the generation of the initial sketch to check it could be something that could add to the telling of a story, (each of which may require a particular set of skills), but also the whole raft of production related issues that then result (which could require a whole raft of other technical skills (which are, for example, skills I know that I don’t really have, even given my quirky skillset…). And if the corporate IT folk take ownership of he publication element, there is then a cascade back upstream of constraints relating to how the data needs to be treated so it can fit in with the IT production system workflow.
Whilst I tend to use ggplot a lot in R for exploring datasets graphically, rather than producing presentation graphics to support the telling of a particular story. Add to that, I’m still not totally up to speed on charting in the python context, and the result is that I didn’t really (think to) explore how graphical, chart based representations might be used to support the story. One thing that charts can do – like photographs – is take up arbitrary amounts of space, which can be a Good Thing (if you need to fill the space) or a Bad Thing (is space is at a premium, or page (print or web) layout is a constraint, perhaps due to page templating allowances, for example.
Some things I didn’t consider but that now come to mind now are:
- how are charts practically handed over? (As Excel charts? as image files?)
- does a sub-editor or web-developer then process the charts somehow?
- for print, are there limitations on use of colour, line thickness, font-size and style?
Print vs Web
I didn’t really consider this, but in terms of workflow and output, are different styles of presentation required for:
- data tables
If you want data tables, there are various libraries or tools for styling charts, but again the question of workflow and the actual form in which items are handed over for print or web publication needs to be considered.
Being right/being wrong
Every cell in a data table is a “fact”. If your code is wrong and and one column, or row, or cell is wrong, that can cause trouble. When you’re tinkering in private, that doesn’t matter so much – every cell can be used as the basis for another question that can be used to test, probe or check that fact further. If you publish that cell, and it’s wrong, you’ve made a false claim… Academics are cautious and don’t usually like to commit to anything without qualifying it further (sic;-). I trust most data, metadata and my own stats skills little enough that I see stats’n’data as a source that needs corroborating, which means showing it to someone else with my conclusions and a question along the lines of “it seems to me that this data suggests that – would you agree?”. This perhaps contrasts with relaying a fact (eg a particular food hygiene score) and taking it as-is as a trusted fact, given it was published from a trusted authoritative source, obtained directly from that source, and not processed locally, but then asking the manager of that establishment for a comment about how that score came about or what action they have taken as a result of getting it.)
I’m also thinking it’d be interesting to compare the similarities and differences between journalists and academics in terms of their relative fears of being wrong…!
One of things I kept pondering – and have been pondering for months – is the extent to which templated analyses can be used to create local “press release” story packs around national datasets that can be customised for local or regional use. That’s a far more substantial topic for another day, but it was put into relief last week by my reading of Nick Carr’s The Glass Cage which got me thinking about the consequences of “robot” written stories… (More about that in a forthcoming post.)
Lots of skills issues, lots of process and workflow issues, lots of story discovery, story creation, story telling and story checking issues, lots of production constraints, lots of time constraints. Fascinating. Got me really excited again about the challenges of, and opportunities for, putting data to work in a news context…:-)
Thanks to all at the Harrogate Advertiser, in particular Ruby Kitchen for putting up with my questions and distractions, and Mark Woodward for setting it all up.
Reading our local weekly press this evening (the Isle of Wight County Press), I noticed a page 5 headline declaring “Alarm over death rates at St Mary’s”, St Mary’s being the local general hospital. It seems a Department of Health report on hospital mortality rates came out earlier this week, and the Island’s hospital, it seems, has not performed so well…
Seeing the headline – and reading the report – I couldn’t help but think of Ben Goldacre’s Bad Science column in the Observer last week (DIY statistical analysis: experience the thrill of touching real data ), which commented on the potential for misleading reporting around bowel cancer death rates; among other things, the column described a statistical graphic known as a funnel plot which could be used to support the interpretation of death rate statistics and communicate the extent to which a particular death rate, for a given head of population, was “significantly unlikely” in statistical terms given the distribution of death rates across different population sizes.
Given the interest there appears to be around data journalism at the moment (amongst the digerati at least), I thought there might be a reasonable chance of finding some data inspired commentary around the hospital mortality figures. So what sort of report was produced by the Guardian (Call for inquiries at 36 NHS hospital trusts with high death rates) or the Telegraph (36 hospital trusts have higher than expected death rates), both of which have pioneering data journalists working for them, come up with? Little more than the official press release: New hospital mortality indicator to improve measurement of patient safety.
The reports were both formulaic, picking on leading with the worst performing hospital (which admittedly was not mentioned in the press release) and including some bog standard quotes from the responsible Minister lifted straight out of the press release (and presumably written by someone working for the Ministry…) Neither the Guardian nor the Telegraph story contained a link to the original data, which was linked to from the press release as part of the Notes to editors rider.
If we do a general, recency filtered, search for hospital death rates on either Google web search:
or Google news search:
we see a wealth of stories from various local press outlets. This was a story with national reach and local colour, and local data set against a national backdrop to back it up. Rather than drawing on the Ministerial press released quotes, a quick scan of the local news reports suggests that at least the local journalists made some effort compared to the nationals’ churnalism, and got quotes from local NHS spokespeople to comment on the local figures. Most of the local reports I checked did not give a link to the original report, or dig too deeply into the data. However, This is Tamworth, (which had a Tamworth Herald byline in the Google News results), did publish the URL to the full report in its article Shock report reveals hospital has highest death rate in country, although not actually as a link… Just by the by, I also noticed the headline was flagged with a “Trusted Source” badge:
Is that Tamworth Herald as the trusted source, or the Department of Health?!
Given that just a few days earlier, Ben Goldacre had provided an interesting way of looking at death rate data, it would have been nice to think that maybe it could have influenced someone out there to try something similar with the hospital mortality data. Indeed, if you check the original report, you can find a document describing How to interpret SHMI bandings and funnel plots (although, admittedly, not that clearly perhaps?). And along with the explanation, some example funnel plots.
However, the plots as provided are not that useful. They aren’t available as image files in a social or rich media press release format, nor are statistical analysis scripts that would allow the plots to be generated from the supplied data in too like R; that is to say, the executable working wasn’t shown…
So here’s what I’m thinking: firstly, we need data press officers as well as data journalists. Their job would be to put together the tools that support the data churnalist in taking the raw data and producing statistical charts and interpretation from it. Just like the ministerial quote can be reused by the journalist, so the data press pack can be used to hep the journalist get some graphs out there to help them illustrate the story. (The finishing of the graph would be up to the journalist, but the mechanics of the generation of the base plot would be provided as part of the data press pack.)
Secondly, there may be an opportunity for an enterprising individual to take the data sets and produced localised statistical graphics from the source data. In the absence of a data press officer, the enterprising individual could even fulfil this role. (To a certain extent, that’s what the Guardian Datastore does.)
(Okay, I know: the local press will have allocated only a certain amount of space to the story, and the editor would likely see any mention of stats or funnel plots as scaring folk off, but we have to start changing attitudes, expectations, willingness and ability to engage with this sort of stuff somehow. Most people have very little education in reading any charts other than pie charts, bar charts, and line charts, and even then are easily misled. We have start working on this, we have to start looking at ways of introducing more powerful plots and charts and helping people get a folk understanding of them. And funnel plots may be one of the things we should be starting to push?)
Now back to the hospital data. In How Might Data Journalists Show Their Working? Sweave, I posted a script that included the working for generating a funnel plot from an appropriate online CSV data source. Could this script be used to generate a funnel plot from the hospital data?
I had a quick play, and managed to get a scatterplot distribution that looks like the one on the funnel plot explanation guide by setting the number value to the SHMI Indicator data (csv) EXPECTED column and the p to the VALUE column. However, because the p value isn’t a probability in the range 0..1, the p.se calculation fails:
p.se <- sqrt((p*(1-p)) / (number))
Anyway, here’s the script for generating the straightforward scatter plot (I had to read the data in from a local file because there was some issue with the security certificate trying to read the data in from the online URL using the RCurl library and hospitaldata = data.frame( read.csv( textConnection( getURL( DATA_URL ) ) ) ):
hospitaldata = read.csv("~/Downloads/SHMI_10_10_2011.csv")
number = hospitaldata$EXPECTED
p = hospitaldata$VALUE
df = data.frame(p, number, Area=hospitaldata$PROVIDER.NAME)
ggplot(aes(x = number, y = p), data = df) + geom_point(shape = 1)
There’s presumably a simple fix to the original script that will take the range of the VALUE column into account and allow us to plot the funnel distribution lines appropriately? If anyone can suggest the fix, please let me know in a comment…;-)
The notion of data driven journalism appears to have some sort of traction at the moment, not least as a recognised use context of some very powerful data handling tools, as Simon “Guardian Datastore” Rogers appearance at Google I/O suggests:
(Simon’s slot starts about 34:30 in, but there’s a good tutorial intro to Fusion Tables from the start…)
As I start to doodle ideas for an open online course on something along the lines of “visually, data” to run October-December, data journalism is going to provide one of the major scenarios for working through ideas. So I guess it’s in my interest to promote this European Journalism Centre: Survey on Data Journalism to try to find out what might actually be useful to journalists…;-)
[T]he survey Data-Driven Journalism – Your opinion aims to gather the opinion of journalists on the emerging practice of data-driven journalism and their training needs in this new field. The survey should take no more than 10 minutes to complete. The results will be publicly released and one of the entries will win a EUR 100 Amazon gift voucher
I think the EJC are looking to run a series of data-driven journalism training activities/workshops too, so it’s worth keeping an eye on the EJC site if #datajourn is your thing…
PS related: the first issue of Google’s “Think Quarterly” magazine was all about data: Think Data
PPS Data in journalism often gets conflated with data visualisation, but that’s only a part of it… Where the visulisation is the thing, then here’s a few things to think about…
Ben Fry interviewed at Where 2.0 2011
To try to bring a bit of focus back to this blog, I’ve started a new blog – F1 Data Junkie: http://f1datajunkie.blogspot.com (aka http://bit.ly/F1DataJunkie) – that will act as the home for my “procedural” F1 Data postings. I’ll still post the occasional thing here – for example, reviewing the howto behind some of the novel visualisations I’m working on (such as practice/qualification session utilisation charts, and race battle maps), but charts relating to particular races, will, in the main, go onto the new blog….
I’m hoping by the end of the season to have an automated route of generating different sorts of visual reviews of practice, qualification and race sessions based on both official timing data, and (hopefully) the McLaren telemetry data. (If anyone has managed to scrape and decode the Mecedes F1 live telemetry data and is willing to share it with me, that would be much appreciated:-)
I also hope to use the spirit of F1 to innovate like crazy on the visualisations as and when I get the chance; I think that there’s a lot of mileage still to come in innovative sports graphics/data visualisations*, not only for the stats geek fan, but also for sports journalists looking to uncover stories from the data that they may have missed during an event. And with a backlog of data going back years for many sports, there’s also the opportunity to revisit previous events and reinterpret them… Over this weekend, I’ve been hacking around a few old scripts to to to automate the production of various data formatters, as well as working on a couple of my very own visualisation types:-) So if you want to see what I’ve been up to, you should probably pop over to F1 Data Junkie, the blog… ;-)
*A lot of innovation is happening in live sports graphics for TV overlays, such as the Piero system developed by the BBC, or the HawkEye ball tracking system (the company behind it has just been bought by Sony, so I wonder if we’ll see the tech migrate into TVs, start to play a role in capturing data that crosses over in gaming (e.g. Play ALong With the Real World), or feed commercial data augmentations from Sony to viewers via widgets on Sony TVs…
There’ve also been recent innovations in sports graphics in the press and online. For example, seeing this interactive football chalkboard on the Guardian website, that lets you pull up, in a graphical way, stats reviews of recent and historical games, or this Daily Telegraph interactive that provides a Hawk-Eye analysis of the Ashes (is there an equivalent for The Master golf anywhere, I wonder, or Wimbledon tennis? See also Cricket visualisation tool), I wonder why there aren’t any interactive graphical viewers over recent and historical F1 data…. (or maybe there are? If you know of any – or know of any interesting visualisations around motorsport in general and F1 in particular, please let me know in the comments…:-)
Having managed to get F1 timing data data through my cobbled together F1 timing data Scraperwiki, it becomes much easier to try out different visualisation approaches that can be used to review the stories that sometimes get hidden in the heat of the race (that data journalism trick of using visualisation as an analytic tool for story discovery, for example).
Whilst I was on holiday, reading a chapter in Beautiful Visualization on Gapminder/Trendalyser/Google Motion Charts (it seems the animations may be effective when narrated, as when Hans Rosling performs with them, but for the uninitiated, they can simply be confusing…), it struck me that I should be able to view some of the timing data in the motion chart…
So here’s a first attempt (going against the previously identified “works best with narration” bit of best practice;-) – F1 timing data (China 2011) in Google Motion Charts, the video:
Visualising the China 2011 F1 Grand Prix in Google Motion Charts
If you want to play with the chart itself, you can find it here: F1 timing data (China 2011) Google Motion Chart.
The (useful) dimensions are:
- lap – the lap number;
- pos – the car/racing number of each driver;
- trackPos – the position in the race (the racing position);
- currTrackPos – the position on the track (so if a lapped car is between the leader and second place car, their respective currtrackpos are 1, 2, 3);
- pitHistory – the number of pit stops to date
The timeToLead, timeToFront and timeToBack measures give the time (in seconds) between each car and the leader, the time to the car in the racing position ahead, and the time to the car in racing position behind (these last two datasets are incomplete at the moment… I still need to calculate this missing datapoints…). The elapsedTime is the elapsed racetime for each car at the end of each measured lap.
The time starts at 1900 because of a quirk in Google Motion Charts – they only work properly for times measured in years, months and days (or years and quarters) for 1900 onwards. (You can use years less than 1900 but at 1899 bad things might happen!) This means that I can simply use the elapsed time as the timebase. So until such a time as the chart supports date:time or :time as well as date: stamps, my fix is simply to use an integer timecount (the elapsed time in seconds) + 1900.
Jod ads come and go, so I thought I’d capture the main elements of this one from the BBC:
Data Journalist – Role Purpose and Aims
You will be required to humanize statistics; to make sense of potentially complicated data and present it in a user friendly format.
You will be asked to focus on a range of data-rich subjects relating to long-term projects or high impact daily new stories, in line with Global News editorial priorities. These could include the following: reports on development, global poverty, Afghanistan casualties, internet connectivity around the world, or global recession figures.
Key Knowledge and Experience
You will be a self-starter, brimming with story ideas who is comfortable with statistics and has the expertise to delve beneath the headline figures and explain the fuller picture.
You will have significant journalistic experience gained ideally from working in an international news environment.
The successful candidate should have experience (or at least awareness) of visualising data and visualisation tools.
You should be excited about developing the way that data is interpreted and presented on the web, from heavy number crunching, to dynamic mapping and interactive graphics. You must have demonstrated knowledge of statistics, statistical analysis, with a good understanding of the range and breadth of data sources in the UK and internationally, broad experience with data sources, data mining and have good visual and statistical skills.
You must have a Computer-assisted reporting background or similar, including a good knowledge of the relevant software (including Excel and mapping software).
Experience of producing and developing data driven web content a senior level within time and budget constraints.
A thorough understanding of the BBC World Service’s aims and the part this initiative plays in meeting them.
Excellent communication and interpersonal skills with ability to present information concisely to a broad audience including journalists and commissioning editors. You should be able to demonstrate the ability to influence, negotiate with and persuade others.
Central to the role is an ability to analyse complicated information and present it to our readers in a way that is visually engaging and easy to understand, using a range of web-based technologies, for which you should have familiarity with database interfaces and web presentation layers, as well as database concepting, content entry and management.
You will be expected to have your own original ideas on how to best apply data driven journalism, either to complement stories when appropriate or to identify potential original stories while interpreting data, researching and investigating them, crunching the data yourself and working with designers and developers on creating content that will engage our audience, and provide them with useful, personalised information.
You will work in a multimedia way, when appropriate, liaising with online but also radio and TV and specialist output producers as required, from a range of language services. You will help lead the development of computer-assisted reporting skills in the wider news specials team.
To identify a range of significant statistics and data -driven stories that can be developed and result in finished graphics that can be used across BBC News websites.
To take a lead role in devising compelling ways of telling data-driven stories on the web, working with specials team designers, developers and journalists as required. Also liaising with radio and TV and specialist output producers across World Service as required, providing a joined-up multi-platform proposition for the audience.
To work with senior stakeholders and programme teams and be an internal expert who can interpret and concisely explain the significance of data to others, and related good practice.
Support the College of Journalism – to help devise training sessions in order to spread the knowledge and best practices of data driven journalism
To help inform the future development by FM&T of tools which enable data-driven stories to be told more quickly and effectively on the web.
To keep abreast of developments in data driven journalism, and pursue collaboration with other teams working on the same area, both within the BBC and also with external organisations.
Willingness to work across a range of online production skills in a flexible manner to BBC standards and values.
· Using their own initiative the successful candidate will be required to build relationships with major sources of content (e.g. BBC networks, programmes, external interest groups) and promote opportunities for cross-media production
Makes the right editorial and policy decisions based upon a clear understanding of the BBC’s distinctive news agenda, the requirements of news and current affairs coverage.
Planning & Organising
Is able to think ahead in order to establish an effective and appropriate course of action for self and others. Prioritises and plans activities taking into account all the relevant issues and factors such as deadlines and resources requirements.
Able to simplify complex problems, process projects into component parts, explore and evaluate them systematically.
Is able to transform creative ideas/impulses into practical reality. Can look at existing situations and problems in novel ways and come up with creative solutions.
Can maintain personal effectiveness by managing own emotions in the face of pressure, set backs or when dealing with provocative situations. Can demonstrate an approach to work that is characterised by commitment, motivation and energy.
The ability to get one’s message understood clearly by adopting a range of styles, tools and techniques appropriate to the audience and the nature of the information.
Influencing and Persuading
Ability to present sound and well reasoned arguments to convince others. Can draw from a range of strategies to persuade people in a way that results in agreement or behaviour change.
Managing Relationships and Team Working
Able to build and maintain effective working relationships with a range of people. Highly effective team player.
Here’s a quick run through of some of the basics (at least, a run through of the first few things I tried to do with the tool…)
First off, we need some data. Orange likes TSV (tab separated values) rather than CSV, so I grabbed some TSV from one of the Guardian Datastore spreadheets on Google Docs (use “Save as Text” to get the tab separated value format…)
Orange is a canvas based visual programming environment, in which functional blocks are added the the canvas and certain parameters set within the block. Here’s how we get some data into Orange from a TSV file:
The File icon is giving me a warning (no dependent variable) but I’m not sure why…? I’m sure Orange has managed to detect labels and quantities correctly from other files I’ve tried?
Anyway… we can inspect the data by looking at it in a data table widget – just wire one in:
The table is sortable by column, and the Report button can be used to save a version of the table. Looking t the data table, we see it has identified columns with missing entries. We can clean these from out data set using the Preprocessing widget:
If we now wire the output of the Processing widget into the Scatterplot widget, we can generate a variety of scatterplots:
If you want to save a copy of the chart, it’s easy enough to do so. (I can’t get colour palettes to work on my Mac, so I’m stuck with greyscale displays. Also, the blob sizing doesn’t seem very responsive…)
The Report tool allows us to create a report from various bits of the dataflow, including adding information from several widgets to either separate report pages or the same report page.
Saving a Report saves all the report pages to a navigable set of HTML pages that resemble the Orange Report viewer.
Here are a couple of other things we can do with the data, this time using a data set that isn’t throwing the “dependent variable missing” error, in particular the distribution of comments in a small Friendfeed network…
So for example, here’s how the number of comments made by members of the network is distributed:
Alternatively, we may look at the distribution in a more “statistical” way:
(Remember, we can generate these reports interactively, and then add them to a growing report.)
The survey plot gives us a macroscopic birds eye view over the whole of the data set:
Okay, that’s enough for starters – hopefully you get the idea: wire stuff together and generate visual reports… So why not go and download Orange now?!;-)
There are a whole range of clustering tools, too, which look like they could be interesting…
And I think the platform is extensible, which means there’s a way of adding your own widgets (written in Python, maybe..?)