In The Re-Birth of the “Beat”: A hyperlocal online newsgathering model (Journalism Practice 6.5-6 (2012): 754-765), Murray Dick cites various others to suggest that routine sources are responsible for generating a significant percentage of local news reports:
Schlesinger [Schlesinger, Philip (1987) Putting ‘Reality’ Together: BBC News. Taylor & Francis: London] found that BBC news was dependent on routine sources for up to 80 per cent of its output, while later [Franklin, Bob and Murphy, David (1991) Making the Local News: Local Journalism in Context. Routledge: London] established that local press relied upon local government, courts, police, business and voluntary organisations for 67 per cent of their stories (in [Keeble, Richard (2009) Ethics for Journalists, 2nd Edition. Routledge: London], p114-15)”].
As well as human sources, news gatherers may also look to data sources at either a local level, such as local council transparency (that is, spending data), or national data sources with a local scope as part of a regular beat. For example, the NHS publish accident and emergency statistics as the provider organisation level on a weekly basis, and nomis, the official labour market statistics publisher, publish unemployment figures at a local council level on a monthly basis. Ratings agencies such as the Care Quality Commission (CQC) and the Food Standards Agency (FSA) publish inspections data for local establishments as it becomes available, and other national agencies publish data annually that can be broken down to a local level: if you want to track car MOT failures at the postcode region level, the DVLA have the data that will help you do it.
To a certain extent, adding data sources to a regular beat, or making a beat purely from data sources enables the automatic generation of data driven press releases that can be used to shorten the production process of news reports about a particular class of routine stories that are essentially reports about “the latest figures” (see, for example, my nomis Labour Market Statistics textualisation sketch).
Data sources can also be used to support the newsgathering process by processing the data in order to raise alerts or bring attention to particular facts that might otherwise go unnoticed. Where the data has a numerical basis, this might relate to sorting a national dataset on the basis of some indicator value or other and highlighting to a particular local news outlet that their local X is in the top M or bottom N of similar establishments in the rest of the country, and that there may be a story there. Where the data has a text basis, looking for keywords might pull out paragraphs or records that are of particular interest, or running a text through an entity recognition engine such as Thomson Reuters’ OpenCalais might automatically help identify individuals or organisations of interest.
In this context of this post, I will be considering the role that metadata about court cases that is contained within court lists and court registers might have to play in helping news media identify possibly newsworthy stories arising from court proceedings. I will also explore the extent to which the metadata may be processed, both in order to help identify court proceedings that may be worth reporting on, as well to produce statistical summaries that may in themselves be newsworthy and provide a more balanced view over the activity of the courts than the impression one might get about their behaviour simply from the balance of coverage provided by the media.
Reading our local weekly press this evening (the Isle of Wight County Press), I noticed a page 5 headline declaring “Alarm over death rates at St Mary’s”, St Mary’s being the local general hospital. It seems a Department of Health report on hospital mortality rates came out earlier this week, and the Island’s hospital, it seems, has not performed so well…
Seeing the headline – and reading the report – I couldn’t help but think of Ben Goldacre’s Bad Science column in the Observer last week (DIY statistical analysis: experience the thrill of touching real data ), which commented on the potential for misleading reporting around bowel cancer death rates; among other things, the column described a statistical graphic known as a funnel plot which could be used to support the interpretation of death rate statistics and communicate the extent to which a particular death rate, for a given head of population, was “significantly unlikely” in statistical terms given the distribution of death rates across different population sizes.
Given the interest there appears to be around data journalism at the moment (amongst the digerati at least), I thought there might be a reasonable chance of finding some data inspired commentary around the hospital mortality figures. So what sort of report was produced by the Guardian (Call for inquiries at 36 NHS hospital trusts with high death rates) or the Telegraph (36 hospital trusts have higher than expected death rates), both of which have pioneering data journalists working for them, come up with? Little more than the official press release: New hospital mortality indicator to improve measurement of patient safety.
The reports were both formulaic, picking on leading with the worst performing hospital (which admittedly was not mentioned in the press release) and including some bog standard quotes from the responsible Minister lifted straight out of the press release (and presumably written by someone working for the Ministry…) Neither the Guardian nor the Telegraph story contained a link to the original data, which was linked to from the press release as part of the Notes to editors rider.
If we do a general, recency filtered, search for hospital death rates on either Google web search:
or Google news search:
we see a wealth of stories from various local press outlets. This was a story with national reach and local colour, and local data set against a national backdrop to back it up. Rather than drawing on the Ministerial press released quotes, a quick scan of the local news reports suggests that at least the local journalists made some effort compared to the nationals’ churnalism, and got quotes from local NHS spokespeople to comment on the local figures. Most of the local reports I checked did not give a link to the original report, or dig too deeply into the data. However, This is Tamworth, (which had a Tamworth Herald byline in the Google News results), did publish the URL to the full report in its article Shock report reveals hospital has highest death rate in country, although not actually as a link… Just by the by, I also noticed the headline was flagged with a “Trusted Source” badge:
Is that Tamworth Herald as the trusted source, or the Department of Health?!
Given that just a few days earlier, Ben Goldacre had provided an interesting way of looking at death rate data, it would have been nice to think that maybe it could have influenced someone out there to try something similar with the hospital mortality data. Indeed, if you check the original report, you can find a document describing How to interpret SHMI bandings and funnel plots (although, admittedly, not that clearly perhaps?). And along with the explanation, some example funnel plots.
However, the plots as provided are not that useful. They aren’t available as image files in a social or rich media press release format, nor are statistical analysis scripts that would allow the plots to be generated from the supplied data in too like R; that is to say, the executable working wasn’t shown…
So here’s what I’m thinking: firstly, we need data press officers as well as data journalists. Their job would be to put together the tools that support the data churnalist in taking the raw data and producing statistical charts and interpretation from it. Just like the ministerial quote can be reused by the journalist, so the data press pack can be used to hep the journalist get some graphs out there to help them illustrate the story. (The finishing of the graph would be up to the journalist, but the mechanics of the generation of the base plot would be provided as part of the data press pack.)
Secondly, there may be an opportunity for an enterprising individual to take the data sets and produced localised statistical graphics from the source data. In the absence of a data press officer, the enterprising individual could even fulfil this role. (To a certain extent, that’s what the Guardian Datastore does.)
(Okay, I know: the local press will have allocated only a certain amount of space to the story, and the editor would likely see any mention of stats or funnel plots as scaring folk off, but we have to start changing attitudes, expectations, willingness and ability to engage with this sort of stuff somehow. Most people have very little education in reading any charts other than pie charts, bar charts, and line charts, and even then are easily misled. We have start working on this, we have to start looking at ways of introducing more powerful plots and charts and helping people get a folk understanding of them. And funnel plots may be one of the things we should be starting to push?)
Now back to the hospital data. In How Might Data Journalists Show Their Working? Sweave, I posted a script that included the working for generating a funnel plot from an appropriate online CSV data source. Could this script be used to generate a funnel plot from the hospital data?
I had a quick play, and managed to get a scatterplot distribution that looks like the one on the funnel plot explanation guide by setting the number value to the SHMI Indicator data (csv) EXPECTED column and the p to the VALUE column. However, because the p value isn’t a probability in the range 0..1, the p.se calculation fails:
p.se <- sqrt((p*(1-p)) / (number))
Anyway, here’s the script for generating the straightforward scatter plot (I had to read the data in from a local file because there was some issue with the security certificate trying to read the data in from the online URL using the RCurl library and hospitaldata = data.frame( read.csv( textConnection( getURL( DATA_URL ) ) ) ):
hospitaldata = read.csv("~/Downloads/SHMI_10_10_2011.csv")
number = hospitaldata$EXPECTED
p = hospitaldata$VALUE
df = data.frame(p, number, Area=hospitaldata$PROVIDER.NAME)
ggplot(aes(x = number, y = p), data = df) + geom_point(shape = 1)
There’s presumably a simple fix to the original script that will take the range of the VALUE column into account and allow us to plot the funnel distribution lines appropriately? If anyone can suggest the fix, please let me know in a comment…;-)
What about monetization? Well, first of all, there are already many private entities who make a nice living processing public data. Why not the newsmedia? Take the education market: Why not having editorial products, designed by professional journalists, capitalizing on powerful label such as Le Monde, VG or The Guardian to address this audience with well designed products, in print or online? Think about students, how they could use this new knowledge with their laptops or iPhones. This market is up for grabs. And medias are well positioned to enter it. (Or someone else will.
My gut feeling is that with the news media trying to redefine itself for a future where revenues aren’t guaranteed by ad-sales in daily or weekend papers, and the higher education market (in the UK at least) potentially looking set for a fall in the short term as graduate openings disappear and institutions look for ways to increase student fees, there is an opportunity for a new sort of service provider, perhaps not dissimilar to a professional institution, to take up the slack and provide quality comment, analysis to the media (think: paidContent for the quality papers’ analysis sections, powered by academe redefined (i.e. Academe 2.0;-)); and FE/HE level lifelong
training “learning” to whosoever needs it.
When the OU was founded 40 years ago, it opened up access to Higher Education in the UK for those who couldn’t otherwise access it, and opened to doorway for many to membership of a professional institution. One of the driving reasons for the institutions was to keep their members up-to-date with innovations in their profession. However, those institutions have suffered terribly in recent years, (declining numbers of members – you can probably guess the rest…) so maybe it’s time for a rethink…
Indeed, maybe it’s time for something that combines elements of Higher Education, professional institutions and “products” like Guardian Professional (such as their research service, and maybe even the events* part?) wrapped up with some form of verification service that blends elements of professional, academic and maybe even vendor certification?
And maybe data analysis and commentary is one way in to that?
* I keep wondering why it is that Guardian, TED and O’Reilly conferences (as well as a wide variety of unconferences) are of far more interest to me than academic ones? It can’t just be that they tend to publish their audio streams online?;-)
PS see also Guerrilla Education: Teaching and Learning at the Speed of News and its associated comments.