Tagged: ddj

Transparency in Parliament… And in Data Journalism?

Over the weekend, I picked up a copy of Parliament Ltd, a two hundred and fifty page rant (or should that be diatribe?!) against various MPs and Lords and their registered (and unregistered) interests. [Disclosure: I’ve picked up a few days paid work for the Parliamentary Digital Service this year.]

The book draws on data scraped from the Parliament website (presumably), as well as Companies House (via a collaboration – or business arrangement? I wasn’t quite sure..?! – with DueDil). As anyone who’s tried looking at registers of interests on the Parliament website, you’ll know they’re not published in the friendliest of formats, and the data is not made available as a machine readable downloadable dataset.

Sources of “Interests” Data From Parliament

By the by, the registers maintained on the Parliament website include:

There’s also the register of all-party groups, which includes statements of benefits received by groups from third parties (links to old scrapers here, possibly?

Another place we might look for associations between MPs/Lords and companies, or other organisations, is in Hansard. For example, Evan Odell recently published a dataset on Hansard Speeches and Sentiment that “provides information on each speech of ten words or longer, made in the House of Commons between 1980 and 2016, with information on the speaking MP, their party, gender and age at the time of the speech”. The R code is supplied, so we could presumably use that as a basis for running the transcripts through a named entity extractor to try to pull out the names of companies or organisation mentioned by each speaker (perhaps as well as something that looks out for declarations of interest mentioned whilst speaking?). It might also be interesting to try to match sentiment with organisation mentions?!

Where companies are mentioned in a debate, and the debate leads to a division (that is, a vote), we can then use sources such as The Public Whip to download information scraped from the Parliament website about who voted how on which division, and perhaps look for MPs voting against their party line but in favour of a particular interest.

(If you know other sources of scraper code, or APIs offered over scraped versions of any of the above registers, please let me know via the comments and I’ll add them in. Also any registers I’ve missed…)

Others Sources of Data Relating to Members’ Parliamentary and Government Activities

By the by, the APPG post also led me to another old post on scraping Ministers’ meetings. For an idea of the sorts of thing currently disclosed (at a departmental level?), see e.g. Cabinet Office: ministers’ transparency publications). There are possibly other forms of declaration on other Government Department websites?

In relation to lobbying firms, there is the Office of the Registrar of Consultant Lobbyists.

Also outside Parliament, the Electoral Commission provide information about donations and loans to individuals (including MPs) and candidate spending and donations at elections.

Other Sources of Information About Members’ External Interests

Companies House can also be used to look up whether a particular named individual is or has been listed as a company officer (such as a director), or is a person of significant control (PSC, sometimes referred to as a “beneficial owner”) of a particular company. Whilst the PSC register is currently available as a bulk data download, the director information isn’t (at least, not without making a personal request). It can be accessed in a piecemeal fashion via the Companies House API though. Current and recently disqualified directors can be found via The Insolvency Service or the Companies House API. The Insolvency Service also publish information about Individual Insolvency (that is, bankruptcies).

Where individuals are associated with an organisation and are registered as a data controller, they should also be listed as an entry on the ICO Data Protection Register.

Evan’s Github account also hosts a fork of a repo published by the NCVO for import[ing] data from the Charity Commission data extract, data that presumably lists trustees, and again that can be used as the basis for finding associations between individuals and organisations.

At a local level, local councils hold a variety of public registers, detailing for example the names of individuals licensed to sell alcohol, or to act as operators of betting, sex or animal breeding establishments. The CQC publish data listing the names of individuals in charge of operating care homes. NHS England list names of GPs working at particular practices. And so on…

More generally, the Advisory Committee on Business Appointments (Acoba) has details of Appointments taken up by former ministers. (Acoba also report on Appointments taken up by former Crown servants.)

So What?

So that’s all so much data, and as Martin Williams points out in his book, it can take a lot of effort to pull the data into some sort of shape where you can use it. And with data sourced from various places, there may be issues associated with sharing the data on once you have processed it.

To a certain extent, you might argue that Parliament is blocking “transparency” around members’ interests – and possible conflicts of interest – by publishing the data in a way that makes it difficult to process it as data without having to do a fair amount of work prepping the data. But I’m not sure how true that is. Journalists are, in part, competitive beasts, wanting to be the first to a story. If a data is well presented and comes with analysis scripts that identify story features and story points, essentially generating a press release around a dataset without much effort involved, there’s nothing there to find (nothing “hidden” in the data waiting for the intrepid journalist to reveal it). But when the data is messy and takes some effort to clean up, then the chances that anyone else will just stumble across the story point by chance are reduced. And when the data is “secret” but still publicly accessible, all the better. For example, it’s no surprise that a common request of Alvateli (the platform underpinning FOI request site WhatDoTheyKnow) was from journalists wanting to be able to hide, or at least embargo, their requests, and (data) responses provided to them (h/t Chris Adams for that observation and link).

Another question that arises around journalists who do clean datasets and then analyse them but who don’t then share their working, (the data cleaning and analysis scripts), is the extent to which they are themselves complicit in acting against transparency. Why should we believe the journalists’ accusations or explanations without seeing what they are actually based on? (Maybe in cleaning the dataset, they threw away explicit declarations of interest because they were too messy to process which then skewed the conclusions drawn from the data analysis?) By sharing analyses, you also provide others with the opportunity to spot errors in your working, or maybe even improve them (scary for some; but consider the alternative: you produce an analysis script that contains an error, and maybe reuse it, generating claims that are false and that cannot be supported by the data. Publishing those is not in your interest.) There also seems to be the implicit assumption that competitors are trying to steal your stories rather than find your own. They probably think and say the same about you. But who has the time to spend it all trying to crib over other people’s shoulders? (Other than me of course;-))

On the other hand, there may be some commercial or competitive intelligence advantage in having a cleaned dataset that you can work with efficiently that is not available to other journalists or that you believe may hide further stories. (A similar argument to the latter is often made by academic researchers who do not want to share their research data, lest someone else makes a discovery from it that eluded them.) But then, with a first mover advantage, you should be able to work with your data and link it to other data sets faster than your competitors. And if they are sharing data back too, then you may be able to benefit from their cleaned data and analysis scripts. Everyone gets to improve their game.

Another possible form of “competitive” advantage that comes from not publishing cleaned datasets or scripts is that is doesn’t tip the hand of the journalist and reveal investigative “trade secrets” to the subject or subjects of an investigation. For by revealing how a story was identified from a dataset, subjects may change their behaviour so as not to divulge information into the dataset in the same revealing way in the future.

One final considerations: when it comes to news stories, what is the extent to which part-time tinkerers and civic tech hackers such as myself spoil a possible story by doing a halfway hack on a dataset, bringing small scale attention to it, and as a consequence disabling or polluting it as a source of journalistic novelty/story-worthiness? Does anyone have examples of where a possible open data story was not pursued by the press because a local data geek blogger got there first?

How Reproducible Data Analysis Scripts Can Help You Route Around Data Sharing Blockers

For aaaagggggggeeeeeeesssssss now, I’ve been wittering on about how just publishing “open data” is okay insofar as it goes, but it’s often not that helpful, or at least, not as useful as it could be. Yes, it’s a Good Thing when a dataset is published in support of a report; but have you ever tried reproducing the charts, tables, or summary figures mentioned in the report from the data supplied along with it?

If a report is generated “from source” using something like Rmd (RMarkdown), which can blend text with analysis code and a means to import the data used in the analysis, as well as the automatically generated outputs, (such as charts, tables, or summary figures) obtained by executing the code over the loaded in data, third parties can see exactly how the data was turned into reported facts. And if you need to run the analysis again with a more recent dataset, you can do. (See here for an example.)

But publishing details about how to do the lengthy first mile of any piece of data analysis – finding the data, loading it in, and then cleaning and shaping it enough so that you can actually start to use it – has additional benefits too.

In the above linked example, the Rmd script links to a local copy of a dataset I’d downloaded onto my local computer. But if I’d written a properly reusable, reproducible script, I should have done at least one of the following two things:

  • either added a local copy of the data to the repository and checked that the script correctly linked relatively to it;
  • and/or provided the original download link for the datafile (and the HTML web page on which the link could be found) and loaded the data in from that URL.

Where the license of a dataset allows sharing, the first option is always a possibility. But where the license does not allow sharing on, the second approach provides a de facto way of sharing the data without actually sharing it directly yourself. I may not be giving you a copy of the data, but I am giving you some of the means by which you can obtain a copy of the data for yourself.

As well as getting round licensing requirements that limit sharing of a dataset but allow downloading of it for personal use, this approach can also be handy in other situations.

For example, where a dataset is available from a particular URL but authentication is required to access it (this often needs a few more tweaks when trying to write the reusable downloader! A stop-gap is to provide the URL in reproducible report document and explicitly instruct the reader to download the dataset locally using their own credentials, then load it in from the local copy).

Or as Paul Bivand pointed out via Twitter, in situations “where data is secure like pupil database, so replication needs independent ethical clearance”. In a similar vein, we might add where data is commercial, and replication may be forbidden, or where additional costs may be incurred. And where the data includes personally identifiable information, such as data published under a DPA exemption as part of a public register, it may be easier all round not to publish your own copy or copies of data from such a register.

Sharing recipes also means you can share pathways to the inclusion of derived datasets, such as named entity tags extracted from a text using free, but non-shareable, (or at least, attributable) license key restricted services, such as the named entity extraction services operated by Thomson Reuters OpenCalais, Microsoft Cognitive Services, IBM Alchemy or Associated Press. That is, rather than tagging your dataset and then sharing and analysing the tagged data, publish a recipe that will allow a third party to tag the original dataset themselves and then analyse it.

Sharing the Data Load

A few weeks ago, I popped together a post listing a few Data Journalism Units on Github. These repos (that is, repositories), are being used to share code (for particular interactives, for example), data, and analysis scripts. They’re starting to hint at ways in which support for public reproducible local data journalism might start to emerge from developing (standardised) data repositories and reproducible workflows built around them.

Here are a handful of other signals that I think support this trend that I’ve come across in the last few weeks (if they haven’t appeared in your own feeds, a great shortcut to many of them is via @digidickinson’s weekly Media Mill Gazette):

Organisations:

Applications:

Data:

And here’s another one, from today – the Associated Press putting together a pilot with data publishing platform data.world “to help newsrooms find local stories within large datasets” (Localizing data, quantifying stories, and showing your work at The Associated Press ). I’m not sure what the pilot will involve, but the rationale sounds interesting:

Transparency is important. It’s a standard we hold the government to, and it’s a standard we should hold the press to. The more journalists can show their work, whether it’s a copy of a crucial document or the data underlying an analysis, the more reason their audience has to accept their findings (or take issue with them in an informed way). When we share our data and methodology with our members, those journalists give us close scrutiny, which is good for everyone. And when we can release the data more broadly and invite our readers to check our work, we create a more secure grounding for the relationship with the reader.

:-) S’what we need… Show your working…

Data Journalism Units on Github

Working as I do with an open notebook (this blog, my github repos, pinboard and twitter), I value works shared by other people too. Often, this can lead to iterative development, as one person sees an opportunity to use someone else’s work for a slightly different purpose, or spots a way to improve upon it.

A nice example of this that I witnessed in more or less realtime a few years ago was when data journalists from the Guardian and the Telegraph – two competing news outlets – bounced off each others’ work to produce complementary visualisations demonstrating electoral boundary changes (Data Journalists Engaging in Co-Innovation…). (By the by, boundary changes are up for review again in 2018 – the consultation is still open.)

Another example comes from when I starting to look for cribs around building virtual machines to support OU course delivery. Specifially, the Infinite Interns idea for distinct (and disposable) virtual machines that could be used to support data journalism projects (about).

Today, I variously chanced across a couple of Github repositories containing data, analyses, toolkits and application code from a couple of different news outlets. Github was originally developed as a social coding environment where developers could share and collaborate on software projects. But over the last few years, it’s also started to be used to share data and (text) documents, as well as reproducible data analyses – and not just by techies.

A couple of different factors have contributed to this, I think, that relate as much to how Github can be used to preview and publish documents, as act as a version control and issue tracking system:

options

Admittedly, using git and Github can be really complicated and scary, but you can also use it as a place to pop documents and preview them or publish them, as described above. And getting files in is easy too – just upload them via the web interface.

Anyway, that’s all by the by… The point of this post was to try to pull together a small collection of links to some of the data journalism units I’ve spotted sharing stuff on Github, and see to what extent they practice “reproducible data journalism”. (There’s also a Github list – Github showcase – open journalism.) So for example:

  • one of the first news units I spotted sharing research in a reproducible way was BuzzFeedNews and their tennis betting analysis. A quick skim of several of the repos suggest they use a similar format – a boilerplate README with a link to the story, the data used in the analysis, and a Jupyter notebook containing python/pandas code to do the analysis. They also publish a handy directory to their repos, categorised as Data and Analyses, Standalone Datasets, Libraries and Tools, GuidesI’m liking this a lot…
  • fivethirtyeight: There are also a few other data related repos at the top level, eg guns-data. Hmm… Data but no reproducible analyses?
  • SRF Data – srfdata (data-driven journalism unit of Swiss Radio and TV): several repos containing Rmd scripts (in English) analysing election related data. More of this, please…
  • FT Interactive News – ft-interactive: separate code repos for different toolkits (such as their nightingale-charts chart tools) and applications; a lot of the applications seem to appear in subscriber only stories – but I can you can try to download the code and run it yourself… Good for sharing code, poor for paywall stopping sharing of executed examples;
  • New York Times – NYTimes: plenty of developer focussed repos, although the gunsales repo creates an R package that works with a preloaded dataset and routines to visualise the data and the ingredient phrase tagger is a natural language parser trained to tag food recipe components. (Makes me wonder what other sorts of trained taggers might be useful…) One for the devs…
  • Washington Post – washingtonpost: more devops repos, they they have also dropped a database of shootings (as a CSV file) as one of the repos (data-police-shootings)). I’d hoped for more…
  • NYT Newsroom Developers: another developer focussed collection of repos, though rather than focussing on just front end tools there are also scrapers and API helpers. (It might actually be worth going through all the various news/media repos to build a metalist/collection of API wrappers, scrapers etc. i.e. tools for sourcing data). I’d half expected to see more here, too…?
  • Wall Street Journal Graphics Team – WSJ: not much here, but picking up on the previous point there is this example of a AP ballot API wrapper; Sparse…
  • The Times / Sunday Times – times: various repos, some of the link shares; the data one collects links to a few datasets and related stories. Also a bit sparse…
  • The Economist – economist-data-team: another unloved account – some old repos for interactive HTML applications; Another one for the devs, maybe…
  • BBC England Data Unit – BBC-Data-Unit: a collection of repositories, one per news project. Recent examples include: Dog Fights and Schools Chemical Alerts. Commits seem to come from a certain @paulbradshaw… Repos seem to include a data file and a chart image. How to create run the analysis/create the chart from the data is not shared… Could do better…

From that quick round up, a couple of impressions. Firstly, BuzzFeedNews seem to be doing some good stuff; the directory listing they use that breaks down different sorts of repos seems sensible, and could provide the basis for a more scholarly round up than the one presented here. Secondly, we could probably create some sort of matrix view over the various repos from different providers, that would allow us, for example, to see all the chart toolkits, or all the scrapers, or all the API wrappers, or all the election related stuff.

If you know of any more I should add to the list, please let me know via the comments below, ideally with a one or two line summary as per the above to give a flavour of what’s there…

I’m also mindful that a lot of people working for the various groups may also be publishing to personal repositories. If you want to suggest names for a round up of those, again, please do so via the comments.

PS I should really be recording the licenses that folk are releasing stuff under too…

“Local stories of national interest” – New Johnston Press (Data Journalism) Investigations Unit

Complementing the approach of Trinity Mirror, who launched a cross-group data journalism unit back in 2013, Johnston Press has pulled together a (virtual?) Investigations Unit made up from several investigative and data skilled reporters from across the Johnston Press regional titles (press release).

The unit’s first campaign is focussed on sentences awarded for causing death by dangerous driving. The campaign allows the unit to report on national datasets, as such, as well as developing local stories based on examples taken from the national dataset, bubbling up local stories to wider national interest as campaign hooks. From the press release announcing the launch of the unit, it seems as if this campaigning style of national/local investigative reporting will be underpin the unit’s activities.

“As well as carrying out investigations, and telling powerful human interest stories, the unit has a campaigning and lobbying role at its heart” – Johnston Press press release.

The use of campaigns means the same theme can be kept alive and repeatedly reported on as on ongoing series over an extended period of time, tracked nationally but reported in a local context on the one hand, promoting local campaigns and then reporting them widely on the other.

The national/local model is one that I’ve long thought makes sense, though I’ve not really considered it in terms of the local to national twist. Instead, I’ve been framing it as an opportunity to address centrally common pain points that may be experienced trying to produce a story from data at a local level, as discussed in these thoughts on a locally targeted, nationally scoped datawire.

National dataset local story

One advantage of this approach is scale: graphics communicating national level statistics can be produced centrally and reused across local titles, perhaps with local customisation; local stories can be used to provide relevance to generic “national context” inserts reused across titles; and story templates can be customised to generate local reports from the same national dataset.

Another advantage with looking at national datasets is that they can help flag the newsworthiness of a local story given its national context (for example, national rankings generate story points for the top M, bottom N rankings).

I haven’t spent much time thinking about the campaign aspect, but on quick reflection I think that campaigns can act as nice wrappers for a wider range of templated activities an outputs.

For example, I’ve written a couple of times about the notion of story templates, noting how these have been rolled out in previous years by at least the Johnston Press and Trinity Mirror (Local News Templates – A Business Opportunity for Data Journalists?).

And eighteen months or so ago, I was fortunate enough to spend a couple of days seeing how Ruby Kitchen, then of the Harrogate Advertiser, now of the Yorkshire Post / Yorkshire Evening Post and the Johnston Press Investigations Unit, worked on a Food Standards Agency story on (Data Journalism in Practice). One of the takeaways for me from that was what was involved in actually making use of leads thrown up from a data trawl and then chasing down people for comment. The work involved in putting together an investigation at a single local level may need to be repeated for other locales, but the process can be reused – the investigatory process can be templated.

On the way back home from Harrogate, I’d started fantasising about putting together a training pack based on the the Food Standards Agency food hygiene ratings data (h/t Andy Dickinson for tangentially reminding me of this a couple of days ago :-), with a dual objective in mind: firstly, to produce a training pack for demonstrating various aspects of how to practically work with national datasets at a local level; secondly, to template a data journalism investigation that could be worked through by local or hyperlocal journalists, or journalism students, to produce a feature local food hygiene ratings. (It’s still sitting on the to do pile… Maybe I should have tried kickstarter!)

(Note that it’s not just news organisations that can scale templated systems, or reuse locally developed solutions for national benefit. For example, see the post Putting Public Open Data to Work…? for several examples of online services developed by local councils and used to publish local data that can also be scaled across other council areas.)

Whilst newspaper groups such as Trinity Mirror or Johnston Press have the scale in terms of the number of local outlets to merit a co-ordinated centre reducing the pain once for working with national datasets and then scaling out the benefits across the regional and local titles, independent hyperlocals are often more resource bound when it comes to pursuing investigations (though The Bristol Cable among others repeatedly shows how hyperlocal led investigations are possible).

Whilst I keep not starting to properly scope a hyperlocal datawire service, Will Perrin’s  Local News Engine seems to have gained some traction in its development recently (Early proof of concept for Local News Engine [code]). This service “is testing the theory that story leads can be found in local data where a newsworthy person or place is engaged in a newsworthy activity”, searching local datasources (license applications, planning applications) for notable names (see for example What data are we using in Local News Engine? and Who, what and where is newsworthy for Local News Engine?). The approach taken – named entity extraction cross-referenced with the names of local notables – complements an alternative approach that I favour for the datawire that would flag local stories from national datasets based on things like top N, bottom M rankings, outliers, notable trends or dramatic change in statistics for a local area from a national dataset based on a comparison with previous data releases, other locales and national averages.

PS you can tell this is a personal blog post, not a piece of journalism – I didn’t reach out to anyone from the Johnston Press, or Trinity Mirror, or get in touch with Will Perrin to check facts or ask for comment. It’s all just my personal comment, bias, interpretation and opinion….

PPS See also Archant’s Investigations Unit (2015 announcement) – h/t Andy Dickinson.

Charting Terrorism Related Arrest Flows Through The Criminal Justice System

One of my daily read feeds is a list of the day’s government statistical releases. Today, I spotted a release on the Operation of police powers under the Terrorism Act 2000, quarterly update to September 2015, which included an annes on Arrests and outcomes, year ending September 2015:

terror_cjs

I tweeted a link to doc, and Michael/@fantasticlife replied with a comment it might look interesting as a Sankey diagram…

Hmm…

So here’s a quick sketch generated using SankeyMATIC:

sankeymatic_1000x800

I took the liberty of adding an extra “InSystem’ step into the chart to account for the feedback look of the bailed arrests.

Here’s the data I used:

Arrested [192] InSystem
Arrested [115] Released without charge
Arrested [8] Alternative action
InSystem [124] Charged
InSystem [68] Released on bail
Charged [111] Terrorism Related
Charged [13] Non-terrorism related
Terrorism Related [36] Prosecuted.t
Terrorism Related [1] Not proceeded against
Terrorism Related [74] Awaiting prosecution
Non-terrorism related [6] Prosecuted.n
Non-terrorism related [2] Not proceeded against
Non-terrorism related [5] Awaiting prosecution
Prosecuted.t [33] Convicted (terrorism related)
Prosecuted.t [2] Convicted (non-terrorism related)
Prosecuted.t [1] Acquitted
Prosecuted.n [5]  Convicted (non-terrorism related)
Prosecuted.n [1] Acquitted

Looking at the diagram, I find the placement of the labels quite confusing and I’m not really sure what relate to what. (The numbers, for example…) It would also be neater if we could capture flows still “in-the system”, for example by stopping the Released on bail element at the same depth as the Charged elements, and also keeping the Awaiting prosecution element short of the right hand side. (Perhaps bail and awaiting elements could be added into a “limbo” field?)

So – nice idea, but as soon as you look at it you see that a quick look at trivial sketch immediately identifies all sorts of other issues that you need to take into account to make the diagram informatively glanceable…

Hmmm…

Thinks.. SankeyMATIC is a d3.js app… it would be nice if I could drag the elements in the generator to may the diagram a bit clearer… maybe I can?
sankeymatic_1000x800 (1)

Only that’s wrong too… because the InSystem label applies to the boundary to the left, and the Bail label to the right… So we need to tweak it a bit more…

sankeymatic_1200x800 (1)

In fact, you may notice that the labels seem to be applied left and right justified according to different rules? Hmmm… Not so simple again…

How about if I take out the insterstitial value I added?

sankeymatic_1200x800 (2)

That’s perhaps a bit clearer? And all goes some way to showing how constructing a graphic is generally an iterative process, scaffolding the evolution of the diagram as you go, as you learn to see it/read it from different perspectives and tweak it to try to clarify particular communicative messages? (Which in this case, for me, was to try to tease out how far through the process various flows had got, as well as clearly identify final outcomes…)

Other things we could do to try to improve the graphic are experiment a bit more with the colour schemes. But that’s left as an exercise for the reader…;-)