Transparency in Parliament… And in Data Journalism?

Over the weekend, I picked up a copy of Parliament Ltd, a two hundred and fifty page rant (or should that be diatribe?!) against various MPs and Lords and their registered (and unregistered) interests. [Disclosure: I’ve picked up a few days paid work for the Parliamentary Digital Service this year.]

The book draws on data scraped from the Parliament website (presumably), as well as Companies House (via a collaboration – or business arrangement? I wasn’t quite sure..?! – with DueDil). As anyone who’s tried looking at registers of interests on the Parliament website, you’ll know they’re not published in the friendliest of formats, and the data is not made available as a machine readable downloadable dataset.

Sources of “Interests” Data From Parliament

By the by, the registers maintained on the Parliament website include:

There’s also the register of all-party groups, which includes statements of benefits received by groups from third parties (links to old scrapers here, possibly?

Another place we might look for associations between MPs/Lords and companies, or other organisations, is in Hansard. For example, Evan Odell recently published a dataset on Hansard Speeches and Sentiment that “provides information on each speech of ten words or longer, made in the House of Commons between 1980 and 2016, with information on the speaking MP, their party, gender and age at the time of the speech”. The R code is supplied, so we could presumably use that as a basis for running the transcripts through a named entity extractor to try to pull out the names of companies or organisation mentioned by each speaker (perhaps as well as something that looks out for declarations of interest mentioned whilst speaking?). It might also be interesting to try to match sentiment with organisation mentions?!

Where companies are mentioned in a debate, and the debate leads to a division (that is, a vote), we can then use sources such as The Public Whip to download information scraped from the Parliament website about who voted how on which division, and perhaps look for MPs voting against their party line but in favour of a particular interest.

(If you know other sources of scraper code, or APIs offered over scraped versions of any of the above registers, please let me know via the comments and I’ll add them in. Also any registers I’ve missed…)

Others Sources of Data Relating to Members’ Parliamentary and Government Activities

By the by, the APPG post also led me to another old post on scraping Ministers’ meetings. For an idea of the sorts of thing currently disclosed (at a departmental level?), see e.g. Cabinet Office: ministers’ transparency publications). There are possibly other forms of declaration on other Government Department websites?

In relation to lobbying firms, there is the Office of the Registrar of Consultant Lobbyists.

Also outside Parliament, the Electoral Commission provide information about donations and loans to individuals (including MPs) and candidate spending and donations at elections.

Other Sources of Information About Members’ External Interests

Companies House can also be used to look up whether a particular named individual is or has been listed as a company officer (such as a director), or is a person of significant control (PSC, sometimes referred to as a “beneficial owner”) of a particular company. Whilst the PSC register is currently available as a bulk data download, the director information isn’t (at least, not without making a personal request). It can be accessed in a piecemeal fashion via the Companies House API though. Current and recently disqualified directors can be found via The Insolvency Service or the Companies House API. The Insolvency Service also publish information about Individual Insolvency (that is, bankruptcies).

Where individuals are associated with an organisation and are registered as a data controller, they should also be listed as an entry on the ICO Data Protection Register.

Evan’s Github account also hosts a fork of a repo published by the NCVO for import[ing] data from the Charity Commission data extract, data that presumably lists trustees, and again that can be used as the basis for finding associations between individuals and organisations.

At a local level, local councils hold a variety of public registers, detailing for example the names of individuals licensed to sell alcohol, or to act as operators of betting, sex or animal breeding establishments. The CQC publish data listing the names of individuals in charge of operating care homes. NHS England list names of GPs working at particular practices. And so on…

More generally, the Advisory Committee on Business Appointments (Acoba) has details of Appointments taken up by former ministers. (Acoba also report on Appointments taken up by former Crown servants.)

So What?

So that’s all so much data, and as Martin Williams points out in his book, it can take a lot of effort to pull the data into some sort of shape where you can use it. And with data sourced from various places, there may be issues associated with sharing the data on once you have processed it.

To a certain extent, you might argue that Parliament is blocking “transparency” around members’ interests – and possible conflicts of interest – by publishing the data in a way that makes it difficult to process it as data without having to do a fair amount of work prepping the data. But I’m not sure how true that is. Journalists are, in part, competitive beasts, wanting to be the first to a story. If a data is well presented and comes with analysis scripts that identify story features and story points, essentially generating a press release around a dataset without much effort involved, there’s nothing there to find (nothing “hidden” in the data waiting for the intrepid journalist to reveal it). But when the data is messy and takes some effort to clean up, then the chances that anyone else will just stumble across the story point by chance are reduced. And when the data is “secret” but still publicly accessible, all the better. For example, it’s no surprise that a common request of Alvateli (the platform underpinning FOI request site WhatDoTheyKnow) was from journalists wanting to be able to hide, or at least embargo, their requests, and (data) responses provided to them (h/t Chris Adams for that observation and link).

Another question that arises around journalists who do clean datasets and then analyse them but who don’t then share their working, (the data cleaning and analysis scripts), is the extent to which they are themselves complicit in acting against transparency. Why should we believe the journalists’ accusations or explanations without seeing what they are actually based on? (Maybe in cleaning the dataset, they threw away explicit declarations of interest because they were too messy to process which then skewed the conclusions drawn from the data analysis?) By sharing analyses, you also provide others with the opportunity to spot errors in your working, or maybe even improve them (scary for some; but consider the alternative: you produce an analysis script that contains an error, and maybe reuse it, generating claims that are false and that cannot be supported by the data. Publishing those is not in your interest.) There also seems to be the implicit assumption that competitors are trying to steal your stories rather than find your own. They probably think and say the same about you. But who has the time to spend it all trying to crib over other people’s shoulders? (Other than me of course;-))

On the other hand, there may be some commercial or competitive intelligence advantage in having a cleaned dataset that you can work with efficiently that is not available to other journalists or that you believe may hide further stories. (A similar argument to the latter is often made by academic researchers who do not want to share their research data, lest someone else makes a discovery from it that eluded them.) But then, with a first mover advantage, you should be able to work with your data and link it to other data sets faster than your competitors. And if they are sharing data back too, then you may be able to benefit from their cleaned data and analysis scripts. Everyone gets to improve their game.

Another possible form of “competitive” advantage that comes from not publishing cleaned datasets or scripts is that is doesn’t tip the hand of the journalist and reveal investigative “trade secrets” to the subject or subjects of an investigation. For by revealing how a story was identified from a dataset, subjects may change their behaviour so as not to divulge information into the dataset in the same revealing way in the future.

One final considerations: when it comes to news stories, what is the extent to which part-time tinkerers and civic tech hackers such as myself spoil a possible story by doing a halfway hack on a dataset, bringing small scale attention to it, and as a consequence disabling or polluting it as a source of journalistic novelty/story-worthiness? Does anyone have examples of where a possible open data story was not pursued by the press because a local data geek blogger got there first?