Tracking down Data Files Associated With Parliamentary Business

One of the ways of finding data related files scattered around an organisations website is to run a web search using a search limit that specifies a data-y filetype, such as xlsx  for an Excel spreadsheet (csv and xls are also good candidates). For example, on the Parliament website, we could run a query along the lines of filetype:xlsx site:parliament.uk and then opt to display the omitted results:

Taken together, these files form an ad hoc datastore (e.g. as per this demo on using FOI response on WhatDoTheyKnow as an “as if” open datastore).

Looking at the URLs, we see that data containing files are strewn about the online Parliamentary estate (that is, the website;-)…

Freedom of Information Related Datasets

Parliament seems to be quite open in the way is handles its FOI responses, publishing disclosure logs and releasing datafile attachments rooted on https://www.parliament.uk/documents/foi/:

Written Questions

Responses to Written Questions often come with datafile attachments.

These are files are posted to the subdomain http://qna.files.parliament.uk/qna-attachments.

Given the numeric key for a particular question, we can run a query on the Written Answers API to find details about the attachment:

Looking at the actual URL , something like http://qna.files.parliament.uk/qna-attachments/454264/original/28152%20-%20table.xlsx, it looks as if some guesswork is required generating the URL from the data contained in the API response? (For example, how might original attachments might distinguish from other attachments (such as “revised” ones, maybe?).)

Written Statements

Written statements often come with one of more data file attachments.

The data files also appear on the http://qna.files.parliament.uk/ subdomain although it looks like they’re on a different path to the answered question attachments (http://qna.files.parliament.uk/ws-attachments compared to http://qna.files.parliament.uk/qna-attachments). This subdomain doesn’t appear to have the data files indexed and searchable on Google? I don’t see a Written Statements API on http://explore.data.parliament.uk/ either?

Deposited Papers

Deposited papers often include supporting documents, including spreadsheets.

Files are located under http://data.parliament.uk/DepositedPapers/Files/:

At the current time there is no API search over deposited papers.

Committee Papers

A range of documents may be associated with Committees, including reports, responses to reports, and correspondence, as well as evidence submissions. These appear to mainly be PDF documents. Written evidence documents are rooted on http://data.parliament.uk/writtenevidence/committeeevidence.svc/evidencedocument/ and can be found from committee written evidence web (HTML) pages rooted on the same path (example).

A web search for site:parliament.uk inurl:committee (filetype:xls OR filetype:csv OR filetype:xlsx) doesn’t turn up any results.

Parliamentary Research Briefings

Research briefings are published by Commons and Lords Libraries, and may include additional documents.

Briefings may be published along with supporting documents, including spreadsheets:

The files are published under the following subdomain and path:  http://researchbriefings.files.parliament.uk/.

The file attachments URLs can be found via the Research Briefings API.

This response is a cut down result – the full resource description, including links to supplementary items, can be found by keying on the numeric identifier from the URI _about which the “naturally” identified resource (e.g. SN06643) is described.

Summary

Data files can be found variously around the Parliamentary website, including down the following paths:

(I don’t think the API supports querying resources that specifically include attachments in general, or attachments of a particular filetype?)

What would be nice would be support for discovering some of these resources. A quick way in to this would be the ability to limit search query responses to webpages that link to a data file, on the grounds that the linking web page probably contains some of the keywords that you’re likely to be searching for data around?

Data Cleaning – Finding Near Matches in Names

In the post What Nationality Did You Say You Were, Again? I showed how we could use the fuzzyset python library to try to reconcile user supplied nationalities entered via a free text entry form to “preferred” nationalities listed in the FCO Register of Country Names.

Here’s another quick example of how to use fuzzyset to help clean a list of names, possibly repeated, that may include near misses or partial matches.

import pandas as pd
names=['S. Smith', 'John Smith','James Brown','John Brown','T. Smith','John Brown']
df=pd.DataFrame({'name':names})

# Set the thresh value (0..1) to tweak match strength
thresh=0.8

import fuzzyset
names=df['name'].tolist()

cleaner = fuzzyset.FuzzySet()
collisions=[]
for name in names:
    maybe=cleaner.get(name)
    # If there is a possible match, get a list of tuples back: (score, matchstring)
    # The following line finds the highest match score
    m=max(maybe,key=lambda item:item[0]) if maybe is not None else (-1,'')
    # If there is no match, or the max match score is below the threshold value,
    if not maybe or m[0] < thresh:         # assume that it's not a match and add name to list of "clean" names…         cleaner.add(name)     elif m[0] >= thresh:
        # But if there is a possible match, print a warning
        txt='assuming {} is a match with {} ({}) so not adding'.format(name,m[1],m[0])
        print(txt)
        # and add the name to a collisions list
        collisions.append((name,m))

#Now generate a simple report
print('------\n\n# Cleaning Report:\n\n## Match Set:\n{}\n\n------\n\n## Collisions:\n{}'.format(cleaner.exact_set, collisions))

The report looks something like this:

Sometimes, you may want to be alerted to exact matches; for example, if you are expecting the values to all be unique.

However, at other times, you may be happy to ignore duplicates, in which case you might consider dropping them from the names list. One way to do this is to convert the names list to a set, and back again, names=list(set(names)), although this changes the list order.

Alternatively, from the pandas dataframe column, just take unique values: names=df['name'].unique().tolist().

You may also want to know how many times duplicate (exact matches) occur. In such a case, we can list items that appear at list twice in the names list using the pandas dataframe value_counts() method:

#Get a count of the number of times each value occurs in a column, along with the value
vals=df['name'].value_counts()
#Select items where the value count is greater than one
vals[vals > 1]
#John Brown    2

PS Another way of detecting, and correcting, near matches is to use an application such as OpenRefine, in particular its clustering tool:

Talking to Developers and Civic Hackers on Their Own Terms…

Looking at the (new to me) Lords Amendments  website yesterday, I wondered whether the search was being fed by an API call, or whether an API is available elsewhere to the underlying data. (An API is available – find it via explore.data.parliament.uk.)

There are a couple of ways of doing this. One way is to View Source” (in Chrome, View -> Developer -> View Source), because as everybody *should know, you can inspect the code running in your browser; another is to use Developer tools, (from the same browser menu) to look at the browser network activity and see what URLs are called when a new selection is made on the web page (the data has to come from somewhere right? And again, you can look at this if you want to.)

Anyway, it struck me that most folk don’t tend to use these tools, but those who do are probably interested in something you’re doing – either how the page was constructed to give a particular effect, or where the data is coming from. If you’re building screenscraper’s, you’d typically look to the source too.

So if you’re trying to engage with developers, whey not leave them messages where they’re likely to look. For example, if you want to promote an API, or perhaps if you’re recruiting. Which reminded me that the Guardian used to have an open developer recruitment ad running in their webpage source. Indeed, they still do:

So if your page is API powered somewhere along the line, and you want to promote the API, why not pop a message at the top of the page source?

Or, as I learned from James Bridle (he of The New Aesthetic; you do follow that photoblog, right?), one of the most thought provoking artists around at the moment (I hesitate to say “digital artist” because that’s still an artist, right… erm…. (data) journalism… erm…  hypocrite…), why not use the console too?

James even provides a script to help…. welcome.js.

PS for a recent example of James’ work, which also invokes the idea of magic-related computing metaphors (cf. here, for example), see this recent interview: Meet the Artist Using Ritual Magic to Trap Self-Driving Cars.

PPS This has got me wondering whether we could actually deliver a “just below the surface” uncourse or training through HTML source, console messages and Javascript comments. Documented code with a view to teaching how to get the most out of an API, or how to do webdesign. The medium as the educational message. See also: Search Engine Powered Courses…

Tinkering With Parliament Data APIs: Commons Written Questions And Parliamentary Written Answers

So…. inspired by @philbgorman, I had a quick play last night with Parliament Written Questions data, putting together a recipe (output) for plotting a Sankey diagram showing the flow of questions from Members of the House of Commons by Party to various Answering Bodies for a particular parliamentary session.

The response that comes back from the Written Questions API includes a question uin (unique identification number?). If you faff around with date settings on the Parliamentary Questions web page you can search for a question by this ID:

Here’s an example of the response from a build download of questions (by 2015/16 session) from the Commons Written Questions API, deep filtered by the uin:

If you tweak the _about URI, which I think refers to details about the question, you get the following sort of response, built around a numeric identifier (447753 in this case):

There’s no statement of the actual answer text in that response, although there is a reference to an answer resource, again keyed by the same numeric key:

The numeric key from the _about identifier is also used with both the Commons Written Questions API and the Parliamentary Questions Answered API.

For example, questions:

And answers:

The uin values can’t be used with either of these APIs, though?

PS I know, I know, the idea is that we just follow resource links (but they’re broken, right? the leading lda. is missing from the http identifiers), but sometimes it’s just as easy to take a unique fragment of the URI (like the numeric key) and then just drop it into the appropriate context when you want it. In this case, contexts are

IMHO, any way… ;-)

PPS for a full list of APIs, see explore.data.parliament.uk

Experimenting With Sankey Diagrams in R and Python

A couple of days ago, I spotted a post by Oli Hawkins on Visualising migration between the countries of the UK which linked to a Sankey diagram demo of Internal migration flows in the UK.

One of the things that interests me about the Jupyter and RStudio centred reproducible research ecosystems is their support for libraries that generate interactive HTML/javascript outputs (charts, maps, etc) from a computational data analysis context such as R, or python/pandas, so it was only natural (?!) that I though I should see how easy it would be to generate something similar from a code context.

In an R context, there are several libraries available that support the generation of Sankey diagrams, including googleVis (which wraps Google Chart tools), and a couple of packages that wrap d3.js – an original rCharts Sankey diagram demo by @timelyporfolio, and a more recent HTMLWidgets demo (sankeyD3).

Here’s an example of the evolution of my Sankey diagram in R using googleVis – the Rmd code is here and a version of the knitred HTML output is here.

The original data comprised a matrix relating population flows between English regions, Wales, Scotland and Northern Ireland. The simplest rendering of the data using the googleViz Sankey diagram generator produces an output that uses default colours to label the nodes.

Using the country code indicator at the start of each region/country identifier, we can generate a mapping from country to a country colour that can then be used to identify the country associated with each node.

One of the settings for the diagram allows the source (or target) node colour to determine the edge colour. We can also play with the values we use as node labels:

If we exclude edges relating to flow between regions of the same country, we get a diagram that is more reminiscent of Oli’s orignal (country level) demo. Note also that the charts that are generated are interactive – in this case, we see a popup that describes the flow along one particular edge.

If we associate a country with each region, we can group the data and sum the flow values to produce country level flows. Charting this produces a chart similar to the original inspiration.

As well as providing the code for generating each of the above Sankey diagrams, the Rmd file linked above also includes demonstrations for generating basic Sankey diagrams for the original dataset using the rCharts and htmlwidgets R libraries.

In order to provide a point of comparison, I also generated a python/pandas workflow using Jupyter notebooks and the ipysankey widget. (In fact, I generated the full workflow through the different chart versions first in pandas – I find it an easier language to think in than R! – and then used that workflow as a crib for the R version…)

The original notebook is here and an example of the HTML version of it here. Note that I tried to save a rasterisation of the widgets but they don’t seem to have turned out that well…

The original (default) diagram looks like this:

and the final version, after a bit of data wrangling, looks like this:

Once again, all the code is provided in the notebook.

One of the nice things about all these packages is that they produce outputs than can be reused/embedded elsewhere, or that can be used as a first automatically produced draft of code that can be tweaked by hand. I’ll have more to say about that in a future post…

Grouping Numbers that are Nearly the Same – Casual Clustering

A couple of reasons for tinkering with WRC rally data this year, over and the above the obvious of wanting to find a way to engage with motorsport at a data level, specifically, I wanted a context for thinking a bit more about ways of generating (commentary) text from timing data, as well as a “safe” environment in which I could look for ways of identifying features (or storypoints) in the data that might provide a basis for making interesting text comments.

One way in to finding features is to look at a visual representations of the data (that is, just look at charts) and see what jumps out… If anything does, then you can ponder ways of automating the detection or recognition of those visually compelling features, or things that correspond to them, or proxy for them, in some way. I’ll give an example of that in the next post in this series, but for now, let’s consider the following question:how can we group numbers that are nearly the same? For example, if I have a set of stage split times, how can I identify groups of drivers that have recorded exactly, or even just nearly, the same time?

Via StackOverflow, I found the following handy fragment:

def cluster(data, maxgap):
    '''Arrange data into groups where successive elements
       differ by no more than *maxgap*

        cluster([1, 6, 9, 100, 102, 105, 109, 134, 139], maxgap=10)
        [[1, 6, 9], [100, 102, 105, 109], [134, 139]]

        cluster([1, 6, 9, 99, 100, 102, 105, 134, 139, 141], maxgap=10)
        [[1, 6, 9], [99, 100, 102, 105], [134, 139, 141]]

    '''
    data.sort()
    groups = [[data[0]]]
    for x in data[1:]:
        if abs(x - groups[-1][-1]) <= maxgap:
            groups[-1].append(x)
        else:
            groups.append([x])
    return groups

print(cluster([2.1,7.4,3.9,4.6,2.5,2.4,2.52],0.35))
[[2.1, 2.4, 2.5, 2.52], [3.9], [4.6], [7.4]]

It struck me that a tweak to the code could limit the range of any grouping relative to a maximum distance between the first and the last number in any particular grouping – maybe I don’t want a group to have a range more than 0.41 for example (that is, strictly more than a dodgy floating point 0.4…):

def cluster2(data, maxgap, maxrange=None):
    data.sort()
    groups = [[data[0]]]
    for x in data[1:]:
        inmaxrange = True if maxrange is None else abs(x-groups[-1][0]) <=maxrange
        if abs(x - groups[-1][-1]) <= maxgap and inmaxrange:
            groups[-1].append(x)
            groups[-1].append(x)
        else:
            groups.append([x])
    return groups

print(cluster2([2.1,7.4,3.9,4.6,2.5,2.4,2.52],0.35,0.41))
[[2.1, 2.4, 2.5], [2.52], [3.9], [4.6], [7.4]]

A downside of this is we might argue we have mistakenly omitted a number that is very close to the last number in the previous group, when we should rightfully have included it, because it’s not really very far away from a number that is close to the group range threshold value…

In which case, we might pull back numbers into a group that are really close to the current last member in the group irrespective of whether we past the originally specified group range:

def cluster3(data, maxgap, maxrange=None, maxminrange=None):
    data.sort()
    groups = [[data[0]]]
    for x in data[1:]:
        inmaxrange = True if maxrange is None else abs(x-groups[-1][0])<=maxrange
        inmaxminrange = False if maxminrange is None else abs(x-groups[-1][-1])<=maxminrange
        if (abs(x - groups[-1][-1]) <= maxgap and inmaxrange) or inmaxminrange:
            groups[-1].append(x)
        else:
            groups.append([x])
    return groups

print(cluster3([2.1,7.4,3.9,4.6,2.5,2.4,2.52],0.35,0.41,0.25))
[[2.1, 2.4, 2.5, 2.52], [3.9], [4.6], [7.4]]

With these simple fragments, I can now find groups of times that are reasonably close to each other.

I can also look for times that are close to other times:

trythis = [x for x in cluster3([2.1,7.4,3.9,4.6,2.5,2.4,2.52],0.35,0.41,0.25) if 2.4 in x]
trythis[0] if len(trythis) else ''
[2.1, 2.4, 2.5, 2.52]

PS I think the following vectorised pandas fragments assign group numbers to rows based on the near matches of numerics in a specified column:

def numclustergroup(x,col,maxgap):
    x=x.sort_values(col)
    x['cluster'] = (x[col].diff()>=maxgap).cumsum()
    return x

def numclustergroup2(x,col,maxgap,maxrange):
    x=x.sort_values(col)
    x['cluster'] = (x[col].diff()>=maxgap).cumsum()
    x['cdiff']=x.groupby('cluster')[col].diff()
    x['cluster'] = ((x.groupby('cluster')['cdiff'].cumsum()>maxrange) | (x[col].diff()>=maxgap)).cumsum()
    return x.drop('cdiff',1)

def numclustergroup3(x,col,maxgap,maxrange,maxminrange):
    x=x.sort_values(col)
    x['cluster'] = (x[col].diff()>=maxgap).cumsum()
    x['cdiff']=x.groupby('cluster')[col].diff()
    x['cluster'] = (((x.groupby('cluster')['cdiff'].cumsum()>maxrange) | (x[col].diff()>=maxgap)) & (x[col].diff()>maxminrange) ).cumsum()
    return x.drop('cdiff',1)

#Test
uu=pd.DataFrame({'x':list(range(0,8)),'y':[1.3,2.1,7.4,3.9,4.6,2.5,2.4,2.52]})
numclustergroup(uu,'y',0.35)
numclustergroup2(uu,'y',0.35,0.41)
numclustergroup3(uu,'y',0.35,0.41,0.25)

The basic idea is to generate logical tests that evaluate as True whenever you want to increase the group number.

Transparency in Parliament… And in Data Journalism?

Over the weekend, I picked up a copy of Parliament Ltd, a two hundred and fifty page rant (or should that be diatribe?!) against various MPs and Lords and their registered (and unregistered) interests. [Disclosure: I’ve picked up a few days paid work for the Parliamentary Digital Service this year.]

The book draws on data scraped from the Parliament website (presumably), as well as Companies House (via a collaboration – or business arrangement? I wasn’t quite sure..?! – with DueDil). As anyone who’s tried looking at registers of interests on the Parliament website, you’ll know they’re not published in the friendliest of formats, and the data is not made available as a machine readable downloadable dataset.

Sources of “Interests” Data From Parliament

By the by, the registers maintained on the Parliament website include:

There’s also the register of all-party groups, which includes statements of benefits received by groups from third parties (links to old scrapers here, possibly?

Another place we might look for associations between MPs/Lords and companies, or other organisations, is in Hansard. For example, Evan Odell recently published a dataset on Hansard Speeches and Sentiment that “provides information on each speech of ten words or longer, made in the House of Commons between 1980 and 2016, with information on the speaking MP, their party, gender and age at the time of the speech”. The R code is supplied, so we could presumably use that as a basis for running the transcripts through a named entity extractor to try to pull out the names of companies or organisation mentioned by each speaker (perhaps as well as something that looks out for declarations of interest mentioned whilst speaking?). It might also be interesting to try to match sentiment with organisation mentions?!

Where companies are mentioned in a debate, and the debate leads to a division (that is, a vote), we can then use sources such as The Public Whip to download information scraped from the Parliament website about who voted how on which division, and perhaps look for MPs voting against their party line but in favour of a particular interest.

(If you know other sources of scraper code, or APIs offered over scraped versions of any of the above registers, please let me know via the comments and I’ll add them in. Also any registers I’ve missed…)

Others Sources of Data Relating to Members’ Parliamentary and Government Activities

By the by, the APPG post also led me to another old post on scraping Ministers’ meetings. For an idea of the sorts of thing currently disclosed (at a departmental level?), see e.g. Cabinet Office: ministers’ transparency publications). There are possibly other forms of declaration on other Government Department websites?

In relation to lobbying firms, there is the Office of the Registrar of Consultant Lobbyists.

Also outside Parliament, the Electoral Commission provide information about donations and loans to individuals (including MPs) and candidate spending and donations at elections.

Other Sources of Information About Members’ External Interests

Companies House can also be used to look up whether a particular named individual is or has been listed as a company officer (such as a director), or is a person of significant control (PSC, sometimes referred to as a “beneficial owner”) of a particular company. Whilst the PSC register is currently available as a bulk data download, the director information isn’t (at least, not without making a personal request). It can be accessed in a piecemeal fashion via the Companies House API though. Current and recently disqualified directors can be found via The Insolvency Service or the Companies House API. The Insolvency Service also publish information about Individual Insolvency (that is, bankruptcies).

Where individuals are associated with an organisation and are registered as a data controller, they should also be listed as an entry on the ICO Data Protection Register.

Evan’s Github account also hosts a fork of a repo published by the NCVO for import[ing] data from the Charity Commission data extract, data that presumably lists trustees, and again that can be used as the basis for finding associations between individuals and organisations.

At a local level, local councils hold a variety of public registers, detailing for example the names of individuals licensed to sell alcohol, or to act as operators of betting, sex or animal breeding establishments. The CQC publish data listing the names of individuals in charge of operating care homes. NHS England list names of GPs working at particular practices. And so on…

More generally, the Advisory Committee on Business Appointments (Acoba) has details of Appointments taken up by former ministers. (Acoba also report on Appointments taken up by former Crown servants.)

So What?

So that’s all so much data, and as Martin Williams points out in his book, it can take a lot of effort to pull the data into some sort of shape where you can use it. And with data sourced from various places, there may be issues associated with sharing the data on once you have processed it.

To a certain extent, you might argue that Parliament is blocking “transparency” around members’ interests – and possible conflicts of interest – by publishing the data in a way that makes it difficult to process it as data without having to do a fair amount of work prepping the data. But I’m not sure how true that is. Journalists are, in part, competitive beasts, wanting to be the first to a story. If a data is well presented and comes with analysis scripts that identify story features and story points, essentially generating a press release around a dataset without much effort involved, there’s nothing there to find (nothing “hidden” in the data waiting for the intrepid journalist to reveal it). But when the data is messy and takes some effort to clean up, then the chances that anyone else will just stumble across the story point by chance are reduced. And when the data is “secret” but still publicly accessible, all the better. For example, it’s no surprise that a common request of Alvateli (the platform underpinning FOI request site WhatDoTheyKnow) was from journalists wanting to be able to hide, or at least embargo, their requests, and (data) responses provided to them (h/t Chris Adams for that observation and link).

Another question that arises around journalists who do clean datasets and then analyse them but who don’t then share their working, (the data cleaning and analysis scripts), is the extent to which they are themselves complicit in acting against transparency. Why should we believe the journalists’ accusations or explanations without seeing what they are actually based on? (Maybe in cleaning the dataset, they threw away explicit declarations of interest because they were too messy to process which then skewed the conclusions drawn from the data analysis?) By sharing analyses, you also provide others with the opportunity to spot errors in your working, or maybe even improve them (scary for some; but consider the alternative: you produce an analysis script that contains an error, and maybe reuse it, generating claims that are false and that cannot be supported by the data. Publishing those is not in your interest.) There also seems to be the implicit assumption that competitors are trying to steal your stories rather than find your own. They probably think and say the same about you. But who has the time to spend it all trying to crib over other people’s shoulders? (Other than me of course;-))

On the other hand, there may be some commercial or competitive intelligence advantage in having a cleaned dataset that you can work with efficiently that is not available to other journalists or that you believe may hide further stories. (A similar argument to the latter is often made by academic researchers who do not want to share their research data, lest someone else makes a discovery from it that eluded them.) But then, with a first mover advantage, you should be able to work with your data and link it to other data sets faster than your competitors. And if they are sharing data back too, then you may be able to benefit from their cleaned data and analysis scripts. Everyone gets to improve their game.

Another possible form of “competitive” advantage that comes from not publishing cleaned datasets or scripts is that is doesn’t tip the hand of the journalist and reveal investigative “trade secrets” to the subject or subjects of an investigation. For by revealing how a story was identified from a dataset, subjects may change their behaviour so as not to divulge information into the dataset in the same revealing way in the future.

One final considerations: when it comes to news stories, what is the extent to which part-time tinkerers and civic tech hackers such as myself spoil a possible story by doing a halfway hack on a dataset, bringing small scale attention to it, and as a consequence disabling or polluting it as a source of journalistic novelty/story-worthiness? Does anyone have examples of where a possible open data story was not pursued by the press because a local data geek blogger got there first?