Practical Data Scraping – UK Government Transparency Data (Minister’s Meetings)

Earlier this week, I came across the Number 10 website’s transparency data area, which among other things has a section on who Ministers are meeting.

Needless to say, the Who’s Lobbying website has started collating this data and making it searchable, but I thought I’d have a look at the original data to see what it would take to aggregate the data myself using Scraperwiki.

The Number 10 transparency site provides a directory to Ministers’ meetings by government department on a single web page:

Number 10 transparency - ministers meetings

The links in the Ministers’ meetings, Ministers’ hospitality, Ministers’ gifts and Ministers’ overseas travel columns all point directly to CSV files. From inspecting a couple of the Ministers’ meetings CSV files, it looks as if they may be being published in a standardised way, using common column headings presented in the same order:

Ministers' meetings transparency data - csv format

Except that: some of the CSV files appeared to have a blank row between the header and the data rows, and at least one table had a blank row immediately after the data rows, followed some notes in cells that did not map onto the semantics of corresponding column headers. Inspecting the data, we also see that once a minister is identified, there is a blank in the first (Minister) column, so we must presumably assume that the following rows relate to meetings that minister had. WHen the data moves on to another minister, that Minister’s name/position is identified in the first column, once again then followed by blank “same as above” cells.

To get the data into scraperwiki means we need to do two things: extract meeting data from a CSV document and get it into a form whereby we can put it into the scraperwiki database; scrape the number 10 Minisiters’ meetings webpage to get a list of the URLs that point to the CSV files for each department. (It might also be worth scraping the name of the department, and adding that as additional metadata to each record pulled out from the CSV docs.)

Here’s the Scraperwiki code I used to scrape the data. I tried to comment it, so it’s worth reading through even if you don’t speak Python, because I’m not going to provide any more description here…;-)

import urllib
import csv
import md5
import scraperwiki


url = "http://download.cabinetoffice.gov.uk/transparency/co-ministers-meetings.csv"
# I have started just looking at data from one source.
# I am assuming, (dangerously), that the column headings are:
#   a) the same, and 
#   b) in the same order
# for different departments

data = csv.DictReader(urllib.urlopen(url))

# Fudge to cope with possibility of blank row between header and first data row
started=False

# Inspection of the data file suggests that when we start considering a Minister's appointments,
#   we leave the Minister cell blank to mean "same as above".
# If we want to put the Minister's name into each row, we need to watch for that. 
minister=''

for d in data:
    if not started and d['Minister']=='':
        # Skip blank lines between header and data rows
        continue
    elif d['Minister']!='':
        # A new Minister is identified, so this becomes the current Minister of interest
        if not started:
            started=True
        minister=d['Minister']
    elif d['Date']=='' and d['Purpose of meeting']=='' and d['Name of External Organisation']=='':
        # Inspection of the original data file suggests that there may be notes at the end of the CSV file...
        # One convention appears to be that notes are separated from data rows by at least one blank row
        # If we detect a blank row within the dataset, then we assume we're at data's end
        # Of course, if there are legitimate blank rows within the later, we won't scrape any of the following data
        # We probably shouldn't discount the notes, but how would we handle them?!
        break
    print minister,d['Date'],d['Purpose of meeting'],d['Name of External Organisation']
    id='::'.join([minister,d['Date'],d['Purpose of meeting'],d['Name of External Organisation']])
    # The md5 function creates a unique ID for the meeting
    id=md5.new(id).hexdigest()
    # Some of the original files contain some Latin-1 characters (such as right single quote, rather than apostrophe)
    #   that make things fall over unless we handle them...
    purpose=d['Purpose of meeting'].decode('latin1').encode('utf-8')
    record={'id':id,'Minister':minister,'date':d['Date'],'purpose':purpose,'lobbiest':d['Name of External Organisation'].decode('latin1').encode('utf-8')}
    # Note that in some cases there may be multiple lobbiests, separated by a comma, in the same record.
    # It might make sense to generate a meeting MD5 id using the original record data, but actually store
    #   a separate record for each lobbiest in the meeting (i.e. have lobbiests and lobbiest columns) by separating on ','
    # That said, there are also records where a comma separates part of the title or affiliation of an individual lobbiest.
    # A robust convention for separating different lobbiests in the same meeting (e.g. ';' rather than ',') would help

    scraperwiki.datastore.save(["id"], record) 

for d in data:
    #use up the generator, close the file, allow garbage collection?
    continue

Here’s a preview of what the scraped data looks like:

Ministers' meetings datascrape - scraperwiki

Here’s the scraper itself, on Scraperwiki: UK Government Transparency Data – Minister’s Meetings Scratchpad

Assuming that the other CSV files are all structured the same way as the one I tested the above scraper on, we should be able to scrape meeting data from other departmental spreadsheets using the same script. (Note that I did try to be defensive in the handling of arbitrary blank lines between the first header row and the data.)

One problem arises in the context of meetings with more than one person. Ideally, I think there should be a separate row for each person attending, so for example, the Roundtable on June, 2010 between Parliamentary Secretary (Minister for Civil Society), Nick Hurd MP and National Voices, MENCAP,National Council of Voluntary Organisations, St Christopher’s Hospice, Diabetes UK, Place 2 Be, Terrence Higgins Trust, British Heart Foundation, Princess Royal Trust for Carers, Clic Sargent might be mapped to separate data rows for each organisation present. If we take this approach, it might also make sense to ensure that each row carries with it a meeting ID, so that we can group all the rows relating to a particular meeting (one for each group in the meeting) on meeting ID.

However, there is an issue in identifying multiple attendee meetings. In the above example, we can simply separate the groups by splitting the attendees lists at each comma; but using this approach would then mean that the meeting with Secretary General, Organisation of the Islamic Conference, Ekmelledin Ihsanoglu would be mapped onto three rows for that meeting: one with Secretary General as an attendee, one with Organisation of the Islamic Conference as an attendee, and finally one with Ekmelledin Ihsanoglu identified as an attendee…

What this suggests to me is that it would be really handy (in data terms), if a convention was used in the attendees column that separated representation from different organisations with a semi-colon, “;”. We can then worry about how to identify numerous individuals from the same organisation (e.g. J Smith, P Brown, Widget Lobbying group), or how to pull out roles from organisations (Chief Lobbiest, Evil Empire Allegiance), names and roles from organisations (J Smith, Chief Lobbiest, UN Owen, Head Wrangler, Evil Empire Allegiance) and so on…

And I know, I know… the Linked Data folk would be able to model that easily.. but I’m talking about quick and dirty typographical conventions that can be easily used in simple CSV docs that more folk are comfortable with than are comfortable with complex, explicitly structured data…;-)

PS I’ll describe how to scrape the CSV urls from the Number 10 web page, and then loop through all of this to generate a comprehensive “Ministers’ meetings” database in a later post…

PPS a really informative post on the WHo’s Lobbying blog goes into further detrail about some of the “pragmatic reuse” problems associated with the “Ministers’ meetings” data released to date: Is this transparency? No consistent format for 500 more UK ministerial meetings.