OUseful.Info, the blog…

Trying to find useful things to do with emerging technologies in open education

Posts Tagged ‘transparency

Using Open Data to Hold Companies to Account?

leave a comment »

Some rambling but possibly associated thoughts… I suggest you put Alice’s Restaurant on…

For some time now, I’ve had an uncomfortable feeling about the asymmetries that exist in the open data world as well as total confusion about the notion of transparency.

Part of the nub of the problem (for me) lies with the asymmetric disclosure requirements of public and private services. Public bodies have disclosure requirements (eg Local Government Transparency Code), private companies don’t. Public bodies disclose metrics and spend data, data that can be used in public contract tendering processes by private bodies against public ones tendering for the same service. The private body uses this information – and prices in a discount associated with not having to carry the cost of public reporting – into the bid. The next time the contract is tendered, the public body won’t have access to the (previously publicly disclosed) information that the private body originally had when making its bid. Possibly. I don’t know how tendering works. But from the outside, that’s how it appears to me. (Maybe there needs to be more transparency about the process?)

Open data is possibly a Big Thing. Who knows? Maybe it isn’t. Certainly the big consulting firms are calling it as something worth squillionty billionty of pounds. I’m not sure how they cost it. Maybe I need to dig through the references and footnotes in their reports (Cap Gemini’s Open Data Economy: Unlocking Economic Value by Opening Government and Public Data, Deloitte’s Open growth: Stimulating demand for open data in the UK or McKinsey’s Open data: Unlocking innovation and performance with liquid information). I don’t know how much those companies have received in fees for producing those reports, or how much they have received in consultancy fees associated with public open data initiatives – somehow, that spend data doesn’t seem to have been curated in a convenient way, or as a #opendatadata bundle? – but I have to assume they’re not doing it to fleece the public bodies and tee up benefits for their other private corporate clients.

Reminds me – I need to read Owen Boswarva’s Who supports the privatisation of Land Registry? and ODUG benefits case for open data release of an authoritative GP dataset again… And remind myself of who sits on the Open Data User Group (ODUG), and other UK gov departmental transparency boards…

And read the FTC’s report Data Brokers: A Call For Transparency and Accountability

Just by the by, one thing I’ve noticed about a lot of opendata releases is that, along with many other sorts of data, they are most useful when aggregated over time or space, and/or combined with other data sets. Looking at the month on month reports of local spending data from my local council is all very well, but it gets more interesting when viewed over several months or years. Looking at the month on month reports of local spending data from my local council is all very well, but it gets more interesting when looking at spend across councils, as for example in the case of looking at spend to particular companies.

Aggregating public data is one of the business models that helps create some of the GDP figure that contributes to the claimed, anticipated squillionty billionty pounds of financial benefit that will arise from open data – companies like opencorporates aggregating company data, or Spend Network aggregating UK public spending data who hope to start making money selling products off the back of public open data they have curated. Yes – I know a lot of work goes in to cleaning and normalising that data, and that exploiting the data collection as a whole is what their business models are about – and why they don’t offer downloads of their complete datasets, though maybe licenses require they do make links to, or downloads of, the original (“partial”) datasets available?

But you know where I think the real value of those companies lies? In being bought out. By Experian, or Acxiom (if there’s even a hint of personally identifiable data through reverse engineering in the mix), or whoever… A weak, cheap, cop out business model. Just like this: Farmers up in arms over potential misuse of data. (In case you missed it, Climate Corporation was one of the OpenData500 that aggregated shed loads of open data – according to Andrew Stott’s Open Data for Economic Growth report for the World Bank, Climate Corp “uses 60 years of detailed crop yield data, weather observations from one million locations in the United States and 14 terabytes of soil quality data – all free from the US Government – to provide applications that help farmers improve their profits by making better informed operating and financing decisions”. It was also recently acquired by Monsanto – Monsanto – for just under a billion US $. That’s part of the squillionty billionties I guess. Good ol’ open data. Monsanto.)

Sort of related to this – that is, companies buying others to asset strip them for their data – you know all that data of yours locked up in Facebook and Google? Remember MySpace? Remember NNTP? According to the Sophos blog, Just Because You Don’t Give Your Personal Data to Google Doesn’t Mean They Can’t Acquire It. Or that someone else might buy it.

And as another aside – Google – remember Google? They don’t really “read” your email, at least, people don’t, they just let algorithms process it so the algorithms can privately just use that data to send you ads, but no-one will ever know what the content of the email was to trigger you getting that ad (‘cos the cookie tracking, cookie matching services can’t unpick ad bids, ad displays, click thrus, surely, can they?!), well – maybe there are side effects: Google tips off cops after spotting child abuse images in email (for some reason, after initially being able to read that article, my browser can’t load it atm. Server fatigue?). Of course, if Google reads your ads for blind business purposes and ad serving is part of that blind process you accept it. But how does the law enforcement ‘because we can even though you didn’t warrant us to?’ angle work? Does the Post Office look inside the envelope? Is surveillance actually part of Google’s business model?

If you want to up the paranoia stakes, this (from Ray Corrigan, in particular: “Without going through the process of matching each government assurance with contradictory evidence, something I suspect would be of little interest, I would like to draw your attention to one important misunderstanding. It seems increasingly to be the belief amongst MPs that blanket data collection and retention is acceptable in law and that the only concern should be the subsequent access to that data. Assertions to this effect are simply wrong.”) + that. Because one day, one day, they may just find your name on an envelope of some sort under a tonne of garbage. Or an algorithm might… Kid.

But that’s not what this post is about – what this post is about is… Way back when, so very long ago, not so very long ago, there was a license called GPL. GPL. And GPL was a tainting license. findlaw describes the consequences of reusing GPL licensed code as follows: Kid, ‘if a user of GPL code decides to distribute or publish a work that “in whole or in part contains or is derived from the [open source] or any part thereof,” it must make the source code available and license the work as a whole at no charge to third parties under the terms of the GPL (thereby allowing further modification and redistribution).

‘In other words, this can be a trap for the unwary: a company can unwittingly lose valuable rights to its proprietary code.’

Now, friends, GPL scared people so much that another license called LGPL was created, and LGPL allowed you to use LGPL licensed code without fear of tainting your own code with the requirement to open up your own code as GPL would require of it. ‘Cos licenses can be used against you.

And when it comes to open data licenses, they seem to be like LGPL. You can take open public data and aggregate it, and combine it, and mix it and mash it and do what you like with it and that’s fine… And then someone can come along buy that good work you’ve done and do what they want with it. Even Monsanto. Even Experian. And that’s good and right, right? Wrong. The ODUG. Remember the ODUG? The ODUG is the Open Data User Group that lobbies government for what datasets to open up next. And who’s on the ODUG? Who’s there, sitting there, on the ODUG bench, right there, right next to you?

Kid… you wanna be the all-open, all-liberal open data advocate? You wanna see open data used for innovation and exploitation and transparency and all the Good Things (big G, big T) that open data might be used for? Or you wanna sit down on the ODUG bench? With Deloitte, and Experian, and so on…

And if you think that using a tainting open data license so anyone who uses that data has to share it likewise, aggregated, congregated, conjugated, disaggregated, mixed, matched, joined, summarised or just otherwise and anyways processed, is a Good Thing…? Then kid… they’ll all move away from you on the bench there…

Because when they come to buy you, they won’t your data to be tainted in any way that means they’ll have to give up the commercial advantage they’ll have from buying up your work on that open data…

But this post? That’s not what this post is about. This post about holding companies to account. Open data used to hold companies to account. There’s a story to be told that’s not been told about Dr Foster, and open NHS data and fear-mongering and the privatisation of the NHS and that’s one thing…

But another thing is how government might use data to help us protect ourselves. Because government can’t protect us. Government can’t make companies pay taxes and behave responsibly and not rip off consumers. Government needs our help to do that. But can government help us do that too? Protect and Survive.

There’s a thing that DECC – the Department of Energy and Climate Change – do, and that’s publish statistics about domestic energy price statistics and industrial energy price statistics and road fuel and other petroleum product price statistics, and they’re all meaningless. Because they bear little resemblance to spot prices paid when consumers pay their domestic energy bills and road fuel and other petroleum product bills.

To find out what those prices are you have to buy the data from someone like Experian, from something like Experian’s Catalist fuel price data – daily site retail fuel prices – data product. You may be able to caluclate the DECC statistics from that data (or you may not) but you certainly can’t go the other way, from the DECC statistics to anything like the Experian data.

But can you go into your local library and ask to look at a copy of the Experian data? A copy of the data that may or may not be used to generate the DECC road fuel and other petroleum product price statistics (how do they generate those statistics anyway? What raw data do they use to generate those statistics?)

Can you imagine ant-eye-ant-eye-consumer data sets being published by your local council or your county council or your national government that can be used to help you hold companies to account and help you tell them that you know they’re ripping you off and your council off and your government off and that together, you’re not going to stand for it?

Can you imagine your local council publishing the forecourt fuel prices for one petrol stations, just one petrol station, in your local council area every day? And how about if they do it for two petrol stations, two petrol stations, each day? And if they do it for three forecourts, three, can you imagine if they do it for three petrol stations…? And can you, can you imagine prices for 50 petrol stations a day being published by your local council, your council helping you inform yourself about how you’re being manipulated, can you imagine…? (It may not be so hard – food hygiene ratings are published for food retail environments across the England, Northern Ireland and Wales…

So let’s hear it for open data, and how open data can be used to hold corporates to account, and how public bodies can use open data to help you make better decisions (which is a good neoliberal position to take and one which the other folk on the bench tell you that that’s what you want and that and markets work, though they also fall short of telling you that the models say that markets work with full information but you don’t have the information, and even if you did, you wouldn’t understand it, because you don’t really know how to make a good decision, but at the end of the day you don’t want a decision, you just want a good service fairly delivered, but they don’t tell that it’s all right to just want that…)

And let’s hear it for public bodies making data available whether it’s open or not, making it available by paying for it if they have to and making it available via library services so that we can start using it to start holding companies to account and start helping our public services, and ourselves, protect ourselves from the attacks being mounted on us by companies, and their national government supporters, who take on debt, and who allow them to take on debt, to make dividend payouts but not capital investment and subsidise the temporary driving down of prices (which is NOT a capital investment) through debt subsidised loss leading designed to crush competition in a last man standing contest that will allow monopolistic last man standing price hikes at the end of it…

And just remember, if there’s anything you want, you know where you can get it… At Alice’s… or the library… only they’re shutting them down, aren’t they…? So that leaves what..? Google?

Written by Tony Hirst

August 3, 2014 at 12:10 am

Local Council Spending Data – Time Series Charts

In What Role, If Any, Does Spending Data Have to Play in Local Council Budget Consultations? I started wondering about the extent to which local spending transparency data might play a role in supporting consultation around new budgets.

As a first pass, I’ve popped up a quick application up at http://glimmer.rstudio.com/psychemedia/iwspend2013_14/ [if that's broken, try this one] (shiny code here) that demonstrates various ways of looking at open spending data from the Isle of Wight council. You can pass form items in via the URL (except to set the Directorate – oops!), and also search using regular expressions, but at the moment still need to hit the Search button to actually run the search. NOTE – there’s a little bug – you need to hit the Search button to get it to show data; note – selecting All directorates and no filter terms can be a bit slow to display anything…

Examples:

- http://glimmer.rstudio.com/psychemedia/iwspend2013_14/?expensesType=(oil)|(gas)|(electricity)

- http://glimmer.rstudio.com/psychemedia/iwspend2013_14/?serviceArea=mainland

- http://glimmer.rstudio.com/psychemedia/iwspend2013_14/?supplierName=capita

I’ve started exploring various views over the data, but these need thinking through properly (in particular with respect to finding out views that may actually be useful!)

iw spend music expneses type

Hmm… did the budget change directorate?!

IW spend - music service area

IW spend music suppliers

Some more views over the suppliers tab – I started experimenting with some tabular views in the suppliers tab too…

IW spend music suppliers table 1

IW spend music suppliers table 2

This is all very “shiny” of course, but is it useful? From these early glimpses over the data, can you think of any ways that a look at the spending data might help support budget consultations? What views over the data, in particular, might support such an aim, and what sort of stories might we be able to tell around this sort of data?

Written by Tony Hirst

November 6, 2013 at 11:27 am

Posted in Policy, Rstats

Tagged with ,

Public Sector Transparency – Do We Need Open Receipts Data as Well as Open Spending Data?

Some time ago, in the post Using Aggregated Local Council Spending Data for Reverse Spending (Payments to) Lookups, I described a way of looking at local council spending data based on how much different councils spent with each other.

This technique generalises within and across sectors, so for example we could look at how hospitals spend money with each other, or how police authorities spend money with each other. In this way, we can get a picture of how public bodies buy -and sell – services off each other. The mappings don’t have to relate to spend, either – we could equally well use this sort of model to see how hospitals transfer patients to one another, or how mental health or social care services offer out-of-area cover to each other, or how councils and housing trusts manage transfers between each other.

The insight that lets us produce this sort of view is that we have entities of a particular sort (hospitals, for example, or local councils), entering into transactions with other entities of the same sort. If these sorts of entity all operate under the same transparency rules, a requirement to publish outgoing (spend) transactions, for example, then we can recreate incoming (receipt) transactions from each entity of the same sort. For example, if local councils are required to publish details of spend over £x, then we can also learn how much councils received from other local councils by means of transactions over £x.

As the UK Government at least seems hell bent on getting markets established in the delivery of public services, markets that can include private companies, then we are faced with a possible asymmetry in transparency information.

UK Gov PolicyMaking local councils more transparent and accountable to local people

The public should be able to hold local councils to account about the services they provide. To do this, people need information about what decisions local councils are taking, and how local councils are spending public money.

And from the NHS:

NHS – Transparency of Spend

As part of the government’s commitment to greater transparency, there is a requirement to publish online each NHS organisation’s expenditure over £25,000. In accordance with the requirement NHS Direct publish this on the basis of payments made in each calendar month.

For example, if hospital A buys significant services off hospital B, and must report that spend under transparency legislation, we can build up a picture not only relating to A’s spend, but also B’s sale of services, because A’s data relating to spend with B is openly available; which means B’s receipts from A are also available. (In this example, if items can be itemised as less than £25k per item, then this form of reporting under transparency guidelines is not required.)

If hospital A now buys service of company C, then we can look up spend from hospital A to get a picture of how much public money is flowing out to the private sector and into company C. That is, we can get an idea of company C’s receipts from openly published hospital spending data. (Of course, games could be played with itemisation – 10 treatments at £3k a treatment would result in a ‘must declare’ spend of £30k on the course of treatment, but an undeclarable £3k per treatment if billing is organised that way.)

But what if company C buys services off hospital B (maybe even subcontracting services it was contracted to deliver by hospital A)? If the spend data of company C is not subject to transparency requirements, and the receipt data from the hospital is not publicly available, we lose sight of how money is being spent within and across the public service.

Whilst private companies may balk at being required to publish details of their own spending data, we might still be able to recreate a picture of their spend with public services by requiring public bodies to also publish receipts data, along with the current requirement to publish spend data?

Written by Tony Hirst

April 3, 2013 at 11:30 am

Posted in opengov, Policy

Tagged with

Wherefore Art Thou, Research Sector Transparency / Research Transparency Sector Board?

On June 28th, 2012, the open data policy white paper Unleashing the Potential was published by the Cabinet Office. In the section on “Opening Up Access to Research”, one particular paragraph runs as follows:

2.66 To further develop government policy on access to research, we are also establishing a Research Transparency Sector Board, chaired by the Minister for Universities and Science, which will consider ways in which transparency in the area of research can be a driver for innovation. Recognising that research data is different to other PSI [Public Sector Information, presumably? - ed.], the Board will consider how to implement transparency measures relating to research in a manner which protects the integrity of the research and associated intellectual property, while ensuring access to research for those SME entrepreneurs vital for driving growth. This will help to realise the full benefits for society as a whole. The Research Transparency Sector Board will consist of government departments, funding agencies and representatives from universities and other stakeholders, and among the first of its tasks will be to consider how to act on the recommendations of the Royal Society report.

The announcement of the board (referred to as the Research Sector Transparency Board – which makes more sense…) was welcomed by the Royal Society in a guest blog post on the data.gov.uk website dated 27th June 2012 (the day before the embargo lifted? I’m not sure when the blog post actually became public): An intelligently open enterprise.

The minutes of a Regular meeting of the ICO Higher Education sector panel on FOI and DP (24.09.2012) dated 16/10/12 notes the following:

Research data caused much concern. VA reminded delegates that she does need input from Research Councils and BIS in this area, as stated in the draft DD [HE definition document]. Definitions of “publicly funded” and “key outputs” may need clarification. It was noted that the Engineering and Physical Sciences Research Panel had to produce this type of data to an agreed timetable by 2015. It was also mentioned that the Open Data White Paper announced the formation of a new Research Sector Transparency Board and it was suggested that HEI research data could be linked to that format – it is not yet ready for use but might be worth noting in the new DD that this is a future aim.

Correspondence from House of Lords European Union Select Committee includes a letter from David Willetts MP dated 25 October 2012 that refers to his anticipated chairing of the Board:

On the question of Open Access (OA), I was pleased to note your expressed support for Open Data (OD) for which the UK is again identified as a good example. We have made excellent progress through the Finch Report on expanded access to research publications and the Government’s response to it. OD is at a relatively early stage. Some initiatives are already in train under Government’s Transparency Agenda, as detailed in the Cabinet Office White Paper, Open Data: Unleashing the Potential. This includes establishment of the Research Sector Transparency Board, which I shall be chairing. The Board will want to examine the complex issues around increasing the sharing of research data. The Research Councils’ published Open Access policy makes appropriate reference to research data, and the recent Royal Society report has informed the discussion, but work is needed on deciding further measures and implementing these appropriately, with the right terms and conditions and timing for disclosure.

We cannot be complacent and we will want to consider how best to monitor the take-up of Gold OA both here in the UK and overseas. The HEFCE-funded Joint Infrastructure Systems Committee (JISC), OAIG, and the Research Innovation Network (RIN) are already active in monitoring OA trends generally. HEFCE also envisages a possible role for JISC in monitoring the effectiveness – and effects – of Government OA policy. I expect that the Research Sector Transparency Board will also take an interest in OA policy implementation.

The 2012 BIS Annual Innovation Report from November 2012 referred to the announcement of the Board, making me wonder how many other Annual Reports celebrate the announcement of vapourwareentities?

10.3 Open data and transparency
We have continued to work to harness the potential and collaborative opportunities offered by wider use of open data.

In June 2012 the Government announced in its Open Data White Paper that we would set up a Research Sector Transparency Board. The Board will consider how transparency in research can be a driver for innovation and discovery while furthering the UK’s recognised excellence in science. It will advise Government transparency issues relating to the national research effort, and improved access for small and medium businesses to the research base. Amongst its first tasks will be to consider and address the recommendations of the Royal Society report, Science as an Open Enterprise, into the sharing and disclosing of research data.

We also established the Administrative Data Taskforce, in December 2011. It will publish proposals for new mechanisms and collaborative agreements to enable and promote the wider use of administrative data for research and policy purposes, before the end of the year.

(I’m not sure I’d picked up on the Administrative Data Taskforce before? It reported in December 2012: The UK Administrative Data Research Network: Improving Access for Research and Policy. This report looks like it could be worth reading – a quick skim reveals several sections on legal and ethical issues related to linking administrative data to other dataset.)

A Hansard reported Written Answer to the House of Lords from 12 Dec 2012 (Column WA241) from The Parliamentary Under-Secretary of State, Department for Business, Innovation and Skills (Lord Marland) on questions referring to open access to research data records:

Any further opening up of access to data, in the context of the wider open data agenda, would be the subject of future discussions with the research councils and other parties including the Data Strategy Board and representative university bodies. These policy issues would also be considered as appropriate by the Research Sector Transparency Board which is chaired by David Willetts. There are no proposals to change the research councils’ policy on access to data at this time.

The Russell Group response to the House of Lords Science and Technology Committee’s inquiry on open access publishing, dated 24 January 2013, makes the following reference to the board:

1.3 The Russell Group has been monitoring the development of open access (OA) policy for some time. We followed the ‘Finch Review’ and Royal Society work on science as an open enterprise with interest and the Russell Group is now represented on the Research Sector Transparency Board which will be covering OA, open data and other issues over the coming year. We have recently had a number of meetings with Research Councils UK (RCUK) to discuss implementation of OA policy.

This suggests that membership of the board has been decided upon, at least partially?

A HEFCE letter on Open access and submissions to the REF post-2014 dated 25/2/13 refers to the board in the following terms:

25. With the Research Councils and the Research Transparency Sector Board, we are giving consideration to the issues involved in increasing access to research data. We are committed to working in dialogue with the sector to develop fair and balanced mechanisms to achieve this aim.

Again, this suggests that the Board has been convened.

So I wonder:

  • What is tha actual name of the board – Research Transparency Sector Board or Research Sector Transparency Board ;-)? (Other sectors have Transparency Boards….)
  • What is the membership of the board and has it convened yet?
  • What are the terms of reference for the board?
  • If it has convened, where are the minutes?

By the by, I note the emergence of the Research Councils UK – Gateway to Research, which provides a single point of access to “[k]ey data from the seven UK Research Councils in one location.”

RCUK - Gateway to Research

This site appears to collate information about research grants, grantees, and publications by grant, across the Research Councils (I’m not sure if an #opendata dump is available though, which would mean I don’t need to scrape across all the sites using Scraperwiki any more?!;-)

PS it seems a tweet about the first meeting appeared whilst I was writing this post:

No linkage that I can see yet, though?

Written by Tony Hirst

February 26, 2013 at 6:47 pm

Posted in Anything you want, Policy

Tagged with ,

Practical Data Scraping – UK Government Transparency Data (Minister’s Meetings)

Earlier this week, I came across the Number 10 website’s transparency data area, which among other things has a section on who Ministers are meeting.

Needless to say, the Who’s Lobbying website has started collating this data and making it searchable, but I thought I’d have a look at the original data to see what it would take to aggregate the data myself using Scraperwiki.

The Number 10 transparency site provides a directory to Ministers’ meetings by government department on a single web page:

Number 10 transparency - ministers meetings

The links in the Ministers’ meetings, Ministers’ hospitality, Ministers’ gifts and Ministers’ overseas travel columns all point directly to CSV files. From inspecting a couple of the Ministers’ meetings CSV files, it looks as if they may be being published in a standardised way, using common column headings presented in the same order:

Ministers' meetings transparency data - csv format

Except that: some of the CSV files appeared to have a blank row between the header and the data rows, and at least one table had a blank row immediately after the data rows, followed some notes in cells that did not map onto the semantics of corresponding column headers. Inspecting the data, we also see that once a minister is identified, there is a blank in the first (Minister) column, so we must presumably assume that the following rows relate to meetings that minister had. WHen the data moves on to another minister, that Minister’s name/position is identified in the first column, once again then followed by blank “same as above” cells.

To get the data into scraperwiki means we need to do two things: extract meeting data from a CSV document and get it into a form whereby we can put it into the scraperwiki database; scrape the number 10 Minisiters’ meetings webpage to get a list of the URLs that point to the CSV files for each department. (It might also be worth scraping the name of the department, and adding that as additional metadata to each record pulled out from the CSV docs.)

Here’s the Scraperwiki code I used to scrape the data. I tried to comment it, so it’s worth reading through even if you don’t speak Python, because I’m not going to provide any more description here…;-)

import urllib
import csv
import md5
import scraperwiki


url = "http://download.cabinetoffice.gov.uk/transparency/co-ministers-meetings.csv"
# I have started just looking at data from one source.
# I am assuming, (dangerously), that the column headings are:
#   a) the same, and 
#   b) in the same order
# for different departments

data = csv.DictReader(urllib.urlopen(url))

# Fudge to cope with possibility of blank row between header and first data row
started=False

# Inspection of the data file suggests that when we start considering a Minister's appointments,
#   we leave the Minister cell blank to mean "same as above".
# If we want to put the Minister's name into each row, we need to watch for that. 
minister=''

for d in data:
    if not started and d['Minister']=='':
        # Skip blank lines between header and data rows
        continue
    elif d['Minister']!='':
        # A new Minister is identified, so this becomes the current Minister of interest
        if not started:
            started=True
        minister=d['Minister']
    elif d['Date']=='' and d['Purpose of meeting']=='' and d['Name of External Organisation']=='':
        # Inspection of the original data file suggests that there may be notes at the end of the CSV file...
        # One convention appears to be that notes are separated from data rows by at least one blank row
        # If we detect a blank row within the dataset, then we assume we're at data's end
        # Of course, if there are legitimate blank rows within the later, we won't scrape any of the following data
        # We probably shouldn't discount the notes, but how would we handle them?!
        break
    print minister,d['Date'],d['Purpose of meeting'],d['Name of External Organisation']
    id='::'.join([minister,d['Date'],d['Purpose of meeting'],d['Name of External Organisation']])
    # The md5 function creates a unique ID for the meeting
    id=md5.new(id).hexdigest()
    # Some of the original files contain some Latin-1 characters (such as right single quote, rather than apostrophe)
    #   that make things fall over unless we handle them...
    purpose=d['Purpose of meeting'].decode('latin1').encode('utf-8')
    record={'id':id,'Minister':minister,'date':d['Date'],'purpose':purpose,'lobbiest':d['Name of External Organisation'].decode('latin1').encode('utf-8')}
    # Note that in some cases there may be multiple lobbiests, separated by a comma, in the same record.
    # It might make sense to generate a meeting MD5 id using the original record data, but actually store
    #   a separate record for each lobbiest in the meeting (i.e. have lobbiests and lobbiest columns) by separating on ','
    # That said, there are also records where a comma separates part of the title or affiliation of an individual lobbiest.
    # A robust convention for separating different lobbiests in the same meeting (e.g. ';' rather than ',') would help

    scraperwiki.datastore.save(["id"], record) 

for d in data:
    #use up the generator, close the file, allow garbage collection?
    continue

Here’s a preview of what the scraped data looks like:

Ministers' meetings datascrape - scraperwiki

Here’s the scraper itself, on Scraperwiki: UK Government Transparency Data – Minister’s Meetings Scratchpad

Assuming that the other CSV files are all structured the same way as the one I tested the above scraper on, we should be able to scrape meeting data from other departmental spreadsheets using the same script. (Note that I did try to be defensive in the handling of arbitrary blank lines between the first header row and the data.)

One problem arises in the context of meetings with more than one person. Ideally, I think there should be a separate row for each person attending, so for example, the Roundtable on June, 2010 between Parliamentary Secretary (Minister for Civil Society), Nick Hurd MP and National Voices, MENCAP,National Council of Voluntary Organisations, St Christopher’s Hospice, Diabetes UK, Place 2 Be, Terrence Higgins Trust, British Heart Foundation, Princess Royal Trust for Carers, Clic Sargent might be mapped to separate data rows for each organisation present. If we take this approach, it might also make sense to ensure that each row carries with it a meeting ID, so that we can group all the rows relating to a particular meeting (one for each group in the meeting) on meeting ID.

However, there is an issue in identifying multiple attendee meetings. In the above example, we can simply separate the groups by splitting the attendees lists at each comma; but using this approach would then mean that the meeting with Secretary General, Organisation of the Islamic Conference, Ekmelledin Ihsanoglu would be mapped onto three rows for that meeting: one with Secretary General as an attendee, one with Organisation of the Islamic Conference as an attendee, and finally one with Ekmelledin Ihsanoglu identified as an attendee…

What this suggests to me is that it would be really handy (in data terms), if a convention was used in the attendees column that separated representation from different organisations with a semi-colon, “;”. We can then worry about how to identify numerous individuals from the same organisation (e.g. J Smith, P Brown, Widget Lobbying group), or how to pull out roles from organisations (Chief Lobbiest, Evil Empire Allegiance), names and roles from organisations (J Smith, Chief Lobbiest, UN Owen, Head Wrangler, Evil Empire Allegiance) and so on…

And I know, I know… the Linked Data folk would be able to model that easily.. but I’m talking about quick and dirty typographical conventions that can be easily used in simple CSV docs that more folk are comfortable with than are comfortable with complex, explicitly structured data…;-)

PS I’ll describe how to scrape the CSV urls from the Number 10 web page, and then loop through all of this to generate a comprehensive “Ministers’ meetings” database in a later post…

PPS a really informative post on the WHo’s Lobbying blog goes into further detrail about some of the “pragmatic reuse” problems associated with the “Ministers’ meetings” data released to date: Is this transparency? No consistent format for 500 more UK ministerial meetings.

Written by Tony Hirst

November 12, 2010 at 1:31 pm

Posted in Data

Tagged with , , ,

Follow

Get every new post delivered to your Inbox.

Join 797 other followers