Bootstrapping GetTheData.org for All Your Public Open Data Questions and Answers

Where can I find a list of hospitals in the UK along with their location data? Or historical weather data for the UK? Or how do I find the county from a postcode, or a book title from its ISBN? And is there any way you can give me RDF Linked Data in a format I can actually use?!

With increasing amounts of data available, it can still be hard to:

– find the data you you want;
– query a datasource to return just the data you want;
– get the data from a datasource in a particular format;
– convert data from one format to another (Excel to RDF, for example, or CSV to JSON);
– get data into a representation that means it can be easily visualised using a pre-existing tool.

In some cases the data will exist in a queryable and machine readable form somewhere, if only you knew where to look. In other cases, you might have found a data source but lack the query writing expertise to get hold of just the data you want in a format you can make use of. Or maybe you know the data is in Linked Data store on data.gov.uk, but you just can’t figure how to get it out?

This is where GetTheData.org comes in. Get The Data arose out of a conversation between myself and Rufus Pollock at the end of last year, which resulted with Rufus setting up the site now known as getTheData.org.

getTheData.org

The idea behind the site is to field questions and answers relating to the practicalities of working with public open data: from discovering data sets, to combining data from different sources in appropriate ways, getting data into formats you can happily work with, or that will play nicely with visualisation or analysis tools you already have, and so on.

At the moment, the site is in its startup/bootstrapping phase, although there is already some handy information up there. What we need now are your questions and answers…

So, if you publish data via some sort of API or queryable interface, why not considering posting self-answered questions using examples from your FAQ?

If you’re running a hackday, why not use GetTheData.org to post questions arising in the scoping the hacks, tweet a link to the question to your event backchannel and give the remote participants a chance to contribute back, at the same time adding to the online legacy of your event.

If you’re looking for data as part of a research project, but can’t find it or can’t get it in an appropriate form that lets you link it to another data set, post a question to GetTheData.

If you want to do some graphical analysis on a data set, but don’t know what tool to use, or how to get the data in the right format for a particular tool, that’d be a good question to ask too.

Which is to say: if you want to GetTheData, but can’t do so for whatever reason, just ask… GetTheData.org

Government Spending Data Explorer

So… the UK Gov started publishing spending data for at least those transactions over £25,0000. Lots and lots of data. So what? My take on it was to find a quick and dirty way to cobble a query interface around the data, so here’s what I spent an hour or so doing in the early hours of last night, and a couple of hours this morning… tinkering with a Gov spending data spreadsheet explorer:

Guardian/gov datastore explorer

The app is a minor reworking of my Guardian datastore explorer, which put some of query front end onto the Guardian Datastore’s Google spreadsheets. Once again, I’m exploiting the work of Simon Rogers and co. at the Guardian Datablog, a reusing the departmental spreadsheets they posted last night. I bookmarked the spreadsheets to delicious (here) and use these feed to populate a spreadsheet selector:

Guardian datastore selector - gov spending data

When you select a spreadsheet, you can preview the column headings:

Datastore explorer - preview

Now you can write queries on that spreadsheet as if it was a database. So for example, here are Department for Education spends over a hundred million:

Education spend - over 100 million

The query is built up in part by selecting items from lists of options – though you can also enter values directly into the appropriate text boxes:

Datstrore explorer - build a query

You can bookmark and share queries in the datastore explorer (for example, Education spend over 100 million), and also get URLs that point directly to CSV and HTML versions of the data via Google Spreadsheets.

Several other example queries are given at the bottom of the data explorer page.

For certain queries (e.g. two column ones with a label column and an amount column), you can generate charts – such as Education spends over 250 million:

Education spend - over 250 million

Here’s how we construct the query:

Education - query spend over 250 million

If you do use this app, and find some interesting queries, please bookmark them and tag them with wdmmg-gde10, or post a link in a comment below, along with a description of what the query is and why its interesting. I’ll try to add interesting examples to the app’s list of example queries.

Notes: the datastore explorer is an example of a single web page application, though it draws on several other external services – delicious for the list of spreadsheets, Google spreadsheets for the database and query engine, Google charts for the charts and styled tabular display. The code is really horrible (it evolved as a series of bug fixes on bug fixes;-), but if anyone would like to run with the idea, start coding afresh maybe, and perhaps make a production version of the app, I have a few ideas I could share;-)

First Dabblings With Scraperwiki – All Party Groups

Over the last few months there’s been something of a roadshow making its way around the country giving journalists, et al. hands-on experience of using Scraperwiki (I haven’t been able to make any of the events, which is shame:-(

So what is Scraperwiki exactly? Essentially, it’s a tool for grabbing data from often unstructured webpages, and putting it into a simple (data) table.

And how does it work? Each wiki page is host to a screenscraper – programme code that can load in web pages, drag information out of them, and pop that information into a simple database. The scraper can be scheduled to run every so often (once a day, once a week, and so on) which means that it can collect data on your behalf over an extended period of time.

Scrapers can be written in a variety of programming languages – Python, Ruby and PHP are supported – and tutorials show how to scrape data from PDF and Escel documents, as well as HTML web pages. But for my first dabblings, I kept it simple: using Python to scrape web pages.

The task I set myself was to grab details of the membership of UK Parliamentary All Party Groups (APGs) to see which parliamentarians were members of which groups. The data is currently held on two sorts of web pages. Firstly, a list of APGs:

All party groups - directory

Secondly, pages for each group, which are published according to a common template:

APG - individual record

The recipe I needed goes as follows:
– grab the list of links to the All Party Groups I was interested in – which was subject based ones rather than country groups;
– for each group, grab it’s individual record page and extract the list of 20 qualifying members
– add records to the scraperwiki datastore of the form (uniqueID, memberName, groupName)

So how did I get on? (You can see the scraper here: ouseful test – APGs). Let’s first have a look at the directory page – this is the bit where it starts to get interesting:

View source: list of APGs

If you look carefully, you will notice two things:
– the links to the country groups and the subject groups look the same:
<p xmlns=”http://www.w3.org/1999/xhtml&#8221; class=”contentsLink”>
<a href=”zimbabwe.htm”>Zimbabwe</a>
</p>

<p xmlns=”http://www.w3.org/1999/xhtml&#8221; class=”contentsLink”>
<a href=”accident-prevention.htm”>Accident Prevention</a>
</p>

– there is a header element that separates the list of country groups from the subject groups:
<h2 xmlns=”http://www.w3.org/1999/xhtml”>Section 2: Subject Groups</h2>

Since scraping largely relies on pattern matching, I took the strategy of:
– starting my scrape proper after the Section 2 header:

def fullscrape():
    # We're going to scrape the APG directory page to get the URLs to the subject group pages
    starting_url = 'http://www.publications.parliament.uk/pa/cm/cmallparty/register/contents.htm'
    html = scraperwiki.scrape(starting_url)

    soup = BeautifulSoup(html)
    # We're interested in links relating to <em>Subject Groups</em>, not the country groups that precede them
    start=soup.find(text='Section 2: Subject Groups')
    # The links we want are in p tags
    links = start.findAllNext('p',"contentsLink")

    for link in links:
        # The urls we want are in the href attribute of the a tag, the group name is in the a tag text
        #print link.a.text,link.a['href']
        apgPageScrape(link.a.text, link.a['href'])

So that function gets a list of the page URLs for each of the subject groups. The subject group pages themselves are templated, so one scraper should work for all of them.

This is the bit of the page we want to scrape:

APG - qualifying members

The 20 qualifying members’ names are actually contained in a single table row:

APG - qualifying members table

def apgPageScrape(apg,page):
    print "Trying",apg
    url="http://www.publications.parliament.uk/pa/cm/cmallparty/register/"+page
    html = scraperwiki.scrape(url)
    soup = BeautifulSoup(html)
    #get into the table
    start=soup.find(text='Main Opposition Party')
    # get to the table
    table=start.parent.parent.parent.parent
    # The elements in the number column are irrelevant
    table=table.find(text='10')
    # Hackery...:-( There must be a better way...!
    table=table.parent.parent.parent
    print table
    
    lines=table.findAll('p')
    members=[]

    for line in lines:
        if not line.get('style'):
            m=line.text.encode('utf-8')
            m=m.strip()
            #strip out the party identifiers which have been hacked into the table (coalitions, huh?!;-)
            m=m.replace('-','–')
            m=m.split('–')
            # I was getting unicode errors on apostrophe like things; Stack Overflow suggested this...
            try:
                unicode(m[0], "ascii")
            except UnicodeError:
                m[0] = unicode(m[0], "utf-8")
            else:
                # value was valid ASCII data
                pass
            # The split test is another hack: it dumps the party identifiers in the last column
            if m[0]!='' and len(m[0].split())>1:
                print '...'+m[0]+'++++'
                members.append(m[0])
            
    if len(members)>20:
        members=members[:20]
    
    for m in members:
        #print m
        record= { "id":apg+":"+m, "mp":m,"apg":apg}
        scraperwiki.datastore.save(["id"], record) 
    print "....done",apg

So… hacky and horrible… and I don’t capture the parties which I probably should… But it sort of works (though I don’t manage to handle the <br /> tag that conjoins a couple of members in the screenshot above) and is enough to be going on with… Here’s what the data looks like:

Scraped data

That’s the first step then – scraping the data… But so what?

My first thought was to grab the CSV output of the data, drop the first column (the unique key) via a spreadsheet, then treat the members’ names and group names as nodes in a network graph, visualised using Gephi (node size reflects the number of groups an individual is a qualifying member of):

APG memberships

(Not the most informative thing, but there we go… At least we can see who can be guaranteed to help get a group up and running;-)

We can also use an ego filter depth 2 to see which people an individual is connected to by virtue of common group membership – so for example (if the scraper worked correctly (and I haven’t checked that it did!), here are John Stevenson’s APG connections (node size in this image relates to the number of common groups between members and John Stevenson):

John Stevenson - APG connections

So what else can we do? I tried to export the data from scraperwiki to Google Docs, but something broke… Instead, I grabbed the URL of the CSV output and used that with an =importData formula in a Google Spreadsheet to get the data into that environment. Once there it becomes a database, as I’ve described before (e.g. Using Google Spreadsheets Like a Database – The QUERY Formula and Using Google Spreadsheets as a Database with the Google Visualisation API Query Language).

I published the spreadsheet and tried to view it in my Guardian Datastore explorer, and whilst the column headings didnlt appear to display properly, I could still run queries:

APG membership

Looking through the documentation, I also notice that Scraperwiki supports Python Google Chart, so there’s a local route to producing charts from the data. There are also some geo-related functions which I probably should have a play with…(but before I do that, I need to have a tinker with the Ordnance Survey Linked Data). Ho hum… there is waaaaaaaaay to much happening to keep up (and try out) with at the mo….

PS Here are some immediate thoughts on “nice to haves”… The current ability to run the scraper according to a schedule seems to append data collected according to the schedule to the original database, but sometimes you may want to overwrite the database? (This may be possible via the programme code using something like fauxscraperwiki.datastore.empty() to empty the database before running the rest of the script?) Adding support for YQL queries by adding e.g. Python-YQL to the supported libraries might also be handy?

Discovering Co-location Communities – Twitter Maps of Tweets Near Wherever…

As privacy erodes further and further, and more and more people start to reveal where they using location services, how easy is it to identify communities based on location, say, or postcode, rather than hashtag? That is, how easy is it to find people who are colocated in space, rather than topic, as in the hashtag communities? Very easy, it turns out…

One of the things I’ve been playing with lately is “community detection”, particularly in the context of people who are using a particular hashtag on Twitter. The recipe in that case runs something along the lines of: find a list of twitter user names for people using a particular hashtag, then grab their Twitter friends lists and look to see what community structures result (e.g. look for clusters within the different twitterers). The first part of that recipe is key, and generalisable: find a list of twitter user names

So, can we create a list of names based on co-location? Yep – easy: Twitter search offers a “near:” search limit that lets you search in the vicinity of a location.

Here’s a Yahoo Pipe to demonstrate the concept – Twitter hyperlocal search with map output:

Pipework for twitter hyperlocal search with map output

[UPDATE: since grabbing that screenshot, I’ve tweaked the pipe to make it a little more robust…]

And here’s the result:

Twitter local trend

It’s easy enough to generate a widget of the result – just click on the Get as Badge link to get the embeddable widget code, or add the widget direct to a dashboard such as iGoogle:

Yahoo pipes map badge

(Note that this pipe also sets the scene for a possible demo of a “live pipe”, e.g. one that subscribes to searches via pubsubhubbub, so that whenever a new tweet appears it’s pushed to the pipe, and that makes the output live, for example by using a webhook.)

You can also grab the KML output of the pipe using a URL of the form:
http://pipes.yahoo.com/pipes/pipe.run?_id=f21fb52dc7deb31f5fffc400c780c38d&_render=kml&distance=1&location=YOUR+LOCATION+STRING
and post it into a Google maps search box… like this:

Yahoo pipe in google map

(If you try to refresh the Google map, it may suffer from result cacheing.. in which case you have to cache bust, e.g. by changing the distance value in the pipe URL to 1.0, 1.00, etc…;-)

Something else that could be useful for community detection is to search through the localised/co-located tweets for popular hashtags. Whilst we could probably do this in a separate pipe (left as an exercise for the reader), maybe by using a regular expression to extract hashtags and then the unique block filtering on hashtags to count the reoccurrences, here’s a Python recipe:

import simplejson, urllib

def getYahooAppID():
  appid='YOUR_YAHOO_APP_ID_HERE'
  return appid

def placemakerGeocodeLatLon(address):
  encaddress=urllib.quote_plus(address)
  appid=getYahooAppID()
  url='http://where.yahooapis.com/geocode?location='+encaddress+'&flags=J&appid='+appid
  data = simplejson.load(urllib.urlopen(url))
  if data['ResultSet']['Found']>0:
    for details in data['ResultSet']['Results']:
      return details['latitude'],details['longitude']
  else:
    return False,False

def twSearchNear(tweeters,tags,num,place='mk7 6aa,uk',term='',dist=1):
  t=int(num/100)
  page=1
  lat,lon=placemakerGeocodeLatLon(place)
  while page<=t:
    url='http://search.twitter.com/search.json?geocode='+str(lat)+'%2C'+str(lon)+'%2C'+str(1.0*dist)+'km&rpp=100&page='+str(page)+'&q=+within%3A'+str(dist)+'km'
    if term!='':
      url+='+'+urllib.quote_plus(term)

    page+=1
    data = simplejson.load(urllib.urlopen(url))
    for i in data['results']:
     if not i['text'].startswith('RT @'):
      u=i['from_user'].strip()
      if u in tweeters:
        tweeters[u]['count']+=1
      else:
        tweeters[u]={}
        tweeters[u]['count']=1
      ttags=re.findall("#([a-z0-9]+)", i['text'], re.I)
      for tag in ttags:
        if tag not in tags:
    	  tags[tag]=1
    	else:
    	  tags[tag]+=1
    	    
  return tweeters,tags

''' Usage:
tweeters={}
tags={}
num=100 #number of search results, best as a multiple of 100 up to max 1500
location='PLACE YOU WANT TO SEARCH AROUND'
term='OPTIONAL SEARCH TERM TO NARROW DOWN SEARCH RESULTS'
tweeters,tags=twSearchNear(tweeters,tags,num,location,searchTerm)
'''

What this code does is:
– use Yahoo placemaker to geocode the address provided;
– search in the vicinity of that area (note to self: allow additional distance parameter to be set; currently 1.0 km)
– identify the unique twitterers, as well as counting the number of times they tweeted in the search results;
– identify the unique tags, as well as counting the number of times they appeared in the search results.

Here’s an example output for a search around “Bath University, UK”:

Having got the list of Twitterers (as discovered by a location based search), we can then look at their social connections as in the hashtag community visualisations:

Community detected around Bath U.. Hmm,,, people there who shouldnlt be?!

And wondering why the likes @pstainthorp and @martin_hamilton appear to be in Bath? Is the location search broken, picking up stale data, or some other error….? Or is there maybe a UKOLN event on today I wonder..?

PS Looking at a search near “University of Bath” in the web based Twitter search, it seems that: a) there arenlt many recent hits; b) the search results pull up tweets going back in time…

Which suggests to me:
1) the code really should have a time window to filter the tweets by time, e.g. excluding tweets that are more than a day or even an hour old; (it would be so nice if Twitter search API offered a since_time: limit, although I guess it does offer since_id, and the web search does offer since: and until: limits that work on date, and that could be included in the pipe…)
2) where there aren’t a lot of current tweets at a location, we can get a profile of that location based on people who passed through it over a period of time?

UPDATE: Problem solved…

The location search is picking up tweets like this:

Twitter locations...

but when you click on the actual tweet link, it’s something different – a retweet:

Twitter reweets pass through the original location

So “official” Twitter retweets appear to pass through the location data of the original tweet, rather than the person retweeting… so I guess my script needs to identify official twitter retweets and dump them…

PS if you want to see how folk tweeting around a location are socially connected (i.e. whether they follow each other), check out A Bit of NewsJam MoJo – SocialGeo Twitter Map).

A First – Not Very Successful – Look at Using Ordnance Survey OpenLayers…

What’s the easiest way of creating a thematic map, that shows regions coloured according to some sort of measure?

Yesterday, I saw a tweet go by from @datastore about Carbon emissions in every local authority in the UK, detailing those emissions for a list of local authorities (whatever they are… I’ll come on to that in a moment…)

Carbon emissions data table

The dataset seemed like a good opportunity to try out the Ordnance Survey’s OpenLayers API, which I’d noticed allows you to make use of OS boundary data and maps in order to create thematic maps for UK data:

OS thematic map demo

So – what’s involved? The first thing was to try and get codes for the authority areas. The ONS make various codes available (download here) and the OpenSpace website also makes available a list of boundary codes that it can render (download here), so I had a poke through the various code files and realised that the Guardian emissions data seemed to identify regions that were coded in different ways? So I stalled there and looked at another part f the jigsaw…

…specifically, OpenLayers. I tried the demo – Creating thematic boundaries – got it to work for the sample data, then tried to put in some other administrative codes to see if I could display boundaries for other area types… hmmm…. No joy:-) A bit of digging identified this bit of code:

boundaryLayer = new OpenSpace.Layer.Boundary("Boundaries", {
strategies: [new OpenSpace.Strategy.BBOX()],
area_code: ["EUR"],
styleMap: styleMap });

which appears to identify the type of area codes/boundary layer required, in this case “EUR”. So two questions came to mind:

1) does this mean we can’t plot layers that have mixed region types? For example, the emissions data seemed to list names from different authority/administrative area types?
2) what layer types are available?

A bit of digging on the OpenLayers site turned up something relevant on the Technical FAQ page:

OS OpenSpace boundary DESCRIPTION, (AREA_CODE) and feature count (number of boundary areas of this type)

County, (CTY) 27
County Electoral Division, (CED) 1739
District, (DIS) 201
District Ward, (DIW) 4585
European Region, (EUR) 11
Greater London Authority, (GLA) 1
Greater London Authority Assembly Constituency, (LAC) 14
London Borough, (LBO) 33
London Borough Ward, (LBW) 649
Metropolitan District, (MTD) 36
Metropolitan District Ward, (MTW) 815
Scottish Parliament Electoral Region, (SPE) 8https://ouseful.wordpress.com/wp-admin/edit.php
Scottish Parliament Constituency, (SPC) 73
Unitary Authority, (UTA) 110
Unitary Authority Electoral Division, (UTE) 1334
Unitary Authority Ward, (UTW) 1464
Welsh Assembly Electoral Region, (WAE) 5
Welsh Assembly Constituency, (WAC) 40
Westminster Constituency, (WMC) 632

so presumably all those code types can be used as area_code arguments in place of “EUR”?

Back to one of the other pieces of the jigsaw: the OpenLayers API is called using official area codes, but the emissions data just provides the names of areas. So somehow I need to map from the area names to an area code. This requires: a) some sort of lookup table to map from name to code; b) a way of doing that.

Normally, I’d be tempted to use a Google Fusion table to try to join the emissions table with the list of boundary area names/codes supported by OpenSpace, but then I recalled a post by Paul Bradshaw on using the Google spreadsheets VLOOKUP formula (to create a thematic map, as it happens: Playing with heat-mapping UK data on OpenHeatMap), so thought I’d give that a go… no joy:-( For seem reason, the vlookup just kept giving rubbish. Maybe it was happy with really crappy best matches, even if i tried to force exact matches. It almost felt like formula was working on a differently ordered column to the one it should have been, I have no idea. So I gave up trying to make sense of it (something to return to another day maybe; I was in the wrong mood for trying to make sense of it, and now I am just downright suspicious of the VLOOKUP function!)…

…and instead thought I’d give the openheatmap application Paul had mentioned a go…After a few false starts (I thought I’d be able to just throw a spreadsheet at it and then specify the data columns I wanted to bind to the visualisation, (c.f. Semantic reports), but it turns out you have to specify particular column names, value for the data value, and one of the specified locator labels) I managed to upload some of the data as uk_council data (quite a lot of it was thrown away) and get some sort of map out:

openheatmap demo

You’ll notice there are a few blank areas where council names couldn’t be identified.

So what do we learn? Firstly, the first time you try out a new recipe, it rarely, if ever, “just works”. When you know what you’re doing, and “all you have to do is…”, all is a little word. When you don’t know what you’re doing, all is a realm of infinite possibilities of things to try that may or may not work…

We also learn that I’m not really that much closer to getting my thematic map out… but I do have a clearer list of things I need to learn more about. Firstly, a few hello world examples using the various different OpenLayer layers. Secondly, a better understanding of the differences between the various authority types, and what sorts of mapping there might be between them. Thirdly, I need to find a more reliable way of reconciling data from two tables and in particular looking up area codes from area names (in two ways: code and area type from area name; code from area name and area type). VLOOKUP didn’t work for me this time, so I need to find out if that was my problem, or an “issue”.

Something else that comes to mind is this: the datablog asks: “Can you do something with this data? Please post your visualisations and mash-ups on our Flickr group”. IF the data had included authority codes, I would have been more likely to persist in trying to get them mapped using OpenLayers. But my lack of understanding about how to get from names to codes meant I stumbled at this hurdle. There was too much friction in going from area name to OpenLayer boundary code. (I have no idea, for example, whether the area names relate to one administrative class, or several).

Although I don’t think the following is the case, I do think it is possible to imagine a scenario where the Guardian do have a table that includes the administrative codes as well as names for this data, or an environment/application/tool for rapidly and reliably generating such a table, and that they know this makes the data more valuable because it means they can easily map it, but others can’t. The lack of codes means that work needs to be done in order to create a compelling map from the data that may attract web traffic. If it was that easy to create the map, a “competitor” might make the map and get the traffic for no real effort. The idea I’m fumbling around here is that there is a spectrum of stuff around a data set that makes it more or less easy to create visualiations. In the current example, we have area name, area code, map. Given an area code, it’s presumably (?) easy enough to map using e.g. OpenLayers becuase the codes are unambiguous. Given an area name, if we can reliably look up the area code, it’s presumably easy to generate the map from the name via the code. Now, if we want to give the appearance of publishing the data, but make it hard for people to use, we can make it hard for them to map from names to codes, either by messing around with the names, or using a mix of names that map on to area codes of different types. So we can taint the data to make it hard for folk to use easily whilst still be being seen to publish the data.

Now I’m not saying the Guardian do this, but a couple of things follow: firstly, obfuscating or tainting data can help you prevent casual use of it by others whilst at the same time ostensibly “open it up” (it can also help you track the data; e.g. mapping agencies that put false artefacts in their maps to help reveal plagiarism); secondly, if you are casual with the way you publish data, you can make it hard for people to make effective use of that data. For a long time, I used to hassle folk into publishing RSS feeds. Some of them did… or at least thought they did. For as soon as I tried to use their feeds, they turned out to be broken. No-one had ever tried to consume them. Same with data. If you publish your data, try to do something with it. So for example, the emissions data is illustrated with a Many Eyes visualisation of it; it works as data in at least that sense. From the place names, it would be easy enough to vaguely place a marker on a map showing a data value roughly in the area of each council. But for identifying exact administrative areas – the data is lacking.

It might seem as is if I’m angling against the current advice to councils and government departments to just “get their data out there” even if it is a bit scrappy, but I’m not… What I am saying (I think) is that folk should just try to get their data out, but also:

– have a go at trying to use it for something themselves, or at least just demo a way of using it. This can have a payoff in at least a three ways I can think of: a) it may help you spot a problem with the way you published the data that you can easily fix, or at least post a caveat about; b) it helps you develop your own data handling skills; c) you might find that you can encourage reuse of the data you have just published in your own institution…

– be open to folk coming to you with suggestions for ways in which you might be able to make the data more valuable/easier to use for them for little effort on your own part, and that in turn may help you publish future data releases in an ever more useful way.

Can you see where this is going? Towards Linked Data… ;-)

PS just by the by, a related post (that just happens to mention OUseful.info:-) on the Telegraph blogs about Open data ‘rights’ require responsibility from the Government led me to a quick chat with Telegraph data hack @coneee and the realisation that the Telegraph too are starting to explore the release of data via Google spreadsheets. So for example, a post on Councils spending millions on website redesigns as job cuts loom also links to the source data here: Data: Council spending on websites.

A Quick Play with Google Static Maps: Dallas Crime

A couple of days ago I got an email from Jennifer Okamato of the Dallas News, who had picked up on one my mashup posts describing how to scrape tabluar data from a web page and get it onto an embeddable map (Data Scraping Wikipedia with Google Spreadsheets). She’d been looking at live crime incident data from the Dallas Police, and had managed to replicate my recipe in order to get the data into a map embedded on the Dallas News website:

Active Dallas Police calls

But there was a problem: the data being displayed on the map wasn’t being updated reliably. I’ve always known there were cacheing delays inherent in the approach I’d described, which involves Google Spreadsheets, Yahoo Pipe, Google Maps, as well as local browsers all calling on each other an all potentially cacheing the data, but never really worried about them. But for this example, where the data was changing on a minute by minute basis, the delays were making the map display feel too out of date to be useful. What’s needed is a more real time solution.

I haven’t had chance to work on a realtime chain yet, but I have started dabbling around the edges. The first thing was to get the data from the Dallas Police website.

Dallas police - live incidents

(You’ll notice the data includes elements relating to the time of incident, a brief description of it, its location as an address, the unit handling the call and their current status, and so on.)

A tweet resulted in a solution from @alexbilbie that uses a call to YQL (which may introduce a cacheing delay?) to scrape the table and generate a JSON feed for it, and a PHP handler script to display the data (code).

I tried the code on the OU server that ouseful.open.ac.uk works on, but as it runs PHP4, rather than the PHP5 Alex coded for, it fell over on the JSON parsing stakes. A quick Google turned up a fix in the form of a PEAR library for handling JSON, and a stub of code to invoke it in the absence of native JSON handling routines:

//JSON.php library from http://abeautifulsite.net/blog/2008/05/using-json-encode-and-json-decode-in-php4/
include("JSON.php");

// Future-friendly json_encode
if( !function_exists('json_encode') ) {
    function json_encode($data) {
        $json = new Services_JSON();
        return( $json->encode($data) );
    }
}
 
// Future-friendly json_decode
if( !function_exists('json_decode') ) {
    function json_decode($data) {
        $json = new Services_JSON();
        return( $json->decode($data) );
    }
}

I then started to explore ways of getting the data onto a Google Map…(I keep meaning to switch to OpenStreetMap, and I keep meaning to start using the Mapstraction library as a proxy that could in principle cope with OpenStreetMap, Google Maps, or various other mapping solutions, but I was feeling lazy, as ever, and defaulted to the Goog…). Two approaches came to mind:

– use the Google static maps API to just get the markers onto a static map. This has the advantage of being able to take a list of addresses in the image URL which then then be automatically geocoded; but it has the disadvantage of requiring a separate key area detailing the incidents associated with each marker:

Dallas crime static map demo

– use the interactive Google web maps API to create a map and add markers to it. In order to place the markers, we need to call the Google geocoding API once for each address. Unfortunately, in a quick test, I couldn’t get the version 3 geolocation API to work, so I left this for another day (and maybe a reversion to the version 2 geolocation API, which requires a user key and which I think I’ve used successfully before… err, maybe?!;-).

So – the static maps route it is.. how does it work then? I tried a couple of routes: firstly, generating the page via a PHP script. Secondly, on the client side using a version of the JSON feed from Alex’s scraper code.

I’ll post the code at the end, but for now will concentrate on how the static image file is generated. As with the Google Charts API, it’s all in the URL.

For example, here’s a static map showing a marker on Walton Hall, Milton Keynes:

OU static map

Here’s the URL:

http://maps.google.com/maps/api/staticmap?
center=Milton%20Keynes
&zoom=12&size=512×512&maptype=roadmap
&markers=color:blue|label:X|Walton%20Hall,%20Milton%20Keynes
&sensor=false

You’ll notice I haven’t had to provide latitude/longitude data – the static map API is handling the geocoding automatically from the address (though if you do have lat/long data, you can pass that instead). The URL can also carry more addresses/more markers – simply add another &markers= argument for each address. (I’m not sure what the limit is? It may be bound by the length of the URL?)

So -remember the original motivation for all this? Finding a way of getting recent crime incident data onto a map on the Dallas News website? Jennifer managed to get the original Google map onto the Dallas News page, so it seems that if she has the URL for a web page containing (just) the map, she can get it embedded in an iframe on the Dallas News website. But I think it’s unlikely that she’d be able to get Javascript embedded in the parent Dallas News page, and probably unlikely that she could get PHP scripts hosted on the site. The interactive map is obviously the preferred option, but a static map may be okay in the short term.

Looking at the crude map above, I think it could be nice to be able to use different markers (either different icons, or different colours – maybe both?) to identify the type of offence, its priority and its status. Using the static maps approach – with legend – it would be possible to colour code different incidents too, or colour or resize them if several units were in attendance? One thing I don;’t do is cluster duplicate entries (where maybe more than one unit is attending?)

It would be nice if the service was a live one, with the map refreshing every couple of minutes or so, for example by pulling a refreshed JSON feed into the page, and updating the map with new markers, and letting old markers fade over time. This would place a load on the original screenscraping script, so it’d be worth revisiting that and maybe implementing some sort of cache so that it plays nicely with the Dallas Police website (e.g. An Introduction to Compassionate Screen Scraping could well be my starter for 10). If the service was running as a production one, API rate limiting might be an issue too, particularly if the map was capable of being updated (I’m not sure what rate limiting applies to the static maps api, the Google maps API, or the Google geolocation API? In the short term (less coding) it might make sense to try to offload this to the client (i.e. let the browser call Google to geocode the markers), but a more efficient solution might be for a script on the server to geocode each location and then pass the lat/long data as part of the JSON feed.

Jennifer also mentioned getting a map together for live fire department data, which could also provide another overlay (and might be useful for identifiying major fire incidents?) In that case, it might be necessary to dither markers, so e.g. police and fire department markers didn’t sit on top of and mask each other. (Not sure how to do this in static maps, where we geocoding by address? Would maybe have to handle things logically, and use a different marker type for events attended by just police units, just fire units, or both types? If we’re going for real time, it might also be interesting to overlay recent geotagged tweets from twitter?

Anything else I’m missing? What would YOU do next?

PS if you want to see the code, here it is:

Firstly, the PHP solution [code]:

<html>
<head><title>Static Map Demo</title>
</head><body>

<?php

error_reporting(-1);
ini_set('display_errors', 'on');

include("json.php");

// Future-friendly json_encode
if( !function_exists('json_encode') ) {
    function json_encode($data) {
        $json = new Services_JSON();
        return( $json->encode($data) );
    }
}
 
// Future-friendly json_decode
if( !function_exists('json_decode') ) {
    function json_decode($data) {
        $json = new Services_JSON();
        return( $json->decode($data) );
    }
}

$response = file_get_contents("http://query.yahooapis.com/v1/public/yql?q=select%20*%20from%20html%20where%20url%3D%22http%3A%2F%2Fwww.dallaspolice.net%2Fmediaaccess%2FDefault.aspx%22%20and%0A%20%20%20%20%20%20xpath%3D'%2F%2F*%5B%40id%3D%22grdData_ctl01%22%5D%2Ftbody'&format=json");

$json = json_decode($response);

$reports = array();

if(isset($json->query->results))
{
    $str= "<img src='http://maps.google.com/maps/api/staticmap?center=Dallas,Texas";
    $str.="&zoom=10&size=512x512&maptype=roadmap";
    
    $ul="<ul>";

	$results = $json->query->results;
	
	$i = 0;
	
	foreach($results->tbody->tr as $tr)
	{
	
		$reports[$i]['incident_num'] = $tr->td[1]->p;
		$reports[$i]['division'] = $tr->td[2]->p;
		$reports[$i]['nature_of_call'] = $tr->td[3]->p;
		$reports[$i]['priority'] = $tr->td[4]->p;
		$reports[$i]['date_time'] = $tr->td[5]->p;
		$reports[$i]['unit_num'] = $tr->td[6]->p;
		$reports[$i]['block'] = $tr->td[7]->p;
		$reports[$i]['location'] = $tr->td[8]->p;
		$reports[$i]['beat'] = $tr->td[9]->p;
		$reports[$i]['reporting_area'] = $tr->td[10]->p;
		$reports[$i]['status'] = $tr->td[11]->p;
		
	    $addr=$reports[$i]['block']." ".$reports[$i]['location'];
	    $label=chr(65+$i);
	    $str.="&markers=color:blue|label:".$label."|".urlencode($addr);
	    $str.=urlencode(",Dallas,Texas");
	
	    $ul.="<li>".$label." - ";
	    $ul.=$reports[$i]['date_time'].": ".$reports[$i]['nature_of_call'];
	    $ul.", incident #".$reports[$i]['incident_num'];
	    $ul.=", unit ".$reports[$i]['unit_num']." ".$reports[$i]['status']; 
	    $ul.=" (priority ".$reports[$i]['priority'].") - ".$reports[$i]['block']." ".$reports[$i]['location'];
	    $ul.="</li>";
	
		$i++;
		
	}	
	
	$str.="&sensor=false";
    $str.="'/>";
    echo $str;
    
    $ul.="</ul>";
    echo $ul;
}
?>
</body></html>

And here are a couple of JSON solutions. One that works using vanilla JSON [code], and as such needs to be respectful of browser security policies that say the JSON feed needs to be served from the same domain as the web page that’s consuming it:

<html>
<head><title>Static Map Demo - client side</title>

<script src="http://code.jquery.com/jquery-1.4.2.min.js"></script>

<script type="text/javascript">

function getData(){
    var str; var msg;
    str= "http://maps.google.com/maps/api/staticmap?center=Dallas,Texas";
    str+="&zoom=10&size=512x512&maptype=roadmap";
    
    $.getJSON('dallas2.php', function(data) {
      $.each(data, function(i,item){
        addr=item.block+" "+item.location;
	    label=String.fromCharCode(65+i);
        str+="&markers=color:blue|label:"+label+"|"+encodeURIComponent(addr);
	    str+=encodeURIComponent(",Dallas,Texas");
	    
	    msg=label+" - ";
        msg+=item.date_time+": "+item.nature_of_call;
	    msg+=", incident #"+item.incident_num;
	    msg+=", unit "+item.unit_num+" "+item.status; 
	    msg+=" (priority "+item.priority+") - "+item.block+" "+item.location;
        $("<li>").html(msg).appendTo("#details");

      })
      str+="&sensor=false";
      $("<img/>").attr("src", str).appendTo("#map");

    });

}
</script>
</head><body onload="getData()">

<div id="map"></div>
<ul id="details"></ul>
</body></html>

And a second approach that uses JSONP [code], so the web page and the data feed can live on separate servers. What this really means is that you can grab the html page, put it on your own server (or desktop), hack around with the HTML/Javascript, and it should still work…

<html>
<head><title>Static Map Demo - client side</title>


<script src="http://code.jquery.com/jquery-1.4.2.min.js"></script>

<script type="text/javascript">

function dallasdata(json){
    var str; var msg;
    str= "http://maps.google.com/maps/api/staticmap?center=Dallas,Texas";
    str+="&zoom=10&size=512x512&maptype=roadmap";

      $.each(json, function(i,item){
        addr=item.block+" "+item.location;
	    label=String.fromCharCode(65+i);
        str+="&markers=color:blue|label:"+label+"|"+encodeURIComponent(addr);
	    str+=encodeURIComponent(",Dallas,Texas");
	    
	    msg=label+" - ";
        msg+=item.date_time+": "+item.nature_of_call;
	    msg+=", incident #"+item.incident_num;
	    msg+=", unit "+item.unit_num+" "+item.status; 
	    msg+=" (priority "+item.priority+") - "+item.block+" "+item.location;
        $("<li>").html(msg).appendTo("#details");

      })
      str+="&sensor=false";
      $("<img/>").attr("src", str).appendTo("#map");

}

function cross_domain_JSON_call(url){
 url="http://ouseful.open.ac.uk/maps/dallas2.php?c=true";
 $.getJSON(
   url,
   function(data) { dallasdata(data); }
 )
}

$(document).ready(cross_domain_JSON_call(url));

</script>
</head><body>

<div id="map"></div>
<ul id="details"></ul>
</body></html>

When Open Public Data Isn’t…?

This year was always going to be an exciting year for open data. The launch of data.gov.uk towards the end of last year, along with commitments from both sides of the political divide before the election that are continuing to be acted upon now means data is starting to be opened up -scruffily at first, but that’s okay – and commercial enterprises are maybe starting to get interested too…

…which was always the plan…

…but how is it starting to play out?

The story so far…

A couple of weeks ago, the first meeting of the Public Data Transparency Board was convened, which discussed – and opened up for further discussion, a set of draft public data principles. (Papers relating to the meeting can be found here.)

In a letter to the responsible Minister prior to the meeting (commentable extracts here), Professor Nigel Shadbolt suggested that:

4. … The economic analysis, and the views we regularly hear from the business community themselves, are unequivocal: data must be released for free re-use so that the private sector can add new value and develop innovative new business services from government information. …

8. Transparency principles need to be extended to those who operate public services on a franchised, regulated or subsidised basis. If the state is controlling a service to the public or is franchising or regulating its delivery the data about that activity should be treated as public data and made available. …

11. We need to support the development of licences and supporting policies to ensure that data released by all public bodies can be freely re-used and is interoperable with the internationally recognised Creative Commons model. …

12. A key Government objective is to realise significant economic benefits by enabling businesses and non-profit organisations to build innovative applications and websites using public data. …

The business imperative is further reinforced by the second of three reasons given by the Open Government Data tracking project in Why Open Government Data?:

Releasing social and commercial value. In a digital age, data is a key resource for social and commercial activities. Everything from finding your local post office to building a search engine requires access to data much of which is created or held by government. By opening up data, government can help drive the creation of innovative business and services that deliver social and commercial value.

So how has business been getting involved? As several local councils start to pick up a request contained in a letter from the Prime Minister published at the end of May that they open up their financial data, Chris Taggart/@countculture, developer of OpenlyLocal posted a piece on The open spending data that isn’t… this is not good in which he described how apparently privileged access to financial data from several councils was being used to drive Spikes Cavell’s SpotlightOnSpend website (for a related open equivalent, see Adrian Short’s Armchair Auditor). Downstream use of the data was hampered by a “personal use only” license, and a CAPTCHA that requires a human in the loop in order to access the data. The Public Sector Transparency Board promptly responded to Chris’ post (Work on Local Spending Data), quoting the principle that:

“Public data will be released under the same open licence which enables free reuse, including commercial reuse – all data should be under the same easy to understand licence. Data released under the Freedom of Information Act or the new Right to Data should be automatically released under that licence.”

and further commenting: “We have already reminded those involved of this principle and the existing availability of the ‘data.gov.uk’ licence which meets its criteria, and we understand that urgent measures are already taking place to rectify the problems identified by Chris.”

Spikes Cavell chief executive Luke Spikes responded via an interview with Information Age, (SpotlightOnSpend reacts to open criticism):

[SpotlightOnSpend] is first and foremost a spend analysis software and consultancy supplier, and that it publishes data through SpotlightOnSpend as a free, optional and supplementary service for its local government customers. The hope is that this might help the company to win business, he explains, but it is not a money-spinner in itself.

“The contribution we’re making to transparency is less about what the purists would like to see, it’s simply putting the information out there in a form that is useful for the audience for which it is intended [i.e. citizens and small businesses]” he said. “But there are a few things we haven’t done right, and we’ll fix that.”

Following the criticism, Cavell says that SpotlightOnSpend will make the data available for download in its raw form. “That’s what we thought was the most sensible solution to overcoming this obstacle,” he told me.

Adrian Short, developer of the open Armchair Auditor, then picked up the baton in a comment to the Information Age article:

There is room for Spikes Cavell to develop their applications and I doubt that anyone has any objection to them offering their services to councils commercially just like thousands of other businesses. But they do not have a monopoly of ideas, talent and resources to build great applications with public spending data. Nor does anyone else.

The concerns that @CountCulture raised were not that Spikes Cavell were trading with councils or trying to attract their business but that they are doing so in a way that precludes anyone else developing applications with this data. By legally and technically locking the data into the Spotlight on Spend platform, everyone else is excluded.

It’s understandable that most councils have no understanding of the culture, legalities or technicalities of open data. This is new territory for nearly all of them. Those councils that have put their data straight onto Spotlight on Spend, bypassing the part where it is made genuinely open — cannot be criticised for not complying with what to them must be a very unusual requirement. But that’s why @CountCulture and I and others want to be very clear that the end result of this process is having effective scrutiny of council finances through multiple websites and applications, not just Spotlight on Spend or any other single website or application. The way we get there is with open data.

And Chris Taggart’s response? (Update on the local spending data scandal… the empire strikes back):

Lest we forget, Spikes Cavell is not an agent for change here, not part of those pushing to open public data, but in fact has a business model which seems to be predicated on data being closed, and the maintenance of the existing procurement model which has served us so badly.

(For recommendations on how councils might publish financial data in an open way, see: Publishing itemised local authority expenditure – advice for comment (reiterated here: Open Government Data: Finances. The Office for National Statistics occasionally releases summary statistics (e.g. as republished in Openlocal: Local spending data in OpenlyLocal, and some thoughts on standards) but at only a coarse resolution. As to how much it might cost to do that, some are claiming Cost of publishing ‘£500 and above’ council expenditure prohibitive.)

From my own perspective, I would also add that should consultants like Spikes Cavell create derived data works from open public data, there should be some transparency in declaring how the derived work was created (see for example: So Where Do the Numbers in Government Reports Come From? and Data is not Binary).

Another example of how once open data is becoming “closed” behind a paywall comes from Paul Geraghty (“Closed Data Now” SOCITM does a “Times”):

If my memory serves me well the e-Gov Register (eGR) hosted by Brent has been showing every IT supplier sortable by product type, supplier, local government type and even on maps for about 6 or 7 years (some links below if you hurry up).

I am aware that there are problems with this data, in my own past employer I know that some of the data is out of date.

But it is there, it is useful and informative and it is OPEN to all, even SMEs like me researching on niche markets in local government.

The latest move by SOCITM (and presumably with the knowledge of the LGA and the IDeA) means all that data is going to go behind the SOCITM paywall.

And the response from Socitm, via a comment from Vicky Sargent:

First of all, I’d like to clear up some points of fact. No local authority or other public sector service provider that provides data to the Applications Register will have to pay for their subscription and for them, access to the data will be free, regardless of whether they subscribe to Socitm Insight (as 95% of local authorities do). Anyone who is employed in an organisation that is an Applications Register subscriber – f-o-c or paid, will be able to access the data.
Then there is who pays. Clearly an information service like this that adds value, has to cover the costs of development and delivery. Unlike government departments, LGA, IDeA and local councils, Socitm is not directly funded by the taxpayer, and needs to fund the services it delivers from money raised from fees, subscriptions, events and other services.
The business model we use for the Applications Register is that public bodies that contribute should not pay to use the service, but those that do not contribute pay in cash. Private sector bodies can only pay in cash.

Your article also suggests that Socitm’s support for the move towards open data is hypocritical, set against our business model for the Applications Register. I think this misunderstands the thinking behind ‘open data’, which is to get raw data out of government systems for transparency purposes, also so that it can be re-used. Socitm has been a long-term strong supporter of this.
The open data agenda explicitly acknowledges that ‘re-use’ includes adding value and selling on. If councils were to routinely publish the sort of data we will collect for the Applications Register, there would still be work to be done aggregating and manipulating and re-publishing the information to make it useful, and that is what we do, recovering our costs in the way described.

Adrian Short (can you see how it’s the same few players engaging in these debates?!;-) develops the “keep it free” argument in a further comment:

Your argument presupposes your conclusion, which is that Socitm is the best organisation to be managing/publishing the applications register. Because, as you correctly say, you don’t receive any direct funding from the taxpayer, you have to find other ways of paying for that work. Inevitably this means charging non-contributing users.

What you’re missing is that millions of pounds of public money is spent every year supporting businesses, helping to create new markets and generally oiling the parts of the economy that don’t easily oil themselves. That’s what BIS and the economic development departments of local authorities do. The public interest and private benefit aren’t easily distinguishable unless you contrive that private benefit for a small group to the exclusion of others. But as Paul rightly points out, the potential market for this information is enormous — essentially every business and individual that works for, supplies or wants to work for the public sector, from the individual IT worker to the massive global consultancies, manufacturers and software firms.

Currently it’s a small number of incumbent suppliers that benefit from this relatively inefficient market. Other businesses lose. Public sector buyers lose. The taxpayer loses.

Keeping this information free for everyone to use and enabling it to be used in future when combined with the enormous amount of data that will be released soon will be likely to produce economic benefits to the public through market efficiencies that outstrip its cost by several orders of magnitude. If Socitm can’t publish this data in the most useful, non-discriminatory way then it’s not the best organisation for the job. I can see no reason in principle or practice why it shouldn’t be fully funded by the taxpayer and free at the point of use for everyone. To do otherwise would be an extremely false economy.

(Note that “free vs. open” debates have also been played out in the open source software arena. Maybe it’s worth revisiting them…?)

The previously quouted comment from Vicky Sargent also contains what might be described as an example case study:

This brings me to Better Connected, the annual survey of council websites carried out by Socitm. You say:
Just about every council in the UK has little option but to pay SOCITM hundreds of pounds annually to join their club to find out the exact details of how their website is being ranked.The data for Better connected only exists because Socitm has devised a methodology for evaluating websites, pays for a team of reviewers collect the data each year, and then analyses and publishes the results. No one has to subscribe, they choose to do so because the information is valuable to them.
Information about how we do the evaluation and ranking is freely available on our website, in our press releases and in our free-to-join Website Usage and Improvement community. The 2010 headline results for all councils are published on socitm.net as open data under a creative commons licence and are linked from data.gov.uk.
If the Better connected report has become a ‘must read’, that is because the investment Socitm has made in the product has led to it being a more cost-effective investment for councils than alternative sources of advice on improving their website. Many users have told us Better connected (cover price £415 for non-subscribers or free as part of the Socitm Insight subscription that starts at £670 pa for a small district council) is worth many days’ consultancy, even when that consultancy is purchased from lower cost SME providers.

As these examples show, the license under which data is originally released can have significant consequences on its downstream use and commercialisation. The open source software community has know this for years, of course, which is why organisations like GNU have two different licenses – GPL, which keeps software open by tainting other software that includes GPL libraries, and LGPL, which allows libraries to be used in closed/proprietary code. There is a good argument that by combining data from different open sources in a particular way valuable results may be created, but it should also be recognised that work may be expended doing this and a financial return may need to be generated (so maybe companies shouldn’t have to open up their aggregated datasets?) Just how we balance commercial exploitation with ongoing openness and access to raw public data is yet to be seen.

(The academic research area – which also has it’s own open data movement (e.g. Panton Principles) – also suggests a different sort of tension arising from the “potential value” of a data set or aggregated data set. For example, research groups analysing data in one particular way may be loathe to release to others because they want to analyse it in another, value creating way at a later date.)

Getting the licensing right is particularly important if councils become obliged to use third party services to publish their data. For example, the grand vision of the Public Sector Transparency Board identified in this paragraph in Shadbolt’s letter to Maude states:

13. We must promote and support the development and application of open, linked data standards for public data, including the development of appropriate skills in the public services. …

But as a recent report, again from Chris Taggart, on Publishing Local Open Data – Important Lessons from the Open Election Data project suggests, there are certain challenges associated with web related development in local authorities, and in particular a significant lack of experience and expertise in dealing with Linked Data (which is not surprising – it is a relatively new, and so far arcane) technology. Here are the first four lessons, for example:

– There is a lack of ‘corporate’ awareness/understanding of open data issues, and this will inhibit take up of open, linked data publishing unless it is addressed
– There is a lack of even basic web skills at some councils
– Many councils lack web publishing resources, never mind the resources to implement open, linked data publishing
– The understanding of even the basics of linked data and the steps to publishing public data in this way is very, very limited

What this suggests to me is that it is likely that in the short term at least, the capability for publishing Linked Data will reside in specialist third party companies, possibly one of only a few companies. As Paul Geraghty discovers from the eGovernment Register in If #localgovweb supplier says “RDF WTF?” Sack em #opendata #spending:

[I]t seems to me that of 450 or so local government organisations, 357 are listed as having a “Financials” supplier **.

There are only 18 suppliers listed, and of those there are 6 Big Ones.

Between them the 6 Big Ones supply “Financials” to 326 Councils.

Don’t you think that the first one of those 6 Big Ones who natively supports LOD [Linked Open Data] as an export option (or agrees to within, say, 8 months) really ought to be favoured when bidding for new business?

Lets go further, lets say that it should be mandated that all new contracts with “Financials” suppliers include an LOD clause.

Perhaps Mr Pickles could dispatch someone to have a chat with one or two of these suppliers, or that he should have someone check that future contracts for Financial products being sold to Local Government all contain the necessary wording to make this happen?

So instead of trying to train and cajole 450 councils to FTP assorted CSV files into localdata.gov.uk (FFS) all the way through to grokking RDF, namespaces, LOD et al – why does the government not get on and make a strategy to bully and coerce 6 suppliers instead – and potentially get 326 councils teed up to produce useful LOD a bit sharpish?

Another technology option is for councils to publish their own linked data to a commercially hosted datastore. At the moment, the two companies I know of that offer “datastore” services for publishing Linked Data, at scale, are Talis, and the Stationery Office (in partnership with Garlik). It is, of course, open knowledge that one Professor Nigel Shadbolt is a director of Garlik Limited.

So Where Do the Numbers in Government Reports Come From?

Last week, the COI (Central Office of Information) released a report on the “websites run by ministerial and non-ministerial government departments”, detailing visitor numbers, costs, satisfaction levels and so on, in accordance with COI standards on guidance on website reporting (Reporting on progress: Central Government websites 2009-10).

As well as the print/PDF summary report (Reporting on progress: Central Government websites 2009-10 (Summary) [PDF, 33 pages, 942KB]) , a dataset was also released as a CSV document (Reporting on progress: Central Government websites 2009-10 (Data) [CSV, 66KB]).

The summary report is full of summary tables on particular topics, for example:

TABLE 1: REPORTED TOTAL COSTS OF DEPARTMENT-RUN WEBSITES
COI web report 2009-10 table 1

TABLE 2: REPORTED WEBSITE COSTS BY AREA OF SPENDING
COI web report 2009-10 table 2

TABLE 3: USAGE OF DEPARTMENT-RUN WEBSITES
COI website report 2009-10 table 3

Whilst I firmly believe it is a Good Thing that the COI published the data alongside the report, there is a still a disconnect between the two. The report is publishing fragments of the released dataset as information in the form of tables relating to particular reporting categories – reported website costs, or usage, for example – but there is no direct link back to the CSV data table.

Looking at the CSV data, we see a range of columns relating to costs, such as:

COI website report - costs column headings

and:

COI website report costs

There are also columns headed SEO/SIO, and HEO, for example, that may or may not relate to costs? (To see all the headings, see the CSV doc on Google spreadsheets).

But how does the released data relate to the summary reported data? It seems to me that there is a huge “hence” between the released CSV data and the summary report. Relating the two appears to be left as an exercise for the reader (or maybe for the data journalist looking to hold the report writers to account?).

The recently published New Public Sector Transparency Board and Public Data Transparency Principles, albeit in draft form, has little to say on this matter either. The principles appear to be focussed on the way in which the data is released, in a context free way, (where by “context” I mean any of the uses to which government may be putting the data).

For data to be useful as an exercise in transparency, it seems to me that when government releases reports, or when government, NGOs, lobbiests or the media make claims using summary figures based on, or derived from, government data, the transparency arises from an audit trail that allows us to see where those numbers came from.

So for example, around the COI website report, the Guardian reported that “[t]he report showed uktradeinvest.gov.uk cost £11.78 per visit, while businesslink.gov.uk cost £2.15.” (Up to 75% of government websites face closure). But how was that number arrived at?

The publication of data means that report writers should be able to link to views over original government data sets that show their working. The publication of data allows summary claims to be justified, and contributes to transparency by allowing others to see the means by which those claims were arrived at and the assumptions that went in to making the summary claim in the first place. (By summary claim, I mean things like “non-staff costs were X”, or the “cost per visit was Y”.)

[Just an aside on summary claims made by, or “discovered” by, the media. Transparency in terms of being able to justify the calculation from raw data is important because people often use the fact that a number was reported in the media as evidence that the number is in some sense meaningful and legitimately derived. (“According to the Guardian/Times/Telegraph/FT, etc etc etc”. To a certain extent, data journalists need to behave like academic researchers in being able to justify their claims to others.]

In Using CSV Docs As a Database, I show how by putting the CSV data into a Google spreadsheet, we can generate several different views over the data using the using the Google Query language. For example, here’s a summary of the satisfaction levels, and here’s one over some of the costs:

COI website report - costs
select A,B,EL,EN,EP,ER,ET

[For more background to using Google spreadsheets as a database, see: Using Google Spreadsheets as a Database with the Google Visualisation API Query Language (via an API) and Using Google Spreadsheets Like a Database – The QUERY Formula (within Google spreadsheets itself)]

We can even have a go at summing the costs:

COI summed website costs
select A,B,EL+EN+EP+ER+ET

In short, it seems to me that releasing the data as data is a good start, but the promise for transparency lays in being able to share queries over data sets that make clear the origins of data-derived information that we are provided with, such as the total non-staff costs of website development, or the average cost per visit to the blah, blah website.

So what would I like to see? Well, for each of the tables in the COI website report, a link to a query over the co-released CSV dataset that generates the summary table “live” from the original dataset would be a start… ;-)

PS In the meantime, to the extent that journalists and the media hold government to account, is there maybe a need for data journalysts (journalist+analyst portmanteau) to recreate the queries used to generate summary tables in government reports to find out exactly how they were derived from released data sets? Finding queries over the COI dataset that generate the tables published in the summary report is left as an exercise for the reader… ;-) If you manage to generate queries, in a bookmarkable form (e.g. using the COI website data explorer (see also this for more hints), please feel free to share the links in the comments below :-)

Guardian Datastore MPs’ Expenses Spreadsheet as a Database

Continuing my exploration of what is and isn’t acceptable around the edges of doing stuff with other people’s data(?!), the Guardian datastore have just published a Google spreadsheet containing partial details of MPs’ expenses data over the period July-Decoember 2009 (MPs’ expenses: every claim from July to December 2009):

thanks to the work of Guardian developer Daniel Vydra and his team, we’ve managed to scrape the entire lot out of the Commons website for you as a downloadable spreadsheet. You cannot get this anywhere else.

In sharing the data, the Guardian folks have opted to share the spreadsheet via a link that includes an authorisation token. Which means that if you try to view the spreadsheet just using the spreadsheet key, you won’t be allowed to see it; (you also need to be logged in to a Google account to view the data, both as a spreadsheet, and in order to interrogate it via the visualisation API). Which is to say, the Guardian datastore folks are taking what steps they can to make the data public, whilst retaining some control over it (because they have invested resource in collecting the data in the form they’re re-presenting it, and reasonably want to make a return from it…)

But in sharing the link that includes the token on a public website, we can see the key – and hence use it to access the data in the spreadsheet, and do more with it… which may be seen as providing a volume add service over the data, or unreasonably freeloading off the back of the Guardian’s data scraping efforts…

So, just pasting the spreadsheet key and authorisation token into the cut down Guardian datastore explorer script I used in Using CSV Docs As a Database to generate an explorer for the expenses data.

So for example, we can run for example run a report to group expenses by category and MP:

MP expesnes explorer

Or how about claims over 5000 pounds (also viewing the information as an HTML table, for example).

Remember, on the datastore explorer page, you can click on column headings to order the data according to that column.

Here’s another example – selecting A,sum(E), where E>0 group by A and order is by sum(E) then asc and viewing as a column chart:

Datastore exploration

We can also (now!) limit the number of results returned, e.g. to show the 10 MPs with lowest claims to date (the datastore blog post explains that why the data is incomplete and to be treated warily).

Limiting results in datstore explorer

Changing the asc order to desc in the above query gives possibly a more interesting result, the MPs who have the largest claims to date (presumably because they have got round to filing their claims!;-)

Datastore exploring

Okay – enough for now; the reason I’m posting this is in part to ask the question: is the this an unfair use of the Guardian datastore data, does it detract from the work they put in that lets them claim “You cannot get this anywhere else”, and does it impact on the returns they might expect to gain?

Sbould they/could they try to assert some sort of database collection right over the collection/curation and re-presentation of the data that is otherwise publicly available that would (nominally!) prevent me from using this data? Does the publication of the data using the shared link with the authorisation token imply some sort of license with which that data is made available? E.g. by accepting the link by clicking on it, becuase it is a shared link rather than a public link, could the Datastore attach some sort of tacit click-wrap license conditions over the data that I accept when I accept the shared data by clicking through the shared link? (Does the/can the sharing come with conditions attached?)

PS It seems there was a minor “issue” with the settings of the spreadsheet, a result of recent changes to the Google sharing setup. Spreadsheets should now be fully viewable… But as I mention in a comment below, I think there are still interesting questions to be considered around the extent to which publishers of “public” data can get a return on that data?

Liberating Data from the Guardian… Has it Really Come to This?

When the data is the story, should a news organisation make it available? When the Telegraph started trawling through MPs’ expenses data it had bought from a source, industry commentators started asking questions around whether it was the Telegraph’s duty to release that data (e.g. Has Telegraph failed by keeping expenses process and data to itself?).

Today, the Guardian released its University guide 2011: University league table, as a table:

Guardian university tables, sort of

Yes, this is data, sort of (though the javascript applied to the table means that it’s hard to just select and copy the data from the page – unless you turn javascript off, of course:

Data grab

but it’s not like the data that the Guardian are republishing it in their datastore, as they did with these league tables…:

Guardian datastore

…which was actually a republication of data from the THES… ;-)

I’ve been wondering for some time when this sort of apparent duplicity was going to occur… the Guardian datastore has been doing a great job of making data available (as evidenced by its award from the Royal Statistical Society last week, which noted: “there was commendable openness with data, providing it in easily accessible ways”) but when the data is “commercially valuable” data to the Guardian, presumably in terms of being able to attract eyeballs to Guardian Education web pages, there seems to be some delay in getting the data onto the datastore… (at least, it isn’t there yet/wasn’t published contemporaneously the original story…)

I have to admit I’m a bit wary about writing this post – I don’t want to throw any spanners in the works as far as harming the work being done by the Datastore team – but I can’t not…

So what do we learn from this about the economics of data in a news environment?

– data has creation costs;
– there may be a return to be had from maintaining limited, priviliged or exclusive access to the data as data OR as information, where information is interpreted, contextualised or visualised data, or is valuable in the short term (as for example, in the case of financial news). By withholding access to data, publishers maintain the ability to generate views or analysis of the data that they can create stories, or attractive website content, around. (Just by the by, I noticed that an interactive Many Eyes widget was embedded in a Guardian Datablog post today:-)
– if you’ve incurred the creation cost, maybe you have a right to a limited period of exclusivity with respect to profiting from that content. This is what intellectual property rights try to guarantee, at least until the Mickey Mouse lawyers get upset about losing their exclusive right to profit from the content.

I think (I think) what the Guardian doing is not so different to what the Telegraph did. A cost was incurred, and now there is a (hopefully limited) period in which some sort of return is attempting to be generated. But there’s a problem, I think, with the way it looks, especially given the way the Guardian has been championing open data access. Maybe the data should have been posted to the datablog, but with access permissions denied until a stated date, so that at least people could see the data was going to be made available.

What this has also thrown up, for me at least, is the question as to what sort of “contract” the datablog might have, implied or otherwise, with third parties who develop visualisations based on data in the Guardian Datastore, particularly if those visualisations are embeddable and capable of generating traffic (i.e. eyeballs, = ad impressions, = income…).

It also gets me wondering; does there need to be a separate datastore? Or is the ideal case where the stories themselves are linking out to datasets directly? (I suppose that would make it hard to locate the data? On second thoughts, the directory datastore approach is much better…)

Related: Time for data.ac.uk? Or a local data.open.ac.uk?

PS I toyed with the idea of republishing all the data from the Guardian Education pages in a spreadsheet somewhere, and then taking my chances with the lawyers in the court of public opinion, but instead, here’s a howto:

Scraping data from the Grauniad

So just create a Google spreadsheet (you don’t even need an account: just go to docs.google.com/demo), double click on cell A1 and enter:

=ImportHtml(“http://www.guardian.co.uk/education/table/2010/jun/04/university-league-table&#8221;,”table”,1)

and then you’ll be presented with the data, in a handy spreadsheet form, from:
http://www.guardian.co.uk/education/table/2010/jun/04/university-league-table

For the subject pages – e.g. Agriculture, Forestry and Food, paste in something like:

=ImportHtml(“http://www.guardian.co.uk/education/table/2010/jun/04/university-guide-agriculture-forestry-and-food&#8221;,”table”,1)

You can probably see the pattern… ;-)

(You might want to select all the previously filled cells and clear them first so you don’t get the data sets messed up. If you’ve got your own spreadsheet, you could always create a new sheet for each table. (It is also possible to automate the scraping of all the tables using Google Apps script: Screenscraping With Google Spreadsheets App Script and the =importHTML Formula gives an example how…))

An alternative route to the data is via YQL:

Scraping HTML table data in YQL

Enjoy…;-) And if you do grab the data and produce some interesting visualisations, feel free to post a link back here… ;-) To give you some ideas, here are a few examples of education data related visualisations I’ve played around with previously.

PPS it’ll be interested to see if this post gets picked up by the Datablog, or popped into the Guardian Technology newsbucket… ;-) Heh heh…