To What Extent Do Candidates Support Each Other Redux – A One-Liner, Thirty Second Route to the Info

In More Storyhunting Around Local Elections Data Using Gephi – To What Extent Do Candidates Support Each Other? I described a visual route to finding out which local council candidates had supported each other on their nomination papers. There is also a thirty second route to that data that I should probably have mentioned;-)

From the Scraperwiki database, we need to interrogate the API:

scraperwiki api

To do this, we’ll use a database query language – SQL.

What we need to ask the database is which of the assentors (members of the support column) are also candidates (members of the candinit column, and just return those rows. The SQL command is simply this:

select * from support where support in (select candinit from support)

Note that “support” refers to two things here – these are columns:

select * from support where support in (select candinit from support)

and these are the table the columns are being pulled from:

select * from support where support in (select candinit from support)

Here’s the result of Runing the query:

sql select on scraperwiki

We can also get a direct link to a tabular view of the data (or generate a link to a CSV output etc from the format selector).

candidates mutual table

There are 15 rows in this result compared to the 15 edges/connecting lines discovered in the Gephi approach, so each method corroborates the other:

Tidier intra-candidate support map


Boundary Files for Electoral Wards Covered by a Particular Geography

A week or so ago, I went looking for boundary lines for UK electoral wards, with half a mind towards trying to do something with them related to this week’s local council elections. One source I came across was the UK 2011 Census, (2011 Census geography products for England and Wales) which publishes the appropriate data from across England and Wales in large single shapefiles. For the amateur cartographer (which I very much am!) wanting to work at a local level, this presents something of a challenge: not only do I have to find a way of downloading and opening the large dataset, I also need to find a way of extracting from it the data for my local area, which is what I actually want. (Getting the local data from the national dataset can be blocking, in other words.)

Another source of data is MySociety’s MapIt service, which provides data about various geographies, including electoral wards, covered by other geographical areas:

mapit areas covered

as well as boundary files for each geography in a variety of formats:

Mapit Geometry

Unfortunately, the MapIt API doesn’t (yet?!) support the ability to generate a single file that contains boundary data for all the electoral wards in a single council or local authority area. So here’s a quick hack, posted as a view on Scraperwiki, that generates a single KML file for the electoral wards contained by a council area (including the wider boundary of that council area) – KML Merge Test. (There are probably better ways of doing this?! I would if I should try to make sense of the MapIt code to see if I can work out how to submit a proper patch…)

import scraperwiki,simplejson,urllib2
from lxml import etree
from copy import deepcopy

#--via @mhawksey
# query string crib
import cgi, os

key='65791' #Use the Isle of Wight as a default
#typs are:
#CTY (county council), CED (county ward), COI (Isles of Scilly), COP (Isles of Scilly parish), CPC (civil parish/community), CPW (civil parish/community ward), DIS (district council), DIW (district ward), EUR (Euro region), GLA (London Assembly), LAC (London Assembly constituency), LBO (London borough), LBW (London ward), LGD (NI council), LGE (NI electoral area), LGW (NI ward), MTD (Metropolitan district), MTW (Metropolitan ward), NIE (NI Assembly constituency), OLF (Lower Layer Super Output Area, Full), OLG (Lower Layer Super Output Area, Generalised), OMF (Middle Layer Super Output Area, Full), OMG (Middle Layer Super Output Area, Generalised), SPC (Scottish Parliament constituency), SPE (Scottish Parliament region), UTA (Unitary authority), UTE (Unitary authority electoral division), UTW (Unitary authority ward), WAC (Welsh Assembly constituency), WAE (Welsh Assembly region), WMC (UK Parliamentary constituency)

if qstring!=None:
    get = dict(cgi.parse_qsl(qstring))
    if 'key' in get: key=get['key']
    if 'typ' in get: typ=get['typ'] 

#Get a stub KML file for the local council level
xmlraw = urllib2.urlopen(url).read()

#Get the list of electoral wards covered by that council area

#Get the KML for each ward, extract the Placemark data, and add it to our comprehensive KML tree
for ward in wards:
    xmlraw = scraperwiki.scrape(url)
    p= xml2.xpath('//geo:Placemark',namespaces={'geo':''})
    xml.append( deepcopy(p[0] ))

scraperwiki.utils.httpresponseheader("Content-Type", "text/xml")
print etree.tostring(xml)

The key value is the ID of the council area within which you want to find the electoral wards. So for example the Isle of Wight parliamentary constituency page - – gives us the ID 65791, which we pass as an argument to the Scraperwiki view. The subdivision we want to grab data for is given by the typ parameter:

CTY (county council), CED (county ward), COI (Isles of Scilly), COP (Isles of Scilly parish), CPC (civil parish/community), CPW (civil parish/community ward), DIS (district council), DIW (district ward), EUR (Euro region), GLA (London Assembly), LAC (London Assembly constituency), LBO (London borough), LBW (London ward), LGD (NI council), LGE (NI electoral area), LGW (NI ward), MTD (Metropolitan district), MTW (Metropolitan ward), NIE (NI Assembly constituency), OLF (Lower Layer Super Output Area, Full), OLG (Lower Layer Super Output Area, Generalised), OMF (Middle Layer Super Output Area, Full), OMG (Middle Layer Super Output Area, Generalised), SPC (Scottish Parliament constituency), SPE (Scottish Parliament region), UTA (Unitary authority), UTE (Unitary authority electoral division), UTW (Unitary authority ward), WAC (Welsh Assembly constituency), WAE (Welsh Assembly region), WMC (UK Parliamentary constituency)

So for example, here’s a link to an aggregate KML file for Unitary authority electoral divisions (UTE) on the Isle of Wight –; and here’s one for Unitary authority wards (UTW) in Milton Keynes:

If you save the resulting file as a .kml file (for example, as kml_merge_test_mk.kml) you can then load it into something like Google Fusion tables to view it:

GOogle fusion table map

Note that the MapIt API is rate limited (I think), so be gentle ;-)

By Me, on the Scraperwiki Blog: Glue Logic and Flowable Data

Regular readers will know I quite often make use of Scraperwiki for grabbing datasets and hosting views over scraped scraped data. A few days ago, I contributed a guest post to the Scraperwiki blog:

As well as being a great tool for scraping and aggregating content from third party sites, Scraperwiki can be used as a transformational “glue logic” tool: joining together applications that utilise otherwise incompatible data formats. Typically, we might think of using a scraper to pull data into one or more Scraperwiki database tables and then a view to develop an application style view over the data. Alternatively, we might just download the data so that we can analyse it elsewhere. There is another way of using Scraperwiki, though, and that is to give life to data as flowable web data.

Read the whole thing here: Glue Logic and Flowable Data.

PS I hope to write more about “flowable data”, feeds, and feed enrichment in a later post here on

Organisations Providing Benefits to All-Party Parliamentary Groups, Part 1

Via a tweet from the author, I came across Rob Fenwick’s post on APPGs – the next Westminster scandal? (APPG = All Party Parliamentary Groups):

APPGs are entitled to a Secretariat. Set aside any images you have of a sensibly dressed person of a certain age mildly taking dictation, the provision of an APPG Secretariat is one of the main routes used by public affairs agencies, charities, and businesses to cosey up to MPs and Peers. These “secretaries” often came up with the idea of setting up the group in the first place, to advance the interests of a client or cause.

The post describes some of the organisations that provide secretariat services to APPGs, and in a couple of cases also takes the next step: “Take the APPG on the Aluminium Industry, the secretarial services of which are provided by Aluminium Federation Ltd which is “a not-for-profit organisation.” That sounds suitably reassuring – if the organisation is not-for-profit what chance can there be of big business buying favoured access? It’s only when you look at the Federation’s website, and examine each of its nine sub-associations in turn, that it becomes clear that this not-for-profit organisation is a membership umbrella for private business. This is above board, within the rules, published, and transparent. Transparent if you’re prepared to invest the time to look, of course.”

It’s worth reading the post in full… Go on…

… Done that? Go on.

Right. Here’s the list of registered All Party Groups. Here’s an example of what’s recorded:

APG form

Conveniently, there’s a Scraperwiki scraper (David Jones / All Party Groups) grabbing this information, so I though I’d have a play with it.

Looking at the benefits, there is a fair bit of convention in the way benefits are described. For example, we see recurring things of the form:

  • 5000 from CAFOD, 6000 from Christian Aid – that is, [AMOUNT] from [ORGANISATION]
  • Age UK (a charity) acts as the groups secretariat. – that is, [ORGANISATION] {(OPTIONAL_TYPE)} acts as the groups secretariat.

We could parse these things out directly (on the to do list!) but as a short cut, I thought I’d try a couple of content analysis/entity extraction services to see if they could pull out the names of companies and charities from the benefits list. You can find the scraper I used to enhance David Jones’ APPG scraper here: APG Enhancer.

Here are a couple of sample reports from my scraper:

This gives a first pass attempt at extracting organisation (company and charity) names from the APPG register, and in passing provides a partial directory for looking up companies by APG (partial because the entity extractors aren’t perfect and don’t manage to identify every company, charity, association or other recognised group.

A more thorough way to look up particular companies is to do a site’n’path limited web search: eg
aviva site:

How might we go further, though? One way would be to look up companies on OpenCorporates, and pull down a list of directors:

opencorposrates - look up directors

And then we can start to walk through the database of directors, looking for other companies that appear to have the same director:

opencorporates - director lookup

(Note: we need to watch out for false positives, whereby one director has the same name as another person who is also a company director. There may be false negatives too, where we don’t find a directorship held by a specific person because a slightly different variation of their name was used on a registration document.)

We can also look up charities on OpenCharities to find charity trustees:

OpenCharities charity lookup

If we’re graph walking, we might then look up the trustees on OpenCorporates to see whether or not the trustees are directors of any companies with possible interests in the area, and as a result identify companies who may be trying to influence Parliamentarians through APPGs that benefit from the direct support of a charity, via that charity.

In this way, we can start to build out a wider direct interest graph around a Parliamentary group. I’m not sure how useful or even how meaningful any of this actually is, but it’s increasingly possible, and once the scripted patterns are there, increasingly easy to deploy in other contexts (for example, wherever there is a list of company names, charity names, or names of people who may be directors. I guess a trustee search on OpenCharities may also be available at some point? From a graph linking point of view, I also wonder if any charities share registered addresses with companies, etc…)

PS by the by, here’s a guest post I just wrote on the OpenCorporates blog: Data Sketching With the OpenCorporates API.

When Machine Readable Data Still Causes “Issues” – Wrangling Dates…

With changes to the FOI Act brought about the Protection of Freedoms Act, FOI will allow requests to be made for data in a machine readable form. In this post, I’ll give asn example of a dataset that is, arguably, released in a machine readable way – as an Excel spreadsheet, but that still requires quite a bit of work to become useful as data; because presumably the intent behind the aforementioned amendement to the FOI is to make data releases useful and useable as data? As a secondary result, through trying to make the data useful as data, I realise I have no idea what some of the numbers that are reported in the context of a date range actually relate to… Which makes those data columns misleading at best, useless at worst…And as to the February data in a release allegedly relating to a weekly release from November…? Sigh…

[Note – I’m not meaning to be critical in the sense of “this data is too broken to be useful so don’t publish it”. My aim in documenting this is to show some of the difficulties involved with actually working with open data sets and at least flag up some of the things that might need addressing so that the process can be improved and more “accessible” open data releases published in the future. ]

So what, and where is, the data…? Via my Twitter feed over the weekend, I saw an exchange between @paulbradshaw and @carlplant relating to a scraper built around the NHS Winter pressures daily situation reports 2012 – 13. This seems like a handy dataset for anyone wanting to report on weekly trends, spot hospitals that appear to be under stress, and so on, so I had a look at the scraper, took issue with it ;-) and spawned my own…

The data look like it’ll be released in a set of weekly Excel spreadsheets, with a separate sheet for each data report.

All well and good… almost…

If we load the data into something like Scraperwiki, we find that some of the dates are actually represented as such; that is, rather than character strings (such as the literal “9-Nov-2012”), they are represented as date types (in this case, the number of days since a baseline starting date). A quick check on StackOverflow turned up the following recipe for handling just such a thing and returning a date element that Python (my language of choice on Scraperwiki) recognises as such:

import datetime

def minimalist_xldate_as_datetime(xldate, datemode):
    # datemode: 0 for 1900-based, 1 for 1904-based
    return (
        datetime.datetime(1899, 12, 30)
        + datetime.timedelta(days=xldate + 1462 * datemode)

The next thing we notice is that some of the date column headings actually specify: 1) date ranges, 2) in a variety of styles across the different sheets. For example:

  • 16 – 18/11/2012
  • 16 Nov 12 to 18-NOV-2012
  • 16 to 18-Nov-12

In addition, we see that some of the sheets split the data into what we might term further “subtables” as you should notice if you compare the following sheet with the previous one shown above:

Notwithstanding that the “shape” of the data table is far from ideal when it comes to aggregating data from several weeks in the same database (as I’ll describe in another post), we are faced with a problem here that if we want to look at the data by date range in a mechanical, programmable way, we need to cast these differently represented date formats in the same way, ideally as a date structure that Python or the Scraperwiki SQLlite database can recognise as such.

[For a library that can automatically reshape this sort of hierarchical tabular data arrangement in R, see Automatic Conversion of Tables to LongForm Dataframes]

The approach I took was as follows (it could be interesting to try to replicate this approach using OpenRefine?). Firstly, I took the decision to map dates onto “fromDates” and “toDates”. ***BEWARE – I DON’T KNOW IF THIS IS CORRECT THING TO DO**** Where there is a single specified date in a column heading, the fromDate and toDate are set to one and the same value. In cases where the date value was specified as an Excel represented date (the typical case), the code snippet above casts it to a Pythonic date value then I can then print out as required (I opted to display dates in the YYYY-MM-DD format) using a construction along the lines of:


In this case, cellValue is the value of a header cell that is represented as an Excel time element, book is the workbook, as parsed using the xlrd library:

import xlrd
xlbin = scraperwiki.scrape(spreadsheetURL)
book = xlrd.open_workbook(file_contents=xlbin)

and book.datemode is a library call that looks up how dates are being represented in the spreadsheet. If the conversion fails, we default to setting dateString to the original value:

The next step was to look at the date range cells, and cast any “literal” date strings into a recognised date format. (I’ve just realised I should have optimised the way this is called in the Scraperwiki code – I am doing so many unnecessary lookups at the moment!) In the following snippet, I look to see if we can split the date into a cell range functions,

import time
from time import mktime
from datetime import datetime

def dateNormalise(d):
    #This is a bit of a hack - each time we find new date formats for the cols, we'll need to extend this
    #The idea is to try to identify the date pattern used, and parse the string accordingly
    for trials in ["%d %b %y",'%d-%b-%y','%d-%b-%Y','%d/%m/%Y','%d/%m/%y']:
            dtf =datetime.datetime.fromtimestamp(mktime(time.strptime(d, trials)))
        except: dtf=d
    if type(dtf) is datetime.datetime:
    return dtf

def patchDate(f,t):
    #Grab the month and year elements from the todate, and add in the from day of month number
    fromdate='-'.join( [ str(tt[0]),str(tt[1]),str(f) ])
    return fromdate

def dateRangeParse(daterange):
    #In this first part, we simply try to identify from and to portions
    dd=daterange.split(' to ')
    if len(dd)<2:
        #That is, split on 'to' doesn't work
        dd2=daterange.split(' - ')
        if len(dd2)<2:
            #Doesn't split on '-' either; set from and todates to the string, just in case.
    #By inspection, the todate looks like it's always a complete date, so try to parse it as such 
    #I think we'll require another fudge here, eg if date is given as '6 to 8 Nov 2012' we'll need to finesse '6' to '6 Nov 2012' so we can make a date from it
    if len(fromdate)<3:
    return (fromdate,todate)


One thing this example shows, I think, is that even though the data is being published as a dataset, albeit in an Excel spreadsheet, we need to do some work to make it properly useable.

XKCD - ISO 8601

The sheets look as if they are an aggregate of data produced by different sections, or different people: that is, they use inconsistent ways of representing date ranges.

When it comes to using the date, we will need to take care in how we represent or report on figures collected over a date range (presumably a weekend? I haven’t checked), compared to daily totals. Indeed, as the PS below shows, I’m now starting to doubt what the number in the date range column represents? Is it: a) the sum total of values for days in that range; b) the average daily rate over that period; c) the value on the first or last date of that period? [This was written under assumption it was summed daily values over period, which PS below suggests is NOT the case, in one sheet at least?] One approach might be to generate “as-if daily” returns simply by dividing ranged date totals by the number of days in the range. A more “truthful” approach may be to plot summed counts over time (date on the x-axis, sume of values to date on the y-axis), with the increment for the date-ranged values that is being added in to the summed value taking the “toDate” date as its x/date value?

When I get a chance, I’ll do a couple more posts around this dataset:
– one looking at datashaping in general, along with an example of how I shaped the data in this particular case
– one looking at different queries we can run over the shaped data.

PS Another problem… on the NHS site, we see that there appear to be weekly spreadsheet releases and an aggregated release:

Because I didn’t check the stub of scraper code used to pull off the spreadsheet URLs from the NHS site, I accidentally scraped weekly and aggrgeated sheets. I’m using a unique key based on a hash that includes the toDate as part of the hashed value, in an attempt to keep dupes out of the data from just this sort of mistake, but looking at a query over the scraped data I spotted this:

If we look at the weekly sheet we see this:

That is, a column for November 15th, and then one for November 18th, but nothing to cover November 16 or 17?

Looking at a different sheet – Adult Critical Care – we get variation at the other end of the range:

If we look into the aggregated sheet, we get:

Which is to say – the weekly report displayed a single data as a column heading where the aggregated sheet gives a date range, although the same cell values are reported in this particular example. So now I realise I have no idea what the cell values in the date range columns represent? Is it: a) the sum total of values for days in that range; b) the average daily rate over that period; c) the value on the first or last date of that period?

And here’s another query:

February data??? I thought we were looking at November data?


PPS If you’re looking for learning outcomes from this post, here are a few: three ways in which we need to wrangle sense out of dates:

  1. representing Excel dates or strings-that-look-like-dates as dates in some sort of datetime representation (which is most useful sort of representation, even if we end up casting dates into string form);
  2. parsing date ranges into pairs of date represented elements (from and to dates);
  3. where a dataset/spreadsheet contains heterogenous single date and date range columns, how do we interpret the numbers that appear in the date range column?
  4. shoving the data into a database and running queries on it can sometimes flag up possible errors or inconsistencies in the data set, that might be otherwise hard to spot (eg if you had to manually inspect lots of different sheets in lots of different spreadsheets…)


PPPS Another week, another not-quite-right feature:

another date mixup

PPPPS An update on what the numbers actually mean,from an email exchange (does that make me more a journalist than a blogger?!;-) with the contact address contained within the spreadsheets: “On the columns, where we have a weekend, all items apart from beds figures are summed across the weekend (eg number of diverts in place over the weekend, number of cancelled ops). Beds figures (including beds closed to norovirus) are snapshots at the collection time (i.e 8am on the Monday morning).”

PPPPPS Another week, ans this time three new ways of writing the date range over the weekend: 14-16-Dec-12, 14-16-Dec 12, 14-16 Dec 12. Anyone would think they were trying to break my scraper;-)

Trademark Galleries on Scraperwiki, via OpenCorporates

What trademarks – familiar to us all – are registered with which companies? And how often are trademarked brands in larger outlets actually ‘exclusive offerings’, aka “own range” products in weak disguise? In Looking up Images Trademarked By Companies Using OpenCorporates and Google Refine, I demonstrated a recipe for using as a way in to previewing at least some of the trademarks registered to a specified (UK registered?) company. Here’s a recipe using Scraperwiki to do a similar thing…

But first, do you recognise any of these trademarks…

… as belonging to Tesco?

The recipe goes something like this… For a given company name keyword, look up that company on OpenCorporates and get a list of company identifiers back:

import scraperwiki,simplejson,urllib


#note - the opencorporates api also offers a search:  companies/search
for entity in entities['result']:

print ocids

For each of those companies, we’ll need to look up the company details on OpenCorporates. To benefit from an increased API limit, it makes sense to use an OpenCorporates API key to do this. The idea of API keys is typically that they are assigned to specific users, which is to say, you’re supposed to keep them secret. But how do we do this on Scraperwiki, an otherwise open environment? Here’s trick I found on the Scraperwiki blog that allows you to keep things like API keys secret… Make the scraper a protected one, and then hide the keys in scraper description on the scraper’s homepage using the following convention:


where XXXXXX is your API key value (don’t use quotes around the value – use the actual value).

Here’s how we pick up the key:

import os, cgi
    qsenv = dict(cgi.parse_qsl(os.getenv("QUERY_STRING")))

To get the trademark data, we need to pull the company data for each company ID of interest, look to see if there are any trademark records associated with it in the OpenCorporates database, and if so, pull those records:

def getOCcompanyData(ocid):
    return ocdata

def getOCtmData(ocid,octmid):
    print 'tm data',octmdata
    for category in octmdata['goods_and_services_classifications']:
        if 'en' in octmdata['goods_and_services_classifications'][category]:
    tmdata['categories']=" :: ".join(categories)
    #if an image tradmarked, we can work out its URL on the WIPO site...
    if tmdata['imgtype']=='JPG' or tmdata['imgtype']=='GIF':
        tmdata['imgurl']='' + tmdata['regnum'][0:2] +'/' + tmdata['regnum'][2:4] + '/' + tmdata['regnum'] + '.'+ tmdata['imgtype'].lower()
    else: tmdata['imgurl']='' 
    print tmdata['regnum'], table_name='trademarks', data=tmdata)

    return octmdata

def grabOCTrademarks(ocid,ocdata):
    for tm in ocdata['data']:
        if tm['datum']['data_type']=='WipoTrademark':

for ocid in ocids:
    print 'company data',getOCcompanyData(ocid)

You can find the actual scraper here: Scraperwiki scraper: OpenCorporates Trademark demo

The code for the view (shown above) can be found here: Scraperwiki View: OpenCorporates Trademark demo (code) and here’s the actual view

Note that I think the OpenCorporates databases only has a very fragmentary coverage over trademark records for each company. Even keeping up with daily updates relating to trademarks and trademarked images on WIPO looks like it could be quite a major undertaking, which is maybe why companies such as Thomson Reuters recently saw an opportunity to Create Most Comprehensive Collection of Searchable Trademark Data in the World.

That’s not to say I don’t see value in this micro-attempts at making sense of the world around us such as this one, though. Trademark brands are pervasive, but it’s often not obvious who actually owns them, and may even mask a lack of competition with an appearance of competition (for example, when a company owns two different trademark brands that a consumer thinks of as being competitive offerings by different providers, when in fact they are both owned by the same company; or when you mistake a company offshoot for a third party franchise, as I realised when I saw that BP owned the Wild Bean Cafe trademark…)

The FOI Route to Real (Fake) Open Data via WhatDoTheyKnow

In FOI Signals on Useful Open Data?, I pondered whether we could make use of information about FOI to help identify what sorts of data folk might actually be interested in by virtue of making Freedom of Information (FOI) requests for that that data.

I couldn’t help but start to try working various elements of that idea through, so here’s a simple baby step to begin with – a scraper on Scraperwiki (Scaperwiki scraper: WhatDOTheyKnow requests) that searches for FOI requests made through WhatDoTheyKnow that got one or more Excel/xls spreasheets back as an attachment.

I’ve also popped up a Scraperwiki view that allows you to view data returning searches made to local councils or universities

Clicking through on an FOI request link takes you to the response that contains the data file, which can be downloaded directly or previewed on Zoho:

It strikes me that if I crawled the response pages, I could build my own index of data files, catalogued according to FOI request titles, in effect generating a “fake” or opendata catalogue as powered by FOI requests…? (What would be really handy in the local council requests would be if the responses were tagged with with appropriate LGSL code or IPSV terms (indexing on the way out) as a form of useful public metadata that can help put the FOI released data to work…?)

Insofar as the requests may or may not be useful as signaling particular topic areas as good candidates as “standard” open data releases, I still need to do some text analysis on the request titles. In the meantime, you can enter a keyword/key phrase in the Request text box in order to filter the table results to only show requests whose title contains the keyword/phrase. (The Council drop down list allows you to filter the table so that it only shows requests for a particular university/council.)

PS via a post on HelpMeInvestigate, I came across this list of FOI responses to requests made to the NHS Prescription Pricing Division. From a quick skim, some of the responses have “data” file attachments, though in the form of PDFs rather than spreadsheets/CSV. However, it would be possible to scrape the pages to at least identify ones that do have attachments (which is a clue they may contain data sets?)

So now I’m wondering – what other bodies produce full lists of FOI requests they have received, along with the responses to them?

PPS See also this search query on FOI Release publications.

Aggregated Local Government Verticals Based on LocalGov Service IDs

(Punchy title, eh?!) If you’re a researcher interested in local government initiatives or service provision across the UK on a particular theme, such as air quality, or you’re looking to start pulling together an aggregator of local council consultation exercises, where would you start?

Really – where would you start? (Please post a comment saying how you’d make a start on this before reading the rest of this post… then we can compare notes;-)

My first thought would be to use a web search engine and search for the topic term using a search limit, maybe along with intitle:council, or at least council. This would generate a list of pages on (hopefully) local gov websites relating to the topic or service I was interested in. That approach is a bit hit or miss though, so next up I’d probably go to DirectGov, or the new site, to see if they had a single page on the corresponding resource area that linked to appropriate pages on the various local council websites. (The site takes a different approach to the old DirectGov site, I think, trying to find a single page for a particular council given your location rather than providing a link for each council to a corresponding service page?) If I was still stuck, OpenlyLocal, the site set up several years ago by Chris Taggart/@countculture to provide a single point of reference for looking up common adminsitrivia details relating to local councils, would be the next thing that came to mind. For a data related query, I would probably have a trawl around, the centralised (but far form complete) UK index of open public datasets.

How much more convenient it would be if there was a “vertical” search or resource site relating to just the topic or service you were interested in, that aggregated relevant content from across the UK’s local council websites in a single place.

(Erm… or maybe it wouldn’t?!)

Anyway, here are a few notes for how we might go about constructing just such a thing out of two key ingredients. The first ingredient is the rather wonderful Local directgov services list:

This dataset is held on the Local Directgov platform which provides the deep links into Local council websites for a number of services in Directgov. The Local Authority Service details holds the local council URLS for over 240 services where the customer can directly transfer to the appropriate service page on any council in England.

The date on the dataset post is 16/09/2011, although I’m not sure if the data file itself is more current (which is one of the issues with, you could argue…). Presumably, runs off a current version of the index? (Share…. ;-) Each item in the local directgov services list carries with it a service identifier code that describes the local government service or provision associated with the corresponding web page. That it, each URL has associated with it a piece of metadata identifying a service or provision type.

Which leads to the second ingredient: the esd standards Local Government Service List. This list maps service codes onto a short key phrase description of the corresponding service. So for example, Council – consultation and community engagement is has service identifier 366, and Pollution control – air quality is 413. (See the standards page for the actual code/vocabulary list in a variety of formats…)

As a starter for ten, I’ve pulled the Directgov local gov URL listing and local gov service list into scraperwiki (Local Gov Web Pages). Using the corresponding scraper API, we can easily run a query looking up service codes relating to pollution, for example:

select * from `serviceDesc` where ToName like '%pollution%'

From this, we can pick up what service code we need to use to look up pages related to that service (413 in the case of air pollution):

select * from `localgovpages` where LGSL=413

We can also get a link to an HTML table (or JSON representation, etc) of the data via a hackable URI:*%20from%20%60localgovpages%60%20where%20LGSL%20%3D413

(Hackable in the sense we can easily change the service code to generate the table for the service with that code.)

So that’s the starter for 10. The next step that comes to my mind is to generate a dynamic Google custom search engine configuration file that defines a search engine that will search over just those URLs (or maybe those URLs plus the pages they link to). This would then provide the ability to generate custom search engines on the fly that searched over particular service pages from across localgov in a single, dynamically generated vertical.

A second thought is to grab those page, index them myself, crawl them/scrape them to find the pages they link to, and index those pages also (using something like tf-idf within each local council site to identify and remove common template elements from the index). (Hmmm… that could be an interesting complement to scraperwiki… SolrWiki, a site for compiling lists of links, indexing them, crawling them to depth N, and then configuring search ranking algorithms over the top of them… Hmmm… It’s a slightly different approach to generating custom search engines as a subset of a monolithic index, which is how the Google CSE and (previously) the Yahoo BOSS engines worked… Not scaleable, of course, but probably okay for small index engines and low thousands of search engines?)

Scraperwiki Glue Between the Ergast Experimental Motor Racing API and Google Chart Tools

Another experiment in using a Scraperwiki view – transforming the JSON output from the Ergast experimental motor racing API into the format used by the Google Chart Tools and then rendering it.

Whilst it would be possible to use a single Scraperwiki view to grab the Ergast data, transform it to the data representation used by the Google Chart tools and then render the data using those chart components within the same Scraperwiki view, I have instead separated the transformation and the rendering into two steps: a view (Ergast API to Datasource Response Format) that takes the Ergast API data and then republishes it as a JSON data feed using the Chart Tools Datasource Protocol (V0.6) response format, and a view that accepts the datasource response and uses it within an end-user view, such as an interactive dashboard (Ergast Data Viewer demo).

As a proof of concept, the data representation transformer only covers race laps data, although it is possible to request data for a race in a supported year using a URI of the form:

The demo view is similarly keyed by year and race, as for example demonstrated here where we request data for the first race (race=1) in 2012 (year=2012):

The demo provides a paged tabular view over lap time data that can also be filtered by lap number and/or driver:

(And yes, I know the ‘time in seconds’ column is wrong… there was a typo in the original script when I generated that view – fixed now…)

The following code fragment shows how straightforward it is construct the dashboard:

    google.load('visualization', '1.1', {packages:['controls']});


    function drawTable() {

      var json_data = new google.visualization.DataTable(data, 0.6);

    var json_table = new google.visualization.ChartWrapper({'chartType': 'Table','containerId':'table_div_json','options': {allowHtml: true,page:'enable',pageSize:24,'pagingButtonsConfiguration':'auto' }});

     var formatter = new google.visualization.NumberFormat( {fractionDigits: 3} );

    var stringFilter = new google.visualization.ControlWrapper({
      'controlType': 'StringFilter',
      'containerId': 'control1',
      'options': {
        'filterColumnLabel': 'Lap',
        'matchType': 'any'

  var categoryPicker = new google.visualization.ControlWrapper({
    'controlType': 'CategoryFilter',
    'containerId': 'control2',
    'options': {
      'filterColumnLabel': 'Driver ID',
      'ui': {
      'labelStacking': 'vertical',
        'allowTyping': false,
        'allowMultiple': false
  var dashboard = new google.visualization.Dashboard(document.getElementById('dashboard')).bind([stringFilter,categoryPicker], [json_table]).draw(json_data);

So what are the next steps?

In terms of user facing views, I’ll try to have a play generating a range of different visualisation types/dashboard setups using different chart controls and chart types. As far as the data transformation goes, it would be worth generalising the approach so we can data from any of the Ergast API data table into the Google Chart Tools datasource format. Of course, if this data representation turns out to be useful, then maybe the Ergast API will support it directly? (Helper libraries for transforming data to the format are available for Java and Python: Implementing the Chart Tools Datasource Protocol.)

PS from the Google Chart tools documentation, I notice a new experiment feature for the datatable – column typing in the form of data roles.

Initial Sketch of Registered Addresses of Tesco Companies

Following on from Mapping the Tesco Corporate Organisational Sprawl – An Initial Sketch, where I graphed relations between Tesco registered companies based on co-directorships, I also used OpenCorporates to grab the registered addresses for the companies returned from the OpenCorporates reconciliation API based on a search using the term tesco.

This initial sketch uses two node types – companies and registered addresses (here’s the Scraperwiki view used to generate the graph file):

We can see how several of the addresses relate to the same location, although they are not identical in string matching terms – a bit of text processing may be able to fix that though…

Not surprisingly, the Cayman Islands features as well as the Cheshunt address…

Having got addresses, we could do a bit of geocoding and pop the results onto a map…here’s an example using Google Fusion Tables.