OUseful.Info, the blog…

Trying to find useful things to do with emerging technologies in open education

Posts Tagged ‘scraperwiki

First Dabblings With Scraperwiki – All Party Groups

Over the last few months there’s been something of a roadshow making its way around the country giving journalists, et al. hands-on experience of using Scraperwiki (I haven’t been able to make any of the events, which is shame:-(

So what is Scraperwiki exactly? Essentially, it’s a tool for grabbing data from often unstructured webpages, and putting it into a simple (data) table.

And how does it work? Each wiki page is host to a screenscraper – programme code that can load in web pages, drag information out of them, and pop that information into a simple database. The scraper can be scheduled to run every so often (once a day, once a week, and so on) which means that it can collect data on your behalf over an extended period of time.

Scrapers can be written in a variety of programming languages – Python, Ruby and PHP are supported – and tutorials show how to scrape data from PDF and Escel documents, as well as HTML web pages. But for my first dabblings, I kept it simple: using Python to scrape web pages.

The task I set myself was to grab details of the membership of UK Parliamentary All Party Groups (APGs) to see which parliamentarians were members of which groups. The data is currently held on two sorts of web pages. Firstly, a list of APGs:

All party groups - directory

Secondly, pages for each group, which are published according to a common template:

APG - individual record

The recipe I needed goes as follows:
– grab the list of links to the All Party Groups I was interested in – which was subject based ones rather than country groups;
– for each group, grab it’s individual record page and extract the list of 20 qualifying members
– add records to the scraperwiki datastore of the form (uniqueID, memberName, groupName)

So how did I get on? (You can see the scraper here: ouseful test – APGs). Let’s first have a look at the directory page – this is the bit where it starts to get interesting:

View source: list of APGs

If you look carefully, you will notice two things:
– the links to the country groups and the subject groups look the same:
<p xmlns=”http://www.w3.org/1999/xhtml&#8221; class=”contentsLink”>
<a href=”zimbabwe.htm”>Zimbabwe</a>
</p>

<p xmlns=”http://www.w3.org/1999/xhtml&#8221; class=”contentsLink”>
<a href=”accident-prevention.htm”>Accident Prevention</a>
</p>

- there is a header element that separates the list of country groups from the subject groups:
<h2 xmlns=”http://www.w3.org/1999/xhtml”>Section 2: Subject Groups</h2>

Since scraping largely relies on pattern matching, I took the strategy of:
– starting my scrape proper after the Section 2 header:

def fullscrape():
    # We're going to scrape the APG directory page to get the URLs to the subject group pages
    starting_url = 'http://www.publications.parliament.uk/pa/cm/cmallparty/register/contents.htm'
    html = scraperwiki.scrape(starting_url)

    soup = BeautifulSoup(html)
    # We're interested in links relating to <em>Subject Groups</em>, not the country groups that precede them
    start=soup.find(text='Section 2: Subject Groups')
    # The links we want are in p tags
    links = start.findAllNext('p',"contentsLink")

    for link in links:
        # The urls we want are in the href attribute of the a tag, the group name is in the a tag text
        #print link.a.text,link.a['href']
        apgPageScrape(link.a.text, link.a['href'])

So that function gets a list of the page URLs for each of the subject groups. The subject group pages themselves are templated, so one scraper should work for all of them.

This is the bit of the page we want to scrape:

APG - qualifying members

The 20 qualifying members’ names are actually contained in a single table row:

APG - qualifying members table

def apgPageScrape(apg,page):
    print "Trying",apg
    url="http://www.publications.parliament.uk/pa/cm/cmallparty/register/"+page
    html = scraperwiki.scrape(url)
    soup = BeautifulSoup(html)
    #get into the table
    start=soup.find(text='Main Opposition Party')
    # get to the table
    table=start.parent.parent.parent.parent
    # The elements in the number column are irrelevant
    table=table.find(text='10')
    # Hackery...:-( There must be a better way...!
    table=table.parent.parent.parent
    print table
    
    lines=table.findAll('p')
    members=[]

    for line in lines:
        if not line.get('style'):
            m=line.text.encode('utf-8')
            m=m.strip()
            #strip out the party identifiers which have been hacked into the table (coalitions, huh?!;-)
            m=m.replace('-','–')
            m=m.split('–')
            # I was getting unicode errors on apostrophe like things; Stack Overflow suggested this...
            try:
                unicode(m[0], "ascii")
            except UnicodeError:
                m[0] = unicode(m[0], "utf-8")
            else:
                # value was valid ASCII data
                pass
            # The split test is another hack: it dumps the party identifiers in the last column
            if m[0]!='' and len(m[0].split())>1:
                print '...'+m[0]+'++++'
                members.append(m[0])
            
    if len(members)>20:
        members=members[:20]
    
    for m in members:
        #print m
        record= { "id":apg+":"+m, "mp":m,"apg":apg}
        scraperwiki.datastore.save(["id"], record) 
    print "....done",apg

So… hacky and horrible… and I don’t capture the parties which I probably should… But it sort of works (though I don’t manage to handle the <br /> tag that conjoins a couple of members in the screenshot above) and is enough to be going on with… Here’s what the data looks like:

Scraped data

That’s the first step then – scraping the data… But so what?

My first thought was to grab the CSV output of the data, drop the first column (the unique key) via a spreadsheet, then treat the members’ names and group names as nodes in a network graph, visualised using Gephi (node size reflects the number of groups an individual is a qualifying member of):

APG memberships

(Not the most informative thing, but there we go… At least we can see who can be guaranteed to help get a group up and running;-)

We can also use an ego filter depth 2 to see which people an individual is connected to by virtue of common group membership – so for example (if the scraper worked correctly (and I haven’t checked that it did!), here are John Stevenson’s APG connections (node size in this image relates to the number of common groups between members and John Stevenson):

John Stevenson - APG connections

So what else can we do? I tried to export the data from scraperwiki to Google Docs, but something broke… Instead, I grabbed the URL of the CSV output and used that with an =importData formula in a Google Spreadsheet to get the data into that environment. Once there it becomes a database, as I’ve described before (e.g. Using Google Spreadsheets Like a Database – The QUERY Formula and Using Google Spreadsheets as a Database with the Google Visualisation API Query Language).

I published the spreadsheet and tried to view it in my Guardian Datastore explorer, and whilst the column headings didnlt appear to display properly, I could still run queries:

APG membership

Looking through the documentation, I also notice that Scraperwiki supports Python Google Chart, so there’s a local route to producing charts from the data. There are also some geo-related functions which I probably should have a play with…(but before I do that, I need to have a tinker with the Ordnance Survey Linked Data). Ho hum… there is waaaaaaaaay to much happening to keep up (and try out) with at the mo….

PS Here are some immediate thoughts on “nice to haves”… The current ability to run the scraper according to a schedule seems to append data collected according to the schedule to the original database, but sometimes you may want to overwrite the database? (This may be possible via the programme code using something like fauxscraperwiki.datastore.empty() to empty the database before running the rest of the script?) Adding support for YQL queries by adding e.g. Python-YQL to the supported libraries might also be handy?

Written by Tony Hirst

October 29, 2010 at 12:24 pm

Accessing Linked Data in Scraperwiki via YQL

A comment from @frabcus earlier today alerted me to the fact that the Scraperwiki team had taken me up on my suggestion that they make the Python YQL library available in the Scraperwiki environment, so I thought I ought to come up with an example of using it…

YQL provides a general purpose standard query interface “to the web”, interfacing with all manner of native APIs and providing a common way of querying with them, and receiving responses from them. YQL is extensible too – If there isn’t a wrapper for your favourite API, you can write one yourself and submit it to the community. (For a good overview of the rationale for, and philosophy behind YQL, see Christian Heilmann’s the Why of YQL.)

Browsing through the various community tables, I found one for handling SPARQL queries. The YQL wrapper expects a SPARQL query and an endpoint URL, and will return the results in the YQL standard form. (Here’s an example SPARQL query in the YQL developer console using the data.gov.uk education datastore.)

The YQL query format is:
select * from sparql.search where query=”YOUR_SPARQL_QUERY” and service=”SPARQL_ENDPOINT_URL”
and can be called in Python YQL in the following way (Python YQL usage):

def run_sparql_query(query, endpoint):
    y = yql.Public()
    query='select * from sparql.search where query="'+query+'" and service="'+endpoint+'"'
    env = "http://datatables.org/alltables.env"
    return y.execute(query, env=env)

For a couple of weeks now, I’ve been look for an opportunity to try to do something – anything – with the newly released Ordnance Survey Linked Data (read @gothwin’s introduction to it for more details: /location /location /location – exploring Ordnance Survey Linked Data – Part 2).

One of the things the OS Linked Data looks exceedingly good for is acting as glue, mapping between different representations for geographical and organisational areas; the data can also return regions that neighbour on a region, which could make for some interesting “next door to each other” ward, district or county level comparisons.

One of the most obvious ways in to the data is via a postcode. The following Linked Data query to the ordnance survey SPARQL endpoint (http://api.talis.com/stores/ordnance-survey/services/sparql) returns the OS district ID, ward and district name that a postcode exists in:
PREFIX skos: <http://www.w3.org/2004/02/skos/core#&gt;
PREFIX postcode: < http://data.ordnancesurvey.co.uk/ontology/postcode/&gt;

select ?district ?wardname ?districtname where { <http://data.ordnancesurvey.co.uk/id/postcodeunit/MK76AA&gt;
postcode:district ?district; postcode:ward ?ward.
?district skos:prefLabel ?districtname.
?ward skos:prefLabel ?wardname
}

Here is is running in the YQL developer console:

OS Posctcode query in YQL developer console

(Just by the by, we can create a query alias for that query if we want, by changing the postcode (MK76AA in the example to @postcode. This gives us a URL argument/variable called postcode whose value gets substituted in to the query whenever we call it:

http://query.yahooapis.com/v1/public/yql/psychemedia/ospostcodelookupdemo1?postcode=INSERTPOSTCODEHERE&env=http://datatables.org/alltables.env

[Note we manually need to add the environment variable &env=http://datatables.org/alltables.env to the URL created by the query alias generator/wizard.]

YQL query alieas for sparql query

So… that’s SPARQL in YQL – but how can we use it in Scraperwiki… The newly added YQL wrapper makes it easy.. here’s an example, based on the above:

os_endpoint='http://api.talis.com/stores/ordnance-survey/services/sparql'

os_query='''
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX postcode: <http://data.ordnancesurvey.co.uk/ontology/postcode/>

select ?district ?wardname ?districtname where {
<http://data.ordnancesurvey.co.uk/id/postcodeunit/MAGIC_POSTCODE> postcode:district ?district; postcode:ward ?ward.
?district skos:prefLabel ?districtname.
?ward skos:prefLabel ?wardname
}
'''
postcode="MK7 6AA"

os_query=os_query.replace('MAGIC_POSTCODE',postcode.replace(' ',''))

def run_sparql_query(query, endpoint):
    y = yql.Public()
    query='select * from sparql.search where query="'+query+'" and service="'+endpoint+'"'
    env = "http://datatables.org/alltables.env"
    return y.execute(query, env=env)

result=run_sparql_query(os_query, os_endpoint)

for row in result.rows:
    print postcode,'is in the',row['result']['wardname']['value'],'ward of',row['result']['districtname']['value']
    record={ "id":postcode, "ward":row['result']['wardname']['value'],"district":row['result']['districtname']['value']}
    scraperwiki.datastore.save(["id"], record) 

I use the MAGICPOSTCODE substitution to give me the freedom to create a procedure that will take in a postcode argument and add it in to the query. Note that I am probably breaking all sorts of Linked Data rule by constructing the URL that uniquely identifies (reifies?) the postcode in the ordnance survey URL namespace (that is, I construct something like <http://data.ordnancesurvey.co.uk/id/postcodeunit/MK76AA&gt;, which contravenes the “URIs are opaque” rule that some folk advocate, but I’m a pragmatist;-)

Anyway, here’s a Scraperwiki example that scrapes a postcode from a web page, and looks up some of its details via the OS: simple Ordnance Survey Linked Data postcode lookup

The next thing I wanted to do was use two different Linked Data services. Here’s the setting. Suppose I know a postcode, and I want to lookup all the secondary schools in the council area that postcode exists in. How do I do that?

The data.gov.uk education datastore lets you look up schools in a council area given the council ID. Simon Hume gives some example queries to the education datastore here: Using SPARQL & the data.gov.uk school data. The following is a typical example:

prefix sch-ont: <http://education.data.gov.uk/def/school/&gt;

SELECT ?name ?reference ?date WHERE {
?school a sch-ont:School;
sch-ont:establishmentName ?name;
sch-ont:uniqueReferenceNumber ?reference ;
sch-ont:districtAdministrative <http://statistics.data.gov.uk/id/local-authority-district/00MG&gt; ;
sch-ont:openDate ?date ;
sch-ont:phaseOfEducation .
}

Here, the secondary schools are being identified according to the district area they are in (00MG in this case).

But all I have is the postcode… Can Linked Data help me get from MK7 6AA to 00MG (or more specifically, from <http://data.ordnancesurvey.co.uk/id/postcodeunit/MAGIC_POSTCODE&gt; to <http://statistics.data.gov.uk/id/local-authority-district/00MG&gt;?)

Here’s what the OS knows about a postcode:

What the OS knows about a postcode

If we click on the District link, we can see what the OS knows about a district:

Local authority area code lookup in OS Linked Data

The Census Code corresponds to the local council id code used in the Education datastore (thanks to John Goodwin for pointing that out…). The identifier doesn’t provide a Linked Data URI, but we can construct one out of the code value:
<http://statistics.data.gov.uk/id/local-authority-district/MAGIC_DISTRICTCODE&gt;

(Note that the statistics.data.gov.uk lookup on the district code does include a sameas URL link back to the OS identifier.)

Here’s how we can get hold of the district code – it’s the dmingeo:hasCensusCode you’re looking for:

PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX admingeo: <http://data.ordnancesurvey.co.uk/ontology/admingeo/>
PREFIX postcode: <http://data.ordnancesurvey.co.uk/ontology/postcode/>

select ?district ?nsdistrict ?wardname ?districtname where {
<http://data.ordnancesurvey.co.uk/id/postcodeunit/MAGIC_POSTCODE> postcode:district ?district; postcode:ward ?ward.
?district skos:prefLabel ?districtname.
?ward skos:prefLabel ?wardname .
?district admingeo:hasCensusCode ?nsdistrict.
}
'''

postcode='MK7 6AA'
os_endpoint='http://api.talis.com/stores/ordnance-survey/services/sparql'
os_query=os_query.replace('MAGIC_POSTCODE',postcode.replace(' ',''))

result=run_sparql_query(os_query, os_endpoint)

for row in result.rows:
    print row['result']['nsdistrict']['value']
    districtcode=row['result']['nsdistrict']['value']
    print postcode,'is in the',row['result']['wardname']['value'],'ward of',row['result']['districtname']['value']
    record={ "id":postcode, "ward":row['result']['wardname']['value'],"district":row['result']['districtname']['value']} 

So what does that mean… well. we managed to look up the district code from a postcode using the Ordnance Survey API, which means we can insert that code into a lookup on the education datastore to find schools in that council area:

def run_sparql_query(query, endpoint):
    '''
    # The following string replacement construction may be handy
    query = 'select * from flickr.photos.search where text=@text limit 3';
    y.execute(query, {"text": "panda"})
    '''
    y = yql.Public()
    query='select * from sparql.search where query="'+query+'" and service="'+endpoint+'"'
    env = "http://datatables.org/alltables.env"
    return y.execute(query, env=env)

edu_endpoint='http://services.data.gov.uk/education/sparql'    

edu_query='''
prefix sch-ont:  <http://education.data.gov.uk/def/school/>

SELECT ?name ?reference ?date WHERE {
?school a sch-ont:School;
sch-ont:establishmentName ?name;
sch-ont:uniqueReferenceNumber ?reference ;
sch-ont:districtAdministrative <http://statistics.data.gov.uk/id/local-authority-district/MAGIC_DISTRICTCODE> ;
sch-ont:openDate ?date ;
sch-ont:phaseOfEducation <http://education.data.gov.uk/def/school/PhaseOfEducation_Secondary>.
}
'''
districtcode='00MG'
edu_query=edu_query.replace('MAGIC_DISTRICTCODE',districtcode)
result=run_sparql_query(edu_query, edu_endpoint)
for row in result.rows:
    for school in row['result']:
        print school['name']['value'],school['reference']['value'],school['date']['value']
        record={ "id":school['reference']['value'],"name":school['name']['value'],"openingDate":school['date']['value']}
        scraperwiki.datastore.save(["id"], record) 
​

Here’s a Scraperwiki example showing the two separate Linked Data calls chained together (click on the “Edit” tab to see the code).

Linked Data in Scraperwiki

Okay – so that easy enough (?!;-). We’ve seen how:
– Scraperwiki supports calls to YQL;
– how to make SPARQL/Linked Data queries from Scraperwiki using YQL;
– how to get data from one Linked Data query and use it in another.

A big problem though is how do you know whether there is a linked data path from a data element in one Linked Data store (e.g. from a postcode lookup in the Ordnance Survey data) through to another datastore (e.g. district area codes in the education datastore), where you is a mere mortal and not a Linked Data guru?! Answers on the back of a postcard, please, or via the comments below;-)

PS whilst doing a little digging around, I came across some geo-referencing guidance on the National Statistcics website that suggests that postcode areas might change over time (they also publish current and previous postcode info). So what do we assume about the status (currency, validity) of the Ordnance Survey postcode data?

PPS Just by the by, this may be useful to folk looking for Linked Data context around local councils: @pezholio’s First steps to councils publishing their own linked data

Written by Tony Hirst

November 2, 2010 at 10:18 am

Accessing Government Education Data in Scraperwiki via the Edubase/Education Datastore API

There’s lots of education data out there, but do we always need to scrape it from LEA websites? How can we easily access the data that’s in the central government datastore, and bring it into an environment we’re happy working from?

Although lots of school data has been made available as data for some time now, it’s publication as Linked Data means there’s a considerable barrier to entry in terms of functional access to, and use of, the data. (For an example of how to interrogate the Edubase Linked Data API in Scraperwiki, see Accessing Linked Data in Scraperwiki via YQL.) As an attempt to mask some of the horrors of SPARQL from mortal users, @jenit et al worked on a simpler webservice that meant you could access the data.gov.uk education datastore via a “friendly”, human readable URL, such as the following:

  • http://services.data.gov.uk/education/api/school/constituency-name/Horsham: list of schools within the constituency of Horsham
  • /education/api/school/local-authority-name/{la}: schools by local authority name
  • /education/api/school/district/{district} : schools by district ID (I think? Not sure exactly what ID’s these might be?)
  • /education/api/school/area/{minLat},{minLong};{maxLat},{maxLong}: schools within a particular geogrpahical area, as described by a latitude/longitude bounding box.

There’s a wealth of data that can be returned about a school, so various views over the data are also supported using a URL variable (for example, ?_view=provision or &_view=performance

short – shows very basic information
medium – shows a few more fundamental details about the schools, such as its address
provision – describes the kind and number of children that they take
location – describes where the school is
performance – gives information related to their performance
admin – gives administrative information
all – gives you everything that’s known about each school

If you know any particular data attributes you want to further filter the results on, they can be specified literally. For example, the following (far from complete) list of attributes gives some idea of what’s possible, this time passed via explicit URL args:

  • ?nurseryProvision=true
  • &gender.label=Girls
  • ofstedSpecialMeasures=true
  • for searching number ranges, the min- and max- prefixes may be applied to certain parameters. For example: &max-statutoryHighAge=10 searches for schools where statutoryHighAge<=10

Jeni did a great write up of the API at A Developers’ Guide to the Linked Data APIs – Jeni Tennison (which I cribbed from heavily in the above;-). You can find a full overview of the education API documentation here: Linked Data API Configuration APIs: Edubase API

So… how can we use this in Scraperwiki? Here’s a demo:

import simplejson
import urllib
import scraperwiki

#------- USER SETTINGS ------
# Original API documentation at: http://services.data.gov.uk/education/api/api-config#schools
# Original blog post by @jenit describing the API used: http://data.gov.uk/blog/guest-post-developers-guide-linked-data-apis-jeni-tennison
# Original blog post describing this Scraperwiki page: http://blog.ouseful.info/2010/11/03/accessing-government-education-data-in-scraperwiki-via-the-edubaseeducation-datastore-api/

# The main query
eduPath='school/constituency-name/Horsham'

# Filters, as a list:
eduFilters=['min-statutoryHighAge=7','max-statutoryHighAge=10']

# _views - not considered yet...

# key and label data is displayed in the console for each result, and added to the Scraperwiki database
# keys are the top level attributes we want to display. For a result item, display each item[key]
keys=['establishmentNumber','label']

# labels are used to display labels of top level items, e.g. item[label]['label']
labels=['typeOfEstablishment','phaseOfEducation']
# Note, if you have item[path][wherever][label], or deeper down a path, we don't handle that (yet?!)

# The school ID will always be added to the Scraperwiki database (it's the database ID for a record).
# If latitude/longitude data is available, it will also be added to the database.


# Note that the script doesn't yet handle multiple pages of results either...

#-------------------------- 
  
# This function displays the results, and also adds results to the Scraperwiki database.
# We always look for school ID (this is the table ID) and latlng for mapping, if that data exists
def printDetails(item,keys=['establishmentNumber','label'],labels=[]):
    txt=[]
    record={}
    for key in keys:
        if key in item:
            txt.append(str(item[key]))
            record[key]=item[key]
        else:
            record[key]=''
    if 'establishmentNumber' not in keys:
        record['establishmentNumber']=item['establishmentNumber']
    for attribute in labels:
        if attribute in item:
            txt.append(item[attribute]['label'])
            record[attribute]=item[attribute]['label']
        else:
            record[attribute]=''
    if 'lat' in item:
        latlng=(item['lat'],item['long'])
        scraperwiki.datastore.save(["establishmentNumber"], record,latlng=latlng)
    else:
        scraperwiki.datastore.save(["establishmentNumber"], record)
        pass
    print ', '.join(txt)    
    
    
# This is where we construct the Edubase Linked Data API URL, and then call it, returning JSON
# Need to find a way of handling results spread over several results pages
data=simplejson.load(urllib.urlopen('http://services.data.gov.uk/education/api/'+eduPath+'.json'+'?'+'&'.join(eduFilters)))['result']
items=data["items"]

for item in items:
    printDetails(item,keys,labels)
    print item

You can find the code running on Scraperwiki here: ouseful scraperwiki – playing with Education datastore API

Here’s an example of what gets put in the Scaperwiki database:

Example scraperwiki datatable - education datastore API

Hopefully what this demo does is show how you can start exploring the Education datastore in Scraperwiki withougt having to do too much. More explanation/guidance, or at least futher examples, are required in order to demonstrate:
– the construction of valid “eduPath” statements, if possible showing how they can reuse identifier codes from other sources;
– the use of different _views, and maybe handlers for those views that add all the data to the Scraperwiki database automagically;
– how to inspect returned results so you can identify what keys and labels can be used from a result when you want to construct your own Scraperwiki database records;
– handlers for data down the result item path (i.e. more than handlers just for item[key] and item[label]['label'], but also item[here][there], item[here][there][everywhere]['label'] etc.)
– results are only pulled back from the first page of results; need to find some way of handling results over multiple pages, maybe limiting results to a max number of results within that. (Maybe the tweepy Cursor code could be reused for this???)

Written by Tony Hirst

November 3, 2010 at 1:14 pm

Posted in Data

Tagged with , ,

Practical Data Scraping – UK Government Transparency Data (Minister’s Meetings)

Earlier this week, I came across the Number 10 website’s transparency data area, which among other things has a section on who Ministers are meeting.

Needless to say, the Who’s Lobbying website has started collating this data and making it searchable, but I thought I’d have a look at the original data to see what it would take to aggregate the data myself using Scraperwiki.

The Number 10 transparency site provides a directory to Ministers’ meetings by government department on a single web page:

Number 10 transparency - ministers meetings

The links in the Ministers’ meetings, Ministers’ hospitality, Ministers’ gifts and Ministers’ overseas travel columns all point directly to CSV files. From inspecting a couple of the Ministers’ meetings CSV files, it looks as if they may be being published in a standardised way, using common column headings presented in the same order:

Ministers' meetings transparency data - csv format

Except that: some of the CSV files appeared to have a blank row between the header and the data rows, and at least one table had a blank row immediately after the data rows, followed some notes in cells that did not map onto the semantics of corresponding column headers. Inspecting the data, we also see that once a minister is identified, there is a blank in the first (Minister) column, so we must presumably assume that the following rows relate to meetings that minister had. WHen the data moves on to another minister, that Minister’s name/position is identified in the first column, once again then followed by blank “same as above” cells.

To get the data into scraperwiki means we need to do two things: extract meeting data from a CSV document and get it into a form whereby we can put it into the scraperwiki database; scrape the number 10 Minisiters’ meetings webpage to get a list of the URLs that point to the CSV files for each department. (It might also be worth scraping the name of the department, and adding that as additional metadata to each record pulled out from the CSV docs.)

Here’s the Scraperwiki code I used to scrape the data. I tried to comment it, so it’s worth reading through even if you don’t speak Python, because I’m not going to provide any more description here…;-)

import urllib
import csv
import md5
import scraperwiki


url = "http://download.cabinetoffice.gov.uk/transparency/co-ministers-meetings.csv"
# I have started just looking at data from one source.
# I am assuming, (dangerously), that the column headings are:
#   a) the same, and 
#   b) in the same order
# for different departments

data = csv.DictReader(urllib.urlopen(url))

# Fudge to cope with possibility of blank row between header and first data row
started=False

# Inspection of the data file suggests that when we start considering a Minister's appointments,
#   we leave the Minister cell blank to mean "same as above".
# If we want to put the Minister's name into each row, we need to watch for that. 
minister=''

for d in data:
    if not started and d['Minister']=='':
        # Skip blank lines between header and data rows
        continue
    elif d['Minister']!='':
        # A new Minister is identified, so this becomes the current Minister of interest
        if not started:
            started=True
        minister=d['Minister']
    elif d['Date']=='' and d['Purpose of meeting']=='' and d['Name of External Organisation']=='':
        # Inspection of the original data file suggests that there may be notes at the end of the CSV file...
        # One convention appears to be that notes are separated from data rows by at least one blank row
        # If we detect a blank row within the dataset, then we assume we're at data's end
        # Of course, if there are legitimate blank rows within the later, we won't scrape any of the following data
        # We probably shouldn't discount the notes, but how would we handle them?!
        break
    print minister,d['Date'],d['Purpose of meeting'],d['Name of External Organisation']
    id='::'.join([minister,d['Date'],d['Purpose of meeting'],d['Name of External Organisation']])
    # The md5 function creates a unique ID for the meeting
    id=md5.new(id).hexdigest()
    # Some of the original files contain some Latin-1 characters (such as right single quote, rather than apostrophe)
    #   that make things fall over unless we handle them...
    purpose=d['Purpose of meeting'].decode('latin1').encode('utf-8')
    record={'id':id,'Minister':minister,'date':d['Date'],'purpose':purpose,'lobbiest':d['Name of External Organisation'].decode('latin1').encode('utf-8')}
    # Note that in some cases there may be multiple lobbiests, separated by a comma, in the same record.
    # It might make sense to generate a meeting MD5 id using the original record data, but actually store
    #   a separate record for each lobbiest in the meeting (i.e. have lobbiests and lobbiest columns) by separating on ','
    # That said, there are also records where a comma separates part of the title or affiliation of an individual lobbiest.
    # A robust convention for separating different lobbiests in the same meeting (e.g. ';' rather than ',') would help

    scraperwiki.datastore.save(["id"], record) 

for d in data:
    #use up the generator, close the file, allow garbage collection?
    continue

Here’s a preview of what the scraped data looks like:

Ministers' meetings datascrape - scraperwiki

Here’s the scraper itself, on Scraperwiki: UK Government Transparency Data – Minister’s Meetings Scratchpad

Assuming that the other CSV files are all structured the same way as the one I tested the above scraper on, we should be able to scrape meeting data from other departmental spreadsheets using the same script. (Note that I did try to be defensive in the handling of arbitrary blank lines between the first header row and the data.)

One problem arises in the context of meetings with more than one person. Ideally, I think there should be a separate row for each person attending, so for example, the Roundtable on June, 2010 between Parliamentary Secretary (Minister for Civil Society), Nick Hurd MP and National Voices, MENCAP,National Council of Voluntary Organisations, St Christopher’s Hospice, Diabetes UK, Place 2 Be, Terrence Higgins Trust, British Heart Foundation, Princess Royal Trust for Carers, Clic Sargent might be mapped to separate data rows for each organisation present. If we take this approach, it might also make sense to ensure that each row carries with it a meeting ID, so that we can group all the rows relating to a particular meeting (one for each group in the meeting) on meeting ID.

However, there is an issue in identifying multiple attendee meetings. In the above example, we can simply separate the groups by splitting the attendees lists at each comma; but using this approach would then mean that the meeting with Secretary General, Organisation of the Islamic Conference, Ekmelledin Ihsanoglu would be mapped onto three rows for that meeting: one with Secretary General as an attendee, one with Organisation of the Islamic Conference as an attendee, and finally one with Ekmelledin Ihsanoglu identified as an attendee…

What this suggests to me is that it would be really handy (in data terms), if a convention was used in the attendees column that separated representation from different organisations with a semi-colon, “;”. We can then worry about how to identify numerous individuals from the same organisation (e.g. J Smith, P Brown, Widget Lobbying group), or how to pull out roles from organisations (Chief Lobbiest, Evil Empire Allegiance), names and roles from organisations (J Smith, Chief Lobbiest, UN Owen, Head Wrangler, Evil Empire Allegiance) and so on…

And I know, I know… the Linked Data folk would be able to model that easily.. but I’m talking about quick and dirty typographical conventions that can be easily used in simple CSV docs that more folk are comfortable with than are comfortable with complex, explicitly structured data…;-)

PS I’ll describe how to scrape the CSV urls from the Number 10 web page, and then loop through all of this to generate a comprehensive “Ministers’ meetings” database in a later post…

PPS a really informative post on the WHo’s Lobbying blog goes into further detrail about some of the “pragmatic reuse” problems associated with the “Ministers’ meetings” data released to date: Is this transparency? No consistent format for 500 more UK ministerial meetings.

Written by Tony Hirst

November 12, 2010 at 1:31 pm

Posted in Data

Tagged with , , ,

Screenscraping the OU Set Books Webpage – Sometimes it’s Worth Asking…

A couple of days ago, i thought I’d complement the OU’s course related Linked Data with some data relating to the set books students need to buy on some of out courses. (Some courses provide the books as part of course materials, others require you to buy them yourselves.)

OU set books

The books required for each course (if required), are listed in a separate HTML table, one table per course. Here’s what the HTML look(ed) like:

OU set books - view src

If you inspect the HTML, you’ll see that the course code and the name of the course are contained in an element outside the table that contains the book details for the course. If you’ve ever managed a children’s party, where cards and presents are easily separated from each other, you’ll maybe get a sense of what these means when trying to screenscrape the booklist for each course… Because screenscraping can be a bit like looking inside a present. That is, it’s easy enough to grab hold of all the table elements as separate bundles (one per course), abnd then look inside them separately, but it can be a real pain picking up those tables as presents and a set of separate envelopes, and trying to make sure you keep track of which envelope goes with which present. Which is a bit like what would happen above…

What would make life easier is for each table to carry with it some sort of information about the course the table is associated with. So I sent to a tweet to someone I thought might be able to help, and the tweet had repurcussions:

From: ******
Sent: 16 November 2010 20:10
To: *******
Subject: set books

Tony H asks

do you know who owns template of http://bit.ly/c6qht7 ? would be scraper friendly if table summary attribute had course code and title...

-----------------

From: *******
Date: 17 November 2010 10:35:05 GMT
To: ******
Subject: RE: set books

No problem, I’ve altered the template and the summary will include code and title from tomorrow……

:-)

So here’s what the page looks like now – you’ll see the summary attribute of each table contains the course code and description.

OU set books -new html

WHich makes scraping the data much easier. here’s my Scraperwiki script (OU Set book scraper):

###############################################################################
# Basic scraper
###############################################################################
import scraperwiki
from BeautifulSoup import BeautifulSoup

unknown=0
# retrieve a page
starting_url = 'http://www3.open.ac.uk/about/setbooks/'
html = scraperwiki.scrape(starting_url)
print html
soup = BeautifulSoup(html)

# The books for each course are listed in a separate table
# use BeautifulSoup to get all <table> tags
count=0
tables = soup.findAll('table') 
for table in tables:
    for attr,val in table.attrs:
        # The course code and course title are contained in the table summary attribute
        if attr=='summary':
            print val
            ccode=val.split(' ')[0]
            ctitle=val.replace(ccode,'').strip()
    firstrow = True
    
    # Work through each row in the table - one row per book - ignoring the header row
    for row in table.findAll('tr'):
        blankLine=False
        for attr,val in row.attrs:
            if attr=='class' and val=='white':
                blankLine=True
        if blankLine:
            break
        if not firstrow:
            cells=row.findAll('td')
            print cells
            author=cells[0].text.replace('&nbsp;','')
            author=author.replace('&amp;','and')
            title=cells[1].text
            isbn=cells[2].text
            publisher=cells[3].text
            rrp=cells[4].text.replace('£','')
            print ccode, ctitle,author,title,isbn,publisher,rrp
            if isbn=='&amp;nbsp;':
                key=ccode+'_unkown:'+str(unknown)
                unknown+=1
            else:
                key=ccode+'_'+isbn
            count +=1
            record = {'id':key, 'Course Code':ccode, 'Course title':ctitle,'Author':author,'Title':title,'ISBN':isbn,'Publisher':publisher,'RRP':rrp }
    
            # save records to the datastore
            scraperwiki.datastore.save(['id'], record)
        else:
            firstrow=False

print count

[Thanks to @ostephens for pointing out my code was broken and was only scraping first line of each table... oops:-( Note to self: always check, and run just one more test... ]

And here’s the result (data as CSV):

OU set books - scraped

And the moral? Sometimes it’s worth asking -just on the offchance – if a page owner can make a pragmatic little change to the page that can make all the difference when it comes to a scrape being easy to achieve, or rather more involved…

PS it’d be nice to see this added to the course linked data on data.open.ac.uk?

PPS Maybe I should have asked the Lucero team about the Linked Data… @mdaquin tweeted: http://bit.ly/aCXDCx last part of second URI is ISBN of book (will add more info, and more “course material” soon) :-)

OU LInked Data - course books

There are more books listed here than I scraped from the set book list though, so I wonder what http://data.open.ac.uk/saou/ontology#hasBook means? Are these all published books associated with a course, irrespective of whether a student has to buy them themselves (as on the set book list), or whether they are supplied as part of the course materials? Or maybe this list includes more courses than on the set book page for some reason? [UPDATE: the scraper was broken, and was only grabbing the first row of each table into the database...bah:-( Apols... The results are closer now - I scrape 356 compared to 367 reported from the LD query, but only 340-odd in the database, so maybe I still have a bug or two in the scraper:-( Ah - some records have no ISBN, and I was using ISBN as part of the unique ID for each record...Fixed that, but still not getting counts to tally though:-( ] In any case, I think the distinction between supplied and provided books is an important one: for example, if I want to find out the cost of a course, then it would be useful to be able to price in the cost of any books I have to buy myself? That said, being able to find all books associated with courses is also handy?

Written by Tony Hirst

November 18, 2010 at 3:38 pm

Posted in Data, OU2.0

Tagged with ,

PDF Data Liberation: Formula One Press Release Timing Sheets

If you want F1 summary timing data from practice sessions, qualifying and the race itself, you might imagine that the the FIA Media Centre is the place to go:

Hmm… PDFs…

Some of the documents provide all the results on a single page in a relatively straightforward fashion:

Others are split into tables over multiple pages:

Following the race, the official classification was available as a scrapable PDF in preliminary for, but the final result – with handwritten signature – looked to be a PDF of a photocopy, and as such defies scraping without an OCR pass first… which I didn’t try…

I did consider setting up separate scrapers for each timing document, and saving the data into a corresponding Scraperwiki database, but a quick look at the license conditions made me a little wary…

No part of these results/data may be reproduced, stored in a retrieval system or transmitted in any form or by any means electronic, mechanical, photocopying, recording, broadcasting or otherwise without prior permission of the copyright holder except for reproduction in local/national/international daily press and regular printed publications on sale to the public within 90 days of the event to which the results/data relate and provided that the copyright symbol appears together with the address shown below …

Instead, I took the scrapers just so far such that I (that is, me ;-) could see how I would be able to get hold of the data without too much additional effort, but I didn’t complete the job… there’s partly an ulterior motive for this too… if anyone really wants the data, then you’ll probably have to do a bit of delving into the mechanics of Scraperwiki;-)

(The other reason for not my spending more time on this at the moment is that I was looking for a couple of simple exercises to get started with grabbing data from PDFs, and the FIA docs seemed quite an easy way in… Writing the scrapers is also bit like doing Sudoku, or Killer, which is one of my weekend pastimes…;-)

The scraper I set up is here: F1 Timing Scraperwiki

To use the scrapers, you need to open up the Scraperwiki editor, and do a little bit of configuration:

(Note the the press releases may disappear a few days after the race – I’m not sure how persistent the URLs are?)

When you’ve configured the scraper, run it…

The results of the scrape should now be displayed…

Scraperwiki does allow scraped data to be deposited into a database, and then accessed via an API, or other scrapers, or uploaded to Google Spreadsheets. However, my code stops at the point of getting the data into a Python list. (If you want a copy of the code, I posted it as a gist: F1 timings – press release scraper; you can also access it via Scraperwiki, of course).

Note that so far I’ve only tried the docs from a single race, so the scrapers may break on the releases published for future (or previous) races… Such is life when working with scrapers… I’ll try to work on robustness as the races go by. (I also need to work on the session/qualifying times and race analysis scrapers… they currently report unstructured data and also display an occasional glitch that I need to handle via a post-scrape cleanser.

If you want to use the scraper code as a starting point for building a data grabber that publishes the timing information as data somewhere, that’s what it’s there for (please let me know in the comments;-)

PS by the by, Mercedes GP publish an XML file of the latest F1 Championship Standings. They also appear to be publishing racetrack information in XML form using URLs of the form http://assets.mercedes-gp.com/—9—swf/assets/xml/race_23_en.xml. Presumably the next race will be 24?

If you know of any other “data” sources or machine readable, structured/semantic data relating to F1, please let me know via a comment below:-)

Written by Tony Hirst

April 10, 2011 at 9:52 pm

Posted in Data, Tinkering

Tagged with , ,

A Quick Lookup Service for UK University Bursary & Scholarship Pages

Here’s a quick recipe for grabbing a set of links from an alphabetised set of lookup pages and then providing a way of looking them up… The use case is to lookup URLs of pages on the websites of colleges and universities offering financial support for students as part of the UK National Scholarship Programme, as described on the DirectGov website:

National Scholarship programme

Index pages have URLs of the form:
http://www.direct.gov.uk/en/EducationAndLearning/UniversityAndHigherEducation/StudentFinance/StudentfinanceA-Z/index.htm?indexChar=D

<div class="subContent">
			<h3>A</h3>
						<div class="subContent">
						<h4></h4>
					<ul class="subLinks">
						<li><a href="http://www.anglia.ac.uk/nsp"   target="_blank">Anglia Ruskin University<span class='tooltip' title='Opens new window'> <span>Opens new window</span></span></a></li>
					</ul>
				</div>
				<div class="subContent">
						<h4></h4>
					<ul class="subLinks">
						<li><a href="http://www.aucb.ac.uk/international/feesandfinance/financialhelp.aspx"   target="_blank">Arts University College at Bournemouth<span class='tooltip' title='Opens new window'> <span>Opens new window</span></span></a></li>
					</ul>
				</div>
				<div class="subContent">
						<h4></h4>
					<ul class="subLinks">
						<li><a href="http://www1.aston.ac.uk/study/undergraduate/student-finance/tuition-fees/2012-entry/ "   target="_blank">Aston University Birmingham<span class='tooltip' title='Opens new window'> <span>Opens new window</span></span></a></li>
					</ul>
				</div>
		
	</div>

I’ve popped a quick scraper onto Scraperwiki (University Bursaries / Scholarship / Bursary Pages) that trawls the the index pages A-Z, grabs the names of the institutions and the URLs they link to and pops them into a database.

import scraperwiki
import string,lxml.html

# A function I usually bring in with lxml that strips tags and just give you text contained in an XML substree
## via http://stackoverflow.com/questions/5757201/help-or-advice-me-get-started-with-lxml/5899005#5899005
def flatten(el):           
    result = [ (el.text or "") ]
    for sel in el:
        result.append(flatten(sel))
        result.append(sel.tail or "")
    return "".join(result)
#As it happens, we're not actually going to use this function in this scraper, so we could remove it from the code...

# We want to poll through page URLs indexed by an uppercase alphachar
allTheLetters = string.uppercase

for letter in allTheLetters:
    #Generate the URL
    url="http://www.direct.gov.uk/en/EducationAndLearning/UniversityAndHigherEducation/StudentFinance/StudentfinanceA-Z/index.htm?indexChar="+letter
    print letter
    #Grab the HTML page from the URL and generate an XML object from it
    page=lxml.html.fromstring(scraperwiki.scrape(url))
    #There are probably more efficient ways of doing this scrape...
    for element in page.findall('.//div'):
        if element.find('h3')!=None and element.find('h3').text==letter:
            for uni in element.findall('.//li/a'):
                print uni.text,uni.get('href')
                scraperwiki.sqlite.save(unique_keys=["href"], data={"href":uni.get('href'), "uni":uni.text})

Running this gives a database containing the names of the institutions that signed up to the National Scholarship Programmea and the information that have about scholarships and bursaries availabale in that context.

The Scraperwiki API allows you to run queries on this database and get the results back as JSON, HTML, CSV or RSS: University Bursaries API. So for example, we can search for bursary pages on Liverpool colleges and universities websites:

Scraperwiki API

We can also generate a view over the data on Scraperwiki… (this script shows how to interrogate the Scraperwiki database from within a webpage.

Finally, if we take the URLs from the bursary pages and pop them into a Google custom search engine, we can now search over just those pages… UK HE Financial Support (National Scholarship Programme) Search Engine. (Note that this is a bit ropey at them moment.) If you own the CSE, it’s easy enough to grab embed codes that allow you to pop search and results controls for the CSE into your own webpage.

(On the to do list is generate a view over the data that defines a Google Custom Search Engine Annotations file that can be used to describe the sites/pages searched over by the CSE.)

Written by Tony Hirst

November 13, 2011 at 12:18 pm

Posted in Tinkering

Tagged with ,

Follow

Get every new post delivered to your Inbox.

Join 812 other followers