OUseful.Info, the blog…

Trying to find useful things to do with emerging technologies in open education

Posts Tagged ‘scraperwiki

Mapping the Tesco Corporate Organisational Sprawl – An Initial Sketch

A quick sketch, prompted by Tesco Graph Hunting on OpenCorporates of how some of Tesco’s various corporate holdings are related based on director appointments and terminations:

The recipe is as follows:

- grab a list of companies that may be associated with “Tesco” by querying the OpenCorporates reconciliation API for tesco
- grab the filings for each of those companies
- trawl through the filings looking for director appointments or terminations
- store a row for each directorial appointment or termination including the company name and the director.

You can find the scraper here: Tesco Sprawl Grapher

import scraperwiki, simplejson,urllib

import networkx as nx

#Keep the API key [private - via http://blog.scraperwiki.com/2011/10/19/tweeting-the-drilling/
import os, cgi
    qsenv = dict(cgi.parse_qsl(os.getenv("QUERY_STRING")))

#note - the opencorporates api also offers a search:  companies/search

def getOCcompanyData(ocid):
    return ocdata

#need to find a way of playing nice with the api, and not keep retrawling

def getOCfilingData(ocid):
    print 'filings',ocid
    #print 'filings',ocid,ocdata
    #print 'filings 2',tmpdata
    while tmpdata['page']<tmpdata['total_pages']:
        print '...another page',page,str(tmpdata["total_pages"]),str(tmpdata['page'])
    return ocdata

def recordDirectorChange(ocname,ocid,ffiling,director):
    print 'ddata',ddata
    scraperwiki.sqlite.save(unique_keys=['fid'], table_name='directors', data=ddata)

def logDirectors(ocname,ocid,filings):
    print 'director filings',filings
    for filing in filings:
        if filing["filing"]["filing_type"]=="Appointment of director" or filing["filing"]["filing_code"]=="AP01":
            director=desc.replace('DIRECTOR APPOINTED ','')
        elif filing["filing"]["filing_type"]=="Termination of appointment of director" or filing["filing"]["filing_code"]=="TM01":
            director=desc.replace('APPOINTMENT TERMINATED, DIRECTOR ','')
            director=director.replace('APPOINTMENT TERMINATED, ','')

for entity in entities['result']:

The next step is to graph the result. I used a Scraperwiki view (Tesco sprawl demo graph) to generate a bipartite network connecting directors (either appointed or terminated) with companies and then published the result as a GEXF file that can be loaded directly into Gephi.

import scraperwiki
import urllib
import networkx as nx

import networkx.readwrite.gexf as gf

from xml.etree.cElementTree import tostring

scraperwiki.sqlite.attach( 'tesco_sprawl_grapher')
q = '* FROM "directors"'
data = scraperwiki.sqlite.select(q)


for row in data:
    if row['fdirector'] not in directors:
    if row['ocname'] not in companies:

scraperwiki.utils.httpresponseheader("Content-Type", "text/xml")


print tostring(writer.xml)

Saving the output of the view as a gexf file means it can be loaded directly in to Gephi. (It would be handy if Gephi could load files in from a URL, methinks?) A version of the graph, laid out using a force directed layout, with nodes coloured according to modularity grouping, suggests some clustering of the companies. Note the parts of the whole graph are disconnected.

In the fragment below, we see Tesco Property Nominees are only losley linked to each other, and from the previous graphic, we see that Tesco Underwriting doesn’t share any recent director moves with any other companies that I trawled. (That said, the scraper did hit the OpenCorporates API limiter, so there may well be missing edges/data…)

And what is it with accountants naming companies after colours?! (It reminds me of sys admins naming servers after distilleries and Lord of the Rings characters!) Is there any sense in there, or is arbitrary?

Written by Tony Hirst

April 12, 2012 at 3:56 pm

Appropriating Scraperwiki Views for Data Format Transformation – Google Spreadsheet CSV to GEXF

Unpicking the architecture of Scraperwiki, we see that it offers several hosting options:

- hosting for executable scraper scripts;
- a hosted database for storing scraped data;
- hosting for executable view pages that allow views of data to be published.

Although the intention of Views hosting is presumably to provide a view over data scraped using Scraperwiki itself, we can also use it simply as a hosted scripted environment for transforming data from one representation to another.

So for example, in the post Using Google Spreadsheets to extract Google+ friend/follower information for analysis in NodeXL, Gephi, Martin Hawksey describes a way of scraping Google+ friends data into a Google spreadsheet, along with a recipe for getting the data out of Google Spreadsheets as a CSV formatted file and then into a form in which it can be loaded into NodeXL, a free social network analysis add-on package for Excel. In so doing, he also makes a request to the lazy web for how the data might be loaded into other graph analysis packages:

I’m not familiar with getting the data into tools like Gephi (other than via a NodeXL export) so if someone wants to post or link to something to do this for other tools leave a comment or get in touch. Instead here are some basic instructions for getting the data into NodeXL…

As I’ve been playing with Scraperwiki views lately, including using them to generate GEXF file representations of network data (Visualising Networks in Gephi via a Scraperwiki Exported GEXF File), I thought I’d try to knock up a View that grabs data from Martin’s spreadsheet and puts it into GEXF form. You can find the result in the Google+ Net Spreadsheet Grapher (note that it sometimes seems to time out.. oops…).

For completeness, here’s the code [as updated by @mhawksey to include a paramter grab from the view URL... I was wondering how to do that?!;-)]:

#gexf from google spreadsheet, picking up on http://mashe.hawksey.info/2011/12/google-plus-network-info-to-nodexl

import scraperwiki
import csv,urllib
import networkx as nx


#--via @mhawksey
# query string crib https://views.scraperwiki.com/run/python_querystring_cheat_sheet/?
#my defensive tweaks
import cgi, os
if qstring!=None:
    get = dict(cgi.parse_qsl(qstring))
    if 'key' in get:




#utilities https://scraperwiki.com/scrapers/utility_library/
def vstr(s):
    if s:
            return unicode(s)
        except UnicodeDecodeError:
            return str(s)
        return u''
# remove any non ascii characters
def ascii(s): return "".join(i for i in s if ord(i)<128)

#need to get rid of first row before the proper header row...

nReader = csv.DictReader(nodesfile)
for nrow in nReader:
    #print nrow
    if nrow['ImageUrl']!=None:
    else: imgurl=''

eReader = csv.DictReader(edgesfile)
for erow in eReader:
    #print erow
    DG.add_edge(erow['Vertex 1'],erow['Vertex 2'])

import networkx.readwrite.gexf as gf


scraperwiki.utils.httpresponseheader("Content-Type", "text/xml")

from xml.etree.cElementTree import tostring
print tostring(writer.xml)

So the take home is this. As well as providing a home for scrapers, is it appropriate to use Scraperwiki as a home for (examples of) data representation conversion scripts and (re)export formatters?

PS here’s another example of using a Scraperwiki View as a standalone item – Example Twitter hashtag user friendship network, which shows how to search Twitter for recent users of a hashtag, then generate a graph file representing the friend connections between the taggers.

Written by Tony Hirst

April 12, 2012 at 9:00 am

Posted in Tinkering

Tagged with

Exporting Yahoo Pipe Definitions, Compiling Them to Python, and Running Them in Scraperwiki

So you’ve got a whole bunch of Yahoo Pipes running some critical information feeds, but you’re fearful that Yahoo Pipes is going to disappear: what are you going to do? Or maybe you want to use Yahoo Pipes to harvest and process a data feed once a day and pop the results into an incremental data store, but you don’t run your own database. This post describe how the Pipe2Py Yahoo Pipes to Python code compiler running inside the data harvesting tool Scraperwiki may provide one way of solving your problem.

Over the years, I’ve created dozens and dozens of Yahoo Pipes, as well as advocating their use as a rapid prototyping environment for feed based processing, particularly amongst the #mashedlibrary/digital librarianship community. There are several sorts of value associated variously with actual Yahoo Pipes designs, including: the algorithmic design, that demonstrates a particular way of sourcing, filtering, processing, mixing and transforming one or more data series; and the operational value, for example in terms of the value associated with running the pipe and publishing, syndicating or otherwise making direct use of the output of a particular pipe.

Whilst I have tried to document elements of some of the pipework I have developed (check the pipework category on this blog, for example), many of the blog posts I have written around Yahoo Pipes have complemented them in a documentation sense, rather than providing a necessary and sufficient explanation from which a pipe can be specifically recreated. (That is, to make full sense of the posts, you often had to have access to the “source” of the pipe as well…)

To try to mitigate against the loss of Yahoo Pipes as an essential complement to many OUseful.info posts, I have from time to time explored the idea of a Yahoo Pipes Documentation Project (countering the risk of algorithmic loss), as well as the ability to export and run equivalent or “compiled” versions of Yahoo Pipes on an arbitrary server (protecting against operational loss). The ability to generate routines with an equivalent behaviour to any given Yahoo Pipe also made sense in the face of perceived concerns “from IT” about the stability of the Yahoo Pipes platform (from time to time, it has been very shaky!) as well as it’s long term availability. Whilst my attitude was typically along the lines of “if you hack something together in Yahoo Pipes that does at least something of what you want, at least you can make use of it in the short term”, I was also mindful of the fact that when applications become the basis of any service they may not be looked at again if the service appears to be working and as such other things may come to depend or otherwise rely on them. As far as I am aware, the Pipe2Py project, developed by Greg Gaughan, has for some time been the best bet when it comes to generating standalone programmes that are functionally equivalent to a wide variety of Yahoo Pipes.

As Yahoo again suffers from a round of redundancies, I thought it about time that I reconsider my own preservation strategy with respect to the possible loss of Yahoo Pipes…

Some time ago, I persuaded @frabcus to make pipe2py library available on Scraperwiki, but to my shame never did anything with it. So today, I thought I’d better address that. Building on the script I linked to from Just in Case – Saving Your Yahoo Pipes…, I put together a simple Scraperwiki script that grabs the JSON descriptions of my public/published pipes and pops them into a Scraperwiki database (Scraperwiki: pipe2py test):

import scraperwiki,urllib,json,simplejson

def getPipesJSON(id,name):
    url = ("""http://query.yahooapis.com/v1/public/yql"""
               + id + 
    pjson = urllib.urlopen(url).readlines()
    pjson = "".join(pjson)
    pipe_def = json.loads(pjson)
    scraperwiki.sqlite.save(unique_keys=['id'], table_name='pipes', data={'id':id,'pjson':pjson,'title':name})
    if not pipe_def['query']['results']:
        print "Pipe not found"
    pjson = pipe_def['query']['results']['json']['PIPE']['working']
    return pjson

def getPipesPage(uid,pageNum):
    print 'getting',uid,pageNum
    return feed

def userPipesExport(uid):

    while (scrapeit):
        feeds= getPipesPage(uid,page)
        print feeds
        if feeds['value']['hits']==0:
            for pipe in feeds['value']['items']:

#Yahoo pipes user ID


To export your own public pipe definitions, clone the scraperwiki, replace my Yahoo pipes user id (uid) with your own, and run the scraper…

Having preserved the JSON descriptions within a Scraperwiki database, the next step was to demonstrate the operationalisation of a preserved pipe. The example view at pipe2py – test view [code] demonstrates how to look up the JSONic description of a Yahoo Pipe, as preserved in a Scraperwiki database table, compile it, execute it, and print out the result of running the pipe.

import scraperwiki,json

from pipe2py import compile, Context


def getpjsonFromDB(id):
    scraperwiki.sqlite.attach( 'pipe2py_test' )
    q = '* FROM "pipes" WHERE "id"="'+id+'"'
    data = scraperwiki.sqlite.select(q)
    #print data
    pipe_def = json.loads(data[0]['pjson'])
    if not pipe_def['query']['results']:
        print "Pipe not found"
    pjson = pipe_def['query']['results']['json']['PIPE']['working']
    return pjson


p = compile.parse_and_build_pipe(Context(), pjson)
for i in p:
    #print 'as',i
    print '<a href="'+i['link']+'">'+i['title']+'</a><br/>',i['summary_detail']['value']+'<br/><br/>'

The examplePipeOutput() function in the pipes preservation Scraperwiki scraper (rather than the view) provides another example of how to compile and execute a pipe, this time by directly loading in its description from Yahoo Pipes, given it’s ID.

To preview the output of one of your own pipes by grabbing the pipe description from Yahoo Pipes, compiling it locally and then running the local compiled version, here’s an example (pipe2py – pipe execution preview):

#Example of how to grab a pipe definition from Yahoo pipes, compile and execute it, and preview its (locally obtained) output

import scraperwiki,json,urllib

from pipe2py import compile, Context


def getPipesJSON(id):
    url = ("""http://query.yahooapis.com/v1/public/yql"""
               + id + 
    pjson = urllib.urlopen(url).readlines()
    pjson = "".join(pjson)
    pipe_def = json.loads(pjson)
    if not pipe_def['query']['results']:
        print "Pipe not found"
    pjson = pipe_def['query']['results']['json']['PIPE']['working']
    return pjson


p = compile.parse_and_build_pipe(Context(), pjson)
for i in p:
    #print 'as',i
    print '<a href="'+i['link']+'">'+i['title']+'</a><br/>',i['summary_detail']['value']+'<br/><br/>'

To try it with a pipe of your own (no actual scraper required…), clone the view and replace the pipe ID with a (published) pipe ID of your own…

(If you want to publish an RSS feed from a view, see for example the httpresponseheader cribs in Exporting and Displaying Scraperwiki Datasets Using the Google Visualisation API .)

Note that this is all very much a work in progress, both at the code level and the recipe level, so if you have any ideas about how to take it forward, or spot any bugs in the compilation of any pipes you have preserved, please let me know via the comments, or, in the case of pipe2py, by filing an issue on github (maybe even posting a bugfix?!;-) and talking nicely to Greg:-) (I fear that my Python skills aren’t up to patching pipe2py!) Also note that I’m not sure what the Scraperwiki policy is with respect to updating third party libraries, so if you do make amy contributions to the pipe2py project, @frabcus may need a heads-up regarding updating the library on Scraperwiki ;-)

PS note that the pipe2py library may still be incomplete (i.e. not all of the Yahoo Pipes blocks may not be implemented as yet). In addition, I suspect that there are some workarounds required in order to run pipes that contain other, embedded custom pipes. (The embedded pipes need compiling first.) I haven’t yet: a) tried, b) worked out how to handle these in the Scraperwiki context. (If you figure it out before I do, please post a howto in the comments;-)

Also note that at the current time the exporter will only export published pipes associated with a specific user ID. To get the full list of pipes for a user (i.e. including unpublished pipes), I think you need to be authenticated as that user? Any workarounds you can come up with for this would be much appreciated ;-)

PPS One of the things that Yahoo Pipes doesn’t offer is the ability to preserve the output of a pipe. By hosting the executable version of a pipe on Scraperwiki, it is easy enough to create a scheduled scraper than loads in the JSON definition of a pipe, for example by a query onto a database table that contains pipe descriptions based on ID, compiles it into the currently running process, calls the pipe and then pops the results into another Scraperwiki database table.

Written by Tony Hirst

April 10, 2012 at 7:49 pm

Posted in Tinkering

Tagged with ,

Scraperwiki Powered OpenLearn Searches – Learning Outcomes and Glossary Items

A quick follow up to Tinkering With Scraperwiki – The Bottom Line, OpenCorporates Reconciliation and the Google Viz API demonstrating how to reuse that pattern (a little more tinkering is required to fully generalise it, but that’ll probably have to wait until after the Easter wifi-free family tour… I also need to do a demo of a pure HTML/JS version of the approach).

In particular, a search over OpenLearn learning outcomes:

and a search over OpenLearn glossary items:

Both are powered by tables from my OpenLearn XML Processor scraperwiki.

Written by Tony Hirst

April 5, 2012 at 12:02 pm

Tinkering With Scraperwiki – The Bottom Line, OpenCorporates Reconciliation and the Google Viz API

Having got to grips with adding a basic sortable table view to a Scraperwiki view using the Google Chart Tools (Exporting and Displaying Scraperwiki Datasets Using the Google Visualisation API), I thought I’d have a look at wiring in an interactive dashboard control.

You can see the result at BBC Bottom Line programme explorer:

The page loads in the contents of a source Scraperwiki database (so only good for smallish datasets in this version) and pops them into a table. The searchbox is bound to the Synopsis column and and allows you to search for terms or phrases within the Synopsis cells, returning rows for which there is a hit.

Here’s the function that I used to set up the table and search control, bind them together and render them:

    google.load('visualization', '1.1', {packages:['controls']});


    function drawTable() {

      var json_data = new google.visualization.DataTable(%(json)s, 0.6);

    var json_table = new google.visualization.ChartWrapper({'chartType': 'Table','containerId':'table_div_json','options': {allowHtml: true}});
    //i expected this limit on the view to work?

    var formatter = new google.visualization.PatternFormat('<a href="http://www.bbc.co.uk/programmes/{0}">{0}</a>');
    formatter.format(json_data, [1]); // Apply formatter and set the formatted value of the first column.

    formatter = new google.visualization.PatternFormat('<a href="{1}">{0}</a>');
    formatter.format(json_data, [7,8]);

    var stringFilter = new google.visualization.ControlWrapper({
      'controlType': 'StringFilter',
      'containerId': 'control1',
      'options': {
        'filterColumnLabel': 'Synopsis',
        'matchType': 'any'

  var dashboard = new google.visualization.Dashboard(document.getElementById('dashboard')).bind(stringFilter, json_table).draw(json_data);


The formatter is used to linkify the two URLs. However, I couldn’t get the table to hide the final column (the OpenCorporates URI) in the displayed table? (Doing something wrong, somewhere…) You can find the full code for the Scraperwiki view here.

Now you may (or may not) be wondering where the OpenCorporates ID came from. The data used to populate the table is scraped from the JSON version of the BBC programme pages for the OU co-produced business programme The Bottom Line (Bottom Line scraper). (I’ve been pondering for sometime whether there is enough content there to try to build something that might usefully support or help promote OUBS/OU business courses or link across to free OU business courses on OpenLearn…) Supplementary content items for each programme identify the name of each contributor and the company they represent in a conventional way. (Their role is also described in what looks to be a conventionally constructed text string, though I didn’t try to extract this explicitly – yet. (I’m guessing the Reuters OpenCalais API would also make light work of that?))

Having got access to the company name, I thought it might be interesting to try to get a corporate identifier back for each one using the OpenCorporates (Google Refine) Reconciliation API (Google Refine reconciliation service documentation).

Here’s a fragment from the scraper showing how to lookup a company name using the OpenCorporates reconciliation API and get the data back:

ocrecURL='http://opencorporates.com/reconcile?query='+urllib.quote_plus("".join(i for i in record['company'] if ord(i)<128))
    print ocrecURL,[recData]
    if len(recData['result'])>0:
        if recData['result'][0]['score']>=0.7:

The ocrecURL is constructed from the company name, sanitised in a hack fashion. If we get any results back, we check the (relevance) score of the first one. (The results seem to be ordered in descending score order. I didn’t check to see whether this was defined or by convention.) If it seems relevant, we go with it. From a quick skim of company reconciliations, I noticed at least one false positive – Reed – but on the whole it seemed to work fairly well. (If we look up more details about the company from OpenCorporates, and get back the company URL, for example, we might be able to compare the domain with the domain given in the link on the Bottom Line page. A match would suggest quite strongly that we have got the right company…)

As @stuartbrown suggeted in a tweet, a possible next step is to link the name of each guest to a Linked Data identifier for them, for example, using DBPedia (although I wonder – is @opencorporates also minting IDs for company directors?). I also need to find some way of pulling out some proper, detailed subject tags for each episode that could be used to populate a drop down list filter control…

PS for more Google Dashboard controls, check out the Google interactive playground…

PPS see also: OpenLearn Glossary Search and OpenLearn LEarning Outcomes Search

Written by Tony Hirst

April 5, 2012 at 8:55 am

Exporting and Displaying Scraperwiki Datasets Using the Google Visualisation API

In Visualising Networks in Gephi via a Scraperwiki Exported GEXF File I gave an example of how we can publish arbitrary serialised output file formats from Scraperwiki using the GEXF XML file format as a specific example. Of more general use, however, may be the ability to export Scraperwiki data using the Google visualisation API DataTable format. Muddling around the Google site last night, I noticed the Google Data Source Python Library that makes it easy to generate appropriately formatted JSON data that can be consumed by the (client side) Google visualisation library. (This library provides support for generating line charts, bar charts, sortable tables, etc, as well as interactive dashboards.) A tweet to @frabcus questioning whether the gviz_api Python library was available as a third party library on Scraperwiki resulted in him installing it (thanks, Francis:-), so this post is by way of thanks…

Anyway, here are a couple of examples of how to use the library. The first is a self-contained example (using code pinched from here) that transforms the data into the Google format and then drops it into an HTML page template that can consume the data, in this case displaying it as a sortable table (GViz API on scraperwiki – self-contained sortable table view [code]):

Of possibly more use in the general case is a JSONP exporter (example JSON output (code)):

Here’s the code for the JSON feed example:

import scraperwiki
import gviz_api

#Example of:
## how to use the Google gviz Python library to cast Scraperwiki data into the Gviz format and export it as JSON

#Based on the code example at:

scraperwiki.sqlite.attach( 'openlearn-units' )
q = 'parentCourseCode,name,topic,unitcode FROM "swdata" LIMIT 20'
data = scraperwiki.sqlite.select(q)

description = {"parentCourseCode": ("string", "Parent Course"),"name": ("string", "Unit name"),"unitcode": ("string", "Unit Code"),"topic":("string","Topic")}

data_table = gviz_api.DataTable(description)

json = data_table.ToJSon(columns_order=("unitcode","name", "topic","parentCourseCode" ),order_by="unitcode")

scraperwiki.utils.httpresponseheader("Content-Type", "application/json")
print 'ousefulHack('+json+')'

I hardcoded the wraparound function name (ousefulHack), which then got me wondering: is there a safe/trusted/approved way of grabbing arguments out of the URL in Scraperwiki so this could be set via a calling URL?

Anyway, what this shows (hopefully) is an easy way of getting data from Scraperwiki into the Google visualisation API data format and then consuming either via a Scraperwiki view using an HTML page template, or publishing it as a Google visualisation API JSONP feed that can be consumed by an arbitrary web page and used direclty to drive Google visualisation API chart widgets.

PS as well as noting that the gviz python library “can be used to create a google.visualization.DataTable usable by visualizations built on the Google Visualization API” (gviz_api.py sourcecode), it seems that we can also use it to generate a range of output formats: Google viz API JSON (.ToJSon), as a simple JSON Response (. ToJSonResponse), as Javascript (“JS Code”) (.ToJSCode), as CSV (.ToCsv), as TSV (.ToTsvExcel) or as an HTML table (.ToHtml). A ToResponse method (ToResponse(self, columns_order=None, order_by=(), tqx=””)) can also be used to select the output response type based on the tqx parameter value (out:json, out:csv, out:html, out:tsv-excel).

PPS looking at eg https://spreadsheets.google.com/tq?key=rYQm6lTXPH8dHA6XGhJVFsA&pub=1 which can be pulled into a javascript google.visualization.Query(), it seems we get the following returned:
google.visualization.Query.setResponse({"version":"0.6","status":"ok","sig":"1664774139","table":{ "cols":[ ... ], "rows":[ ... ] }})
I think google.visualization.Query.setResponse can be a user defined callback function name; maybe worth trying to implement this one day?

Written by Tony Hirst

April 3, 2012 at 11:28 am

Posted in onlinejournalismblog, Tinkering

Tagged with

Visualising Networks in Gephi via a Scraperwiki Exported GEXF File

How do you visualise data scraped from the web using Scraperwiki as a network using a graph visualisation tool such as Gephi? One way is to import the a two-dimensional data table (i.e. a CSV file) exported from Scraperwiki into Gephi using the Data Explorer, but at times this can be a little fiddly and may require you to mess around with column names to make sure they’re the names Gephi expects. Another way is to get the data into a graph based representation using an appropriate file format such as GEXF or GraphML that can be loaded directly (and unambiguously) into Gephi or other network analysis and visualisation tools.

A quick bit of backstory first…

A couple of related key features for me of a “data management system” (eg the joint post from Francis Irving and Rufus Pollock on From CMS to DMS: C is for Content, D is for Data) are the ability to put data into shapes that play nicely with predefined analysis and visualisation routines, and the ability to export data in a variety of formats or representations that allow that data to be be readily imported into, or used by, other applications, tools, or software libraries. Which is to say, I’m into glue

So here’s some glue – a recipe for generating a GEXF formatted file that can be loaded directly into Gephi and used to visualise networks like this one of how OpenLearn units are connected by course code and top level subject area:

The inspiration for this demo comes from a couple of things: firstly, noticing that networkx is one of the third party supported libraries on ScraperWiki (as of last night, I think the igraph library is also available; thanks @frabcus ;-); secondly, having broken ground for myself on how to get Scraperwiki views to emit data feeds rather than HTML pages (eg OpenLearn Glossary Items as a JSON feed).

As a rather contrived demo, let’s look at the data from this scrape of OpenLearn units, as visualised above:

The data is available from the openlearn-units scraper in the table swdata. The columns of interest are name, parentCourseCode, topic and unitcode. What I’m going to do is generate a graph file that represents which unitcodes are associated with which parentCourseCodes, and which topics are associated with each parentCourseCode. We can then visualise a network that shows parentCourseCodes by topic, along with the child (unitcode) course units generated from each Open University parent course (parentCourseCode).

From previous dabblings with the networkx library, I knew it’d be easy enough to generate a graph representation from the data in the Scraperwiki data table. Essentially, two steps are required: 1) create and label nodes, as required; 2) tie nodes together with edges. (If a node hasn’t been defined when you use it to create an edge, netwrokx will create it for you.)

I decided to create and label some of the nodes in advance: unit nodes would carry their name and unitcode; parent course nodes would just carry their parentCourseCode; and topic nodes would carry an newly created ID and the topic name itself. (The topic name is a string of characters and would make for a messy ID for the node!)

To keep gephi happy, I’m going to explicitly add a label attribute to some of the nodes that will be used, by default, to label nodes in Gephi views of the network. (Here are some hints on generating graphs in networkx.)

Here’s how I built the graph:

import scraperwiki
import urllib
import networkx as nx

scraperwiki.sqlite.attach( 'openlearn-units' )
q = '* FROM "swdata"'
data = scraperwiki.sqlite.select(q)


for row in data:
    if topic not in topics:

Having generated a representation of the data as a graph using networkx, we now need to export the data. networkx supports a variety of export formats, including GEXF. Looking at the documentation for the GEXF exporter, we see that it offers methods for exporting the GEXF representation to a file. But for scraperwiki, we want to just print out a representation of the file, not actually save the printed representation of the graph to a file. So how do we get hold of an XML representation of the GEXF formatted data so we can print it out? A peek into the source code for the GEXF exporter (other exporter file sources here) suggests that the functions we need can be found in the networkx.readwrite.gexf file: a constructor (GEXFWriter), and a method for loading in the graph (.add_graph()). An XML representation of the file can then be obtained and printed out using the ElementTree tostring function.

Here’s the code I hacked out as a result of that little investigation:

import networkx.readwrite.gexf as gf


scraperwiki.utils.httpresponseheader("Content-Type", "text/xml")

from xml.etree.cElementTree import tostring
print tostring(writer.xml)

Note the use of the scraperwiki.utils.httpresponseheader to set the MIMEtype of the view. If we don’t do this, scraperwiki will by default publish an HTML page view, along with a Scraperwiki logo embedded in the page.

Here’s the full code for the view.

And here’s the GEXF view:

Save this file with a .gexf suffix and you can then open the file directly into Gephi.

Hopefully, what this post shows is how you can generate your own, potentially complex, output file formats within Scraperwiki that can then be imported directly into other tools.

PS see also Exporting and Displaying Scraperwiki Datasets Using the Google Visualisation API, which shows how to generate a Google Visualisation API JSON from Scraperwiki, allowing for the quick and easy generation of charts and tables using Google Visualisation API components.

Written by Tony Hirst

April 3, 2012 at 9:39 am

Deconstructing OpenLearn Units – Glossary Items, Learning Outcomes and Image Search

It turns out that part of the grief I encountered here in trying to access OpenLearn XML content was easily resolved (check the comments: mechanise did the trick…), though I’ve still to try to sort out a workaround for accessing OpenLearn images (a problem described here)), but at least now I have another stepping stone: a database of some deconstructed OpenLearn content.

Using Scraperwiki to pull down and parse the OpenLearn XML files, I’ve created some database tables that contain the following elements scraped from across the OpenLearn units by this OpenLearn XML Processor:

  • glossary items;
  • learning objectives;
  • figure captions and descriptions.

You can download CSV data files corresponding to the tables, or the whole SQLite database. (Note that there is also an “errors” table that identifies units that threw an error when I tried to grab, or parse, the OpenLearn XML.)

Unfortunately, I haven’t had a chance yet to pop up a view over the data (I tried, briefly, but today was another of those days where something that’s probably very simple and obvious prevented me from getting the code I wanted to write working; if anyone has an example Scraperwiki view that chucks data into a sortable HTML table or a Simile Exhibit searchable table, please post a link below; or even better, add a view to the scraper:-)

So in the meantime, if ypu want to have a play, you need to make use of the Scraperwiki API wizard.

Here are some example queries:

  • a search for figure descriptions containing the word “communication” – select * from `figures` where desc like ‘%communication%’: try it
  • a search over learning outcomes that include the phrase how to followed at some point by the word dataselect * from `learningoutcomes` where lo like ‘%how to%data%’: try it
  • a search of glossary items for glossary terms that contain the word “period” or a definition that contains the word “ancient” – select * from `glossary` where definition like ‘%ancient%’ or term like ‘%period%’: try it
  • find figures with empty captions – select * from `figures` where caption==”: try it

I’ll try to add some more examples when I get a chance, as well as knocking up a more friendly search interface. Unless you want to try…?!;-)

Written by Tony Hirst

March 15, 2012 at 10:59 am

A Quick Lookup Service for UK University Bursary & Scholarship Pages

Here’s a quick recipe for grabbing a set of links from an alphabetised set of lookup pages and then providing a way of looking them up… The use case is to lookup URLs of pages on the websites of colleges and universities offering financial support for students as part of the UK National Scholarship Programme, as described on the DirectGov website:

National Scholarship programme

Index pages have URLs of the form:

<div class="subContent">
						<div class="subContent">
					<ul class="subLinks">
						<li><a href="http://www.anglia.ac.uk/nsp"   target="_blank">Anglia Ruskin University<span class='tooltip' title='Opens new window'> <span>Opens new window</span></span></a></li>
				<div class="subContent">
					<ul class="subLinks">
						<li><a href="http://www.aucb.ac.uk/international/feesandfinance/financialhelp.aspx"   target="_blank">Arts University College at Bournemouth<span class='tooltip' title='Opens new window'> <span>Opens new window</span></span></a></li>
				<div class="subContent">
					<ul class="subLinks">
						<li><a href="http://www1.aston.ac.uk/study/undergraduate/student-finance/tuition-fees/2012-entry/ "   target="_blank">Aston University Birmingham<span class='tooltip' title='Opens new window'> <span>Opens new window</span></span></a></li>

I’ve popped a quick scraper onto Scraperwiki (University Bursaries / Scholarship / Bursary Pages) that trawls the the index pages A-Z, grabs the names of the institutions and the URLs they link to and pops them into a database.

import scraperwiki
import string,lxml.html

# A function I usually bring in with lxml that strips tags and just give you text contained in an XML substree
## via http://stackoverflow.com/questions/5757201/help-or-advice-me-get-started-with-lxml/5899005#5899005
def flatten(el):           
    result = [ (el.text or "") ]
    for sel in el:
        result.append(sel.tail or "")
    return "".join(result)
#As it happens, we're not actually going to use this function in this scraper, so we could remove it from the code...

# We want to poll through page URLs indexed by an uppercase alphachar
allTheLetters = string.uppercase

for letter in allTheLetters:
    #Generate the URL
    print letter
    #Grab the HTML page from the URL and generate an XML object from it
    #There are probably more efficient ways of doing this scrape...
    for element in page.findall('.//div'):
        if element.find('h3')!=None and element.find('h3').text==letter:
            for uni in element.findall('.//li/a'):
                print uni.text,uni.get('href')
                scraperwiki.sqlite.save(unique_keys=["href"], data={"href":uni.get('href'), "uni":uni.text})

Running this gives a database containing the names of the institutions that signed up to the National Scholarship Programmea and the information that have about scholarships and bursaries availabale in that context.

The Scraperwiki API allows you to run queries on this database and get the results back as JSON, HTML, CSV or RSS: University Bursaries API. So for example, we can search for bursary pages on Liverpool colleges and universities websites:

Scraperwiki API

We can also generate a view over the data on Scraperwiki… (this script shows how to interrogate the Scraperwiki database from within a webpage.

Finally, if we take the URLs from the bursary pages and pop them into a Google custom search engine, we can now search over just those pages… UK HE Financial Support (National Scholarship Programme) Search Engine. (Note that this is a bit ropey at them moment.) If you own the CSE, it’s easy enough to grab embed codes that allow you to pop search and results controls for the CSE into your own webpage.

(On the to do list is generate a view over the data that defines a Google Custom Search Engine Annotations file that can be used to describe the sites/pages searched over by the CSE.)

Written by Tony Hirst

November 13, 2011 at 12:18 pm

Posted in Tinkering

Tagged with ,

PDF Data Liberation: Formula One Press Release Timing Sheets

If you want F1 summary timing data from practice sessions, qualifying and the race itself, you might imagine that the the FIA Media Centre is the place to go:

Hmm… PDFs…

Some of the documents provide all the results on a single page in a relatively straightforward fashion:

Others are split into tables over multiple pages:

Following the race, the official classification was available as a scrapable PDF in preliminary for, but the final result – with handwritten signature – looked to be a PDF of a photocopy, and as such defies scraping without an OCR pass first… which I didn’t try…

I did consider setting up separate scrapers for each timing document, and saving the data into a corresponding Scraperwiki database, but a quick look at the license conditions made me a little wary…

No part of these results/data may be reproduced, stored in a retrieval system or transmitted in any form or by any means electronic, mechanical, photocopying, recording, broadcasting or otherwise without prior permission of the copyright holder except for reproduction in local/national/international daily press and regular printed publications on sale to the public within 90 days of the event to which the results/data relate and provided that the copyright symbol appears together with the address shown below …

Instead, I took the scrapers just so far such that I (that is, me ;-) could see how I would be able to get hold of the data without too much additional effort, but I didn’t complete the job… there’s partly an ulterior motive for this too… if anyone really wants the data, then you’ll probably have to do a bit of delving into the mechanics of Scraperwiki;-)

(The other reason for not my spending more time on this at the moment is that I was looking for a couple of simple exercises to get started with grabbing data from PDFs, and the FIA docs seemed quite an easy way in… Writing the scrapers is also bit like doing Sudoku, or Killer, which is one of my weekend pastimes…;-)

The scraper I set up is here: F1 Timing Scraperwiki

To use the scrapers, you need to open up the Scraperwiki editor, and do a little bit of configuration:

(Note the the press releases may disappear a few days after the race – I’m not sure how persistent the URLs are?)

When you’ve configured the scraper, run it…

The results of the scrape should now be displayed…

Scraperwiki does allow scraped data to be deposited into a database, and then accessed via an API, or other scrapers, or uploaded to Google Spreadsheets. However, my code stops at the point of getting the data into a Python list. (If you want a copy of the code, I posted it as a gist: F1 timings – press release scraper; you can also access it via Scraperwiki, of course).

Note that so far I’ve only tried the docs from a single race, so the scrapers may break on the releases published for future (or previous) races… Such is life when working with scrapers… I’ll try to work on robustness as the races go by. (I also need to work on the session/qualifying times and race analysis scrapers… they currently report unstructured data and also display an occasional glitch that I need to handle via a post-scrape cleanser.

If you want to use the scraper code as a starting point for building a data grabber that publishes the timing information as data somewhere, that’s what it’s there for (please let me know in the comments;-)

PS by the by, Mercedes GP publish an XML file of the latest F1 Championship Standings. They also appear to be publishing racetrack information in XML form using URLs of the form http://assets.mercedes-gp.com/—9—swf/assets/xml/race_23_en.xml. Presumably the next race will be 24?

If you know of any other “data” sources or machine readable, structured/semantic data relating to F1, please let me know via a comment below:-)

Written by Tony Hirst

April 10, 2011 at 9:52 pm

Posted in Data, Tinkering

Tagged with , ,


Get every new post delivered to your Inbox.

Join 725 other followers