Tagged: scraperwiki

Aggregated Local Government Verticals Based on LocalGov Service IDs

(Punchy title, eh?!) If you’re a researcher interested in local government initiatives or service provision across the UK on a particular theme, such as air quality, or you’re looking to start pulling together an aggregator of local council consultation exercises, where would you start?

Really – where would you start? (Please post a comment saying how you’d make a start on this before reading the rest of this post… then we can compare notes;-)

My first thought would be to use a web search engine and search for the topic term using a site:gov.uk search limit, maybe along with intitle:council, or at least council. This would generate a list of pages on (hopefully) local gov websites relating to the topic or service I was interested in. That approach is a bit hit or miss though, so next up I’d probably go to DirectGov, or the new gov.uk site, to see if they had a single page on the corresponding resource area that linked to appropriate pages on the various local council websites. (The gov.uk site takes a different approach to the old DirectGov site, I think, trying to find a single page for a particular council given your location rather than providing a link for each council to a corresponding service page?) If I was still stuck, OpenlyLocal, the site set up several years ago by Chris Taggart/@countculture to provide a single point of reference for looking up common adminsitrivia details relating to local councils, would be the next thing that came to mind. For a data related query, I would probably have a trawl around data.gov.uk, the centralised (but far form complete) UK index of open public datasets.

How much more convenient it would be if there was a “vertical” search or resource site relating to just the topic or service you were interested in, that aggregated relevant content from across the UK’s local council websites in a single place.

(Erm… or maybe it wouldn’t?!)

Anyway, here are a few notes for how we might go about constructing just such a thing out of two key ingredients. The first ingredient is the rather wonderful Local directgov services list:

This dataset is held on the Local Directgov platform which provides the deep links into Local council websites for a number of services in Directgov. The Local Authority Service details holds the local council URLS for over 240 services where the customer can directly transfer to the appropriate service page on any council in England.

The date on the dataset post is 16/09/2011, although I’m not sure if the data file itself is more current (which is one of the issues with data.gov.uk, you could argue…). Presumably, gov.uk runs off a current version of the index? (Share…. ;-) Each item in the local directgov services list carries with it a service identifier code that describes the local government service or provision associated with the corresponding web page. That it, each URL has associated with it a piece of metadata identifying a service or provision type.

Which leads to the second ingredient: the esd standards Local Government Service List. This list maps service codes onto a short key phrase description of the corresponding service. So for example, Council – consultation and community engagement is has service identifier 366, and Pollution control – air quality is 413. (See the standards page for the actual code/vocabulary list in a variety of formats…)

As a starter for ten, I’ve pulled the Directgov local gov URL listing and local gov service list into scraperwiki (Local Gov Web Pages). Using the corresponding scraper API, we can easily run a query looking up service codes relating to pollution, for example:

select * from `serviceDesc` where ToName like '%pollution%'

From this, we can pick up what service code we need to use to look up pages related to that service (413 in the case of air pollution):

select * from `localgovpages` where LGSL=413

We can also get a link to an HTML table (or JSON representation, etc) of the data via a hackable URI:


(Hackable in the sense we can easily change the service code to generate the table for the service with that code.)

So that’s the starter for 10. The next step that comes to my mind is to generate a dynamic Google custom search engine configuration file that defines a search engine that will search over just those URLs (or maybe those URLs plus the pages they link to). This would then provide the ability to generate custom search engines on the fly that searched over particular service pages from across localgov in a single, dynamically generated vertical.

A second thought is to grab those page, index them myself, crawl them/scrape them to find the pages they link to, and index those pages also (using something like tf-idf within each local council site to identify and remove common template elements from the index). (Hmmm… that could be an interesting complement to scraperwiki… SolrWiki, a site for compiling lists of links, indexing them, crawling them to depth N, and then configuring search ranking algorithms over the top of them… Hmmm… It’s a slightly different approach to generating custom search engines as a subset of a monolithic index, which is how the Google CSE and (previously) the Yahoo BOSS engines worked… Not scaleable, of course, but probably okay for small index engines and low thousands of search engines?)

Scraperwiki Glue Between the Ergast Experimental Motor Racing API and Google Chart Tools

Another experiment in using a Scraperwiki view – transforming the JSON output from the Ergast experimental motor racing API into the format used by the Google Chart Tools and then rendering it.

Whilst it would be possible to use a single Scraperwiki view to grab the Ergast data, transform it to the data representation used by the Google Chart tools and then render the data using those chart components within the same Scraperwiki view, I have instead separated the transformation and the rendering into two steps: a view (Ergast API to Datasource Response Format) that takes the Ergast API data and then republishes it as a JSON data feed using the Chart Tools Datasource Protocol (V0.6) response format, and a view that accepts the datasource response and uses it within an end-user view, such as an interactive dashboard (Ergast Data Viewer demo).

As a proof of concept, the data representation transformer only covers race laps data, although it is possible to request data for a race in a supported year using a URI of the form:


The demo view is similarly keyed by year and race, as for example demonstrated here where we request data for the first race (race=1) in 2012 (year=2012):


The demo provides a paged tabular view over lap time data that can also be filtered by lap number and/or driver:

(And yes, I know the ‘time in seconds’ column is wrong… there was a typo in the original script when I generated that view – fixed now…)

The following code fragment shows how straightforward it is construct the dashboard:

    google.load('visualization', '1.1', {packages:['controls']});


    function drawTable() {

      var json_data = new google.visualization.DataTable(data, 0.6);

    var json_table = new google.visualization.ChartWrapper({'chartType': 'Table','containerId':'table_div_json','options': {allowHtml: true,page:'enable',pageSize:24,'pagingButtonsConfiguration':'auto' }});

     var formatter = new google.visualization.NumberFormat( {fractionDigits: 3} );

    var stringFilter = new google.visualization.ControlWrapper({
      'controlType': 'StringFilter',
      'containerId': 'control1',
      'options': {
        'filterColumnLabel': 'Lap',
        'matchType': 'any'

  var categoryPicker = new google.visualization.ControlWrapper({
    'controlType': 'CategoryFilter',
    'containerId': 'control2',
    'options': {
      'filterColumnLabel': 'Driver ID',
      'ui': {
      'labelStacking': 'vertical',
        'allowTyping': false,
        'allowMultiple': false
  var dashboard = new google.visualization.Dashboard(document.getElementById('dashboard')).bind([stringFilter,categoryPicker], [json_table]).draw(json_data);

So what are the next steps?

In terms of user facing views, I’ll try to have a play generating a range of different visualisation types/dashboard setups using different chart controls and chart types. As far as the data transformation goes, it would be worth generalising the approach so we can data from any of the Ergast API data table into the Google Chart Tools datasource format. Of course, if this data representation turns out to be useful, then maybe the Ergast API will support it directly? (Helper libraries for transforming data to the format are available for Java and Python: Implementing the Chart Tools Datasource Protocol.)

PS from the Google Chart tools documentation, I notice a new experiment feature for the datatable – column typing in the form of data roles.

Initial Sketch of Registered Addresses of Tesco Companies

Following on from Mapping the Tesco Corporate Organisational Sprawl – An Initial Sketch, where I graphed relations between Tesco registered companies based on co-directorships, I also used OpenCorporates to grab the registered addresses for the companies returned from the OpenCorporates reconciliation API based on a search using the term tesco.

This initial sketch uses two node types – companies and registered addresses (here’s the Scraperwiki view used to generate the graph file):

We can see how several of the addresses relate to the same location, although they are not identical in string matching terms – a bit of text processing may be able to fix that though…

Not surprisingly, the Cayman Islands features as well as the Cheshunt address…

Having got addresses, we could do a bit of geocoding and pop the results onto a map…here’s an example using Google Fusion Tables.

Mapping the Tesco Corporate Organisational Sprawl – An Initial Sketch

A quick sketch, prompted by Tesco Graph Hunting on OpenCorporates of how some of Tesco’s various corporate holdings are related based on director appointments and terminations:

The recipe is as follows:

– grab a list of companies that may be associated with “Tesco” by querying the OpenCorporates reconciliation API for tesco
– grab the filings for each of those companies
– trawl through the filings looking for director appointments or terminations
– store a row for each directorial appointment or termination including the company name and the director.

You can find the scraper here: Tesco Sprawl Grapher

import scraperwiki, simplejson,urllib

import networkx as nx

#Keep the API key [private - via http://blog.scraperwiki.com/2011/10/19/tweeting-the-drilling/
import os, cgi
    qsenv = dict(cgi.parse_qsl(os.getenv("QUERY_STRING")))

#note - the opencorporates api also offers a search:  companies/search

def getOCcompanyData(ocid):
    return ocdata

#need to find a way of playing nice with the api, and not keep retrawling

def getOCfilingData(ocid):
    print 'filings',ocid
    #print 'filings',ocid,ocdata
    #print 'filings 2',tmpdata
    while tmpdata['page']<tmpdata['total_pages']:
        print '...another page',page,str(tmpdata["total_pages"]),str(tmpdata['page'])
    return ocdata

def recordDirectorChange(ocname,ocid,ffiling,director):
    print 'ddata',ddata
    scraperwiki.sqlite.save(unique_keys=['fid'], table_name='directors', data=ddata)

def logDirectors(ocname,ocid,filings):
    print 'director filings',filings
    for filing in filings:
        if filing["filing"]["filing_type"]=="Appointment of director" or filing["filing"]["filing_code"]=="AP01":
            director=desc.replace('DIRECTOR APPOINTED ','')
        elif filing["filing"]["filing_type"]=="Termination of appointment of director" or filing["filing"]["filing_code"]=="TM01":
            director=desc.replace('APPOINTMENT TERMINATED, DIRECTOR ','')
            director=director.replace('APPOINTMENT TERMINATED, ','')

for entity in entities['result']:

The next step is to graph the result. I used a Scraperwiki view (Tesco sprawl demo graph) to generate a bipartite network connecting directors (either appointed or terminated) with companies and then published the result as a GEXF file that can be loaded directly into Gephi.

import scraperwiki
import urllib
import networkx as nx

import networkx.readwrite.gexf as gf

from xml.etree.cElementTree import tostring

scraperwiki.sqlite.attach( 'tesco_sprawl_grapher')
q = '* FROM "directors"'
data = scraperwiki.sqlite.select(q)


for row in data:
    if row['fdirector'] not in directors:
    if row['ocname'] not in companies:

scraperwiki.utils.httpresponseheader("Content-Type", "text/xml")


print tostring(writer.xml)

Saving the output of the view as a gexf file means it can be loaded directly in to Gephi. (It would be handy if Gephi could load files in from a URL, methinks?) A version of the graph, laid out using a force directed layout, with nodes coloured according to modularity grouping, suggests some clustering of the companies. Note the parts of the whole graph are disconnected.

In the fragment below, we see Tesco Property Nominees are only losley linked to each other, and from the previous graphic, we see that Tesco Underwriting doesn’t share any recent director moves with any other companies that I trawled. (That said, the scraper did hit the OpenCorporates API limiter, so there may well be missing edges/data…)

And what is it with accountants naming companies after colours?! (It reminds me of sys admins naming servers after distilleries and Lord of the Rings characters!) Is there any sense in there, or is arbitrary?

Appropriating Scraperwiki Views for Data Format Transformation – Google Spreadsheet CSV to GEXF

Unpicking the architecture of Scraperwiki, we see that it offers several hosting options:

– hosting for executable scraper scripts;
– a hosted database for storing scraped data;
– hosting for executable view pages that allow views of data to be published.

Although the intention of Views hosting is presumably to provide a view over data scraped using Scraperwiki itself, we can also use it simply as a hosted scripted environment for transforming data from one representation to another.

So for example, in the post Using Google Spreadsheets to extract Google+ friend/follower information for analysis in NodeXL, Gephi, Martin Hawksey describes a way of scraping Google+ friends data into a Google spreadsheet, along with a recipe for getting the data out of Google Spreadsheets as a CSV formatted file and then into a form in which it can be loaded into NodeXL, a free social network analysis add-on package for Excel. In so doing, he also makes a request to the lazy web for how the data might be loaded into other graph analysis packages:

I’m not familiar with getting the data into tools like Gephi (other than via a NodeXL export) so if someone wants to post or link to something to do this for other tools leave a comment or get in touch. Instead here are some basic instructions for getting the data into NodeXL…

As I’ve been playing with Scraperwiki views lately, including using them to generate GEXF file representations of network data (Visualising Networks in Gephi via a Scraperwiki Exported GEXF File), I thought I’d try to knock up a View that grabs data from Martin’s spreadsheet and puts it into GEXF form. You can find the result in the Google+ Net Spreadsheet Grapher (note that it sometimes seems to time out.. oops…).

For completeness, here’s the code [as updated by @mhawksey to include a paramter grab from the view URL… I was wondering how to do that?!;-)]:

#gexf from google spreadsheet, picking up on http://mashe.hawksey.info/2011/12/google-plus-network-info-to-nodexl

import scraperwiki
import csv,urllib
import networkx as nx


#--via @mhawksey
# query string crib https://views.scraperwiki.com/run/python_querystring_cheat_sheet/?
#my defensive tweaks
import cgi, os
if qstring!=None:
    get = dict(cgi.parse_qsl(qstring))
    if 'key' in get:




#utilities https://scraperwiki.com/scrapers/utility_library/
def vstr(s):
    if s:
            return unicode(s)
        except UnicodeDecodeError:
            return str(s)
        return u''
# remove any non ascii characters
def ascii(s): return "".join(i for i in s if ord(i)<128)

#need to get rid of first row before the proper header row...

nReader = csv.DictReader(nodesfile)
for nrow in nReader:
    #print nrow
    if nrow['ImageUrl']!=None:
    else: imgurl=''

eReader = csv.DictReader(edgesfile)
for erow in eReader:
    #print erow
    DG.add_edge(erow['Vertex 1'],erow['Vertex 2'])

import networkx.readwrite.gexf as gf


scraperwiki.utils.httpresponseheader("Content-Type", "text/xml")

from xml.etree.cElementTree import tostring
print tostring(writer.xml)

So the take home is this. As well as providing a home for scrapers, is it appropriate to use Scraperwiki as a home for (examples of) data representation conversion scripts and (re)export formatters?

PS here’s another example of using a Scraperwiki View as a standalone item – Example Twitter hashtag user friendship network, which shows how to search Twitter for recent users of a hashtag, then generate a graph file representing the friend connections between the taggers.

Exporting Yahoo Pipe Definitions, Compiling Them to Python, and Running Them in Scraperwiki

[Update Jun 2105 – it seem Yahoo Pipes is finally shutting down – see my obituary for it here: Reflections on the Closure of Yahoo Pipes.]

So you’ve got a whole bunch of Yahoo Pipes running some critical information feeds, but you’re fearful that Yahoo Pipes is going to disappear: what are you going to do? Or maybe you want to use Yahoo Pipes to harvest and process a data feed once a day and pop the results into an incremental data store, but you don’t run your own database. This post describe how the Pipe2Py Yahoo Pipes to Python code compiler running inside the data harvesting tool Scraperwiki may provide one way of solving your problem.

Over the years, I’ve created dozens and dozens of Yahoo Pipes, as well as advocating their use as a rapid prototyping environment for feed based processing, particularly amongst the #mashedlibrary/digital librarianship community. There are several sorts of value associated variously with actual Yahoo Pipes designs, including: the algorithmic design, that demonstrates a particular way of sourcing, filtering, processing, mixing and transforming one or more data series; and the operational value, for example in terms of the value associated with running the pipe and publishing, syndicating or otherwise making direct use of the output of a particular pipe.

Whilst I have tried to document elements of some of the pipework I have developed (check the pipework category on this blog, for example), many of the blog posts I have written around Yahoo Pipes have complemented them in a documentation sense, rather than providing a necessary and sufficient explanation from which a pipe can be specifically recreated. (That is, to make full sense of the posts, you often had to have access to the “source” of the pipe as well…)

To try to mitigate against the loss of Yahoo Pipes as an essential complement to many OUseful.info posts, I have from time to time explored the idea of a Yahoo Pipes Documentation Project (countering the risk of algorithmic loss), as well as the ability to export and run equivalent or “compiled” versions of Yahoo Pipes on an arbitrary server (protecting against operational loss). The ability to generate routines with an equivalent behaviour to any given Yahoo Pipe also made sense in the face of perceived concerns “from IT” about the stability of the Yahoo Pipes platform (from time to time, it has been very shaky!) as well as it’s long term availability. Whilst my attitude was typically along the lines of “if you hack something together in Yahoo Pipes that does at least something of what you want, at least you can make use of it in the short term”, I was also mindful of the fact that when applications become the basis of any service they may not be looked at again if the service appears to be working and as such other things may come to depend or otherwise rely on them. As far as I am aware, the Pipe2Py project, developed by Greg Gaughan, has for some time been the best bet when it comes to generating standalone programmes that are functionally equivalent to a wide variety of Yahoo Pipes.

As Yahoo again suffers from a round of redundancies, I thought it about time that I reconsider my own preservation strategy with respect to the possible loss of Yahoo Pipes…

Some time ago, I persuaded @frabcus to make pipe2py library available on Scraperwiki, but to my shame never did anything with it. So today, I thought I’d better address that. Building on the script I linked to from Just in Case – Saving Your Yahoo Pipes…, I put together a simple Scraperwiki script that grabs the JSON descriptions of my public/published pipes and pops them into a Scraperwiki database (Scraperwiki: pipe2py test):

import scraperwiki,urllib,json,simplejson

def getPipesJSON(id,name):
    url = ("""http://query.yahooapis.com/v1/public/yql"""
               + id + 
    pjson = urllib.urlopen(url).readlines()
    pjson = "".join(pjson)
    pipe_def = json.loads(pjson)
    scraperwiki.sqlite.save(unique_keys=['id'], table_name='pipes', data={'id':id,'pjson':pjson,'title':name})
    if not pipe_def['query']['results']:
        print "Pipe not found"
    pjson = pipe_def['query']['results']['json']['PIPE']['working']
    return pjson

def getPipesPage(uid,pageNum):
    print 'getting',uid,pageNum
    return feed

def userPipesExport(uid):

    while (scrapeit):
        feeds= getPipesPage(uid,page)
        print feeds
        if feeds['value']['hits']==0:
            for pipe in feeds['value']['items']:

#Yahoo pipes user ID


To export your own public pipe definitions, clone the scraperwiki, replace my Yahoo pipes user id (uid) with your own, and run the scraper…

Having preserved the JSON descriptions within a Scraperwiki database, the next step was to demonstrate the operationalisation of a preserved pipe. The example view at pipe2py – test view [code] demonstrates how to look up the JSONic description of a Yahoo Pipe, as preserved in a Scraperwiki database table, compile it, execute it, and print out the result of running the pipe.

import scraperwiki,json

from pipe2py import compile, Context


def getpjsonFromDB(id):
    scraperwiki.sqlite.attach( 'pipe2py_test' )
    q = '* FROM "pipes" WHERE "id"="'+id+'"'
    data = scraperwiki.sqlite.select(q)
    #print data
    pipe_def = json.loads(data[0]['pjson'])
    if not pipe_def['query']['results']:
        print "Pipe not found"
    pjson = pipe_def['query']['results']['json']['PIPE']['working']
    return pjson


p = compile.parse_and_build_pipe(Context(), pjson)
for i in p:
    #print 'as',i
    print '<a href="'+i['link']+'">'+i['title']+'</a><br/>',i['summary_detail']['value']+'<br/><br/>'

The examplePipeOutput() function in the pipes preservation Scraperwiki scraper (rather than the view) provides another example of how to compile and execute a pipe, this time by directly loading in its description from Yahoo Pipes, given it’s ID.

To preview the output of one of your own pipes by grabbing the pipe description from Yahoo Pipes, compiling it locally and then running the local compiled version, here’s an example (pipe2py – pipe execution preview):

#Example of how to grab a pipe definition from Yahoo pipes, compile and execute it, and preview its (locally obtained) output

import scraperwiki,json,urllib

from pipe2py import compile, Context


def getPipesJSON(id):
    url = ("""http://query.yahooapis.com/v1/public/yql"""
               + id + 
    pjson = urllib.urlopen(url).readlines()
    pjson = "".join(pjson)
    pipe_def = json.loads(pjson)
    if not pipe_def['query']['results']:
        print "Pipe not found"
    pjson = pipe_def['query']['results']['json']['PIPE']['working']
    return pjson


p = compile.parse_and_build_pipe(Context(), pjson)
for i in p:
    #print 'as',i
    print '<a href="'+i['link']+'">'+i['title']+'</a><br/>',i['summary_detail']['value']+'<br/><br/>'

To try it with a pipe of your own (no actual scraper required…), clone the view and replace the pipe ID with a (published) pipe ID of your own…

(If you want to publish an RSS feed from a view, see for example the httpresponseheader cribs in Exporting and Displaying Scraperwiki Datasets Using the Google Visualisation API .)

Note that this is all very much a work in progress, both at the code level and the recipe level, so if you have any ideas about how to take it forward, or spot any bugs in the compilation of any pipes you have preserved, please let me know via the comments, or, in the case of pipe2py, by filing an issue on github (maybe even posting a bugfix?!;-) and talking nicely to Greg:-) (I fear that my Python skills aren’t up to patching pipe2py!) Also note that I’m not sure what the Scraperwiki policy is with respect to updating third party libraries, so if you do make amy contributions to the pipe2py project, @frabcus may need a heads-up regarding updating the library on Scraperwiki ;-)

PS note that the pipe2py library may still be incomplete (i.e. not all of the Yahoo Pipes blocks may not be implemented as yet). In addition, I suspect that there are some workarounds required in order to run pipes that contain other, embedded custom pipes. (The embedded pipes need compiling first.) I haven’t yet: a) tried, b) worked out how to handle these in the Scraperwiki context. (If you figure it out before I do, please post a howto in the comments;-)

Also note that at the current time the exporter will only export published pipes associated with a specific user ID. To get the full list of pipes for a user (i.e. including unpublished pipes), I think you need to be authenticated as that user? Any workarounds you can come up with for this would be much appreciated ;-)

PPS One of the things that Yahoo Pipes doesn’t offer is the ability to preserve the output of a pipe. By hosting the executable version of a pipe on Scraperwiki, it is easy enough to create a scheduled scraper than loads in the JSON definition of a pipe, for example by a query onto a database table that contains pipe descriptions based on ID, compiles it into the currently running process, calls the pipe and then pops the results into another Scraperwiki database table.

Scraperwiki Powered OpenLearn Searches – Learning Outcomes and Glossary Items

A quick follow up to Tinkering With Scraperwiki – The Bottom Line, OpenCorporates Reconciliation and the Google Viz API demonstrating how to reuse that pattern (a little more tinkering is required to fully generalise it, but that’ll probably have to wait until after the Easter wifi-free family tour… I also need to do a demo of a pure HTML/JS version of the approach).

In particular, a search over OpenLearn learning outcomes:

and a search over OpenLearn glossary items:

Both are powered by tables from my OpenLearn XML Processor scraperwiki.