Using Many Eyes Wikified to Visualise Guardian Data Store Data on Google Docs

Last week, I posted a quick demo of how to visualise data stored in a Google spreadsheet in Many Eyes Wikified (HEFCE Grant Funding, in Pictures).

The data I used was the latest batch of HEFCE teaching funding data, but Joss soon tweeted to say he’d got Research funding data up on Google spreadsheets, and could I do something with that? You can see the results here: Visualising UK HEI Research Funding data on many Eyes Wikified (Joss has also had a go: RAE: UK research funding results visualised).

Anyway, today the Guardian announced a new content API (more on that later – authorised developer keys are still like gold dust), as well as the Guardian data store (strapline: “facts you can use”) and the associated Data Store Blog.

Interestingly, the data is being stored on Google docs, in part because Google spreadsheets offer an API and a wide variety of export formats.

As regular readers will know, one of the export formats from Google spreadsheets is CSV – Comma Separated Variable data – which just so happens to be liked by services such as Dabble DB and Many Eyes. I’ll try to come up with a demo of how to mash-up several different data sets in Dabble DB over the next few days, but as I’ve a spare half-hour now, I thought I’d post a qiuck demo of how to visualise some of the Guardian data store spreadsheet data in Many Eyes Wikified.

So to start, let’s look at the the RAE2008 results data – University research department rankings (you can find the actual data here:

If you speak URL, you’ll know that you can get the CSV version of the data by adding &output=csv to the URL, like this:

Inspection of the CSV output suggests there’s some crap at the top we don’t want – i.e. not actual column headings – as well as the the end of the file:

(Note this “crap” is actually important metadata – it describes the data and its provenance – but it’s not the actual data we want to visualise).

Grabbing the actualt data, without the metadata, can be achieve by grabbing a particular range of cells using the &range= URL argument. Inspection of the table suggests that meaningful data can be found in the columnar range of A to H; guesswork and a bit of binary search identifies the actual range of cell data as A2:H2365 – so we can export JUST the data, as CSV, using the URL

If you create a new page on Many Eyes Wikified, this data can be imported into a wiki page there as follows:

We can now use this data page as the basis of a set of Many Eyes visualisations. Noting that the “relative URL address” of the data page is ousefulTestboard/GuardianUKRAERankings2008 (the full URL of the wikified data page is, create a new page and put a visualisation placeholder or two in it:

Saving that page – and clicking through on the visualisation placeholder links – means you can now create your visualisation (Many Eyes seems to try to guess what visualisation you want if you use an appropriate visulisation name?):

Select the settings you want for you visualisation, and hit save:

A visualisation page will be created automatically, and a smaller, embedded version of the visualisation will appear in the wiki page:

If you visit the visualisation page – for example this Treemap visualisation, you should find it is fully interactive – which means you can explore the data for yourself, as I’ll show in a later post…

See more examples here: RAE 2008 Treemap; RAE 2008 Bubble Diagram; RAE 2008 bar chart.

Filtering Guardian Data Blog/Google Spreadsheet Data With Yahoo! Pipes

In Using Many Eyes Wikified to Visualise Guardian Data Store Data on Google Docs I showed how to pull data from Google spreadsheets (uploaded there by the Guardian as part of their Data Store initiative and visualise it using Many Eyes Wikified.

Unfortunately, one of the downsides of using Many Eyes Wikified is that you can’t filter the imported data or select subsets of rows from it (nor can you choose to just ignore particular columns in the visualisation editor – which would be a really handy thing to be able to do, and probably quite easy to implement?)

So for example, when looking at the RAE 2008 Data, it might be quite useful to be able to just visualise the data for a single institution, such as the Open University. But how can we do this?

One way would be to find a way of searching the data in the spreadsheet, and then only extracting the rows that contained the search term – such as “Open University” – in the institution name column. But I don’t know if that’s possible in Google Spreadsheets, (though it is possible in Zoho DB, which supports all manner of queries in SQL like dialects; but that’s for another day…;-).

An alternative way is to grab the whole of the spreadsheet and then just filter out all the rows that don’t contain the filter term in the desired column… which is an idea that came into my mind last night on my way home, and which appears to work quite well, as this Filtering UK RAE 2008 CSV Data from Google Docs pipe shows:

If your pipe imported a CSV file, as this one does (in fact, if the items being pushed out of the pipe have a y:row attribute set giving a row/count number for each item) then you can get a CSV file out of the pipe too:

Which in this case means we can filter through just the RAE 2008 data for a particular institution, grab the CSV URL for that data, and pull it into Many Eyes wikified in the same way as described before:

Here it is:

And once we have a wiki data page for it, we can visualise it – I’ve popped some examples up here: OU RAE 2008 performance.

For example, how about a bubble diagram view showing our 4* performance across the various units of assessment:

(The number is the percentage of submissions in that unit of assessment achieving the specified grade.)

Or how about this one – an interactive bar chart showing the percentages of 3* and 4* submissions in each unit of assessment:

If you look back at the pipe interface, you’ll see you can use the same pipe to pull out the data for any of the HEIs:

The pipe output CSV URI also makes this clear:

In the next post on this topic, I’ll show you how to create a rather more interesting RSS feed from this sort of pipe, including some Google chart URIs created dynamically within the pipe…

But for now, it’s time to go foraging for dinner…

Creating Google Charts From CSV Data Inside a Yahoo Pipe

One of the problems with processing CSV data using a Yahoo pipe – as described in Filtering Guardian Data Blog/Google Spreadsheet Data With Yahoo! Pipes – is that the RSS output of the pipe is not very well formed. It’s fine for outputting as CSV, or even JSON, but it’s not so hot as RSS.

So in this post, I’ll show you how to get a proper RSS feed out, with a meaningful title and sensible description for each item, and as an added bonus, a chart image generated from the CSV data using the Google Chart API.

So to start, let’s grab some CSV (I’m using the Guardian Data Store RAE 2008 data on Google Spreadhseets again), name the columns ourselves, and generate a sensible title for each feed item:

For the item description data, I’m going to create a simple listing of the data values:

(Don’t worry about where the chartURI came from just for the moment…)

What this loop element does is create a string for each item in the feed, and then output is as the description element for the corresponding item. The data elements within each item are displayed in a faux table within the description. (Note that I could have constructed a proper HTML data table had I wanted to…)

So here’s description (without the chartURI image element shown):

That is:

So that’s part 1 – constructing title and description elements for each feed item, where a feed item is derived from a CSV row.

Now let’s see how to create a chart image from the data using the Google Chart API.

The Google Charts API provides a RESTful API to a chart generating service – which is to say: you can give it a URI containing some data and specifying a chart type, and it will give you an image back in return, with your data graphed appropriately (see also: RESTful Image Generation – When Text Just Won’t Do).

So for example, here’s a pie chart:

And here’s its URI:×300&chd=t:10,10,45,35,0&chl=1*|2*|3*|4*|unclassifed&chtt=RAE2008%20(Open%20University%20)%20-%20Art%20and%20Design

So using our EYES, we can see different things in that URI:

  • the domain that serves the image:
  • the chart type: cht=p
  • the chart size/dimensions: &chs=450×300
  • the chart data: &chd=t:10,10,45,35,0
  • the chart labels: &chl=1*|2*|3*|4*|unclassifed
  • and the chart title:

And now we can construct URIs of that form within the pipe:

Now I do a couple o fbits of tidying up – the chartURI string just needs making into a nice URI – replace the spaces with their ASCII code representation, for example:

As we currently have no link element defined for the feed items, lets use the chart URI:


So there we have it – and RSS feed from the pipe containing the “tabulated” data from each CSV row, and a pie chart to display it:

From the pipe RSS feed URI:
you’ll notice the search term for the particular HEI you’re interested in, which means you should be easily able to hack it to give the appropriate feed for your own HEI:-)

“JOIN”ing Data from the Guardian Data Store Student Satisfaction Data

I really should make this the last post on the Guardian data store for a while, because I’ve got a stack of other things I really ought to be doing instead, but it struck me that the following demo might open up some peoples eyes as to what’s possible when you have several data sets that play nicely…

It shows you how to take data from two different spreadsheets and link it together to create a third data set that contains elements from the two original ones. (If you just want to cut to the end, here’s a (visualised) reason why it might not be such a happy idea to go to Southampton Solent if you want to study Architecture or Planning: Student Happiness on Planning and Architecture courses. Now ask yourself the question: how would I (err, that is, you) have produced that sort of chart?)

This whole idea is in and around the area of Linked Data, but doing it the hard way…

If you don’t know what Linked Data is, watch this:

So let’s have a look at the Guardian University Tables/ Satisfaction Data that has been uploaded to Google Spreadsheets:

You’ll notice there are lots of sheets in there, covering the different subject areas – like “Architecture”, for instance:

Importantly, the format of the names of the institutions is consistent across spreadsheets.

So I was wondering – if I was a student wanting to study either planning or architecture, how could i find out which institutions had a good satisfaction rating, or low student to staff ratio, across both those subjects? (Maybe I’m not sure which subject I want to do, so if I choose the wrong one and try to switch course, I know I’m not going to switch into a duff department.)

That is, I’d like to be able to see a single table containing the data from both the overall results table as well as the Architecture and Planning tables.

Now if the data was in a database, and if I spoke SQL, this would be quite easy to do (hint condition: look up sql JOIN). But I only have my browser to hand, so what to do…?

Dabble DB provides one answer… How? Here’s how…

Start off by creating a new application:

I’m going to seed it with a table containing the names of HEIs as listed in column B of the overall data table by importing just that column, as CSV data, from the Google spreadsheet:

Pull the data in:

So now we have the data:

Okay – let’s import a couple more tables, say the data for Planning and Architecture areas.

First, Planning – here’s the CSV:

Click on “More…” and you’ll be offered the chance to Add a New Category.

Take that opportunity:-)

You hopefully get the idea…

Exactly as before…

Now do the same for the Architecture data:

So now i have three tables – known as categories – in Dabble DB.

Let’s link them… that is, let’s make the data from one category available to another.

Firstly, I’ll link the Architecture data to the table that just lists the HEIs – click on the Name of Institution column to pop-up a menu and select Configure:

We’re going to Link the column to Entries in another one:

in particular, we’re going to tell Dabble DB that the Names of Institutions in the Architecture table are the same things as the institions in the HEI category/table:

If you look at the HEIs category, you’ll see the Architecture column has been brought in:

We can now do the same for Planning (remember, pop up the Name of Institution menu and Configure it to Link Entries).

The next step is to pull in some data from the two linked categories. How about we compare the Teaching Satisfaction scores for these two subjects?

Click on the column header for one of the linked categories – say the planning one, select Add Derived Field and then the field you want to pull in:

The data gets pulled in…

(Oops – this is all a bit sparse; maybe I should have used a different filed, such as Average Entry Tariff? Never mind, let’s push on…)

Add the corresponding derived field for the Architecture courses:

If you click on the “unsaved View” label, you can save this data table:

To tidy up the table, let’s hide the duplicated Name columns and resave:

To give something like this:

A nice feature of Dabble DB is that is makes it easy to export data from any given view:

So if we grab the CSV URI:

We can take it to, I dunno, Many Eyes Wikified?

Here it comes…:

Many eyes wikified data

Create a placeholder for a visulisation (the data page is ousefulTestboard/StudentHappinessPlanningArchitecture):

Or just type the text yourself:

Click through to create the viz:

We’ll have the scatter plot:

The empty cells in the data columns may cause Many Eyes Wikified to think the data is Text – it’s not, it’s Number:

Now customise the view… I could just have every spot the same, but Architecture is my first preference, so let’s just highlight the places where students are happiest doing that subject (click through to play with the visualisation):

UCL seems to do best:

So here’s a recap – we’ve essentially JOINed data from two separate spreadsheets from Google Spreadsheets to create a new data table in Dabble DB, then visualised it in Many Eyes.

Can you see now why privacy hacks don’t like the idea of linked data in government, or across companies?

So here’s you’re weekend homework – create a data set that identifies entry requirements across the various engineering subject areas, and try to find a way of visualising it ;-)

Tinkering with the Guardian Content API – SerendipiTwitterNews

Okay, so here I am again, again, trying yet again to get y’all to see just what I was on about with serendipitwitterous

So think of it this way – you’re on the Tube (route planning courtesy of a hack you knocked up using the TfL API), peering over the shoulder of someone reading the Guardian or the Observer, and an interesting headline catches your eye… You start to read it…. Serendipity has worked her charms again…

Got that? That’s what this new edition pipe/feed hack type thing lets you do – discover interesting news stories from the Guardian (or Observer) courtesy of a little pipe that does term extraction on your latest tweets and then runs them as searches on the OpenPlatform API, with the “/technology” filter set, to turn up news stories from the last 10 days or so that may be loosely related to something you tweeted about…

So here’s where we start – grab your most recent tweets, via a twitter RSS feed:

Run each tweet through the Yahoo term extraction service, and filter out null entries (i.e. ones that donlt contain any text) and any duplicates:

As a result, you might get something like this:

We’re now going to run those extracted terms through the Guardian Open Platform API. Firstly, we need to construct the RESTful URL that will call the service:

You’ll maybe notice a couple of filters in there? One limits stories to articles that have been tagged as Technology stories; the other limits the search to reasonably fresh stories (that is, stories with a datestamp from the last 10 days).

We now use these URIs to call the web service and get some XML back; when testing the service, I noticed some errors that appeared to be caused by the API limiting the number of calls per second to the service (it is still early beta days of testing, after all), so I’m limiting the number of calls I make in rapid succession to just a few queries:

Finally, we map the relevant parts of the XML response to RSS item elements:

And there we have it – a Yahoo pipe that does crude term extraction on your most recent tweets and uses them as query terms for recent tech stories in a search on the Guardian OpenPlatform Content API:

And I call it Guardian Tech SerendipiTwitterNews, okay? ;-)

Guardian Game Reviews – with Video Trailers

In what I intend to be the last post in the series for a while – maybe – here’s a quick hack around the Guardian Open-Platform Content API that shows how to “annotate” a particular set of articles (specifically, recent video game reviews) with video trailers for the corresponding game pulled in from Youtube, and then render the whole caboodle in a Grazr widget.

So let’s begin…

Take one Guardian API query: we construct the URI and call the webservice:

Here’s a typical result:

As with the football results map pipe, the linkj-text looks like a good candidate for search query text. So let’s tweak it for use as a search on somewhere like Youtube:

I want to be able to let the user choose trailer or review videos, which is why I cleared the search string of the word “review” first, before adding the user preference back into the string.

Now run a Youtube GData/API search using an old pipe block I found laying around (Youtube video search pipe) and grab the top result:

Now I happen to know that if you give a Grazr widget a feed with enclosure.url attributes that point to a flash file, it will embed the flash resource for you:

So now we can take the RSS output of the pipe and pop it into a Guardian game review with video previews Grazr widget:

(The widget layout is customised as described in part in Setting the 3 Pane divider positions.)

If you want to grab the widget and embed it your own webpages, it’s easy enough to do so (although not on hosted WordPress blogs). Simply click on “Share” and select “Customize” from the widget menu bar:

Then you can customise the widget and grab an embed code, or a link to a full screen view:

Good, eh? ;-)

PS Grazr (which is actually an OPML viewer as well as an RSS viewer) embeds other stuff too. For example, here it as as a slideshare viewer; and here it is showing how to automatically generate embedded players for Youtube videos and MP3 audio files using delicious Feed Enhancer – Auto-Enclosures pipe. (If you’re into a bit of hackery, it’ll carry Scribd iPaper too: Embed Scribd iPaper in Grazr Demo. I’m guessing you should be able to get it to embed arbitrary flash games as well?)

My Guardian OpenPlatform API’n’Data Hacks’n’Mashups Roundup

Over the last week or two, I’ve put together a handful of hacks demonstrating how to do various mashup related things using the Guardian open data on Google spreadsheets, and the open platform content API. So here’s a quick summary – think of it as a single post mashup uncourse… ;-)

Just as way of background, here’s how the Guardian Open Platform was announced (I didn’t get an invite ;-)

So here’s what I’ve been up to:

  • Using Many Eyes Wikified to Visualise Guardian Data Store Data on Google Docs: using a Guardian data in a particular Google docs spreadsheet, generate a URL that emits CSV formatted data from a specified region of cells in the spreadsheet and consume the live CSV data feed in Many Eyes wikified to visualise the data.
    Nuggets: how to generate the URL for the CSV output from a Google spreadsheet over a range of specified cells, and then consume that data in Many Eyes Wikified. Use the Many Eyes wikified data as the basis for several Many Eyes wikified visualisations.
  • Filtering Guardian Data Blog/Google Spreadsheet Data With Yahoo! Pipes: take a set of data regarding UK HE research funding from a Google spreadsheet via CSV into a Yahoo pipe, and then extract only those data rows that correspond to a particular result. Grab the CSV output from the pipe and pass it to Many Eyes wikified to visualise the research funding data for a single university.
    Nuggets: how to use Yahoo pipes to filter CSV data to only pass through items from rows where the data in a particular column matches a given term.
  • Creating Google Charts From CSV Data Inside a Yahoo Pipe: grab the CSV output from a Google spreadsheet into a Yahoo pipe, construct description elements for each feed item using data from several different columns in each CSV row, and then generate the URI for a call to the Google charts API using data from each row. By associated the chart URI with the appriopriate media group RSS attribute, a variety of widgets should be able to “play” the charts that are automatically generated from the spreadsheet data as a slideshow.
    Nuggets: how to create Google Charts within a Yahoo pipe (using the Google Charts API) based on data pulled into the pipe from a CSV file generated from a Google spreadsheet.
  • “JOIN”ing Data from the Guardian Data Store Student Satisfaction Data: pull data in from two different sheets on the same Google spreadsheet into Dabble DB, and generate another data table containing mashed together elements from the two spreadsheets. Display the result table via Many Eyes wikified.
    Nuggets: how to use Dabble DB to “join” data from two separate tables that share a similar key column.
  • Tinkering with the Guardian Content API – SerendipiTwitterNews: given a Twitter User id, pull the most recent tweets fro that user into a Yahoo pipe and run them though a term extractor; where terms are successfully extracted, use them as a query on the Guardian content API (either all articles, or just tech articles). The aim? Generate a news feed of items serendipitously related to your most recent tweets.
    Nuggets: using the Yahoo term extractor on a Twitter feed to generate search queries over the content API; using API tags to narrow down the search to a particular topic area.
  • Last Week’s Football Reports from the Guardian Content Store API (with a little dash of SPARQL): use a Yahoo pipe to pull football match reports “since last Saturday” from the content API, extract the name of the stadium each match was played in, and use as as the search term in a SPARQL query over a DBpedia page listing the locations of each English football stadium. Retrieve the lat/lon geo co-ordinates for each stadium from the SPARQL query, and associate them with the match reports. Plot the resulting feed of geo-annotated match reports on a Google map.
    Nuggets: running a SPARQL query over DBPedia from a Yahoo pipe.
  • Guardian Game Reviews – with Video Trailers: query the Guardian content API for game reviews from a Yahoo pipe, extract the title of each game and use it in another pipe that searches Youtube for a trailer (or review) of that game; create the URL of the actual Youtube movie file, and attach is as an enclosure URL to the output feed; view the result in a portable Grazr widget, which automatically embeds the Youtube videos in the widget.
    Nuggets: using an embedded pipe block to search over Youtube; creating a feed that will auto render an embedded Youtube player when viewed in a Grazr widget.

So there you have it… my Guardian OpenPlatform hacks to date…

Visualising MPs’ Expenses Using Scatter Plots, Charts and Maps

A couple of days ago, the Guardian’s @datastore announced that a spreadsheet of UK MPs’ (Members of Parliament) expenses had been posted to the Guardian OpenPlatform datastore on Google Spreadsheets.

Just because, I though it would be nice to visualise the spreadsheet using some Many Eyes Wikified charts, so I had a look at the data, and sighed a little: in many of the spreadsheet cells was a pound sign, and Many Eyes doesn’t like those – it just wants numbers… So I went in to Yahoo pipes to create a pipe to tidy up the CSV output of the spreadsheet so I could pipe it into Many Eyes Wikified… and drew a blank: I couldn’t get the pipe to work (no CSV – just HTML (it turns out I was using the wrong URL pattern from the spreadsheet – doh!)). So I exported the CSV, reg-exped it in a text editor, adn uploaded it to create a new spreadsheet. (Which reminds me: note to self – create a tidy-upper pipe fed from the datastore and refactor the wikified data page to feed from the pipe…)

[Many Eyes Wikified is no longer available as a service – to replicate the following visulisations, you need to upload the data to Many Eyes (the none wikified version…). I think this is the spreadsheet I was pulling in to the Wikified service…]

So anyway, here are some interactive ways of visualising MPs’ expenses data using Many Eyes wikified

Firstly, a bar char – select which expenses category you’d like to chart and then view the ranked distribution by sorting by values. If you mouse over any of the bars, you’ll see which MP made that claim:

Second up, a block histogram view. This chart is good for looking at the natural distribution of different claim categories. The search box makes it easy to search for your MP by name:

Again, mousing over any of the blocks identifies the name of the MP making that claim.

Thirdly, a scatter plot. This display lets you compare an MP’s claims across two categories, and potentially bring in a thrid category using the dot size:

As with the other visulisations, mouse over any point to see which MP it belongs to.

By the by, along the way I did a couple of other Yahoo pipes – one to extract expenses by MP name, (which simply pulls in CSV from the spreadsheet, then filters on an MP’s surname), the other MPs’ expenses by postcode. The latter pipe actually embeds the foemer, and works by looking up the name of the MP by postcode, using the TheyWorkForYou API; this name is then passed in to an embedded ‘expenses by name’ pipe.

Anyway, back to the viz biz: Charles Arthur generously picked up on my tweets announcing the visualisations with a blog post on the Guardian data blog (Visualising MP expenses) in a post that included the tease:

But what we need now is a dataset which shows constituency distances
from Westminster, so that we can compare that against travel. And perhaps someone else can work out the travelling MPs’ carbon footprints based on whether they went by air or rail or car

No fair… Okay – so where to get the location data for each MP. Well, the TheyWorkForYou API came to my rescue again. One call (to getConstituencies) pulled out details of each constituency, which included the lat/long geo-coordinated of the ‘centre’ of each constituency (along with the co-ordinates of the bounding box round each constituency… maybe I’ll use those another time ;-) A second call (to getMPs) pulled out all the MPs, and their constituencies. Loading both sets of data into different sheets on Dabble DB, meant I could then link them together by constituency name (for more on linking data in Dabble DB, see Mash/Combining Data from Three Separate Sources Using Dabble DB and Using Dabble DB in an Online Mashup Context).

Adding the MP data into Dabble DB after a further bit of cleaning – removing things like Mr, Miss, and Sir from the firstnames etc – and linking by MP name meant that I could now generate a single data view that exposed MPs by name, constituency, and expense claims, along with the geolocation of the midpoint of their constituency.

After grabbing the CSV feed out of this Dabble DB view into a pipe, and tidying up the data a little once again (eg removing commas in the formatted numbers), it was an easy matter to pull the JSON output from the pipe into a map, and plot different coloured markers depending what ‘band’ the MPs’ total expenses fell into. Here’s a snapshot of that first map:

All well and good – what’s nice about this view is that it’s quite easy to see which MPs appear to be claiming disproportionately more than other MPs with constituencies in a similar area. (There may be good reason for this, like, err… whatever. This tool is just a starting point for sensemaking round the data, right?!;-). If you click on one of the markers you can pop up a little info window, too (rather sparse in this first demo):

In that first map, the only expenses data I was exposing, and mapping, was the total travel expenses claimed. So over a coffee this afternoon, I created a richer view, and tweaked the map code to let me inspect a couple of other data sets. You can find the map here: MPs’ travel expenses map.

So for example, we can look at mileage claims:

Or the total expenses claimed for living away from the primary home:

One thing these quick to put together maps show is how powerful map based displays can be used to get a feel for local differences where there is a difference. (There may well be a good reason for this, of course; including errors in the data set being used…)

It’s also interesting to use the map based displays in conjunction with other chart based visualisations, such as the MPs’ expenses visualisations on Many Eyes Wikified, to explore the data in a far more natural way than trying to make sense of a spreadsheet containing the MPs’ expenses data.

Enjoy :-)

PS the code is all as is; if it’s broken and the visualisations are consequently wrong/misleading, then I apologise in advance… ;-)

PPS See also: My Guardian OpenPlatform API’n’Data Hacks’n’Mashups Roundup, which describes 6 different recipes for playing with Guardian openplatform resources. And if you’re into F1, see Visualising Lap Time Data – Australian Grand Prix, 2009 ! ;-)

PPPS see also MPs’ Expenses by Constituency, Sort Of…, where I plot a couple of really colourful proportional symbol maps based on total travel expenses…

Querying a Google Spreadsheet of MPs’ Expenses Data: So Who Claimed for “biscuits”?

Yesterday, the Guardian published a spreadsheet to their Data Store containing all the MPs’ expenses revelations to date in a spreadsheet form (“MPs’ expenses in the news: all the revelations, as a spreadsheet“)*.

So it struck me that I should be able to find a way of easily searching that data to find just those MPs who had, for example, been claiming for biscuits

[If you don’t want to read how it’s done, cust straight to the MPs’ Expenses Search to find who’s been claiming for biscuits]

* I actually found the link to the story just now from a known item search on Google, site limited to the Guardian domain and restricted in time to the last 24 hours using on of the new Google search options (I remembered seeing the story on the Guardian website somewhere yesterday). Note to self: this is a really handy trick for searching over recent content on a particular site:-)

(To tidy those search results even more, and remove the RSS feed results, just add -inurl:feedarticle to the search terms… i.e. exclude results that have feedarticle in the URL.)

Anyway, the question was, how to search the data in the spreadsheet. Now I had a half memory from HTML Tables and the Data Web of Google releasing a query language that would allow you to query data in a “data table” object embedded in a web page – the Google QUery Language – which it turns out can be used to interrogate anything defined as a Google visualisation API data source

…and it just so happens that Google spreadsheets are so defined: Using a Google Spreadsheet as a Data Source.

So this means that I should be able to use the Google visualisation API query language to run a query on a Google Spreadsheet; like the MPs’ expenses data spreadsheet; like asking it for who’s claimed for biscuits…

So here’s what I want to do:

1) create a data table that pulls data in from a Google spreadsheet;
2) actually, that’s not strictly true – I want to run a query on the spreadsheet that pulls in some of the data from the spreadsheet (in particular, just the rows that satisfy the query);
3) I want to display the results in a table using the visualisastion API libraries (so then I don’t have to write any code to display the data myself; and more than that, I don’t even need to understand how the data has been returned from the spreadsheet).

Okay – so the ambitious next step is to try to write a test query on the spreadsheet by trying to make sense of the Google documentation, which is never as helpful as it might be.

No joy, so in the end, I copied and pasted some example code from the closest working example to what I wanted from Google’s interactive AJAX APIs Playground – an example of just getting data into a web page from a spreadsheet using the Google visualisation API libraries:

Okay – so what this example does is run a query on a spreadsheet and plot the data as a map. Just seeing the code isn’t much help though – what libraries do I need to load to run it? So I exported the whole example into a standalone worked example, did a View Source, and copied the code wholesale.

Good, I now have a canned example that pulls in data from a spreadsheet. Next step – I want to display a data table, not a map.

Again, the API Playground comes in handy – check out the table example and see what bits of the code need changing:

Change the demo code so it displays the data from the example spreadsheet as a table rather than a map, and check it works. It does. Good… So now change the spreadsheet key and see if it works to display the expenses data. It does. Good again.

Okay, now I can start to write some test queries. The AJAX API playground provides a crib again, this time in the form of the Using the Query Language example:

(Hmmm… maybe I should have just worked from this example from the start? Ah well, never mind, note to self: teach the changes required from just this example next time…)

Now it’s fun time… writing the query, the query language documentation suggests only equivalence style relations are possible, but I want to use a conditions along the lines of “select * where M LIKE ‘%biscuits%’ – that is, give me [select] all the columns in a row [*] where [where] column M [M] contains [LIKE] the word ‘biscuits’ [‘%biscuits%’].

Typing a suitably encoded a test query URL (there’s a tool to encode the query string on the query language documentation page) into the browser location bar didn’t work :-( BUT, it turned up an informative error message that described some phrases the query language does support, or at least, that are expected by the spreadsheet:

So let’s try contains rather than LIKE… which works…

Okay, so now the long and the short of it is, I know how to write queries.

So for example, here’s searching the name column (so you can search for your MP by name):
var query=’select * where A contains “‘+q+'”‘ (e.g. search for Huhne)

Here’s searching the constitutency column (so you can search or your MP by constituency):
var query=’select * where B contains “‘+q+'”‘ (e.g. Edinburgh)

And here’s searching several columns for a particular item:
var query=’select * where (M contains “‘+q+'” OR O contains “‘+q+'” OR Q contains “‘+q+'” OR S contains “‘+q+'” OR U contains “‘+q+'” OR V contains “‘+q+'”)’

Add it all together, and what have you got? A way of searching to see who’s been claiming for biscuits:

Note that searches are case sensitive…(anyone know if there’s a way round this?)

So there you have it: an MP’s expenses search engine via Google Spreadsheets :-)

First Steps Towards a Generic Google Spreadsheets Query Tool, or At Least, A Guardian Datastore Interactive Playground (aka the Guardian Datastore Explorer)

So I had another bit of a tinker, and came up with some code that’s breaking all over the place, but I think there’s enough of a vision there to have something to say, so I’ll say it…

How’s about a generic query’n’viz tool for the Guardian datastore? My first (and maybe last) attempt at a back of an envelope, sometimes it works, sometimes it doesn’t, bare bones rapid prototype of just such a thing can be found here.

In my original post on Making it a Little Easier to Use Google Spreadsheets as a Database (Hopefully!), I sketched out a simple form for helping create calls to a Google spreadsheet using the Google visualisation API query language. I then extended this to try to make the query building a little more transparent: Last Night’s Update to the “Google Spreadsheets as a Database” Demo. Today’s step is to see how we can make it easier to pull in spreadsheets from the datastore collection as a whole.

So referring to the image below, if you select a spreadsheet from the drop down list and click preview, you should get a preview of the column headings from that spreadsheet:

(The new link is to an original Guardian blog post announcing or describing the data.)

The list items are pulled in from a tag on my delicious account, which actually bookmarks the original data blog posts. The URI for the spreadsheet is added to the end of the bookmark description, and keyed with a –:

ISSUE 1: Sometimes the spreadsheet doesn’t load… I don’t know if this is down to something I’m (not) doing or not (if you’ve seen this sort of error and know a cause/fix, please post a comment below).

I’ve found if you just keep canceling the alert and clicking “Preview” the file loads in the end…

Scroll down on the page, and you can now start to build a query:

(See Last Night’s Update to the “Google Spreadsheets as a Database” Demo for more on this.)

Another new feature is the ability to preview results using various chart types, rather than just use a data table:

(Oh yes – the “bookmark” link should also allow you to share the current view with other people. At least, it shares the spreadsheet ID and the query, but not the view type…)

I haven’t implemented chart labeling, or the ability to set what values are used for what bit of the chart, so chart compomnent default rules apply. By juggling the queries (including changing the order of columns that appear in the various text boxes), you can sometimes get a reasonable chart out.

Of course, you can always just grab the CSV URL and then visualise the data in something like Many Eyes Wikified.

The chart components I used are all taken from the Google Visualisation API, so they play nicely with the Google data source representation that holds the data values.

So, that’s where it’s at, and that’s probably me done with it now… (I think I can see what’s possible so my fun is done…) And if you haven’t got an inkling of what it is I think I can see, it’s this:

A front end to the Guardian data store that lets readers:
– select a data set from the datastore (and maybe get a chance to view the original story from the datablog; I guess this could be pulled in from the Guardian OpenPlatform API?)
– write queries on that dataset to generate particular views of the data;
– generate CSV and HTML preview view URLs for a particular query, so the data can be shared with other people (turning different views on subsets of the data into social objects);
– generate quick visualisation previews of different views of the data.

Nice to haves? Maybe links to stories that also mention the data, again via the OpenPlatform API? A collection of different bookmarks/views that use the same spreadsheet, so readers can share their own views of the data (the sort of social thing that Many Eyes Wikified offers). An opportunity to accept comments on a dataset? etc etc

All told, I reckon it’s taken less than 20 hours of solo effort (except for a bit of 3rd party bug spotting ;-), plus time to write the blog posts, to get this far (but the state of the code shows that: it’s truly scrappy). A fair amount of that time was spent learning how to do stuff and looking at exemplar code on Google AJAX APIs Code Playground. Of course, there are bugs’n’issues all over the place, but as people bring them to my attention, I tend to see if there’s a quick fix…

PS (I think) I’ve just noticed a Google data source wrapper for Pachube (Google Visualization API for Pachube history), which means that as well as pulling in Guardian datastore content from Google spreadsheets (as well as other publishers’ content on Google spreadsheets), this ‘interface’ could also be applied to Pachube data. (If you know of anyone else who exposes the Google visualisation/data source API, please post a link below.)

PPS search key: I also call this the Guardian Datastore Explorer