Posts Tagged ‘OpenPlatform’
One of the things I try to impress on folk whenever I do a talk about web stats is how showing charts of mean (that is, simple averages) numbers of visitors or time on site is all but useless, because the actual distribution of values is not likely to be normal, and so the simple averages reported are, too all intents and purposes, not good for anything.
So when putting together a change treemap to summarise the changes in funding of quangos form government departments (Visualising Where the Money Goes: Westminster Quangos, Part 1), what I did originally was to visualise the change in funding for each quango within each department, rather than just displaying the following simpler, overall funding by government department treemap that I posted about previously:
The reason being? The overall change treemap might show increases or decreases in expenditure for a department as a whole, but it doesn’t reveal whether the funded quangos were all treated equally, or whether lots of small quangos received a cut whilst one big one received a huge rise in funding, for example:
So how did I create this treemap? The simplest way is just to create a query on the original spreadsheet that pulls in 4 columns – department, quango, and two expnditure columns (one for 2006 and one for 2007). A query built around this kernel, in fact:
select C,D,R,S order by C,R desc
(To see how such a query is put together, see Part 1 of this exercise.)
To generate a query that only displays quangos that had a non-zeor expenditure in both 2006 and 2007, just add:
where (R>0 AND S>0) after the select statement and before order by.
Just as an aside, note that it’s possible to to pull the output of this query into another spreadsheet. This allows you to add derived columns to the spreadsheet, for example, busing relative formaula that act on quantities automativcally inported into other columns in the spreadsheet. (One thing I intend to explore with the data store explorer is a 1-click way of creating a new spreadsheet that pulls in a query created using the explorer. See also: Using Google Spreadsheets and Viz API Queries to Roll Your Own Data Rich Version of Google Squared on Steroids (Almost…))
The CSV feed can then be pulled into a Many Eyes Wikified and visulaised in a variety of ways that reveal both the overall expenditure of quangos funded from within each department, as well as the relative funding over 2006/2007 of each quango:
So for example, for the two biggest quangos by expenditure in Defence, one spent more over 2006/7, and the other spent less…
Using the same data page , we can create other visualisations within the Many Eyes Wikified environment that allow us to explore the data in a little more detail. So for example, here’s a bit more detail about the funding of quangos in the Department of Health. Parallel bands show that quangos spent equivalent amounts in 2006 and 2007, lines that diverge going from left to right show an increase in expenditure, and lines that converge going from left to right depict decreasing expenditure.
A scatter chart readily shows large changes in expnditure over the two years:
And some charts are just shiny ;-)
Compared with just trawling through the spreadsheet data, I think there is a lot to be said for using visualisations to identify out of the norm data points (e.g. using the scatterplot), or unpacking totals (as in the case of departmentally funded quango expenditure) in a quick and easy way as a starting point to exploring the data in a more systematic way, for example in order to ask journalistic questions (whatever they might be).
My tweetwatching has been a little sporadic of late, so I haven’t really been keeping up with the data sets that keep being posted to the Guardian Datablog, but today I had a quick skim through some of the recent uploads and had my eye caught by a post about funding of UK government quangos (Every quango in Britain [spreadsheet]).
What I’ll show in this post is how to construct a query on one of the quangos data sheets that can then be visualised as a change treemap, showing at a single glance how funding over all the quangos (arranged by sponsor department) has changed over the period 2006-2007.
The quangos spreadsheet is split into several different areas and at face value is quite hard to make sense of (what’s the difference in status (and provenance of the data) between the NHS and Health quangos, for example?
But I take nothing if not a pragmatic view about all this data stuff, and whilst there may be, err, issues with doing “proper” data journalism with this spreadsheet, I think we can still get value from just seeing what sorts of technology enhanced questions we might ask of this data, such as it is (as wll as identifying any problems with the data as it is presented), and as a result maybe identifying various issues with how to present and engage with this data were it to be republished again.
As ever, my doodles don’t properly acknowledge the provenance or source of the data, nor do I try to make any sense out of the data or look for any specific ‘meaning’ within it – I’m still at the stage of sharpening my technology pencil and seeing what sorts of marks it can make – but this is something I know I don’t do, and will be something I start to look at somewhen, honest!
So let’s make a start. To provide a bit of context, the questions I set out with when doodling through this post were:
1) is the data clean enough to run summarising queries on the data (that is, queries that summed totals for different departments)?
2) is the data clean enough to not break Many Eyes Wikified if i pass it to that visualisation tool via a CSV feed?
And a matter arising:
3) how do I write a query that specifies column headings (the headings in the spreadsheet leave something to be desired, at times….)?
The spreadsheet I chose to play with was the Westminster sheet, which you can find from here: UK Quangos [Guardian Datastore] (you’ll need to select the appropriate tab).
Just by looking at the data in the spreadsheet we notice a couple of things, things that suggest certain queries we might run on the data. (I probably need to unpack that phrase at some point (“run certain queries”) but the essence of it is this: if we treat the spreadsheet as a database, so sort of reprts can we generate from it; typically, in a database environment, reports are generated by running queries using some sort of database query language, which in the case of Google spreadsheets is the SQL like Google Visualisation API Query Language.)
So, the first thing I noticed are the two columns on the left – Government departments and presumably the quangos sponsored by those departments. And what these suggested to me were that I should be able to generate reports that summarise expenditure over all quangos in each department. (Whether or not this is interesting, I don’t know; but it’s something that we should be able to do easily enough, and it may spark off other questions in our mind).
The second thing I noticed was that lots of the data straddled two years (2006 and 2007)
And finally, gross expenditure seemed like a meaningful quantity and maybe least open to contention, so I decided to pick on that as the quantity to sharpen my pencil with:
To start playing with spreadsheets, I bookmarked it so that I could play with it in my Data Store explorer (note that I needed to specify the sheet number, where the first sheet is sheet 0, the second is sheet 1, and so on; so the Westminster sheet (the third sheet form the left in the spreadsheet) is sheet 2):
When we preview the column headings, (which the API assumes are in the top row, I think?), we get – not a lot…
If we scroll down in the data store explorer, we get at least the spreadsheet column labels:
Anyway, let’s try to run a query that summarises the overall gross expenditure for 2006 (column R) and 2007 (column S) for each department (column C):
The query is encoded as:
select C,sum(R),sum(S) group by C order by sum(R) desc
So we select three columns, and for each column, group the rows according to department (column C), display the summed value over those grouped rows for columns R and S, and presenting them in descending (desc) order by column sum(R):
If we click on the HTML preview link, we can view the table in its own page:
(A link to the CSV version is also generated.)
The data explorer doesn’t support forms for all the queries we can write yet, so the next step requires hacking the HTML table URI directly to add labels to the columns:
select C,sum(R),sum(S) group by C order by sum(R) desc label C ‘Dept’, sum(R) ’2006 Gross Expenditure’, sum(S) ’2007 Expenditure’
If you’re hacking the URI in a recent browser address/location bar, you don’t need to encode things like spaces as %20, or single quotes as %27, because the browser will take care of it for you:
If you then copy this URI and paste it back into the location bar, the encoded version will be generated for you, e.g. so that you can use it as a link in a web page:
So we have our basic summary report, and can now use the CSV output of the spreadsheet so that it can be transported elsewhere. There are two things we need to do now.
The first is to just change the output format of the data from an HTML table to CSV. Take this part of the URI:
and change it to this:
If you preview the CSV, you’ll notice there’s a problem with it though:
There are rogue commas everywhere, appearing within the ‘numerical’ data, and this file is supposed to use commas to separate out different elements. To get proper numbers out, we need to set their format, which means adding something to the end of the URI:
format sum(R) ‘#’,sum(S) ‘#’
(Not that you do need to encode the #s by hand, as %23)
That is, present sum(R) and sum(S) in a proper numerical format:
So there we have summary tables showing the expenditure for each government department. Many Eyes Wikified isn’t letting m import that data directly via CSV at the moment, but it’s easy enough to download the CSV and copy the data into a Many Eyes Wikified data page:
(Casting an eye over the CSV data, there are also a couple of glitches in it that upset the grouping – eg “Home” and “Home ” (trailing space) are treated differently.)
We can now use this page to create some visualisations – so put some placeholders in place in a visualisation page:
And then generate some visualisations…
One of the potentially most informative summary views for this data is a change treemap, that shows the relative gross expenditure for each department, along with whether it has increased or decreased between 2006 and 2007:
Blue is decrease from 2006 to 2007, orange is an increase.
The next step is now to create a change treemap for each quango within a department, but that’s a post for another day… [UPDATE: see it here - Visualising Where the Money Goes: Westminster Quangos, Part 2]
Last week, Steph Gray announced yet another innovative way of trying to engage people in public consultations in his blog post Your starter for ten. The piece describes a scheme in which a series of pub quiz style “killer facts” are pulled out of a current consultation document on consumer rights and credit and then represented in a quiz format along with why you should care/what the consultation is seeking to do to address the issues raised by each quiz question. (You can find the quiz from a link on here: Government action to secure a better deal for consumers.)
This idea, of microchunking particular elements of a consultation and then trying to use these microchunks to draw people into commenting on a consultation document, is one that Joss Winn and I have casually explored in the context of WriteToReply. In that case, we discussed whether or not we should pull out intriguing facts or potentially contentious questions that we could then tweet, along with a link to the appropriate part of the consultation document, in order to entice people into commenting, either directly on the WriteToReply site, or by remote commenting (that is, posting a blog comment or tweet that links back to a particular paragraph on the WriteToReply site site that we can then track via a Trackback).
(As part of this, we imagined creating a list of ‘nuggets’ pulled from consultation docs as we imported them into WordPress; it strikes me now that if we did have such a list, we could set up a twitter account for each consultation that could be run on a ‘daily feeds’ like basis – whenever anyone subscribes, they start to receive tweets @’d to them, according to a personal schedule starting at the moment they follow the consultation, as well as more general broadcast tweets?)
So for example, here are a couple of tweets that we sent out yesterday in support of a new consultation doc on WriteToReply about funding local and regional news (Sustainable independent and impartial news):
One thing to note here is that rathr than linking to the actual paragraph that contains the question, which is what we’d normally do, these tweets link to paragraphs that preview, and provide the context for, the questions. So if you follow the link, you are lead into the body of the consultation document, and if you read on you then come to the question included in the tweet. That is, the tweet provides the question that sets the contest, the link leads through to the part of the consultation that provides the context for the question, and then to the question as it appears in the consultation doc.
Also on Twitter, Joss and I fell into a conversation with Steph and Richard Stirling about the different audiences for consultation docs and what the appropriate means of publication are for those different audiences. So for example, Steph suggested “Consultations have multiple audiences. Suspect downloadable PDFs actually not bad for policy folk. But for public?” [ref], which was backed up by Richard: “I agree with @lesteph’s point. As a policy person I often want to read the whole doc – not sections. PDF works.” [ref].
However, if the aim is to reach outside the policy wonks and the committed lobbiests/interest group members, then I suspect we need smaller ‘headline’ chunks, or atomic parts of the consultation document, to pull people in to the consultation. (Also, we may learn something form the journalists here, and the way they construct stories to lead people in, or at least, give them some of the facts – that is, facts they can misquote in the pub later! – up front.)
There are dangers with the headline approach, of course, as the ‘simplistic’ tweeted questions shown above suggest… At the simplest reading, they just solicit a trivial yes/no answer, rather than an informed comment. But bear this in mind too – those questions were taken from the consultation document itself.
A further thing that’s interesting to note is how the consultation document is actually constructed. The ‘argument’, such as it, and the issues that the consultation wishes to be taken into account, are used to preface the actual questions (see the sections on Potential Sources of Top-Up Funding or Protecting the BBC’s Funding for a couple of examples).
That is, some issues a presented, and then the question is asked. But how likely is this to work as an engagement strategy? A cold start conversational strategy would probably be more likely to start with a question, followed by a discussion (or argument) and an agreement to disagree.
So on the WriteToReply “plural news” consultation dashboard, we have started to explore how we can hook people into the consultation, first through a re-presentation of the consultation questions as simple polls:
and also by using the questions to lead in to some of the discussion that actually appears before the questions in the original consultation document:
We’ve also started looking at pulling related news stories in to the dashboard, in the first instance from the Guardian using the Guardian OpenPlatform API, to try and embed the consultation in a wider context:
There is an issue of circularity here, of course – the news reports presented to date stem largely from responses to the original consultation call, so rather than setting the consultation in context, you could argue they are just responses to it.
Cf. also the approach taken particularly on BBC sites where full articles on a government documents are often backed up with a link to the original document:
But we have to start somewhere, and we are, after all, making this stuff up as we go along. If nothing else, we are exploring how to re-balance the presentation of the consultation doc and associated news stories compared to the mode of presentation used by the BBC et al.
And finally (and slightly off topic!), note that we’re also using WordPress feed to pull in both the content of the report and the comments from the WriteToReply republication of the original consultation document:
However, whilst we can pull the content of the report into the dashboard via an RSS feed, the paragraph level links and links and comment links are not passed though the RSS:
(I suspect this is because the linking is managed by the CommentPress theme? Joss – maybe we need to look at adding paragraph and “comment here” links to the RSS content too?)
When the Guardian launched their OpenPlatform DataStore, a collection of public data, curated by Guardian folk, hosted on Google Spreadsheets, it raised the question as to whether this initiative would influence the attitude of the Office of National Statistics, and in particular the way they publish their results (e.g. Guardian Data Store: threat to ONS or its saviour?).
In the three sexy skills of data geeks, Michael Driscoll reinterprets Google’s Chief Economist’s prediction that “the sexy job in the next ten years will be statisticians… The ability to take data—to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it” with his belief that “the folks to whom Hal Varian i.e. [Google's Chief Economist] is referring are not statisticians in the narrow sense, but rather people who possess skills in three key, yet independent areas: statistics [studying], data munging [suffering] and data visualization [storytelling]”
I’ve already suggested that what I’d quite like to see is plug’n’play public data that’s easy for people to play with in a variety of ways, and publishing it via Google Spreadsheets certainly lowers quite a few barriers to entry from a technical perspective which can make life easier for statisticians and the visualisers, and reduce the need for the data mungers, the poor folks who go through “the painful process of cleaning, parsing, and proofing one’s data before it’s suitable for analysis. Real world data is messy” as well as providing access to data where it is difficult to access: “related to munging but certainly far less painful is the ability to retrieve, slice, and dice well-structured data from persistent data stores”.
But if you don’t take care of the data you’re publishing, the even though there are friendly APIs to the data it doesn’t necessarily follow that the data will be useful.
As Steph Gray says in Cui bono? The problem with opening up data:
Here’s my thought: open data needs a new breed of data gardeners – not necessarily civil servants, but people who know data, what it means and how to use it, and have a role like the editors of Wikipedia or the mods of a busy forum in keeping it clean and useful for the rest of us. … Support them with some data groundsmen with heavy-lifting tools and technical skills to organise, format, publish and protect large datasets.
So with all that in mind, is the Guardian DataStore adding value to the data in the data store in an accessibility sense by reducing the need for data mungers to have to process the data so that it can be used in a plug’n'play way by the statisticians and the data visualisers, whether they’re professionals, amateurs or good old Jo Public?
As a way in to this question, let’s look at the various HE datasets. The Guardian has published several of these:
Before we look at the data, though, let’s look at the URIs to see if the architecture of the site makes it easy to discover potentially related datasets. (Finding data is another of the skill related to the black arts of the data mungers, I think?!;-)
The URI for the metapage that hosts a link to the RAE/research data blog post is:
and links to the teaching related posts is:
Going back up the common path to http://www.guardian.co.uk/news/datablog+education/ we get…. a 404 :-(
Hmmm… So how come the datablog+education page doesn’t link down to the HE collection pages, as wll as the schools data blog pages (e.g. these are both valid:
- http://www.guardian.co.uk/news/datablog+education/school-tables and
and might naturally be expected to be linked to from:
Looking back to the HE teaching related datasets, we see they are both listed on the http://www.guardian.co.uk/news/datablog+education/higher-education page. So might we then expect them to be ‘compatible’ datasets in some sense?
That is, do the HE data sets share common values, for instance in the way the HEIs are named?
If we generate a couple of queries on to the university satisfaction tables and the dropout tables (maybe trying to look for correlations between drop out rate and student satisfaction) by pulling the results from different queries on those tables in to a data grid within a Google spreadsheet (cf. the approach taken in Using Google Spreadsheets and Viz API Queries to Roll Your Own Data Rich Version of Google Squared on Steroids (Almost…)), what do we gt?
Here’s a search for “Leeds”, for example:
One table contains items:
- Leeds Trinity & All Saints
- Leeds Met
and the other contains:
- Leeds College of Music
- The University of Leeds
- Leeds Metropolitan University
- Leeds Trinity and All Saints
So already, even with quite a young datastore, we have an issue with data quality. In Data Driven: Profiting from Your Most Important Business Asset, Thomas Redman identifies “seven common data quality issues) which include the related problems of too much data (i.e. multiple copies of the same data in different places – that is, redundancy) and data inconsistency across sources (not a problem the datastore suffers from – yet?) and poor data definition (p41 -preview available on Google books?).
This latter issue, poor data definition, is evident in the naming of the HEI institutions above: I can’t simply import the overall tables and dropout tables into DabbleDB and let it magically create a combined table based on common (i.e. canonical) HEI names (using the approach described in Mash/Combining Data from Three Separate Sources Using Dabble DB), for example) because the HEIs don’t have common names.
So what does Redmond have to say about this (p.55)?
- Find and fix errors
- Prevent them at their source [in this case, the error is inconsistency and could have been prevented by using a common HEI naming scheme, OR providing another unique identifier that could act as a key across multiple data tables; but name is easier – because name is what people are likely to search by…).
(See also Redmond’s “Hierarchy of Data and Information Needs”, (p. 58), which identifies the need for consistency across sources.)
Note that we have a problem though – the datastore curators can’t change the names in the current spreadsheets, because people may already be using them and keying on the current name format. We shouldn’t create another spreadsheet containing the same data because that causes duplication/redundancy? So what would be the best approach? Answers on the back of a postcard to, err, the Guardian datastore, I guess?!;-)
So is it the Guardian’s job to be curating this data, or tending it as one of Steph’s data gardeners/groundsmen might? If they want it to be a serious resource, then I would say so. But if it’s just a toy? Well, who cares…?
PS Just in passing, what other value might the DataStore add to spreadsheets to make them more amenable to “mashups”? For data like the university data, providing geo-data might be interesting (even at the crude level of just providing a single geographical co-ordinate for the central location of the institution). If I could easily get geo-data for the HEIs, and combine it with the satisfaction tables or dropout rates, it would be trivial to generate map based views of the data.
PPS one other gripe I have with the Guardian datablog, where many of the datastore data sets are announced, is that the links are misleading:
Now call me naive, but I’d expect those DATA links to point to spreadsheets, as indeed the first two do, but the third points to another blog post and so I’ve lost trust in being able to use those DATA links (e.g. in a screenscraper) as a direct pointer to a spreadsheet.
Using Google Spreadsheets and Viz API Queries to Roll Your Own Data Rich Version of Google Squared on Steroids (Almost…)
So it seems that the Numeracy in the Newsroom blog liked the ‘roll your Google Squared’ toy that I rediscovered in Is Google Squared Just a Neatly Packaged and Generalised =googlelookup Array?.
But how about if you wanted to roll your own square by pulling in data from data sources you had defined yourself, such as spreadsheets in the Guardian datastore?
Well, it occurred to me that it should be easy enough to generate queries onto those spreadsheets (e.g. using the link generators described in First Steps Towards a Generic Google Spreadsheets Query Tool, or At Least, A Guardian Datastore Interactive Playground et al.) and roll them into a square.
Okay – so I know I’ve lost you but I only have 10 mins to crank this post out, so I’ll walk you through a demo (ish).
Create a new Google spreadsheet and pull in some data from the Conservative Shadow Cabinet expenses spreadsheet, such as their total expenses declared in that spreadsheet to date:
Pop that in cell A1 and it gives something like:
So now suppose I want to pull in data from the Guardian spreadsheets. I can use the name as a key and construct a URI that will query the Guardian spreadsheets by MP name. An easy way to gt an example query is to use the Guardian Datastore Explorer:
So having got a pattern for a URI:
we can look at it and see how it might be constructed in the spreadsheet:
That is, we construct the URI around the name contained in the cell a couple of columns to the left, (data that was itself pulled in from a Tory party spreadsheet).
(Note that I’m generating an HTML preview, but you should be able to grab CSV too: just set tqx=out:csv).
Drag the cell down to generate URIs for all the Shadow Cabinet:
We can now import data into the spreadsheet using that constructed URI:
Remember, I generated an HTML preview table URI. (If you generated a CSV URI, use =importdata(C2). Note I was using CSV originally, but then it seemed to stop working, so I switched to HTML).
Here’s what you get when you copy that (relative) formula for each MP:
Okay, so there’s an issue here – we’re only getting very other MP because the table headers are being pulled in (the same happens with CSV – I couldn’t find a way to get the importdata formula to suppress the header?). A clunky workaround you be to create another set of columns that pull in the other alternate rows (i.e. start pulling in data from row 3 rather than row 2).
You’ve also got the header crap everywhere, but I’m guessing we could probably write a query language call that would pull out the data from this spreadsheet and ignore the rows containing the column headings.
Anyway, so what?
So – I can create a data square in a Google Spreadsheet that pulls in and essentially combines data from multiple spreadsheet data source that I know and trust, as long as there is a common key query term (name in this case).
Prompted by a couple of comments from @Josswinn to be more transparent (?!;-), here’s a glimpse into how I set about learning how to do something.
It potentially won’t make a lot of sense, but it’s typical of the sort of process I go through when I hack something together…
So, there’s usually an initiator:
Then there’s a look to see if there’s anything to play with that is on my current list of things I thing I’d like to, (or need to;-), know more about:
Hmm, ok… Google spreadsheets. I’ve just learned about how to write queries against Google spreadsheets using the visualisation API query language, so can I push that another step forward…? What don’t I know how to do that could be useful, and that I could try to demo in an app using this resource?
How about this: a web page that lets me pull the result out for a searched for by name university.
Hmm, what would that mean? A demo around a single subject, maybe, or should the user be able to specify the subject? The subject areas are listed, but how do get those in to a form? Copy the list of tab names from the spreadsheet? Hmm… Or how about entering the name of a single university and displaying the results for that HEI in each of the categories. That would also require me to find out how many sheets there were in the spreadsheet, and then interrogate each one…
Okay, so it’d be nice to be able to search for the results of a given university in a given subject area, or maybe even compare the results of two universities in a given subject area?
So to do that, do I need to learn how to do something new? If not, there’s no point, it’s just makework.
Well, I don’t know how to grab a list of worksheet names from a Google spreadsheet, so that’d be something I could learn… So how to do that?
Well, the query language only seems to work within a sheet, but there is a Google spreadsheets API I think? Let’s have a look: Google Spreadsheets APIs and Tools: Reference Guide. F**k it, why haven’t I looked at this before…?!
[Go there now - go on... have a look...]
Blah, blah, blah – ah: Spreadsheets query parameters reference. Quick scan… hmm, nothing obvious there about getting a list of worksheets. How about further up the page…?
Ah: “The worksheets metafeed lists all the worksheets within the spreadsheet identified by the specified key, along with the URIs for list and cells feeds for each worksheet”
I have no idea what the “list and cells feeds” means, but I’m not interested in that; “lists all the worksheets within the spreadsheet identified by the specified key” is what I want. Okay, so where’s a URL pattern I can crib?
Scan down a bit looking for keywords “visibility” and “projection” (I’m guessing key is the spreadsheet key…). Okay, visibility public and projection basic just to check it works…
Okay, that works… No obvious way of getting the gid of the worksheet number though, unless maybe I count the items and number each one…? The order of worksheets in the feed looks to be the sheet order, so I just need to count them 0,1,2 etc from the top of the list to gd the worksheet gid. Ah, there could be an opportunity here to try out the YQL Execute in a pipe trick? After all, the demo for that was an indexer for feed items, and because the API is chucking out RSS I need to use something like a pipe anyway to get a JSON version I can pull into my web page.
Hmmm, what else is there on the docs page? “alt, start-index, max-results Supported in all feed types. ” I wonder? Does alt stand for alternative formats maybe? Let’s try adding ?&alt=json to the URL – it may work, or it may relate to something completely other…. [Success] heh heh :-) Okay, so that means I don’t need the pipe?
What else – anything that could be useful in the future? Hmm, seems like the Spreadsheets API actually supports queries too? So e.g. I can run a query to see if there is a sheet that contains “geo” maybe?
Okay – lots of other query stuff there; remember that for another day…
So: to recap, the above process took maybe 10-15 mins and went from:
- initiator: see a tweet;
- follow-up: look at the resource;
- question: is there something I could do with that data that I don’t know how to do?
- question refined: how about I pull out a list of the worksheets from the spreadsheet, and use that in e.g. a drop down box so students can choose a subject area from a list, then search for one, or compare two, HEIs in that subject area. I don’t know how to get the list, and I’m not sure about the best way of comparing two items, so I’ll probably learn something useful.
- solution finding: check out the Google spreadsheets API documentation; (If that had failed, I’d have done a blogsearch along the lines of ‘feed list worksheets google spreadsheet’
- plan: err, okay, the plan is a form that pulls a list of worksheets from the HEI spreadsheet via JSON, indexes each one to give me the worksheet gid number (this is a possibly flakey step? Could I index the spreadsheet by name?) then builds a query on that worksheet using an input from one or more text boxes containing the name of HEIs (or maybe a single text box with comma separated HEI names?)
Normally I’d have then spent up to an hour messing around with this (it is (working) lunchtime i.e. playtime after all), but instead I spent forty five mins writing this blog post… which means there is no demo…:-(
First Steps Towards a Generic Google Spreadsheets Query Tool, or At Least, A Guardian Datastore Interactive Playground (aka the Guardian Datastore Explorer)
So I had another bit of a tinker, and came up with some code that’s breaking all over the place, but I think there’s enough of a vision there to have something to say, so I’ll say it…
How’s about a generic query’n'viz tool for the Guardian datastore? My first (and maybe last) attempt at a back of an envelope, sometimes it works, sometimes it doesn’t, bare bones rapid prototype of just such a thing can be found here.
In my original post on Making it a Little Easier to Use Google Spreadsheets as a Database (Hopefully!), I sketched out a simple form for helping create calls to a Google spreadsheet using the Google visualisation API query language. I then extended this to try to make the query building a little more transparent: Last Night’s Update to the “Google Spreadsheets as a Database” Demo. Today’s step is to see how we can make it easier to pull in spreadsheets from the datastore collection as a whole.
So referring to the image below, if you select a spreadsheet from the drop down list and click preview, you should get a preview of the column headings from that spreadsheet:
(The new link is to an original Guardian blog post announcing or describing the data.)
The list items are pulled in from a tag on my delicious account, which actually bookmarks the original data blog posts. The URI for the spreadsheet is added to the end of the bookmark description, and keyed with a –:
ISSUE 1: Sometimes the spreadsheet doesn’t load… I don’t know if this is down to something I’m (not) doing or not (if you’ve seen this sort of error and know a cause/fix, please post a comment below).
I’ve found if you just keep canceling the alert and clicking “Preview” the file loads in the end…
Scroll down on the page, and you can now start to build a query:
(See Last Night’s Update to the “Google Spreadsheets as a Database” Demo for more on this.)
Another new feature is the ability to preview results using various chart types, rather than just use a data table:
(Oh yes – the “bookmark” link should also allow you to share the current view with other people. At least, it shares the spreadsheet ID and the query, but not the view type…)
I haven’t implemented chart labeling, or the ability to set what values are used for what bit of the chart, so chart compomnent default rules apply. By juggling the queries (including changing the order of columns that appear in the various text boxes), you can sometimes get a reasonable chart out.
Of course, you can always just grab the CSV URL and then visualise the data in something like Many Eyes Wikified.
The chart components I used are all taken from the Google Visualisation API, so they play nicely with the Google data source representation that holds the data values.
So, that’s where it’s at, and that’s probably me done with it now… (I think I can see what’s possible so my fun is done…) And if you haven’t got an inkling of what it is I think I can see, it’s this:
A front end to the Guardian data store that lets readers:
- select a data set from the datastore (and maybe get a chance to view the original story from the datablog; I guess this could be pulled in from the Guardian OpenPlatform API?)
- write queries on that dataset to generate particular views of the data;
- generate CSV and HTML preview view URLs for a particular query, so the data can be shared with other people (turning different views on subsets of the data into social objects);
- generate quick visualisation previews of different views of the data.
Nice to haves? Maybe links to stories that also mention the data, again via the OpenPlatform API? A collection of different bookmarks/views that use the same spreadsheet, so readers can share their own views of the data (the sort of social thing that Many Eyes Wikified offers). An opportunity to accept comments on a dataset? etc etc
All told, I reckon it’s taken less than 20 hours of solo effort (except for a bit of 3rd party bug spotting ;-), plus time to write the blog posts, to get this far (but the state of the code shows that: it’s truly scrappy). A fair amount of that time was spent learning how to do stuff and looking at exemplar code on Google AJAX APIs Code Playground. Of course, there are bugs’n'issues all over the place, but as people bring them to my attention, I tend to see if there’s a quick fix…
PS (I think) I’ve just noticed a Google data source wrapper for Pachube (Google Visualization API for Pachube history), which means that as well as pulling in Guardian datastore content from Google spreadsheets (as well as other publishers’ content on Google spreadsheets), this ‘interface’ could also be applied to Pachube data. (If you know of anyone else who exposes the Google visualisation/data source API, please post a link below.)
PPS search key: I also call this the Guardian Datastore Explorer…
Yesterday, the Guardian published a spreadsheet to their Data Store containing all the MPs’ expenses revelations to date in a spreadsheet form (“MPs’ expenses in the news: all the revelations, as a spreadsheet“)*.
So it struck me that I should be able to find a way of easily searching that data to find just those MPs who had, for example, been claiming for biscuits…
[If you don't want to read how it's done, cust straight to the MPs' Expenses Search to find who's been claiming for biscuits]
* I actually found the link to the story just now from a known item search on Google, site limited to the Guardian domain and restricted in time to the last 24 hours using on of the new Google search options (I remembered seeing the story on the Guardian website somewhere yesterday). Note to self: this is a really handy trick for searching over recent content on a particular site:-)
(To tidy those search results even more, and remove the RSS feed results, just add -inurl:feedarticle to the search terms… i.e. exclude results that have feedarticle in the URL.)
Anyway, the question was, how to search the data in the spreadsheet. Now I had a half memory from HTML Tables and the Data Web of Google releasing a query language that would allow you to query data in a “data table” object embedded in a web page – the Google QUery Language – which it turns out can be used to interrogate anything defined as a Google visualisation API data source…
…and it just so happens that Google spreadsheets are so defined: Using a Google Spreadsheet as a Data Source.
So this means that I should be able to use the Google visualisation API query language to run a query on a Google Spreadsheet; like the MPs’ expenses data spreadsheet; like asking it for who’s claimed for biscuits…
So here’s what I want to do:
1) create a data table that pulls data in from a Google spreadsheet;
2) actually, that’s not strictly true – I want to run a query on the spreadsheet that pulls in some of the data from the spreadsheet (in particular, just the rows that satisfy the query);
3) I want to display the results in a table using the visualisastion API libraries (so then I don’t have to write any code to display the data myself; and more than that, I don’t even need to understand how the data has been returned from the spreadsheet).
Okay – so the ambitious next step is to try to write a test query on the spreadsheet by trying to make sense of the Google documentation, which is never as helpful as it might be.
No joy, so in the end, I copied and pasted some example code from the closest working example to what I wanted from Google’s interactive AJAX APIs Playground – an example of just getting data into a web page from a spreadsheet using the Google visualisation API libraries:
Okay – so what this example does is run a query on a spreadsheet and plot the data as a map. Just seeing the code isn’t much help though – what libraries do I need to load to run it? So I exported the whole example into a standalone worked example, did a View Source, and copied the code wholesale.
Good, I now have a canned example that pulls in data from a spreadsheet. Next step – I want to display a data table, not a map.
Again, the API Playground comes in handy – check out the table example and see what bits of the code need changing:
Change the demo code so it displays the data from the example spreadsheet as a table rather than a map, and check it works. It does. Good… So now change the spreadsheet key and see if it works to display the expenses data. It does. Good again.
Okay, now I can start to write some test queries. The AJAX API playground provides a crib again, this time in the form of the Using the Query Language example:
(Hmmm… maybe I should have just worked from this example from the start? Ah well, never mind, note to self: teach the changes required from just this example next time…)
Now it’s fun time… writing the query, the query language documentation suggests only equivalence style relations are possible, but I want to use a conditions along the lines of “select * where M LIKE ‘%biscuits%’ – that is, give me [select] all the columns in a row [*] where [where] column M [M] contains [LIKE] the word ‘biscuits’ ['%biscuits%'].
Typing a suitably encoded a test query URL (there’s a tool to encode the query string on the query language documentation page) into the browser location bar didn’t work :-( BUT, it turned up an informative error message that described some phrases the query language does support, or at least, that are expected by the spreadsheet:
So let’s try contains rather than LIKE… which works…
Okay, so now the long and the short of it is, I know how to write queries.
So for example, here’s searching the name column (so you can search for your MP by name):
var query=’select * where A contains “‘+q+’”‘ (e.g. search for Huhne)
Here’s searching the constitutency column (so you can search or your MP by constituency):
var query=’select * where B contains “‘+q+’”‘ (e.g. Edinburgh)
And here’s searching several columns for a particular item:
var query=’select * where (M contains “‘+q+’” OR O contains “‘+q+’” OR Q contains “‘+q+’” OR S contains “‘+q+’” OR U contains “‘+q+’” OR V contains “‘+q+’”)’
Add it all together, and what have you got? A way of searching to see who’s been claiming for biscuits:
Note that searches are case sensitive…(anyone know if there’s a way round this?)
So there you have it: an MP’s expenses search engine via Google Spreadsheets :-)
A couple of days ago, the Guardian’s @datastore announced that a spreadsheet of UK MPs’ (Members of Parliament) expenses had been posted to the Guardian OpenPlatform datastore on Google Spreadsheets.
Just because, I though it would be nice to visualise the spreadsheet using some Many Eyes Wikified charts, so I had a look at the data, and sighed a little: in many of the spreadsheet cells was a pound sign, and Many Eyes doesn’t like those – it just wants numbers… So I went in to Yahoo pipes to create a pipe to tidy up the CSV output of the spreadsheet so I could pipe it into Many Eyes Wikified… and drew a blank: I couldn’t get the pipe to work (no CSV – just HTML (it turns out I was using the wrong URL pattern from the spreadsheet – doh!)). So I exported the CSV, reg-exped it in a text editor, adn uploaded it to create a new spreadsheet. (Which reminds me: note to self – create a tidy-upper pipe fed from the datastore and refactor the wikified data page to feed from the pipe…)
[Many Eyes Wikified is no longer available as a service - to replicate the following visulisations, you need to upload the data to Many Eyes (the none wikified version...). I think this is the spreadsheet I was pulling in to the Wikified service...]
So anyway, here are some interactive ways of visualising MPs’ expenses data using Many Eyes wikified…
Firstly, a bar char – select which expenses category you’d like to chart and then view the ranked distribution by sorting by values. If you mouse over any of the bars, you’ll see which MP made that claim:
Second up, a block histogram view. This chart is good for looking at the natural distribution of different claim categories. The search box makes it easy to search for your MP by name:
Again, mousing over any of the blocks identifies the name of the MP making that claim.
Thirdly, a scatter plot. This display lets you compare an MP’s claims across two categories, and potentially bring in a thrid category using the dot size:
As with the other visulisations, mouse over any point to see which MP it belongs to.
By the by, along the way I did a couple of other Yahoo pipes – one to extract expenses by MP name, (which simply pulls in CSV from the spreadsheet, then filters on an MP’s surname), the other MPs’ expenses by postcode. The latter pipe actually embeds the foemer, and works by looking up the name of the MP by postcode, using the TheyWorkForYou API; this name is then passed in to an embedded ‘expenses by name’ pipe.
Anyway, back to the viz biz: Charles Arthur generously picked up on my tweets announcing the visualisations with a blog post on the Guardian data blog (Visualising MP expenses) in a post that included the tease:
But what we need now is a dataset which shows constituency distances
from Westminster, so that we can compare that against travel. And perhaps someone else can work out the travelling MPs’ carbon footprints based on whether they went by air or rail or car
No fair… Okay – so where to get the location data for each MP. Well, the TheyWorkForYou API came to my rescue again. One call (to getConstituencies) pulled out details of each constituency, which included the lat/long geo-coordinated of the ‘centre’ of each constituency (along with the co-ordinates of the bounding box round each constituency… maybe I’ll use those another time ;-) A second call (to getMPs) pulled out all the MPs, and their constituencies. Loading both sets of data into different sheets on Dabble DB, meant I could then link them together by constituency name (for more on linking data in Dabble DB, see Mash/Combining Data from Three Separate Sources Using Dabble DB and Using Dabble DB in an Online Mashup Context).
Adding the MP data into Dabble DB after a further bit of cleaning – removing things like Mr, Miss, and Sir from the firstnames etc – and linking by MP name meant that I could now generate a single data view that exposed MPs by name, constituency, and expense claims, along with the geolocation of the midpoint of their constituency.
After grabbing the CSV feed out of this Dabble DB view into a pipe, and tidying up the data a little once again (eg removing commas in the formatted numbers), it was an easy matter to pull the JSON output from the pipe into a map, and plot different coloured markers depending what ‘band’ the MPs’ total expenses fell into. Here’s a snapshot of that first map:
All well and good – what’s nice about this view is that it’s quite easy to see which MPs appear to be claiming disproportionately more than other MPs with constituencies in a similar area. (There may be good reason for this, like, err… whatever. This tool is just a starting point for sensemaking round the data, right?!;-). If you click on one of the markers you can pop up a little info window, too (rather sparse in this first demo):
In that first map, the only expenses data I was exposing, and mapping, was the total travel expenses claimed. So over a coffee this afternoon, I created a richer view, and tweaked the map code to let me inspect a couple of other data sets. You can find the map here: MPs’ travel expenses map.
So for example, we can look at mileage claims:
Or the total expenses claimed for living away from the primary home:
One thing these quick to put together maps show is how powerful map based displays can be used to get a feel for local differences where there is a difference. (There may well be a good reason for this, of course; including errors in the data set being used…)
It’s also interesting to use the map based displays in conjunction with other chart based visualisations, such as the MPs’ expenses visualisations on Many Eyes Wikified, to explore the data in a far more natural way than trying to make sense of a spreadsheet containing the MPs’ expenses data.
PS the code is all as is; if it’s broken and the visualisations are consequently wrong/misleading, then I apologise in advance… ;-)
PPS See also: My Guardian OpenPlatform API’n’Data Hacks’n’Mashups Roundup, which describes 6 different recipes for playing with Guardian openplatform resources. And if you’re into F1, see Visualising Lap Time Data – Australian Grand Prix, 2009 ! ;-)
PPPS see also MPs’ Expenses by Constituency, Sort Of…, where I plot a couple of really colourful proportional symbol maps based on total travel expenses…
Over the last week or two, I’ve put together a handful of hacks demonstrating how to do various mashup related things using the Guardian open data on Google spreadsheets, and the open platform content API. So here’s a quick summary – think of it as a single post mashup uncourse… ;-)
Just as way of background, here’s how the Guardian Open Platform was announced (I didn’t get an invite ;-)
So here’s what I’ve been up to:
- Using Many Eyes Wikified to Visualise Guardian Data Store Data on Google Docs: using a Guardian data in a particular Google docs spreadsheet, generate a URL that emits CSV formatted data from a specified region of cells in the spreadsheet and consume the live CSV data feed in Many Eyes wikified to visualise the data.
Nuggets: how to generate the URL for the CSV output from a Google spreadsheet over a range of specified cells, and then consume that data in Many Eyes Wikified. Use the Many Eyes wikified data as the basis for several Many Eyes wikified visualisations.
- Filtering Guardian Data Blog/Google Spreadsheet Data With Yahoo! Pipes: take a set of data regarding UK HE research funding from a Google spreadsheet via CSV into a Yahoo pipe, and then extract only those data rows that correspond to a particular result. Grab the CSV output from the pipe and pass it to Many Eyes wikified to visualise the research funding data for a single university.
Nuggets: how to use Yahoo pipes to filter CSV data to only pass through items from rows where the data in a particular column matches a given term.
- Creating Google Charts From CSV Data Inside a Yahoo Pipe: grab the CSV output from a Google spreadsheet into a Yahoo pipe, construct description elements for each feed item using data from several different columns in each CSV row, and then generate the URI for a call to the Google charts API using data from each row. By associated the chart URI with the appriopriate media group RSS attribute, a variety of widgets should be able to “play” the charts that are automatically generated from the spreadsheet data as a slideshow.
Nuggets: how to create Google Charts within a Yahoo pipe (using the Google Charts API) based on data pulled into the pipe from a CSV file generated from a Google spreadsheet.
- “JOIN”ing Data from the Guardian Data Store Student Satisfaction Data: pull data in from two different sheets on the same Google spreadsheet into Dabble DB, and generate another data table containing mashed together elements from the two spreadsheets. Display the result table via Many Eyes wikified.
Nuggets: how to use Dabble DB to “join” data from two separate tables that share a similar key column.
- Tinkering with the Guardian Content API – SerendipiTwitterNews: given a Twitter User id, pull the most recent tweets fro that user into a Yahoo pipe and run them though a term extractor; where terms are successfully extracted, use them as a query on the Guardian content API (either all articles, or just tech articles). The aim? Generate a news feed of items serendipitously related to your most recent tweets.
Nuggets: using the Yahoo term extractor on a Twitter feed to generate search queries over the content API; using API tags to narrow down the search to a particular topic area.
- Last Week’s Football Reports from the Guardian Content Store API (with a little dash of SPARQL): use a Yahoo pipe to pull football match reports “since last Saturday” from the content API, extract the name of the stadium each match was played in, and use as as the search term in a SPARQL query over a DBpedia page listing the locations of each English football stadium. Retrieve the lat/lon geo co-ordinates for each stadium from the SPARQL query, and associate them with the match reports. Plot the resulting feed of geo-annotated match reports on a Google map.
Nuggets: running a SPARQL query over DBPedia from a Yahoo pipe.
- Guardian Game Reviews – with Video Trailers: query the Guardian content API for game reviews from a Yahoo pipe, extract the title of each game and use it in another pipe that searches Youtube for a trailer (or review) of that game; create the URL of the actual Youtube movie file, and attach is as an enclosure URL to the output feed; view the result in a portable Grazr widget, which automatically embeds the Youtube videos in the widget.
Nuggets: using an embedded pipe block to search over Youtube; creating a feed that will auto render an embedded Youtube player when viewed in a Grazr widget.
So there you have it… my Guardian OpenPlatform hacks to date…