A couple of days ago, the Guardian’s @datastore announced that a spreadsheet of UK MPs’ (Members of Parliament) expenses had been posted to the Guardian OpenPlatform datastore on Google Spreadsheets.
Just because, I though it would be nice to visualise the spreadsheet using some Many Eyes Wikified charts, so I had a look at the data, and sighed a little: in many of the spreadsheet cells was a pound sign, and Many Eyes doesn’t like those – it just wants numbers… So I went in to Yahoo pipes to create a pipe to tidy up the CSV output of the spreadsheet so I could pipe it into Many Eyes Wikified… and drew a blank: I couldn’t get the pipe to work (no CSV – just HTML (it turns out I was using the wrong URL pattern from the spreadsheet – doh!)). So I exported the CSV, reg-exped it in a text editor, adn uploaded it to create a new spreadsheet. (Which reminds me: note to self – create a tidy-upper pipe fed from the datastore and refactor the wikified data page to feed from the pipe…)
[Many Eyes Wikified is no longer available as a service – to replicate the following visulisations, you need to upload the data to Many Eyes (the none wikified version…). I think this is the spreadsheet I was pulling in to the Wikified service…]
So anyway, here are some interactive ways of visualising MPs’ expenses data using Many Eyes wikified…
Firstly, a bar char – select which expenses category you’d like to chart and then view the ranked distribution by sorting by values. If you mouse over any of the bars, you’ll see which MP made that claim:
Second up, a block histogram view. This chart is good for looking at the natural distribution of different claim categories. The search box makes it easy to search for your MP by name:
Again, mousing over any of the blocks identifies the name of the MP making that claim.
Thirdly, a scatter plot. This display lets you compare an MP’s claims across two categories, and potentially bring in a thrid category using the dot size:
As with the other visulisations, mouse over any point to see which MP it belongs to.
By the by, along the way I did a couple of other Yahoo pipes – one to extract expenses by MP name, (which simply pulls in CSV from the spreadsheet, then filters on an MP’s surname), the other MPs’ expenses by postcode. The latter pipe actually embeds the foemer, and works by looking up the name of the MP by postcode, using the TheyWorkForYou API; this name is then passed in to an embedded ‘expenses by name’ pipe.
Anyway, back to the viz biz: Charles Arthur generously picked up on my tweets announcing the visualisations with a blog post on the Guardian data blog (Visualising MP expenses) in a post that included the tease:
But what we need now is a dataset which shows constituency distances
from Westminster, so that we can compare that against travel. And perhaps someone else can work out the travelling MPs’ carbon footprints based on whether they went by air or rail or car
No fair… Okay – so where to get the location data for each MP. Well, the TheyWorkForYou API came to my rescue again. One call (to getConstituencies) pulled out details of each constituency, which included the lat/long geo-coordinated of the ‘centre’ of each constituency (along with the co-ordinates of the bounding box round each constituency… maybe I’ll use those another time ;-) A second call (to getMPs) pulled out all the MPs, and their constituencies. Loading both sets of data into different sheets on Dabble DB, meant I could then link them together by constituency name (for more on linking data in Dabble DB, see Mash/Combining Data from Three Separate Sources Using Dabble DB and Using Dabble DB in an Online Mashup Context).
Adding the MP data into Dabble DB after a further bit of cleaning – removing things like Mr, Miss, and Sir from the firstnames etc – and linking by MP name meant that I could now generate a single data view that exposed MPs by name, constituency, and expense claims, along with the geolocation of the midpoint of their constituency.
After grabbing the CSV feed out of this Dabble DB view into a pipe, and tidying up the data a little once again (eg removing commas in the formatted numbers), it was an easy matter to pull the JSON output from the pipe into a map, and plot different coloured markers depending what ‘band’ the MPs’ total expenses fell into. Here’s a snapshot of that first map:
All well and good – what’s nice about this view is that it’s quite easy to see which MPs appear to be claiming disproportionately more than other MPs with constituencies in a similar area. (There may be good reason for this, like, err… whatever. This tool is just a starting point for sensemaking round the data, right?!;-). If you click on one of the markers you can pop up a little info window, too (rather sparse in this first demo):
In that first map, the only expenses data I was exposing, and mapping, was the total travel expenses claimed. So over a coffee this afternoon, I created a richer view, and tweaked the map code to let me inspect a couple of other data sets. You can find the map here: MPs’ travel expenses map.
So for example, we can look at mileage claims:
Or the total expenses claimed for living away from the primary home:
One thing these quick to put together maps show is how powerful map based displays can be used to get a feel for local differences where there is a difference. (There may well be a good reason for this, of course; including errors in the data set being used…)
It’s also interesting to use the map based displays in conjunction with other chart based visualisations, such as the MPs’ expenses visualisations on Many Eyes Wikified, to explore the data in a far more natural way than trying to make sense of a spreadsheet containing the MPs’ expenses data.
PS the code is all as is; if it’s broken and the visualisations are consequently wrong/misleading, then I apologise in advance… ;-)
PPS See also: My Guardian OpenPlatform API’n’Data Hacks’n’Mashups Roundup, which describes 6 different recipes for playing with Guardian openplatform resources. And if you’re into F1, see Visualising Lap Time Data – Australian Grand Prix, 2009 ! ;-)
PPPS see also MPs’ Expenses by Constituency, Sort Of…, where I plot a couple of really colourful proportional symbol maps based on total travel expenses…
Wouldn’t it be handy if we could treat all the public spreadsheets uploaded to Google docs as queryable tables in a database? Well, it appears that you can do so, at least at an individual spreadsheet level: Introducing the Google Visualization API.
Over the weekend, I started exploring the Google Visualisation API Query Language, which is reminiscent of SQL (if that means anything to you!). This language provides a way of interrogating a data source such as a public online Google spreadsheet and pulling back the results of the query as JSON, CSV, or an HTML table.
Got that? I’ll say it again: the Google Visualisation API Query Language lets you use a Google spreadsheet like a database (in certain respects, at least).
Google query languages are defined on a spreadsheet in the following way:
Although defined, by default, to return JSON data from a query, wrapped in a pre-defined (and fixed?) callback function (google.visualization.Query.setResponse()), it is also possible to display the results of a query as an HTML table (which is “useful”, as the documentation says, “for debugging”). The trick here is to add another argument to the URL: tqx=out:html, so for example a query would now be defined along the lines of:
Using the Guardian datastore’s MPs expenses spreadsheet 2007-8 as an example, we can write quite a wide variety of queries, which I’ll show below in their ‘HTML preview’ form.
(In a ‘real’ situation, you are more likely to retrieve the data as JSON and then process it as an object. Or, as I will also demonstrate, take the results of the query as CSV output (tqx=out:csv rather then tqx=out:html) and pull it directly into a service such as Many Eyes WIkified.)
The generic URL is of the form: http://spreadsheets.google.com/tq?tqx=out:html&tq=QUERY&key=phNtm3LmDZEObQ2itmSqHIA.
In the examples, I will just show the unencoded select statement, but the link will be the complete, well-formed link.
So here we go:
- show everything – fetch the whole table: select * (in a lot of computer languages, ‘*’ often refers to ‘everything and anything’);
- just show some particular columns, but again for everyone: fetch just columns B (surname), C (first name) and I (total additional costs allowance): select B,C,I
- only show the names of people who have claimed the maximum additional costs allowance (£23,083): fetch just columns B, C and I where the value in column I is 23083: select B,C,I where I=23083 (column I is the additional costs allowance column);
- How many people did claim the maximum additional costs allowance? Select the people who claimed the maximum amount (23083) and count them: select count(I) where I=23083
- So which people did not claim the maximum additional costs allowance? Display the people who did not claim total additional allowances of 23083: select B,C,I where I!=23083 (using <> for ‘not equals’ also works); NB here’s a more refined take on that query: select B,C,I where (I!=23083 and I>=0) order by I
- search for the name, party (column D) and constituency (column E) of people whose first name is Jane or is recorded as John (rather than “Mr John”, or “Rt Hon John”): select B,C,D,E where (C contains ‘Joan’ or C matches ‘John’)
- only show the people who have claimed less than £100,000 in total allowances : select * where F<100000
- what is the total amount of expenses claimed? Fetch the summed total of entries in column I (i.e. the total expenses claimed by everyone): select sum(I)
- So how many MPs are there? Count the number of rows in an arbitrary column: select count(I)
- Find the average amount claimed by the MPs: select sum(I)/count(I)
- Find out how much has been claimed by each party (column D): select D,sum(I) where I>=0 group by D (Setting I>0 just ensures there is something in the column)
- For each party, find out how much (on average) each party member claims: select D,sum(I)/count(I) where I=0 group by D
To create your own queries, just hack around the URIs.
Many Eyes WIkified is no more…
One other trick is to grab a CSV output, rather than an HTML output, and pull it into Many Eyes Wikified, and then visualise it within that environment – so we grab the data (in this case, using select D,sum(I) where I>=0 group byD, i.e. the total amount of additional costs allowance claims by party):
to give this:
and then visualise it in an appropriate way:
So to recap this final case, then, we are running a query on the original spreadsheet that calculates the total additional costs allowance claims per party, and emits the results as CSV. These results are imported into Many Eyes Wikified, and displayed therein.
Now I’m pretty sure that Many Eyes Wikified will continue (how often?) to synch data from a potentially changing data source, which means we should be able to use a similar approach to plot a running total of claims from the Shadow Cabinet Expenses spreadsheet…
…but, at the time of writing at least, it seems as if the publication/privacy settings on that spreadsheet are set such that access via th query language is denied…:-(
Anyway – that was a quick intro to the Google Visualisation API Query Language – so go play… ;-)
PS so what other spreadsheets might make for some interesting queries?
PPS @adrianshort has made a valuable point about how easy it is for a publisher to change the order of rows in a spreadsheet, and hence make a nonsense of your query. (Also, I think the approach I’m taking sort of assumes a simple, regular spreadsheet where row 1 is for headers, then the data, and ideally no other text e.g. in cells below the table describing the data in the table.) So always check… ;-)
PPPS If the first row in the table defines column headings, then there are intervening lines (maybe spaces) before the data starts, putting offset N (where N is a number) will skip that many rows before displaying the data.
Something else I noticed on the order by setting, this can be of the form order by COL asc (to sort in ascending order, which is the default) or order by COL desc ( to sort in descending order).
It was nice to see a couple of people picking up on my post about using Google Spreadsheets as a database (e.g. Using Google Spreadsheets as a database (no, it really is very interesting, honest) over at the Online Journalism blog), but it struck me that the URL hacking involved might still deter some people.
(Btw, the only list of keywords I’ve found to date for the query language are on the official documentation pages, and even then they aren’t complete…)
So – I spent an hour or so last night putting together a first attempt at a form based interface for writing the queries and constructing the URLs.
The form starts with a slot for the key of the spreadsheet you want to query – clicking on the preview button should display the column headings:
This preview can be used to help you select the columns you want to interrogate or return in your query, counting left-to-right: A, B, C and so on.
Next up are some hints on example queries:
and then there is the query form itself:
I’ve made a start on separating out the different bits of query, but there’s clearly lots more that could be done. For example, an optional “order by” slot could be provided (with a post-qualifying asc or desc selection), or the select entry box could be turned into a multiple selection listbox displaying the column headers, (but I only gave myself an hour, right?;-) [Note to self: lots of handy functions here – Google Visualization API Reference]
Anyway, once you make the query, links to the URIs of the HTML preview and CSV versions of the query are automatically generated, and the HTML table of results is displayed:
The CSV URI can then be used to import the data into a Many Eyes Wikified data page, for example.
Anyway, hopefully this makes it a little easier to get people started with these queries. A good place to start looking for spreadsheets is on the Guardian DataBlog.
So I had another bit of a tinker, and came up with some code that’s breaking all over the place, but I think there’s enough of a vision there to have something to say, so I’ll say it…
How’s about a generic query’n’viz tool for the Guardian datastore? My first (and maybe last) attempt at a back of an envelope, sometimes it works, sometimes it doesn’t, bare bones rapid prototype of just such a thing can be found here.
In my original post on Making it a Little Easier to Use Google Spreadsheets as a Database (Hopefully!), I sketched out a simple form for helping create calls to a Google spreadsheet using the Google visualisation API query language. I then extended this to try to make the query building a little more transparent: Last Night’s Update to the “Google Spreadsheets as a Database” Demo. Today’s step is to see how we can make it easier to pull in spreadsheets from the datastore collection as a whole.
So referring to the image below, if you select a spreadsheet from the drop down list and click preview, you should get a preview of the column headings from that spreadsheet:
(The new link is to an original Guardian blog post announcing or describing the data.)
The list items are pulled in from a tag on my delicious account, which actually bookmarks the original data blog posts. The URI for the spreadsheet is added to the end of the bookmark description, and keyed with a –:
ISSUE 1: Sometimes the spreadsheet doesn’t load… I don’t know if this is down to something I’m (not) doing or not (if you’ve seen this sort of error and know a cause/fix, please post a comment below).
I’ve found if you just keep canceling the alert and clicking “Preview” the file loads in the end…
Scroll down on the page, and you can now start to build a query:
(See Last Night’s Update to the “Google Spreadsheets as a Database” Demo for more on this.)
Another new feature is the ability to preview results using various chart types, rather than just use a data table:
(Oh yes – the “bookmark” link should also allow you to share the current view with other people. At least, it shares the spreadsheet ID and the query, but not the view type…)
I haven’t implemented chart labeling, or the ability to set what values are used for what bit of the chart, so chart compomnent default rules apply. By juggling the queries (including changing the order of columns that appear in the various text boxes), you can sometimes get a reasonable chart out.
Of course, you can always just grab the CSV URL and then visualise the data in something like Many Eyes Wikified.
The chart components I used are all taken from the Google Visualisation API, so they play nicely with the Google data source representation that holds the data values.
So, that’s where it’s at, and that’s probably me done with it now… (I think I can see what’s possible so my fun is done…) And if you haven’t got an inkling of what it is I think I can see, it’s this:
A front end to the Guardian data store that lets readers:
– select a data set from the datastore (and maybe get a chance to view the original story from the datablog; I guess this could be pulled in from the Guardian OpenPlatform API?)
– write queries on that dataset to generate particular views of the data;
– generate CSV and HTML preview view URLs for a particular query, so the data can be shared with other people (turning different views on subsets of the data into social objects);
– generate quick visualisation previews of different views of the data.
Nice to haves? Maybe links to stories that also mention the data, again via the OpenPlatform API? A collection of different bookmarks/views that use the same spreadsheet, so readers can share their own views of the data (the sort of social thing that Many Eyes Wikified offers). An opportunity to accept comments on a dataset? etc etc
All told, I reckon it’s taken less than 20 hours of solo effort (except for a bit of 3rd party bug spotting ;-), plus time to write the blog posts, to get this far (but the state of the code shows that: it’s truly scrappy). A fair amount of that time was spent learning how to do stuff and looking at exemplar code on Google AJAX APIs Code Playground. Of course, there are bugs’n’issues all over the place, but as people bring them to my attention, I tend to see if there’s a quick fix…
PS (I think) I’ve just noticed a Google data source wrapper for Pachube (Google Visualization API for Pachube history), which means that as well as pulling in Guardian datastore content from Google spreadsheets (as well as other publishers’ content on Google spreadsheets), this ‘interface’ could also be applied to Pachube data. (If you know of anyone else who exposes the Google visualisation/data source API, please post a link below.)
PPS search key: I also call this the Guardian Datastore Explorer…
Whilst listening to Radio 4’s Today programme this morning, I was pleasantly surprised to hear and interview with Hans Rosling about making stats’n’data relevant to Joe Public (you can find the interview, along with a video overview of the Gapminder software, here: Can statistics be beautiful?).
The last few weeks have seen the US Government getting into the data transparency business with the launch of data.gov whose purpose is “to increase public access to high value, machine readable datasets generated by the Executive Branch of the Federal Government”. The site offers access to a wide range of US Government datasets in a range of formats – XML, CSV, KML etc. (The site also gives links to widgets and other (parent?) sites that expose data.)
Providing URIs directly to CSV fils, for example, means that is is trivial to pull the data into online spreadsheets/databases, such as Google spreadsheets, or Dabble DB, or visualisation tools such as Many Eyes Wikified; and for smaller files, Yahoo Pipes provides a way of converting CSV or XML files to JSON that can be easily pulled in to a web page.
Realising that there may be some business in public data, Microsoft, Amazon and Google have all been sniffing around this area too: for Microsoft, it’s the Open Government Data Initiative (OGDI), for Amazon, it’s big data via Public Datasets on AWS, and for Google… well, Google. They bought Rosling’s Trendalyser, of course, and recently made a brief announcement about Public Data on Google, as well as Google Squared, which is still yet to be seen in public. With the publication of a Java support library for the Google Visualisation API open wire protocol/query language, you can see them trying to get their hooks into other people’s data. (The thing is, the query language is just so darned useful;-) Wolfram Alpha recently opened up their computational search over a wide range of curated data sets, and Yahoo? They’re trying to encourage people to make glue, I think, with YQL, YQL Execute and YQL Open Data Tables.
In the UK, we have the National Statistics website (I’m not even going to link to it, it’s that horrible..) as well as a scattered collection of resources as listed on the Rewired State: APIs wiki page; and, of course, the first steps of a news media curated datastore from the Guardian.
But maybe things are set to change? In a post on the Cabinet Office Digital Engagement blog, Information and how to make it useful, Richard (Stirling?) picks up on Recommendation 14 of the POIT (Power of Information Taskforce) Review Final Report, which states:
The government should ensure that public information data sets are easy to find and use. The government should create a place or places online where public information can be stored and maintained (a ‘repository‘) or its location and characteristics listed (an online catalogue). Prototypes should be running in 2009.
and proposes starting a conversation about “a UK version of data.gov”:
What characteristics would be most useful to you – feeds (ATOM or RSS) or bulk download by e.g. FTP, etc?
Should this be an index or a repository?
Should this serve particular types of data e.g. XML, JSON or RDF?
What examples should we be looking at (beyond data.gov e.g. http://ideas.welcomebackstage.com/data)?
Does this need it’s own domain, or should it sit on an existing supersite (e.g. http://direct.gov.uk)?
I posted my starter for 10 thoughts as a comment to that post (currently either spamtrapped, or laughed out of court), but there’s already some interesting discussion started there, as well as thoughtful response on Steph Gray’s Helpful Technology blog (Cui bono? The problem with opening up data) which picks up on “some more fundamental problems than whether we publish the data in JSON or RSS” such as:
– Which data?
– Who decides whether to publish?
– Who benefits?
– Who pays?
– For how long?
My own stance is from a purely playful, and maybe even a little pragmatic, position: so what?
Someone like me… so a population of one, then… ;-)
So what do I know? I know how to cut and paste URLs in to things, and I know how to copy other peoples’ code and spot what bits I need to change so that it does “stuff with my stuff”.
I know that I can import CSV and Excel spreadsheets that are hosted online from their URL into Google spreadsheets, and from a URL as CSV into something like Dabble DB (which also lets me essentially merge data from two sources into a new data table). Yahoo Pipes also consumes CSV. I know that I can get CSV out of a Google spreadsheet or Dabble DB (or from a Yahoo pipe if CSV went in). I know that I can plot KML or geoRSS files on a Google map simply by pasting the URL into a Google map search box. I know I can get simple XML into a Google spreadsheet, and more general XML into a Yahoo Pipe. I know that YQL will also let me interrogate XML files and emit the results as XML or JSON. Pipes is good as emitting JSON too. (JSON is handy because you can pull it into a web page without requiring and help from script running on a server.) I’ve recently discovered that the Google Visualisation API query language and open wire protocol lets me run queries on datastores that support it, such as Google spreadsheets and Pachube. I know that Many Eyes Wikified will ingest CSV and then allow me to easily create a set of interactive visualisation
So what would I want from a UK version of data.gov, and why?
– CSV, XML and JSON output, with KML/GeoRSS where appropriate, keyed by a simple URI term;
– a sensible (i.e. a readable, hackable) URI pattern for extracting data: good examples are the BBC Programmes website and Google spreadsheets (e.g. where you can specify cell ranges);
– data available from a URI via an HTTP GET (not POST; GETable resources are easily pulled into other services, POST requested ones aren’t; don’t even think about SOAP;-);
– if possible, being able to query data or extract subsets of it: YQL and the Google Viz API query language show a possible way forward here. Supporting the Google open-wire protocol, or defining YQL open data tables for data sets brings the data into an environment where it can be interrogated or subsetted. (Pulling cell ranges from spreadsheets is only useful where the cells you want are contiguous.)
Although it pains me to suggest hooking into yet more of the Googleverse, a UK version of data.gov could do worse than support the Google visualization API open-wire protocol. Why? Well, for example, with only an hour or two’s coding, I was able to pull together a site that added a front end on to the Guardian datastore files on Google spreadsheets: First Steps Towards a Generic Google Spreadsheets Query Tool, or At Least, A Guardian Datastore Interactive Playground (Okay, okay, I know – it shows that I only spent a couple of hours on it… but it was enough to demonstrate a sort of working rapid prototype…;-)
As to whether the data is useful, or who’s going to use it, or why they’re going to use it, I don’t know: but I suspect that if it isn’t easy to use, then people won’t. If one of the aims of data.gov style approaches is to engage people in conversations with data, we need to make it easy for them. Essentially, we want people to engage in – not quite ‘enterprise mashups’, more civic mashups. I’m not sure who these people are likely to be – activitists, policy wonks, journalists, concrned citizens, academics, students – but they’re probably not qualified statisticians with a blackbelt in R or SPSS.
So for example, even the Guardian datastore data is quite hard to play with for most people (it’s just a set of spreadsheets, right? So what can I actually do with them?). In contrast, the New York Times Visualization Lab folks have started looking at making it easier for readers to intrrogate the data in a visual way with Many Eyes Wikified, which is one reason I started trying to think about what a query’n’visualisation API to the Guardian datastore might look like…
PS just in case the Linked Data folks feel left out, I still think RDF and semweb geekery is way too confusing for mortals. Things like SPARCool are starting to help, but IMHO it’s still way too quirky syntactic for a quick hit… SQL and SQL like languages are hard enough, especially when you bear in mind that most people don’t know (or care) that advanced search exists on web search engines, let alone what it does or how to use it.
PPS see also National Research Council Canada: Gateway to Scientific Data (via Lorcan Dempsey).
So it seems that the Numeracy in the Newsroom blog liked the ‘roll your Google Squared’ toy that I rediscovered in Is Google Squared Just a Neatly Packaged and Generalised =googlelookup Array?.
But how about if you wanted to roll your own square by pulling in data from data sources you had defined yourself, such as spreadsheets in the Guardian datastore?
Well, it occurred to me that it should be easy enough to generate queries onto those spreadsheets (e.g. using the link generators described in First Steps Towards a Generic Google Spreadsheets Query Tool, or At Least, A Guardian Datastore Interactive Playground et al.) and roll them into a square.
Okay – so I know I’ve lost you but I only have 10 mins to crank this post out, so I’ll walk you through a demo (ish).
Create a new Google spreadsheet and pull in some data from the Conservative Shadow Cabinet expenses spreadsheet, such as their total expenses declared in that spreadsheet to date:
Pop that in cell A1 and it gives something like:
So now suppose I want to pull in data from the Guardian spreadsheets. I can use the name as a key and construct a URI that will query the Guardian spreadsheets by MP name. An easy way to gt an example query is to use the Guardian Datastore Explorer:
So having got a pattern for a URI:
we can look at it and see how it might be constructed in the spreadsheet:
That is, we construct the URI around the name contained in the cell a couple of columns to the left, (data that was itself pulled in from a Tory party spreadsheet).
(Note that I’m generating an HTML preview, but you should be able to grab CSV too: just set tqx=out:csv).
Drag the cell down to generate URIs for all the Shadow Cabinet:
We can now import data into the spreadsheet using that constructed URI:
Remember, I generated an HTML preview table URI. (If you generated a CSV URI, use =importdata(C2). Note I was using CSV originally, but then it seemed to stop working, so I switched to HTML).
Here’s what you get when you copy that (relative) formula for each MP:
Okay, so there’s an issue here – we’re only getting very other MP because the table headers are being pulled in (the same happens with CSV – I couldn’t find a way to get the importdata formula to suppress the header?). A clunky workaround you be to create another set of columns that pull in the other alternate rows (i.e. start pulling in data from row 3 rather than row 2).
You’ve also got the header crap everywhere, but I’m guessing we could probably write a query language call that would pull out the data from this spreadsheet and ignore the rows containing the column headings.
Anyway, so what?
So – I can create a data square in a Google Spreadsheet that pulls in and essentially combines data from multiple spreadsheet data source that I know and trust, as long as there is a common key query term (name in this case).
When the Guardian launched their OpenPlatform DataStore, a collection of public data, curated by Guardian folk, hosted on Google Spreadsheets, it raised the question as to whether this initiative would influence the attitude of the Office of National Statistics, and in particular the way they publish their results (e.g. Guardian Data Store: threat to ONS or its saviour?).
In the three sexy skills of data geeks, Michael Driscoll reinterprets Google’s Chief Economist’s prediction that “the sexy job in the next ten years will be statisticians… The ability to take data—to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it” with his belief that “the folks to whom Hal Varian i.e. [Google’s Chief Economist] is referring are not statisticians in the narrow sense, but rather people who possess skills in three key, yet independent areas: statistics [studying], data munging [suffering] and data visualization [storytelling]”
I’ve already suggested that what I’d quite like to see is plug’n’play public data that’s easy for people to play with in a variety of ways, and publishing it via Google Spreadsheets certainly lowers quite a few barriers to entry from a technical perspective which can make life easier for statisticians and the visualisers, and reduce the need for the data mungers, the poor folks who go through “the painful process of cleaning, parsing, and proofing one’s data before it’s suitable for analysis. Real world data is messy” as well as providing access to data where it is difficult to access: “related to munging but certainly far less painful is the ability to retrieve, slice, and dice well-structured data from persistent data stores”.
But if you don’t take care of the data you’re publishing, the even though there are friendly APIs to the data it doesn’t necessarily follow that the data will be useful.
As Steph Gray says in Cui bono? The problem with opening up data:
Here’s my thought: open data needs a new breed of data gardeners – not necessarily civil servants, but people who know data, what it means and how to use it, and have a role like the editors of Wikipedia or the mods of a busy forum in keeping it clean and useful for the rest of us. … Support them with some data groundsmen with heavy-lifting tools and technical skills to organise, format, publish and protect large datasets.
So with all that in mind, is the Guardian DataStore adding value to the data in the data store in an accessibility sense by reducing the need for data mungers to have to process the data so that it can be used in a plug’n’play way by the statisticians and the data visualisers, whether they’re professionals, amateurs or good old Jo Public?
As a way in to this question, let’s look at the various HE datasets. The Guardian has published several of these:
Before we look at the data, though, let’s look at the URIs to see if the architecture of the site makes it easy to discover potentially related datasets. (Finding data is another of the skill related to the black arts of the data mungers, I think?!;-)
The URI for the metapage that hosts a link to the RAE/research data blog post is:
and links to the teaching related posts is:
Going back up the common path to http://www.guardian.co.uk/news/datablog+education/ we get…. a 404 :-(
Hmmm… So how come the datablog+education page doesn’t link down to the HE collection pages, as wll as the schools data blog pages (e.g. these are both valid:
– http://www.guardian.co.uk/news/datablog+education/school-tables and
and might naturally be expected to be linked to from:
Looking back to the HE teaching related datasets, we see they are both listed on the http://www.guardian.co.uk/news/datablog+education/higher-education page. So might we then expect them to be ‘compatible’ datasets in some sense?
That is, do the HE data sets share common values, for instance in the way the HEIs are named?
If we generate a couple of queries on to the university satisfaction tables and the dropout tables (maybe trying to look for correlations between drop out rate and student satisfaction) by pulling the results from different queries on those tables in to a data grid within a Google spreadsheet (cf. the approach taken in Using Google Spreadsheets and Viz API Queries to Roll Your Own Data Rich Version of Google Squared on Steroids (Almost…)), what do we gt?
Here’s a search for “Leeds”, for example:
One table contains items:
– Leeds Trinity & All Saints
– Leeds Met
and the other contains:
– Leeds College of Music
– The University of Leeds
– Leeds Metropolitan University
– Leeds Trinity and All Saints
So already, even with quite a young datastore, we have an issue with data quality. In Data Driven: Profiting from Your Most Important Business Asset, Thomas Redman identifies “seven common data quality issues) which include the related problems of too much data (i.e. multiple copies of the same data in different places – that is, redundancy) and data inconsistency across sources (not a problem the datastore suffers from – yet?) and poor data definition (p41 -preview available on Google books?).
This latter issue, poor data definition, is evident in the naming of the HEI institutions above: I can’t simply import the overall tables and dropout tables into DabbleDB and let it magically create a combined table based on common (i.e. canonical) HEI names (using the approach described in Mash/Combining Data from Three Separate Sources Using Dabble DB), for example) because the HEIs don’t have common names.
So what does Redmond have to say about this (p.55)?
– Find and fix errors
– Prevent them at their source [in this case, the error is inconsistency and could have been prevented by using a common HEI naming scheme, OR providing another unique identifier that could act as a key across multiple data tables; but name is easier – because name is what people are likely to search by…).
(See also Redmond’s “Hierarchy of Data and Information Needs”, (p. 58), which identifies the need for consistency across sources.)
Note that we have a problem though – the datastore curators can’t change the names in the current spreadsheets, because people may already be using them and keying on the current name format. We shouldn’t create another spreadsheet containing the same data because that causes duplication/redundancy? So what would be the best approach? Answers on the back of a postcard to, err, the Guardian datastore, I guess?!;-)
So is it the Guardian’s job to be curating this data, or tending it as one of Steph’s data gardeners/groundsmen might? If they want it to be a serious resource, then I would say so. But if it’s just a toy? Well, who cares…?
PS Just in passing, what other value might the DataStore add to spreadsheets to make them more amenable to “mashups”? For data like the university data, providing geo-data might be interesting (even at the crude level of just providing a single geographical co-ordinate for the central location of the institution). If I could easily get geo-data for the HEIs, and combine it with the satisfaction tables or dropout rates, it would be trivial to generate map based views of the data.
PPS one other gripe I have with the Guardian datablog, where many of the datastore data sets are announced, is that the links are misleading:
Now call me naive, but I’d expect those DATA links to point to spreadsheets, as indeed the first two do, but the third points to another blog post and so I’ve lost trust in being able to use those DATA links (e.g. in a screenscraper) as a direct pointer to a spreadsheet.