My tweetwatching has been a little sporadic of late, so I haven’t really been keeping up with the data sets that keep being posted to the Guardian Datablog, but today I had a quick skim through some of the recent uploads and had my eye caught by a post about funding of UK government quangos (Every quango in Britain [spreadsheet]).
What I’ll show in this post is how to construct a query on one of the quangos data sheets that can then be visualised as a change treemap, showing at a single glance how funding over all the quangos (arranged by sponsor department) has changed over the period 2006-2007.
The quangos spreadsheet is split into several different areas and at face value is quite hard to make sense of (what’s the difference in status (and provenance of the data) between the NHS and Health quangos, for example?
But I take nothing if not a pragmatic view about all this data stuff, and whilst there may be, err, issues with doing “proper” data journalism with this spreadsheet, I think we can still get value from just seeing what sorts of technology enhanced questions we might ask of this data, such as it is (as wll as identifying any problems with the data as it is presented), and as a result maybe identifying various issues with how to present and engage with this data were it to be republished again.
As ever, my doodles don’t properly acknowledge the provenance or source of the data, nor do I try to make any sense out of the data or look for any specific ‘meaning’ within it – I’m still at the stage of sharpening my technology pencil and seeing what sorts of marks it can make – but this is something I know I don’t do, and will be something I start to look at somewhen, honest!
So let’s make a start. To provide a bit of context, the questions I set out with when doodling through this post were:
1) is the data clean enough to run summarising queries on the data (that is, queries that summed totals for different departments)?
2) is the data clean enough to not break Many Eyes Wikified if i pass it to that visualisation tool via a CSV feed?
And a matter arising:
3) how do I write a query that specifies column headings (the headings in the spreadsheet leave something to be desired, at times….)?
The spreadsheet I chose to play with was the Westminster sheet, which you can find from here: UK Quangos [Guardian Datastore] (you’ll need to select the appropriate tab).
Just by looking at the data in the spreadsheet we notice a couple of things, things that suggest certain queries we might run on the data. (I probably need to unpack that phrase at some point (“run certain queries”) but the essence of it is this: if we treat the spreadsheet as a database, so sort of reprts can we generate from it; typically, in a database environment, reports are generated by running queries using some sort of database query language, which in the case of Google spreadsheets is the SQL like Google Visualisation API Query Language.)
So, the first thing I noticed are the two columns on the left – Government departments and presumably the quangos sponsored by those departments. And what these suggested to me were that I should be able to generate reports that summarise expenditure over all quangos in each department. (Whether or not this is interesting, I don’t know; but it’s something that we should be able to do easily enough, and it may spark off other questions in our mind).
The second thing I noticed was that lots of the data straddled two years (2006 and 2007)
And finally, gross expenditure seemed like a meaningful quantity and maybe least open to contention, so I decided to pick on that as the quantity to sharpen my pencil with:
To start playing with spreadsheets, I bookmarked it so that I could play with it in my Data Store explorer (note that I needed to specify the sheet number, where the first sheet is sheet 0, the second is sheet 1, and so on; so the Westminster sheet (the third sheet form the left in the spreadsheet) is sheet 2):
When we preview the column headings, (which the API assumes are in the top row, I think?), we get – not a lot…
If we scroll down in the data store explorer, we get at least the spreadsheet column labels:
Anyway, let’s try to run a query that summarises the overall gross expenditure for 2006 (column R) and 2007 (column S) for each department (column C):
The query is encoded as:
select C,sum(R),sum(S) group by C order by sum(R) desc
So we select three columns, and for each column, group the rows according to department (column C), display the summed value over those grouped rows for columns R and S, and presenting them in descending (desc) order by column sum(R):
If we click on the HTML preview link, we can view the table in its own page:
select C,sum(R),sum(S) group by C order by sum(R) desc
(A link to the CSV version is also generated.)
The data explorer doesn’t support forms for all the queries we can write yet, so the next step requires hacking the HTML table URI directly to add labels to the columns:
select C,sum(R),sum(S) group by C order by sum(R) desc label C ‘Dept’, sum(R) ‘2006 Gross Expenditure’, sum(S) ‘2007 Expenditure’
If you’re hacking the URI in a recent browser address/location bar, you don’t need to encode things like spaces as %20, or single quotes as %27, because the browser will take care of it for you:
If you then copy this URI and paste it back into the location bar, the encoded version will be generated for you, e.g. so that you can use it as a link in a web page:
So we have our basic summary report, and can now use the CSV output of the spreadsheet so that it can be transported elsewhere. There are two things we need to do now.
The first is to just change the output format of the data from an HTML table to CSV. Take this part of the URI:
and change it to this:
If you preview the CSV, you’ll notice there’s a problem with it though:
There are rogue commas everywhere, appearing within the ‘numerical’ data, and this file is supposed to use commas to separate out different elements. To get proper numbers out, we need to set their format, which means adding something to the end of the URI:
format sum(R) ‘#’,sum(S) ‘#’
(Not that you do need to encode the #s by hand, as %23)
That is, present sum(R) and sum(S) in a proper numerical format:
So there we have summary tables showing the expenditure for each government department. Many Eyes Wikified isn’t letting m import that data directly via CSV at the moment, but it’s easy enough to download the CSV and copy the data into a Many Eyes Wikified data page:
(Casting an eye over the CSV data, there are also a couple of glitches in it that upset the grouping – eg “Home” and “Home ” (trailing space) are treated differently.)
We can now use this page to create some visualisations – so put some placeholders in place in a visualisation page:
And then generate some visualisations…
One of the potentially most informative summary views for this data is a change treemap, that shows the relative gross expenditure for each department, along with whether it has increased or decreased between 2006 and 2007:
Blue is decrease from 2006 to 2007, orange is an increase.
The next step is now to create a change treemap for each quango within a department, but that’s a post for another day… [UPDATE: see it here – Visualising Where the Money Goes: Westminster Quangos, Part 2]