OUseful.Info, the blog…

Trying to find useful things to do with emerging technologies in open education

Visualising Where the Money Goes: Westminster Quangos, Part 1

My tweetwatching has been a little sporadic of late, so I haven’t really been keeping up with the data sets that keep being posted to the Guardian Datablog, but today I had a quick skim through some of the recent uploads and had my eye caught by a post about funding of UK government quangos (Every quango in Britain [spreadsheet]).

What I’ll show in this post is how to construct a query on one of the quangos data sheets that can then be visualised as a change treemap, showing at a single glance how funding over all the quangos (arranged by sponsor department) has changed over the period 2006-2007.

The quangos spreadsheet is split into several different areas and at face value is quite hard to make sense of (what’s the difference in status (and provenance of the data) between the NHS and Health quangos, for example?

But I take nothing if not a pragmatic view about all this data stuff, and whilst there may be, err, issues with doing “proper” data journalism with this spreadsheet, I think we can still get value from just seeing what sorts of technology enhanced questions we might ask of this data, such as it is (as wll as identifying any problems with the data as it is presented), and as a result maybe identifying various issues with how to present and engage with this data were it to be republished again.

As ever, my doodles don’t properly acknowledge the provenance or source of the data, nor do I try to make any sense out of the data or look for any specific ‘meaning’ within it – I’m still at the stage of sharpening my technology pencil and seeing what sorts of marks it can make – but this is something I know I don’t do, and will be something I start to look at somewhen, honest!

So let’s make a start. To provide a bit of context, the questions I set out with when doodling through this post were:

1) is the data clean enough to run summarising queries on the data (that is, queries that summed totals for different departments)?
2) is the data clean enough to not break Many Eyes Wikified if i pass it to that visualisation tool via a CSV feed?

And a matter arising:

3) how do I write a query that specifies column headings (the headings in the spreadsheet leave something to be desired, at times….)?

The spreadsheet I chose to play with was the Westminster sheet, which you can find from here: UK Quangos [Guardian Datastore] (you’ll need to select the appropriate tab).

Just by looking at the data in the spreadsheet we notice a couple of things, things that suggest certain queries we might run on the data. (I probably need to unpack that phrase at some point (“run certain queries”) but the essence of it is this: if we treat the spreadsheet as a database, so sort of reprts can we generate from it; typically, in a database environment, reports are generated by running queries using some sort of database query language, which in the case of Google spreadsheets is the SQL like Google Visualisation API Query Language.)

So, the first thing I noticed are the two columns on the left – Government departments and presumably the quangos sponsored by those departments. And what these suggested to me were that I should be able to generate reports that summarise expenditure over all quangos in each department. (Whether or not this is interesting, I don’t know; but it’s something that we should be able to do easily enough, and it may spark off other questions in our mind).

The second thing I noticed was that lots of the data straddled two years (2006 and 2007)

And finally, gross expenditure seemed like a meaningful quantity and maybe least open to contention, so I decided to pick on that as the quantity to sharpen my pencil with:

To start playing with spreadsheets, I bookmarked it so that I could play with it in my Data Store explorer (note that I needed to specify the sheet number, where the first sheet is sheet 0, the second is sheet 1, and so on; so the Westminster sheet (the third sheet form the left in the spreadsheet) is sheet 2):

When we preview the column headings, (which the API assumes are in the top row, I think?), we get – not a lot…

If we scroll down in the data store explorer, we get at least the spreadsheet column labels:

Anyway, let’s try to run a query that summarises the overall gross expenditure for 2006 (column R) and 2007 (column S) for each department (column C):

The query is encoded as:
select%20C%2Csum(R)%2Csum(S)%20group%20by%20C%20order%20by%20sum(R)%20desc

That is:
select C,sum(R),sum(S) group by C order by sum(R) desc

So we select three columns, and for each column, group the rows according to department (column C), display the summed value over those grouped rows for columns R and S, and presenting them in descending (desc) order by column sum(R):

If we click on the HTML preview link, we can view the table in its own page:

select C,sum(R),sum(S) group by C order by sum(R) desc

(A link to the CSV version is also generated.)

The data explorer doesn’t support forms for all the queries we can write yet, so the next step requires hacking the HTML table URI directly to add labels to the columns:

select C,sum(R),sum(S) group by C order by sum(R) desc label C ‘Dept’, sum(R) ‘2006 Gross Expenditure’, sum(S) ‘2007 Expenditure’

If you’re hacking the URI in a recent browser address/location bar, you don’t need to encode things like spaces as %20, or single quotes as %27, because the browser will take care of it for you:

If you then copy this URI and paste it back into the location bar, the encoded version will be generated for you, e.g. so that you can use it as a link in a web page:

So we have our basic summary report, and can now use the CSV output of the spreadsheet so that it can be transported elsewhere. There are two things we need to do now.

The first is to just change the output format of the data from an HTML table to CSV. Take this part of the URI:

http://spreadsheets.google.com/tq?tqx=out:html&tq=select C

and change it to this:

http://spreadsheets.google.com/tq?tqx=out:csv&tq=select C

If you preview the CSV, you’ll notice there’s a problem with it though:

There are rogue commas everywhere, appearing within the ‘numerical’ data, and this file is supposed to use commas to separate out different elements. To get proper numbers out, we need to set their format, which means adding something to the end of the URI:

format sum(R) ‘#’,sum(S) ‘#’

(Not that you do need to encode the #s by hand, as %23)

That is, present sum(R) and sum(S) in a proper numerical format:

So there we have summary tables showing the expenditure for each government department. Many Eyes Wikified isn’t letting m import that data directly via CSV at the moment, but it’s easy enough to download the CSV and copy the data into a Many Eyes Wikified data page:

That is:

(Casting an eye over the CSV data, there are also a couple of glitches in it that upset the grouping – eg “Home” and “Home ” (trailing space) are treated differently.)

We can now use this page to create some visualisations – so put some placeholders in place in a visualisation page:

And then generate some visualisations

One of the potentially most informative summary views for this data is a change treemap, that shows the relative gross expenditure for each department, along with whether it has increased or decreased between 2006 and 2007:

Blue is decrease from 2006 to 2007, orange is an increase.

The next step is now to create a change treemap for each quango within a department, but that’s a post for another day… [UPDATE: see it here - Visualising Where the Money Goes: Westminster Quangos, Part 2]

Written by Tony Hirst

July 11, 2009 at 10:32 am

4 Responses

Subscribe to comments with RSS.

  1. First, let me say that I think you’ve been doing an excellent job demystifying a lot of these processes and I’m amazed how quickly you are able to put together the sort of illustrative material that makes a post like this much more comprehensible.

    You encouraged me to explore further, but I admit I’ve got to a point where it seems easier to fall back on familiar tools (pulling the data into a local spreadsheet or database before putting it back in manyeyes wiki). I’m interested to know if you think I’m using the pencil incorrectly, or trying to do something that really needs a paintbrush – if that’s not stretching the metaphor too far.

    I’m interested in the underlying data here, and it seemed to me that the example you showed wasn’t telling the story that a concerned public might want to know. Quango expenditure is not necessarily government expenditure – some quangos received little or no government funding. So I decided to see how easy it would be to visualise the government funding as opposed to the expenditure. While I was about it I thought I would fix the quirk that your example had whereby all the figures were a factor of 1000 smaller than they should be.

    So it looks simple to start with. There are two columns for funding per year, rather than one. Column L is direct govt funding in 2006, column N other govt funding. M and O cover 2007. So, the select clause changes from

    select C,sum(R),sum(S)

    to

    select C,(sum(L)+sum(N))*1000,(sum(M)+sum(O))*1000

    where I’ve also added the 1000 correction factor. Knowing SQL, I added some ‘AS’ qualifiers to those sums:

    select C,(sum(L)+sum(N))*1000 as fund2006,…

    only to discover that Google’s query language doesn’t support that. But then I saw your use of ‘label’. This is where it gets hard. ‘label’ requires you to quote the exact form that was used in the select clause:

    select C,(sum(L)+sum(N))*1000,(sum(M)+sum(O))*1000 group by C label C ‘dept’,(sum(L)+sum(N))*1000 ‘Funds 2006′,…

    which is a bit messy. And would need to be repeated for the ‘format’ clause. And doesn’t even work, because google’s API rewrites that formula as something like ‘product(sum direct gov’t funding £000 1000.0)’ (it’s much worse than that.)

    And at that point I lost heart. Easier to grab the original spreadsheet, manipulate and clean the data locally, and then load the result into manyeyes – not that I’ve done it yet. But I can’t help feeling I’ve missed something. This looks so powerful, I ought to be able to do what I want. Shouldn’t I ?

    Kevin Ashley

    July 11, 2009 at 8:28 pm

  2. Hi Kevin
    Thanks for the comments :-)

    “I’m interested in the underlying data here, and it seemed to me that the example you showed wasn’t telling the story that a concerned public might want to know.”

    As I was commenting to @jtownend the other day, I’m not actually (at the moment) that interested in the semantics of the data at all – it’s just numbers. What motives me (at the moment) is trying to find quick and easy ways of taking stuff out of spreadsheets and shove it into things that generate pictures of the data, whether those visualisations are appropriate or not… Future steps for me are looking at ways of exploiting relationships between related data columns in the same (or different) spreadsheets and then maybe actually trying to tell interesting stories. But at the moment, I’m just interested in the pipework and wiring;-)

    “But then I saw your use of ‘label’. This is where it gets hard”

    I have a workaround, which will b in part 2 of this post;-)
    The workaround is to export the CSV of the table into another spreadsheet using the =importData (?) formaula, and then process this data. On my to do list is a to look at the Google spreadsheets API to see whether I can take my data explorer rapid prototype and 1-click spawn a new spreadsheet based on the output of a query from a datastore spreadsheet, then feed this derived spreadsheet back into the data store ;-)

    I’ll try to write the recipe up in the next day or two

    tony

    Tony Hirst

    July 11, 2009 at 10:49 pm

  3. As I was commenting to @jtownend the other day, I’m not actually (at the moment) that interested in the semantics of the data at all – it’s just numbers. What motives me (at the moment) is trying to find quick and easy ways of taking stuff out of spreadsheets and shove it into things that generate pictures of the data, whether those visualisations are appropriate or not…

    That’s understood. Sorry if it didn’t come across that way. I’m approaching it from that direction as well, except I have this obsessive need to rationalise what I’m doing by appealing to the underlying data…

    But looking at it more abstractly, your demo showed what was possible given columns A, B and C where A was something we wanted to group on, and B and C were numerical measures of the same quantity at different times. And I then thought, “what if we’ve also got columns D and E, and I’m interested in comparing a functional combination of B and D with C and E” (in this case,
    (B+D)*1000 compared with (C+E)*1000.)

    Doing this with the data in my own spreadsheet, or with a database and SQL, is trivial. But doing it entirely on the web, as you are demonstrating, is far more interesting but also frustrating.

    Using an intermediate spreadsheet to do the calculations certainly looks like an answer. I suspect that Yahoo pipes could also be used to automated some of the data cleaning problems you encountered, such as variations in trailing spaces or capitalisation affecting GROUP BY stuff. Wish I had time to explore that right now.

    I’ll be interested to see your follow up post!

    Kevin Ashley

    July 12, 2009 at 12:46 pm

  4. [...] a change treemap to summarise the changes in funding of quangos form government departments (Visualising Where the Money Goes: Westminster Quangos, Part 1), what I did originally was to visualise the change in funding for each quango within each [...]


Comments are closed.

Follow

Get every new post delivered to your Inbox.

Join 811 other followers

%d bloggers like this: