Search results for: funding visualisation

What Are JISC’s Funding Priorities?

I’ve just got back home from a rather wonderful week away at the JISC Developer Happiness Days (dev8D), getting a life (of a sort?!;-) so now it’s time to get back to the blog…

My head’s still full of things newly learned from the last few days, so while I digest it, here’s a quick taster of something I hope to dabble a little more with over the next week for the developer decathlon, along with the idea (which reminds me of my to do list…oops…)

A glimpse of shiny things to do with JISC project data (scraped from Ross’s Simal site… [updated simal url] (see also: Prod).

Firstly, a Many Eyes tag cloud showing staffing on projects by theme:

Secondly, a many Eyes pie chart showing the relative number of projects by theme:

As ever, the data may not be that reliable/complete, because I believe it’s a best effort scrape of the JISC website. Now if only they made their data available in a nice way???;-)

Following a session in the “Dragon’s Den”, where I was told by Rachel Bruce that these charts might be used for good as a well as, err, heckling, I guess, Mark van Harmalen that I should probably pay lip service to who potential users might be, and Jim Downing’s suggestion that I could do something similar for research council projects, I also started having a play with data pulled from the the JISC website.

So for example, here’s a treemap showing current EPSRC Chemistry programme area grants >2M UKP by subprogramme area:

And if you were wondering who got the cash in the Chemistry area, here’s a bubble chart showing projects held by named PIs, along with their relative value:

If you try out the interactive visualisation on Many Eyes, you can hover over each person bubble to see what projects they hold and how much they’re worth:

PS thanks to Dave Flanders and all at JISC for putting the dev8D event on and managing to keep everything running so smoothly over the week:-) Happiness 11/10…

HEFCE Grant Funding, in Pictures

Somehow earlier today I managed to pop open a tab in browser pointing to the HEFCE funding allocation spreadsheets for 2009/2010 (maybe from twitter? It’s been one of those days where losing track has been the norm!): HEFCE Core funding/operation (Allocation of funds, Recurrent grants for 2009-10).

So I thought, like you do, how much nivcer it would have been if they’d published the data in visualisation environment… So here’s the HEI data, republished in some Many Eyes Wikified pages:

And here are some sample interactive visualisations you can use to explore the data (click through to get to the actual interactive demo):

There’s a full list of demo thumbnails available on this Wikified page: HEFCE Viz Test.

Feel free to create your own pages/discussion around the charts (it is a wiki, after all).

In order to pull the data in to your own wiki page, use the following “data include” commands in your wiki page (one for each visualisation; the visualisation page name musn’t contain any spaces (I think??)):

You’ll notice I was a little careless in naming the three data pages, which consequently have inconsistent URIs.

Enjoy! … and don’t forget, you can create your own wiki pages using the data, and add text/discussion into them too.

Does Funding Equal Happiness in Higher Education?

[The service used to create the visualisations mentioned in this post has been closed down. To search over recent (2013 intake) Guardian HE data, see this: Guardian Student Rankings]

With the announcement of the amount of funding awarded to UK Higher Education institutions from HEFCE, the government funding body, several people posted me links to the data wondering what I might do with it. I saw this as a good opportunity to do something I’ve been meaning to do for ages, specifically have another look at how to provide a view of a range of HE related datasets around particular institutions. So for example, if you ever wondered whether or not there is a relationship between the drop out rate from a university and a surveyed average teaching score, you should be able to look it up:

Since its launch, one of the more actively maintained areas of the Guardian datastore has been the education area. A quick skim over HE related data turns up at least the following:

In a follow on post, I’ll show how to pull this data together, but for now, let’s look at some of the possibilities of pulling data in from these separate sheets around an HEI. As a proof of concept, I grabbed the following data and popped it into Many Eyes Wikified:

(I need to add provenance info to the wiki, but didn’t in this instance because I don’t want the data to be referenced or trusted… I still need to check everything is working properly… (so I guess I should have used dummy HEI names… hmm…)

The data is pulled from four separate sheets and aggregated around HEI name. The “Percentage no longer in HE” comes from the datastore Dropout spreadsheet, the “Total staff earning ..” etc column is from the Pay spreadsheet, the “Average Teaching” to “Student to Staff Ratio” columns come from the 2009-10 university tables spreadsheet, and the “Teaching funding” to “Funding change” columns from the most recent (2010-11) funding spreadsheet.

I’ve posted a couple of interactive visualisations on to Many Eyes WIkified so you can have a play with the numbers (but don’t trust them or quote them unless you fact check them first…;-)

The first is a Scatter Chart, which gives us three dimensions to play with – x, y, and dot size.

So for example, in the chart shown above, we see that lower teaching scores seem to correlate with higher drop out rates. In the chart below, we seed how teaching scores relate to the expenditure per student and the student staff ratio (and how expenditure per student and student staff ratio relate to each other):

Is satisfaction rewarded with funding, or is funding to improve matters?

The other chart type I produced is a bar chart. These are less interesting, but heavily used…

I assume that university strategy and planning units worry over this sort of combined data all the time (but I’m not sure how they: obtain it, combine it, represent it, report it, use it? Maybe if an HE planner is reading they could post a comment or two to describe what data they use, how they use it and what they use it for…?;-) Anyway, it’s getting close to a stage now where the rest of us can play along too…;-)

Using Many Eyes Wikified to Visualise Guardian Data Store Data on Google Docs

Last week, I posted a quick demo of how to visualise data stored in a Google spreadsheet in Many Eyes Wikified (HEFCE Grant Funding, in Pictures).

The data I used was the latest batch of HEFCE teaching funding data, but Joss soon tweeted to say he’d got Research funding data up on Google spreadsheets, and could I do something with that? You can see the results here: Visualising UK HEI Research Funding data on many Eyes Wikified (Joss has also had a go: RAE: UK research funding results visualised).

Anyway, today the Guardian announced a new content API (more on that later – authorised developer keys are still like gold dust), as well as the Guardian data store (strapline: “facts you can use”) and the associated Data Store Blog.

Interestingly, the data is being stored on Google docs, in part because Google spreadsheets offer an API and a wide variety of export formats.

As regular readers will know, one of the export formats from Google spreadsheets is CSV – Comma Separated Variable data – which just so happens to be liked by services such as Dabble DB and Many Eyes. I’ll try to come up with a demo of how to mash-up several different data sets in Dabble DB over the next few days, but as I’ve a spare half-hour now, I thought I’d post a qiuck demo of how to visualise some of the Guardian data store spreadsheet data in Many Eyes Wikified.

So to start, let’s look at the the RAE2008 results data – University research department rankings (you can find the actual data here:

If you speak URL, you’ll know that you can get the CSV version of the data by adding &output=csv to the URL, like this:

Inspection of the CSV output suggests there’s some crap at the top we don’t want – i.e. not actual column headings – as well as the the end of the file:

(Note this “crap” is actually important metadata – it describes the data and its provenance – but it’s not the actual data we want to visualise).

Grabbing the actualt data, without the metadata, can be achieve by grabbing a particular range of cells using the &range= URL argument. Inspection of the table suggests that meaningful data can be found in the columnar range of A to H; guesswork and a bit of binary search identifies the actual range of cell data as A2:H2365 – so we can export JUST the data, as CSV, using the URL

If you create a new page on Many Eyes Wikified, this data can be imported into a wiki page there as follows:

We can now use this data page as the basis of a set of Many Eyes visualisations. Noting that the “relative URL address” of the data page is ousefulTestboard/GuardianUKRAERankings2008 (the full URL of the wikified data page is, create a new page and put a visualisation placeholder or two in it:

Saving that page – and clicking through on the visualisation placeholder links – means you can now create your visualisation (Many Eyes seems to try to guess what visualisation you want if you use an appropriate visulisation name?):

Select the settings you want for you visualisation, and hit save:

A visualisation page will be created automatically, and a smaller, embedded version of the visualisation will appear in the wiki page:

If you visit the visualisation page – for example this Treemap visualisation, you should find it is fully interactive – which means you can explore the data for yourself, as I’ll show in a later post…

See more examples here: RAE 2008 Treemap; RAE 2008 Bubble Diagram; RAE 2008 bar chart.

My Guardian OpenPlatform API’n’Data Hacks’n’Mashups Roundup

Over the last week or two, I’ve put together a handful of hacks demonstrating how to do various mashup related things using the Guardian open data on Google spreadsheets, and the open platform content API. So here’s a quick summary – think of it as a single post mashup uncourse… ;-)

Just as way of background, here’s how the Guardian Open Platform was announced (I didn’t get an invite ;-)

So here’s what I’ve been up to:

  • Using Many Eyes Wikified to Visualise Guardian Data Store Data on Google Docs: using a Guardian data in a particular Google docs spreadsheet, generate a URL that emits CSV formatted data from a specified region of cells in the spreadsheet and consume the live CSV data feed in Many Eyes wikified to visualise the data.
    Nuggets: how to generate the URL for the CSV output from a Google spreadsheet over a range of specified cells, and then consume that data in Many Eyes Wikified. Use the Many Eyes wikified data as the basis for several Many Eyes wikified visualisations.
  • Filtering Guardian Data Blog/Google Spreadsheet Data With Yahoo! Pipes: take a set of data regarding UK HE research funding from a Google spreadsheet via CSV into a Yahoo pipe, and then extract only those data rows that correspond to a particular result. Grab the CSV output from the pipe and pass it to Many Eyes wikified to visualise the research funding data for a single university.
    Nuggets: how to use Yahoo pipes to filter CSV data to only pass through items from rows where the data in a particular column matches a given term.
  • Creating Google Charts From CSV Data Inside a Yahoo Pipe: grab the CSV output from a Google spreadsheet into a Yahoo pipe, construct description elements for each feed item using data from several different columns in each CSV row, and then generate the URI for a call to the Google charts API using data from each row. By associated the chart URI with the appriopriate media group RSS attribute, a variety of widgets should be able to “play” the charts that are automatically generated from the spreadsheet data as a slideshow.
    Nuggets: how to create Google Charts within a Yahoo pipe (using the Google Charts API) based on data pulled into the pipe from a CSV file generated from a Google spreadsheet.
  • “JOIN”ing Data from the Guardian Data Store Student Satisfaction Data: pull data in from two different sheets on the same Google spreadsheet into Dabble DB, and generate another data table containing mashed together elements from the two spreadsheets. Display the result table via Many Eyes wikified.
    Nuggets: how to use Dabble DB to “join” data from two separate tables that share a similar key column.
  • Tinkering with the Guardian Content API – SerendipiTwitterNews: given a Twitter User id, pull the most recent tweets fro that user into a Yahoo pipe and run them though a term extractor; where terms are successfully extracted, use them as a query on the Guardian content API (either all articles, or just tech articles). The aim? Generate a news feed of items serendipitously related to your most recent tweets.
    Nuggets: using the Yahoo term extractor on a Twitter feed to generate search queries over the content API; using API tags to narrow down the search to a particular topic area.
  • Last Week’s Football Reports from the Guardian Content Store API (with a little dash of SPARQL): use a Yahoo pipe to pull football match reports “since last Saturday” from the content API, extract the name of the stadium each match was played in, and use as as the search term in a SPARQL query over a DBpedia page listing the locations of each English football stadium. Retrieve the lat/lon geo co-ordinates for each stadium from the SPARQL query, and associate them with the match reports. Plot the resulting feed of geo-annotated match reports on a Google map.
    Nuggets: running a SPARQL query over DBPedia from a Yahoo pipe.
  • Guardian Game Reviews – with Video Trailers: query the Guardian content API for game reviews from a Yahoo pipe, extract the title of each game and use it in another pipe that searches Youtube for a trailer (or review) of that game; create the URL of the actual Youtube movie file, and attach is as an enclosure URL to the output feed; view the result in a portable Grazr widget, which automatically embeds the Youtube videos in the widget.
    Nuggets: using an embedded pipe block to search over Youtube; creating a feed that will auto render an embedded Youtube player when viewed in a Grazr widget.

So there you have it… my Guardian OpenPlatform hacks to date…

A Fair Use, or Not…? Visualising the THES UK Higher Education Pay Survey Data

Last week, Alan tweeted a challenge of sorts about me doing something to the academics’ pay data referred to in the THES article Pay packets of excellence. The data (Vice Chancellors pay in UK HEIs, and acadmics’ pay across UK HEIs) was published via two separate PDF documents, and “compiled and audited by Grant Thornton on behalf of Times Higher Education”.

The THES provided some analysis and interpretation of the data, and the survey was picked up by other media (e.g. here’s the Guardian’s take: Vice-chancellors’ salaries on a par with prime minister; the Telegraph said: Anger as university bosses claim £200,000 salaries; the Times: Campus fury at vice-chancellors’ windfalls; the Press Association: University chiefs pocket wage rise; and so on).

So partly to give Martin something concrete to think about in the context of Should universities break copyright law? and Universities as copyright warriors, is my republishing of the data contained in the two PDF documents on the THES website alongside the Pay packets of excellence article as a spreadsheet on Google spreadsheets a fair thing to do? (I didn’t notice any explicit license terms?)

(The data is here: UK HE VCs’ and Academics’ pay.)

Now why would I want to publish the data? Well, as it stands, the data as published in the PDF documents is all very well, but… what can you do with it? How useful is it to the reader? And what did the THES intend by publishing the data in the PDFs?

That readers could check the claims made in the article is one possibility; that other media channels could draw their own conclusions from the results and then cite the THES is another (“link bait”;-). But is there any implication or not that readers could take the data as data and manipulate, visualise it, and so on? If there is, is there any implication or expectation that journalists et al. might take the data into a private spreadsheet, maybe, manipulate it, understand it, and then publish their interpretation? Might there be a reasonable expectation that someone would republish the data as data so that people without the skills to take the data out of the PDF and put it into a spreadsheet could benefit from it being represented in that way?

As well as publishing the data via a Google spreadsheet, I also republished via two Many Eyes Wikified data pages: UK HE Vice Chancellors’ Salaries: Many Eyes wikified data page and UK HE Academic Salaries: Many Eyes wikified data page. So was this a fair thing to do, in any reasonable sense of the word?

And then of course, I did a few visualisations: UK HE Vice Chancellors’ Salaries: Many Eyes wikified visualisations page and UK HE Academic Salaries: Many Eyes wikified visualisations page.

So by making the data available, it means I can create visual interpretations of the data. Is this a fair thing to do with the data or not? If the data was published with the intention that other people publish their interpretations of it, does a visual interpretation count? And if so, what’s a fair way of creating that “data as data”? By publishing the data used to generate the visualisation in the spreadsheet, people can check the actual data that is feeding the visualisation, and then check that it’s the same as the THES data.

Finally, each Many Eyes visualisation is itself interactive. That is, a user can change the dimensions plotted in the chart and try to interpret (or make sense of) the data themselves in a visual way.

So is that a fair thing to do with data? Using it to underwrite the behaviour of a set of visualisations that a user can interact with and change themselves?

So here’s where we’re at: the THES published the data in a “closed” format – a PDF document. One of the features of the PDF is that the presentation of the document is locked down – it should always look the same.

By republishing the data as data in a public Google document, then other people can change how that data is presented. They can also use that data as the basis of a visualisation. Is there any difference between an IT literate journalist putting the data into a private spreadsheet and then publishing a visualisation of that data, and someone republishing the data so that anyone can visualise it?

Now let’s consider the Many Eyes visualisations. Suppose it is a fair use of the data to somehow use it to create a visuliastion, and then publish that visualisation as a static image. Presumably someone will have checked that the graphic is itself fair, and is not misrepresenting the data. That is, the data has not been misused or misapplied – it has been used in a responsible way and visualised appropriately.

But now suppose that Jo Public can start to play with the visualisation (because it is presented in an interactive way) and maybe configure the chart so that a nonsensical or misleading visulisation is produced, with the result that the person comes away from the data claiming something that is not true (for example, because they have misunderstood that the chart they have created does not show what they maybe intended it to show, or what they think it shows?). That person might now claim (incorrectly) that the THES data shows something that it does not – and they have a graphic to “prove” it.

This is where the educator thing comes in to play – I would probably want to argue that by republishing the data both as data and via interactive visualisations, I am providing an opportunity for people to engage with and interpret the data that the THES published.

If the THES published the data because they wanted people to be able to write about their own analysis of the data, I have just made that easier to do. I have “amplified” the intent of the THES. However, if the THES only published the data to back up the claims they made in their article, then what I have done may not be fair?

So, what do you think?

Visualising Where the Money Goes: Westminster Quangos, Part 1

My tweetwatching has been a little sporadic of late, so I haven’t really been keeping up with the data sets that keep being posted to the Guardian Datablog, but today I had a quick skim through some of the recent uploads and had my eye caught by a post about funding of UK government quangos (Every quango in Britain [spreadsheet]).

What I’ll show in this post is how to construct a query on one of the quangos data sheets that can then be visualised as a change treemap, showing at a single glance how funding over all the quangos (arranged by sponsor department) has changed over the period 2006-2007.

The quangos spreadsheet is split into several different areas and at face value is quite hard to make sense of (what’s the difference in status (and provenance of the data) between the NHS and Health quangos, for example?

But I take nothing if not a pragmatic view about all this data stuff, and whilst there may be, err, issues with doing “proper” data journalism with this spreadsheet, I think we can still get value from just seeing what sorts of technology enhanced questions we might ask of this data, such as it is (as wll as identifying any problems with the data as it is presented), and as a result maybe identifying various issues with how to present and engage with this data were it to be republished again.

As ever, my doodles don’t properly acknowledge the provenance or source of the data, nor do I try to make any sense out of the data or look for any specific ‘meaning’ within it – I’m still at the stage of sharpening my technology pencil and seeing what sorts of marks it can make – but this is something I know I don’t do, and will be something I start to look at somewhen, honest!

So let’s make a start. To provide a bit of context, the questions I set out with when doodling through this post were:

1) is the data clean enough to run summarising queries on the data (that is, queries that summed totals for different departments)?
2) is the data clean enough to not break Many Eyes Wikified if i pass it to that visualisation tool via a CSV feed?

And a matter arising:

3) how do I write a query that specifies column headings (the headings in the spreadsheet leave something to be desired, at times….)?

The spreadsheet I chose to play with was the Westminster sheet, which you can find from here: UK Quangos [Guardian Datastore] (you’ll need to select the appropriate tab).

Just by looking at the data in the spreadsheet we notice a couple of things, things that suggest certain queries we might run on the data. (I probably need to unpack that phrase at some point (“run certain queries”) but the essence of it is this: if we treat the spreadsheet as a database, so sort of reprts can we generate from it; typically, in a database environment, reports are generated by running queries using some sort of database query language, which in the case of Google spreadsheets is the SQL like Google Visualisation API Query Language.)

So, the first thing I noticed are the two columns on the left – Government departments and presumably the quangos sponsored by those departments. And what these suggested to me were that I should be able to generate reports that summarise expenditure over all quangos in each department. (Whether or not this is interesting, I don’t know; but it’s something that we should be able to do easily enough, and it may spark off other questions in our mind).

The second thing I noticed was that lots of the data straddled two years (2006 and 2007)

And finally, gross expenditure seemed like a meaningful quantity and maybe least open to contention, so I decided to pick on that as the quantity to sharpen my pencil with:

To start playing with spreadsheets, I bookmarked it so that I could play with it in my Data Store explorer (note that I needed to specify the sheet number, where the first sheet is sheet 0, the second is sheet 1, and so on; so the Westminster sheet (the third sheet form the left in the spreadsheet) is sheet 2):

When we preview the column headings, (which the API assumes are in the top row, I think?), we get – not a lot…

If we scroll down in the data store explorer, we get at least the spreadsheet column labels:

Anyway, let’s try to run a query that summarises the overall gross expenditure for 2006 (column R) and 2007 (column S) for each department (column C):

The query is encoded as:

That is:
select C,sum(R),sum(S) group by C order by sum(R) desc

So we select three columns, and for each column, group the rows according to department (column C), display the summed value over those grouped rows for columns R and S, and presenting them in descending (desc) order by column sum(R):

If we click on the HTML preview link, we can view the table in its own page:

select C,sum(R),sum(S) group by C order by sum(R) desc

(A link to the CSV version is also generated.)

The data explorer doesn’t support forms for all the queries we can write yet, so the next step requires hacking the HTML table URI directly to add labels to the columns:

select C,sum(R),sum(S) group by C order by sum(R) desc label C ‘Dept’, sum(R) ‘2006 Gross Expenditure’, sum(S) ‘2007 Expenditure’

If you’re hacking the URI in a recent browser address/location bar, you don’t need to encode things like spaces as %20, or single quotes as %27, because the browser will take care of it for you:

If you then copy this URI and paste it back into the location bar, the encoded version will be generated for you, e.g. so that you can use it as a link in a web page:

So we have our basic summary report, and can now use the CSV output of the spreadsheet so that it can be transported elsewhere. There are two things we need to do now.

The first is to just change the output format of the data from an HTML table to CSV. Take this part of the URI: C

and change it to this: C

If you preview the CSV, you’ll notice there’s a problem with it though:

There are rogue commas everywhere, appearing within the ‘numerical’ data, and this file is supposed to use commas to separate out different elements. To get proper numbers out, we need to set their format, which means adding something to the end of the URI:

format sum(R) ‘#’,sum(S) ‘#’

(Not that you do need to encode the #s by hand, as %23)

That is, present sum(R) and sum(S) in a proper numerical format:

So there we have summary tables showing the expenditure for each government department. Many Eyes Wikified isn’t letting m import that data directly via CSV at the moment, but it’s easy enough to download the CSV and copy the data into a Many Eyes Wikified data page:

That is:

(Casting an eye over the CSV data, there are also a couple of glitches in it that upset the grouping – eg “Home” and “Home ” (trailing space) are treated differently.)

We can now use this page to create some visualisations – so put some placeholders in place in a visualisation page:

And then generate some visualisations

One of the potentially most informative summary views for this data is a change treemap, that shows the relative gross expenditure for each department, along with whether it has increased or decreased between 2006 and 2007:

Blue is decrease from 2006 to 2007, orange is an increase.

The next step is now to create a change treemap for each quango within a department, but that’s a post for another day… [UPDATE: see it here – Visualising Where the Money Goes: Westminster Quangos, Part 2]

Visualising Where the Money Goes: Westminster Quangos, Part 2

One of the things I try to impress on folk whenever I do a talk about web stats is how showing charts of mean (that is, simple averages) numbers of visitors or time on site is all but useless, because the actual distribution of values is not likely to be normal, and so the simple averages reported are, too all intents and purposes, not good for anything.

So when putting together a change treemap to summarise the changes in funding of quangos form government departments (Visualising Where the Money Goes: Westminster Quangos, Part 1), what I did originally was to visualise the change in funding for each quango within each department, rather than just displaying the following simpler, overall funding by government department treemap that I posted about previously:

The reason being? The overall change treemap might show increases or decreases in expenditure for a department as a whole, but it doesn’t reveal whether the funded quangos were all treated equally, or whether lots of small quangos received a cut whilst one big one received a huge rise in funding, for example:

So how did I create this treemap? The simplest way is just to create a query on the original spreadsheet that pulls in 4 columns – department, quango, and two expnditure columns (one for 2006 and one for 2007). A query built around this kernel, in fact:

select C,D,R,S order by C,R desc

(To see how such a query is put together, see Part 1 of this exercise.)

To generate a query that only displays quangos that had a non-zeor expenditure in both 2006 and 2007, just add:

where (R>0 AND S>0) after the select statement and before order by.

Just as an aside, note that it’s possible to to pull the output of this query into another spreadsheet. This allows you to add derived columns to the spreadsheet, for example, busing relative formaula that act on quantities automativcally inported into other columns in the spreadsheet. (One thing I intend to explore with the data store explorer is a 1-click way of creating a new spreadsheet that pulls in a query created using the explorer. See also: Using Google Spreadsheets and Viz API Queries to Roll Your Own Data Rich Version of Google Squared on Steroids (Almost…))

The CSV feed can then be pulled into a Many Eyes Wikified and visulaised in a variety of ways that reveal both the overall expenditure of quangos funded from within each department, as well as the relative funding over 2006/2007 of each quango:

So for example, for the two biggest quangos by expenditure in Defence, one spent more over 2006/7, and the other spent less…

Using the same data page , we can create other visualisations within the Many Eyes Wikified environment that allow us to explore the data in a little more detail. So for example, here’s a bit more detail about the funding of quangos in the Department of Health. Parallel bands show that quangos spent equivalent amounts in 2006 and 2007, lines that diverge going from left to right show an increase in expenditure, and lines that converge going from left to right depict decreasing expenditure.

A scatter chart readily shows large changes in expnditure over the two years:

And some charts are just shiny ;-)

Compared with just trawling through the spreadsheet data, I think there is a lot to be said for using visualisations to identify out of the norm data points (e.g. using the scatterplot), or unpacking totals (as in the case of departmentally funded quango expenditure) in a quick and easy way as a starting point to exploring the data in a more systematic way, for example in order to ask journalistic questions (whatever they might be).

A Rosetta Stone for Guardian Datastore UK Higher Education Data

In Does Funding Equal Happiness in Higher Education?, I described a couple of interactive visualisations that are built up around a dataset that pools data from several of the Guardian datastore Higher Education datasets. In this post, I’ll show how that aggregated dataset was put together, and review some of the problems that the approach I took has.

The first thing to note is that the data I wanted to combine existed in several different spreadsheets, which we might think of as several different databases. Notwithstanding some of the “issues” I have with some of the more puritanical elements of the Linked Data world view, publishing data from different sources that is ostensibly about the same things (i.e. Higher Education institutions) in the seemingly arbitrary way that it has been published on the Guardian Datastore ires me even more… (I’ve written about this before (e.g. The Guardian OpenPlatform DataStore – Just a Toy, or a Trusted Resource?) so I’m not causing any new offence by saying this;-)

So what’s the problem? In short, this sort of thing (the column contents are taken from several of the HE related spreadsheets):

Given the name of a university from one spreadsheet, it’s all but impossible to match the data to a corresponding university in one of the other spreadsheets.

Ideally, each spreadsheet should use a common identifier for a particular institution, such as a UCAS institution code; but that hasn’t happened, which makes relating one data set to another – such as comparing drop out rates to student satisfaction scores – difficult.

One possible way around this is to use a “Rosetta Stone” spreadsheet (e.g. The Guardian rosetta: the Datablog reference guide to nearly everything) which contains synonyms for the same entity as used across several spreadsheets. (I kept meaning to demo how this could be used, but never got round to it, so I’ll give one possible take on it in a quick tutorial below…) The translations for the HEIs has been barely attempted in the current version of the Guardian Datastore Rosetta sheet, though, so I spent a couple of hours last night addressing that and creating my own Rosetta Stone spreadsheet for the HE data.

So how does it work? The spreadsheet essentially defines a set of sameAs relations within a row using the principle: one row, one object. The columns correspond to separate datasheets within the Guardian datastore. Each cell corresponds to the identifier used within a particular datasheet (that is, within a particular spreadsheet that we are using as a database) to describe a particular thing.

I was going to say that this contrasts to the Linked Data principle of “one thing, one identifier”, but that principle is not explicitly one of the four Linked Data rules, is it…?

So how can we use the Rosetta Table? One way is to use a =QUERY() formula, building on the ideas explored in Using Google Spreadsheets Like a Database – The QUERY Formula.

Recall that the QUERY formula has the form: =QUERY(RANGE,DATAQUERY). Now here comes the important bit:

if we know the name of (that is, the identifer for) an HEI in one particular datastore spreadsheet, we can look up the identifier used for the same institution in another datastore spreadsheet using the Rosetta sheet as a stepping stone.

For example, if we have the Rosetta data in a sheet called “Mapping”, column B contains the UCAS codes and column F contains the name of a university for a datasheet that we are currently interested in, (that is, one from which we have the name of the institution) we can use a query of the following form to grab the UCAS code (the formula will also return the name of the institution we are looking up the code for):

=query(‘Mapping’!B:F,”select B,F where F contains ‘The City University'”)

So for example:


If we want to look up the name of an institution in a spreadsheet whose identifiers are listed in column C of the Rosetta sheet, using the name of an institution as described using an identifier taken from a spreadsheet corresonding to Rosetta column F, we can use a formula of the form:

=query(‘Mapping’!C:F,”select C,F where F contains ‘The City University'”)

If we are pulling in the name of the institution we want to look up a UCAS code or synonym for from another cell (say, B1), we can use a formula of the form:

=query(‘Mapping’!C:F,CONCATENATE(“select C,F where F contains ‘”,B1,”‘”))

(Note that in this case, C and F are columns C and F in the “Mapping” sheet, and B1 refers to column B in the current sheet.)

So for example:

which gives:

Once we have looked up the identifier for an instituion in one datastore sheet that corresponds to an institution mentioned in another datastore sheet, we can use that identifier to lookup data from that sheet. In other words, we can create a spreadsheet whose rows contain data for a particular institution pulled from separate datastore spreadsheets. The method is a little clunky, as I’ll show below, but it works. (I’ll try to post a more efficent way in a few days.)

The recipe is as follows:

– populate a sheet with the names of universities as identified in one particular sheet. You might do this by using an =ImportRange() formula, like this one:
=ImportRange(“tr8_2VPY0bfJQgf29KRz9sg”,”ALL UNIVERSITIES!A1:B132″)
that pulls in data from the 2010 funding spreadsheet.

– look up the synonym for each institution as used in a different spreadsheet (e.g. the student satisfaction spreadsheet) using the Rosetta table loaded into a separate sheet (I called mine “Mapping”); e.g. =query(‘Mapping’!C:F,CONCATENATE(“select C,E where E contains ‘”,A2,”‘”))

If we drag that cell formula down the column, we get the other synonyms too:

– now we can run a QUERY to pull in the student satisfaction data for the corresponding institution into the appropriate row, to give us rows that contain funding AND student satisfaction data. There is just one issue though. Whilst the spreadsheet documentation suggests that the RANGE for a QUERY() should be okay as range of cells imported from another spreadsheet using an =importRange() formula, it doesn’t actually appear to work… Instead, we only seem to be able to run a QUERY over a range of cells contained in a sheet within the current spreadsheet. Which means we need to copy the student satisfaction data into another local sheet and then call on that sheet when we run our query. such as:
=Query(‘Satisfaction’!B:F,concatenate(“select B,C,D,E,F where B matches ‘”,C2,”‘”),””)

Drag the cell down, and we get the satisfaction data, though we need to complete the column headings ourself using the above formula:

(If you put:
=Query(‘Satisfaction’!B:F,concatenate(“select B,C,D,E,F where B matches ‘”,C2))
into cell E1 this will pull the headings into row 1 and the appropriate data into row 2. Why? Because we removed the end of the no headers (“”) argument at the end of the query.)

Okay – that’s more than enough of that, for now. Hopefully you should have a reasonable idea of: a) how to use a Rosetta sheet to look up the name of an HEI in the appropriate format for a particular datastore spreadsheet given it’s name as taken from another datastore spreadsheet; and b) how to use that name to lookup data in a local copy of a datastore spreadsheet.

Before I sign off, though, it’w worth reviewing some of the problems with this approach.

Firstly, there’s the matter of compiling the Rosetta Stone spreadsheet itself. I hand-crafted this spreadsheet for several reasons: firstly, to let my fingers get a feel for the sorts of process that I really should have tried to automate; secondly, to get a feel for just what sorts of differences there were in the way the same institutions were represented across different spreadsheets; and thirdly, to see whether those difference were regular in any way, because if they are, we might be able to use heuristics to guess with a reasonable degree of success the mapping from one identifier on to another. (A couple of pointers about how possibly to approach this are described by @kitwallace here: A data mashup to explore Brian Kelly’s Linked Data challenge.)

Another class of problems relate to knowing what the data in each row, column or cell is about. So for example:

– if you pull data in from one spreadsheet into another one in without bringing in the column heading, you can lose track of what the data you have pulled in is;
– identifying what column to pull in from another spreadsheet is difficult; if you pull columns in by column number (A, B, C) if for any reason the column ordering changes in the spreadsheet you’re pulling data from, you lose the desired linkage; ideally, what you want to do is pull in columnar data by at least column heading, i.e. some descriptor that is used to identify a column in a meaningful way rather than an arbitrary way like column number;
– the link between the contents of a cell and what it refers to is only a weak one. If, by convention, we always use column 1 to hold the identifier of the thing being talked about, and the row 1 column headings as the identifiers that describe the properties of the the thing, then the co-ordinates of a cell can be used to identify the particular property of the particular thing being talked about. But if the table is not situated regularly within a spreadsheet (e.g. it starts at cell D7), things get a little bit more arbitrary (unless we have another convention, such as having padding cells around the row/column headings containing what amounts to punctuation to syntactically identify them as such).

(There’s a whole range of other problems about whether we can sensibly compare data from one spreadsheet with data from another… e.g. comparing funding in 2010 with drop out rates from 2001.)

If we unpick this, we see we really want two sorts of identifier: a set of unique identifiers for the HEIs (e.g. UCAS number) that are used consistently across different spreadsheets; and a set of unique identifiers for the properties of the HEIs (e.g. Average_NSS_Student_Satisfaction_Score, or 2010_HEFCE_funding_change) that can be used to uniquely identify a set of properties in one sheet so that they can be referenced explicitly from another.

Any Linked Data folk reading this will probably, at this point, be yelling “We told you so”, but as a pragmatist I think we have to find a way to make data work in the real’n’messy world…;-)

Time for Or a local

Over a swift half at the end of the rather wonderful Liver’n’Mash, Mike Nolan chatted through some of his thoughts around his presentation on that I hadn’t been able to get to see.

From looking back over Mike’s presentation slides, he seems to be advocating “let’s do what we can, as soon as possible”, particularly with respect to sharing things like information about courses and events, as well as other data from university information systems. That emphasis seems to me to be on syndicating information already available on university websites in a more data-like way (RSS news feeds, for example, or calendar feeds); this is similar to the approach taken by the new DirectGov API, I think?

(By the by, I came across this related presentation earlier today, something I’d prepared for the SocialLearn project,as was, as couple of years ago: Portable Course Data.)

As part of the same conversation, Brian Kelly suggested that just as the open data lobby had been calling for government to open up it’s data, government might well respond by calling up public sector organisations to open up their data. This has already started to happen, for example with a letter from Downing Street last week calling on local councils to get ready to open up some of their financial and organisational chart data.

I think Brian is right in suggesting that Higher Education should brace itself to expect similar treatment… (A lot of this data is already out there, it has to be said. For example, here’s a spreadsheet detailing VCs’ pay.)

So what is my take on how to get started with, or a more local version, such as

To my mind, the quickest start is to just republish data that is already available in data form. So for example:

– student satisfaction data is available from the Direct Gov Unistats service (OU data [XLS]; general download list);
– funding data about current grants is provided on research council sites. The EPSRC, for example, provide a way of accessing spreadsheets for funding received by various OU departments: OU Awards from the EPSRC (see more generally the full list of funded organisations; (if you know similar ways of getting similar data from other research councils, or funders such as JISC, please post a link in the comments to this post:-)
– financial data, where already published; the OU’s public financial statements can be found on the Freedom of Information minisite, for example (OU FOI: financial statements);
– organisational data, where already published. Again the OU seems to be ahead of the game on this one via the FOI site: OU FOI: organisational structure; (the FOI site also includes pay grade details, so you’ll be able to see just how overpaid I really am, despite all my wittering;-)
– RAE (Research Assessment Exercise) data: one possible source of this information is the Guardian DataStore (Guardian datastore: RAE data, original data from [XLS]).

(From that quick list, the OU seems to be doing really well via the OU FOI website. Are other HEIs as far on as this, I wonder, or does having Open in the university name create raised expectations around the OU on matters such as this?!)

The Guardian has also republished quite a range of additional HE related data in its datastore, some of which I’ve even played with before… e.g. Does Funding Equal Happiness in Higher Education? (though there have been one or two, err, niggles with the data… in previous spreadsheets;-) or for a fuller list: OUseful visualisations around education data.

Another possible source of data in a raw form is from the education datastore (an example can be found via here, which makes me wonder about the extent to which a website might just be an HE/FE view over that wider datastore? (Related: @kitwallace on University data.) And then maybe, hence: would data.* be a view over for a particular institution. Or * a view over a view over the full education datastore?

As to how best to publish the data? That’ll probably take another post, though a really quick win could be achieved by just grabbing the appropriate data from a Guardian datastore spreadsheet on Google docs, putting it into another Google doc, and then just embedding it in a page…;-)

PS In his post, Mike mentioned an old hack of mine that searched for autodiscoverable RSS feeds on * websites. I’d also done one that puts up screenshots of 404 pages… Maybe I need one that looks for the existence of data.* subdomains?!

PPS Finally, it’s probably worth just paying heed to notions of Good and bad Trasnparency. The line I’m suggesting above is one of convenient discovery as much as anything else, pulling (links to) all the data sets related to an institution into an area of the institution’s own website. Cf. the similar approach taken by, which is to act primarily as a directory layer, as well as hosting national level datastores for particular datasets.