[Note that this post has undergone a title change since it was originally published…]
In yesterday’s Observer newspaper there was a feature article on grade inflation in Higher Education (‘Dumbing down’ row over value of degrees), along with a front page tease about how “the Conservatives pledged last night to open universities to tougher scrutiny by publishing data next month that details what happens to students’ job prospects and salaries when they leave individual institutions, exposing for the first time how college choice affects lifetime earnings.”
Which got me wondering… is this info hidden away in any of the data that has already been officially published at least once, and then republished on the Guardian datablog?
So I had a quick look at the full university tables, and used my Guardian datastore explorer to pull off the ranking of each university along with the career prospects and value add score so that you could compare them on a scatterplot, look at the best and the worse, and so on.
I also constructed a simple query that lets you look at the data for a particular HEI – here’s the results for Bath for example:
(The URI is hackable – just find the word Bath in it and change it to the name of the institution you want the data for.)
Scanning over the datablog, I also noticed there was data available for drop out rates… Hmmm… maybe now’s a good time to have a play with Google Fusion Tables?
Long suffering readers of this blog will know already how I’ve explored ‘JOINing’ data from different spreadsheet tables in DabbleDB to create data tables that merge data from different tables sharing a similar column (e.g. Using Dabble DB in an Online Mashup Context or Mash/Combining Data from Three Separate Sources Using Dabble DB); Google Fusion Tables does something similar…
Start off by importing a couple of spreadsheets into the Fusion Tables app: so for example, I exported the HE Tables 2010 from the Guardian datastore as an Excel file to my desktop and then uploaded the an XLS file into Fusion Tables (‘Create New Table’). An alternative route would have been to grab a copy of the spreadsheet into my own Google spreadsheets account, and then pull it into Fusion Tables from there by associatiting my Google spreadsheets account with my Fusion Tables account (this is don using OAuth, I think…).
I also grabbed the CSV output of the HEI dropout dataset, cleaned it so that the names of the universities used the same naming convention that is used in the HE Tables 2010 dataset (a bit of value add activity that the Datablog folks really ought to do… e.g. Do publications need style guides for data?;-) and then uploaded that to Google Fusion Tables too.
Now comes the magic – take one data set:
and merge it with another:
The radio buttons let you select the common column, and merge (JOIN) the rows from ach table that share the same value in those columns.
The merge can be used to augment the current table, or create a new one; you can also select which columns you want to use in the new (or extended) table:
(Note that this is open to all sorts of careless abuse – e.g. I am conflating the 2010 HE Tables data with dropout data from 2007/8; but then, I’m showing you how the app works, not best practise in how to use it ;-)
Hmmm – looks like I didn’t clean or import the data properly… That’s one to watch for next time round… I didn’t check the data with the original source either, so can’t guarantee those numbers are right for those HEIs;-)
Just by the by, note how each row can be commented upon; in the uploaded tables, you can also add comments to each cell – which could be interesting for WriteToReply purposes?
So there we have a way of combining data sets. It’s possible to keep merging data sets, so one thing I might do is conflate all the separate sheets in the HE Tables 2010 dataset (which give student satisfaction ratings for different subject areas) into a single table that allows analysis across all of them at the same time :-) (I know journalists are encouraged to nurture ‘contacts'; maybe they should nurture their own combined data sets too????;-)
But that’s not all… One of the things I realised in my data explorer was that it’s useful to be able to query a dataset, and visualise all or part of it too; and Fusion Tables offers exactly this sort of support.
SO for example, you can filter the data based on different criteria, like who’s got a high dropout rate and poor career prospects:
(Note I’ve blocked out the names of the institutions because I haven’t checked to make sure I haven’t done anything silly with the data and don’t want to be at the centre of a story;-)
There are also options that let you visualise the data (either over the whole table or a filtered part of it):
So for example, who’s doing well…
No peeking though ;-)
This is exactly the sort of thing I envisioned a properly working datastore explorer doing… and whilst there are still more than a few niggles with the Fusion Tables app as it currently stands, there’s no immediate need for me to bother extending my own data store explorer, because I can probably get away with using Fusion Tables… (particularly if they open up support for the visualisation API query language…)
Hopefully you’ll see that there is benefit in being able to JOIN different sets, as well as the opportunity to look for patterns and anomalies across different data sets. So for example, if the Conservatives publish on-salary information about graduates, it can be added to the above table allow conflated queries about salary, career prospects and drop-out rates.
Go play… :-)
PS I was going to publish some datasets via Fusion Tables as public docs, but then – it’s the holiday season, and it won’t take that long for you to just go and do-it-yourself ;-) If you try to recreate the above here’s a tip about transforming the dropout data so the HEI names are the same as the ones used in the HE 2010 tables – use regular expressions to remove any blank lines, “The University of “, “University of ” and ” University”…