“JOIN”ing Data from the Guardian Data Store Student Satisfaction Data
I really should make this the last post on the Guardian data store for a while, because I’ve got a stack of other things I really ought to be doing instead, but it struck me that the following demo might open up some peoples eyes as to what’s possible when you have several data sets that play nicely…
It shows you how to take data from two different spreadsheets and link it together to create a third data set that contains elements from the two original ones. (If you just want to cut to the end, here’s a (visualised) reason why it might not be such a happy idea to go to Southampton Solent if you want to study Architecture or Planning: Student Happiness on Planning and Architecture courses. Now ask yourself the question: how would I (err, that is, you) have produced that sort of chart?)
This whole idea is in and around the area of Linked Data, but doing it the hard way…
If you don’t know what Linked Data is, watch this:
So let’s have a look at the Guardian University Tables/ Satisfaction Data that has been uploaded to Google Spreadsheets:
You’ll notice there are lots of sheets in there, covering the different subject areas – like “Architecture”, for instance:
Importantly, the format of the names of the institutions is consistent across spreadsheets.
So I was wondering – if I was a student wanting to study either planning or architecture, how could i find out which institutions had a good satisfaction rating, or low student to staff ratio, across both those subjects? (Maybe I’m not sure which subject I want to do, so if I choose the wrong one and try to switch course, I know I’m not going to switch into a duff department.)
That is, I’d like to be able to see a single table containing the data from both the overall results table as well as the Architecture and Planning tables.
Now if the data was in a database, and if I spoke SQL, this would be quite easy to do (hint condition: look up sql JOIN). But I only have my browser to hand, so what to do…?
Dabble DB provides one answer… How? Here’s how…
Start off by creating a new application:
I’m going to seed it with a table containing the names of HEIs as listed in column B of the overall data table by importing just that column, as CSV data, from the Google spreadsheet:
Pull the data in:
So now we have the data:
Okay – let’s import a couple more tables, say the data for Planning and Architecture areas.
First, Planning – here’s the CSV:
Click on “More…” and you’ll be offered the chance to Add a New Category.
Take that opportunity:-)
You hopefully get the idea…
Exactly as before…
Now do the same for the Architecture data:
So now i have three tables – known as categories – in Dabble DB.
Let’s link them… that is, let’s make the data from one category available to another.
Firstly, I’ll link the Architecture data to the table that just lists the HEIs – click on the Name of Institution column to pop-up a menu and select Configure:
We’re going to Link the column to Entries in another one:
in particular, we’re going to tell Dabble DB that the Names of Institutions in the Architecture table are the same things as the institions in the HEI category/table:
If you look at the HEIs category, you’ll see the Architecture column has been brought in:
We can now do the same for Planning (remember, pop up the Name of Institution menu and Configure it to Link Entries).
The next step is to pull in some data from the two linked categories. How about we compare the Teaching Satisfaction scores for these two subjects?
Click on the column header for one of the linked categories – say the planning one, select Add Derived Field and then the field you want to pull in:
The data gets pulled in…
(Oops – this is all a bit sparse; maybe I should have used a different filed, such as Average Entry Tariff? Never mind, let’s push on…)
Add the corresponding derived field for the Architecture courses:
If you click on the “unsaved View” label, you can save this data table:
To tidy up the table, let’s hide the duplicated Name columns and resave:
To give something like this:
A nice feature of Dabble DB is that is makes it easy to export data from any given view:
So if we grab the CSV URI:
We can take it to, I dunno, Many Eyes Wikified?
Create a placeholder for a visulisation (the data page is ousefulTestboard/StudentHappinessPlanningArchitecture):
Or just type the text yourself:
Click through to create the viz:
We’ll have the scatter plot:
The empty cells in the data columns may cause Many Eyes Wikified to think the data is Text – it’s not, it’s Number:
Now customise the view… I could just have every spot the same, but Architecture is my first preference, so let’s just highlight the places where students are happiest doing that subject (click through to play with the visualisation):
UCL seems to do best:
So here’s a recap – we’ve essentially JOINed data from two separate spreadsheets from Google Spreadsheets to create a new data table in Dabble DB, then visualised it in Many Eyes.
Can you see now why privacy hacks don’t like the idea of linked data in government, or across companies?
So here’s you’re weekend homework – create a data set that identifies entry requirements across the various engineering subject areas, and try to find a way of visualising it ;-)