In Does Funding Equal Happiness in Higher Education?, I described a couple of interactive visualisations that are built up around a dataset that pools data from several of the Guardian datastore Higher Education datasets. In this post, I’ll show how that aggregated dataset was put together, and review some of the problems that the approach I took has.
The first thing to note is that the data I wanted to combine existed in several different spreadsheets, which we might think of as several different databases. Notwithstanding some of the “issues” I have with some of the more puritanical elements of the Linked Data world view, publishing data from different sources that is ostensibly about the same things (i.e. Higher Education institutions) in the seemingly arbitrary way that it has been published on the Guardian Datastore ires me even more… (I’ve written about this before (e.g. The Guardian OpenPlatform DataStore – Just a Toy, or a Trusted Resource?) so I’m not causing any new offence by saying this;-)
So what’s the problem? In short, this sort of thing (the column contents are taken from several of the HE related spreadsheets):
Given the name of a university from one spreadsheet, it’s all but impossible to match the data to a corresponding university in one of the other spreadsheets.
Ideally, each spreadsheet should use a common identifier for a particular institution, such as a UCAS institution code; but that hasn’t happened, which makes relating one data set to another – such as comparing drop out rates to student satisfaction scores – difficult.
One possible way around this is to use a “Rosetta Stone” spreadsheet (e.g. The Guardian rosetta: the Datablog reference guide to nearly everything) which contains synonyms for the same entity as used across several spreadsheets. (I kept meaning to demo how this could be used, but never got round to it, so I’ll give one possible take on it in a quick tutorial below…) The translations for the HEIs has been barely attempted in the current version of the Guardian Datastore Rosetta sheet, though, so I spent a couple of hours last night addressing that and creating my own Rosetta Stone spreadsheet for the HE data.
So how does it work? The spreadsheet essentially defines a set of sameAs relations within a row using the principle: one row, one object. The columns correspond to separate datasheets within the Guardian datastore. Each cell corresponds to the identifier used within a particular datasheet (that is, within a particular spreadsheet that we are using as a database) to describe a particular thing.
I was going to say that this contrasts to the Linked Data principle of “one thing, one identifier”, but that principle is not explicitly one of the four Linked Data rules, is it…?
So how can we use the Rosetta Table? One way is to use a =QUERY() formula, building on the ideas explored in Using Google Spreadsheets Like a Database – The QUERY Formula.
Recall that the QUERY formula has the form: =QUERY(RANGE,DATAQUERY). Now here comes the important bit:
if we know the name of (that is, the identifer for) an HEI in one particular datastore spreadsheet, we can look up the identifier used for the same institution in another datastore spreadsheet using the Rosetta sheet as a stepping stone.
For example, if we have the Rosetta data in a sheet called “Mapping”, column B contains the UCAS codes and column F contains the name of a university for a datasheet that we are currently interested in, (that is, one from which we have the name of the institution) we can use a query of the following form to grab the UCAS code (the formula will also return the name of the institution we are looking up the code for):
=query(‘Mapping’!B:F,”select B,F where F contains ‘The City University'”)
So for example:
If we want to look up the name of an institution in a spreadsheet whose identifiers are listed in column C of the Rosetta sheet, using the name of an institution as described using an identifier taken from a spreadsheet corresonding to Rosetta column F, we can use a formula of the form:
=query(‘Mapping’!C:F,”select C,F where F contains ‘The City University'”)
If we are pulling in the name of the institution we want to look up a UCAS code or synonym for from another cell (say, B1), we can use a formula of the form:
=query(‘Mapping’!C:F,CONCATENATE(“select C,F where F contains ‘”,B1,”‘”))
(Note that in this case, C and F are columns C and F in the “Mapping” sheet, and B1 refers to column B in the current sheet.)
So for example:
Once we have looked up the identifier for an instituion in one datastore sheet that corresponds to an institution mentioned in another datastore sheet, we can use that identifier to lookup data from that sheet. In other words, we can create a spreadsheet whose rows contain data for a particular institution pulled from separate datastore spreadsheets. The method is a little clunky, as I’ll show below, but it works. (I’ll try to post a more efficent way in a few days.)
The recipe is as follows:
– populate a sheet with the names of universities as identified in one particular sheet. You might do this by using an =ImportRange() formula, like this one:
that pulls in data from the 2010 funding spreadsheet.
– look up the synonym for each institution as used in a different spreadsheet (e.g. the student satisfaction spreadsheet) using the Rosetta table loaded into a separate sheet (I called mine “Mapping”); e.g. =query(‘Mapping’!C:F,CONCATENATE(“select C,E where E contains ‘”,A2,”‘”))
If we drag that cell formula down the column, we get the other synonyms too:
– now we can run a QUERY to pull in the student satisfaction data for the corresponding institution into the appropriate row, to give us rows that contain funding AND student satisfaction data. There is just one issue though. Whilst the spreadsheet documentation suggests that the RANGE for a QUERY() should be okay as range of cells imported from another spreadsheet using an =importRange() formula, it doesn’t actually appear to work… Instead, we only seem to be able to run a QUERY over a range of cells contained in a sheet within the current spreadsheet. Which means we need to copy the student satisfaction data into another local sheet and then call on that sheet when we run our query. such as:
=Query(‘Satisfaction’!B:F,concatenate(“select B,C,D,E,F where B matches ‘”,C2,”‘”),””)
Drag the cell down, and we get the satisfaction data, though we need to complete the column headings ourself using the above formula:
(If you put:
=Query(‘Satisfaction’!B:F,concatenate(“select B,C,D,E,F where B matches ‘”,C2))
into cell E1 this will pull the headings into row 1 and the appropriate data into row 2. Why? Because we removed the end of the no headers (“”) argument at the end of the query.)
Okay – that’s more than enough of that, for now. Hopefully you should have a reasonable idea of: a) how to use a Rosetta sheet to look up the name of an HEI in the appropriate format for a particular datastore spreadsheet given it’s name as taken from another datastore spreadsheet; and b) how to use that name to lookup data in a local copy of a datastore spreadsheet.
Before I sign off, though, it’w worth reviewing some of the problems with this approach.
Firstly, there’s the matter of compiling the Rosetta Stone spreadsheet itself. I hand-crafted this spreadsheet for several reasons: firstly, to let my fingers get a feel for the sorts of process that I really should have tried to automate; secondly, to get a feel for just what sorts of differences there were in the way the same institutions were represented across different spreadsheets; and thirdly, to see whether those difference were regular in any way, because if they are, we might be able to use heuristics to guess with a reasonable degree of success the mapping from one identifier on to another. (A couple of pointers about how possibly to approach this are described by @kitwallace here: A data mashup to explore Brian Kelly’s Linked Data challenge.)
Another class of problems relate to knowing what the data in each row, column or cell is about. So for example:
– if you pull data in from one spreadsheet into another one in without bringing in the column heading, you can lose track of what the data you have pulled in is;
– identifying what column to pull in from another spreadsheet is difficult; if you pull columns in by column number (A, B, C) if for any reason the column ordering changes in the spreadsheet you’re pulling data from, you lose the desired linkage; ideally, what you want to do is pull in columnar data by at least column heading, i.e. some descriptor that is used to identify a column in a meaningful way rather than an arbitrary way like column number;
– the link between the contents of a cell and what it refers to is only a weak one. If, by convention, we always use column 1 to hold the identifier of the thing being talked about, and the row 1 column headings as the identifiers that describe the properties of the the thing, then the co-ordinates of a cell can be used to identify the particular property of the particular thing being talked about. But if the table is not situated regularly within a spreadsheet (e.g. it starts at cell D7), things get a little bit more arbitrary (unless we have another convention, such as having padding cells around the row/column headings containing what amounts to punctuation to syntactically identify them as such).
(There’s a whole range of other problems about whether we can sensibly compare data from one spreadsheet with data from another… e.g. comparing funding in 2010 with drop out rates from 2001.)
If we unpick this, we see we really want two sorts of identifier: a set of unique identifiers for the HEIs (e.g. UCAS number) that are used consistently across different spreadsheets; and a set of unique identifiers for the properties of the HEIs (e.g. Average_NSS_Student_Satisfaction_Score, or 2010_HEFCE_funding_change) that can be used to uniquely identify a set of properties in one sheet so that they can be referenced explicitly from another.
Any Linked Data folk reading this will probably, at this point, be yelling “We told you so”, but as a pragmatist I think we have to find a way to make data work in the real’n’messy world…;-)