Datagrabbing Commonly Formatted Sheets from a Google Spreadsheet – Guardian 2014 University Guide Data

So it seems like it’s that time of year when the Guardian publish their university rankings data (Datablog: University guide 2014), which means another opportunity to have a tinker and see what I’ve learned since last year…

(Last year’s hack was a Filtering Guardian University Data Every Which Way You Can…, where I had a quick go at creating a simple interactive dashboard viewer and charter over the Guardian tables, as published via Google spreadsheets.)

The data is published via a Google spreadsheet using a standard sheet layout (apart from the first two sheets, with gid numbers 0 and 1 respectively). Sheets 2..47 are formatted as follows:

Guardian data table uni 2014

The following R function provides the basis for downloading the data from a single sheet, given the spreadsheet key and the sheet gid, and puts the data into a dataframe.

gsqAPI = function(key,query,gid=0){
  tmp=getURL( paste( sep="",'', 'tqx=out:csv','&tq=', curlEscape(query), '&key=', key, '&gid=', gid), ssl.verifypeer = FALSE )
  return( read.csv( textConnection( tmp ),  stringsAsFactors=F ) )

The query is a query written using the SQL-like Google visualisation API query language.

The following routine will download non-empty rows from a specific sheet, and rename the subject specific rank column to a generically titled column (“Subject Rank”). An additional column is added to the table that denotes what subject the table is associated with.

  tmp=gsqAPI(key,"select * where B!=''", i)

We can now download the first subject specific sheet using the following call:


This loads in the data as follows:

seeding guardian ranking data table

We can then add data to this dataframe from all the other sheets, to give us a single data from containing ranking data for all the subjects listed.

for (i in 3:47){

(This is probably not a very R idiomatic (iRonic?) way of doing things, but it seems to do the trick…)

The result is a single dataframe containing all the subject specific data from the Guardian spreadsheets.

If we check the data types of the columns in the full data set, we can see that most of the columns that should be viewed as numeric types as instead being treated as characters.

Data format - need to make numbers

We can fix this with a one liner, that forces the type on the columns we select (the fourth to the eleventh column):

gdata[, 4:11] <- sapply(gdata[, 4:11], as.numeric)
#Cast the university names and subjects to levelled factors

One advantage of combining the data from the separate sheets into a monolithic data table is that we can see the ranking across all subject areas for a given university. For example:

oxfordbrookes.rankings = subset(gdata, Name.of.Institution=='Oxford Brookes', select=c('subject','Subject.Rank') )
#Let's also sort the results:
oxfordbrookes.rankings = oxfordbrookes.rankings[ order(oxfordbrookes.rankings['Subject.Rank']), ]

Cross-subject rankings

We can also start to quiz the data in a graphical way. For example, can we get a feel for how courses are distributed within a university according to teaching and satisfaction levels, whilst also paying heed to the value add score?

oxfordbrookes.full = subset(gdata, Name.of.Institution=='Oxford Brookes' )
ggplot(oxfordbrookes.full) + geom_text(aes(x=X..Satisfied.with.Teaching, y=X..Satisfied.with.Assessment, label=subject,size=Value.added.score.10))

oxfordbrookes start asking

All told, it took maybe a couple of hours of faffing around trying to remember R syntax and cope with https hassles to get this far (including a chunk of time to write this post;-). But now we’re in a position to start hacking out quick queries and having a proper conversation with the data. The work can also be viewed as disposable tool building – it hasn’t required a project proposal, it hasn’t used lots of third party developer time etc etc.

As it is though, I’m not sure how useful getting this far is to anyone who isn’t willing to have a go at hacking te data for themselves…

Hmmm… maybe I should try to sketch out a quick Shiny app…..?

Filtering Guardian University Data Every Which Way You Can…

In a post from way back when – Does Funding Equal Happiness in Higher Education? – that I still get daily traffic to, though the IBM Many Eyes Wikified hack described in it no longer works, I aggregated and reused a selection of data sets collected by the Guardian datastore relating to HE.

Whilst the range of datasets used in that hack don’t seem to have been re-collected more recently, the Guardian DataStore does still publish and annual set of aggregated data (from Unistats?) for courses by subject area across UK HE (University guide 2013 – data tables).

The DataStore data is published using Google spreadsheets, which as regular readers will know also double up as a database. The Google Visualisation API that’s supported by Google Spreadsheets also makes it easy to pull data from the spreadsheets into an interactive dashboard view.

As an example, I’ve popped a quick demo up as a Scraperwiki View showing how to pull data from a selected sheet within the Guardian University data spreadsheet and filter it using a range of controls. I’ve also added a tabular view, and a handful of scatterplots, to show off the filtered data.

To play with the view, visit here: Guardian Student Rankings.

If you want to hack around with the view, it’s wikified here: wikified source code.

PS I’ve also pulled all the subject data tables into a single Scraperwiki database: Guardian HE Data 2013

Data Journalists Engaging in Co-Innovation…

You may or may not have noticed that the Boundary Commission released their take on proposed parliamentary constituency boundaries today.

They could have released the data – as data – in the form of shape files that can be rendered at the click of a button in things like Google Maps… but they didn’t… [The one thing the Boundary Commission quango forgot to produce: a map] (There are issues with publishing the actual shapefiles, of course. For one thing, the boundaries may yet change – and if the original shapefiles are left hanging around, people may start to draw on these now incorrect sources of data once the boundaries are fixed. But that’s a minor issue…)

Instead, you have to download a series of hefty PDFs, one per region, to get a flavour of the boundary changes. Drawing a direct comparison with the current boundaries is not possible.

The make-up of the actual constituencies appears to based on their member wards, data which is provided in a series of spreadsheets, one per region, each containing several sheets describing the ward makeup of each new constituency for the counties in the corresponding region.

It didn’t take long for the data junkies to get on the case though. From my perspective, the first map I saw was on the Guardian Datastore, reusing work by University of Sheffield academic Alasdair Rae, apparently created using Google Fusion Tables (though I haven’t see a recipe published anywhere? Or a link to the KML file that I saw Guardian Datablog editor Simon Rogers/@smfrogers tweet about?)

[I knew I should have grabbed a screen shot of the original map…:-(]

It appears that Conrad Quilty-Harper (@coneee) over at the Telegraph then got on the case, and came up with a comparative map drawing on Rae’s work as published on the Datablog, showing the current boundaries compared to the proposed changes, and which ties the maps together so the zoom level and focus are matched across the maps (MPs’ constituencies: boundary changes mapped):

Telegraph side by side map comparison

Interestingly, I was alerted to this map by Simon tweeting that he liked the Telegraph map so much, they’d reused the idea (and maybe even the code?) on the Guardian site. Here’s a snapshot of the conversation between these two data journalists over the course of the day (reverse chronological order):

Datajournalists in co-operative bootstrapping mode

Here’s the handshake…

Collaborative co-evolution

I absolutely love this… and what’s more, it happened over the course of four or five hours, with a couple of technology/knowledge transfers along the way, as well as evolution in the way both news agencies communicated the information compared to the way the Boundary Commission released it. (If I was evil, I’d try to FOI the Boundary Commission to see how much time, effort and expense went into their communication effort around the proposed changes, and would then try to guesstimate how much the Guardian and Telegraph teams put into it as a comparison…)

At the time of writing (15.30), the BBC have no data driven take on this story…

And out of interest, I also wondered whether Sheffield U had a take…

Sheffiled u media site

Maybe not…

PS By the by, the website relaunched today. I’m honoured to be on the editorial board, along with @paulbradshaw @nicolaskb @mirkolorenz @smfrogers and @stiles, and looking forward to seeing how we can start to drive interest, engagement and skills development in, as well as analysis and (re)use of, and commentary on, public open data through the data journalism route…

PPS if you’re into data journalism, you may also be interested in, a question and answer site in the model of Stack Overflow, with an emphasis on Q&A around how to find, access, and make use of open and public datasets.

Using Google Spreadsheets as a Database Source for R

I couldn’t contain myself (other more pressing things to do, but…), so I just took a quick time out and a coffee to put together a quick and dirty R function that will let me run queries over Google spreadsheet data sources and essentially treat them as database tables (e.g. Using Google Spreadsheets as a Database with the Google Visualisation API Query Language).

Here’s the original function I used:

gsqAPI = function(key,query,gid=0){ return( read.csv( paste( sep="",'', 'tqx=out:csv','&tq=', curlEscape(query), '&key=', key, '&gid=', gid) ) ) }

However, with a move to https, this function kept breaking. The one I currently use is:

gsqAPI = function(key,query,gid=0){ 
  tmp=getURL( paste( sep="",'', 'tqx=out:csv','&tq=', curlEscape(query), '&key=', key, '&gid=', gid), ssl.verifypeer = FALSE )
  return( read.csv( textConnection( tmp ) ) )

It requires the spreadsheet key value and a query; you can optionally provide a sheet number within the spreadsheet if the sheet you want to query is not the first one.

We can call the function as follows:

gsqAPI('tPfI0kerLllVLcQw7-P1FcQ','select * limit 3')

In that example, and by default, we run the query against the first sheet in the spreadsheet.

Alternatively, we can make a call like this, and run a query against sheet 3, for example:
tmpData=gsqAPI('0AmbQbL4Lrd61dDBfNEFqX1BGVDk0Mm1MNXFRUnBLNXc','select A,C where <= 10',3)

My first R function

The real question is, of course, could it be useful.. (or even OUseful?!)?

Here’s another example: a way of querying the Guardian Datastore list of spreadsheets:

gsqAPI('0AonYZs4MzlZbdFdJWGRKYnhvWlB4S25OVmZhN0Y3WHc','select * where A contains "crime" and B contains "href" order by C desc limit 10')

What that call does is run a query against the Guardian Datastore spreadsheet that lists all the other Guardian Datastore spreadsheets, and pulls out references to spreadsheets relating to “crime”.

The returned data is a bit messy and requires parsing to be properly useful.. but I haven’t started looking at string manipulation in R yet…(So my question is: given a dataframe with a column containing things like <a href=””>Some Page</a>, how would I extract columns containing or Some Page fields?)

[UPDATE: as well as indexing a sheet by sheet number, you can index it by sheet name, but you’ll probably need to tweak the function to look end with '&gid=', curlEscape(gid) so that things like spaces in the sheet name get handled properly I’m not sure about this now.. calling sheet by name works when accessing the “normal” Google spreadsheets application, but I’m not sure it does for the chart query language call??? ]

[If you haven’t yet discovered R, it’s an environment that was developed for doing stats… I use the RStudio environment to play with it. The more I use it (and I’ve only just started exploring what it can do), the more I think it provides a very powerful environment for working with data in quite a tangible way, not least for reshaping it and visualising it, let alone doing stats with in. (In fact, don’t use the stats bit if you don’t want to; it provides more than enough data mechanic tools to be going on with;-)]

PS By the by, I’m syndicating my Rstats tagged posts through the R-Bloggers site. If you’re at all interested in seeing what’s possible with R, I recommend you subscribe to R-Bloggers, or at least have a quick skim through some of the posts on there…

PPS The RSpatialTips post Accessing Google Spreadsheets from R has a couple of really handy tips for tidying up data pulled in from Google Spreadsheets; assuming the spreadsheetdata has been loaded into ssdata: a) tidy up column names using colnames(ssdata) <- c("my.Col.Name1","my.Col.Name2",...,"my.Col.NameN"); b) If a column returns numbers as non-numeric data (eg as a string "1,000") in cols 3 to 5, convert it to a numeric using something like: for (i in 3:5) ssdata[,i] <- as.numeric(gsub(",","",ssdata[,i])) [The last column can be identifed as ncol(ssdata) You can do a more aggessive conversion to numbers (assuming no decimal points) using gsub("[^0-9]","",ssdata[,i])]

PPPS via Revolutions blog, how to read the https file into R (unchecked):

myCsv = getURL(httpsCSVurl)

A First Quick Viz of UK University Fees

Regular readers will know how I do quite like to dabble with visual analysis, so here are a couple of doodles with some of the university fees data that is starting to appear.

The data set I’m using is a partial one, taken from the Guardian Datastore: Tuition fees 2012: what are the universities charging?. (If you know where there’s a full list of UK course fees data by HEI and course, please let me know in a comment below, or even better, via an answer to this Where’s the fees data? question on GetTheData.)

My first thought was to go for a proportional symbol map. (Does anyone know of a javascript library that can generate proportional symbol overlays on a Google Map or similar, even better if it can trivially pull in data from a Google spreadsheet via the Google visualisation? I have an old hack (supermarket catchment areas), but there must be something nicer to use by now, surely? [UPDATE: ah – forgot this: Polymaps])

In the end, I took the easy way out, and opted for Geocommons. I downloaded the data from the Guardian datastore, and tidied it up a little in Google Refine, removing non-numerical entries (including ranges, such 4,500-6,000) in the Fees column and replacing them with minumum fee values. Sorting the fees column as a numerical type with errors at the top made the columns that needed tweaking easy to find:

The Guardian data included an address column, which I thought Geocommons should be able to cope with. It didn’t seem to work out for me though (I’m sure I checked the UK territory, but only seemed to get US geocodings?) so in the end I used a trick posted to the OnlineJournalism blog to geocode the addresses (Getting full addresses for data from an FOI response (using APIs); rather than use the value.parseJson().results[0].formatted_address construct, I generated a couple of columns from the JSON results column using value.parseJson().results[0].geometry.location.lng and value.parseJson().results[0]

Uploading the data to Geocommons and clicking where prompted, it was quite easy to generate this map of the fees to date:

Anyone know if there’s a way of choosing the order of fields in the pop-up info box? And maybe even a way of selecting which ones to display? Or do I have to generate a custom dataset and then create a map over that?

What I had hoped to be able to do was use coloured proportional symbols to generate a two dimensional data plot, e.g. comparing fees with drop out rates, but Geocommons doesn’t seem to support that (yet?). It would also be nice to have an interactive map where the user could select which numerical value(s) are displayed, but again, I missed that option if it’s there…

The second thing I thought I’d try would be an interactive scatterplot on Many Eyes. Here’s one view that I thought might identify what sort of return on value you might get for you course fee…;-)

Click thru’ to have a play with the chart yourself;-)

PS I can;t not say this, really – you’ve let me down again, @datastore folks…. where’s a university ID column using some sort of standard identifier for each university? I know you have them, because they’re in the Rosetta sheet… although that is lacking a HESA INST-ID column, which might be handy in certain situations… ;-) [UPDATE – apparently, HESA codes are in the spreadsheet…. ;-0]

PPS Hmm… that Rosetta sheet got me thinking – what identifier scheme does the JISC MU API use?

PPPS If you’re looking for a degree, why not give the Course Detective search engine a go? It searches over as many of the UK university online prospectus web pages that we could find and offer up as a sacrifice to a Google Custom search engine ;-)

A Few More Thoughts on

As we come up to a week in on, there’s already an interesting collection of questions – and answers – starting to appear on the site, along with a fledgling community (thanks for chipping in, folks:-), so how can we maintain – and hopefully grow – interest in the site?

A couple of things strike me as the most likely things to make the site attractive to folk:

– the ability to find an appropriate – and useful – answer to your question without having to ask it, for example because someone has already asked the same, or a similar, question;
– timely responses to questions once asked (which leads to a sense of community, as well as utility).

I think it’s also worth bearing in mind the context that GetTheData sits in. Many of the questions result in answers that point to data resources that are listed in other directories. (The links may go to either the data home page or its directory page on a data directory site.)

Data Recommendations
One thing I think is worth exploring is the extent to which GetTheData can both receive and offer recommendations to other websites. Within a couple of days of releasing the site, Rufus had added a recommendation widget that could recommend datasets hosted on CKAN that seem to be related to a particular question. - related datasets on CKAN

What this means is that even before you get a reply, a recommendation might be made to you of a dataset that meets your requirements.

(As with many other Q&A sites, GetTheData also tries to suggest related questions to you when you enter you question, to prompt you to consider whether or not your question has already been asked – and answered.)

I think the recommendation context is something we might be able to explore further, both in terms of linking to recommendations of related data on other websites, but also in the sense of reverse links from GetTheData to those sites.

For example:

– would it be possible to have a recommendation widget on GetTheData that links to related datasets from the Guardian datastore, or National Statistics?
– are there other data directory sites that can take one or more search terms and return a list of related datasets?
– could a getTheData widget be located on CKAN data package pages to alert package owners/maintainers that a question possibly related to the dataset had been posted on GetTheData? This might encourage the data package maintainer to answer the question on the getTheData site with a link back to the CKAN data package page.

As well as recommendations, would it be useful for GetTheData to syndicate new questions asked on the site? For example, I wonder if the Guardian Datastore blog would be willing to add the new questions feed to the other datablogs they syndicate?;-) (Disclosure: data tagged posts from get syndicated in that way.)

Although I don’t have any good examples of this to hand from GetTheData, it strikes me that we might start to see questions that relate to obtaining data which is actually a view over a particular data set. This view might be best obtained via a particular query onto a particular data set. such as a specific SPARQL query on a Linked Data set, or a particular Google query language request to the visualisation API against a particular Google spreadsheet.

If we do start to see such queries, then it would be useful to aggregate these around the datastores they relate to, though I’m not sure how we could best do this at the moment other than by tagging?

News announcements
There are a wide variety of sites publishing data independently, and a fractured networked of data directories and data catalogues. Would it make sense for GetTheData to aggregate news announcements relating to the release of new data sets, and somehow use these to provide additional recommendations around data sets?

Hackdays and Data Fridays
As suggested in Bootstrapping for All Your Public Open Data Questions and Answers:

If you’re running a hackday, why not use to post questions arising in the scoping the hacks, tweet a link to the question to your event backchannel and give the remote participants a chance to contribute back, at the same time adding to the online legacy of your event.

Alternatively, how about “Data Fridays”, on the first Friday in the month, where folk agree to check GetTheData two or three times that day and engage in something of a distributed data related Question and Answer sprint, helping answer unanswered questions, and maybe pitching in a few new ones?

Aggregated Search
It would be easy enough to put together a Google custom search engine that searches over the domains of data aggregation sites, and possibly also offer filetype search limits?

So What Next?
Err, that’s it for now…;-) Unless you fancy seeing if there’s a question you can help out on right now at

Where Linked Data Would Be Useful – Creating More MPs’ Maps from the Guardian Politics API

So given the news from the Commons today, I was wondering where the current crop of MPs came from in terms of birthplace, school, and university… Would a map based view turn up from across the UK, or something a little more clumpy…?

Searching for sources of biographical data, two sources came to mind – Wikipedia infoboxes (and hence DBpedia) and the Guardian Politics API. In this post, I’ll describe a minimal – and not very reliable – recipe for plotting a map of UK MPs’ alma mater based on data grabbed from the Guardian Politics API, identifying a couple of ways in which the data could be made so much more useful, and indicating why the Linked Data approach is a Good Thing…

Just because, I’ll give a Yahoo Pipes recipe…

The first step is a handler pipe for grabbing an MP’s details from the Guardian Politics API from their Guardian ID:

MP details via Guardian Politics API

The next step is to get a list of current MPs and annotate the list with MP details using the helper pipe:

Lookup MP details

Not every MP has an alma mater listed, so we filter out the ones where there is no university information. We then use the university data as the input to a rough and ready geocoder, which does its best to identify a location and then geocode it. The Yahoo pipes trick of putting geo-data into the y:location attribute means that the pipe will automagically generate KML and map based previews of the output of the pipe.

Lookup location by university

Finally, we tidy up the feed a little:

Tidy the pipe...

Here’s the output of the resulting pipe:

MPs by university - badly coded...

Clicking on the various markers, we see that there is a lot of miscoding going on. Also, some MPs have several universities listed, which may also contribute to the confusion. (A rough and ready way of handling that would be to split the university field on a semi-colon, and just use the first listed university in the location lookup.)

So what would make things easier? The Guardian Politics API is getting the data out there, but can it be improved in any way in order to make it a little (or a lot?!) more useful in a machine automated context such as this?

I think so…

Here’s one possible approach: a few weeks ago, the JISC Monitoring unit published a lookup service for looking up UK HEIs using a variety of identifier schemes and a crude name based lookup, and returning synonymous identifers, canonical URLs and lat/long data: As identified in the announcement post, this information complements rather more formally some of that already collated in the Guardian’s Education Datastore Rosetta Stone spreadsheet…

(I’m not sure if Leigh Dodds looked at how the JISCMU data could be used as part of a Google Refine reconciliation API service? I seem to remember a brief flurry of tweets on a related topic at the time…;-)

So, what would be really useful would be for the Guardian Politics API to use a weak Linked Data approach and provide a list of HEI identifiers using a formal identification scheme such as UCAS or HESA codes so that we knew which institutions they were actually referring to; (though this wouldn’t cope with overseas universities… Hmm… is there an international identifier scheme for universities?)

We could then hop over to the JISCMU service and pull down the lat/long information, before popping it on a map.

Looking deeper into the Guardian Politics API, we also see a field for listing the MPs’ schools… which in turn could be enhanced by including identifiers used in the education datastore.

So – Linked Data: can you see how it works yet? And do you get the feeling that network effects could kick in to place really quickly as data is enhanced with linking elements such as well defined identifiers using know identification schemes?

PS Chris Gutteridge has also picked up the challenge of this post, contributing a list of DBPedia URIs for current MPs to the cause. I wouldn’t be surprised if he turns up a whole load more data actually cracks the problem way before I do!

Ah – seems like Chris has been on the case, and produced, (with caveats: “Note that the data is patchy. It only shows MPs with a geocoded birthplace/university listed on dbpedia”) a map [updated] of 313 MPs’ birthplaces:

MPs birthplaces

as well as a map [updated] of 176 MPs’ universities (though I don’t have a valid link for this… yet…;-) Ah – here it is:

MPs alma mater map

UPDATE: here’s the recipe – Studying the MPs

PPS I really need to add Chris’ geo-tagged RDF to KML converter service (described here to my toolkit…