OUseful.Info, the blog…

Trying to find useful things to do with emerging technologies in open education

Datagrabbing Commonly Formatted Sheets from a Google Spreadsheet – Guardian 2014 University Guide Data

So it seems like it’s that time of year when the Guardian publish their university rankings data (Datablog: University guide 2014), which means another opportunity to have a tinker and see what I’ve learned since last year…

(Last year’s hack was a Filtering Guardian University Data Every Which Way You Can…, where I had a quick go at creating a simple interactive dashboard viewer and charter over the Guardian tables, as published via Google spreadsheets.)

The data is published via a Google spreadsheet using a standard sheet layout (apart from the first two sheets, with gid numbers 0 and 1 respectively). Sheets 2..47 are formatted as follows:

Guardian data table uni 2014

The following R function provides the basis for downloading the data from a single sheet, given the spreadsheet key and the sheet gid, and puts the data into a dataframe.

library(RCurl)
gsqAPI = function(key,query,gid=0){
  tmp=getURL( paste( sep="",'https://spreadsheets.google.com/tq?', 'tqx=out:csv','&tq=', curlEscape(query), '&key=', key, '&gid=', gid), ssl.verifypeer = FALSE )
  return( read.csv( textConnection( tmp ),  stringsAsFactors=F ) )
}

The query is a query written using the SQL-like Google visualisation API query language.

The following routine will download non-empty rows from a specific sheet, and rename the subject specific rank column to a generically titled column (“Subject Rank”). An additional column is added to the table that denotes what subject the table is associated with.

handler=function(key,i){
  tmp=gsqAPI(key,"select * where B!=''", i)
  subject=sub(".Rank",'',colnames(tmp)[1])
  colnames(tmp)[1]="Subject.Rank"
  tmp$subject=subject
  tmp
}

We can now download the first subject specific sheet using the following call:

key='0Aq73qj3QslDedHFNdUsxOVZQZ1dmbXNrYlZGUWgwdHc'
gdata=handler(key,2)

This loads in the data as follows:

seeding guardian ranking data table

We can then add data to this dataframe from all the other sheets, to give us a single data from containing ranking data for all the subjects listed.

for (i in 3:47){
  gdata=rbind(gdata,handler(key,i))
}

(This is probably not a very R idiomatic (iRonic?) way of doing things, but it seems to do the trick…)

The result is a single dataframe containing all the subject specific data from the Guardian spreadsheets.

If we check the data types of the columns in the full data set, we can see that most of the columns that should be viewed as numeric types as instead being treated as characters.

Data format - need to make numbers

We can fix this with a one liner, that forces the type on the columns we select (the fourth to the eleventh column):

gdata[, 4:11] <- sapply(gdata[, 4:11], as.numeric)
#Cast the university names and subjects to levelled factors
gdata$Name.of.Institution=as.factor(gdata$Name.of.Institution)
gdata$subject=as.factor(gdata$subject)

One advantage of combining the data from the separate sheets into a monolithic data table is that we can see the ranking across all subject areas for a given university. For example:

oxfordbrookes.rankings = subset(gdata, Name.of.Institution=='Oxford Brookes', select=c('subject','Subject.Rank') )
#Let's also sort the results:
oxfordbrookes.rankings = oxfordbrookes.rankings[ order(oxfordbrookes.rankings['Subject.Rank']), ]

Cross-subject rankings

We can also start to quiz the data in a graphical way. For example, can we get a feel for how courses are distributed within a university according to teaching and satisfaction levels, whilst also paying heed to the value add score?

library(ggplot2)
oxfordbrookes.full = subset(gdata, Name.of.Institution=='Oxford Brookes' )
ggplot(oxfordbrookes.full) + geom_text(aes(x=X..Satisfied.with.Teaching, y=X..Satisfied.with.Assessment, label=subject,size=Value.added.score.10))

oxfordbrookes start asking

All told, it took maybe a couple of hours of faffing around trying to remember R syntax and cope with https hassles to get this far (including a chunk of time to write this post;-). But now we’re in a position to start hacking out quick queries and having a proper conversation with the data. The work can also be viewed as disposable tool building – it hasn’t required a project proposal, it hasn’t used lots of third party developer time etc etc.

As it is though, I’m not sure how useful getting this far is to anyone who isn’t willing to have a go at hacking te data for themselves…

Hmmm… maybe I should try to sketch out a quick Shiny app…..?

Written by Tony Hirst

June 20, 2013 at 3:35 pm

Posted in Rstats

Tagged with

4 Responses

Subscribe to comments with RSS.

  1. […] Datagrabbing Commonly Formatted Sheets from a Google Spreadsheet – Guardian 2014 University Guide …, I showed how to grab some data from several dozen commonly formatted sheets in a Google […]

  2. […] the story to date. The first post, which tells how to grab the data into R, can be found in Datagrabbing Commonly Formatted Sheets from a Google Spreadsheet – Guardian 2014 University Guide …; the second, how to start building a simple interactive viewer for the data using the R shiny […]

  3. Tony – here’s a google visualization of the same data

    http://ramblings.mcpher.com/Home/excelquirks/google-visualization/universityranking

    Bruce

    • Thanks Bruce… Next on my to do list is grabbing data from previous years into one big db to allow year on year comparison…

      Tony Hirst

      June 28, 2013 at 8:57 am


Comments are closed.

Follow

Get every new post delivered to your Inbox.

Join 797 other followers

%d bloggers like this: