A Briefest of Looks at the REF 2014 Results Data – Reflection on a Data Exercise

At a coursemodule team meeting for the new OU data course […REDACTED…], which will involve students exploring data sets that we’ve given them, as well as ones that they will hopefully find for themselves, it was mentioned that we really should get an idea about how the the exercises we’ve written, are writing, and have yet to write will take students to do.

In that context, I noticed that the UK Higher Education Research Excellence Framework, 2014 results were out today, so I gave myself an hour to explore the data, see what’s there, and get an idea for some of the more obvious stories we might try to pull out.

Here’s as far as I got: an hour long conversation from a standing start with the REF 2014 data.

Although I did have a couple of very minor interruptions, I didn’t get as far as I’d expected/hoped.

So here are a few reflections, as well as some comments on the constraints I put myself under:

  • the students will be working in a headless virtual machine we have provided them with; we don’t make any requirements of students to have access to a spreadsheet application; OpenRefine runs on the VM, so that could be used to get a preview of the spreadsheet (I’m not sure how well OpenRefine copes with multiple sheets, if a spreadsheet does contain multiple sheets?); given all that, I thought I’d try to explore the notebook purely within an IPython notebook, without (as @andysc put it) eyeballing the data in a spreadsheet first;
  • I didn’t really read any of the REF docs, so I wasn’t really sure how the data would be reported. I’m not sure how much time it would have taken to read up on the reporting data used, or what sort of explanatory notes and/or external metadata are provided?
  • I had no real idea what questions to ask or reports to generate. “League tables” was an obvious one, but calculated how? You need to know what numbers are available in the data set and how they may (or may not) relate to each other to start down that track. I guess I could have looked at distributions down a column, and then grouped in different ways, and then started to look for and identify the outliers, at least as visually revealed.
  • I didn’t do any charts at all. I had it half in mind to do some dodged bar charts eg within an institution to show how the different profiles were scored within each unit of assessment for a given institution, but ran out of time before I tried that. (I couldn’t remember offhand what sort of shape the data needs to be in to plot that sort of chart, and then wasted a minute or two, gardener’s foot on fork staring into the distance, pondering what we could do if I cast (unmelted) the separate profile data into different columns for each return), but then decided it’d use up too much of my hour trying to remember/look-up how to do that, let alone then trying to make up stuff to do with the data once it was in that form.
  • The exploration/conversation demonstrated grouping, sorting and filtering, though I didn’t limit the range of columns displayed. I did use a few cribs, both from the pandas online documentation, and referenced other notebooks we have drafted for student use (eg on how to generate sorted group/aggregate reports on dataframe).
  • our assessment will probably mark folk down for not doing graphical stuff… so I’d have lost marks on not just putting a random chart in, such as a bar chart counting numbers of institutions by unit of assessment;
  • I didn’t generate any derived data – again, this is something we’d maybe mark students down on; an example I saw just now in the OU internal report on the REF results is GPA – grade point average. I’m not sure what it means, but while in the data I was wondering whether I should explore some function of the points (eg 4 x (num x 4*) + 3 x (num x 3*) … etc) or some function of the number of FTEs and the star rating results.
  • Mid-way through my hour, I noticed that Chris Gutteridge had posted the data as Linked Data; Linked Data and SPARQL querying is another part of the course, so maybe I should spend an hour seeing what I can do with that endpoint from a standing start? (Hmm.. I wonder – does the Gateway to Research also have a SPARQL endpoint?)
  • The course is about database management systems, in part, but I didn’t put the data into either PostgreSQL or MongoDB, which are the two systems we introduce students to, or discuss rationale for which db may have been a useful store for the data, extent to which normalisation was required, (eg taking the data to third normal form or wherever, and perhaps actually demonstrating that etc). (In the course, we’ll probably also show students how to generate RDF triples they can run their own SPARQL queries against.) Nor did I throw the dataframe in SQLite using pandasql, which would have perhaps made it easier (and quicker?) to write some of the queries using SQL rather than the pandas syntax?
  • I didn’t link in to any other datasets, which again is something we’d like students to be able to do. At the more elaborate end might have been pulling in data from something like Gateway to Research? A quicker hack might have been to try annotating the data with administrative information, which I guess can be pulled from one of the datasets on data.ac.uk?
  • I didn’t do any data joining or merging; again, I expect we’ll want students to be able to demonstrate this sort of thing in an appropriate way, eg as a result of merging data in from another source.
  • writing filler text (setting the context, explaining what you’re going to do, commenting on results etc) in the notebook takes time… (That is, the hour is not just taken up by writing the code/queries; and there is also time spent, but not seen, in coming up with questions to ask, as well as then converting them to queries and and then reading, checking and mentally interpreting the results.)

One thing I suggested to the course team was that we all spend an hour on the data and see what we come up with. Another thing that comes to mind is what might I now be able to achieve in a second hour; and then a third hour. (This post has taken maybe half an hour?)

Another approach might have been to hand off notebooks to each other, the second person building on the first’s notebook etc. (We’d need to think how that would work for time: would the second person’s hour start before – or after – reading the first person’s notebook?) This would in some way model providing the student with an overview of a dataset and then getting them to explore it further, giving us an estimate of timings based on how well we can build on work started by someone else, but still getting us to work under a time limit.

Hmmm.. does this also raise the possibility of some group exercises? Eg one person has to normalise the data and get it into PostgreSQL, someone else get some additional linkable data into the mix, someone to start generating summary textual reports and derived data elements, someone generating charts/graphical reports, someone exploring the Linked Data approach?

PS One other thing I didn’t look at, but that is a good candidate for all sorts of activity, would be to try to make comparisons to previous years. Requires finding and ordering previous results, comparing rankings, deciding whether rankings actually refer to similar things (and extent to which we can compare them at all). Also, data protection issues: could we identify folk likely to have been included in a return just from the results data, annotated with data from Gateway to Research, perhaps, or institutional repositories?