(Superfluous?) Jupyter / pandas Notebook for Wrangling TEF Excel Spreadsheets

A couple of days ago, new Wonkhe employee Dave Kernohan (congrats!:-) got in touch asking if I’d be interested in helping wrangle the as then yet to be released TEF (Teaching Excellence Framework) data into a form they could work with. The suspicion was that if the data was be released in a piecemeal fashion – one Excel spreadsheet per institution – rather than as a single data set, it would be a bit of a faff to analyse…

I found an example of the sort of spreadsheet it looked might be used to publish the data in, and started hacking a notebook to parse the sheets and pull the data into a set of monolithic files.

As it was, the data was published in a variety of forms, rendering the notebook superfluous. But I’ll post it anyway as a curiosity here.

FWIW, the final, published spreadsheets differed slightly in layout and in the naming of the internal sheets, both of which broke the original notebook. The revised notebook is no less brittle – cell ranges are typically hard coded, rather than being auto detected.

The spreadsheets contain many compound tables – for example, in the following case we have full-time and part time student data in the same table. (I parse these out by hard coded cell ranges – I really should autodetect the row number of the row the headcount headers appear on and use those values, noting the number of rows in each subtable is the same…)

Also note the the use of single and multi-level columns headings in adjacent columns.

A single sheet may also contain multiple tables. For example, like this:

or like this:

As mentioned, the sheets also contained adminstrative data cruft, as revealed when opening the full sheet into a pandas dataframe:

tef-excel_4

Anyway – superfluous, as the data exists in a single long form CSV file anyway on the TEF data download site. But maybe useful as a crib for the future. Here’s a reminder of the link to the notebook.

PS another tool for wrangling spreadsheets that I really need to get my head round is databaker, which was originally developed for working with the spreadsheet monstrosities that ONS produce…

PPS the OU actively opted out of TEF, citing “justifiable reasons”… It will be interesting to see the NSS results this year… (and also see how NSS and TEF scores correlate).

Author: Tony Hirst

I'm a Senior Lecturer at The Open University, with an interest in #opendata policy and practice, as well as general web tinkering...