More Notes on School Examinations Data
In an earlier post on awarding body market share in UK school examinations, I described an OfQual dataset that listed the number of certificates awarded by certificate name and qualification level by the various awarding bodies. We can use that sort of data to see market share by certificates awarded, but the dataset does not give us any insight into the grades awarded by the different bodies, which might allow us to ask a range of other questions: for example, do any exams appear to be “easier” or “harder” than others, simply based on percentages awarded at each grade by different bodies or within different subject areas (that is the distribution of grades; note that statistical assumptions may be used to tweak grade boundaries, so we need to be careful here about what questions we even think we might be able to ask…).
Just a quick aside here: WhatDoTheyKnow suggests that the JCQ is not FOIable, although OfQual is, and routinely uses “publicly available information” from the JCQ in the formulation of its own reports; similar data is also published by the Awarding bodies, although again in the form of informally structured tabular data within PDF files. In response to an FOI request I made to OfQual, the following statement appears:
Because this information is already accesible (sic) to you from JCQ and Awarding Organisations it is exempt from disclosure under Section 21 of the Act because the information is already in the public domain.
So, to clarify:
- OfQual is an FOIable body that has published some data as information; I don’t know whether it holds the information as data
- The data I requested is available as information in the form of PDF documents from two classes of non-FOIable body: the JCQ, a charity; the Awarding Bodies, commercial companies.
Under FOI 2000 s. 11 (as amended by Protection fo Freedoms Act 2012), if you request all or part of a dataset in an electronic form “the public authority must, so far as reasonably practicable, provide the information to the applicant in an electronic form which is capable of re-use.” I don’t know if there are any cases out there arguing the toss about how to interpret this (PDF doc, CSV or SQL dump good, etc? If you know of any, please add a comment…) but I’d argue that the data is not available in that form. So a question that naturally follows is: does this affect the reading of Section 21 of the Act “Information which is reasonably accessible to the applicant … is exempt information.”? (BTW, this looks handy – FOI Wiki, though I’m not sure it’s being maintained?) Similarly, if a public body publishes a dataset in the form of a PDF document and not as data, can I FOI a request for that information as data notwithstanding that the information is available in a different, less accessible form? Or will they throw s. 21 back at me? [Via a tweet, @paulbradshaw suggests that a request for machine readable/data version of info released as PDF will typically be satisified.]
Now where was I..?!
Oh yes… the JCQ data as PDF… well it just so happens that the data is available as data from the Guardian Datastore: GCSE results 2012: exam breakdown by subject, gender and area [data] (I’m not sure if they scraped the A’level results too?). However, the breakdown does not go as far as distributions by award board, and the linkage between subject areas and the certificate titles used in the OfQual dataset is not obvious (there may be mappings in the data documentation/explanatory notes maybe? I haven’t looked.)
Another aside: could the FOIable body point to a scraped dataset published by a third party as evidence that the information is available in a reusable form, even if the reusable format was not published directly by the original FOIable body? That is, if a council publishes data as PDF, and someone scrapes it using Scraperwiki, making it available “as data”, could the council point to the Scraperwiki database as evidence of “[i]nformation which is reasonably accessible to the applicant”? How would they know the data was valid? How about the concrete case here of the JCQ PDF data being scraped by the Guardian Datastore folk and republished as a Google Spreadsheet? And here’s another thought: if I were known to be a demon PDF hacker, would that affect the interpretation of “reasonably accessible”?
If we really wanted to look at distributions of grades by certificate and Awarding Body, we’d probably need to go to the horse’s mouth. So for example, EdExcel grade statistics, AQA results statistics, OCR results stats, CEA Statistics. But again, this data is only available in PDF form, and the companies that publish it aren’t FOIable. (If you’re running – or know of – scrapers grabbing this data, please let me know via the comments). Note that if this “source” data were available, we should be able to check it against the original OfQual data (at least, we should be able to check totals by award board and certificate).
Of course, I could possibly go straight to the OfQual annual market report [PDF] to see market segment breakdowns; but I think that was where I started (the pie charts immediately started to put me off!) – and it’s not really the datajunkie way, is it, seeing reports containing tables and charts and not being able to recreate them?;-)
SO what DDJ lessons do we learn from all this? One thing may be that as data goes along a publishing chain, it tends to get summarised, which then limits the sorts of questions you can ask of it. By unpicking the chain, and getting access to ever finer grained data, we get ourselves into a position whereby we should in principle be able to regenerate the summary reports from the next level down; but we may also be faced with trying to reconcile the data or fit it into the categories that are referred to in the original reports. For transparency as reproducibility, what we need is for reports that publish summary data to also publish two other things: 1) the full set of data that was summarised; 2) the formulae used to generate the summaries from that full data set. Of course, it may be that there are multiple summary steps in the chain (report A generates summaries of dataset B, which itself summarises or represents a particular view over a dataset C). In the current example, OfQual publishes data about certificates awarded by each Awarding Body but no grades; JCQ has grade data across awards but no awarding body data (though in some cases we may be able to recreate that – eg where only a single awarding bidy offers certificates in a particular area); the awarding bodies publish the finest grained data of all – grade distributions by certificate (and rather obviously, this data is at the level of a particular awarding body).