One of the many things I’d like to spend my time doing is tinkering with data journalism doodles relating to local news stories. For example, via our local hyperlocal blog, I saw this post announced today: Isle of Wight has highest percentage of secondary school absentee rates in country. The post included a link to a Department for Education page (Pupil absence in schools in England, including pupil characteristics) containing links to the statistical release and the associated data sets:
Here’s what we get in the zipped datafile:
The school level dataset had the following column headings:
Year, country_code, country, GOR, GOR_code, LA, new_LA_code, LA_Name, URN, Estab, LAEstab, School_name, School_type, Academy_Flag, Academy_open_date, enrol_sum, SessionsPossible_sum, OverallAbsence_sum, AuthorisedAbsence_sum, UnauthorisedAbsence_sum, overall_absence_percent, auth_absence_percent, unauth_absence_percent, PA_15_sum, possible_sessions_pa_15_sum, overall_abs_pa_15_sum, auth_abs_pa_15_sum, unauth_abs_pa_15_sum, overall_absence_percent_PA_15, auth_absence_percent_PA_15, unauth_absence_percent_PA_15, sess_auth_illness, sess_auth_appointments, sess_auth_religious, sess_auth_study, sess_auth_traveller, sess_auth_holiday, sess_auth_ext_holiday, sess_auth_excluded, sess_auth_other, sess_auth_totalreasons, sess_auth_unclass, sess_unauth_holiday, sess_unauth_late, sess_unauth_other, sess_unauth_noyet, sess_unauth_totalreasons, sess_unauth_unclass, sess_overall_totalreasons
We can guess at what some of these refer to, but what, for example, do the “PA 15” columns refer to? In this case, what we really should do is look up the actual definitions, which are described in the metadata description document; a document that just happens to be a Microsoft Word 2007 formatted document…
…a document that doesn’t play nicely either with the copy of Word I have on my Mac:
…or the converter that the Google docs uploader uses:
In cases such as this, particularly where there are mathematical equations that often have very specific layout requirements, it can be “safer” to use a document format such as PDF that more reliably captures the appearance of the original page. (If we were really keen on reproducibility, we might also suggest that the equations were made available in an executable form, such as programme code or even as a spreadsheet (I’m not sure “Microsoft equations” are executable?).)
I gave myself a couple of hours to have a quick look through some of the data, but as it is I’ve spent an hour or so looking for ways of reading the metadata description document along with writing up my frustration around not being able to do so… Which is time spent not making sense of the data, or, indeed, its metadata…
PS in passing, I note the publication of the parliamentary Public Accounts Committee 37th report, Whole of Government Accounts 2010-11 again picks up on the way in which government data releases often fall short in terms of their usability (for example, this week MPs call for greater use of Whole of Government Accounts; see also last August Government must do better on transparency, say MPs).
PPS Here’s the solution I used in the end – Skydrive, Microsoft’s online storage/doc viewing play:
As it turns out, the equations could easily have been written using simple text strings…
PPPS as to the “15” columns, the metadata files describes them along the following lines:
PA_15_sum Number of enrolments classed as persistent absentees (threshold of 15 per cent)
possible_session_pa_15_sum Sessions possible for persistent absentees (threshold of 15 per cent)
Which means what exactly?!