When Machine Readable Data Still Causes “Issues” – Wrangling Dates…
With changes to the FOI Act brought about the Protection of Freedoms Act, FOI will allow requests to be made for data in a machine readable form. In this post, I’ll give asn example of a dataset that is, arguably, released in a machine readable way – as an Excel spreadsheet, but that still requires quite a bit of work to become useful as data; because presumably the intent behind the aforementioned amendement to the FOI is to make data releases useful and useable as data? As a secondary result, through trying to make the data useful as data, I realise I have no idea what some of the numbers that are reported in the context of a date range actually relate to… Which makes those data columns misleading at best, useless at worst…And as to the February data in a release allegedly relating to a weekly release from November…? Sigh…
[Note - I'm not meaning to be critical in the sense of "this data is too broken to be useful so don't publish it". My aim in documenting this is to show some of the difficulties involved with actually working with open data sets and at least flag up some of the things that might need addressing so that the process can be improved and more "accessible" open data releases published in the future. ]
So what, and where is, the data…? Via my Twitter feed over the weekend, I saw an exchange between @paulbradshaw and @carlplant relating to a scraper built around the NHS Winter pressures daily situation reports 2012 – 13. This seems like a handy dataset for anyone wanting to report on weekly trends, spot hospitals that appear to be under stress, and so on, so I had a look at the scraper, took issue with it ;-) and spawned my own…
The data look like it’ll be released in a set of weekly Excel spreadsheets, with a separate sheet for each data report.
All well and good… almost…
If we load the data into something like Scraperwiki, we find that some of the dates are actually represented as such; that is, rather than character strings (such as the literal “9-Nov-2012″), they are represented as date types (in this case, the number of days since a baseline starting date). A quick check on StackOverflow turned up the following recipe for handling just such a thing and returning a date element that Python (my language of choice on Scraperwiki) recognises as such:
#http://stackoverflow.com/a/1112664/454773 import datetime def minimalist_xldate_as_datetime(xldate, datemode): # datemode: 0 for 1900-based, 1 for 1904-based return ( datetime.datetime(1899, 12, 30) + datetime.timedelta(days=xldate + 1462 * datemode) )
The next thing we notice is that some of the date column headings actually specify: 1) date ranges, 2) in a variety of styles across the different sheets. For example:
- 16 – 18/11/2012
- 16 Nov 12 to 18-NOV-2012
- 16 to 18-Nov-12
In addition, we see that some of the sheets split the data into what we might term further “subtables” as you should notice if you compare the following sheet with the previous one shown above:
Notwithstanding that the “shape” of the data table is far from ideal when it comes to aggregating data from several weeks in the same database (as I’ll describe in another post), we are faced with a problem here that if we want to look at the data by date range in a mechanical, programmable way, we need to cast these differently represented date formats in the same way, ideally as a date structure that Python or the Scraperwiki SQLlite database can recognise as such.
[For a library that can automatically reshape this sort of hierarchical tabular data arrangement in R, see Automatic Conversion of Tables to LongForm Dataframes]
The approach I took was as follows (it could be interesting to try to replicate this approach using OpenRefine?). Firstly, I took the decision to map dates onto “fromDates” and “toDates”. ***BEWARE – I DON’T KNOW IF THIS IS CORRECT THING TO DO**** Where there is a single specified date in a column heading, the fromDate and toDate are set to one and the same value. In cases where the date value was specified as an Excel represented date (the typical case), the code snippet above casts it to a Pythonic date value then I can then print out as required (I opted to display dates in the YYYY-MM-DD format) using a construction along the lines of:
In this case, cellValue is the value of a header cell that is represented as an Excel time element, book is the workbook, as parsed using the xlrd library:
import xlrd xlbin = scraperwiki.scrape(spreadsheetURL) book = xlrd.open_workbook(file_contents=xlbin)
and book.datemode is a library call that looks up how dates are being represented in the spreadsheet. If the conversion fails, we default to setting dateString to the original value:
The next step was to look at the date range cells, and cast any “literal” date strings into a recognised date format. (I’ve just realised I should have optimised the way this is called in the Scraperwiki code – I am doing so many unnecessary lookups at the moment!) In the following snippet, I look to see if we can split the date into a cell range functions,
import time from time import mktime from datetime import datetime def dateNormalise(d): #This is a bit of a hack - each time we find new date formats for the cols, we'll need to extend this #The idea is to try to identify the date pattern used, and parse the string accordingly for trials in ["%d %b %y",'%d-%b-%y','%d-%b-%Y','%d/%m/%Y','%d/%m/%y']: try: dtf =datetime.datetime.fromtimestamp(mktime(time.strptime(d, trials))) break except: dtf=d if type(dtf) is datetime.datetime: dtf=dtf.strftime("%Y-%m-%d") return dtf def patchDate(f,t): #Grab the month and year elements from the todate, and add in the from day of month number tt=t.split('-') fromdate='-'.join( [ str(tt),str(tt),str(f) ]) return fromdate def dateRangeParse(daterange): #In this first part, we simply try to identify from and to portions dd=daterange.split(' to ') if len(dd)<2: #That is, split on 'to' doesn't work dd2=daterange.split(' - ') if len(dd2)<2: #Doesn't split on '-' either; set from and todates to the string, just in case. fromdate=daterange todate=daterange else: fromdate=dd2 todate=dd2 else: fromdate=dd todate=dd #By inspection, the todate looks like it's always a complete date, so try to parse it as such todate=dateNormalise(todate) #I think we'll require another fudge here, eg if date is given as '6 to 8 Nov 2012' we'll need to finesse '6' to '6 Nov 2012' so we can make a date from it fromdate=dateNormalise(fromdate) if len(fromdate)<3: fromdate=patchDate(fromdate,todate) return (fromdate,todate) #USAGE: (fromdate,todate)=dateRangeParse(dateString)
One thing this example shows, I think, is that even though the data is being published as a dataset, albeit in an Excel spreadsheet, we need to do some work to make it properly useable.
The sheets look as if they are an aggregate of data produced by different sections, or different people: that is, they use inconsistent ways of representing date ranges.
When it comes to using the date, we will need to take care in how we represent or report on figures collected over a date range (presumably a weekend? I haven’t checked), compared to daily totals. Indeed, as the PS below shows, I’m now starting to doubt what the number in the date range column represents? Is it: a) the sum total of values for days in that range; b) the average daily rate over that period; c) the value on the first or last date of that period?
[This was written under assumption it was summed daily values over period, which PS below suggests is NOT the case, in one sheet at least?] One approach might be to generate “as-if daily” returns simply by dividing ranged date totals by the number of days in the range. A more “truthful” approach may be to plot summed counts over time (date on the x-axis, sume of values to date on the y-axis), with the increment for the date-ranged values that is being added in to the summed value taking the “toDate” date as its x/date value?
When I get a chance, I’ll do a couple more posts around this dataset:
– one looking at datashaping in general, along with an example of how I shaped the data in this particular case
– one looking at different queries we can run over the shaped data.
PS Another problem… on the NHS site, we see that there appear to be weekly spreadsheet releases and an aggregated release:
Because I didn’t check the stub of scraper code used to pull off the spreadsheet URLs from the NHS site, I accidentally scraped weekly and aggrgeated sheets. I’m using a unique key based on a hash that includes the toDate as part of the hashed value, in an attempt to keep dupes out of the data from just this sort of mistake, but looking at a query over the scraped data I spotted this:
If we look at the weekly sheet we see this:
That is, a column for November 15th, and then one for November 18th, but nothing to cover November 16 or 17?
Looking at a different sheet – Adult Critical Care – we get variation at the other end of the range:
If we look into the aggregated sheet, we get:
Which is to say – the weekly report displayed a single data as a column heading where the aggregated sheet gives a date range, although the same cell values are reported in this particular example. So now I realise I have no idea what the cell values in the date range columns represent? Is it: a) the sum total of values for days in that range; b) the average daily rate over that period; c) the value on the first or last date of that period?
And here’s another query:
February data??? I thought we were looking at November data?
PPS If you’re looking for learning outcomes from this post, here are a few: three ways in which we need to wrangle sense out of dates:
- representing Excel dates or strings-that-look-like-dates as dates in some sort of datetime representation (which is most useful sort of representation, even if we end up casting dates into string form);
- parsing date ranges into pairs of date represented elements (from and to dates);
- where a dataset/spreadsheet contains heterogenous single date and date range columns, how do we interpret the numbers that appear in the date range column?
- shoving the data into a database and running queries on it can sometimes flag up possible errors or inconsistencies in the data set, that might be otherwise hard to spot (eg if you had to manually inspect lots of different sheets in lots of different spreadsheets…)
PPPS Another week, another not-quite-right feature:
PPPPS An update on what the numbers actually mean,from an email exchange (does that make me more a journalist than a blogger?!;-) with the contact address contained within the spreadsheets: “On the columns, where we have a weekend, all items apart from beds figures are summed across the weekend (eg number of diverts in place over the weekend, number of cancelled ops). Beds figures (including beds closed to norovirus) are snapshots at the collection time (i.e 8am on the Monday morning).”
PPPPPS Another week, ans this time three new ways of writing the date range over the weekend: 14-16-Dec-12, 14-16-Dec 12, 14-16 Dec 12. Anyone would think they were trying to break my scraper;-)