<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
	>

<channel>
	<title>OUseful.Info, the blog... &#187; When Machine Readable Data Still Causes &#8220;Issues&#8221; &#8211; Wrangling Dates&#8230;</title>
	<atom:link href="http://blog.ouseful.info/2012/11/27/when-machine-readable-data-still-causes-issues-wrangling-dates/feed/?withoutcomments=1" rel="self" type="application/rss+xml" />
	<link>http://blog.ouseful.info</link>
	<description>Trying to find useful things to do with emerging technologies in open education</description>
	<lastBuildDate>Thu, 23 May 2013 22:38:57 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.com/</generator>
<cloud domain='blog.ouseful.info' port='80' path='/?rsscloud=notify' registerProcedure='' protocol='http-post' />
<image>
		<url>http://s2.wp.com/i/buttonw-com.png</url>
		<title>OUseful.Info, the blog... &#187; When Machine Readable Data Still Causes &#8220;Issues&#8221; &#8211; Wrangling Dates&#8230;</title>
		<link>http://blog.ouseful.info</link>
	</image>
	<atom:link rel="search" type="application/opensearchdescription+xml" href="http://blog.ouseful.info/osd.xml" title="OUseful.Info, the blog..." />
	<atom:link rel='hub' href='http://blog.ouseful.info/?pushpress=hub'/>
		<item>
		<title>When Machine Readable Data Still Causes &#8220;Issues&#8221; &#8211; Wrangling Dates&#8230;</title>
		<link>http://blog.ouseful.info/2012/11/27/when-machine-readable-data-still-causes-issues-wrangling-dates/</link>
		<comments>http://blog.ouseful.info/2012/11/27/when-machine-readable-data-still-causes-issues-wrangling-dates/#comments</comments>
		<pubDate>Tue, 27 Nov 2012 17:55:00 +0000</pubDate>
		<dc:creator>Tony Hirst</dc:creator>
				<category><![CDATA[Data]]></category>
		<category><![CDATA[Infoskills]]></category>
		<category><![CDATA[ddj]]></category>
		<category><![CDATA[excel]]></category>
		<category><![CDATA[scraperwiki]]></category>

		<guid isPermaLink="false">http://blog.ouseful.info/?p=9096</guid>
		<description><![CDATA[With changes to the FOI Act brought about the Protection of Freedoms Act, FOI will allow requests to be made for data in a machine readable form. In this post, I&#8217;ll give asn example of a dataset that is, arguably, released in a machine readable way &#8211; as an Excel spreadsheet, but that still requires [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.ouseful.info&#038;blog=325417&#038;post=9096&#038;subd=ouseful&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>With changes to the FOI Act brought about the Protection of Freedoms Act, FOI will allow requests to be made for data in a machine readable form. In this post, I&#8217;ll give asn example of a dataset that <em>is</em>, arguably, released in a machine readable way &#8211; as an Excel spreadsheet, but that still requires quite a bit of work to become useful <em>as data</em>; because presumably the intent behind the aforementioned amendement to the FOI is to make data releases useful and useable <em>as data</em>? As a secondary result, through trying to make the data useful <em>as data</em>, I realise I have no idea what some of the numbers that are reported in the context of a date range actually relate to&#8230; Which makes those data columns misleading at best, useless at worst&#8230;And as to the February data in a release allegedly relating to a weekly release from November&#8230;? Sigh&#8230;</p>
<p>[Note - I'm not meaning to be critical in the sense of "this data is too broken to be useful so don't publish it". My aim in documenting this is to show some of the difficulties involved with actually working with open data sets and at least flag up some of the things that might need addressing so that the process can be improved and more "accessible" open data releases published in the future. ]</p>
<p>So what, and where is, the data&#8230;? Via my Twitter feed over the weekend, I saw an exchange between @paulbradshaw and @carlplant relating to a scraper built around the <a href="http://transparency.dh.gov.uk/2012/10/26/winter-pressures-daily-situation-reports-2012-13/">NHS Winter pressures daily situation reports 2012 – 13</a>. This seems like a handy dataset for anyone wanting to report on weekly trends, spot hospitals that appear to be under stress, and so on, so I had a look at the scraper, took issue with it ;-) and spawned my own&#8230;</p>
<p>The data look like it&#8217;ll be released in a set of weekly Excel spreadsheets, with a separate sheet for each data report.</p>
<p><a href="http://ouseful.files.wordpress.com/2012/11/nhs-sitrep.png"><img src="http://ouseful.files.wordpress.com/2012/11/nhs-sitrep.png?w=700&#038;h=473" alt="" title="nhs sitrep" width="700" height="473" class="alignnone size-full wp-image-9097" /></a></p>
<p>All well and good&#8230; almost&#8230;</p>
<p>If we <a href="https://scraperwiki.com/scrapers/nhs_sit_reps/">load the data into something like Scraperwiki</a>, we find that some of the dates are actually represented as such; that is, rather than character strings (such as the literal &#8220;9-Nov-2012&#8243;), they are represented as date types (in this case, the number of days since a baseline starting date). A <a href="http://stackoverflow.com/a/1112664/454773">quick check on StackOverflow</a> turned up the following recipe for handling just such a thing and returning a date element that Python (my language of choice on Scraperwiki) recognises as such:</p>
<pre class="brush: python; title: ; notranslate">#http://stackoverflow.com/a/1112664/454773
import datetime

def minimalist_xldate_as_datetime(xldate, datemode):
    # datemode: 0 for 1900-based, 1 for 1904-based
    return (
        datetime.datetime(1899, 12, 30)
        + datetime.timedelta(days=xldate + 1462 * datemode)
        )</pre>
<p>The next thing we notice is that some of the date column headings actually specify: 1) date ranges, 2) in a variety of styles across the different sheets. For example:</p>
<ul>
<li>16 &#8211; 18/11/2012</li>
<li>16 Nov 12 to 18-NOV-2012</li>
<li>16 to 18-Nov-12</li>
</ul>
<p>In addition, we see that some of the sheets split the data into what we might term further &#8220;subtables&#8221; as you should notice if you compare the following sheet with the previous one shown above:</p>
<p><a href="http://ouseful.files.wordpress.com/2012/11/nhs-sit-rep-further-breakdown.png"><img src="http://ouseful.files.wordpress.com/2012/11/nhs-sit-rep-further-breakdown.png?w=700&#038;h=119" alt="" title="nhs sit rep further breakdown" width="700" height="119" class="alignnone size-full wp-image-9098" /></a></p>
<p>Notwithstanding that the &#8220;shape&#8221; of the data table is far from ideal when it comes to aggregating data from several weeks in the same database (as I&#8217;ll describe in another post), we are faced with a problem here that if we want to look at the data by date range in a mechanical, programmable way, we need to cast these differently represented date formats in the same way, ideally as a date structure that Python or the Scraperwiki SQLlite database can recognise as such.</p>
<p>The approach I took was as follows (it could be interesting to try to replicate this approach using OpenRefine?). Firstly, I took the decision to map dates onto &#8220;fromDates&#8221; and &#8220;toDates&#8221;. ***BEWARE &#8211; I DON&#8217;T KNOW IF THIS IS CORRECT THING TO DO**** Where there is a single specified date in a column heading, the fromDate and toDate are set to one and the same value. In cases where the date value was specified as an Excel represented date (the typical case), the code snippet above casts it to a Pythonic date value then I can then print out as required (I opted to display dates in the YYYY-MM-DD format) using a construction along the lines of:</p>
<p><tt>dateString=minimalist_xldate_as_datetime(cellValue,book.datemode).date().strftime("%Y-%m-%d")</tt></p>
<p>In this case, <tt>cellValue</tt> is the value of a header cell that is represented as an Excel time element, <tt>book</tt> is the workbook, as parsed using the <em>xlrd</em> library:</p>
<pre class="brush: python; title: ; notranslate">import xlrd
xlbin = scraperwiki.scrape(spreadsheetURL)
book = xlrd.open_workbook(file_contents=xlbin)</pre>
<p>and <tt>book.datemode</tt> is a library call that looks up how dates are being represented in the spreadsheet. If the conversion fails, we default to setting dateString to the original value:<br />
<tt>dateString=cellvalue</tt></p>
<p>The next step was to look at the date range cells, and cast any &#8220;literal&#8221; date strings into a recognised date format. (I&#8217;ve just realised I should have optimised the way this is called in the Scraperwiki code &#8211; I am doing so many unnecessary lookups at the moment!) In the following snippet, I look to see if we can split the date into a cell range functions,</p>
<pre class="brush: python; title: ; notranslate">import time
from time import mktime
from datetime import datetime

def dateNormalise(d):
    #This is a bit of a hack - each time we find new date formats for the cols, we'll need to extend this
    #The idea is to try to identify the date pattern used, and parse the string accordingly
    for trials in [&quot;%d %b %y&quot;,'%d-%b-%y','%d-%b-%Y','%d/%m/%Y','%d/%m/%y']:
        try:
            dtf =datetime.datetime.fromtimestamp(mktime(time.strptime(d, trials)))
            break
        except: dtf=d
    if type(dtf) is datetime.datetime:
        dtf=dtf.strftime(&quot;%Y-%m-%d&quot;)
    return dtf

def patchDate(f,t):
    #Grab the month and year elements from the todate, and add in the from day of month number
    tt=t.split('-')
    fromdate='-'.join( [ str(tt[0]),str(tt[1]),str(f) ])
    return fromdate

def dateRangeParse(daterange):
    #In this first part, we simply try to identify from and to portions
    dd=daterange.split(' to ')
    if len(dd)&lt;2:
        #That is, split on 'to' doesn't work
        dd2=daterange.split(' - ')
        if len(dd2)&lt;2:
            #Doesn't split on '-' either; set from and todates to the string, just in case.
            fromdate=daterange
            todate=daterange
        else:
            fromdate=dd2[0]
            todate=dd2[1]
    else:
        fromdate=dd[0]
        todate=dd[1]
    #By inspection, the todate looks like it's always a complete date, so try to parse it as such 
    todate=dateNormalise(todate)
    #I think we'll require another fudge here, eg if date is given as '6 to 8 Nov 2012' we'll need to finesse '6' to '6 Nov 2012' so we can make a date from it
    fromdate=dateNormalise(fromdate)
    if len(fromdate)&lt;3:
        fromdate=patchDate(fromdate,todate)
    return (fromdate,todate)

#USAGE:
(fromdate,todate)=dateRangeParse(dateString)</pre>
<p>One thing this example shows, I think, is that even though the data is being published as a dataset, albeit in an Excel spreadsheet, we need to do some work to make it properly useable.</p>
<p><a href="http://xkcd.com/1179/"><img src="http://imgs.xkcd.com/comics/iso_8601.png" alt='XKCD - ISO 8601' /></a></p>
<p>The sheets look as if they are an aggregate of data produced by different sections, or different people: that is, they use inconsistent ways of representing date ranges.</p>
<p>When it comes to using the date, we will need to take care in how we represent or report on figures collected over a date range (presumably a weekend? I haven&#8217;t checked), compared to daily totals. Indeed, as the PS below shows, I&#8217;m now starting to doubt what the number in the date range column represents? Is it: a) the sum total of values for days in that range; b) the average daily rate over that period; c) the value on the first or last date of that period? <s>[This was written under assumption it was summed daily values over period, which PS below suggests is NOT the case, in one sheet at least?] One approach might be to generate &#8220;as-if daily&#8221; returns simply by dividing ranged date totals by the number of days in the range. A more &#8220;truthful&#8221; approach may be to plot summed counts over time (date on the x-axis, sume of values to date on the y-axis), with the increment for the date-ranged values that is being added in to the summed value taking the &#8220;toDate&#8221; date as its x/date value?</s></p>
<p>When I get a chance, I&#8217;ll do a couple more posts around this dataset:<br />
- one looking at datashaping in general, along with an example of how I shaped the data in this particular case<br />
- one looking at different queries we can run over the shaped data.</p>
<p>PS Another problem&#8230; <a href="http://transparency.dh.gov.uk/2012/10/26/winter-pressures-daily-situation-reports-2012-13/">on the NHS site</a>, we see that there appear to be weekly spreadsheet releases and an aggregated release:</p>
<p><a href="http://ouseful.files.wordpress.com/2012/11/nhs-releases.png"><img src="http://ouseful.files.wordpress.com/2012/11/nhs-releases.png?w=700" alt="" title="NHS releases"   class="alignnone size-full wp-image-9108" /></a></p>
<p>Because I didn&#8217;t check the stub of scraper code used to pull off the spreadsheet URLs from the NHS site, I accidentally scraped weekly and aggrgeated sheets. I&#8217;m using a unique key based on a hash that includes the toDate as part of the hashed value, in an attempt to keep dupes out of the data from just this sort of mistake, but <a href="https://api.scraperwiki.com/api/1.0/datastore/sqlite?format=htmltable&amp;name=nhs_sit_reps&amp;query=select%20*%20from%20%60fulltable%60where%20Code%3D'R1F'%20%20order%20by%20toDateStr%20desc">looking at a query over the scraped data</a> I spotted this:</p>
<p><a href="http://ouseful.files.wordpress.com/2012/11/nhs-data-quality-issue2.png"><img src="http://ouseful.files.wordpress.com/2012/11/nhs-data-quality-issue2.png?w=700&#038;h=358" alt="" title="nhs data quality issue2?" width="700" height="358" class="alignnone size-full wp-image-9109" /></a></p>
<p>If we look at the <a href="https://www.wp.dh.gov.uk/transparency/files/2012/10/DailySR-Web-file-WE-18-11-12.xls">weekly sheet</a> we see this:</p>
<p><a href="http://ouseful.files.wordpress.com/2012/11/nhs-sitrep-data-quality-issue.png"><img src="http://ouseful.files.wordpress.com/2012/11/nhs-sitrep-data-quality-issue.png?w=700&#038;h=268" alt="" title="NHS sitrep data quality issue?" width="700" height="268" class="alignnone size-full wp-image-9110" /></a></p>
<p>That is, a column for November 15th, and then one for November 18th, but nothing to cover November 16 or 17?</p>
<p>Looking at a different sheet &#8211; Adult Critical Care &#8211; we get variation at the other end of the range:</p>
<p><a href="http://ouseful.files.wordpress.com/2012/11/adult-critical-care.png"><img src="http://ouseful.files.wordpress.com/2012/11/adult-critical-care.png?w=700&#038;h=447" alt="" title="adult Critical care..." width="700" height="447" class="alignnone size-full wp-image-9125" /></a></p>
<p>If we look into the <a href="https://www.wp.dh.gov.uk/transparency/files/2012/10/Daily-SR-Timeseries-18-11-12.xls">aggregated sheet</a>, we get:<br />
<a href="http://ouseful.files.wordpress.com/2012/11/nhs-daterange.png"><img src="http://ouseful.files.wordpress.com/2012/11/nhs-daterange.png?w=700" alt="" title="nhs daterange"   class="alignnone size-full wp-image-9111" /></a></p>
<p>Which is to say &#8211; the weekly report displayed a single data as a column heading where the aggregated sheet gives a date range, although the same cell values are reported in this particular example. So now I realise I have no idea what the cell values in the date range columns represent? Is it: a) the sum total of values for days in that range; b) the average daily rate over that period; c) the value on the first or last date of that period?</p>
<p>And <a href="https://api.scraperwiki.com/api/1.0/datastore/sqlite?format=htmltable&amp;name=nhs_sit_reps&amp;query=select%20distinct%20Code,toDateStr,tableName,facetB,value%20from%20%60fulltable%60where%20Code%3D'R1F'%20%20order%20by%20toDateStr,value%20desc">here&#8217;s another query</a>:</p>
<p><a href="http://ouseful.files.wordpress.com/2012/11/nhs-february.png"><img src="http://ouseful.files.wordpress.com/2012/11/nhs-february.png?w=700" alt="" title="nhs february???"   class="alignnone size-full wp-image-9120" /></a></p>
<p>February data??? I thought we were looking at November data?</p>
<p><a href="http://ouseful.files.wordpress.com/2012/11/oops.png"><img src="http://ouseful.files.wordpress.com/2012/11/oops.png?w=700&#038;h=191" alt="" title="oops..." width="700" height="191" class="alignnone size-full wp-image-9121" /></a></p>
<p>Hmmm&#8230;</p>
<p>PPS If you&#8217;re looking for learning outcomes from this post, here are a few: three ways in which we need to wrangle sense out of dates:</p>
<ol>
<li>representing Excel dates or strings-that-look-like-dates as dates in some sort of datetime representation (which is most useful sort of representation, even if we end up casting dates into string form);</li>
<li>parsing date ranges into pairs of date represented elements (from and to dates);</li>
<li>where a dataset/spreadsheet contains heterogenous single date and date range columns, how do we interpret the numbers that appear in the date range column?</li>
<li>shoving the data into a database and running queries on it can sometimes flag up possible errors or inconsistencies in the data set, that might be otherwise hard to spot (eg if you had to manually inspect lots of different sheets in lots of different spreadsheets&#8230;)</li>
</ol>
<p>Hmmm&#8230;.</p>
<p>PPPS Another week, another not-quite-right feature:</p>
<p><img src="http://ouseful.files.wordpress.com/2012/11/another-date-mixup.png?w=700" alt="another date mixup"   class="alignnone size-full wp-image-9283" /></p>
<p>PPPPS An update on what the numbers actually mean,from an email exchange (does that make me more a journalist than a blogger?!;-) with the contact address contained within the spreadsheets: &#8220;On the columns, where we have a weekend, all items apart from beds figures are summed across the weekend (eg number of diverts in place over the weekend, number of cancelled ops).  Beds figures (including beds closed to norovirus) are snapshots at the collection time (i.e 8am on the Monday morning).&#8221;</p>
<p>PPPPPS Another week, ans this time <em>three</em> new ways of writing the date range over the weekend: 14-16-Dec-12, 14-16-Dec 12, 14-16 Dec 12. Anyone would think they were trying to break my scraper;-)</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/ouseful.wordpress.com/9096/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/ouseful.wordpress.com/9096/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.ouseful.info&#038;blog=325417&#038;post=9096&#038;subd=ouseful&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://blog.ouseful.info/2012/11/27/when-machine-readable-data-still-causes-issues-wrangling-dates/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/abbd9f90565ce9ae4d065d93a81d8c03?s=96&#38;d=http%3A%2F%2F1.gravatar.com%2Favatar%2Fad516503a11cd5ca435acc9bb6523536%3Fs%3D96" medium="image">
			<media:title type="html">Tony Hirst</media:title>
		</media:content>

		<media:content url="http://ouseful.files.wordpress.com/2012/11/nhs-sitrep.png" medium="image">
			<media:title type="html">nhs sitrep</media:title>
		</media:content>

		<media:content url="http://ouseful.files.wordpress.com/2012/11/nhs-sit-rep-further-breakdown.png" medium="image">
			<media:title type="html">nhs sit rep further breakdown</media:title>
		</media:content>

		<media:content url="http://imgs.xkcd.com/comics/iso_8601.png" medium="image">
			<media:title type="html">XKCD - ISO 8601</media:title>
		</media:content>

		<media:content url="http://ouseful.files.wordpress.com/2012/11/nhs-releases.png" medium="image">
			<media:title type="html">NHS releases</media:title>
		</media:content>

		<media:content url="http://ouseful.files.wordpress.com/2012/11/nhs-data-quality-issue2.png" medium="image">
			<media:title type="html">nhs data quality issue2?</media:title>
		</media:content>

		<media:content url="http://ouseful.files.wordpress.com/2012/11/nhs-sitrep-data-quality-issue.png" medium="image">
			<media:title type="html">NHS sitrep data quality issue?</media:title>
		</media:content>

		<media:content url="http://ouseful.files.wordpress.com/2012/11/adult-critical-care.png" medium="image">
			<media:title type="html">adult Critical care...</media:title>
		</media:content>

		<media:content url="http://ouseful.files.wordpress.com/2012/11/nhs-daterange.png" medium="image">
			<media:title type="html">nhs daterange</media:title>
		</media:content>

		<media:content url="http://ouseful.files.wordpress.com/2012/11/nhs-february.png" medium="image">
			<media:title type="html">nhs february???</media:title>
		</media:content>

		<media:content url="http://ouseful.files.wordpress.com/2012/11/oops.png" medium="image">
			<media:title type="html">oops...</media:title>
		</media:content>

		<media:content url="http://ouseful.files.wordpress.com/2012/11/another-date-mixup.png" medium="image">
			<media:title type="html">another date mixup</media:title>
		</media:content>
	</item>
	</channel>
</rss>
