I spent a chunk of last week at Curriculum Development Hackathon for a Data Carpentry workshop on Reproducible Research using Jupyter Notebooks (I’d like to thank the organisers for the travel support). One of the planned curriculum areas looked at data project organisation, another on automation. Poking around on an NHS data publication webpage for a particular statistical work area just now suggests an example of how the two inter-relate… and how creating inconsistent URLs or filenames makes automatically downloading similar files a bit of a faff when it could be so easy…
To begin with, the A&E Attendances and Emergency Admissions statistical work area has URL:
The crumb trail in the on-page navigation has the form:
Home -> Statistics -> Statistical Work Areas -> A&E Attendances and Emergency Admissions
which we might note jars somewhat with the slug ae-waiting-times-and-activity, and perhaps reflects some sort of historical legacy in how the data was treated previously…
Monthly data is collected on separate financial year related pages linked from that page:
The breadcrumb for these pages has the form:
Home -> Statistics -> Statistical Work Areas -> A&E Attendances and Emergency Admissions -> A&E Attendances and Emergency Admissions 20MM-NN
A few of observations about those financial year related page URLs. Firstly, the path is rooted on the parent page (a Good Thing), but the slug looks mangled together from what looks like a more reasonable parent path (statistical-work-areasae-waiting-times-and-activity; this looks as if it’s been collapsed from statistical-work-areas/ae-waiting-times-and-activity).
The next part of the URL specifies the path to the A & E Attendances and Emergency Admissions page for a particular year, with an appropriate slug for the name – ae-attendances-and-emergency-admissions- but differently formed elements for the years: 2016-17 compared to 2015-16-monthly-3.
(Note that the 2015-16 monthly listing is incomplete and starts in June 2015.)
If we look at URLs for some of the monthly 2016-17 Excel data file downloads, we see inconsistency in the filenames:
(Note that CSV data seems only to be available for the latest (November 2016) data set. I don’t know if this means that the CSV data link only appears for the current month, or data in the CSV format only started to be published in November 2016.)
For the previous year we get:
and so on.
Inspection of these URLs suggests:
- the data is being uploaded to and published from a WordPress site (wp-content/uploads);
- the path to the data directory for the annual collection is minted according to the month in which the first dataset of the year is uploaded (data takes a month or two to be uploaded, so presumably the April 2016 data was posted in June, 2016 (2016/06); the 2015 data started late – the first month (June 2015) presumably being uploaded in August of that year (2015/08);
- the month slug for the data file starts off fine, being of the form MONTH-YEAR-AE-by-provider-, but then breaks things by having some sort of code value that perhaps uniquely identifies the version of the file;
- the month slug may be further broken by the addition of a revision element (eg -Revised-11082016).
If the URLs all had a common pattern, it would be easy enough to automate their generation from a slug pattern and month/year combination, and then automatically download them. (I haven’t yet explored inside each spreadsheet to see what inconsistency errors/horrors make it non-trivial to try to combine the monthly data into a single historical data set…)
As it is, to automate the download of the files requires scraping the web pages for the links, or manually retrieving them. (At least the link text on the pages seems to be reasonably consistent!)