Category: Anything you want

Opportunities for Doing the Civic Thing With Open and Public Data

I’ve been following the data thing for a few years now, and it’s been interesting to see how data related roles have evolved over that time.

For my own part, I’m really excited to have got the chance to work with the Parliamentary Digital Service [PDS] [blog] for a few days this coming year. Over the next few weeks, I hope to be starting to nose around Parliament and the Parliamentary libraries getting a feel for the The Life of Data there, as well as getting in touch with users of Parliamentary data more widely (if you are one, or aspire to be one, say hello in the commentsto see if we can start to generate more opportunities for coffee…:-)

I’m also keen to see what the Bureau of Investigative Journalism’s Local Data Lab, headed up by Megan Lucero, starts to get up to. There’s still a chance to apply for starting role there if you’re a “a journalist who uses computational method to find stories, an investigative or local journalist who regularly uses data, a tech or computer science person who is interested in local journalism or a civic tech person keen to get involved”, and the gig looks like it could be a fun one:

  • We will take on datasets that have yet to be broken down to a local level, investigate and reveal stories not yet told and bring this to local journalists.
  • We will be mobile, agile and innovative. The team will travel around the country to hear the ideas and challenges of regional reporters. We will listen, respond and learn so to provide evidence-based solutions.
  • We will participate in all parts of the process. Every member will contribute to story ideas, data wrangling, problem solving, building and storytelling.
  • We will participate in open journalism. We will publish public interest stories that throw light on local and national issues. We will open our data and code and document our shortcomings and learnings. We will push for greater transparency. We will foster collaboration between reporters and put power into regional journalism.

I’m really hoping they start a fellowship model too, so I can find some way of getting involved and maybe try to scale some of the data wrangling I will be doing around Isle of Wight data this year to wider use. (I wonder if they’d be interested in a slackbot/datawire experiment or two?!) After all, having split the data out for one local area, it’s often trivial to change the area code and do the same for another:

(It’ll also be interesting to see how the Local Data Lab might complement things like the BBC Local Reporting Scheme,  or feed leads into the C4CJ led “representative network for community journalism”.)

Data journalism job ads are still appearing, too. A recent call for a Senior Broadcast Journalist (Data), BBC Look North suggests the applicant should be able:

  • To generate ideas for data-driven stories and for how they might be developed and visualized.
  • To explore those ideas using statistical tools – and present them to wider stakeholders from a non-statistical background.
  • To report on and analyse data in a way that contributes to telling compelling stories on an array of news platforms.
  • To collaborate with reporters, editors, designers and developers to bring those stories to publication.
  • To use statistical tools to identify significant data trends.

The ad suggests that required skills include good knowledge of Microsoft Excel, a strong grasp of how to clean, parse and query data as well as database management*, [and] demonstrable experience of visualising data and using visualisation tools such as SPSS, SAS, Tableau, Refine and Fusion Tables.

* I’m intrigued as to what this might mean. As an entry level, I like to think this is getting data into something like SQLite and then running SQL queries over it? It’s also worth remembering that Google Sheets also exposes an SQL like interface that you can query (example, about).

When I started pottering around in the muddy shores of “data journalism” as it became a thing, Google Sheets, Fusion Tables and Open (then Google) Refine were the tools I tried to promote because I saw them as a relatively easy way in to working with data. But particularly with the advent of accessible working environments like RStudio and Jupyter notebooks, I have moved very much more towards the code side. This is perceived as a much harder sell – it requires learning to code – but it’s amazing what you can do with a single line of code, and in many cases someone has already written that line, so all you have to do is copy it; environments like Jupyter notebooks also provide a nicer (simpler) environment for trying out code than scary IDEs (even the acronym is impenetrable;-). As a consequence of spending more time in code, it’s also made me think far more about reproducible and transparent research (indeed, “reproducible data journalism”), as well as the idea of literate programming, where code, text and, particularly in research workflows, code outputs, together form a (linear) narrative that make it easier to see and understand what’s going on…

As well as the data journalism front, I’ve also kept half an eye on how academic libraries have been engaging with data issues, particularly from an “IT skills” development perspective. Generally, they haven’t, although libraries are often tasked with supporting research data management projects, as this job ad posted recently by the University of Michigan (via @lorcanD) for a data workflows specialist shows:

The Research Data Workflows Specialist will advance the library’s mission to create and sustain data services for the c­ampus that support the mission of the University of Michigan researchers through Research Data Services (RDS), a new and growing initiative that will build the next generation of data curation systems. A key focus of this position will be to understand and connect with the various disciplinary workflows on campus in order to inform the development of our technical infrastructure and data services.

I suspect this is very much associated with research data management. It seems to me that there’s still a hole when it comes to helping people put together their own reproducible research toolchains and technology stacks together (as well as working out what sort of toolchain/stack are actually required…).

Finally, I note that NotWestminster is back later next month in Huddersfield (I managed to grab a ticket last night). I have no idea what to expect from the event, but it may generate a few ideas for what I can usefully do with Island data this year…

PS just spotted another job opportunity in a related opendata area: Data Labs and Learning Manager, 360 Giving.

Accounting for the 80% – A Quick Pareto Principle Filter for pandas

Having decided (again) to try to do something with local government transparency spending data this year, I thought I take the take of generating some simple reports that just identify the significant spending areas within a particular directorate or service area.

It’s easy enough to render dozens of charts that show some bars bigger than others, and from this suggest the significant spending areas, but this still requires folk to spend time reading those charts and runs the risk that that they don’t “read” from the chart what you wanted them to… (This is one way in which titles and captions can help…)

dirspend

So how about putting the Pareto Principle, or 80/20 rule, to work, where 80% of some effect or other (such as spending) is accounted for by 20% of contributory effects (such as spending in a particular area, or to a particular set of suppliers)?

In other words, is one way in to the spending data to use it simply to see what accounts for 80%, or thereabouts, of monthly spend?

Here’s a quick function that tries to do something like that, that can be applied to a pandas Series:

def paretoXY(s, x, y,threshold=0):
    ''' Return a list until you account for X% of the whole and remainders are less than Y% individually.
        The threshold percentage allows you to hedge your bets and check items just past the treshold. '''
    #Generate percentages, and sort, and find accumulated total
    #Exclude any negative payments that can make the cumulative percentage go > 100% before we get to them
    df=pd.DataFrame( s[s>0].sort_values(ascending=False) )
    df['pc']= 100*s/s.sum()
    df['cum_pc']=df['pc'].cumsum()
    #Return ordered items that either together account at least for X% of the total,
    # and/or individually account for at least Y% of the total
    #The threshold provides a fudge factor on both components...
    return df[ (df['cum_pc']-df['pc'] <= x+ x*threshold/100) | (df['pc'] >= y-y*threshold/100) ]

iw_transparency_spending_-_adult_services

The resulting report simply describes just the components that either make up 80% (or whatever) of the total in each area, or that represent a significant contribution (howsoever defined), in their own right, to the corresponding total.

In the above case, the report describes the significant expense types in capital or revenue streams for each directorate for a given month’s spending data.

The resulting dataframe can also be converted to a crude text report summarising percentage contributions to specific areas easily enough:

iw_transparency_spending_-_adult_services2

Weekly Subseries Charts – Plotting NHS A&E Admissions

A post on Richard “Joy of Tax” Murphy’s blog a few days ago caught my eye – Data shows January is often the quietest time of the year for A & E departments – with a time series chart showing weekly admission numbers to A&E from a time when the numbers were produced weekly (they’re now produced monthly).

In a couple of follow up posts, Sean Danaher did a bit more analysis to reinforce the claim, again generating time series charts over the whole reporting period.

For me, this just cries out for a seasonal subseries plot. These are typically plotted over months or quarters and show for each month (or quarter) the year on year change of a indicator value. Rendering weekly subseries plots is a but more cluttered – 52 weekly subcharts rather 12 monthly ones – but still doable.

I haven’t generated subseries plots from pandas before, but the handy statsmodels Python library has a charting package that looks like it does the trick. The documentation is a bit sparse (I looked to the source…), but given a pandas dataframe and a suitable period based time series index, the chart falls out quite simply…

Here’s the chart and then the code… the data comes from NHS England, A&E Attendances and Emergency Admissions 2015-16 (2015.06.28 A&E Timeseries).

DO NOT TRUST THE FOLLOWING CHART

aeseasonalsubseries

(Yes, yes I know; needs labels etc etc; but it’s a crappy graph, and if folk want to use it they need to generate a properly checked and labelled version themselves, right?!;-)

import pandas as pd
# !pip3 install statsmodels
import statsmodels.api as sm
import statsmodels.graphics.tsaplots as tsaplots
import matplotlib.pyplot as plt

!wget -P data/ https://www.england.nhs.uk/statistics/wp-content/uploads/sites/2/2015/04/2015.06.28-AE-TimeseriesBaG87.xls

dfw=pd.read_excel('data/2015.06.28-AE-TimeseriesBaG87.xls',skiprows=14,header=None,na_values='-').dropna(how='all').dropna(axis=1,how='all')
#Faff around with column headers, empty rows etc
dfw.ix[0,2]='Reporting'
dfw.ix[1,0]='Code'
dfw= dfw.fillna(axis=1,method='ffill').T.set_index([0,1]).T.dropna(how='all').dropna(axis=1,how='all')

dfw=dfw[dfw[('Reporting','Period')].str.startswith('W/E')]

#pandas has super magic "period" datetypes... so we can cast a week ending date to a week period
dfw['Reporting','_period']=pd.to_datetime(dfw['Reporting','Period'].str.replace('W/E ',''), format='%d/%m/%Y').dt.to_period('W') 

#Check the start/end date of the weekly period
#dfw['Reporting','_period'].dt.asfreq('D','s')
#dfw['Reporting','_period'].dt.asfreq('D','e')

#Timeseries traditionally have the datey-timey thing as the index
dfw=dfw.set_index([('Reporting', '_period')])
dfw.index.names = ['_period']

#Generate a matplotlib figure/axis pair to give us easier access to the chart chrome
fig, ax = plt.subplots()

#statsmodels has quarterly and montthly subseries plots helper functions
#but underneath, they use a generic seasonal plot
#If we groupby the week number, we can plot the seasonal subseries on a week number basis
tsaplots.seasonal_plot(dfw['A&E attendances']['Total Attendances'].groupby(dfw.index.week),
                       list(range(1,53)),ax=ax)

#Tweak the display
fig.set_size_inches(18.5, 10.5)
ax.set_xticklabels(ax.xaxis.get_majorticklabels(), rotation=90);

As to how you read the chart – each line shows the trend over years for a particular week’s figures. The week number is along the x-axis. This chart type is really handy for letting you see a couple of things: year on year trend within a particular week; repeatable periodic trends over the course of the year.

A glance at the chart suggests weeks 24-28 (months 6/7 – so June/July) are the busy times in A&E?

PS the subseries plot uses pandas timeseries periods; see eg Wrangling Time Periods (such as Financial Year Quarters) In Pandas.

PPS Looking at the chart, it seems odd that the numbers always go up in a group. Looking at the code:

def seasonal_plot(grouped_x, xticklabels, ylabel=None, ax=None):
    """
    Consider using one of month_plot or quarter_plot unless you need
    irregular plotting.

    Parameters
    ----------
    grouped_x : iterable of DataFrames
        Should be a GroupBy object (or similar pair of group_names and groups
        as DataFrames) with a DatetimeIndex or PeriodIndex
    """
    fig, ax = utils.create_mpl_ax(ax)
    start = 0
    ticks = []
    for season, df in grouped_x:
        df = df.copy() # or sort balks for series. may be better way
        df.sort()
        nobs = len(df)
        x_plot = np.arange(start, start + nobs)
        ticks.append(x_plot.mean())
        ax.plot(x_plot, df.values, 'k')
        ax.hlines(df.values.mean(), x_plot[0], x_plot[-1], colors='k')
        start += nobs

    ax.set_xticks(ticks)
    ax.set_xticklabels(xticklabels)
    ax.set_ylabel(ylabel)
    ax.margins(.1, .05)
    return fig

there’s a df.sort() in there – which I think should be removed, assuming that the the data presented is pre-sorted in the group?

A Unit of Comparison for Local Council Budget Consultations, Based on Transparency Spending Data – ASCdays?

A few years ago, via the BBC Radio 4 & World Service programme More or Less (incidentally co-produced by the OU), I came across the notion of miciromorts (Contextualising the Chance of Something Happening – Micromorts), a one in a million chance of death that can be used as a unit of risk to compare various likelihoods of dying. Associated with this measure is David Spiegelhalter’s microlife, “30 minutes of your life expectancy”. The point behind the microlife measure is that it provides a way of comparing life threatening risks based on how much life is likely to be lost, on average, when exposed to such risks.

Life expectancy for a man aged 22 in the UK is currently about 79 years, which is an extra 57 years, or 20,800 days, or 500,000 hours, or 1 million half hours. So, a young man of 22 typically has 1,000,000 half-hours (57 years) ahead of him, the same as a 26 year-old woman. We define a microlife as the result of a chronic risk that reduces life, on average, by just one of the million half hours that they have left.

The idea of micromorts came to mind last night as I was reflecting on a public budget consultation held by the Isle of Wight Council yesterday (a day that also saw the Council’s Leader and Deputy Leader resign their positions). The Council needs to improve budgetary matters by £20 million over the next 3 years, starting with £7.5m in the next financial year. This can come through increasing funding, or cuts. By far the biggest chunk of expenditure by the council, as with all councils, is on adult social care (ASC) [community care statistics /via @jonpoole].

As with every year for the past however many years, I’ve had a vague resolution to do something with local council spending data, and never got very far. Early dabblings with the data that I’ve so far played with this year (and intend to continue…) reinforce the notion that ASC is expensive. Here’s a quick summary of the spending data items for October, 2016:

The spend for each of the directorates was as follows:

  • Adult Services:
    • total spend: £7,746,875.55 (48.33%% of total monthly spend)
    • capital: £395,900.06 (5.11% of directorate monthly spend)
    • revenue: £7,350,975.49 (94.89% of directorate monthly spend)
  • Chief Executive:
    • total spend: £501,021.32 (3.13%% of total monthly spend)
    • capital: £492,507.54 (98.30% of directorate monthly spend)
    • revenue: £8,513.78 (1.70% of directorate monthly spend)
  • Childrens Services:
    • total spend: £2,044,524.26 (12.76%% of total monthly spend)
    • capital: £243,675.08 (11.92% of directorate monthly spend)
    • revenue: £1,800,849.18 (88.08% of directorate monthly spend)
  • Place:
    • total spend: £4,924,117.40 (30.72%% of total monthly spend)
    • capital: £974,024.13 (19.78% of directorate monthly spend)
    • revenue: £3,950,093.27 (80.22% of directorate monthly spend)
  • Public Health:
    • total spend: £434,654.13 (2.71%% of total monthly spend)
    • revenue: £434,654.13 (100.00% of directorate monthly spend)
  • Regeneration:
    • total spend: £57.65 (0.00%% of total monthly spend)
    • revenue: £57.65 (100.00% of directorate monthly spend)
  • Resources:
    • total spend: £377,172.20 (2.35%% of total monthly spend)
    • capital: £20,367.87 (5.40% of directorate monthly spend)
    • revenue: £356,804.33 (94.60% of directorate monthly spend)

Cancelling out Adult Services revenue spend for a month would match the £7.5 million required to make up next year’s funds. That’s unlikely to happen, but it does perhaps hint at a possible unit of comparison when trying to make budget decisions, or at least, support budget consultations.

From my naive perspective, adult social care needs to support a certain number of people, a number that evolves (probably?) in line with demographics. One of the ways people exit care is by dying, though the service is set up to minimise harm and help prolong life. Folk may also be transiently cared for (that is, they enter the care system and then leave it). By looking at the amount spent on adult social care, we can come up with an average cost (mean, median?) per person per day of adult social care – ASCdays. We can reduce the total cost by reducing the amount of time folk spend in the system, either by shortening transient stays or postponing entry into the system.

So what I’ve started wondering is this: as one way of trying to make sense of transparency spending data, is there any use in casting it into equivalent units of ASCdays? If we use ASCday equivalent units, can we take a weak systems view and try to get a feel for whether a cut to a particular service (or improvement of another) can help us get a handle on the ASC expenditure – or whether it might cause problems down the line?

For example, suppose a week’s respite care costs the same as two weeks worth of ASCdays. If that week’s respite care keeps someone out of the adult care service for a month, we’re quids in. If cutting respite care saves 100 ASCdays of funding, but is likely to bring just one person into the care system 3 months early, we might start to doubt whether it will actually lead to any saving at all. (Longer tail saves complicate matters given councils need to balance a budget within a financial year. Spending money this year to save next year requires access to reserves – and confidence in your bet…)

For trying to make budget decisions, or helping engage citizens in budgetary consultations, costing things as per ASCday equivalents, and then trying to come up with some probabilities about the likelihood that a particular cut or expense will result in a certain number of people entering or leaving ASC sooner or later, may help you get a feel for the consequences for a particular action.

As to whether prior probabilities exist around whether cutting this service, or supporting that, are likely to impact on the adult care system, maybe data for that is out there, also?

EU Legal Affairs – Proposals on Civil Law Rules on Robotics

The EU Legal Affairs Committee voted last week to call for legislation that is likely to include a definition of “smart autonomous robots”, regulation of the same, and an ethical code of conduct for designers, producers and users.

Note that the full proposal to be put forward is not currently available – the draft is here: COMMITTEES-JURI-PR-2017-01-12-1095387EN), with amendments.

One thing I noticed in the original document was that it seemed limited in it’s definition of “smart autonomous robots” as physically instantiated things, and ignored more general “smart systems” – AI software systems, for example. An amendment in PE592.405 addressed this, broadening the scope to cover “AI” more generally: smart autonomous robots and their subcategories by taking into consideration the following characteristics of a smart robot and an autonomous system, that comprised a physical support or is connected to a software programme without being embedded in a physical support.

When the topic was originally announced, a big thing was made in the news about the calls for robots to be classed as “electronic persons”. A policy study – European Civil Law Rules In Robotics – that fed into the deliberations attempted to debunk this:

3.1. Incongruity of establishing robots as liable legal persons
The motion for a resolution proposes creating a new category of individual, specifically for robots: electronic persons. Paragraph 31(f) calls upon the European Commission to explore the legal consequences of “creating a specific legal status for robots, so that at least the most sophisticated autonomous robots could be established as having the status of electronic persons with specific rights and obligations, including that of making good any damage they may cause [to third parties], and applying electronic personality to cases where robots make smart autonomous decisions or otherwise interact with third parties”.

When considering civil law in robotics, we should disregard the idea of autonomous robots having a legal personality, for the idea is as unhelpful as it is inappropriate.

Traditionally, when assigning an entity legal personality, we seek to assimilate it to humankind. This is the case with animal rights, with advocates arguing that animals should be assigned a legal personality since some are conscious beings, capable of suffering, etc., and so of feelings which separate them from things. Yet the motion for a resolution does not tie the acceptance of the robot’s legal personality to any potential consciousness. Legal personality is therefore not linked to any regard for the robot’s inner being or feelings, avoiding the questionable assumption that the robot is a conscious being. Assigning robots such personality would, then, meet a simple operational objective arising from the need to make robots liable for their actions.

… the motion for a resolution would appear more inclined to fully erase the human presence. In viewing as an electronic person any “robots [which] make smart autonomous decisions or otherwise interact with third parties” (end of paragraph 31(f)), the motion seems to suggest that the robot itself would be liable and become a legal actor. This analysis finds support in paragraph S, which states that “the more autonomous robots are, the less they can be considered simple tools in the hands of other actors […] [and this] calls for new rules which focus on how a machine can be held — partly or entirely — responsible for its acts or omissions”. Once a robot is no longer controlled by another actor, it becomes the actor itself. Yet how can a mere machine, a carcass devoid of consciousness, feelings, thoughts or its own will, become an autonomous legal actor? How can we even conceive this reality as foreseeable within 10 to 15 years, i.e. within the time frame set in paragraph 25 of the motion for a resolution? From a scientific, legal and even ethical perspective, it is impossible today — and probably will remain so for a long time to come — for a robot to take part in legal life without a human being pulling its strings.

What is more, considering that the main purpose of assigning a robot legal personality would be to make it a liable actor in the event of damage, we should note that other systems would be far more effective at compensating victims; for example, an insurance scheme for autonomous robots, perhaps combined with a compensation fund (paragraphs 31(a) to (e)).

We also have to bear in mind that this status would unavoidably trigger unwanted legal consequences. Paragraph T of the motion states that creating a legal personality would mean that robots’ rights and duties had to be respected. How can we contemplate conferring rights and duties on a mere machine? How could a robot have duties, since this idea is closely linked with human morals? Which rights would we bestow upon a robot: the right to life (i.e. the right to non-destruction), the right to dignity, the right to equality with humankind, the right to retire, the right to receive remuneration (an option explicitly explored in paragraph 31(b) of the motion), etc.? …

In reality, advocates of the legal personality option have a fanciful vision of the robot, inspired by science-fiction novels and cinema. They view the robot — particularly if it is classified as smart and is humanoid — as a genuine thinking artificial creation, humanity’s alter ego. We believe it would be inappropriate and out-of-place not only to recognise the existence of an electronic person but to even create any such legal personality. Doing so risks not only assigning rights and obligations to what is just a tool, but also tearing down the boundaries between man and machine, blurring the lines between the living and the inert, the human and the inhuman. Moreover, creating a new type of person – an electronic person – sends a strong signal which could not only reignite the fear of artificial beings but also call into question Europe’s humanist foundations. Assigning person status to a nonliving, non-conscious entity would therefore be an error since, in the end, humankind would likely be demoted to the rank of a machine. Robots should serve humanity and should have no other role, except in the realms of science-fiction.

Yes but, no but… replace “robot” with “company” in the above. Companies have a legal basis, are defined as legal entities, albeit requiring (in the UK at least), at least one human officer to be responsible for them. To what extent could smart systems be similarly treated, or incorporated?

Or consider this – employees, who have human agency have responsibilities, and liabilities incurred by them may or may not become liabilities for their employer. To the extent that smart systems have “machine agency”, to what extent might they also be classed as employees, eg for insurance or liability purposes? (Related: Fragments – Should Algorithms, Deep Learning AI Models and/or Robots be Treated as Employees?.) To be an “employee”, do you also have to be a legal entity?

PS there’s also a clause in the amendments that expresses neo-Luddite tendencies with which I am not totally unsympathetic:

“whereas robotics and AI that can perform similar tasks to those performed by humans should be used mainly to support and boost the abilities of man, as opposed to trying to replace the human element completely”

I haven’t read all the amendments yet – there may be more nuggets in there…

Convention Based Used URLs Support Automation

I spent a chunk of last week at Curriculum Development Hackathon for a Data Carpentry workshop on Reproducible Research using Jupyter Notebooks (I’d like to thank the organisers for the travel support). One of the planned curriculum areas looked at data project organisation, another on automation. Poking around on an NHS data publication webpage for a particular statistical work area just now suggests an example of how the two inter-relate… and how creating inconsistent URLs or filenames makes automatically downloading similar files a bit of a faff when it could be so easy…

To begin with, the A&E Attendances and Emergency Admissions statistical work area has URL:

https://www.england.nhs.uk/statistics/statistical-work-areas/ae-waiting-times-and-activity/

The crumb trail in the on-page navigation has the form:

Home -> Statistics -> Statistical Work Areas -> A&E Attendances and Emergency Admissions

which we might note jars somewhat with the slug ae-waiting-times-and-activity, and perhaps reflects some sort of historical legacy in how the data was treated previously…

Monthly data is collected on separate financial year related pages linked from that page:

https://www.england.nhs.uk/statistics/statistical-work-areas/ae-waiting-times-and-activity/statistical-work-areasae-waiting-times-and-activityae-attendances-and-emergency-admissions-2016-17/

https://www.england.nhs.uk/statistics/statistical-work-areas/ae-waiting-times-and-activity/statistical-work-areasae-waiting-times-and-activityae-attendances-and-emergency-admissions-2015-16-monthly-3/

The breadcrumb for these pages has the form:

Home -> Statistics -> Statistical Work Areas -> A&E Attendances and Emergency Admissions -> A&E Attendances and Emergency Admissions 20MM-NN

A few of observations about those financial year related page URLs. Firstly, the path is rooted on the parent page (a Good Thing), but the slug looks mangled together from what looks like a more reasonable parent path (statistical-work-areasae-waiting-times-and-activity; this looks as if it’s been collapsed from statistical-work-areas/ae-waiting-times-and-activity).

The next part of the URL specifies the path to the A & E Attendances and Emergency Admissions page for a particular year, with an appropriate slug for the name – ae-attendances-and-emergency-admissions- but differently formed elements for the years: 2016-17 compared to 2015-16-monthly-3.

(Note that the 2015-16 monthly listing is incomplete and starts in June 2015.)

If we look at URLs for some of the monthly 2016-17 Excel data file downloads, we see inconsistency in the filenames:

https://www.england.nhs.uk/statistics/wp-content/uploads/sites/2/2016/06/November-2016-AE-by-provider-W0Hp0.xls
https://www.england.nhs.uk/statistics/wp-content/uploads/sites/2/2016/06/October-2016-AE-by-provider-Nxpai.xls
https://www.england.nhs.uk/statistics/wp-content/uploads/sites/2/2016/06/September-2016-AE-by-provider-BtD4b.xls

(Note that CSV data seems only to be available for the latest (November 2016) data set. I don’t know if this means that the CSV data link only appears for the current month, or data in the CSV format only started to be published in November 2016.)

For the previous year we get:

https://www.england.nhs.uk/statistics/wp-content/uploads/sites/2/2015/08/March-2016-AE-by-provider-9g0dQ-Revised-11082016.xls
https://www.england.nhs.uk/statistics/wp-content/uploads/sites/2/2015/08/February-2016-AE-by-provider-1gWNy-Revised-11082016.xls

and so on.

Inspection of these URLs suggests:

  1. the data is being uploaded to and published from a WordPress site (wp-content/uploads);
  2. the path to the data directory for the annual collection is minted according to the month in which the first dataset of the year is uploaded (data takes a month or two to be uploaded, so presumably the April 2016 data was posted in June, 2016 (2016/06); the 2015 data started late – the first month (June 2015) presumably being uploaded in August of that year (2015/08);
  3. the month slug for the data file starts off fine, being of the form MONTH-YEAR-AE-by-provider-, but then breaks things by having some sort of code value that perhaps uniquely identifies the version of the file;
  4. the month slug may be further broken by the addition of a revision element (eg -Revised-11082016).

If the URLs all had a common pattern, it would be easy enough to automate their generation from a slug pattern and month/year combination, and then automatically download them. (I haven’t yet explored inside each spreadsheet to see what inconsistency errors/horrors make it non-trivial to try to combine the monthly data into a single historical data set…)

As it is, to automate the download of the files requires scraping the web pages for the links, or manually retrieving them. (At least the link text on the pages seems to be reasonably consistent!)

An Alternative Way of Motivating the Use of Functions?

At the end of the first of the Curriculum Development Hackathon on Reproducible Research using Jupyter Notebooks held at BIDS in Berkeley, yesterday, discussion turned on whether we should include a short how-to on the use of interactive IPython widgets to support exploratory data analysis. This would provide workshop participants with an example of how to rapidly prototype a simple exploratory data analysis application such as an interactive chart, enabling them to explore a range of parameter values associated with the data being plotted in a convenient way.

In summarising how the ipywidgets interact() function works, Fernando Perez made a comment that made wonder whether we could use the idea of creating simple interactive chart explorers as a way of motivating the use of functions.

More specifically, interact() takes a function name and the set of parameters passed into that function and creates a set of appropriate widgets for setting the parameters associated with the function. Changing the widget setting runs the function with the currently selected values of the parameters. If the function returns a chart object, then the function essentially defines an interactive chart explorer application.

So one reason for creating a function is that you may be able to automatically convert into an interactive application using interact().

Here’s a quick first sketch notebook that tries to set up a motivating example: An Alternative Way of Motivating Functions?

PS to embed an image of a rendered widget in the notebook, select the Save notebook with snapshots option from the Widgets menu:

simplewidgetdemo

See also: Simple Interactive View Controls for pandas DataFrames Using IPython Widgets in Jupyter Notebooks