Making a Simple Database to Act as a Lookup for the ONS Register of Geographic Codes

Via a GSS (Government Statistical Service) blog post yesterday – Why Do We Need Another Register? – announcing the Statistical Geography Register, which contains “the codes for each type of statistical geography within the UK”, I came across mention of the ONS Register of Geographic Codes.

This register, maintained by the Office of National Statistics, is released as a zip file containing an Excel spreadsheet. Separate worksheets in the spreadsheet list codes for various geographies in England and Wales (but not Scotland; that data is available elsewhere).

To make a rather more reproducible component for accessing the data, I hacked together a simple notebook to pull the data out of the spreadsheet and pop it into a simple SQLite3 database as a set of separate tables, one per sheet.

One thing we need to do to reconcile items in the metadata sheet and the sheetnames by joining a couple of the columns together with a subscript:

xl['RGC']["codeAbbrv"] = xl['RGC']["Entity code"].map(str) + '_' + xl['RGC']["Entity abbreviation"].map(str)

The bulk of the script is quite simple (see the full notebook here):

import sqlite3
con = sqlite3.connect("onsgeocodes.sqlite")

cols=['GEOGCD','GEOGNM','GEOGNMW','OPER_DATE','TERM_DATE','STATUS']

bigcodes=pd.DataFrame(columns=['sheet']+cols)
bigcodes.to_sql(con=con, name='codelist', index=False, if_exists='replace')

sheets= list(xl.keys())
sheets.remove('For_Scotland')
for sheet in sheets[2:]:
xl[sheet].to_sql(con=con, name=sheet, index=False, if_exists='replace')
xl[sheet]['sheet']=sheet
#Reorder the columns
xl[sheet][['sheet']+cols].to_sql(con=con, name='codelist', index=False, if_exists='append')

You may also notice that it creates a “big” table (codelist) that contains all the codes – which means we can look up the provenance of a particular code:

q='''
SELECT sheet, GEOGCD, GEOGNM, GEOGNMW, codelist.STATUS, "Entity name"
FROM codelist JOIN metadata WHERE "GEOGCD"="{code}" AND codeAbbrv=sheet
'''.format(code='W40000004')
pd.read_sql_query(q, con)
sheet GEOGCD GEOGNM GEOGNMW STATUS Entity name
0 W40_CMLAD W40000004 Denbighshire Sir Ddinbych live Census Merged Local Authority Districts

We can also look to see what (current) geographies might be associated with a particular name:

q='''
SELECT DISTINCT "Entity name", sheet FROM codelist JOIN metadata
WHERE "GEOGNM" LIKE "%{name}%" AND codeAbbrv=sheet AND codelist.STATUS="live"
'''.format(name='Isle of Wight')
pd.read_sql_query(q, con)
Entity name sheet
0 Super Output Areas, Lower Layer E01_LSOA
1 Super Output Areas, Middle Layer E02_MSOA
2 Unitary Authorities E06_UA
3 Westminster Parliamentary Constituencies E14_WPC
4 Community Safety Partnerships E22_CSP
5 Registration Districts E28_REGD
6 Registration Sub-district E29_REGSD
7 Travel to Work Areas E30_TTWA
8 Fire and Rescue Authorities E31_FRA
9 Built-up Areas E34_BUA
10 Clinical Commissioning Groups E38_CCG
11 Census Merged Local Authority Districts E41_CMLAD
12 Local Resilience Forums E48_LRF
13 Sustainability and Transformation Partnerships E54_STP

What I’m wondering now is – can I crib from the way the ergast API is put together to create a simple API that takes a code, or a name, and returns geography register information related to it?

The same approach could also be applied to the registers I pull down from NHS Digital (here) – which makes me think I should generate a big codelist table for those codes too…

PS this in part reminds me of a conversation years ago with Richard et al from @cottagelabs who were mooting, at the time, a service that would take an arbitrary code and try to pattern match the coding scheme it was part of and then look it up in that coding scheme.

PPS hmm, also thinks: maybe names associated with coding schemes could be added to a simple document tagger.

Reporting in a Repeatable, Parameterised, Transparent Way

Earlier this week, I spent a day chatting to folk from the House of Commons Library as a part of a temporary day-a-week-or-so bit of work I’m doing with the Parliamentary Digital Service.

During one of the conversations on matters loosely geodata-related with Carl Baker, Carl mentioned an NHS Digital data set describing the number of people on a GP Practice list who live within a particular LSOA (Lower Super Output Area). There are possible GP practice closures on the Island at the moment, so I thought this might be an interesting dataset to play with in that respect.

Another thing Carl is involved with is producing a regularly updated briefing on Accident and Emergency Statistics. Excel and QGIS templates do much of the work in producing the updated documents, so much of the data wrangling side of the report generation is automated using those tools. Supporting regular updating of briefings, as well as answering specific, ad hoc questions from MPs, producing debate briefings and other current topic briefings, seems to be an important Library activity.

As I’ve been looking for opportunities to compare different automation routes using things like Jupyter notebooks and RMarkdown, I thought I’d have a play with the GP list/LSOA data, showing how we might be able to use each of those two routes to generate maps showing the geographical distribution, across LSOAs at least, for GP practices on the Isle of Wight. This demonstrates several things, including: data ingest; filtering according to practice codes accessed from another dataset; importing a geoJSON shapefile; generating a choropleth map using the shapefile matched to the GP list LSOA codes.

The first thing I tried was using a python/pandas Jupyter notebook to create a choropleth map for a particular practice using the folium library. This didn’t take long to do at all – I’ve previously built an NHS admin database that lets me find practice codes associated with a particular CCG, such as the Isle of Wight CCG, as well as a notebook that generates a choropleth over LSOA boundaries, so it was simply a case of copying and pasting old bits of code and adding in the new dataset.You can see a rendered example of the notebook here (download).

One thing you might notice from the rendered notebook is that I actually “widgetised” it, allowing users of the live notebook to select a particular practice and render the associated map.

Whilst I find the Jupyter notebooks to provide a really friendly and accommodating environment for pulling together a recipe such as this, the report generation workflows are arguably still somewhat behind the workflows supported by RStudio and in particular the knitr tools.

So what does an RStudio workflow have to offer? Using Rmarkdown (Rmd) we can combine text, code and code outputs in much the same way as we can in a Jupyter notebook, but with slightly more control over the presentation of the output.

__dropbox_parlidata_rdemos_-_rstudio

For example, from a single Rmd file we can knit an output HTML file that incorporates an interactive leaflet map, or a static PDF document.

It’s also possible to use a parameterised report generation workflow to generate separate reports for each practice. For example, applying this parameterised report generation script to a generic base template report will generate a set of PDF reports on a per practice basis for each practice on the Isle of Wight.

The bookdown package, which I haven’t played with yet, also looks promising for its ability to generate a single output document from a set of source documents. (I have a question in about the extent to which bookdown supports partially parameterised compound document creation).

Having started thinking about comparisons between Excel, Jupyter and RStudio workflows, possible next steps are:

  • to look for sensible ways of comparing the workflow associated with each,
  • the ramp-up skills required, and blockers (including cultural blockers (also administrative / organisational blockers, h/t @dasbarrett)) associated with getting started with new tools such as Jupyter or RStudio, and
  • the various ways in which each tool/workflow supports: transparency; maintainability; extendibility; correctness; reuse; integration with other tools; ease and speed of use.

It would also be interesting to explore how much time and effort would actually be involved in trying to port a legacy Excel report generating template to Rmd or ipynb, and what sorts of issue would be likely to arise, and what benefits Excel offers compared to Jupyter and RStudio workflows.

A Recipe for Automatically Going From Data to Text to Reveal.js Slides

Over the last few years, I’ve experimented on and off with various recipes for creating text reports from tabular data sets, (spreadsheet plugins are also starting to appear with a similar aim in mind). There are several issues associated with this, including:

  • identifying what data or insight you want to report from your dataset;
  • (automatically deriving the insights);
  • constructing appropriate sentences from the data;
  • organising the sentences into some sort of narrative structure;
  • making the sentences read well together.

Another approach to humanising the reporting of tabular data is to generate templated webpages that review and report on the contents of a dataset; this has certain similarities to dashboard style reporting, mixing tables and charts, although some simple templated text may also be generated to populate the page.

In a business context, reporting often happens via Powerpoint presentations. Slides within the presentation deck may include content pulled from a templated spreadsheet, which itself may automatically generate tables and charts for such reuse from a new dataset. In this case, the recipe may look something like:

exceldata2slide

#render via: http://blockdiag.com/en/blockdiag/demo.html
{
  X1[label='macro']
  X2[label='macro']

  Y1[label='Powerpoint slide']
  Y2[label='Powerpoint slide']

   data -> Excel -> Chart -> X1 -> Y1;
   Excel -> Table -> X2 -> Y2 ;
}

In the previous couple of posts, the observant amongst you may have noticed I’ve been exploring a couple of components for a recipe that can be used to generate reveal.js browser based presentations from the 20% that account for the 80%.

The dataset I’ve been tinkering with is a set of monthly transparency spending data from the Isle of Wight Council. Recent releases have the form:

iw_transparency_spending_data

So as hinted at previously, it’s possible to use the following sort of process to automatically generate reveal.js slideshows from a Jupyter notebook with appropriately configured slide cells (actually, normal cells with an appropriate metadata element set) used as an intermediate representation.

jupyterslidetextgen

{
  X1[label="text"]
  X2[label="Jupyter notebook\n(slide mode)"]
  X3[label="reveal.js\npresentation"]

  Y1[label="text"]
  Y2[label="text"]
  Y3[label="text"]

  data -> "pandas dataframe" -> X1  -> X2 ->X3
  "pandas dataframe" -> Y1,Y2,Y3  -> X2 ->X3

  Y2 [shape = "dots"];
}

There’s an example slideshow based on October 2016 data here. Note that some slides have “subslides”, that is, slides underneath them, so watch the arrow indicators bottom left to keep track of when they’re available. Note also that the scrolling is a bit hit and miss – ideally, a new slide would always be scrolled to the top, and for fragments inserted into a slide one at a time the slide should scroll down to follow them).

The structure of the presentation is broadly as follows:

demo_-_interactive_shell_for_blockdiag_-_blockdiag_1_0_documentation

For example, here’s a summary slide of the spends by directorate – note that we can embed charts easily enough. (The charts are styled using seaborn, so a range of alternative themes are trivially available). The separate directorate items are brought in one at a time as fragments.

testfullslidenotebook2_slides1

The next slide reviews the capital versus expenditure revenue spend for a particular directorate, broken down by expenses type (corresponding slides are generated for all other directorates). (I also did a breakdown for each directorate by service area.)

The items listed are ordered by value, and taken together account for at least 80% of the spend in the corresponding area. Any further items contributing more than 5%(?) of the corresponding spend are also listed.

testfullslidenotebook2_slides2

Notice that subslides are available going down from this slide, rather than across the mains slides in the deck. This 1.5D structure means we can put an element of flexible narrative design into the presentation, giving the reader an opportunity to explore the data, but in a constrained way.

In this case, I generated subslides for each major contributing expenses type to the capital and revenue pots, and then added a breakdown of the major suppliers for that spending area.

testfullslidenotebook2_slides3

This just represents a first pass at generating a 1.5D slide deck from a tabular dataset. A Pareto (80/20) heurstic is used to try to prioritise to the information displayed in order to account for 80% of spend in different areas, or other significant contributions.

Applying this principle repeatedly allows us to identify major spending areas, and then major suppliers within those spending areas.

The next step is to look at other ways of segmenting and structuring the data in order to produce reports that might actually be useful…

If you have any ideas, please let me know via the comments, or get in touch directly…

PS FWIW, it should be easy enough to run any other dataset that looks broadly like the example at the top through the same code with only a couple of minor tweaks…

A Unit of Comparison for Local Council Budget Consultations, Based on Transparency Spending Data – ASCdays?

A few years ago, via the BBC Radio 4 & World Service programme More or Less (incidentally co-produced by the OU), I came across the notion of miciromorts (Contextualising the Chance of Something Happening – Micromorts), a one in a million chance of death that can be used as a unit of risk to compare various likelihoods of dying. Associated with this measure is David Spiegelhalter’s microlife, “30 minutes of your life expectancy”. The point behind the microlife measure is that it provides a way of comparing life threatening risks based on how much life is likely to be lost, on average, when exposed to such risks.

Life expectancy for a man aged 22 in the UK is currently about 79 years, which is an extra 57 years, or 20,800 days, or 500,000 hours, or 1 million half hours. So, a young man of 22 typically has 1,000,000 half-hours (57 years) ahead of him, the same as a 26 year-old woman. We define a microlife as the result of a chronic risk that reduces life, on average, by just one of the million half hours that they have left.

The idea of micromorts came to mind last night as I was reflecting on a public budget consultation held by the Isle of Wight Council yesterday (a day that also saw the Council’s Leader and Deputy Leader resign their positions). The Council needs to improve budgetary matters by £20 million over the next 3 years, starting with £7.5m in the next financial year. This can come through increasing funding, or cuts. By far the biggest chunk of expenditure by the council, as with all councils, is on adult social care (ASC) [community care statistics /via @jonpoole].

As with every year for the past however many years, I’ve had a vague resolution to do something with local council spending data, and never got very far. Early dabblings with the data that I’ve so far played with this year (and intend to continue…) reinforce the notion that ASC is expensive. Here’s a quick summary of the spending data items for October, 2016:

The spend for each of the directorates was as follows:

  • Adult Services:
    • total spend: £7,746,875.55 (48.33%% of total monthly spend)
    • capital: £395,900.06 (5.11% of directorate monthly spend)
    • revenue: £7,350,975.49 (94.89% of directorate monthly spend)
  • Chief Executive:
    • total spend: £501,021.32 (3.13%% of total monthly spend)
    • capital: £492,507.54 (98.30% of directorate monthly spend)
    • revenue: £8,513.78 (1.70% of directorate monthly spend)
  • Childrens Services:
    • total spend: £2,044,524.26 (12.76%% of total monthly spend)
    • capital: £243,675.08 (11.92% of directorate monthly spend)
    • revenue: £1,800,849.18 (88.08% of directorate monthly spend)
  • Place:
    • total spend: £4,924,117.40 (30.72%% of total monthly spend)
    • capital: £974,024.13 (19.78% of directorate monthly spend)
    • revenue: £3,950,093.27 (80.22% of directorate monthly spend)
  • Public Health:
    • total spend: £434,654.13 (2.71%% of total monthly spend)
    • revenue: £434,654.13 (100.00% of directorate monthly spend)
  • Regeneration:
    • total spend: £57.65 (0.00%% of total monthly spend)
    • revenue: £57.65 (100.00% of directorate monthly spend)
  • Resources:
    • total spend: £377,172.20 (2.35%% of total monthly spend)
    • capital: £20,367.87 (5.40% of directorate monthly spend)
    • revenue: £356,804.33 (94.60% of directorate monthly spend)

Cancelling out Adult Services revenue spend for a month would match the £7.5 million required to make up next year’s funds. That’s unlikely to happen, but it does perhaps hint at a possible unit of comparison when trying to make budget decisions, or at least, support budget consultations.

From my naive perspective, adult social care needs to support a certain number of people, a number that evolves (probably?) in line with demographics. One of the ways people exit care is by dying, though the service is set up to minimise harm and help prolong life. Folk may also be transiently cared for (that is, they enter the care system and then leave it). By looking at the amount spent on adult social care, we can come up with an average cost (mean, median?) per person per day of adult social care – ASCdays. We can reduce the total cost by reducing the amount of time folk spend in the system, either by shortening transient stays or postponing entry into the system.

So what I’ve started wondering is this: as one way of trying to make sense of transparency spending data, is there any use in casting it into equivalent units of ASCdays? If we use ASCday equivalent units, can we take a weak systems view and try to get a feel for whether a cut to a particular service (or improvement of another) can help us get a handle on the ASC expenditure – or whether it might cause problems down the line?

For example, suppose a week’s respite care costs the same as two weeks worth of ASCdays. If that week’s respite care keeps someone out of the adult care service for a month, we’re quids in. If cutting respite care saves 100 ASCdays of funding, but is likely to bring just one person into the care system 3 months early, we might start to doubt whether it will actually lead to any saving at all. (Longer tail saves complicate matters given councils need to balance a budget within a financial year. Spending money this year to save next year requires access to reserves – and confidence in your bet…)

For trying to make budget decisions, or helping engage citizens in budgetary consultations, costing things as per ASCday equivalents, and then trying to come up with some probabilities about the likelihood that a particular cut or expense will result in a certain number of people entering or leaving ASC sooner or later, may help you get a feel for the consequences for a particular action.

As to whether prior probabilities exist around whether cutting this service, or supporting that, are likely to impact on the adult care system, maybe data for that is out there, also?

Convention Based Used URLs Support Automation

I spent a chunk of last week at Curriculum Development Hackathon for a Data Carpentry workshop on Reproducible Research using Jupyter Notebooks (I’d like to thank the organisers for the travel support). One of the planned curriculum areas looked at data project organisation, another on automation. Poking around on an NHS data publication webpage for a particular statistical work area just now suggests an example of how the two inter-relate… and how creating inconsistent URLs or filenames makes automatically downloading similar files a bit of a faff when it could be so easy…

To begin with, the A&E Attendances and Emergency Admissions statistical work area has URL:

https://www.england.nhs.uk/statistics/statistical-work-areas/ae-waiting-times-and-activity/

The crumb trail in the on-page navigation has the form:

Home -> Statistics -> Statistical Work Areas -> A&E Attendances and Emergency Admissions

which we might note jars somewhat with the slug ae-waiting-times-and-activity, and perhaps reflects some sort of historical legacy in how the data was treated previously…

Monthly data is collected on separate financial year related pages linked from that page:

https://www.england.nhs.uk/statistics/statistical-work-areas/ae-waiting-times-and-activity/statistical-work-areasae-waiting-times-and-activityae-attendances-and-emergency-admissions-2016-17/

https://www.england.nhs.uk/statistics/statistical-work-areas/ae-waiting-times-and-activity/statistical-work-areasae-waiting-times-and-activityae-attendances-and-emergency-admissions-2015-16-monthly-3/

The breadcrumb for these pages has the form:

Home -> Statistics -> Statistical Work Areas -> A&E Attendances and Emergency Admissions -> A&E Attendances and Emergency Admissions 20MM-NN

A few of observations about those financial year related page URLs. Firstly, the path is rooted on the parent page (a Good Thing), but the slug looks mangled together from what looks like a more reasonable parent path (statistical-work-areasae-waiting-times-and-activity; this looks as if it’s been collapsed from statistical-work-areas/ae-waiting-times-and-activity).

The next part of the URL specifies the path to the A & E Attendances and Emergency Admissions page for a particular year, with an appropriate slug for the name – ae-attendances-and-emergency-admissions- but differently formed elements for the years: 2016-17 compared to 2015-16-monthly-3.

(Note that the 2015-16 monthly listing is incomplete and starts in June 2015.)

If we look at URLs for some of the monthly 2016-17 Excel data file downloads, we see inconsistency in the filenames:

https://www.england.nhs.uk/statistics/wp-content/uploads/sites/2/2016/06/November-2016-AE-by-provider-W0Hp0.xls
https://www.england.nhs.uk/statistics/wp-content/uploads/sites/2/2016/06/October-2016-AE-by-provider-Nxpai.xls
https://www.england.nhs.uk/statistics/wp-content/uploads/sites/2/2016/06/September-2016-AE-by-provider-BtD4b.xls

(Note that CSV data seems only to be available for the latest (November 2016) data set. I don’t know if this means that the CSV data link only appears for the current month, or data in the CSV format only started to be published in November 2016.)

For the previous year we get:

https://www.england.nhs.uk/statistics/wp-content/uploads/sites/2/2015/08/March-2016-AE-by-provider-9g0dQ-Revised-11082016.xls
https://www.england.nhs.uk/statistics/wp-content/uploads/sites/2/2015/08/February-2016-AE-by-provider-1gWNy-Revised-11082016.xls

and so on.

Inspection of these URLs suggests:

  1. the data is being uploaded to and published from a WordPress site (wp-content/uploads);
  2. the path to the data directory for the annual collection is minted according to the month in which the first dataset of the year is uploaded (data takes a month or two to be uploaded, so presumably the April 2016 data was posted in June, 2016 (2016/06); the 2015 data started late – the first month (June 2015) presumably being uploaded in August of that year (2015/08);
  3. the month slug for the data file starts off fine, being of the form MONTH-YEAR-AE-by-provider-, but then breaks things by having some sort of code value that perhaps uniquely identifies the version of the file;
  4. the month slug may be further broken by the addition of a revision element (eg -Revised-11082016).

If the URLs all had a common pattern, it would be easy enough to automate their generation from a slug pattern and month/year combination, and then automatically download them. (I haven’t yet explored inside each spreadsheet to see what inconsistency errors/horrors make it non-trivial to try to combine the monthly data into a single historical data set…)

As it is, to automate the download of the files requires scraping the web pages for the links, or manually retrieving them. (At least the link text on the pages seems to be reasonably consistent!)

Datadive Reproducibility – Time for a DataBox?

Whilst at the Global Witness “Beneficial Ownership” datadive a couple of weeks ago, one of the things I was pondering  – how to make the weekend’s discoveries reproducible on the one hand, useful as a set of still working legacy tooling on the other – blended into another: how to provide an on-ramp for folk attending the event who were not familiar with the data or the way in which t was provided.

Event facilitators DataKind worked in advance with Global Witness to produce an orientation exercise based around a sample dataset. Several other prepped datasets were also made available via USB memory sticks distributed as required to the three different working groups.

The orientation exercise was framed as a series of questions applied to a core dataset, a denormalised flat 250MB or so CSV file containing just over a million or so rows, with headers. (I think Excel could cope with this – not sure if that was by design or happy accident.)

For data wranglers expert at working with raw datafiles and their own computers, this doesn’t present much of a problem. My gut reaction was to open the datafile into a pandas dataframe in a Jupyter notebook and twiddle with it there; but as pandas holds dataframes in memory, this may not be the best approach, particularly if you have multiple large dataframes open at the same time. As previously mentioned, I think the data also fit into Excel okay.

Another approach after previewing the data, even if just by looking at it on the command line with a head command, was to load the data into a database and look at it from there.

This immediately begs several questions of course  – if I have a database set up on my machine and import the database without thinking about it, how can someone else recreate that? If I don’t have a database on my machine (so I need to install one and get it running) and/or I don’t then know how to get data into the database, I’m no better off. (It may well be that there are great analysts who know how to work with data stored in databases but don’t know how to do the data engineering stuff of getting the database up and running and populated with data in the first place.)

My preferred solution for this at the moment is to see whether Docker containers can help. And in this case, I think they can. I’d already had a couple of quick plays looking at getting the Companies House significant ownership data into various databases (Mongo, neo4j) and used a recipe that linked a database container with a Jupyter notebook server that I could write my analysis scripts in (linking RStudio rather than Jupyter notebooks is just as straightforward).

Using those patterns, it was easy enough to create a similar recipe to link a Postgres database container to a Jupyter notebook server. The next step – loading the data in. Now it just so happens that in the days before the datadive, I’d been putting together some revised notebooks for an OU course on data management and analysis that dealt with quick ways of loading data into a Postgres data, so I wondered whether those notes provided enough scaffolding to help me load the sample core data into a database: a) even if I was new to working with databases, and b) in a reproducible way. The short answer was “yes”. Putting the two steps together, the results can be found here: Getting started – Database Loader Notebook.

With the data in a reproducibly shareable and “live” queryable form, I put together a notebook that worked through the orientation exercises. Along the way, I found a new-to-me HTML5/d3js package for displaying small  interactive network diagrams, visjs2jupyter. My attempt at the orientation exercises can be found here: Orientation Activities.

Whilst I am all in favour of experts datawranglers using their own recipes, tools and methods for working with the data – that’s part of the point of these expert datadives – I think there may also be mileage in providing a base install where the data is in some sort of immediately queryable form, such as in a minimal, even if not properly normalised, database. This means that datasets too large to be manipulated in memory or loaded into Excel can be worked with immediately. It also means that orientation materials can be produced that pose interesting questions that can be used to get a quick overview of the data, or tutorial materials produced that show how to work with off-the-shelf powertool combinations (Jupyter notebooks / Python/pandas / PostgreSQL, for example, or RStudio /R /PostgreSQL ).

Providing a base set up to start from also acts as an invitation to extend that environment in a reproducible way over the course of the datadive. (When working on your own computer with your own tooling, it can be way too easy to forget what packages (apt-get, pip and so on) you have pre-installed that will cause breaking changes to any outcome code you show with others who do not have the same environment. Creating a fresh environment for the datadive, and documenting what you add to it, can help with that, but testing in a linked container, but otherwise isolated, context really helps you keep track of what you needed to add to make things work!

If you also keep track of what you needed to do handle undeclared file encodings, weird separator characters, or password protected zip files from the provided files, it means that others should be able to work with the files in a reliable way…

(Just a note on that point for datadive organisers – metadata about file encodings, unusual zip formats, weird separator encodings etc is a useful thing to share, rather than have to painfully discover….)

Using tools like Docker is one way of improving the shareability of immediately queryable data, but is there an even quick way? One thing I want to explore on my to do list is the idea of a “databox”, a Raspberry Pi image that when booted runs a database server and Jupyter notebook (or RStudio) environment. The database can be pre-seeded with data for the datadive, so all that should be required is for an individual to plug the Raspberry Pi into their computer with an ethernet cable, and run from there. (This won’t work for really large datasets – the Raspberry Pi lacks grunt – but it’s enough to get you started.)

Note that these approaches scale out to other domains, such as data journalism projects (each project on its own Raspberry PI SD card or docker-compose setup…)

Revisiting Diabetes Prescribing Data

Last year, I had a quick dabble with creating Data Driven Press Releases From HSCIC Data based around a diabetes prescribing data release. Noticing that the figures for 2015/16 had been released over the summer break, I revisited last year’s script to see it if works with this years data – which is does, save for a few tweaks to the import of the spreadsheets.

So this year’s report for the Isle of Wight is as follows, constructed automatically from the data by looking across several sheets in two spreadsheets:

Figures recently published by the HSCIC for the NHS Isle Of Wight CCG show that for the reporting period 2015/16, the total Net Ingredient Costs (NIC) for prescribed diabetes drugs was £2,579,592.41, representing 9.95% of overall Net Ingredient Costs. The NHS Isle Of Wight CCG prescribed 137,741 diabetes drugs items, representing 4.28% of all prescribed items. The average net ingredient cost (NIC) was £18.73 per item. This compares to 4.14% of items (10.31% of total NIC) in the Wessex (Q70) region and 4.61% of items (10.57% of total NIC) in England.

Of the total diabetes drugs prescribed, Insulins accounted for 21,884 items at a total NIC of £1,071,880.50 (£48.98 per item (on average), 0.68% of overall prescriptions, 4.13% of total NIC) and Antidiabetic Drugs accounted for 94,347 items at a total NIC of £890,424.84 (£9.44 per item (on average), 2.93% of overall prescriptions, 3.43% of total NIC). Diagnostic and monitoring devices accounted for 20,485 items at a total NIC of £605,971.30 (£29.58 per item (on average), 0.64% of overall prescriptions, 2.34% of total NIC).

For the NHS ISLE OF WIGHT CCG, the NIC in 2015/16 per patient on the QOF diabetes register in 2014/15 was £330.42. The QOF prevalence of diabetes, aged 17+, for the NHS ISLE OF WIGHT CCG in 2014/15 was 6.61%. This compares to a prevalence rate of 6.37% in Wessex and 5.83% across England.

Creating reports for other CCGs is simply a matter of changing the CCG code. On the to do list is pull dat from last year as well as this year into a simple database, and then write some more sentence templates that compare the year on year performance. (I also really need to have a think about a more sensible way of generating sentences!)

After an interesting chat last night with Gary Warner from Island based Pinnacle Health Partnership, a social enterprise providing backend services for community pharmacies, I thought I’d have a poke around what services are out there based on the NHS open prescribing data, as a short cut to population my own databases with the original data (each month’s dataset comes in around 1GB).

OpenPrescribing.net seems really useful in this respect, at least as a quick way in to summaries of the data – the API allows you to pull down data by CCG, as well as GP practices within a CCG, and break out prescriptions by item using BNF codes. A handy look up service also helps find items by BNF section, such as Drugs used in diabetes (BNF 6.1).

I’ve posted a quick sketch notebook as a gist, also embedded below.  But here’s a quick glimpse at some of the first reports I had a look at generating. For example, we can look at the spend associated with particular BNF section codes broken down by GP practices in a particular CCG area:

iw_diabetes-2016_0

and then group it by period (I made a crude guess at the financial year but I’m not sure what the dates in openprescribing actually relate to…) so the aggregates are indicative only. The aggregate value over the CCG for item counts seems to be broadly in line with the NHS Diabetes Prescribing report, which I took as weak confirmation that it’s sort of working!

iw_diabetes-2016

We can also use the API to look up items by BNF section:

iw_diabetes-2016_3

If we loop round the items in a BNF subsection, we can generate reports about the prescribing of particular items within that subsection across practices, merging in the item names to make them easier to identify:

iw_diabetes-2016_5

The pandas charting tools aren’t brilliant – after the Yhat refresh of ggplot (for python), I think I need to revisit that library when tinkering in the python context – but we can do crude sketches quite easily.

iw_diabetes-2016_4

Anyway, playtime over. It was interesting to give the openprescribing,net a go, and give a chance I’ll try to play with it a bit more to explore some more quick reports to add to the diabetes notebook.


 


Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

 

Bands Incorporated

A few weeks ago, as I was doodling with some Companies House director network mapping code and simple Companies House chatbot ideas, I tweeted an example of Iron Maiden’s company structure based on co-director relationships. Depending on the original search is seeded, the maps may also includes elements of band members’ own personal holdings/interests. The following map, for example, is seeded just from the Iron Maiden LLP company number:

iron_maiden

If you know anything about the band, you’ll know Bruce Dickinson’s aircraft interests make complete sense…

That graph is actually a bipartite graph – nodes are either directors or companies. We can easily generate a projection of the graph that replaces directors that link companies by edges that represent “common director” links between companies:

ireonmaiden2.png

(The edges are actually weighted, so the greater the edge weight, the more directors there are in common between the linked companies.)

In today’s Guardian, I notice they’re running a story about Radiohead’s company structure, with a parallel online piece, Radiohead’s corporate empire: inside the band’s dollars and cents which shows how to get a story out of such a map, as well as how to re-present the original raw map to provide to a bit more spatial semantic structure to it:

Radiohead_s_corporate_empire__inside_the_band_s_dollars_and_cents___Music___The_Guardian

(The story also digs into the financial reports from some of the companies.)

By way of comparison, here’s my raw map of Radiohead’s current company structure, generated from Companies House data seeded on the company number for Radiohead Trademark:

radiohead

It’s easy enough to grab the data for other bands. So how about someone like The Who? If we look in the immediate vicinity of The Who Group, we see core interests:

who1

But if we look for linkage to the next level of co-director links, we start to see other corporate groups that hold some at least one shared interest with the band members:

who2

So what other bands incorporated in the UK might be worth mapping?

Want to Get Started With Open Data? Looking for an Introductory Programming Course?

Want to learn to code but never got round to it? The next presentation of OUr FutureLearn course Learn to Code for Data Analysis will teach you how to write you own programme code, a line a time, to analyse real open data datasets. The next presentation starts on 6 June, 2016, and runs for 4 weeks, and takes about 5 hrs per week.

I’ve often thought that there are several obstacles to getting started with programming. Firstly, there’s the rationale or context: why bother/what could I possibly use programming for? Secondly, there are the practical difficulties: to write and execute programmes, you need to get an programming environment set up. Thirdly, there’s the so what: “okay, so I can programme now, but how do I use this in the real world?”

Many introductory programming courses reuse educational methods and motivational techniques or contexts developed to teach children (and often very young children) the basics of computer programming to set the scene: programming a “turtle” that can drive around the screen, for example, or garishly coloured visual programming environments that let you plug logical blocks together as if they were computational Lego. Great fun, and one way of demonstrating some of the programming principles common to all programming languages, but they don’t necessarily set you up for seeing how such techniques might be directly relevant to an IT problem or issue you face in your daily life. And it can be hard to see how you might use such environments or techniques at work to help you get perform real tasks… (Because programmes can actually be good at that – automating the repetitive and working through large amounts of stuff on your behalf.) At the other extreme are professional programming environments, like geekily bloated versions of Microsoft Word or Excel, with confusing preference setups and menus and settings all over the place. And designed by hardcore programmers for hardcore programmers.

So the approach we’ve taken in the OU FutureLearn course Learn to Code for Data Analysis is slightly different to that.

The course uses a notebook style programming environment that blends text, programme code, and the outputs of running that code (such as charts and tables) in a single, editable web page accessed via your web browser.

Learn_to_Code_-_SageMathCloud

To motivate your learning, we use real world, openly licensed data sets from organisations such as the World Bank and the United Nations – data you can download and access for yourself – that you can analyse and chart using your own programme code. A line at a time. Because each line does it’s own thing, each line is useful, and you can see what each line does to your dataset directly.

So that’s the rationale: learn to code so you can work with data (and that includes datasets much larger than you can load into Excel…)

The practicalities of setting up the notebook environment still have to be negotiated, of course. But we try to help you there too. If you want to download and install the programming environment on your computer, you can do, in the form of the freely available Anaconda Scientific Computing Python Distribution. Or you can access an online versions of the notebook based programming environment via SageMathCloud and do all your programming online, through your browser.

So that’s the practical issues hopefully sorted.

But what about the “so what”? Well, the language you’ll be learning is Python, a widely used language programming language that makes it ridiculously easy to do powerful things.

Pyython cartoon - via https://xkcd.com/353/

But not that easy, perhaps..?!

The environment you’ll be using – Jupyter notebooks – is also a “real world” technology, inspired as an open source platform for scientific computing but increasingly being used by journalists (data journalism, anyone?) and educators. It’s also attracted the attention of business, with companies such as IBM supporting the development of a range of interactive dashboard tools and backend service hooks that allow programmes written using the notebooks to be deployed as standalone online interactive dashboards.

The course won’t take you quite that far, but it will get you started, and safe in the knowledge that whatever you learn, as well as the environment you’re learning in, can be used directly to support your own data analysis activities at work, or at home as a civically minded open data armchair analyst.

So what are you waiting for? Sign up now and I’ll see you in the comments:-)

Trawling the Companies House API to Generate Co-Director Networks

Somewhen ago (it’s always somewhen ago; most of the world never seems to catch up with what’s already happened!:-( I started dabbling with the OpenCorporates API to generate co-director corporate maps that showed companies linked by multiple directors. It must have been a bad idea because no-one could see any point in it, not even interestingness…  (Which suggests to me that boards made up of directors are similarly meaningless? In which case, how are companies supposed to hold themselves to account?)

I tend to disagree. If I hadn’t been looking at connected companies around food processing firms, I would never have learned that one that meat processors cope with animal fat waste is to feed it into the biodiesel raw material supply chain.

Anyway, if we ever get to see a beneficial ownership register, a similar approach should work to generate maps showing how companies sharing beneficial owners are linked. (The same approach also drives my emergent social positioning Twitter maps and the Wikipedia semantic maps I posted about again recently.)

As a possible precursor to that, I thought I’d try to reimplement the code (in part to see if a better approach came to mind) using data grabbed directly from Companies House via their API. I’d already started dabbling with the API (Chat Sketches with the Companies House API) so it didn’t take much more to get a grapher going…

But first, I realise in that earlier post I’d missed the function for actually calling the API – so here it is:

import urllib2, base64, json
from urllib import urlencode
from time import sleep

def url_nice_req(url,t=300):
    try:
        return urllib2.urlopen(url)
    except HTTPError, e:
        if e.code == 429:
            print("Overloaded API, resting for a bit...")
            time.sleep(t)
            return url_req(url)

#Inspired by http://stackoverflow.com/a/2955687/454773
def ch_request(CH_API_TOKEN,url,args=None):
    if args is not None:
        url='{}?{}'.format(url,urlencode(args))
    request = urllib2.Request(url)
    # You need the replace to handle encodestring adding a trailing newline 
    # (https://docs.python.org/2/library/base64.html#base64.encodestring)
    base64string = base64.encodestring('%s:' % (CH_API_TOKEN)).replace('\n', '')
    request.add_header("Authorization", "Basic %s" % base64string)   
    result = url_nice_req(request)

    return json.loads(result.read())

CH_API_TOKEN='YOUR_API_TOKEN_FROM_COMPANIES_HOUSE'

In the original implementation, I stored the incremental search results in a dict; in the reimplementation, I thought I’d make use of a small SQLite database.

import sqlite3
db=None
memDB=":memory:"
tmpDB='example.db'
if db in locals():
    db.close()
    
db = sqlite3.connect(tmpDB)
c = db.cursor()

for drop in ['directorslite','companieslite','codirs','coredirs','singlecos']:
    c.execute('''drop table if exists {}'''.format(drop))
              
c.execute('''create table directorslite
         (dirnum text primary key,
          dirdob integer,
          dirname text)''')

c.execute('''create table companieslite
         (conum text primary key,
          costatus text,
          coname text)''')

c.execute('''create table codirs
         (conum text,
          dirnum text,
          typ text,
          status text)''')

c.execute('''create table coredirs
         (dirnum text)''')

c.execute('''create table singlecos
         (conum text,
          coname text)''')

cosdone=[]
cosparsed=[]
dirsdone=[]
dirsparsed=[]
codirsdone=[]

The code itself runs in two passes. The first pass builds up a seed set of directors from a single company or set of companies using a simple harvester:

def updateOnCo(seed,typ='current',role='director'):
    print('harvesting {}'.format(seed))
    
    #apiNice()
    o=ch_getCompanyOfficers(seed,typ=typ,role=role)['items']
    x=[{'dirnum':p['links']['officer']['appointments'].strip('/').split('/')[1],
          'dirdob':p['date_of_birth']['year'] if 'date_of_birth' in p else None,
          'dirname':p['name']} for p in o]
    z=[]
    for y in x:
        if y['dirnum'] not in dirsdone:
            z.append(y)
            dirsdone.append(y['dirnum'])
        if isinstance(z, dict): z=[z]
    print('Adding {} directors'.format(len(z)))
    c.executemany('INSERT INTO directorslite (dirnum, dirdob,dirname)'
                     'VALUES (:dirnum,:dirdob,:dirname)', z)
    for oo in [i for i in o if i['links']['officer']['appointments'].strip('/').split('/')[1] not in dirsparsed]:
        oid=oo['links']['officer']['appointments'].strip('/').split('/')[1]
        print('New director: {}'.format(oid))
        #apiNice()
        ooo=ch_getAppointments(oid,typ=typ,role=role)
        #apiNice()
        #Play nice with the api
        sleep(0.5)
        #add company details
        x=[{'conum':p['appointed_to']['company_number'],
          'costatus':p['appointed_to']['company_status'] if 'company_status' in p['appointed_to'] else '',
          'coname':p['appointed_to']['company_name'] if 'company_name' in p['appointed_to'] else ''} for p in ooo['items']]
        z=[]
        for y in x:
            if y['conum'] not in cosdone:
                z.append(y)
                cosdone.append(y['conum'])
        if isinstance(z, dict): z=[z]
        print('Adding {} companies'.format(len(z)))
        c.executemany('INSERT INTO companieslite (conum, costatus,coname)'
                     'VALUES (:conum,:costatus,:coname)', z)
        for i in x:cosdone.append(i['conum'])
        #add company director links
        dirnum=ooo['links']['self'].strip('/').split('/')[1]
        x=[{'conum':p['appointed_to']['company_number'],'dirnum':dirnum,
            'typ':'current','status':'director'} for p in ooo['items']]
        c.executemany('INSERT INTO codirs (conum, dirnum,typ,status)'
                     'VALUES (:conum,:dirnum,:typ,:status)', x)
        print('Adding {} company-directorships'.format(len(x)))
        dirsparsed.append(oid)
    cosparsed.append(seed)

The set of seed companies may be companies associated with one or more specified seed directors, for example:

def dirCoSeeds(dirseeds,typ='all',role='all'):
    ''' Find companies associated with dirseeds '''
    coseeds=[]
    for d in dirseeds:
        for c in ch_getAppointments(d,typ=typ,role=role)['items']:
            coseeds.append(c['appointed_to']['company_number'])
    return coseeds

dirseeds=[]
for d in ch_searchOfficers('Bernard Ecclestone',n=10,exact='forename')['items']:
    dirseeds.append(d['links']['self'])
    
coseeds=dirCoSeeds(dirseeds,typ='current',role='director')

Then I call a first pass of the co-directed companies search with the set of company seeds:

typ='current'
#Need to handle director or LLP Designated Member
role='all'
for seed in coseeds:
    updateOnCo(seed,typ=typ,role=role)
c.executemany('INSERT INTO coredirs (dirnum) VALUES (?)', [[d] for d in dirsparsed])

seeder_roles=['Finance Director']
#for dirs in seeded_cos, if dir_role is in seeder_roles then do a second seeding based on their companies
#TO DO

depth=0

Then we go for a crawl for as many steps as required… The approach I’ve taken here is to search through the current database to find the companies heuristically defined as codirected, and then feed these back into the harvester.

seeder=True
oneDirSeed=True
#typ='current'
#role='director'
maxdepth=3
#relaxed=0
while depth<maxdepth:
    print('---------------\nFilling out level - {}...'.format(depth))
    if seeder and depth==0:
        #Another policy would be dive on all companies associated w/ dirs of seed
        #In which case set the above test to depth==0
        tofetch=[u[0] for u in c.execute(''' SELECT DISTINCT conum from codirs''')]
    else:
        duals=c.execute('''SELECT cd1.conum as c1,cd2.conum as c2, count(*) FROM codirs AS cd1
                        LEFT JOIN codirs AS cd2 
                        ON cd1.dirnum = cd2.dirnum AND cd1.dirnum
                        WHERE cd1.conum < cd2.conum GROUP BY c1,c2 HAVING COUNT(*)>1
                        ''')
        tofetch=[x for t in duals for x in t[:2]]
        #The above has some issues. eg only 1 director is required, and secretary IDs are unique to company
        #Maybe need to change logic so if two directors OR company just has one director?
        #if relaxed>0:
        #    print('Being relaxed {} at depth {}...'.format(relaxed,depth))
        #    duals=c.execute('''SELECT cd.conum as c1,cl.coname as cn, count(*) FROM codirs as cd JOIN companieslite as cl 
        #                 WHERE cd.conum= cl.conum GROUP BY c1,cn HAVING COUNT(*)=1
        #                ''')
        #    tofetch=tofetch+[x[0] for x in duals]
        #    relaxed=relaxed-1
    if depth==0 and oneDirSeed:
        #add in companies with a single director first time round
        sco=[]
        for u in c.execute('''SELECT DISTINCT cd.conum, cl.coname FROM codirs cd  JOIN companieslite cl ON
                                cd.conum=cl.conum'''):
            #apiNice()
            o=ch_getCompanyOfficers(u[0],typ=typ,role=role)
            if len(o['items'])==1 or u[0]in coseeds:
                sco.append({'conum':u[0],'coname':u[1]})
                tofetch.append(u[0])
        c.executemany('INSERT INTO singlecos (conum,coname) VALUES (:conum,:coname)', sco)
    #TO DO: Another stategy might to to try to find the Finance Director or other names role and seed from them?
    
    #Get undone companies
    print('To fetch: ',[u for u in tofetch if u not in cosparsed])
    for u in [x for x in tofetch if x not in cosparsed]:
            updateOnCo(u,typ=typ,role=role)
            cosparsed.append(u)
            #play nice
            #apiNice()
    depth=depth+1
    #Parse companies

To visualise the data, I opted for Gephi, which meant having to export the data. I started off with a simple CSV edgelist exporter:

data=c.execute('''SELECT cl1.coname as Source,cl2.coname as Target, count(*) FROM codirs AS cd1
                        LEFT JOIN codirs AS cd2 JOIN companieslite as cl1 JOIN companieslite as cl2
                        ON cd1.dirnum = cd2.dirnum and cd1.conum=cl1.conum and cd2.conum=cl2.conum
                        WHERE cd1.conum 1''')
import csv
with open('output1.csv', 'wb') as f:
    writer = csv.writer(f)
    writer.writerow(['Source', 'Target'])
    writer.writerows(data)
    
data= c.execute('''SELECT cl1.coname as c1,cl2.coname as c2 FROM codirs AS cd1
                        LEFT JOIN codirs AS cd2 JOIN singlecos as cl1 JOIN singlecos as cl2
                        ON cd1.dirnum = cd2.dirnum and cd1.conum=cl1.conum and cd2.conum=cl2.conum
                        WHERE cd1.conum &lt; cd2.conum''')
with open('output1.csv', 'ab') as f:
    writer = csv.writer(f)
    writer.writerows(data)

but soon changed that to a proper graph file export, based on a graph built around the codirected companies using the networkx package:

import networkx as nx

G=nx.Graph()

data=c.execute('''SELECT cl.conum as cid, cl.coname as cn, dl.dirnum as did, dl.dirname as dn
FROM codirs AS cd JOIN companieslite as cl JOIN directorslite as dl ON cd.dirnum = dl.dirnum and cd.conum=cl.conum ''')
for d in data:
    G.add_node(d[0], Label=d[1])
    G.add_node(d[2], Label=d[3])
    G.add_edge(d[0],d[2])
nx.write_gexf(G, "test.gexf")

I then load the graph file into Gephi to visualise the data.

Here’s an example of the sort of thing we can get out for a search seeded on companies associated with the Bernie Ecclestone who directs at least one F1 related company:

Gephi_0_9_1_-_Project_2

On the to do list is to automate this a little bit more by adding some netwrok statistics, and possibly a first pass layout, in the networkx step.

In terms of time required to collect the data, the ,a href=”https://developer.companieshouse.gov.uk/api/docs/index/gettingStarted/rateLimiting.html”>Companies House API is rate limited to allow 600 requests within a five minute period. Many company networks can be mapped within the 600 call limit, but even for larger networks, the trawl doesn’t take too long even if two or three rest periods are required.