One of the things I’ve noticed coming out of GDS, the Government Digital Service, over the last few weeks is that reproducible research and report generation seems to be an active area of interest. In conversations with folk around the House of Commons Library, it seems as if there may an appetite for starting to explore a similar approach there. So what might be involved?
As the GDS post on Reproducible Analytical Pipelines suggests, a typical workflow for producing a report containing statistical results, charts and tables often looks something like the following:
(Actually, I’m not sure what different roles the statistical software and spreadsheet are supposed to achieve in that diagram?)
In this case, data is obtained from a data store, wrangled analysed, tabulated and charted, annotated and commented upon and then published. The workflow is serialised across several applications (which in itself is not necessarily a bad thing) but may be laborious to reproduce, for example if the data is updated and a new version of the briefing is required.
A few weeks ago, representatives from the DfE and DCMS, along with GDS, look to have held a seminar on how this sort of workflow has been implemented for statistics production in their departments. (I’m guessing some of the contributors were participants in the Data Science Accelerator programme…?)
A similar sort of workflow exists for producing library research briefings, along with a cut down variant of the workflow for answering Members’ questions (the output format may be an email rather than a PDF or Word document.) In the latter case, reproducing the workflow may be required if a member needs an updated response to a previously asked question, or another member asks a similar question to a previously asked question but about a different constituency.
In this case, automation of analyses or the production of document assets (tables, charts, etc) may support reuse of methods used to answer a question for one particular constituency across all constituencies, or methods used to answer a question at one particular time to re-answer the same question at a later time with updated information.
A functionally equivalent process to he GDS workflow can be implemented using a reproducible research toolchain such as the one described using the R statistical programming language and knitr publishing package by Michael Sachs in his Tools for Reproducible Research: Introduction to knitr presentation:
In this case, the data ingest, analysis, tabulation and charting is done in the context of the annotation and commentary as part of a single source document – the Rmd report. Various automated publication routes then handle the rendering and publication of the final document.
In the following example, a prepackaged dataset is used as the basis for a simple scatterplot, created using one line of code.
One of the possible arguments that can be made against the automated production of graphics containing reports is that the graphics won’t conform to the house style or convention. However, graphics packages such as
matplotlib in Python, or
ggplot in R allow you to both “write charts” and create chart objects to which style can then be applied.
In the above example, the styling applied to the chart object can be updated by adding a simple predefined clause to the definition of the chart object. (Themes can be updated in Python using Seaborn styles or R using ggplot
ggthemes. GDS have already produced a Government “govstyle” theme for R, so it should equally be possible to produce green-and-red themes for the House of Commons and House of Lords libraries respectively.)
(Related to this, I’ve also previously dabbled with quick experiments to automatically generate accessible text descriptions from scripted chart objects.)
An output document, in this case HTML (but it could be PDF, or a Microsoft Word document), can then be generated from the source document. If desired, the display of the code used to generate the chart objects can be suppressed in the output document. Generating an updated version of the chart just requires an update to the dataset and regenerating the output document, in the desired format, from the source document.
Reproducible code scripts can also be used to produce a particular chart type in a self-documenting way that may be used as a training example or as the basis for another diagram. For example, this Migrant Flow notebook documents the creation of a Sankey diagram as well as providing information about how to export it in different file formats.
As well as scripting statistical analyses and chart generation so that they can be reproduced, code can often be reused as part of an interactive application. In the R ecosystem, the shiny package supports the creation of customised interactive applications around a particular dataset.
For example, a dataset reporting different broadband statistics at LSOA level for a particular local authority can be used as the basis of a graphical reporting tool that displays a selected statistic using a choropleth map.
The creation of such an application can be used to demonstrate how reusable components can be developed one useful step at a time to provide a range of tools around a particular dataset. For example:
- for a particular LA, generate a map for a particular statistic; this requires loading in a shapefile and a datafile that use the same identifier scheme to identify regions. Data associated with a particular region can then be used to colour the corresponding area of the map. This might be developed in response a query from a particular person for a particular area, and used to generate a map or tabular data returned in an email, for example.
- generalise the mapping function so that it can use data associated with a selected statistic within the datafile to produce maps/tables for other members based on their constituency code.
- create an interactive application that uses column headings in the datafile corresponding to different statistical measures as the options in a drop down list selection UI component; the selected item from the drop down list can then be used to trigger the generation of the map for that statistic.
The above example relates to a tool that plots a variety of statistics contained within a single data file for a particular LA. (You can find the code here.) This code can itself act as a building block for further work. For example:
- extend the code to generate maps for other LAs when specifying an appropriate LA code.
- write a script to iterate through all LA codes and produce a report for each. (For a [related example](https://blog.ouseful.info/2017/02/23/reporting-in-a-repeatable-parameterised-transparent-way/), [these documents](https://psychemedia.github.io/parlihacks/iwgplsoadocs/) were created from a script that mapped the patient catchment area within a particular LA for a particular GP practice code, that was itself called multiple times from a second script that looked up the GP practice codes within a particular LA.)
- use a different datafile (using similar region codes) to display different sorts of data.
- add a button to the interactive application to generate and download a PDF or png version of the map ;
- add a selection list to the interactive application to allow the user to select a particular LA as well as a particular statistic.
Each additional step results in something more or differently useful, and provides another building block or code fragment that could be reused as a building block or “tweakable example” elsewhere.
In the Jupyter notebook ecosystem, ipywidgets provide a complement to the R/Shiny approach by allowing the use of interactive widgets inline in a notebook, or as part of an interactive dashboard.
By scripting – that is, automating – different parts of the enquiry answering process, we can start to develop a range of components that can be used to scale the answering of queries of the same form to other areas or periods of time without having to redo the same work each time.
Making code available also supports checking and transparency of method. For example, Chris Hanretty’s blog post asking Is the left over-represented within academia? is backed up by code that allows the reader to check his working although the rerunnability of the script falls short by not explicitly specifying how to obtain and load in the source data. The media are also starting to make use reproducible scripts to support some of their stories. For example, Buzzfeed News regularly publish scripts, such as this one on how Nobel Prizewinners Show Why Immigration Is So Important For American Science, as background support for their stories. (See also: Data Journalism Units on Github.) By publishing reproducible research scripts, third parties can not only check working and assumptions, but may also extend or otherwise build on the same research. They can also generate chart assets, for example, according to the first party analysis and then theme exactly that chart in their own house style.
Reusability and “scripting support” can also be promoted though the use and development of software packages developed to make accessing and analysing particular datasets easier. For example, Oli Hawkin’s Python MNIS API wrapper, or Evan Odell’s Hansard Speeches and Sentiment dataset or R hansard Parliamentary data API wrapper provide tools that make it easier to access or ingest data into Python or R environments. The Python pandas-datareader](https://pandas-datareader.readthedocs.io/en/latest/) package offers support for accessing data from a growing number of sources including the World Bank, OECD and Eurostat, and exposing it as tabular pandas dataframes.
Identifying other often used data sources as candidates for “wrapping” in a similar way, such that data access can be automated in a repeatable way, is one way of improving local workflow but also contributing back to the wider data using community.
Accessing such datasources using scripts enhances an analysis by including the provenance of the data in such a way that a third party can also access it. (If the datasource does not support versioning but may include updates, keeping an archival copy of the data used in a particular analysis is also recommended…)