Coming down from a festival high, I spent the day yesterday jamming with code and trying to get a feel for Apache Drill. As I’ve posted before, Apache Drill is really handy for querying large CSV files.
The test data I’ve been using is Evan Odell’s 3GB Hansard dataset, downloaded as a CSV file but recast in the parquet format to speed up queries (see the previous post for details). I had another look at the dataset yesterday, and popped together some simple SQL warm up exercises in notebook (here).
Something I found myself doing was flitting between running SQL queries over the data using Apache Drill to return a pandas dataframe, and wrangling the pandas dataframes directly, following the path of least resistance to doing the thing I wanted to do at each step. (Which is to say, if I couldn’t figure out the SQL, I’d try moving into pandas; and if the pandas route was too fiddly, rethinking the SQL query! That said, I also noticed I had got a bit rusty with SQL…) Another pattern of behaviour I found myself falling into was using Apache Drill to run summarising queries over the large original dataset, and then working with these smaller, summary datasets as in-memory pandas dataframes. This could be a handy strategy, I think.
As well as being able to query large flat CSV files, Apache Drill also allows you to run queries over JSON files, as well directories full of similarly structured JSON or CSV files. Lots of APIs export data as JSON, so being able to save the response of multiple calls on a similar topic as uniquely named (and the name doesn’t matter..) flat JSON files in the same folder, and then run a SQL query over all of them simply by pointing to the host directory, is really appealing.
I explored these features in a notebook exploring UK Parliament Written Answers data. In particular, the notebook shows:
- requesting multiple pages on the same from the UK Parliament data API (not an Apache Drill feature!)
- querying deep into a single large JSON file;
- running JOIN queries over data contained in a JSON file and a CSV file;
- running JOIN queries over data contained in a JSON file and data contained in multiple, similarly structured, data files in the same directory.
All I need to do now is come up with a simple “up and running” recipe for working Apache Drill. I’m thinking: Docker compose and some linked containers: Jupyter notebook, RStudio, Apache Drill, and maybe a shared data volume between them?
A stub post for collecting together:
- code libraries that wrap APIs and datasets related to UK Parliament data;
- downloadable datasets (raw and annotated);
- API endpoints;
If you know of any others, please let me know via the comments…
Official Parliament APIs
- UK Parliament data API: data from UK Parliament; the catalogue is most easily explored via http://explore.data.parliament.uk/.
- UK Parliament Members Data Platform (MNIS)
Secondary Source APIs
mnis: a small Python library that makes it easy to download data on UK Members of Parliament from
- They Work For You API: no longer maintained?
hansard: R package to automatically fetch data from the UK Parliament API; but not Hansard!
twfy: R wrapper for TheyWorkForYou’s API (about)
twfyR: another R wrapper for TheyWorkForYou’s API
Data Sourcing Applications
- Popbuilder: combined population estimates by selected LSOA (about, code)
- Constituency Boundaries: tool for exploring current and proposed constituency boundaries and downloading name, code, and Parliamentary electorate data (about)
- Parlparse: scripts from They Work For You for scraping Parlimanet website
- Members interests scraper (
node.js?: (no README/howto)
- (old/archived APPG scraper (
- Hansard Speeches and Sentiment: a public dataset of speeches in Hansard, with information on the speaking MP, their party, gender and age at the time of the speech. (Large files (~3GB) – example of how to query datafile using Apache Drill.)
Data Handling Utilities
Rcode for internal DfT tool for Transport Appraisal and Strategic modelling (TASM) use: convert outputs from the National Trip End Model (NTEM) for use in the National Transport Model (NTM);
One of the ways of finding data related files scattered around an organisations website is to run a web search using a search limit that specifies a data-y filetype, such as
xlsx for an Excel spreadsheet (
xls are also good candidates). For example, on the Parliament website, we could run a query along the lines of
filetype:xlsx site:parliament.uk and then opt to display the omitted results:
Taken together, these files form an ad hoc datastore (e.g. as per this demo on using FOI response on WhatDoTheyKnow as an “as if” open datastore).
Looking at the URLs, we see that data containing files are strewn about the online Parliamentary estate (that is, the website;-)…
Freedom of Information Related Datasets
Responses to Written Questions often come with datafile attachments.
These are files are posted to the subdomain http://qna.files.parliament.uk/qna-attachments.
Looking at the actual URL , something like http://qna.files.parliament.uk/qna-attachments/454264/original/28152%20-%20table.xlsx, it looks as if some guesswork is required generating the URL from the data contained in the API response? (For example, how might original attachments might distinguish from other attachments (such as “revised” ones, maybe?).)
The data files also appear on the http://qna.files.parliament.uk/ subdomain although it looks like they’re on a different path to the answered question attachments (http://qna.files.parliament.uk/ws-attachments compared to http://qna.files.parliament.uk/qna-attachments). This subdomain doesn’t appear to have the data files indexed and searchable on Google? I don’t see a Written Statements API on http://explore.data.parliament.uk/ either?
Deposited papers often include supporting documents, including spreadsheets.
Files are located under http://data.parliament.uk/DepositedPapers/Files/:
At the current time there is no API search over deposited papers.
A range of documents may be associated with Committees, including reports, responses to reports, and correspondence, as well as evidence submissions. These appear to mainly be PDF documents. Written evidence documents are rooted on
http://data.parliament.uk/writtenevidence/committeeevidence.svc/evidencedocument/ and can be found from committee written evidence web (HTML) pages rooted on the same path (example).
A web search for
site:parliament.uk inurl:committee (filetype:xls OR filetype:csv OR filetype:xlsx) doesn’t turn up any results.
Parliamentary Research Briefings
Research briefings are published by Commons and Lords Libraries, and may include additional documents.
Briefings may be published along with supporting documents, including spreadsheets:
The files are published under the following subdomain and path: http://researchbriefings.files.parliament.uk/.
The file attachments URLs can be found via the Research Briefings API.
This response is a cut down result – the full resource description, including links to supplementary items, can be found by keying on the numeric identifier from the URI
_about which the “naturally” identified resource (e.g. SN06643) is described.
Data files can be found variously around the Parliamentary website, including down the following paths:
http://qna.files.parliament.uk/qna-attachments(appear in Written Answers API results);
http://researchbriefings.files.parliament.uk/(appear in Research Briefings API results)
(I don’t think the API supports querying resources that specifically include attachments in general, or attachments of a particular filetype?)
What would be nice would be support for discovering some of these resources. A quick way in to this would be the ability to limit search query responses to webpages that link to a data file, on the grounds that the linking web page probably contains some of the keywords that you’re likely to be searching for data around?
So…. inspired by @philbgorman, I had a quick play last night with Parliament Written Questions data, putting together a recipe (output) for plotting a Sankey diagram showing the flow of questions from Members of the House of Commons by Party to various Answering Bodies for a particular parliamentary session.
The response that comes back from the Written Questions API includes a question uin (unique identification number?). If you faff around with date settings on the Parliamentary Questions web page you can search for a question by this ID:
Here’s an example of the response from a build download of questions (by 2015/16 session) from the Commons Written Questions API, deep filtered by the
If you tweak the
_about URI, which I think refers to details about the question, you get the following sort of response, built around a numeric identifier (
447753 in this case):
There’s no statement of the actual answer text in that response, although there is a reference to an answer resource, again keyed by the same numeric key:
The numeric key from the
_about identifier is also used with both the Commons Written Questions API and the Parliamentary Questions Answered API.
For example, questions:
uin values can’t be used with either of these APIs, though?
PS I know, I know, the idea is that we just follow resource links (but they’re broken, right? the leading
lda. is missing from the http identifiers), but sometimes it’s just as easy to take a unique fragment of the URI (like the numeric key) and then just drop it into the appropriate context when you want it. In this case, contexts are
IMHO, any way… ;-)
PPS for a full list of APIs, see explore.data.parliament.uk
A couple of days ago, I spotted a post by Oli Hawkins on Visualising migration between the countries of the UK which linked to a Sankey diagram demo of Internal migration flows in the UK.
In an R context, there are several libraries available that support the generation of Sankey diagrams, including
googleVis (which wraps Google Chart tools), and a couple of packages that wrap d3.js – an original rCharts Sankey diagram demo by @timelyporfolio, and a more recent HTMLWidgets demo (sankeyD3).
The original data comprised a matrix relating population flows between English regions, Wales, Scotland and Northern Ireland. The simplest rendering of the data using the
googleViz Sankey diagram generator produces an output that uses default colours to label the nodes.
Using the country code indicator at the start of each region/country identifier, we can generate a mapping from country to a country colour that can then be used to identify the country associated with each node.
One of the settings for the diagram allows the source (or target) node colour to determine the edge colour. We can also play with the values we use as node labels:
If we exclude edges relating to flow between regions of the same country, we get a diagram that is more reminiscent of Oli’s orignal (country level) demo. Note also that the charts that are generated are interactive – in this case, we see a popup that describes the flow along one particular edge.
If we associate a country with each region, we can group the data and sum the flow values to produce country level flows. Charting this produces a chart similar to the original inspiration.
As well as providing the code for generating each of the above Sankey diagrams, the Rmd file linked above also includes demonstrations for generating basic Sankey diagrams for the original dataset using the rCharts and htmlwidgets R libraries.
In order to provide a point of comparison, I also generated a python/pandas workflow using Jupyter notebooks and the
ipysankey widget. (In fact, I generated the full workflow through the different chart versions first in pandas – I find it an easier language to think in than R! – and then used that workflow as a crib for the R version…)
The original (default) diagram looks like this:
and the final version, after a bit of data wrangling, looks like this:
Once again, all the code is provided in the notebook.
One of the nice things about all these packages is that they produce outputs than can be reused/embedded elsewhere, or that can be used as a first automatically produced draft of code that can be tweaked by hand. I’ll have more to say about that in a future post…
A couple of weeks ago I posted a demo of how to automate the production of a templated report (catchment for GP practices by LSOA on the Isle of Wight) using
knitr (Reporting in a Repeatable, Parameterised, Transparent Way).
Today, I noticed another report, with data, from the House of Commons Library on Superfast Broadband Coverage in the UK. This reports at the ward level rather than the LSOA level the GP report was based on, so I wondered how easy it would be to reuse the GP/LSOA code for a broadband/ward map…
After fighting with the Excel data file (metadata rows before the header and at the end of the table, cruft rows between the header and data table proper) and the R library I was using to read the file (it turned the data into a tibble, with spacey column names I couldn’t get to work with
ggplot, rather than a dataframe – I ended saving to CSV then loading back in again…), not many changes were required to the code at all… What I really should have done was abstracted the code in to an R file (and maybe some importable Rmd chunks) and tried to get the script down to as few lines of bespoke code to handle the new dataset as possible – maybe next time…
I also had a quick play at generating a shiny app from the code (again, cut and pasting rather than abstracting into a separate file and importing… I guess at least now I have three files to look at when trying to abstract the code and to test against…!)
So this has got me thinking – what are the commonly produced “types” of report or report section, and what bits of common/reusuble code would make it easy to generate new automation scripts, at least at a first pass, for a new dataset?
Earlier this week, I spent a day chatting to folk from the House of Commons Library as a part of a temporary day-a-week-or-so bit of work I’m doing with the Parliamentary Digital Service.
During one of the conversations on matters loosely geodata-related with Carl Baker, Carl mentioned an NHS Digital data set describing the number of people on a GP Practice list who live within a particular LSOA (Lower Super Output Area). There are possible GP practice closures on the Island at the moment, so I thought this might be an interesting dataset to play with in that respect.
Another thing Carl is involved with is producing a regularly updated briefing on Accident and Emergency Statistics. Excel and QGIS templates do much of the work in producing the updated documents, so much of the data wrangling side of the report generation is automated using those tools. Supporting regular updating of briefings, as well as answering specific, ad hoc questions from MPs, producing debate briefings and other current topic briefings, seems to be an important Library activity.
As I’ve been looking for opportunities to compare different automation routes using things like Jupyter notebooks and RMarkdown, I thought I’d have a play with the GP list/LSOA data, showing how we might be able to use each of those two routes to generate maps showing the geographical distribution, across LSOAs at least, for GP practices on the Isle of Wight. This demonstrates several things, including: data ingest; filtering according to practice codes accessed from another dataset; importing a geoJSON shapefile; generating a choropleth map using the shapefile matched to the GP list LSOA codes.
The first thing I tried was using a python/pandas Jupyter notebook to create a choropleth map for a particular practice using the
folium library. This didn’t take long to do at all – I’ve previously built an NHS admin database that lets me find practice codes associated with a particular CCG, such as the Isle of Wight CCG, as well as a notebook that generates a choropleth over LSOA boundaries, so it was simply a case of copying and pasting old bits of code and adding in the new dataset.You can see a rendered example of the notebook here (download).
One thing you might notice from the rendered notebook is that I actually “widgetised” it, allowing users of the live notebook to select a particular practice and render the associated map.
Whilst I find the Jupyter notebooks to provide a really friendly and accommodating environment for pulling together a recipe such as this, the report generation workflows are arguably still somewhat behind the workflows supported by RStudio and in particular the
So what does an RStudio workflow have to offer? Using Rmarkdown (Rmd) we can combine text, code and code outputs in much the same way as we can in a Jupyter notebook, but with slightly more control over the presentation of the output.
It’s also possible to use a parameterised report generation workflow to generate separate reports for each practice. For example, applying this parameterised report generation script to a generic base template report will generate a set of PDF reports on a per practice basis for each practice on the Isle of Wight.
bookdown package, which I haven’t played with yet, also looks promising for its ability to generate a single output document from a set of source documents. (I have a question in about the extent to which
bookdown supports partially parameterised compound document creation).
Having started thinking about comparisons between Excel, Jupyter and RStudio workflows, possible next steps are:
- to look for sensible ways of comparing the workflow associated with each,
- the ramp-up skills required, and blockers (including cultural blockers (also administrative / organisational blockers, h/t @dasbarrett)) associated with getting started with new tools such as Jupyter or RStudio, and
- the various ways in which each tool/workflow supports: transparency; maintainability; extendibility; correctness; reuse; integration with other tools; ease and speed of use.
It would also be interesting to explore how much time and effort would actually be involved in trying to port a legacy Excel report generating template to Rmd or ipynb, and what sorts of issue would be likely to arise, and what benefits Excel offers compared to Jupyter and RStudio workflows.