Tagged: parlidata

HexJSON HTMLWidget for R, Part 2

In my previous post – HexJSON HTMLWidget for R, Part 1 – I described a first attempt at an HTMLwidget for displaying hexJSON maps using d3-hexJSON.

I had another play today and added a few extra features, including the ability to:

  • add a grid (as demonstrated in the original d3-hexJSON examples),
  • modify the default colour of data and grid hexes,
  • set the data hex colour via a col attribute defined on a hexJSON hex, and
  • set the data hex label via a label attribute defined on a hexJSON hex.

We can now also pass in the path to a hexJSON file, rather than just the hexJSON object:

Here’s the example hexJSON file:

And here’s an example of the default grid colour and a custom text colour :

I’ve also tried to separate out the code changes as separate commits for each feature update: code checkins. For example, here’s where I added the original colour handling.

I’ve also had a go at putting some docs in place, generated using roxygen2 called from inside the widget code folder with devtools::document(). (The widget itself gets rebuilt by running the command devtools::install().)

Next up – some routines to annotate a base hexJSON file with data to colour and label the hexes. (I’m also wondering if I should support the ability to specify arbitrary hexJSON hex attribute names for label text (label) and hex colour (col), or whether to keep those names as a fixed requirement?) See what I came up with here: HexJSON HTMLWidget for R, Part 3.

HexJSON HTMLWidget for R, Part 1

In advance of the recent UK general election, ODI Leeds published an interactive hexmap of constituencies to provide a navigation surface over various datasets relating to Westminster constituencies:

As well as the interactive front end, ODI Leeds published a simple JSON format for sharing the hex data – hexjson that allows you to specify an identifier for each, some data values associated with it, and relative row (r) and column (q) co-ordinates:

It’s not hard to imagine the publication of default, or base, hexJSON documents that include standard identifier codes and appropriate co-ordinates, e.g. for Westminster constituencies, wards, local authorities, and so on being developed around such a standard.

So that’s one thing the standard affords – a common way of representing lots of different datasets.

Tooling can then be developed to inject particular data-values into an appropriate hexJSON file. For example, a hexJSON representation of UK HEIs could add a data attribute identifying whether an HEI received a Gold, Silver or Bronze TEF rating. That’s a second thing the availability of a standard supports.

By building a UI that reads data in from a hexJSON file, ODI Leeds have developed an application that can presumably render other people’s hexJSON files, again, another benefit of a standard representation.

But the availability of the standard also means other people can build other visualisation tools around the standard. Which is exactly what Oli Hawkins did with his d3-hexjson Javascript library, “a D3 module for generating [SVG] hexmaps from data in HexJSON format” as announced here. So that’s another thing the standard allows.

You can see an example here, created by Henry Lau:

You maybe start to get a feel for how this works… Data in a standard form, standard library that renders the data. For example, Giuseppe Sollazzo (aka @puntofisso), had a play looking at voter swing:

So… one of the things I was wondering was how easy it would be for folk in the House of Commons Library, for example, to make use of the d3-hexjson maps without having to do the Javascript or HTML thing.

Step in HTMLwidgets, a (standardised) format for publishing interactive HTML widgets from Rmarkdown (Rmd). The idea is that you should be able to say something like:

hexjsonwidget( hexjson )

and embed a rendering of a d3-hexjson map in HTML output from a knitred Rmd document.

(Creating the hexjson as a JSON file from a base (hexJSON) file with custom data values added to it is the next step, and the next thing on my to do list.)

So following the HTMLwidgets tutorial, and copying Henry Lau’s example (which maybe drew on Oli’s README?) I came up with a minimal take on a hexJSON HTMLwidget.

library(devtools)
install_github("psychemedia/htmlwidget-hexjson")

It’s little more than a wrapping of the demo template, and I’ve only tested it with a single example hexJSON file, but it does generate d3.js hexmaps:

rstudio_hexjson

library(jsonlite)
library(hexjsonwidget)

jj=fromJSON('./example.hexjson')
hexjsonwidget(jj)

It also needs documenting. And support for creating data-populated base hexJSON files. But it’s a start. And another thing the hexJSON has unintentionally provided supported for.

But it does let you create HTML documents with embedded hexmaps if you have the hexJSON file handy:

By the by, it’s also worth noting that we can also publish an image snapshot of the SVG hexjson map in a knitr rendering of the document to a PDF or Microsoft Word output document format:

At first I thought this wasn’t possible, and via @timelyportfolio found a workaround to generate an image from the SVG:

library(exportwidget)
library(webshot)
library(htmltools)
library(magrittr)

html_print(tagList(
hexjsonwidget(jj)
, export_widget( )
), viewer=NULL) %>%
webshot( delay = 3 )

But then noticed that the PDF rendering was suddenly working – it seems that if you have the webshot and htmltools packages installed, then the PDF and Word rendering of the HTMLwidget SVG as an image works automagically. (I’m not sure I’ve seen that documented – the related HTMLwidget Github issue still looks to be open?)

See also: HexJSON HTMLWidget for R, Part 2, in which support for custom hex colour and labeling is added, and HexJSON HTMLWidget for R, Part 3, where I add in the ability to merge an R dataframe into a hexjson object, and create a hexjsonwidget directly from a dataframe.

Tinkering With Apache Drill – JOINed Queries Across JSON and CSV files

Coming down from a festival high, I spent the day yesterday jamming with code and trying to get a feel for Apache Drill. As I’ve posted before, Apache Drill is really handy for querying large CSV files.

The test data I’ve been using is Evan Odell’s 3GB Hansard dataset, downloaded as a CSV file but recast in the parquet format to speed up queries (see the previous post for details). I had another look at the dataset yesterday, and popped together some simple SQL warm up exercises in notebook (here).

Something I found myself doing was flitting between running SQL queries over the data using Apache Drill to return a pandas dataframe, and wrangling the pandas dataframes directly, following the path of least resistance to doing the thing I wanted to do at each step. (Which is to say, if I couldn’t figure out the SQL, I’d try moving into pandas; and if the pandas route was too fiddly, rethinking the SQL query! That said, I also noticed I had got a bit rusty with SQL…) Another pattern of behaviour I found myself falling into was using Apache Drill to run summarising queries over the large original dataset, and then working with these smaller, summary datasets as in-memory pandas dataframes. This could be a handy strategy, I think.

As well as being able to query large flat CSV files, Apache Drill also allows you to run queries over JSON files, as well directories full of similarly structured JSON or CSV files. Lots of APIs export data as JSON, so being able to save the response of multiple calls on a similar topic as uniquely named (and the name doesn’t matter..) flat JSON files in the same folder, and then run a SQL query over all of them simply by pointing to the host directory, is really appealing.

I explored these features in a notebook exploring UK Parliament Written Answers data. In particular, the notebook shows:

  • requesting multiple pages on the same from the UK Parliament data API (not an Apache Drill feature!)
  • querying deep into a single large JSON file;
  • running JOIN queries over data contained in a JSON file and a CSV file;
  • running JOIN queries over data contained in a JSON file and data contained in multiple, similarly structured, data files in the same directory.

All I need to do now is come up with a simple “up and running” recipe for working Apache Drill. I’m thinking: Docker compose and some linked containers: Jupyter notebook, RStudio, Apache Drill, and maybe a shared data volume between them?

Parliamentary Data Sources & API Wrappers

A stub post for collecting together:

  • code libraries that wrap APIs and datasets related to UK Parliament data;
  • downloadable datasets (raw and annotated);
  • API endpoints;
  • etc.

If you know of any others, please let me know via the comments…

APIs

Official Parliament APIs

Secondary Source APIs

Python

R

  • hansard: R package to automatically fetch data from the UK Parliament API; but not Hansard!
  • twfy: R wrapper for TheyWorkForYou’s API (about)
  • twfyR: another R wrapper for TheyWorkForYou’s API

Data Sourcing Applications

Scrapers

Datasets

  • Hansard Speeches and Sentiment: a public dataset of speeches in Hansard, with information on the speaking MP, their party, gender and age at the time of the speech. (Large files (~3GB) – example of how to query datafile using Apache Drill.)

Data Handling Utilities

Miscellaneous

Tracking down Data Files Associated With Parliamentary Business

One of the ways of finding data related files scattered around an organisations website is to run a web search using a search limit that specifies a data-y filetype, such as xlsx  for an Excel spreadsheet (csv and xls are also good candidates). For example, on the Parliament website, we could run a query along the lines of filetype:xlsx site:parliament.uk and then opt to display the omitted results:

Taken together, these files form an ad hoc datastore (e.g. as per this demo on using FOI response on WhatDoTheyKnow as an “as if” open datastore).

Looking at the URLs, we see that data containing files are strewn about the online Parliamentary estate (that is, the website;-)…

Freedom of Information Related Datasets

Parliament seems to be quite open in the way is handles its FOI responses, publishing disclosure logs and releasing datafile attachments rooted on https://www.parliament.uk/documents/foi/:

Written Questions

Responses to Written Questions often come with datafile attachments.

These are files are posted to the subdomain http://qna.files.parliament.uk/qna-attachments.

Given the numeric key for a particular question, we can run a query on the Written Answers API to find details about the attachment:

Looking at the actual URL , something like http://qna.files.parliament.uk/qna-attachments/454264/original/28152%20-%20table.xlsx, it looks as if some guesswork is required generating the URL from the data contained in the API response? (For example, how might original attachments might distinguish from other attachments (such as “revised” ones, maybe?).)

Written Statements

Written statements often come with one of more data file attachments.

The data files also appear on the http://qna.files.parliament.uk/ subdomain although it looks like they’re on a different path to the answered question attachments (http://qna.files.parliament.uk/ws-attachments compared to http://qna.files.parliament.uk/qna-attachments). This subdomain doesn’t appear to have the data files indexed and searchable on Google? I don’t see a Written Statements API on http://explore.data.parliament.uk/ either?

Deposited Papers

Deposited papers often include supporting documents, including spreadsheets.

Files are located under http://data.parliament.uk/DepositedPapers/Files/:

At the current time there is no API search over deposited papers.

Committee Papers

A range of documents may be associated with Committees, including reports, responses to reports, and correspondence, as well as evidence submissions. These appear to mainly be PDF documents. Written evidence documents are rooted on http://data.parliament.uk/writtenevidence/committeeevidence.svc/evidencedocument/ and can be found from committee written evidence web (HTML) pages rooted on the same path (example).

A web search for site:parliament.uk inurl:committee (filetype:xls OR filetype:csv OR filetype:xlsx) doesn’t turn up any results.

Parliamentary Research Briefings

Research briefings are published by Commons and Lords Libraries, and may include additional documents.

Briefings may be published along with supporting documents, including spreadsheets:

The files are published under the following subdomain and path:  http://researchbriefings.files.parliament.uk/.

The file attachments URLs can be found via the Research Briefings API.

This response is a cut down result – the full resource description, including links to supplementary items, can be found by keying on the numeric identifier from the URI _about which the “naturally” identified resource (e.g. SN06643) is described.

Summary

Data files can be found variously around the Parliamentary website, including down the following paths:

(I don’t think the API supports querying resources that specifically include attachments in general, or attachments of a particular filetype?)

What would be nice would be support for discovering some of these resources. A quick way in to this would be the ability to limit search query responses to webpages that link to a data file, on the grounds that the linking web page probably contains some of the keywords that you’re likely to be searching for data around?

Tinkering With Parliament Data APIs: Commons Written Questions And Parliamentary Written Answers

So…. inspired by @philbgorman, I had a quick play last night with Parliament Written Questions data, putting together a recipe (output) for plotting a Sankey diagram showing the flow of questions from Members of the House of Commons by Party to various Answering Bodies for a particular parliamentary session.

The response that comes back from the Written Questions API includes a question uin (unique identification number?). If you faff around with date settings on the Parliamentary Questions web page you can search for a question by this ID:

Here’s an example of the response from a build download of questions (by 2015/16 session) from the Commons Written Questions API, deep filtered by the uin:

If you tweak the _about URI, which I think refers to details about the question, you get the following sort of response, built around a numeric identifier (447753 in this case):

There’s no statement of the actual answer text in that response, although there is a reference to an answer resource, again keyed by the same numeric key:

The numeric key from the _about identifier is also used with both the Commons Written Questions API and the Parliamentary Questions Answered API.

For example, questions:

And answers:

The uin values can’t be used with either of these APIs, though?

PS I know, I know, the idea is that we just follow resource links (but they’re broken, right? the leading lda. is missing from the http identifiers), but sometimes it’s just as easy to take a unique fragment of the URI (like the numeric key) and then just drop it into the appropriate context when you want it. In this case, contexts are

IMHO, any way… ;-)

PPS for a full list of APIs, see explore.data.parliament.uk

Experimenting With Sankey Diagrams in R and Python

A couple of days ago, I spotted a post by Oli Hawkins on Visualising migration between the countries of the UK which linked to a Sankey diagram demo of Internal migration flows in the UK.

One of the things that interests me about the Jupyter and RStudio centred reproducible research ecosystems is their support for libraries that generate interactive HTML/javascript outputs (charts, maps, etc) from a computational data analysis context such as R, or python/pandas, so it was only natural (?!) that I though I should see how easy it would be to generate something similar from a code context.

In an R context, there are several libraries available that support the generation of Sankey diagrams, including googleVis (which wraps Google Chart tools), and a couple of packages that wrap d3.js – an original rCharts Sankey diagram demo by @timelyporfolio, and a more recent HTMLWidgets demo (sankeyD3).

Here’s an example of the evolution of my Sankey diagram in R using googleVis – the Rmd code is here and a version of the knitred HTML output is here.

The original data comprised a matrix relating population flows between English regions, Wales, Scotland and Northern Ireland. The simplest rendering of the data using the googleViz Sankey diagram generator produces an output that uses default colours to label the nodes.

Using the country code indicator at the start of each region/country identifier, we can generate a mapping from country to a country colour that can then be used to identify the country associated with each node.

One of the settings for the diagram allows the source (or target) node colour to determine the edge colour. We can also play with the values we use as node labels:

If we exclude edges relating to flow between regions of the same country, we get a diagram that is more reminiscent of Oli’s orignal (country level) demo. Note also that the charts that are generated are interactive – in this case, we see a popup that describes the flow along one particular edge.

If we associate a country with each region, we can group the data and sum the flow values to produce country level flows. Charting this produces a chart similar to the original inspiration.

As well as providing the code for generating each of the above Sankey diagrams, the Rmd file linked above also includes demonstrations for generating basic Sankey diagrams for the original dataset using the rCharts and htmlwidgets R libraries.

In order to provide a point of comparison, I also generated a python/pandas workflow using Jupyter notebooks and the ipysankey widget. (In fact, I generated the full workflow through the different chart versions first in pandas – I find it an easier language to think in than R! – and then used that workflow as a crib for the R version…)

The original notebook is here and an example of the HTML version of it here. Note that I tried to save a rasterisation of the widgets but they don’t seem to have turned out that well…

The original (default) diagram looks like this:

and the final version, after a bit of data wrangling, looks like this:

Once again, all the code is provided in the notebook.

One of the nice things about all these packages is that they produce outputs than can be reused/embedded elsewhere, or that can be used as a first automatically produced draft of code that can be tweaked by hand. I’ll have more to say about that in a future post…