Reproducible Notebooks Help You Track Down Errors…

A couple of days ago on the Spreadsheet Journalism blog, I came across a post on UK Immigration Raids:.

The post described a spreadsheet process for reshaping a a couple of wide format sheets in a spreadsheet into a single long format dataset to make the data easier to work with.

One of my New Year resolutions was to try to look out for opportunities to rework and compare spreadsheet vs. notebook data treatments, so here’s a Python pandas reworking of the example linked to above in a Jupyter notebook: UK Immigration Raids.ipynb.

You can run the notebook “live” using Binder/Binderhub:

Look in the notebooks folder for the UK Immigration Raids.ipynb notebook.

A few notes on the process:

  • there was no link to the original source data in the original post, although there was a link to a local copy of it;
  • the original post had a few useful cribs that I could use to cross check my working with, such as the number of incidents in Bristol;
  • the mention of dirty data made me put in a step to find mismatched data;
  • my original notebook working contained an error – which I left in the notebook to show how it might propagate through and how we might then try to figure out what the error was having detected it.

As an example, it could probably do with another iteration to demonstrate a more robust process with additional checks that transformations have been correctly applied to data along the way.

Anyway, that’s another of my New Year resolutions half implemented: *share your working, share your mistakes(.

Author: Tony Hirst

I'm a Senior Lecturer at The Open University, with an interest in #opendata policy and practice, as well as general web tinkering...

%d bloggers like this: