Reproducible Notebooks Help You Track Down Errors…

A couple of days ago on the Spreadsheet Journalism blog, I came across a post on UK Immigration Raids:.

The post described a spreadsheet process for reshaping a a couple of wide format sheets in a spreadsheet into a single long format dataset to make the data easier to work with.

One of my New Year resolutions was to try to look out for opportunities to rework and compare spreadsheet vs. notebook data treatments, so here’s a Python pandas reworking of the example linked to above in a Jupyter notebook: UK Immigration Raids.ipynb.

You can run the notebook “live” using Binder/Binderhub:

Look in the notebooks folder for the UK Immigration Raids.ipynb notebook.

A few notes on the process:

  • there was no link to the original source data in the original post, although there was a link to a local copy of it;
  • the original post had a few useful cribs that I could use to cross check my working with, such as the number of incidents in Bristol;
  • the mention of dirty data made me put in a step to find mismatched data;
  • my original notebook working contained an error – which I left in the notebook to show how it might propagate through and how we might then try to figure out what the error was having detected it.

As an example, it could probably do with another iteration to demonstrate a more robust process with additional checks that transformations have been correctly applied to data along the way.

Anyway, that’s another of my New Year resolutions half implemented: *share your working, share your mistakes(.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s