Rescuing Python Module Code From Cluttered Jupyter Notebooks

One of the ways I use Jupyter notebooks is to write stream-of-consciousness code.

Whilst the code doesn’t include formal tests (I never got into the swing of test-driven development, partly because I’m forever changing my mind about what I need a particular function to do!) the notebooks do contain a lot of implicit testing as I tried to build up a function (see Programming in Jupyter Notebooks, via the Heavy Metal Umlaut for an example of some of the ways I use notebooks to iterate the development of code either across several cells or within a single cell).

The resulting notebooks tend to be messy in a way that makes it hard to reuse code contained in them easily. In particular, there are lots of parameter setting cells and code fragment cells where I test specific things out, and then there are cells containing functions that pull together separate pieces to perform a particular task.

So for example, in the fragment below, there is a cell where I’m trying something out, a cell where I put that thing into a function, and a cell where I test the function:

My notebooks also tend to include lots of markdown cells where I try to explain what I want to achieve, or things I still need to do. Some of these usefully document completed functions, others are more note form that relate to the development of an idea or act as notes-to-self.

As the notebooks get more cluttered, it gets harder to use them to perform a particular task. I can’t load the notebook into another notebook as a module because as well as loading the functions in, all the directly run code cells will be loaded in and then executed.

Jupytext comes partly to the rescue here. As described in Exploring Jupytext – Creating Simple Python Modules Via a Notebook UI, we can add active-ipynb tags to a cell that instruct Jupytext where code cells should be executable:

In the case of the active-ipynb tag, if we generate a Python file from a notebook using Jupytext, the active-ipynb tagged code cells will be commented out. But that can still make for a quite a messy Python file.

Via Marc Wouts comes this alternative solution for using an nbconvert template to strip out code cells that are tagged active-ipynb; I’ve also tweaked the template to omit cell count numbers and only include markdown cells that are tagged docs.

echo """{%- extends 'python.tpl' -%}

{% block in_prompt %}
{% endblock in_prompt %}

{% block markdowncell scoped %}
{%- if \"docs\" in cell.metadata.tags -%}
{{ super() }}
{%- else -%}
{%- endif -%}
{% endblock markdowncell %}

{% block input_group -%}
{%- if \"active-ipynb\" in cell.metadata.tags  -%}
{%- else -%}
{{ super() }}
{%- endif -%}
{% endblock input_group %}""" > clean_py_file.tpl

Running nbconvert using this template over a notebook:

jupyter nbconvert "My Messy Notebook.ipynb" --to script --template clean_py_file.tpl

generates a My Messy file that includes code from cells not tagged as active-ipynb, along with commented out markdown from docs tagged markdown cells, that provides a much cleaner python module file.

With this workflow, I can revisit my old messy notebooks, tag the cells appropriately, and recover useful module code from them.

If I only ever generate (and never edit by hand) the module/Python files, then I can maintain the code from the messy notebook as long as I remember to generate the Python file from the notebook via the clean_py_file.tpl template. Ideally, this would be done via a Jupyter content manager hook so that whenever the notebook was saved, as per Jupytext paired files, the clean Python / module file would be automatically generated from it.

Just by the by, we can load in Python files that contain spaces in the filename as modules into another Python file or notebook using the formulation:

tsl = __import__('TSL Timing Screen Screenshot and Data Grabber')

and then call functions via tsl.myFunction() in the normal way. If running in a Jupyter notebook setting (which may be a notebook UI loaded from a .py file under Jupytext) where the notebook includes the magics:

%load_ext autoreload
%autoreload 2

then whenever a function from a loaded module file is called, the module (and any changes to it since it was last loaded) are reloaded.

PS thinks… it’d be quite handy to have a simple script that would autotag all notebook cells as active-ipynb; or perhaps just have another template that ignores all code cells not tagged with something like active-py or module-py. That would probably make tag gardening in old notebooks a bit easier…

Author: Tony Hirst

I'm a Senior Lecturer at The Open University, with an interest in #opendata policy and practice, as well as general web tinkering...