Rescuing Python Module Code From Cluttered Jupyter Notebooks

One of the ways I use Jupyter notebooks is to write stream-of-consciousness code.

Whilst the code doesn’t include formal tests (I never got into the swing of test-driven development, partly because I’m forever changing my mind about what I need a particular function to do!) the notebooks do contain a lot of implicit testing as I tried to build up a function (see Programming in Jupyter Notebooks, via the Heavy Metal Umlaut for an example of some of the ways I use notebooks to iterate the development of code either across several cells or within a single cell).

The resulting notebooks tend to be messy in a way that makes it hard to reuse code contained in them easily. In particular, there are lots of parameter setting cells and code fragment cells where I test specific things out, and then there are cells containing functions that pull together separate pieces to perform a particular task.

So for example, in the fragment below, there is a cell where I’m trying something out, a cell where I put that thing into a function, and a cell where I test the function:

My notebooks also tend to include lots of markdown cells where I try to explain what I want to achieve, or things I still need to do. Some of these usefully document completed functions, others are more note form that relate to the development of an idea or act as notes-to-self.

As the notebooks get more cluttered, it gets harder to use them to perform a particular task. I can’t load the notebook into another notebook as a module because as well as loading the functions in, all the directly run code cells will be loaded in and then executed.

Jupytext comes partly to the rescue here. As described in Exploring Jupytext – Creating Simple Python Modules Via a Notebook UI, we can add active-ipynb tags to a cell that instruct Jupytext where code cells should be executable:

In the case of the active-ipynb tag, if we generate a Python file from a notebook using Jupytext, the active-ipynb tagged code cells will be commented out. But that can still make for a quite a messy Python file.

Via Marc Wouts comes this alternative solution for using an nbconvert template to strip out code cells that are tagged active-ipynb; I’ve also tweaked the template to omit cell count numbers and only include markdown cells that are tagged docs.


echo """{%- extends 'python.tpl' -%}

{% block in_prompt %}
{% endblock in_prompt %}

{% block markdowncell scoped %}
{%- if \"docs\" in cell.metadata.tags -%}
{{ super() }}
{%- else -%}
{%- endif -%}
{% endblock markdowncell %}

{% block input_group -%}
{%- if \"active-ipynb\" in cell.metadata.tags  -%}
{%- else -%}
{{ super() }}
{%- endif -%}
{% endblock input_group %}""" > clean_py_file.tpl

Running nbconvert using this template over a notebook:

jupyter nbconvert "My Messy Notebook.ipynb" --to script --template clean_py_file.tpl

generates a My Messy Notebook.py file that includes code from cells not tagged as active-ipynb, along with commented out markdown from docs tagged markdown cells, that provides a much cleaner python module file.

With this workflow, I can revisit my old messy notebooks, tag the cells appropriately, and recover useful module code from them.

If I only ever generate (and never edit by hand) the module/Python files, then I can maintain the code from the messy notebook as long as I remember to generate the Python file from the notebook via the clean_py_file.tpl template. Ideally, this would be done via a Jupyter content manager hook so that whenever the notebook was saved, as per Jupytext paired files, the clean Python / module file would be automatically generated from it.

Just by the by, we can load in Python files that contain spaces in the filename as modules into another Python file or notebook using the formulation:

tsl = __import__('TSL Timing Screen Screenshot and Data Grabber')

and then call functions via tsl.myFunction() in the normal way. If running in a Jupyter notebook setting (which may be a notebook UI loaded from a .py file under Jupytext) where the notebook includes the magics:

%load_ext autoreload
%autoreload 2

then whenever a function from a loaded module file is called, the module (and any changes to it since it was last loaded) are reloaded.

PS thinks… it’d be quite handy to have a simple script that would autotag all notebook cells as active-ipynb; or perhaps just have another template that ignores all code cells not tagged with something like active-py or module-py. That would probably make tag gardening in old notebooks a bit easier…

Exploring Jupytext – Creating Simple Python Modules Via a Notebook UI

Although I spend a lot of my coding time in Jupyter notebooks, there are several practical problems associated with working in that environment.

One problem is that under version control, it can be hard to tell what’s changed. On the one hand, the notebook .ipynb format, which saves as a serialised JSON object, is hard to read cleanly:

The .ipynb format also records changes to cell execution state, including cell execution count numbers and changes to cell outputs (which may take the form of large encoded strings when a cell output is an image, or chart, for example:

Another issue arises when trying to write modules in a notebook that can be loaded into other notebooks.

One workaround for this is to use the notebook loading hack described in the official docs: Importing notebooks. This requires loading in a notebook loading module that then allows you to import other modules. Once the notebook loader module is installed, you can run things like:

  • import mycode as mc to load mycode.ipynb
  • `moc = __import__(“My Other Code”)` to load code in from `My Other Code.ipynb`

If you want to include code that can run in the notebook, but that is not executed when the notebook is loaded as a module, you can guard items in the notebook:

In this case, the if __name__=='__main__': guard will run the code in the code cell when run in the notebook UI, but will not run it when the notebook is loaded as a module.

Guarding code can get very messy very quickly, so is there an easier way?

And is there an easier way of using notebooks more generally as an environment for creating code+documentation files that better meet the needs of a variety of users? For example, I note this quote from Daniele Procida recently shared by Simon Willison:

Documentation needs to include and be structured around its four different functions: tutorials, how-to guides, explanation and technical reference. Each of them requires a distinct mode of writing. People working with software need these four different kinds of documentation at different times, in different circumstances—so software usually needs them all.

This suggests a range of different documentation styles for different purposes, although I wonder if that is strictly necessary?

When I am hacking code together, I find that I start out by writing things a line at a time, checking the output for each line, then grouping lines in a single cell and checking the output, then wrapping things in a function (for example of this in practice, see Programming in Jupyter Notebooks, via the Heavy Metal Umlaut). I also try to write markdown notes that set up what I intend to do (and why) in the following code cells. This means my development notebooks tell a story (of a sort) of the development of the functions that hopefully do what I actually want them to by the end of the notebook.

If truth be told, the notebooks often end up as an unholy mess, particularly if they are full of guard statements that try to separate out development and testing code from useful code blocks that I might want to import elsewhere.

Although I’ve been watching it for months, I’ve only started exploring how to use Jupytext in practice quite recently, and already it’s starting to change how I use notebooks.

If you install jupytext, you will find that if you click on a link to a markdown (.md)) or Python (.py), or a whole range of other text document types (.py, .R, .r, .Rmd, .jl, .cpp, .ss, .clj, .scm, .sh, .q, .m, .pro, .js, .ts, .scala), you will open the file in a notebook environment.

You can also open the file as a .py file, from the notebook listing menu by selecting the notebook:

and then using the Edit button to open it:

at which point you are presented with the “normal” text file editor:

One thing to note about the notebook editor view over the notebook is that you can also include markdown cells, as you might in any other notebook, and run code cells to preview their output inline within the notebook view.

However, whilst the markdown code will be saved into the Python file (as commented out code), the code outputs will not be saved into the Python file.

If you do want to be able to save notebook views with any associated code output, you can configure Jupytext to “pair” .py and .ipynb files (and other combinations, such as .py, .ipynb and .md files) such that when you save an open .py or .ipynb file from the notebook editing environment, a “paired” .ipynb or .py version of the file is also saved at the same time.

This means I could click to open my .py file in the notebook UI, run it, then when I save it, a “simple” .py file containing just code and commented out markdown is saved along with a notebook .ipynb file that also contains the code cell outputs.

You can configure Jupytext so that the pairing only works in particular directories. I’ve started trying to explore various settings in the branches of this repo: ouseful-template-repos/jupytext-md. You can also convert files on the command line; for example, <span class="s1">jupytext --to py Required\ Pace.ipynb will convert a notebook file to a python file.

The ability to edit Python / .py files, or code containing markdown / .md files in a notebook UI, is really handy, but there’s more…

Remember the guards?

If I tag a code cell using the notebook UI (from the notebook View menu, select Cell Toolbar and then Tags, you can tag a cell with a tag of the form active-ipynb:

See the Jupytext docs: importing Jupyter notebooks as modules for more…

The tags are saved as metadata in all document types. For example, in an .md version of the notebook, the metadata is passed in an attribute-value pair when defining the language type of a code block:

In a .py version of the notebook, however, the tagged code cell is not rendered as a code cell, it is commented out:

What this means is that I can tag cells in the notebook editor to include them — or not — as executable code in particular document types.

For example, if I pair .ipynb and .py files, whenever I edit either an .ipynb or .py file in the notebook UI, it also gets saved as the paired document type. Within the notebook UI, I can execute all the code cells, but through using tagged cells, I can define some cells as executable in one saved document type (.ipynb for example) but not in another (a .py file, perhaps).

What that in turn means is that when I am hacking around with the document in the notebook UI I can create documents that include all manner of scraggy developmental test code, but only save certain cells as executable code into the associated .py module file.

The module workflow is now:

  • install Jupytext;
  • edit Python files in a notebook environment;
  • run all cells when running in the notebook UI;
  • mark development code as active-ipynb, which is to say, it is *not active* in a .py file;
  • load the .py file in as a module into other modules or notebooks but leaving out the commented out the development code; if I use `%load_ext autoreload` and `%autoreload 2` magic in the document that’s loading the modules, it will [automatically reload them](https://stackoverflow.com/a/5399339/454773) when I call functions imported from them if I’ve made changes to the associated module file;
  • optionally pair the .py file with an .ipynb file, in which case the .ipynb file will be saved: a) with *all* cells run; b) include cell outputs.

Referring back to Daniele Procida’s insights about documentation, this ability to have code in a single document (for example, a .py file) that is executable in one environment (the notebook editing / development environment, for example) but not another (when loaded as a .py module) means we can start to write richer source code files.

I also wonder if this provides us with a way of bundling test code as part of the code development narrative? (I don’t use tests so don’t really know how the workflow goes…)

More general is the insight that we can use Jupytext to automatically generate distinct versions of a document from a single source document. The generated documents:

  • can include code outputs;
  • can *exclude* code outputs;
  • can have tagged code commented out in some document formats and not others.

I’m not sure if we can also use it in combination with other notebook extensions to hide particular cells, for example, when viewing documents in the notebook editor or generating export document formats from an executed notebook form of it. A good example to try out might be the hide_code extension, which provides a range of toolbar options that can be used to customise the display of a document in a the notebook editor or HTML / PDF documents generated from it.

It could also be useful to have a very simple extension that lets you click a toolbar button to set an active- state tag and style or highlight that cell in the notebook UI to mark it out as having limited execution status. A simple fork of, or extension to, the freeze extension would probably do that. (I note that Jupytext responds to the “frozen” freeze setting but that presumably locks out executing the cell in the notebook UI too?)

PS a few weeks ago, Jupytext creator Marc Wouts posted this handy recipe for *rewriting* notebook commits made to a git branch against markdown formatted documents rather than the original ipynb change commits: git filter-branch --tree-filter 'jupytext --to md */*.ipynb && rm -f */*.ipynb' HEAD This means that if you have a legacy project with commits made to notebook files, you can rewrite it as a series of changes made to markdown or Python document versions of the notebooks…