Although I spend a lot of my coding time in Jupyter notebooks, there are several practical problems associated with working in that environment.
One problem is that under version control, it can be hard to tell what’s changed. On the one hand, the notebook .ipynb format, which saves as a serialised JSON object, is hard to read cleanly:
The .ipynb format also records changes to cell execution state, including cell execution count numbers and changes to cell outputs (which may take the form of large encoded strings when a cell output is an image, or chart, for example:
Another issue arises when trying to write modules in a notebook that can be loaded into other notebooks.
One workaround for this is to use the notebook loading hack described in the official docs: Importing notebooks. This requires loading in a notebook loading module that then allows you to import other modules. Once the notebook loader module is installed, you can run things like:
import mycode as mcto load
- `moc = __import__(“My Other Code”)` to load code in from `My Other Code.ipynb`
If you want to include code that can run in the notebook, but that is not executed when the notebook is loaded as a module, you can guard items in the notebook:
In this case, the
if __name__=='__main__': guard will run the code in the code cell when run in the notebook UI, but will not run it when the notebook is loaded as a module.
Guarding code can get very messy very quickly, so is there an easier way?
And is there an easier way of using notebooks more generally as an environment for creating code+documentation files that better meet the needs of a variety of users? For example, I note this quote from Daniele Procida recently shared by Simon Willison:
Documentation needs to include and be structured around its four different functions: tutorials, how-to guides, explanation and technical reference. Each of them requires a distinct mode of writing. People working with software need these four different kinds of documentation at different times, in different circumstances—so software usually needs them all.
This suggests a range of different documentation styles for different purposes, although I wonder if that is strictly necessary?
When I am hacking code together, I find that I start out by writing things a line at a time, checking the output for each line, then grouping lines in a single cell and checking the output, then wrapping things in a function (for example of this in practice, see Programming in Jupyter Notebooks, via the Heavy Metal Umlaut). I also try to write markdown notes that set up what I intend to do (and why) in the following code cells. This means my development notebooks tell a story (of a sort) of the development of the functions that hopefully do what I actually want them to by the end of the notebook.
If truth be told, the notebooks often end up as an unholy mess, particularly if they are full of guard statements that try to separate out development and testing code from useful code blocks that I might want to import elsewhere.
Although I’ve been watching it for months, I’ve only started exploring how to use Jupytext in practice quite recently, and already it’s starting to change how I use notebooks.
If you install
jupytext, you will find that if you click on a link to a markdown (
.md)) or Python (
.py), or a whole range of other text document types (
.scala), you will open the file in a notebook environment.
You can also open the file as a
.py file, from the notebook listing menu by selecting the notebook:
and then using the Edit button to open it:
at which point you are presented with the “normal” text file editor:
One thing to note about the notebook editor view over the notebook is that you can also include markdown cells, as you might in any other notebook, and run code cells to preview their output inline within the notebook view.
However, whilst the markdown code will be saved into the Python file (as commented out code), the code outputs will not be saved into the Python file.
If you do want to be able to save notebook views with any associated code output, you can configure Jupytext to “pair”
.ipynb files (and other combinations, such as
.md files) such that when you save an open
.ipynb file from the notebook editing environment, a “paired”
.py version of the file is also saved at the same time.
This means I could click to open my
.py file in the notebook UI, run it, then when I save it, a “simple”
.py file containing just code and commented out markdown is saved along with a notebook
.ipynb file that also contains the code cell outputs.
You can configure Jupytext so that the pairing only works in particular directories. I’ve started trying to explore various settings in the branches of this repo: ouseful-template-repos/jupytext-md. You can also convert files on the command line; for example,
<span class="s1">jupytext --to py Required\ Pace.ipynb will convert a notebook file to a python file.
The ability to edit Python /
.py files, or code containing markdown /
.md files in a notebook UI, is really handy, but there’s more…
Remember the guards?
If I tag a code cell using the notebook UI (from the notebook View menu, select Cell Toolbar and then Tags, you can tag a cell with a tag of the form
See the Jupytext docs: importing Jupyter notebooks as modules for more…
The tags are saved as metadata in all document types. For example, in an
.md version of the notebook, the metadata is passed in an attribute-value pair when defining the language type of a code block:
.py version of the notebook, however, the tagged code cell is not rendered as a code cell, it is commented out:
What this means is that I can tag cells in the notebook editor to include them — or not — as executable code in particular document types.
For example, if I pair
.py files, whenever I edit either an
.py file in the notebook UI, it also gets saved as the paired document type. Within the notebook UI, I can execute all the code cells, but through using tagged cells, I can define some cells as executable in one saved document type (
.ipynb for example) but not in another (a
.py file, perhaps).
What that in turn means is that when I am hacking around with the document in the notebook UI I can create documents that include all manner of scraggy developmental test code, but only save certain cells as executable code into the associated
.py module file.
The module workflow is now:
- install Jupytext;
- edit Python files in a notebook environment;
- run all cells when running in the notebook UI;
- mark development code as
active-ipynb, which is to say, it is *not active* in a
- load the
.pyfile in as a module into other modules or notebooks but leaving out the commented out the development code; if I use `%load_ext autoreload` and `%autoreload 2` magic in the document that’s loading the modules, it will [automatically reload them](https://stackoverflow.com/a/5399339/454773) when I call functions imported from them if I’ve made changes to the associated module file;
- optionally pair the
.pyfile with an
.ipynbfile, in which case the
.ipynbfile will be saved: a) with *all* cells run; b) include cell outputs.
Referring back to Daniele Procida’s insights about documentation, this ability to have code in a single document (for example, a
.py file) that is executable in one environment (the notebook editing / development environment, for example) but not another (when loaded as a
.py module) means we can start to write richer source code files.
I also wonder if this provides us with a way of bundling test code as part of the code development narrative? (I don’t use tests so don’t really know how the workflow goes…)
More general is the insight that we can use Jupytext to automatically generate distinct versions of a document from a single source document. The generated documents:
- can include code outputs;
- can *exclude* code outputs;
- can have tagged code commented out in some document formats and not others.
I’m not sure if we can also use it in combination with other notebook extensions to hide particular cells, for example, when viewing documents in the notebook editor or generating export document formats from an executed notebook form of it. A good example to try out might be the
hide_code extension, which provides a range of toolbar options that can be used to customise the display of a document in a the notebook editor or HTML / PDF documents generated from it.
It could also be useful to have a very simple extension that lets you click a toolbar button to set an
active- state tag and style or highlight that cell in the notebook UI to mark it out as having limited execution status. A simple fork of, or extension to, the
freeze extension would probably do that. (I note that Jupytext responds to the “frozen”
freeze setting but that presumably locks out executing the cell in the notebook UI too?)
PS a few weeks ago, Jupytext creator Marc Wouts posted this handy recipe for *rewriting* notebook commits made to a git branch against markdown formatted documents rather than the original ipynb change commits:
git filter-branch --tree-filter 'jupytext --to md */*.ipynb && rm -f */*.ipynb' HEAD This means that if you have a legacy project with commits made to notebook files, you can rewrite it as a series of changes made to markdown or Python document versions of the notebooks…