Editing Jupyter Notebooks Mechanically

In passing, I note that because of REDACTED last year, we had to edit a load of teaching materials to account for inconsistent ways of connecting to provided database servers across local and hosted environments (I’ve argued for years argued that provided environments should always be consistent wherever they accessed from, but go figure) which means that for 21J, where we will have the same container running on local and hosted environments, we need to revert all those changes. (I wonder if the changes had been made in a particular Github branch and PR’d from that branch whether we could just reopen that branch, revert the changes, and submit another PR? I really do need to get better at git…)

Example of a cell to be removed.

The changes impact 30 or so notebooks, and might typically involve an editor making the changes. But we don’t editors near our notebooks in the data management course, so it’s down to the module team to undo the changes.

Skimming over the changes we need to make, we it looks to me like the advice we need to update was boilerplate text, which means it should be the same across notebooks (there were two sets of changes required: connection to a mongo database, and connection to a postgres database).

This in turn suggests automation should be possible. So here’s a first attempt:

import nbformat
from pathlib import Path

def fix_cells(cell_type, str_start, path='.',
              replace_match=None, replace_with=None,
              convert_to=None, overwrite=True,
              version=nbformat.NO_CONVERT,
              ignore_files = None,
              verbose=False):
    """Remove cells of a particular type starting with a particular string.
       Optionally replace cell contents.
       Optionally convert cell type.
    """

    # Cell types
    cell_types = ['markdown', 'code', 'raw']
    if cell_type and cell_type not in cell_types:
        raise ValueError('Error: cell_type not recognised')
        
    if convert_to and convert_to not in cell_types:
        raise ValueError('Error: convert_to cell type not recognised')

    # Iterate path
    nb_dir = Path(path)
    for p in nb_dir.rglob("*"): #nb_dir.iterdir():
        if ignore_files and p.name in ignore_files:
            continue
        if '.ipynb_checkpoints' in p.parts:
            continue
        
        if p.is_file() and p.suffix == '.ipynb':
            updated = False
            if verbose:
                print(f"Checking {p}")

            # Read notebook
            with p.open('r') as f:
                # parse notebook
                #nb = nbformat.read(f, as_version=nbformat.NO_CONVERT)
                #nb = nbformat.convert(nb, version)
                #opinionated
                try:
                    nb = nbformat.read(f, as_version=version)
                except:
                    print(f"Failed to open: {p}")
                    continue
                deletion_list = []
                for i, cell in enumerate(nb['cells']):
                    if cell["cell_type"]==cell_type and nb['cells'][i]["source"].startswith(str_start):
                        if replace_with is None and not convert_to:
                            deletion_list.append(i)
                        elif replace_with is not None:
                            if replace_match:
                                nb['cells'][i]["source"] = nb['cells'][i]["source"].replace(replace_match, replace_with)
                                updated = True
                            else:
                                nb['cells'][i]["source"] = replace_with
                                updated = True
                        if convert_to:
                            if convert_to=='code':
                                new_cell = nbformat.v4.new_code_cell(nb['cells'][i]["source"])
                                nb['cells'][i] = new_cell
                            elif convert_to=='markdown':
                                new_cell = nbformat.v4.new_markdown_cell(nb['cells'][i]["source"])
                                nb['cells'][i] = new_cell
                            elif convert_to=='raw':
                                new_cell = nbformat.v4.new_raw_cell(nb['cells'][i]["source"])
                                nb['cells'][i] = new_cell           
                            else:
                                pass
                            updated = True

                # Delete unrequired cells
                if deletion_list:
                    updated = True
                nb['cells']  = [c for i, c in enumerate(nb['cells']) if i not in deletion_list]

                if updated:
                    # Validate - exception if we fail
                    #nbformat.validate(nb)

                    # Create output filename
                    out_path =  p if overwrite else p.with_name(f'{p.stem}__patched{p.suffix}') 

                    # Save notebook
                    print(f"Updating: {p}")
                    nbformat.write(nb, out_path.open('w'), nbformat.NO_CONVERT)

Usage for deleting cells takes the form:

# delete cells
ignore_files=['21J DB repair.ipynb']

str_start = '# If you are using the remote environment, change this cell'
fix_cells('raw', str_start, ignore_files=ignore_files)

And for updating the contents of cells and/or changing their cell type:

str_start = "# If you are using a locally hosted environment, change this cell"
replace_match ="""# If you are using a locally hosted environment, change this cell
# type to "code", and execute it

"""
replace_with = ''
convert_to = 'code'
fix_cells('raw', str_start, convert_to='code',
          replace_match=replace_match, replace_with=replace_with, ignore_files=ignore_files)

Better pattern matching around identifying cells, and perhaps navigating directory paths, is obviously required.

In passing, I note that if we had tagged the boilerplate cells or added other metadata to them identifying them as relating to db setup, we could have processed the notebooks based on tags. For future notebooks, I think we should start to consider adding identifying tags to distinct boilerplate items so that we can, if necessary, more easily modify / update them.

I also note that if I suggest this to the “Jupyer Notebook Production Working Group” (which I wasn’t invited to join, obvs), I’m guessing they’d say this is “too technical” and recommend the manual approach of opening an editing each notebook by hand…! And I doubt they’d be able to comment on any potential git revert strategy;-)

PS see also @choldgraf’s nbclean package which includes a tool to “replace text in cells with new text of your choosing”.

Author: Tony Hirst

I'm a Senior Lecturer at The Open University, with an interest in #opendata policy and practice, as well as general web tinkering...

%d bloggers like this: