Fragment -Visualising Jupyter Notebook Structure

Over the weekend, I spent some time dabbling with generating various metrics over Jupyter notebooks (more about that in a later post…). One of the things I started looking at were tools for visualising notebook structure.

In the first instance, I wanted a simple tool to show the relative size of notebooks, as well as the size and placement of markdown and code cells within them.

The following is an example of a view over a simple notebook; the blue denotes a markdown cell, the pink a code cell, and the grey separates the cells. (The colour of the separator is controllable, as well as its size, which can be 0.)

When visualising multiple notebooks, we can also display the path to the notebook:

The code can be be found in this repo this gist.

The size of the cells in the diagram are determined as follows:

  • for markdown cells, the number of “screen lines” taken up by the markdown when presented on a screen with a specified screen character width;
        import textwrap
        LINE_WIDTH = 160
    
        def _count_screen_lines(txt, width=LINE_WIDTH):
            """Count the number of screen lines that an overflowing text line takes up."""
            ll = txt.split('\n')
            _ll = []
            for l in ll:
                #Model screen flow: split a line if it is more than `width` characters long
                _ll=_ll+textwrap.wrap(l, width)
            n_screen_lines = len(_ll)
            return n_screen_lines
    

  • for code cells, the number of lines of code; (long lines are counted over multiple lines as per markdown lines)

In parsing a notebook, we consider each cell in turn capturing its cell type and screen line length, returing a cell_map as a list of (cell_size, cell_type) tuples:

   import os
   import nbformat
   VIS_COLOUR_MAP  = {'markdown':'cornflowerblue','code':'pink'}

   def _nb_vis_parse_nb(fn):
        """Parse a notebook and generate the nb_vis cell map for it."""

        cell_map = []

        _fn, fn_ext = os.path.splitext(fn)
        if not fn_ext=='.ipynb' or not os.path.isfile(fn):
            return cell_map

        with open(fn,'r') as f:
            nb = nbformat.reads(f.read(), as_version=4)

        for cell in nb.cells:
            cell_map.append((_count_screen_lines(cell['source']), VIS_COLOUR_MAP[cell['cell_type']]))

        return cell_map

The following function handle single files or directory paths and generates a cell map for each notebook as required:

    def _dir_walker(path, exclude = 'default'):
        """Profile all the notebooks in a specific directory and in any child directories."""

        if exclude == 'default':
            exclude_paths = ['.ipynb_checkpoints', '.git', '.ipynb', '__MACOSX']
        else:
            #If we set exclude, we need to pass it as a list
            exclude_paths = exclude
        nb_multidir_cell_map = {}
        for _path, dirs, files in os.walk(path):
            #Start walking...
            #If we're in a directory that is not excluded...
            if not set(exclude_paths).intersection(set(_path.split('/'))):
                #Profile that directory...
                for _f in files:
                    fn = os.path.join(_path, _f)
                    cell_map = _nb_vis_parse_nb(fn)
                    if cell_map:
                        nb_multidir_cell_map[fn] = cell_map

        return nb_multidir_cell_map

The following function is used to grab the notebook file(s) and generate the visualisation:

def nb_vis_parse_nb(path, img_file='', linewidth = 5, w=20, **kwargs):
    """Parse one or more notebooks on a path."""
     
    if os.path.isdir(path):
        cell_map = _dir_walker(path)
    else:
        cell_map = _nb_vis_parse_nb(path)
        
    nb_vis(cell_map, img_file, linewidth, w, **kwargs)

So how is the visualisation generated?

A plotter function generates the plot from acell_map:

    import matplotlib.pyplot as plt

    def plotter(cell_map, x, y, label='', header_gap = 0.2):
        """Plot visualisation of gross cell structure for a single notebook."""

        #Plot notebook path
        plt.text(y, x, label)
        x = x + header_gap

        for _cell_map in cell_map:

            #Add a coloured bar between cells
            if y > 0:
                if gap_colour:
                    plt.plot([y,y+gap],[x,x], gap_colour, linewidth=linewidth)

                y = y + gap
            
            _y = y + _cell_map[0] + 1 #Make tiny cells slightly bigger
            plt.plot([y,_y],[x,x], _cell_map[1], linewidth=linewidth)

            y = _y

The gap can be automatically calculated relative to the longest notebook we’re trying to visualise which sets the visualisation limits:

    import math

    def get_gap(cell_map):
        """Automatically set the gap value based on overall length"""
        
        def get_overall_length(cell_map):
            """Get overall line length of a notebook."""
            overall_len = 0
            gap = 0
            for i ,(l,t) in enumerate(cell_map):
                #i is number of cells if that's useful too?
                overall_len = overall_len + l
            return overall_len

        max_overall_len = 0
        
        #If we are generating a plot for multiple notebooks, get the largest overall length
        if isinstance(cell_map,dict):
            for k in cell_map:
                _overall_len = get_overall_length(cell_map[k])
                max_overall_len = _overall_len if _overall_len > max_overall_len else max_overall_len
        else:
            max_overall_len = get_overall_length(cell_map)

        #Set the gap at 0.5% of the overall length
        return math.ceil(max_overall_len * 0.01)

The nb_vis() function takes the cell_map, either as a single cell map for a single notebook, or as a dict of cell maps for multiple notebooks, keyed by the notebook path:

def nb_vis(cell_map, img_file='', linewidth = 5, w=20, gap=None, gap_boost=1, gap_colour='lightgrey'):
    """Visualise notebook gross cell structure."""

    x=0
    y=0
    
    #If we have a single cell_map for a single notebook
    if isinstance(cell_map,list):
        gap = gap if gap is not None else get_gap(cell_map) * gap_boost
        fig, ax = plt.subplots(figsize=(w, 1))
        plotter(cell_map, x, y)
    #If we are plotting cell_maps for multiple notebooks
    elif isinstance(cell_map,dict):
        gap = gap if gap is not None else get_gap(cell_map) * gap_boost
        fig, ax = plt.subplots(figsize=(w,len(cell_map)))
        for k in cell_map:
            plotter(cell_map[k], x, y, k)
            x = x + 1
    else:
        print('wtf')
    ax.axis('off')
    plt.gca().invert_yaxis()
    
    if img_file:
        plt.savefig(img_file)

The function will render the plot in a Jupyter notebook, or can be called to save the visualisation to a file.

This was just done as a quick proof of concept, so comments welcome.

On the to do list is to create a simple CLI (command line interface) for it, as well as explore additional customisation support (eg allow the color types to be specified). I also need to account for other cell types. An optional legend explaining the colour map would also make sense.

On the longer to do list is a visualiser that supports within cell visualisation. For example, headers, paragraphs and code blocks in markdown cells; comment lines, empty lines, code lines, magic lines / blocks, shell command lines in code cells.

In OU notebooks, being able to identify areas associated with activities would also be useful.

Supporting the level of detail required in the visualisation may be be tricky, particulary in long notebooks. A vertical, multi-column format is probably best showing eg an approximate “screen’s worth” of content in a column then the next “scroll” down displayed in the next column along.

Something else I can imagine is a simple service that would let you pass a link to an online notebook and get a visulisation back, or a link to a Github repo that would give you a visualisation back of all the notebooks in the repo. This would let you embed a link to the visualisation, for example, in the repo README. On the server side, I guess this means something that could clone a repo, generate the visualisation and return the image. To keep the workload down, the service would presumably keep a hash of the repo and the notebooks within the repo, and if any of those had changed, regenerate the image, else re-use a cached one. (It might also make sense to cache images at a notebook level to save having to reparse all the notebooks in a repo where only a single notebook has changed, and then merge those into a single output image?)

PS this has also go me thinking about simple visualisers over XML materials too… I do have an OU-XML to ipynb route (as well as OU-XML2md2html, for example), but a lot of the meaningful structure from the OU-XML would get lost on a trivial treatment (eg activity specifications, mutlimedia use, etc). I wonder if it’d make more sense to create an XSLT to generate a summary XML document and then visualise from that? Or create Jupytext md with lots of tags (eg tagging markdown cells as activities etc) that could be easily parsed out in a report? Hmmm… now that may make a lot more sense…

Author: Tony Hirst

I'm a Senior Lecturer at The Open University, with an interest in #opendata policy and practice, as well as general web tinkering...

%d bloggers like this: