## Confusing Chart? Seaborn Jointplot

I’ve just been doodling with some data and the seaborn graphic library and managed to confuse myself a couple of times when quickly glancing at some “jointplots” that add marginal histograms to a scatter plot.

The code is easy enough:

```sns.jointplot(x='Indoors Sub-domain Rank (where 1 is most deprived)',
y='Outdoors Sub-domain Rank (where 1 is most deprived)',
data=lsoaiw)```

which give charts of the form:

but how do you intuitively see the histograms to compare them?

My at-a-glance (tired, past midnight…) reaction keeps “seeing” the top comparison, representing a 90 degree counterclockwise rotation about the top left corner of the y-axis chart. But if you think about for more than a glance, that obviously puts the large-y values at low-x, and low y-values at large-x.

The correct way is the bottom one; to make the comparison you need to flip the y-axis chart, folding its bottom right corner towards the top-left corner of the x-axis chart. This is a rotation and a reflection. Which is hard.

But if we do the flip to try to help (can we do this using seaborn???):

my eye now feels as if it wants to help out, keeping all the near corners of the various charts close together (top right of the y-axis chart, bottom left of the x-axis one, and top left of the main panel) all close together, and flipping the bottom left corner of the y-axis chart as if up to the top right corner of the x-axis chart. (i.e. in this configuration it wants to do the flip?)

The distribution of bars in  the marginal charts may also be complicating matters, encouraging the eye to match large with large (which in this case is wrong…).

Hmm….

## Fragment -Visualising Jupyter Notebook Structure

Over the weekend, I spent some time dabbling with generating various metrics over Jupyter notebooks (more about that in a later post…). One of the things I started looking at were tools for visualising notebook structure.

In the first instance, I wanted a simple tool to show the relative size of notebooks, as well as the size and placement of markdown and code cells within them.

The following is an example of a view over a simple notebook; the blue denotes a markdown cell, the pink a code cell, and the grey separates the cells. (The colour of the separator is controllable, as well as its size, which can be 0.)

When visualising multiple notebooks, we can also display the path to the notebook:

The code can be be found in this repo this gist.

The size of the cells in the diagram are determined as follows:

• for markdown cells, the number of “screen lines” taken up by the markdown when presented on a screen with a specified screen character width;
```    import textwrap
LINE_WIDTH = 160

def _count_screen_lines(txt, width=LINE_WIDTH):
"""Count the number of screen lines that an overflowing text line takes up."""
ll = txt.split('\n')
_ll = []
for l in ll:
#Model screen flow: split a line if it is more than `width` characters long
_ll=_ll+textwrap.wrap(l, width)
n_screen_lines = len(_ll)
return n_screen_lines
```

• for code cells, the number of lines of code; (long lines are counted over multiple lines as per markdown lines)

In parsing a notebook, we consider each cell in turn capturing its cell type and screen line length, returing a `cell_map` as a list of `(cell_size, cell_type)` tuples:

```   import os
import nbformat
VIS_COLOUR_MAP  = {'markdown':'cornflowerblue','code':'pink'}

def _nb_vis_parse_nb(fn):
"""Parse a notebook and generate the nb_vis cell map for it."""

cell_map = []

_fn, fn_ext = os.path.splitext(fn)
if not fn_ext=='.ipynb' or not os.path.isfile(fn):
return cell_map

with open(fn,'r') as f:

for cell in nb.cells:
cell_map.append((_count_screen_lines(cell['source']), VIS_COLOUR_MAP[cell['cell_type']]))

return cell_map
```

The following function handle single files or directory paths and generates a cell map for each notebook as required:

```    def _dir_walker(path, exclude = 'default'):
"""Profile all the notebooks in a specific directory and in any child directories."""

if exclude == 'default':
exclude_paths = ['.ipynb_checkpoints', '.git', '.ipynb', '__MACOSX']
else:
#If we set exclude, we need to pass it as a list
exclude_paths = exclude
nb_multidir_cell_map = {}
for _path, dirs, files in os.walk(path):
#Start walking...
#If we're in a directory that is not excluded...
if not set(exclude_paths).intersection(set(_path.split('/'))):
#Profile that directory...
for _f in files:
fn = os.path.join(_path, _f)
cell_map = _nb_vis_parse_nb(fn)
if cell_map:
nb_multidir_cell_map[fn] = cell_map

return nb_multidir_cell_map
```

The following function is used to grab the notebook file(s) and generate the visualisation:

```def nb_vis_parse_nb(path, img_file='', linewidth = 5, w=20, **kwargs):
"""Parse one or more notebooks on a path."""

if os.path.isdir(path):
cell_map = _dir_walker(path)
else:
cell_map = _nb_vis_parse_nb(path)

nb_vis(cell_map, img_file, linewidth, w, **kwargs)
```

So how is the visualisation generated?

A plotter function generates the plot from a`cell_map`:

```    import matplotlib.pyplot as plt

def plotter(cell_map, x, y, label='', header_gap = 0.2):
"""Plot visualisation of gross cell structure for a single notebook."""

#Plot notebook path
plt.text(y, x, label)

for _cell_map in cell_map:

#Add a coloured bar between cells
if y > 0:
if gap_colour:
plt.plot([y,y+gap],[x,x], gap_colour, linewidth=linewidth)

y = y + gap

_y = y + _cell_map[0] + 1 #Make tiny cells slightly bigger
plt.plot([y,_y],[x,x], _cell_map[1], linewidth=linewidth)

y = _y
```

The `gap` can be automatically calculated relative to the longest notebook we’re trying to visualise which sets the visualisation limits:

```    import math

def get_gap(cell_map):
"""Automatically set the gap value based on overall length"""

def get_overall_length(cell_map):
"""Get overall line length of a notebook."""
overall_len = 0
gap = 0
for i ,(l,t) in enumerate(cell_map):
#i is number of cells if that's useful too?
overall_len = overall_len + l
return overall_len

max_overall_len = 0

#If we are generating a plot for multiple notebooks, get the largest overall length
if isinstance(cell_map,dict):
for k in cell_map:
_overall_len = get_overall_length(cell_map[k])
max_overall_len = _overall_len if _overall_len > max_overall_len else max_overall_len
else:
max_overall_len = get_overall_length(cell_map)

#Set the gap at 0.5% of the overall length
return math.ceil(max_overall_len * 0.01)
```

The `nb_vis()` function takes the `cell_map`, either as a single cell map for a single notebook, or as a dict of cell maps for multiple notebooks, keyed by the notebook path:

```def nb_vis(cell_map, img_file='', linewidth = 5, w=20, gap=None, gap_boost=1, gap_colour='lightgrey'):
"""Visualise notebook gross cell structure."""

x=0
y=0

#If we have a single cell_map for a single notebook
if isinstance(cell_map,list):
gap = gap if gap is not None else get_gap(cell_map) * gap_boost
fig, ax = plt.subplots(figsize=(w, 1))
plotter(cell_map, x, y)
#If we are plotting cell_maps for multiple notebooks
elif isinstance(cell_map,dict):
gap = gap if gap is not None else get_gap(cell_map) * gap_boost
fig, ax = plt.subplots(figsize=(w,len(cell_map)))
for k in cell_map:
plotter(cell_map[k], x, y, k)
x = x + 1
else:
print('wtf')
ax.axis('off')
plt.gca().invert_yaxis()

if img_file:
plt.savefig(img_file)
```

The function will render the plot in a Jupyter notebook, or can be called to save the visualisation to a file.

This was just done as a quick proof of concept, so comments welcome.

On the to do list is to create a simple CLI (command line interface) for it, as well as explore additional customisation support (eg allow the color types to be specified). I also need to account for other cell types. An optional legend explaining the colour map would also make sense.

On the longer to do list is a visualiser that supports within cell visualisation. For example, headers, paragraphs and code blocks in markdown cells; comment lines, empty lines, code lines, magic lines / blocks, shell command lines in code cells.

In OU notebooks, being able to identify areas associated with activities would also be useful.

Supporting the level of detail required in the visualisation may be be tricky, particulary in long notebooks. A vertical, multi-column format is probably best showing eg an approximate “screen’s worth” of content in a column then the next “scroll” down displayed in the next column along.

Something else I can imagine is a simple service that would let you pass a link to an online notebook and get a visulisation back, or a link to a Github repo that would give you a visualisation back of all the notebooks in the repo. This would let you embed a link to the visualisation, for example, in the repo README. On the server side, I guess this means something that could clone a repo, generate the visualisation and return the image. To keep the workload down, the service would presumably keep a hash of the repo and the notebooks within the repo, and if any of those had changed, regenerate the image, else re-use a cached one. (It might also make sense to cache images at a notebook level to save having to reparse all the notebooks in a repo where only a single notebook has changed, and then merge those into a single output image?)

PS this has also go me thinking about simple visualisers over XML materials too… I do have an OU-XML to ipynb route (as well as OU-XML2md2html, for example), but a lot of the meaningful structure from the OU-XML would get lost on a trivial treatment (eg activity specifications, mutlimedia use, etc). I wonder if it’d make more sense to create an XSLT to generate a summary XML document and then visualise from that? Or create Jupytext md with lots of tags (eg tagging markdown cells as activities etc) that could be easily parsed out in a report? Hmmm… now that may make a lot more sense…