Generating Printable MS Word Versions of Merged Jupyter Notebooks

One of the issues we know students have with the Jupyter notebooks that we provide as part of the course is that there is no straightforward way of printing them them all out for offscreen reading / annotation. (As well as code, there is a certain amount of practical and code related explanatory material in the notebooks.)

One of the things I started to doodle with last year was a simple script to merge several notebooks than then render the result as a Microsoft Word doc. This has a dependency on pandoc, though not LaTeX and requires that the conversion takes place via HTML: ipynb is converted to HTML using nbconvert , then from HTML to docx. If there are image files transcluded into the notebook, this also means that the pandoc conversion process needs to be executed in the same directory as the notebook so that the image paths are correctly recognised. (When running nbconvert with the html_embed output, pandoc fell over.)

Having to run pandoc in a local, image path respecting directory is a pain because it means I can’t run it over a merged notebook file composed of notebooks from multiple directories. Which means that I have to generate a separate docx file for the notebooks in each separate directory. Whilst I could more this into the same directory to make accessing them all a bit easier, it still means students have to print out multiple documents. I did try using a python package to merge the Word docs, but it borked on the images.

There are Python packages that can merge PDF documents in a more reliable way, but I am having issues with getting a sensible PDF workflow together. In the first case, for pandoc to render documents to  PDF seems to require the texlive-xetex package, which adds considerable weight to the VM (and I don’t know the dependency voodoo required to get a minimum viable LaTeX distribution in place). In the second, my test notebooks included a pymarkdown inline element that embedded a pandas dataframe in a markdown cell and this seemed to break the pandoc PDF conversion at that point.

One thing I haven’t done yet is look at customising the output templates so that we can brand the exported documents. For this, I need to look at custom templates.

My initial sketch code for the ‘export merged notebooks in a directory as docx’ routine is available via this gist. One thing I need to do is wrap it in a simple CLI command. Comments / suggestions for improvement, or links to better alternatives, more than welcome!

#https://stackoverflow.com/a/3207973/454773
from nbformat.v4 import new_notebook, new_markdown_cell
import nbformat
import io
import os
import subprocess
import random
import string
#from PyPDF2 import PdfFileMerger, PdfFileReader
def merged_notebooks_in_dir(dirpath,filenames):
''' Merge all notebooks in a directory into a single notebook '''
fns = ['{}/{}'.format(dirpath, fn) for fn in filenames if '.ipynb_checkpoints' not in dirpath and fn.endswith('.ipynb')]
if fns:
merged = new_notebook()
#Identify directory containing merged notebooks
cell = '\n\n\n\n# {}\n\n\n\n'.format(dirpath)
merged.cells.append(new_markdown_cell(cell))
else: return
for fn in fns:
#print(fn)
notebook_name = fn.split('/')[1]
with io.open(fn, 'r', encoding='utf-8') as f:
nb = nbformat.read(f, as_version=4)
#Identify filename of notebook
cell = '\n\n\n\n# {}\n\n\n\n'.format(fn)
merged.cells.append(new_markdown_cell(cell))
merged.cells.extend(nb.cells)
if not hasattr(merged.metadata, 'name'):
merged.metadata.name = ''
merged.metadata.name += "_merged"
return nbformat.writes(merged)
def merged_notebooks_down_path(path, typ='docx', execute=False):
''' Walk a path, creating an output file in each directory that merges all notebooks in the directory '''
for (dirpath, dirnames, filenames) in os.walk(path):
if '.ipynb_checkpoints' in dirpath: continue
#Should we run the execute processor here on each notebook separately,
# ensuring that images are embedded, and then merge the executed notebook files?
merged_nb = merged_notebooks_in_dir(dirpath,filenames)
if not merged_nb: continue
fn=''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(10))
with open('{}/{}.ipynbx'.format(dirpath,fn), 'w') as f:
f.write(merged_nb)
# Execute the merged notebook in its directory so that images are correctly handled
# Using html_embed seems to cause pandoc to fall over?
# The pdf conversion requires installation of texlive-xetex and inkscape
# This adds significant weight to the VM: maybe we need an MT/prouction VM and a student build?
# Inline code execution generated using python-markdown extension seems to break PDF generation
# at the first instance of inline code? Need to add a preprocessor?
# We could maybe process the notebook inline rather than via the commandline
# In such a case, the following may be a useful reference:
#https://github.com/ipython-contrib/jupyter_contrib_nbextensions/blob/master/docs/source/exporting.rst
execute = ' –ExecutePreprocessor.timeout=600 –ExecutePreprocessor.allow_errors=True –execute' if execute else ''
if typ=='pdf':
cmd='jupyter nbconvert –to pdf {exe} "{fn}".ipynbx'.format(exe=execute, fn=fn)
subprocess.check_call(cmd, shell=True, cwd=dirpath)
elif typ in ['docx']:
cmd='jupyter nbconvert –to html {exe} "{fn}".ipynbx'.format(exe=execute, fn=fn)
subprocess.check_call(cmd, shell=True, cwd=dirpath)
cmd='pandoc -s "{fn_out}".html -o _merged_notebooks.{typ}'.format(fn_out=fn, typ=typ)
subprocess.check_call(cmd, shell=True, cwd=dirpath)
os.remove("{}/{}.html".format(dirpath,fn))
os.remove("{}/{}.ipynbx".format(dirpath,fn))

Author: Tony Hirst

I'm a Senior Lecturer at The Open University, with an interest in #opendata policy and practice, as well as general web tinkering...