One of the issues we know students have with the Jupyter notebooks that we provide as part of the course is that there is no straightforward way of printing them them all out for offscreen reading / annotation. (As well as code, there is a certain amount of practical and code related explanatory material in the notebooks.)
One of the things I started to doodle with last year was a simple script to merge several notebooks than then render the result as a Microsoft Word doc. This has a dependency on pandoc
, though not LaTeX and requires that the conversion takes place via HTML: ipynb
is converted to HTML using nbconvert
, then from HTML to docx. If there are image files transcluded into the notebook, this also means that the pandoc
conversion process needs to be executed in the same directory as the notebook so that the image paths are correctly recognised. (When running nbconvert
with the html_embed
output, pandoc
fell over.)
Having to run pandoc
in a local, image path respecting directory is a pain because it means I can’t run it over a merged notebook file composed of notebooks from multiple directories. Which means that I have to generate a separate docx
file for the notebooks in each separate directory. Whilst I could more this into the same directory to make accessing them all a bit easier, it still means students have to print out multiple documents. I did try using a python package to merge the Word docs, but it borked on the images.
There are Python packages that can merge PDF documents in a more reliable way, but I am having issues with getting a sensible PDF workflow together. In the first case, for pandoc
to render documents to PDF seems to require the texlive-xetex
package, which adds considerable weight to the VM (and I don’t know the dependency voodoo required to get a minimum viable LaTeX distribution in place). In the second, my test notebooks included a pymarkdown inline element that embedded a pandas dataframe in a markdown cell and this seemed to break the pandoc
PDF conversion at that point.
One thing I haven’t done yet is look at customising the output templates so that we can brand the exported documents. For this, I need to look at custom templates.
My initial sketch code for the ‘export merged notebooks in a directory as docx’ routine is available via this gist. One thing I need to do is wrap it in a simple CLI command. Comments / suggestions for improvement, or links to better alternatives, more than welcome!
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#https://stackoverflow.com/a/3207973/454773 | |
from nbformat.v4 import new_notebook, new_markdown_cell | |
import nbformat | |
import io | |
import os | |
import subprocess | |
import random | |
import string | |
#from PyPDF2 import PdfFileMerger, PdfFileReader | |
def merged_notebooks_in_dir(dirpath,filenames): | |
''' Merge all notebooks in a directory into a single notebook ''' | |
fns = ['{}/{}'.format(dirpath, fn) for fn in filenames if '.ipynb_checkpoints' not in dirpath and fn.endswith('.ipynb')] | |
if fns: | |
merged = new_notebook() | |
#Identify directory containing merged notebooks | |
cell = '\n\n—\n\n# {}\n\n—\n\n'.format(dirpath) | |
merged.cells.append(new_markdown_cell(cell)) | |
else: return | |
for fn in fns: | |
#print(fn) | |
notebook_name = fn.split('/')[–1] | |
with io.open(fn, 'r', encoding='utf-8') as f: | |
nb = nbformat.read(f, as_version=4) | |
#Identify filename of notebook | |
cell = '\n\n—\n\n# {}\n\n—\n\n'.format(fn) | |
merged.cells.append(new_markdown_cell(cell)) | |
merged.cells.extend(nb.cells) | |
if not hasattr(merged.metadata, 'name'): | |
merged.metadata.name = '' | |
merged.metadata.name += "_merged" | |
return nbformat.writes(merged) | |
def merged_notebooks_down_path(path, typ='docx', execute=False): | |
''' Walk a path, creating an output file in each directory that merges all notebooks in the directory ''' | |
for (dirpath, dirnames, filenames) in os.walk(path): | |
if '.ipynb_checkpoints' in dirpath: continue | |
#Should we run the execute processor here on each notebook separately, | |
# ensuring that images are embedded, and then merge the executed notebook files? | |
merged_nb = merged_notebooks_in_dir(dirpath,filenames) | |
if not merged_nb: continue | |
fn=''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(10)) | |
with open('{}/{}.ipynbx'.format(dirpath,fn), 'w') as f: | |
f.write(merged_nb) | |
# Execute the merged notebook in its directory so that images are correctly handled | |
# Using html_embed seems to cause pandoc to fall over? | |
# The pdf conversion requires installation of texlive-xetex and inkscape | |
# This adds significant weight to the VM: maybe we need an MT/prouction VM and a student build? | |
# Inline code execution generated using python-markdown extension seems to break PDF generation | |
# at the first instance of inline code? Need to add a preprocessor? | |
# We could maybe process the notebook inline rather than via the commandline | |
# In such a case, the following may be a useful reference: | |
#https://github.com/ipython-contrib/jupyter_contrib_nbextensions/blob/master/docs/source/exporting.rst | |
execute = ' –ExecutePreprocessor.timeout=600 –ExecutePreprocessor.allow_errors=True –execute' if execute else '' | |
if typ=='pdf': | |
cmd='jupyter nbconvert –to pdf {exe} "{fn}".ipynbx'.format(exe=execute, fn=fn) | |
subprocess.check_call(cmd, shell=True, cwd=dirpath) | |
elif typ in ['docx']: | |
cmd='jupyter nbconvert –to html {exe} "{fn}".ipynbx'.format(exe=execute, fn=fn) | |
subprocess.check_call(cmd, shell=True, cwd=dirpath) | |
cmd='pandoc -s "{fn_out}".html -o _merged_notebooks.{typ}'.format(fn_out=fn, typ=typ) | |
subprocess.check_call(cmd, shell=True, cwd=dirpath) | |
os.remove("{}/{}.html".format(dirpath,fn)) | |
os.remove("{}/{}.ipynbx".format(dirpath,fn)) |