Generating “Print Pack” Material (PDF, epub) From Jupyter Notebooks Using pandoc

For one of the courses I work on, we produce “print pack” materials (PDF, epub), on request, for students who require print, or alternative format, copies of the Jupyter notebooks used in the module. This does and doesn’t make a lot of sense. On the one hand, having a print copy of notebooks may be useful for creating physical annotations. On the other, the notebooks are designed as interactive, and often generative, materials: interactive in the sense students are expect to run code, as well modify, create and execute their own code; generative in the sense that outputs are generated by code execution and the instructional content may explicitly refer to things that have been so generated.

In producing the print / alternative format material, we generally render the content from un-run notebooks, which is to say the the print material notebooks do not include code outputs. Part of the reason for this is that we want the students to do the work: if we hnded out completed worksheets, there’d be no need to work through and complete the worksheets, right?

Furthermore, whilst for some notebooks, it may be possible to meaningfully run all cells and then provide executed/run cell notebooks as print materials, in other modules this may not make sense. In our TM129 robotics block, an interactive simulator widget is generated and controlled from the run notebook, and it doesn’t make sense to naively share a notebook with all cells run. Instead, we would have to share screenshots of the simulator widget following each notebook activity, and here may be several such activities in each notebook. (It might be instructive to try to automate the creation of such screenshots, eg using the JupyterLab galata test framework.)

Anyway, I started looking at how to automate the generation of print packs. The following code [(https://gist.github.com/psychemedia/6288db9ef97dc17b2fbd909a7516f12b)%5D is hard wired for a directory structure where the content is in ./content directory, with subdirectories for each week starting with a two digit week number (01., 02. etc) and notebooks in each week directory numbered according to week/notebook number (eg 01.1, 01.2, …, 02.1, 02.2, … etc.).

# # `print_publication.py`
#
# Script for generating print items (weekly PDF, weekly epub).
# # Install requirements
#
# – Python
# – pandoc
# – python packages:
# – ipython
# – nbconvert
# – nbformat
# – pymupdf
from pathlib import Path
#import nbconvert
import nbformat
from nbconvert import HTMLExporter
#import pypandoc
import os
import secrets
import shutil
import subprocess
import fitz #pip install pymupdf
html_exporter = HTMLExporter(template_name = 'classic')
pwd = Path.cwd()
print(f'Starting in: {pwd}')
# +
nb_wd = "content" # Path to weekly content folders
pdf_output_dir = "print_pack" # Path to output dir
# Create print pack output dir if required
Path(pdf_output_dir).mkdir(parents=True, exist_ok=True)
# –
# Iterate through weekly content dirs
# We assume the dir starts with a week number
for p in Path(nb_wd).glob("[0-9]*"):
print(f'- processing: {p}')
if not p.is_dir():
continue
# Get the week number
weeknum = p.name.split(". ")[0]
# Settings for pandoc
pdoc_args = ['-s', '-V geometry:margin=1in',
'–toc',
#f'–resource-path="{p.resolve()}"', # Doesn't work?
'–metadata', f'title="TM129 Robotics — Week {weeknum}"']
#cd to week directory
os.chdir(p)
# Create a tmp directory for html files
# Rather than use tempfile, create our own lest we want to persist it
_tmp_dir = Path(secrets.token_hex(5))
_tmp_dir.mkdir(parents=True, exist_ok=True)
# Find notebooks for the current week
for _nb in Path.cwd().glob("*.ipynb"):
nb = nbformat.read(_nb, as_version=4)
# Generate HTML version of document
(body, resources) = html_exporter.from_notebook_node(nb)
with open(_tmp_dir / _nb.name.replace(".ipynb", ".html"), "w") as f:
f.write(body)
# Now convert the HTML files to PDF
# We need to run pandoc in the correct directory so that
# relatively linked image files are correctly picked up.
# Specify output PDF path
pdf_out = str(pwd / pdf_output_dir / f"tm129_{weeknum}.pdf")
epub_out = str(pwd / pdf_output_dir / f"tm129_{weeknum}.epub")
# It seems pypandoc is not sorting the files in ToC etc?
#pypandoc.convert_file(f"{temp_dir}/*html",
# to='pdf',
# #format='html',
# extra_args=pdoc_args,
# outputfile= str(pwd / pdf_output_dir / f"tm129_{weeknum}.pdf"))
# Hacky – requires IPython
# #! pandoc -s -o {pdf_out} -V geometry:margin=1in –toc –metadata title="TM129 Robotics — Week {weeknum}" {_tmp_dir}/*html
# #! pandoc -s -o {epub_out} –metadata title="TM129 Robotics — Week {weeknum}" –metadata author="The Open University, 2022" {_tmp_dir}/*html
subprocess.call(f'pandoc –quiet -s -o {pdf_out} -V geometry:margin=1in –toc –metadata title="TM129 Robotics — Week {weeknum}" {_tmp_dir}/*html', shell = True)
subprocess.call(f'pandoc –quiet -s -o {epub_out} –metadata title="TM129 Robotics — Week {weeknum}" –metadata author="The Open University, 2022" {_tmp_dir}/*html', shell = True)
# Tidy up tmp dir
shutil.rmtree(_tmp_dir)
#Just in case we need to know relatively where we are…
#Path.cwd().relative_to(pwd)
# Go back to the home dir
os.chdir(pwd)
os.chdir(pwd)
# ## Add OU Logo to First Page of PDF
#
# Add an OU logo to the first page of the PDF documents
# +
logo_file = ".print_assets/OU-logo-83×65.png"
img = open(logo_file, "rb").read()
# define the position (upper-left corner)
logo_container = fitz.Rect(60,40,143,105)
for f in Path(pdf_output_dir).glob("*.pdf"):
print(f'- branding: {f}')
with fitz.open(f) as pdf:
pdf_first_page = pdf[0]
pdf_first_page.insert_image(logo_container, stream=img)
pdf_out = f.name.replace(".pdf", "_logo.pdf")
txt_origin = fitz.Point(350, 770)
text = "Copyright © The Open University, 2022"
for page in pdf:
page.insert_text(txt_origin, text)
pdf.save(Path(pdf_output_dir) / pdf_out)
#Remove the unbranded PDF
os.remove(f)
view raw print_pack.py hosted with ❤ by GitHub

The script uses pandoc to generate the PDF and epub documents, one per weekly directory. The PDF generator also includes a table of contents, automatically generated from headings by pandoc. A second pass using fitz/pymupdf then adds a logo and copyright notice to each PDF.

PDF with post-processed addition of a custom logo
PDF with post-process addition of a copyright footer

Author: Tony Hirst

I'm a Senior Lecturer at The Open University, with an interest in #opendata policy and practice, as well as general web tinkering...

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: