Docker Crossbuilds: Platform Sensitive Dockerfile Commands

Whilst it’s true to a certain extent that Docker containers “run anywhere”, that “anywhere” comes with a proviso: Docker images are built for particular operating system architectures. A wide variety of machines run amd64 processor instruction sets, but other devices, such as the new Mac M1 processor based machines, are arm64 devices; and Raspberry Pi’s can run as either arm64 or arm/v7 (32 bit).

A single Dockerfile can be used to build images for each architecture using build commands of the form:

docker buildx build --platform linux/amd64,linux/arm/v7,linux/arm64 .

In some cases, you may need to modify the Dockerfile to perform slightly different actions based on the architecture type. Several arguments are available in the Dockerfile (brought into scope by declaring them using ARG) that allow you to modify behaviour based on the the architecture:

  • TARGETPLATFORM – platform of the build result (for example, linux/amd64, linux/arm64linux/arm/v7windows/amd64)
  • TARGETOS – OS component of TARGETPLATFORM (for examaple, linux)
  • TARGETARCH – architecture component of TARGETPLATFORM (for example, amd64, arm64 )
  • TARGETVARIANT – variant component of TARGETPLATFORM (for example, v7)

In a Dockerfile, we might then have something like:

ARG TARGETPLATFORM

...

# RUN command for specific target platforms
RUN if [ "$TARGETPLATFORM" = "linux/amd64" ] || ["$TARGETPLATFORM" = "linux/arm64"] ; \
    then MPLBACKEND=Agg python3 -c "import matplotlib.pyplot" &>/dev/null ; fi 

For more discussion and examples, see WIP: Docker –platform translation example for TARGETPLATFORM.

Draft: Glossary of Jupyter Production Workflow Terms

Jupyter: an open source community project focused on the development of the Jupyter ecosystem (tools and architectures for the deployment of arbitrary executable code environment and reproducible "computational essay" documents). Coined from the original three programming languages supported by the IPython notebook architecture which was subsumed into the Jupyter project as Jupyter Notebooks: Julia, Python and R.

Jupyter Notebooks: variously: a browser based interactive Jupyter notebook; a textual document format (.ipynb); and (less frequently) the single user Jupyter notebook server. In the first sense, most commonly used sense, the Jupyter notebook is a browser based application within which users can edit, render and save markdown (text rendered as HTML), edit code in a wide variety of languages (including but not limited to Python, Javascript, R, Java, C++, SQL), execute the code on a code server and then return and display the response/code outputs in the interactive notebook. The Jupyter notebook document format is a text (JSON) document format that can embed the markdown text, code and code outputs. The cell based structure of the notebook format supports the use of metadata "tags" to annotate cells which can then be used to provide extension supported styling of individual cells (for example, colouring "activity" tagged cells with a blue background to distinguish them from the rest of the content) or modify cell behaviour in other ways.

JupyterHub: JupyterHub is a multi-user server providing authentication, access to persistent user storage, and a multi-user experience. Logged in users can be presented with a range of available environments associated with their user account. The JupyterHub server is responsible for launching individual notebook servers on demand and providing tools for users to manage their environment as well as tools for administrators to manage all users registered on the hub. JupyterHub can launch environments using remote cloud-hosted servers in an elastic (on-demand and responsive) way.

Jupyter server: a Jupyter server or Jupyter notebook server is a server that that connects a Jupyter served computational environment to a Jupyter client (for example, the Jupyter notebook or JupyterLab user interface or the VS Code IDE).

Jupyter kernel: a Jupyter kernel is a code execution environment managed by Jupyter protocols that can execute code requests from a Jupyter notebook environment or IDE and return a code output to the notebook. Jupyter kernels are available for a wide variety of programming languages.

Integrated Development Environment / IDE: a software application providing code editing and debugging tools. IDEs such as Microsoft’s VS Code also provide support for editing and previewing markdown content (as well as generated content, such as VS Code as an Integrated, Extensible Authoring Environment for Rich Media Asset Creation) and showing differences between file versions (see for example Sensible Diff-ing of Jupyter Notebook ipynb Documents Using VS Code).

BinderHub: BinderHub is a on-demand server capable of building and launching temporary / ephemeral environments constructed from configuration files and content contained in an online repository (eg Github or a DOI accessed repository). By default, BinderHub will build a Jupyter notebook environment with preinstalled packaged defined as requirements in a specified Github repository and populated with notebooks contained in the repository.

MyBinder: MyBinder is a freely available community service that launches temporary/ephemeral interactive environments from public repositories using donated cloud server resources.

ipywidgets: the ipywidgets Python package provide a set of interactive HTML widgets that can synchronise settings across interactive Javascript applications that are rendered in a web browser with the state of Python programmes running inside a Jupyter computational environment. ipywidgets also provide a toolkit for easily generating end user application interfaces / widgets inside a Jupyter notebook that can interact with Python programme code also defined in the same notebook.

Core package: for the purposes of this document, a core package is one that is managed under the official jupyter namespace under the Jupyter project governance process.

Contributed package: for the purposes of this document, a contributed package is one that is maintained outside of the official Jupyter project namespace and governance process by independent contributors but complements or extends the core Jupyter packages. Many "official" (which is to say core) packages started life as contributed packages.

Jupytext: Jupytext is a contributed package that supports the conversion of Jupyter notebook .ipynb files to/from other text representations (structured markdown files, Python or Rmd (R markdown) code files). A server extension allows markdown and code documents opened from within a Jupyter environment to be edited within the Jupyter environment. Jupytext also synchronises multiple formats of the same notebooks, such as an .ipynb notebook document with populated code output cells and simple markdown document that represented just markdown and code input cells.

JupyterLite: JupyterLite is a contributed package that removes the need for a separately hosted Jupyter server. Instead, a simple web server can deploy a JupyterLite distribution which provides a JupyterLab or RetroLab user environment that can execute code against a computational environment that runs purely in the web page/web browser using a WASM compiled Jupyter kernel. With JupyterLite, the user can run a Jupyter environment without the need to install any software other than a web browser and without the need to have a web connection once the environment is loaded in the browser.

Github: Github is an online collaborative development environment owned and operated by Microsoft. Online code repositories provide version controlled file archives that can be access individually or by multiple team members. As well as providing a git managed repository with all that involves (the ability to inspect different versions of checked in files, the ability to manage various code branches, management tools for accepting pull requests), Github also provides a wide range of project management and coordination tools: project boards, issue management, discussion forums, code commit comments, wikis, automation.

git: git is a version control system for tracking changes over separate file "commits" (i.e. saved versions of a file). Originally designed as a command line tool, several graphical UI applications (for example, Github Desktop and Sourcetree) or IDEs (for example, VS Code with the extensions make it easier to manage git environments locally as well as synchronising local code repositories with online code repositories. Many IDEs also integrate git support natively (VS Code, RStudio) as well as providing extended support through additional extensions (for example, VS Code GitLens extension). Notably, the VS Code environment provides a rich differencing display for Jupyter notebooks.

ThebeLab: Thebelab is a contributed package that provides a set of Javascript functions that support remote code execution from an HTML web page. Using ThebeLab, code contained in HTML code cells can be edited and executed against a remote Jupyter kernel that is either hosted by a Jupyter notebook server or launched responsively via MyBinder or another BinderHub server.

Jupyter Book: Jupyter Book is a contributed technique for generating an interactive HTML style textbook from a set of markdown documents or Jupyter notebooks using the Sphinx document processing toolchain. Documents can also be rendered into other formats such as e-book formats or PDF. Notebooks can be executed to include code outputs or rendered without code execution. Notebook cell tags can be used to hide (or remove) unwanted code cell inputs or outputs as well as styling particular cells. Inline interactive code execution is also possible using ThebeLab, although in-browser code execution using JupyterLite is not supported. Interactive notebooks can also be launched from Jupyter Books using MyBinder or opened directly in a linked Jupyter notebook server environment. Jupyter Book builds on several community contributed tools managed as part of the Executable Books project for rendering rich and comprehensively styled content from source markdown and notebook documents. Jupyter Book represents the closest thing to an official a rich publication route from notebook content.

Sphinx: Sphinx is a publishing toolchain originally created to support the generation of Python code documentation. Spinx can render a documents in a wide variety of formats including HTML, ebooks, LaTeX and PDF. A wide range of plugins and extensions exist to support formatting and structuring of documentation, including the generation of tables of contents, managing references, handling code syntax highlighting and providing code copying tools.

nbsphinx: nbsphinx is a contributed Sphinx extension that for parsing and executing Jupyter notebook .ipynb files. nbsphinx thus represents a simple publishing extension to Sphinx for rendering Jupyter notebooks, compared to Jupyter Book which provides a complete framework for publishing rich interactive content as part of a Jupyter workflow.

Docker: Docker is a virtual machine technology used to deploy virtualised environments on a user’s own computer or via a remote server. A JupyterHub server can be used to manage the deployment of Docker environments running individual Jupyter user environments on remote, scaleable servers.

Docker image / Docker container image: a Docker virtual machine environment is downloaded as an image file. An actual instance of a Docker virtual machine environment is generated from a Docker image. Public Docker images are hosted in a Docker registry such as DockerHub from where they can be downloaded by a Docker client.

Docker container: a Docker container is an instantiated version of a Docker image. A Docker container can be used to deploy a Jupyter notebook server and the Jupyter environments exposed by the server. Just like a "real" computer, Docker containers can also be hibernated / resumed or restarted. A pristine version of the environment can be created by destroying a container and then creating a brand new one from the original Docker container image.

Dockerhub: DockerHub is a hosted Docker image registry that hosts public Docker images that can be downloaded and used by Docker applications running locally or on a cloud server. Github also publish a Docker container registry. In addition, organisations and individuals can self-host a registry. Private image registries are also possible that only allow authenticated users or clients to search for and download particular images.

Python: Python is a general purpose programming language that is widely used in OU modules. A Python environment can be distributed via the Anaconda scientific Python distribution or inside a Docker container.

Anaconda: Anaconda is a scientific Python distribute that bundles the basic Python environment with a wide range of preinstalled scientific Python packages. In many instances, the Anaconda distribution will include all the packages required in order to perform a set of required scientific computing tasks. Anaconda can be installed directly onto the user’s desktop or used inside a Docker container to provide a Python environment inside such a virtualised environment. The appropriateness of using Anaconda as a distribution environment in a distance education context is contested.

IPython: IPython (interactive Python) provides an interactive "REPL" (read, evaluate, print, loop) environment for supporting interactive execution and code output display. In a Python based Jupyter environment, it is actual IPython that supports the interactive code execution.

R: R is a programming language designed to support statistical analysis and the creation of hight quality, data driven scientific charts and graphs. R is used in several OU modules.

Javascript: Javascript is a widely used general purpose programming language. Javascript is also available inside a web browser. Standalone interactive web pages or web applications are typically built from Javascript code that runs inside the web page/web browser. Such applications can often continue to work even in the absence of a network connection.

WASM: WASM (or WebAssembly) is a virtualised programming environment that can run inside a web browser. The JupyterLite package uses WASM to provide an in-browser computational environment for Jupyter environments that allows notebooks to execute Python code cells purely within the browser.

Markdown: Markdown is a simple text markup language that allows you to use simple conventions to indicate style (for example, wrapping a word in asterisks to indicate emphasis, or using a dash at the start of a line to indicate a list item or bullet point). Markdown is typically converted to HTML and then rendered in a browser as a styled document. Many Markdown editors, including Jupyter notebooks and IDEs such as VS Code, provide live, styled previews of raw markdown content within the application.

HTML: HTML (hyptertext markup language) is an XML based language used to mark-up text documents with simple structure and style. Web browsers typically render HTML documents as styled web pages. The actual styling (cplour selection, font selection) is typically managed using a CSS (cascading style sheets) which can change the look and feel of the page without having to change the underlying HTML. (When a theme is changed on a web page, for example, dark mode, a different set of CSS settings are used to render the page whilst the HTML remains unchanged).

CSS: CSS (cascading style sheets) control the particular visual styles used to render HTML content. Changing the CSS changes the visual rendering of a particular HTML webpage without having to change the underlying structural HTML.

nbgrader: nbgrader is a core Jupyter package providing a range of tools for manage the creation, release, collection and automated and manual marking of Jupyter notebooks.

Version Control: version control is a technique for tracking changes in one or more documents over time. Changes to individual documents may be uniquely tracked with different document versions (for example, imagine looking at "tracked changes" between two versions of the same document), and collections of versioned documents can themselves be versioned and tracked (for example, a set of documents that make up the documents released to students in a particular presentation of a particular module). In a distributed version control system such as git, mechanisms exist that allow multiple authors or editors to work on their own own copies of the same documents at the same time, and then alert each other to the changes they have made to the documents and allow them to merge changes in made by other authors/editors. If two people have changed the same piece of content in different ways at the same time, a so-called merge conflict will be generated that identifies the clash and allows a decision to be made as to which change is accepted.

Merge conflict: a merge conflict arises in a collaborative, distributed version control system when conflicting changes are made the same part of a particular file by different people, or when one person works on or makes changes to a file that another has independently deleted. Resolving the merge conflict means deciding which set of updates you actually want to to accept into the modified document.

Github Issue: a Github Issue is a single issue comment thread used to discuss a particular issue such as a specific bug, error or feature request. Issues can be tagged with particular users and/or topics. Github Issues are associated with a particular code repository. "Open issues" are ones that are still to be addressed; once resolved, they are then "closed" providing an archived history of matters arising and how they were addressed. When files are committed to the repository, the commit message may be used to associate the commit (i.e. the changes made to particular files) with a particular issue, and even automatically close the issue if the commit resolves that issue.

Github Discussion: a Github Discussion is a threaded forum associated with a particular repository that allows for more open ended discussions than might be appropriate in an issue.

Github/git commit: a git or Github commit represents a check-in of a particular set of changes to one or more documents. Each commit has a unique reference value allowing you to review just the changes made as part of that commit compared to either the previous version of those documents, or another version of those documents. Making commits at a low level of granularity means that very particular changes can be tracked and if necessary rolled back. A commit message allows a brief summary of the changes made in the commit to be associated with it; this is useful for review purposes and in distributed multi-user settings to communicate what changes have been made (a longer description message may also be attached to each commit). Identifying an appropriate level of granularity for commits is one of the challenges in establishing a good workflow, not least because of the overhead associated with adding a commit message to each commit.

Github/git pull request (PR): a git or Github Pull request (PR) represents a request that a set of committed changes are accepted from one branch into into another branch of a git repository. Automated checks and tests can be run whenever a PR is made; if they do not pass, the person making the PR is alerted to the fact and invited to address the issue. Merging commits from a PR may be blocked until all tests pass. PRs may also be blocked until the PR has received a review by one or more named individuals.

Automation: automation is the use of automatically or manually triggered events or manually issued commands for running scripted tasks. Automation can be used to run a spell-checker over a set of files whenever they are updated, automatically check and style code syntax, or automatically execute and text code execution. Automation can also be used to automatically update the building of Docker images or render and publish interactive textbooks. Automation could be used to automate the production of material distributions and releases and then publish them to a desired location (such as the location pointed to by a VLE download link).

Autonomation: autonomation (not commonly used in computing context) is a term taken from lean manufacturing that refers to "automation with a human touch". In the case of a Jupyter production system, this might include the running of automated tests (such as spell checkers) that prevent documents being committed to a repository if they contain a spelling mistake. The main idea is that errors should not propagate but be fixed immediately at source. The automation identifies the issue and prevents it being propagated forward, a human fixes the issue then reruns the automated tests. If they pass, the work is then automatically passed forwards.

Github Action: a Github Action forms part of an automation framework for Github. Github Actions can be triggered to run checks and tests in response to particular events such as code commits, PRs or releases, as well as to manual triggers. Github Actions can also be used to render source documents to create distributions as well as publishing distributions to particular locations (for example, creating a Docker image and pushing it to DockerHub, generating a Jupyter Book interactive textbook and publishing it via Github Pages, etc.). A wide range of off-the-shelf Github Actions are available.

git commit hook: a git commit hook is a trigger for automation scripts that are run whenever a git commit is issued. The script runs against the committed files and may augment the commit (for example, automatically checking and correcting code style / layout and automatically adding style corrections as part of the commit process, or using Jupytext to automatically create a paired markdown document for a committed .ipynb notebook document, or vice versa.

pre-commit: pre-commit is a general purpose contributed framework for creating git precommit scripts. A wide range of off-the-shelf pre-commit actions are defined for performing particular tasks.

Rendering: rendering a file refers to the generation of a styled "output" version of a document from a source format. For example, a markdown document may rendered as a styled HTML document.

Generative document: a generative document is a document that includes executable source code. The source code provides a complete set of instructions for generating media assets as the source document is rendered into a distribution document.

Generative rendering: a generative document is rendered as a styled document containing media assets that are created by executing some form of source code within the source document as part of the rendering process

Generated asset: a generated asset is a media asset that has been generated from a source code representation as part of the rendering process. Updates to the media asset (for example, text labels or positioning in a diagram) are made by making changes to the source code and then re-rendering it, not by editing the asset directly.

Distribution: a distribution represents a complete set of version controlled files that could be distributed to an end user. In a content creation content, a distribution might take the form of a complete set of notebooks, a complete set of HTML files, or a set of rendered PDF documents. A distribution might be used as a formal handover in a regimented linear workflow process or as the basis of a set of files released to students. A uniquely identifying hash value can be used to identify each distribution and track exactly which version of each individual file is included in a particular distribution.

Release: a release is a version controlled distribution that can be distributed to end users such as a particular cohort of students on a particular module. A release will be given an explicit version number that should include the module code, year and month of presentation as well as lesser version (edition) numbers that help track releases integrating minor updates etc.

Source / Source files: the source files are the set of files from which a distribution is rendered. The source files might include structural metadata, comment code that is stripped from the source document and does not appear in the rendered document, and even code that is executed to produce generated assets that form part of the distribution, even if the source code does not.

Fragment: Tools of Production – ggalt and encircling scatterplot points in R and Python

In passing, I note ggalt, an R package containing some handy off-the-shelf geoms for use with ggplot2.

Using geom_encircle() you can trivially encircle a set of points which could be really handing when demonstrating / highlighting grouping various sets of points in a scatterplot:

See the end of this post for a recipe for creating a similar effect in Python.

You can also encircle and fill by group:

A lollipop chart . The geom_lollipop() geom provides a clean alternative to the bar chart (although with a possible loss of resolution around the actual value being indicated):

A dumbbell chart provides a really handy way of comparing differences between pairs of values. Enter, the geom_dumbbell():

The geom_dumbbell() will also do dodging of duplicate treatment values, which could be really useful:

The geom_xspline() geom provides a good range of controls for generating splines drawn relative to control points: “for each control point, the line may pass through (interpolate) the control point or it may only approach (approximate) the control point”.

The geom_encircle() idea is really handy for annotating charts. I donlt think there’s a native Pyhton seaborn method for this, but there is a hack to it (via this StackOverflow answer) using the scipy.spatial.ConvexHull() function:

# Via: https://stackoverflow.com/a/44577682

import matplotlib.pyplot as plt
import numpy as np; np.random.seed(1)
from scipy.spatial import ConvexHull

x1, y1 = np.random.normal(loc=5, scale=2, size=(2,15))
x2, y2 = np.random.normal(loc=8, scale=2.5, size=(2,13))

plt.scatter(x1, y1)
plt.scatter(x2, y2)

def encircle(x,y, ax=None, **kw):
    if not ax: ax=plt.gca()
    p = np.c_[x,y]
    hull = ConvexHull(p)
    poly = plt.Polygon(p[hull.vertices,:], **kw)
    ax.add_patch(poly)

encircle(x1, y1, ec="k", fc="gold", alpha=0.2)
encircle(x2, y2, ec="orange", fc="none")

plt.show()

It would be handy to add a buffer / margin region so the line encircles the points rather than going through the envelope loci? From this handy post on Drawing Boundaries in Python, one way of doing this is to cast the points defining the convex hull to a shapely shape (eg using boundary = shapely.geometry.MultiLineString(edge_points)) and then buffer it using a shapely shape buffer (boundary.buffer(1)). Alternatively, if the points are cast as shapely points using MultiPoint, then shapely also a convex hull function that returns and object that can be buffered from directly.

Sensible Diff-ing of Jupyter Notebook ipynb Documents Using VS Code

One of the major ARRGHHHs with working with Jupyter notebooks in a git / Github context is that if you are in the habit of checking in notebook ipynb files, particularly notebook ipynb files with code cells run, the differencing experience sucks…

And if the metadata changes you can get loads of diffs to skim through…

There are tools available for viewing differences between rendered Jupyter notebooks (for example, the nbdime Jupyter extension or the reviewnb Github application) but to my knowledge we’ve definitely left these currently underexplored in an internal context, and despite their availability for many years, I donlt see them talked about that much (maybe everyone uses them silently?).

Anyway, recent updates to the VS Code editor provide a huge leap forward, with the off-the shelf inclusion in the Jupyter extension of sensible a diff viewer (VS Code – Custom notebook diffing).

Git diff of Jupyter notebooks in VS Code

The differencing gives you diff views at the cell input, metadata and output levels, as required. Where code cell outputs are images, you have the images presented side by side.

With the addition of the VS Code GitLens extension (via this SO answer), you can trivially compare files across different branches of a repo via the Search and Compre / Compare References… option:

Just choose the branch you’re interested in, and the branch you want to comapre to:

And the comparison is launched and the side by side view rendered in the main panel:

You can also connect to a remote or containerised kernel from the command palette:

by specifying the URL (and any necessary auth token) for the Jupyter server you want to connect to:

A little bit of me suspects this might not actually work when trying to connect to an institutional server hidden behind institutional auth because, obvs, IT security policies are all designed to prevent folk accessing internal compute…!;-)

That said, it is possible to run VS Code in a browser becuase a Jupyter server proxy using code-server, so that represents another possibe solution for institutionally hosted environments (run VS Code and Jupyter notebooks in the same server and diff using VS Code (or even pre-installed and pre-enabled nbdime).

Anyway, the DIFFing is really exciting… Now let’s see if I actually start using it!

So the Olympics Gets to Block BBC Broadcasts?

Noting that a canned message is currently being played out 24/7 on BBC local radio channels accessed via internet radio (on our Pure radio at least) that the service is not currently available…

Seems like it’s a right’s issue…

Dear customer,

We would like to inform you that there is some temporary disruption to local BBC Radio stations broadcast on internet radio, due to the Olympic games. This disruption is due to rights issues and is unfortunately outside of our control.

The Tokyo 2020 Olympics are set to finish on Sunday 8 August. After this date, BBC radio stations will stream as normal for both international listeners, and UK listeners on affected devices.

You can read more about this over on the BBC website by clicking here. Thank you for your understanding.

Note on Pure radio website

And the BBC announcement?

Why am I currently unable to listen to local radio stations?

During the Tokyo 2020 Olympics, local BBC radio stations will not be available online to listeners located outside the UK. National stations will be available, however some programmes, or segments of programmes, might be unavailable during this time. This is due to rights reasons.

This may also impact some UK listeners if the device accessing the stream uses Shoutcast. Shoutcast has one stream covering both the UK and overseas, so these will be unavailable at this time.

If your device is affected, you’ll hear a looping message advising that the stream isn’t available at the moment. If you’re within the UK and unable to listen, please check the following FAQ for info on how to listen on a different device in the meantime: How do I listen live?

The Tokyo 2020 Olympics started on Friday 23 July and finish on Sunday 8 August. After this date, BBC radio stations will stream as normal for both international listeners, and UK listeners on affected devices.

BBC Sounds | Update *** Friday 23 July ***, https://www.bbc.co.uk/sounds/help/issues/bbc-sounds/local-radio-olympics-2020

Visualising regex / Regular Expressions Using a Syntax / Railroad Diagram

Ish via Simon Willison, in a blog post describing some recent updates to sqlite-utils that brings sqlite-utils/datasette a step close to doing what OpenRefine does (Apply conversion functions to data in SQLite columns with the sqlite-utils CLI tool), I note debuggex.com, a tool for visualising regular expressions as syntax diagrams (albeit, not open source, so no local running…).

Editing Jupyter Notebooks Mechanically

In passing, I note that because of REDACTED last year, we had to edit a load of teaching materials to account for inconsistent ways of connecting to provided database servers across local and hosted environments (I’ve argued for years argued that provided environments should always be consistent wherever they accessed from, but go figure) which means that for 21J, where we will have the same container running on local and hosted environments, we need to revert all those changes. (I wonder if the changes had been made in a particular Github branch and PR’d from that branch whether we could just reopen that branch, revert the changes, and submit another PR? I really do need to get better at git…)

Example of a cell to be removed.

The changes impact 30 or so notebooks, and might typically involve an editor making the changes. But we don’t editors near our notebooks in the data management course, so it’s down to the module team to undo the changes.

Skimming over the changes we need to make, we it looks to me like the advice we need to update was boilerplate text, which means it should be the same across notebooks (there were two sets of changes required: connection to a mongo database, and connection to a postgres database).

This in turn suggests automation should be possible. So here’s a first attempt:

import nbformat
from pathlib import Path

def fix_cells(cell_type, str_start, path='.',
              replace_match=None, replace_with=None,
              convert_to=None, overwrite=True,
              version=nbformat.NO_CONVERT,
              ignore_files = None,
              verbose=False):
    """Remove cells of a particular type starting with a particular string.
       Optionally replace cell contents.
       Optionally convert cell type.
    """

    # Cell types
    cell_types = ['markdown', 'code', 'raw']
    if cell_type and cell_type not in cell_types:
        raise ValueError('Error: cell_type not recognised')
        
    if convert_to and convert_to not in cell_types:
        raise ValueError('Error: convert_to cell type not recognised')

    # Iterate path
    nb_dir = Path(path)
    for p in nb_dir.rglob("*"): #nb_dir.iterdir():
        if ignore_files and p.name in ignore_files:
            continue
        if '.ipynb_checkpoints' in p.parts:
            continue
        
        if p.is_file() and p.suffix == '.ipynb':
            updated = False
            if verbose:
                print(f"Checking {p}")

            # Read notebook
            with p.open('r') as f:
                # parse notebook
                #nb = nbformat.read(f, as_version=nbformat.NO_CONVERT)
                #nb = nbformat.convert(nb, version)
                #opinionated
                try:
                    nb = nbformat.read(f, as_version=version)
                except:
                    print(f"Failed to open: {p}")
                    continue
                deletion_list = []
                for i, cell in enumerate(nb['cells']):
                    if cell["cell_type"]==cell_type and nb['cells'][i]["source"].startswith(str_start):
                        if replace_with is None and not convert_to:
                            deletion_list.append(i)
                        elif replace_with is not None:
                            if replace_match:
                                nb['cells'][i]["source"] = nb['cells'][i]["source"].replace(replace_match, replace_with)
                                updated = True
                            else:
                                nb['cells'][i]["source"] = replace_with
                                updated = True
                        if convert_to:
                            if convert_to=='code':
                                new_cell = nbformat.v4.new_code_cell(nb['cells'][i]["source"])
                                nb['cells'][i] = new_cell
                            elif convert_to=='markdown':
                                new_cell = nbformat.v4.new_markdown_cell(nb['cells'][i]["source"])
                                nb['cells'][i] = new_cell
                            elif convert_to=='raw':
                                new_cell = nbformat.v4.new_raw_cell(nb['cells'][i]["source"])
                                nb['cells'][i] = new_cell           
                            else:
                                pass
                            updated = True

                # Delete unrequired cells
                if deletion_list:
                    updated = True
                nb['cells']  = [c for i, c in enumerate(nb['cells']) if i not in deletion_list]

                if updated:
                    # Validate - exception if we fail
                    #nbformat.validate(nb)

                    # Create output filename
                    out_path =  p if overwrite else p.with_name(f'{p.stem}__patched{p.suffix}') 

                    # Save notebook
                    print(f"Updating: {p}")
                    nbformat.write(nb, out_path.open('w'), nbformat.NO_CONVERT)

Usage for deleting cells takes the form:

# delete cells
ignore_files=['21J DB repair.ipynb']

str_start = '# If you are using the remote environment, change this cell'
fix_cells('raw', str_start, ignore_files=ignore_files)

And for updating the contents of cells and/or changing their cell type:

str_start = "# If you are using a locally hosted environment, change this cell"
replace_match ="""# If you are using a locally hosted environment, change this cell
# type to "code", and execute it

"""
replace_with = ''
convert_to = 'code'
fix_cells('raw', str_start, convert_to='code',
          replace_match=replace_match, replace_with=replace_with, ignore_files=ignore_files)

Better pattern matching around identifying cells, and perhaps navigating directory paths, is obviously required.

In passing, I note that if we had tagged the boilerplate cells or added other metadata to them identifying them as relating to db setup, we could have processed the notebooks based on tags. For future notebooks, I think we should start to consider adding identifying tags to distinct boilerplate items so that we can, if necessary, more easily modify / update them.

I also note that if I suggest this to the “Jupyer Notebook Production Working Group” (which I wasn’t invited to join, obvs), I’m guessing they’d say this is “too technical” and recommend the manual approach of opening an editing each notebook by hand…! And I doubt they’d be able to comment on any potential git revert strategy;-)

PS see also @choldgraf’s nbclean package which includes a tool to “replace text in cells with new text of your choosing”.

Fragment: “Try it and See” Interactive Learning Activities in Jupyter Notebooks

Getting on for 6 years ago now, when we were in the initial round of production for the data management course, and I was hacking away notebook customisations to support the delivery of notebooks as teaching and learning materials, I put together an extension that came to be known as empinken that provided toolbar buttons for colouring certain activity cells. (The extension disappeared for a couple of years following a notebook update that broke things, but it came back a couple of years ago, and I keep tinkering with it on and off to change how it works and what it does. So it’s now tag based; and it also has a green (success) button added following a request from a module team in production).

Empinken toolbar buttons (warning, activity, student contribution, solution)

The colours and buttons were also made configurable in another recent release (control panel available by nbextensions configurator):

The cell colouring / style was reminiscent of activity styling in our online VLE materials so it made sense to try to carry that idea over to the notebooks. The VLE materials also made use of a reveal, allow the student to read the question, do the work, then look at the answer. In our early notebooks, I advocated the used of a reveal to provide inline answers, but several other authors preferred to put activity answers into another notebook, believing that that level of friction was required to stop students always just cribbing the answer before putting the learning effort and/or practice in. I thought they were wrong then and I still they’re wrong now.

The reveal we explored originally was provided by third party extension that provided a button to click the displayed the answer. However, we soon moved to another off-the-shelf extension that provided collapsible headers and that’s still the method we use now. (Another module team has recently expressed a preference for “click here to show the answer” button, so I’ll intend to revisit that approach when I get a chance.)

Last year, whilst creating new notebooks on introductory robotcs I started using a secondary colour as a call to action: in the yellow activity cells, students are expected to do things. This might be writing some code:

It might mean adding additional cells:

Or it might be an invitation to write free text:

Here’s another club sandwich style activity that attempts to prompt some sort of reflective behaviour at the end of a notebook:

One of the problems with this activity design is that it’s quite heavy. So whilst I’ve been updating notebooks for October/November use (I also have an August 20th deadline for other notebooks that don’t have a student use date till well into 2022, so that’s really not gonna happen…) I’ve started sketching a new lite activity pattern that I’m calling “try it and see”:

The learning thing should perhaps be blue (activity) but we also have a convention around the bootstrap styled content of green being informational/advisory (fact), yellow being “you might like to try this” and pink being “DANGER”. And this feels facty to me…

One issue with this pattern is that there is the risk that students will try things wrong and save the notebooks with their brokenness. Which makes me wonder about having a ‘resettable’ read-only copy of the code in metadata that could be used to reset the cell from a toolbar button click. But ideally we’d then need some device on the cell to show that it is resettable.

Alternatively, we could put the example code in the green learning thing set-up box. Or, and I think this is a much neater solution, we stash the working example as a comment that duplicates the executable example we want the student to change (the comment also essentially acts as a crib for what to change…):

Ideally, students/learners would feel confident enough to take ownership of the notebooks and change them / play with them howver they want to (the empinken buttons are on the toolbar of notebooks in the environments we provide to students, so if they do create their own cells they can colour them if they want to show they’re theirs by selecting the cell and clicking the appropriate toolbar button. (This is also how tutors provide feedback on marked notebooks: they tend to add feedback comments to a new cell and then click the pink empinken button to colour it and bring it to the attention of the student when they get the script back (the extension was named empinken by the tutors…).

Animating GraphViz dot files: Parsing Parliamentary Processes, Sort Of…

[Re: the original title of, and my description of, the graph visualisations in this post, Michael suggests I don’t really have a clue what they represent. He’s not wrong… ]

A couple of reminders of things past in my feeds yesterday: first up, a reminder of the Workbench reproducible data scraping and cleaning (and built in tutorial…) application in a really useful review by Jon Udell: A beautiful power tool to scrape, clean, and combine data. It’s moved on a fair bit in terms of complexity since I first noted it following a prompt from Owen Stephens several years ago (Linear Cell Based Data Cleaning and Analysis Workflows With Workbench). At the time, as I remember it, it let you use pandas code and SQLite databases as part of the workflow, which also led me to tweet at some point about how it would be interesting to see a pyoliteWASM powered in-browser version, but the latest code repo seems to show all manner of complicated Kubernetes stuff required for deployment (and no simple docker-compose route that I can see) so maybe it’s too complex for a simple WASM backend now (though reverting to an earlier release might still work for a POC?).

Something I’m noticing more is that code projects are making Kubernetes edployment recipes are available. This is great for scaling, but a barrier to folk who just want to try things out or use them at small scale. Maybe in a year or two, folk will commonly be running K8s on the desktop, but I still find it a faff and a hassle and another blocker to sharing things with others. It’d be some much easier to just be able to run docker-compose up.

Secondly, via a tweet from Michael/@fantasticlife, a query about using mapio/GraphvizAnim, “a tool to create simple animated graph visualizations [form GraphViz graphs] aimed mainly at teaching purposes”. It seems like I’d spotted this before, but offhand don’t recall using it for anything, so now seemed like a good opportunity to give it a spin…

For some time, Michael, Anya et al have been attempting to teach parliamentary procedure to machines, a frankly bewildering project into Gormenghastian rituals of the UK Parliament that I’ve been following on and off via the best weeknotes series ever (if you ever want to tune into a never ending tale of everyday parliamentary geeks, I heartily recommend it. It’s funny too…).

It seems that all manner of things can, and do, happen as part of everyday parlimentary processes and the models are… see for yourself.

Anyway, Michael had a thing — https://api.parliament.uk/procedures/work-packages/9/parse/log.dot [parse code here]— so I grabbed a coffee (how I do miss interesting folk asking interesting “any idea how to…?” off-the-cuff things… Tinkering’s when I’m at my happiest!) and had a play. Here it is: https://github.com/ouseful-demos/GraphvizAnim

Here’s what it does, based on a quick hack around the original GraphVizAnim demo [MyBinder demo]:

Create an animation object, add nodes and edges to it with an animation step between each, and you can create a simple slider powered walk through the construction of the graph. In its current form, it’s perhaps not overly useful, but it provides a starting point for something practical to talk around to work out what might be useful.

As an aside, this is exactly another of those things that I think folk should be able to do when we say everybody should learn to code… Maybe 15 mins worth of effort from a cold start but an example to work from?

Another 2 mins just now from cribbing another demo (I spotted an animated gif in the repo and wonderred what had generated it…), and we can create an animated gif…

# Create an image frame for each step...
graphs = ga.graphs()
files = render( graphs, 'process', 'png' )

# Then an animated gif for each frame
from gvanim import Animation, render, gif
gif( files, 'process', 50 )

# And display the image
from IPython.display import Image
Image('process.gif')

Here’s the result (again, not brilliant, but something to work from):

Animated gif of a Parliamentary process parse tree generated using GraphvizAnim

Now I’m wondering… is there a narrated form of one of these steps anywhere (or even just a written narration of a couple of the steps?) that we could use as a crib for generating a “dot2text” templater that would take a dot file and generate a human readable version of it (to the extent that you can: a) narrate the steps in a graph with closed loops in a sensible (readable) way; b) make sense of parliamentary jargon anyway….?)