Notes in Advance of a Meeting About The Possibility of Getting an Institutional Jupyter Server Up and Running

Notes on Jupyter Deployments in an OU Context.

Notes made in advance of an internal workshop to discuss supporting “Jupyter notebooks” in the OU.

My intention is to split the content over several documents.

Needless to say, I think “notebooks” are both: a) not really the point; b) offer way more potential for doing all sorts of things than folk might think.

pandoc -o output.docx -f markdown -t docx filename.md

The content is split into three main sections:

  • Current Jupyter Deployments in the OU
  • Architectural Models
  • Use Cases for Jupyter Services

Current Jupyter Deployments in the OU

Jupyter notebooks are already used in several OU courses. The following summarises what I’ve managed to learn, or imagine may be the case. Folk involved with the respective modules may well disagree.

TM351

TM351 is a 30 point, third level module on data management and analysis. Approximately 50% of study is spent working on practical activities delivered via Jupyter notebooks.
Essential requirements:

  • notebook server that allows students to complete notebook based activities in a Python environment;
  • ability to create and run new Python backed notebooks;
  • pre-configured Python computational environment with preinstalled packages (Python and Linux package dependencies);
  • access to PostgreSQL server with permissions to add and delete users, roles, databases, tables; read and write to tables from notebooks;
  • access to MongoDB server with read permissions on a seeded database; permissions to create / read / write / delete databases and collections from notebooks;
  • access to OpenRefine application;
  • access to public internet addresses in order to download data files from arbitrary URLs;
  • ability to persist notebooks;
  • ability to access all service GUIs through a web browser;
  • ability to work on a student’s own computer in an offline mode;
  • ability to work cross-platform (Windows, Mac, Linux).
    Desirable Requirements:

  • ability to backup and restore databases;

  • ability to take away the complete computing environment so it can be run ex- of the OU;
  • ability to access the environment on a remote host (eg an OU hosted solution); (this solution would in turn create requirements based on scaleability, affordability, peak load, resource (processor, memory, storage, bandwidth), uptime etc.)
  • ability to access the environment from a terminal / command line;
  • side effect free on user desktop (i.e. the environment should not clash with any services already running on the student’s computer; the environment should not require changes to the student’s computer; any applications installed should be capable of being uninstalled cleanly).

Optional Requirements:

  • headless operation (access to desktop applications inside the provided environment is not required);
    Solution (16B, 16J-19J):

A VirtualBox virtual machine (VM) managed using vagrant provides a self-contained, preconfigured environment running all required applications. User files are mounted into the VM from the user desktop and saved out of the VM back onto the desktop.

Proposed Solution (20J+):

A Docker container (rather than VirtualBox VM) built using repo2docker, runnable via Docker Desktop / ContainDS on a student computer or on a remote host either as a standalone service or via JupyterHub.

Requirements not met:

  • side effects (VirtualBox and vagrant must both be installed; on Windows, virtualisation may need enabling, HyperV needs disabling. A fix to the latter would be to ship a VM built natively for HyperV);
  • hosted solution: an OU hosted solution is not available. DIY solutions for students to self-host on Azure, AWS, Digital Ocean are provided.

TM112

TM112 is a 30 point, first level module that in part provides an introduction to Python programming. Whilst the Python environment used for most of the activities is a simple, user-installed Python environment without embellishment, an optional “notebook experience” activity is also provided to deliver enrichment activities.
Essential requirements:

  • ability to run provided Jupyter notebooks within a single study session (no requirement to persist changed notebooks);
  • install-free / hosted solution;
  • scaleable (capable of coping with peak demand: 1k concurrent users)
  • affordable ( Use case example: a Docker container running a Jupyter server provides authenticated access to a GUI based Java application via a web browser. The Java application runs on the virtualised desktop and is proxied by the notebook server using the jupyter-desktop-server.
    > Use case example: a Jupyter server provides browser based access to a Windows desktop application running under Wine on an XFCE desktop. The application is launched from a Jupyter notebook UI and proxied via the jupyter-desktop-server extension.

JupyterHub

JupyterHub is a multi-user service that can provide authenticated access to personalised computational environments for multiple users. Each user may also be provided with their own persistent account managed by the server.

A range of authentication schemes are supported, including OAuth, LTI, login from Github etc. (TM112 uses LTI to allow students to access a single sign on authenticating JupyterHub server that launches temporary notebook servers from a Moodle VLE web page. Contact: Rod Norfor for technical details.)

The JupyterHub server provides access to a range of computational environments through environment spawners. For example, the DockerSpawner will launch a Docker container in response to a user login that runs a personalised computational environment in a Docker container.

JupyterHub can be configured to provide users with a range of alternative posssible environments, so a student could log in and be presented with options to launch different environments relating to different modules, for example.

JupyterHub can scale a service offering with increasing numbers of users using a well supported and well proven Kubernetes deployment model. (The TM112 Jupyterhub sever uses Kubernetes on Microsoft Azure to service the required number of students in a scaleable way.)

Whilst JupyterHub nominally expects to manage launched environments via a Jupyter notebook server running in the environment, a recent community contribution (jhsingle-native-proxy) allows arbitrary containerised web applications to be launched and managed from a JupyterHub server.

Architecturally, JupyterHub can launch individual Jupyter notebook servers, notebook servers then launch notebooks, and notebooks launch kernels.

Jupyter Enterprise Gateway Server

The Jupyter Enterprise Gateway Server is a middleware service, originally developed by IBM, that provides the ability to launch kernels on behalf of remote notebooks in a scaleable way (eg scaling for large numbers of users; allowing kernels to run with different amounts of computational resource (CPUs, GPUs, memory etc)).

One possible architectural model would be for a JupyterHub server to provide multi-user access to the Jupyter environment, and JupyterHub to launch kernels via the Jupyter Enterprise Gateway Server.

Alternatively, a student running their own personal Jupyter notebook server at home on a computer with limited computational resource could use an institutional Jupyter Enterprise Gateway Server to launch a kernel on a well provisioned server, eg one with a large amount of memory and a GPU.

(At the moment, I don’t think the same personal Jupyter server can launch kernels locally as well as via a Jupyter Enterprise Gateway server; I think the provisioner is one or the other.)

BinderHub

BinderHub is variant of JupyterHub that allows an unauthenticated user to launch a containerised Jupyter notebook server, on-demand, built according to a linked to specification on a remote repository.

The most common way of using BinderHub is to create a Github repository containing environment definition files as well as user files (eg Jupyter notebooks) and then use MyBinder (a free and open federated Binderhub service) to build a Docker image based on the contents of the repository. Once built, or if a cached version already exists, MyBinder spawns a Docker container from the image and serves it to the user.

Currently, BinderHub provides only a “temporary” service – the container is built, deployed, served to the user, and then destroyed at the end of the session. However, one of the Binder Federation nodes do have an experimental persistent Binderhub deployment that provides authenticated user access and a persistent user file areas. Users can launch Binder containers from their account, save files to their account, and share their own files into their launched Binder containers.

Binderhub / MyBinder is also used as an ad hoc provider of computational environments for a variety of online “interactive textbooks” and online courses. For example, Jupyter Book and “the spacy course”, as well as the LibreTexts interactive book platform. Published as HTML websites, the contents of code cells embedded (and editable) within the HTML page can be executed against a remotely launched MyBinder kernel, with the result of the computation returned to the page and displayed within it.

Several Javascript packages (thebelab.js, juniper.js) exist to “enable” the code cells in an HTML page and manage the MyBinder connection.

nbgallery

nbgallery is not an official Jupyter project but it does provide a range of interesting features that are worth exploring if we want to be open-minded about what sort of user environment we want to use to provide people to access to Jupyer notebooks.

The nbgallery application (TH review and video review) was developed by the US Department of Defense and the NSA to provide multi-user access to a wide range notebooks. The gallery provides search tools over a wide collection of notebooks and allows users to rate and review notebooks. Users can launch notebooks in a connected Jupyter environment. A healthcheck facility checks that code cells execute as expected (and if not may flag maintenance or student difficulty issues).

Exploring the use of nbgallery either as a social application, visited by all students, or as a personal application, used by a student to access notebooks in a personal environment, may turn up a way of providing access to notebooks in a way that is useful not just for linear courses, but also resource based / problem based learning courses.

Institutional Vs Local Provision

Whilst it is possible to consider Jupyter mediated environments in either the local user context (eg TM351 students using their own VMs) or the institutional context (for example, TM112 students launnching temporary notebook servers), I think there is most to be gained from considering them as two different ways of exposing students to the same computational environments.

For example, consider the following three situations:

1) TM351 students run a Virtualbox virtual machine containing multiple “personal” servers: a Jupyter notebook server, a Postgres database server; the student “owns” all services and all services are integrated.

2) TM351 students accessing a virtalised Jupyter environment for notebook access, and logging in to a shared Postgres database server. The student does not own their computational environment provided by the notebook server, though they may able to export their files; nor do they own their database server: they are one of many users accessing the same service, though they may be able to export a dump of their database contents.

3) a TM351 Docker environment is defined in a public repository such as innovationOUtside/tm351vm-binder. The definition is public and can be shared (owned, edited) by anyone. The repository can be launched using a Binderhub instance and used to provide temporary access to an integrated, personal TM351 environment running a personal Jupyter notebook server as well as a Posgtgres server. Continuous integration tools build a Docker container image from the repo and push it to Docker Hub (ousefuldemos/tm351-binderised. An institutional JupyterHub server allows students to seemlessly login from the VLE and launch the TM351 environment pulled from Docker Hub either directly or via an institutional Jupyter Enterprise Gateway Server. A student with a powerful computer at home installs Docker and launches their own local container instance of the TM351 environment pulled from DockerHub. Perhaps more conveniently, they use the ContainDS desktop application to launch the container locally, again either pulling the prebuilt image from Docker Hub, or building a version themselves either directly from the original repository or from a local clone of it. (ContainDS greatly simplifies the practicalites of running Dockerised notebook servers on the desktop, providing a useful graphical user interface for mangaing containers, managing server authentication tokens, mounting files from the desktop into the container, etc.)

In the third case, the same environment definition is used to:

  • build and deploy temporary environments on MyBinder;
  • build a public image deposited on Docker Hub;

The public image on Docker Hub is then used to:

  • deploy environments from an institutional JupyterHub service;
  • deploy local environments on the students’ own desktop.

In each case, students gain personal access to a commonly defined environment and have “ownership” of all the services running inside the environment. Students are free to take away the environment and use it in other contexts, or access it solely via hosted solutions. (There is an issue of synchronising user files and environment updates across mutliple services if a student works in that way, eg sometimes at home on their computer, sometimes from their desk on an OU remote host, etc.)

Use Cases for Jupyter Services

Jupyter environments can be used to support a range of activities across the institution, most notably:

  1. Delivering interactive teaching materials to students (module delivery);
  2. Authoring teaching materials (module production);
  3. Supporting computational academic research (research);
  4. Disseminating academic research (reproducible research publications) (research publishing);
  5. Supporting institutional data analysis and reporting (business analytics)

A sensible question to ask is: “what benefits or differences do a ‘Jupyter solution’ bring to each of these activities. So let’s quickly review them:

Delivering Interactive Teaching Materials to Students

We have been using Jupyter notebook from since before they were Jupyter notebooks to deliver 4 hours of teaching per week to students on TM351. Over that time, the pedagogy has delivered but is still largely unexplored.

The notebooks we wrote then are not the notebooks we would write now. The notebooks we might write now are not the notebooks we could write if we spent some time exploring them properly as a medium for both teaching and learning, eg considering how they might be used to support formative assessment, summatice assessment (touched on on TM351), automated testing / grading, personal note taking and portfolio development.

Computation supports interactivity in two ways:

  • it supports the execution of provided code and as such can be used to create what are effectively end user applications;
  • it allows student to create and execute their own code, for whatever purpose.

The Jupyter environment can support both use cases.

Authoring Teaching Materials

Irrespective of whether notebooks are used to deliver teaching to students, they can be used to develop teaching materials, interactive and otherwise, in a direct authoring way.

For example, document conversion tools allow authored notebooks to be rendered in a variety of formats: as .docx word processor documents, as .pdf files, as simple .md markdown/text files, as HTML pages.

The notebook user interface is web based and, via notebook extensions, supports WYSIWYG editing as well as direct editing of markdown and HTML text. Mathematical and chemical equations written in LaTeX are rendered natively by the notebook.

A wide range of display methods allow rich media assets to be embedded in the text by simple wrapping of a file reference (a local file reference or a web URL) that points to the object: images (Image(URL)), videos (Video(LOCAL_FILE) and audio clips (Audio(URL)) are all readily embedded in the document, for example, as can more complex objects such as interactive maps.

Code, often little more than a single simple line, can also be used to generate media outputs, from rich interactive embedded javascript applications to simple charts and tables.

More ambitious authors may choose to create their own asset generating code, or even go so far as to create de facto end user applications within the notebook context.

Using code to generate charts and tables has benefits for module maintenance, because charts are generated from source datasets or equations. If they need to be updated, a simple change to the code or data file is all that’s required for the rendered asset to be updated.

The executed notebook can then be exported as final document (as interactive HTML or ePub, as a flat PDF or docx etc) from the original notebook. (Of course, not all renderings will be as rich as the original interactive notebook form or HTML converted form.)

I was hoping to make more progress on openlearn-publish-test to demonstrate how we could use Jupytext to support direct authoring and rich / interactive updating of converted OpenLearn OU-XML content in a notebook UI, but I’ve run out of time and only got as far as how to get the OpenLearn content into a notebook enabled environment.

Supporting Computational Academic Reasearch

Lots of academics write ad hoc code in their research; lots of academics make notes around their ad hoc code; lots of academics use code for exploration; notebooks are powerful environment that lets you do each of those in the context of all the others.

Jupyter notebooks are arguably not the best environment for developing research software packages, although the provision of computational environments may support that activity. However, workflows are emerging that do better support traditional software engineering / code development practices. For example, tools such as Jupytext provide support for working with simple text document formats (.py and .R files, for example) directly within the Jupyter notebook environment.

Disseminating Academic Research

An increasing number of academic journals require researchers to deposit reproducible code scripts with their submissions.

Several journals are exploring the use of Jupyer notebooks as a first-class document format for submitting papers, as well as developing review and comment tools around them.

Supporting Institutional Analysis and Reporting

Lots of financial companies use notebooks as an analysis environment. The Ministry of Justice moved to a Jupyter fronted platform (MoJ Analytical Platform) for their analysts.

A recent OU job ad for Head of Data Analytics identified skills in things like Python and R. Such folk might reasonably expect to use Jupyter notebooks for their analysis and reporting. One of the things not covered in this review are the rich interactive dashboarding tools that Jupyter ecosystem supports (eg Voilà).

Author: Tony Hirst

I'm a Senior Lecturer at The Open University, with an interest in #opendata policy and practice, as well as general web tinkering...