Notes in Advance of a Meeting About The Possibility of Getting an Institutional Jupyter Server Up and Running

Notes on Jupyter Deployments in an OU Context.

Notes made in advance of an internal workshop to discuss supporting “Jupyter notebooks” in the OU.

My intention is to split the content over several documents.

Needless to say, I think “notebooks” are both: a) not really the point; b) offer way more potential for doing all sorts of things than folk might think.

pandoc -o output.docx -f markdown -t docx filename.md

The content is split into three main sections:

  • Current Jupyter Deployments in the OU
  • Architectural Models
  • Use Cases for Jupyter Services

Current Jupyter Deployments in the OU

Jupyter notebooks are already used in several OU courses. The following summarises what I’ve managed to learn, or imagine may be the case. Folk involved with the respective modules may well disagree.

TM351

TM351 is a 30 point, third level module on data management and analysis. Approximately 50% of study is spent working on practical activities delivered via Jupyter notebooks.
Essential requirements:

  • notebook server that allows students to complete notebook based activities in a Python environment;
  • ability to create and run new Python backed notebooks;
  • pre-configured Python computational environment with preinstalled packages (Python and Linux package dependencies);
  • access to PostgreSQL server with permissions to add and delete users, roles, databases, tables; read and write to tables from notebooks;
  • access to MongoDB server with read permissions on a seeded database; permissions to create / read / write / delete databases and collections from notebooks;
  • access to OpenRefine application;
  • access to public internet addresses in order to download data files from arbitrary URLs;
  • ability to persist notebooks;
  • ability to access all service GUIs through a web browser;
  • ability to work on a student’s own computer in an offline mode;
  • ability to work cross-platform (Windows, Mac, Linux).
    Desirable Requirements:

  • ability to backup and restore databases;

  • ability to take away the complete computing environment so it can be run ex- of the OU;
  • ability to access the environment on a remote host (eg an OU hosted solution); (this solution would in turn create requirements based on scaleability, affordability, peak load, resource (processor, memory, storage, bandwidth), uptime etc.)
  • ability to access the environment from a terminal / command line;
  • side effect free on user desktop (i.e. the environment should not clash with any services already running on the student’s computer; the environment should not require changes to the student’s computer; any applications installed should be capable of being uninstalled cleanly).

Optional Requirements:

  • headless operation (access to desktop applications inside the provided environment is not required);
    Solution (16B, 16J-19J):

A VirtualBox virtual machine (VM) managed using vagrant provides a self-contained, preconfigured environment running all required applications. User files are mounted into the VM from the user desktop and saved out of the VM back onto the desktop.

Proposed Solution (20J+):

A Docker container (rather than VirtualBox VM) built using repo2docker, runnable via Docker Desktop / ContainDS on a student computer or on a remote host either as a standalone service or via JupyterHub.

Requirements not met:

  • side effects (VirtualBox and vagrant must both be installed; on Windows, virtualisation may need enabling, HyperV needs disabling. A fix to the latter would be to ship a VM built natively for HyperV);
  • hosted solution: an OU hosted solution is not available. DIY solutions for students to self-host on Azure, AWS, Digital Ocean are provided.

TM112

TM112 is a 30 point, first level module that in part provides an introduction to Python programming. Whilst the Python environment used for most of the activities is a simple, user-installed Python environment without embellishment, an optional “notebook experience” activity is also provided to deliver enrichment activities.
Essential requirements:

  • ability to run provided Jupyter notebooks within a single study session (no requirement to persist changed notebooks);
  • install-free / hosted solution;
  • scaleable (capable of coping with peak demand: 1k concurrent users)
  • affordable (<£X per student presentation, <£Y per student per activity);
  • available (24hr, instant / on-demand access over activity period);

Desirable Requirements:

  • enforced spend limit for autoscaled delivery;
  • authenticated access (without additional sign-on requirements) from inside Moodle VLE.
    Solution (18J, 19B, 19J):

JupyterHub+Kubernetes running prebuilt Docker container on Microsoft Azure. Disposable single user notebook environments launched on demand from a preconfigured, LTI authenticating link on a module webpage in Moodle VLE. Docker container also runs via MyBinder or on local machine with Docker installed.

S818

S818 is a masters level module on space science. Students are required to use the Python programming language to complete a small numbet of programming activities in supplied notebooks.

TMA01 and require students to use simple notebook computations and TMA02 requires students to complete calculations that might sensibly be computed in a notebook context.

Essential requirements:

  • ability to run and edit provided notebooks in a Python / pandas environment;
  • ability to create and run new notebooks;
  • ability to persist created / edited notebooks.

Desirable Requirements:

  • None?

Solution (??, 18B, 19B, 20B):

Students are referred to the OpenLearn Learn to Code for Data Analysis course which recommends installing and Anancoda scientific Python environment locally. This environment includes a local notebook server; the scipy stack, including pandas, is available as part of the default Anaconda environment.

Planned Deployments

Jupyter notebooks are planned for use in several modules currently in production, including:

M269 (new edition) Algorithms

Essential requirements:
– uploading data files and notebooks and run them, like current TM351 server
– have NetworkX installed

Desirable requirements:
– direct link from VLE, not a separate login
– ability to create image before start of module with already all notebooks and TMAs
– direct submission, marking and return of TMAs via nbgrader
– jupyterlab
– remember open notebooks and files from last session

_Intended solution:__ online hosted notebook environment; Anaconda for local use.

TM358 Machine Learning

Essential requirements:

  • ability to run GPU powered kernels;
  • pre-configured Python computational environment with preinstalled packages (Python and Linux package dependencies);
  • access to public internet addresses in order to download data files from arbitrary URLs;
  • ability to persist notebooks;
  • ability to access all service GUIs through a web browser;
  • ability to work cross-platform (Windows, Mac, Linux).
  • access to data storage for large datasets (10s Mb up to perhaps a few Gb) (mostly read-only, but also need to save and load trained models)

Desirable requirements:

M348 Linear Models

Essential requirements:

  • ability to run and edit provided notebooks;
  • ability to execute code in notebook code cells in a preconfigured R environment;

Architectural Models

When delivering computational environments to students, particular ones that expose services to students, we might characterise several architectural models:

  • student / user-focused / standalone environments (1 to N):
  • a single base environment (1) is downloaded by N students; the provides all tools and services required by the student.
  • example: TM351VM: contains Jupyter server, PostgreSQL server; OpenRefine server; MongoDB server; requires VirtualBox to run the VM, and vagrant to manage its deployment.

  • institutional / centralised environments (S to N):

  • centralised multi-user services;
  • S denotes multiple central services provided into the student environment (for example, a shared database, a shared JupyterHub);

The 1-N approach means that students can take away their computing environment and work with it offline. It also means that we can’t track activity inside the student environment unless we enable logging and log data collection inside the environment along with some sort of data log return mechanism.

The S-N approach means that students require online access and cannot take away their computing environment. It also means that we can log any transaction that goes through a server.

Note that a localised, temporary, site deployment model may be possible in the S-N approach. For example, from a standalone physical server at a day school. On a local network in a prison (though prison IT might forbid such an architecture. It would be interesting to know what policies govern how we make software available to students in prisons).

A Note On installing software on student computers

In the first case, we should note that the OU supports:

  • platform indepence (Windows, Mac, Linux);
  • low minimum specification machine (old operating system, minimal memory, basic CPU, no GPU);

When deploying environments to student machines, we should aim to isolate or encapsulate the provided environment from the student’s own environment so that it does not interfere with any applications or services they are already running, and in a way that it can be easily and comprehensively removed from their computer at the end of their studies.

Jupyter Architectural Components

The Jupyter project oversees several components that can be used as part of an integrated notebook hosting service:

  • single user Jupyter server (aka Jupyter notebook server or simple Jupyter server) [PRODUCTION STABLE]: serves notebook and JupyterLab UIs via a browser to a single user, with password or token authentication if required;
  • multi-user JupyterHub server [PRODUCTION STABLE]: provides authenticated access for multiple users to single-user Jupyter servers. Plugins exist to support a wide range of authentication types. Persistent user accounts supported. Single user environments can be created using various “spawner” types, for example: Docker, Kubernetes.
  • Jupyter Enterprise Gateway [PRODUCTION STABLE]: single user Jupyter servers connect a user facing noteook or JupyterLab UI with a backend Jupyter kernel that contains the runtime object environment within which code in a particular notebook is executed. The Jupyter Enterprise Gateway launches kernels at the request of a single user server using Kubernetes; the single user server then manages communications between the UI and the Jupyter Enterprise Gateway managed kernel.
  • Binderhub: launch “temporary” single person notebook servers based on environment definitions contained in a public repository (Github, Zenodo DOI indicated repositories, etc).

Arbitrary web applications (that is, applications the present an HTML over HTTP user interface) can also be access in Jupyter environment in two ways: first, proxied via a single user Jupyter notebook server; second, via a recent community contribution (jhsingle-native-proxy), through being wrapped with a proxy services that can communicate with a JupyterHub or BinderHub server in a similar way to a single user Jupyter notebook server but without the need to run a notebook server.

Single User Jupyter Notebook Server

  • Jupyter notebook server: a standalone server that can be run locally and that is capable of:
  • providing password or token enabled access to the server via the web UI;
  • serving a Jupyter notebook or JupyterLab HTML UI over http on an arbitrary port;
  • from the UI, each separate notebook can launch a single computational enviroment (a Jupyter kernel) that is responsible for executing on demand, and in a REPL way, code contained in notebook code cells.

Architecturally, a notebook server presents the user with a notebook management interface for launching individual notebooks; and the notebook interface then provides a way of launching and managing a code executing kernel associated with the notebook.

The notebook server can be used to support computation in several ways:

  • via an interactive browser based notebook UI;
  • as a headless kernel provider to provide a computational environment that can be used to execute code:
  • from within a code editor, such as a PyCharm, VSCode, Atom etc.
  • displayed in an arbitrary HTML page;
  • as a proxy server providing access to other, arbitrary web applications via a single notebook server URL path (i.e. down a single path on a single port).

Let’s consider each of those in more detail in turn.

Using a Notebook Server to Serve Interactive Notebooks

This is the limit of what most people think of when they think of Jupyter notebooks: as a read/write/execute/display interactive notebook environment, accessed via a web browser.

Notebooks can be used in various ways, including but not limited to:

  • all explanatory text and code provided; all code is run in one go and used to deploy interactive widgets and displays in the page to support UI driven interactive activities (the code can optionally be hidden from view);
  • all code provided and users run one cell at a time; instructional text guides their activity and the see the result of executing each step of code a code cell block at a time;
  • all code provided, but users encouraged to edit, change and execute code repeatedly to explore a particular code idea;
  • some code provided; users lead through an activity but have to supply some code themselves;
  • structured slate: text used to develop ideas and set up practical activities but students provide all the code;
  • blank slate: users create and run all their own text and code.
Using a Notebook Server as a Provider of Computational Environments for Interactive Code Activities in Arbitrary Web Pages

Javascript packages such as thebelab.js allow code areas in arbitrary HTML documents to be executed against a known Jupyter server endpoint. This allows instructional HTML text to include activities where:

  • students execute provided code and see the results returned to, and embedded in, the page at the point the executed the code;
  • students to edit code in the HTML page, execute that code against the notebook served headless Jupyter kernel, and see the results returned to and embedded in the page.
Using a Notebook Server as a Proxy to Other Web Applications

The Jupyter single user server can be extended with a server extension (jupyter-server-proxy) that will allow it to proxy other HTML/http UIs.

Use case example: a Jupyter single user server with the jupyter-server-proxy enabled can be used to proxy an RStudio or OpenRefine application via the Jupyter user interface. A notebook server on example.com/nbserver can trivially serve applications against example.com/nbserver/proxy/PORTNUMBER or aliased as eg example.com/nbserver/rstudio. This means that a Jupyter notebook server can be used to provide an authentication layer in front of an arbitrary web application served from the same local network.

The jupyter-desktop-server server extension extends the jupyter-server-proxy extension to allow desktop environments (such as XFCE) to be proxied via the Jupyter single user server.

Use case example: a Docker container running a Jupyter server provides authenticated access to a GUI based Java application via a web browser. The Java application runs on the virtualised desktop and is proxied by the notebook server using the jupyter-desktop-server.
Use case example: a Jupyter server provides browser based access to a Windows desktop application running under Wine on an XFCE desktop. The application is launched from a Jupyter notebook UI and proxied via the jupyter-desktop-server extension.

JupyterHub

JupyterHub is a multi-user service that can provide authenticated access to personalised computational environments for multiple users. Each user may also be provided with their own persistent account managed by the server.

A range of authentication schemes are supported, including OAuth, LTI, login from Github etc. (TM112 uses LTI to allow students to access a single sign on authenticating JupyterHub server that launches temporary notebook servers from a Moodle VLE web page. Contact: Rod Norfor for technical details.)

The JupyterHub server provides access to a range of computational environments through environment spawners. For example, the DockerSpawner will launch a Docker container in response to a user login that runs a personalised computational environment in a Docker container.

JupyterHub can be configured to provide users with a range of alternative posssible environments, so a student could log in and be presented with options to launch different environments relating to different modules, for example.

JupyterHub can scale a service offering with increasing numbers of users using a well supported and well proven Kubernetes deployment model. (The TM112 Jupyterhub sever uses Kubernetes on Microsoft Azure to service the required number of students in a scaleable way.)

Whilst JupyterHub nominally expects to manage launched environments via a Jupyter notebook server running in the environment, a recent community contribution (jhsingle-native-proxy) allows arbitrary containerised web applications to be launched and managed from a JupyterHub server.

Architecturally, JupyterHub can launch individual Jupyter notebook servers, notebook servers then launch notebooks, and notebooks launch kernels.

Jupyter Enterprise Gateway Server

The Jupyter Enterprise Gateway Server is a middleware service, originally developed by IBM, that provides the ability to launch kernels on behalf of remote notebooks in a scaleable way (eg scaling for large numbers of users; allowing kernels to run with different amounts of computational resource (CPUs, GPUs, memory etc)).

One possible architectural model would be for a JupyterHub server to provide multi-user access to the Jupyter environment, and JupyterHub to launch kernels via the Jupyter Enterprise Gateway Server.

Alternatively, a student running their own personal Jupyter notebook server at home on a computer with limited computational resource could use an institutional Jupyter Enterprise Gateway Server to launch a kernel on a well provisioned server, eg one with a large amount of memory and a GPU.

(At the moment, I don’t think the same personal Jupyter server can launch kernels locally as well as via a Jupyter Enterprise Gateway server; I think the provisioner is one or the other.)

BinderHub

BinderHub is variant of JupyterHub that allows an unauthenticated user to launch a containerised Jupyter notebook server, on-demand, built according to a linked to specification on a remote repository.

The most common way of using BinderHub is to create a Github repository containing environment definition files as well as user files (eg Jupyter notebooks) and then use MyBinder (a free and open federated Binderhub service) to build a Docker image based on the contents of the repository. Once built, or if a cached version already exists, MyBinder spawns a Docker container from the image and serves it to the user.

Currently, BinderHub provides only a “temporary” service – the container is built, deployed, served to the user, and then destroyed at the end of the session. However, one of the Binder Federation nodes do have an experimental persistent Binderhub deployment that provides authenticated user access and a persistent user file areas. Users can launch Binder containers from their account, save files to their account, and share their own files into their launched Binder containers.

Binderhub / MyBinder is also used as an ad hoc provider of computational environments for a variety of online “interactive textbooks” and online courses. For example, Jupyter Book and “the spacy course”, as well as the LibreTexts interactive book platform. Published as HTML websites, the contents of code cells embedded (and editable) within the HTML page can be executed against a remotely launched MyBinder kernel, with the result of the computation returned to the page and displayed within it.

Several Javascript packages (thebelab.js, juniper.js) exist to “enable” the code cells in an HTML page and manage the MyBinder connection.

nbgallery

nbgallery is not an official Jupyter project but it does provide a range of interesting features that are worth exploring if we want to be open-minded about what sort of user environment we want to use to provide people to access to Jupyer notebooks.

The nbgallery application (TH review and video review) was developed by the US Department of Defense and the NSA to provide multi-user access to a wide range notebooks. The gallery provides search tools over a wide collection of notebooks and allows users to rate and review notebooks. Users can launch notebooks in a connected Jupyter environment. A healthcheck facility checks that code cells execute as expected (and if not may flag maintenance or student difficulty issues).

Exploring the use of nbgallery either as a social application, visited by all students, or as a personal application, used by a student to access notebooks in a personal environment, may turn up a way of providing access to notebooks in a way that is useful not just for linear courses, but also resource based / problem based learning courses.

Institutional Vs Local Provision

Whilst it is possible to consider Jupyter mediated environments in either the local user context (eg TM351 students using their own VMs) or the institutional context (for example, TM112 students launnching temporary notebook servers), I think there is most to be gained from considering them as two different ways of exposing students to the same computational environments.

For example, consider the following three situations:

1) TM351 students run a Virtualbox virtual machine containing multiple “personal” servers: a Jupyter notebook server, a Postgres database server; the student “owns” all services and all services are integrated.

2) TM351 students accessing a virtalised Jupyter environment for notebook access, and logging in to a shared Postgres database server. The student does not own their computational environment provided by the notebook server, though they may able to export their files; nor do they own their database server: they are one of many users accessing the same service, though they may be able to export a dump of their database contents.

3) a TM351 Docker environment is defined in a public repository such as innovationOUtside/tm351vm-binder. The definition is public and can be shared (owned, edited) by anyone. The repository can be launched using a Binderhub instance and used to provide temporary access to an integrated, personal TM351 environment running a personal Jupyter notebook server as well as a Posgtgres server. Continuous integration tools build a Docker container image from the repo and push it to Docker Hub (ousefuldemos/tm351-binderised. An institutional JupyterHub server allows students to seemlessly login from the VLE and launch the TM351 environment pulled from Docker Hub either directly or via an institutional Jupyter Enterprise Gateway Server. A student with a powerful computer at home installs Docker and launches their own local container instance of the TM351 environment pulled from DockerHub. Perhaps more conveniently, they use the ContainDS desktop application to launch the container locally, again either pulling the prebuilt image from Docker Hub, or building a version themselves either directly from the original repository or from a local clone of it. (ContainDS greatly simplifies the practicalites of running Dockerised notebook servers on the desktop, providing a useful graphical user interface for mangaing containers, managing server authentication tokens, mounting files from the desktop into the container, etc.)

In the third case, the same environment definition is used to:

  • build and deploy temporary environments on MyBinder;
  • build a public image deposited on Docker Hub;

The public image on Docker Hub is then used to:

  • deploy environments from an institutional JupyterHub service;
  • deploy local environments on the students’ own desktop.

In each case, students gain personal access to a commonly defined environment and have “ownership” of all the services running inside the environment. Students are free to take away the environment and use it in other contexts, or access it solely via hosted solutions. (There is an issue of synchronising user files and environment updates across mutliple services if a student works in that way, eg sometimes at home on their computer, sometimes from their desk on an OU remote host, etc.)

Use Cases for Jupyter Services

Jupyter environments can be used to support a range of activities across the institution, most notably:

  1. Delivering interactive teaching materials to students (module delivery);
  2. Authoring teaching materials (module production);
  3. Supporting computational academic research (research);
  4. Disseminating academic research (reproducible research publications) (research publishing);
  5. Supporting institutional data analysis and reporting (business analytics)

A sensible question to ask is: “what benefits or differences do a ‘Jupyter solution’ bring to each of these activities. So let’s quickly review them:

Delivering Interactive Teaching Materials to Students

We have been using Jupyter notebook from since before they were Jupyter notebooks to deliver 4 hours of teaching per week to students on TM351. Over that time, the pedagogy has delivered but is still largely unexplored.

The notebooks we wrote then are not the notebooks we would write now. The notebooks we might write now are not the notebooks we could write if we spent some time exploring them properly as a medium for both teaching and learning, eg considering how they might be used to support formative assessment, summatice assessment (touched on on TM351), automated testing / grading, personal note taking and portfolio development.

Computation supports interactivity in two ways:

  • it supports the execution of provided code and as such can be used to create what are effectively end user applications;
  • it allows student to create and execute their own code, for whatever purpose.

The Jupyter environment can support both use cases.

Authoring Teaching Materials

Irrespective of whether notebooks are used to deliver teaching to students, they can be used to develop teaching materials, interactive and otherwise, in a direct authoring way.

For example, document conversion tools allow authored notebooks to be rendered in a variety of formats: as .docx word processor documents, as .pdf files, as simple .md markdown/text files, as HTML pages.

The notebook user interface is web based and, via notebook extensions, supports WYSIWYG editing as well as direct editing of markdown and HTML text. Mathematical and chemical equations written in LaTeX are rendered natively by the notebook.

A wide range of display methods allow rich media assets to be embedded in the text by simple wrapping of a file reference (a local file reference or a web URL) that points to the object: images (Image(URL)), videos (Video(LOCAL_FILE) and audio clips (Audio(URL)) are all readily embedded in the document, for example, as can more complex objects such as interactive maps.

Code, often little more than a single simple line, can also be used to generate media outputs, from rich interactive embedded javascript applications to simple charts and tables.

More ambitious authors may choose to create their own asset generating code, or even go so far as to create de facto end user applications within the notebook context.

Using code to generate charts and tables has benefits for module maintenance, because charts are generated from source datasets or equations. If they need to be updated, a simple change to the code or data file is all that’s required for the rendered asset to be updated.

The executed notebook can then be exported as final document (as interactive HTML or ePub, as a flat PDF or docx etc) from the original notebook. (Of course, not all renderings will be as rich as the original interactive notebook form or HTML converted form.)

I was hoping to make more progress on openlearn-publish-test to demonstrate how we could use Jupytext to support direct authoring and rich / interactive updating of converted OpenLearn OU-XML content in a notebook UI, but I’ve run out of time and only got as far as how to get the OpenLearn content into a notebook enabled environment.

Supporting Computational Academic Reasearch

Lots of academics write ad hoc code in their research; lots of academics make notes around their ad hoc code; lots of academics use code for exploration; notebooks are powerful environment that lets you do each of those in the context of all the others.

Jupyter notebooks are arguably not the best environment for developing research software packages, although the provision of computational environments may support that activity. However, workflows are emerging that do better support traditional software engineering / code development practices. For example, tools such as Jupytext provide support for working with simple text document formats (.py and .R files, for example) directly within the Jupyter notebook environment.

Disseminating Academic Research

An increasing number of academic journals require researchers to deposit reproducible code scripts with their submissions.

Several journals are exploring the use of Jupyer notebooks as a first-class document format for submitting papers, as well as developing review and comment tools around them.

Supporting Institutional Analysis and Reporting

Lots of financial companies use notebooks as an analysis environment. The Ministry of Justice moved to a Jupyter fronted platform (MoJ Analytical Platform) for their analysts.

A recent OU job ad for Head of Data Analytics identified skills in things like Python and R. Such folk might reasonably expect to use Jupyter notebooks for their analysis and reporting. One of the things not covered in this review are the rich interactive dashboarding tools that Jupyter ecosystem supports (eg Voilà).

Author: Tony Hirst

I'm a Senior Lecturer at The Open University, with an interest in #opendata policy and practice, as well as general web tinkering...

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.