Notes in Advance of a Meeting About The Possibility of Getting an Institutional Jupyter Server Up and Running

Notes on Jupyter Deployments in an OU Context.

Notes made in advance of an internal workshop to discuss supporting “Jupyter notebooks” in the OU.

My intention is to split the content over several documents.

Needless to say, I think “notebooks” are both: a) not really the point; b) offer way more potential for doing all sorts of things than folk might think.

pandoc -o output.docx -f markdown -t docx filename.md

The content is split into three main sections:

  • Current Jupyter Deployments in the OU
  • Architectural Models
  • Use Cases for Jupyter Services

Current Jupyter Deployments in the OU

Jupyter notebooks are already used in several OU courses. The following summarises what I’ve managed to learn, or imagine may be the case. Folk involved with the respective modules may well disagree.

TM351

TM351 is a 30 point, third level module on data management and analysis. Approximately 50% of study is spent working on practical activities delivered via Jupyter notebooks.
Essential requirements:

  • notebook server that allows students to complete notebook based activities in a Python environment;
  • ability to create and run new Python backed notebooks;
  • pre-configured Python computational environment with preinstalled packages (Python and Linux package dependencies);
  • access to PostgreSQL server with permissions to add and delete users, roles, databases, tables; read and write to tables from notebooks;
  • access to MongoDB server with read permissions on a seeded database; permissions to create / read / write / delete databases and collections from notebooks;
  • access to OpenRefine application;
  • access to public internet addresses in order to download data files from arbitrary URLs;
  • ability to persist notebooks;
  • ability to access all service GUIs through a web browser;
  • ability to work on a student’s own computer in an offline mode;
  • ability to work cross-platform (Windows, Mac, Linux).
    Desirable Requirements:

  • ability to backup and restore databases;

  • ability to take away the complete computing environment so it can be run ex- of the OU;
  • ability to access the environment on a remote host (eg an OU hosted solution); (this solution would in turn create requirements based on scaleability, affordability, peak load, resource (processor, memory, storage, bandwidth), uptime etc.)
  • ability to access the environment from a terminal / command line;
  • side effect free on user desktop (i.e. the environment should not clash with any services already running on the student’s computer; the environment should not require changes to the student’s computer; any applications installed should be capable of being uninstalled cleanly).

Optional Requirements:

  • headless operation (access to desktop applications inside the provided environment is not required);
    Solution (16B, 16J-19J):

A VirtualBox virtual machine (VM) managed using vagrant provides a self-contained, preconfigured environment running all required applications. User files are mounted into the VM from the user desktop and saved out of the VM back onto the desktop.

Proposed Solution (20J+):

A Docker container (rather than VirtualBox VM) built using repo2docker, runnable via Docker Desktop / ContainDS on a student computer or on a remote host either as a standalone service or via JupyterHub.

Requirements not met:

  • side effects (VirtualBox and vagrant must both be installed; on Windows, virtualisation may need enabling, HyperV needs disabling. A fix to the latter would be to ship a VM built natively for HyperV);
  • hosted solution: an OU hosted solution is not available. DIY solutions for students to self-host on Azure, AWS, Digital Ocean are provided.

TM112

TM112 is a 30 point, first level module that in part provides an introduction to Python programming. Whilst the Python environment used for most of the activities is a simple, user-installed Python environment without embellishment, an optional “notebook experience” activity is also provided to deliver enrichment activities.
Essential requirements:

  • ability to run provided Jupyter notebooks within a single study session (no requirement to persist changed notebooks);
  • install-free / hosted solution;
  • scaleable (capable of coping with peak demand: 1k concurrent users)
  • affordable (<£X per student presentation, <£Y per student per activity);
  • available (24hr, instant / on-demand access over activity period);

Desirable Requirements:

  • enforced spend limit for autoscaled delivery;
  • authenticated access (without additional sign-on requirements) from inside Moodle VLE.
    Solution (18J, 19B, 19J):

JupyterHub+Kubernetes running prebuilt Docker container on Microsoft Azure. Disposable single user notebook environments launched on demand from a preconfigured, LTI authenticating link on a module webpage in Moodle VLE. Docker container also runs via MyBinder or on local machine with Docker installed.

S818

S818 is a masters level module on space science. Students are required to use the Python programming language to complete a small numbet of programming activities in supplied notebooks.

TMA01 and require students to use simple notebook computations and TMA02 requires students to complete calculations that might sensibly be computed in a notebook context.

Essential requirements:

  • ability to run and edit provided notebooks in a Python / pandas environment;
  • ability to create and run new notebooks;
  • ability to persist created / edited notebooks.

Desirable Requirements:

  • None?

Solution (??, 18B, 19B, 20B):

Students are referred to the OpenLearn Learn to Code for Data Analysis course which recommends installing and Anancoda scientific Python environment locally. This environment includes a local notebook server; the scipy stack, including pandas, is available as part of the default Anaconda environment.

Planned Deployments

Jupyter notebooks are planned for use in several modules currently in production, including:

M269 (new edition) Algorithms

Essential requirements:
– uploading data files and notebooks and run them, like current TM351 server
– have NetworkX installed

Desirable requirements:
– direct link from VLE, not a separate login
– ability to create image before start of module with already all notebooks and TMAs
– direct submission, marking and return of TMAs via nbgrader
– jupyterlab
– remember open notebooks and files from last session

_Intended solution:__ online hosted notebook environment; Anaconda for local use.

TM358 Machine Learning

Essential requirements:

  • ability to run GPU powered kernels;
  • pre-configured Python computational environment with preinstalled packages (Python and Linux package dependencies);
  • access to public internet addresses in order to download data files from arbitrary URLs;
  • ability to persist notebooks;
  • ability to access all service GUIs through a web browser;
  • ability to work cross-platform (Windows, Mac, Linux).
  • access to data storage for large datasets (10s Mb up to perhaps a few Gb) (mostly read-only, but also need to save and load trained models)

Desirable requirements:

M348 Linear Models

Essential requirements:

  • ability to run and edit provided notebooks;
  • ability to execute code in notebook code cells in a preconfigured R environment;

Architectural Models

When delivering computational environments to students, particular ones that expose services to students, we might characterise several architectural models:

  • student / user-focused / standalone environments (1 to N):
  • a single base environment (1) is downloaded by N students; the provides all tools and services required by the student.
  • example: TM351VM: contains Jupyter server, PostgreSQL server; OpenRefine server; MongoDB server; requires VirtualBox to run the VM, and vagrant to manage its deployment.

  • institutional / centralised environments (S to N):

  • centralised multi-user services;
  • S denotes multiple central services provided into the student environment (for example, a shared database, a shared JupyterHub);

The 1-N approach means that students can take away their computing environment and work with it offline. It also means that we can’t track activity inside the student environment unless we enable logging and log data collection inside the environment along with some sort of data log return mechanism.

The S-N approach means that students require online access and cannot take away their computing environment. It also means that we can log any transaction that goes through a server.

Note that a localised, temporary, site deployment model may be possible in the S-N approach. For example, from a standalone physical server at a day school. On a local network in a prison (though prison IT might forbid such an architecture. It would be interesting to know what policies govern how we make software available to students in prisons).

A Note On installing software on student computers

In the first case, we should note that the OU supports:

  • platform indepence (Windows, Mac, Linux);
  • low minimum specification machine (old operating system, minimal memory, basic CPU, no GPU);

When deploying environments to student machines, we should aim to isolate or encapsulate the provided environment from the student’s own environment so that it does not interfere with any applications or services they are already running, and in a way that it can be easily and comprehensively removed from their computer at the end of their studies.

Jupyter Architectural Components

The Jupyter project oversees several components that can be used as part of an integrated notebook hosting service:

  • single user Jupyter server (aka Jupyter notebook server or simple Jupyter server) [PRODUCTION STABLE]: serves notebook and JupyterLab UIs via a browser to a single user, with password or token authentication if required;
  • multi-user JupyterHub server [PRODUCTION STABLE]: provides authenticated access for multiple users to single-user Jupyter servers. Plugins exist to support a wide range of authentication types. Persistent user accounts supported. Single user environments can be created using various “spawner” types, for example: Docker, Kubernetes.
  • Jupyter Enterprise Gateway [PRODUCTION STABLE]: single user Jupyter servers connect a user facing noteook or JupyterLab UI with a backend Jupyter kernel that contains the runtime object environment within which code in a particular notebook is executed. The Jupyter Enterprise Gateway launches kernels at the request of a single user server using Kubernetes; the single user server then manages communications between the UI and the Jupyter Enterprise Gateway managed kernel.
  • Binderhub: launch “temporary” single person notebook servers based on environment definitions contained in a public repository (Github, Zenodo DOI indicated repositories, etc).

Arbitrary web applications (that is, applications the present an HTML over HTTP user interface) can also be access in Jupyter environment in two ways: first, proxied via a single user Jupyter notebook server; second, via a recent community contribution (jhsingle-native-proxy), through being wrapped with a proxy services that can communicate with a JupyterHub or BinderHub server in a similar way to a single user Jupyter notebook server but without the need to run a notebook server.

Single User Jupyter Notebook Server

  • Jupyter notebook server: a standalone server that can be run locally and that is capable of:
  • providing password or token enabled access to the server via the web UI;
  • serving a Jupyter notebook or JupyterLab HTML UI over http on an arbitrary port;
  • from the UI, each separate notebook can launch a single computational enviroment (a Jupyter kernel) that is responsible for executing on demand, and in a REPL way, code contained in notebook code cells.

Architecturally, a notebook server presents the user with a notebook management interface for launching individual notebooks; and the notebook interface then provides a way of launching and managing a code executing kernel associated with the notebook.

The notebook server can be used to support computation in several ways:

  • via an interactive browser based notebook UI;
  • as a headless kernel provider to provide a computational environment that can be used to execute code:
  • from within a code editor, such as a PyCharm, VSCode, Atom etc.
  • displayed in an arbitrary HTML page;
  • as a proxy server providing access to other, arbitrary web applications via a single notebook server URL path (i.e. down a single path on a single port).

Let’s consider each of those in more detail in turn.

Using a Notebook Server to Serve Interactive Notebooks

This is the limit of what most people think of when they think of Jupyter notebooks: as a read/write/execute/display interactive notebook environment, accessed via a web browser.

Notebooks can be used in various ways, including but not limited to:

  • all explanatory text and code provided; all code is run in one go and used to deploy interactive widgets and displays in the page to support UI driven interactive activities (the code can optionally be hidden from view);
  • all code provided and users run one cell at a time; instructional text guides their activity and the see the result of executing each step of code a code cell block at a time;
  • all code provided, but users encouraged to edit, change and execute code repeatedly to explore a particular code idea;
  • some code provided; users lead through an activity but have to supply some code themselves;
  • structured slate: text used to develop ideas and set up practical activities but students provide all the code;
  • blank slate: users create and run all their own text and code.
Using a Notebook Server as a Provider of Computational Environments for Interactive Code Activities in Arbitrary Web Pages

Javascript packages such as thebelab.js allow code areas in arbitrary HTML documents to be executed against a known Jupyter server endpoint. This allows instructional HTML text to include activities where:

  • students execute provided code and see the results returned to, and embedded in, the page at the point the executed the code;
  • students to edit code in the HTML page, execute that code against the notebook served headless Jupyter kernel, and see the results returned to and embedded in the page.
Using a Notebook Server as a Proxy to Other Web Applications

The Jupyter single user server can be extended with a server extension (jupyter-server-proxy) that will allow it to proxy other HTML/http UIs.

Use case example: a Jupyter single user server with the jupyter-server-proxy enabled can be used to proxy an RStudio or OpenRefine application via the Jupyter user interface. A notebook server on example.com/nbserver can trivially serve applications against example.com/nbserver/proxy/PORTNUMBER or aliased as eg example.com/nbserver/rstudio. This means that a Jupyter notebook server can be used to provide an authentication layer in front of an arbitrary web application served from the same local network.

The jupyter-desktop-server server extension extends the jupyter-server-proxy extension to allow desktop environments (such as XFCE) to be proxied via the Jupyter single user server.

Use case example: a Docker container running a Jupyter server provides authenticated access to a GUI based Java application via a web browser. The Java application runs on the virtualised desktop and is proxied by the notebook server using the jupyter-desktop-server.
Use case example: a Jupyter server provides browser based access to a Windows desktop application running under Wine on an XFCE desktop. The application is launched from a Jupyter notebook UI and proxied via the jupyter-desktop-server extension.

JupyterHub

JupyterHub is a multi-user service that can provide authenticated access to personalised computational environments for multiple users. Each user may also be provided with their own persistent account managed by the server.

A range of authentication schemes are supported, including OAuth, LTI, login from Github etc. (TM112 uses LTI to allow students to access a single sign on authenticating JupyterHub server that launches temporary notebook servers from a Moodle VLE web page. Contact: Rod Norfor for technical details.)

The JupyterHub server provides access to a range of computational environments through environment spawners. For example, the DockerSpawner will launch a Docker container in response to a user login that runs a personalised computational environment in a Docker container.

JupyterHub can be configured to provide users with a range of alternative posssible environments, so a student could log in and be presented with options to launch different environments relating to different modules, for example.

JupyterHub can scale a service offering with increasing numbers of users using a well supported and well proven Kubernetes deployment model. (The TM112 Jupyterhub sever uses Kubernetes on Microsoft Azure to service the required number of students in a scaleable way.)

Whilst JupyterHub nominally expects to manage launched environments via a Jupyter notebook server running in the environment, a recent community contribution (jhsingle-native-proxy) allows arbitrary containerised web applications to be launched and managed from a JupyterHub server.

Architecturally, JupyterHub can launch individual Jupyter notebook servers, notebook servers then launch notebooks, and notebooks launch kernels.

Jupyter Enterprise Gateway Server

The Jupyter Enterprise Gateway Server is a middleware service, originally developed by IBM, that provides the ability to launch kernels on behalf of remote notebooks in a scaleable way (eg scaling for large numbers of users; allowing kernels to run with different amounts of computational resource (CPUs, GPUs, memory etc)).

One possible architectural model would be for a JupyterHub server to provide multi-user access to the Jupyter environment, and JupyterHub to launch kernels via the Jupyter Enterprise Gateway Server.

Alternatively, a student running their own personal Jupyter notebook server at home on a computer with limited computational resource could use an institutional Jupyter Enterprise Gateway Server to launch a kernel on a well provisioned server, eg one with a large amount of memory and a GPU.

(At the moment, I don’t think the same personal Jupyter server can launch kernels locally as well as via a Jupyter Enterprise Gateway server; I think the provisioner is one or the other.)

BinderHub

BinderHub is variant of JupyterHub that allows an unauthenticated user to launch a containerised Jupyter notebook server, on-demand, built according to a linked to specification on a remote repository.

The most common way of using BinderHub is to create a Github repository containing environment definition files as well as user files (eg Jupyter notebooks) and then use MyBinder (a free and open federated Binderhub service) to build a Docker image based on the contents of the repository. Once built, or if a cached version already exists, MyBinder spawns a Docker container from the image and serves it to the user.

Currently, BinderHub provides only a “temporary” service – the container is built, deployed, served to the user, and then destroyed at the end of the session. However, one of the Binder Federation nodes do have an experimental persistent Binderhub deployment that provides authenticated user access and a persistent user file areas. Users can launch Binder containers from their account, save files to their account, and share their own files into their launched Binder containers.

Binderhub / MyBinder is also used as an ad hoc provider of computational environments for a variety of online “interactive textbooks” and online courses. For example, Jupyter Book and “the spacy course”, as well as the LibreTexts interactive book platform. Published as HTML websites, the contents of code cells embedded (and editable) within the HTML page can be executed against a remotely launched MyBinder kernel, with the result of the computation returned to the page and displayed within it.

Several Javascript packages (thebelab.js, juniper.js) exist to “enable” the code cells in an HTML page and manage the MyBinder connection.

nbgallery

nbgallery is not an official Jupyter project but it does provide a range of interesting features that are worth exploring if we want to be open-minded about what sort of user environment we want to use to provide people to access to Jupyer notebooks.

The nbgallery application (TH review and video review) was developed by the US Department of Defense and the NSA to provide multi-user access to a wide range notebooks. The gallery provides search tools over a wide collection of notebooks and allows users to rate and review notebooks. Users can launch notebooks in a connected Jupyter environment. A healthcheck facility checks that code cells execute as expected (and if not may flag maintenance or student difficulty issues).

Exploring the use of nbgallery either as a social application, visited by all students, or as a personal application, used by a student to access notebooks in a personal environment, may turn up a way of providing access to notebooks in a way that is useful not just for linear courses, but also resource based / problem based learning courses.

Institutional Vs Local Provision

Whilst it is possible to consider Jupyter mediated environments in either the local user context (eg TM351 students using their own VMs) or the institutional context (for example, TM112 students launnching temporary notebook servers), I think there is most to be gained from considering them as two different ways of exposing students to the same computational environments.

For example, consider the following three situations:

1) TM351 students run a Virtualbox virtual machine containing multiple “personal” servers: a Jupyter notebook server, a Postgres database server; the student “owns” all services and all services are integrated.

2) TM351 students accessing a virtalised Jupyter environment for notebook access, and logging in to a shared Postgres database server. The student does not own their computational environment provided by the notebook server, though they may able to export their files; nor do they own their database server: they are one of many users accessing the same service, though they may be able to export a dump of their database contents.

3) a TM351 Docker environment is defined in a public repository such as innovationOUtside/tm351vm-binder. The definition is public and can be shared (owned, edited) by anyone. The repository can be launched using a Binderhub instance and used to provide temporary access to an integrated, personal TM351 environment running a personal Jupyter notebook server as well as a Posgtgres server. Continuous integration tools build a Docker container image from the repo and push it to Docker Hub (ousefuldemos/tm351-binderised. An institutional JupyterHub server allows students to seemlessly login from the VLE and launch the TM351 environment pulled from Docker Hub either directly or via an institutional Jupyter Enterprise Gateway Server. A student with a powerful computer at home installs Docker and launches their own local container instance of the TM351 environment pulled from DockerHub. Perhaps more conveniently, they use the ContainDS desktop application to launch the container locally, again either pulling the prebuilt image from Docker Hub, or building a version themselves either directly from the original repository or from a local clone of it. (ContainDS greatly simplifies the practicalites of running Dockerised notebook servers on the desktop, providing a useful graphical user interface for mangaing containers, managing server authentication tokens, mounting files from the desktop into the container, etc.)

In the third case, the same environment definition is used to:

  • build and deploy temporary environments on MyBinder;
  • build a public image deposited on Docker Hub;

The public image on Docker Hub is then used to:

  • deploy environments from an institutional JupyterHub service;
  • deploy local environments on the students’ own desktop.

In each case, students gain personal access to a commonly defined environment and have “ownership” of all the services running inside the environment. Students are free to take away the environment and use it in other contexts, or access it solely via hosted solutions. (There is an issue of synchronising user files and environment updates across mutliple services if a student works in that way, eg sometimes at home on their computer, sometimes from their desk on an OU remote host, etc.)

Use Cases for Jupyter Services

Jupyter environments can be used to support a range of activities across the institution, most notably:

  1. Delivering interactive teaching materials to students (module delivery);
  2. Authoring teaching materials (module production);
  3. Supporting computational academic research (research);
  4. Disseminating academic research (reproducible research publications) (research publishing);
  5. Supporting institutional data analysis and reporting (business analytics)

A sensible question to ask is: “what benefits or differences do a ‘Jupyter solution’ bring to each of these activities. So let’s quickly review them:

Delivering Interactive Teaching Materials to Students

We have been using Jupyter notebook from since before they were Jupyter notebooks to deliver 4 hours of teaching per week to students on TM351. Over that time, the pedagogy has delivered but is still largely unexplored.

The notebooks we wrote then are not the notebooks we would write now. The notebooks we might write now are not the notebooks we could write if we spent some time exploring them properly as a medium for both teaching and learning, eg considering how they might be used to support formative assessment, summatice assessment (touched on on TM351), automated testing / grading, personal note taking and portfolio development.

Computation supports interactivity in two ways:

  • it supports the execution of provided code and as such can be used to create what are effectively end user applications;
  • it allows student to create and execute their own code, for whatever purpose.

The Jupyter environment can support both use cases.

Authoring Teaching Materials

Irrespective of whether notebooks are used to deliver teaching to students, they can be used to develop teaching materials, interactive and otherwise, in a direct authoring way.

For example, document conversion tools allow authored notebooks to be rendered in a variety of formats: as .docx word processor documents, as .pdf files, as simple .md markdown/text files, as HTML pages.

The notebook user interface is web based and, via notebook extensions, supports WYSIWYG editing as well as direct editing of markdown and HTML text. Mathematical and chemical equations written in LaTeX are rendered natively by the notebook.

A wide range of display methods allow rich media assets to be embedded in the text by simple wrapping of a file reference (a local file reference or a web URL) that points to the object: images (Image(URL)), videos (Video(LOCAL_FILE) and audio clips (Audio(URL)) are all readily embedded in the document, for example, as can more complex objects such as interactive maps.

Code, often little more than a single simple line, can also be used to generate media outputs, from rich interactive embedded javascript applications to simple charts and tables.

More ambitious authors may choose to create their own asset generating code, or even go so far as to create de facto end user applications within the notebook context.

Using code to generate charts and tables has benefits for module maintenance, because charts are generated from source datasets or equations. If they need to be updated, a simple change to the code or data file is all that’s required for the rendered asset to be updated.

The executed notebook can then be exported as final document (as interactive HTML or ePub, as a flat PDF or docx etc) from the original notebook. (Of course, not all renderings will be as rich as the original interactive notebook form or HTML converted form.)

I was hoping to make more progress on openlearn-publish-test to demonstrate how we could use Jupytext to support direct authoring and rich / interactive updating of converted OpenLearn OU-XML content in a notebook UI, but I’ve run out of time and only got as far as how to get the OpenLearn content into a notebook enabled environment.

Supporting Computational Academic Reasearch

Lots of academics write ad hoc code in their research; lots of academics make notes around their ad hoc code; lots of academics use code for exploration; notebooks are powerful environment that lets you do each of those in the context of all the others.

Jupyter notebooks are arguably not the best environment for developing research software packages, although the provision of computational environments may support that activity. However, workflows are emerging that do better support traditional software engineering / code development practices. For example, tools such as Jupytext provide support for working with simple text document formats (.py and .R files, for example) directly within the Jupyter notebook environment.

Disseminating Academic Research

An increasing number of academic journals require researchers to deposit reproducible code scripts with their submissions.

Several journals are exploring the use of Jupyer notebooks as a first-class document format for submitting papers, as well as developing review and comment tools around them.

Supporting Institutional Analysis and Reporting

Lots of financial companies use notebooks as an analysis environment. The Ministry of Justice moved to a Jupyter fronted platform (MoJ Analytical Platform) for their analysts.

A recent OU job ad for Head of Data Analytics identified skills in things like Python and R. Such folk might reasonably expect to use Jupyter notebooks for their analysis and reporting. One of the things not covered in this review are the rich interactive dashboarding tools that Jupyter ecosystem supports (eg Voilà).

Appropriating OpenLearn Content and Republishing Edited Versions Of It Via a “Simple” Automated Text Blogging Workflow

I had intended on using my (unpaid) strike days to catch up with some books and harp practice, and maybe even the garden, and keep away from the keyboard; or failing that, to have a push on my rally data tinkering and get another LeanPub book started to try to reboot the £50 a quarter or so my previous publication (Wrangling F1 Data With R) generated, which keeps things like recurring Dropbox and Flickr etc etc charges covered (no-one has ever bought me a KoFi, as far as I can tell…).

And I was determined not to do any of the mounting workload associated with the day job, no matter how much fun some it is likely to be (like getting Ev3devSim working as an ipywidget in Jupyter notebooks).

Whilst I did manage to stick to the determined not to path, I never even really started down the intended one, instead spending hours and hours in front of keyboard trying to hack something together around my OpenLearn publishing workflow.

So here’s what I’ve come up with…

An OpenLearn Unit Text Publishing Thing

Firstly, it’s a thing that lets you grab the “source” content of an OpenLearn unit (at least, some of it; I still haven’t got round to grabbing things like video files or audio files, or scraping PDFs etc.) and churn it into a simple text format, markdown, which looks like this:

Headers are prefixed with a #, you can emphasise things by wrapping it in * characters, eg *italics* -> italics, or **double them up** for strong emphasis. Embedding links — [link text](link/path/file.html) — and images — ![Alt text](path/to/image.file) — is also pretty easy when you get the hang of it.

So how do you get started? First, you need a Github account (sign up here; just get one: you’re not going to have to do any hard Github stuff, you’re just making use of their free hosting). Get one, and sign in.

Second, visit my demo repo — psychemedia/openlearn-publish-test — (the URL will change at some point, but I’ll archive the original and link the new address from it…) and grab a copy of your own repo from mine by clicking the big green Use this template button:

You’ll be presented with a form:

Give your repo a name (no spaces). Optionally add a description. Keep the repo public. And click the big green Create repository from template button.

Things will churn for a moment or two:

And then you’ll have your own repo, containing a copy of the files in mine:

Behind the scenes, there is work going on…

Click the Actions tab on your copy of the repo to see what…

At first, it may look like nothing… but wait a moment or two and refresh the page:

A couple of actions will start running to initialise, and customise, your repo for you.

When the actions are done, you’ll be informed… (you shouldn’t have to refresh the page, the status indicators should update when things are done…):

If you go back to your repo homepage, you’ll see it’s been updated with a new README that’s slightly different to the original copy from my repo, and that has been personalised to yours:

So… now you can grab some OpenLearn content into your repo.

Click on the SET_UP.md file link in your repo:

You will be presented with a list of units on OpenLearn.

Find one you like the look of and click the Grab Unit into this repo link:

This will open a new issue for you in the Issues tab of your repo, and prepopulate it with a title that will tell a Github Action you want to grab some OpenLearn content, and an issue body that tells the action where the unit can be found.

Click the Submit new issue button to get things started.

Back in the Actions tab, you can see the helper elves have started doing their thing again…

If you click on a running Action, you can check its progress in more detail:

Click through on the actual job name to see what’s happening inside:

You can expand a step by clicking the arrow to see what each step is doing or has already done…

If you read through the steps, you’ll see several things are done: for example, we grab some OUXML (the OpenLearn content), convert to markdown, build some HTML files (these are what gets published), and deploy them, then build a LaTeX version of the material (which is used to generate a PDF), and an ePub ebook. (The LaTeX step takes some time; I should perhaps simplify things so that only the HTML build is done by default.)

When the Actions are green circled / green ticked and done (which may take a few minutes…), or at least, when the Deploy HTML to gh-pages step has run, go back to the repo home page, where you should see a new commit has been made to your repo:

If you click into the content folder you’ll see one or more session folders:

If you click into a session folder, you see some markdown files:

If you click on one of those, you’ll see some scraped and converted OpenLearn content:

So the content has been grabbed from OpenLearn and saved to your repo.

But that’s not all.

If you scroll down on your README page (I really should make this link more prominent in the README…) you’ll see a link to a github.io site published from your repo:

Click it…

If you see a “404”, page not found, don’t panic

On the repo home page, select the Settings tab:

and scroll down to the Github Pages area:

Change the Source from gh-pages branch to master branch:

And then, select the master branch:

And set the Source back to the gh-pages branch:

When you see something like this, you knows all good to go:

Note that cacheing of a previous build of the site may last for up to 10 minues, so grab yourself a cup of tea, or perhaps look through the markdown files in the content directory, or even go back to the Actions tab and, if the actions have completed.

If the Actions have completed, select the OpenLearnXML2 (or a completed nbsphinx publisher action if you have committed your own changes to the markdown files) and you should see and the availability of an Artifacts download.

Down load and unzip the artifacts file. If the build process has been able to build a PDF file and/or an ePub file from the content, it will be found in the unzipped downloaded directory.

Right… time to try your github.io site link again:

An OpenLearn Editing Thing

This will have to be in a part two to this post… I’ve run out of time for now and need to get back to the day job…

If you are itching to get started, this may work, if I’ve got my autopublishing things fixed…

In the content folder (on the default master branch of the repo),  find the markdown file you want to edit, and click on the pencil icon to open the editor:

my-oer_Part_00_02_md_at_master_·_psychemedia_my-oer

Edit the file / make the changes you want, and commit it (you may want to set a meaningful commit message title summarising the chnages, and perhaps even a longer description about the motivation for the changes, but both are optional…):

Editing_my-oer_Part_00_02_md_at_master_·_psychemedia_my-oer

Click the big green Commit changes button to commit the changes. If you look in the Actions tab, you should see that an nbsphinx publisher action has started that should publish your changes to your site.

Actions_·_psychemedia_my-oer6

Note that even when the publishing action has generated and pushed updated site pages  to where they need to be, the site may take a few minutes to update because of page cacheing on the Github site.

The Future

One of the spinoffs of this for me was the realisation that I could use Github Actions to run arbitrary code in response to particular events, such as.. commits or issue postings. The current machinery uses a Sphinx / nbSphinx publishing route, but I’ve also started exploring a recipe for Jekyll based Jupyter Book publishing. (Next on the to do list will be an Executable Book project / MyST workflow](https://ebp.jupyterbook.org/en/latest/); I also need to split out the workflows into actions of their own, but I haven’t figured out how to do that for myself yet.) It strikes me that I could bundle all these in the same repo with some way of flagging which build process I want to use. This would allow the user to then republish their material using the publishing tool, and its various peculiarities, customisationa and affordances, of their choice.

Immediate to dos, that may not happen because I’m the only user, I know it’s possible, and I’m not that interested, are to: make the Github Pages / pubished site link more prominent in the README; get movie and audio downloads and embeds working. Also a way of handling PDFs linked from the OpenLearn materials, and perhaps extracting text from those, even, to support republishing…

On the publish side, it would be useful to be able to publis to HTML only by default, with some optional way of invoking the PDF and ePub builds. The ePub build also needs things like title and author setting. The PDF build sometimes breaks, eg due to the inability to detect a bounding box size round a gif image. I maybe need to use another PDF generator, eg some hints here.

I also need to refactor the code, two ways: firstly, a simplification, that uses the bare minimum of packages and just churns the markdown direct from XML in one simple step. Secondly, fixing the current workflow, which stages the XML in a SQLite database, so that the database can properly handly content from multiple unitis and I can reliably churn the md from the database for any single unit. At the moment, I think things pretty much assume there’s content from just a single unit in the database. Putting the md into the db might be useful too… Then I could imagine a datasette powered publishing route too…

As a recent tweet from Martin Hawksey reveals, he’s been blogging about how we can turn Google’s App Script to our own purposes as a hosted code runner, and I think Github now provides a similar opportunity for anyone who wants to appropriate it to that end…

OpenLearn OER (Re)Publishing the Text Way

In response to a provocation, I built a thing that will let you grab an OpenLearn unit, convert it to a simple text format, and publish it on your own website.

[For the next step in this journey, see: Appropriating OpenLearn Content and Republishing Edited Versions Of It Via a “Simple” Automated Text Blogging Workflow.]

It doesn’t require much:

  • if you haven’t got one already, create a Github account (just don’t “ooh, Github, that’s really hard, so I won’t be able to do it…”; just f***ing get an account);
  • visit my repo and read down the page to see what to do…

And what to do essentially boils down to:

As for changing the content – it’s not that hard once you’ve done it a few times and just go with the flow of writing what feels natural… “Easy” to edit text files are in the content directory and you can edit them via the Github website.

Merging Several Binder Configurations

As more and more repositories start to incorporate MyBinder / repo2docker build specifications, more and more building blocks start to appear for how to get particular things running in MyBinder. For example, I have several ouseful-template-repos with various building blocks for getting different databases running in MyBinder, and occasionally require an environment that also loads in a Jupyter-server-proxied application, such as OpenRefine. Other times, I might want to pull in the config for a partculalry ast install, or merge configs someone else has developed to run different sets of notbooks in the same Binderised repo.

But: a problem arises if you want to combine multiple Binder specifications from various repos into a single Binder setup in a single repo – how do you do it?

One way might be for repo2docker to itereate through multiple build steps, one for each Binder specification. There may be clashes, of course, such as conflicting package versions from different specifications, but it would then fall to the user to try to resolve the issue. Which is fine, if Binder is making a best attempt rather than guaranteeing to work.

Assuming that such a facility does not exist, it would require updates to repo2docker, so that’s not something we can easily hack around with ourselves. So how about something where we try to combine the contents of multiple binder/ setup directories ourselves. This is something we can start to do easily enough ourselves, and as a personal tool doesn’t necessarily have to work “properly” and “for everything”: for starters, it only has to work with what we want it to work with. And if it only works so far, getting 80% of the way to a working combined configuration that’s fine too.

So what would we need to do?

Simple list files like apt.txt and requirements.txt could be simply concatenated together, leaving it up to pip to do whatever it does with any clashes in pinned package numbers, for example (though we may want to report possible clashes, perhaps via a comment in the file, to help the user debug things).

In a shell script, something like the following would concatenate files in directories binder_1, binder_2, etc.:

for i in $(ls -d binder_*)
do
   echo >> binder/apt.txt
   echo "# $i" >> binder/apt.txt
   cat "$i/requirements.txt" >> binder/apt.txt
done

In Python, something like:

import os

with open('binder/requirements.txt', 'w') as outfile:
    for d in [d for d in os.listdir() if d.startswith('binder_') and os.path.isdir(d)]:
        # Should test: if 'requirements.txt' in os.listdir(d)
        with open(os.path.join(d, 'requirements.txt')) as infile:
            outfile.write(f'\n#{d}\n')
            outfile.write(infile.read())

Merging environment.yml files is a little trickier — the structure within the file is hierarchical — but a package like hiyapyco can help us with that:

import hiyapyco
import fnmatch

_envs = [os.path.join(d, e) for e in [d for d in os.listdir() if d.startswith('binder_') and os.path.isdir(d)] if fnmatch.fnmatch(e, '*.y*ml')]

merged = hiyapyco.load(_envs,
                       method=hiyapyco.METHOD_MERGE,
                       interpolate=True)

with open('binder/environment.yml', 'w') as f:
    f.write(hiyapyco.dump(merged))

There is an issue with environments where we have both environment.yml and requirements.txt files because the environments.yml trumps requirements.txt: the former will run but the latter won’t. A workaround I have used in the past for installing from both is to call install from the requirements.txt file by using a directive in the postBuild file to handle the requirements.txt installation.

I’ve also had to use a related trick to install a really dependent Python package explicitly via postBuild and then install from a renamed requirements.txt also via postBuild: the pip installer installs packages in whatever order it wants, and doesn’t necessarily follow any order “specified” in the requirements.txt file. This means that on certain occasions, a build can fail becuase one Python package is relying on another which is specified in the requirements.txt file but hasn’t been installed yet.

Another approach might be to grab any requirements from a (merged) requirements.txt file into an environment.yml file. For example, we can create a “dummy” _environment.yml file that will install elements from our requirements file, and then merge that into an existing environments.yml file. (We’d probbaly guard this with a check that both environment.y*ml and requirements.txt are in binder/):

_yaml = '''dependencies:
  - pip
  - pip:
'''

# if 'requirements.txt' in os.listdir() and 'environment.yml' in os.listdir():

with open('binder/requirements.txt') as f:
    for item in f.readlines():
        if item and not item.startswith('#'):
            _yaml = f'{_yaml}    - {item.strip()}\n'

with open('binder/_environment.yml', 'w') as f:
    f.write(_yaml)

merged = hiyapyco.load('binder/environment.yml', 'binder/_environment.yml',
                       method=hiyapyco.METHOD_MERGE,
                       interpolate=True)

with open('binder/environment.yml', 'w') as f:
    f.write(hiyapyco.dump(merged))

# Maybe also now delete requirements.txt?

For postBuild elements, different postBuild files may well operate in different shells (for example, we may have one that executes bash code, another that contains Python code). Perhaps the simplest way of “merging” this is to just copy over the separate postBuild files and generate a new one that calls each of them in turn.

import shutil

postBuild = ''

for d in [d for d in os.listdir() if d.startswith('binder_') and os.path.isdir(d)]:
    if 'postBuild' in os.listdir(d) and os.path.isfile(os.path.join(d, 'postBuild')):
        _from = os.path.join(d, 'postBuild')
        _to = os.path.join('binder', f'postBuild_{d}')
        shutil.copyfile(_from, _to)
        postBuild = f'{postBuild}\n./{_to}\n'

with open('binder/postBuild', 'w') as outfile:
    outfile.write(postBuild)

I’m guessing we could do the same for start?

If you want to have a play, the beginnings of a test file can be found here (for some reason, WordPress craps all over it and deletes half of it if I try to embed it in the sourcecode block etc. (I really should move to a blogging platform that does what I need…)

PoC: Using Git Commit Messages As a CLI

Following an idle wonder last week on Using Git Commit Messages as a Command Line?, I had a play and came up with a demo of sorts: ouseful-testing/action-steps.

The idea is that by creating a Github Action that performs actions based, in part at least, on the contents of a Github commit message, we can start to use commit messages as as a CLI to invoke particular Github Action mediated activities.

My first couple of proofs of concept were:

  • a simple script that replaces one file (the README) with the contents of another. At the moment, both files need to be in the same branch (ideally, the replacement files would be pulled in from another branch but I couldn’t figure out how to do that offhand). If you just make a “dummy” commit to any old file with the commit message Update Readme the README will be updated with the contents of one file. If you use the commit message Reset Readme it will be replaced with the contents of another. My thinking in part, here, is that you could “commit” progress messages as you work through a thing and the README keeps getting updated with the next thing you have to do as you commit to say you’ve done the previous thing.
name: Updates
on: push

jobs:
  update_readme:
    if: (github.event.commits[0].message == 'Update Readme')
    runs-on: ubuntu-latest
    steps:

    - name: Copy Repository Contents
      uses: actions/checkout@v2
        
    - name: commit changes
      run: |
        git config --global user.email "${GH_EMAIL}"
        git config --global user.name "${GH_USERNAME}"
        # git checkout -B fastpages-automated-setup
        mv README2.md README.md
        git add README.md
        git commit -m'Update README'
        git push
      env: 
        GH_EMAIL: ${{ github.event.commits[0].author.email }}
        GH_USERNAME: ${{ github.event.commits[0].author.username }}
        
  reset_readme:
    if: (github.event.commits[0].message == 'Reset Readme')
    runs-on: ubuntu-latest
    steps:

    - name: Copy Repository Contents
      uses: actions/checkout@v2
        
    - name: commit changes
      run: |
        git config --global user.email "${GH_EMAIL}"
        git config --global user.name "${GH_USERNAME}"
        # git checkout -B fastpages-automated-setup
        mv README1.md README.md
        git add README.md
        git commit -m'Reset README'
        git push
      env: 
        GH_EMAIL: ${{ github.event.commits[0].author.email }}
        GH_USERNAME: ${{ github.event.commits[0].author.username }}
  • a simple script that lets you upload one or more zip files as part of a push; if the commit message starts with Unzip, in response to the commit, unzip the committed zip files, and then delete the .zip archive files you had just committed, pushing the unzipped files in their place.
name: Unzip
on:
  push:
    paths:
    - '**.zip'

jobs:
  unzip-files:
    if: startsWith(github.event.commits[0].message, 'Unzip')
    runs-on: ubuntu-latest
    steps:

    - name: Copy Repository Contents
      uses: actions/checkout@v2
      with:
        fetch-depth: 2

    - name: handle zip
      run: |
        git config --global user.email "${GH_EMAIL}"
        git config --global user.name "${GH_USERNAME}"
        for f in $(git diff HEAD^..HEAD --no-commit-id --name-only | grep -E '.zip$')
          do
              echo $f
              fn=`unzip $f | grep -m1 'creating:' | cut -d' ' -f5-`
              echo $fn
              git rm $f
              git add $fn
              git commit -m"unzip $f"
          done
        git push
      env: 
        GH_EMAIL: ${{ github.event.commits[0].author.email }}
        GH_USERNAME: ${{ github.event.commits[0].author.username }}

I did also wonder about whether it would be possible to implement something like Adventure, played by issuing instructions through Git commit messages and maybe updating the readme with the game response to each step… Stepping through the hstory of READMEs would be your game transcript…

Are fastpages Really an EASY Way to Publish a Blog From Jupyter Notebooks?

I tried to submit this to the fast.ai discourse forum, having been invited to do so, but after handing over credentials to get an account so I could log in, then having to go to my email client to click the confirmation code, then not being able to create a new topic (new user policy, maybe?) then having my post quarantined and my account largely suspended, I thought I’d post the text of the post here (glad I took a copy…)

(I appreciate that my behaviour / attitude around this may be seen as both childish and churlish,  but I was originally riled by the “easy” hype around fastpages (because I don’t think it necessarily is easy for anyone other than a particularly select population…) and since then, things have just gone downhill in terms of ease of use / communication!;-) “Just do X” has (just) so much baggage associated with it…

Original post and replies thread

Picking up on a Twitter thread, some comments around the “fastpages supports really easy Jupyter blogging” effusiveness on Twitter.

(Note this isn’t meant to be hostile, it’s meant to be usefully critical ;-)

For any seasoned Github user and developer who’s also been responsible for maintaining documentation sites using Jekyll, fastpages “just” requires folk to use Github and Jekyll style publishing to publish a blog site from notebook files and markdown docs.

For anyone familiar with Github, git, and Jekyll publishing, the fastpages automation simplifies some of the faff required in getting that stuff working. (Other approaches, such as Jupyter Book, ipypublish and nbsphinx offer related publishing routes but less hype. A proper comparison of all the approaches might be useful…)

So if you’re familiar with Github and Jekyll, the benefits are quite possibly both clear and enticing. But if you aren’t a Github user or a Jekyll user, things are pretty much as opaque as every they were.

The fastpages mechanic of generating a PR on the first commit generated when cloning the template repo is really neat, and an idea I’ll likely steal. But for a novice, without mental model of how Github works, this doesn’t in and of itself make things that much easier. The naive user is faced with a complex UI, using complex jargon, and probably doesn’t know where to go looking for the PR, how to handle it, what it means when they do handle it, etc etc.

The file listing on the master home page you’re faced with when cloning the repo is also intimidating. There are a lot of files, there’s lots of directory names starting with scary underscores, lots of `.whatever` hidden files. That’s fine if you’re creating a workflow that’s “easy” for folk who are happy with all this stuff, but if the claim is that this is an “easy route into blogging with Jupyter” in general, it isn’t.

One of the attractive features of the Jupyter notebook UI and infrastructure is that someone with little technical knowledge on the command line can quickly start using magics and high level commands, a line at a time, to get stuff done. Just because someone can plot a chart a from a pandas data frame populate[d] from a loaded in CSV file doesn’t necessarily mean they know how to set up the Jupyterhub server they’re actually a user of, nor even how to install pandas into the environment they’re using. As a *user*, why should they? The same goes for their familiarity, or otherwise, with Github and Jekyll. (By the by, it’s probably best to leave the “but they ought to…” arguments aside…)

I’m all for folk developing skills, but onboarding is really hard. And oftentimes, when trying to persuade people to adopt new tech in conservative institutions, you only get infrequent opportunities to entice them in. If you claim something is easy, that you “just” this and that, then watch their face as confusion and terror reigns, and you’ve lost your conversion opportunity. They won’t try again.

To make things *really* easy means taking things much slower. Cloning the repo and showing a clean page with a very simple set of instructions, and all the scary stuff hidden in branches, provides an opportunity for generating an easy way in. The initial readme could provide a set of very clear instructions about setting up tokens etc, along with why they’re necessary (eg Stephen Downes had a go at simplifying them [here, part 1](https://halfanhour.blogspot.com/2020/02/how-to-use-fastpages.html) and [here, part 2](https://halfanhour.blogspot.com/2020/02/how-to-use-fastpages-2.html)).

Things would also be simple if the all[simpler if all] the Jekyll scaffolding were hidden away somewhere, and the user could just slowly introduce things into the top level directory, the homepage for their blog source files, with all the scaffolding hidden away and built on via branches.

This level of simplicity may or may not be desirable for a (semi-)professional, if ad hoc, tool, but if the desire is to find a way to make it easier for novices (to Github, to Jekyll) to publish in what is still quite a low level way, I think more scaffolding is required. (A limiting case of easy is probably to just click a button on your Jupyter notebook and have the file posted somewhere, from where it magically then appears on a public URL.)

Inspired by the initial commit handling Github Action, I started some baby steps explorations of a way of making “performative” Github commit actions ([action-steps](https://github.com/ouseful-testing/action-steps)) that might (or might not!) make things simpler for a novice user (they also run the risk of them developing bad mental models, but I’m just exploring ideas).

For example, you might encourage someone via the readme to create a new file from the Github web UI with a particular filename or particular commit message, and then handle that in a particular way, perhaps updating the README with the next step; this might include some description of how you could then compare the original readme with the updated one. (I did start wondering whether I could code Adventure to be played via commit messages! Has that been done before I wonder?)

You might have additional commit messages that introduce new files into the top level repo, a file at a time. (Where to put simple documentation describing commit performative commands would be another issue!)

I appreciate this is probably *not* how Github is traditionally used, where a principle of least surprise about what appears in the repo compared to the files you actually commit is a sound one (that said, a lot of workflows do make use of commit hooks that do change files…) But I would argue that using Github for the primary purpose of making use of its Github Pages publishing mechanism is not using Github in a traditional version control application way either. Version control is NOT the aim. So what I’m thinking of here is where the user can instruct Git to add in very particular new files at particular times in response to particular commands issued via a particular commit message for a particular reason: to allow them to incrementally develop the complexity of their environment from within the environment as they grow familiar with it. Along the way, the mechanism could coach an introduce the user to features of Github that may be useful in a blogging context, such as the ability to “track changes” and maintain different versions of a content as you draft it etc. This would then introduce them to version control as a side effect of them developing particular blogging workflow practices in an environment that can coach them as they use it.

This may all just be nonsense, of course!

For some definition of “just”…

Fragment: Hard to Use OpenLearn OU-XML to Markdown Tool, If You Fancy Trying It…

Over the years, I’ve dabbled on and off with OU-XML, the XML document format that OU and OpenLearn texts are mastered in. Over the last year I’ve been exploring convertng OU-XML to the simple markdown text format (eg here).

There are a several advantages to using markdown: firstly, it’s a simple text format; secondly, you can open and edit markdown docs in a Jupyter notebook UI via Jupytext; thirdly, there are well proven (though still fiddly…) workflows for publising websites from markdown source docs (eg on of my experiments here).

As to why editing markdown docs in a notebook UI is useful: for one, you can edit — and preview — Latex, which means you can write maths equations and chemical formulae in a simple text way; for another, you can add code into your document that can embed interactives: for example, my folium magic lets you embed maps with markers or shaperfiles in to the document with a single, relatively straightforward, one-liner; or code to generate charts from data; or create simple interactive applications using ipywidgets. And so on. In short, the notebook is a medium that affords you lots of possibilities for incorporating generated, as well as interactive, content.

Following a proviocation by Marco Kalz / @mkalz yesterday, I cobbled together various bits of code into this repo — innovationOUtside/open-ouxml-tools — which doubles as the src for an installable Python package’n’CLI, that lets you:

  • download and grab the OU-XML for an OpenLearn unit, along with all its image assets, into a SQLite database;
  • generate a set of markdown files from the SQLite database.

With the single test unit I tried it on, it seems to work okay in MyBinder (just click on the button on the repo homepage, than click on the README.md file when the notebook UI loads).

To get the files out, the nbarchive extension is preinstalled into the Binderised environment so you should be able to zip and export the all the generated files.

They could then be uploaded into a clone of something like ouseful-template-repos/oer-md-publish for autopublishing. (That example uses CircleCI as per this). I’ll try to figure out a Github Action way of doing something similar over the next few days, perhaps in a repo that will also grab a specified OpenLEarn unit for you (eg by using a Git commit performative CLI call, for example…?!;-)

Note that I’m still not claiming that this is easy, but I think the pieces are there if anyone wants to work through it and try it out. If folk do play with it, I’m more likely to try to make it a bit easier. But I know that because it isn’t easy, most folk won’t try it. (S’like a built in defense mechanism for me; matched time. If no-one else bothers, I don’t have to either… So if you want this thing to become real, you have to invest time into it now, too…)

PS I’m working on a new way of introducing recipes like this, as TINEWY (tin yui) ones: There Is No Easy Way Yet.