Tagged: jupyter

Exposing Multiple Services Via a Single http Port Using Jupyter nbserverproxy

Over the last couple of weeks I’ve been circling, but failing to make much actual progress on using, OpenStack as a platform for making self-serve OU hosted VMs available to students. (I’m increasingly starting to thing this is not sensible, but I’m struggling to find someone I can chat to about it…).


One of the issues with the OU Faculty OpenStack setup is the way the security model locks everything down. Not only is no API access available, there is also a limit on IP address allocation and open ports are limited to port 80 (and maybe port 22? Or maybe not.)

For the TM351 VM – which is what we’re looking to put onto OU OpenStack – we have been exposing services on at least two http ports, one for the Jupyter notebooks and one for OpenRefine. (The latest build also has a simple VM webserver, and I’m experimenting with a notebook search engine. Optionally, we have also allowed students to open up ports to the PostgreSQL and MongoDB services.)

If I do find a sensible way to get the VM running on OpenStack, finding a way to shove all the http services through port 80 looks like a necessary requirement. Previously, I’d noticed that @betatim’s openrefineder demo made use of a proxy to expose the OpenRefine service via the Jupyter notebook port, and looking at it again today I noticed that the nbopenrefineproxy package it was using is available as a Jupyterhub project package: jupyterhub/nbserverproxy.

In the current TM351 VM set-up, we have the following:

  • Jupyter notebook on guest port 8888, host port 35180
  • OpenRefine on guest port 3334, host port 35181

However, if I install and enable nbserverproxy, and restart the Jupyter notebook server, I can now find OpenRefine proxied as http://localhost:35180/proxy/3334/ as well as on http://localhost:35181.

One gotcha to note is that the OpenRefine page doesn’t render properly from that URL without the trailing slash because the OpenRefine HTML includes relative links to assets:

<link type="text/css" rel="stylesheet" href="externals/select2/select2.css" />
 <link type="text/css" rel="stylesheet" href="externals/tablesorter/theme.blue.css" />

which resolve as e.g. http://localhost:35180/proxy/externals/select2/select2.css (404).

However, with the trailing slash, the links do resolve correctly (e.g. as http://localhost:35180/proxy/3334/externals/select2/select2.css) when the trailing slash is added.

Handy… and the way to go if we do get this running on OpenStack.

PS if you know of a baby steps tutorial that shows how I can build a custom VM image on a Mac that I can upload to OpenStack, please let me know via the comments. Or otherwise get in touch if you can talk me through the various approaches.

First Class R Support in Binder / Binderhub – Shiny Apps As Well as R-Kernels and RStudio

I notice from the binder-examples/r repo that Binderhub now appears to offer all sorts of R goodness out of the can, if you specify a particular R build.

From the same repo root, you can get:

And from previously, here’s a workaround for displaying R/HTMLwidgets in a Jupyter notebook.

OpenRefine is also available from a simple URL – https://mybinder.org/v2/gh/betatim/openrefineder/master?urlpath=openrefine – courtesy of betatim/openrefineder:

Perhaps it’s time for me to try to get my head round what the Jupyter notebook proxy handlers are doing…

PS see also Scripted Forms for a simple markdown script way of building interactive Jupyter widget powered UIs.

More Thoughts On Jupyter Notebook Search

Following on from initial sketch of Searching Jupyter Notebooks Using lunr, here’s a quick first pass [gist] at pouring Jupyter notebook cell contents (code and markdown) into a SQLite database, running a query over it and then inspecting the results using a modified NLTK text concordancer to show the search phrase in the context of where it’s located in a document.

The concordancer means we can offer a results listing more in accordance with a traditional search engine, showing just the text in the immediate vicinity of a search term. (Hmm, I’d need to check what happens if the search term appears multiple times in the search result text.) This means we can offer a tidier display the dumping the contents of a complete cell into the results listing.

The table the notebook data is added to is created so that it supports full text search. However, I imagine that any stemming that we could apply is not best suited to indexing code.

Similarly, the NLTK tokeniser doesn’t handle code very well. For example, splits occur around # and % symbols, which means things like magics, such as %load_ext, aren’t recognised; instead, they’re split into separate tokens: % and load_ext.

A bigger issue for the db approach is that I need to find a way to update / clean the database as and when notebooks are saved, updated, deleted etc.

Jigsaw Pieces – Linux Service Indicators, Jupyter Kernel Monitoring and Environment Management

Something I’ve been pondering for some time is how to set up some simple Linux service monitoring so that I can display an in indicator light in a web page to show whether a Linux service is running or not.

For example, in the TM351 VM, it could be handy to display some indicator lights in a Jupyter notebook status bar showing whether the database services we connect to from the notebooks are running correctly,

So here are some pieces that may contribute to that:

My thinking is:

  • use monit to monitor a process; if the process is down, write to a service status file in my www server directory, eg service_servicename_status.txt. If a service is running the contents of this file are 1, otherwise 0;
  • use the JQuery fragment to poll the status file every few seconds;
  • if the status file returns 0, display a red indicator, otherwise green.

Here are some other monitoring / environment managing fragments I’m pondering:

  • something like ps_mem, a Python utility *to accurately report the in core memory usage for a program*. I’m wondering if I could use that to track how much memory each Jupyter notebook python kernel is taking up (or maybe monit can do that?) There’s an old extnesion that looks like ti shows reports: nbtop. Or perhaps use psutil (via this issue, which seems to offer a solution?);
  • a minimal example of setting up notebook homepage tab for a hello world webpage; Writing a notebook server extension looks like it has the ingredients, and nb_conda provides a fuller working example. Actually, that extension looks useful for *Jupyter-as-a-learning-environment* because it lets you select different conda environments, which could be handy for running different activities.

Any other examples out there of Jupyter monitoring / environment management?

Interactive Authoring Environments for Reproducible Media: Stencila

One of the problems associated with keeping up with tech is that a lot of things that “make sense” are not the result of the introduction or availability of a new tool or application in and of itself, but in the way that it might make a new combination of tools possible that support a complete end to end workflow or that can be used to reengineer (a large part of) an existing workflow.

In the OU, it’s probably fair to say that the document workflow associated with creating course materials has its issues. I’m still keen to explore how a Jupyter notebook or Rmd workflow would work, particularly if the authored documents included recipes for embedded media objects such as diagrams, items retrieved from a third party API, or rendered from a source representation or recipe.

One “obvious” problem is that the Jupyter notebook or RStudio Rmd editor is “too hard” to work with (that is, it’s not Word).

A few days ago I saw a tweet mentioning the use of Stencila with Binderhub. Stencila? Apparently, *”[a]n open source office suite for reproducible research”. From the blurb:

[T]oday’s tools for reproducible research can be intimidating – especially if you’re not a coder. Stencila make reproducible research more accessible with the intuitive word processor and spreadsheet interfaces that you and your colleagues are already used to.

That sounds appropriate… It’s available as a desktop app, but courtesy of minrk/jupyter-dar (I think?), it runs on binderhub and can be accessed via a browser too:


You can try it here.

As with Jupyter notebooks, you can edit and run code cells, as well as authoring text. But the UI is smoother than in Jupyter notebooks.

(This is one of the things I don’t understand about colleagues’ attitude towards emerging tech projects: they look at today’s UX and think that’s it, because that’s how it is inside an organisation – you take what you’re given and it stays the same for decades. In a living project, stuff tends to get better if it’s being used and there are issues with it…)

The Jupyter-Dar strapline pitches “Jupyter + DAR compatibility exploration for running Stencila on binder”. Hmm. DAR? That’s also new to me:

Dar stands for (Reproducible) Document Archive and specifies a virtual file format that holds multiple digital documents, complete with images and other assets. A Dar consists of a manifest file (manifest.xml) that describes the contents.

Dar is being designed for storing reproducible research publications, but the underlying concepts are suitable for any kind of digital publications that can be bundled together with their assets.

Repo: [substance/dar](https://github.com/substance/dar)

Sounds interesting. And which reminds me: how’s OpenCreate coming along, I wonder? (My permissions appear to have been revoked again; or the URL has changed.)

PS seems like there’s more activity in the “pure web” notebook application world. Hot on the heels of Mike Bostock’s Observable notebooks (rationale) comes iodide, “[a] frictionless portable notebook-style interface for literate scientific computing in the browser” (examples).

I don’t know if these things just require you to use Javascript, or whether they can also embed things like Brython.

I’m not sure I fully get the js/browser notebooks yet? I like the richer extensibility of things like Jupyter in terms of arbitrary language/kernel availability, though I suppose the web notebooks might be able to hook into other kernels using similar mechanics to those used by things like Thebelab?

I guess one advantage is that you can do stuff on a Chromebook, and without a network connection if you cache all the required JS packages locally? Although with new ChromeOS offering support for Linux – and hence, Docker containers – natively, Chromebooks could get a whole lot more exciting over the next few months. From what I can tell, corsvm looks like a ChromeOS native equivalent to something like Virtualbox (with an equivalent of Guest Additions?). It’ll be interesting how well things like audio works? Reports suggest that graphical UIs will work, presumably using some sort of native X11 support rather than noVNC, so now could be a good time to start looking out for souped up Pixelbook…

Keeping Up With OpenRefine – Database Connections

It’s been a few months since I last checked out updates to OpenRefine, but reading a (completed) phase 1 project plan associated with some funding the OpenRefine Foundation received from Google News Labs it looks like database support is on the cards.

Database Table import/export – COMPLETED

Historically, OpenRefine has been limited compared to other data tools in that it does not have a way to connect to a database table. This is especially useful at export time, when there is a need to save a cleaned CSV for example into a database table. Importing from a database is useful also. It can help to join clean data in a database table against messy data in OpenRefine, in order to clean and prepare it for use. Database Drivers exist for many databases such as Oracle, MySQL, Postgres, and even many schema-less databases such as MongoDB. Most database drivers use JDBC which makes it easier for us to develop against, and others typically use a custom Java driver that sometimes is non-trivial to integrate with. Since OpenRefine is built with Java this should be relatively straightforward to utilize existing JDBC drivers for our import/export operations and for support of MongoDB there is a Java driver available.

Looking through the repo, it looks like there are a couples of related PRs:

I’m not sure about the export to a db?

The tests suggest drivers are in place for PostgreSQL, MySQL and MariaDB:

public class DatabaseTestConfig extends DBExtensionTests {

private DatabaseConfiguration mysqlDbConfig;
private DatabaseConfiguration pgsqlDbConfig;
private DatabaseConfiguration mariadbDbConfig;

It also looks like an upgrade to the internal data representation may be being considered: Research Apache Arrow to improve in-memory data model. FWIW, I think Apache Arrow really is one to watch.

Via the OpenRefine Google Group, I also noticed a couple of references to future planned activity / roadmap items:

Phase 2

Front / Backend separation

Scope: completely separating the backend so that an full API can be exposed for all OpenRefine operations and commands. Once the decoupling done, we can move to a modern front end framework and
Deliverable: Functional and documented API covering all the commands available in OpenRefine 3 front end.

Phase 3
R Lang support
Work with community to bring support for R lang via an extension.
There is significant use of statistics within News Organizations where the goal of minimizing the back and forth between R tooling and OpenRefine would be explored and assessed by the community.

rrefine is around and needs investigation – https://github.com/vpnagraj/rrefine

Hmmm… rrefine?

rrefine enables users to programmatically trigger data transfer between R and OpenRefine. Using the functions available in this package, you can import, export or delete a project in OpenRefine directly from R. There are several client libraries for automating OpenRefine tasks via Python, nodeJS and Ruby. rrefine extends this functionality to R users.

Okay – that makes me think of the OpenRefine Python Client Library?

But how about that Edit cells &gt; Transform &gt; Language support for R #1226` issue? “This is a feature-request to add R support in Edit cells > Transform > Language.”

That fits in with an earlier thought I had along the lines of “what if OpenRefine was a Jupyter client?” In an imagining frame of mind, this seems to me to offer a couple of potential benefits:

  • if the Transform &gt; Language utility supports hooks into a Jupyter kernel and exposes an executable code cell onto that (state persisting) kernel, and the data can be transferred efficiently using serialisations like feather or deeper hooks into Apache Arrow representations that might be supported in R or Python pandas, then any language with a Jupyter kernel could be used for transformations?
  • if OpenRefine was exposed as a panel in Jupyterlab, which it presumably could be simply by embedding the HTML UI in an IFrame, then it have a role as part of the look and feel of a single working environment, even if it was only loading and saving CSV files into the environment workspace.

But then let’s imagine something a bit more extreme (I’m not sure if / how this might fit into the Jupyterlab architecture, indeed whether it’s possible or just imagine magic, I’m just riffing…): if the data being manipulated within OpenRefine could be synched with a representation of the data being manipulated elsewhere in the Jupyterlab environment, then we could be viewing a dataset in one panel (Jupyterlab has crazy efficient support for viewing large datafiles), manipulating it in an OpenRefine panel, and running analysis scripts over it in a third. The reticulate package suddenly comes to mind here as an example of accessing data objects from one environment in another.

It also strikes me that use cases of the data represented in OpenRefine reflecting updates to the data from the analysis environment are less likely. The analysis should be operating on data after it has been cleaned, rather than passing it to OpenRefine?

PS by the by, if you want to run OpenRefine using the Jupyter ecosystem Binderhub machinery, here’s a proof of concept from @betatim: openrefineder.

Generating Printable MS Word Versions of Merged Jupyter Notebooks

One of the issues we know students have with the Jupyter notebooks that we provide as part of the course is that there is no straightforward way of printing them them all out for offscreen reading / annotation. (As well as code, there is a certain amount of practical and code related explanatory material in the notebooks.)

One of the things I started to doodle with last year was a simple script to merge several notebooks than then render the result as a Microsoft Word doc. This has a dependency on pandoc, though not LaTeX and requires that the conversion takes place via HTML: ipynb is converted to HTML using nbconvert , then from HTML to docx. If there are image files transcluded into the notebook, this also means that the pandoc conversion process needs to be executed in the same directory as the notebook so that the image paths are correctly recognised. (When running nbconvert with the html_embed output, pandoc fell over.)

Having to run pandoc in a local, image path respecting directory is a pain because it means I can’t run it over a merged notebook file composed of notebooks from multiple directories. Which means that I have to generate a separate docx file for the notebooks in each separate directory. Whilst I could more this into the same directory to make accessing them all a bit easier, it still means students have to print out multiple documents. I did try using a python package to merge the Word docs, but it borked on the images.

There are Python packages that can merge PDF documents in a more reliable way, but I am having issues with getting a sensible PDF workflow together. In the first case, for pandoc to render documents to  PDF seems to require the texlive-xetex package, which adds considerable weight to the VM (and I don’t know the dependency voodoo required to get a minimum viable LaTeX distribution in place). In the second, my test notebooks included a pymarkdown inline element that embedded a pandas dataframe in a markdown cell and this seemed to break the pandoc PDF conversion at that point.

One thing I haven’t done yet is look at customising the output templates so that we can brand the exported documents. For this, I need to look at custom templates.

My initial sketch code for the ‘export merged notebooks in a directory as docx’ routine is available via this gist. One thing I need to do is wrap it in a simple CLI command. Comments / suggestions for improvement, or links to better alternatives, more than welcome!

Initial Sketch – Searching Jupyter Notebooks Using lunr

Coming round as it is to that time of year for updating, testing and freezing/”gold mastering” the TM351 VM that we distribute to students for the October presentation of our Data Analysis and Management course,  I’ve been thinking about how we can make the VM more useful for students, and whether the things we’re looking at might also be useful in an Institute of Coding context (I’m on a workpackage looking at infrastructure to support coding education: please get in touch if you’re up for a conversation around such matters:-)

One of the things I’ve been pondering is how to search across notebooks – a lot of the TM351 teaching material is in notebooks and there’s no obvious way of searching over them. (There’s also no obvious way of printing them all out in one go, or saving them to a merged document – I’ll post more about that in separate post…)

In my sketches for the new VM, I’ve added a simple python webserver that exposes a homepage that links to the various services running inside the VM. (Ideally, there’d also be indicator lights showing whether the associated Linux service is running or no: anyone know of a simple package to help with that?)

This made me think that it might be useful to provide simple search tool over the notebooks in the (shared) directory that the VM shares with the host.

One way of doing this might be to put the notebook content into a simple sqlite database and serve it using datasette, or query it via a Scripted Form style UI. SQLite has a full text search extension (FTS3-5) and some support for fuzzy matching (eg spellfix1), although I’m note sure how well it would fare as a code search engine.

But I also came across a lightweight Javascript search engine called lunr“[a] bit like Solr, but much smaller and not as bright” – and an example of How [Matthew Daly] Added Search to [His] Site With Lunr.js so I thought I’d give that a go…

At the moment, I’m only testing against a couple of notebooks. The search results are at the markdown cell level, so if a cell contains a lot of text, the whole cell will be displayed, which may not be optimal. I’m rendering the cell markdown as HTML in the browser using the Showdown Javascript package although this could be disabled to show just the raw markdown. My guess is that any relatively linked images embedded in the markdown will show as broken.

The search terms are supposed to be highlighted using mark.js, but while I had it working in a preliminary sketch, it seems to be borked now and I’m not sure where I’m setting it up incorrectly or using it wrong.

It strikes me that if a markdown cell in the results contains a lot of text, it might be worth trying to identify where in the text the query terms appear and then prune the result text around them.

I’m making no attempt to search code cells, though I did think about trying to extract lines of comment text using a crib along the lines of if LINE.strip().startswith('#').

I’m generating the lunr index using lunr.py and saving it along with a store of the cell content in a JSON file that’s loaded into the search page. Whilst I’m testing the search paged served from a simple Python httpserver, it struck me that it could also be served along a /view path in the Jupyter notebook context. When I first tried this, using JSON data loaded in to the search page using JQuery as a JSON object, I got a CORS error. Rather than waste too much time trying to solve that (I wasted a little!) I worked around it instead and loaded my lunr.json search index and store in to the page as JSONP instead.

One thing I need to do is provide an easy to use tool to generate the search index and lookup store from a set of notebooks. (In the TM351 VM context, this would be in the context of the mounted /shared notebooks folder that the notebook server runs at the top of.)

There still needs to be some clear thinking about what to link to – my initial thought is to link to the notebook running in the VM. If anchors are in the original markdown cell text it should be be possible to deeplink to those. It might also be possible to link to an HTML render of the notebook. This could be done via nbconvert (although I am not currently running this as a service in the VM) or perhaps as an in-browser rendering of the .ipynb JSON using something like Notebook.js / nbpreview. (FWIW, I also note react-jupyter).

But if nothing else, this is a thing that can be used and poked around to find out where it’s most painful in use and how it can be improved. A couple of things that immediately come to mind in terms of Jupyter integration, for example:

  • Jupyter notebook classic UI could come with a ‘Search notebooks’ tab and maybe a search indexer running in the background as and when notebooks in scope are saved);
  • JupterLab could be extended with a lun based notebook search plugin.

Code for my initial pencil sketch of a lunr Jupyter notebook markdown cell search tool can be found in this gist.

PS via Grant Nestor on the Jupyter Google group:

grep –include=’*.ipynb’ –exclude-dir=’.ipynb_checkpoints’ -rliw . -e ‘search query’

This will search your Jupyter server root recursively for files that contain the whole word (case-insensitive) “search query” and only return the file names of matches.

More info: https://stackoverflow.com/questions/16956810/how-do-i-find-all-files-containing-specific-text-on-linux

Importing Functions From DevTesting Jupyter Notebooks

One of the ways I use Jupyter notebooks is as sketchbooks in which some code cells are used to develop useful functions and other are used as “in-passing” develop’n’test cells that include code fragments on the way to becoming useful as part of a larger function.

Once a function has been developed, it can be a pain getting it into a form where I can use it in other notebooks. One way is to copy the function code into a separate python file that can be imported into another notebook, but if the function code needs updating, this means changing it in the python file and the documenting notebook, which can lead to differences arising between the two versions of the function.

Recipes such as Importing Jupyter Notebooks as Modules provide a means for importing the contents of a notebook as a module, but they do so by executing all code cells.

So how can we get round this, loading – and executing – just the “exportable” cells, such as the ones containing “finished” functions, and ignoring the cruft?

I was thinking it might be handy to define some code cell metadata (‘exportable’:boolean, perhaps), that I could set on a code cell to say whether that cell was exportable as a notebook-module function or just littering a notebook as a bit of development testing.

The notebook-as-module recipe would then test to see whether a notebook cell was not just a code cell, but an exportable code cell, before running it. The metadata could also hook into a custom template that could export the notebook as python with the code cells set to exportable:False commented out.

But this is overly complicating and hides the difference between exportable and extraneous code cells in the metadata field. Because as Johannes Feist pointed out to me in the Jupyter Google group, we can actually use a feature of the import recipe machinery to mask out the content of certain code cells. As Johannes suggested:

what I have been doing for this case is the “standard” python approach, i.e., simply guard the part that shouldn’t run upon import with if __name__=='__main__': statements. When you execute a notebook interactively, __name__ is defined as '__main__', so the code will run, but when you import it with the hooks you mention, __name__is set to the module name, and the code behind the if doesn’t run.

Johannes also comments that “Of course it makes the notebook look a bit more ugly, but it works well, allows to develop modules as notebooks with included tests, and has the advantage of being immediately visible/obvious (as opposed to metadata).”

In my own workflow, I often make use of the ability to display as code cell output whatever value is returned from the last item in a code cell. Guarding code with the if statement prevents the output of the last code item in the guarded block from being displayed. However, passing a variable to the display() function as the last line of the guarded block displays the output as before.


So now I have a handy workflow for writing sketch notebooks containing useful functions + cruft from which I can just load in the useful functions into another notebook. Thanks, Johannes :-)

PS see also this hint from Doug Blank about building up a class across several notebook code cells:

Cell 1:

class MyClass():
def method1(self):

Cell 2:

class MyClass(MyClass):
def method2(self):

Cell 3:

instance = MyClass()

(WordPress really struggles with: a) markdown; b) code; c) markdown and code.)

See also: https://github.com/ipython/ipynb

PPS this also looks related but I haven’t tried it yet: https://github.com/deathbeds/importnb

Potential Issues With Institutionally Mediated Reproducible Research Environments

One of the advantages, for me, of the Jupyter Binderhub enviornment is that it provides with a large amount of freedom to create my own computational environment in the context of a potentially managed institutional service.

At the moment, I’m lobbying for an OU hosted version of Binderhub, probably hosted via Azure Kubernetes, for internal use in the first instance. (It would be nice if we could also be part of an open and federated MyBinder provisioning service, but I’m not in control of any budgets.) But in the meantime, I’m using the open MyBinder service (and very appreciative of it, too).

To test the binder builds locally, I use repo2docker, which is also used as part of the Binderhub build process.

What this all means is that I should be able to write – and test – notebooks locally, and know that I’ll be able to run them “institutionally” (eg on Binderhub).

However, one thing I noticed today was that notebooks in a binder container that was running okay, and that still builds and runs okay locally, have broken when run through Binderhub.

I think the error is a permissions error in creating temporary directories or writing temporary image files in either the xelatex commandline command used to generate a PDF from the LaTeX script, or the ImageMagick convert command used produce an image from the PDF which are both used as part of some IPython magic that renders LaTeX tikz diagram generating scripts. It certainly affects a couple of my magics. (It might be an issue with the way the magics are defined too. But whatever the case, it works for me locally but not “institutionally”.)

Broken notebook: https://mybinder.org/v2/gh/psychemedia/showntell/maths?filepath=Mechanics.ipynb
magic code: https://github.com/psychemedia/showntell/tree/maths/magics/tikz_magic
Error is something to do with the ImageMagick convert command not converting the .pdf to an image. At least one of issues seems to be that ghostscript is lost somewhere?

So here’s the issue. Whilst the notebooks were running fine in a container generated from an image that was itself created presumably before a Binderhub update, rebuilding the image (potentially without making any changes to the source Github repository) can lead to notebooks that were running fine to break.

Which is to say, there may be a dependency in the way a repository defines an environment on some of the packages installed by the repo2docker build process. (I don’t know if we can fully isolate out these dependencies by using a Dockerfile to define the environment rather than apt.txt and requirements.txt?)

This raises a couple of questions for me about dependencies:

  • what sort of dependency issues might there be in components or settings introduced by the jupyter2repo process, and how might we mitigate against these?
  • are there other aspects of the Binderhub process that can produce breaking changes that impact on notebooks running in a repository that specifies a computational environment run via Binderhub?

Institutionally, it also means that environments run via an institutionally supported Binderhub environment could break downstream environments (that is, ones run via Binderhub) through updates to the Binderhub environment.

This is a really good time for this to happen to me, I think, because it gives me more things to think about when considering the case for providing a Binderhub service institutionally.

On the other hand, it means I can’t update any of the other repos that use the tikz or asymptote magic until I find the fix because otherwise they will break too…

Should users of the institutional service, for example, be invited to define test areas in their Binder repositories (for example, using nbval) that the institution can use as test cases when making updates to the institutional service? If errors are detected through the running of these tests by the institutional service provider against their users’ tests, then the institutional service provider could explore whether the issue can be addressed by their update strategy, or alert the Binderhub user there may be breaking changes and how to explore what they are or mitigate against them. (That is, perhaps it falls to the institutional provider to centrally explore the likely common repercussions of a particular update and identify fixes to address them?)

For example, there might be dependencies on particular package version numbers. In this case, the user might then either want to update their own code, or add in a build requirement that regresses the package to the desired version. (Institutional providers might have something to say about that if the upgrade was for valid security reasons, though running things in isolation in containers should reduce that risk?) Lists of affected packages could also be circulated to other users using the same packages, along with mitigation strategies for coping with updates to the institutionally provided service.

There are also updating issues associated with a workflow strategy I am exploring around Binderhub which relates to using “base containers” to seed Binderhub builds (Note On My Emerging Workflow for Working With Binderhub). For example, if a build uses a “latest” tagged base image, any updates to that base image may break things built on top of it. In this case, mitigating against update risk to the base container is achieved by building from a specifically tagged version of the container. However, if an update to the Binderhub environment can break notebooks running on top of a particularly labelled base container, the fix for the notebooks may reside in making a fix to the environment in the base container (for example, which specifically acts to enforce a package version). This suggests that the base container might need doubly tagging – one tag paying heed to the downstream end users (“buildForExptXYZ”) – and the other that captures the upstream Binderhub environment (“BinderhubBuildABC”).

I’m also wondering know about where responsibility arises for maintaining the integrity of the user computing environment (that is, the local computational environment within which code in notebooks should continue to operate once the user has defined their environment). Which is to say, if there are changes to the wider environment that somehow break that local user environment, who should help fix it? If the changes are likely to impact widely, it makes sense to try to fix it once and then share the change, rather than expecting every user suffering from the break to have to find the fix independently?

Also, I’m wondering about classes of error that might arise. For example, ones that can be fixed purely by changing the environmental definition (baking package versions into config files, for example, which is probably best practice anyway) and ones that require changes to code in notebooks?

PS Hmm.. noting… are whitelists and blacklists also specifiable in Binderhub config? eg https://github.com/jupyterhub/mybinder.org-deploy/pull/239/files

        c.GitHubRepoProvider.banned_specs = [