Tagged: jupyter

Potential Issues With Institutionally Mediated Reproducible Research Environments

One of the advantages, for me, of the Jupyter Binderhub enviornment is that it provides with a large amount of freedom to create my own computational environment in the context of a potentially managed institutional service.

At the moment, I’m lobbying for an OU hosted version of Binderhub, probably hosted via Azure Kubernetes, for internal use in the first instance. (It would be nice if we could also be part of an open and federated MyBinder provisioning service, but I’m not in control of any budgets.) But in the meantime, I’m using the open MyBinder service (and very appreciative of it, too).

To test the binder builds locally, I use repo2docker, which is also used as part of the Binderhub build process.

What this all means is that I should be able to write – and test – notebooks locally, and know that I’ll be able to run them “institutionally” (eg on Binderhub).

However, one thing I noticed today was that notebooks in a binder container that was running okay, and that still builds and runs okay locally, have broken when run through Binderhub.

I think the error is a permissions error in creating temporary directories or writing temporary image files in either the xelatex commandline command used to generate a PDF from the LaTeX script, or the ImageMagick convert command used produce an image from the PDF which are both used as part of some IPython magic that renders LaTeX tikz diagram generating scripts. It certainly affects a couple of my magics. (It might be an issue with the way the magics are defined too. But whatever the case, it works for me locally but not “institutionally”.)

Broken notebook: https://mybinder.org/v2/gh/psychemedia/showntell/maths?filepath=Mechanics.ipynb
magic code: https://github.com/psychemedia/showntell/tree/maths/magics/tikz_magic
Error is something to do with the ImageMagick convert command not converting the .pdf to an image. At least one of issues seems to be that ghostscript is lost somewhere?

So here’s the issue. Whilst the notebooks were running fine in a container generated from an image that was itself created presumably before a Binderhub update, rebuilding the image (potentially without making any changes to the source Github repository) can lead to notebooks that were running fine to break.

Which is to say, there may be a dependency in the way a repository defines an environment on some of the packages installed by the repo2docker build process. (I don’t know if we can fully isolate out these dependencies by using a Dockerfile to define the environment rather than apt.txt and requirements.txt?)

This raises a couple of questions for me about dependencies:

  • what sort of dependency issues might there be in components or settings introduced by the jupyter2repo process, and how might we mitigate against these?
  • are there other aspects of the Binderhub process that can produce breaking changes that impact on notebooks running in a repository that specifies a computational environment run via Binderhub?

Institutionally, it also means that environments run via an institutionally supported Binderhub environment could break downstream environments (that is, ones run via Binderhub) through updates to the Binderhub environment.

This is a really good time for this to happen to me, I think, because it gives me more things to think about when considering the case for providing a Binderhub service institutionally.

On the other hand, it means I can’t update any of the other repos that use the tikz or asymptote magic until I find the fix because otherwise they will break too…

Should users of the institutional service, for example, be invited to define test areas in their Binder repositories (for example, using nbval) that the institution can use as test cases when making updates to the institutional service? If errors are detected through the running of these tests by the institutional service provider against their users’ tests, then the institutional service provider could explore whether the issue can be addressed by their update strategy, or alert the Binderhub user there may be breaking changes and how to explore what they are or mitigate against them. (That is, perhaps it falls to the institutional provider to centrally explore the likely common repercussions of a particular update and identify fixes to address them?)

For example, there might be dependencies on particular package version numbers. In this case, the user might then either want to update their own code, or add in a build requirement that regresses the package to the desired version. (Institutional providers might have something to say about that if the upgrade was for valid security reasons, though running things in isolation in containers should reduce that risk?) Lists of affected packages could also be circulated to other users using the same packages, along with mitigation strategies for coping with updates to the institutionally provided service.

There are also updating issues associated with a workflow strategy I am exploring around Binderhub which relates to using “base containers” to seed Binderhub builds (Note On My Emerging Workflow for Working With Binderhub). For example, if a build uses a “latest” tagged base image, any updates to that base image may break things built on top of it. In this case, mitigating against update risk to the base container is achieved by building from a specifically tagged version of the container. However, if an update to the Binderhub environment can break notebooks running on top of a particularly labelled base container, the fix for the notebooks may reside in making a fix to the environment in the base container (for example, which specifically acts to enforce a package version). This suggests that the base container might need doubly tagging – one tag paying heed to the downstream end users (“buildForExptXYZ”) – and the other that captures the upstream Binderhub environment (“BinderhubBuildABC”).

I’m also wondering know about where responsibility arises for maintaining the integrity of the user computing environment (that is, the local computational environment within which code in notebooks should continue to operate once the user has defined their environment). Which is to say, if there are changes to the wider environment that somehow break that local user environment, who should help fix it? If the changes are likely to impact widely, it makes sense to try to fix it once and then share the change, rather than expecting every user suffering from the break to have to find the fix independently?

Also, I’m wondering about classes of error that might arise. For example, ones that can be fixed purely by changing the environmental definition (baking package versions into config files, for example, which is probably best practice anyway) and ones that require changes to code in notebooks?

PS Hmm.. noting… are whitelists and blacklists also specifiable in Binderhub config? eg https://github.com/jupyterhub/mybinder.org-deploy/pull/239/files

binderhub:
  extraConfig:
    bans:
        c.GitHubRepoProvider.banned_specs = [
          '^GITHUBUSER/REPO.*'
        ]

Fragments – Multi-Tasking In Jupyter Notebooks and Postgres in Binderhub

A couple of things that I thought might be handy:

  • multi-tasking in Jupyter notebook cells: nbmultitask; running actions in notebook cells is usually blocking on that cell; this extension – and the associated examples.ipynb notebook – show various ways of running non-blocking threads in notebook cells.
  • a Binderhub compliant Dockerfile for running Postgres alongside a Jupyter notebook; I posted a fragment here that demos a connection if you run the repo via Binderhub. It requires the user starting and stopping the postgres server, and it also escalates the jovyan user privileges via sudo, which could be handy for TM351 purposes.  Now pretty much all I need to see a demo of is running OpenRefine too. I think the RStudio Jupyter proxy route is being generalised to make that possible… and, via @choldgraf, it seems as if there is an issue relating to this: https://github.com/jupyterhub/binder/issues/40 .

Fragment – Wizards From Jupyter RISE Slideshows

Note to self, as much as anything…

At the moment I’m tinkering with a couple of OU hacks that require:

  • a login prompt to log in to OU auth
  • a prompt that requests what service you require
  • a screen that shows a dialogue relating to the desired service, as well as a response from that service.

I’m building these up in Jupyter notebooks, and it struck me that I could create a simple, multi-step wizard to mediate the interaction using the Jupyter RISE slideshow extension.

For example, the first slide is the login, the second slide the service prompt, the third screen the service dialogue, maybe with child slides relating to that?

(Hmm, thinks – would be interesting if RISE supported even more non-linear actions over and above it’s 1.5D nature? For example, branched logic, choosing which of N slides to go to next?)

Anyway, just noting that as an idea: RISE live notebook slideshows as multi-step wizards.

OERs in Practice: Repurposing Jupyter Notebooks as Presentations

Over coffee following a maps SIG meeting last week, fellow Jupyter notebooks enthusiast Phil Wheeler wondered about the extent to which tutors / Associate Lecturers might be able to take course materials delivered as notebooks and deliver them in tutorials as slideshows.

The RISE notebook extension allows notebooks to be presented using the reveal.js HTML presentation framework. The slides essentially provide an alternative client to the Jupyter notebook and can be autolaunched using Binder. (I’m not sure if the autostart slideshow can also be configured to Run All cells before it starts?)

To see how this might work, I marked up one of the notebooks in my showntell/maths demo setting some of the cells to appear as slides in a presentation based on the notebook.

I also used the Hide Input Jupyter extension to hide code cell inputs so that the code used to generate an output image or interactive could be hidden from the actual presentation.

Get into the slideshow editor mode from the notebook View menu, select Cell Toolbar and then Slideshow. Reset the notebook disable using View > Cell Toolbar > None.

To run the presentation with code cell outputs pre-rendered, you first need to run all the cells. From the notebook Cell menu select Run All to execute all the cells.  You can now enter the slideshow using the Enter/Exit RISE Slideshow toolbar buttom (it looks like a bar chart). Exit the presentation using the cross in the topleft of the slideshow display.

[Live demo on Binderhub]

PS building on the idea of using mapping notebook cells into a reveal.js tagged display, I wonder if we could do something similar using a scrollytelling framework such as scrollama, scrollstory or idyll?

Note On My Emerging Workflow for Working With Binderhub

Yesterday saw the public reboot of Binder / MyBinder (which I first wrote about a couple of years ago here), as reported in The Jupyter project blog post Binder 2.0, a Tech Guide and this practical guide: Introducing Binder 2.0 — share your interactive research environment.

For anyone not familiar with Binder / MyBinder, it’s a service that will launch a fully running Jupyter notebook server and computing environment based the contents of a Github repository (config files as well as notebooks).  What this means is that if you put your Jupyter notebooks into a Github repository, along with one or two simple files that least any Linux or Python packages you need to install in order to run the code in the notebooks (or R packages and perhaps Rmd files if you also install an R kernel/RStudio), you can get a browser access to that running environment at just the click of a link. And the generosity of whoever is paying for the servers the notebook server runs on.

The system has been rebuilt to use Jupyterhub, with a renaming as far as the codebase goes to Binderhub. There are also several utility tools associated with the project, including the really handy repo2docker that builds a Docker image from the contents of a local folder or Github repository.

One of the things that particularly interested me in the announcement blog posts was the following aspirational remark:

We would love to see others deploy their own BinderHub servers, either for their own communities, or as part of a federated public service of BinderHubs.

I’d love to see the OU get behind this, either directly or under the banner of OpenLearn, as part of an effort to help make Jupyter powered interactive open educational materials available without the need to install any software.

(I tried to pitch it to FutureLearn to help support the OU/FutureLearn Learn to Code for Data Analysis MOOC when we were writing that course, but they weren’t interested…)

One disadvantage is Binderhub is a stateless service, which means you need to download any notebooks you’re working on and them upload them again yourself if you stop an interactive session: the environment you were working in is personal to you, but it’s also destroyed whenever you close the session (or after a particular amount of time? So other solutions are required for persisting state (i.e. having a personal file storage area). Jupyterhub is one way to do that (and one of the things we’re starting to explore in the OU at the moment).

Through playing with Binderhub over the last couple of weeks as part of an attempt to put together some demos for how to use Jupyter notebooks to support the creation of educational content that contains rich content (images, interactives) from specifications contained within the notebook document (think: writing diagrams) I’ve come to the following workflow:

  • create a Github repository to host different builds (example). In my case, these are for different topic areas; but they could be different research projects, courses, data journalism investigations, etc.
  • put each build in a branch (example);
  • work up the build instructions for the environment either using Github/Binder or locally; I was having to use Github/Binder because I was working on a slow network connection that made building my evolving image difficult. But it meant that every time I made a change to the build, it used up Binder resources to do so.
  • if the build is a big one, it can take time to complete. I think that Binder will rebuild the Docker image each time you update the repo, so even if you only update notebook files, then *I think* that that package installation steps are also run even if those files *haven’t* changed? To simplify this process, we can instead create a Docker image from out build files and push that to Dockerhub (example).
  • We can then then create a new build process for our working repository that pulls the pre-built image (containing all the required packages) and adds in the working notebooks (example).
  • We can also share a minimum viable repository that can be forked to allow other people to use the same environment (example).

One advantage of this route is that it separates “sys admin” concerns – building and installing the required packages – from “working” concerns relating to developing the contents of the notebooks. (I think the working repository that uses the Dockerfile build can also draw on the postbuild file to add in any additional or missing packages, which can then be added to the container build as part of a maintenance step.)

PS picking up on a recent related Downes presentation – Applications, Algorithms and Data: Open Educational Resources and the Next Generation of Virtual Learning – and a response from @jimgroom that I really need to comment back on – Containing the Future of OER – this phrase comes to mind: “syndicated runtime” eg if you syndicate the HTML version of a notebook via an RSS feed with a link back to the Binder runnable version of it…

Some Recent Noticings From the Jupyter Ecosystem

Over the last couple of weeks, I’ve got back into the speaking thing, firstly at an OU TEL show’n’tell event, then at a Parliamentary Digital Service show’n’tell.

In each case, the presentation was based around some of the things you can do with notebooks, one of which was using the RISE extension to run a notebook as an interactive slideshow: cells map on to slides or slide elements, and code cells can be executed live within the presentation, with any generated cell outputs being displayed in the slide.

RISE has just been updated to include an autostart mode that can be demo’ed if you run the RISE example on Binderhub.

Which brings me to Binderhub. Originally know as MyBinder, Binderhub takes the MyBinder idea of building a Docker image based on the build specification and content files contained in a public Github repository, and launching a Docker container from that image. Binderhub has recently moved into the Jupyter ecosystem, with the result that there are several handy spin-off command line components; for example, jupyter-repo2docker lets you build, and optionally push and/or launch, a local image from a Github repository or a local repository.

To follow on from my OU show’n’tell, I started putting together a set of branches on a single repository (psychemedia/showntell) that will eventually(?!) contain working demos of how to use Jupyter notebooks as part of “generative document” workflow in particular topic areas. For example, for authoring texts containing rich media assets in a maths subject area, or music. (The environment I used for the shown’n’tell was my own build (checks to make sure I turned that cloud machine off so I’m not still paying for it!), and I haven’t got working Binderhub environments for all the subject demos yet. If anyone would like to contribute to setting up the builds, or adding to subject specific demos, please get in touch…)

I also prepped for the PDS event by putting together a Binderhub build file in my psychemedia/parlihacks repo so (most of) the demo code would work on Binderhub. I think the only think that doesn’t work at the moment is the Shiny app demo? This includes an RStudio environment, launched from the Jupter notebooks New menu. (For an example, see the binder-examples/dockerfile-rstudio demo.)

So – long and short of that – you can create multiple demo environments in a single Github repo using a different branch for each demo, and then launch them separately using Binderhub.

What else…?

Oh yes, a new extension gives you a Shiny like workflow for creating simple apps from a Jupyter notebook: appmode. This seems to complement the Jupyter dashboards approoach, by providing an “app view” of a notebook that displays the content of markdown cells and code cell outputs, but hides the code cell contents. So if you’e been looking for a Jupyter notebook equivalent to R/shiny app development, this may get you some of the way there… (One of the nice things about the app view is that you can easily “View Source” – and modify that source…)

Possibly related to the appmode way of doing things, one thing I showed in the PDS show’n’tell was how notebooks can be used to define simple API services using the jupyter/kernel_gateway (example). These seem to run okay – locally at least – inside Binderhub, although I didn’t try calling a Jupyter API service from outside the container. (Maybe they can be made publicly available via the jupyterhub/nbserverproxy? Why’s this relevant to appmode? My thinking is architecturally you could separate out concerns, having one or more notebooks running an API that is consumed from the appmode notebook?

Another recent announcement came from Google in the form of Colaboratory, a “research project created to help disseminate machine learning education and research”. The environment is “a Jupyter notebook environment that requires no setup to use”, although it does require registration to run notebook cells, and there appears to be a waiting list. The most interesting thing, perhaps, is the ability to collaboratively work on notebooks shared with other people across Google Drive. I think this is separate from the jupyterlab-google-drive initiative, which is looking to offer a similar sort of shared working, again through Google Drive?

By the by, it’s probably also worth noting that other big providers make notebooks available, such as Microsoft (notebooks.azure.com) and IBM (eg datascientistworkbench.com, cognitiveclass.ai; digging around, cognitiveclass.ai seems to be a rebranding of bigdatauniversity.com).

There are other hosted notebook servers relevant to education too: CoCalc (previously SageMathCloud) offers a free way in, as does gryd.us if you have a .edu email address. pythonanywhere.com/ offers notebooks to anyone on a paid plan.

It also seems like there are services starting to appear that offer free notebooks as well as compute power for research/scientific computing on a model similar to CoCalc (free tier in, then buy credits for additional services). For example, Kogence.

For sharing notebooks, I also just spotted Anaconda Cloud, which looks like it could be an interesting place to browse every so often…

Interesting times…

First Attempt At Using IPywidgets in Jupyter Notebooks to Display V-REP Robot Simulator Telemetry

Having got a thing together that lets me use some magic to load a V-REP robot simulator scene, connect to it and control a robot contained inside it, I also started to wonder about we could build instrumentation on the Jupyter notebook client side.

The V-REP simulator itself has graph objects that can record and display logged data within the simulator:

But we can also capture data from the simulator as part of the Python control loop running via a notebook.

(I’m not sure if streaming data from the simulator is possible, or how to go about either setting that up in the simulator connection or rendering it in the notebook?)

So here’s my quick starter for 10 getting a simple data display running in a notebook using IPython widgets.

So here’s a simple text display to give a real time (ish) view of a couple of sensor values:

As the robot runs, the widget values update in real-time-ish .

I couldn’t figure out offhand how to generate a live-updating chart, and couldn’t quickly see how to return data from inside the magic cell as part of the magic function. (In fact, I’m not convinced I understand at all the scoping in there!)

But it seems as if we set a global variable inside the magic cell, we can get data out and plot it when the simulation is stopped:

Example notebook here.

If anyone can show me how to create and update a live chart, that would be fantastic:-)