Merging Several Binder Configurations

As more and more repositories start to incorporate MyBinder / repo2docker build specifications, more and more building blocks start to appear for how to get particular things running in MyBinder. For example, I have several ouseful-template-repos with various building blocks for getting different databases running in MyBinder, and occasionally require an environment that also loads in a Jupyter-server-proxied application, such as OpenRefine. Other times, I might want to pull in the config for a partculalry ast install, or merge configs someone else has developed to run different sets of notbooks in the same Binderised repo.

But: a problem arises if you want to combine multiple Binder specifications from various repos into a single Binder setup in a single repo – how do you do it?

One way might be for repo2docker to itereate through multiple build steps, one for each Binder specification. There may be clashes, of course, such as conflicting package versions from different specifications, but it would then fall to the user to try to resolve the issue. Which is fine, if Binder is making a best attempt rather than guaranteeing to work.

Assuming that such a facility does not exist, it would require updates to repo2docker, so that’s not something we can easily hack around with ourselves. So how about something where we try to combine the contents of multiple binder/ setup directories ourselves. This is something we can start to do easily enough ourselves, and as a personal tool doesn’t necessarily have to work “properly” and “for everything”: for starters, it only has to work with what we want it to work with. And if it only works so far, getting 80% of the way to a working combined configuration that’s fine too.

So what would we need to do?

Simple list files like apt.txt and requirements.txt could be simply concatenated together, leaving it up to pip to do whatever it does with any clashes in pinned package numbers, for example (though we may want to report possible clashes, perhaps via a comment in the file, to help the user debug things).

In a shell script, something like the following would concatenate files in directories binder_1, binder_2, etc.:

for i in $(ls -d binder_*)
do
   echo >> binder/apt.txt
   echo "# $i" >> binder/apt.txt
   cat "$i/requirements.txt" >> binder/apt.txt
done

In Python, something like:

import os

with open('binder/requirements.txt', 'w') as outfile:
    for d in [d for d in os.listdir() if d.startswith('binder_') and os.path.isdir(d)]:
        # Should test: if 'requirements.txt' in os.listdir(d)
        with open(os.path.join(d, 'requirements.txt')) as infile:
            outfile.write(f'\n#{d}\n')
            outfile.write(infile.read())

Merging environment.yml files is a little trickier — the structure within the file is hierarchical — but a package like hiyapyco can help us with that:

import hiyapyco
import fnmatch

_envs = [os.path.join(d, e) for e in [d for d in os.listdir() if d.startswith('binder_') and os.path.isdir(d)] if fnmatch.fnmatch(e, '*.y*ml')]

merged = hiyapyco.load(_envs,
                       method=hiyapyco.METHOD_MERGE,
                       interpolate=True)

with open('binder/environment.yml', 'w') as f:
    f.write(hiyapyco.dump(merged))

There is an issue with environments where we have both environment.yml and requirements.txt files because the environments.yml trumps requirements.txt: the former will run but the latter won’t. A workaround I have used in the past for installing from both is to call install from the requirements.txt file by using a directive in the postBuild file to handle the requirements.txt installation.

I’ve also had to use a related trick to install a really dependent Python package explicitly via postBuild and then install from a renamed requirements.txt also via postBuild: the pip installer installs packages in whatever order it wants, and doesn’t necessarily follow any order “specified” in the requirements.txt file. This means that on certain occasions, a build can fail becuase one Python package is relying on another which is specified in the requirements.txt file but hasn’t been installed yet.

Another approach might be to grab any requirements from a (merged) requirements.txt file into an environment.yml file. For example, we can create a “dummy” _environment.yml file that will install elements from our requirements file, and then merge that into an existing environments.yml file. (We’d probbaly guard this with a check that both environment.y*ml and requirements.txt are in binder/):

_yaml = '''dependencies:
  - pip
  - pip:
'''

# if 'requirements.txt' in os.listdir() and 'environment.yml' in os.listdir():

with open('binder/requirements.txt') as f:
    for item in f.readlines():
        if item and not item.startswith('#'):
            _yaml = f'{_yaml}    - {item.strip()}\n'

with open('binder/_environment.yml', 'w') as f:
    f.write(_yaml)

merged = hiyapyco.load('binder/environment.yml', 'binder/_environment.yml',
                       method=hiyapyco.METHOD_MERGE,
                       interpolate=True)

with open('binder/environment.yml', 'w') as f:
    f.write(hiyapyco.dump(merged))

# Maybe also now delete requirements.txt?

For postBuild elements, different postBuild files may well operate in different shells (for example, we may have one that executes bash code, another that contains Python code). Perhaps the simplest way of “merging” this is to just copy over the separate postBuild files and generate a new one that calls each of them in turn.

import shutil

postBuild = ''

for d in [d for d in os.listdir() if d.startswith('binder_') and os.path.isdir(d)]:
    if 'postBuild' in os.listdir(d) and os.path.isfile(os.path.join(d, 'postBuild')):
        _from = os.path.join(d, 'postBuild')
        _to = os.path.join('binder', f'postBuild_{d}')
        shutil.copyfile(_from, _to)
        postBuild = f'{postBuild}\n./{_to}\n'

with open('binder/postBuild', 'w') as outfile:
    outfile.write(postBuild)

I’m guessing we could do the same for start?

If you want to have a play, the beginnings of a test file can be found here (for some reason, WordPress craps all over it and deletes half of it if I try to embed it in the sourcecode block etc. (I really should move to a blogging platform that does what I need…)

Installing Applications via postBuild in MyBinder and repo2docker

A note on downloading and installing things into a Binderised repo, or a container built using repo2docker.

If you save the files into $HOME as part of the container build process, if you try to use the image outside of MyBinder you will find that if storage volumes or local directories are mounted onto $HOME, your saved files are clobbered.

The MyBinder / repo2docker build is pretty limiting in terms of permissions the default jovyan user has over the file system. $HOME is one place you can write to, but if you need somewhere outside the path, then $CONDA_DIR (which defaults to /srv/conda) is handy…

For example, I just tweaked my neo4j binder repo to install a downloaded neo4j server into that path.

Accessing MyBinder Kernels Remotely from IPython Magic and from VS Code

One of the issues facing us as a distance learning organisation is how to support the computing needs of distance learning students, on a cross-platform basis, and over a wide range of computer specifications.

The approach we have taken for our TM351 Data Management and Analysis course is to ship students a Virtualbox virtual machine. This mostly works. But in some cases it doesn’t. So in the absence of an institutional online hosted notebook, I started wondering about whether we could freeload on MyBinder as a way of helping students run the course software.

I’ve started working on an image here though it’s still divergent from the shipped VM (I need to sort out things like database seeding, and maybe fix some of the package versions…), but that leaves open the question of how students would then access the environment.

One solution would be to let students work on MyBinder directly, but this raises the question of how to get the course notebooks into the Binder environment (the notebooks are in a repo, but its a private repo) and out again at the end of a session. One solution might be to use a Jupyter github extension but this would require students setting up a Github repository, installing and configuring the extension, remember to sync (unless auto-save-and-commit is available, or could be added to the extendion) and so on…

An alternative solution would be to find a way of treating MyBinder like an Enterprise Gateway server, launching a kernel via MyBinder from a local notebook server extension. But I don’t know how to do that.

Some fragments I have had laying around for a bit were the first fumblings towards a Python Mybinder client API, based on the Sage Cell client for running a chunk of code on a remote server… So I wondered whether I could do another pass over that code to ceate some IPython magic that let you create a MyBinder environment from a repo and then execute code against it from a magicked code cell. Proof of concept code for that is here: innovationOUtside/ipython_binder_magic.

One problem is that the connection seems to time out quite quickly. The code is really hacky and could probably be rebuilt from functions in the Jupyter client package, but making sense of that code is beyond my limited cut-and-paste abilities. But: it does offer a minimal working demo of what such a thing could be like. At a push, a student could install a minimal Jupyter server on their machine, install the magic, and then write notebooks using magic to run the code against a Binder kernel, albeit one that keeps dying. Whilst this would be inconvenient, it’s not a complete catastrophe because the notebook would be bing saved to the student’s local machine.

Another alternative struck me today when I say that Yuvi Panda had posted to the official Jupyter blog a recipe on how to connect to a remote Jupyterhub from Visual Studio Code. The mechanics are quite simple — I posted a demo here about how to connect from VS Code to a remote Jupyter server running on Digital Ocean, and the same approach works for connecting to out VM notebook server, if you tweak the VM notebook server’s access permissions — but it requires you to have a token. Yuvi’s post says how to find that from a remote JupyterHub server, but can we find the token for a MyBinder server?

If you open your browser’s developer tools and watch the network traffic as you launch a MyBinder server, then you can indeed see the URL used to launch the environment, along with the necessary token:

But that’s a bit of a faff if we want students to launch a Binder environment, watch the newtrok traff, grab the token and then use that to create a connection to the Binder environment from VS Code.

Searching the contents of pages from a running Binder environment, it seems that the token is hidden in the page:

And it’s not that hard to find… it’s in the link from the Jupyter log. The URL needs a tiny bit of editing (cut the /tree path element) but then the URL is good to go as the kernel connection URL in VS Code:

Then you can start working on your notebook in VS Code (open a new notebook from the settings menu), executing the code against the MyBinder environment.

You can also see the notebooks listed in the remote MyBinder environment.

So that’s another way… and now it’s got me thinking… how hard would it be to write a VS Code extension to launch a MyBinder container and then connect to it?

Ps by the by, I notice that developer tools in Firefox become increasingly useful with the Firefox 71 release in the form of a websocket inspector.

This lets you inspect traffic sent across a webseocket connection. For example, if we force a page reload on a running Jupyter notebook, we can see a websocket connection:

We can the click on that connection and monitor the messages being passed over it…

I thought this might help me debug / improve my Binder magic, but it hasn’t. The notebook looks like it sends an empty ping as a heartbeat (as per the docs), but if I try to send an empyt message from the magic it closes the connection? Instead, I send a message to the hearbeat channel…

PS sort of related, binderbot, “A simple CLI to interact with binder, eg run local notebooks on a remote binder”.

Tinkering With Neo4j and Cypher

I am so bored of tech at the moment — I just wish I could pluck up the courage to go into the garden and start working on it again (it was, after all, one of the reasons for buying the house we’ve been in for several years now, and months go by without me setting foot into it; for the third year in a row the apples and pears have gone to rot, except for the ones the neighbours go scrumping for…) Instead, I sit all day, every day, in front of a screen, hacking at a keyboard… and I f*****g hate it…

Anyway… here’s some of the stuff that I’ve been playing with yesterday and today, in part prompted by a tweet doing the rounds again on:

#Software #Analytics with #Jupyter notebooks using a prefilled #Neo4j database running on #MyBinder by @softvisresearch
Created with building blocks from @feststelltaste and @psychemedia
#knowledgegraph #softwaredevelopment
https://github.com/softvis-research/BeLL

Impact.

Yeah:-)

Anyway… It prompted me to revisit my binder-neo4j repo that demos how to launch a neo4j database in a MyBinder container tp provide some more baby steps ways in to actually getting started running queries.

So yesterday I added in a third party cypher kernel to the build, HelgeCPH/cypher_kernel that lets you write cypher queries in code cells; and today I hacked together some simple magic — innovationOUtside/cypher_magic — that lets you write cypher queries in block magic cells in a “normal” (python kernel) notebook. This magic really should be extended a bit more eg to allow connections to arbitrary neo4j databases, and perhaps crib from the cypher_kernel to include graph conversions to a networkx graph object format as well as graphical vidusalisations.

The cypher-kernel uses visjs, as does an earlier cypher magic that appears to have rotted (ipython-cypher). But if we can get the graph objects into a nx format, then we could also use netwulf to make pretty diagrams…

The tweet-linked repo also looks interesting (although I don’t speak German at all, so, erm…); there may be things I can also pull out of there to add to my binder-neo4j repo, although I may need to rethink that: the binder-neo4j repo had started out as a minimal template repo for just getting started with neo4j in MyBinder/repo2docker. But it’s started creeping… Maybe I should pare it back again, install the magic from its own repo, and but the demos in a more disposable place.

Fragment – Jupyter Kernels / MyBinder as a Remote Code Execution Sandbox for Moodle

Although I don’t know for sure, I suspect that administrators of computing infrastructure in educational establishments are wary of requests from academics for compute services that allow students to run arbitrary code.

One of the main reasons why an educator would want to support this is that becuase setting up an environment can be hard: if you want a student to focus on writing code that makes use of particular packages, you probably don’t want them engaging in arcane sys admin practices and spending all them time trying to install those packages in the first place.

For the IT department, the thought of running arbitrary code that could be produced either by novices or deliberately malicious users is likely to raise several well-founded concerns: how do we stop users using the code environment to attack the server or network the code is running on; how do we stop folk from running code on out servers that could be used to attack external sites; and how do we control the resource requirements (storage, compute, network) when mistakes happen and folk try to repeatedly download the internet to our server.

One way of making hosted compute available to students is to execute code within isolated sandboxed environments that you can park in a safe area of the network and monitor closely.

In our Moodle VLE, the Moodle CodeRunner environment is used to allow students to run small fragments of code within just such an environment when completing interactive quiz questions. (I provide a quick review of the Moodle CodeRunner plugin in post [A] Quick First Look At Moodle CodeRunner.)

Presumably, someone somewhere has done a security audit and decided that the sandboxed code execution environment is a safe one and signed off on its use.

Another approach, described in this fragment on Jupyter Notebooks and Moodle, the SageCell filter for Moodle, allows you to run code against an external (stateless) SageCell server:

<?php
/**
 * SageCell filter for Moodle 3.4+
 *
 *  This filter will replace any Sage code in [sage]...[/sage]
 *  with a Ajax code from http://sagecell.sagemath.org
 *
 * @package    filter_sagecell
 * @copyright  2015-2018 Eugene Modlo, Sergey Semerikov
 * @license    http://www.gnu.org/copyleft/gpl.html GNU GPL v3 or later
 */

defined('MOODLE_INTERNAL') || die();

/**
 * Automatic SageCell embedding filter class.
 *
 * @package    filter_sagecell
 * @copyright  2015-2016 Eugene Modlo, Sergey Semerikov
 * @license    http://www.gnu.org/copyleft/gpl.html GNU GPL v3 or later
 */
class filter_sagecell extends moodle_text_filter {

    /**
     * Check text for Sage code in [sage]...[/sage].
     *
     * @param string $text
     * @param array $options
     * @return string
     */
    public function filter($text, array $options = array()) {

        if (!is_string($text) or empty($text)) {
            // Non string data can not be filtered anyway.
            return $text;
        }

        if (strpos($text, '[sage]') === false) {
            // Performance shortcut - if there is no </a> tag, nothing can match.
            return $text;
        }

        $newtext = $text; // Fullclone is slow and not needed here.

        $search = '/\[sage](.+?)\[\/sage]/is';
        $newtext = preg_replace_callback($search, 'filter_sagecell_callback', $newtext);

        if (is_null($newtext) or $newtext === $text) {
            // Error or not filtered.
            return $text;
        }

        return $newtext;
    }

}

/**
 * Replace Sage code with embedded SageCell, if possible.
 *
 * @param array $sagecode
 * @return string
 */
function filter_sagecell_callback($sagecode) {

    // SageCell code from [sage]...[/sage].
    $output = $sagecode[1];
    $output = str_ireplace("", "\n", $output);
    $output = str_ireplace("

", "\n", $output);
    $output = str_ireplace("
", "\n", $output);
    $output = str_ireplace("
", "\n", $output);
    $output = str_ireplace("
", "\n", $output);
    $output = str_ireplace("&nbsp;", "\x20", $output);
    $output = str_ireplace("\xc2\xa0", "\x20", $output);
    $output = clean_text($output);
    $output = str_ireplace("&lt;", "", $output);

    $id = uniqid("");

    $output = "" .
    "" .
        "sagecell.makeSagecell({inputLocation: \"#" . $id . "\"," .
        "evalButtonText: \"Evaluate\"," .
        "autoeval: true," .
        "hide: [\"evalButton\", \"editor\", \"messages\", \"permalink\", \"language\"] }" .
    ");" .
    "" .
    "
<div id="">". $output. "</div>
";

    return $output;
}

This looks to me like the SageCell Moodle filter essentially rewrites a [sage]...[/sage] delimited code block within a Moodle environment as a Javascript backed SageCell form and then lets users run the code embedded in the form against the remote server. This sort of thing could presumably be used to support interactive, executable code activities within a Moodle hosted web page, for example.

As I remarked previously, it’s not hard to imagine doing something similar to provide a [mybinder repository="..."]...[/mybinder]​ filter that could use a Javascript library such as ThebeLab or Juniper to provide a similar style of interaction backed by a MyBinder launched repository, though minor tweaks may be required around those packages to handle stateless rather than stateful transactions if repeated calls are made to the server.

Going back to the CodeRunner plugin (as described here):

[i]nternally CodeRunner is designed to support multiple sandboxes, implemented as subclasses of the abstract class qtype_coderunner_sandbox – see sandbox.php. Sandboxes are essentially plugins to CodeRunner. Several different ones have been used over the years but the only current ones are the jobe sandbox (file jobesandbox.php) and the ideone sandbox. The latter interfaces to the Sphere On-line judge server but is now more-or-less defunct. Both of those sandboxes run as services. CodeRunner can support multiple sandboxes at the same time and questions can be configured to select a particular sandbox (if desired). By default the first available sandbox that supports the language required by the question is used.

So could we use a MyBinder launched Jupyter server to provide sandboxed code execution?

One advantage of this would be that we could define a Jupyter environment that students could use on their own machines, or that we could host via a hosted notebook server, and that same environment could be used for CodeRunner style assessment.

Another advantage would be that if we want to run student created arbitrary code for teaching activities as well as CodeRunner based assessment activities, we’d only need to sign off on one sandboxed code execution environment rather than several.

So what’s required?

It’s years since I had used PHP, but I thought I’d have a go at creating a simple Python client that would let me:

  • start a MyBinder server against a specified Github repo;
  • start a kernel;
  • run a small code sample in the kernel and get a code execution response back.

Cribbing heavily from juniper.js and this rather handy sagecell-client.py, I came up with a hacky recipe that works a minimal proof of concept here: mybinder_py_client-ipynb.

I think this is stateful, in that we execute several code blocks one after the other and exploit state in previous calls to the same kernel. It would probably also make sense to have a call that forces a new kernel for each code execution call, as well as providing a recipe for killing a kernel.

The next step in trying to use this approach for CodeRunner sandbox would presumably be to try to create a simple PHP based MyBinder client; then the next step would be to use that in a CodeRunner sandbox subclass.

But that’s out of scope for me atm…

Please let me know in the comments if you have a go at this… or know of any other Moodle / Jupyter integrations…

Binder Base Boxes, Several Ways…

A couple of weeks ago, Chris Holdgraf published a handy tip on the Jupyter Discourse site about how to embed custom github content in a Binder link with nbgitpuller.

One of the problems with (features of…) MyBinder is that if you make a change to a repo, even if it’s just a change to the README, it will spawn a rebuild of the Docker image built from the repo the next time the repo is launched onto MyBinder.

With the recent announcement of the Binder Federation, whereby there are multiple clusters (currently two…) onto which MyBinder launch requests are mapped, if each cluster maintains its own Docker image hub, this could mean that with N clusters available, your next N launches may all require a rebuild if each launch request is mapped to a different cluster.

So how does nbgitpuller help? If you install nbgitpuller into a Binderised repository, you can launch a container on MyBinder with a git-pull? argument. This will grab the contents of a specified repository into a notebook server environment before presenting you with the notebook homepage.

What this means is that we can construct a MyBinder URL that will:

  • launch a container built from one repo; and
  • populate it with files pulled from another.

The advantage of this is that you can create one repo with a complex set of build requirements and build a MyBinder image from that once and once only. If you also maintain a second repository with notebook files, or a package definition, with frequent changes, but run it in a Binderised container launched from the “fixed” build repo, you won’t need to rebuild the container each time: just launch from the pre-built one and then synch the changed content in from the other repo.

To pull the contents of a repo http://github.com/USER/REPO into a MyBinder container built from a particular binder-base-boxes branch, use a MyBinder URL of the form:

https://mybinder.org/v2/gh/ouseful-demos/binder-base-boxes/BASEBOXBRANCH/?urlpath=git-pull?repo=https://github.com/USER/REPO

To pull the contents from a particular branch of a repo http://github.com/USER/REPO/tree/BRANCH, use a MyBinder URL of the form:

https://mybinder.org/v2/gh/ouseful-demos/binder-base-boxes/BASEBOXBRANCH/?urlpath=git-pull?repo=https://github.com/USER/REPO%26amp%3Bbranch=BRANCH

Note the escaping on the & conjunction between the repo and branch arguments that keeps it inside the scope of the git-pull?repo phrase.

To pull the contents from a particular branch of a repo http://github.com/USER/REPO/tree/BRANCH and launch into a particular notebook, use a MyBinder URL of the form:

https://mybinder.org/v2/gh/ouseful-demos/binder-base-boxes/BASEBOXBRANCH/?urlpath=git-pull?repo=https://github.com/USER/REPO%26amp%3Bbranch=BRANCH%26amp%3BsubPath=FILENAME.ipynb

You can see several examples in the various branches of https://github.com/ouseful-demos/binder-base-boxes.

See Feeding a MyBinder Container Built From One Github Repository With the Contents of Another for an earlier review of this approach (which I have to admit, I’d forgotten I’d posted when I started this post!).

On my to do list is to try to add a tab to the nbgitpuller/link generator to simplify the process of link creation. But in addition to a helper tool, is there a convention we might adopt to make it clearer when we are using this sort of split build/content repo approach?

Github conventionally uses the gh-pages branch as a “reserved” branch for constructing Github Pages docs related to a particular repo. Could we take a similar approach for defining a “Binder build” branch?

The binder/ directory in a repo can be used to partition Binder build requirements in a repo, but there are a couple of problems associated with this:

  • a maintainer may not want to have the binder/ directory cluttering their package repo;
  • any updates to the repo will force a rebuild of the Binder image next time the repo is run on a particular Binder node. (With Binder federation, if there are N hosts in the federation, after updating a repo, is it possible that my next N attempts to run the repo on MyBinder may require a rebuild if I am directed to a different host each time?)

If by convention something like a binder-build branch was used to contain the build requirements for a repo, then the process for calling a build (by default) could be simplified.

Eg rather than having something like:

https://mybinder.org/v2/gh/colinleach/binder-box/master/?urlpath=git-pull?repo=https://github.com/colinleach/astro-Jupyter

we would have something like:

https://mybinder.org/v2/gh/colinleach/astro-Jupyter/binder-build/?urlpath=git-pull?repo=https://github.com/colinleach/astro-Jupyter

which could simplify to something that defaults to a build from binder-build branch (the “build” branch) and nbgitpull from master (the “content” branch):

https://mybinder.org/v2/gh/colinleach/astro-Jupyter?binder-build=True

Complications could be added to support changing the build branch, the nbgitpull branch, the commit/ID of a particular build, etc?

It might overly complicate things further, but I could also imagine:

  • automatically injecting nbgitpuller into the Binder image and enabling it;
  • providing some sort of directive support so that if the content directory has a setup.py file the package from that content directory is installed.

Binder Buildpacks

As well as defining dynamically constructed Binder base boxes built from one repo and used to provide an environment within which to run the contents of another, there is a second sense in which we might define Binder base boxes and that is to consider the base environment on which repo2docker constructs a Binder image.

In the nbgitpuller approach, I am treating the Binder base box (sense 1) as the environment that the git pulled content runs in. In the buildpack appraoch, the Binder base box (sense 2) is the image that repo2docker uses to bootstrap the Binder image build process. Binder base box sense 1 = Binder base box sense 2 + Binder repo build process. Maybe it’d make more sense to swap those senses, so sense 2 builds on sense 1?!

This approach is discussed in the repo2docker issue #487 Make it possible to configure the base image with an example implementation in pangeo-stacks/pull/27. The implementation allows users to create a Dockerfile in which they specify a required base Docker image upon which the normal apt.txt, environment.yml, requirements.txt and postBuild steps can be applied.

The Dockerfile FROM statement takes the form:

FROM yuvipanda/pangeo-base-notebook-onbuild:2019.04.15-4

and then other build files (requirements.txt etc) are declared as normal.

The -onbuild component marks out the base image as one that should be built on (I think). I’m not sure how the date component applies (or whether it is required or optional). I’m not sure if the base box itself also needs some custom configuration? I think an example of the code use to build it is in the base-notebook directory of this repo: https://github.com/yuvipanda/pangeo-stacks .

Summary

Installing nbgitpuller into a Binderised repo allows us to pull the contents of a second Github repository into the first. This means we can build a complex environment from one repository once and pull regularly updated content from another repo into it without needing a rebuild step. Using the -onbuild approach, Binderhub can use repo2docker to build a Binder image from a user defined base image and then apply normal build steps to it. This means that optimised base boxes can be defined on which additional customisations can be layered. This can also make development of Binder boxes more efficient by starting rebuilds further up the image layer stack by building on top of prebuilt boxes rather than having build images from scratch.

MyBinder Launches From Any git Repository: Github, Gists, GitLab, Bitbucket etc

By default, MyBinder looks to repositories on Github for its builds, but it can also build from Githubs gists, GitLab.com repositories, and, well, any git repository with a networked endpoint, it seems:

What prompted me to this was looking for a way to launch a MyBinder container from Bitbucket. (For the archaeologists, there are various issues and PRs (such as here, and here, as well as this recent forum post — How to use bitbucket repositories on mybinder.org — that trace some of the history…)

So what’s the trick?

For now, you need to get hold of the URL to a particular Bitbucket repo commit. For example, to try running this repo you need to co to the Commits page and grab the URL for the most recent master commit (or whichever one you you want) which will contain the commit hash:

For example, soenthing like https://bitbucket.org/ueacomputervision/image-labelling-tool/commits/f3ddb33e4839f8a0fe73c168993b405adc13daf0 gives the commit hash f3ddb33e4839f8a0fe73c168993b405adc13daf0.

For the repo base URL https://bitbucket.org/ueacomputervision/image-labelling-tool, the MyBinder launch link then takes on the form:

https://mybinder.org/v2/git/https%3A%2F%2Fbitbucket.org%2Fueacomputervision%2Fimage-labelling-tool.git/f3ddb33e4839f8a0fe73c168993b405adc13daf0

which is to say:

https://mybinder.org/v2/git/ESCAPED_REPO_URL.git/COMMIT_HASH

But it does look like things may get easier in the near future…

Feeding a MyBinder Container Built From One Github Repository With the Contents of Another

Long time readers should be more than well aware by now of MyBinder, the Jupyter project service that will build a Docker image from the contents of a git repository and then launch a container based on that image so you can work with a live, running, albeit temporary, instance if it.

But that’s not all it can do…

Via Chris Holdgraf on the Jupyter discourse community site (Tip: embed custom github content in a Binder link with nbgitpuller), comes a magical trick whereby you can launch a MyBinder instance built from one repository and populate it with files from another.

Why’s this useful? Well, if you’ve had a play with your own repos using MyBinder, you’ll know that each time you make a change to a repository, MyBinder will want to rebuild the Docker image next time you try to launch the repo there.

So if your repo defines a complex build that takes some time to install all of its dependencies, you have to wait for that build even if all you did was correct a typo in the markdown of a notebook file.

So here’s the trick…

nbgitpuller is a Jupyter server extension that supports the “one-way synchronization of a remote git repository to a local git repository”.

There are other approaches to git syncing too. See the next edition of Tracking Jupyter to find out what they are…

Originally developed as a tool to help distribute notebooks to students, it can be called via a Jupyter server URL. For example, if you have nbgitpuller installed in a local Jupyter server running on the default port 8888, the following URL will pull data from the specified repo into the base directory the notebook server points to using a URL of the form:

localhost:8888/git-pull?repo=https://github.com/USER/NOTEBOOK_REPO

One of the neat things about Binderhub / MyBinder is that can pass a git-pull? argument through as part of a MyBinder launch URL, so if the repo you want to build from installs and enables nbgitpuller, you can then pull notebooks into the launched container from a second, nbgitpulled repository.

For example, yesterday I came across the Python show_ast package, and incorporated IPython magic,  that will render the abstract syntax tree of a Python command:

Such a thing may be useful in an introductory programming course (TBH, I’m never really sure what people try to teach in introductory programming courses, what the useful mental models are, how best to help folk learn them, and how to figure out how to teach them…).

As with most Python based repos, particularly ones that contain Jupyter notebooks (that is, .ipynb files [….thinks… ooh… I has a plan to try s/thing else too….]) I generally try to “run” them via MyBinder. In this case, the repo didn’t work because there is a dependency on the Linux graphviz apt package and the Python graphviz package.

At this point, I’d generally fork the repo, create a binderise branch containing the dependencies, then try that out on MyBinder, sometimes adding an issue and/or making a pull request to the original repository suggesting they Binderise it…

…but nbgitpuller provides a different opportunity. Suppose I create a base container that contains the Graphviz Linux application and the graphivz Python package. Something like this: ouseful-testing/binder-graphviz.

Then I can create a MyBinder session from that repo and pull in the show_ast package from its repo and run the notebook directly:

https://mybinder.org/v2/gh/ouseful-testing/binder-graphviz/master/?urlpath=git-pull?repo=https://github.com/hchasestevens/show_ast

Fortuitously, things work seemlessly in this case because the example notebook lives in directory where we can import show_ast without the need to install it (otherwise we’d have needed to run pip install . at the top level of the repo). In general, where notebooks are kept in a notebooks or docs directory, for example, the path to import the package would break. (Hmmm… I need to think about protocols for handling that… It’s better practise to put the notebooks somewhere but that means we need to install the package or change the import path to it, which is one more step for folk to stumble over…)

Thinking about my old show’n’tell repo, the branches of which scruffily define various Binder environments suited to particular topic areas (environments for working on chemistry notebooks, for example, or astronomy notebooks, or classical language or music notebooks) and also contain demo notebooks, I could instead just define a set of base Binder environment containers, slow to build but built infrequently, and then lighter weight notebook repos containing just demo notebooks for a particular topic area. These could then be quickly and easily updated, and run on MyBinder having been nbgitpulled by a base container, without having to rebuild the base container each time I update a notebook in a notebook repo.

A couple of other things to note here. First, nbgitpuller has its own helper for creating nbgitpuller URLs, the nbgitpuller link generator:

It’s not hard to imagine a similar UI, or another tab to that UI, that can build a MyBinder link from a “standard” base container selected from a dropdown menu (or an optional link to a git repo) and then a provided git repo link for the target content repo.

Second, this has got me thinking about how we (don’t) handle notebook distribution very well in the OU.

For our TM351 internal course, we control the student’s computing environment via VM we provide them with, so we could install nbgitpuller in it, but the notebooks are stored in a private Github repo and we don’t want to give students any keys to it at all. (For some reason, I seem to be the only person who doesn’t have a problem with the notebooks being in a public repo!;-)

For our public notebook utilising courses on FutureLearn or OpenLearn, the notebooks are in a public repo, but we don’t have control of the learners’ computing environments, (which is to say, we can’t preinstall nbgitpuller and can’t guarantee that learners will have permissions of network access to install it themselves).

It’s almost as if various pieces keep appearing, but the jigsaw never quite seems to fit together…

Running a PostgreSQL Server in a MyBinder Container

The original MyBinder service used to run an optional PostgreSQL DBMS alongside the Jupyter notebook service inside a Binder container (my original review).

But if you want to run a Postgres database in the same MyBinder environment nowadays, you need to add it in yourself.

Here are some recipes with different pros and cons. As @manics comments here, “[m]ost distributions package postgres to be run as a system service, so the user permissions are locked down.”, which means that you can’t run Postgres as an arbitrary user. The best approach is probably the last one, which uses an Anaconda packaged version of Postgres that has a more liberal attitude…

Recipe the First – Hacking Permissions

I picked up this approach from dchud/datamanagement-notebook/ based around Docker. It gets around the problem that the Postgres Linux package requires a particular user (postgres) or an alternative user with root permissions to start and stop the server.

Use a Dockerfile to install postgres and create a simple database test user, as well as escalating default user notebook jovyan to sudoers (along with the password redspot). The jovyan user can then start / stop the Postgres server via an appropriate entrypoint script.

USER root

RUN chown -R postgres:postgres /var/run/postgresql
RUN echo "jovyan ALL=(ALL)   ALL" >> /etc/sudoers
RUN echo "jovyan:redspot" | chpasswd

COPY ./entrypoint.sh /
RUN chmod +x /entrypoint.sh

USER $NB_USER
ENTRYPOINT ["/entrypoint.sh"]

The entrypoint.sh script will start the Postgres server and then continue with any other start-up actions required to start the Jupyter notebook server install by repo2docker/MyBinder by default:

#!/bin/bash
set -e

echo redspot | sudo -S service postgresql start

exec "$@"

Try it on MyBinder from here.

A major issue with this approach is that you may not want jovyan, or another user, to have root privileges.

Recipe The Second – Hacking Fewer Permissions

The second example comes from @manics/@crucifixkiss and is based on manics/omero-server-jupyter.

In this approach, which also uses a Dockerfile, we again escalate the privileges of the jovyan user, although this time in a more controlled way:

USER root

#The trick in this Dockerfile is to change the ownership of /run/postgresql
RUN  apt-get update && \
    apt-get install -qq -y \
        postgresql postgresql-client && apt-get clean && \
    chown jovyan /run/postgresql/

COPY ./entrypoint.sh  /
RUN chmod +x /entrypoint.sh

In this case, the entrypoint.sh script doesn’t require any tampering with sudo:

#!/bin/bash
set -e

PGDATA=${PGDATA:-/home/jovyan/srv/pgsql}

if [ ! -d "$PGDATA" ]; then
  /usr/lib/postgresql/10/bin/initdb -D "$PGDATA" --auth-host=md5 --encoding=UTF8
fi
/usr/lib/postgresql/10/bin/pg_ctl -D "$PGDATA" status || /usr/lib/postgresql/10/bin/pg_ctl -D "$PGDATA" -l "$PGDATA/pg.log" start

psql postgres -c "CREATE USER testuser PASSWORD 'testpass'"
createdb -O testuser testdb

exec "$@"

You can try it on MyBinder from here.

Recipe the Third – An Alternative Distribution

The third approach is again via @manics and uses an Anaconda packaged version of Postgres, installing the postgresql package via an environment.yml file.

A postbuild step initialises everything and pulls in a script to set up a dummy user and database.

#!/bin/bash
set -eux

#Make sure that everything is initialised properly
PGDATA=${PGDATA:-/home/jovyan/srv/pgsql}
if [ ! -d "$PGDATA" ]; then
  initdb -D "$PGDATA" --auth-host=md5 --encoding=UTF8
fi

#Start the database during the build process
# so that we can seed it with users, a dummy seeded db, etc
pg_ctl -D "$PGDATA" -l "$PGDATA/pg.log" start

#Call a script to create a dummy user and seeded dummy db
#Make sure that the script is executable...
chmod +x $HOME/init_db.sh
$HOME/init_db.sh

For example, here’s a simple init_db.sh script:

#!/bin/bash
set -eux

THISDIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"

#Demo PostgreSQL Database initialisation
psql postgres -c "CREATE USER testuser PASSWORD 'testpass'"

#The -O flag below sets the user: createdb -O DBUSER DBNAME
createdb -O testuser testdb

psql -d testdb -U testuser -f $THISDIR/seed_db.sql

which in turn pulls in a simple .sql file to seed the dummy database:

-- Demo PostgreSQL Database initialisation

DROP TABLE IF EXISTS quickdemo CASCADE;
CREATE TABLE quickdemo(id INT, name VARCHAR(20), value INT);
INSERT INTO quickdemo VALUES(1,'This',12);
INSERT INTO quickdemo VALUES(2,'That',345);

Picking up on the recipe described in an earlier post (AutoStarting A Headless OpenRefine Server in MyBinder Using Repo2Docker and a start Config File), the database is autostarted using a start file:

#!/bin/bash
set -eux
PGDATA=${PGDATA:-/home/jovyan/srv/pgsql}
pg_ctl -D "$PGDATA" -l "$PGDATA/pg.log" start

exec "$@"

In a Jupyter notebook, we can connect to the database in several ways.

For example, we can connect directly using the the psycopg2 package:

import psycopg2

conn = psycopg2.connect("dbname='postgres'")
cur = conn.cursor()
cur.execute("SELECT datname from pg_database")

cur.fetchall()

Alternatively we can connect using something like ipython-sql magic, using a connection string that attaches us using a passwordless connection string as the default (jovyan) user and default connection details (we use default ports etc.): postgresql:///postgres

Or we can go to the other extreme, and use a connection string that connects us using the test user credentials, explicit host/port details, and a specified database: postgresql://testuser:testpass@localhost:5432/testdb

You can try it on MyBinder from here.

Simple Demo of Green Screen Principle in a Jupyter Notebook Using MyBinder

One of my favourite bits of edtech  in the form of open educational technology infrastucture at the moment is mybinder (code), which allows you to fire up a semi-customised Docker container and run Jupyter notebooks based on the contents of a github repository. This makes is trivial to share interactive, Jupyter notebook demos, as long as you’re happy to make your notebooks public and pop them into github.

As an example, here’s a simple notebook I knocked up yesterday to demonstrate how we could created a composited image from a foreground image captured against a green screen, and a background image we wanted to place behind our foregrounded character.

The recipe was based on one I found in a Bryn Mawr College demo (Bryn Mawr is one of the places I look to for interesting ways of using Jupyter notebooks in an educational context.)

The demo works by looking at each pixel in turn in the foreground (greenscreened) image and checking its RGB colour value. If it looks to be green, use the corresponding pixel from the background image in the composited image; if it’s not green, use the colour values of the pixel in the foreground image.

The trick comes in setting appropriate threshold values to detect the green coloured background. Using Jupyter notebooks and ipywidgets, it’s easy enough to create a demo that lets you try out different “green detection” settings using sliders to select RGB colour ranges. And using mybinder, it’s trivial to share a copy of the working notebook – fire up a container and look for the Green screen.ipynb notebook: demo notebooks on mybinder.

green_screen_-_tm112

(You can find the actual notebook code on github here.)

I was going to say that one of the things I don’t think you can do at the moment is share a link to an actual notebook, but in that respect I’d be wrong… The reason I thought was that to launch a mybinder instance, eg from the psychemedia/ou-tm11n github repo, you’d use a URL of the form http://mybinder.org/repo/psychemedia/ou-tm11n; this then launches a container instance at a dynamically created location – eg http://SOME_IP_ADDRESS/user/SOME_CONTAINER_ID – with a URL and container ID that you don’t know in advance.

The notebook contents of the repo are copied into a notebooks folder in the container when the container image is built from the repo, and accessed down that path on the container URL, such as http://SOME_IP_ADDRESS/user/SOME_CONTAINER_ID/notebooks/Green%20screen%20-%20tm112.ipynb.

However, on checking, it seems that any path added to the mybinder call is passed along and appended to the URL of the dynamically created container.

Which means you can add the path to a notebook in the repo to the notebooks/ path when you call mybinder – http://mybinder.org/repo/psychemedia/ou-tm11n/notebooks/Green%20screen%20-%20tm112.ipynb – and the path will will passed through to the launched container.

In other words, you can share a link to a live notebook running on dynamically created container – such as this one – by calling mybinder with the local path to the notebook.

You can also go back up to the Jupyter notebook homepage from a notebook page by going up a level in the URL to the notebooks folder, eg http://mybinder.org/repo/psychemedia/ou-tm11n/notebooks/ .

I like mybinder a bit more each day:-)