Fragment – Jupyter Kernels / MyBinder as a Remote Code Execution Sandbox for Moodle

Although I don’t know for sure, I suspect that administrators of computing infrastructure in educational establishments are wary of requests from academics for compute services that allow students to run arbitrary code.

One of the main reasons why an educator would want to support this is that becuase setting up an environment can be hard: if you want a student to focus on writing code that makes use of particular packages, you probably don’t want them engaging in arcane sys admin practices and spending all them time trying to install those packages in the first place.

For the IT department, the thought of running arbitrary code that could be produced either by novices or deliberately malicious users is likely to raise several well-founded concerns: how do we stop users using the code environment to attack the server or network the code is running on; how do we stop folk from running code on out servers that could be used to attack external sites; and how do we control the resource requirements (storage, compute, network) when mistakes happen and folk try to repeatedly download the internet to our server.

One way of making hosted compute available to students is to execute code within isolated sandboxed environments that you can park in a safe area of the network and monitor closely.

In our Moodle VLE, the Moodle CodeRunner environment is used to allow students to run small fragments of code within just such an environment when completing interactive quiz questions. (I provide a quick review of the Moodle CodeRunner plugin in post [A] Quick First Look At Moodle CodeRunner.)

Presumably, someone somewhere has done a security audit and decided that the sandboxed code execution environment is a safe one and signed off on its use.

Another approach, described in this fragment on Jupyter Notebooks and Moodle, the SageCell filter for Moodle, allows you to run code against an external (stateless) SageCell server:

<?php
/**
 * SageCell filter for Moodle 3.4+
 *
 *  This filter will replace any Sage code in [sage]...[/sage]
 *  with a Ajax code from http://sagecell.sagemath.org
 *
 * @package    filter_sagecell
 * @copyright  2015-2018 Eugene Modlo, Sergey Semerikov
 * @license    http://www.gnu.org/copyleft/gpl.html GNU GPL v3 or later
 */

defined('MOODLE_INTERNAL') || die();

/**
 * Automatic SageCell embedding filter class.
 *
 * @package    filter_sagecell
 * @copyright  2015-2016 Eugene Modlo, Sergey Semerikov
 * @license    http://www.gnu.org/copyleft/gpl.html GNU GPL v3 or later
 */
class filter_sagecell extends moodle_text_filter {

    /**
     * Check text for Sage code in [sage]...[/sage].
     *
     * @param string $text
     * @param array $options
     * @return string
     */
    public function filter($text, array $options = array()) {

        if (!is_string($text) or empty($text)) {
            // Non string data can not be filtered anyway.
            return $text;
        }

        if (strpos($text, '[sage]') === false) {
            // Performance shortcut - if there is no </a> tag, nothing can match.
            return $text;
        }

        $newtext = $text; // Fullclone is slow and not needed here.

        $search = '/\[sage](.+?)\[\/sage]/is';
        $newtext = preg_replace_callback($search, 'filter_sagecell_callback', $newtext);

        if (is_null($newtext) or $newtext === $text) {
            // Error or not filtered.
            return $text;
        }

        return $newtext;
    }

}

/**
 * Replace Sage code with embedded SageCell, if possible.
 *
 * @param array $sagecode
 * @return string
 */
function filter_sagecell_callback($sagecode) {

    // SageCell code from [sage]...[/sage].
    $output = $sagecode[1];
    $output = str_ireplace("", "\n", $output);
    $output = str_ireplace("

", "\n", $output);
    $output = str_ireplace("
", "\n", $output);
    $output = str_ireplace("
", "\n", $output);
    $output = str_ireplace("
", "\n", $output);
    $output = str_ireplace("&nbsp;", "\x20", $output);
    $output = str_ireplace("\xc2\xa0", "\x20", $output);
    $output = clean_text($output);
    $output = str_ireplace("&lt;", "", $output);

    $id = uniqid("");

    $output = "" .
    "" .
        "sagecell.makeSagecell({inputLocation: \"#" . $id . "\"," .
        "evalButtonText: \"Evaluate\"," .
        "autoeval: true," .
        "hide: [\"evalButton\", \"editor\", \"messages\", \"permalink\", \"language\"] }" .
    ");" .
    "" .
    "
<div id="">". $output. "</div>
";

    return $output;
}

This looks to me like the SageCell Moodle filter essentially rewrites a [sage]...[/sage] delimited code block within a Moodle environment as a Javascript backed SageCell form and then lets users run the code embedded in the form against the remote server. This sort of thing could presumably be used to support interactive, executable code activities within a Moodle hosted web page, for example.

As I remarked previously, it’s not hard to imagine doing something similar to provide a [mybinder repository="..."]...[/mybinder]​ filter that could use a Javascript library such as ThebeLab or Juniper to provide a similar style of interaction backed by a MyBinder launched repository, though minor tweaks may be required around those packages to handle stateless rather than stateful transactions if repeated calls are made to the server.

Going back to the CodeRunner plugin (as described here):

[i]nternally CodeRunner is designed to support multiple sandboxes, implemented as subclasses of the abstract class qtype_coderunner_sandbox – see sandbox.php. Sandboxes are essentially plugins to CodeRunner. Several different ones have been used over the years but the only current ones are the jobe sandbox (file jobesandbox.php) and the ideone sandbox. The latter interfaces to the Sphere On-line judge server but is now more-or-less defunct. Both of those sandboxes run as services. CodeRunner can support multiple sandboxes at the same time and questions can be configured to select a particular sandbox (if desired). By default the first available sandbox that supports the language required by the question is used.

So could we use a MyBinder launched Jupyter server to provide sandboxed code execution?

One advantage of this would be that we could define a Jupyter environment that students could use on their own machines, or that we could host via a hosted notebook server, and that same environment could be used for CodeRunner style assessment.

Another advantage would be that if we want to run student created arbitrary code for teaching activities as well as CodeRunner based assessment activities, we’d only need to sign off on one sandboxed code execution environment rather than several.

So what’s required?

It’s years since I had used PHP, but I thought I’d have a go at creating a simple Python client that would let me:

  • start a MyBinder server against a specified Github repo;
  • start a kernel;
  • run a small code sample in the kernel and get a code execution response back.

Cribbing heavily from juniper.js and this rather handy sagecell-client.py, I came up with a hacky recipe that works a minimal proof of concept here: mybinder_py_client-ipynb.

I think this is stateful, in that we execute several code blocks one after the other and exploit state in previous calls to the same kernel. It would probably also make sense to have a call that forces a new kernel for each code execution call, as well as providing a recipe for killing a kernel.

The next step in trying to use this approach for CodeRunner sandbox would presumably be to try to create a simple PHP based MyBinder client; then the next step would be to use that in a CodeRunner sandbox subclass.

But that’s out of scope for me atm…

Please let me know in the comments if you have a go at this… or know of any other Moodle / Jupyter integrations…

Binder Base Boxes, Several Ways…

A couple of weeks ago, Chris Holdgraf published a handy tip on the Jupyter Discourse site about how to embed custom github content in a Binder link with nbgitpuller.

One of the problems with (features of…) MyBinder is that if you make a change to a repo, even if it’s just a change to the README, it will spawn a rebuild of the Docker image built from the repo the next time the repo is launched onto MyBinder.

With the recent announcement of the Binder Federation, whereby there are multiple clusters (currently two…) onto which MyBinder launch requests are mapped, if each cluster maintains its own Docker image hub, this could mean that with N clusters available, your next N launches may all require a rebuild if each launch request is mapped to a different cluster.

So how does nbgitpuller help? If you install nbgitpuller into a Binderised repository, you can launch a container on MyBinder with a git-pull? argument. This will grab the contents of a specified repository into a notebook server environment before presenting you with the notebook homepage.

What this means is that we can construct a MyBinder URL that will:

  • launch a container built from one repo; and
  • populate it with files pulled from another.

The advantage of this is that you can create one repo with a complex set of build requirements and build a MyBinder image from that once and once only. If you also maintain a second repository with notebook files, or a package definition, with frequent changes, but run it in a Binderised container launched from the “fixed” build repo, you won’t need to rebuild the container each time: just launch from the pre-built one and then synch the changed content in from the other repo.

To pull the contents of a repo http://github.com/USER/REPO into a MyBinder container built from a particular binder-base-boxes branch, use a MyBinder URL of the form:

https://mybinder.org/v2/gh/ouseful-demos/binder-base-boxes/BASEBOXBRANCH/?urlpath=git-pull?repo=https://github.com/USER/REPO

To pull the contents from a particular branch of a repo http://github.com/USER/REPO/tree/BRANCH, use a MyBinder URL of the form:

https://mybinder.org/v2/gh/ouseful-demos/binder-base-boxes/BASEBOXBRANCH/?urlpath=git-pull?repo=https://github.com/USER/REPO%26amp%3Bbranch=BRANCH

Note the escaping on the & conjunction between the repo and branch arguments that keeps it inside the scope of the git-pull?repo phrase.

To pull the contents from a particular branch of a repo http://github.com/USER/REPO/tree/BRANCH and launch into a particular notebook, use a MyBinder URL of the form:

https://mybinder.org/v2/gh/ouseful-demos/binder-base-boxes/BASEBOXBRANCH/?urlpath=git-pull?repo=https://github.com/USER/REPO%26amp%3Bbranch=BRANCH%26amp%3BsubPath=FILENAME.ipynb

You can see several examples in the various branches of https://github.com/ouseful-demos/binder-base-boxes.

See Feeding a MyBinder Container Built From One Github Repository With the Contents of Another for an earlier review of this approach (which I have to admit, I’d forgotten I’d posted when I started this post!).

On my to do list is to try to add a tab to the nbgitpuller/link generator to simplify the process of link creation. But in addition to a helper tool, is there a convention we might adopt to make it clearer when we are using this sort of split build/content repo approach?

Github conventionally uses the gh-pages branch as a “reserved” branch for constructing Github Pages docs related to a particular repo. Could we take a similar approach for defining a “Binder build” branch?

The binder/ directory in a repo can be used to partition Binder build requirements in a repo, but there are a couple of problems associated with this:

  • a maintainer may not want to have the binder/ directory cluttering their package repo;
  • any updates to the repo will force a rebuild of the Binder image next time the repo is run on a particular Binder node. (With Binder federation, if there are N hosts in the federation, after updating a repo, is it possible that my next N attempts to run the repo on MyBinder may require a rebuild if I am directed to a different host each time?)

If by convention something like a binder-build branch was used to contain the build requirements for a repo, then the process for calling a build (by default) could be simplified.

Eg rather than having something like:

https://mybinder.org/v2/gh/colinleach/binder-box/master/?urlpath=git-pull?repo=https://github.com/colinleach/astro-Jupyter

we would have something like:

https://mybinder.org/v2/gh/colinleach/astro-Jupyter/binder-build/?urlpath=git-pull?repo=https://github.com/colinleach/astro-Jupyter

which could simplify to something that defaults to a build from binder-build branch (the “build” branch) and nbgitpull from master (the “content” branch):

https://mybinder.org/v2/gh/colinleach/astro-Jupyter?binder-build=True

Complications could be added to support changing the build branch, the nbgitpull branch, the commit/ID of a particular build, etc?

It might overly complicate things further, but I could also imagine:

  • automatically injecting nbgitpuller into the Binder image and enabling it;
  • providing some sort of directive support so that if the content directory has a setup.py file the package from that content directory is installed.

Binder Buildpacks

As well as defining dynamically constructed Binder base boxes built from one repo and used to provide an environment within which to run the contents of another, there is a second sense in which we might define Binder base boxes and that is to consider the base environment on which repo2docker constructs a Binder image.

In the nbgitpuller approach, I am treating the Binder base box (sense 1) as the environment that the git pulled content runs in. In the buildpack appraoch, the Binder base box (sense 2) is the image that repo2docker uses to bootstrap the Binder image build process. Binder base box sense 1 = Binder base box sense 2 + Binder repo build process. Maybe it’d make more sense to swap those senses, so sense 2 builds on sense 1?!

This approach is discussed in the repo2docker issue #487 Make it possible to configure the base image with an example implementation in pangeo-stacks/pull/27. The implementation allows users to create a Dockerfile in which they specify a required base Docker image upon which the normal apt.txt, environment.yml, requirements.txt and postBuild steps can be applied.

The Dockerfile FROM statement takes the form:

FROM yuvipanda/pangeo-base-notebook-onbuild:2019.04.15-4

and then other build files (requirements.txt etc) are declared as normal.

The -onbuild component marks out the base image as one that should be built on (I think). I’m not sure how the date component applies (or whether it is required or optional). I’m not sure if the base box itself also needs some custom configuration? I think an example of the code use to build it is in the base-notebook directory of this repo: https://github.com/yuvipanda/pangeo-stacks .

Summary

Installing nbgitpuller into a Binderised repo allows us to pull the contents of a second Github repository into the first. This means we can build a complex environment from one repository once and pull regularly updated content from another repo into it without needing a rebuild step. Using the -onbuild approach, Binderhub can use repo2docker to build a Binder image from a user defined base image and then apply normal build steps to it. This means that optimised base boxes can be defined on which additional customisations can be layered. This can also make development of Binder boxes more efficient by starting rebuilds further up the image layer stack by building on top of prebuilt boxes rather than having build images from scratch.

MyBinder Launches From Any git Repository: Github, Gists, GitLab, Bitbucket etc

By default, MyBinder looks to repositories on Github for its builds, but it can also build from Githubs gists, GitLab.com repositories, and, well, any git repository with a networked endpoint, it seems:

What prompted me to this was looking for a way to launch a MyBinder container from Bitbucket. (For the archaeologists, there are various issues and PRs (such as here, and here, as well as this recent forum post — How to use bitbucket repositories on mybinder.org — that trace some of the history…)

So what’s the trick?

For now, you need to get hold of the URL to a particular Bitbucket repo commit. For example, to try running this repo you need to co to the Commits page and grab the URL for the most recent master commit (or whichever one you you want) which will contain the commit hash:

For example, soenthing like https://bitbucket.org/ueacomputervision/image-labelling-tool/commits/f3ddb33e4839f8a0fe73c168993b405adc13daf0 gives the commit hash f3ddb33e4839f8a0fe73c168993b405adc13daf0.

For the repo base URL https://bitbucket.org/ueacomputervision/image-labelling-tool, the MyBinder launch link then takes on the form:

https://mybinder.org/v2/git/https%3A%2F%2Fbitbucket.org%2Fueacomputervision%2Fimage-labelling-tool.git/f3ddb33e4839f8a0fe73c168993b405adc13daf0

which is to say:

https://mybinder.org/v2/git/ESCAPED_REPO_URL.git/COMMIT_HASH

But it does look like things may get easier in the near future…

Feeding a MyBinder Container Built From One Github Repository With the Contents of Another

Long time readers should be more than well aware by now of MyBinder, the Jupyter project service that will build a Docker image from the contents of a git repository and then launch a container based on that image so you can work with a live, running, albeit temporary, instance if it.

But that’s not all it can do…

Via Chris Holdgraf on the Jupyter discourse community site (Tip: embed custom github content in a Binder link with nbgitpuller), comes a magical trick whereby you can launch a MyBinder instance built from one repository and populate it with files from another.

Why’s this useful? Well, if you’ve had a play with your own repos using MyBinder, you’ll know that each time you make a change to a repository, MyBinder will want to rebuild the Docker image next time you try to launch the repo there.

So if your repo defines a complex build that takes some time to install all of its dependencies, you have to wait for that build even if all you did was correct a typo in the markdown of a notebook file.

So here’s the trick…

nbgitpuller is a Jupyter server extension that supports the “one-way synchronization of a remote git repository to a local git repository”.

There are other approaches to git syncing too. See the next edition of Tracking Jupyter to find out what they are…

Originally developed as a tool to help distribute notebooks to students, it can be called via a Jupyter server URL. For example, if you have nbgitpuller installed in a local Jupyter server running on the default port 8888, the following URL will pull data from the specified repo into the base directory the notebook server points to using a URL of the form:

localhost:8888/git-pull?repo=https://github.com/USER/NOTEBOOK_REPO

One of the neat things about Binderhub / MyBinder is that can pass a git-pull? argument through as part of a MyBinder launch URL, so if the repo you want to build from installs and enables nbgitpuller, you can then pull notebooks into the launched container from a second, nbgitpulled repository.

For example, yesterday I came across the Python show_ast package, and incorporated IPython magic,  that will render the abstract syntax tree of a Python command:

Such a thing may be useful in an introductory programming course (TBH, I’m never really sure what people try to teach in introductory programming courses, what the useful mental models are, how best to help folk learn them, and how to figure out how to teach them…).

As with most Python based repos, particularly ones that contain Jupyter notebooks (that is, .ipynb files [….thinks… ooh… I has a plan to try s/thing else too….]) I generally try to “run” them via MyBinder. In this case, the repo didn’t work because there is a dependency on the Linux graphviz apt package and the Python graphviz package.

At this point, I’d generally fork the repo, create a binderise branch containing the dependencies, then try that out on MyBinder, sometimes adding an issue and/or making a pull request to the original repository suggesting they Binderise it…

…but nbgitpuller provides a different opportunity. Suppose I create a base container that contains the Graphviz Linux application and the graphivz Python package. Something like this: ouseful-testing/binder-graphviz.

Then I can create a MyBinder session from that repo and pull in the show_ast package from its repo and run the notebook directly:

https://mybinder.org/v2/gh/ouseful-testing/binder-graphviz/master/?urlpath=git-pull?repo=https://github.com/hchasestevens/show_ast

Fortuitously, things work seemlessly in this case because the example notebook lives in directory where we can import show_ast without the need to install it (otherwise we’d have needed to run pip install . at the top level of the repo). In general, where notebooks are kept in a notebooks or docs directory, for example, the path to import the package would break. (Hmmm… I need to think about protocols for handling that… It’s better practise to put the notebooks somewhere but that means we need to install the package or change the import path to it, which is one more step for folk to stumble over…)

Thinking about my old show’n’tell repo, the branches of which scruffily define various Binder environments suited to particular topic areas (environments for working on chemistry notebooks, for example, or astronomy notebooks, or classical language or music notebooks) and also contain demo notebooks, I could instead just define a set of base Binder environment containers, slow to build but built infrequently, and then lighter weight notebook repos containing just demo notebooks for a particular topic area. These could then be quickly and easily updated, and run on MyBinder having been nbgitpulled by a base container, without having to rebuild the base container each time I update a notebook in a notebook repo.

A couple of other things to note here. First, nbgitpuller has its own helper for creating nbgitpuller URLs, the nbgitpuller link generator:

It’s not hard to imagine a similar UI, or another tab to that UI, that can build a MyBinder link from a “standard” base container selected from a dropdown menu (or an optional link to a git repo) and then a provided git repo link for the target content repo.

Second, this has got me thinking about how we (don’t) handle notebook distribution very well in the OU.

For our TM351 internal course, we control the student’s computing environment via VM we provide them with, so we could install nbgitpuller in it, but the notebooks are stored in a private Github repo and we don’t want to give students any keys to it at all. (For some reason, I seem to be the only person who doesn’t have a problem with the notebooks being in a public repo!;-)

For our public notebook utilising courses on FutureLearn or OpenLearn, the notebooks are in a public repo, but we don’t have control of the learners’ computing environments, (which is to say, we can’t preinstall nbgitpuller and can’t guarantee that learners will have permissions of network access to install it themselves).

It’s almost as if various pieces keep appearing, but the jigsaw never quite seems to fit together…