A couple of weeks ago, Chris Holdgraf published a handy tip on the Jupyter Discourse site about how to embed custom github content in a Binder link with nbgitpuller.
One of the problems with (features of…) MyBinder is that if you make a change to a repo, even if it’s just a change to the README, it will spawn a rebuild of the Docker image built from the repo the next time the repo is launched onto MyBinder.
With the recent announcement of the Binder Federation, whereby there are multiple clusters (currently two…) onto which MyBinder launch requests are mapped, if each cluster maintains its own Docker image hub, this could mean that with N clusters available, your next N launches may all require a rebuild if each launch request is mapped to a different cluster.
So how does nbgitpuller
help? If you install nbgitpuller
into a Binderised repository, you can launch a container on MyBinder with a git-pull?
argument. This will grab the contents of a specified repository into a notebook server environment before presenting you with the notebook homepage.
What this means is that we can construct a MyBinder URL that will:
- launch a container built from one repo; and
- populate it with files pulled from another.
The advantage of this is that you can create one repo with a complex set of build requirements and build a MyBinder image from that once and once only. If you also maintain a second repository with notebook files, or a package definition, with frequent changes, but run it in a Binderised container launched from the “fixed” build repo, you won’t need to rebuild the container each time: just launch from the pre-built one and then synch the changed content in from the other repo.
To pull the contents of a repo http://github.com/USER/REPO
into a MyBinder container built from a particular binder-base-boxes
branch, use a MyBinder URL of the form:
https://mybinder.org/v2/gh/ouseful-demos/binder-base-boxes/BASEBOXBRANCH/?urlpath=git-pull?repo=https://github.com/USER/REPO
To pull the contents from a particular branch of a repo http://github.com/USER/REPO/tree/BRANCH
, use a MyBinder URL of the form:
https://mybinder.org/v2/gh/ouseful-demos/binder-base-boxes/BASEBOXBRANCH/?urlpath=git-pull?repo=https://github.com/USER/REPO%26amp%3Bbranch=BRANCH
Note the escaping on the &
conjunction between the repo
and branch
arguments that keeps it inside the scope of the git-pull?repo
phrase.
To pull the contents from a particular branch of a repo http://github.com/USER/REPO/tree/BRANCH
and launch into a particular notebook, use a MyBinder URL of the form:
https://mybinder.org/v2/gh/ouseful-demos/binder-base-boxes/BASEBOXBRANCH/?urlpath=git-pull?repo=https://github.com/USER/REPO%26amp%3Bbranch=BRANCH%26amp%3BsubPath=FILENAME.ipynb
You can see several examples in the various branches of https://github.com/ouseful-demos/binder-base-boxes.
See Feeding a MyBinder Container Built From One Github Repository With the Contents of Another for an earlier review of this approach (which I have to admit, I’d forgotten I’d posted when I started this post!).
On my to do list is to try to add a tab to the nbgitpuller/link
generator to simplify the process of link creation. But in addition to a helper tool, is there a convention we might adopt to make it clearer when we are using this sort of split build/content repo approach?
Github conventionally uses the gh-pages
branch as a “reserved” branch for constructing Github Pages docs related to a particular repo. Could we take a similar approach for defining a “Binder build” branch?
The binder/ directory in a repo can be used to partition Binder build requirements in a repo, but there are a couple of problems associated with this:
- a maintainer may not want to have the binder/ directory cluttering their package repo;
- any updates to the repo will force a rebuild of the Binder image next time the repo is run on a particular Binder node. (With Binder federation, if there are N hosts in the federation, after updating a repo, is it possible that my next N attempts to run the repo on MyBinder may require a rebuild if I am directed to a different host each time?)
If by convention something like a binder-build
branch was used to contain the build requirements for a repo, then the process for calling a build (by default) could be simplified.
Eg rather than having something like:
https://mybinder.org/v2/gh/colinleach/binder-box/master/?urlpath=git-pull?repo=https://github.com/colinleach/astro-Jupyter
we would have something like:
https://mybinder.org/v2/gh/colinleach/astro-Jupyter/binder-build/?urlpath=git-pull?repo=https://github.com/colinleach/astro-Jupyter
which could simplify to something that defaults to a build from binder-build branch (the “build” branch) and nbgitpull from master (the “content” branch):
https://mybinder.org/v2/gh/colinleach/astro-Jupyter?binder-build=True
Complications could be added to support changing the build branch, the nbgitpull branch, the commit/ID of a particular build, etc?
It might overly complicate things further, but I could also imagine:
- automatically injecting
nbgitpuller
into the Binder image and enabling it; - providing some sort of directive support so that if the content directory has a
setup.py
file the package from that content directory is installed.
Binder Buildpacks
As well as defining dynamically constructed Binder base boxes built from one repo and used to provide an environment within which to run the contents of another, there is a second sense in which we might define Binder base boxes and that is to consider the base environment on which repo2docker constructs a Binder image.
In the nbgitpuller approach, I am treating the Binder base box (sense 1) as the environment that the git pulled content runs in. In the buildpack appraoch, the Binder base box (sense 2) is the image that repo2docker uses to bootstrap the Binder image build process. Binder base box sense 1 = Binder base box sense 2 + Binder repo build process. Maybe it’d make more sense to swap those senses, so sense 2 builds on sense 1?!
This approach is discussed in the repo2docker
issue #487 Make it possible to configure the base image with an example implementation in pangeo-stacks/pull/27. The implementation allows users to create a Dockerfile in which they specify a required base Docker image upon which the normal apt.txt
, environment.yml
, requirements.txt
and postBuild
steps can be applied.
The Dockerfile FROM
statement takes the form:
FROM yuvipanda/pangeo-base-notebook-onbuild:2019.04.15-4
and then other build files (requirements.txt
etc) are declared as normal.
The -onbuild
component marks out the base image as one that should be built on (I think). I’m not sure how the date component applies (or whether it is required or optional). I’m not sure if the base box itself also needs some custom configuration? I think an example of the code use to build it is in the base-notebook
directory of this repo: https://github.com/yuvipanda/pangeo-stacks .
Summary
Installing nbgitpuller
into a Binderised repo allows us to pull the contents of a second Github repository into the first. This means we can build a complex environment from one repository once and pull regularly updated content from another repo into it without needing a rebuild step. Using the -onbuild
approach, Binderhub can use repo2docker
to build a Binder image from a user defined base image and then apply normal build steps to it. This means that optimised base boxes can be defined on which additional customisations can be layered. This can also make development of Binder boxes more efficient by starting rebuilds further up the image layer stack by building on top of prebuilt boxes rather than having build images from scratch.
Re: url patterns for this see http://edu.oggm.org/en/latest/user_content.html and this thread: https://discourse.jupyter.org/t/tip-embed-custom-github-content-in-a-binder-link-with-nbgitpuller/922/24?u=psychemedia