A couple of weeks ago, Chris Holdgraf published a handy tip on the Jupyter Discourse site about how to embed custom github content in a Binder link with nbgitpuller.
One of the problems with (features of…) MyBinder is that if you make a change to a repo, even if it’s just a change to the README, it will spawn a rebuild of the Docker image built from the repo the next time the repo is launched onto MyBinder.
With the recent announcement of the Binder Federation, whereby there are multiple clusters (currently two…) onto which MyBinder launch requests are mapped, if each cluster maintains its own Docker image hub, this could mean that with N clusters available, your next N launches may all require a rebuild if each launch request is mapped to a different cluster.
So how does
nbgitpuller help? If you install
nbgitpuller into a Binderised repository, you can launch a container on MyBinder with a
git-pull? argument. This will grab the contents of a specified repository into a notebook server environment before presenting you with the notebook homepage.
What this means is that we can construct a MyBinder URL that will:
- launch a container built from one repo; and
- populate it with files pulled from another.
The advantage of this is that you can create one repo with a complex set of build requirements and build a MyBinder image from that once and once only. If you also maintain a second repository with notebook files, or a package definition, with frequent changes, but run it in a Binderised container launched from the “fixed” build repo, you won’t need to rebuild the container each time: just launch from the pre-built one and then synch the changed content in from the other repo.
To pull the contents of a repo
http://github.com/USER/REPO into a MyBinder container built from a particular
binder-base-boxes branch, use a MyBinder URL of the form:
To pull the contents from a particular branch of a repo
http://github.com/USER/REPO/tree/BRANCH, use a MyBinder URL of the form:
Note the escaping on the
& conjunction between the
branch arguments that keeps it inside the scope of the
To pull the contents from a particular branch of a repo
http://github.com/USER/REPO/tree/BRANCH and launch into a particular notebook, use a MyBinder URL of the form:
You can see several examples in the various branches of https://github.com/ouseful-demos/binder-base-boxes.
See Feeding a MyBinder Container Built From One Github Repository With the Contents of Another for an earlier review of this approach (which I have to admit, I’d forgotten I’d posted when I started this post!).
On my to do list is to try to add a tab to the
nbgitpuller/link generator to simplify the process of link creation. But in addition to a helper tool, is there a convention we might adopt to make it clearer when we are using this sort of split build/content repo approach?
Github conventionally uses the
gh-pages branch as a “reserved” branch for constructing Github Pages docs related to a particular repo. Could we take a similar approach for defining a “Binder build” branch?
The binder/ directory in a repo can be used to partition Binder build requirements in a repo, but there are a couple of problems associated with this:
- a maintainer may not want to have the binder/ directory cluttering their package repo;
- any updates to the repo will force a rebuild of the Binder image next time the repo is run on a particular Binder node. (With Binder federation, if there are N hosts in the federation, after updating a repo, is it possible that my next N attempts to run the repo on MyBinder may require a rebuild if I am directed to a different host each time?)
If by convention something like a
binder-build branch was used to contain the build requirements for a repo, then the process for calling a build (by default) could be simplified.
Eg rather than having something like:
we would have something like:
which could simplify to something that defaults to a build from binder-build branch (the “build” branch) and nbgitpull from master (the “content” branch):
Complications could be added to support changing the build branch, the nbgitpull branch, the commit/ID of a particular build, etc?
It might overly complicate things further, but I could also imagine:
- automatically injecting
nbgitpullerinto the Binder image and enabling it;
- providing some sort of directive support so that if the content directory has a
setup.pyfile the package from that content directory is installed.
As well as defining dynamically constructed Binder base boxes built from one repo and used to provide an environment within which to run the contents of another, there is a second sense in which we might define Binder base boxes and that is to consider the base environment on which repo2docker constructs a Binder image.
In the nbgitpuller approach, I am treating the Binder base box (sense 1) as the environment that the git pulled content runs in. In the buildpack appraoch, the Binder base box (sense 2) is the image that repo2docker uses to bootstrap the Binder image build process. Binder base box sense 1 = Binder base box sense 2 + Binder repo build process. Maybe it’d make more sense to swap those senses, so sense 2 builds on sense 1?!
This approach is discussed in the
repo2docker issue #487 Make it possible to configure the base image with an example implementation in pangeo-stacks/pull/27. The implementation allows users to create a Dockerfile in which they specify a required base Docker image upon which the normal
postBuild steps can be applied.
FROM statement takes the form:
and then other build files (
requirements.txt etc) are declared as normal.
-onbuild component marks out the base image as one that should be built on (I think). I’m not sure how the date component applies (or whether it is required or optional). I’m not sure if the base box itself also needs some custom configuration? I think an example of the code use to build it is in the
base-notebook directory of this repo: https://github.com/yuvipanda/pangeo-stacks .
nbgitpuller into a Binderised repo allows us to pull the contents of a second Github repository into the first. This means we can build a complex environment from one repository once and pull regularly updated content from another repo into it without needing a rebuild step. Using the
-onbuild approach, Binderhub can use
repo2docker to build a Binder image from a user defined base image and then apply normal build steps to it. This means that optimised base boxes can be defined on which additional customisations can be layered. This can also make development of Binder boxes more efficient by starting rebuilds further up the image layer stack by building on top of prebuilt boxes rather than having build images from scratch.