Exploring Jupytext – Creating Simple Python Modules Via a Notebook UI

Although I spend a lot of my coding time in Jupyter notebooks, there are several practical problems associated with working in that environment.

One problem is that under version control, it can be hard to tell what’s changed. On the one hand, the notebook .ipynb format, which saves as a serialised JSON object, is hard to read cleanly:

The .ipynb format also records changes to cell execution state, including cell execution count numbers and changes to cell outputs (which may take the form of large encoded strings when a cell output is an image, or chart, for example:

Another issue arises when trying to write modules in a notebook that can be loaded into other notebooks.

One workaround for this is to use the notebook loading hack described in the official docs: Importing notebooks. This requires loading in a notebook loading module that then allows you to import other modules. Once the notebook loader module is installed, you can run things like:

  • import mycode as mc to load mycode.ipynb
  • `moc = __import__(“My Other Code”)` to load code in from `My Other Code.ipynb`

If you want to include code that can run in the notebook, but that is not executed when the notebook is loaded as a module, you can guard items in the notebook:

In this case, the if __name__=='__main__': guard will run the code in the code cell when run in the notebook UI, but will not run it when the notebook is loaded as a module.

Guarding code can get very messy very quickly, so is there an easier way?

And is there an easier way of using notebooks more generally as an environment for creating code+documentation files that better meet the needs of a variety of users? For example, I note this quote from Daniele Procida recently shared by Simon Willison:

Documentation needs to include and be structured around its four different functions: tutorials, how-to guides, explanation and technical reference. Each of them requires a distinct mode of writing. People working with software need these four different kinds of documentation at different times, in different circumstances—so software usually needs them all.

This suggests a range of different documentation styles for different purposes, although I wonder if that is strictly necessary?

When I am hacking code together, I find that I start out by writing things a line at a time, checking the output for each line, then grouping lines in a single cell and checking the output, then wrapping things in a function (for example of this in practice, see Programming in Jupyter Notebooks, via the Heavy Metal Umlaut). I also try to write markdown notes that set up what I intend to do (and why) in the following code cells. This means my development notebooks tell a story (of a sort) of the development of the functions that hopefully do what I actually want them to by the end of the notebook.

If truth be told, the notebooks often end up as an unholy mess, particularly if they are full of guard statements that try to separate out development and testing code from useful code blocks that I might want to import elsewhere.

Although I’ve been watching it for months, I’ve only started exploring how to use Jupytext in practice quite recently, and already it’s starting to change how I use notebooks.

If you install jupytext, you will find that if you click on a link to a markdown (.md)) or Python (.py), or a whole range of other text document types (.py, .R, .r, .Rmd, .jl, .cpp, .ss, .clj, .scm, .sh, .q, .m, .pro, .js, .ts, .scala), you will open the file in a notebook environment.

You can also open the file as a .py file, from the notebook listing menu by selecting the notebook:

and then using the Edit button to open it:

at which point you are presented with the “normal” text file editor:

One thing to note about the notebook editor view over the notebook is that you can also include markdown cells, as you might in any other notebook, and run code cells to preview their output inline within the notebook view.

However, whilst the markdown code will be saved into the Python file (as commented out code), the code outputs will not be saved into the Python file.

If you do want to be able to save notebook views with any associated code output, you can configure Jupytext to “pair” .py and .ipynb files (and other combinations, such as .py, .ipynb and .md files) such that when you save an open .py or .ipynb file from the notebook editing environment, a “paired” .ipynb or .py version of the file is also saved at the same time.

This means I could click to open my .py file in the notebook UI, run it, then when I save it, a “simple” .py file containing just code and commented out markdown is saved along with a notebook .ipynb file that also contains the code cell outputs.

You can configure Jupytext so that the pairing only works in particular directories. I’ve started trying to explore various settings in the branches of this repo: ouseful-template-repos/jupytext-md. You can also convert files on the command line; for example, <span class="s1">jupytext --to py Required\ Pace.ipynb will convert a notebook file to a python file.

The ability to edit Python / .py files, or code containing markdown / .md files in a notebook UI, is really handy, but there’s more…

Remember the guards?

If I tag a code cell using the notebook UI (from the notebook View menu, select Cell Toolbar and then Tags, you can tag a cell with a tag of the form active-ipynb:

See the Jupytext docs: importing Jupyter notebooks as modules for more…

The tags are saved as metadata in all document types. For example, in an .md version of the notebook, the metadata is passed in an attribute-value pair when defining the language type of a code block:

In a .py version of the notebook, however, the tagged code cell is not rendered as a code cell, it is commented out:

What this means is that I can tag cells in the notebook editor to include them — or not — as executable code in particular document types.

For example, if I pair .ipynb and .py files, whenever I edit either an .ipynb or .py file in the notebook UI, it also gets saved as the paired document type. Within the notebook UI, I can execute all the code cells, but through using tagged cells, I can define some cells as executable in one saved document type (.ipynb for example) but not in another (a .py file, perhaps).

What that in turn means is that when I am hacking around with the document in the notebook UI I can create documents that include all manner of scraggy developmental test code, but only save certain cells as executable code into the associated .py module file.

The module workflow is now:

  • install Jupytext;
  • edit Python files in a notebook environment;
  • run all cells when running in the notebook UI;
  • mark development code as active-ipynb, which is to say, it is *not active* in a .py file;
  • load the .py file in as a module into other modules or notebooks but leaving out the commented out the development code; if I use `%load_ext autoreload` and `%autoreload 2` magic in the document that’s loading the modules, it will [automatically reload them](https://stackoverflow.com/a/5399339/454773) when I call functions imported from them if I’ve made changes to the associated module file;
  • optionally pair the .py file with an .ipynb file, in which case the .ipynb file will be saved: a) with *all* cells run; b) include cell outputs.

Referring back to Daniele Procida’s insights about documentation, this ability to have code in a single document (for example, a .py file) that is executable in one environment (the notebook editing / development environment, for example) but not another (when loaded as a .py module) means we can start to write richer source code files.

I also wonder if this provides us with a way of bundling test code as part of the code development narrative? (I don’t use tests so don’t really know how the workflow goes…)

More general is the insight that we can use Jupytext to automatically generate distinct versions of a document from a single source document. The generated documents:

  • can include code outputs;
  • can *exclude* code outputs;
  • can have tagged code commented out in some document formats and not others.

I’m not sure if we can also use it in combination with other notebook extensions to hide particular cells, for example, when viewing documents in the notebook editor or generating export document formats from an executed notebook form of it. A good example to try out might be the hide_code extension, which provides a range of toolbar options that can be used to customise the display of a document in a the notebook editor or HTML / PDF documents generated from it.

It could also be useful to have a very simple extension that lets you click a toolbar button to set an active- state tag and style or highlight that cell in the notebook UI to mark it out as having limited execution status. A simple fork of, or extension to, the freeze extension would probably do that. (I note that Jupytext responds to the “frozen” freeze setting but that presumably locks out executing the cell in the notebook UI too?)

PS a few weeks ago, Jupytext creator Marc Wouts posted this handy recipe for *rewriting* notebook commits made to a git branch against markdown formatted documents rather than the original ipynb change commits: git filter-branch --tree-filter 'jupytext --to md */*.ipynb && rm -f */*.ipynb' HEAD This means that if you have a legacy project with commits made to notebook files, you can rewrite it as a series of changes made to markdown or Python document versions of the notebooks…

What Do you Mean You Write Code EVERY DAY?

Every so often, I ask folk in the department when they last wrote any code; often, I get blank stares back. Write code? Why would they want to do that? Code is for the teaching of, and big software engineering projects, and, and, not using it every day, surely?

I disagree.

I see code as a tool for making tools, often disposable ones.

Here’s an example…

I’m writing a blog post, and I want to list the file types recognised by Jupytext. I can’t find a list of the filetypes it recognises as a simple string that I can copy and paste into the post, but I do find this:

Copying out those suffixes is a pain, so I just copy that text string, which in this case happens to play nicely with Python (because it is Python), sprinkle a bit of code:

and here’s the list of filetypes supported by Jupytext: .py, .R, .r, .jl, .cpp, .ss, .clj, .scm, .sh, .q, .m, .pro, .js, .ts, .scala.

Note that is doesn’t have to be nice code, and there may be multiple ways of solving the problem (in the example, I use a hybrid “me + the computer” approach where I get the code to do one thing, I copy the output, paste that into the next cell and then hack code around that, as well as “just the computer” approach. The first one is perhaps more available to a novice, the second to someone who knows about .join()).

So what?

I tend use code without thinking anything special of it; it’s just a tool that’s to hand to fashion other tools from, and I think that colours my attitude towards the way in which we teach it.

First and foremost, if you come out of a coding course not thinking that you now have a skill you can use quite casually to help get stuff done, you’ve been mis-sold…

This blog post took much longer to write than it took me to copy the _SCRIPT_EXTENSIONS text and write the code to extract the list of suffixes… And it didn’t take long to write the post at all…

See also: Fragment – Programming Privilege.

Simple Self-Test and Feedback in Jupyter Notebooks — Ordo

Some time ago I came across ordo, “a lightweight feedback tool for Jupyter”; here’s a quick initial review of what we can do with it (Binderised demo)…

Installing and enabling the extension gives you a couple of toolbar buttons:

The tick is “Feedback Mode” for running cells and evaluating the output, the pencil is “Edit Mode” for creating/editing feedback messages.

The README.ipynb demo notebook has some feedback cells already set up. For example, the first cell tests a simple sum. In “Feedback mode”, if you get an incorrect answer, you are alerted to the fact with an error message, which can either be the default message or a custom one assigned to that cell.

Clicking the eye reveals the answer; when you get the answer right, you are awarded with confirmatory feedback, again, either as a default message or as a custom message defined for that cell.

In the edit mode, you can click in a code cell and raise some value setting controls for the cell:

If you click the Make Solution button, the current cell output is set as the desired output.

Alternatively, you can explicitly set the desired solution, as well as custom success/failure messages on each cell:

Making a custom solution allows you to specify different sorts of output cell types… I think using Make Solution is probably easier!

Note that if you do opt to explicitly define a solution, any previous solution will not be displayed.

However, you can see  the desired output in the corresponding cell metadata field:

The same is true if you add custom success or failure messages:

As before, the cell metadata does reveal what the current value is if the default feedback message has been changed.

For example, if we assign the following success feedback message to a cell:

the cell metadata updated with the non-default value:

Ordo looks like a really handy tool for baking explicit answers for cell based tests into notebook metadata. As such, it could be good as a quick way of implementing formative feedback into teaching notebooks, as long as the users have the ordo extension installed and enabled in the notebook server they are accessing the notebooks from.

If you can’t explicitly declare the exact answer you’re expecting as the cell output, it isn’t much use though…

PS A couple of other comments…

An ordo annotated notebook degrades gracefully in the sense that if the extension is not annotated, no bad things happen, you just don’t get the test run and the feedback displayed.

The ability to close the alert messages is neat. Peaking at the code:

ordo-dismissible
it looks like we can just add:
&lt;button class="close" type="button" data-dismiss="alert"&gt;×&lt;/button&gt;
into the div and that gives us a button we can use to collapse the alert box.

[I note WordPress is still crap at handling taglike text… WTF do I have to do to get it to display properly?]

The alert-dismissible class attribute does not seem to be required.

This dismissible behaviour could be used when using alert boxes in eg ​​nbgrader test generated feedback because it could be used to provide messages for markers that they could collapse… hmm… would that definitely delete it from the feedback document? I suppose if we added the alert-dismissible class attribute, we could also filter such divs out in an nbgrader feedback generator processor?

Fragment: Reversionable Open Educational Resources

I seem to be running out of hours in the day, and the blog is suffering as a result (I need to reprioritise…). So this is just another fragment…

One of the attractions for me of creating OERs that incorporate computational objects is that you can often express them in different ways. A musical score as a computational object can be displayed graphically, or converted to an audio file. A chart object can be rendered as an interactive HTML chart, or embedded as a publication quality flat image in a PDF file. Your report can be published as HTML on the web, distributed as a PDF document, or converted to Word for the folk who like that sort of thing.

That is, the object can be reversioned…. (I’m not sure that’s the right word? Reformatted? Exported?)

Here’s another example: assessment material. Via R-Bloggers, I come across R-exams, an R package for creating assessment activities as computational objects, of which it is claimed you can create:

PDFs for classical written exams (with automatic evaluation), import formats for learning management systems (like Moodle, Blackboard, OLAT, or Ilias), live voting (via ARSnova), and the possibility to create custom output (in PDF, HTML, Docx, …).

Exercise types include multiple-choice or single-choice questions, numeric or text answers, or combinations of these. Formatting can be done either in Markdown or LaTeX with the possibility to generate dynamic content using R, e.g., random numbers, graphics, data sets, or shuffled text blocks.

There’s a recent tutorial here which I think I should probably have a quick play with, if I can find an hour just so I get a proper feel for it, and an earlier review presentation here.

Time was when I would have played with this package before the blog post, then included some of my own tinkerings in the post. Not doing that in a blog post, and just passing off the PR blurb, feels wrong to me…  The blog is recording what I’ve learned through doing / using, not just read about… Maybe I need to go down to 3 days a week, not 4, to get blogging time back, though I’m not sure I can afford that, given the blog is a purely selfish pleasure, albeit one that acts as my “professional” life log.

PS here’s something else sort of related: a pitch from BBC R&D for Object-Based Media (originally via @charlesarthur):

Object-based media allows the content of programmes to change according to the requirements of each individual audience member.

The ‘objects’ refer to the different assets that are used to make a piece of content. These could be large objects: the audio and video used for a scene in a drama – or small objects, like an individual frame of video, a caption, or a signer.

By breaking down a piece of media into separate objects, attaching meaning to them, and describing how they can be rearranged, a programme can change to reflect the context of an individual viewer.

The “object based media” project has been around for some time, with demos going back several years. But one thing I did spot that was new to me (I try to follow BBC R&D…) was this BBC Taster site (“Taster is where you can Try, Rate and Share new ideas from the BBC and its partners”) and this BBC Pilots listing.

Fragment: On Online Courses….

Rehashing something I posted to an internal forum because I haven’t posted here for what feels like aaagggeeessss….

I think our mode of delivery — narrative based courses presented primarily as written texts, interspersed with other forms of media, as well as dialog in the form self-test/supported open learning/tutor at your side SAQs and exercises — is an engaging and powerful one.

I personally think that a medium that supports embedded rich media and interactive activities provides many opportunities for us as educators to engage learners with more than just a static written text (although such activities may or may not actually make a positive impact on learning, and may affect it negatively, either directly, because the activities are not supportive of the learning, or because the material around the interactive is geared towards the activity creating an opportunity cost against using that material for other purposes).

Michel mentions platforms like Codio and (the new to me) Stepik, which in many ways are just an evolution of platforms that allow you to easily create and publish quite traditional e-learning (remember that?!) quizzes. It’s not too hard to roll your own course, either: https://course.spacy.io/ was a single person’s DIY effort, but it’s also produced a framework from which you can create your own course.

Something I’m noticing more and more are pages that embed read/write interactions as well as presenting the document as a whole via a personal read/write web style interface. You could argue these are just an iteration on a personal wiki, but I think calling them cell based, notebook style interfaces is more apposite. (OpenCreate was a cell based authoring environment, at least in the iteration I saw.)

The spacy course provides one example of inlining free text editable areas in a course. (The course also mixes linear text with slideshow expositions, which I thinks work nicely. It would be even more powerful if there were an area beneath the slide show where you could enter, and save, your own notes / commentary.)

Applications like Observeable provide you with in-browser documents editable at the cell level (click on the vertical ellipsis in the sidebar of a cell to make it editable). These interfaces also support code-based activities fully supported within the browser. (The spacy course executes code against a remote code environment; Observeable allows js code editing that is executed within the browser; Iodide is a similar, but more generic, framework from Mozilla; here’s a demo document; you can click the Explore button in the top right corner to edit, and preview, the source code. Another part of the same project,  Pyodide, brings a fully blown scientific Python stack into the browser using Webkit. Epiphany.pub is a very new (also solo) project demoing inline editing of docs that make use of Pyodide; click on a cell tool icon (in the toolbar at the right side of each cell) to edit the cell).

Something that I think these new read/write interfaces offer is the opportunity for students to take ownership of these documents and make marks on them, much as they might write comments or underlines on a print study guide. (Yes, I know about OU Annotate, but it’s not the nicest of experiences…)

Annotated documents can then be saved to personal file spaces. In the case of epiphany.pub, it wasn’t working yet when I tried yesterday (which is not to say that it might not have already been fixed today), but the model seems to be that you log in with something like Github and it saves the file there… This means that the site publisher: a) doesn’t really have to worry about managing user accounts, perhaps other than in a very simple, account secret token keeping, way; b) doesn’t “own” your document, it just hosts the editor; c) doesn’t have to pay for any storage for files edited using the editor. This pattern seems to be becoming more and more prevalent; you log in to a service with credentials and permissions that allow the service you are logging in to to store and retrieve stuff using the service whose credentials you logged in with. It’s becoming popular because me as a service provider can create a web app with minimal resource – not much more than hosting for a single page web app.

Just by the by, making annotations on top of documents is also becoming easier. eg the RISE slideshow in Jupyter notebooks, which lets you specify certain cells in a notebook to use as part of a presentation, also supports the ability to draw over a slide https://www.youtube.com/watch?v=Gx2TnIdt0hw&t=28m30s .  Things like Jupyter graffiti also go a bit further in terms of allowing you to create not-really screencasts that are actually replays over a live notebook with support for free annotation over the top (the replay element means the user can step into the “screencast” and take over control of the notebook themselves and go off in a different direction from the screenplay). At the moment Jupyter Graffiti is intended for instructors to make tutorials over the top of notebooks, but I wonder how elements of it might be tweaked or co-opted as an annotation tool for students…

One thing to note about the above is that the tech is starting to get there, but the understanding of how to use it, let alone a culture of using it, is still some way away.

Things like Noteable are fine, but they provide a base experience. An argument we keep having in TM351 is the extent to which we provide students with a vanilla notebook experience, or a rich environment built on Jupyter with loads of extensions and loads of exploitation of those extensions in the way we write the notebooks. Another way might be to find extensions that students can use to enrich their experience of our vanilla notebooks, but that requires skilling up the students so that they can make most effective use of the medium on their own terms.

Zero to Notebook With notebook.js +ThebeLab, Inspired By nbgrader

Whilst at the nbgrader workshop in Edinburgh a couple of weeks ago, a couple of things jumped out at me. Firstly, Jess Hamrick’s design requirement that students should be able to complete ngrader assignments without need to install the nbgrader package or any of its extensions. This is completely reasonable — we want to make it as easy as possible for students to complete an assignment which means minimising the chances of anything going wrong. Secondly, a comment that the ability to lock cells in an assignment notebook is available as an undocumented feature in nbgrader.

The ability to lock cells is an interesting one when it comes to using notebooks for assessment (it also has interesting pedagogical implications when delivering teaching materials in a notebook form). One thing that might make sense in some situations would be to prevent students, by default, from editing any cells not associated with a submission. (That is, the only editable cells should be cells that are “markable” code or free text cells.) This would provide more of a form driven “exam script” style presentation. Furthermore, if we are autograding a notebook, we want to make sure that students don’t mess up the autograding; neither do we want students to put their answers in cells we can’t detect as requiring grading.

I was aware of a freeze Jupyter notebook extension that allows cells to take on various states. Code cells, for example, can be read-only (executable, but the code cannot be changed) or frozen (code be neither altered nor executed), and markdown cells can be read-only (viewable as markdown but non-editable) or frozen (neither viewable as markdown nor editable). But could this also be achieved without an extension, as per the design requirement?

Interestingly, it seems that recent notebook releases do support various cell presentation controls that can be invoked using particular metadata elements.

In particular, the deletable and editable metadata elements seem to be recognised by the Jupyter notebook server (deletable example, editable example); they take effect when set to false.

From the nbformat docs, the deletable metadata element will prevent the deletion of the cell if set to False, but the editable element doesn’t appear to be documented?

To see the effect of setting these metadata elements, you can enable the Edit Metadata controls from the notebook View - Cell Toolbar menu:

edit the required field:

and then try to edit or delete the cell:

Within nbgrader itself, the nbgrader lockcells preprocessor sets cell.metadata['deletable'] = False and in some cases also sets cell.metadata['editable'] = False. [I didn’t spot in the nbgrader` docs a clear description of what’s set to what and how?]

Using native notebook controls, then, we should be able to make certain cells undeletable and uneditable without the need to install any extensions. Students can still add new cells into the notebook but there will be no nbgrader metadata associated with the cell; as such, these cells would presumably be ignored by the autograder and not highlighted for manual grading. (There is a recently added nbgrader task style that seems to allow submissions over multiple cells but I haven’t had a chance to explore this yet.) It’s maybe also worth noting that if all cells in a notebook are tagged with some metadata as part of the assignment creation step, even if just a cell ID, then nbgrader would be able to detect any cells newly added to the notebook by a student.

Within an assignment notebook, certain code and metadata cells can be designated as Read Only. Originally, the semantics of this was such that the source of those cells was recorded into a database when the assignment was created; during autograding nbgrader checked whether the source of the student’s version of the cell has changed and if it had, the changed content would be replaced by the version in the database to undo any student changes. With native notebook, metadata driven support of non-editable cells, cells are now read-only unless the student changes the metadata field.

Making all assessment notebook cells undeletable and making all cells other than cells where we require students to provide an answer uneditable gives us a certain amount of control over the notebook document. But can we take this further? Providing locked down environments where only “assessable” cells are editable made me start to wonder about whether it would be possible to put an nbgrader assignment into a Jupyter Book like environment, with ThebeLab enabled code cells (and free text answer / markdown cells?) that students could use to run and test their code and then somehow export their answers.

Rather than letting students edit an assignment notebook freely, we would provide them with a truly fixed environment that could execute only clearly identified (user editable) code cells against a known/predefined environment. Ideally, we’d then also provide a means to save the code the student had created in the code cells.

Note: in other automated code assessment tools, the ability to save code may not be strictly necessary. For example, the Moodle CodeRunner question type provides automated assessment “inline”: students write code in a web form, execute it within a remote computational environment, and then receive an automated grading against predefined tests. nbgrader doesn’t operate in this way, but it might be interesting if it could…

As a first attempt at providing a “fixed guidance, editable answer” view, I had a play with nbpreview, a single page standalone web app that lets you upload a notebook and then preview it using notebook.js, a Javascript notebook previewer. In particular, I made a minor tweak to notebook.js  to simplify the way in which code cells were rendered, and a tweak to nbpreview that provided support for executing the code against a MyBinder container launched using ThebeLab. (See also this related issue thread.)

The resulting demo, which you can find here, allows you to upload a notebook .ipynb file and preview it in a normal way.

Clicking the Activate button will make the code cells editable and runnable, via ThebeLab, against a specified MyBinder launched environment.

The Download button is a proof of concept that will export the contents of all the code cells in a raw form. My intention for this, in the nbgrader context, is to be able to download a JSON object file that associates the nbgrader cell ID. The notebook.js package really needs some further revision so that the nbgrader cell ID metadata is passed as an ID attribute of each code block into the previewed web page; further modification to identify “assessable” markdown cell content is also required, along with a tweak to the ThebeLab.js package so that assessable and identifiable free text / markdown cells are converted into editable text area form elements when the Activate button is clicked. User modified free text cells should also be exportable / downloadable.

As to how downloaded cell content might be returned to nbgrader, I can think of a several possibilities.

One way would be to create a notebook template document that can be repopulated with content from the answer download.

A similar effect could also be achieved by simply rewriting the contents of each assessable cell in the assessment notebook with the contents of the associated (similarly ID’d) element in the downloaded JSON file. Submission could then proceed using the current nbgrader submission approach.

Another approach might be to modify the autograder so that answer cells from student submissions (either pulled from the downloaded JSON answers file, or extracted from notebook answer cells) are parsed into an “answers” table in the database;  autograding could then be run against templated code cells and content pulled from the database’s “answers” table.

An advantage of this approach is that an “assessment form” UI such as the one above might then allow answers to be uploaded directly into the database rather than downloaded as an answer object file.

The database mediated approach might then also support a CodeRunner style mode of operation in which assessments containing only automatically graded elements can be autograded directly in response to a student hitting a “Grade me now” button on the assessment page web app.

Binder Base Boxes, Several Ways…

A couple of weeks ago, Chris Holdgraf published a handy tip on the Jupyter Discourse site about how to embed custom github content in a Binder link with nbgitpuller.

One of the problems with (features of…) MyBinder is that if you make a change to a repo, even if it’s just a change to the README, it will spawn a rebuild of the Docker image built from the repo the next time the repo is launched onto MyBinder.

With the recent announcement of the Binder Federation, whereby there are multiple clusters (currently two…) onto which MyBinder launch requests are mapped, if each cluster maintains its own Docker image hub, this could mean that with N clusters available, your next N launches may all require a rebuild if each launch request is mapped to a different cluster.

So how does nbgitpuller help? If you install nbgitpuller into a Binderised repository, you can launch a container on MyBinder with a git-pull? argument. This will grab the contents of a specified repository into a notebook server environment before presenting you with the notebook homepage.

What this means is that we can construct a MyBinder URL that will:

  • launch a container built from one repo; and
  • populate it with files pulled from another.

The advantage of this is that you can create one repo with a complex set of build requirements and build a MyBinder image from that once and once only. If you also maintain a second repository with notebook files, or a package definition, with frequent changes, but run it in a Binderised container launched from the “fixed” build repo, you won’t need to rebuild the container each time: just launch from the pre-built one and then synch the changed content in from the other repo.

To pull the contents of a repo http://github.com/USER/REPO into a MyBinder container built from a particular binder-base-boxes branch, use a MyBinder URL of the form:

https://mybinder.org/v2/gh/ouseful-demos/binder-base-boxes/BASEBOXBRANCH/?urlpath=git-pull?repo=https://github.com/USER/REPO

To pull the contents from a particular branch of a repo http://github.com/USER/REPO/tree/BRANCH, use a MyBinder URL of the form:

https://mybinder.org/v2/gh/ouseful-demos/binder-base-boxes/BASEBOXBRANCH/?urlpath=git-pull?repo=https://github.com/USER/REPO%26amp%3Bbranch=BRANCH

Note the escaping on the & conjunction between the repo and branch arguments that keeps it inside the scope of the git-pull?repo phrase.

To pull the contents from a particular branch of a repo http://github.com/USER/REPO/tree/BRANCH and launch into a particular notebook, use a MyBinder URL of the form:

https://mybinder.org/v2/gh/ouseful-demos/binder-base-boxes/BASEBOXBRANCH/?urlpath=git-pull?repo=https://github.com/USER/REPO%26amp%3Bbranch=BRANCH%26amp%3BsubPath=FILENAME.ipynb

You can see several examples in the various branches of https://github.com/ouseful-demos/binder-base-boxes.

See Feeding a MyBinder Container Built From One Github Repository With the Contents of Another for an earlier review of this approach (which I have to admit, I’d forgotten I’d posted when I started this post!).

On my to do list is to try to add a tab to the nbgitpuller/link generator to simplify the process of link creation. But in addition to a helper tool, is there a convention we might adopt to make it clearer when we are using this sort of split build/content repo approach?

Github conventionally uses the gh-pages branch as a “reserved” branch for constructing Github Pages docs related to a particular repo. Could we take a similar approach for defining a “Binder build” branch?

The binder/ directory in a repo can be used to partition Binder build requirements in a repo, but there are a couple of problems associated with this:

  • a maintainer may not want to have the binder/ directory cluttering their package repo;
  • any updates to the repo will force a rebuild of the Binder image next time the repo is run on a particular Binder node. (With Binder federation, if there are N hosts in the federation, after updating a repo, is it possible that my next N attempts to run the repo on MyBinder may require a rebuild if I am directed to a different host each time?)

If by convention something like a binder-build branch was used to contain the build requirements for a repo, then the process for calling a build (by default) could be simplified.

Eg rather than having something like:

https://mybinder.org/v2/gh/colinleach/binder-box/master/?urlpath=git-pull?repo=https://github.com/colinleach/astro-Jupyter

we would have something like:

https://mybinder.org/v2/gh/colinleach/astro-Jupyter/binder-build/?urlpath=git-pull?repo=https://github.com/colinleach/astro-Jupyter

which could simplify to something that defaults to a build from binder-build branch (the “build” branch) and nbgitpull from master (the “content” branch):

https://mybinder.org/v2/gh/colinleach/astro-Jupyter?binder-build=True

Complications could be added to support changing the build branch, the nbgitpull branch, the commit/ID of a particular build, etc?

It might overly complicate things further, but I could also imagine:

  • automatically injecting nbgitpuller into the Binder image and enabling it;
  • providing some sort of directive support so that if the content directory has a setup.py file the package from that content directory is installed.

Binder Buildpacks

As well as defining dynamically constructed Binder base boxes built from one repo and used to provide an environment within which to run the contents of another, there is a second sense in which we might define Binder base boxes and that is to consider the base environment on which repo2docker constructs a Binder image.

In the nbgitpuller approach, I am treating the Binder base box (sense 1) as the environment that the git pulled content runs in. In the buildpack appraoch, the Binder base box (sense 2) is the image that repo2docker uses to bootstrap the Binder image build process. Binder base box sense 1 = Binder base box sense 2 + Binder repo build process. Maybe it’d make more sense to swap those senses, so sense 2 builds on sense 1?!

This approach is discussed in the repo2docker issue #487 Make it possible to configure the base image with an example implementation in pangeo-stacks/pull/27. The implementation allows users to create a Dockerfile in which they specify a required base Docker image upon which the normal apt.txt, environment.yml, requirements.txt and postBuild steps can be applied.

The Dockerfile FROM statement takes the form:

FROM yuvipanda/pangeo-base-notebook-onbuild:2019.04.15-4

and then other build files (requirements.txt etc) are declared as normal.

The -onbuild component marks out the base image as one that should be built on (I think). I’m not sure how the date component applies (or whether it is required or optional). I’m not sure if the base box itself also needs some custom configuration? I think an example of the code use to build it is in the base-notebook directory of this repo: https://github.com/yuvipanda/pangeo-stacks .

Summary

Installing nbgitpuller into a Binderised repo allows us to pull the contents of a second Github repository into the first. This means we can build a complex environment from one repository once and pull regularly updated content from another repo into it without needing a rebuild step. Using the -onbuild approach, Binderhub can use repo2docker to build a Binder image from a user defined base image and then apply normal build steps to it. This means that optimised base boxes can be defined on which additional customisations can be layered. This can also make development of Binder boxes more efficient by starting rebuilds further up the image layer stack by building on top of prebuilt boxes rather than having build images from scratch.