Merging Several Binder Configurations

As more and more repositories start to incorporate MyBinder / repo2docker build specifications, more and more building blocks start to appear for how to get particular things running in MyBinder. For example, I have several ouseful-template-repos with various building blocks for getting different databases running in MyBinder, and occasionally require an environment that also loads in a Jupyter-server-proxied application, such as OpenRefine. Other times, I might want to pull in the config for a partculalry ast install, or merge configs someone else has developed to run different sets of notbooks in the same Binderised repo.

But: a problem arises if you want to combine multiple Binder specifications from various repos into a single Binder setup in a single repo – how do you do it?

One way might be for repo2docker to itereate through multiple build steps, one for each Binder specification. There may be clashes, of course, such as conflicting package versions from different specifications, but it would then fall to the user to try to resolve the issue. Which is fine, if Binder is making a best attempt rather than guaranteeing to work.

Assuming that such a facility does not exist, it would require updates to repo2docker, so that’s not something we can easily hack around with ourselves. So how about something where we try to combine the contents of multiple binder/ setup directories ourselves. This is something we can start to do easily enough ourselves, and as a personal tool doesn’t necessarily have to work “properly” and “for everything”: for starters, it only has to work with what we want it to work with. And if it only works so far, getting 80% of the way to a working combined configuration that’s fine too.

So what would we need to do?

Simple list files like apt.txt and requirements.txt could be simply concatenated together, leaving it up to pip to do whatever it does with any clashes in pinned package numbers, for example (though we may want to report possible clashes, perhaps via a comment in the file, to help the user debug things).

In a shell script, something like the following would concatenate files in directories binder_1, binder_2, etc.:

for i in $(ls -d binder_*)
do
   echo >> binder/apt.txt
   echo "# $i" >> binder/apt.txt
   cat "$i/requirements.txt" >> binder/apt.txt
done

In Python, something like:

import os

with open('binder/requirements.txt', 'w') as outfile:
    for d in [d for d in os.listdir() if d.startswith('binder_') and os.path.isdir(d)]:
        # Should test: if 'requirements.txt' in os.listdir(d)
        with open(os.path.join(d, 'requirements.txt')) as infile:
            outfile.write(f'\n#{d}\n')
            outfile.write(infile.read())

Merging environment.yml files is a little trickier — the structure within the file is hierarchical — but a package like hiyapyco can help us with that:

import hiyapyco
import fnmatch

_envs = [os.path.join(d, e) for e in [d for d in os.listdir() if d.startswith('binder_') and os.path.isdir(d)] if fnmatch.fnmatch(e, '*.y*ml')]

merged = hiyapyco.load(_envs,
                       method=hiyapyco.METHOD_MERGE,
                       interpolate=True)

with open('binder/environment.yml', 'w') as f:
    f.write(hiyapyco.dump(merged))

There is an issue with environments where we have both environment.yml and requirements.txt files because the environments.yml trumps requirements.txt: the former will run but the latter won’t. A workaround I have used in the past for installing from both is to call install from the requirements.txt file by using a directive in the postBuild file to handle the requirements.txt installation.

I’ve also had to use a related trick to install a really dependent Python package explicitly via postBuild and then install from a renamed requirements.txt also via postBuild: the pip installer installs packages in whatever order it wants, and doesn’t necessarily follow any order “specified” in the requirements.txt file. This means that on certain occasions, a build can fail becuase one Python package is relying on another which is specified in the requirements.txt file but hasn’t been installed yet.

Another approach might be to grab any requirements from a (merged) requirements.txt file into an environment.yml file. For example, we can create a “dummy” _environment.yml file that will install elements from our requirements file, and then merge that into an existing environments.yml file. (We’d probbaly guard this with a check that both environment.y*ml and requirements.txt are in binder/):

_yaml = '''dependencies:
  - pip
  - pip:
'''

# if 'requirements.txt' in os.listdir() and 'environment.yml' in os.listdir():

with open('binder/requirements.txt') as f:
    for item in f.readlines():
        if item and not item.startswith('#'):
            _yaml = f'{_yaml}    - {item.strip()}\n'

with open('binder/_environment.yml', 'w') as f:
    f.write(_yaml)

merged = hiyapyco.load('binder/environment.yml', 'binder/_environment.yml',
                       method=hiyapyco.METHOD_MERGE,
                       interpolate=True)

with open('binder/environment.yml', 'w') as f:
    f.write(hiyapyco.dump(merged))

# Maybe also now delete requirements.txt?

For postBuild elements, different postBuild files may well operate in different shells (for example, we may have one that executes bash code, another that contains Python code). Perhaps the simplest way of “merging” this is to just copy over the separate postBuild files and generate a new one that calls each of them in turn.

import shutil

postBuild = ''

for d in [d for d in os.listdir() if d.startswith('binder_') and os.path.isdir(d)]:
    if 'postBuild' in os.listdir(d) and os.path.isfile(os.path.join(d, 'postBuild')):
        _from = os.path.join(d, 'postBuild')
        _to = os.path.join('binder', f'postBuild_{d}')
        shutil.copyfile(_from, _to)
        postBuild = f'{postBuild}\n./{_to}\n'

with open('binder/postBuild', 'w') as outfile:
    outfile.write(postBuild)

I’m guessing we could do the same for start?

If you want to have a play, the beginnings of a test file can be found here (for some reason, WordPress craps all over it and deletes half of it if I try to embed it in the sourcecode block etc. (I really should move to a blogging platform that does what I need…)

Author: Tony Hirst

I'm a Senior Lecturer at The Open University, with an interest in #opendata policy and practice, as well as general web tinkering...

%d bloggers like this: