TJ Fragment: Sharing Desktop Apps Via Jupyter-Server-Proxy Et Al.

It’s been some time since I last had a play with remote desktops, so here’s a placeholder / round up of a couple of related Jupyter server proxy extensions that seem to fit the bill.

For those at the back who aren’t keeping up, jupyter-server-proxy applications are incredibly useful: they extent the Jupyter server to proxy other services running in the same environment. So if you have a Jupyter server running on example.url/nbserver/, and another application that publishes a web UI in the same environment, you can publish that application, using jupyter-server-proxy, via example.url/myapplication. As an example, for out TM351 Data Management and Analysis course, we proxy OpenRefine using jupyter-server-proxy (example [still missing docs].).

Applications that are published using a jupyter-server-proxy wrapper are typically applications that publish an HTML UI. So what do you do if the application you want to share is a desktop application? One way is to to share the desktop via a browser (HTML) interface. Two popular ways of doing this are:

  • novnc: an “open source VNC client – it’s is both a VNC client JavaScript library as well as an application built on top of that library”;
  • xpra: “an open-source multi-platform persistent remote display server and client for forwarding applications and desktop screens”.

Both of these applications allow you to share (Linux) desktop applications via a web browser, and both of them are available as jupyter-server-proxy extensions (subject to the correct operating system packages also being installed).

As far as novnc goes, jupyterhub/jupyter-remote-desktop-proxy will “run a Linux desktop on the Jupyter single-user server, and proxy it to your browser using VNC via Jupyter”. A TightVNC server is bundled with the application as a fallback if no other VNC server is available. One popular application used wrapped by several people using jupyter-remote-desktop-proxy is QGIS; for example, giswqs/jupyter-qgis. I used it to demonstrate how we could make a legacy Windows desktop application available via a browser by running using Wine on a Linux desktop and then sharing it via the jupyter-remote-desktop-proxy.

For xpra, the not very active (but maybe it’s stable enough?!) FZJ-JSC/jupyter-xprahtml5-proxy seems to allow you to “integrate Xpra in your Jupyter environment for an fast, feature-rich and easy to use remote desktop in the browser”. However, no MyBinder demo is provided and I haven’t had a chance yet to give this a go. (I have tried XPRA in other contexts, though, such as here: Running Legacy Windows Desktop Applications Under Wine Directly in the Browser Via XPRA Containers.)

Another way of sharing desktops is to use the Microsoft Remote Desktop Protocol (aka RDP). Again, I’ve used that in various demos (eg This is What I Keep Trying to Say…) but not via a jupyter-server-proxy. I’m not sure if there is a jupyter-server-proxy example out there for publishing a proxied RDP port?

Just in passing, I also note this recipe for a Docker compose configuration that uses a bespoke container to act as a desktop sharing bridge: Viewing Dockerised Desktops via an X11 Bridge, novnc and RDP, Sort of…. I’m not sure how that might fit into in Jupyter set up? Could a Jupyter server container be composed with a bridge container, and then proxy the bridge services?

Finally, another way to share stuff is to to use WebRTC. The maartenbreddels/ipywebrtc extension can “expose the WebRTC and MediaStream API in a Jupyter notebook/JupyterLab environment” allowing you to create a MediaStream out of an ipywidget, a video/image/audio file, or a webcam and use it as the bases for a movie, image snapshot or audio recording. I keep thinking this might be really useful for recording screencast scenes or other teaching related assets, but I haven’t fully grocked the full use of it. (Something like Jupyter Graffiti also falls into this class, which can be used to record a “tour” or walkthrough of a notebook that can also be interrupted by live interaction or the user going off-piste. The jupyterlab-contrib/jupyterlab-tour extension also provides an example of a traditional UI tour for JupyterLab, although I’m not sure how easy it is to script/create your own tours. Such a thing might be useful for guiding a user around a custom JupyterLab workspace layout, for example. [To my mind, workspaces are the most useful and least talked about feature of the JupyerLab UI….] More generally, shepherd.js looks interesting as a generic website tour supporting Javascript package.) What I’m not sure about is the extent to which I could share, or proxy access to, of a WebRTC MediaStream that could be accessed live by a remote user.

Another way of sharing the content of a live notebook is to use the new realtime collaboration features in JupyterLab (see the official announcement/background post: How we made Jupyter Notebooks collaborative with Yjs). (A handy spin-off of this is that it now provides a hacky workaround way of opening two notebooks on different monitors.) If you prefer more literal screensharing, there’s also yuvipanda/jupyter-videochat which provides a server extension for proxying a Jitsi (WebRTC) powered video chat, which can also support screen sharing.

Fragment: Software Decay From Inside and Out

Over the last couple of weeks, I’ve been dabbling with a new version of the software environment we use for our TM351 Data Management and Analysis course, bundling everything into a single monolithic docker container (rather than a more elegant docker compose solution because we haven’t yet figured out how to mount multiple personal volumes from a JupyterHub/k8s config),

Hmm… in a docker compose set up, where I mount a persistent volume onto container A at $SHAREDPATH, can I mount from a path $SHAREDPATH/OTHER in that container into another, docker compose linked container?

At the final hurdle, have fought with various attempts to build a docker container stack that works, I hit an an issue when trying to open a new notebook:


The same notebook works fine in JupyterLab, so there is something wrong, somewhere, with launching notebooks in the classic Jupyter notebook UI.

Which made me a bit twitchy. Because the classic notebook is the one we use for teaching in several courses, and we use a wide variety of off-the-shelf extensions, as well as a range of custom developed extensions to customise our notebook authoring and presentation environment (examples). And these customisations are not available in JupyterLab UIs. For a related discussion, see this very opinionated post.

For folk who follow these things, and for folk who have a stake in the classic notebook UI, the question of long term support for the classic UI should be a consideration and a concern. Support for the classic notebook UI is not the focus area for the core Jupyter UI project developers effort.

And here’s another weak signal of a possible fork in the road:

The classic notebook user community, of which I consider myself a part, which includes education and could well extend into publishing more generally as tools like Jupyter Book mature even further, need to be mindful that someone needs to look after this codebase. And it would be a tragedy if that someone turned out to be someone who forked the codebase for their own (commercial) publishing platform. An Elsevier, for example, or a Blackboard.

Anyway, back to my 500 server error.

Here’s how the error starts to be logged in by the Jupyter server:

And here’s where the problem might be:

In a third party package that provides and an “export to docx (Microsoft Word)” feature implemented as a Jupyter notebook custom bundler.

Removing that package seemed to fix things, but it got me wondering about whether I should treat this as a weak signal of software rot in Jupyter notebook. I tweeted to the same effect — with a slight twinge of uncertainty about whether folk might think I was dissing the Jupyter community again! — but then started pondering about what that might actually mean.

Off the top of my head, it seems that one way of slicing the problem is consider rot that comes from two different directions:

  • inside: which is to say, as packages that the notebook depends on update, do things start to break inside the notebook server and its environment. Pinning package versions may help, or making sure you always run the notebook server in its own, very tightly controlled Python environment and always serve kernels from a separate environment. But if you do need to install other things in the same environment as the notebook server, and there is conflict in the dependencies of those things and the notebook server’s dependencies, things might break;
  • outside: which is to say, things that a user or administrator might introduce into the notebook environment to extend it. As in the example of the extension I installed that in its current version appears to cause the 500 server error noted above.

Note that in the case of the outside introduced breakage, the error for the user appears to be that something inside the notebook server is broken: for the user, they draw the system boundary around the notebook server and its extensions whilst for the developer (core notebook server dev, or the extension developer), they see the world a bit differently:

There are folk who make an academic career out of such concerns of course, who probably have a far more consdered take on how software decays and how software rot manifests itself, so here are a few starters for 10 that I’ve added to my reading pile (no idea how good they are: this was just a first quick grab):

  • Le, Duc Minh, et al. “Relating architectural decay and sustainability of software systems.” 2016 13th Working IEEE/IFIP Conference on Software Architecture (WICSA). IEEE, 2016.
  • Izurieta, Clemente, and James M. Bieman. “How software designs decay: A pilot study of pattern evolution.” First International Symposium on Empirical Software Engineering and Measurement (ESEM 2007). IEEE, 2007.
  • Izurieta, C., Vetrò, A., Zazworka, N., Cai, Y., Seaman, C., & Shull, F. (2012, June). Organizing the technical debt landscape. In 2012 Third International Workshop on Managing Technical Debt (MTD) (pp. 23-26). IEEE.
  • Hassaine, S., Guéhéneuc, Y. G., Hamel, S., & Antoniol, G. (2012, March). Advise: Architectural decay in software evolution. In 2012 16th European Conference on Software Maintenance and Reengineering (pp. 267-276). IEEE.
  • Hochstein, Lorin, and Mikael Lindvall. “Combating architectural degeneration: a survey.” Information and Software Technology 47.10 (2005): 643-656.

jupyterlite — “serverless” Jupyter In the Browser Using Pyodide and WASM

Several years ago, a Mozilla project announced pyodide, a full Python stack, compiled to WebAssembly / WASM, running in the browser. Earlier this year, pyodide was spun out into its own community governed project (Pyodide Spin Out and 0.17 Release) which means it will now stand or fall based on its usefulness to the community. I’m hopeful this is a positive step, and it’ll be interesting to see how active the project becomes over the next few months.

Since then, a full scipy stack appeared, runnable via pyodide, along with the odd false start (most notably, jyve) at getting a Jupyter server running in the browser. Originally, pyodide had supported its own notebook client (indeed, had been created for it) but that project — iodide — soon languished.

As I haven’t really been Tracking Jupyter since summer last year, there are probably more a few projects ticking along that I missed the earliest of signs of and that have only now come to my attention through occasional mentions on social media that have passed my way.

One of these is jupyterlite (docs), “a JupyterLab distribution that runs entirely in the browser built from the ground-up using JupyterLab components and extensions”. It’s not classic notebook, but it does suggest there’s a running jupyter server available as a WASM component…

So why is this interesting?

To run a Jupyter notebook requires three things:

  • a client in the browser;
  • a Jupyter server to serve the client and connect it to a kernel process;
  • a computing environment to execute code in code cells (the kernel process).

If you access a hosted Jupyter environment, someone else manages the Jupyter server and computing environment for you. If you run notebooks locally, you need at least a Jupyter server, and then you can either connect to a remote kernel or run one locally.

To run a multi-user hosted server, you need to run the Jupyter server, and potentially also manage authentication, persistent storage for users to save their notebooks, and the compute backend to serve the kernel processes. This means you need infrastructure of the hard kind (servers, storage, bandwidth), and you become a provider of insfrastructure of the soft kind (jupyer notebooks as a service).

With Jupyter running in the browser, using something like Jupyterlite, all you need is a web server. Which you’re probably already running. The notebook server now runs in the browser; the kernel now runs in the browser; and the client (JupyterLab) continues to run in the browser, just as it ever did.

In JupyterLite, storage is provided by the local browser storgae, which means you need to work with a single browser. (With many browsers, such as Chrome, now offering browser synchronisation, I wonder if the local storage is synched too? If so, then you can work from any browser you can “log in” to to enable synchronisation services.)

To my mind, this is a huge win. You don’t need to host any compute or storage services to make interactive computing available to your users/students/learners: you just need a webserver. And you don’t even need to run your own: the jupyterlite demo runs using Github pages.

jupyterlite demo running on Github Pages

For open education, this means you can make a computing environment available, in the browser, using just a webserver, without the overhead, or security concerns, of running a compute backend capabale of running arbitrary, user submitted code.

So that’s one thing.

For learners running things locally, they just need a simple web server (I think this requires serving: clicking on an HTML document to open it in a browser may hit browser issues that expect content served with a particular MIME-type).

A simple web server is the sort of thing that can be easily packaged and distributed, but it still presents something of an overhead in terms os downloading, installing and then running the service.

Perhaps simpler would be distributing the application as a cross-platform electron app? As far as I know, jupyerlite isn’t (yet?!) packaged that way, but there is at least one demo out there of pyodide bundled inside an electron app: inureyes/pyodide-console (presentation). So it’s not hard to imagine bundling and distributing jupyterlite the same way, although the practicalities may prove fiddly. (Or they may not…)

“So what?”, you may say. “If we’re giving students access to anaconda anyway, what benefit does this bring?” Leaving aside the huge questions I have about using things like Anaconda, not least their lack of generality compared to distributing environments using docker containers, for example, and notwithstanding the ability to be able to provide computing environments purely withing the browser as noted earlier, the availablity of a Jupyter server and Jupyter kernel running in the browser makes other things possible, or at least, allows us to entertain the idea of other applications with a view to seeing if they are realisable.

Hmm… maybe chrome can provide the webserver itself?

So what might those things be? Off the top of my head, and without any serious thought at all, several things come immediately to mind.

Firstly, the possibilty of the cross-platfrom electron distribution (which is essentially an application container wrapping a chrome browser and a simple web server).

Secondly, and nosing around a little, there are already VS Code extensions that seem to be riffing on using jupyterlite too; so if you have access to VS Code in the browser, you could perhaps also install a pyodide jupyter kernel and run notebooks using that in VS Code in the browser. (I’m not sure if you can host VS Code using just a simple web server or it needs a nodejs server app?)

Thirdly, it’s not hard to imagine a route towards making interactive books avaliable and served just via a web browser. For example, a jupyter book UI where rather than  having to hook up to a remote jupyter server to run (editable) code cells from the page using thebelab you could just run the cells against the WASM run kernel in the browser. (What would be required to make a thebelab like javaascript package that would allow a Jupyter Book to connect to a jupyterlite server running from the same browser tab?) It would then be possible publish a fully interactive textbook using just a simple web server and no other dependencies. The only piece missing from that jigsaw would be a Jupyter Book extension to allow you to save edited code cells into browser storage; and maybe also add in some means of adding / editing additional html cells (then at a later date adding support for markdown, perhaps).

The availability of a thebelab like package to connect to an “in page” Jupyter environment also means we can support on demand executable code from any code bearing HTML page, such as content pages with code examples in a VLE web page, and without the need for backend server support.

Finally, institutionally, jupyterlite makes it possible to publish a simple Jupyter environment directly from the VLE as a “simple” html page, with no compute backend/traditional Jupyter hosting requirement on the backend. The compute/storage requirement must be provided by the end user in the form of a recent browser and a computer that can cope with running the WASM enviornment inside it.

Related: Fragment – Jupyter Book Electron App.

On the WatchList: VisualPython

A fragmentary note to put a watch on Visual Python, a classic Jupyter notebook extension (note that.. a classic notebook extension) to support visual Python programming:

It’s a bit flaky at the moment — the above screenshot shows multiple previews of the selected function code, and the function preview doesn’t properly render things like the function arguments (nor could I get the function to appear in the list of user defined functions), but it’s early days yet.

At first, a blocker to me in terms of suggesting folk internally have at a look at it right now included the apparant inability to define a variable by visual means (all I wanted to do was set a=1) or clear the notebook cells when I wanted to reflow the visual program into the notebook code cell area.

But then I twigged that rather than trying to create a complete program using the visual tools, a better way of using visualpython might be as a helper to code fragments for me in particular use cases.

In the example below, I created the dataframe manually and then used the editor to create a simple plot command that could be inserted into a notebook code cell. The editor picked up on the dataframe I had defined and used that to prepopulate selection lists in the editor.

If the environment becomes the plaything of devs looking to put complex features into the environment, seeing it as a rich power tool for them (and contra to their beliefs, an increasingly hostile environment to novices as more “powerful” features are added and more visual clutter to the environment to scare the hell of users with things that at irrelevant), then the basic usability required for a teaching and learning environment will be lost if users see it as a tool for creating complete programs visually.

For the developers, it’s all too easy to see how the environment could become as much a toy for adding yet more support for yet more packages that can be demonstrated in the environment but never used (because the power users actually prefer using autocomplete in a “proper IDE”) rather than being simplified for use by novices with very, very, very simple programming demands; (just think of the two, three, four lines of code examples that fill the pages of introductory programming text books).

If folk do want a visual editor for data related programming, wouldn’t they use something like Orange, enso, or the new JupyterLab based orchest?

orchest: JupyterLab visual pipleine programming environment

But if you see the Visual Python editor as a tool at the side that essentially operationalises documentation lookup in a way that helps you create opinionated code fragments, where the opinion is essentially a by prodcut of the code templates that are used to generate code from particular visual UI selections, then I think it could be useful as a support tool for creating code snippets, not as an authoring tool for writing a more complete program or computational analysis.

So what will I be watching for? User uptake (proxied by mentions I see of it), some simple documentation, and perhaps a two minute preview video tour (I’m not willing to spend my time on this right now because I think it needs a bit more time in the oven…). The usability should improve as novices get confused and raise issues with how to perform the most basic of tasks and as the noosphere finds a way to conceptualise the sort of usage patterns and workflows that VisualPython supports best.

My intial reaction was a bit negative — it’s too visually complex already for novices, and some really basic usability issues and operations are either missing or broken, if you see it as an editor for creating complete programs. But if you view it as a code generating documentation support tool that lets you hack together a particular code fragment with visual cues that you might otherwise pick up from from documentation, documentation code examples or simple tutorials, then I think it could be useful.

Hmmm… Another thing to try to get my head round in the context of generative workflow tools…

Show Your Working, Check Your Working, Check the Units

One of the common refrains in maths, physics and engineering education is to “show your working” and “check your working”. In physics and engineering, “check the units” is also commonly heard.

If you have a calculation to do, show the algebraic steps, then substitute in the numbers as part of the working, showing partial results along the way.

As I slowly start to sketch out examples of how we can use one piece generative document workflows to both create educational materials, and, by sharing the tools of production with learners in the guise now of a mechanical tutor, support self-checking, worked equations, and checking your working, both seem to provide good examples of how to demonstrate this sort of practice.

The handcalcs python package provides a simple but effective way to to write simple mathematical expressions and then automate the production of a simple worked example.

handcalcs worked example

In phyics and engineering settings, dimensional analysis can provide a powerful shortcut to checking that a derived equation produces a thing of the correct dimension: was it V=IR, or V=I/R? A quick dimensional analysis, if you know your SI units, can help check.

There are several packages out there that provide units of measurements that can be used to type numerical values with particular units. The forallpeople package is onesuch, and also happens to play nicely with handcalcs.

Another handy benefit of a good units of measurement package, for production as well as student self-checking, is the mechanical support for expressing the units in appropriate form, given the magnitude of an expressed quantity:

Demonstration of forallpeople units of measurement

Current SubjectMatterNotebooks example:

Fragment: Factory Edjucashun

A not thought out at all fragment that came to mind reading the first few pages of Taiichi Ohno’s Toyota Production System book last night.

First up, if we have open book assessment, then for efficient students motivated solely by accreditation, everything in course material that does not directly help get marks in the assessment is waste. Which made me wonder: to get better coverage of the course material so huge chunks of it aren’t waste in a particular presentation, we should have different assessment for different students that cover the whole of the curriculum over the student body? As to why we don’t do this, I suspect “quality” (standardisation) is the answer: by giving everyone the same assessment, we can get a deviation of marks to find who the good and bad outliers are. And we also get to fiddle the distributions to fix questions, or markers, that didn’t seem to work so well by manipulating the stats…

Secondly, if teaching universities are factories working on the raw material that is a student, what’s the output? A standard product from each course where each student can produce the same function, albeit with a range of tolerances? Or a material tranformation where the same processing or transformation steps have been applied to materials of varying quality (different students with different interests, skills, resources, ability, etc?)

Cross Platform Docker Builds… Or not…

One of the things I’ve started trying really hard to do for our Docker builds is to try to create cross-platfrom images that run on arm64 and arm32 devices as well as amd64. This means they should run on new Mac M1s as well as Raspberry Pis running 32 and 64 bit o/s.

I thought I had a stack today working most of the way with an official Python base container (Debian, but the TM351 Data Managament and Analysis fell at the last hurdle on arm64 because arm64/Debian is not supported by Mongo, (Mongo is generally the thing that f**ks with me every time I try to update the image). The arm32 build had actually fallen a bit before that with the build I wanted to do because it needs a very old version of proj4.

Switching to an ubuntu base container also caused issues. SOmeof the packages between ubuntu and Debian are at different versions (Postgres 12 vs 11, libblis3-serial vs libblis2-serial) and on RPi Python fails with a “pyinit_main: can’t initialize time” error that seems to require running the container in privileged mode as a fix.

It’s way too late, and I’m way too tired (3am is just a minute away) to try the fixing the buildstack again, so I figure: make these notes of what I think we the main diffs between Debian and ubuntu, and go back to the ubuntu build. If students want to run arm64 at home I can give them a docker-compose setup (which was my preference anyway: the monolithic container is primarily to suit the single container needs of a hosted Jupyter solution) with Mongo running in its own container.

Hmm… this suggests that I should make a docker compose script the primary instruction route for student setup… then all we need to do is ship them an appropriate docker-compose file and they donlt need to know how many containers are running…

The downside, of course, is that the docker-compose route requires a file download, whereas the docker run route can just be typed. (I’m still waiting on Docker Dashboard to let you pull an image… And I haven’t had a chance to try to simplify portainer to make that the defaust home user UI.)

So, four past 3… a.m. eternal…

And just gone 10 past eternal, another gotcha – in ubuntum where I need to manually install the is-it-really-still-that-recent Python 3.9, I need to manually link the python command:

RUN curl -o && \

RUN ln -s /usr/bin/python3.9 /usr/bin/python

Placeholder — Rebooted Jupyter Notebook Advocacy Project: One-Piece Generative Documents

A few years ago, I started a cobbling together notebooks across a range of subject areas as part of an informal shown’t’tell project. The idea was to try to to create a set of resources to demonstrate some of the ways in which jupyter notebooks could be be used to mediate the production and delivery of computationally supported teaching and learning across a wide range of topics.

Over the last week or so, I started updating those materials. One of the motivating reasons was as a way of finally getting round to trying out the Jupyter Book publishing system, which is starting to mature nicely (as I always expected it would…!) and comparing it with the bookdown publishing workflow, which I’ve also revisted recently in the guise of putting together various rally data reports.

Another reason was that I’ve started pondering lean manufacturing philosophies which has helped crystallise out some of my thinking around generative documents which is now taking the form of “one piece generative document” production workflows.

The materials I’ve refreshed to date, and which just represent a start of the content publishing journey I’ll be going on over half hour hack and coffee break additions for the foreseeable future, thus far cover a range of topics, with many more to come:

  • astronomy;
  • chemistry;
  • electronics;
  • classical Latin;
  • diagram generation;
  • interactive mapping.

The docs can be found in Jupyter book published form here and in the source repo here.

Running Arbitrary Startup Scripts in Docker Containers

From October 2021, we’re hopefully going to be able to start offering students on several modules access to virtualised computing enviornments launched from a JupyterHub server.

Architecturally, the computing environments provided to students are ephemeral, created on demand for a particular student study session, and destroyed at the end of it.

So students don’t lose their work, each student will be allocated a generous block persistent file storage which will be shared into each computer environment when it is requested.

One of the issues we face is how to “seed” various environments. This might include sharing of Jupyter notebooks containing teaching materials, but it might also include sharing pre-seeded database content.

One architectural model we looked at was using docker compose to support the launching of a set of interconnected services, each running in its own container and with its own persistent storage volume. So for example, a student environment might contain a Jupyter notebook server in one container connected to a Postgres database server in another container, each sharing data into its own persistent storage volume.

Another possibility was to launch a single container running multiple services (for example, a Jupyter notebook server and a postgres database server) and mount a separate volume for each user against each service (for example, a notebook storage volume, a database storage volume).

However, my understanding of how JupyterHub on Kubernetes works (which we need for scaleability) is that only a single user storage volume can be mounted against a launched environment. Which means we need to persist everything (potentially for several courses running different environments) in a single per-user storage volume. (If my understanding is incorrect, please let me know what the fix is via the comments, or otherwise.)

For our TM351 Data Management and Analysis module, we need to ship a couple of prepopulated databases as well as a Jupyter server proxied Open Refine server; students then add notebooks distributed by other means. For TM129 Robotics block, the notebook distribution is baked into the container.

In the first case, we need to be able to copy the original seeded database files into persistent storage, which the students will then be able to update as required. In the second case, we need to be able to copy or move the distributed files into the shared persistent storage volume so any changes to them aren’t lost when the ephemeral computing environment is destroyed.

The solution I’ve come up with is to support the running of arbitrary scripts when a container is started. These scripts can then do things like copy stashed files into the shared persistent storage volume. It’s trivial to make first run / run once functions that set a flag in the persistent storage volume that can be tested for: if the flag isn’t there, run a particular function. If it isn’t, don’t run the function. Or vice versa.

But of course, the solution isn’t really mine… It’s a wholesale crib of the approach used in repo2docker.

Looking at the repo2docker build files, I notice the lines:

# Add start script
{% if start_script is not none -%}
RUN chmod +x "{{ start_script }}"
ENV R2D_ENTRYPOINT "{{ start_script }}"
{% endif -%}

# Add entrypoint
COPY /python3-login /usr/local/bin/python3-login
COPY /repo2docker-entrypoint /usr/local/bin/repo2docker-entrypoint
ENTRYPOINT ["/usr/local/bin/repo2docker-entrypoint"]
# Specify the default command to run
CMD ["jupyter", "notebook", "--ip", ""]

An answer on Stack Overflow shows how ENTRYPOINT and CMD work together in a Dockerfile (which was new to me):

So… if we pinch the repo2docker-entrypoint script, we can trivially add our own start scripts

I also note that the official Postgres and Mongodb repos allow users to pop config scripts into a /docker-entrypoint-initdb.d/ that can be used to seed a database on first run of the container uisng routines in their own entrypoint files (for example, Postgres entrypoint, Mongo entrypoint). This raises the interesting possiblity that we might be able to reuse those entrypoint scripts as is or with only minor modification to help seed the databases.

There’s another issue here: should we create the seeded database files as part of the image build and then copy over the database files and reset the path to those files duitng container start / first run; or should we seed the database from the raw init-db files and raw data on first run? What are the pros and cons in each case?

Here’s an example of the Dockerfile I use to install and seed PostgreSQL and MongoDB databases, as well as a a jupyter-server-proxied Open Refine server:


# Get database seeding files
COPY ./init_db ./

########## Setup Postgres ##########
# Install the latest version of PostgreSQL.
RUN apt update && apt-get install -y postgresql && apt-get clean
RUN PG_DB_DIR=/var/db/data/postgres && mkdir -p $PG_DB_DIR

# Set up credentials
#ENV POSTGRES_DB=my_database_name
    if [ ! -d "$PGDATA" ]; then initdb -D "$PGDATA" --auth-host=md5 --encoding=UTF8 ; fi && \
    pg_ctl -D "$PGDATA" -l "$PGDATA/pg.log" start

#  Check is server is readey: pg_isready

# Seed postgres database
USER postgres
RUN service postgresql restart && psql postgres -f ./init_db_seed/postgres/init_db.sql && \
    #Put an equivalent of the above in a config file: init_db.sql
    #psql -U postgres postgres -f init_db.sql
    #psql test < seed_db.sql
    #pg_ctl -D "$PGDATA" -l "$PGDATA/pg.log" stop
# if we don't stop it, can bad things happen on shutdown?
 #&& service postgresql stop

USER root
# Give the jovyan user some permissions over the postgres db
RUN usermod -a -G postgres jovyan

########## Setup Mongo ##########

RUN wget -qO - | sudo apt-key add -
RUN echo "deb buster/mongodb-org/4.4 main" | sudo tee /etc/apt/sources.list.d/mongodb-org-4.4.list
RUN apt-get update && apt-get install -y mongodb-org

# Set up paths
ARG MONGO_DB_PATH=/var/db/data/mongo
RUN mkdir -p ${MONGO_DB_PATH}

# Unpack and seed the MongoDB
RUN mkdir -p ./tmpdatafiles && \
    tar xvjf ./init_db_seed/mongo/small_accidents.tar.bz2 -C ./tmpdatafiles  && \
    mongod --fork --logpath /var/log/mongosetup --dbpath ${MONGO_DB_PATH} && \
    mongorestore --drop --db accidents ./tmpdatafiles/small_accidents && \
    rm -rf ./tmpdatafiles && rm -rf ./init_db
#    mongod --shutdown --dbpath ${MONGO_DB_PATH} 

########## Setup OpenRefine ##########
RUN apt-get update && apt-get install -y openjdk-11-jre
ARG OPENREFINE_PATH=/var/openrefine
RUN wget -q -O openrefine-${OPENREFINE_VERSION}.tar.gz${OPENREFINE_VERSION}/openrefine-linux-${OPENREFINE_VERSION}.tar.gz \
        && tar xzf openrefine-${OPENREFINE_VERSION}.tar.gz \
        && mv openrefine-${OPENREFINE_VERSION} $OPENREFINE_PATH \
        && rm openrefine-${OPENREFINE_VERSION}.tar.gz
RUN pip install --no-cache git+

########## Setup start procedure ##########

USER root

# Copy over start scripts and handle startup procedure
COPY start /var/startup/start
RUN chmod +x /var/startup/start
ENV R2D_ENTRYPOINT /var/startup/start
COPY repo2docker-entrypoint /usr/local/bin/repo2docker-entrypoint
COPY python3-login /usr/local/bin/python3-login
RUN chmod +x /usr/local/bin/repo2docker-entrypoint
RUN chmod +x /usr/local/bin/python3-login
ENTRYPOINT ["/usr/local/bin/repo2docker-entrypoint", "tini", "-g", "--"]
CMD [""]

What the image does is seed the datbases into known locations.

What I need to do next is fettle the start file to copy (or move) the database storage files into a location inside the mounted storage volume and then reset the database directoy path environment variables before starting the database services, which are currently started in the copied over start file:


service postgresql restart

mongod --fork --logpath /dev/stdout --dbpath ${MONGO_DB_PATH}

# Test dir
#if [ -d "$DIR" ]; then

# Test file
#if [ -f "$FILE" ]; then

if [ -d "/var/stash/content" ]; then
    mkdir -p /home/jovyan/content
    cp –r /var/stash/content/* /home/jovyan/content

exec "$@" 

Caught in the Act — When Recorded Times Aren’t

SS7 on Rally Portugal turned out to be a nightmare for Thierry Neuville, who buckled a wheel, and Elfyn Evans, who ran into Neuville’s dust cloud after the final split.

Evans had been on something of a charge, with a stage win on the cards. By the final split, he was still matching first on the road Seb Ogier’s time on a stage that seemed to buck the trend of the previous stages, where sweeping had been an expensive affair.

But then, thick dust hanging in the road that reduced visibility to zero. Even with pace notes, it was obvious there was trouble ahead; and pace notes don’t flag extra cautions to signal the presence of a limping Hyundai i20 looming out of the murk in the middle of a single track road on slight left.

The timing screen told the sorry tale, which I reimagined on my RallyDataJunkie page for the stage:

Looking at time differences to get from one split point to the next, Evans had been up at the start of the stage, though he had perhaps started slowing:

If we look at his pace (the time taken to drive 1km), which takes into account the distance travelled between split points, we see it was good mathcing Ogier over the first half of the stage, though was perhaps slowing in the third quarter:

Looking at the ultimate transit times recorded between split points, we see Evans led the the first two splits, but dropped time to split 3.

Was that just a blip, or would Evans have pick up the pace at the end? Ogier often finishes strong, but could Evans have taken the stage? We’ll never know…

But anyone looking simply at the times on the official timing screen half an hour or so after the end of the stage might also be misled, unless they understand the vaharies of rally timing…

Here’s what the timing screen looks like now:

And here’s what my take on it is:

Spot anything different compared to my original table?

Evans was (rightly) given a recalculated time, equivalent to Ogier’s.

No other drivers were affected, so the other times stand. But if reflow my data tables, the story is lost. And if I update pace tables to used the recalculated time, and other folk use those tables, they’re not right, at least in terms of the story they tell of Evans SS7.

Who know what would have happened in that final stretch?!

The next time I run my table data, the original story will be lost. My data structures can can’t coped with revised times… so a remnant of the data story will just have to suffice here…