Running Arbitrary Startup Scripts in Docker Containers

From October 2021, we’re hopefully going to be able to start offering students on several modules access to virtualised computing enviornments launched from a JupyterHub server.

Architecturally, the computing environments provided to students are ephemeral, created on demand for a particular student study session, and destroyed at the end of it.

So students don’t lose their work, each student will be allocated a generous block persistent file storage which will be shared into each computer environment when it is requested.

One of the issues we face is how to “seed” various environments. This might include sharing of Jupyter notebooks containing teaching materials, but it might also include sharing pre-seeded database content.

One architectural model we looked at was using docker compose to support the launching of a set of interconnected services, each running in its own container and with its own persistent storage volume. So for example, a student environment might contain a Jupyter notebook server in one container connected to a Postgres database server in another container, each sharing data into its own persistent storage volume.

Another possibility was to launch a single container running multiple services (for example, a Jupyter notebook server and a postgres database server) and mount a separate volume for each user against each service (for example, a notebook storage volume, a database storage volume).

However, my understanding of how JupyterHub on Kubernetes works (which we need for scaleability) is that only a single user storage volume can be mounted against a launched environment. Which means we need to persist everything (potentially for several courses running different environments) in a single per-user storage volume. (If my understanding is incorrect, please let me know what the fix is via the comments, or otherwise.)

For our TM351 Data Management and Analysis module, we need to ship a couple of prepopulated databases as well as a Jupyter server proxied Open Refine server; students then add notebooks distributed by other means. For TM129 Robotics block, the notebook distribution is baked into the container.

In the first case, we need to be able to copy the original seeded database files into persistent storage, which the students will then be able to update as required. In the second case, we need to be able to copy or move the distributed files into the shared persistent storage volume so any changes to them aren’t lost when the ephemeral computing environment is destroyed.

The solution I’ve come up with, inspired by the MyBinder start feature, is to support the running of arbitrary scripts when a container is started. These scripts can then do things like copy stashed files into the shared persistent storage volume. It’s trivial to make first run / run once functions that set a flag in the persistent storage volume that can be tested for: if the flag isn’t there, run a particular function. If it isn’t, don’t run the function. Or vice versa.

But of course, the solution isn’t really mine… It’s a wholesale crib of the approach used in repo2docker.

Looking at the repo2docker build files, I notice the lines:

# Add start script
{% if start_script is not none -%}
RUN chmod +x "{{ start_script }}"
ENV R2D_ENTRYPOINT "{{ start_script }}"
{% endif -%}

# Add entrypoint
ENV PYTHONUNBUFFERED=1
COPY /python3-login /usr/local/bin/python3-login
COPY /repo2docker-entrypoint /usr/local/bin/repo2docker-entrypoint
ENTRYPOINT ["/usr/local/bin/repo2docker-entrypoint"]
# Specify the default command to run
CMD ["jupyter", "notebook", "--ip", "0.0.0.0"]

An answer on Stack Overflow shows how ENTRYPOINT and CMD work together in a Dockerfile (which was new to me):

So… if we pinch the repo2docker-entrypoint script, we can trivially add our own start scripts

I also note that the official Postgres and Mongodb repos allow users to pop config scripts into a /docker-entrypoint-initdb.d/ that can be used to seed a database on first run of the container uisng routines in their own entrypoint files (for example, Postgres entrypoint, Mongo entrypoint). This raises the interesting possiblity that we might be able to reuse those entrypoint scripts as is or with only minor modification to help seed the databases.

There’s another issue here: should we create the seeded database files as part of the image build and then copy over the database files and reset the path to those files duitng container start / first run; or should we seed the database from the raw init-db files and raw data on first run? What are the pros and cons in each case?

Here’s an example of the Dockerfile I use to install and seed PostgreSQL and MongoDB databases, as well as a a jupyter-server-proxied Open Refine server:

#Dockerfile

# Get database seeding files
COPY ./init_db ./

########## Setup Postgres ##########
# Install the latest version of PostgreSQL.
RUN apt update && apt-get install -y postgresql && apt-get clean
RUN PG_DB_DIR=/var/db/data/postgres && mkdir -p $PG_DB_DIR

# Set up credentials
ENV POSTGRES_USER=postgres
ENV POSTGRES_PASSWORD=postgres
#ENV POSTGRES_DB=my_database_name
ARG PGDATA=${PGDATA}:-${PG_DB_DIR} && \
    if [ ! -d "$PGDATA" ]; then initdb -D "$PGDATA" --auth-host=md5 --encoding=UTF8 ; fi && \
    pg_ctl -D "$PGDATA" -l "$PGDATA/pg.log" start

#  Check is server is readey: pg_isready

# Seed postgres database
USER postgres
RUN service postgresql restart && psql postgres -f ./init_db_seed/postgres/init_db.sql && \
   ./init_db_seed/postgres/init_db.sh  
    #Put an equivalent of the above in a config file: init_db.sql
    #psql -U postgres postgres -f init_db.sql
    #psql test < seed_db.sql
    #pg_ctl -D "$PGDATA" -l "$PGDATA/pg.log" stop
# if we don't stop it, can bad things happen on shutdown?
 #&& service postgresql stop

USER root
# Give the jovyan user some permissions over the postgres db
RUN usermod -a -G postgres jovyan

########## Setup Mongo ##########

RUN wget -qO - https://www.mongodb.org/static/pgp/server-4.4.asc | sudo apt-key add -
RUN echo "deb http://repo.mongodb.org/apt/debian buster/mongodb-org/4.4 main" | sudo tee /etc/apt/sources.list.d/mongodb-org-4.4.list
RUN apt-get update && apt-get install -y mongodb-org

# Set up paths
ARG MONGO_DB_PATH=/var/db/data/mongo
ENV MONGO_DB_PATH=${MONGO_DB_PATH}
RUN mkdir -p ${MONGO_DB_PATH}

# Unpack and seed the MongoDB
RUN mkdir -p ./tmpdatafiles && \
    tar xvjf ./init_db_seed/mongo/small_accidents.tar.bz2 -C ./tmpdatafiles  && \
    mongod --fork --logpath /var/log/mongosetup --dbpath ${MONGO_DB_PATH} && \
    mongorestore --drop --db accidents ./tmpdatafiles/small_accidents && \
    rm -rf ./tmpdatafiles && rm -rf ./init_db
#    mongod --shutdown --dbpath ${MONGO_DB_PATH} 



########## Setup OpenRefine ##########
RUN apt-get update && apt-get install -y openjdk-11-jre
ARG OPENREFINE_VERSION=3.4.1
ARG OPENREFINE_PATH=/var/openrefine
ENV PATH="${OPENREFINE_PATH}:${PATH}"
RUN wget -q -O openrefine-${OPENREFINE_VERSION}.tar.gz https://github.com/OpenRefine/OpenRefine/releases/download/${OPENREFINE_VERSION}/openrefine-linux-${OPENREFINE_VERSION}.tar.gz \
        && tar xzf openrefine-${OPENREFINE_VERSION}.tar.gz \
        && mv openrefine-${OPENREFINE_VERSION} $OPENREFINE_PATH \
        && rm openrefine-${OPENREFINE_VERSION}.tar.gz
RUN pip install --no-cache git+https://github.com/innovationOUtside/nb_serverproxy_openrefine.git


########## Setup start procedure ##########

USER $NB_USER
USER root

# Copy over start scripts and handle startup procedure
COPY start /var/startup/start
RUN chmod +x /var/startup/start
ENV R2D_ENTRYPOINT /var/startup/start
COPY repo2docker-entrypoint /usr/local/bin/repo2docker-entrypoint
COPY python3-login /usr/local/bin/python3-login
RUN chmod +x /usr/local/bin/repo2docker-entrypoint
RUN chmod +x /usr/local/bin/python3-login
ENTRYPOINT ["/usr/local/bin/repo2docker-entrypoint", "tini", "-g", "--"]
CMD ["start-notebook.sh"]

What the image does is seed the datbases into known locations.

What I need to do next is fettle the start file to copy (or move) the database storage files into a location inside the mounted storage volume and then reset the database directoy path environment variables before starting the database services, which are currently started in the copied over start file:

#!/bin/bash

service postgresql restart

mongod --fork --logpath /dev/stdout --dbpath ${MONGO_DB_PATH}

# Test dir
#if [ -d "$DIR" ]; then
#fi

# Test file
#if [ -f "$FILE" ]; then
#fi

if [ -d "/var/stash/content" ]; then
    mkdir -p /home/jovyan/content
    cp –r /var/stash/content/* /home/jovyan/content
fi

exec "$@" 

PS Things are never, of course, that simple. We may also need to elevate permissions to run scripts. See for example With Permission: Running Arbitrary Startup Services In Docker Containers.

Author: Tony Hirst

I'm a Senior Lecturer at The Open University, with an interest in #opendata policy and practice, as well as general web tinkering...

One thought on “Running Arbitrary Startup Scripts in Docker Containers”

  1. It’s encouraging there’s so much “room” for optimizing it just the way you need/want it for the different courses that plan on using it. Seems like you can get it configured, refine it, and then just run on automatic for the rest of the semester.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.