Fragment – Jupyter For Edu

With more and more core components, as well as user contibutions, being added to the Jupyter framework, I’m starting to lose track of what’s possible. One of the things I might be useful for the OU, and Institute of Coding, context is to explore various architectural patterns that can be constructed in a Jupyter mediated environment that are particular useful for education.

In advance of getting a Github repo / wiki together to start that, here are a few fragments my my feeds, several of which have appeared in just the last couple of days:

Jupyter Enterprise Gateway Now a Top Level Jupyter Project

Via the Jupyter blog, I see the Jupyter Enterprise Gateway is now a top-level Jupyter project.

The Jupyter Enterprise Gateway “enables Jupyter Notebook to launch remote kernels in a distributed cluster“, which provides a handy separation between a notebook server (or Jupyterhub multi-user notebook server) and the kernel that a notebook runs against. For example, Jupyter Enterprise Gateway can be used to create kernels in a scaleable way using Kubernetes, or (I’m guessing…?) to do things like launch remote kernels running on a GPU cluster. From the docs it looks like Jupyter Enterprise Gateway  should work in a Jupyterhub context, although I can’t offhand find a simple howto / recipe for how to do that. (Presumably, Jupyterhub creates and launches user specific notebook server containers, and these then create and connect to arbitrary kernel running back-ends via the Jupyter Enterprise Gateway? Here’s a related issue I found.)

Running Notebook Cells One at a Time in a Terminal

The ever productive Doug Blank has a recipe for stepping through notebook cells in a terminal [code: nbplayer]. The player launches an IPython terminal that displays the first cell in the notebook and lets you step through them (executing or skipping the cell) one at a time. You can also run your own commands in between stepping through the notebook cells.

I can imagine using this to create a fixed set of steps for an activity that I want a student to work through, whilst giving them “free time” to explore the state of current execution environment, for example, or try out particular “given” functions with different parameters. This approach also provides a workaround for using notebook authored exercises in the terminal environment, which I know some colleagues favour over the notebook environment.

On my to do list is recast some of the activities from the new TM112 course to see how they feel using this execution model, and then compare that to the original activity and the activity run using the same notebook in a notebook environment.

Adding Multiple Student Users to a Jupyterhub Environment

Also via Doug Blank, a recipe for adding multiple users to a Jupyterhub environment using a form that allows you to simply add a list of user names: a more flexible way of adding accounts to Jupyterhub. User account details and random passwords are created automatically and then emailed to students.

To allow users to change passwords, e.g. on first run, I think the NotebookApp.allow_password_change=True notebook server parameter (Jupyter notebook – Config file and command line options) allows that?

The repo also shows a way of bundling nbviewer to allow users to “publish” HTML versions of their notebooks.

Doug also points to yuvipanda/jupyterhub-firstuseauthenticator, a first use authenticator for Jupyterhub that allows new users to create an account and then set a password on it. This could be really handy for workshops, where you want to allow uses to self-serve an environment that persists over a couple of workshop sessions, for example. (One thing we still need to do in the OU is get a Jupyterhub server up and running with persistent user storage; for TM112, we ran a temporary notebook server, which meant students couldn’t save and return to notebooks on the server – they’d have to download notebooks and then re-upload them into a new session if they wanted to return to working on a notebook they had modified. That said, the activity was designed as a “displosable” activity…)

Zip All Notebooks

This handy extension — nbzipprovides a button to zip and download a Jupyter notebook server folder.  If you’re working on a temporary notebook server, this provides and easy way of grabbing all the notebooks in one go. What might be even nicer would be to select a sub-folder, or selected set of files, using checkbox selectors? I’m not sure if there’s a complementary tool that will let you upload a zipped archive and unpack it in one go?

Fragment – Running Multiple Services, such as Jupyter Notebooks and a Postgres Database, in a Single Docker Container

Over the last couple of days, I’ve been fettling the build scripts for the TM351 VM, which typically uses vagrant to build a VirtualBox VM from a set of shell scripts, so they can be used to build a single Docker container that runs all the TM351 services, specifically Jupyter notebooks, OpenRefine, PostgreSQL and MongoDB.

Docker containers are typically constructed to a run a single service, with compositions of containers wired together using Docker Compose to create applications that deliver, or rely on, more than one running service. For example, in a previous post (Setting up a Containerised Desktop API server (MySQL + Apache / PHP 5) for the ergast Motor Racing Data API) I showed how to set up a couple of containers to work together, one running a MySQL database server, the other an http service that provided an API to the database.

So how to run multiple services in the same container? Docs on the Docker website suggest using supervisord to run multiple services in a single container, so here’s a fragment on how I’ve done that from my TM351 build.

To begin with, I’ve built the container up as a tiered set of containers, in a similar way to the way the stack of opinionated Jupyter notebook Docker containers are constructed:

#Define a stub to identify the images in this image stack
IMAGESTUB=psychemedia/tm361testm

# minimal
## Define a minimal container, eg a basic Linux container
## using whatever flavour of Linux we prefer
docker build --rm -t ${IMAGESTUB}-minimal-test ./minimal

# base
## The base container installs core packages
## The intention is to define a common build environment
## populated with packages likely to be common to many courses
docker build --rm --build-arg BASE=${IMAGESTUB}-minimal-test -t ${IMAGESTUB}-base-test ./base

#...

One of the things I’ve done to try to generalise the build steps is allow the name a base container to be used to bootstrap a new one by passing the name of the base image in via an optional variable (in the above case, --build-arg BASE=${IMAGESTUB}-minimal-test). Each Dockerfile in a build step directory uses the following construction to work out which image to use as the FROM basis:

#Set ARG values using --build-arg =
#Each ARG value can also have a default value
ARG BASE=psychemedia/ou-tm351-base-test
FROM ${BASE}

Using the same approach, I have used separate build tiers for the following components:

  • jupyter base: minimal Jupyter notebook install;
  • jupyter custom: add some customisation onto a pre-existing Jupyter notebook install;
  • openrefine: add the OpenRefine application; (note, we could just use BASE=ubuntu to create this a simple, standalone OpenRefine container);
  • postgres: create a seeded PostgreSQL database; note, this could be split into two: a base postgres tier and then a customisation that adds users, creates and seed databases etc;
  • mongodb: add in a seeded mongo database; again, the seeding could be added as an extra tier on a minimal database tier;
  • topup: a tier to add in anything I’ve missed without having to go back to rebuild from an earlier step…

The intention behind splitting out these tiers is that we might want to have a battle hardened OU postgres tier, for example, that could be shared between different courses. Alternatively, we might want to have tiers offering customisations for specific presentations of a course, whilst reusing several other fixed tiers intended to last out the life of the course.

By the by, it can be quite handy to poke inside an image once you’ve created it to check that everything is in the right place:

#Explore inside animage by entering it with a shell command
docker run -it --entrypoint=/bin/bash psychemedia/ou-tm351-jupyter-base-test -i

Once the services are in place, I add a final layer to the container that ensures supervisord is available and set up with an appropriate supervisord.conf configuration file:

##Dockerfile
#Final tier Dockerfile
ARG BASE=psychemedia/testpieces
FROM ${BASE}

USER root
RUN apt-get update && apt-get install -y supervisor

RUN mkdir -p /openrefine_projects  && chown oustudent:100 /openrefine_projects
VOLUME /openrefine_projects

RUN mkdir -p /notebooks  && chown oustudent:100 /notebooks
VOLUME /notebooks

RUN mkdir -p /var/log/supervisor
COPY monolithic_container_supervisord.conf /etc/supervisor/conf.d/supervisord.conf

EXPOSE 3334
EXPOSE 8888

CMD ["/usr/bin/supervisord"]

The supervisord.conf file is defined as follows:

##supervisord.conf
##We can check running processes under supervisord with: supervisorctl

[supervisord]
nodaemon=true
logfile=/dev/stdout
loglevel=trace
logfile_maxbytes=0
#The HOME envt needs setting to the correct USER
#otherwise jupyter throws: [Errno 13] Permission denied: '/root/.local'
#https://github.com/jupyter/notebook/issues/1719
environment=HOME=/home/oustudent

[program:jupyternotebook]
#Note the auth is a bit ropey on this atm!
command=/usr/local/bin/jupyter notebook --port=8888 --ip=0.0.0.0 --y --log-level=WARN --no-browser --allow-root --NotebookApp.password= --NotebookApp.token=
#The directory we want to start in
#(replaces jupyter notebook parameter: --notebook-dir=/notebooks)
directory=/notebooks
autostart=true
autorestart=true
startsecs=5
user=oustudent
stdout_logfile=NONE
stderr_logfile=NONE

[program:postgresql]
command=/usr/lib/postgresql/9.5/bin/postgres -D /var/lib/postgresql/9.5/main -c config_file=/etc/postgresql/9.5/main/postgresql.conf
user=postgres
autostart=true
autorestart=true
startsecs=5

[program:mongodb]
command=/usr/bin/mongod --dbpath=/var/lib/mongodb --port=27351
user=mongodb
autostart=true
autorestart=true
startsecs=5

[program:openrefine]
command=/opt/openrefine-3.0-beta/refine -p 3334 -i 0.0.0.0 -d /vagrant/openrefine_projects
user=oustudent
autostart=true
autorestart=true
startsecs=5
stdout_logfile=NONE
stderr_logfile=NONE

One thing I need to do better is to find a way to stage the construction of the supervisord.conf file, bearing in mind that multiple tiers may relate to the same servicel for example, I have a jupyter-base tier to create a minimal Jupyter notebook server and then a jupyter-base-custom tier that adds in specific customisations, such as branding and course related notebook extensions.

When the final container is built, the supervisord command is run and the multiple services started.

One other thing to note: we’re hoping to run TM351 environments on an internal OpenStack cluster. The current cluster only allows students to expose a single port, and port 80 at that, from the VM (IP addresses are in scant supply, and network security lockdowns are in place all over the place). The current VM exposes at least two http services: Jupyter notebooks and OpenRefine, so we need a proxy in place if we are to expose them both via a single port. Helpfully, the nbserverproxy Jupyter extension (as described in Exposing Multiple Services Via a Single http Port Using Jupyter nbserverproxy), allows us to do just that. One thing to note, though – I had to enable it via the same user that launches the notebook server in the suoervisord.conf settings:

##Dockerfile fragment

RUN $PIP install nbserverproxy

USER oustudent
RUN jupyter serverextension enable --py nbserverproxy
USER root

To run the VM, I can call something like:

docker run -p 8899:8888 -d psychemedia/tm351dockermonotest

and then to access the additional services, I can browse to e.g. localhost:8899/proxy/3334/ to see the OpenRefine application.

PS in case you’re wondering why I syndicated this through RBloggers too, the same recipe will work if you’re using Jupyter notebooks with an R kernel, rather than the default IPython one.

Exposing Multiple Services Via a Single http Port Using Jupyter nbserverproxy

Over the last couple of weeks I’ve been circling, but failing to make much actual progress on using, OpenStack as a platform for making self-serve OU hosted VMs available to students. (I’m increasingly starting to think this is not sensible, but I’m struggling to find someone I can chat to about it… OpenStack is too enterprise, like a heavy “Java” thing where I need a just works “Python” thing…).

Anyway.

One of the issues with the OU Faculty OpenStack setup is the way the security model locks everything down. Not only is no API access available, there is also a limit on IP address allocation and open ports are limited to port 80 (and maybe port 22? Or maybe not.)

For the TM351 VM – which is what we’re looking to put onto OU OpenStack – we have been exposing services on at least two http ports, one for the Jupyter notebooks and one for OpenRefine. (The latest build also has a simple VM webserver, and I’m experimenting with a notebook search engine. Optionally, we have also allowed students to open up ports to the PostgreSQL and MongoDB services.)

If I do find a sensible way to get the VM running on OpenStack, finding a way to shove all the http services through port 80 looks like a necessary requirement. Previously, I’d noticed that @betatim’s openrefineder demo made use of a proxy to expose the OpenRefine service via the Jupyter notebook port, and looking at it again today I noticed that the nbopenrefineproxy package it was using is available as a Jupyterhub project package: jupyterhub/nbserverproxy.

In the current TM351 VM set-up, we have the following:

  • Jupyter notebook on guest port 8888, host port 35180
  • OpenRefine on guest port 3334, host port 35181

However, if I install and enable nbserverproxy, and restart the Jupyter notebook server, I can now find OpenRefine proxied as http://localhost:35180/proxy/3334/ as well as on http://localhost:35181.

One gotcha to note is that the OpenRefine page doesn’t render properly from that URL without the trailing slash because the OpenRefine HTML includes relative links to assets:

...
<link type="text/css" rel="stylesheet" href="externals/select2/select2.css" />
 <link type="text/css" rel="stylesheet" href="externals/tablesorter/theme.blue.css" />
...

which resolve as e.g. http://localhost:35180/proxy/externals/select2/select2.css (404).

However, with the trailing slash, the links do resolve correctly (e.g. as http://localhost:35180/proxy/3334/externals/select2/select2.css) when the trailing slash is added.

Handy… and the way to go if we do get this running on OpenStack.

PS if you know of a baby steps tutorial that shows how I can build a custom VM image on a Mac that I can upload to OpenStack, please let me know via the comments. Or otherwise get in touch if you can talk me through the various approaches.

Web Terminals and X11 Remote Desktop Access for Personal Virtual Applications

One of the components I’ve found useful when putting together virtual machines / Docker containers is a simple web based terminal that gives me terminal access into the machine.

This is a living post where I can keep notes on the various solutions I’ve tried…

Jupyter Notebooks

If you run a Jupyter notebook server, or JupyterLab, in a container, you get a web based terminal for free, albeit running under whatever user/privileges the Jupyter user environment is using.

Gritty

Browser based terminal; requires a node install, which is a faff, but it seems to work quite well…  Here’s a Dockerfile fragment:

#Install node.js
RUN apt-get update && apt-get install -y build-essential
RUN curl -sL https://deb.nodesource.com/setup_8.x | bash && \
apt-get install -y nodejs && apt-get clean
RUN npm i npm@latest -g

#Install web terminal - gritty
#Provides access into container via in-browser terminal
RUN npm i gritty -g --unsafe-perm

#Use the terminal as the default running service for the container
#When the container is running, we should be able to access browser terminal into it
EXPOSE 1337
CMD gritty

The --no-perms thing for me was required to get it to install cleanly…

You’d probably want to run this alongside another process. See Docker docs – Run multiple services in a container for now. I need to work on my own reusable recipe to support this general sort of activity. (The supervisord route is probable the best way..? Example (untested), untested simple example – use /usr/bin/supervisord as the CMD , untested tutorial)

Here’s another example in a simple Vagrantfile:

Vagrant.configure("2") do |config|

#------------------------- PROVIDER: VIRTUALBOX (BUILD) ------------------------------

    config.vm.provider :virtualbox do |virtualbox|

        #Stick with the default key
        config.ssh.insert_key=false

        #For local testing:
        config.vm.box = "ubuntu/xenial64"

        config.vm.hostname = "gritty-test"

        virtualbox.name = "gritty_test"
        virtualbox.memory = 1024
        virtualbox.cpus = 1
        # virtualbox.gui = true

        #Don't bother updating guest additions
        config.vbguest.auto_update = false

        #---- START PORT FORWARDING ----
        #Registered ports: https://en.wikipedia.org/wiki/List_of_TCP_and_UDP_port_numbers

        #Gritty
        config.vm.network :forwarded_port, guest: 1337, host: 8898, auto_correct: true

        #---- END PORT FORWARDING ----

     end
#------------------------- END PROVIDER: VIRTUALBOX (BUILD) ------------------------------

  config.vm.provision :shell, :inline => <<-SH

    if [ ! -f /vagrant/.firstrun_setup.done ]; then
        apt-get update && apt-get install -y build-essential
        curl -sL https://deb.nodesource.com/setup_8.x | bash && apt-get install -y nodejs && apt-get clean
        npm i npm@latest -g

        #Install web terminal - gritty
        #Provides access into container via in-browser terminal
        npm i gritty -g --unsafe-perm

        cp /vagrant/gritty.service /lib/systemd/system/gritty.service

    	# Enable autostart
	systemctl enable gritty.service

	# Refresh service config
	systemctl daemon-reload

	#(Re)start service
	systemctl restart gritty.service

        touch /vagrant/.firstrun_setup.done
    fi
   SH

   #config.vm.provision "shell", inline: "gritty --port 1337 &", run: "always"

end

And here's an example gritty.service file:


[Unit]
Description=OU Gritty Terminal Demo

#When to bring the service up
#via https://www.freedesktop.org/wiki/Software/systemd/NetworkTarget/
#Wait for a network stack to appear
After=network.target
#If we actually need the network to have a routable IP address:
#After=network-online.target

[Service]
#User=oustudent
Environment=GRITTYPORT=1337

#ExecStartPre=mkdir -p /vagrant/openrefine_projects

ExecStart=/usr/bin/gritty --port ${GRITTYPORT}
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target

Alternatives: xterm.js seems to provide the basis of several terminal servers such as ttyd; here’s an example with authentication from Duke: mccahill/xterm-js-docker; and a minimal example server: simon-engledew/dockerweb. Here’s an example of a web ssh terminal with a ore complete config file example here.

novnc

Another route to a web based terminal via the browser is to cheat – and allow access to VM desktop through the browser. This can be done with using a virtualised desktop. For example, novnc eg https://github.com/psharkey/docker/tree/master/novnc with other containers to compose with it.

The dit4c/dockerfile-dit4c-container-x11 container is a nice example of a set of Docker container layers to support X11 exposure via a browser using novnc.

The new Docker Desktop looks like it will have some GUI tools to help assemble applications in Docker Desktop so it’d be great if that included a robust, templated containers to handle both X11, and terminal access, through a browser.

See also: Apache Guacamole; example – Guacamole and Wine – Running Small Legacy Windows Apps Via a Browser or Accessing GUI Apps (Audacity) Via a Browser from a Container Using Guacamole.

PS here’s a set-up for running under supervisord in a container.

Docker – multiple services in same container recipe

This uses a docker-compose file (run using docker-compose up) that uses a local Dockerfile that calls on build.sh script to create a container image that runs a couple of services as specified in the supervisord.conf file. Note the requirement on installing the supervisor package.

First, the supervisord.conf script; this is responsible for starting the running applications (and replaces any services; which means my build files need to separate our services and copy them over / start them up if we are not doing the docker build).

##supervisord.conf
[supervisord]
nodaemon=true

[program:gritty]
command = /usr/bin/gritty --port 80

[program:gritty2]
command = /usr/bin/gritty --port 8899

The build.sh script is responsible for installing the necessary packages into the container.

##build.sh
apt-get update && apt-get install -y build-essential curl
curl -sL https://deb.nodesource.com/setup_8.x | bash && apt-get install -y nodejs && apt-get clean
npm i npm@latest -g

#Install web terminal - gritty
#Provides access into container via in-browser terminal
npm i gritty -g --unsafe-perm

The Dockerfile orchestrates the build of the container.

##Dockerfile
FROM ubuntu:16.04

RUN apt-get update && apt-get install -y supervisor

ADD build.sh /root/build.sh
RUN chmod +x /root/build.sh && /root/build.sh

RUN apt-get clean

COPY supervisord.conf /etc/supervisor/conf.d/supervisord.conf

CMD ["/usr/bin/supervisord"]

The docker-compose.yaml orchestrates the deployment of the container, including port mappings.

##docker-compose.yaml
version: '3'
services:
  multidemo:
    build: .
    ports:
     - "8087:80"
     - "8097:8899"

Running the container should publish terminals running as separate services on different ports (host ports 8087 and 8097.

Seven Ways of Making Use of SQLite

SQLite is a really handy file based database engine. When you put data into a database, it can make it easier to search; it also provides a great tool for looking for stories or story leads hidden in the data. So here are seven ways of getting started with SQLite.

Querying SQLite Databases Using Rich User Interfaces

Whilst SQLite comes with it’s own command line client, many people will find the easiest way of getting started with querying SQLite databases is to use an application, either on the desktop or accessed via a browser based UI.

I think this project has stalled; not sure if it still works? -> franchise is a browser based UI that you can access via the cloud or run locally (code). If you have a SQLite database file (make sure the suffix is .sql) you can upload it and explore it using the franchise application. (If you have a CSV or Excel data file, you can upload that too and it will add it to its own temporary SQLite database). Here’s a review: Asking Questions of CSV Data, Using SQL In the Browser, With Franchise.

In the browser, you can try the official sql.js demo GUI or SQLite Viewer.

If you prefer something on the desktop, the cross-platform sqlitebrowser [repo] might suit your needs.

Another great way of making use of SQLite is bring it alive using datasette. A single command allows you to publish an interactive, browser based user interface to one or more databases, either on your own computer or via an online host such as Zeit Now, or Heroku. For example, I’ve popped up three databases I scrape together on Heroku and pointed my own URL at them (unfortunately, I tend to run out of free Heroku credits in the last week of the month at which point the site goes down!)

datasette allows you to query the databases through a customisable webpage and supports a range of plugins. For example, the datasette-cluster-map will detect latitude and longitude columns in a datatable and present the results using an interactive map. I gave it a spin with this map of UK Food Hygiene Ratings. You can find several other examples of datasettes published in the wild on the datasette wiki.

Finding Data: SQLite Databases in the Wild

Whilst tools such as datasette are great for quickly getting started with querying a SQLite database, one obvious question that arises is: what SQLite database?

Once you start poking around, however, you can start to find examples of SQLite databases working quietly behind the scenes on you own computer. (Searching your computer for files with a .sqlite suffix is one way of finding them!)

As a specific example, the Safari, Chrome and Firefox web browsers all keep track of your browser history using a SQLite database on your computer (this gist – dropmeaword/browser_history.md – tells you where you can find the files. You can then simply query them using datasette. On a Mac, I can simply run:

datasette ~/Library/Application\ Support/Google/Chrome/Default/History

and I can then start to query my browser history using the datasette browser based UI. Here’s an ‘inspect your browser history’ tutorial to get you started…

Ingesting Data

Applications such as franchise allow you to upload a CSV or Excel datafile and automatically import it into a SQLite database so that it can be queried using SQLite. The datasette ecosystem also includes and application for uploading CSV files and ingesting them into a SQlite database: datasette publish.

Behind the scenes of that application is a Python command line utility called csvs-to-sqlite. A simple command lets yoiu convert a CSV file to a SQLite database:

csvs-to-sqlite myfile.csv mydatabase.db

Whilst csvs-to-sqlite focusses on the conversion of CSV files into a SQLite database, the more general sqlitebiter command line utility can convert CSV, Excel, HTML tables (eg from a URL), JSON, Jupyter Notebooks, Markdown, TSV and Google-Sheets to a SQLite database file using a similar command format.

Using SQLite from the Commandline

Natively, SQLite comes with its own command line shell that allows you to connect to and query a SQLite database from the command line.

As well as command line tools for converting data contained in various file formats into a SQLite database, several command line tools embed that conversion within a command line interface that allows you convert a data file to an in-memory SQLite database and query it directly.

For example, using the csvsql command from csvkit:

csvsql --query "select * from iris as i join irismeta as m on (i.species = m.species)" examples/iris.csv examples/irismeta.csv

Or textql:

Or the simply named q:

q -H "SELECT COUNT(DISTINCT(uuid)) FROM ./clicks.csv"

Querying SQLite Databases from Programming Languages Such as Python and R

If you are developing your own data-centric reproducible research pipelines, it’s likely that you will be using a programming language such as R or the Python pandas library.

There are two main ways for using SQLite in this context. The first is to connect to the database from the programming language and then query it from within the language. For example, in R, you might use the RSQLite package. In Python, you can connect to a SQLite database using the base sqlite3 package.

The other approach is to use SQLite as an in-memory database that provides a SQL query interface to a dataframe. In R, this can be achieved using the sqldf package:

library(sqldf)
sqldf("select * from iris limit 5")

In Python/pandas, you can use the pandasql package:

from pandasql import sqldf, load_births
pysqldf = lambda q: sqldf(q, globals())
births = load_births()
print(pysqldf("SELECT * FROM births LIMIT 10;").head())

In many respects, sqldf and pandasql behave like programming language native versions of command-line utilities such as csvsql, textql and q, although rather than importing a data file into a SQLite database so that it can be queried, they import the contents of the referenced dataframe.

pandas also provides native support for adding dataframes as tables to a connected SQLite database, as well as reading results from queries onto the database back into a dataframe. Once you start getting into the swing of putting data into a database, and then running JOINed queries over multiple tables, you’ll start to wonder why you spent so much time getting hassled by VLOOKUP. As an example, here’s a way of Making a Simple Database to Act as a Lookup for the ONS Register of Geographic Codes.

Using SQLite in Your Own Applications

If you are comfortable with using programming code to manipulate your data, then you may want to explore ways of using SQLite to create your own data driven applications.

One way to start is to use SQLite completely within the browser. Accessing desktop applications from a webpage is typically a no-no because of browser security restrictions, but SQLite is quite a light application, so it can – and has been – compiled to Javascript so that it can be imported as a Javascript library and run from within a webpage: sql.js. The aforementioned sql.js GUI and SQLite Viewer both use sql.js under the hood…

As well as running SQLite in a browser, SQLite can also be used to power an API. One of the advantages of running a datasette service is that it also exposes a datasette API. This means you can publish your datasette to a web host then other computers can querying it programmatically.

If you are working in a Python Jupyter environment, it’s simple enough to use the Jupyer kernel gateway to create your own APIs. Here’s an example of building a service to allow the lookup of ONS codes from a simple SQLite database: Building a JSON API Using Jupyter Notebooks in Under 5 Minutes.

Another way of using SQLite databases in a Jupyter environment is to use Scripted Forms to For example, here’s one of my own recipes for Creating Simple Interactive Forms Using Python + Markdown Using ScriptedForms + Jupyter that shows how to create a simple interactive form for querying a SQLite database containing descriptions of images used in OpenLearn courses.

SQLite Database Admin Tools

As well as providing a simple explorer and query interface, the sqlitebrowser tool also supports a range of other SQLite database administration functions such as the ability to create, define, modify and delete tables and indexes, or edit, add and delete individual records.

The browser based sqlite-web application provides a similar range of utilities via a browser based, rather than desktop client, UI.

Summary

SQLite is lightweight, in-memory and file based database that allows you to run SQL queries over a wide range of tabular datasets. If you work with data, knowing how to write even simple SQL queries can add a powerful tool to your toolbox. SQLite, and the associated tools created around it, is almost certainly one of the easiest ways in to using this most versatile, portable, and personal data power tool.

PS Here are some more ways of using SQLite:

Jigsaw Pieces – Linux Service Indicators, Jupyter Kernel Monitoring and Environment Management

Something I’ve been pondering for some time is how to set up some simple Linux service monitoring so that I can display an in indicator light in a web page to show whether a Linux service is running or not.

For example, in the TM351 VM, it could be handy to display some indicator lights in a Jupyter notebook status bar showing whether the database services we connect to from the notebooks are running correctly,

So here are some pieces that may contribute to that:

My thinking is:

  • use monit to monitor a process; if the process is down, write to a service status file in my www server directory, eg service_servicename_status.txt. If a service is running the contents of this file are 1, otherwise 0;
  • use the JQuery fragment to poll the status file every few seconds;
  • if the status file returns 0, display a red indicator, otherwise green.

Here are some other monitoring / environment managing fragments I’m pondering:

  • something like ps_mem, a Python utility *to accurately report the in core memory usage for a program*. I’m wondering if I could use that to track how much memory each Jupyter notebook python kernel is taking up (or maybe monit can do that?) There’s an old extnesion that looks like ti shows reports: nbtop. Or perhaps use psutil (via this issue, which seems to offer a solution?);
  • a minimal example of setting up notebook homepage tab for a hello world webpage; Writing a notebook server extension looks like it has the ingredients, and nb_conda provides a fuller working example. Actually, that extension looks useful for *Jupyter-as-a-learning-environment* because it lets you select different conda environments, which could be handy for running different activities.

Any other examples out there of Jupyter monitoring / environment management?

Interactive Authoring Environments for Reproducible Media: Stencila

One of the problems associated with keeping up with tech is that a lot of things that “make sense” are not the result of the introduction or availability of a new tool or application in and of itself, but in the way that it might make a new combination of tools possible that support a complete end to end workflow or that can be used to reengineer (a large part of) an existing workflow.

In the OU, it’s probably fair to say that the document workflow associated with creating course materials has its issues. I’m still keen to explore how a Jupyter notebook or Rmd workflow would work, particularly if the authored documents included recipes for embedded media objects such as diagrams, items retrieved from a third party API, or rendered from a source representation or recipe.

One “obvious” problem is that the Jupyter notebook or RStudio Rmd editor is “too hard” to work with (that is, it’s not Word).

A few days ago I saw a tweet mentioning the use of Stencila with Binderhub. Stencila? Apparently, *”[a]n open source office suite for reproducible research”. From the blurb:

[T]oday’s tools for reproducible research can be intimidating – especially if you’re not a coder. Stencila make reproducible research more accessible with the intuitive word processor and spreadsheet interfaces that you and your colleagues are already used to.

That sounds appropriate… It’s available as a desktop app, but courtesy of minrk/jupyter-dar (I think?), it runs on binderhub and can be accessed via a browser too:

 

You can try it here.

As with Jupyter notebooks, you can edit and run code cells, as well as authoring text. But the UI is smoother than in Jupyter notebooks.

(This is one of the things I don’t understand about colleagues’ attitude towards emerging tech projects: they look at today’s UX and think that’s it, because that’s how it is inside an organisation – you take what you’re given and it stays the same for decades. In a living project, stuff tends to get better if it’s being used and there are issues with it…)

The Jupyter-Dar strapline pitches “Jupyter + DAR compatibility exploration for running Stencila on binder”. Hmm. DAR? That’s also new to me:

Dar stands for (Reproducible) Document Archive and specifies a virtual file format that holds multiple digital documents, complete with images and other assets. A Dar consists of a manifest file (manifest.xml) that describes the contents.

Dar is being designed for storing reproducible research publications, but the underlying concepts are suitable for any kind of digital publications that can be bundled together with their assets.

Repo: [substance/dar](https://github.com/substance/dar)

Sounds interesting. And which reminds me: how’s OpenCreate coming along, I wonder? (My permissions appear to have been revoked again; or the URL has changed.)

PS seems like there’s more activity in the “pure web” notebook application world. Hot on the heels of Mike Bostock’s Observable notebooks (rationale) comes iodide, “[a] frictionless portable notebook-style interface for literate scientific computing in the browser” (examples).

I don’t know if these things just require you to use Javascript, or whether they can also embed things like Brython.

I’m not sure I fully get the js/browser notebooks yet? I like the richer extensibility of things like Jupyter in terms of arbitrary language/kernel availability, though I suppose the web notebooks might be able to hook into other kernels using similar mechanics to those used by things like Thebelab?

I guess one advantage is that you can do stuff on a Chromebook, and without a network connection if you cache all the required JS packages locally? Although with new ChromeOS offering support for Linux – and hence, Docker containers – natively, Chromebooks could get a whole lot more exciting over the next few months. From what I can tell, corsvm looks like a ChromeOS native equivalent to something like Virtualbox (with an equivalent of Guest Additions?). It’ll be interesting how well things like audio works? Reports suggest that graphical UIs will work, presumably using some sort of native X11 support rather than noVNC, so now could be a good time to start looking out for souped up Pixelbook…

PS March 2019 – Stencila desktop appears to have stalled for some time. As it’s built on the Texture wordprocessor / editor, it may end up as a plugin for that…

PPS June 2021 – have things rebooted again for Stencila? https://elifesciences.org/labs/a04d2b80/announcing-the-next-phase-of-executable-research-articles

Generative Assessment Creation

It’s coming round to that time of year where we have to create the assessment material for courses with an October start date. In many cases, we reuse question forms from previous presentations but change the specific details. If a question is suitably defined, then large parts of this process could be automated.

In the OU, automated question / answer option randomisation is used to provide iCMAs (interactive computer marked assessments) via the student VLE using OpenMark. As well as purely text based questions, questions can include tables or images as part of the question.

One way of supporting such question types is to manually create a set of answer options, perhaps with linked media assets, and then allow randomisation of them.

Another way is to define the question in a generative way so that the correct and incorrect answers are automatically generated.(This seems to be one of those use cases for why ‘everyone should learn to code’;-)

Pinching screenshots from an (old?) OpenMark tutorial, we can see how a dynamically generated question might be defined. For example, create a set of variables:

and then generate a templated question, and student feedback generator, around them:

Packages also exist for creating generative questions/answers more generally. For example, the R exams package allows you to define question/answer templates in Rmd and then generate questions and solutions in a variety of output document formats.


You can also write templates that include the creation of graphical assets such as charts:

 

Via my feeds over the weekend, I noticed that this package now also supports the creation of more general diagrams created from a TikZ diagram template. For example, logic diagrams:

Or automata diagrams:

(You can see more exam templates here: www.r-exams.org/templates.)

As I’m still on a “we can do everything in Jupyter” kick, one of the things I’ve explored is various IPython/notebook magics that support diagram creation. At the moment, these are just generic magics that allow you to write TikZ diagrams, for example, that make use of various TikZ packages:

One the to do list is to create some example magics that template different question types.

I’m not sure if OpenCreate is following a similar model? (I seem to have lost access permissions again…)

FWIW, I’ve also started looking at my show’n’tell notebooks again, trying to get them working in Azure notebooks. (OU staff should be able to log in to noteooks.azure.com using OUCU@open.ac.uk credentials.) For the moment, I’m depositing them at https://notebooks.azure.com/OUsefulInfo/libraries/gettingstarted, although some tidying may happen at some point. There are also several more basic demo notebooks I need to put together (e.g. on creating charts and using interactive widgets, digital humanities demos, R demos and (if they work!) polyglot R and python notebook demos, etc.). To use the notebooks interactively, log in and clone the library into your own user space.

Generating Printable MS Word Versions of Merged Jupyter Notebooks

One of the issues we know students have with the Jupyter notebooks that we provide as part of the course is that there is no straightforward way of printing them them all out for offscreen reading / annotation. (As well as code, there is a certain amount of practical and code related explanatory material in the notebooks.)

One of the things I started to doodle with last year was a simple script to merge several notebooks than then render the result as a Microsoft Word doc. This has a dependency on pandoc, though not LaTeX and requires that the conversion takes place via HTML: ipynb is converted to HTML using nbconvert , then from HTML to docx. If there are image files transcluded into the notebook, this also means that the pandoc conversion process needs to be executed in the same directory as the notebook so that the image paths are correctly recognised. (When running nbconvert with the html_embed output, pandoc fell over.)

Having to run pandoc in a local, image path respecting directory is a pain because it means I can’t run it over a merged notebook file composed of notebooks from multiple directories. Which means that I have to generate a separate docx file for the notebooks in each separate directory. Whilst I could more this into the same directory to make accessing them all a bit easier, it still means students have to print out multiple documents. I did try using a python package to merge the Word docs, but it borked on the images.

There are Python packages that can merge PDF documents in a more reliable way, but I am having issues with getting a sensible PDF workflow together. In the first case, for pandoc to render documents to  PDF seems to require the texlive-xetex package, which adds considerable weight to the VM (and I don’t know the dependency voodoo required to get a minimum viable LaTeX distribution in place). In the second, my test notebooks included a pymarkdown inline element that embedded a pandas dataframe in a markdown cell and this seemed to break the pandoc PDF conversion at that point.

One thing I haven’t done yet is look at customising the output templates so that we can brand the exported documents. For this, I need to look at custom templates.

My initial sketch code for the ‘export merged notebooks in a directory as docx’ routine is available via this gist. One thing I need to do is wrap it in a simple CLI command. Comments / suggestions for improvement, or links to better alternatives, more than welcome!


#https://stackoverflow.com/a/3207973/454773
from nbformat.v4 import new_notebook, new_markdown_cell
import nbformat
import io
import os
import subprocess
import random
import string
#from PyPDF2 import PdfFileMerger, PdfFileReader
def merged_notebooks_in_dir(dirpath,filenames):
''' Merge all notebooks in a directory into a single notebook '''
fns = ['{}/{}'.format(dirpath, fn) for fn in filenames if '.ipynb_checkpoints' not in dirpath and fn.endswith('.ipynb')]
if fns:
merged = new_notebook()
#Identify directory containing merged notebooks
cell = '\n\n\n\n# {}\n\n\n\n'.format(dirpath)
merged.cells.append(new_markdown_cell(cell))
else: return
for fn in fns:
#print(fn)
notebook_name = fn.split('/')[1]
with io.open(fn, 'r', encoding='utf-8') as f:
nb = nbformat.read(f, as_version=4)
#Identify filename of notebook
cell = '\n\n\n\n# {}\n\n\n\n'.format(fn)
merged.cells.append(new_markdown_cell(cell))
merged.cells.extend(nb.cells)
if not hasattr(merged.metadata, 'name'):
merged.metadata.name = ''
merged.metadata.name += "_merged"
return nbformat.writes(merged)
def merged_notebooks_down_path(path, typ='docx', execute=False):
''' Walk a path, creating an output file in each directory that merges all notebooks in the directory '''
for (dirpath, dirnames, filenames) in os.walk(path):
if '.ipynb_checkpoints' in dirpath: continue
#Should we run the execute processor here on each notebook separately,
# ensuring that images are embedded, and then merge the executed notebook files?
merged_nb = merged_notebooks_in_dir(dirpath,filenames)
if not merged_nb: continue
fn=''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(10))
with open('{}/{}.ipynbx'.format(dirpath,fn), 'w') as f:
f.write(merged_nb)
# Execute the merged notebook in its directory so that images are correctly handled
# Using html_embed seems to cause pandoc to fall over?
# The pdf conversion requires installation of texlive-xetex and inkscape
# This adds significant weight to the VM: maybe we need an MT/prouction VM and a student build?
# Inline code execution generated using python-markdown extension seems to break PDF generation
# at the first instance of inline code? Need to add a preprocessor?
# We could maybe process the notebook inline rather than via the commandline
# In such a case, the following may be a useful reference:
#https://github.com/ipython-contrib/jupyter_contrib_nbextensions/blob/master/docs/source/exporting.rst
execute = ' –ExecutePreprocessor.timeout=600 –ExecutePreprocessor.allow_errors=True –execute' if execute else ''
if typ=='pdf':
cmd='jupyter nbconvert –to pdf {exe} "{fn}".ipynbx'.format(exe=execute, fn=fn)
subprocess.check_call(cmd, shell=True, cwd=dirpath)
elif typ in ['docx']:
cmd='jupyter nbconvert –to html {exe} "{fn}".ipynbx'.format(exe=execute, fn=fn)
subprocess.check_call(cmd, shell=True, cwd=dirpath)
cmd='pandoc -s "{fn_out}".html -o _merged_notebooks.{typ}'.format(fn_out=fn, typ=typ)
subprocess.check_call(cmd, shell=True, cwd=dirpath)
os.remove("{}/{}.html".format(dirpath,fn))
os.remove("{}/{}.ipynbx".format(dirpath,fn))

Seeding Shared Folders With Files Distributed via a VM

For the first few presentations of our Data Management and Analysis course, the course VM has been distributed to students via a USB mailing. This year, I’m trying to move to a model whereby the primary distribution is via a download from VagrantCloud (students manage the VM using Vagrant), though we’re also hoping to be able to offer access to an OU OpenStack hosted VM to any student’s who really need it.

For students on Microsoft Windows computers, an installer installs Virtualbox and vagrant from installers distributed via the USB memory stick. This in part derives from the policy of fixing versions of as much as we can so that it can be tested in advance. The installer also creates a working directory for the course that will be shared by the VM, and copies required files, again from the memory stick, into the shared folder. On Macs and Linux, students have to do this setup themselves.

One of the things I have consciouslystarted trying to do is move the responsibility for satisficing of some of the installation requirements into the Vagrantfile. (I’m also starting to think they should be pushed even deeper into the VM itself.)

For example, as some of the VM services expect particular directories to exist in the shared directory, we have a couple of defensive measures in place:

  • the Vagrantfile creates any required, yet missing, subdirectories in the shared directory;
            #Make sure that any required directories are created
            config.vm.provision :shell, :inline => <<-SH
                mkdir -p /vagrant/notebooks
                mkdir -p /vagrant/openrefine_projects
                mkdir -p /vagrant/logs
                mkdir -p /vagrant/data
                mkdir -p /vagrant/utilities
                mkdir -p /vagrant/backups
                mkdir -p /vagrant/backups/postgres-backup/
                mkdir -p /vagrant/backups/mongo-backup/	
            SH
    

  • start up scripts for services that require particular directories check they exist before they are started and create them if they are missing. For example, in the service file, go defensive with something like ExecStartPre=mkdir -p /vagrant/notebooks.

The teaching material associated with the (contents of) the VM is distributed using a set of notebooks downloaded from the VLE. Part of the reason for this is that it delays the point at which the course notebooks must be frozen: the USB is mastered late July/early August for a mailing in September and course start in October.

As well as the course notebooks are a couple of informal installation test notebooks. This can be frozen along with the VM and distributed inside it, but the question then arises. So this year I am trying out a simple pattern that bakes test files into the VM and then uses the Vagranfile to copy the files into the shared directory on its first run with a particular shared folder:

config.vm.provision :shell, :inline => <<-SH
    if [ ! -f /vagrant/.firstrun_nbcopy.done ]; then
        # Trust notebooks in immediate child directories of notebook directory
        files=(`find /opt/notebooks/* -maxdepth 2 -name "*.ipynb"`)
        if [ ${#files[@]} -gt 0 ]; then
            jupyter trust /opt/notebooks/*.ipynb;
            jupyter trust /opt/notebooks/*/*.ipynb;
        fi
        #Copy notebooks into shared directory
        cp -r /opt/notebooks/. /vagrant/notebooks
        touch /vagrant/.firstrun_nbcopy.done
    fi
   SH

This pattern allows files shipped inside the VM to be copied into the shared folder once it is mounted into the VM from host. The files will then persist inside the shared directory, along with a hidden flag file to say the files have been copied. I’m not sure about the benefits of auto-running something inside the VM to manage this copying? Or whether to check that a more recent copy of the files to be copied doesn’t already exist in the shared folder before copying on the first run in the folder?