Category: IoC

Exposing Multiple Services Via a Single http Port Using Jupyter nbserverproxy

Over the last couple of weeks I’ve been circling, but failing to make much actual progress on using, OpenStack as a platform for making self-serve OU hosted VMs available to students. (I’m increasingly starting to thing this is not sensible, but I’m struggling to find someone I can chat to about it…).

Anyway.

One of the issues with the OU Faculty OpenStack setup is the way the security model locks everything down. Not only is no API access available, there is also a limit on IP address allocation and open ports are limited to port 80 (and maybe port 22? Or maybe not.)

For the TM351 VM – which is what we’re looking to put onto OU OpenStack – we have been exposing services on at least two http ports, one for the Jupyter notebooks and one for OpenRefine. (The latest build also has a simple VM webserver, and I’m experimenting with a notebook search engine. Optionally, we have also allowed students to open up ports to the PostgreSQL and MongoDB services.)

If I do find a sensible way to get the VM running on OpenStack, finding a way to shove all the http services through port 80 looks like a necessary requirement. Previously, I’d noticed that @betatim’s openrefineder demo made use of a proxy to expose the OpenRefine service via the Jupyter notebook port, and looking at it again today I noticed that the nbopenrefineproxy package it was using is available as a Jupyterhub project package: jupyterhub/nbserverproxy.

In the current TM351 VM set-up, we have the following:

  • Jupyter notebook on guest port 8888, host port 35180
  • OpenRefine on guest port 3334, host port 35181

However, if I install and enable nbserverproxy, and restart the Jupyter notebook server, I can now find OpenRefine proxied as http://localhost:35180/proxy/3334/ as well as on http://localhost:35181.

One gotcha to note is that the OpenRefine page doesn’t render properly from that URL without the trailing slash because the OpenRefine HTML includes relative links to assets:

...
<link type="text/css" rel="stylesheet" href="externals/select2/select2.css" />
 <link type="text/css" rel="stylesheet" href="externals/tablesorter/theme.blue.css" />
...

which resolve as e.g. http://localhost:35180/proxy/externals/select2/select2.css (404).

However, with the trailing slash, the links do resolve correctly (e.g. as http://localhost:35180/proxy/3334/externals/select2/select2.css) when the trailing slash is added.

Handy… and the way to go if we do get this running on OpenStack.

PS if you know of a baby steps tutorial that shows how I can build a custom VM image on a Mac that I can upload to OpenStack, please let me know via the comments. Or otherwise get in touch if you can talk me through the various approaches.

Web Terminals and X11 Remote Desktop Access for Personal Virtual Applications

One of the components I’ve found useful when putting together virtual machines / Docker containers is a simple web based terminal that gives me terminal access into the machine.

This is a living post where I can keep notes on the various solutions I’ve tried…

Jupyter Notebooks

If you run a Jupyter notebook server, or JupyterLab, in a container, you get a web based terminal for free, albeit running under whatever user/privileges the Jupyter user environment is using.

Gritty

Browser based terminal; requires a node install, which is a faff, but it seems to work quite well…  Here’s a Dockerfile fragment:

#Install node.js
RUN apt-get update && apt-get install -y build-essential
RUN curl -sL https://deb.nodesource.com/setup_8.x | bash && \
apt-get install -y nodejs && apt-get clean
RUN npm i npm@latest -g

#Install web terminal - gritty
#Provides access into container via in-browser terminal
RUN npm i gritty -g --unsafe-perm

#Use the terminal as the default running service for the container
#When the container is running, we should be able to access browser terminal into it
EXPOSE 1337
CMD gritty

The --no-perms thing for me was required to get it to install cleanly…

You’d probably want to run this alongside another process. See Docker docs – Run multiple services in a container for now. I need to work on my own reusable recipe to support this general sort of activity. (The supervisord route is probable the best way..? Example (untested), untested simple example – use /usr/bin/supervisord as the CMD , untested tutorial)

Here’s another example in a simple Vagrantfile:

Vagrant.configure("2") do |config|

#------------------------- PROVIDER: VIRTUALBOX (BUILD) ------------------------------

    config.vm.provider :virtualbox do |virtualbox|

        #Stick with the default key
        config.ssh.insert_key=false

        #For local testing:
        config.vm.box = "ubuntu/xenial64"

        config.vm.hostname = "gritty-test"

        virtualbox.name = "gritty_test"
        virtualbox.memory = 1024
        virtualbox.cpus = 1
        # virtualbox.gui = true

        #Don't bother updating guest additions
        config.vbguest.auto_update = false

        #---- START PORT FORWARDING ----
        #Registered ports: https://en.wikipedia.org/wiki/List_of_TCP_and_UDP_port_numbers

        #Gritty
        config.vm.network :forwarded_port, guest: 1337, host: 8898, auto_correct: true

        #---- END PORT FORWARDING ----

     end
#------------------------- END PROVIDER: VIRTUALBOX (BUILD) ------------------------------

  config.vm.provision :shell, :inline => <<-SH

    if [ ! -f /vagrant/.firstrun_setup.done ]; then
        apt-get update && apt-get install -y build-essential
        curl -sL https://deb.nodesource.com/setup_8.x | bash && apt-get install -y nodejs && apt-get clean
        npm i npm@latest -g

        #Install web terminal - gritty
        #Provides access into container via in-browser terminal
        npm i gritty -g --unsafe-perm

        cp /vagrant/gritty.service /lib/systemd/system/gritty.service

    	# Enable autostart
	systemctl enable gritty.service

	# Refresh service config
	systemctl daemon-reload

	#(Re)start service
	systemctl restart gritty.service

        touch /vagrant/.firstrun_setup.done
    fi
   SH

   #config.vm.provision "shell", inline: "gritty --port 1337 &", run: "always"

end

And here's an example gritty.service file:


[Unit]
Description=OU Gritty Terminal Demo

#When to bring the service up
#via https://www.freedesktop.org/wiki/Software/systemd/NetworkTarget/
#Wait for a network stack to appear
After=network.target
#If we actually need the network to have a routable IP address:
#After=network-online.target

[Service]
#User=oustudent
Environment=GRITTYPORT=1337

#ExecStartPre=mkdir -p /vagrant/openrefine_projects

ExecStart=/usr/bin/gritty --port ${GRITTYPORT}
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target

Alternatives: xterm.js seems to provide the basis of several terminal servers such as ttyd; here’s an example with authentication from Duke: mccahill/xterm-js-docker; and a minimal example server: simon-engledew/dockerweb. Here’s an example of a web ssh terminal with a ore complete config file example here.

novnc

Another route to a web based terminal via the browser is to cheat – and allow access to VM desktop through the browser. This can be done with using a virtualised desktop. For example, novnc eg https://github.com/psharkey/docker/tree/master/novnc with other containers to compose with it.

The dit4c/dockerfile-dit4c-container-x11 container is a nice example of a set of Docker container layers to support X11 exposure via a browser using novnc.

The new Docker Desktop looks like it will have some GUI tools to help assemble applications in Docker Desktop so it’d be great if that included a robust, templated containers to handle both X11, and terminal access, through a browser.

See also: Apache Guacamole; example – Guacamole and Wine – Running Small Legacy Windows Apps Via a Browser or Accessing GUI Apps (Audacity) Via a Browser from a Container Using Guacamole.

PS here’s a set-up for running under supervisord in a container.

Docker – multiple services in same container recipe

This uses a docker-compose file (run using docker-compose up) that uses a local Dockerfile that calls on build.sh script to create a container image that runs a couple of services as specified in the supervisord.conf file. Note the requirement on installing the supervisor package.

First, the supervisord.conf script; this is responsible for starting the running applications (and replaces any services; which means my build files need to separate our services and copy them over / start them up if we are not doing the docker build).

##supervisord.conf
[supervisord]
nodaemon=true

[program:gritty]
command = /usr/bin/gritty --port 80

[program:gritty2]
command = /usr/bin/gritty --port 8899

The build.sh script is responsible for installing the necessary packages into the container.

##build.sh
apt-get update && apt-get install -y build-essential curl
curl -sL https://deb.nodesource.com/setup_8.x | bash && apt-get install -y nodejs && apt-get clean
npm i npm@latest -g

#Install web terminal - gritty
#Provides access into container via in-browser terminal
npm i gritty -g --unsafe-perm

The Dockerfile orchestrates the build of the container.

##Dockerfile
FROM ubuntu:16.04

RUN apt-get update && apt-get install -y supervisor

ADD build.sh /root/build.sh
RUN chmod +x /root/build.sh && /root/build.sh

RUN apt-get clean

COPY supervisord.conf /etc/supervisor/conf.d/supervisord.conf

CMD ["/usr/bin/supervisord"]

The docker-compose.yaml orchestrates the deployment of the container, including port mappings.

##docker-compose.yaml
version: '3'
services:
  multidemo:
    build: .
    ports:
     - "8087:80"
     - "8097:8899"

Running the container should publish terminals running as separate services on different ports (host ports 8087 and 8097.

Seven Ways of Making Use of SQLite

SQLite is a really handy file based database engine. When you put data into a database, it can make it easier to search; it also provides a great tool for looking for stories or story leads hidden in the data. So here are seven ways of getting started with SQLite.

Querying SQLite Databases Using Rich User Interfaces

Whilst SQLite comes with it’s own command line client, many people will find the easiest way of getting started with querying SQLite databases is to use an application, either on the desktop or accessed via a browser based UI.

franchise is a browser based UI that you can access via the cloud or run locally (

(https://github.com/hvf/franchise)). If you have a SQLite database file (make sure the suffix is .sql) you can upload it and explore it using the franchise application. (If you have a CSV or Excel data file, you can upload that too and it will add it to its own temporary SQLite database). Here’s a review: Asking Questions of CSV Data, Using SQL In the Browser, With Franchise.

If you prefer something on the desktop, the cross-platform sqlitebrowser might suit your needs.

Another great way of making use of SQLite is bring it alive using datasette. A single command allows you to publish an interactive, browser based user interface to one or more databases, either on your own computer or via an online host such as Zeit Now, or Heroku. For example, I’ve popped up three databases I scrape together on Heroku and pointed my own URL at them (unfortunately, I tend to run out of free Heroku credits in the last week of the month at which point the site goes down!)

datasette allows you to query the databases through a customisable webpage and supports a range of plugins. For example, the datasette-cluster-map will detect latitude and longitude columns in a datatable and present the results using an interactive map. I gave it a spin with this map of UK Food Hygiene Ratings. You can find several other examples of datasettes published in the wild on the datasette wiki.

Finding Data: SQLite Databases in the Wild

Whilst tools such as datasette are great for quickly getting started with querying a SQLite database, one obvious question that arises is: what SQLite database?

Once you start poking around, however, you can start to find examples of SQLite databases working quietly behind the scenes on you own computer. (Searching your computer for files with a .sqlite suffix is one way of finding them!)

As a specific example, the Safari, Chrome and Firefox web browsers all keep track of your browser history using a SQLite database on your computer (this gist – dropmeaword/browser_history.md – tells you where you can find the files. You can then simply query them using datasette. On a Mac, I can simply run:

datasette ~/Library/Application\ Support/Google/Chrome/Default/History

and I can then start to query my browser history using the datasette browser based UI. Here’s an ‘inspect your browser history’ tutorial to get you started…

Ingesting Data

Applications such as franchise allow you to upload a CSV or Excel datafile and automatically import it into a SQLite database so that it can be queried using SQLite. The datasette ecosystem also includes and application for uploading CSV files and ingesting them into a SQlite database: datasette publish.

Behind the scenes of that application is a Python command line utility called csvs-to-sqlite. A simple command lets yoiu convert a CSV file to a SQLite database:

csvs-to-sqlite myfile.csv mydatabase.db

Whilst csvs-to-sqlite focusses on the conversion of CSV files into a SQLite database, the more general sqlitebiter command line utility can convert CSV, Excel, HTML tables (eg from a URL), JSON, Jupyter Notebooks, Markdown, TSV and Google-Sheets to a SQLite database file using a similar command format.

Using SQLite from the Commandline

Natively, SQLite comes with its own command line shell that allows you to connect to and query a SQLite database from the command line.

As well as command line tools for converting data contained in various file formats into a SQLite database, several command line tools embed that conversion within a command line interface that allows you convert a data file to an in-memory SQLite database and query it directly.

For example, using the csvsql command from csvkit:

csvsql --query "select * from iris as i join irismeta as m on (i.species = m.species)" examples/iris.csv examples/irismeta.csv

Or textql:

Or the simply named q:

q -H "SELECT COUNT(DISTINCT(uuid)) FROM ./clicks.csv"

Querying SQLite Databases from Programming Languages Such as Python and R

If you are developing your own data-centric reproducible research pipelines, it’s likely that you will be using a programming language such as R or the Python pandas library.

There are two main ways for using SQLite in this context. The first is to connect to the database from the programming language and then query it from within the language. For example, in R, you might use the RSQLite package. In Python, you can connect to a SQLite database using the base sqlite3 package.

The other approach is to use SQLite as an in-memory database that provides a SQL query interface to a dataframe. In R, this can be achieved using the sqldf package:

library(sqldf)
sqldf("select * from iris limit 5")

In Python/pandas, you can use the pandasql package:

from pandasql import sqldf, load_births
pysqldf = lambda q: sqldf(q, globals())
births = load_births()
print(pysqldf("SELECT * FROM births LIMIT 10;").head())

In many respects, sqldf and pandasql behave like programming language native versions of command-line utilities such as csvsql, textql and q, although rather than importing a data file into a SQLite database so that it can be queried, they import the contents of the referenced dataframe.

pandas also provides native support for adding dataframes as tables to a connected SQLite database, as well as reading results from queries onto the database back into a dataframe. Once you start getting into the swing of putting data into a database, and then running JOINed queries over multiple tables, you’ll start to wonder why you spent so much time getting hassled by VLOOKUP. As an example, here’s a way of Making a Simple Database to Act as a Lookup for the ONS Register of Geographic Codes.

Using SQLite in Your Own Applications

If you are comfortable with using programming code to manipulate your data, then you may want to explore ways of using SQLite to create your own data driven applications.

One way yo start is to use SQLite completely within the browser. Accessing desktop applications from a webpage is typically a no-no because of browser security restrictions, but SQLite is quite a light application, so it can – and has been – compiled to Javascript so that it can be imported as a Javascript library and run from within a webpage: sql.js. You can see an example of how it can be used to provide a simple browser based, SQLite powered data explorer, running solely within a browser here: official demo or SQLite Viewer.

As well as running SQLite in a browser, SQLite can also be used to power an API. One of the advantages of running a datasette service is that it also exposes a datasette API. This means you can publish your datasette to a web host then other computers can querying it programmatically.

If you are working in a Python Jupyter environment, it’s simple enough to use the Jupyer kernel gateway to create your own APIs. Here’s an example of building a service to allow the lookup of ONS codes from a simple SQLite database: Building a JSON API Using Jupyter Notebooks in Under 5 Minutes.

Another way of using SQLite databases in a Jupyter environment is to use Scripted Forms to For example, here’s one of my own recipes for Creating Simple Interactive Forms Using Python + Markdown Using ScriptedForms + Jupyter that shows how to create a simple interactive form for querying a SQLite database containing descriptions of images used in OpenLearn courses.

SQLite Database Admin Tools

As well as providing a simple explorer and query interface, the sqlitebrowser tool also supports a range of other SQLite database administration functions such as the ability to create, define, modify and delete tables and indexes, or edit, add and delete individual records.

The browser based sqlite-web application provides a similar range of utulities via a browser based, rather than desktop client, UI.

Summary

SQLite is lightweight, in-memory and file based database that allows you to run SQL queries over a wide range of tabular datasets. If you work with data, knowing how to write even simple SQL queries can add a powerful tool to your toolbox. SQLite, and the associated tools created around it, is almost certainly one of the easiest ways in to using this most versatile, portable, and personal data power tool.

PS Here are some more ways of using SQLite:

Jigsaw Pieces – Linux Service Indicators, Jupyter Kernel Monitoring and Environment Management

Something I’ve been pondering for some time is how to set up some simple Linux service monitoring so that I can display an in indicator light in a web page to show whether a Linux service is running or not.

For example, in the TM351 VM, it could be handy to display some indicator lights in a Jupyter notebook status bar showing whether the database services we connect to from the notebooks are running correctly,

So here are some pieces that may contribute to that:

My thinking is:

  • use monit to monitor a process; if the process is down, write to a service status file in my www server directory, eg service_servicename_status.txt. If a service is running the contents of this file are 1, otherwise 0;
  • use the JQuery fragment to poll the status file every few seconds;
  • if the status file returns 0, display a red indicator, otherwise green.

Here are some other monitoring / environment managing fragments I’m pondering:

  • something like ps_mem, a Python utility *to accurately report the in core memory usage for a program*. I’m wondering if I could use that to track how much memory each Jupyter notebook python kernel is taking up (or maybe monit can do that?) There’s an old extnesion that looks like ti shows reports: nbtop. Or perhaps use psutil (via this issue, which seems to offer a solution?);
  • a minimal example of setting up notebook homepage tab for a hello world webpage; Writing a notebook server extension looks like it has the ingredients, and nb_conda provides a fuller working example. Actually, that extension looks useful for *Jupyter-as-a-learning-environment* because it lets you select different conda environments, which could be handy for running different activities.

Any other examples out there of Jupyter monitoring / environment management?

Interactive Authoring Environments for Reproducible Media: Stencila

One of the problems associated with keeping up with tech is that a lot of things that “make sense” are not the result of the introduction or availability of a new tool or application in and of itself, but in the way that it might make a new combination of tools possible that support a complete end to end workflow or that can be used to reengineer (a large part of) an existing workflow.

In the OU, it’s probably fair to say that the document workflow associated with creating course materials has its issues. I’m still keen to explore how a Jupyter notebook or Rmd workflow would work, particularly if the authored documents included recipes for embedded media objects such as diagrams, items retrieved from a third party API, or rendered from a source representation or recipe.

One “obvious” problem is that the Jupyter notebook or RStudio Rmd editor is “too hard” to work with (that is, it’s not Word).

A few days ago I saw a tweet mentioning the use of Stencila with Binderhub. Stencila? Apparently, *”[a]n open source office suite for reproducible research”. From the blurb:

[T]oday’s tools for reproducible research can be intimidating – especially if you’re not a coder. Stencila make reproducible research more accessible with the intuitive word processor and spreadsheet interfaces that you and your colleagues are already used to.

That sounds appropriate… It’s available as a desktop app, but courtesy of minrk/jupyter-dar (I think?), it runs on binderhub and can be accessed via a browser too:

 

You can try it here.

As with Jupyter notebooks, you can edit and run code cells, as well as authoring text. But the UI is smoother than in Jupyter notebooks.

(This is one of the things I don’t understand about colleagues’ attitude towards emerging tech projects: they look at today’s UX and think that’s it, because that’s how it is inside an organisation – you take what you’re given and it stays the same for decades. In a living project, stuff tends to get better if it’s being used and there are issues with it…)

The Jupyter-Dar strapline pitches “Jupyter + DAR compatibility exploration for running Stencila on binder”. Hmm. DAR? That’s also new to me:

Dar stands for (Reproducible) Document Archive and specifies a virtual file format that holds multiple digital documents, complete with images and other assets. A Dar consists of a manifest file (manifest.xml) that describes the contents.

Dar is being designed for storing reproducible research publications, but the underlying concepts are suitable for any kind of digital publications that can be bundled together with their assets.

Repo: [substance/dar](https://github.com/substance/dar)

Sounds interesting. And which reminds me: how’s OpenCreate coming along, I wonder? (My permissions appear to have been revoked again; or the URL has changed.)

PS seems like there’s more activity in the “pure web” notebook application world. Hot on the heels of Mike Bostock’s Observable notebooks (rationale) comes iodide, “[a] frictionless portable notebook-style interface for literate scientific computing in the browser” (examples).

I don’t know if these things just require you to use Javascript, or whether they can also embed things like Brython.

I’m not sure I fully get the js/browser notebooks yet? I like the richer extensibility of things like Jupyter in terms of arbitrary language/kernel availability, though I suppose the web notebooks might be able to hook into other kernels using similar mechanics to those used by things like Thebelab?

I guess one advantage is that you can do stuff on a Chromebook, and without a network connection if you cache all the required JS packages locally? Although with new ChromeOS offering support for Linux – and hence, Docker containers – natively, Chromebooks could get a whole lot more exciting over the next few months. From what I can tell, corsvm looks like a ChromeOS native equivalent to something like Virtualbox (with an equivalent of Guest Additions?). It’ll be interesting how well things like audio works? Reports suggest that graphical UIs will work, presumably using some sort of native X11 support rather than noVNC, so now could be a good time to start looking out for souped up Pixelbook…

Generative Assessment Creation

It’s coming round to that time of year where we have to create the assessment material for courses with an October start date. In many cases, we reuse question forms from previous presentations but change the specific details. If a question is suitably defined, then large parts of this process could be automated.

In the OU, automated question / answer option randomisation is used to provide iCMAs (interactive computer marked assessments) via the student VLE using OpenMark. As well as purely text based questions, questions can include tables or images as part of the question.

One way of supporting such question types is to manually create a set of answer options, perhaps with linked media assets, and then allow randomisation of them.

Another way is to define the question in a generative way so that the correct and incorrect answers are automatically generated.(This seems to be one of those use cases for why ‘everyone should learn to code’;-)

Pinching screenshots from an (old?) OpenMark tutorial, we can see how a dynamically generated question might be defined. For example, create a set of variables:

and then generate a templated question, and student feedback generator, around them:

Packages also exist for creating generative questions/answers more generally. For example, the R exams package allows you to define question/answer templates in Rmd and then generate questions and solutions in a variety of output document formats.


You can also write templates that include the creation of graphical assets such as charts:

 

Via my feeds over the weekend, I noticed that this package now also supports the creation of more general diagrams created from a TikZ diagram template. For example, logic diagrams:

Or automata diagrams:

(You can see more exam templates here: www.r-exams.org/templates.)

As I’m still on a “we can do everything in Jupyter” kick, one of the things I’ve explored is various IPython/notebook magics that support diagram creation. At the moment, these are just generic magics that allow you to write TikZ diagrams, for example, that make use of various TikZ packages:

One the to do list is to create some example magics that template different question types.

I’m not sure if OpenCreate is following a similar model? (I seem to have lost access permissions again…)

FWIW, I’ve also started looking at my show’n’tell notebooks again, trying to get them working in Azure notebooks. (OU staff should be able to log in to noteooks.azure.com using OUCU@open.ac.uk credentials.) For the moment, I’m depositing them at https://notebooks.azure.com/OUsefulInfo/libraries/gettingstarted, although some tidying may happen at some point. There are also several more basic demo notebooks I need to put together (e.g. on creating charts and using interactive widgets, digital humanities demos, R demos and (if they work!) polyglot R and python notebook demos, etc.). To use the notebooks interactively, log in and clone the library into your own user space.

Generating Printable MS Word Versions of Merged Jupyter Notebooks

One of the issues we know students have with the Jupyter notebooks that we provide as part of the course is that there is no straightforward way of printing them them all out for offscreen reading / annotation. (As well as code, there is a certain amount of practical and code related explanatory material in the notebooks.)

One of the things I started to doodle with last year was a simple script to merge several notebooks than then render the result as a Microsoft Word doc. This has a dependency on pandoc, though not LaTeX and requires that the conversion takes place via HTML: ipynb is converted to HTML using nbconvert , then from HTML to docx. If there are image files transcluded into the notebook, this also means that the pandoc conversion process needs to be executed in the same directory as the notebook so that the image paths are correctly recognised. (When running nbconvert with the html_embed output, pandoc fell over.)

Having to run pandoc in a local, image path respecting directory is a pain because it means I can’t run it over a merged notebook file composed of notebooks from multiple directories. Which means that I have to generate a separate docx file for the notebooks in each separate directory. Whilst I could more this into the same directory to make accessing them all a bit easier, it still means students have to print out multiple documents. I did try using a python package to merge the Word docs, but it borked on the images.

There are Python packages that can merge PDF documents in a more reliable way, but I am having issues with getting a sensible PDF workflow together. In the first case, for pandoc to render documents to  PDF seems to require the texlive-xetex package, which adds considerable weight to the VM (and I don’t know the dependency voodoo required to get a minimum viable LaTeX distribution in place). In the second, my test notebooks included a pymarkdown inline element that embedded a pandas dataframe in a markdown cell and this seemed to break the pandoc PDF conversion at that point.

One thing I haven’t done yet is look at customising the output templates so that we can brand the exported documents. For this, I need to look at custom templates.

My initial sketch code for the ‘export merged notebooks in a directory as docx’ routine is available via this gist. One thing I need to do is wrap it in a simple CLI command. Comments / suggestions for improvement, or links to better alternatives, more than welcome!

Seeding Shared Folders With Files Distributed via a VM

For the first few presentations of our Data Management and Analysis course, the course VM has been distributed to students via a USB mailing. This year, I’m trying to move to a model whereby the primary distribution is via a download from VagrantCloud (students manage the VM using Vagrant), though we’re also hoping to be able to offer access to an OU OpenStack hosted VM to any student’s who really need it.

For students on Microsoft Windows computers, an installer installs Virtualbox and vagrant from installers distributed via the USB memory stick. This in part derives from the policy of fixing versions of as much as we can so that it can be tested in advance. The installer also creates a working directory for the course that will be shared by the VM, and copies required files, again from the memory stick, into the shared folder. On Macs and Linux, students have to do this setup themselves.

One of the things I have consciouslystarted trying to do is move the responsibility for satisficing of some of the installation requirements into the Vagrantfile. (I’m also starting to think they should be pushed even deeper into the VM itself.)

For example, as some of the VM services expect particular directories to exist in the shared directory, we have a couple of defensive measures in place:

  • the Vagrantfile creates any required, yet missing, subdirectories in the shared directory;
            #Make sure that any required directories are created
            config.vm.provision :shell, :inline => <<-SH
                mkdir -p /vagrant/notebooks
                mkdir -p /vagrant/openrefine_projects
                mkdir -p /vagrant/logs
                mkdir -p /vagrant/data
                mkdir -p /vagrant/utilities
                mkdir -p /vagrant/backups
                mkdir -p /vagrant/backups/postgres-backup/
                mkdir -p /vagrant/backups/mongo-backup/	
            SH
    

  • start up scripts for services that require particular directories check they exist before they are started and create them if they are missing. For example, in the service file, go defensive with something like ExecStartPre=mkdir -p /vagrant/notebooks.

The teaching material associated with the (contents of) the VM is distributed using a set of notebooks downloaded from the VLE. Part of the reason for this is that it delays the point at which the course notebooks must be frozen: the USB is mastered late July/early August for a mailing in September and course start in October.

As well as the course notebooks are a couple of informal installation test notebooks. This can be frozen along with the VM and distributed inside it, but the question then arises. So this year I am trying out a simple pattern that bakes test files into the VM and then uses the Vagranfile to copy the files into the shared directory on its first run with a particular shared folder:

config.vm.provision :shell, :inline => <<-SH
    if [ ! -f /vagrant/.firstrun_nbcopy.done ]; then
        # Trust notebooks in immediate child directories of notebook directory
        files=(`find /opt/notebooks/* -maxdepth 2 -name "*.ipynb"`)
        if [ ${#files[@]} -gt 0 ]; then
            jupyter trust /opt/notebooks/*.ipynb;
            jupyter trust /opt/notebooks/*/*.ipynb;
        fi
        #Copy notebooks into shared directory
        cp -r /opt/notebooks/. /vagrant/notebooks
        touch /vagrant/.firstrun_nbcopy.done
    fi
   SH

This pattern allows files shipped inside the VM to be copied into the shared folder once it is mounted into the VM from host. The files will then persist inside the shared directory, along with a hidden flag file to say the files have been copied. I’m not sure about the benefits of auto-running something inside the VM to manage this copying? Or whether to check that a more recent copy of the files to be copied doesn’t already exist in the shared folder before copying on the first run in the folder?

Initial Sketch – Searching Jupyter Notebooks Using lunr

Coming round as it is to that time of year for updating, testing and freezing/”gold mastering” the TM351 VM that we distribute to students for the October presentation of our Data Analysis and Management course,  I’ve been thinking about how we can make the VM more useful for students, and whether the things we’re looking at might also be useful in an Institute of Coding context (I’m on a workpackage looking at infrastructure to support coding education: please get in touch if you’re up for a conversation around such matters:-)

One of the things I’ve been pondering is how to search across notebooks – a lot of the TM351 teaching material is in notebooks and there’s no obvious way of searching over them. (There’s also no obvious way of printing them all out in one go, or saving them to a merged document – I’ll post more about that in separate post…)

In my sketches for the new VM, I’ve added a simple python webserver that exposes a homepage that links to the various services running inside the VM. (Ideally, there’d also be indicator lights showing whether the associated Linux service is running or no: anyone know of a simple package to help with that?)

This made me think that it might be useful to provide simple search tool over the notebooks in the (shared) directory that the VM shares with the host.

One way of doing this might be to put the notebook content into a simple sqlite database and serve it using datasette, or query it via a Scripted Form style UI. SQLite has a full text search extension (FTS3-5) and some support for fuzzy matching (eg spellfix1), although I’m note sure how well it would fare as a code search engine.

But I also came across a lightweight Javascript search engine called lunr“[a] bit like Solr, but much smaller and not as bright” – and an example of How [Matthew Daly] Added Search to [His] Site With Lunr.js so I thought I’d give that a go…

At the moment, I’m only testing against a couple of notebooks. The search results are at the markdown cell level, so if a cell contains a lot of text, the whole cell will be displayed, which may not be optimal. I’m rendering the cell markdown as HTML in the browser using the Showdown Javascript package although this could be disabled to show just the raw markdown. My guess is that any relatively linked images embedded in the markdown will show as broken.

The search terms are supposed to be highlighted using mark.js, but while I had it working in a preliminary sketch, it seems to be borked now and I’m not sure where I’m setting it up incorrectly or using it wrong.

It strikes me that if a markdown cell in the results contains a lot of text, it might be worth trying to identify where in the text the query terms appear and then prune the result text around them.

I’m making no attempt to search code cells, though I did think about trying to extract lines of comment text using a crib along the lines of if LINE.strip().startswith('#').

I’m generating the lunr index using lunr.py and saving it along with a store of the cell content in a JSON file that’s loaded into the search page. Whilst I’m testing the search paged served from a simple Python httpserver, it struck me that it could also be served along a /view path in the Jupyter notebook context. When I first tried this, using JSON data loaded in to the search page using JQuery as a JSON object, I got a CORS error. Rather than waste too much time trying to solve that (I wasted a little!) I worked around it instead and loaded my lunr.json search index and store in to the page as JSONP instead.

One thing I need to do is provide an easy to use tool to generate the search index and lookup store from a set of notebooks. (In the TM351 VM context, this would be in the context of the mounted /shared notebooks folder that the notebook server runs at the top of.)

There still needs to be some clear thinking about what to link to – my initial thought is to link to the notebook running in the VM. If anchors are in the original markdown cell text it should be be possible to deeplink to those. It might also be possible to link to an HTML render of the notebook. This could be done via nbconvert (although I am not currently running this as a service in the VM) or perhaps as an in-browser rendering of the .ipynb JSON using something like Notebook.js / nbpreview. (FWIW, I also note react-jupyter).

But if nothing else, this is a thing that can be used and poked around to find out where it’s most painful in use and how it can be improved. A couple of things that immediately come to mind in terms of Jupyter integration, for example:

  • Jupyter notebook classic UI could come with a ‘Search notebooks’ tab and maybe a search indexer running in the background as and when notebooks in scope are saved);
  • JupterLab could be extended with a lun based notebook search plugin.

Code for my initial pencil sketch of a lunr Jupyter notebook markdown cell search tool can be found in this gist.

PS via Grant Nestor on the Jupyter Google group:

grep –include=’*.ipynb’ –exclude-dir=’.ipynb_checkpoints’ -rliw . -e ‘search query’

This will search your Jupyter server root recursively for files that contain the whole word (case-insensitive) “search query” and only return the file names of matches.

More info: https://stackoverflow.com/questions/16956810/how-do-i-find-all-files-containing-specific-text-on-linux

[Fragment] Jupyter Notebooks and Moodle

Via a comment, Ger Tielemans alerted me to the SageCell filter for Moodle that lets you embed an executable code cell powered by Sage Cell Server, which was also new to me, in a Moodle content page.

From what I can tell, Sage Cell Server provides a stateless was of executing SageMath, R or Python code.

This contrasts with approaches such as Thebelab (e.g. as described in Using ThebeLab to Run Python Code Embedded in HTML Via A Jupyter Kernel), which will connect all the cells in a web page to the same Jupyter kernel, and as such allow state to be shared between them.

It also strikes me that ScriptedForms provide another possible way of accessing Jupyter/python executable code from a frame embed.

So now I’m wondering – might the SageCell filter for Moodle provide a basis for a Thebelab filter for Moodle?

The comment also prompted me to look around for other Moodle’n’Jupyter integrations.

We have an auth hack for running temporary notebooks via Jupyterhub in which students who are logged in to the Moodle VLE are presented with a URL containing a secret than can be used to log them in to a notebook launching Jupyterhub server running on Azure/Kubernetes. This uses some sort of Moodle plugin for adding the secret token, I think? (Note to self: find out how this works and what the Moodle plugin is! Rod….????!)

I was reminded of another weak integration we have with OU learning environments and Jupyter notebooks in a reply posting to the Moodle forums by Mark Johnson on Moodle integration with Jupyter notebooks. OpenDesign Studio, which allows students to upload and comment on media objects, has a plugin that allows students to upload and view Jupyter notebooks as a media asset. As I replied to the thread:

The OpenStudio plugin requires users to upload a zip file with an `.nbk` suffix that contains the notebook `.ipynb` file and an HTML export view of it; (the html view can be a bit ropey eg if the notebook has run output cells that originally contained IFrames). A crude notebook extension provides a toolbar button to create the `.nbk` package that uses `nbconvert html_embed` filter to generate the html.

The original proposal was for an OpenStudio plugin architecture that would allow .ipynb to be uploaded and then rendered back to users of OpenStudio using nbviewer. That was blocked, presumably on grounds of complexity, available project time/resource and maybe also security…

One other code related thing we have running in Moodle is Moodle CodeRunner. I think I’d originally hassled for this to be installed in a demo server when Tim Hunt had originally mentioned it to me, though part of me wishes I hadn’t! The plugin adds simple code testing to Moodle quizzes, so students can enter code which is then executed and checked against a test.

As to why I don’t like it: I think that because we have Moodle CodeRunner installed, it will act as a blocker to exploring any ways in which we might look at code runners that run using Jupyter protocols against Jupyter kernels. I don’t know if CodeRunner can be tweaked to run against such kernels, or otherwise patched to become a Jupyter client?

One of the things that concerns me about CodeRunner is what happens if we want to assess students on code that requires packages (either python packages, or operating system packages they call on) that are preinstalled as part of the computing environment we give students?

Other issues arise if we want to test students more widely, for example if we wanted to test students’ ability to use python to query, or even update, a linked database such as the one we install in the TM351 VM?

What would be nice if we could containerise a student computing environment and then use that as the back end to an automated tester eg running something like Thebelab or perhaps better, a variant of something like the stateless Sage Cell Server.

One thing that is still on my to do list is to explore nbgrader as a tool for annotating assessment notebooks and providing a means for ALs to manually mark and comment on them. (I note that as well as supporting automated tests with nbgrader, Jess Hamrick also started to look at tools to support the automated marking of matplotlib charts: jhamrick/plotchecker. I guess another way would be to use automated testing tools like Selenium to render charts and then test them, but this would presumably on work for exact image / hash matches only?

Unfortunately, I haven’t been able to get nbgrader to work with my standalone notebook server – the formgrader isn’t found. I need to return to this when I get a chance (unless some has working babystep instructions for getting nbgrader running with a standalone Jupyter notebook server?)

PS as far as integrations with other learning environemnts go, I note the Open edX Jupyter Notebook Viewer that lets you “fetch and display part of, or an entire Jupyter Notebook in an XBlock“.