Dockerising Open Data Databases – First Fumblings

A new year, and I’m trying to reclaim some joy, fun, tech mojo by getting back into the swing of appropriating stuff and seeing what we can do with it. My current universal screwdriver is Docker, so I spent an hour or two this yesterday afternoon and a large chunk of today poking around various repos looking for interesting things to make use of.

The goal I set myself was to find a piece of the datawrangling jigsaw puzzle that would let me easily load a previously exported SQL database into MySQL running in a container, the idea being we should be able to write a single command that points to a directory containing some exported SQL stuff, and then load that data straight into a container in which you can start querying it . (There’s possibly (probably?) a well known way of doing this for the folk who know how to do it, but what do I know….?!;-)

The solution I settled with was a set of python scripts published by Luis Elizondo / luiselizondo [about] that wrap some docker commands to produce a commandline tool for firing up and populating MySQL instances, as well as connecting to them: docker-mysql-scripts.

The data I thought I’d play with is the 2013 UK MOT results database, partly because I found a database config file (that I think may originally have been put together by Ian Hopkinson). My own archive copy of the relevant scripts (including the sql file to create the necessary MOT database tables and boot load the MOT data) can be found here. Download the scripts and you should be able to run them (perhaps after setting permissions to make them executable) as follows:

#Scripts: https://gist.github.com/psychemedia/86980a9e86a6de87a23b
#Create a container mot2013 with MySQL pwd: mot
#/host/path/to/ should contain:
## mot_boot.sql #Define the database tables and populate them
## test_result_2013.txt  #download and uncompress: http://data.dft.gov.uk/anonymised-mot-test/12-03/test_result_2013.txt.gz 
## test_item_2013.txt  #download and uncompress: http://data.dft.gov.uk/anonymised-mot-test/12-03/test_item_2013.txt.gz 
## mdr_test_lookup_tables/ #download and unzip: http://data.dft.gov.uk/anonymised-mot-test/mdr_test_lookup_tables.zip

#Create a container 'mot2013' with MYSQL password: mot
dmysql-server mot2013 mot

#Create a database: motdata
dmysql-create-database mot2013 motdata

#Populate the database using the mot_boot.sql script
dmysql-import-database mot2013 /host/path/to/mot_boot.sql --database motdata

I can then login in to the database with the command dmysql mot2013, connect to the appropriate database from the MySQL client with the SQL command USE motdata; and then run queries on the 2013 MOT data.

The scripts also allow the database contents to be managed via a separate data container:

#Create a new container called: test with an external data container
dmysql-server --with-volume test test
#This creates a second container - TEST_DATA - that manages the database files
#The name is given by: upper($CONTAINERNAME)+'_DATA'

#Create a dummy db
dmysql-create-database test testdb

#We can connect to the database as before
#dmysql test
mysql> exit

#Delete this container, removing any attached volumes with the -v flag
docker rm -v -f test

#Create a new container connected to the original TEST_DATA container
docker run --volumes-from TEST_DATA --name test4 -e MYSQL_ROOT_PASSWORD="test" -d mysql

#Connect to this new container
dmysql test4
mysql> SHOW DATABASES;
#We should see the testdb database there...

#NOTE - I think that only one database server container can be connected to the TEST_DATA container at a time

So far, so confusing… Here’s a brief description of the current state of my understanding/confusion:

What really threw me is that the database container (either the original mot2013 container or the test database with the external data container) don’t appear to store the data inside the container itself. (So for example, the TEST_DATA container does not contain the database.) Instead, the data appears to be contained in a “hidden” volume that is mounted outside the container. I came a cropper with this originally by deleting containers using commands of the form docker rm [CONTAINER_NAME] and then finding that the docker VM was running out of memory. This deletes the container, but leaves a mounted volume (that is associated with the deleted container name) hanging around. To remove those volumes automatically, containers should be removed with commands of the form docker rm -v [CONTAINER_NAME]. What makes things difficult to tidy up is that the mounted volumes can’t be seen using the normal docker ps or docker ps -a commands; instead you need to install docker-volumes to identify them and delete them. (There’s a possible fix that shows how to store the data inside the container, rather than in an externally mounted volume – I think?! – linked to from this post, but I haven’t tried it because I was trying to use root Dockerfile images.)

The use of the external data volumes also means we can’t easily bundle up a data container using docker commit and then hope to create new containers from it (such a model would allow you to spend an hour or two loading a large-ish data set into a database, then push a data container containing that db to dockerhub; other users could then pull down that image, create a container from it and immediately attach it to a MySQL container without having to go through the pain of building the database; this would provide a nifty way of sharing ready-to-query datasets such as the MOT database. You could just pull a data image, mount it as the data volume/container, and get started with running queries).

On the other hand, it is possible to mount a volume inside a container by running the container with a -v flag and specifying the mount point (docker volumes). Luis Elizondo’s scripts allow you to set-up these data volume containers by running dmysql-server with the –with-volume flag as shown in the code snippet above, but at the moment I can’t see how to connect to a pre-existing data container? (See issue here.)

So given that we can mount volumes inside a linked to data container, does this mean we have a route to portable data containers that could be shared by a docker datahub? It seems not… because as it was, I wasted quite a bit more time learning the fact that docker data Container volumes can’t be saved as images! (To make a data container portable, I think we’d need to commit it as an image, share the image, then create a container from that image? Or do I misunderstand this aspect of the docker workflow too…?!)

That said, there does look to be a workaround in the form of flocker. However, I’ve fed up with this whole thing for a bit now… What I hoped would be a quick demo of: get data in docker MySQL container; package container and put image on dockerhub; pull down image, create container and start using data immediately turned into a saga of realising quite how much I don’t understand docker, what it does and how it does it.

I hate computers…. :-(