Frictionless Data Analysis – Trying to Clarify My Thoughts

Prompted by a conversation with Rufus Pollock over lunch today, in part about data containerisation and the notion of “frictionless” data that can be easily discovered and is packaged along with metadata that helps you to import it into other tools or applications (such as a database), I’ve been confusing myself about what it might be like to have a frictionless data analysis working environment, where I could do something like write fda --datapackage --db postgres --client rstudio ipynb and that would then:

  • generate a fig script (eg as per something like Using Docker to Build Linked Container Course VMs);
  • download the data package from the specified URL, unbundle it, create an SQL file to create an appropriate init file for the database specified, fire up the database and use the generated SQL file to configure the database by creating any necessary tables and loading the data in;
  • fire up any specified client applications (IPython notebook and RStudio server in this example) and ideally seed them with SQL magic or database connection statements, for example, that automatically define an appropriate data connection to the database that’s just been configured;
  • launch browser tabs that contain the clients;
  • it might also be handy to be able to mount local directories against directory paths in the client applications, so I could have my R scripts in one directory of my own desktop, IPython notebooks in another, and then have visibility of those analysis scripts from the actual client applications.

The idea is that from a single command I can pull down a datafile, ingest it into a database, fire up one or more clients that are connected to that database, and start working with the data immediately. It’s not so different to double clicking on a file on your desktop and launching it into an application to start working on it, right?!

Can’t be that hard to wire up, surely?!;-) But would it be useful?

PS See also a further riff on this idea: Data Analysis Packages…?

Using Docker to Build Linked Container Course VMs

Having spent bits of last year tinkering with vagrant and puppet as part of a workflow for building and deploying course related VMs in a scaleable way for a distance education context (trying to be OUseful here…) I’ve more recently started pondering whether it makes more sense to create virtual machines from linked data containers.

Some advantages of the “all in one flat VM” approach seem to be that we can construct puppet files to build particular components and then compose the final machine configuration from a single Vagrant script pulling in those separate components. Whilst this works when developing a VM for use by students on their own machines, it perhaps makes less sense if we were to provide remote hosted access to student VMs. There is an overhead associated with running a VM which needs to be taken into account if you need to scale. In terms of help desk support, the all-in-one VM approach offers a couple of switch it off and switch it on again opportunities: a machine can be shutdown and restarted/reprovisioned, or if necessary can be deleted and reinstalled though this latter loses any state that was saved internally in the VM by the student). If a particular application in the VM needs shutting down and restarting, then a specific stop/start instruction is required for each application.

On the other hand, a docker route in which each virtual application is launched inside its own container, and those containers are then linked together to provide the desired student VM configuration, means that if an application needs to be restarted, we can just destroy the container and fire up a replacement (though we’d probably need to find ways of preserving – or deleting – state associated with a particular application container too). If applications run as services, and for example I have a notebook server connected to a database server, if I destroy the database server container, I should be able to point the notebook server to the new database server – if I know the address of the new database server…

After a bit of searching around, I came across an example of creating a configuration not too dissimilar from the TM351 virtual machine configuration, but built from linked containers: Using Docker for data science, part 2 [Calvin Giles]. The machine is constructed from several containers, wired together using this fig script:

    command: echo created
    image: busybox
        - "~/Google Drive/notebooks:/notebooks"
    command: echo created
    image: busybox
        - "~/Google Drive/data:/data"

    command: echo created
    image: busybox
        - /var/lib/postgresql/data

    image: postgres
        - "5433:5432"
        - devpostgresdata

        - PASSWORD
    image: calvingiles/data-science-environment
        - devpostgres:postgres
        - "443:8888"
        - notebooks
        - data

(WordPress code plugin & editor sucking atm wrt the way it keeps trying to escape stuff…)

(Fig is a tool for building multiple docker containers and wiring them together, a scripted version of something like Panamax. The main analysis application – calvingiles/data-science-environment – is a slight extension of ipython/scipyserver.)

With fig and boot2docker installed, and the fig script downloaded into the current working directory:

curl -L > fig.yml

the following two lines of code make sure that any previous copies of the containers are purged, and a new set of containers fired up with the desired password settings:

fig rm

(The script also creates Google Drive folders into which copies of the notebooks will be located and shared between the VM containers and the host.)

The notebooks can then be accessed via browser, (you need to log in with the specified password – MyPass from the example above); the location of the notebooks is https//IP.ADDRESS:443 (note the https, which may require you saying “yes, really load the page” to Google Chrome – though it is possible to configure the server to use just http) where IP.ADDRESS can be found by running boot2docker ip.

One thing I had trouble with at first was connecting the IPython notebook to the PostgreSQL database server (I couldn’t see it on localhost). I found I needed to connect to the actual IP address within the VM of the database container.

I found this address (IPADDRESS) from the docker commandline using: fig run devpostgres env (where devpostgres is the name of the database server container). The port is the actual server port number rather than the forwarded port number:

import psycopg2
con = psycopg2.connect(host=IPADDRESS,port=5432,user='postgres',password='PGPass')

I also came up with a workaround (as described in this issue I raised) but this seems messy to me – there must be a better way? Note how we connect to the forwarded port:

#Get the IP address of the docker host server inside the VM
# I assume this is like a sort of 'localhost' for the space in which the containers float around?
IPADDRESS=!netstat -nr | grep '^0\.0\.0\.0' | awk '{print $2}'

#Let's see if we can connect to the db using the forwarded port address
import psycopg2
con = psycopg2.connect(host=IPADDRESS[0],port='5433',user='postgres', password='PGPass')

#Alternativley, connect via SQL magic
!pip3 install ipython-sql
%load_ext sql
postgrescon = 'postgresql://postgres:PGPass@'+IPADDRESS[0]+':5433'

#Then cell magic via:
%%sql $postgrescon

This addressing fiddliness also raises an issue about how we would run container bundles for several students in the same VM under a hosted offering – how would any particular student know how to connect to “their” personal database server(s). [UPDATE: doh! Container linking passes name information into a container as an environmental variable: Linking Containers Together.] Would we also need to put firewall rules in place to partition the internal VM network so that a student could only see other containers from their bundle? And in the event of switch-it-off/destroy-it/start-it-up-again actions, how would any new firewall rules and communication of where to find things be managed? Or am I overcomplicating?!

Anyway – this is a candidate way for constructing a VM out of containers in an automated way. So what are the relative merits, pros, cons etc of using the vagrant/puppet/all-in-one-VM approach as opposed to the containerised approach in an educational context? Or indeed, for different education contexts (eg trad-uni computer lab with VMs running in student desktops (if that’s allowed!); distance education student working on their home machine with a locally hosted VM; multiple students connecting to VM configurations hosted on a local cluster, or on AWS/Google Cloud etc?

Any comments – please add them below… I am sooooooo out of my depth in all this!

PS Seems I can connect with con = psycopg2.connect(host='POSTGRES',port='5432',user='postgres', password="PGPass")

Check other environments with:

import os