OUseful.Info, the blog…

Trying to find useful things to do with emerging technologies in open education

Connecting RStudio and MySQL Docker Containers – an example using the ergast db

leave a comment »

building on Dockerising Open Data Databases – First Fumblings and my Book Extras – Data Files, Code Files and a Dockerised Application, I just figured out how to get the ergast db into a MySQL docker container and then query it from RStudio:

  • Download and unzip the f1db.sql.gz file to f1db.sql
  • install these docker-mysql-scripts
  • run boot2docker
  • from the boot2docker shell, start up a MySQL server (ergastdb) with password f1: dmysql-server ergastdb f1 By default, this exposes port 3306
  • create an new empty database (f1db): dmysql-create-database ergastdb f1db
  • add the ergast data to it: dmysql-import-database ergastdb /path/to/ergastdb/f1db.sql --database f1db
  • fire up a copy of RStudio, in this case using my psychemedia/wranglingf1data container, linking it to the MySQL database which has the alias db: docker run --name f1djd -p 8788:8787 --link ergastdb:db -d psychemedia/wranglingf1data
  • run boot2docker ip to find where RStudio is running (IPADDRESS) and in your browser go to: http://IPADDRESS:8788, logging in with username rstudio and password rstudio
  • in RStudio, import the RMySQL library: library(RMySQL)
  • in RStudio, connect to the database: con=dbConnect(MySQL(),user='root',password='f1',host='db',port=3306,dbname='f1db')
  • in RStudio, run a test query: dbQuery(con,'SHOW TABLES');

rstudio-mysq;

I guess what I need to do now is pull the various bits into another script to make it a one-liner, perhaps with a few switches? For example, to create the database if it doesn’t exist, to download the ergast database file automatically, to populate the database for the first time, or update it with a more recent copy of the database, to fire up both containers and make sure they are appropriately linked etc. This would dramatically simplify things for use in the context of the Wrangling F1 Data With R book, for example. (If you beat me to it, please post details in the comments below.)

PS Hmm…. seems I get a UTF-8 encoding issue:

RStudio-encoding

Not sure if this is with the database, or the RMySQL connector? Anyone got any ideas of a fix?

Ah ha – sort of via SO:

Running dbGetQuery(con,'SET NAMES utf8;') before querying seems to do the trick…

Written by Tony Hirst

January 17, 2015 at 7:32 pm

Posted in Rstats

Tagged with ,

OpenRefine Docker Containers

leave a comment »

I had a go at building a couple of docker containers for OpenRefine, one from the latest release and one from the latest code on github:

In order to create the virtual machine, you should:

  • install boot2docker
  • run boot2docker
  • Either: to run with a project directory solely within the container, in the boot2docker terminal, enter the command docker run --name openrefine -d -p 3334:3333 psychemedia/openrefine
  • Or: to run with a project directory mounted from a shared folder on the host, in the boot2docker terminal, enter the command docker run -d -p 3334:3333 -v /path/to/yourSharedDirectory:/mnt/refine --name openrefine psychemedia/openrefine
  • Or: to run with a project directory in a linked data volume, in the boot2docker terminal, enter the command docker run -d -p 3334:3333 -v /mnt/refine --name openrefine psychemedia/openrefine

(To use the latest release rather than a recent build use psychemedia/docker-openrefine rather than psychemedia/openrefine.)

The port number you will be able to find OpenRefine on is given by the first number set in the flag -p NNNN:3333. To access OpenRefine via port 3334, use -p 3334:3333 etc.

OpenRefine will then be available via your browser at the URL http://IPADDRESS:NNNN. To find the required value of IPADDRESS can be found using the command boot2docker ip

The returned IP address (eg 192.168.59.103) is the IP address you can find OpenRefine on, for example: http://192.168.59.103:3334.

Written by Tony Hirst

January 17, 2015 at 1:53 pm

Posted in School_Of_Data, Tinkering

Tagged with

Split View Web Page Bookmarklet

with 3 comments

What feels like forever ago, I described a method Split Screen Screenshots that allow you to put a copy of the same web page into two frames, so that you can grab a screenshot that includes the page header in the top frame, for example, as well as something from waaaaaay down the page in the lower frame.

I didn’t realise that the recipe I described required help from an external service until I tried to reuse the method to grab the following screenshot…

brirminghampayment

Anyway, cribbing from this Chrome Dual View standalone bookmark, here’s an updates version of my split screen bookmarklet:

javascript:url=location.href;f='<frameset rows=\'30%,*\'>\n<frame src=\''+url+'\'/><frame src=\''+url+'\'/></frameset>';with(document){write(f);void(close())}

To make use of it, drag the following link onto you bookmark toolbar, or save the link as a bookmark: Hmm… seems the craptastic WordPress.com doesn’t let me post bookmarklet links? So you’ll have to: bookmark this page, edit the bookmark, paste the above code snippet into the URL. Then when you want to split the view of a webpage, just click the bookmarklet.

Written by Tony Hirst

January 15, 2015 at 1:27 pm

Posted in Infoskills

Tagged with

Using Docker to Build Linked Container Course VMs

with 2 comments

Having spent bits of last year tinkering with vagrant and puppet as part of a workflow for building and deploying course related VMs in a scaleable way for a distance education context (trying to be OUseful here…) I’ve more recently started pondering whether it makes more sense to create virtual machines from linked data containers.

Some advantages of the “all in one flat VM” approach seem to be that we can construct puppet files to build particular components and then compose the final machine configuration from a single Vagrant script pulling in those separate components. Whilst this works when developing a VM for use by students on their own machines, it perhaps makes less sense if we were to provide remote hosted access to student VMs. There is an overhead associated with running a VM which needs to be taken into account if you need to scale. In terms of help desk support, the all-in-one VM approach offers a couple of switch it off and switch it on again opportunities: a machine can be shutdown and restarted/reprovisioned, or if necessary can be deleted and reinstalled though this latter loses any state that was saved internally in the VM by the student). If a particular application in the VM needs shutting down and restarting, then a specific stop/start instruction is required for each application.

On the other hand, a docker route in which each virtual application is launched inside its own container, and those containers are then linked together to provide the desired student VM configuration, means that if an application needs to be restarted, we can just destroy the container and fire up a replacement (though we’d probably need to find ways of preserving – or deleting – state associated with a particular application container too). If applications run as services, and for example I have a notebook server connected to a database server, if I destroy the database server container, I should be able to point the notebook server to the new database server – if I know the address of the new database server…

After a bit of searching around, I came across an example of creating a configuration not too dissimilar from the TM351 virtual machine configuration, but built from linked containers: Using Docker for data science, part 2 [Calvin Giles]. The machine is constructed from several containers, wired together using this fig script:

notebooks:
    command: echo created
    image: busybox
    volumes:
        - &amp;amp;quot;~/Google Drive/notebooks:/notebooks&amp;amp;quot;
data:
    command: echo created
    image: busybox
    volumes:
        - &amp;amp;quot;~/Google Drive/data:/data&amp;amp;quot;

devpostgresdata:
    command: echo created
    image: busybox
    volumes: 
        - /var/lib/postgresql/data

devpostgres:
    environment:
        - POSTGRES_PASSWORD
    image: postgres
    ports:
        - &amp;amp;quot;5433:5432&amp;amp;quot;
    volumes_from:
        - devpostgresdata

notebook:
    environment:
        - PASSWORD
    image: calvingiles/data-science-environment
    links:
        - devpostgres:postgres
    ports:
        - &amp;amp;quot;443:8888&amp;amp;quot;
    volumes_from:
        - notebooks
        - data

(WordPress code plugin & editor sucking atm wrt the way it keeps trying to escape stuff…)

(Fig is a tool for building multiple docker containers and wiring them together, a scripted version of something like Panamax. The main analysis application – calvingiles/data-science-environment – is a slight extension of ipython/scipyserver.)

With fig and boot2docker installed, and the fig script downloaded into the current working directory:

curl -L https://gist.githubusercontent.com/calvingiles/b6123c301954fe68e29a/raw/data-science-environment-fig.yml > fig.yml

the following two lines of code make sure that any previous copies of the containers are purged, and a new set of containers fired up with the desired password settings:

fig rm
PASSWORD=MyPass POSTGRES_PASSWORD=PGPass fig up -d

(The script also creates Google Drive folders into which copies of the notebooks will be located and shared between the VM containers and the host.)

The notebooks can then be accessed via browser, (you need to log in with the specified password – MyPass from the example above); the location of the notebooks is https//IP.ADDRESS:443 (note the https, which may require you saying “yes, really load the page” to Google Chrome – though it is possible to configure the server to use just http) where IP.ADDRESS can be found by running boot2docker ip.

One thing I had trouble with at first was connecting the IPython notebook to the PostgreSQL database server (I couldn’t see it on localhost). I found I needed to connect to the actual IP address within the VM of the database container.

I found this address (IPADDRESS) from the docker commandline using: fig run devpostgres env (where devpostgres is the name of the database server container). The port is the actual server port number rather than the forwarded port number:

import psycopg2
con = psycopg2.connect(host=IPADDRESS,port=5432,user='postgres',password='PGPass')

I also came up with a workaround (as described in this issue I raised) but this seems messy to me – there must be a better way? Note how we connect to the forwarded port:

#Via http://blog.michaelhamrah.com/2014/06/accessing-the-docker-host-server-within-a-container/
#Get the IP address of the docker host server inside the VM
# I assume this is like a sort of 'localhost' for the space in which the containers float around?
IPADDRESS=!netstat -nr | grep '^0\.0\.0\.0' | awk '{print $2}'

#Let's see if we can connect to the db using the forwarded port address
import psycopg2
con = psycopg2.connect(host=IPADDRESS[0],port='5433',user='postgres', password='PGPass')

#Alternativley, connect via SQL magic
!pip3 install ipython-sql
%load_ext sql
postgrescon = 'postgresql://postgres:PGPass@'+IPADDRESS[0]+':5433'

#Then cell magic via:
%%sql $postgrescon

This addressing fiddliness also raises an issue about how we would run container bundles for several students in the same VM under a hosted offering – how would any particular student know how to connect to “their” personal database server(s). [UPDATE: doh! Container linking passes name information into a container as an environmental variable: Linking Containers Together.] Would we also need to put firewall rules in place to partition the internal VM network so that a student could only see other containers from their bundle? And in the event of switch-it-off/destroy-it/start-it-up-again actions, how would any new firewall rules and communication of where to find things be managed? Or am I overcomplicating?!

Anyway – this is a candidate way for constructing a VM out of containers in an automated way. So what are the relative merits, pros, cons etc of using the vagrant/puppet/all-in-one-VM approach as opposed to the containerised approach in an educational context? Or indeed, for different education contexts (eg trad-uni computer lab with VMs running in student desktops (if that’s allowed!); distance education student working on their home machine with a locally hosted VM; multiple students connecting to VM configurations hosted on a local cluster, or on AWS/Google Cloud etc?

Any comments – please add them below… I am sooooooo out of my depth in all this!

PS Seems I can connect with con = psycopg2.connect(host='POSTGRES',port='5432',user='postgres', password="PGPass")

Check other environments with:

import os
os.environ

Written by Tony Hirst

January 14, 2015 at 5:03 pm

Posted in OU2.0, Tinkering

Tagged with , , , ,

The Camera Only Lies

leave a comment »

For some reason, when I first saw this:

Google Word Lens iPhone GIF

it reminded me of Hockney. The most recent thing I’ve read, or heard, about Hockney was a recent radio interview (BBC: At Home With David Hockney), in the opening few minutes of which we hear Hockney talking about photographs, how the chemical period of photography ,and its portrayal of perspective, is over, and how a new age of digital photography allows us to reimagine this sort of image making.

Hockney also famously claimed that lenses have played an important part in the craft of image making Observer: What the eye didn’t see…. Whether or not people have used lenses to create particular images, the things seen through lenses and the idea of how lenses may help us see the world differently may in turn influence the way we see other things, either in the mind’s eye, in where we focus attention, or in how we construct or compose an image of our own making.

As the above animated image shows, when a lens is combined with a logical transformation (that may be influenced by a socio-cultural and/or language context acting as a filter) of the machine interpretation of a visual scene represented as bits, it’s not exactly clear what we do see…!

As Hockney supposedly once said, “[t]he photograph isn’t good enough. It’s not real enough”.

Written by Tony Hirst

January 14, 2015 at 10:54 am

Posted in Anything you want

Dockerising Open Data Databases – First Fumblings

leave a comment »

A new year, and I’m trying to reclaim some joy, fun, tech mojo by getting back into the swing of appropriating stuff and seeing what we can do with it. My current universal screwdriver is Docker, so I spent an hour or two this yesterday afternoon and a large chunk of today poking around various repos looking for interesting things to make use of.

The goal I set myself was to find a piece of the datawrangling jigsaw puzzle that would let me easily load a previously exported SQL database into MySQL running in a container, the idea being we should be able to write a single command that points to a directory containing some exported SQL stuff, and then load that data straight into a container in which you can start querying it . (There’s possibly (probably?) a well known way of doing this for the folk who know how to do it, but what do I know….?!;-)

The solution I settled with was a set of python scripts published by Luis Elizondo / luiselizondo [about] that wrap some docker commands to produce a commandline tool for firing up and populating MySQL instances, as well as connecting to them: docker-mysql-scripts.

The data I thought I’d play with is the 2013 UK MOT results database, partly because I found a database config file (that I think may originally have been put together by Ian Hopkinson). My own archive copy of the relevant scripts (including the sql file to create the necessary MOT database tables and boot load the MOT data) can be found here. Download the scripts and you should be able to run them (perhaps after setting permissions to make them executable) as follows:

#Scripts: https://gist.github.com/psychemedia/86980a9e86a6de87a23b
#Create a container mot2013 with MySQL pwd: mot
#/host/path/to/ should contain:
## mot_boot.sql #Define the database tables and populate them
## test_result_2013.txt  #download and uncompress: http://data.dft.gov.uk/anonymised-mot-test/12-03/test_result_2013.txt.gz 
## test_item_2013.txt  #download and uncompress: http://data.dft.gov.uk/anonymised-mot-test/12-03/test_item_2013.txt.gz 
## mdr_test_lookup_tables/ #download and unzip: http://data.dft.gov.uk/anonymised-mot-test/mdr_test_lookup_tables.zip

#Create a container 'mot2013' with MYSQL password: mot
dmysql-server mot2013 mot

#Create a database: motdata
dmysql-create-database mot2013 motdata

#Populate the database using the mot_boot.sql script
dmysql-import-database mot2013 /host/path/to/mot_boot.sql --database motdata

I can then login in to the database with the command dmysql mot2013, connect to the appropriate database from the MySQL client with the SQL command USE motdata; and then run queries on the 2013 MOT data.

The scripts also allow the database contents to be managed via a separate data container:

#Create a new container called: test with an external data container
dmysql-server --with-volume test test
#This creates a second container - TEST_DATA - that manages the database files
#The name is given by: upper($CONTAINERNAME)+'_DATA'

#Create a dummy db
dmysql-create-database test testdb

#We can connect to the database as before
#dmysql test
mysql> exit

#Delete this container, removing any attached volumes with the -v flag
docker rm -v -f test

#Create a new container connected to the original TEST_DATA container
docker run --volumes-from TEST_DATA --name test4 -e MYSQL_ROOT_PASSWORD="test" -d mysql

#Connect to this new container
dmysql test4
mysql> SHOW DATABASES;
#We should see the testdb database there...

#NOTE - I think that only one database server container can be connected to the TEST_DATA container at a time

So far, so confusing… Here’s a brief description of the current state of my understanding/confusion:

What really threw me is that the database container (either the original mot2013 container or the test database with the external data container) don’t appear to store the data inside the container itself. (So for example, the TEST_DATA container does not contain the database.) Instead, the data appears to be contained in a “hidden” volume that is mounted outside the container. I came a cropper with this originally by deleting containers using commands of the form docker rm [CONTAINER_NAME] and then finding that the docker VM was running out of memory. This deletes the container, but leaves a mounted volume (that is associated with the deleted container name) hanging around. To remove those volumes automatically, containers should be removed with commands of the form docker rm -v [CONTAINER_NAME]. What makes things difficult to tidy up is that the mounted volumes can’t be seen using the normal docker ps or docker ps -a commands; instead you need to install docker-volumes to identify them and delete them. (There’s a possible fix that shows how to store the data inside the container, rather than in an externally mounted volume – I think?! – linked to from this post, but I haven’t tried it because I was trying to use root Dockerfile images.)

The use of the external data volumes also means we can’t easily bundle up a data container using docker commit and then hope to create new containers from it (such a model would allow you to spend an hour or two loading a large-ish data set into a database, then push a data container containing that db to dockerhub; other users could then pull down that image, create a container from it and immediately attach it to a MySQL container without having to go through the pain of building the database; this would provide a nifty way of sharing ready-to-query datasets such as the MOT database. You could just pull a data image, mount it as the data volume/container, and get started with running queries).

On the other hand, it is possible to mount a volume inside a container by running the container with a -v flag and specifying the mount point (docker volumes). Luis Elizondo’s scripts allow you to set-up these data volume containers by running dmysql-server with the –with-volume flag as shown in the code snippet above, but at the moment I can’t see how to connect to a pre-existing data container? (See issue here.)

So given that we can mount volumes inside a linked to data container, does this mean we have a route to portable data containers that could be shared by a docker datahub? It seems not… because as it was, I wasted quite a bit more time learning the fact that docker data Container volumes can’t be saved as images! (To make a data container portable, I think we’d need to commit it as an image, share the image, then create a container from that image? Or do I misunderstand this aspect of the docker workflow too…?!)

That said, there does look to be a workaround in the form of flocker. However, I’ve fed up with this whole thing for a bit now… What I hoped would be a quick demo of: get data in docker MySQL container; package container and put image on dockerhub; pull down image, create container and start using data immediately turned into a saga of realising quite how much I don’t understand docker, what it does and how it does it.

I hate computers…. :-(

Written by Tony Hirst

January 13, 2015 at 6:39 pm

Posted in Tinkering

Tagged with

Calculating Churn in Seasonal Leagues

leave a comment »

One of the things I wanted to explore in the production of the Wrangling F1 Data With R book was the extent to which I could draw on published academic papers for inspiration in exploring the the various results and timing datasets.

In a chapter published earlier this week, I explored the notion of churn, as described in Mizak, D, Neral, J & Stair, A (2007) The adjusted churn: an index of competitive balance for sports leagues based on changes in team standings over time. Economics Bulletin, Vol. 26, No. 3 pp. 1-7, and further appropriated by Berkowitz, J. P., Depken, C. A., & Wilson, D. P. (2011). When going in circles is going backward: Outcome uncertainty in NASCAR. Journal of Sports Economics, 12(3), 253-283.

In a competitive league, churn is defined as:

C_t =  \frac{\sum_{i=1}^{N}\left|f_{i,t} - f_{i,t-1}\right|}{N}

where C_t is the churn in team standings for year t, \left|f_{i,t} - f_{i,t-1}\right| is the absolute value of the i-th team’s change in finishing position going from season t-1 to season t, and N is the number of teams.

The adjusted churn is defined as an indicator with the range 0..1 by dividing the churn, C_t, by the maximum churn, C_max. The value of the maximum churn depends on whether there is an even or odd number of competitors:

C_{max} = N/2 \text{, for even N}

C_{max} = (N^2 - 1) / 2N \text{, for odd N}

Berkowitz et al. reconsidered churn as applied to an individual NASCAR race (that is, at the event level). In this case, f_{i,t} is the position of driver i at the end of race t, f_{i,t-1} is the starting position of driver i at the beginning of that race (that is, race t) and N is the number of drivers participating in the race. Once again, the authors recognise the utility of normalising the churn value to give an *adjusted churn* in the range 0..1 by dividing through by the maximum churn value.

Using these models, I created churn function of the form:

is.even = function(x) x %% 2 == 0
churnmax=function(N)
  if (is.even(N)) return(N/2) else return(((N*N)-1)/(2*N))

churn=function(d) sum(d)/length(d)
adjchurn = function(d) churn(d)/churnmax(length(d))

and then used it to explore churn in a variety of contexts:

  • comparing grid positions vs race classifications across a season (cf. Berkowitz et al.)
  • churn in Drivers’ Championship standings over several seasons (cf. Mizak et al.)
  • churn in Constructors’ Championship standings over several seasons (cf. Mizak et al.)

For example, in the first case, we can process data from the ergast database as follows:

library(DBI)
ergastdb = dbConnect(RSQLite::SQLite(), './ergastdb13.sqlite')

q=paste('SELECT round, name, driverRef, code, grid, 
                position, positionText, positionOrder
          FROM results rs JOIN drivers d JOIN races r
          ON rs.driverId=d.driverId AND rs.raceId=r.raceId
          WHERE r.year=2013',sep='')
results=dbGetQuery(ergastdb,q)

library(plyr)
results['delta'] =  abs(results['grid']-results['positionOrder'])
churn.df = ddply(results[,c('round','name','delta')], .(round,name), summarise,
            churn = churn(delta),
            adjchurn = adjchurn(delta)
            )

For more details, see this first release of the Keeping an Eye on Competitiveness – Tracking Churn chapter of the Wrangling F1 Data With R book.

Written by Tony Hirst

January 10, 2015 at 12:06 am

Posted in f1stats, Rstats

Tagged with

Follow

Get every new post delivered to your Inbox.

Join 1,281 other followers