OUseful.Info, the blog…

Trying to find useful things to do with emerging technologies in open education

Dockerising Open Data Databases – First Fumblings

leave a comment »

A new year, and I’m trying to reclaim some joy, fun, tech mojo by getting back into the swing of appropriating stuff and seeing what we can do with it. My current universal screwdriver is Docker, so I spent an hour or two this yesterday afternoon and a large chunk of today poking around various repos looking for interesting things to make use of.

The goal I set myself was to find a piece of the datawrangling jigsaw puzzle that would let me easily load a previously exported SQL database into MySQL running in a container, the idea being we should be able to write a single command that points to a directory containing some exported SQL stuff, and then load that data straight into a container in which you can start querying it . (There’s possibly (probably?) a well known way of doing this for the folk who know how to do it, but what do I know….?!;-)

The solution I settled with was a set of python scripts published by Luis Elizondo / luiselizondo [about] that wrap some docker commands to produce a commandline tool for firing up and populating MySQL instances, as well as connecting to them: docker-mysql-scripts.

The data I thought I’d play with is the 2013 UK MOT results database, partly because I found a database config file (that I think may originally have been put together by Ian Hopkinson). My own archive copy of the relevant scripts can be found here.

#Scripts: https://gist.github.com/psychemedia/86980a9e86a6de87a23b
#Create a container mot2013 with MySQL pwd: mot
#/host/path/to/ should contain:
## mot_boot.sql #Define the database tables and populate them
## test_result_2013.txt  #download and uncompress: http://data.dft.gov.uk/anonymised-mot-test/12-03/test_result_2013.txt.gz 
## test_item_2013.txt  #download and uncompress: http://data.dft.gov.uk/anonymised-mot-test/12-03/test_item_2013.txt.gz 
## mdr_test_lookup_tables/ #download and unzip: http://data.dft.gov.uk/anonymised-mot-test/mdr_test_lookup_tables.zip

#Create a container 'mot2013' with MYSQL password: mot
dmysql-server mot2013 mot

#Create a database: motdata
dmysql-create-database mot2013 motdata

#Populate the database using the mot_boot.sql script
dmysql-import-database mot2013 /host/path/to/mot_boot.sql --database motdata

I can then login in to the database with the commsnd dmysql mot2013, connect to the appropriate database from the MySQL client with the SQL command USE motdata; and then run queries on the 2013 MOT data.

The scripts also allow the database contents to be managed via a separate data container:

#Create a new container called: test with an external data container
dmysql-server --with-volume test test
#This creates a second container - TEST_DATA - that manages the database files
#The name is given by: upper($CONTAINERNAME)+'_DATA'

#Create a dummy db
dmysql-create-database test testdb

#We can connect to the database as before
#dmysql test
mysql> exit

#Delete this container, removing any attached volumes with the -v flag
docker rm -v -f test

#Create a new container connected to the original TEST_DATA container
docker run --volumes-from TEST_DATA --name test4 -e MYSQL_ROOT_PASSWORD="test" -d mysql

#Connect to this new container
dmysql test4
mysql> SHOW DATABASES;
#We should see the testdb database there...

#NOTE - I think that only one database server container can be connected to the TEST_DATA container at a time

So far, so confusing… Here’s a brief description of the current state of my understanding/confusion:

What really threw me is that the database container (either the original mot2013 container or the test database with the external data container) don’t appear to store the data inside the container itself. (So for example, the TEST_DATA container does not contain the database.) Instead, the data appears to be contained in a “hidden” volume that is mounted outside the container. I came a cropper with this originally by deleting containers using commands of the form docker rm [CONTAINER_NAME] and then finding that the docker VM was running out of memory. This deletes the container, but leaves a mounted volume (that is associated with the deleted container name) hanging around. To remove those volumes automatically, containers should be removed with commands of the form docker rm -v [CONTAINER_NAME]. What makes things difficult to tidy up is that the mounted volumes can’t be seen using the normal docker ps or docker ps -a commands; instead you need to install docker-volumes to identify them and delete them. (There’s a possible fix that shows how to store the data inside the container, rather than in an externally mounted volume – I think?! – linked to from this post, but I haven’t tried it because I was trying to use root Dockerfile images.)

The use of the external data volumes also means we can’t easily bundle up a data container using docker commit and then hope to create new containers from it (such a model would allow you to spend an hour or two loading a large-ish data set into a database, then push a data container containing that db to dockerhub; other users could then pull down that image, create a container from it and immediately attach it to a MySQL container without having to go through the pain of building the database; this would provide a nifty way of sharing ready-to-query datasets such as the MOT database. You could just pull a data image, mount it as the data volume/container, and get started with running queries).

On the other hand, it is possible to mount a volume inside a container by running the container with a -v flag and specifying the mount point (docker volumes). Luis Elizondo’s scripts allow you to set-up these data volume containers by running dmysql-server with the –with-volume flag as shown in the code snippet above, but at the moment I can’t see how to connect to a pre-existing data container? (See issue here.)

So given that we can mount volumes inside a linked to data container, does this mean we have a route to portable data containers that could be shared by a docker datahub? It seems not… because as it was, I wasted quite a bit more time learning the fact that docker data Container volumes can’t be saved as images! (To make a data container portable, I think we’d need to commit it as an image, share the image, then create a container from that image? Or do I misunderstand this aspect of the docker workflow too…?!)

That said, there does look to be a workaround in the form of flocker. However, I’ve fed up with this whole thing for a bit now… What I hoped would be a quick demo of: get data in docker MySQL container; package container and put image on dockerhub; pull down image, create container and start using data immediately turned into a saga of realising quite how much I don’t understand docker, what it does and how it does it.

I hate computers…. :-(

Written by Tony Hirst

January 13, 2015 at 6:39 pm

Posted in Tinkering

Tagged with

Calculating Churn in Seasonal Leagues

leave a comment »

One of the things I wanted to explore in the production of the Wrangling F1 Data With R book was the extent to which I could draw on published academic papers for inspiration in exploring the the various results and timing datasets.

In a chapter published earlier this week, I explored the notion of churn, as described in Mizak, D, Neral, J & Stair, A (2007) The adjusted churn: an index of competitive balance for sports leagues based on changes in team standings over time. Economics Bulletin, Vol. 26, No. 3 pp. 1-7, and further appropriated by Berkowitz, J. P., Depken, C. A., & Wilson, D. P. (2011). When going in circles is going backward: Outcome uncertainty in NASCAR. Journal of Sports Economics, 12(3), 253-283.

In a competitive league, churn is defined as:

C_t =  \frac{\sum_{i=1}^{N}\left|f_{i,t} - f_{i,t-1}\right|}{N}

where C_t is the churn in team standings for year t, \left|f_{i,t} - f_{i,t-1}\right| is the absolute value of the i-th team’s change in finishing position going from season t-1 to season t, and N is the number of teams.

The adjusted churn is defined as an indicator with the range 0..1 by dividing the churn, C_t, by the maximum churn, C_max. The value of the maximum churn depends on whether there is an even or odd number of competitors:

C_{max} = N/2 \text{, for even N}

C_{max} = (N^2 - 1) / 2N \text{, for odd N}

Berkowitz et al. reconsidered churn as applied to an individual NASCAR race (that is, at the event level). In this case, f_{i,t} is the position of driver i at the end of race t, f_{i,t-1} is the starting position of driver i at the beginning of that race (that is, race t) and N is the number of drivers participating in the race. Once again, the authors recognise the utility of normalising the churn value to give an *adjusted churn* in the range 0..1 by dividing through by the maximum churn value.

Using these models, I created churn function of the form:

is.even = function(x) x %% 2 == 0
churnmax=function(N)
  if (is.even(N)) return(N/2) else return(((N*N)-1)/(2*N))

churn=function(d) sum(d)/length(d)
adjchurn = function(d) churn(d)/churnmax(length(d))

and then used it to explore churn in a variety of contexts:

  • comparing grid positions vs race classifications across a season (cf. Berkowitz et al.)
  • churn in Drivers’ Championship standings over several seasons (cf. Mizak et al.)
  • churn in Constructors’ Championship standings over several seasons (cf. Mizak et al.)

For example, in the first case, we can process data from the ergast database as follows:

library(DBI)
ergastdb = dbConnect(RSQLite::SQLite(), './ergastdb13.sqlite')

q=paste('SELECT round, name, driverRef, code, grid, 
                position, positionText, positionOrder
          FROM results rs JOIN drivers d JOIN races r
          ON rs.driverId=d.driverId AND rs.raceId=r.raceId
          WHERE r.year=2013',sep='')
results=dbGetQuery(ergastdb,q)

library(plyr)
results['delta'] =  abs(results['grid']-results['positionOrder'])
churn.df = ddply(results[,c('round','name','delta')], .(round,name), summarise,
            churn = churn(delta),
            adjchurn = adjchurn(delta)
            )

For more details, see this first release of the Keeping an Eye on Competitiveness – Tracking Churn chapter of the Wrangling F1 Data With R book.

Written by Tony Hirst

January 10, 2015 at 12:06 am

Posted in f1stats, Rstats

Tagged with

Validating Local Spending Data

with 2 comments

In passing, I noticed on the Local Government Association (LGA) website a validator for checking the format of documents used to publish local council spending data, as well as various other data releases (contracts, planning applications, toilet locations, land holdings etc): LGA OpenData Schema Validator.

LGA_opendata_Schema_validator

I wonder how many councils are publishing new releases that actually validate, and how many have “back-published” historical data releases using a format that validates?! When officers publish data files, I wonder how many of them even try to download and open the data files they have just published (to check the links work, the documents open as advertised, and also appear to contain what’s expected), let alone run either the uploaded or downloaded files through the validator (it often makes sense to do both: check the file validates before you publish it, then download it and check the downloaded version, just in case the publishing process has somehow mangled the file…)

Guidance for the spending data releases can be found here: Local government open data schemas: Spend

Documentation regarding the release of procurement and spending information (v. 1.1 dated 14/12/2014) can be found here: Local transparency guidance – publishing spending and procurement information.

I’ve still no real idea how to make interesting use of this data, or how DCLG expect folk to make use of it?!;-)

Written by Tony Hirst

January 7, 2015 at 2:42 pm

Posted in Open Data

Pondering Local Spending Data, Again…

leave a comment »

Last night I saw a mention of a budget review consultation being held by the Milton Keynes Council. I’ve idly wondered before about whether spending data could be used to inform these consultations, for example by roleplaying what the effects of a cut to a particular spending area might be at a transactional level. (For what it’s worth, I’ve bundled together the Milton Keynes spending data into a single (but uncleaned) CSV file here and posted the first couple of lines of a data conversation with it here. One of the things I realised is that I still don’t know how to visualise data by financial year, so I guess I need to spend some time looking at pandas timeseries support).

Another transparency/spending data story that caught my eye over the break was news of how Keighley Town Council had been chastised for its behaviour around various transparency issues (see for example the Audit Commission Report in the public interest on Keighley Town Council). Among other things, it seems that the council had “entered into a number of transactions with family members of Councillors and employees” (which makes me think that an earlier experiment I dabbled with that tried to reconcile councillor names with: a) directors of companies in general; b) directors of companies that trade with a council may be a useful tool to work up a bit further). They had also been lax in ensuring “appropriate arrangements were in place to deal with non-routine transactions such as the recovery of overpayments made to consultants”. I’ve noted before that where a council publishes all its spending data, not just amounts over £500, including negative payments, there may be interesting things to learn (eg Negative Payments in Local Spending Data).

It seems that the Audit Commission report was conducted in response to a request from a local campaigner (Keighley investigation: How a grandmother blew whistle on town council [Yorkshire Post, 20/12/14]). As you do, I wondered whether the spending data might have sent up an useful signals about any of the affairs the auditors – and local campaigners – took issue with. The Keighley Town Council website doesn’t make it obvious where the spending data can be found – the path you need to follow is Committees, then Finance and Audit, then Schedule of payments over £500 – and even then I can’t seem to find any data for the current financial year.

The data itself is published using an old Microsoft Office .doc format:

keighley_spending

The extent of the data that is published is not brilliant… In terms of usefulness, this is pretty low quality stuff…

keighley_spendData

Getting the data, such as it is, into a canonical form is complicated by the crappy document format, though it’s not hard to imagine how such a thing came to be generated (council clerk sat using an old Pentium powered desktop and Windows 95, etc etc ;-). Thanks to a tip off from Alex Dutton, unoconv can convert the docs into a more usable format (apt-get update ; apt-get install -y libreoffice ; apt-get install -y unoconv); so for example, unoconv -f html 2014_04.doc converts the specified .doc file to an HTML document. (I also had a look at getting convertit, an http serverised version of unoconv, working in a docker container, but it wouldn’t build properly for me? Hopefully a tested version will appear on dockerhub at some point…:-)

This data still requires scraping of course… but I’m bored already…

PS I’m wondering if it would be useful to skim through some of the Audit Commission’s public interest reports to fish for ideas about interesting things to look for in the spending data?

Written by Tony Hirst

January 6, 2015 at 12:58 pm

Posted in opengov, Open Data

Book Extras – Data Files, Code Files and a Dockerised Application

leave a comment »

Idling through the LeanPub documentation last night, I noticed that they support the ability to sell digital extras, such as bundled code files or datafiles. Along with the base book sold at one price, additional extras can be bundled into packages alongside the original book and sold at another (higher) price. As with the book sales, two price points are supported: the minimum price and a recommended price.

It was easy enough to create a bundle of sample code and data files to support the Wrangling F1 Data With R book and add them as an extras package bundled with the book for an extra dollar or so.

leanpub_extras

This approach makes it slightly easier to distribute file bundles to support a book, but it still requires a reader to do some work in configuring their own development environment.

In an attempt to make this slightly easier, I also started exploring ways of packaging and distributing a preconfigured virtual machine that contains the tools – as well as code and data examples – that are required in order to try out the data wrangling approaches described in the book. (I’m starting to see more and more technical books supported by virtual machines, and can imagine this approach becoming a standard way of augmenting practical texts.) In particular, I needed a virtual machine that could run RStudio and that would be preloaded with libraries that would support working with SQLite data files and generating ggplot2 charts.

The route I opted for was to try out a dockerised application. The rocker/hadleyverse Docker image bundles a wide variety of useful R packages into a container along with RStudio and a base R installation. Building on top of this image, I created a small Dockerfile that additionally loaded in the example code and data files from the book extras package – psychemedia/wranglingf1data.

# Wrangling F1 Data Dockerfile
#
# https://github.com/psychemedia/wranglingf1data-docker
#

# Pull RStudio base image with handy packages...
FROM rocker/hadleyverse

#Create a directory to create a new project from
RUN mkdir -p /home/rstudio/wranglingf1data
RUN chmod a+rw /home/rstudio/wranglingf1data

#Populate the project-directory-to-be with ergast API supporting code and data
ADD ergastR-core.R /home/rstudio/wranglingf1data/ergastR-core.R
ADD ergastdb13.sqlite /home/rstudio/wranglingf1data/ergastdb13.sqlite

#Populate the project-directory-to-be with an additional data source
ADD scraperwiki.sqlite /home/rstudio/wranglingf1data/scraperwiki.sqlite

Running this Dockerfile (for example, using boot2docker) downloads and builds a containerised application preconfigured to support the book and available via a web browser. Instructions for downloading, and running the container can be found here: psychemedia/wranglingf1data-docker repository.

I also added instructions for using the Dockerised application to the book extras package as part of its README file.

Written by Tony Hirst

January 5, 2015 at 3:40 pm

Posted in OU2.0, Rstats

Custom Gridlines and Line Guides in R/ggplot Charts

leave a comment »

In the last quarter of last year, I started paying more attention to the use of custom grid lines and line guides in charts I’ve been developing for the Wrangling F1 Data With R book.

The use of line guides was in part inspired by canopy views from within the cockpit of one of the planes that makes up the Red Arrows aerobatic display team.

SCA-07-247-RAW-UNC-

A little bit of digging suggested that the lines on the cockpit are actually an explosive cord used to shatter the cockpit if a pilot needs to eject from the aircraft – but I can’t believe that the pilots don’t also use the lines as a crib for helping position themselves with respect to the other aircraft in the team? (I emailed the Red Arrows press office to ask about the extent to which the cockpit lines are used in this way but got what amounted to a NULL response.)

Whatever the case, it seems eminently sensible to me that we make use of line guides to help us read charts more effectively, where it makes sense to, or to use guides as stepping stones to more refined data views.

The following example shows how we can generate a 2 dimensional grid based on the discrete points allocations possible for being placed in the top 10 positions of a Formula One race.

The grid lines show allowable points values, and are constructed as follows:

NEWPOINTS =c(25,18,15,12,10,8,6,4,2,1)

#The newpoints dataframe has two columns
#The first column indicates points available, in order
#The second column is the maximum number of points the lower placed driver could score
newpoints=data.frame(x=c(NEWPOINTS,0),y=c(NEWPOINTS[-1],0,0))

baseplot_singleWay=function(g){
  g=g+xlab('Highest placed driver points')+ylab('Lower placed driver points')
  g=g+scale_y_continuous(breaks = newpoints$x,minor_breaks=NULL) 
  g=g+scale_x_continuous(breaks = newpoints$x,minor_breaks=NULL)
  g=g+coord_fixed()
  g=g+guides(size=FALSE)+theme_bw()
  g
}

g=baseplot_singleWay(ggplot())
g

The final chart (of which this is a “one-sided” example) is used to display a count of the number of races in which at least one of the two drivers in a particular team scores points in a race. The horizontal x-axis represents the number of points scored by the highest placed driver in the race, and the y-axis the number of points scored by their lower placed team mate.

A fixed co-ordinate scheme is adopted to ensure that points are separated consistently on the x and y axes. A dotted red line shows the maximum number of points the lower placed driver in a team could scored depending on the number of points scored by their higher placed team mate.

#Add in the maximum points line guide

g+geom_line(data=newpoints,aes(x=x,y=y),col='red',linetype='dotted')

pointsPerformance-basechartAnnotated-1

A two-sided version of the chart is also possible in which the x-axis represents the points scored in a particular race by one name driver and the y-axis represents the points scored by another named driver in the same race. The two-sided chart has two guidelines, representing the maximum points that can be scored by the other driver in the event of one particular driver being higher placed in a race.

pointsPerformance-pointsPerformanceChart-2way-1

A full description can be found in the Points Performance Charts chapter of the Wrangling F1 Data With R book.

Written by Tony Hirst

January 2, 2015 at 9:09 pm

Posted in Rstats

Fragments – Wikipedia to Markdown

leave a comment »

I’ve been sketching some ideas, pondering the ethics of doing an F1 review style book blending (openly licensed) content from Wikipedia race reports with some of my own f1datajunkie charts, and also wondering about the extent to which I could automatically generate Wikipedia style race report sentences from the data; I think the sentence generation, in general should be quite easy – the harder part would be identifying the “interesting” sentences (that is, the ones that make it into the report, rather than than the totality of ones that could be generated).

So far, my sketches have been based around just grabbing the content from Wikipedia, and transforming to markdown, the markup language used in the Leanpub workflow:

In Python 3.x at least, I came across some encoding issues, and couldn’t seem to identify Wikipedia page sections. For what it’s worth, a minimal scribble looks something like this:

!pip3 install wikipedia
import wikipedia

#Search for page titles on Wikipedia
wikipedia.search('2014 Australian grand prix')

#Load a page
f1=wikipedia.page('2014 Australian Grand Prix')

#Preview page content
f1.content

#Preview a section's content by section header
f1.section('Qualifying')
##For some reason, f1.sections shows an empty list for me?


#pandoc supports Wikimedia to markdown conversion
!apt-get -y install pandoc
!pip3 install pypandoc
import pypandoc

#To work round encoding issues, write the content to a file and then convert it...
f = open('delme1.txt', 'w', encoding='utf8')
f.write(f1.content)
f.close()

md=pypandoc.convert('delme1.txt', 'md', format='mediawiki')

If the Formula One race report pages follow similar templates and use similar headings, then it should be straightforward enough to pull down sections of the reports and interleave them with charts and tables. (As well as issues parsing out section headers to fill the sections list, the tables on the page don’t appear to be grabbed into the .content field (assuming the API wrapper does manage to grab that content down? However, I can easily recreate those from things like the ergast API).

Looking at the construction of sentences in the race reports, many of them are formulaic. However, as noted above, generating sentences is one thing, but generating interesting sentences is another. For that, I think we need to identify sets of rules that mark data features out as interesting or not before generating sentences from them.

Written by Tony Hirst

December 30, 2014 at 11:46 pm

Posted in Anything you want

Follow

Get every new post delivered to your Inbox.

Join 1,183 other followers