2015 – Page 15 – OUseful.Info, the blog…

Pondering Local Spending Data, Again…

Last night I saw a mention of a budget review consultation being held by the Milton Keynes Council. I’ve idly wondered before about whether spending data could be used to inform these consultations, for example by roleplaying what the effects of a cut to a particular spending area might be at a transactional level. (For what it’s worth, I’ve bundled together the Milton Keynes spending data into a single (but uncleaned) CSV file here and posted the first couple of lines of a data conversation with it here. One of the things I realised is that I still don’t know how to visualise data by financial year, so I guess I need to spend some time looking at pandas timeseries support).

Another transparency/spending data story that caught my eye over the break was news of how Keighley Town Council had been chastised for its behaviour around various transparency issues (see for example the Audit Commission Report in the public interest on Keighley Town Council). Among other things, it seems that the council had “entered into a number of transactions with family members of Councillors and employees” (which makes me think that an earlier experiment I dabbled with that tried to reconcile councillor names with: a) directors of companies in general; b) directors of companies that trade with a council may be a useful tool to work up a bit further). They had also been lax in ensuring “appropriate arrangements were in place to deal with non-routine transactions such as the recovery of overpayments made to consultants”. I’ve noted before that where a council publishes all its spending data, not just amounts over £500, including negative payments, there may be interesting things to learn (eg Negative Payments in Local Spending Data).

It seems that the Audit Commission report was conducted in response to a request from a local campaigner (Keighley investigation: How a grandmother blew whistle on town council [Yorkshire Post, 20/12/14]). As you do, I wondered whether the spending data might have sent up an useful signals about any of the affairs the auditors – and local campaigners – took issue with. The Keighley Town Council website doesn’t make it obvious where the spending data can be found – the path you need to follow is Committees, then Finance and Audit, then Schedule of payments over £500 – and even then I can’t seem to find any data for the current financial year.

The data itself is published using an old Microsoft Office .doc format:

The extent of the data that is published is not brilliant… In terms of usefulness, this is pretty low quality stuff…

Getting the data, such as it is, into a canonical form is complicated by the crappy document format, though it’s not hard to imagine how such a thing came to be generated (council clerk sat using an old Pentium powered desktop and Windows 95, etc etc ;-). Thanks to a tip off from Alex Dutton, unoconv can convert the docs into a more usable format (apt-get update ; apt-get install -y libreoffice ; apt-get install -y unoconv); so for example, unoconv -f html 2014_04.doc converts the specified .doc file to an HTML document. (I also had a look at getting convertit, an http serverised version of unoconv, working in a docker container, but it wouldn’t build properly for me? Hopefully a tested version will appear on dockerhub at some point…:-)

This data still requires scraping of course… but I’m bored already…

PS I’m wondering if it would be useful to skim through some of the Audit Commission’s public interest reports to fish for ideas about interesting things to look for in the spending data?

Book Extras – Data Files, Code Files and a Dockerised Application

Idling through the LeanPub documentation last night, I noticed that they support the ability to sell digital extras, such as bundled code files or datafiles. Along with the base book sold at one price, additional extras can be bundled into packages alongside the original book and sold at another (higher) price. As with the book sales, two price points are supported: the minimum price and a recommended price.

It was easy enough to create a bundle of sample code and data files to support the Wrangling F1 Data With R book and add them as an extras package bundled with the book for an extra dollar or so.

This approach makes it slightly easier to distribute file bundles to support a book, but it still requires a reader to do some work in configuring their own development environment.

In an attempt to make this slightly easier, I also started exploring ways of packaging and distributing a preconfigured virtual machine that contains the tools – as well as code and data examples – that are required in order to try out the data wrangling approaches described in the book. (I’m starting to see more and more technical books supported by virtual machines, and can imagine this approach becoming a standard way of augmenting practical texts.) In particular, I needed a virtual machine that could run RStudio and that would be preloaded with libraries that would support working with SQLite data files and generating ggplot2 charts.

The route I opted for was to try out a dockerised application. The rocker/hadleyverse Docker image bundles a wide variety of useful R packages into a container along with RStudio and a base R installation. Building on top of this image, I created a small Dockerfile that additionally loaded in the example code and data files from the book extras package – psychemedia/wranglingf1data.

# Wrangling F1 Data Dockerfile
#
# https://github.com/psychemedia/wranglingf1data-docker
#

# Pull RStudio base image with handy packages...
FROM rocker/hadleyverse

#Create a directory to create a new project from
RUN mkdir -p /home/rstudio/wranglingf1data
RUN chmod a+rw /home/rstudio/wranglingf1data

#Populate the project-directory-to-be with ergast API supporting code and data
ADD ergastR-core.R /home/rstudio/wranglingf1data/ergastR-core.R
ADD ergastdb13.sqlite /home/rstudio/wranglingf1data/ergastdb13.sqlite

#Populate the project-directory-to-be with an additional data source
ADD scraperwiki.sqlite /home/rstudio/wranglingf1data/scraperwiki.sqlite

Running this Dockerfile (for example, using boot2docker) downloads and builds a containerised application preconfigured to support the book and available via a web browser. Instructions for downloading, and running the container can be found here: psychemedia/wranglingf1data-docker repository.

I also added instructions for using the Dockerised application to the book extras package as part of its README file.

Custom Gridlines and Line Guides in R/ggplot Charts

In the last quarter of last year, I started paying more attention to the use of custom grid lines and line guides in charts I’ve been developing for the Wrangling F1 Data With R book.

The use of line guides was in part inspired by canopy views from within the cockpit of one of the planes that makes up the Red Arrows aerobatic display team.

A little bit of digging suggested that the lines on the cockpit are actually an explosive cord used to shatter the cockpit if a pilot needs to eject from the aircraft – but I can’t believe that the pilots don’t also use the lines as a crib for helping position themselves with respect to the other aircraft in the team? (I emailed the Red Arrows press office to ask about the extent to which the cockpit lines are used in this way but got what amounted to a NULL response.)

Whatever the case, it seems eminently sensible to me that we make use of line guides to help us read charts more effectively, where it makes sense to, or to use guides as stepping stones to more refined data views.

The following example shows how we can generate a 2 dimensional grid based on the discrete points allocations possible for being placed in the top 10 positions of a Formula One race.

The grid lines show allowable points values, and are constructed as follows:

NEWPOINTS =c(25,18,15,12,10,8,6,4,2,1)

#The newpoints dataframe has two columns
#The first column indicates points available, in order
#The second column is the maximum number of points the lower placed driver could score
newpoints=data.frame(x=c(NEWPOINTS,0),y=c(NEWPOINTS[-1],0,0))

baseplot_singleWay=function(g){
  g=g+xlab('Highest placed driver points')+ylab('Lower placed driver points')
  g=g+scale_y_continuous(breaks = newpoints$x,minor_breaks=NULL) 
  g=g+scale_x_continuous(breaks = newpoints$x,minor_breaks=NULL)
  g=g+coord_fixed()
  g=g+guides(size=FALSE)+theme_bw()
  g
}

g=baseplot_singleWay(ggplot())
g

The final chart (of which this is a “one-sided” example) is used to display a count of the number of races in which at least one of the two drivers in a particular team scores points in a race. The horizontal x-axis represents the number of points scored by the highest placed driver in the race, and the y-axis the number of points scored by their lower placed team mate.

A fixed co-ordinate scheme is adopted to ensure that points are separated consistently on the x and y axes. A dotted red line shows the maximum number of points the lower placed driver in a team could scored depending on the number of points scored by their higher placed team mate.

#Add in the maximum points line guide

g+geom_line(data=newpoints,aes(x=x,y=y),col='red',linetype='dotted')

A two-sided version of the chart is also possible in which the x-axis represents the points scored in a particular race by one name driver and the y-axis represents the points scored by another named driver in the same race. The two-sided chart has two guidelines, representing the maximum points that can be scored by the other driver in the event of one particular driver being higher placed in a race.