OUseful.Info, the blog…

Trying to find useful things to do with emerging technologies in open education

Code as Magic, and the Vernacular of Data Wrangling Verbs

It’s been some time now since I drafted most of my early unit contributions to the TM351 Data management and analysis course. Part of the point (for me) in drafting that material was to find out what sorts of thing we actually wanted to say and help identify the sorts of abstractions we wanted to then build a narrative around. Another part of this (for me) means exploring new ways of putting powerful “academic” ideas and concepts into meaningful, contexts; finding new ways to describe them; finding ways of using them in conjunction with other ideas; or finding new ways of using – or appropriating them – in general (which in turn may lead to new ways of thinking about them). These contexts are often process based, demonstrating how we can apply the ideas or put them to use (make them useful…) or use the ideas to support problem identification, problem decomposition and problem solving. At heart, I’m more of a creative technologist than a scientist or an engineer. (I aspire to being an artist…;-)

Someone who I think has a great take on conceptualising the data wrangling process – in part arising from his prolific tool building approach in the R language – is Hadley Wickham. His recent work for RStudio is built around an approach to working with data that he’s captured as follows (e.g. “dplyr” tutorial at useR 2014 , Pipelines for Data Analysis):

Following an often painful and laborious process of getting data into a state where you can actually start to work with it), you can then enter into an iterative process of transforming the data into various shapes and representations (often in the sense of re-presentations) that you can easily visualise or build models from. (In practice, you may have to keep redoing elements of the tidy step and then re-feed the increasingly cleaned data back into the sensemaking loop.)

Hadley’s take on this is that the visualisation phase can spring surprises on you but doesn’t scale very well, whilst the modeling phase scales but doesn’t surprise you.

To support the different phases of activity, Hadley has been instrumental in developing several software libraries for the R programming language that are particular suited to the different steps. (For the modeling, there are hundreds of community developed and often very specialised R libraries for doing all manner of weird and wonderful statistics…)

In many respects, I’ve generally found the way Hadley has presented his software libraries to be deeply pragmatic – the tools he’s developed are useful and in many senses naturalistic; they help you do the things you need to do in a way that makes practical sense. The steps they encourage you to take are natural ones, and useful ones. They are the sorts of tools that implement the sorts of ideas that come to mind when you’re faced with a problem and you think: this is the sort of thing I need (to be able) to do. (I can’t comment on how well implemented they are; I suspect: pretty well…)

Just as the data wrangling process diagram helps frame the sorts of things you’re likely to do into steps that make sense in a “folk computational” way (in the sense of folk computing or folk IT (also here), a computational correlate to notions of folk physics, for example), Hadley also has a handy diagram for helping us think about the process of solving problems computationally in a more general, problem solving sense:

A cognitive think it step, identifying a problem, and starting to think about what sort of answer you want from it, as well as how you might start to approach it; a describe it step, where you describe precisely what it is you want to do (the sort of step where you might start scribbling pseudo-code, for example); and the computational do it step where the computational grunt work is encoded in a way that allows it to actually get done by machine.

I’ve been pondering my own stance towards computing lately, particularly from my own context of someone who sees computery stuff from a more technology, tool building and tool using context, (that is, using computery things to help you do useful stuff), rather than framing it as a purer computer science or even “trad computing” take on operationalised logic, where the practical why is often ignored.

Figuring out what the hell it is you want to do (imagining, the what for a particular why), figuring out how to do it (precisely; the programming step; the how); hacking that idea into a form that lets a machine actually do it for you (the coding step; the step where you express the idea in a weird incantation where every syllable has to be the right syllable; and from which the magic happens).

One of the nice things about Hadley’s approach to supporting practical spell casting (?!) is that transformation or operational steps his libraries implement are often based around naturalistic verbs. They sort of do what they say on the tin. For example, in the dplyr toolkit, there are the following verbs:

These sort of map onto elements (often similarly named) familiar to anyone who has used SQL, but in a friendlier way. (They don’t SHOUT AT YOU for a start.) It almost feels as if they have been designed as articulations of the ideas that come to mind when you are trying to describe (precisely) what it is you actually want to do to a dataset when working on a particular problem.

In a similar way, the ggvis library (the interactive chart reinvention of Hadley’s ggplot2 library) builds on the idea of Leland Wilkinson’s “The Grammar of Graphics” and provides a way of summoning charts from data in an incremental way, as well as a functionally and grammatically coherent way. The words the libraries use encourage you to articulate the steps you think you need to take to solve a problem – and then, as if by magic, they take those steps for you.

If programming is the meditative state you need to get into to cast a computery-thing spell, and coding is the language of magic, things like dplyr help us cast spells in the vernacular.

Written by Tony Hirst

February 11, 2015 at 3:10 pm

Rediscovering Formula One Race Battlemaps

A couple of days ago, I posted a recipe on the F1DataJunkie blog that described how to calculate track position from laptime data.

Using that information, as well as additional derived columns such as the identity of, and time to, the cars immediately ahead of and behind a particular selected driver, both in terms of track position and race position, I revisited a chart type I first started exploring several years ago – race battle charts.

The main idea behind the battlemaps is that they can help us search for stories amidst the runners.

dirattr=function(attr,dir='ahead') paste(attr,dir,sep='')

#We shall find it convenenient later on to split out the initial data selection
battlemap_df_driverCode=function(driverCode){
lapTimes[lapTimes['code']==driverCode,]
}

car_X=dirattr('car_',dir)
code_X=dirattr('code_',dir)
factor_X=paste('factor(position_',dir,'<position)',sep='')
code_race_X=dirattr('code_race',dir)

g=g+geom_hline(aes_string(yintercept=drs),linetype=5,col='grey')

#Plot the offlap cars that aren't directly being raced
g=g+geom_text(data=df[df[dirattr('code_',dir)]!=df[dirattr('code_race',dir)],],
aes_string(x='lap',
y=car_X,
label=code_X,
col=factor_X),
angle=45,size=2)
#Plot the cars being raced directly
g=g+geom_text(data=df,
aes_string(x='lap',
y=diff_X,
label=code_race_X),
angle=45,size=2)
g+guides(col=guide_legend(title='Intervening car'))

}

battle_WEB=battlemap_df_driverCode('WEB')
battlemap_core_chart(battle_WEB,g,dir='behind')


In this first sketch, from the 2012 Australian Grand Prix, I show the battlemap for Mark Webber:

We see how at the start of the race Webber kept pace with Alonso, albeit around about a second behind, at the same time as he drew away from Massa. In the last third of the race, he was closely battling with Hamilton whilst drawing away from Alonso. Coloured labels are used to highlight cars on a different lap (either ahead (aqua) or behind (orange)) that are in a track position between the selected driver and the car one place ahead or behind in terms of race position (the black labels). The y-axis is the time delta in milliseconds between the selected car and cars ahead (y > 0) or behind (y < 0). A dashed line at the +/- one second mark identifies cars within DRS range.

As well as charting the battles in the vicinity of a particular driver, we can also chart the battle in the context of a particular race position. We can reuse the chart elements and simply need to redefine the filtered dataset we are charting.

For example, if we filter the dataset to just get the data for the car in third position at the end of each lap, we can then generate a battle map of this data.

battlemap_df_position=function(position){
lapTimes[lapTimes['position']==position,]
}

battleForThird=battlemap_df_position(3)

g=battlemap_core_chart(battleForThird,ggplot(),dir='behind')+xlab(NULL)+theme_bw()
g

For more details, see the original version of the battlemap chapter. For updates to the chapter, I recommend that you invest in a copy Wrangling F1 Data With R book if you haven’t already done so:-)

Written by Tony Hirst

January 31, 2015 at 8:52 pm

Posted in Rstats

Tagged with ,

Connecting RStudio and MySQL Docker Containers – an example using the ergast db

building on Dockerising Open Data Databases – First Fumblings and my Book Extras – Data Files, Code Files and a Dockerised Application, I just figured out how to get the ergast db into a MySQL docker container and then query it from RStudio:

• install these docker-mysql-scripts
• run boot2docker
• from the boot2docker shell, start up a MySQL server (ergastdb) with password f1: dmysql-server ergastdb f1 By default, this exposes port 3306
• create an new empty database (f1db): dmysql-create-database ergastdb f1db
• add the ergast data to it: dmysql-import-database ergastdb /path/to/ergastdb/f1db.sql --database f1db
• fire up a copy of RStudio, in this case using my psychemedia/wranglingf1data container, linking it to the MySQL database which has the alias db: docker run --name f1djd -p 8788:8787 --link ergastdb:db -d psychemedia/wranglingf1data
• in RStudio, import the RMySQL library: library(RMySQL)
• in RStudio, connect to the database: con=dbConnect(MySQL(),user='root',password='f1',host='db',port=3306,dbname='f1db')
• in RStudio, run a test query: dbQuery(con,'SHOW TABLES');

I guess what I need to do now is pull the various bits into another script to make it a one-liner, perhaps with a few switches? For example, to create the database if it doesn’t exist, to download the ergast database file automatically, to populate the database for the first time, or update it with a more recent copy of the database, to fire up both containers and make sure they are appropriately linked etc. This would dramatically simplify things for use in the context of the Wrangling F1 Data With R book, for example. (If you beat me to it, please post details in the comments below.)

PS Hmm…. seems I get a UTF-8 encoding issue:

Not sure if this is with the database, or the RMySQL connector? Anyone got any ideas of a fix?

Ah ha – sort of via SO:

Running dbGetQuery(con,'SET NAMES utf8;') before querying seems to do the trick…

Written by Tony Hirst

January 17, 2015 at 7:32 pm

Posted in Rstats

Tagged with ,

Calculating Churn in Seasonal Leagues

One of the things I wanted to explore in the production of the Wrangling F1 Data With R book was the extent to which I could draw on published academic papers for inspiration in exploring the the various results and timing datasets.

In a chapter published earlier this week, I explored the notion of churn, as described in Mizak, D, Neral, J & Stair, A (2007) The adjusted churn: an index of competitive balance for sports leagues based on changes in team standings over time. Economics Bulletin, Vol. 26, No. 3 pp. 1-7, and further appropriated by Berkowitz, J. P., Depken, C. A., & Wilson, D. P. (2011). When going in circles is going backward: Outcome uncertainty in NASCAR. Journal of Sports Economics, 12(3), 253-283.

In a competitive league, churn is defined as:

$C_t = \frac{\sum_{i=1}^{N}\left|f_{i,t} - f_{i,t-1}\right|}{N}$

where $C_t$ is the churn in team standings for year $t$, $\left|f_{i,t} - f_{i,t-1}\right|$ is the absolute value of the $i$-th team’s change in finishing position going from season $t-1$ to season $t$, and $N$ is the number of teams.

The adjusted churn is defined as an indicator with the range 0..1 by dividing the churn, $C_t$, by the maximum churn, $C_max$. The value of the maximum churn depends on whether there is an even or odd number of competitors:

$C_{max} = N/2 \text{, for even N}$

$C_{max} = (N^2 - 1) / 2N \text{, for odd N}$

Berkowitz et al. reconsidered churn as applied to an individual NASCAR race (that is, at the event level). In this case, $f_{i,t}$ is the position of driver $i$ at the end of race $t$, $f_{i,t-1}$ is the starting position of driver $i$ at the beginning of that race (that is, race $t$) and $N$ is the number of drivers participating in the race. Once again, the authors recognise the utility of normalising the churn value to give an *adjusted churn* in the range 0..1 by dividing through by the maximum churn value.

Using these models, I created churn function of the form:

is.even = function(x) x %% 2 == 0
churnmax=function(N)
if (is.even(N)) return(N/2) else return(((N*N)-1)/(2*N))

churn=function(d) sum(d)/length(d)
adjchurn = function(d) churn(d)/churnmax(length(d))

and then used it to explore churn in a variety of contexts:

• comparing grid positions vs race classifications across a season (cf. Berkowitz et al.)
• churn in Drivers’ Championship standings over several seasons (cf. Mizak et al.)
• churn in Constructors’ Championship standings over several seasons (cf. Mizak et al.)

For example, in the first case, we can process data from the ergast database as follows:

library(DBI)
ergastdb = dbConnect(RSQLite::SQLite(), './ergastdb13.sqlite')

q=paste('SELECT round, name, driverRef, code, grid,
position, positionText, positionOrder
FROM results rs JOIN drivers d JOIN races r
ON rs.driverId=d.driverId AND rs.raceId=r.raceId
WHERE r.year=2013',sep='')
results=dbGetQuery(ergastdb,q)

library(plyr)
results['delta'] =  abs(results['grid']-results['positionOrder'])
churn.df = ddply(results[,c('round','name','delta')], .(round,name), summarise,
churn = churn(delta),
)


For more details, see this first release of the Keeping an Eye on Competitiveness – Tracking Churn chapter of the Wrangling F1 Data With R book.

Written by Tony Hirst

January 10, 2015 at 12:06 am

Posted in f1stats, Rstats

Tagged with

Book Extras – Data Files, Code Files and a Dockerised Application

Idling through the LeanPub documentation last night, I noticed that they support the ability to sell digital extras, such as bundled code files or datafiles. Along with the base book sold at one price, additional extras can be bundled into packages alongside the original book and sold at another (higher) price. As with the book sales, two price points are supported: the minimum price and a recommended price.

It was easy enough to create a bundle of sample code and data files to support the Wrangling F1 Data With R book and add them as an extras package bundled with the book for an extra dollar or so.

This approach makes it slightly easier to distribute file bundles to support a book, but it still requires a reader to do some work in configuring their own development environment.

In an attempt to make this slightly easier, I also started exploring ways of packaging and distributing a preconfigured virtual machine that contains the tools – as well as code and data examples – that are required in order to try out the data wrangling approaches described in the book. (I’m starting to see more and more technical books supported by virtual machines, and can imagine this approach becoming a standard way of augmenting practical texts.) In particular, I needed a virtual machine that could run RStudio and that would be preloaded with libraries that would support working with SQLite data files and generating ggplot2 charts.

The route I opted for was to try out a dockerised application. The rocker/hadleyverse Docker image bundles a wide variety of useful R packages into a container along with RStudio and a base R installation. Building on top of this image, I created a small Dockerfile that additionally loaded in the example code and data files from the book extras package – psychemedia/wranglingf1data.

# Wrangling F1 Data Dockerfile
#
# https://github.com/psychemedia/wranglingf1data-docker
#

# Pull RStudio base image with handy packages...

#Create a directory to create a new project from
RUN mkdir -p /home/rstudio/wranglingf1data
RUN chmod a+rw /home/rstudio/wranglingf1data

#Populate the project-directory-to-be with ergast API supporting code and data

#Populate the project-directory-to-be with an additional data source


Running this Dockerfile (for example, using boot2docker) downloads and builds a containerised application preconfigured to support the book and available via a web browser. Instructions for downloading, and running the container can be found here: psychemedia/wranglingf1data-docker repository.

I also added instructions for using the Dockerised application to the book extras package as part of its README file.

Written by Tony Hirst

January 5, 2015 at 3:40 pm

Posted in OU2.0, Rstats

Custom Gridlines and Line Guides in R/ggplot Charts

In the last quarter of last year, I started paying more attention to the use of custom grid lines and line guides in charts I’ve been developing for the Wrangling F1 Data With R book.

The use of line guides was in part inspired by canopy views from within the cockpit of one of the planes that makes up the Red Arrows aerobatic display team.

A little bit of digging suggested that the lines on the cockpit are actually an explosive cord used to shatter the cockpit if a pilot needs to eject from the aircraft – but I can’t believe that the pilots don’t also use the lines as a crib for helping position themselves with respect to the other aircraft in the team? (I emailed the Red Arrows press office to ask about the extent to which the cockpit lines are used in this way but got what amounted to a NULL response.)

Whatever the case, it seems eminently sensible to me that we make use of line guides to help us read charts more effectively, where it makes sense to, or to use guides as stepping stones to more refined data views.

The following example shows how we can generate a 2 dimensional grid based on the discrete points allocations possible for being placed in the top 10 positions of a Formula One race.

The grid lines show allowable points values, and are constructed as follows:

NEWPOINTS =c(25,18,15,12,10,8,6,4,2,1)

#The newpoints dataframe has two columns
#The first column indicates points available, in order
#The second column is the maximum number of points the lower placed driver could score
newpoints=data.frame(x=c(NEWPOINTS,0),y=c(NEWPOINTS[-1],0,0))

baseplot_singleWay=function(g){
g=g+xlab('Highest placed driver points')+ylab('Lower placed driver points')
g=g+scale_y_continuous(breaks = newpoints$x,minor_breaks=NULL) g=g+scale_x_continuous(breaks = newpoints$x,minor_breaks=NULL)
g=g+coord_fixed()
g=g+guides(size=FALSE)+theme_bw()
g
}

g=baseplot_singleWay(ggplot())
g


The final chart (of which this is a “one-sided” example) is used to display a count of the number of races in which at least one of the two drivers in a particular team scores points in a race. The horizontal x-axis represents the number of points scored by the highest placed driver in the race, and the y-axis the number of points scored by their lower placed team mate.

A fixed co-ordinate scheme is adopted to ensure that points are separated consistently on the x and y axes. A dotted red line shows the maximum number of points the lower placed driver in a team could scored depending on the number of points scored by their higher placed team mate.

#Add in the maximum points line guide

g+geom_line(data=newpoints,aes(x=x,y=y),col='red',linetype='dotted')


A two-sided version of the chart is also possible in which the x-axis represents the points scored in a particular race by one name driver and the y-axis represents the points scored by another named driver in the same race. The two-sided chart has two guidelines, representing the maximum points that can be scored by the other driver in the event of one particular driver being higher placed in a race.

A full description can be found in the Points Performance Charts chapter of the Wrangling F1 Data With R book.

Written by Tony Hirst

January 2, 2015 at 9:09 pm

Posted in Rstats

Sketching Scatterplots to Demonstrate Different Correlations

Looking just now for an openly licensed graphic showing a set of scatterplots that demonstrate different correlations between X and Y values, I couldn’t find one.

[UPDATE: following a comment, Rich Seiter has posted a much cleaner – and general – method here: NORTA Algorithm Examples; refer to that post – rather than this – for the method…(my archival copy of rseiter’s algorithm)]

So here’s a quick R script for constructing one, based on a Cross Validated question/answer (Generate two variables with precise pre-specified correlation):

library(MASS)

corrdata=function(samples=200,r=0){
data = mvrnorm(n=samples, mu=c(0, 0), Sigma=matrix(c(1, r, r, 1), nrow=2), empirical=TRUE)
X = data[, 1]  # standard normal (mu=0, sd=1)
Y = data[, 2]  # standard normal (mu=0, sd=1)
data.frame(x=X,y=Y)
}

df=data.frame()
for (i in c(1,0.8,0.5,0.2,0,-0.2,-0.5,-0.8,-1)){
tmp=corrdata(200,i)
tmp['corr']=i
df=rbind(df,tmp)
}

library(ggplot2)

g=ggplot(df,aes(x=x,y=y))+geom_point(size=1)
g+facet_wrap(~corr)+ stat_smooth(method='lm',se=FALSE,color='red')

And here’s an example of the result:

It’s actually a little tidier if we also add in + coord_fixed() to fix up the geometry/aspect ratio of the chart so the axes are of the same length:

So what sort of OER does that make this post?!;-)

PS methinks it would be nice to be able to use different distributions, such as a uniform distribution across x. Is there a similarly straightforward way of doing that?

UPDATE: via comments, rseiter (maybe Rich Seiter?) suggests the NORmal-To-Anything (NORTA) algorithm (about, also here). I have no idea what it does, but here’s what it looks like!;-)

//based on http://blog.ouseful.info/2014/12/17/sketching-scatterplots-to-demonstrate-different-correlations/#comment-69184
#The NORmal-To-Anything (NORTA) algorithm
library(MASS)
library(ggplot2)

#NORTA - h/t rseiter
corrdata2=function(samples, r){
mu <- rep(0,4)
Sigma <- matrix(r, nrow=4, ncol=4) + diag(4)*(1-r)
rawvars <- mvrnorm(n=samples, mu=mu, Sigma=Sigma)
#unifvars <- pnorm(rawvars)
unifvars <- qunif(pnorm(rawvars)) # qunif not needed, but shows how to convert to other distributions
print(cor(unifvars))
unifvars
}

df2=data.frame()
for (i in c(1,0.9,0.6,0.4,0)){
tmp=data.frame(corrdata2(200,i)[,1:2])
tmp['corr']=i
df2=rbind(df2,tmp)
}
g=ggplot(df2,aes(x=X1,y=X2))+geom_point(size=1)+facet_wrap(~corr)
g+ stat_smooth(method='lm',se=FALSE,color='red')+ coord_fixed()

Here’s what it looks like with 1000 points:

Note that with smaller samples, for the correlation at zero, the best fit line may wobble and may not have zero gradient, though in the following case, with 200 points, it looks okay…

The method breaks if I set the correlation (r parameter) values to less than zero – Error in mvrnorm(n = samples, mu = mu, Sigma = Sigma) : ‘Sigma’ is not positive definite – but we can just negate the y-values (unifvars[,2]=-unifvars[,2]) and it seems to work…

If in the corrdata2 function we stick with the pnorm(rawvars) distribution rather than the uniform (qunif(pnorm(rawvars))) one, we get something that looks like this: