Mediated/Augmented Reality (Un)Course Notes, Part I

Pokemon Go seems to have hit the news this week – though I’m sure for anyone off social media last week and back to it next week, the whole thing will have completely passed them by – demonstrating that augmented reality apps really haven’t moved on much at all over the last five years or so.

But notwithstanding that, I’ve been trying to make sense of a whole range of mediated reality technologies for myself as prep for a very short unit on technologies and techniques on that topic.

Here’s what I’ve done to date, over on the Digital Worlds uncourse blog. This stuff isn’t official OU course material, it’s just my own personal learning diary of related stuff (technical term!;-)

More to come over the next couple of weeks or so. If you want to comment, and perhaps influence the direction of my meanderings, please feel free to do that here or on the relevant post.

An evolving feed of the posts is available in chronological order and in reverse chronological order.

Getting Started with F1 Betting Data

As part of my “learn about Formula One Stats” journey, one of the things I wanted to explore was how F1 betting odds change over the course of a race weekend, along with how well they predict race weekend outcomes.

Courtesy of @flutterF1, I managed to get a peek of some betting data from one of the race weekends last year year. In this preliminary post, I’ll describe some of the ways I started to explore the data initially, before going on to look at some of the things it might be able to tell us in more detail in a future post.

(I’m guessing that it’s possible to buy historical data(?), as well as collecting it yourself it for personal research purposes? eg Betfair have an api, and there’s at least one R library to access it: betfairly.)

The application I’ll be using to explore the data is RStudio, the cross-platform integrated development environment for the R programming language. Note that I will be making use of some R packages that are not part of the base install, so you will need to load them yourself. (I really need to find a robust safe loader that installs any required packages first if they have not already been installed.)

The data @flutterF1 showed me came in two spreadsheets. The first (filename convention RACE Betfair Odds Race Winner.xlsx) appears to contain a list of frequently sampled timestamped odds from Betfair, presumably, for each driver recorded over the course of the weekend. The second (filename convention RACE Bookie Odds.xlsx) has multiple sheets that contain less frequently collected odds from different online bookmakers for each driver on a variety of bets – race winner, pole position, top 6 finisher, podium, fastest lap, first lap leader, winner of each practice session, and so on.

Both the spreadsheets were supplied as Excel spreadsheets. I guess that many folk who collect betting data store it as spreadsheets, so this recipe for loading spreadsheets in to an R environment might be useful to them. The gdata library provides hooks for working with Excel documents, so I opted for that.

Let’s look at the Betfair prices spreadsheet first. The top line is junk, so we’ll skip it on load, and add in our own column names, based on John’s description of the data collected in this file:

The US Betfair Odds Race Winner.xslx is a raw data collection with 5 columns….
1) The timestap (an annoying format but there is a reason for this albeit a pain to work with).
2) The driver.
3) The last price money was traded at.
4) the total amount of money traded on that driver so far.
5) If the race is in ‘In-Play’. True means the race has started – however this goes from the warm up lap, not the actual start.

To reduce the amount of data I only record it when the price traded changes or if the amount changes.

Looking through the datafile, they appear to be some gotchas, so these need cleansing out:

datafile gotchas

Here’s my initial loader script:

library(gdata)
xl=read.xls('US Betfair Odds Race Winner.xlsx',skip = 1)
colnames(xl)=c('dt','driver','odds','amount','racing')

#Cleansing pass
bf.odds=subset(xl,racing!='')

str(bf.odds)
'data.frame':	10732 obs. of  5 variables:
 $ dt    : Factor w/ 2707 levels "11/16/2012 12:24:52 AM",..: 15 15 15 15 15 15 15 15 15 15 ...
 $ driver: Factor w/ 34 levels " Recoding Began",..: 19 11 20 16 18 29 26 10 31 17 ...
 $ odds  : num  3.9 7 17 16.5 24 140 120 180 270 550 ...
 $ amount: num  1340 557 120 118 195 ...
 $ racing: int  0 0 0 0 0 0 0 0 0 0 ...

#Generate a proper datetime field from the dt column
#This is a hacked way of doing it. How do I do it properly?
bf.odds$dtt=as.POSIXlt(gsub("T", " ", bf.odds$dt))

#If we rerun str(), we get the following extra line in the results:
# $ dtt   : POSIXlt, format: "2012-11-11 11:00:08" "2012-11-11 11:00:08" "2012-11-11 11:00:08" "2012-11-11 11:00:08" ...

Here’s what the raw data, as loaded, looks like to the eye:
Betfair spreadsheet

Having loaded the data, cleansed it, and cast a proper datetime column, it’s easy enough to generate a few plots:

#We're going to make use of the ggplot2 graphics library
library(ggplot2)

#Let's get a quick feel for bets around each driver
g=ggplot(xl)+geom_point(aes(x=dtt,y=odds))+facet_wrap(~driver,scales="free_y")
g=g+theme(axis.text.x=element_text(angle=-90))
g

#Let's look in a little more detail around a particular driver within a particular time window
g=ggplot(subset(xl,driver=="Lewis Hamilton"))+geom_point(aes(x=dtt,y=odds))+facet_wrap(~driver,scales="free_y")
g=g+theme(axis.text.x=element_text(angle=-90))
g=g+ scale_x_datetime(limits=c(as.POSIXct('2012/11/18 18:00:00'), as.POSIXct('2012/11/18 22:00:00')))
g

Here are the charts (obviously lacking in caption, tidy labels and so on).

Firstly, the odds by driver:

odds by driver

Secondly, zooming in on a particular driver in a particular time window:

timewindow

That all seems to work okay, so how about the other spreadsheet?

#There are several sheets to choose from, named as follows:
#Race,Pole,Podium,Points,SC,Fastest Lap, Top 6, Hattrick,Highest Scoring,FP1, ReachQ3,FirstLapLeader, FP2, FP3

#Load in data from a particular specified sheet
race.odds=read.xls('USA Bookie Odds.xlsx',sheet='Race')

#The datetime column appears to be in Excel datetime format, so cast it into something meaningful
race.odds$tTime=as.POSIXct((race.odds$Time-25569)*86400, tz="GMT",origin=as.Date("1970-1-1"))
#Note that I am not I checking for gotcha rows, though maybe I should...?

#Use the directlabels package to help tidy up the display a little
library(directlabels)

#Let's just check we've got something loaded - prune the display to rule out the longshots
g=ggplot(subset(race.odds,Bet365<30),aes(x=tTime,y=Bet365,group=Runner,col=Runner,label=Runner))
g=g+geom_line()+theme_bw()+theme(legend.position = "none")
g=g+geom_dl(method=list('top.bumpup',cex=0.6))
g=g+scale_x_datetime(expand=c(0.15,0))
g

Here’s a view over the drivers’ odds to win, with the longshots pruned out:

example race odds by driver

With a little bit of fiddling, we can also look to see how the odds for a particular driver compare for different bookies:

#Let's see if we can also plot the odds by bookie
colnames(race.odds)
#[1] "Time" "Runner" "Bet365" "SkyBet" "Totesport" "Boylesport" "Betfred"     
# [8] "SportingBet" "BetVictor" "BlueSQ" "Paddy.Power" "Stan.James" "X888Sport" "Bwin"        
#[15] "Ladbrokes" "X188Bet" "Coral" "William.Hill" "You.Win" "Pinnacle" "X32.Red"     
#[22] "Betfair" "WBX" "Betdaq" "Median" "Median.." "Min" "Max"         
#[29] "Range" "tTime"   

#We can remove items from this list using something like this:
tmp=colnames(race.odds)
#tmp=tmp[tmp!='Range']
tmp=tmp[tmp!='Range' & tmp!='Median' & tmp!='Median..' & tmp!='Min' & tmp!= 'Max' & tmp!= 'Time']
#Then we can create a subset of cols
race.odds.data=subset(race.odds,select=tmp)

#Melt the data
library(reshape)
race.odds.data.m=melt(race.odds.data,id=c('tTime','Runner'))

#head( race.odds.data.m)
#                tTime                 Runner variable value
#1 2012-11-11 19:07:01 Sebastian Vettel (Red)   Bet365  2.37
#2 2012-11-11 19:07:01   Lewis Hamilton (McL)   Bet365  3.25
#3 2012-11-11 19:07:01  Fernando Alonso (Fer)   Bet365  6.00
#...

#Now we can plot how the different bookies compare
g=ggplot(subset(race.odds.data.m,value<30 & Runner=='Sebastian Vettel (Red)'),aes(x=tTime,y=value,group=variable,col=variable,label=variable))
g=g+geom_line()+theme_bw()+theme(legend.position = "none")
g=g+geom_dl(method=list('top.bumpup',cex=0.6))
g=g+scale_x_datetime(expand=c(0.15,0))
g

bookies odds

Okay, so that all seems to work… Now I can start pondering what sensible questions to ask…

F1Stats – A Prequel to Getting Started With Rank Correlations

I finally, finally made time to get started on my statistics learning journey with a look at some of the results in the paper A Tale of Two Motorsports: A Graphical-Statistical Analysis of How Practice, Qualifying, and Past SuccessRelate to Finish Position in NASCAR and Formula One Racing.

Note that these notes represent a description of the things I learned trying to draw on ideas contained within the paper and apply it to data I had available. There may be errors… if you spot any, please let me know via the comments.

The paper uses race classification data from the 2009 season, comparing F1 and NASCAR championships and claiming to explore the extent to which positions in practice and qualifying relate to race classification. I won’t be looking at the NASCAR data, but I did try to replicate the F1 stats which I’ll start to describe later on in this post. I’ll also try to build up a series of interactive apps around the analyses, maybe along with some more traditional print format style reports.

(There are so many things I want to learn about, from the stats themselves, to possible workflows for analysis and reporting, to interactive analysis tools that I’m not sure what order any of it will pan out into, or even the extent to which I should try to write separate posts about the separate learnings…)

As a source of data, I used my f1com megascraper that grabs classification data (along with sector times and fastest laps) since 2006. (The ergast API doesn’t have the practice results, though it does have race and qualifying results going back a long way, so it could be used to do a partial analysis over many more seasons). I downloaded the whole Scraperwiki SQLite database which I could then load into R and play with offline at my leisure.

The first result of note in the paper is a chart that claims to demonstrate the Spearman rank correlation between practise and race results, qualification and race results, and championship points and race results, for each race in the season. The caption to the corresponding NASCAR graphs explains the shaded region: “In general, the values in the grey area are not statistically significant and the values in the white area are statistically significant.” A few practical uses we might put the chart to come to mind (possibly!): is qualifying position or p3 position a good indicator of race classification (that is, is the order folk finish at the end of p3 a good indicator of the rank order in which they’ll finish the race?)?; if the different rank orders are not correlated, (people finish the race in a different order to the gird position), does this say anything about how exciting the race might have been? Does the “statistical significance” of the correlation value add anything meaningful?

F1 2009 correlations

So what is this chart trying to show and what, if anything, might it tell us of interest about the race weekend?

First up, the statistic that’s being plotted is Spearman’s rank correlation coefficient. There are four things to note there:

  1. it’s a coefficient: a coefficent is a single number that tends to come in two flavours (often both at the same time). In a mathematical equation, a coefficient is typically a constant number that is used as multiplier of a variable. So for example, in the equation t = 2 x, the x variable has the coefficient 2. Note that the coefficient may also be a parameter, as for example in the equation y= a.x (where the . means ‘multiply’, and we naturally read x as a dependent variable that is used to determine the value of y having been multiplied by the value of a). However, a coefficient may also be a particular number that characterises a particular relationship between two things. In this case, it characterises the degree of correlation between two things…
  2. The refrain “correlation is not causation” is one heard widely around the web that mocks the fact that just because two things may be correlated – that is, when one thing changes, another changes in a similar way – it doesn’t necessarily mean that the way one thing changed caused the other to change in a similar way as a result. (Of course, it might mean that…;-). (If you want to find purely incidental correlations between things, have a look at Google correlate, that identifies different search terms whose search trends over time are similar. You can even draw your own trend over time to find terms that have trended in a similar way.)

    Correlation, then, describes the extent to which two things tend to change over time in a similar way: when one goes up, the other goes up; when one goes down, the other goes down. (If they behave in opposite ways – if one goes up steeply the other goes down steeply; if one goes up gently, the other goes down gently – then they are negatively or anti-correlated).
    Correlation measures require that you have paired data. You know you have paired data if you can plot your data as a two dimensional scatterplot and label each point with the name of the person or thing that was measured to get the two co-ordinate values. So for example, on a plot of qualifying position versus race classification, I can label each point with the name of the driver. The qualification position and race classification is paired data around the set of drivers.

  3. A rank correlation coefficient is used to describe the extent to which two rankings are correlated. Ranked orders do away with the extent to which different elements in a series differ from each other and just concentrate on their rank order. That is, we don’t care how much change there is in each of the data series, we’re just interested in rank positions within each series. In the case of an F1 race, the distribution of laptimes during qualifying may see the first few cars separated by a few thousandths of a second, but the time between the best laptimes of consecutively placed cars at the back of the grid might be of the order of tenths of a second. (Distributions take into account, for example, the variety and range of values in a dataset.) However, in a rank ordered chart, all we are interested in is the integer position: first, second, third, …, nineteenth, twentieth. There is no information about the magnitude of the actual time difference between the laptimes, that is, how far close to or far apart from each other the laptimes of consecutively placed cars were, we just know the rank order of fastest to slowest cars. The distribution of the rank values is not really very interesting, or subject to change, at all.
    One thing that I learned that’s possibly handy to know when decoding the jargon: rank based stats are also often referred to as non-parametric statistics because no assumptions are made about how the numbers are distributed (presumably, there are no parameters of note relating to how the values are distributed, such as the mean and standard deviation of a “normal” distribution). If we think about the evolution of laptimes in a race, then most of them will be within a few tenths of the fastest lap each lap, with perhaps two bunches of slower lap times (in-lap and out-lap around a pitstop). The distribution of these lap times may be interesting (for example, the distribution of laptimes on a lap when everyone is racing will be different to the distribution of lap times on a lap when several cars are pitting). On the other hand, for each lap, the distribution of the rank order of laptimes during that lap will always be the same (first fastest, second fastest, third fastest, etc.). That is not to say, however, that the rank order of the drivers’ lap times does not change lap on lap, which of course it might do (Webber might be tenth fastest on lap 1, fastest on lap 2, eight fastest on lap three, and so on).
    Of course, this being stats, “non-parametric” probably means lots of other things as well, but for now my rule of thumb will be: the distribution doesn’t matter (that is, the statistic does not make any assumptions about the distribution of the data in order for the statistic to work; which is to say, that’s one thing you don’t have to check in order to use the statistic (erm, I think…?!)
  4. The statistic chosen was Spearman’s rank correlation coefficient. Three differently calculated correlation coefficients appear to be widely used, (and also appear as possible methods in the R corr() function that calculates correlations between lists of numbers): i) Pearson’s product moment correlation coefficient (how well does a straight line through an x-y scatterplot of the data describe the relationship between the x and y values, and what’s the sign of its gradient); ii) Spearman’s rank correlation coefficient (also known as Spearman’s rho or rs); [this interactive is rather nice and shows how Pearson and Spearman correlations can differ]; iii) Kendall’s τ (that is, Kendall’s Tau; this coefficient is based on concordance, which describes how the sign of the difference in rank between pairs of numbers in one data series is the same as the sign of the difference in rank between a corresponding pair in the other data series.). Other flavours of correlation coefficient are also available (for example, Lin’s concordance correlation coefficient, as demonstrated in this example of identifying a signature collapse in a political party’s electoral vote when the Pearson coefficient suggested it had held up, which I think is used to test for how close to a 45 degree line the x-y association between paired data points is…).

The statistical significance test is based around the “null hypothesis” that the two sets of results are not correlated; the result is significant if they are more correlated than you might expect if both are randomly ordered. this made me a little twitchy: wouldn’t it be equally valid to argue that F1 is a procession and we would expect the race position and grid position to be perfectly correlated, for example, and then define our test for significance on the extent to which they are not?

This comparison of Pearson’s product moment and Spearman’s rank correlation coefficients helped me get a slightly clearer view of the notion of “test” and how both these coefficients act as tests for particular sorts of relationship. The Pearson product moment coefficent has a high value if a strong linear relationship holds across the data pairs. The Spearman rank correlation is weaker, in that it is simply looking to see whether or not the relationship is monotonic (that is, things all go up together, or they all go down together, but the extent to which they do so need not define a linear relationship, which is an assumption of the Pearson test.). In addition, when defining the statistical significance of the test, this is dependent on particular assumptions about the distribution of the data values, at least in the case of the Pearson test. The statistical significance relates to how likely the correlation value was assuming a normal distribution in the values within the paired data series (that is, each series is assumed to represent a normally distributed set of values).

If I understand this right, it means we separate the two things out: on the one hand, we have the statistic (the correlation coefficient); on the other, we have the significance test, which tells you how likely that result us given a set of assumptions about how the data is distributed. A question that then comes to mind is this: is the definition of the statistic dependent on a particular distribution of the data in order for the statistic to have something interesting to say, or is it just the significance that relies on that distribution. To twist this slightly, if we can’t do a significance test, is the statistic then essentially meaningless (because we don’t know whether those values are likely to occur whatever the relationships (or even no relationship) between the data sets?). Hmm.. maybe a statistic is actually a measurement in the context of its significance given some sort of assumption about how likely it is to occur by chance?

As far as Spearman’s rank correlation coefficient goes, I was little bit confused by the greyed “not significant” boundary on the diagram shown above. The claim is that any correlation value in that grey area could be accounted for in many cases by random sorting. Take two sets of driver numbers, sort them both randomly, and much of the time the correlation value will fall in that region. (I suspect this is not only arbitrary but misleading? If you have random orderings, is the likelihood that the correlation is in the range -0.1 to 0.1 the same as the likelihood that it will be in the range 0.2 to 0.4? Is the probability distribution of correlations “uniform” across the +/- 1 range?) Also, my hazy vague memory is that the population size affects the confidence interval (see also Explorations in statistics: confidence intervals) – isn’t this the principle on which funnel plots are built? The caption to the figure suggests that the population size (the “degrees of freedom”) was different for different races (different numbers of drivers). So why isn’t the shaded interval differently sized for those races?

Something else confused me about the interval values used to denote the significance of the Spearman rho values – where do they come from? A little digging suggested that they come from a table (i.e. someone worked them out numerically, presumably by generating looking at the distribution of different random rank orderings, rather than algorithmically – I couldn’t find a formula to calculate them? I did find this on Sample Size Requirements for Estimating Pearson, Kendall and Spearman Correlations by D Bonett (Psychometrika Vol. 65, No. 1, 23-28, March 2000) though). A little more digging suggested Significance Testing of the Spearman Rank Correlation Coefficient by J Zar (Journal of the American Statistical Association, Vol. 67, No. 339 (Sep., 1972), pp. 578- 580) as a key work on this, with some later qualification in Testing the Significance of Kendall’s τ and Spearman’s rs by M. Nijsse (Psychological Bulletin, 1988, Vol. 103, No. 2,235-237).

Hmmm.. this post was supposed to be about running the some of the stats used in A Tale of Two Motorsports: A Graphical-Statistical Analysis of How Practice, Qualifying, and Past SuccessRelate to Finish Position in NASCAR and Formula One Racing on some more recent data. But I’m well over a couple of thousand words into this post and still not started that bit… So maybe I’ll finish now, and hold the actual number crunching over to the next post…

PS I find myself: happier that I (think I) understand a little bit more about the rationale of significance tests; just as sceptical as ever I was about the actual practice;-)

My Personal Intro to F1 Race Statistics

One of the many things I keep avoiding is statistics. I’ve never really been convinced about the 5% significance level thing; as far as I can tell, hardly anything that’s interesting normally distributes; all the counting that’s involved just confuses me; and I never really got to grips with confidently combining probabilities. I find a lot of statistics related language impenetrable too, with an obscure vocabulary and some very peculiar usage. (Regular readers of this blog know that’s true here, as well ;-)

So this year I’m going to try to do some stats, and use some stats, and see if I can find out from personal need and personal interest whether they lead me to any insights about, or stories hidden within, various data sets I keep playing with. So things like: looking for patterns or trends, looking for outliers, and comparing one thing with another. If I can find any statistics that appear to suggest particular betting odds look particularly favourable, that might be interesting too. (As Nate Silver suggests, betting, even fantasy betting, is a great way of keeping score…)

Note that what I hope will turn into a series of posts should not be viewed as tutorial notes – they’re far more likely to be akin to student notes on a problem set the student is trying to work through, without having attended any of the required courses, and without having taken the time to read through a proper tutorial on the subject. Nor do I intend to to set out with a view to learning particular statistical techniques. Instead, I’ll be dipping into the world of stats looking for useful tools to see if they help me explore particular questions that come to mind and then try to apply them cut-and-past fashion, which is how I approach most of my coding!

Bare naked learning, in other words.

So if you thought I had any great understanding about stats – in fact, any understanding at all – I’m afraid I’m going to disabuse you of that notion. As to my use of the R statistical programming language, that’s been pretty much focussed on using it for generating graphics in a hacky way. (I’ve also found it hard, in the past, plotting pixels on screen and page in a programmable way, but R graphics libraries such as ggplot2 make it really easy at a high level of abstraction…:-)

That’s the setting then… Now: #f1stats. What’s that all about?

Notwithstanding the above (that this isn’t about learning a particular set of stats methods defined in advance) I did do a quick trawl looking for “F1 stats tutorials” to see if there were any that I could crib from directly; but my search didn’t turn up much that was directly and immediately useful (if you know of anything that might be, please post a link in the comments). There were a few things that looked like they might be interesting, so here’s a quick dump of the relevant…

If you know of any other relevant looking papers or articles, please post a link in the comments.

[MORE LINKS…
Who is the Best Formula 1 Driver? An Econometric Analysis
]

I was hoping to finish this post with a couple of quick R hacks around some F1 datasets, but I’ve just noticed that today, as in yesterday, has become tomorrow, as in today, and this post is probably already long enough… So it’ll have to wait for another day…

PS FWIW, I also note the arrival of the Sports Analytics Innovation Summit in London in March… I doubt I have the impact required to make it as a media partner though… Although maybe OpenLearn does…?!

Infoskills 2.012 – Practical Exercises in Social Media Network Analysis #change11

As ever, it seems the longer I have to prepare something, the less likely I am to do it. I was supposed to be running a #change11 MOOC session this week – Infoskills 2.012 How to do a lot with a little – but having had it in the diary for a 6 months or so, I have, of course, done nothing to prepare for it… (I didn’t come up with the 2.012 – not sure who did?)

Anyway….over the weekend, I gave a presentation (Social Media Visualisation Hacks) that, typically, bewildered the audience with a blizzard of things that are possible when it comes to looking at social networks but that are still alien to most:

As ever, the presentation is not complete (i.e. the slides really need to be complemented by a commentary), but that’s something I hope to start working on improving – maybe starting this week…

The deck is a review – of sorts – of some of the various ways we can look at social networks and the activity that takes place within them. The slides are prompts, keys, search phrase suggestions that provide a starting point for finding out more. Many of the slides contain screenshots – and if you peer closely enough, you can often see the URL. For posts on my blog, searching with the word ouseful followed by key terms from the post title will often turn up the result on major search engines. Many of the slides identify a “hack” that is described in pseudo-tutorial form on the this blog, or on Martin Hawksey’s MASHe blog.

I put together a delicious stack of links relating to the presentation here: #drg12 – Visualising Social Networks (Tutorial Posts)

For a tutorial stack that focusses more on Yahoo Pipes (though who knows how long that is still for this world given the perilous nature of Yahoo at the moment), see: Twitter Pipes

My #change11 week was supposed to be about new info skills, with a practical emphasis. A couple of other presentations relating to how we might appropriate (a-pro-pre-eight) tools and applications can be found here: Appropriate IT – My ILI2011 Presentation and Just Do IT Yourself… MY UKSG Presentation.

If all you do in Google is 2.3 keyword searches, this deck – Making the Most of Google – (though possibly a little dated by now), may give you some new ideas.

For a more formal take on infoskills for the new age, (though I need to write a critique of this from my own left-field position), see the Cambridge University Library/Arcadia Project “New Curriculum for Information Literacy (ANCIL)” project via the Arcadia project.

If you want to do some formal reading in the visualisation space, check out 7 Classic Foundational Vis Papers You Might not Want to Publicly Confess you Don’t Know.

Via @cogdog/Alan Levine, I am reminded of Jon Udell’s Seven ways to think like the web. You do think like that, right?!;-)

Mulling over @downes’ half hour post on Education as Platform: The MOOC Experience and what we can do to make it better, I see the MOOC framework as providing an element of co-ordination, pacing and a legacy resource package. For my week, I was expected(?!) to put together some readings and exercises and maybe a webinar. But I haven’t prepared anything (I tried giving a talk at Dev8D earlier this year completely unprepared (though I did have an old presentation I could have used) and it felt to me like a car crash/a disaster, so I know I do need to prep things (even though it may not seem like it if you’ve heard me speak before!;-))

But maybe that’s okay, for one week of this MOOC? The OUSeful.info blog, where you’re maybe reading this, is an ongoing presentation, with a post typically every day or so. WHen I learn something related to the general themes of this blog, I post it here, often as a partial tutorial (partial in the sense that you often have to work through the tutorial for the words to make sense – they complement the things you should be seeing on screen – if you look – as you work through the tutorial; in a sense, the tutorial posts are often second screen complements to and drivers of an activity on another screen).

I’ve personally tried registering with MOOCs a couple of times, but never completed any of the activities. Some of the MOOC related readings or activities pass my way through blogs I follow, or tweeted links that pique my interest, and sometimes I try them out. I guess I’m creating my own unstructured uncourse* daily anyway… (*uncourses complement MOOCs, sort of… They’re courses created live in partial response to feedback, but also reflecting the “teacher”‘s learning journey through a topic. Here’s an example that led to a formal OU course: Digital Worlds uncourse blog experiment. The philosophy is based on the “teacher” modelling – and documenting – a learning journey. Uncourses fully expect the “teacher” not being totally knowledgeable about the subject area, but being happy to demonstrate how they go about making sense of a topic that may well be new to them).

So… this is my #change11 offering. It’s not part of the “formal” course, (how weird does that sound?!) unless it is… As the MOOC is now in week 29, if its principles have been taken on-board, you should be starting to figure out your own distributed co-ordination mechanisms by now. Because what else will you do when the course is over? Or will it be a course that never ends, yet ceases to have a central co-ordination point?

PS if you want to chat, this blog is open to comments; you can also find me on Twitter as @psychemedia

PPS seems like I’ve had at least one critical response (via trackbacks) towards my lacksadaisical “contribution” towards my “teaching” week on the #Change11 MOOC. True. Sorry. But not. I should have kept it simple, posted my motto – identify a problem, then hack a solution to it, every day – and left it at that… It’s how I learn about this stuff… (and any teaching I receive tends to be indirect – by virtue stuff other folk have published that I’ve discovered through web search, (aka search queries – questins – that I’ve had to formulate to help me answer the problem I have identified/created…).

More Thoughts on a Content Strategy for Data – Many Eyes and Google Fusion Tables

It’s one thing publishing data just to comply with a formal requirement to make it public, quite another if you’re publishing it because you want folk to doing something with it.

But if you decide you’re publishing data because you want folk to do something with it, what does that mean exactly?

[Not quite related but relevant: Pete Sefton (@ptsefton) on Did you say you “own” this data? You keep using that word. I do not think it means what you think it means.]

One answer might be that you want them to be able to refer to the data for their own purposes, simply by cutting an pasting a results datatable out of one of your spreadsheets so they can paste it into one of theirs and refer to it as “evidence”:

Reference summary data

Another might be that you want folk to be able to draw on your data as part of their own decision making process. And so on. (For a couple of other use cases, see First Thoughts On A Content Strategy for Data.)

A desire that appears to have have gained some traction over the last couple of years is to publish data so that folk can produce visualisations based on it. This is generally seen as a Good Thing, although I’m not sure I know exactly why…? Perhaps it’s because visualisations are shiny objects and folk can sometimes be persuaded to share (links to) shiny objects across their social networks; this in turn may help raise wider awareness about the existence of your data, and potentially bring it to the attention of somebody who can actually makes some use of it, or extract some value from it, possibly in combination with one or more other datasets that you may or may not be aware of.

Something that I’ve become increasingly aware of over the last couple of years is that people respond to graphics and data visualisations in very different ways. The default assumption seems to be that a graphic should expose some truth in very obvious way without any external input. (I’m including things like axis labels, legends, and captions in the definition of a graphic.) That is, the graphic should be a self-contained, atomic object, meaningful in its own right. I think this view is borne out of the assumption that graphics are used to communicate something that is known by the author who used it to their audience. The graphic is chosen because it does “self-evidently” make some point that makes the author’s case. Let’s call these “presentation graphics”. Presentation graphics are shiny objects, designed to communicate something in particular, to a particular audience, in (if at all possible) a self-contained way.

Another way of using visualisations is as part of a visual analysis process. In this case, visual representations of the data are generated by the analyst as part of a conversation they are having with the data. One aim of this conversation (or maybe we should call it an interrogation?!) may be to get the data to reveal something about its structure, or meaningful patterns contained within it. Visual analysis is therefore less to do with the immediate requirement of producing meaningful presentation graphics, and more to do with getting the data to tell its story. Scripted speeches contain soundbites – presentation graphics. Conversations can ramble all over the place and are often so deeply situated in a particular context they are meaningless to onlookers – as visualisations produced during a visual analysis activity may be. (Alternatively, the visual analyst spends their time trying to learn how to ride a bike. Chris Hoy and Victoria Pendleton show how it’s done with the presentation graphics…)

It could be that I’m setting up something of a false dichotomy between extrema here, because sometimes a simple, “directly generated” chart may be effective as both a simple 1-step visual analysis view, and as a presentation graphic. But I’m trying to think through my fingers and type my way round to what I actually believe about all this stuff, and arguing to limits is one lazy way of doing this! The distinction is also not just mine… For example, Iliinsky and Steele’s Designing Data Visualizations identifies the following:

Explanatory visualization: Data visualizations that are used to transmit information or a point of view from the designer to the reader. Explanatory visualizations typically have a specific “story” or information that they are intended to transmit.
Exploratory visualization: Data visualizations that are used by the designer for self-informative purposes to discover patterns, trends, or sub-problems in a dataset. Exploratory visualizations typically don’t have an already-known story.

They also define a data visualizations as “[v]isualizations that are algorithmically generated and can be easily regenerated with different data, are usually data-rich, and are often aesthetically shallow.” Leaving aside the aesthetics, the notion that data visualisations can be “algorithmically generated” is important here.

A related insight I picked up from the New York Times’ Amanda Cox is the use of statistical charts of visual analysis as sketches that help us engage with data en route to understanding some of the stories it contains, stories that may then be told by whatever means are appropriate (which may or may not include graphical representations or visualisations).

So when it comes to publishing data in the hope that folk will do something visual with it, does that mean we want to provide them with the data that can be directly used to convey some known truth in an appealing way, or do we want to provide them with data in such a way that they can engage with it in a (visual) analytic way and the communicate their insight through a really beautiful presentation graphic? (Note that it may often be the case that something discovered through a visual analysis step may actually best be communicated through a simple set of ranked, tabulated data presented as text…) Maybe this explains why folk are publishing the data in the hope that it will be “visualised”? They are conflating visual analysis with presentation graphics, and hoping that novel visualisation (visual analysis) techniques will: 1) provide new insights (new sense made from) the data, that: 2) also work as shiny, shareable and insightful presentation graphics? Hmmm…

Publishing Data to Support Visual Engagement

So we have our data set, but how can we publish it in a way that supports generative visual engagement with it (generative in the sense that we want the user to have at least some role in creating their own visual representations of the data)?

The easiest route to engagement is to publish an interactive visualisation on top of your data set so that the only way folk can engage with the data is through the interactive visualisation interface. So for example, interactive visualisations published by the BBC or New York Times. These typically support the generation of novel views over the data by allowing the user to construct queries over the data through interactive form elements (drop down lists, radio buttons, sliders, checkboxes, etc.); these queries are then executed to filter or analyse the data and provide a view over it that can be visually displayed in a predetermined way. The publisher may also choose to provide alternative ways of visualising the data (for example, scatter plot or bar chart) based on preidentified ways of mapping from the data to various graphical dimensions within particular chart types. In the case of the interactive visualisation hosted on the publisher’s website, the user is thus typically isolated from the actual data.

An alternative approach is to publish the data in an environment that supports the creation of visualisations at what we might term the row and column level. This is where ideas relating to a content strategy for data start to come in to play. An example of this is IBM’s Many Eyes data visualisation tool. Long neglected by IBM, the Many Eyes website provides an environment for: 1) uploading tabular datasets; 2) generating preconfigured interactive visualisations on top of the datasets; 3) providing embeddable versions of visualisations; 4) supporting discussions around visualisations. Note that a login is required to upload data and generate new visualisations.

As an example of what’s possible, I uploaded a copy of the DCMS CASE data relating to Capital Investment to Many Eyes (DCMS CASE data – capital investment (modified)):

CASE Data on Many Eyes

Once the data is uploaded, the user has the option of generating one or more interactive visualisations of the data from a wide range of visualisation types. For example, here’s a matrix chart view (click through to see the interactive version; note: Java required).

Many Eyes example

And here’s a bubble chart:

Many Eyes Bubblechart

In an ideal world, getting the data into Many Eyes should just(?) have been a case of copy and pasting data from the original spreadsheet. (Note that this requires access to a application that can open the spreadsheet, either on the desktop or online.) In the case of the DCMS CASE data, this required opening the spreadsheet, finding the correct sheet, then identifying the cell range containing the data we want to visualise:

DCMS CASE data - raw

Things are never just that simple, of course… Whilst it is possible to define columns as “text” or “number” in Many Eyes, the date field was recognised by the Many Eyes visualisation tools as a “number” which may lead to some visualisations incorrectly aggregating data from the same region across several date ranges. In order to force the Many Eyes visualisations to recongnise the date column as “text”, I had to edit the original data file (before uploading it to Many Eyes) by prepending the date ranges with an alphabetic character (so for example I replaced instances of 2004/05 with Y2004/05).

Recommendation In terms of a “content strategy for data”, then, we need to identify possible target domains where we wish to syndicate or republish our data and then either: 1) publish the data to that domain, possibly under our own branding, or “official” account on that domain – this approach also allows the publisher to add provenance metadata, a link to the homepage for the data or its original source, and so on; or, 2) publish the data on our site in such a way that we know it will work on the target domain (which means testing it/trying it out…). If you expect users to upload your data to services like Many Eyes themselves, it would make sense to provide easily cut and pastable example text of the sort you might expect to see appear in the metadata fields of the data page on the target site and encourage users to make use of that text.

Recommendation A lot of the CASE data spreadsheets contain multiple cell ranges corresponding to different tables within a particular sheet. Many cut and paste tools support data that can be cut and pasted from appropriately highlighted cell ranges. However, other tools require data in comma separated (rather than tab separated) format which mean the user must copy and paste the data into another sheet and then save it as CSV. Although a very simple format, there is a lot to be said for publishing very simple CSV files containing your data. Provenance and explanatory data often gets separated from data contained in CSV files, but you can always add a manifest text file to the collection of CSV data files to explain the contents of each one.

Whilst services such as Many Eyes do their best in trying to identify numeric versus categorical data columns, unless the user is familiar with the sorts of data a particular visualisation type requires and how it should be presented, it can sometimes be hard to understand why Many Eyes has automatically identified particular values for use in drop down list boxes, and at times hard to interpret what is actually being displayed. (This is a good reason to limit the use of Many Eyes to a visual analysis role, and use it to discover things that look interesting/odd and then go off and dig in the data a little more to se if there really is anything interesting there…)

In some cases, it may be possible to reshape the data and get it in to a form that Many Eyes can work with. (Remember the Iliinsky and Steele definition of a data visualisation as something “algorithmically generated”? If the data is presented in the right way, then Many Eyes can do something with it. But if it’s presented in the wrong way, not joy…) As an example, if we look at the CASE Capital Investment data, we see it has columns for Region, Local Authority, Date, as well as columns relating to the different investment types. Presented this way, we can easily group data across different years within an LA or Region. Alternatively, we might have selected Region, Local Authority, and Asset type columns, with separate columns for each date range. This different combination of rows and columns may provides a different basis for the sorts of visualisations we can generate within Many Eyes and the different summary views we can present over the data.

Recommendation The shape in which the data it published may have an effect on the range of visualisations that can be directly generated from the data, without reshaping by the user. It may be appropriate to publish the data in a variety of shapes, or provide tools for reshaping data for use with particular target services. Tools such as the Stanford Data Wrangler are making it easier for people to clean and reshape messy data sets, but that is out of scope for this review. In addition, it is worth consider the data type or physical form in which data is published. For example, in columns relating to finanacial amounts, prepending each data element in a cell with a £ my break cut and paste visualisation tools such as Many Eyes, which will recognise the element as a character string. Some tools are capable of recognising datetime formats, so in some cases it may be appropriate to publish date/datetime in a standardised way. Many tools choke on punctuation characters from Windows character sets, and despite best efforts, rogue characters and undeclared or incorrect character encodings often find their way in to datasets which present them working correctly in third party applications. Some tools will automatically strip out leading and trailing whitespace characters, others will treat them as actual characters. Where string matching operations are applied (for example, grouping data elements) a word with a trailing space and a word without a trailing space may be treated as defining different groups. (Which is to say, try to strip leading and trailing whitespace in your data. Experts know to check for this, novices don’t).

If the expectation is that users will make use of a service such as Many Eyes, it may be worth providing an FAQ area that describes what shape the different visualisation expect the data to be in, with examples from your own data sets. Services such as Number Picture, which provide a framework for visualising data by means of visualisation templates that accept data in a specified shape and form, provided helpful user prompts that explain what the (algorithmic) visualisation expects in terms of the shape and form of input data:

Number picture - describes the shape and form the data needs to be in

Custom Filter Visualisations – Google Fusion Tables

Google Fusion Tables are like spreadsheets on steroids. They combine features of traditional spreadsheets with database like query support and access to popular chart types. Google Fusion Tables can be populated by importing data from Google Spreadsheets or uploading data from CSV files (example Fusion Table).

Google Fusion Table - data import

Google Fusion Tables can also be used to generate new tables based on the fusion of two (or, by chaining, more than two) tables that share a common column. So for example, given three CSV data files containing different data sets (for example, file A has LA codes, and Regions, file B has LA codes and arts spend by LA, and file C has LA codes and sports engagement data) we can merge the files on the common columns to give a “fused” data set (for example, a single table containing four columns: LA codes, Regions, arts spend, sports engagement data). Note that the data may need to be appropriately shaped before it can be fused in a meaningful way with other data sets.

As with many sites that support data upload/import, it’s typically down the to the user to add appropriate metadata to to the data file. This metadata is important for a variety of reasons: firstly, it provides context around a dataset; secondly, it may aid in discovery of the data set if the data and its metadata is publicly indexed; thirdly, it may support tracking, which can be useful if the original publisher needs to demonsstrate how widely a dataset has been (re)used.

Google spreadsheets provenance metadata

If there are too many steps involved in getting the data from the download site into the target environment (for example, if it needs downloading, a cell range copying, saving into another data file, cleaning, then uploading) the distance from the original data source to the file that is uploaded may result in the user not adding much metadata at all. As before, if it is anticipated that a service such as Google Fusion Tables is a likely locus for (re)use of a dataset, the publisher should consider publishing the data directly through the service, with high quality metadata in place, or provide obvious cues and cribs to users about the metadata they might wish to add to their data uploads.

A nice feature of Google Fusion Tables is the way it provides support for dynamic and compound queries over a data set. So for example, we can filter rows:

Google fusion table query filters

Or generate summary/aggregate views:

Generating aggregate views

A range of standard visualisation types are available:

Google Fusion tables visualisation options

Charts can be used to generate views over filtered data:

Google Fusion Tables Filters and charts

Or filtered and aggregated data:

Google Fusion Tables Filtered and aggregated views

Note that these charts may not be of publishable quality/useful as presentation graphics, but they may be useful as part of a visual analysis of the data. To this extent, the lack of detailed legends and titles/captions for the chart does not necessarily present a problem – the visual analyst should be aware of what the data they are viewing actually represents (and they can always check the filter and aggregate settings if they are unsure, as well as dropping in to the tabular data view to check actual numerical values if anything appears to be “odd”. However, the lack of explanatory labeling is likely to be an issue if the intention is to produce a presentation graphic, in which case the user will need to grab a copy of the image and maybe postprocess it elsewhere.

Note that Google Fusion Tables is capable of geo-coding certain sorts of location related data such as placenames or postcodes and rendering associated markers on a map. It is also possible to generate thematic maps based on arbitrary geographical shapefiles (eg Thematic Maps with Google Fusion Tables [PDF]).

Helping Data Flow – Treat It as a Database

Services such as Google Spreadsheets provide online spreadsheets that support traditional spreadsheet operations that include chart generation (using standard chart types familiar to spreadsheet users) and support for interactive graphical widgets (including more exotic chart types, such as tree maps), powered by spreadsheet data, that can be embedded in third party webpages. Simple aggregate reshaping of data is provided in the from of support for Pivot Tables. (Note however that Google Spreadsheet functionality is sometimes a little bug ridden…) Google spreadsheets also provides a powerful query API (the Google Visulisation API), that allows the spreadsheet to be treated as a database. For an example in another government domain, see Government Spending Data Explorer; see also Guardian Datastore MPs’ Expenses Spreadsheet as a Database ).

Publishing data in this way has the following benefits: 1) treating the data as a spreadsheet allows query based views to be generated over it; 2) this views can be visualised directly in the page (this includes dynamic visulisations, as for example described in Google Chart Tools – dynamic controls, and gallery); 3) queries can be used to generated CSV based views over the data that can be (re)used in third party applications.

Geographical Data

Sometimes it makes sense to visualise data in a geographical way. One service that provides a quick way of generating choropleth/thematic maps from simple two or three column data keyed by UK administrative geography labels or identifiers is OpenHeatmap. Data can be uploaded from a simple CSV file or imported from a Google spreadsheet using predetermined column names (a column identifying geographical areas according to one of fixed number of geographies, a number value column for colouring the geographical area, and an optional date column for animation purposes (so a map can be viewed in an animated way over consecutive time periods):

Openheatmap

Once generated, links to an online version of the map are available.

The code for OpenHeatmap is available as open source software so without too much effort it should be possible to modify the code in order to host a local instance of the software and tie it in a set of predetermined Google spreadsheets, local CSV files, or data views generated from queries over a predetermined datasource so that only the publisher’s data can be visualised using the particular instance of OpenHeatmap.

Other services for publishing and visualising geo-related data are available (eg Geocommons) and could play a role as a possible outlet in a content strategy for data with a strong geographical bias.

Power Tools – R

A further class of tools that can be used to generate visual representations or arbitrary datasets are the fully programmatic tools, such as the R statistical programming language. Developed for academic use, R is currently increasing in popularity on the coat tails of “big data” and the growing interest in analysis of activity data (“paradata”) that is produced as a side-effect of our digital activities. R is capable of importing data in a wide variety of formats from local files as well as via a URL from an online source. The R data model supports a range of powerful transformations that allow data to be shaped as required. Merging data that shares common columns (in whole or part) from separate sources is also supported.

In order to reduce overheads in getting data into a useful shape within the R environment, it may make sense to publish datafile “wrappers” that act as a simple API to data contained with one or published spreadsheets or datafiles. By providing an object type and, where appropriate, access methods for the the data, the data publisher can provide a solid framework on top of which third parties can build their own analysis and statistical charts. R is supported by a wide range of third party extension libraries for generating a wide range of statistical charts and graphics, including maps. (Of particular note are ggplot2 for generating graphics according to the Grammar of Graphics model, and googleVis, which provides a range of functions that support the rapid generation of Google Charts). Many of the charts can be generated from a single R command if the data is in the correct shape and format.

As well as running as a local, desktop application, R can also be run as a hosted webservice (for example, cloudstat.org; the RStudio cross-platform desktop application can also be accessed as a hosted online service, and could presumably be used to provide a robust, online hosted analysis environment tied in to a set of locked down data sources). It is also possible to use R to power online hosted statistical charting services; see for example .

Uploading data to ggplot2

Some cleaning of the data may be required before uploading to the ggplot service. For example, empty cells marked as such by a “-” should be replaced by empty cells; numeric values containing a “,”, may be misinterpreted as character strings (factor levels) rather than numbers (in which case the data needs cleaning by removing commas). Again, if it is known that a service such as ggplot2 is likely to be a target for data reuse, publishing the data in a format that is known to work “just by loading the data in” to R with default import settings will reduce friction/overheads and keep the barriers to reusing the data within that environment to a minimum.

Observation Most of the time, most people don’t get past default settings on any piece of software. If someone tries to load your data into an application, odds on they will use default, factory settings. If you know that users are likely to want to use your data in a particular package, make at least a version of your data available in a format that will load into that package under the default settings in a meaningful way.

Finally, a couple of wild card possibilities.

Firstly, Wolfram Alpha. Wolfram Alpha provides access to a “computational search engine” that accepts natural language queries about facts or data and attempts to provide reasoned responses to those queries, including graphics. Wolfram Alpha is based around a wide range of curated data sets, so an engagement strategy with them may, in some certain circumstances, be appropriate (for example, working with them in the publication of data sets and then directing users to Wolfram Alpha in return). Wolfram Alpha also offers a “Pro” service (Wolfram Alpha Pro) that allows users to visualise and query their own data.

Secondly, the Google Refine Reconciliation API. Google Refine is a cross-platform for cleaning datasets, with the ability to reconcile the content of data columns with canonical identifiers published elsewhere. For example, it is possible to reconcile the names of local authorities with canonical Linked Data identifiers via the Kasabi platform (UK Adminstrative Geography codes and identifiers).

Google refine reconciliation

By anchoring cell values to canonical identifiers, it becomes possible to aggregate data from different sources around those known, uniquely identified items in a definite and non-arbitrary way. By publishing: a) a reconciliation service (eg for LEP codes); and b) data that relates to identifiers returned by the reconciliation service (for example, sports data by LEP), the data publisher provides a focus for third parties who want to reconcile their own data against the published identifiers, as well as a source of data values that can be used to annotate records referencing those identifiers. (So for example, if you make it easy for me to get Local Authority codes based on local authority names from your reconciliation service, and also publish data linked to those identifiers (sports engagement data, say), if I reconcile my data against your codes, I will also be provided with the opportunity to annotate my data with your data (so I can annotate my local LEP spend data with your LER sports engagement data; [probably a bad example… need something more convincing?!])… Although uptake of the reconciliation API (and the associated possibility of providing annotation services) is still a minority interest, there are some signs of interest in it (for example, Using Google Refine and taxonomic databases (EOL, NCBI, uBio, WORMS) to clean messy data; note that data published on the Kasabi platform also exposes a Google Refine reconciliation service endpoint.) In my opinion, there are potentially significant benefits to be had by publishing reconciliation service endpoints with associated annotation services if a culture of use grows up around this protocol.

Not covered: as part of this review, I have not covered applications such as Microsoft Excel or Tableau Desktop (the latter being a Windows only data visualisation environment that is growing in popularity). Instead, I have tried to focus on applications that are freely available either via the web or on a cross-platform basis. There is also a new kid on the block – datawrapper.de – but it’s still early days for this tool…

Social Interest Positioning – Visualising Facebook Friends’ Likes With Data Grabbed Using Google Refine

What do my Facebook friends have in common in terms of the things they have Liked, or in terms of their music or movie preferences? (And does this say anything about me?!) Here’s a recipe for visualising that data…

After discovering via Martin Hawksey that the recent (December, 2011) 2.5 release of Google Refine allows you to import JSON and XML feeds to bootstrap a new project, I wondered whether it would be able to pull in data from the Facebook API if I was logged in to Facebook (Google Refine does run in the browser after all…)

Looking through the Facebook API documentation whilst logged in to Facebook, it’s easy enough to find exemplar links to things like your friends list (https://graph.facebook.com/me/friends?access_token=A_LONG_JUMBLE_OF_LETTERS) or the list of likes someone has made (https://graph.facebook.com/me/likes?access_token=A_LONG_JUMBLE_OF_LETTERS); replacing me with the Facebook ID of one of your friends should pull down a list of their friends, or likes, etc.

(Note that validity of the access token is time limited, so you can’t grab a copy of the access token and hope to use the same one day after day.)

Grabbing the link to your friends on Facebook is simply a case of opening a new project, choosing to get the data from a Web Address, and then pasting in the friends list URL:

Google Refine - import Facebook friends list

Click on next, and Google Refine will download the data, which you can then parse as a JSON file, and from which you can identify individual record types:

Google Refine - import Facebook friends

If you click the highlighted selection, you should see the data that will be used to create your project:

Google Refine - click to view the data

You can now click on Create Project to start working on the data – the first thing I do is tidy up the column names:

Google Refine - rename columns

We can now work some magic – such as pulling in the Likes our friends have made. To do this, we need to create the URL for each friend’s Likes using their Facebook ID, and then pull the data down. We can use Google Refine to harvest this data for us by creating a new column containing the data pulled in from a URL built around the value of each cell in another column:

Google Refine - new column from URL

The Likes URL has the form https://graph.facebook.com/me/likes?access_token=A_LONG_JUMBLE_OF_LETTERS which we’ll tinker with as follows:

Google Refine - crafting URLs for new column creation

The throttle control tells Refine how often to make each call. I set this to 500ms (that is, half a second), so it takes a few minutes to pull in my couple of hundred or so friends (I don’t use Facebook a lot;-). I’m not sure what limit the Facebook API is happy with (if you hit it too fast (i.e. set the throttle time too low), you may find the Facebook API stops returning data to you for a cooling down period…)?

Having imported the data, you should find a new column:

Google Refine - new data imported

At this point, it is possible to generate a new column from each of the records/Likes in the imported data… in theory (or maybe not..). I found this caused Refine to hang though, so instead I exprted the data using the default Templating… export format, which produces some sort of JSON output…

I then used this Python script to generate a two column data file where each row contained a (new) unique identifier for each friend and the name of one of their likes:

import simplejson,csv

writer=csv.writer(open('fbliketest.csv','wb+'),quoting=csv.QUOTE_ALL)

fn='my-fb-friends-likes.txt'

data = simplejson.load(open(fn,'r'))
id=0
for d in data['rows']:
	id=id+1
	#'interests' is the column name containing the Likes data
	interests=simplejson.loads(d['interests'])
	for i in interests['data']:
		print str(id),i['name'],i['category']
		writer.writerow([str(id),i['name'].encode('ascii','ignore')])

[I think this R script, in answer to a related @mhawksey Stack Overflow question, also does the trick: R: Building a list from matching values in a data.frame]

I could then import this data into Gephi and use it to generate a network diagram of what they commonly liked:

Sketching common likes amongst my facebook friends

Rather than returning Likes, I could equally have pulled back lists of the movies, music or books they like, their own friends lists (permissions settings allowing), etc etc, and then generated friends’ interest maps on that basis.

[See also: Getting Started With The Gephi Network Visualisation App – My Facebook Network, Part I and how to visualise Google+ networks]

PS dropping out of Google Refine and into a Python script is a bit clunky, I have to admit. What would be nice would be to be able to do something like a “create new rows with new column from column” pattern that would let you set up an iterator through the contents of each of the cells in the column you want to generate the new column from, and for each pass of the iterator: 1) duplicate the original data row to create a new row; 2) add a new column; 3) populate the cell with the contents of the current iteration state. Or something like that…

PPS Related to the PS request, there is a sort of related feature in the 2.5 release of Google Refine that lets you merge data from across rows with a common key into a newly shaped data set: Key/value Columnize. Seeing this, it got me wondering what a fusion of Google Refine and RStudio might be like (or even just R support within Google Refine?)

PPPS this could be interesting – looks like you can test to see if a friendship exists given two Facebook user IDs.

PPPPS This paper in PNAS – Private traits and attributes are predictable from digital records of human behavior – by Kosinski et. al suggests it’s possible to profile people based on their Likes. It would be interesting to compare how robust that profiling is, compared to profiles based on the common Likes of a person’s followers, or the common likes of folk in the same Facebook groups as an individual?

A Tool Chain for Plotting Twitter Archive Retweet Graphs – Py, R, Gephi

Another set of stepping stones that provide a clunky route to a solution that @mhawksey has been working on a far more elegant expression of (eg Free the tweets! Export TwapperKeeper archives using Google Spreadsheet and Twitter: How to archive event hashtags and create an interactive visualization of the conversation)…

The recipe is as follows:

– download a Twapperkeeper archive to a CSV file using a Python script as described in Python Script for Exporting (Large) Twapperkeeper Archives By User; the CSV file should contain a single column with one row per archive entry; each row includes the sender, the tweet, the tweet ID and a timestamp; **REMEMBER – TWAPPERKEEPER ARCHIVES WILL BE DISABLED ON JAN 6TH, 2012**

– in an R environment (I use RStudio), reuse code from Rescuing Twapperkeeper Archives Before They Vanish and Cornelius Puschmann’s post Generating graphs of retweets and @-messages on Twitter using R and Gephi:

require(stringr)

#A helper function to remove @ symbols from user names...
trim <- function (x) sub('@','',x)

twapperkeeperCSVParse=function(fp){
    df = read.csv(fp, header=F)
    df$from=sapply(df$V1,function(tweet) str_extract(tweet,"^([[:alnum:]_]*)"))
    df$id=sapply(df$V1,function(tweet) str_extract(tweet,"[[:digit:]/s]*$"))
    df$txt=sapply(df$V1,function(tweet) str_trim(str_replace(str_sub(str_replace(tweet,'- tweet id [[:digit:]/s]*$',''),end=-35),"^([[:alnum:]_]*:)",'')))
    df$to=sapply(df$txt,function(tweet) trim(str_extract(tweet,"^(@[[:alnum:]_]*)")))
    df$rt=sapply(df$txt,function(tweet) trim(str_match(tweet,"^RT (@[[:alnum:]_]*)")[2]))
    return(df)
}
#usage: 
#twarchive.df=twapperkeeperCSVParse("PATH_TO_YOUR_FILE")
#For example:
df=twapperkeeperCSVParse("~/code/twapps/reports/twArchive_online11.txt")

ats.df <- data.frame(df$from,df$to)
rts.df <- data.frame(df$from,df$rt)

#Cribbing http://blog.ynada.com/339
require(igraph)
ats.g <- graph.data.frame(ats.df, directed=T)
rts.g <- graph.data.frame(rts.df, directed=T)

write.graph(ats.g, file="ats.graphml", format="graphml")
write.graph(rts.g, file="rts.graphml", format="graphml")

– Cornelius’ code uses the igraph library to construct a graph and export graphml files that describe graphs of at behaviour (tweets in the archive sent from one user to another) and RT behaviour (tweets from one person retweeting another using the RT @name convention).

– visualise the graphml files in Gephi. Note a couple of things – empty nodes aren’t handled properly in my version of the code, so the graph includes a dummy node that all non-at or non-RT row tweet senders point to; when you visualise the graph, this node will be obvious, so just delete it ;-)

– the Gephi visualisation by default uses the Label attribute for labeling nodes – we need to change this:

Gephi - setting node label choice

You should now be able to view graphs that illustrate RT or @ behaviour as captured in a Twapperkeeper archive in Gephi.

ILI2011 RT behaviour

Just by the by, we can also generate stats’n graphs of the contents of the archive. For example, via Getting Started With Twitter Analysis in R, we can generate a bar plot to show who was retweeted most:

require(ggplot2)

ggplot()+geom_bar(aes(x=na.omit(df$rt)))+opts(axis.text.x=theme_text(angle=-90,size=6))+xlab(NULL)

We can also do some counting to find out who was RT’d the most, for example:

#count the occurrences of each name in the rt column
rt.count=data.frame(table(df$rt))
#sort the results in descending order and display the top 5 results
head(rt.count[order(-rt.count$Freq),],5)
#There are probably better ways of doing that! If so, let me know via comments

Next on the to do list is:
– automate the production of archive reports
– work in the time component so we can view behaviour over time in Gephi… (here’s a starting point maybe, again from Cornelius Puschmann’s blog: Dynamic Twitter graphs with R and Gephi (clip and code))

As things stand though, I may not be able to get round to either of those for a while…

Accessing and Visualising Sentencing Data for Local Courts

A recent provisional data release from the Ministry of Justice contains sentencing data from English(?) courts, at the offence level, for the period July 2010-June 2011: “Published for the first time every sentence handed down at each court in the country between July 2010 and June 2011, along with the age and ethnicity of each offender.” Criminal Justice Statistics in England and Wales [data]

In this post, I’ll describe a couple of ways of working with the data to produce some simple graphical summaries of the data using Google Fusion Tables and R…

…but first, a couple of observations:

– the web page subheading is “Quarterly update of statistics on criminal offences dealt with by the criminal justice system in England and Wales.”, but the sidebar includes the link to the 12 month set of sentencing data;
– the URL of the sentencing data is http://www.justice.gov.uk/downloads/publications/statistics-and-data/criminal-justice-stats/recordlevel.zip, which does not contain a time reference, although the data is time bound. What URL will be used if data for the period 7/11-6/12 is released in the same way next year?

The data is presented as a zipped CSV file, 5.4MB in the zipped form, and 134.1MB in the unzipped form.

The unzipped CSV file is too large to upload to a Google Spreadsheet or a Google Fusion Table, which are two of the tools I use for treating large CSV files as a database, so here are a couple of ways of getting in to the data using tools I have to hand…

Unix Command Line Tools

I’m on a Mac, so like Linux users I have ready access to a Console and several common unix commandline tools that are ideally suited to wrangling text files (on Windows, I suspect you need to install something like Cygwin; a search for windows unix utilities should turn up other alternatives too).

In Playing With Large (ish) CSV Files, and Using Them as a Database from the Command Line: EDINA OpenURL Logs and Postcards from a Text Processing Excursion I give a couple of examples of how to get started with some of the Unix utilities, which we can crib from in this case. So for example, after unzipping the recordlevel.csv document I can look at the first 10 rows by opening a console window, changing directory to the directory the file is in, and running the following command:

head recordlevel.csv

Or I can pull out rows that contain a reference to the Isle of Wight using something like this command:

grep -i wight recordlevel.csv > recordsContainingWight.csv

(The -i reads: “ignoring case”; grep is a command that identifies rows contain the search term (wight in this case). The > recordsContainingWight.csv says “send the result to the file recordsContainingWight.csv” )

Having extracted rows that contain a reference to the Isle of Wight into a new file, I can upload this smaller file to a Google Spreadsheet, or as Google Fusion Table such as this one: Isle of Wight Sentencing Fusion table.

Isle fo wight sentencing data

Once in the fusion table, we can start to explore the data. So for example, we can aggregate the data around different values in a given column and then visualise the result (aggregate and filter options are available from the View menu; visualisation types are available from the Visualize menu):

Visualising data in google fusion tables

We can also introduce filters to allow use to explore subsets of the data. For example, here are the offences committed by females aged 35+:

Data exploration in Google FUsion tables

Looking at data from a single court may be of passing local interest, but the real data journalism is more likely to be focussed around finding mismatches between sentencing behaviour across different courts. (Hmm, unless we can get data on who passed sentences at a local level, and look to see if there are differences there?) That said, at a local level we could try to look for outliers maybe? As far as making comparisons go, we do have Court and Force columns, so it would be possible to compare Force against force and within a Force area, Court with Court?

R/RStudio

If you really want to start working the data, then R may be the way to go… I use RStudio to work with R, so it’s a simple matter to just import the whole of the reportlevel.csv dataset.

Once the data is loaded in, I can use a regular expression to pull out the subset of the data corresponding once again to sentencing on the Isle of Wight (i apply the regular expression to the contents of the court column:

recordlevel <- read.csv("~/data/recordlevel.csv")
iw=subset(recordlevel,grepl("wight",court,ignore.case=TRUE))

We can then start to produce simple statistical charts based on the data. For example, a bar plot of the sentencing numbers by age group:

age=table(iw$AGE)
barplot(age, main="IW: Sentencing by Age", xlab="Age Range")

R - bar plot

We can also start to look at combinations of factors. For example, how do offence types vary with age?

ageOffence=table(iw$AGE, iw$Offence_type)
barplot(ageOffence,beside=T,las=3,cex.names=0.5,main="Isle of Wight Sentences", xlab=NULL, legend = rownames(ageOffence))

R barplot - offences on IW

If we remove the beside=T argument, we can produce a stacked bar chart:

barplot(ageOffence,las=3,cex.names=0.5,main="Isle of Wight Sentences", xlab=NULL, legend = rownames(ageOffence))

R - stacked bar chart

If we import the ggplot2 library, we have even more flexibility over the presentation of the graph, as well as what we can do with this sort of chart type. So for example, here’s a simple plot of the number of offences per offence type:

require(ggplot2)
#You may need to install ggplot2 as a library if it isn't already installed
ggplot(iw, aes(factor(Offence_type)))+ geom_bar() + opts(axis.text.x=theme_text(angle=-90))+xlab('Offence Type')

GGPlot2 in R

Alternatively, we can break down offence types by age:

ggplot(iw, aes(AGE))+ geom_bar() +facet_wrap(~Offence_type)

ggplot facet barplot

We can bring a bit of colour into a stacked plot that also displays the gender split on each offence:

ggplot(iw, aes(AGE,fill=sex))+geom_bar() +facet_wrap(~Offence_type)

ggplot with stacked factor

One thing I’m not sure how to do is rip the data apart in a ggplot context so that we can display percentage breakdowns, so we could compare the percentage breakdown by offence type on sentences awarded to males vs. females, for example? If you do know how to do that, please post a comment below ;-)

PS HEre’s an easy way of getting started with ggplot… use the online hosted version at http://www.yeroon.net/ggplot2/ using this data set: wightCrimRecords.csv; download the file to your computer then upload it as shown below:

yeroon.net/ggplot2

PPS I got a little way towards identifying percentage breakdowns using a crib from here. The following command:
iwp=tapply(iw$Offence_type,iw$sex,function(x){prop.table(table(x))})
generates a (multidimensional) array for the responseVar (Offence) about the groupVar (sex). I don’t know how to generate a single data frame from this, but we can create separate ones for each sex as follows:
iwpMale=data.frame(iwp['Male'])
iwpFemale=data.frame(iwp['Female'])

We can then plot these percentages using constructions of the form:
ggplot(iwp2)+geom_bar(aes(x=Male.x,y=Male.Freq))
What I haven’t worked out how to do is elegantly map from the multidimensional array to a single data.frame? If you know how, please add a comment below…(I also posted a question on Cross Validated, the stats bit of Stack Exchange…)

The Visual Difference – R and Anscombe’s Quartet

I spent a chunk of today trying to get my thoughts in order for a keynote presentation at next week’s The Difference that Makes a Difference conference. The theme of my talk will be on how visualisations can be used to discover structure and pattern in data, and as in many or my other recent talks I found the idea of Anscombe’s quartet once again providing a quick way in to the idea that sometimes the visual dimension can reveal a story that simple numerical analysis appears to deny.

For those of you who haven’t come across Anscombe’s quartet yet, it’s a set of four simple 2 dimensional data sets (each 11 rows long) that have similar statistical properties, but different stories to tell…

Quite by chance, I also happened upon a short exercise based on using R to calculate some statistical properties of the quartet (More useless statistics), so I thought I’d try and flail around in my unprincipled hack-it-and-see approach to learning R to see if I could do something similar with rather simpler primitives than described in that blog post.

(If you’re new to R and want to play along, I recommend RStudio…)

Here’s the original data set – you can see it in R simply by typing anscombe:

   x1 x2 x3 x4    y1   y2    y3    y4
1  10 10 10  8  8.04 9.14  7.46  6.58
2   8  8  8  8  6.95 8.14  6.77  5.76
3  13 13 13  8  7.58 8.74 12.74  7.71
4   9  9  9  8  8.81 8.77  7.11  8.84
5  11 11 11  8  8.33 9.26  7.81  8.47
6  14 14 14  8  9.96 8.10  8.84  7.04
7   6  6  6  8  7.24 6.13  6.08  5.25
8   4  4  4 19  4.26 3.10  5.39 12.50
9  12 12 12  8 10.84 9.13  8.15  5.56
10  7  7  7  8  4.82 7.26  6.42  7.91
11  5  5  5  8  5.68 4.74  5.73  6.89

We can construct a simple data frame containing just the values of x1 and y1 with a construction of the form: data.frame(x=c(anscombe$x1),y=c(anscombe$y1)) (where we identify the columns explicitly by column name) or alternatively data.frame(x=c(anscombe[1]),y=c(anscombe[5])) (where we refer to them by column index number).

x y
1 10 8.04
2 8 6.95
3 13 7.58
4 9 8.81
5 11 8.33
6 14 9.96
7 6 7.24
8 4 4.26
9 12 10.84
10 7 4.82
11 5 5.68

A tidier way of writing this is as follows:
with(anscombe,data.frame(x1Val=c(x1),y1Val=c(y1)))

In order to call on, or refer to, the data frame, we assign it to a variable: g1data=with(anscombe,data.frame(xVal=c(x1),yVal=c(y1)))

We can then inspect the mean and sd values: mean(g1data$xVal), or sd(g1data$yVal)

> mean(g1data$xVal)
[1] 9
> sd(g1data$xVal)
[1] 3.316625
>

To plot the data, we can simply issue a plot command: plot(g1data)

R- getting started with anscombe's quartet

It would be possible to create similar datasets for each of the separate groups of data, but R has all sorts of tricks for working with data (apparently…?!;-) There are probably much better ways of getting hold of the statistics in a more direct way, but here’s the approach I took. Firstly, we need to reshape the data a little. I cribbed the “useless stats” approach for most of this. The aim is to produce a data set with 44 rows, and 3 columns: x, y and a column that identifies the group of results (let’s call them myX, myY and myGroup for clarity). The myGroup values will range from 1 to 4, identifying each of the four datasets in turn (so the first 11 rows will be for x1, y1 and will have myGroup value 1; then the values for x2, y2 and myGroup equal to 2, and so on). That is, we want a dataset that starts:

1 10 9.14 1
2 8 8.14 1

and ends:

43 8 7.91 4
44 8 6.89 4

To begin with, we need some helper routines:

– how many rows are there in the data set? nrow(anscombe)
– how do we create a set of values to label the rows by group number (i.e. how do we generate a set of 11 1’s, then 11 2’s, 11 3’s and 11 4’s)? Here’s how: gl(4, nrow(anscombe)) [typing ?gl in R should bring up the appropriate help page;-) What we do is construct a list of 4 values, with each value repeating nrow(anscombe) times]
– to add in a myGroup column to a dataframe containing x1 and y1 columns, set with just values 1, we simply insert an additional column definition: data.frame(xVal=c(anscombe$x1), yVal=c(anscombe$y1), mygroup=gl(1,nrow(anscombe)))
– to generate a data frame containing three columns and the data for group 1, followed by group 2, we would use a construction of the form: data.frame(xVal=c(anscombe$x1,anscombe$x2), yVal=c(anscombe$y1,anscombe$y2), mygroup=gl(2,nrow(anscombe))). That is, populate the xVal column with rows from x1 in the anscombe dataset, then the rows from column x2; populate yVal with values from y1 then y2; and populate myGroup with 11 1’s followed by 11 2’s.
– a more compact way of constructing the data frame is to specify that we want to concatenate (c()) values from several columns from the same dataset: with(anscombe,data.frame(xVal=c(x1,x2,x3,x4), yVal=c(y1,y2,y3,y4), mygroup=gl(4,nrow(anscombe))))
– to be able to reference this dataset, we need to assign it to a variable: mydata=with(anscombe,data.frame(xVal=c(x1,x2,x3,x4), yVal=c(y1,y2,y3,y4), mygroup=gl(4,nrow(anscombe))))

This final command will give us a data frame with the 3 columns, as required, containing values from group 1, then group 2, then groups 3 and 4, all labelled appropriately.

To find the means for each column by group, we can use the aggregate command: aggregate(.~mygroup,data=mydata,mean)

(I think you read that as follows:”aggregate the current data (.) by each of the groups in (~) mygroup using the mydata dataset, reporting on the groupwise application of the function: mean)

To find the SD values: aggregate(.~mygroup,data=mydata,sd)

R-playing with anscombe

Cribbing an approach I discovered from the hosted version of the ggplot R graphics library, here’s a way of plotting the data for each of the four groups from within the single aggregate dataset. (If you are new to R, you will need to download and install the ggplot2 package; in RStudio, from the Packages menu, select Install Packages and enter ggplot2 to download and install the package. To load the package into your current R session, tick the box next to the installed package name or enter the command library("ggplot2").)

The single command to plot xy scatterplots for each of the four groups in the combined 3 column dataset is as follows:

ggplot(mydata,aes(x=xVal,y=yVal,group=mygroup))+geom_point()+facet_wrap(~mygroup)

And here’s the result (remember, the statistical properties were the same…)

R - anscombe's quartet

To recap the R commands:

mydata=with(anscombe,data.frame(xVal=c(x1,x2,x3,x4), yVal=c(y1,y2,y3,y4), group=gl(4,nrow(anscombe))))
aggregate(.~mygroup,data=mydata,mean)
aggregate(.~mygroup,data=mydata,sd)
library(ggplot2)
ggplot(mydata,aes(x=xVal, y=yVal)) + geom_point() + facet_wrap(~mygroup)

PS this looks exciting from an educational and opendata perspective, though I haven’t had a chance to play with it: OpenCPU: a server where you can upload and run R functions. (The other hosted R solutions I was aware of – R-Node – doesn’t seem to be working any more? online R-Server [broken?]. For completeness, here’s the link to the hosted ggplot IDE referred to in the post. And finally – if you need to crucnh big files, CloudNumbers may be appropriate (disclaimer: I haven’t tried it))

PPS And here’s something else for the data junkies – an easy way of getting data into R from Datamarket.com: How to access 100M time series in R in under 60 seconds.