Pokemon Go seems to have hit the news this week – though I’m sure for anyone off social media last week and back to it next week, the whole thing will have completely passed them by – demonstrating that augmented reality apps really haven’t moved on much at all over the last five years or so.
But notwithstanding that, I’ve been trying to make sense of a whole range of mediated reality technologies for myself as prep for a very short unit on technologies and techniques on that topic.
Here’s what I’ve done to date, over on the Digital Worlds uncourse blog. This stuff isn’t official OU course material, it’s just my own personal learning diary of related stuff (technical term!;-)
- Blurred Edges – Dual Reality: intro post, setting the scene and introducing the worlds of mediated and augmented reality;
- Introducing Augmented Reality Apparatus – From Victorian Stage Effects to Head-Up Displays: some background history of a related effect – Pepper’s Ghost – and a look at modern day head-up displays that make use of it;
- The Art of Sound – Algorithmic Foley Artists? An aside reviewing the work of foley artists, and asking whether machines could ever replace them…
- Taxonomies for Describing Mixed and Alternate Reality Systems: some vocab to help us talk about the components that make up a mediated reality system;
- Augmenting Reality With Digital Overlays: examples of how augmented reality can be used to overlay the physical world with digital artefacts;
- Real or Virtual Objects? A bit more vocab – what’s real and what’s virtual?
- “Magic Lenses” and See-Through Displays: examples of “magic lens” style augmented reality;
- Noise Cancellation – An Example of Mediated Audio Reality? Augmented/mediated reality isn’t just about what you can see…
More to come over the next couple of weeks or so. If you want to comment, and perhaps influence the direction of my meanderings, please feel free to do that here or on the relevant post.
As part of my “learn about Formula One Stats” journey, one of the things I wanted to explore was how F1 betting odds change over the course of a race weekend, along with how well they predict race weekend outcomes.
Courtesy of @flutterF1, I managed to get a peek of some betting data from one of the race weekends last year year. In this preliminary post, I’ll describe some of the ways I started to explore the data initially, before going on to look at some of the things it might be able to tell us in more detail in a future post.
(I’m guessing that it’s possible to buy historical data(?), as well as collecting it yourself it for personal research purposes? eg Betfair have an api, and there’s at least one R library to access it: betfairly.)
The application I’ll be using to explore the data is RStudio, the cross-platform integrated development environment for the R programming language. Note that I will be making use of some R packages that are not part of the base install, so you will need to load them yourself. (I really need to find a robust safe loader that installs any required packages first if they have not already been installed.)
The data @flutterF1 showed me came in two spreadsheets. The first (filename convention RACE Betfair Odds Race Winner.xlsx) appears to contain a list of frequently sampled timestamped odds from Betfair, presumably, for each driver recorded over the course of the weekend. The second (filename convention RACE Bookie Odds.xlsx) has multiple sheets that contain less frequently collected odds from different online bookmakers for each driver on a variety of bets – race winner, pole position, top 6 finisher, podium, fastest lap, first lap leader, winner of each practice session, and so on.
Both the spreadsheets were supplied as Excel spreadsheets. I guess that many folk who collect betting data store it as spreadsheets, so this recipe for loading spreadsheets in to an R environment might be useful to them. The gdata library provides hooks for working with Excel documents, so I opted for that.
Let’s look at the Betfair prices spreadsheet first. The top line is junk, so we’ll skip it on load, and add in our own column names, based on John’s description of the data collected in this file:
The US Betfair Odds Race Winner.xslx is a raw data collection with 5 columns….
1) The timestap (an annoying format but there is a reason for this albeit a pain to work with).
2) The driver.
3) The last price money was traded at.
4) the total amount of money traded on that driver so far.
5) If the race is in ‘In-Play’. True means the race has started – however this goes from the warm up lap, not the actual start.
To reduce the amount of data I only record it when the price traded changes or if the amount changes.
Looking through the datafile, they appear to be some gotchas, so these need cleansing out:
Here’s my initial loader script:
library(gdata) xl=read.xls('US Betfair Odds Race Winner.xlsx',skip = 1) colnames(xl)=c('dt','driver','odds','amount','racing') #Cleansing pass bf.odds=subset(xl,racing!='') str(bf.odds) 'data.frame': 10732 obs. of 5 variables: $ dt : Factor w/ 2707 levels "11/16/2012 12:24:52 AM",..: 15 15 15 15 15 15 15 15 15 15 ... $ driver: Factor w/ 34 levels " Recoding Began",..: 19 11 20 16 18 29 26 10 31 17 ... $ odds : num 3.9 7 17 16.5 24 140 120 180 270 550 ... $ amount: num 1340 557 120 118 195 ... $ racing: int 0 0 0 0 0 0 0 0 0 0 ... #Generate a proper datetime field from the dt column #This is a hacked way of doing it. How do I do it properly? bf.odds$dtt=as.POSIXlt(gsub("T", " ", bf.odds$dt)) #If we rerun str(), we get the following extra line in the results: # $ dtt : POSIXlt, format: "2012-11-11 11:00:08" "2012-11-11 11:00:08" "2012-11-11 11:00:08" "2012-11-11 11:00:08" ...
Having loaded the data, cleansed it, and cast a proper datetime column, it’s easy enough to generate a few plots:
#We're going to make use of the ggplot2 graphics library library(ggplot2) #Let's get a quick feel for bets around each driver g=ggplot(xl)+geom_point(aes(x=dtt,y=odds))+facet_wrap(~driver,scales="free_y") g=g+theme(axis.text.x=element_text(angle=-90)) g #Let's look in a little more detail around a particular driver within a particular time window g=ggplot(subset(xl,driver=="Lewis Hamilton"))+geom_point(aes(x=dtt,y=odds))+facet_wrap(~driver,scales="free_y") g=g+theme(axis.text.x=element_text(angle=-90)) g=g+ scale_x_datetime(limits=c(as.POSIXct('2012/11/18 18:00:00'), as.POSIXct('2012/11/18 22:00:00'))) g
Here are the charts (obviously lacking in caption, tidy labels and so on).
Firstly, the odds by driver:
Secondly, zooming in on a particular driver in a particular time window:
That all seems to work okay, so how about the other spreadsheet?
#There are several sheets to choose from, named as follows: #Race,Pole,Podium,Points,SC,Fastest Lap, Top 6, Hattrick,Highest Scoring,FP1, ReachQ3,FirstLapLeader, FP2, FP3 #Load in data from a particular specified sheet race.odds=read.xls('USA Bookie Odds.xlsx',sheet='Race') #The datetime column appears to be in Excel datetime format, so cast it into something meaningful race.odds$tTime=as.POSIXct((race.odds$Time-25569)*86400, tz="GMT",origin=as.Date("1970-1-1")) #Note that I am not I checking for gotcha rows, though maybe I should...? #Use the directlabels package to help tidy up the display a little library(directlabels) #Let's just check we've got something loaded - prune the display to rule out the longshots g=ggplot(subset(race.odds,Bet365<30),aes(x=tTime,y=Bet365,group=Runner,col=Runner,label=Runner)) g=g+geom_line()+theme_bw()+theme(legend.position = "none") g=g+geom_dl(method=list('top.bumpup',cex=0.6)) g=g+scale_x_datetime(expand=c(0.15,0)) g
Here’s a view over the drivers’ odds to win, with the longshots pruned out:
With a little bit of fiddling, we can also look to see how the odds for a particular driver compare for different bookies:
#Let's see if we can also plot the odds by bookie colnames(race.odds) # "Time" "Runner" "Bet365" "SkyBet" "Totesport" "Boylesport" "Betfred" #  "SportingBet" "BetVictor" "BlueSQ" "Paddy.Power" "Stan.James" "X888Sport" "Bwin" # "Ladbrokes" "X188Bet" "Coral" "William.Hill" "You.Win" "Pinnacle" "X32.Red" # "Betfair" "WBX" "Betdaq" "Median" "Median.." "Min" "Max" # "Range" "tTime" #We can remove items from this list using something like this: tmp=colnames(race.odds) #tmp=tmp[tmp!='Range'] tmp=tmp[tmp!='Range' & tmp!='Median' & tmp!='Median..' & tmp!='Min' & tmp!= 'Max' & tmp!= 'Time'] #Then we can create a subset of cols race.odds.data=subset(race.odds,select=tmp) #Melt the data library(reshape) race.odds.data.m=melt(race.odds.data,id=c('tTime','Runner')) #head( race.odds.data.m) # tTime Runner variable value #1 2012-11-11 19:07:01 Sebastian Vettel (Red) Bet365 2.37 #2 2012-11-11 19:07:01 Lewis Hamilton (McL) Bet365 3.25 #3 2012-11-11 19:07:01 Fernando Alonso (Fer) Bet365 6.00 #... #Now we can plot how the different bookies compare g=ggplot(subset(race.odds.data.m,value<30 & Runner=='Sebastian Vettel (Red)'),aes(x=tTime,y=value,group=variable,col=variable,label=variable)) g=g+geom_line()+theme_bw()+theme(legend.position = "none") g=g+geom_dl(method=list('top.bumpup',cex=0.6)) g=g+scale_x_datetime(expand=c(0.15,0)) g
Okay, so that all seems to work… Now I can start pondering what sensible questions to ask…
I finally, finally made time to get started on my statistics learning journey with a look at some of the results in the paper A Tale of Two Motorsports: A Graphical-Statistical Analysis of How Practice, Qualifying, and Past SuccessRelate to Finish Position in NASCAR and Formula One Racing.
Note that these notes represent a description of the things I learned trying to draw on ideas contained within the paper and apply it to data I had available. There may be errors… if you spot any, please let me know via the comments.
The paper uses race classification data from the 2009 season, comparing F1 and NASCAR championships and claiming to explore the extent to which positions in practice and qualifying relate to race classification. I won’t be looking at the NASCAR data, but I did try to replicate the F1 stats which I’ll start to describe later on in this post. I’ll also try to build up a series of interactive apps around the analyses, maybe along with some more traditional print format style reports.
(There are so many things I want to learn about, from the stats themselves, to possible workflows for analysis and reporting, to interactive analysis tools that I’m not sure what order any of it will pan out into, or even the extent to which I should try to write separate posts about the separate learnings…)
As a source of data, I used my f1com megascraper that grabs classification data (along with sector times and fastest laps) since 2006. (The ergast API doesn’t have the practice results, though it does have race and qualifying results going back a long way, so it could be used to do a partial analysis over many more seasons). I downloaded the whole Scraperwiki SQLite database which I could then load into R and play with offline at my leisure.
The first result of note in the paper is a chart that claims to demonstrate the Spearman rank correlation between practise and race results, qualification and race results, and championship points and race results, for each race in the season. The caption to the corresponding NASCAR graphs explains the shaded region: “In general, the values in the grey area are not statistically significant and the values in the white area are statistically significant.” A few practical uses we might put the chart to come to mind (possibly!): is qualifying position or p3 position a good indicator of race classification (that is, is the order folk finish at the end of p3 a good indicator of the rank order in which they’ll finish the race?)?; if the different rank orders are not correlated, (people finish the race in a different order to the gird position), does this say anything about how exciting the race might have been? Does the “statistical significance” of the correlation value add anything meaningful?
So what is this chart trying to show and what, if anything, might it tell us of interest about the race weekend?
First up, the statistic that’s being plotted is Spearman’s rank correlation coefficient. There are four things to note there:
- it’s a coefficient: a coefficent is a single number that tends to come in two flavours (often both at the same time). In a mathematical equation, a coefficient is typically a constant number that is used as multiplier of a variable. So for example, in the equation t = 2 x, the x variable has the coefficient 2. Note that the coefficient may also be a parameter, as for example in the equation y= a.x (where the . means ‘multiply’, and we naturally read x as a dependent variable that is used to determine the value of y having been multiplied by the value of a). However, a coefficient may also be a particular number that characterises a particular relationship between two things. In this case, it characterises the degree of correlation between two things…
- The refrain “correlation is not causation” is one heard widely around the web that mocks the fact that just because two things may be correlated – that is, when one thing changes, another changes in a similar way – it doesn’t necessarily mean that the way one thing changed caused the other to change in a similar way as a result. (Of course, it might mean that…;-). (If you want to find purely incidental correlations between things, have a look at Google correlate, that identifies different search terms whose search trends over time are similar. You can even draw your own trend over time to find terms that have trended in a similar way.)
Correlation, then, describes the extent to which two things tend to change over time in a similar way: when one goes up, the other goes up; when one goes down, the other goes down. (If they behave in opposite ways – if one goes up steeply the other goes down steeply; if one goes up gently, the other goes down gently – then they are negatively or anti-correlated).
Correlation measures require that you have paired data. You know you have paired data if you can plot your data as a two dimensional scatterplot and label each point with the name of the person or thing that was measured to get the two co-ordinate values. So for example, on a plot of qualifying position versus race classification, I can label each point with the name of the driver. The qualification position and race classification is paired data around the set of drivers.
- A rank correlation coefficient is used to describe the extent to which two rankings are correlated. Ranked orders do away with the extent to which different elements in a series differ from each other and just concentrate on their rank order. That is, we don’t care how much change there is in each of the data series, we’re just interested in rank positions within each series. In the case of an F1 race, the distribution of laptimes during qualifying may see the first few cars separated by a few thousandths of a second, but the time between the best laptimes of consecutively placed cars at the back of the grid might be of the order of tenths of a second. (Distributions take into account, for example, the variety and range of values in a dataset.) However, in a rank ordered chart, all we are interested in is the integer position: first, second, third, …, nineteenth, twentieth. There is no information about the magnitude of the actual time difference between the laptimes, that is, how far close to or far apart from each other the laptimes of consecutively placed cars were, we just know the rank order of fastest to slowest cars. The distribution of the rank values is not really very interesting, or subject to change, at all.
One thing that I learned that’s possibly handy to know when decoding the jargon: rank based stats are also often referred to as non-parametric statistics because no assumptions are made about how the numbers are distributed (presumably, there are no parameters of note relating to how the values are distributed, such as the mean and standard deviation of a “normal” distribution). If we think about the evolution of laptimes in a race, then most of them will be within a few tenths of the fastest lap each lap, with perhaps two bunches of slower lap times (in-lap and out-lap around a pitstop). The distribution of these lap times may be interesting (for example, the distribution of laptimes on a lap when everyone is racing will be different to the distribution of lap times on a lap when several cars are pitting). On the other hand, for each lap, the distribution of the rank order of laptimes during that lap will always be the same (first fastest, second fastest, third fastest, etc.). That is not to say, however, that the rank order of the drivers’ lap times does not change lap on lap, which of course it might do (Webber might be tenth fastest on lap 1, fastest on lap 2, eight fastest on lap three, and so on).
Of course, this being stats, “non-parametric” probably means lots of other things as well, but for now my rule of thumb will be: the distribution doesn’t matter (that is, the statistic does not make any assumptions about the distribution of the data in order for the statistic to work; which is to say, that’s one thing you don’t have to check in order to use the statistic (erm, I think…?!)
- The statistic chosen was Spearman’s rank correlation coefficient. Three differently calculated correlation coefficients appear to be widely used, (and also appear as possible methods in the R corr() function that calculates correlations between lists of numbers): i) Pearson’s product moment correlation coefficient (how well does a straight line through an x-y scatterplot of the data describe the relationship between the x and y values, and what’s the sign of its gradient); ii) Spearman’s rank correlation coefficient (also known as Spearman’s rho or rs); [this interactive is rather nice and shows how Pearson and Spearman correlations can differ]; iii) Kendall’s τ (that is, Kendall’s Tau; this coefficient is based on concordance, which describes how the sign of the difference in rank between pairs of numbers in one data series is the same as the sign of the difference in rank between a corresponding pair in the other data series.). Other flavours of correlation coefficient are also available (for example, Lin’s concordance correlation coefficient, as demonstrated in this example of identifying a signature collapse in a political party’s electoral vote when the Pearson coefficient suggested it had held up, which I think is used to test for how close to a 45 degree line the x-y association between paired data points is…).
The statistical significance test is based around the “null hypothesis” that the two sets of results are not correlated; the result is significant if they are more correlated than you might expect if both are randomly ordered. this made me a little twitchy: wouldn’t it be equally valid to argue that F1 is a procession and we would expect the race position and grid position to be perfectly correlated, for example, and then define our test for significance on the extent to which they are not?
This comparison of Pearson’s product moment and Spearman’s rank correlation coefficients helped me get a slightly clearer view of the notion of “test” and how both these coefficients act as tests for particular sorts of relationship. The Pearson product moment coefficent has a high value if a strong linear relationship holds across the data pairs. The Spearman rank correlation is weaker, in that it is simply looking to see whether or not the relationship is monotonic (that is, things all go up together, or they all go down together, but the extent to which they do so need not define a linear relationship, which is an assumption of the Pearson test.). In addition, when defining the statistical significance of the test, this is dependent on particular assumptions about the distribution of the data values, at least in the case of the Pearson test. The statistical significance relates to how likely the correlation value was assuming a normal distribution in the values within the paired data series (that is, each series is assumed to represent a normally distributed set of values).
If I understand this right, it means we separate the two things out: on the one hand, we have the statistic (the correlation coefficient); on the other, we have the significance test, which tells you how likely that result us given a set of assumptions about how the data is distributed. A question that then comes to mind is this: is the definition of the statistic dependent on a particular distribution of the data in order for the statistic to have something interesting to say, or is it just the significance that relies on that distribution. To twist this slightly, if we can’t do a significance test, is the statistic then essentially meaningless (because we don’t know whether those values are likely to occur whatever the relationships (or even no relationship) between the data sets?). Hmm.. maybe a statistic is actually a measurement in the context of its significance given some sort of assumption about how likely it is to occur by chance?
As far as Spearman’s rank correlation coefficient goes, I was little bit confused by the greyed “not significant” boundary on the diagram shown above. The claim is that any correlation value in that grey area could be accounted for in many cases by random sorting. Take two sets of driver numbers, sort them both randomly, and much of the time the correlation value will fall in that region. (I suspect this is not only arbitrary but misleading? If you have random orderings, is the likelihood that the correlation is in the range -0.1 to 0.1 the same as the likelihood that it will be in the range 0.2 to 0.4? Is the probability distribution of correlations “uniform” across the +/- 1 range?) Also, my hazy vague memory is that the population size affects the confidence interval (see also Explorations in statistics: confidence intervals) – isn’t this the principle on which funnel plots are built? The caption to the figure suggests that the population size (the “degrees of freedom”) was different for different races (different numbers of drivers). So why isn’t the shaded interval differently sized for those races?
Something else confused me about the interval values used to denote the significance of the Spearman rho values – where do they come from? A little digging suggested that they come from a table (i.e. someone worked them out numerically, presumably by generating looking at the distribution of different random rank orderings, rather than algorithmically – I couldn’t find a formula to calculate them? I did find this on Sample Size Requirements for Estimating Pearson, Kendall and Spearman Correlations by D Bonett (Psychometrika Vol. 65, No. 1, 23-28, March 2000) though). A little more digging suggested Significance Testing of the Spearman Rank Correlation Coefficient by J Zar (Journal of the American Statistical Association, Vol. 67, No. 339 (Sep., 1972), pp. 578- 580) as a key work on this, with some later qualification in Testing the Significance of Kendall’s τ and Spearman’s rs by M. Nijsse (Psychological Bulletin, 1988, Vol. 103, No. 2,235-237).
Hmmm.. this post was supposed to be about running the some of the stats used in A Tale of Two Motorsports: A Graphical-Statistical Analysis of How Practice, Qualifying, and Past SuccessRelate to Finish Position in NASCAR and Formula One Racing on some more recent data. But I’m well over a couple of thousand words into this post and still not started that bit… So maybe I’ll finish now, and hold the actual number crunching over to the next post…
PS I find myself: happier that I (think I) understand a little bit more about the rationale of significance tests; just as sceptical as ever I was about the actual practice;-)
One of the many things I keep avoiding is statistics. I’ve never really been convinced about the 5% significance level thing; as far as I can tell, hardly anything that’s interesting normally distributes; all the counting that’s involved just confuses me; and I never really got to grips with confidently combining probabilities. I find a lot of statistics related language impenetrable too, with an obscure vocabulary and some very peculiar usage. (Regular readers of this blog know that’s true here, as well ;-)
So this year I’m going to try to do some stats, and use some stats, and see if I can find out from personal need and personal interest whether they lead me to any insights about, or stories hidden within, various data sets I keep playing with. So things like: looking for patterns or trends, looking for outliers, and comparing one thing with another. If I can find any statistics that appear to suggest particular betting odds look particularly favourable, that might be interesting too. (As Nate Silver suggests, betting, even fantasy betting, is a great way of keeping score…)
Note that what I hope will turn into a series of posts should not be viewed as tutorial notes – they’re far more likely to be akin to student notes on a problem set the student is trying to work through, without having attended any of the required courses, and without having taken the time to read through a proper tutorial on the subject. Nor do I intend to to set out with a view to learning particular statistical techniques. Instead, I’ll be dipping into the world of stats looking for useful tools to see if they help me explore particular questions that come to mind and then try to apply them cut-and-past fashion, which is how I approach most of my coding!
Bare naked learning, in other words.
So if you thought I had any great understanding about stats – in fact, any understanding at all – I’m afraid I’m going to disabuse you of that notion. As to my use of the R statistical programming language, that’s been pretty much focussed on using it for generating graphics in a hacky way. (I’ve also found it hard, in the past, plotting pixels on screen and page in a programmable way, but R graphics libraries such as ggplot2 make it really easy at a high level of abstraction…:-)
That’s the setting then… Now: #f1stats. What’s that all about?
Notwithstanding the above (that this isn’t about learning a particular set of stats methods defined in advance) I did do a quick trawl looking for “F1 stats tutorials” to see if there were any that I could crib from directly; but my search didn’t turn up much that was directly and immediately useful (if you know of anything that might be, please post a link in the comments). There were a few things that looked like they might be interesting, so here’s a quick dump of the relevant…
- First up, I’ve been reading Nate Silver’s The Signal and the Noise, which mentions the aging stats and aging models for baseball players. I found a paper on The Age Productivity Gradient: Evidence from a sample of F1 Drivers, which hasn’t got too many scary equations in, so I may try to replicate that and then bring the models up to date (the paper is dated 2009). It would have been so nice if the authors had published code equivalents in R that I could have played with directly, but I haven’t been able to find it if they did. I also found a paper on Estimated Age Effects in Baseball, again with equations but no code, but it may provide additional clues. From a quick skim, I think there may be some mileage in trying to get my head round different ways of comparing rankings.
- A Tale of Two Motorsports: A Graphical-Statistical Analysis of How Practice, Qualifying, and Past Success Relate to Finish Position in NASCAR and Formula One Racing is perhaps an easier thing to try to copy for starters, though>
- The article The wisdom of ignorant crowds: predicting sport outcomes by mere recognition explores a simple tournament winner predicting strategy based on how recognisable the names of competitors are. (I guess social media metrics might be a proxy for recognition? Hmm.. could test that I suppose with reference to this paper?) One thing that caught my eye were a couple of simple schemes for benchmarking different prediction models, which might be something I could pull on if I end up exploring prediction models?
- NASCAR results have featured in several papers (I think there’s also a NASCAR dataset available in R?) so I’ll probably try dipping in to them at some point to see if I can do similar things with F1 data. For example, an analysis of NASCAR Winston Cup Race Results for 1975-2003; a couple of papers on hierarchical modelling of auto-racing results; and some Bayesian inference stuff that I guess is really beyond me for now and that I really really could do with a pre-built R libraries for;
- an MSc thesis I’ve referred to before on Prediction of Formula One Race Results Using Driver Characteristics has some handy ideas that I might be able to draw on if I have a look at laptime data;
- One of the the things I’ve been pondering is ways of ranking drivers based on fast lap times (eg during qualifying, vs. during the race). Although not about motor sport, or any sort of racing, A New Method for Ranking Total Driving Performance on the PGA Tour has a metric I may be able to bastardise in a Formula One context. The same periodical also has an article on Do Reliable Predictors Exist for the Outcomes of NASCAR Races?, the techniques of which might be applicable to F1? Predicting The Outcome Of NASCAR Races: The Role Of Driver Experience looks to be in a similar vein too…
- A paper on Outcome Uncertainty in NASCAR looks at how attendance and TV audience figures are influenced by race expectations, which might be something that could also be explored in context of UK F1 TV audience figures. That said, the notion of “outcome uncertainty” itself, and related measures, might also be worth exploring in their own right?
If you know of any other relevant looking papers or articles, please post a link in the comments.
– Who is the Best Formula 1 Driver? An Econometric Analysis
I was hoping to finish this post with a couple of quick R hacks around some F1 datasets, but I’ve just noticed that today, as in yesterday, has become tomorrow, as in today, and this post is probably already long enough… So it’ll have to wait for another day…
PS FWIW, I also note the arrival of the Sports Analytics Innovation Summit in London in March… I doubt I have the impact required to make it as a media partner though… Although maybe OpenLearn does…?!
As ever, it seems the longer I have to prepare something, the less likely I am to do it. I was supposed to be running a #change11 MOOC session this week – Infoskills 2.012 How to do a lot with a little – but having had it in the diary for a 6 months or so, I have, of course, done nothing to prepare for it… (I didn’t come up with the 2.012 – not sure who did?)
Anyway….over the weekend, I gave a presentation (Social Media Visualisation Hacks) that, typically, bewildered the audience with a blizzard of things that are possible when it comes to looking at social networks but that are still alien to most:
As ever, the presentation is not complete (i.e. the slides really need to be complemented by a commentary), but that’s something I hope to start working on improving – maybe starting this week…
The deck is a review – of sorts – of some of the various ways we can look at social networks and the activity that takes place within them. The slides are prompts, keys, search phrase suggestions that provide a starting point for finding out more. Many of the slides contain screenshots – and if you peer closely enough, you can often see the URL. For posts on my blog, searching with the word ouseful followed by key terms from the post title will often turn up the result on major search engines. Many of the slides identify a “hack” that is described in pseudo-tutorial form on the this blog, or on Martin Hawksey’s MASHe blog.
I put together a delicious stack of links relating to the presentation here: #drg12 – Visualising Social Networks (Tutorial Posts)
For a tutorial stack that focusses more on Yahoo Pipes (though who knows how long that is still for this world given the perilous nature of Yahoo at the moment), see: Twitter Pipes
My #change11 week was supposed to be about new info skills, with a practical emphasis. A couple of other presentations relating to how we might appropriate (a-pro-pre-eight) tools and applications can be found here: Appropriate IT – My ILI2011 Presentation and Just Do IT Yourself… MY UKSG Presentation.
If all you do in Google is 2.3 keyword searches, this deck – Making the Most of Google – (though possibly a little dated by now), may give you some new ideas.
For a more formal take on infoskills for the new age, (though I need to write a critique of this from my own left-field position), see the Cambridge University Library/Arcadia Project “New Curriculum for Information Literacy (ANCIL)” project via the Arcadia project.
If you want to do some formal reading in the visualisation space, check out 7 Classic Foundational Vis Papers You Might not Want to Publicly Confess you Don’t Know.
Mulling over @downes’ half hour post on Education as Platform: The MOOC Experience and what we can do to make it better, I see the MOOC framework as providing an element of co-ordination, pacing and a legacy resource package. For my week, I was expected(?!) to put together some readings and exercises and maybe a webinar. But I haven’t prepared anything (I tried giving a talk at Dev8D earlier this year completely unprepared (though I did have an old presentation I could have used) and it felt to me like a car crash/a disaster, so I know I do need to prep things (even though it may not seem like it if you’ve heard me speak before!;-))
But maybe that’s okay, for one week of this MOOC? The OUSeful.info blog, where you’re maybe reading this, is an ongoing presentation, with a post typically every day or so. WHen I learn something related to the general themes of this blog, I post it here, often as a partial tutorial (partial in the sense that you often have to work through the tutorial for the words to make sense – they complement the things you should be seeing on screen – if you look – as you work through the tutorial; in a sense, the tutorial posts are often second screen complements to and drivers of an activity on another screen).
I’ve personally tried registering with MOOCs a couple of times, but never completed any of the activities. Some of the MOOC related readings or activities pass my way through blogs I follow, or tweeted links that pique my interest, and sometimes I try them out. I guess I’m creating my own unstructured uncourse* daily anyway… (*uncourses complement MOOCs, sort of… They’re courses created live in partial response to feedback, but also reflecting the “teacher”‘s learning journey through a topic. Here’s an example that led to a formal OU course: Digital Worlds uncourse blog experiment. The philosophy is based on the “teacher” modelling – and documenting – a learning journey. Uncourses fully expect the “teacher” not being totally knowledgeable about the subject area, but being happy to demonstrate how they go about making sense of a topic that may well be new to them).
So… this is my #change11 offering. It’s not part of the “formal” course, (how weird does that sound?!) unless it is… As the MOOC is now in week 29, if its principles have been taken on-board, you should be starting to figure out your own distributed co-ordination mechanisms by now. Because what else will you do when the course is over? Or will it be a course that never ends, yet ceases to have a central co-ordination point?
PS if you want to chat, this blog is open to comments; you can also find me on Twitter as @psychemedia
PPS seems like I’ve had at least one critical response (via trackbacks) towards my lacksadaisical “contribution” towards my “teaching” week on the #Change11 MOOC. True. Sorry. But not. I should have kept it simple, posted my motto – identify a problem, then hack a solution to it, every day – and left it at that… It’s how I learn about this stuff… (and any teaching I receive tends to be indirect – by virtue stuff other folk have published that I’ve discovered through web search, (aka search queries – questins – that I’ve had to formulate to help me answer the problem I have identified/created…).
It’s one thing publishing data just to comply with a formal requirement to make it public, quite another if you’re publishing it because you want folk to doing something with it.
But if you decide you’re publishing data because you want folk to do something with it, what does that mean exactly?
[Not quite related but relevant: Pete Sefton (@ptsefton) on Did you say you “own” this data? You keep using that word. I do not think it means what you think it means.]
One answer might be that you want them to be able to refer to the data for their own purposes, simply by cutting an pasting a results datatable out of one of your spreadsheets so they can paste it into one of theirs and refer to it as “evidence”:
Another might be that you want folk to be able to draw on your data as part of their own decision making process. And so on. (For a couple of other use cases, see First Thoughts On A Content Strategy for Data.)
A desire that appears to have have gained some traction over the last couple of years is to publish data so that folk can produce visualisations based on it. This is generally seen as a Good Thing, although I’m not sure I know exactly why…? Perhaps it’s because visualisations are shiny objects and folk can sometimes be persuaded to share (links to) shiny objects across their social networks; this in turn may help raise wider awareness about the existence of your data, and potentially bring it to the attention of somebody who can actually makes some use of it, or extract some value from it, possibly in combination with one or more other datasets that you may or may not be aware of.
Something that I’ve become increasingly aware of over the last couple of years is that people respond to graphics and data visualisations in very different ways. The default assumption seems to be that a graphic should expose some truth in very obvious way without any external input. (I’m including things like axis labels, legends, and captions in the definition of a graphic.) That is, the graphic should be a self-contained, atomic object, meaningful in its own right. I think this view is borne out of the assumption that graphics are used to communicate something that is known by the author who used it to their audience. The graphic is chosen because it does “self-evidently” make some point that makes the author’s case. Let’s call these “presentation graphics”. Presentation graphics are shiny objects, designed to communicate something in particular, to a particular audience, in (if at all possible) a self-contained way.
Another way of using visualisations is as part of a visual analysis process. In this case, visual representations of the data are generated by the analyst as part of a conversation they are having with the data. One aim of this conversation (or maybe we should call it an interrogation?!) may be to get the data to reveal something about its structure, or meaningful patterns contained within it. Visual analysis is therefore less to do with the immediate requirement of producing meaningful presentation graphics, and more to do with getting the data to tell its story. Scripted speeches contain soundbites – presentation graphics. Conversations can ramble all over the place and are often so deeply situated in a particular context they are meaningless to onlookers – as visualisations produced during a visual analysis activity may be. (Alternatively, the visual analyst spends their time trying to learn how to ride a bike. Chris Hoy and Victoria Pendleton show how it’s done with the presentation graphics…)
It could be that I’m setting up something of a false dichotomy between extrema here, because sometimes a simple, “directly generated” chart may be effective as both a simple 1-step visual analysis view, and as a presentation graphic. But I’m trying to think through my fingers and type my way round to what I actually believe about all this stuff, and arguing to limits is one lazy way of doing this! The distinction is also not just mine… For example, Iliinsky and Steele’s Designing Data Visualizations identifies the following:
Explanatory visualization: Data visualizations that are used to transmit information or a point of view from the designer to the reader. Explanatory visualizations typically have a specific “story” or information that they are intended to transmit.
Exploratory visualization: Data visualizations that are used by the designer for self-informative purposes to discover patterns, trends, or sub-problems in a dataset. Exploratory visualizations typically don’t have an already-known story.
They also define a data visualizations as “[v]isualizations that are algorithmically generated and can be easily regenerated with different data, are usually data-rich, and are often aesthetically shallow.” Leaving aside the aesthetics, the notion that data visualisations can be “algorithmically generated” is important here.
A related insight I picked up from the New York Times’ Amanda Cox is the use of statistical charts of visual analysis as sketches that help us engage with data en route to understanding some of the stories it contains, stories that may then be told by whatever means are appropriate (which may or may not include graphical representations or visualisations).
So when it comes to publishing data in the hope that folk will do something visual with it, does that mean we want to provide them with the data that can be directly used to convey some known truth in an appealing way, or do we want to provide them with data in such a way that they can engage with it in a (visual) analytic way and the communicate their insight through a really beautiful presentation graphic? (Note that it may often be the case that something discovered through a visual analysis step may actually best be communicated through a simple set of ranked, tabulated data presented as text…) Maybe this explains why folk are publishing the data in the hope that it will be “visualised”? They are conflating visual analysis with presentation graphics, and hoping that novel visualisation (visual analysis) techniques will: 1) provide new insights (new sense made from) the data, that: 2) also work as shiny, shareable and insightful presentation graphics? Hmmm…
Publishing Data to Support Visual Engagement
So we have our data set, but how can we publish it in a way that supports generative visual engagement with it (generative in the sense that we want the user to have at least some role in creating their own visual representations of the data)?
The easiest route to engagement is to publish an interactive visualisation on top of your data set so that the only way folk can engage with the data is through the interactive visualisation interface. So for example, interactive visualisations published by the BBC or New York Times. These typically support the generation of novel views over the data by allowing the user to construct queries over the data through interactive form elements (drop down lists, radio buttons, sliders, checkboxes, etc.); these queries are then executed to filter or analyse the data and provide a view over it that can be visually displayed in a predetermined way. The publisher may also choose to provide alternative ways of visualising the data (for example, scatter plot or bar chart) based on preidentified ways of mapping from the data to various graphical dimensions within particular chart types. In the case of the interactive visualisation hosted on the publisher’s website, the user is thus typically isolated from the actual data.
An alternative approach is to publish the data in an environment that supports the creation of visualisations at what we might term the row and column level. This is where ideas relating to a content strategy for data start to come in to play. An example of this is IBM’s Many Eyes data visualisation tool. Long neglected by IBM, the Many Eyes website provides an environment for: 1) uploading tabular datasets; 2) generating preconfigured interactive visualisations on top of the datasets; 3) providing embeddable versions of visualisations; 4) supporting discussions around visualisations. Note that a login is required to upload data and generate new visualisations.
Once the data is uploaded, the user has the option of generating one or more interactive visualisations of the data from a wide range of visualisation types. For example, here’s a matrix chart view (click through to see the interactive version; note: Java required).
In an ideal world, getting the data into Many Eyes should just(?) have been a case of copy and pasting data from the original spreadsheet. (Note that this requires access to a application that can open the spreadsheet, either on the desktop or online.) In the case of the DCMS CASE data, this required opening the spreadsheet, finding the correct sheet, then identifying the cell range containing the data we want to visualise:
Things are never just that simple, of course… Whilst it is possible to define columns as “text” or “number” in Many Eyes, the date field was recognised by the Many Eyes visualisation tools as a “number” which may lead to some visualisations incorrectly aggregating data from the same region across several date ranges. In order to force the Many Eyes visualisations to recongnise the date column as “text”, I had to edit the original data file (before uploading it to Many Eyes) by prepending the date ranges with an alphabetic character (so for example I replaced instances of 2004/05 with Y2004/05).
Recommendation In terms of a “content strategy for data”, then, we need to identify possible target domains where we wish to syndicate or republish our data and then either: 1) publish the data to that domain, possibly under our own branding, or “official” account on that domain – this approach also allows the publisher to add provenance metadata, a link to the homepage for the data or its original source, and so on; or, 2) publish the data on our site in such a way that we know it will work on the target domain (which means testing it/trying it out…). If you expect users to upload your data to services like Many Eyes themselves, it would make sense to provide easily cut and pastable example text of the sort you might expect to see appear in the metadata fields of the data page on the target site and encourage users to make use of that text.
Recommendation A lot of the CASE data spreadsheets contain multiple cell ranges corresponding to different tables within a particular sheet. Many cut and paste tools support data that can be cut and pasted from appropriately highlighted cell ranges. However, other tools require data in comma separated (rather than tab separated) format which mean the user must copy and paste the data into another sheet and then save it as CSV. Although a very simple format, there is a lot to be said for publishing very simple CSV files containing your data. Provenance and explanatory data often gets separated from data contained in CSV files, but you can always add a manifest text file to the collection of CSV data files to explain the contents of each one.
Whilst services such as Many Eyes do their best in trying to identify numeric versus categorical data columns, unless the user is familiar with the sorts of data a particular visualisation type requires and how it should be presented, it can sometimes be hard to understand why Many Eyes has automatically identified particular values for use in drop down list boxes, and at times hard to interpret what is actually being displayed. (This is a good reason to limit the use of Many Eyes to a visual analysis role, and use it to discover things that look interesting/odd and then go off and dig in the data a little more to se if there really is anything interesting there…)
In some cases, it may be possible to reshape the data and get it in to a form that Many Eyes can work with. (Remember the Iliinsky and Steele definition of a data visualisation as something “algorithmically generated”? If the data is presented in the right way, then Many Eyes can do something with it. But if it’s presented in the wrong way, not joy…) As an example, if we look at the CASE Capital Investment data, we see it has columns for Region, Local Authority, Date, as well as columns relating to the different investment types. Presented this way, we can easily group data across different years within an LA or Region. Alternatively, we might have selected Region, Local Authority, and Asset type columns, with separate columns for each date range. This different combination of rows and columns may provides a different basis for the sorts of visualisations we can generate within Many Eyes and the different summary views we can present over the data.
Recommendation The shape in which the data it published may have an effect on the range of visualisations that can be directly generated from the data, without reshaping by the user. It may be appropriate to publish the data in a variety of shapes, or provide tools for reshaping data for use with particular target services. Tools such as the Stanford Data Wrangler are making it easier for people to clean and reshape messy data sets, but that is out of scope for this review. In addition, it is worth consider the data type or physical form in which data is published. For example, in columns relating to finanacial amounts, prepending each data element in a cell with a £ my break cut and paste visualisation tools such as Many Eyes, which will recognise the element as a character string. Some tools are capable of recognising datetime formats, so in some cases it may be appropriate to publish date/datetime in a standardised way. Many tools choke on punctuation characters from Windows character sets, and despite best efforts, rogue characters and undeclared or incorrect character encodings often find their way in to datasets which present them working correctly in third party applications. Some tools will automatically strip out leading and trailing whitespace characters, others will treat them as actual characters. Where string matching operations are applied (for example, grouping data elements) a word with a trailing space and a word without a trailing space may be treated as defining different groups. (Which is to say, try to strip leading and trailing whitespace in your data. Experts know to check for this, novices don’t).
If the expectation is that users will make use of a service such as Many Eyes, it may be worth providing an FAQ area that describes what shape the different visualisation expect the data to be in, with examples from your own data sets. Services such as Number Picture, which provide a framework for visualising data by means of visualisation templates that accept data in a specified shape and form, provided helpful user prompts that explain what the (algorithmic) visualisation expects in terms of the shape and form of input data:
Custom Filter Visualisations – Google Fusion Tables
Google Fusion Tables are like spreadsheets on steroids. They combine features of traditional spreadsheets with database like query support and access to popular chart types. Google Fusion Tables can be populated by importing data from Google Spreadsheets or uploading data from CSV files (example Fusion Table).
Google Fusion Tables can also be used to generate new tables based on the fusion of two (or, by chaining, more than two) tables that share a common column. So for example, given three CSV data files containing different data sets (for example, file A has LA codes, and Regions, file B has LA codes and arts spend by LA, and file C has LA codes and sports engagement data) we can merge the files on the common columns to give a “fused” data set (for example, a single table containing four columns: LA codes, Regions, arts spend, sports engagement data). Note that the data may need to be appropriately shaped before it can be fused in a meaningful way with other data sets.
As with many sites that support data upload/import, it’s typically down the to the user to add appropriate metadata to to the data file. This metadata is important for a variety of reasons: firstly, it provides context around a dataset; secondly, it may aid in discovery of the data set if the data and its metadata is publicly indexed; thirdly, it may support tracking, which can be useful if the original publisher needs to demonsstrate how widely a dataset has been (re)used.
If there are too many steps involved in getting the data from the download site into the target environment (for example, if it needs downloading, a cell range copying, saving into another data file, cleaning, then uploading) the distance from the original data source to the file that is uploaded may result in the user not adding much metadata at all. As before, if it is anticipated that a service such as Google Fusion Tables is a likely locus for (re)use of a dataset, the publisher should consider publishing the data directly through the service, with high quality metadata in place, or provide obvious cues and cribs to users about the metadata they might wish to add to their data uploads.
A nice feature of Google Fusion Tables is the way it provides support for dynamic and compound queries over a data set. So for example, we can filter rows:
Or generate summary/aggregate views:
A range of standard visualisation types are available:
Charts can be used to generate views over filtered data:
Or filtered and aggregated data:
Note that these charts may not be of publishable quality/useful as presentation graphics, but they may be useful as part of a visual analysis of the data. To this extent, the lack of detailed legends and titles/captions for the chart does not necessarily present a problem – the visual analyst should be aware of what the data they are viewing actually represents (and they can always check the filter and aggregate settings if they are unsure, as well as dropping in to the tabular data view to check actual numerical values if anything appears to be “odd”. However, the lack of explanatory labeling is likely to be an issue if the intention is to produce a presentation graphic, in which case the user will need to grab a copy of the image and maybe postprocess it elsewhere.
Note that Google Fusion Tables is capable of geo-coding certain sorts of location related data such as placenames or postcodes and rendering associated markers on a map. It is also possible to generate thematic maps based on arbitrary geographical shapefiles (eg Thematic Maps with Google Fusion Tables [PDF]).
Helping Data Flow – Treat It as a Database
Services such as Google Spreadsheets provide online spreadsheets that support traditional spreadsheet operations that include chart generation (using standard chart types familiar to spreadsheet users) and support for interactive graphical widgets (including more exotic chart types, such as tree maps), powered by spreadsheet data, that can be embedded in third party webpages. Simple aggregate reshaping of data is provided in the from of support for Pivot Tables. (Note however that Google Spreadsheet functionality is sometimes a little bug ridden…) Google spreadsheets also provides a powerful query API (the Google Visulisation API), that allows the spreadsheet to be treated as a database. For an example in another government domain, see Government Spending Data Explorer; see also Guardian Datastore MPs’ Expenses Spreadsheet as a Database ).
Publishing data in this way has the following benefits: 1) treating the data as a spreadsheet allows query based views to be generated over it; 2) this views can be visualised directly in the page (this includes dynamic visulisations, as for example described in Google Chart Tools – dynamic controls, and gallery); 3) queries can be used to generated CSV based views over the data that can be (re)used in third party applications.
Sometimes it makes sense to visualise data in a geographical way. One service that provides a quick way of generating choropleth/thematic maps from simple two or three column data keyed by UK administrative geography labels or identifiers is OpenHeatmap. Data can be uploaded from a simple CSV file or imported from a Google spreadsheet using predetermined column names (a column identifying geographical areas according to one of fixed number of geographies, a number value column for colouring the geographical area, and an optional date column for animation purposes (so a map can be viewed in an animated way over consecutive time periods):
Once generated, links to an online version of the map are available.
The code for OpenHeatmap is available as open source software so without too much effort it should be possible to modify the code in order to host a local instance of the software and tie it in a set of predetermined Google spreadsheets, local CSV files, or data views generated from queries over a predetermined datasource so that only the publisher’s data can be visualised using the particular instance of OpenHeatmap.
Other services for publishing and visualising geo-related data are available (eg Geocommons) and could play a role as a possible outlet in a content strategy for data with a strong geographical bias.
Power Tools – R
A further class of tools that can be used to generate visual representations or arbitrary datasets are the fully programmatic tools, such as the R statistical programming language. Developed for academic use, R is currently increasing in popularity on the coat tails of “big data” and the growing interest in analysis of activity data (“paradata”) that is produced as a side-effect of our digital activities. R is capable of importing data in a wide variety of formats from local files as well as via a URL from an online source. The R data model supports a range of powerful transformations that allow data to be shaped as required. Merging data that shares common columns (in whole or part) from separate sources is also supported.
In order to reduce overheads in getting data into a useful shape within the R environment, it may make sense to publish datafile “wrappers” that act as a simple API to data contained with one or published spreadsheets or datafiles. By providing an object type and, where appropriate, access methods for the the data, the data publisher can provide a solid framework on top of which third parties can build their own analysis and statistical charts. R is supported by a wide range of third party extension libraries for generating a wide range of statistical charts and graphics, including maps. (Of particular note are ggplot2 for generating graphics according to the Grammar of Graphics model, and googleVis, which provides a range of functions that support the rapid generation of Google Charts). Many of the charts can be generated from a single R command if the data is in the correct shape and format.
As well as running as a local, desktop application, R can also be run as a hosted webservice (for example, cloudstat.org; the RStudio cross-platform desktop application can also be accessed as a hosted online service, and could presumably be used to provide a robust, online hosted analysis environment tied in to a set of locked down data sources). It is also possible to use R to power online hosted statistical charting services; see for example .
Some cleaning of the data may be required before uploading to the ggplot service. For example, empty cells marked as such by a “-” should be replaced by empty cells; numeric values containing a “,”, may be misinterpreted as character strings (factor levels) rather than numbers (in which case the data needs cleaning by removing commas). Again, if it is known that a service such as ggplot2 is likely to be a target for data reuse, publishing the data in a format that is known to work “just by loading the data in” to R with default import settings will reduce friction/overheads and keep the barriers to reusing the data within that environment to a minimum.
Observation Most of the time, most people don’t get past default settings on any piece of software. If someone tries to load your data into an application, odds on they will use default, factory settings. If you know that users are likely to want to use your data in a particular package, make at least a version of your data available in a format that will load into that package under the default settings in a meaningful way.
Finally, a couple of wild card possibilities.
Firstly, Wolfram Alpha. Wolfram Alpha provides access to a “computational search engine” that accepts natural language queries about facts or data and attempts to provide reasoned responses to those queries, including graphics. Wolfram Alpha is based around a wide range of curated data sets, so an engagement strategy with them may, in some certain circumstances, be appropriate (for example, working with them in the publication of data sets and then directing users to Wolfram Alpha in return). Wolfram Alpha also offers a “Pro” service (Wolfram Alpha Pro) that allows users to visualise and query their own data.
Secondly, the Google Refine Reconciliation API. Google Refine is a cross-platform for cleaning datasets, with the ability to reconcile the content of data columns with canonical identifiers published elsewhere. For example, it is possible to reconcile the names of local authorities with canonical Linked Data identifiers via the Kasabi platform (UK Adminstrative Geography codes and identifiers).
By anchoring cell values to canonical identifiers, it becomes possible to aggregate data from different sources around those known, uniquely identified items in a definite and non-arbitrary way. By publishing: a) a reconciliation service (eg for LEP codes); and b) data that relates to identifiers returned by the reconciliation service (for example, sports data by LEP), the data publisher provides a focus for third parties who want to reconcile their own data against the published identifiers, as well as a source of data values that can be used to annotate records referencing those identifiers. (So for example, if you make it easy for me to get Local Authority codes based on local authority names from your reconciliation service, and also publish data linked to those identifiers (sports engagement data, say), if I reconcile my data against your codes, I will also be provided with the opportunity to annotate my data with your data (so I can annotate my local LEP spend data with your LER sports engagement data; [probably a bad example… need something more convincing?!])… Although uptake of the reconciliation API (and the associated possibility of providing annotation services) is still a minority interest, there are some signs of interest in it (for example, Using Google Refine and taxonomic databases (EOL, NCBI, uBio, WORMS) to clean messy data; note that data published on the Kasabi platform also exposes a Google Refine reconciliation service endpoint.) In my opinion, there are potentially significant benefits to be had by publishing reconciliation service endpoints with associated annotation services if a culture of use grows up around this protocol.
Not covered: as part of this review, I have not covered applications such as Microsoft Excel or Tableau Desktop (the latter being a Windows only data visualisation environment that is growing in popularity). Instead, I have tried to focus on applications that are freely available either via the web or on a cross-platform basis. There is also a new kid on the block – datawrapper.de – but it’s still early days for this tool…
What do my Facebook friends have in common in terms of the things they have Liked, or in terms of their music or movie preferences? (And does this say anything about me?!) Here’s a recipe for visualising that data…
After discovering via Martin Hawksey that the recent (December, 2011) 2.5 release of Google Refine allows you to import JSON and XML feeds to bootstrap a new project, I wondered whether it would be able to pull in data from the Facebook API if I was logged in to Facebook (Google Refine does run in the browser after all…)
Looking through the Facebook API documentation whilst logged in to Facebook, it’s easy enough to find exemplar links to things like your friends list (https://graph.facebook.com/me/friends?access_token=A_LONG_JUMBLE_OF_LETTERS) or the list of likes someone has made (https://graph.facebook.com/me/likes?access_token=A_LONG_JUMBLE_OF_LETTERS); replacing me with the Facebook ID of one of your friends should pull down a list of their friends, or likes, etc.
(Note that validity of the access token is time limited, so you can’t grab a copy of the access token and hope to use the same one day after day.)
Grabbing the link to your friends on Facebook is simply a case of opening a new project, choosing to get the data from a Web Address, and then pasting in the friends list URL:
Click on next, and Google Refine will download the data, which you can then parse as a JSON file, and from which you can identify individual record types:
If you click the highlighted selection, you should see the data that will be used to create your project:
You can now click on Create Project to start working on the data – the first thing I do is tidy up the column names:
We can now work some magic – such as pulling in the Likes our friends have made. To do this, we need to create the URL for each friend’s Likes using their Facebook ID, and then pull the data down. We can use Google Refine to harvest this data for us by creating a new column containing the data pulled in from a URL built around the value of each cell in another column:
The Likes URL has the form https://graph.facebook.com/me/likes?access_token=A_LONG_JUMBLE_OF_LETTERS which we’ll tinker with as follows:
The throttle control tells Refine how often to make each call. I set this to 500ms (that is, half a second), so it takes a few minutes to pull in my couple of hundred or so friends (I don’t use Facebook a lot;-). I’m not sure what limit the Facebook API is happy with (if you hit it too fast (i.e. set the throttle time too low), you may find the Facebook API stops returning data to you for a cooling down period…)?
Having imported the data, you should find a new column:
At this point, it is possible to generate a new column from each of the records/Likes in the imported data… in theory (or maybe not..). I found this caused Refine to hang though, so instead I exprted the data using the default Templating… export format, which produces some sort of JSON output…
I then used this Python script to generate a two column data file where each row contained a (new) unique identifier for each friend and the name of one of their likes:
import simplejson,csv writer=csv.writer(open('fbliketest.csv','wb+'),quoting=csv.QUOTE_ALL) fn='my-fb-friends-likes.txt' data = simplejson.load(open(fn,'r')) id=0 for d in data['rows']: id=id+1 #'interests' is the column name containing the Likes data interests=simplejson.loads(d['interests']) for i in interests['data']: print str(id),i['name'],i['category'] writer.writerow([str(id),i['name'].encode('ascii','ignore')])
[I think this R script, in answer to a related @mhawksey Stack Overflow question, also does the trick: R: Building a list from matching values in a data.frame]
I could then import this data into Gephi and use it to generate a network diagram of what they commonly liked:
Rather than returning Likes, I could equally have pulled back lists of the movies, music or books they like, their own friends lists (permissions settings allowing), etc etc, and then generated friends’ interest maps on that basis.
PS dropping out of Google Refine and into a Python script is a bit clunky, I have to admit. What would be nice would be to be able to do something like a “create new rows with new column from column” pattern that would let you set up an iterator through the contents of each of the cells in the column you want to generate the new column from, and for each pass of the iterator: 1) duplicate the original data row to create a new row; 2) add a new column; 3) populate the cell with the contents of the current iteration state. Or something like that…
PPS Related to the PS request, there is a sort of related feature in the 2.5 release of Google Refine that lets you merge data from across rows with a common key into a newly shaped data set: Key/value Columnize. Seeing this, it got me wondering what a fusion of Google Refine and RStudio might be like (or even just R support within Google Refine?)
PPPS this could be interesting – looks like you can test to see if a friendship exists given two Facebook user IDs.
PPPPS This paper in PNAS – Private traits and attributes are predictable from digital records of human behavior – by Kosinski et. al suggests it’s possible to profile people based on their Likes. It would be interesting to compare how robust that profiling is, compared to profiles based on the common Likes of a person’s followers, or the common likes of folk in the same Facebook groups as an individual?