Posts Tagged ‘f1datajunkie’
F1Stats – Correlations Between Qualifying, Grid and Race Classification
Following directly on from F1Stats – Visually Comparing Qualifying and Grid Positions with Race Classification, and continuing in my attempt to replicate some of the methodology and results used in A Tale of Two Motorsports: A Graphical-Statistical Analysis of How Practice, Qualifying, and Past SuccessRelate to Finish Position in NASCAR and Formula One Racing, here’s a quick look at the correlation scores between the final practice, qualifying and grid positions and the final race classification.
I’ve already done brief review of what correlation means (sort of) in F1Stats – A Prequel to Getting Started With Rank Correlations, so I’m just going to dive straight in with some R code that shows how I set about trying to find the correlations between the different classifications:
Here’s the answer from the back of the book paper that we’re aiming for…
Here’s what I got:
> corrs.df[order(corrs.df$V1),]
V1 p3pos.int qpos.int grid.int racepos.raw pval.grid pval.qpos pval.p3pos
2 AUSTRALIA 0.30075188 0.01503759 0.087218045 1 7.143421e-01 9.518408e-01 0.197072158
13 MALAYSIA 0.42706767 0.57293233 0.630075188 1 3.584362e-03 9.410805e-03 0.061725312
6 CHINA -0.26015038 0.57443609 0.514285714 1 2.183596e-02 9.193214e-03 0.266812583
3 BAHRAIN 0.13082707 0.73233083 0.739849624 1 2.900250e-04 3.601434e-04 0.581232598
16 SPAIN 0.25112782 0.80451128 0.804511278 1 2.179221e-05 2.179221e-05 0.284231482
14 MONACO 0.51578947 0.48120301 0.476691729 1 3.513870e-02 3.326706e-02 0.021403708
17 TURKEY 0.52330827 0.73082707 0.730827068 1 3.756531e-04 3.756531e-04 0.019344720
9 GREAT BRITAIN 0.65413534 0.83007519 0.830075188 1 8.921842e-07 8.921842e-07 0.002260234
8 GERMANY 0.32030075 0.46917293 0.452631579 1 4.657539e-02 3.844275e-02 0.168419054
10 HUNGARY 0.49649123 0.37017544 0.370175439 1 1.194050e-01 1.194050e-01 0.032293715
7 EUROPE 0.28120301 0.72030075 0.720300752 1 4.997719e-04 4.997719e-04 0.228898214
4 BELGIUM 0.06766917 0.62105263 0.621052632 1 4.222076e-03 4.222076e-03 0.777083014
11 ITALY 0.52932331 0.52481203 0.524812030 1 1.895282e-02 1.895282e-02 0.017815489
15 SINGAPORE 0.50526316 0.58796992 0.715789474 1 5.621214e-04 7.414170e-03 0.024579520
12 JAPAN 0.34912281 0.74561404 0.849122807 1 0.000000e+00 3.739715e-04 0.143204045
5 BRAZIL -0.51578947 -0.02105263 -0.007518797 1 9.771776e-01 9.316030e-01 0.021403708
1 ABU DHABI 0.42556391 0.66466165 0.628571429 1 3.684738e-03 1.824565e-03 0.062722332
The paper mistakenly reports the grid values as the qualifying positions, so if we look down the grid.int column that I use to contain the correlation values between the grid and final classifications, we see they broadly match the values quoted in the paper. I also calculated the p-values and they seem to be a little bit off, but of the right order.
And here’s the R-code I used to get those results… The first chunk is just the loader, a refinement of the code I have used previously:
require(RSQLite)
require(reshape)
#Data downloaded from my f1com scraper on scraperwiki
f1 = dbConnect(drv="SQLite", dbname="f1com_megascraper.sqlite")
getRacesData.full=function(year='2012'){
#Data query
results.combined=dbGetQuery(f1,
paste('SELECT raceResults.year as year, qualiResults.pos as qpos, p3Results.pos as p3pos, raceResults.pos as racepos, raceResults.race as race, raceResults.grid as grid, raceResults.driverNum as driverNum, raceResults.raceNum as raceNum FROM raceResults, qualiResults, p3Results WHERE raceResults.year==',year,' and raceResults.year = qualiResults.year and raceResults.year = p3Results.year and raceResults.race = qualiResults.race and raceResults.race = p3Results.race and raceResults.driverNum = qualiResults.driverNum and raceResults.driverNum = p3Results.driverNum;',sep=''))
#Data tidying
results.combined=ddply(results.combined,.(race),mutate,racepos.raw=1:length(race))
for (i in c('racepos','grid','qpos','p3pos','driverNum'))
results.combined[[paste(i,'.int',sep='')]]=as.integer( as.character(results.combined[[i]]))
results.combined$race=reorder(results.combined$race,results.combined$raceNum)
results.combined
}
f1 = dbConnect(drv="SQLite", dbname="f1com_megascraper.sqlite")
results.combined=getRacesData.full(2009)
corrs.df[order(corrs.df$V1),]
Here’s the actual correlation calculation – I use the cor function:
#The cor() function returns data that looks like:
# p3pos.int qpos.int grid.int racepos.raw
#p3pos.int 1.0000000 0.31578947 0.28270677 0.30075188
#qpos.int 0.3157895 1.00000000 0.97744361 0.01503759
#grid.int 0.2827068 0.97744361 1.00000000 0.08721805
#racepos.raw 0.3007519 0.01503759 0.08721805 1.00000000
#Row/col 4 relates to the correlation with the race classification, so for now just return that
corr.rank.race=function(results.combined,cmethod='spearman'){
##Correlations
corrs=NULL
#Run through the races
for (i in levels(factor(results.combined$race))){
results.classified = subset( results.combined,
race==i,
select=c('p3pos.int','qpos.int','grid.int','racepos.raw'))
#print(i)
#print( results.classified)
cp=cor(results.classified,method=cmethod,use="complete.obs")
#print(cp[4,])
corrs=rbind(corrs,c(i,cp[4,]))
}
corrs.df=as.data.frame(corrs)
signif=data.frame()
for (i in levels(factor(results.combined$race))){
results.classified = subset( results.combined,
race==i,
select=c('p3pos.int','qpos.int','grid.int','racepos.raw'))
#p.value
pval.grid=cor.test(results.classified$racepos.raw,results.classified$grid.int,method=cmethod,alternative = "two.sided")$p.value
pval.qpos=cor.test(results.classified$racepos.raw,results.classified$qpos.int,method=cmethod,alternative = "two.sided")$p.value
pval.p3pos=cor.test(results.classified$racepos.raw,results.classified$p3pos.int,method=cmethod,alternative = "two.sided")$p.value
signif=rbind(signif,data.frame(race=i,pval.grid=pval.grid,pval.qpos=pval.qpos,pval.p3pos=pval.p3pos))
}
corrs.df$qpos.int=as.numeric(as.character(corrs.df$qpos.int))
corrs.df$grid.int=as.numeric(as.character(corrs.df$grid.int))
corrs.df$p3pos.int=as.numeric(as.character(corrs.df$p3pos.int))
corrs.df=merge(corrs.df,signif,by.y='race',by.x='V1')
corrs.df$V1=factor(corrs.df$V1,levels=levels(results.combined$race))
corrs.df
}
corrs.df=corr.rank.race(results.combined)
It’s then trivial to plot the result:
require(ggplot2)
xRot=function(g,s=5,lab=NULL) g+theme(axis.text.x=element_text(angle=-90,size=s))+xlab(lab)
g=ggplot(corrs.df)+geom_point(aes(x=V1,y=grid.int))
g=xRot(g,6)+xlab(NULL)+ylab('Correlation')+ylim(0,1)
g=g+ggtitle('F1 2009 Correlation: grid and final classification')
g
Recalling that there are different types of rank correlation function, specifically “Kendall’s τ (that is, Kendall’s Tau; this coefficient is based on concordance, which describes how the sign of the difference in rank between pairs of numbers in one data series is the same as the sign of the difference in rank between a corresponding pair in the other data series”, I wondered whether it would make sense to look at correlations under this measure to see whether there were any obvious looking differences compared to Spearmans’s rho, that might prompt us to look at the actual grid/race classifications to see which score appears to be more meaningful.
The easiest way to spot the difference is probably graphically:
corrs.df2=corr.rank.race(results.combined,'kendall')
corrs.df2[order(corrs.df2$V1),]
g=ggplot(corrs.df)+geom_point(aes(x=V1,y=grid.int),col='red',size=4)
g=g+geom_point(data=corrs.df2, aes(x=V1,y=grid.int),col='blue')
g=xRot(g,6)+xlab(NULL)+ylab('Correlation')+ylim(0,1)
g=g+ggtitle('F1 2009 Correlation: grid and final classification')
g
corrs.df2[order(corrs.df2$V1),]
V1 p3pos.int qpos.int grid.int racepos.raw pval.grid pval.qpos pval.p3pos
2 AUSTRALIA 0.17894737 -0.01052632 0.04210526 1 8.226829e-01 9.744669e-01 0.288378196
13 MALAYSIA 0.26315789 0.41052632 0.46315789 1 3.782665e-03 1.110136e-02 0.112604127
6 CHINA -0.20000000 0.41052632 0.35789474 1 2.832863e-02 1.110136e-02 0.233266557
3 BAHRAIN 0.07368421 0.51578947 0.52631579 1 8.408301e-04 1.099522e-03 0.677108239
16 SPAIN 0.17894737 0.64210526 0.64210526 1 2.506940e-05 2.506940e-05 0.288378196
14 MONACO 0.38947368 0.35789474 0.35789474 1 2.832863e-02 2.832863e-02 0.016406081
17 TURKEY 0.37894737 0.64210526 0.64210526 1 2.506940e-05 2.506940e-05 0.019784403
9 GREAT BRITAIN 0.46315789 0.63157895 0.63157895 1 3.622261e-05 3.622261e-05 0.003782665
8 GERMANY 0.23157895 0.31578947 0.30526316 1 6.380788e-02 5.475355e-02 0.164976406
10 HUNGARY 0.36842105 0.36842105 0.36842105 1 2.860214e-02 2.860214e-02 0.028602137
7 EUROPE 0.21052632 0.62105263 0.62105263 1 5.176962e-05 5.176962e-05 0.208628398
4 BELGIUM 0.02105263 0.46315789 0.46315789 1 3.782665e-03 3.782665e-03 0.923502331
11 ITALY 0.35789474 0.36842105 0.36842105 1 2.373450e-02 2.373450e-02 0.028328627
15 SINGAPORE 0.35789474 0.45263158 0.55789474 1 3.589956e-04 4.748310e-03 0.028328627
12 JAPAN 0.26315789 0.57894737 0.69590643 1 6.491222e-06 3.109641e-04 0.124796908
5 BRAZIL -0.37894737 -0.05263158 -0.04210526 1 8.226829e-01 7.732195e-01 0.019784403
1 ABU DHABI 0.34736842 0.61052632 0.55789474 1 3.589956e-04 7.321900e-05 0.033643947
Hmm.. Kendall gives lower values for all races except Hungary – maybe put that on the “must look at Hungary compared to the other races” pile…;-)
One thing that did occur to me was that I have access to race data from other years, so it shouldn’t be too hard to see how the correlations play out over the years at different circuits (do grid/race correlations tend to be higher at some circuits, for example?).
testYears=function(years=2009:2012){
bd=NULL
for (year in years) {
d=getRacesData.full(year)
corrs.df=corr.rank.race(d)
bd=rbind(bd,cbind(year,corrs.df))
}
bd
}
a=testYears(2006:2012)
ggplot(a)+geom_point(aes(x=year,y=grid.int))+facet_wrap(~V1)+ylim(0,1)
g=ggplot(a)+geom_boxplot(aes(x=V1,y=grid.int))
g=xRot(g)
g
So Spain and Turkey look like they tend to the processional? Let’s see if a boxplot bears that out:
How predictable have the years been, year on year?
g=ggplot(a)+geom_point(aes(x=V1,y=grid.int))+facet_wrap(~year)+ylim(0,1) g=xRot(g) g ggplot(a)+geom_boxplot(aes(x=factor(year),y=grid.int))
And as a boxplot:
From a betting point of view, (eg Getting Started with F1 Betting Data and The Basics of Betting as a Way of Keeping Score…) it possibly also makes sense to look at the correlation between the P3 times and the qualifying classification to see if there is a testable edge in the data when it comes to betting on quali?
I think I need to tweak my code slightly to make it easy to pull out correlations between specific columns, but that’ll have to wait for another day…
Getting Started with F1 Betting Data
As part of my “learn about Formula One Stats” journey, one of the things I wanted to explore was how F1 betting odds change over the course of a race weekend, along with how well they predict race weekend outcomes.
Courtesy of @flutterF1, I managed to get a peek of some betting data from one of the race weekends last year year. In this preliminary post, I’ll describe some of the ways I started to explore the data initially, before going on to look at some of the things it might be able to tell us in more detail in a future post.
(I’m guessing that it’s possible to buy historical data(?), as well as collecting it yourself it for personal research purposes? eg Betfair have an api, and there’s at least one R library to access it: betfairly.)
The application I’ll be using to explore the data is RStudio, the cross-platform integrated development environment for the R programming language. Note that I will be making use of some R packages that are not part of the base install, so you will need to load them yourself. (I really need to find a robust safe loader that installs any required packages first if they have not already been installed.)
The data @flutterF1 showed me came in two spreadsheets. The first (filename convention RACE Betfair Odds Race Winner.xlsx) appears to contain a list of frequently sampled timestamped odds from Betfair, presumably, for each driver recorded over the course of the weekend. The second (filename convention RACE Bookie Odds.xlsx) has multiple sheets that contain less frequently collected odds from different online bookmakers for each driver on a variety of bets – race winner, pole position, top 6 finisher, podium, fastest lap, first lap leader, winner of each practice session, and so on.
Both the spreadsheets were supplied as Excel spreadsheets. I guess that many folk who collect betting data store it as spreadsheets, so this recipe for loading spreadsheets in to an R environment might be useful to them. The gdata library provides hooks for working with Excel documents, so I opted for that.
Let’s look at the Betfair prices spreadsheet first. The top line is junk, so we’ll skip it on load, and add in our own column names, based on John’s description of the data collected in this file:
The US Betfair Odds Race Winner.xslx is a raw data collection with 5 columns….
1) The timestap (an annoying format but there is a reason for this albeit a pain to work with).
2) The driver.
3) The last price money was traded at.
4) the total amount of money traded on that driver so far.
5) If the race is in ‘In-Play’. True means the race has started – however this goes from the warm up lap, not the actual start.To reduce the amount of data I only record it when the price traded changes or if the amount changes.
Looking through the datafile, they appear to be some gotchas, so these need cleansing out:
Here’s my initial loader script:
library(gdata)
xl=read.xls('US Betfair Odds Race Winner.xlsx',skip = 1)
colnames(xl)=c('dt','driver','odds','amount','racing')
#Cleansing pass
bf.odds=subset(xl,racing!='')
str(bf.odds)
'data.frame': 10732 obs. of 5 variables:
$ dt : Factor w/ 2707 levels "11/16/2012 12:24:52 AM",..: 15 15 15 15 15 15 15 15 15 15 ...
$ driver: Factor w/ 34 levels " Recoding Began",..: 19 11 20 16 18 29 26 10 31 17 ...
$ odds : num 3.9 7 17 16.5 24 140 120 180 270 550 ...
$ amount: num 1340 557 120 118 195 ...
$ racing: int 0 0 0 0 0 0 0 0 0 0 ...
#Generate a proper datetime field from the dt column
#This is a hacked way of doing it. How do I do it properly?
bf.odds$dtt=as.POSIXlt(gsub("T", " ", bf.odds$dt))
#If we rerun str(), we get the following extra line in the results:
# $ dtt : POSIXlt, format: "2012-11-11 11:00:08" "2012-11-11 11:00:08" "2012-11-11 11:00:08" "2012-11-11 11:00:08" ...
Here’s what the raw data, as loaded, looks like to the eye:

Having loaded the data, cleansed it, and cast a proper datetime column, it’s easy enough to generate a few plots:
#We're going to make use of the ggplot2 graphics library
library(ggplot2)
#Let's get a quick feel for bets around each driver
g=ggplot(xl)+geom_point(aes(x=dtt,y=odds))+facet_wrap(~driver,scales="free_y")
g=g+theme(axis.text.x=element_text(angle=-90))
g
#Let's look in a little more detail around a particular driver within a particular time window
g=ggplot(subset(xl,driver=="Lewis Hamilton"))+geom_point(aes(x=dtt,y=odds))+facet_wrap(~driver,scales="free_y")
g=g+theme(axis.text.x=element_text(angle=-90))
g=g+ scale_x_datetime(limits=c(as.POSIXct('2012/11/18 18:00:00'), as.POSIXct('2012/11/18 22:00:00')))
g
Here are the charts (obviously lacking in caption, tidy labels and so on).
Firstly, the odds by driver:
Secondly, zooming in on a particular driver in a particular time window:
That all seems to work okay, so how about the other spreadsheet?
#There are several sheets to choose from, named as follows:
#Race,Pole,Podium,Points,SC,Fastest Lap, Top 6, Hattrick,Highest Scoring,FP1, ReachQ3,FirstLapLeader, FP2, FP3
#Load in data from a particular specified sheet
race.odds=read.xls('USA Bookie Odds.xlsx',sheet='Race')
#The datetime column appears to be in Excel datetime format, so cast it into something meaningful
race.odds$tTime=as.POSIXct((race.odds$Time-25569)*86400, tz="GMT",origin=as.Date("1970-1-1"))
#Note that I am not I checking for gotcha rows, though maybe I should...?
#Use the directlabels package to help tidy up the display a little
library(directlabels)
#Let's just check we've got something loaded - prune the display to rule out the longshots
g=ggplot(subset(race.odds,Bet365<30),aes(x=tTime,y=Bet365,group=Runner,col=Runner,label=Runner))
g=g+geom_line()+theme_bw()+theme(legend.position = "none")
g=g+geom_dl(method=list('top.bumpup',cex=0.6))
g=g+scale_x_datetime(expand=c(0.15,0))
g
Here’s a view over the drivers’ odds to win, with the longshots pruned out:
With a little bit of fiddling, we can also look to see how the odds for a particular driver compare for different bookies:
#Let's see if we can also plot the odds by bookie
colnames(race.odds)
#[1] "Time" "Runner" "Bet365" "SkyBet" "Totesport" "Boylesport" "Betfred"
# [8] "SportingBet" "BetVictor" "BlueSQ" "Paddy.Power" "Stan.James" "X888Sport" "Bwin"
#[15] "Ladbrokes" "X188Bet" "Coral" "William.Hill" "You.Win" "Pinnacle" "X32.Red"
#[22] "Betfair" "WBX" "Betdaq" "Median" "Median.." "Min" "Max"
#[29] "Range" "tTime"
#We can remove items from this list using something like this:
tmp=colnames(race.odds)
#tmp=tmp[tmp!='Range']
tmp=tmp[tmp!='Range' & tmp!='Median' & tmp!='Median..' & tmp!='Min' & tmp!= 'Max' & tmp!= 'Time']
#Then we can create a subset of cols
race.odds.data=subset(race.odds,select=tmp)
#Melt the data
library(reshape)
race.odds.data.m=melt(race.odds.data,id=c('tTime','Runner'))
#head( race.odds.data.m)
# tTime Runner variable value
#1 2012-11-11 19:07:01 Sebastian Vettel (Red) Bet365 2.37
#2 2012-11-11 19:07:01 Lewis Hamilton (McL) Bet365 3.25
#3 2012-11-11 19:07:01 Fernando Alonso (Fer) Bet365 6.00
#...
#Now we can plot how the different bookies compare
g=ggplot(subset(race.odds.data.m,value<30 & Runner=='Sebastian Vettel (Red)'),aes(x=tTime,y=value,group=variable,col=variable,label=variable))
g=g+geom_line()+theme_bw()+theme(legend.position = "none")
g=g+geom_dl(method=list('top.bumpup',cex=0.6))
g=g+scale_x_datetime(expand=c(0.15,0))
g
Okay, so that all seems to work… Now I can start pondering what sensible questions to ask…
My Personal Intro to F1 Race Statistics
One of the many things I keep avoiding is statistics. I’ve never really been convinced about the 5% significance level thing; as far as I can tell, hardly anything that’s interesting normally distributes; all the counting that’s involved just confuses me; and I never really got to grips with confidently combining probabilities. I find a lot of statistics related language impenetrable too, with an obscure vocabulary and some very peculiar usage. (Regular readers of this blog know that’s true here, as well ;-)
So this year I’m going to try to do some stats, and use some stats, and see if I can find out from personal need and personal interest whether they lead me to any insights about, or stories hidden within, various data sets I keep playing with. So things like: looking for patterns or trends, looking for outliers, and comparing one thing with another. If I can find any statistics that appear to suggest particular betting odds look particularly favourable, that might be interesting too. (As Nate Silver suggests, betting, even fantasy betting, is a great way of keeping score…)
Note that what I hope will turn into a series of posts should not be viewed as tutorial notes – they’re far more likely to be akin to student notes on a problem set the student is trying to work through, without having attended any of the required courses, and without having taken the time to read through a proper tutorial on the subject. Nor do I intend to to set out with a view to learning particular statistical techniques. Instead, I’ll be dipping into the world of stats looking for useful tools to see if they help me explore particular questions that come to mind and then try to apply them cut-and-past fashion, which is how I approach most of my coding!
Bare naked learning, in other words.
So if you thought I had any great understanding about stats – in fact, any understanding at all – I’m afraid I’m going to disabuse you of that notion. As to my use of the R statistical programming language, that’s been pretty much focussed on using it for generating graphics in a hacky way. (I’ve also found it hard, in the past, plotting pixels on screen and page in a programmable way, but R graphics libraries such as ggplot2 make it really easy at a high level of abstraction…:-)
That’s the setting then… Now: #f1stats. What’s that all about?
Notwithstanding the above (that this isn’t about learning a particular set of stats methods defined in advance) I did do a quick trawl looking for “F1 stats tutorials” to see if there were any that I could crib from directly; but my search didn’t turn up much that was directly and immediately useful (if you know of anything that might be, please post a link in the comments). There were a few things that looked like they might be interesting, so here’s a quick dump of the relevant…
- First up, I’ve been reading Nate Silver’s The Signal and the Noise, which mentions the aging stats and aging models for baseball players. I found a paper on The Age Productivity Gradient: Evidence from a sample of F1 Drivers, which hasn’t got too many scary equations in, so I may try to replicate that and then bring the models up to date (the paper is dated 2009). It would have been so nice if the authors had published code equivalents in R that I could have played with directly, but I haven’t been able to find it if they did. I also found a paper on Estimated Age Effects in Baseball, again with equations but no code, but it may provide additional clues. From a quick skim, I think there may be some mileage in trying to get my head round different ways of comparing rankings.
- A Tale of Two Motorsports: A Graphical-Statistical Analysis of How Practice, Qualifying, and Past Success Relate to Finish Position in NASCAR and Formula One Racing is perhaps an easier thing to try to copy for starters, though>
- The article The wisdom of ignorant crowds: predicting sport outcomes by mere recognition explores a simple tournament winner predicting strategy based on how recognisable the names of competitors are. (I guess social media metrics might be a proxy for recognition? Hmm.. could test that I suppose with reference to this paper?) One thing that caught my eye were a couple of simple schemes for benchmarking different prediction models, which might be something I could pull on if I end up exploring prediction models?
- NASCAR results have featured in several papers (I think there’s also a NASCAR dataset available in R?) so I’ll probably try dipping in to them at some point to see if I can do similar things with F1 data. For example, an analysis of NASCAR Winston Cup Race Results for 1975-2003; a couple of papers on hierarchical modelling of auto-racing results; and some Bayesian inference stuff that I guess is really beyond me for now and that I really really could do with a pre-built R libraries for;
- an MSc thesis I’ve referred to before on Prediction of Formula One Race Results Using Driver Characteristics has some handy ideas that I might be able to draw on if I have a look at laptime data;
- One of the the things I’ve been pondering is ways of ranking drivers based on fast lap times (eg during qualifying, vs. during the race). Although not about motor sport, or any sort of racing, A New Method for Ranking Total Driving Performance on the PGA Tour has a metric I may be able to bastardise in a Formula One context. The same periodical also has an article on Do Reliable Predictors Exist for the Outcomes of NASCAR Races?, the techniques of which might be applicable to F1? Predicting The Outcome Of NASCAR Races: The Role Of Driver Experience looks to be in a similar vein too…
- A paper on Outcome Uncertainty in NASCAR looks at how attendance and TV audience figures are influenced by race expectations, which might be something that could also be explored in context of UK F1 TV audience figures. That said, the notion of “outcome uncertainty” itself, and related measures, might also be worth exploring in their own right?
If you know of any other relevant looking papers or articles, please post a link in the comments.
[MORE LINKS...
- Who is the Best Formula 1 Driver? An Econometric Analysis
]
I was hoping to finish this post with a couple of quick R hacks around some F1 datasets, but I’ve just noticed that today, as in yesterday, has become tomorrow, as in today, and this post is probably already long enough… So it’ll have to wait for another day…
PS FWIW, I also note the arrival of the Sports Analytics Innovation Summit in London in March… I doubt I have the impact required to make it as a media partner though… Although maybe OpenLearn does…?!
Emergent Social Interest Mapping – Red Bull Racing Facebook Group
With the possibility that my effectively unlimited Twitter API key will die at some point in the Spring with the Twitter API upgrade, I’m starting to look around for alternative sources of interest signal (aka getting ready to say “bye, bye, Twitter interest mapping”). And Facebook groups look like they may offer once possibility…
Some time ago, I did a demo of how to map the the common Facebook Likes of my Facebook friends (Social Interest Positioning – Visualising Facebook Friends’ Likes With Data Grabbed Using Google Refine). In part inspired by a conversation today about profiling the interests of members of particular Facebook groups, I thought I’d have a quick peek at the Facebook API to see if it’s possible to grab the membership list of arbitrary, open Facebook groups, and then pull down the list of Likes made by the members of the group.
As with my other social positioning/social interest mapping experiments, the idea behind this approach is broadly this: users express interest through some sort of public action, such as following a particular Twitter account that can be associated with a particular interest. In this case, the signal I’m associating with an expression of interest is a Facebook Like. To locate something in interest space, we need to be able to detect a set of users associated with that thing, identify each of their interests, and then find interests they have in common. These shared interests (ideally over and above a “background level of shared interest”, aka the Stephen Fry effect (from Twitter, where a large number of people in any set of people appear to follow Stephen Fry oblivious of other more pertinent shared interests that are peculiar to that set of people) are then assumed to be representative of the interests associated with the thing. In this case, the thing is a Facebook group, the users associated with the thing are the group members, and the interests associated with the thing are the things commonly liked by members of the group.
Simples.
So for example, here is the social interest positioning of the Red Bull Racing group on Facebook, based on a sample of 3000 members of the group. Note that a significant number of these members returned no likes, either because they haven’t liked anything, or because their personal privacy settings are such that they do not publicly share their likes.

As we might expect, the members of this group also appear to have an interest in other Formula One related topics, from F1 in general, to various F1 teams and drivers, and to motorsport and motoring in general (top half of the map). We also find music preferences (the cluster to the left of the map) and TV programmes (centre bottom of the map) that are of common interest, though I have no idea yet whether these are background radiation interests (that is, the Facebook equivalent of the Stephen Fry effect on Twitter) or are peculiar to this group. I’m not sure whether the cluster of beverage related preferences at the bottom right corner of the map is notable either?
This information is visualised using Gephi, using data grabbed via the following Python script:
#This is a really simple script:
##Grab the list of members of a Facebook group (no paging as yet...)
###For each member, try to grab their Likes
import urllib,simplejson,csv,argparse
#Grab a copy of a current token from an example Facebook API call.
#Something a bit like this:
#AAAAAAITEghMBAOMYrWLBTYpf9ciZBLXaw56uOt2huS7C4cCiOiegEZBeiZB1N4ZCqHgQZDZD
parser = argparse.ArgumentParser(description='Generate social positioning map around a Facebook group')
parser.add_argument('-gid',default='2311573955',help='Facebook group ID')
#gid='2311573955'
parser.add_argument('-FBTOKEN',help='Facebook API token')
#Quick test - output file is simple 2 column CSV that we can render in Gephi
fn='fbgroupliketest_'+str(gid)+'.csv'
writer=csv.writer(open(fn,'wb+'),quoting=csv.QUOTE_ALL)
uids=[]
def getGroupMembers(gid):
gurl='https://graph.facebook.com/'+str(gid)+'/members?limit=5000&access_token='+FBTOKEN
data=simplejson.load(urllib.urlopen(gurl))
if "error" in data:
print "Something seems to be going wrong - check OAUTH key?"
print data['error']['message'],data['error']['code'],data['error']['type']
exit(-1)
else:
return data
#Grab the likes for a particular Facebook user by Facebook User ID
def getLikes(uid,gid):
#Should probably implement at least a simple cache here
lurl="https://graph.facebook.com/"+str(uid)+"/likes?access_token="+FBTOKEN
ldata=simplejson.load(urllib.urlopen(lurl))
print ldata
if len(ldata['data'])>0:
for i in ldata['data']:
if 'name' in i:
writer.writerow([str(uid),i['name'].encode('ascii','ignore')])
#We could colour nodes based on category, etc, though would require richer output format.
#In the past, I have used the networkx library to construct "native" graph based representations of interest networks.
if 'category' in i:
print str(uid),i['name'],i['category']
#For each user in the group membership list, get their likes
def parseGroupMembers(groupData,gid):
for user in groupData['data']:
uid=user['id']
writer.writerow([str(uid),str(gid)])
#x is just a fudge used in progress reporting
x=0
#Prevent duplicate fetches
if uid not in uids:
getLikes(user['id'],gid)
uids.append(uid)
#Really crude progress reporting
print x
x=x+1
#need to handle paging?
#parse next page URL and recall this function
groupdata=getGroupMembers(gid)
parseGroupMembers(groupdata,gid)
Note that I have no idea whether or not this is in breach of Facebook API terms and conditions, nor have I reflected on the ethical implications of running this sort of analysis, over and the above remarking that it’s the same general approach I apply to mapping social interests on Twitter.
As to where next with this? It brings into focus again the question of identifying common interests pertinent to this particular group, compared to background popular interest that might be expressed by any random set of people. But having got a new set of data to play with, it will perhaps make it easier to test the generalisability of any model or technique I do come up with for filtering out, or normalising against, background interest.
Other directions this could go? Using a single group to bootstrap a walk around the interest space? For example, in the above case, trying to identify groups associated with Sebastian Vettel, or F1, and then repeating the process? It might also make sense to look at the categories of the notable shared interests; (from a quick browse, these include, for example, things like Movie, Product/service, Public figure, Games/toys, Sports Company, Athlete, Interest, Sport; is there a full vocabulary available, I wonder? How might we use this information?)
More Shiny Goodness – Tinkering With the Ergast Motor Racing Data API
I had a bit of a play with Shiny over the weekend, using the Ergast Motor Racing Data API and the magical Shiny library for R, that makes building interactive, browser based applications around R a breeze.
As this is just a quick heads-up/review post, I’ll largely limit myself to a few screenshots. When I get a chance, I’ll try to do a bit more of a write-up, though this may actually just take the form of more elaborate documentation of the app, both within the code and in the form of explanatory text in the app itself.
If you want to try ou the app, you can find an instance here: F1 2012 Laptime Explorer. The code is also available.
Here’s the initial view – the frist race of the season is selected as a default and data loaded in. The driver list is for all drivers represented during the season.

THe driver selectors allow us to just display traces for selected drivers.
The Race History chart is a classic results chart. It show the difference between the race time to date for each driver, by lap, compared to the average lap time for the winner times the lap number. (As such, this is an offline statistic – it is calculated when the winner’s overall average laptime is known).

Variants of the classic Race History chart are possible, for example, using different base line times, but I haven’t implemented any of them – or the necessary UI controls. Yet…
The Lap Chart is another classic:

Annotations for this chart are also supported, describing all drivers who final status was not “Finished”.

The Lap Evolution chart shows how each driver’s laptime evolved over the course of the race compared with the fastest overall recorded laptime.

The Personal Lap Evolution chart shows how each driver’s laptime evolved over the course of the race compared with their personal fastest laptime.

The Personal Deltas Chart shows the difference between one laptime and the next for each driver.

The Race Summary Chart is a chart of my own design that tries to capture notable features relating to race position – the grid position (blue circle), final classification (red circle), position at the end of the first lap (the + or horizontal bar). The violin plot shows the distribution of how many laps the driver spent in each race position. Where the chart is wide, the driver spent a large number of laps in that position.

The x-axis ordering pulls out different features about how the race progressed. I need to add in a control that lets the user select different orderings.
Finally, the Fast Lap text scatterplot shows the fastest laptime for each driver and the lap at which they recorded it.

So – that’s a quick review of the app. All in all it took maybe 3 hours getting my head round the data parsing, 2-3 hours figuring what I wanted to do and learning how to do it in Shiny, and a couple of hours doing it/starting to document/annotate it. Next time, it’ll be much quicker…
Interactive Scenarios With Shiny – The Race to the F1 2012 Drivers’ Championship
In Paths to the F1 2012 Championship Based on How They Might Finish in the US Grand Prix I posted a quick hack to calculate the finishing positions that would determine the F1 2012 Drivers’ Championship in today’s United States Grand Prix, leaving a tease dangling around the possibility of working out what combinations would lead to a VET or ALO victory if the championship isn’t decided today. So in the hour before the race started, I began to doodle a quick’n'dirty interactive app that would let me keep track of what the championship scenarios would be for the Brazil race given the lap by lap placement of VET and ALO during the US Grand Prix. Given the prep I’d done in the aforementioned post, this meant figuring out how to code up a similar algorithm in R, and then working out how to make it interactive…
But before I show you how I did it, here’s the scenario for Brazil given how the US race finished:
So how was this quick hack app done…?
Trying out the new Shiny interactive stats app builder from the RStudio folk has been on my to do list for some time. It didn’t take long to realise that an interactive race scenario builder would provide an ideal context for trying it out. There are essentially two (with a minor middle third) steps to a Shiny model:
- work out the points difference between VET and ALO for all their possible points combinations in the US Grand Prix;
- calculate the points difference going into the Brazilian Grand Prix;
- calculate the possible outcomes depending on placements in the Brazilian Grand Prix (essentially, an application of the algorithm I did in the original post).
The Shiny app requires two bits of code – a UI in file ui.R, in which I define two sliders that allow me to set the actual (or anticpated, or possible;-) race classifications in the US for Vettel and Alonso:
library(shiny)
shinyUI(pageWithSidebar(
# Application title
headerPanel("F1 Driver Championship Scenarios"),
# Sidebar with a slider input for number of observations
sidebarPanel(
sliderInput("alo",
"ALO race pos in United States Grand Prix:",
min = 1,
max = 11,
value = 1),
sliderInput("vet",
"VET race pos in United States Grand Prix:",
min = 1,
max = 11,
value = 2)
),
# Show a plot of the generated model
mainPanel(
plotOutput("distPlot")
)
))
And some logic, in file server.R (original had errors; hopefully now bugfixed…) – the original “Paths to the Championship” unpicks elements of the algorithm in a little more detail, but basically I figure out the points difference between VET and ALO based on the points difference at the start of the race and the additional points difference arising from the posited finishing positions for the US race, and then generate a matrix that works out the difference in points awarded for each possible combination of finishes in Brazil:
library(shiny)
library(ggplot2)
library(reshape)
# Define server logic required to generate and plot a random distribution
shinyServer(function(input, output) {
points=data.frame(pos=1:11,val=c(25,18,15,12,10,8,6,4,2,1,0))
points[[1,2]]
a=245
v=255
pospoints=function(a,v,pdiff,points){
pp=matrix(ncol = nrow(points), nrow = nrow(points))
for (i in 1:nrow(points)){
for (j in 1:nrow(points))
pp[[i,j]]=v-a+pdiff[[i,j]]
}
pp
}
pdiff=matrix(ncol = nrow(points), nrow = nrow(points))
for (i in 1:nrow(points)){
for (j in 1:nrow(points))
pdiff[[i,j]]=points[[i,2]]-points[[j,2]]
}
ppx=pospoints(a,v,pdiff,points)
winmdiff=function(vadiff,pdiff,points){
win=matrix(ncol = nrow(points), nrow = nrow(points))
for (i in 1:nrow(points)){
for (j in 1:nrow(points))
if (i==j) win[[i,j]]=''
else if ((vadiff+pdiff[[i,j]])>=0) win[[i,j]]='VET'
else win[[i,j]]='ALO'
}
win
}
# Function that generates a plot of the distribution. The function
# is wrapped in a call to reactivePlot to indicate that:
#
# 1) It is "reactive" and therefore should be automatically
# re-executed when inputs change
# 2) Its output type is a plot
#
output$distPlot <- reactivePlot(function() {
wmd=winmdiff(ppx[[input$vet,input$alo]],pdiff,points)
wmdm=melt(wmd)
g=ggplot(wmdm)+geom_text(aes(X1,X2,label=value,col=value))
g=g+xlab('VET position in Brazil')+ ylab('ALO position in Brazil')
g=g+labs(title="Championship outcomes in Brazil")
g=g+ theme(legend.position="none")
g=g+scale_x_continuous(breaks=seq(1, 11, 1))+scale_y_continuous(breaks=seq(1, 11, 1))
print(g)
})
})
To run the app, if your server and ui files are in some directory shinychamp, then something like the following should et the Shiny app running:
library(shiny)
runApp("~/path/to/my/shinychamp")
You can find the code on github here: F1 Championship 2012 – scenarios if the race gets to Brazil…
Unfortunately, until a hosted service is available, you’ll have to run it yourself if you want to try it out…
Disclaimer: I’ve been rushing to get this posted before the start of the race… If you spot errors, please shout!
Paths to the F1 2012 Championship Based on How They Might Finish in the US Grand Prix
If you haven’t already seen it, one of the breakthrough visualisations of the US elections was the New York Times Paths to the Election scenario builder. With the F1 drivers’ championship in the balance this weekend, I wondered what chances were of VET claiming the championship this weekend. The only contender is ALO, who is currently ten points behind.
A quick Python script shows the outcome depending on the relative classification of ALO and VET at the end of today’s race. (If the drivers are 25 points apart, and ALO then wins in Brazil with VET out of the points, I think VET will win on countback based on having won more races.)
#The current points standings
vetPoints=255
aloPoints=245
#The points awarded for each place in the top 10; 0 points otherwise
points=[25,18,15,12,10,8,6,4,2,1,0]
#Print a header row (there's probably a more elegant way of doing this...;-)
for x in ['VET\ALO',1,2,3,4,5,6,7,8,9,10,'11+']: print str(x)+'\t',
print ''
#I'm going to construct a grid, VET's position down the rows, ALO across the columns
for i in range(len(points)):
#Build up each row - start with VET's classification
row=[str(i+1)]
#Now for the columns - that is, ALO's classification
for j in range(len(points)):
#Work out the points if VET is placed i+1 and ALO at j+1 (i and j start at 0)
#Find the difference between the points scores
#If the difference is >= 25 (the biggest points diff ALO could achieve in Brazil), VET wins
if ((vetPoints+points[i])-(aloPoints+points[j])>=25):
row.append("VET")
else: row.append("?")
#Print the row a slightly tidier way...
print '\t'.join(row)
(Now I wonder – how would I write that script in R?)
And the result?
VET\ALO 1 2 3 4 5 6 7 8 9 10 11+ 1 ? ? ? ? VET VET VET VET VET VET VET 2 ? ? ? ? ? ? ? ? VET VET VET 3 ? ? ? ? ? ? ? ? ? ? VET 4 ? ? ? ? ? ? ? ? ? ? ? 5 ? ? ? ? ? ? ? ? ? ? ? 6 ? ? ? ? ? ? ? ? ? ? ? 7 ? ? ? ? ? ? ? ? ? ? ? 8 ? ? ? ? ? ? ? ? ? ? ? 9 ? ? ? ? ? ? ? ? ? ? ? 10 ? ? ? ? ? ? ? ? ? ? ? 11 ? ? ? ? ? ? ? ? ? ? ?
Which is to say, VET wins if:
- VET wins the race and ALO is placed 5th or lower;
- VET is second in the race and ALO is placed 9th or lower;
- VET is third in the race and ALO is out of the points (11th or lower)
We can also look at the points differences (define a row2 as row, then use row2.append(str((vetPoints+points[i])-(aloPoints+points[j])))):
VET\ALO 1 2 3 4 5 6 7 8 9 10 11+ 1 10 17 20 23 25 27 29 31 33 34 35 2 3 10 13 16 18 20 22 24 26 27 28 3 0 7 10 13 15 17 19 21 23 24 25 4 -3 4 7 10 12 14 16 18 20 21 22 5 -5 2 5 8 10 12 14 16 18 19 20 6 -7 0 3 6 8 10 12 14 16 17 18 7 -9 -2 1 4 6 8 10 12 14 15 16 8 -11 -4 -1 2 4 6 8 10 12 13 14 9 -13 -6 -3 0 2 4 6 8 10 11 12 10 -14 -7 -4 -1 1 3 5 7 9 10 11 11 -15 -8 -5 -2 0 2 4 6 8 9 10
We could then do a similar exercise for the Brazil race, and essentially get all the information we need to do a scenario builder like the New York Times election scenario builder… Which I would try to do, but I’ve had enough screen time for the weekend already…:-(
PS FWIW, here’s a quick table showing the awarded points difference between two drivers depending on their relative classification in a race:
A\B 1 2 3 4 5 6 7 8 9 10 11+ 1 X 7 10 13 15 17 19 21 23 24 25 2 -7 X 3 6 8 10 12 14 16 17 18 3 -10 -3 X 3 5 7 9 11 13 14 15 4 -13 -6 -3 X 2 4 6 8 10 11 12 5 -15 -8 -5 -2 X 2 4 6 8 9 10 6 -17 -10 -7 -4 -2 X 2 4 6 7 8 7 -19 -12 -9 -6 -4 -2 X 2 4 5 6 8 -21 -14 -11 -8 -6 -4 -2 X 2 3 4 9 -23 -16 -13 -10 -8 -6 -4 -2 X 1 2 10 -24 -17 -14 -11 -9 -7 -5 -3 -1 X 1 11 -25 -18 -15 -12 -10 -8 -6 -4 -2 -1 X
Here’s how to use this chart in association with the previous. Looking at the previous chart, if VET finishes second and ALO third, the points difference is 13 in favour of VET. Looking at the chart immediately above, if we let VET = A and ALO = B, then the columns correspond to ALO’s placement, and the rows to VET. VET (A) needs to lose 14 or more points to lose the championship (that is, we’re looking for values of -14 or less). In particular, ALO (B, columns) needs to finish 1st with VET (A) 5th or worse, 2nd with A 8th or worse, or 3rd with VET 10th or worse.
And the script:
print '\t'.join(['A\B','1','2','3','4','5','6','7','8','9','10','11+'])
for i in range(len(points)):
row=[str(i+1)]
for j in range(len(points)):
if i!=j:row.append(str(points[i]-points[j]))
else: row.append('X')
And now for the rest of the weekend…
The Race to the F1 2012 Drivers’ Championship – Initial Sketches
In part inspired by the chart described in The electoral map sans the map, I thought I’d start mulling over a quick sketch showing the race to the 2012 Formula One Drivers’ Championship.
The chart needs to show tension somehow, so in this first really quick and simple rough sketch, you really do have to put yourself in the graph and start reading it from left to right:
The data is pulled in from the Ergast API as JSON data, which is then parsed and visualised using R:
require(RJSONIO)
require(ggplot2)
#initialise a data frame
champ <- data.frame(round=numeric(),
driverID=character(),
position=numeric(), points=numeric(),wins=numeric(),
stringsAsFactors=FALSE)
#This is a fudge at the moment - should be able to use a different API call to
#get the list of races to date, rather than hardcoding latest round number
for (j in 1:18){
resultsURL=paste("http://ergast.com/api/f1/2012/",j,"/driverStandings",".json",sep='')
print(resultsURL)
results.data.json=fromJSON(resultsURL,simplify=FALSE)
rd=results.data.json$MRData$StandingsTable$StandingsLists[[1]]$DriverStandings
for (i in 1:length(rd)){
champ=rbind(champ,data.frame(round=j, driverID=rd[[i]]$Driver$driverId,
position=as.numeric(as.character(rd[[i]]$position)),
points=as.numeric(as.character(rd[[i]]$points)),
wins=as.numeric(as.character(rd[[i]]$wins)) ))
}
}
champ
#Horrible fudge - should really find a better way of filtering?
test2=subset(champ,( driverID=='vettel' | driverID=='alonso' | driverID=='raikkonen'|driverID=='webber' | driverID=='hamilton'|driverID=='button' ))
#Really rough sketch, in part inspired by http://junkcharts.typepad.com/junk_charts/2012/11/the-electoral-map-sans-the-map.html
ggplot(test2)+geom_line(aes(x=round,y=points,group=driverID,col=driverID))+labs(title="F1 2012 - Race to the Championship")
#I wonder if it would be worth annotating the chart with labels explaining any DNF reasons at parts where points stall?
So, that’s the quickest and dirtiest chart I could think of – where to take this next? One way would be to start making the chart look cleaner; another possibility would be to start looking at adding labels, highlights, and maybe pushing all but ALO and VET into the background? (GDS do some nice work in this vein, eg Updating the GOV.UK Performance Dashboard; this StoryTellingWithData post on stacked bar charts also has some great ideas about how to make simple, clean and effective use of text and highlighting…).
Let’s try cleaning it up a little, and then highlight the championship contenders?
test3=subset(test,( driverID=='vettel' | driverID=='alonso' )) test4=subset(test,( driverID=='raikkonen'|driverID=='webber' | driverID=='hamilton'|driverID=='button' )) ggplot(test4) + geom_line(aes(x=round,y=position,group=driverID),col='lightgrey') + geom_line(data=test3,aes(x=round,y=position,group=driverID,col=driverID)) + labs(title="F1 2012 - Race to the Championship")
Hmm… I’m not sure about those colours? Maybe use Blue for VET and Red for ALO?
I really hacked the path to this – there must be a cleaner way?!
ggplot(test4)+geom_line(aes(x=round,y=points,group=driverID),col='lightgrey') + geom_line(data=subset(test3,driverID=='vettel'),aes(x=round,y=points),col='blue') + geom_line(data=subset(test3,driverID=='alonso'),aes(x=round,y=points),col='red') + labs(title="F1 2012 - Race to the Championship")
Other chart types are possible too, I suppose? Such as something in the style of a lap chart?
ggplot(test2)+geom_line(aes(x=round,y=position,group=driverID,col=driverID))+labs(title="F1 2012 - Race to the Championship")
Hmmm… Just like the first sketch, this one is cluttered and confusing too… How about if we clean it as above to highlight just the contenders?
ggplot(test4) + geom_line(aes(x=round,y=points,group=driverID),col='lightgrey') + geom_line(data=test3,aes(x=round,y=points,group=driverID,col=driverID)) + labs(title="F1 2012 - Race to the Championship")
A little cleaner, maybe? And with the colour tweak:
ggplot(test4) + geom_line(aes(x=round,y=position,group=driverID),col='lightgrey') + geom_line(data=subset(test3,driverID=='vettel'),aes(x=round,y=position),col='blue') + geom_line(data=subset(test3,driverID=='alonso'),aes(x=round,y=position),col='red') + labs(title="F1 2012 - Race to the Championship")
Something that really jumps out at me in this chart are the gridlines – they really need fixing? But what would be best to show?
Hmm, before we do that, how about an animation? (Does WordPress.com allow animated gifs?)
Here’s the code (it requires the animation package):
library(animation)
race.ani= function(...) {
for (i in 1:18) {
g=ggplot(subset(test3, round<=i)) + geom_line(aes(x=round,y=position,group=driverID),col='lightgrey')+geom_line(data=subset(test3,driverID=='vettel' & round<=i),aes(x=round,y=position),col='blue')+geom_line(data=subset(test3,driverID=='alonso' & round <=i),aes(x=round,y=position),col='red')+labs(title="F1 2012 - Race to the Championship")+xlim(1,18)
print(g)
}
}
saveMovie(race.ani(), interval = 0.4, outdir = getwd())
And for the other chart:
Hmmm…
How’s about another sort of view – the points difference between VET and ALO?
alo=subset(test3,driverID=='alonso')
vet=subset(test3,driverID=='vettel')
colnames(vet)=c("round","driverID","vposition","vpoints","vwins")
colnames(alo)=c("round","driverID","aposition","apoints","awins")
cf= merge(alo,vet,by=c('round'))
ggplot(cf) + geom_bar( aes(x=round,y=vpoints-apoints,fill=(vpoints-apoints)>0), stat='identity') + labs(title="F1 2012 Championship - VET vs ALO")
F1 2012 Mid-Season Review
Rather belatedly, I got around to posting a series of posts summarising the Formula One season to date:
- F1 2012 Mid-Season Review – Grid/Classification Analysis: for example, how do the drivers’ grid and final classifications compare?
- F1 2012 Mid-Season Review – Pit Stops: for example, how does pit stop performance across the teams compare?

- F1 2012 Mid-Season Review – Qualifying Analysis: for example, how do normalised qualifying lap times compare across the teams over the season so far?

It’s also worth comparing the charts to the F1 2011 Season Review charts.
The R-code used to generate the graphics can be found here: F1 2012 Mid-Season Review – R Markdown.
Comments/suggestions/code improvements and extensions etc all welcome…
F1 Championship Points as a d3.js Powered Sankey Diagram
d3.js crossed my path a couple of times yesterday: firstly, in the form of an enquiry about whether I’d be interested in writing a book on d3.js (I’m not sure I’m qualified: as I responded, I’m more of a script kiddie who sees things I can reuse, rather than have any understanding at all about how d3.js does what it does…); secondly, via a link to d3.js creator Mike Bostock’s new demo of Sankey diagrams built using d3.js:
Hmm… Sankey diagrams are good for visualising flow, so to get to grips myself with seeing if I could plug-and-play with the component, I needed an appropriate data set. F1 related data is usually my first thought as far as testbed data goes (no confidences to break, the STEM/innovation outreach/tech transfer context, etc etc) so what things flow in F1? What quantities are conserved whilst being passed between different classes of entity? How about points… points are awarded on a per race basis to drivers who are members of teams. It’s also a championship sport, run over several races. The individual Driver Championship is a competition between drivers to accumulate the most points over the course of the season, and the Constructor Chanmpionship is a battle between teams. Which suggests to me that a Sankey plot of points from races to drivers and then constructors might work?
So what do we need to do? First up, look at the source code for the demo using View Source. Here’s the relevant bit:
Data is being pulled in from a relatively addressed file, energy.json. Let’s see what it looks like:
Okay – a node list and an edge list. From previous experience, I know that there is a d3.js JSON exporter built into the Python networkx library, so maybe we can generate the data file from a network representation of the data in networkx?
Here we are: node_link_data(G) “[r]eturn data in node-link format that is suitable for JSON serialization and use in Javascript documents.”
Next step – getting the data. I’ve already done a demo of visualising F1 championship points sourced from the Ergast motor racing API as a treemap (but not blogged it? Hmmm…. must fix that) that draws on a JSON data feed constructed from data extracted from the Ergast API so I can clone that code and use it as the basis for constructing a directed graph that represents points allocations: race nodes are linked to driver nodes with edges weighted by points scored in that race, and driver nodes are connected to teams by edges weighted according to the total number of points the driver has earned so far. (Hmm, that gives me an idea for a better way of coding the weight for that edge…)
I don’t have time to blog the how to of the code right now – train and boat to catch – but will do so later. If you want to look at the code, it’s here: Ergast Championship nodelist. And here’s the result – F1 Chanpionship 2012 Points as a Sankey Diagram:
See what I mean about being a cut and paste script kiddie?!;-)






























