Archive for the ‘Rstats’ Category
F1Stats – Visually Comparing Qualifying and Grid Positions with Race Classification
[If this isn't your thing, the posts in this thread will be moving to a new blog soon, rather than taking over OUseful.info...]
Following the roundabout tour of F1Stats – A Prequel to Getting Started With Rank Correlations, here’s a walk through of my attempt to replicate the first part of A Tale of Two Motorsports: A Graphical-Statistical Analysis of How Practice, Qualifying, and Past SuccessRelate to Finish Position in NASCAR and Formula One Racing. Specifically, a look at the correlation between various rankings achieved over a race weekend and the final race position for that weekend.
The intention behind doing the replication is two-fold: firstly, published papers presumably do “proper stats”, so by pinching the methodology, I can hopefully pick up some tricks about how you’re supposed to do this stats lark by apprenticing myself, via the web, to someone who has done something similar; secondly, it provides an answer scheme, of a sort: I can check my answers with the answers in the back of the book published paper. I’m also hoping that by working through the exercise, I can start to frame my own questions and start testing my own assumptions.
So… let’s finally get started. The data I’ll be using for now is pulled from my scraper of the FormulaOne website (personal research purposes, blah, blah, blah;-). If you’re following along, from the Download button grab the whole SQLite database and save it into whatever directory you’re going to be working in in R…
…because this is an R-based replication… (I use RStudio for my R tinkerings; if you need to change directory when you’re in R, setwd('~/path/to/myfiles')).
To load the data in, and have a quick peek at what’s loaded, I use the following recipe:
f1 = dbConnect(drv ="SQLite", dbname="f1com_megascraper.sqlite") #There are a couple of ways we can display the tables that are loaded #Directly: dbListTables(f1) #Or via a query: dbGetQuery(f1, 'SELECT name FROM sqlite_master WHERE type="table"')
The last two commands each display a list of the names of the database tables that can be loaded in as separate R dataframes.
To load the data in from a particular table, we run a SELECT query on the database. Let’s start with grabbing the same data that the paper analysed – the results from 2009:
#Grab the race results from the raceResults table for 2009
results.race=dbGetQuery(f1, 'SELECT pos as racepos, race, grid, driverNum FROM raceResults where year==2009')
#We can load data in from other tables and merge the results:
results.quali=dbGetQuery(f2, 'SELECT pos as qpos, driverNum, race FROM qualiResults where year==2009')
results.combined.clunky = merge(results.race, results.quali, by = c('race', 'driverNum'), all=T)
#A more efficient approach is to run a JOINed query to pull in the data in in one go
results.combined=dbGetQuery(f1,
'SELECT raceResults.year as year, qualiResults.pos as qpos, p3Results.pos as p3pos, raceResults.pos as racepos, raceResults.race as race, raceResults.grid as grid, raceResults.driverNum as driverNum, raceResults.raceNum as raceNum FROM raceResults, qualiResults, p3Results WHERE raceResults.year==2009 and raceResults.year = qualiResults.year and raceResults.year = p3Results.year and raceResults.race = qualiResults.race and raceResults.race = p3Results.race and raceResults.driverNum = qualiResults.driverNum and raceResults.driverNum = p3Results.driverNum;' )
#Note: I haven't used SQL that much so there may be a more efficient way of writing this?
#We can preview the response
head(results.combined)
# year qpos p3pos racepos race grid driverNum raceNum
#1 2009 1 3 1 AUSTRALIA 1 22 1
#2 2009 2 6 2 AUSTRALIA 2 23 1
#3 2009 20 2 3 AUSTRALIA 20 9 1
#4 2009 19 8 4 AUSTRALIA 19 10 1
#5 2009 10 17 5 AUSTRALIA 10 7 1
#6 2009 5 1 6 AUSTRALIA 5 16 1
#We can look at the format of each column
str(results.combined)
#'data.frame': 338 obs. of 8 variables:
# $ year : chr "2009" "2009" "2009" "2009" ...
# $ qpos : chr "1" "2" "20" "19" ...
# $ p3pos : chr "3" "6" "2" "8" ...
# $ racepos : chr "1" "2" "3" "4" ...
# $ race : chr "AUSTRALIA" "AUSTRALIA" "AUSTRALIA" "AUSTRALIA" ...
# $ grid : chr "1" "2" "20" "19" ...
# $ driverNum: chr "22" "23" "9" "10" ...
# $ raceNum : int 1 1 1 1 1 1 1 1 1 1 ...
#We can inspect the different values taken by each field
levels(factor(results.combined$racepos))
# [1] "1" "10" "11" "12" "13" "14" "15" "16" "17" "18" "19" "2" "3" "4" "5" "6" "7" "8" "9"
#[20] "DSQ" "Ret"
#We need to do a little tidying... Let's make integer versions of the positions
#NA values will be introduced where there are no integers
results.combined$racepos.int=as.integer(as.character(results.combined$racepos))
results.combined$grid.int=as.integer(as.character(results.combined$grid))
results.combined$qpos.int=as.integer(as.character(results.combined$qpos))
#The race classification does not give a numerical position for each driver
##(eg in cases of DSQ or Ret) although the tables are ordered
#Let's use the row order for each race to give an actual numerical position to each driver for the race
#(If we wanted to do this for practice and quali too, we would have to load the tables
#in separately, number each row, then merge them.)
require(reshape)
results.combined=ddply(results.combined, .(race), mutate, racepos.raw=1:length(race))
#Now we have a data frame that looks like this:
head(results.combined)
# year qpos p3pos racepos race grid driverNum raceNum racepos.int grid.int qpos.int racepos.raw
#1 2009 2 11 1 ABU DHABI 2 15 17 1 2 2 1
#2 2009 3 15 2 ABU DHABI 3 14 17 2 3 3 2
#3 2009 5 1 3 ABU DHABI 5 22 17 3 5 5 3
#4 2009 4 3 4 ABU DHABI 4 23 17 4 4 4 4
#5 2009 8 5 5 ABU DHABI 8 6 17 5 8 8 5
#6 2009 12 13 6 ABU DHABI 12 10 17 6 12 12 6
#If necessary we can force the order of the races factor levels away from alphabetical into race order
results.combined$race=reorder(results.combined$race, results.combined$raceNum)
levels(results.combined$race)
# [1] "AUSTRALIA" "MALAYSIA" "CHINA" "BAHRAIN" "SPAIN" "MONACO" "TURKEY"
# [8] "GREAT BRITAIN" "GERMANY" "HUNGARY" "EUROPE" "BELGIUM" "ITALY" "SINGAPORE"
#[15] "JAPAN" "BRAZIL" "ABU DHABI"
We’re now in a position whereby we can start to look at the data, and do some analysis of it. The paper shows a couple of example scatterplots charting the “qualifying” position against the final race position. We can go one better in single line using the ggplot2 package:
require(ggplot2) g=ggplot(results.combined) + geom_point(aes(x=qpos.int, y=racepos.raw)) + facet_wrap(~race) g
At first glance, this may all look very confusing, but just take your time, actually look at it, see it you can spot any patterns or emergent structure in the way the points are distributed. Does it look as if the points on the chart might have a story to tell? (If a picture saves a thousand words, and it takes five minutes to read a thousand words, the picture might actually take a couple of minutes to read get across the same information…;-)
First thing to notice: there are lots of panels – one for each race. The panels are ordered left to right, then down a row, left to right, down a row, etc, in the order they took place during the season. So you should be able to quickly spot the first race of the season (top left), the last race (at the end of the last/bottom row), and races in mid-season (middle of them all!)
Now focus on a single chart and thing about that the positioning of the points actually means. If the qualifying position was the same as the race position, what would you expect to see? That is, if the car that qualified first finished first; the car that qualified second finished second; the car that qualified tenth finished tenth; and so on. As the axes are incrementally ordered, I’d expect to see a straight line, going up at 45 degrees if the same space was given on each axis (that is, if the charts were square plots).
We can run a quick experiment to see how that looks:
#Plot increasing x and increasing y in step with each other ggplot() + geom_point(aes(x=1:20, y=1:20))
Looking over the race charts, some of the charts have the points plotted roughly along a straight line – Turkey and Abu Dhabi, for example. If the cars started the race roughly according to qualifying position, this would suggest the races may have been rather processional? On the other hand, Brazil is all over the place – qualifying position appears to bear very little relationship to where the cars actually finished the race.
Even for such a simple chart type, it’s amazing how much structure we might be able to pull out of these charts. If every one was a straight line, we might imagine a very boring season, full of processions. If every chart was all over the place, with little apparent association between qualifying and race positions, we might begin to wonder whether there was any “form” to speak of at all. (Such events may on the surface provide exciting races, at least at the start of the season, but if every driver has an equal, but random, chance of winning, it makes it hard to root for anyone in particular, makes it hard to root for an underdog versus a favourite, and so on.)
Again, we can run a quick experiment to see what things would look like if each car qualified at random and was placed at random at the end of the race:
expt2 = NULL
#We're going to run 20 experiments (z)
for (i in 1:20){
#Each experiment generates a random start order (x) and random finish order (y)
expt=data.frame(x=sample(1:20, 20), y=sample(1:20, 20), z=i)
expt2=rbind(expt2, expt)
}
#Plot the result, by experiment
ggplot(expt2) + geom_point(aes(x=x, y=y)) + facet_wrap(~z)
There’s not a lot of order in there, is there? However, it is worth noting that on occasion it may look as if there is some sort of ordering (as in trial 4, for example?), purely by chance.
For what it’s worth, we can also tidy up the race plots a little to make them a little bit more presentable:
g = g + ggtitle( 'F1 2009' )
g = g + xlab('Qualifying Position') + ylab('Final Race Classification')
g
Here’s the chart with added titles and axis labels:

We can also grab a plot for a single race. The paper showed scatterplots for Brazil and Japan.
Here’s the code:
g1 = ggplot(subset(results.combined, race == 'BRAZIL')) + geom_point(aes(x=qpos.int, y=racepos.raw)) + facet_wrap(~race)
g1 = g1 + ggtitle('F1 2009 Brazil') + xlab('Qualifying Position') + ylab('Final Race Classification') + theme_bw()
g2 = ggplot(subset(results.combined, race=='JAPAN' ) )+geom_point(aes(x =qpos.int, y=racepos.raw)) + facet_wrap(~race)
g2 = g2 + ggtitle('F1 2009 Japan') + xlab('Qualifying Position') + ylab('Final Race Classification') + theme_bw()
require( gridExtra )
grid.arrange( g1, g2, ncol=2 )
Note that I have: i) filtered the data to get just the data required for each race; ii) added a little bit more styling with the theme_bw() call; iii) used the gridExtra package to generate the side-by-side plot.
Here’s what the data looked like in the original paper:
NOT the same… Hmmm… How about if we plot the grid.int position vs. the final position? (Set x = grid.int and tweak the xlab()):
So – the “qualifying” position in the published paper is actually the grid position? Did I not read the paper closely enough, or did they make a mistake? (So much for academic peer review stopping errors getting through. Like, erm, the error in the visualiastion of the confidence interval that also got through? ;-)
Generating similar views for other years should be trivial – simply tweak the year selector in the SELECT query and then repeat as above.
In the limit, I guess everything could be wrapped up in a single function to plot either the facetted view, or, if a list of explicitly named races are passed in to the function, as a set of more specific race charts. But that is left as an exercise for the reader…;-)
Okay – enough for now… am I ever going to get even as far as blogging the correlation analysis., let alone the regressions?! Hopefully in the next post in this series!
PS I couldn’t resist… here’s the summary set of charts for the 2012 season, in respect of grid positions vs. final race classifications:
Was it as you remember it?!! Were Australia and Malaysia all over the place? Was there a big crash amongst the front runners in Belgium?!
PPS Rather than swap this blog with too many posts on this topic, I intend to start a new uncourse blog and post them there in future… details to follow…
PPPS if you liked this post, you might like this Ergast API Interactive F1 Results data app
Getting Started with F1 Betting Data
As part of my “learn about Formula One Stats” journey, one of the things I wanted to explore was how F1 betting odds change over the course of a race weekend, along with how well they predict race weekend outcomes.
Courtesy of @flutterF1, I managed to get a peek of some betting data from one of the race weekends last year year. In this preliminary post, I’ll describe some of the ways I started to explore the data initially, before going on to look at some of the things it might be able to tell us in more detail in a future post.
(I’m guessing that it’s possible to buy historical data(?), as well as collecting it yourself it for personal research purposes? eg Betfair have an api, and there’s at least one R library to access it: betfairly.)
The application I’ll be using to explore the data is RStudio, the cross-platform integrated development environment for the R programming language. Note that I will be making use of some R packages that are not part of the base install, so you will need to load them yourself. (I really need to find a robust safe loader that installs any required packages first if they have not already been installed.)
The data @flutterF1 showed me came in two spreadsheets. The first (filename convention RACE Betfair Odds Race Winner.xlsx) appears to contain a list of frequently sampled timestamped odds from Betfair, presumably, for each driver recorded over the course of the weekend. The second (filename convention RACE Bookie Odds.xlsx) has multiple sheets that contain less frequently collected odds from different online bookmakers for each driver on a variety of bets – race winner, pole position, top 6 finisher, podium, fastest lap, first lap leader, winner of each practice session, and so on.
Both the spreadsheets were supplied as Excel spreadsheets. I guess that many folk who collect betting data store it as spreadsheets, so this recipe for loading spreadsheets in to an R environment might be useful to them. The gdata library provides hooks for working with Excel documents, so I opted for that.
Let’s look at the Betfair prices spreadsheet first. The top line is junk, so we’ll skip it on load, and add in our own column names, based on John’s description of the data collected in this file:
The US Betfair Odds Race Winner.xslx is a raw data collection with 5 columns….
1) The timestap (an annoying format but there is a reason for this albeit a pain to work with).
2) The driver.
3) The last price money was traded at.
4) the total amount of money traded on that driver so far.
5) If the race is in ‘In-Play’. True means the race has started – however this goes from the warm up lap, not the actual start.To reduce the amount of data I only record it when the price traded changes or if the amount changes.
Looking through the datafile, they appear to be some gotchas, so these need cleansing out:
Here’s my initial loader script:
library(gdata)
xl=read.xls('US Betfair Odds Race Winner.xlsx',skip = 1)
colnames(xl)=c('dt','driver','odds','amount','racing')
#Cleansing pass
bf.odds=subset(xl,racing!='')
str(bf.odds)
'data.frame': 10732 obs. of 5 variables:
$ dt : Factor w/ 2707 levels "11/16/2012 12:24:52 AM",..: 15 15 15 15 15 15 15 15 15 15 ...
$ driver: Factor w/ 34 levels " Recoding Began",..: 19 11 20 16 18 29 26 10 31 17 ...
$ odds : num 3.9 7 17 16.5 24 140 120 180 270 550 ...
$ amount: num 1340 557 120 118 195 ...
$ racing: int 0 0 0 0 0 0 0 0 0 0 ...
#Generate a proper datetime field from the dt column
#This is a hacked way of doing it. How do I do it properly?
bf.odds$dtt=as.POSIXlt(gsub("T", " ", bf.odds$dt))
#If we rerun str(), we get the following extra line in the results:
# $ dtt : POSIXlt, format: "2012-11-11 11:00:08" "2012-11-11 11:00:08" "2012-11-11 11:00:08" "2012-11-11 11:00:08" ...
Here’s what the raw data, as loaded, looks like to the eye:

Having loaded the data, cleansed it, and cast a proper datetime column, it’s easy enough to generate a few plots:
#We're going to make use of the ggplot2 graphics library
library(ggplot2)
#Let's get a quick feel for bets around each driver
g=ggplot(xl)+geom_point(aes(x=dtt,y=odds))+facet_wrap(~driver,scales="free_y")
g=g+theme(axis.text.x=element_text(angle=-90))
g
#Let's look in a little more detail around a particular driver within a particular time window
g=ggplot(subset(xl,driver=="Lewis Hamilton"))+geom_point(aes(x=dtt,y=odds))+facet_wrap(~driver,scales="free_y")
g=g+theme(axis.text.x=element_text(angle=-90))
g=g+ scale_x_datetime(limits=c(as.POSIXct('2012/11/18 18:00:00'), as.POSIXct('2012/11/18 22:00:00')))
g
Here are the charts (obviously lacking in caption, tidy labels and so on).
Firstly, the odds by driver:
Secondly, zooming in on a particular driver in a particular time window:
That all seems to work okay, so how about the other spreadsheet?
#There are several sheets to choose from, named as follows:
#Race,Pole,Podium,Points,SC,Fastest Lap, Top 6, Hattrick,Highest Scoring,FP1, ReachQ3,FirstLapLeader, FP2, FP3
#Load in data from a particular specified sheet
race.odds=read.xls('USA Bookie Odds.xlsx',sheet='Race')
#The datetime column appears to be in Excel datetime format, so cast it into something meaningful
race.odds$tTime=as.POSIXct((race.odds$Time-25569)*86400, tz="GMT",origin=as.Date("1970-1-1"))
#Note that I am not I checking for gotcha rows, though maybe I should...?
#Use the directlabels package to help tidy up the display a little
library(directlabels)
#Let's just check we've got something loaded - prune the display to rule out the longshots
g=ggplot(subset(race.odds,Bet365<30),aes(x=tTime,y=Bet365,group=Runner,col=Runner,label=Runner))
g=g+geom_line()+theme_bw()+theme(legend.position = "none")
g=g+geom_dl(method=list('top.bumpup',cex=0.6))
g=g+scale_x_datetime(expand=c(0.15,0))
g
Here’s a view over the drivers’ odds to win, with the longshots pruned out:
With a little bit of fiddling, we can also look to see how the odds for a particular driver compare for different bookies:
#Let's see if we can also plot the odds by bookie
colnames(race.odds)
#[1] "Time" "Runner" "Bet365" "SkyBet" "Totesport" "Boylesport" "Betfred"
# [8] "SportingBet" "BetVictor" "BlueSQ" "Paddy.Power" "Stan.James" "X888Sport" "Bwin"
#[15] "Ladbrokes" "X188Bet" "Coral" "William.Hill" "You.Win" "Pinnacle" "X32.Red"
#[22] "Betfair" "WBX" "Betdaq" "Median" "Median.." "Min" "Max"
#[29] "Range" "tTime"
#We can remove items from this list using something like this:
tmp=colnames(race.odds)
#tmp=tmp[tmp!='Range']
tmp=tmp[tmp!='Range' & tmp!='Median' & tmp!='Median..' & tmp!='Min' & tmp!= 'Max' & tmp!= 'Time']
#Then we can create a subset of cols
race.odds.data=subset(race.odds,select=tmp)
#Melt the data
library(reshape)
race.odds.data.m=melt(race.odds.data,id=c('tTime','Runner'))
#head( race.odds.data.m)
# tTime Runner variable value
#1 2012-11-11 19:07:01 Sebastian Vettel (Red) Bet365 2.37
#2 2012-11-11 19:07:01 Lewis Hamilton (McL) Bet365 3.25
#3 2012-11-11 19:07:01 Fernando Alonso (Fer) Bet365 6.00
#...
#Now we can plot how the different bookies compare
g=ggplot(subset(race.odds.data.m,value<30 & Runner=='Sebastian Vettel (Red)'),aes(x=tTime,y=value,group=variable,col=variable,label=variable))
g=g+geom_line()+theme_bw()+theme(legend.position = "none")
g=g+geom_dl(method=list('top.bumpup',cex=0.6))
g=g+scale_x_datetime(expand=c(0.15,0))
g
Okay, so that all seems to work… Now I can start pondering what sensible questions to ask…
The Basics of Betting as a Way of Keeping Score…
Another preparatory step before I start learning about stats in the context of Formula One… There are a couple of things I’m hoping to achieve when I actually start the journey: 1) finding ways of using stats to help to pull out patterns and events that are interesting from a storytelling or news perspective; 2) seeing if I can come up with any models that help forecast or predict race winners or performances over a race weekend.
There are a couple of problems I can foresee (?!) when it comes to the predictions: firstly, unlike horseracing, there aren’t that many F1 races each year to test the predictions against. Secondly, how do I even get a baseline start on the probabilities that driver X or team Y might end up on the podium?
It seems to me as if betting odds provide one publicly available “best guess” at the likelihood of any driver winning a race (a range of other bets are possible, of course, that give best guess predictions for other situations…) Having had a sheltered life, the world of betting is completely alien to me, so here’s what I think I’ve learned so far…
Odds are related to the anticipated likelihood of a particular event occurring and represent the winnings you get back (plus your stake) if a particular event happens. So 2/1 (2 to 1) fractional odds say: if the event happens, you’ll get 2 back for every 1 you placed, plus your stake back. If I bet 1 unit at 2/1 and win, I get 3 back: my original 1 plus 2 more. If I bet 3, I get 9 back: my original 3 plus 2 for every 1 I placed. Since I placed 3 1s, I get back 3 x 2 = 6 in winnings. Plus my original 3, which gives me 9 back on 3 staked, a profit of 6.
Odds are related (loosely) to the likelihood of an event happening. 2/1 odds represent a likelihood (probability) that an event will happen (1/3) = 0.333… of the time (to cur down on confusion between fractional odds and fractional probabilities, I’ll try to remember to put the fractional probabilities in brackets; so 1/2 is fractional odds of 2 to 1 on, and (1/2) is a probability of one half). To see how this works, consider an evens bet, fractional odds of 1/1, such as someone might make for tossing a coin. The probability of getting heads on a single toss is (1/2); the probability of getting tails is also (1/2). If I’m giving an absolutely fair book based on these likelihoods, I’d offer you even odds that you get a head, for example, on a single toss. After all, it’s (fifty/fifty) (fifty per cent chance either way) of whether a heads or tails will land face up. If there are three equally possible outcomes, (1/3) each, then I’d offer 2/1. After all, it’s twice as likely that something other than the single outcome you called would come up. If there are four possible outcomes, I’d offer 3/1, because it’s likely (if we played repeatedly) that three times out of four, you’d be wrong. So every three times out of four you’d lose and I’d take your stake. And on the fourth go, when you get it right, I give you your stake back for that round plus three for winning, so over all we’d be back where we started.
Decimal odds are a way of describing the return you get on a unit stake. So for a 2/1 bet, the decimal odds are three. For a 4/1 bet they’d be 5. For an N/1 bet they’d be 1+N. For an 1/2 (two to one on?) bet they’d be 1.5, for a 1/10 bet they’d be 1.1. So for a 1/M bet, 1+1/M. Generally, for an N/M bet, decimal odds are 1+N/M.
Decimal odds give an easy way in to calculating the likelihood of an event. Decimal odds of 3, (that is, fractional odds 2/1), describe an event that will happen (1/3) of the time in a fair game. That is (1/(decimal odds)) of the time. For fractional odds of N/M, you expect the event to happen with probability (1/(1+N/M))
In a completely fair book (?my phrase), the sum of the odds should lead to the summed probability of all possible events happening of 1. Bookmakers right the odds in their favour though, so the summed probabilities on a book will add up to more than 1 – this represents the bookmaker’s margin. If you’re betting on the toss of a coin with a bookie, they may offer you 99/100 for heads, evens for tails. If you play 400 games and bet 300 heads and 200 tails, winning 100 of each, you’ll overall stake 400, win 100 (plus 100 back) on tails along with 99 (plus 100 original stake) on heads. That is, you’ll have staked 400 and got back 399. The bookie will be 1 up overall. The summed probabilities add up to more than 1, since (1/2) + (1/(1+99/100)) = (0.5 + ~0.5025) > 1.
One off bets are no basis for a strategy. You need to bet regularly. One way of winning is to follow a value betting strategy where you place bets on outcomes that you predict are more likely than the odds you’re offered. This is counter to how the bookie works. If a bookie offers you fractional odds of 3/1 (expectation that the event will happen (1/4) of the time), and you have evidence that suggests it will happen (1/3) of the time (decimal odds of 3, fractional odds 2/1) then it’s worth your while repeatedly accepting the bet. After all, if you play 12 rounds, you’ll wager 12, and win on 12/3=4 occasions, getting 4 back (3 + your stake) each time, to give you a net return of 4 x 4 – 12 = 16 – 12 = +4. If the event had happened at the bookie’s predicted likelihood of 1/4 of the time, you would have got back ( 12/4 ) * 4 – 12 = +0 overall.
I’ve tried to do an R script to explore this:
#My vocabulary may be a bit confused herein
#Corrections welcome in the comments from both statisticians and gamblers;-)
#The offered odds
price=4 #3/1 -> 3+1 That is, the decimal odds on fractional odds of 3/1
odds=1/price
#The odds I've predicted
myodds=1/3 #2/1 -> 1/(2+1)
#The number of repeated trials in the game
trials=10000
#The amount staked
bet=1
#The experiment that we'll run trials number of times
expt=function(trials,odds,myodds,bet){
#trial sets a uniform random number in ranger 0..1
df=data.frame(trial=runif(1:trials))
#The win condition happens at my predicted odds, ie if trial value is less than my odds
#So if my odds are (1/4) = 0.25, a trial value in range 0..0.25 counts as a win
# (df$trial<myodds) is TRUE if trial < myodds, which is cast by as.integer() to value 1
# If (df$trial<myodds) is FALSE, as.integer() returns 0
df$win=as.integer(df$trial<myodds)
df$bet=bet
#The winnings are calculated at the offered odds and are net of the stake
#The df$win/odds = 1/odds = price (the decimal odds) on a win, else 0
#The actual win is the product of the stake (bet) and the decimal odds
#The winnings are the return net of the initial amount staked
#Where there is no win, the winnings are a loss of the value of the bet
df$winnings=df$bet*df$win/odds-df$bet
df
}
df.e=expt(trials,odds,myodds,bet)
#The overall net winnings
sum(df.e$winnings)
#If myodds > odds, then I'm likely to end up winning on a value betting strategy
#A way of running the experiment several times
#There are probably better R protocols for doing this?
runs=10
df.r=data.frame(s=numeric(),v=numeric())
for (i in 1:runs){
e=expt(trials,odds,myodds,bet)
df.r=rbind(df.r,data.frame(s=sum(e$winnings),v=sd(e$winnings)))
}
#It would be nice to do some statistical graphics demonstrations of the different distributions of possible outcomes for different regimes. For example:
## different combinations of odds and myodds
## different numbers of trials
## different bet sizes
There are apparently also “efficient” ways of working out what stake to place (the “staking strategy”). The value strategy gives you the edge to win, long term, the staking strategy is how you maximise profits. See for example Horse Racing Staking and Betting: Horse racing basics part 2 or more mathematical treatments, such as The Kelly Criterion or Statistical Methodology for Profitable Sports Gambling. See also the notion of “betting rules”, eg A statistical development of fixed odds betting rules in soccer.
There is possibly some mileage to be had in getting to grips with R modeling using staking strategy models as an opening exercise, along with statistical graphical demonstrations of the same, but that is perhaps a little off topic for now…
To recap then, what I think I’ve learned is that we can test predictions against the benchmark of offered odds. The offered odds in themselves give us a ballpark estimate of what the (expert) bookmakers, as influenced by the betting/prediction market, expect the outcome of an event to be. Note that the odds are rigged to give summed probabilities over a range of events happening to be greater than 1, to build in a profit margin (does it have a proper name?) for the bookmaker. If we have a prediction model that appears to offer better odds on an event than the odds that are actually offered, and we believe in our prediction, we can run a value betting strategy on that basis and hopefully come out, over the long term, with a profit. The size of the profit is in part an indicator of how much more accurate our model is as a predictive model than the expert knowledge and prediction market basis that is used to set the bookie’s odds.
PS Re: the bookie’s profit, seems that this is called the overround or vigorish. The paper Forecasting sports tournaments by ratings of (prob)abilities: A comparison for the EURO 2008 makes clear the relationship between the bookies’ cut and the odds:
One thing that immediately springs to mind is to look at what sort of overround applies to different bookmakers around different sorts of F1 bets, and whether this is related to the apparent forecast accuracy of the odds offered, at least in ranking terms? (See the comments for a couple of links to papers on forecast accuracy of sports betting odds.)
PPS FWIW, as and when I come across R libraries to access bookmaker APIs, I’ll add them here:
- Betfair R package – access Betair API; another package (CRAN): Betfairly
My Personal Intro to F1 Race Statistics
One of the many things I keep avoiding is statistics. I’ve never really been convinced about the 5% significance level thing; as far as I can tell, hardly anything that’s interesting normally distributes; all the counting that’s involved just confuses me; and I never really got to grips with confidently combining probabilities. I find a lot of statistics related language impenetrable too, with an obscure vocabulary and some very peculiar usage. (Regular readers of this blog know that’s true here, as well ;-)
So this year I’m going to try to do some stats, and use some stats, and see if I can find out from personal need and personal interest whether they lead me to any insights about, or stories hidden within, various data sets I keep playing with. So things like: looking for patterns or trends, looking for outliers, and comparing one thing with another. If I can find any statistics that appear to suggest particular betting odds look particularly favourable, that might be interesting too. (As Nate Silver suggests, betting, even fantasy betting, is a great way of keeping score…)
Note that what I hope will turn into a series of posts should not be viewed as tutorial notes – they’re far more likely to be akin to student notes on a problem set the student is trying to work through, without having attended any of the required courses, and without having taken the time to read through a proper tutorial on the subject. Nor do I intend to to set out with a view to learning particular statistical techniques. Instead, I’ll be dipping into the world of stats looking for useful tools to see if they help me explore particular questions that come to mind and then try to apply them cut-and-past fashion, which is how I approach most of my coding!
Bare naked learning, in other words.
So if you thought I had any great understanding about stats – in fact, any understanding at all – I’m afraid I’m going to disabuse you of that notion. As to my use of the R statistical programming language, that’s been pretty much focussed on using it for generating graphics in a hacky way. (I’ve also found it hard, in the past, plotting pixels on screen and page in a programmable way, but R graphics libraries such as ggplot2 make it really easy at a high level of abstraction…:-)
That’s the setting then… Now: #f1stats. What’s that all about?
Notwithstanding the above (that this isn’t about learning a particular set of stats methods defined in advance) I did do a quick trawl looking for “F1 stats tutorials” to see if there were any that I could crib from directly; but my search didn’t turn up much that was directly and immediately useful (if you know of anything that might be, please post a link in the comments). There were a few things that looked like they might be interesting, so here’s a quick dump of the relevant…
- First up, I’ve been reading Nate Silver’s The Signal and the Noise, which mentions the aging stats and aging models for baseball players. I found a paper on The Age Productivity Gradient: Evidence from a sample of F1 Drivers, which hasn’t got too many scary equations in, so I may try to replicate that and then bring the models up to date (the paper is dated 2009). It would have been so nice if the authors had published code equivalents in R that I could have played with directly, but I haven’t been able to find it if they did. I also found a paper on Estimated Age Effects in Baseball, again with equations but no code, but it may provide additional clues. From a quick skim, I think there may be some mileage in trying to get my head round different ways of comparing rankings.
- A Tale of Two Motorsports: A Graphical-Statistical Analysis of How Practice, Qualifying, and Past Success Relate to Finish Position in NASCAR and Formula One Racing is perhaps an easier thing to try to copy for starters, though>
- The article The wisdom of ignorant crowds: predicting sport outcomes by mere recognition explores a simple tournament winner predicting strategy based on how recognisable the names of competitors are. (I guess social media metrics might be a proxy for recognition? Hmm.. could test that I suppose with reference to this paper?) One thing that caught my eye were a couple of simple schemes for benchmarking different prediction models, which might be something I could pull on if I end up exploring prediction models?
- NASCAR results have featured in several papers (I think there’s also a NASCAR dataset available in R?) so I’ll probably try dipping in to them at some point to see if I can do similar things with F1 data. For example, an analysis of NASCAR Winston Cup Race Results for 1975-2003; a couple of papers on hierarchical modelling of auto-racing results; and some Bayesian inference stuff that I guess is really beyond me for now and that I really really could do with a pre-built R libraries for;
- an MSc thesis I’ve referred to before on Prediction of Formula One Race Results Using Driver Characteristics has some handy ideas that I might be able to draw on if I have a look at laptime data;
- One of the the things I’ve been pondering is ways of ranking drivers based on fast lap times (eg during qualifying, vs. during the race). Although not about motor sport, or any sort of racing, A New Method for Ranking Total Driving Performance on the PGA Tour has a metric I may be able to bastardise in a Formula One context. The same periodical also has an article on Do Reliable Predictors Exist for the Outcomes of NASCAR Races?, the techniques of which might be applicable to F1? Predicting The Outcome Of NASCAR Races: The Role Of Driver Experience looks to be in a similar vein too…
- A paper on Outcome Uncertainty in NASCAR looks at how attendance and TV audience figures are influenced by race expectations, which might be something that could also be explored in context of UK F1 TV audience figures. That said, the notion of “outcome uncertainty” itself, and related measures, might also be worth exploring in their own right?
If you know of any other relevant looking papers or articles, please post a link in the comments.
[MORE LINKS...
- Who is the Best Formula 1 Driver? An Econometric Analysis
]
I was hoping to finish this post with a couple of quick R hacks around some F1 datasets, but I’ve just noticed that today, as in yesterday, has become tomorrow, as in today, and this post is probably already long enough… So it’ll have to wait for another day…
PS FWIW, I also note the arrival of the Sports Analytics Innovation Summit in London in March… I doubt I have the impact required to make it as a media partner though… Although maybe OpenLearn does…?!
WordPress Stats in R
A trackback from Martin Hawksey’s recent post on Analysing WordPress post velocity and momentum stats with Google Sheets (Spreadsheet), which demonstrates how to pull WordPress stats into a Google Spreadsheet and generates charts and reports therein, reminded me of the WordPress stats API.
So here’s a quick function for pulling WordPress reports into R.
#Wordpress Stats
##---------------
#Wordpress Stats API docs (from http://stats.wordpress.com/csv.php)
#You can get a copy of your API key (required) from Akismet:
#Login with you WordPress account: http://akismet.com/account/
#Resend API key: https://akismet.com/resend/
#Required parameters: api_key, blog_id or blog_uri.
#Optional parameters: table, post_id, end, days, limit, summarize.
#Parameters:
#api_key String A secret unique to your WordPress.com user account.
#blog_id Integer The number that identifies your blog. Find it in other stats URLs.
#blog_uri String The full URL to the root directory of your blog. Including the full path.
#table String One of views, postviews, referrers, referrers_grouped, searchterms, clicks, videoplays.
#post_id Integer For use with postviews table.
#end String The last day of the desired time frame. Format is 'Y-m-d' (e.g. 2007-05-01) and default is UTC date.
#days Integer The length of the desired time frame. Default is 30. "-1" means unlimited.
#period String For use with views table and the 'days' parameter. The desired time period grouping. 'week' or 'month'
#Use 'days' as the number of results to return (e.g. '&period=week&days=12' to return 12 weeks)
#limit Integer The maximum number of records to return. Default is 100. "-1" means unlimited. If days is -1, limit is capped at 500.
#summarize Flag If present, summarizes all matching records.
#format String The format the data is returned in, 'csv', 'xml' or 'json'. Default is 'csv'.
##---------------------------------------------
#NOTE: some of the report calls I tried didn't seem to work properly?
#Need to build up a list of tested calls to the API that actually do what you think they should?
##-----
wordpress.getstats.demo=function(apikey, blogurl, table='postviews', end=Sys.Date(), days='12', period='week', limit='', summarise=''){
#default parameters gets back last 12 weeks of postviews aggregated by week
url=paste('http://stats.wordpress.com/csv.php?',
'api_key=',apikey,
'&blog_uri=',blogurl,
'&table=',table,
'&end=',end,
'&days=',days,
'&period=',period,
'&limit=',limit,
'&',summarise, #set this to 'summarise=T' if required
sep=''
)
#Martin's post notes that JSON appears to work better than CSV
#May be worth doing a JSON parsing version?
read.csv(url)
}
APIKEY='YOUR-API_KEY_HERE'
#Use the URL of a WordPress blog associated with the same account as the API key
BLOGURL='http://ouseful.wordpress.com'
#Examples
wp.pageviews.last12weeks=wordpress.getstats.demo(APIKEY,BLOGURL)
wp.views.last12weeks.byweek=wordpress.getstats.demo(APIKEY,BLOGURL,'views')
wp.views.last30days.byday=wordpress.getstats.demo(APIKEY,BLOGURL,'views',days=30,period='')
wp.clicks.wpdefault=wordpress.getstats.demo(APIKEY,BLOGURL,'clicks',days='',period='')
wp.clicks.lastday=wordpress.getstats.demo(APIKEY,BLOGURL,'clicks',days='1',period='')
wp.referrers.lastday=wordpress.getstats.demo(APIKEY,BLOGURL,'referrers',days='1',period='')
require(stringr)
getDomain=function(url) str_match(url, "^http[s]?://([^/]*)/.*?")[, 2]
#We can pull out the domains clicks were sent to or referrals came from
wp.clicks.lastday$domain=getDomain(wp.clicks.lastday$click)
wp.referrers.lastday$domain=getDomain(wp.referrers.lastday$referrer)
require(ggplot2)
#Scruffy bar chart - is there a way of doing this sorted chart using geom_bar? How would we reorder x?
c=as.data.frame(table(wp.clicks.yesterday$domain))
ggplot(c)+geom_bar(aes(x=reorder(Var1,Freq),y=Freq),stat='identity')+theme( axis.text.x=element_text(angle=-90))
c=as.data.frame(table(wp.referrers.lastday$domain))
ggplot(c)+geom_bar(aes(x=reorder(Var1,Freq),y=Freq),stat='identity')+theme( axis.text.x=element_text(angle=-90))
(Code as a gist.)
I guess there’s scope for coming up with a set of child functions that pull back specific report types? Also, if we pull in the blog XML archive and extract external links from each page, we could maybe start to analyse we pages are sending traffic where? (Of course, you can use Google Analytics to do this more efficiently, for hosted WordPress blogs don’t support Google Analytics (for no very good reason that I can tell…?)
PS for more WordPress tinkerings, see eg How OUseful.Info Posts Link to Each Other…,which links to a Python script for extracting data from WordPress blog export files that show how blogs posts in a particular WordPress blog link to each other.
NHS Winter Situation Reports: Shiny Viewer v2
Having got my NHS Winter sitrep data scraper into shape (I think!), and dabbled with a quick Shiny demo using the R/Shiny library, I thought I’d tidy it up a little over the weekend and long the way learn a few new presentation tricks.
To quickly recap the data availability, the NHS publish a weekly spreadsheet (with daily reports for Monday to Friday – weekend data is rolled over to the Monday) as an Excel workbook. The workbook contains several sheets, corresponding to different data collections. A weekly scheduled scraper on Scraperwiki grabs each spreadsheet and pulls the data into a rolling database: NHS Sitreps scraper/aggregator. This provides us with a more convenient longitudinal dataset if we want to look at sitrep measures for a period longer than a single week.
So here’s where I’ve got to now – NHS sitrep demo:
The panel on the left controls user actions. The PCT (should be relabelled as “Trust”) drop down list is populated based on the selection of a Strategic Health Authority. The Report types follow the separate sheets in the Winter sitrep spreadsheet (though some of them include several reported measures, which is handled in the graphical display). The Download button allows you to download, as CSV data, the data for the selected report. By default, it downloads data at the SHA level (that is, data for each Trust in the selected SHA), although checkbox control allows you to limit the downloaded results to just data for the selected Trust:

Using just these controls, then, the user can select and download Winter sitrep data (to date), as a CSV file, for any selected Trust, or for all the Trusts in a given SHA. Here’s how the downloader was put together using Shiny:
So how does the Download work? Quite straightforwardly, as it turns out:
#This function marhsals the data for download
downloadData <- reactive(function() {
ds=results.data()
if (input$pctdownonly==TRUE)
ds=subset(ds,tid==input$rep & Code==input$tbl,select=c('Name','fromDateStr','toDateStr','tableName','facetB','value'))
ds
})
output$downloadData <- downloadHandler(
#Add a little bit of logic to name the download file appropriately
filename = function() { if (input$pctdownonly==FALSE) paste(input$sha,'_',input$rep, '.csv', sep='') else paste(input$tbl,'_',input$rep, '.csv', sep='') },
content = function(file) { write.csv(downloadData(), file, row.names=FALSE) }
)
Graphical reports are split into two panels: at the top, views over the report data for each Trust in the selected SHA; at the bottom, more focussed views over the currently selected Trust.
Working through the charts, the SHA level stacked bar char is intended to show summed metrics at the SHA level:

My thinking here was that it may be useful to look at bed availability across an SHA, for example. The learning I had to do for this view was in the layout of the legend:
#g is a ggplot object g=g+theme( legend.position = 'bottom' ) g=g+scale_fill_discrete( guide = guide_legend(title = NULL,ncol=2) )
The facetted, multiplot view also uses independent y-axis scales for each plot (sometimes this makes sense, sometimes it doesn’t. Maybe I need to some logic to control when to use this and when not to?)
#The 'scales' parameter allows independent y-axis limits for each facet plot g=g+facet_wrap( ~tableName+facetB, scales = "free_y" )
The line chart shows the ame data in a more connected way:

To highlight the data trace for the currently selected Trust, I overplot that line with dots that show the value of each data point for that Trust. I’m not sure whether these should be coloured? Again, the y-axis scales are free.
The SHA Boxplot shows the distribution of values for each Trust in the SHA. I overplot the box for the selected Trust using a different colour.

(I guess a “semantic depth of field“/blur approach might also be used to focus attention on the plot for the currently selected Trust?)
My original attempt at this graphic was distorted by very long text labels, that were also misaligned. To get round this, I generated a new label attribute that included line breaks:
#Wordwrapper via:
##http://stackoverflow.com/questions/2351744/insert-line-breaks-in-long-string-word-wrap
#Limit the length of each line to 15 chars
limiter=function(x) gsub('(.{1,15})(\\s|$)', '\\1\n', x)
d$sName=sapply(d$Name,limiter)
#We can then print axis tick labels using d$sName
We can offset the positioning of the label when it is printed:
#Tweak the positioning using vjust, rotate it and also modify label size g=g+theme( axis.text.x=element_text(angle=-90,vjust = 0.5,size=5) )
The Trust Barchart and Linechart are quite straightforward. The Trust Daily Boxplot is a little more involved. The intention of the Daily plot is to try to identify whether or not there are distributional differences according to the day of the week. (Note that some of the data reports relate to summed values over the weekend, so these charts are likely to have comparatively high values on the weekend reporting Monday figure!)

I ‘borrowed’ a script for identifying days of the week… (I need to tweak the way these are ordered – the original author had a very particular application in mind.)
library('zoo')
library('plyr')
#http://margintale.blogspot.co.uk/2012/04/ggplot2-time-series-heatmaps.html
tmp$year<-as.numeric(as.POSIXlt(tmp$fdate)$year+1900)
# the month too
tmp$month<-as.numeric(as.POSIXlt(tmp$fdate)$mon+1)
# but turn months into ordered facors to control the appearance/ordering in the presentation
tmp$monthf<-factor(tmp$month,levels=as.character(1:12), labels=c("Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"),ordered=TRUE)
# the day of week is again easily found
tmp$weekday = as.POSIXlt(tmp$fdate)$wday
# again turn into factors to control appearance/abbreviation and ordering
# I use the reverse function rev here to order the week top down in the graph
# you can cut it out to reverse week order
tmp$weekdayf<-factor(tmp$weekday,levels=rev(0:6),labels=rev(c("Sun","Mon","Tue","Wed","Thu","Fri","Sat")),ordered=TRUE)
# the monthweek part is a bit trickier
# first a factor which cuts the data into month chunks
tmp$yearmonth<-as.yearmon(tmp$fdate)
tmp$yearmonthf<-factor(tmp$yearmonth)
# then find the "week of year" for each day
tmp$week <- as.numeric(format(tmp$fdate,"%W"))
# and now for each monthblock we normalize the week to start at 1
tmp<-ddply(tmp,.(yearmonthf),transform,monthweek=1+week-min(week))
The weekdayf value could then be used as the basis for plotting the results by day of week.
To add a little more information to the chart, I overplot the boxplot with actual data point, adding a small amount of jitter added to the x-component (the y-value is true).
g=g+geom_point(aes(x=weekdayf,y=val),position = position_jitter(w = 0.3, h = 0))
I guess it would be more meaningful if the data points were actually ordered by week/year. (Indeed, what I originally intended to do was a seasonal subseries style plot at the day level, to see whether there were any trends within a day of week over time, as well as pull out differences at the level of day of week.)
Finally, the Trust datatable shows the actual data values for the selected report and Trust:

(Remember, this data, or data for this report for each trusts in the selected SHA, can also be downloaded directly as a CSV file.)
The thing I had to learn here was how to disable the printing of the dataframe row names in the SHiny context:
output$view = reactiveTable(function() {
#...get the data and return it for printing
}, include.rownames=FALSE)
As a learning exercise, this app got me thinking about solving several presentational problems, as well as trying to consider what reports might be informative or pattern revealing (for example, the Daily boxplots).
The biggest problem, of course, is coming up with views that are meaningful and useful to end-users, the sorts of questions they may want to ask of the data, and the sorts of things they may want to pull from it. I have no idea who the users, if any, of the Winter sit rep data as published on the NHS website might be, or how they make use of the data, either in mechanistic terms – what do they actually do with the spreadsheets – or at the informational level – what stories they look for in the data/pull out out of it, and what they then use that information for.
This tension is manifest around a lot of public data releases, I think – hacks’n'hackers look for why shiny(?!) things they can do with the data, though often out of any sort of context other than demonstrating technical prowess or quick technical hacks. Users of the data may possibly struggle with doing anything other than opening the spreadsheet in Excel and then copying and pasting it into other spreadsheets, although they might know exactly what they want to get out of the data as presented to them. Some users may be frustrated at a technical level in the sense of knowing what they’d like to be able to get from the data (for example, getting monthly timeseries from weekly timeseries spreadsheets) but may not be able to do it easily for lack of technical skills. Some users may not know what can be readily achieved with the way data is organised, aggregated and mixed with other datasets, and what this data manipulation then affords in its potential for revealing stories, trends, structures and patterns in the data, and here we have a problem with even knowing what value might become unlockable (“Oh, I didn’t know you coud do that with it…”). This is one reason why hackdays – such as the NHS Hackday and various govcamps – can be so very useful (I’m also reminded of MashLib/Mashed Library events where library folk and techie shambrarians come together to learn from each other). What I think I’d like to see more of, though, is people with real/authentic questions that might be asked of data, or real answers they’d like to be able to find from data, starting to air them as puzzles that the data junkies, technicians, wranglers and mechanics amongst us can start to play with from a technical side.
PS this could be handy… downloading PDF docs from Shiny.
PPS Radio 4′s Today programme today had a package on NHS release of surgeon success data. In an early interview with someone from the NHS, the interviewee made the point that the release of the data was there for quality/improvement purposes and to help identify opportunities for supporting best practice (eg along the lines of previous releases of heart surgery performance data. The 8am, after 8 interview, and 8.30am news bulletins all pushed the faux and misleading line of how this data could be used for “parent choice”, (I complained bitterly – twice- via Twitter;-) though the raising standards line was taken in the 9am bulletin. There’s real confusion, I think, about how all this data stuff might, could and should be used (I’m thoroughly confused by it myself), and I’m not sure the press are always very helpful in communicating it…
Mapping Primary Care Trust (PCT) Data, Part 1
The launch or official opening or whatever it was of the Open Data Institute this week provided another chance to grab a snapshot of notable folk in the community, as for example demonstrated by people commonly followed by users of the #ODIlaunch hashtag on Twitter. The PR campaign also resulted in the appearance of some open data related use cases, such as a report in the Economist about an analysis by MastodonC and Prescribing Analytics mapping prescription charges (R code available), with a view to highlighting where prescriptions for branded, as opposed to the recommended generic, drugs are being issued at wasteful expense to the NHS. (See Exploring GP Practice Level Prescribing Data for some of my entry level doodlings with prescription data.)
Quite by chance, I’ve been looking at some other health data recently, (Quick Shiny Demo – Exploring NHS Winter Sit Rep Data), which has been a real bundle of laughs. Looking at a range of health related datasets, data seems to be published at a variety of aggregation levels – individual practices and hospitals, Primary Care Trusts (PCTs), Strategic Health Authorities (SHAs) and the new Clinical Commissioning Groups (CCGs). Some of these map on to geographical regions, that can then be coloured according to a particular measure value associated with that area.
I’ve previously experimented with rendering shapefiles and choropleth maps (Amateur Mapmaking: Getting Started With Shapefiles) so I know R provides one possible environment for generating these maps, so I thought I’d try to pull together a recipe or two for supporting the creation of thematic maps based on health related geographical regions.
A quick trawl for PCT shapefiles turned up nothing useful. @jenit suggested @mastodonc, and @paulbradshaw pointed me to a dataset on Google Fusion Tables, discovered through the Fusion Tables search engine, that included PCT geometry data. So no shapefiles, but there is exportable KML data from Fusion Tables.
At this point I should have followed Paul Bradshaw’s advice, and just uploaded my own data (I was going to start out with mapping per capita uptake of dental services by PCT) to Fusion Tables, merging with the other data set, and generating my thematic maps that way.
But that wasn’t quite the point, which was actually an exercise in pulling together an R based recipe for generating these maps…
Anyway, I’ve made a start, and here’s the code I have to date:
##Example KML: https://dl.dropbox.com/u/1156404/nhs_pct.kml
##Example data: https://dl.dropbox.com/u/1156404/nhs_dent_stat_pct.csv
install.packages("rgdal")
library(rgdal)
library(ggplot2)
#The KML data downloaded from Google Fusion Tables
fn='nhs_pct.kml'
#Look up the list of layers
ogrListLayers(fn)
#The KML file was originally grabbed from Google Fusion Tables
#There's only one layer...but we still need to identify it
kml=readOGR(fn,layer='Fusiontables folder')
#This seems to work for plotting boundaries:
plot(kml)
#And this:
kk=fortify(kml)
ggplot(kk, aes(x=long, y=lat,group=group))+ geom_polygon()
#Add some data into the mix
#I had to grab a specific sheet from the original spreadsheet and then tidy the data little...
nhs <- read.csv("nhs_dent_stat_pct.csv")
kml@data=merge(kml@data,nhs,by.x='Name',by.y='PCT.ONS.CODE')
#I think I can plot against this data using plot()?
plot(kml,col=gray(kml@data$A.30.Sep.2012/100))
#But is that actually doing what I think it's doing?!
#And if so, how can experiment using other colour palettes?
#But the real question is: HOW DO I DO COLOUR PLOTS USING gggplot?
ggplot(kk, aes(x=long, y=lat,group=group)) #+ ????
Here’s what an example of the raw plot looks like:

And the greyscale plot, using one of the dental services uptake columns:

Here’s the base ggplot() view:

However, I don’t know how to actually now plot the data into the different areas? (Oh – might this help? CRAN Task View: Analysis of Spatial Data.)
If you know how to do the colouring, or ggplotting, please leave a comment, or alternatively, chip in an answer to a related question I posted on StackOverflow: Plotting Thematic Maps from KML Data Using ggplot2
Thanks:-)
PS The recent Chief Medical Officer’s Report makes widespread use of a whole range of graphical devices and charts, including cartograms:

Is there R support for cartograms yet, I wonder?! (Hmmm… maybe?)
PPS on the public facing national statistics front, I spotted this job ad yesterday – Head of Rich Content Development, ONS:
The postholder is responsible for inspiring and leading development of innovative rich content outputs for the ONS website and other channels, which anticipate and meet user needs and expectations, including those of the Citizen User. The role holder has an important part to play in helping ONS to realise its vision “for official statistics to achieve greater impact on key decisions affecting the UK and to encourage broader use across the country”.
Key Responsibilities:
1.Inspires, builds, leads and develops a multi-disciplinary team of designers, developers, data analysts and communications experts to produce innovative new outputs for the ONS website and other channels.
2. Keeps abreast of emerging trends and identifies new opportunities for the use of rich web content with ONS outputs.
3. Identifies new opportunities, proposes new directions and developments and gains buy in and commitment to these from Senior Executives and colleagues in other ONS business areas.
4. Works closely with business areas to identify, assess and commission new rich-content projects.
5. Provides, vision, guidance and editorial approval for new projects based on a continual understanding of user needs and expectations.
6. Develops and manages an ongoing portfolio of innovative content, maximising impact and value for money.
7. Builds effective partnerships with media to increase outreach and engagement with ONS content.
8. Establishes best practice in creation of rich content for the web and other channels, and works to improve practice and capability throughout ONS.
Interesting…
More Shiny Goodness – Tinkering With the Ergast Motor Racing Data API
I had a bit of a play with Shiny over the weekend, using the Ergast Motor Racing Data API and the magical Shiny library for R, that makes building interactive, browser based applications around R a breeze.
As this is just a quick heads-up/review post, I’ll largely limit myself to a few screenshots. When I get a chance, I’ll try to do a bit more of a write-up, though this may actually just take the form of more elaborate documentation of the app, both within the code and in the form of explanatory text in the app itself.
If you want to try ou the app, you can find an instance here: F1 2012 Laptime Explorer. The code is also available.
Here’s the initial view – the frist race of the season is selected as a default and data loaded in. The driver list is for all drivers represented during the season.

THe driver selectors allow us to just display traces for selected drivers.
The Race History chart is a classic results chart. It show the difference between the race time to date for each driver, by lap, compared to the average lap time for the winner times the lap number. (As such, this is an offline statistic – it is calculated when the winner’s overall average laptime is known).

Variants of the classic Race History chart are possible, for example, using different base line times, but I haven’t implemented any of them – or the necessary UI controls. Yet…
The Lap Chart is another classic:

Annotations for this chart are also supported, describing all drivers who final status was not “Finished”.

The Lap Evolution chart shows how each driver’s laptime evolved over the course of the race compared with the fastest overall recorded laptime.

The Personal Lap Evolution chart shows how each driver’s laptime evolved over the course of the race compared with their personal fastest laptime.

The Personal Deltas Chart shows the difference between one laptime and the next for each driver.

The Race Summary Chart is a chart of my own design that tries to capture notable features relating to race position – the grid position (blue circle), final classification (red circle), position at the end of the first lap (the + or horizontal bar). The violin plot shows the distribution of how many laps the driver spent in each race position. Where the chart is wide, the driver spent a large number of laps in that position.

The x-axis ordering pulls out different features about how the race progressed. I need to add in a control that lets the user select different orderings.
Finally, the Fast Lap text scatterplot shows the fastest laptime for each driver and the lap at which they recorded it.

So – that’s a quick review of the app. All in all it took maybe 3 hours getting my head round the data parsing, 2-3 hours figuring what I wanted to do and learning how to do it in Shiny, and a couple of hours doing it/starting to document/annotate it. Next time, it’ll be much quicker…
Quick Shiny Demo – Exploring NHS Winter Sit Rep Data
Having spent a chink of the weekend and a piece of yesterday trying to pull NHS Winter sitrep data into some sort of shape in Scraperwiki, (described, in part, here: When Machine Readable Data Still Causes “Issues” – Wrangling Dates…), I couldn’t but help myself last night and had a quick go at using RStudio’s Shiny tooling to put together a quick, minimal explorer for it:
For proof of concept, I just pulled in data relating to the Isle of Wight NHS Trust, but it should be possible to build a more generic explorer: Isle of Wight NHS Sit Rep Explorer Demo.
Three files are used to crate the app – a script to define the user interface (ui.R), a script to define the server that responds to UI actions and displays the charts (server.R), and a supporting file that creates variables and functions that are globally available to bother the server and UI scripts (global.R).
##wightsitrep2/global.R
#Loading in CSV directly from https seems to cause problems but this workaround seems okay
floader=function(fn){
temporaryFile <- tempfile()
download.file(fn,destfile=temporaryFile, method="curl")
read.csv(temporaryFile)
}
#This is the data source - a scraperwiki API call
#It would make sense to abstract this further, eg allowing the creation of the URL based around a passed in a select statement
u="https://api.scraperwiki.com/api/1.0/datastore/sqlite?format=csv&name=nhs_sit_reps&query=select%20SHA%2CName%2C%20fromDateStr%2CtoDateStr%2C%20tableName%2CfacetB%2Cvalue%20from%20fulltable%20%20where%20Name%20like%20'%25WIGH%25'"
#Load the data and do a bit typecasting, just in case...
d=floader(u)
d$fdate=as.Date(d$fromDateStr)
d$tdate=as.Date(d$toDateStr)
d$val=as.integer(d$value)
##wightsitrep2/ui.R
library(shiny)
tList=levels(d$tableName)
names(tList) = tList
# Define UI for application that plots random distributions
shinyUI(pageWithSidebar(
# Application title
headerPanel("IW NHS Trust Sit Rep Explorer"),
sidebarPanel(
#Just a single selector here - which table do you want to view?
selectInput("tbl", "Report:",tList),
div("This demo provides a crude graphical view over data extracted from",
a(href='http://transparency.dh.gov.uk/2012/10/26/winter-pressures-daily-situation-reports-2012-13/',
"NHS Winter pressures daily situation reports"),
"relating to the Isle of Wight NHS Trust."),
div("The data is pulled in from a scraped version of the data stored on Scraperwiki",
a(href="https://scraperwiki.com/scrapers/nhs_sit_reps/","NHS Sit Reps"),".")
),
#The main panel is where the "results" charts are plotted
mainPanel(
plotOutput("testPlot"),
tableOutput("view")
)
))
##wightsitrep2/server.R
library(shiny)
library(ggplot2)
# Define server logic
shinyServer(function(input, output) {
#Do a simple barchart of data in the selected table.
#Where there are "subtables", display these using the faceted view
output$testPlot = reactivePlot(function() {
g=ggplot(subset(d,fdate>as.Date('2012-11-01') & tableName==input$tbl))
g=g+geom_bar(aes(x=fdate,y=val),stat='identity')+facet_wrap(~tableName+facetB)
g=g+theme(axis.text.x=element_text(angle=-90),legend.position="none")+labs(title="Isle of Wight NHS Trust")
#g=g+scale_y_discrete(breaks=0:10)
print(g)
})
#It would probable make sense to reshape the data presented in this table
#For example, define columns based on facetB values, so we have one row per date range
#I also need to sort the table by date
output$view = reactiveTable(function() {
head(subset(d,tableName==input$tbl,select=c('Name','fromDateStr','toDateStr','tableName','facetB','value')),n=100)
})
})
I get the feeling that it shouldn’t be too hard to create quite complex Shiny apps relatively quickly, pulling on things like Scraperwiki as a remote data source. One thing I haven’t tried is to use googleVis components, which would support in the first instance at least a sortable table view… Hmmm…
Interactive Scenarios With Shiny – The Race to the F1 2012 Drivers’ Championship
In Paths to the F1 2012 Championship Based on How They Might Finish in the US Grand Prix I posted a quick hack to calculate the finishing positions that would determine the F1 2012 Drivers’ Championship in today’s United States Grand Prix, leaving a tease dangling around the possibility of working out what combinations would lead to a VET or ALO victory if the championship isn’t decided today. So in the hour before the race started, I began to doodle a quick’n'dirty interactive app that would let me keep track of what the championship scenarios would be for the Brazil race given the lap by lap placement of VET and ALO during the US Grand Prix. Given the prep I’d done in the aforementioned post, this meant figuring out how to code up a similar algorithm in R, and then working out how to make it interactive…
But before I show you how I did it, here’s the scenario for Brazil given how the US race finished:
So how was this quick hack app done…?
Trying out the new Shiny interactive stats app builder from the RStudio folk has been on my to do list for some time. It didn’t take long to realise that an interactive race scenario builder would provide an ideal context for trying it out. There are essentially two (with a minor middle third) steps to a Shiny model:
- work out the points difference between VET and ALO for all their possible points combinations in the US Grand Prix;
- calculate the points difference going into the Brazilian Grand Prix;
- calculate the possible outcomes depending on placements in the Brazilian Grand Prix (essentially, an application of the algorithm I did in the original post).
The Shiny app requires two bits of code – a UI in file ui.R, in which I define two sliders that allow me to set the actual (or anticpated, or possible;-) race classifications in the US for Vettel and Alonso:
library(shiny)
shinyUI(pageWithSidebar(
# Application title
headerPanel("F1 Driver Championship Scenarios"),
# Sidebar with a slider input for number of observations
sidebarPanel(
sliderInput("alo",
"ALO race pos in United States Grand Prix:",
min = 1,
max = 11,
value = 1),
sliderInput("vet",
"VET race pos in United States Grand Prix:",
min = 1,
max = 11,
value = 2)
),
# Show a plot of the generated model
mainPanel(
plotOutput("distPlot")
)
))
And some logic, in file server.R (original had errors; hopefully now bugfixed…) – the original “Paths to the Championship” unpicks elements of the algorithm in a little more detail, but basically I figure out the points difference between VET and ALO based on the points difference at the start of the race and the additional points difference arising from the posited finishing positions for the US race, and then generate a matrix that works out the difference in points awarded for each possible combination of finishes in Brazil:
library(shiny)
library(ggplot2)
library(reshape)
# Define server logic required to generate and plot a random distribution
shinyServer(function(input, output) {
points=data.frame(pos=1:11,val=c(25,18,15,12,10,8,6,4,2,1,0))
points[[1,2]]
a=245
v=255
pospoints=function(a,v,pdiff,points){
pp=matrix(ncol = nrow(points), nrow = nrow(points))
for (i in 1:nrow(points)){
for (j in 1:nrow(points))
pp[[i,j]]=v-a+pdiff[[i,j]]
}
pp
}
pdiff=matrix(ncol = nrow(points), nrow = nrow(points))
for (i in 1:nrow(points)){
for (j in 1:nrow(points))
pdiff[[i,j]]=points[[i,2]]-points[[j,2]]
}
ppx=pospoints(a,v,pdiff,points)
winmdiff=function(vadiff,pdiff,points){
win=matrix(ncol = nrow(points), nrow = nrow(points))
for (i in 1:nrow(points)){
for (j in 1:nrow(points))
if (i==j) win[[i,j]]=''
else if ((vadiff+pdiff[[i,j]])>=0) win[[i,j]]='VET'
else win[[i,j]]='ALO'
}
win
}
# Function that generates a plot of the distribution. The function
# is wrapped in a call to reactivePlot to indicate that:
#
# 1) It is "reactive" and therefore should be automatically
# re-executed when inputs change
# 2) Its output type is a plot
#
output$distPlot <- reactivePlot(function() {
wmd=winmdiff(ppx[[input$vet,input$alo]],pdiff,points)
wmdm=melt(wmd)
g=ggplot(wmdm)+geom_text(aes(X1,X2,label=value,col=value))
g=g+xlab('VET position in Brazil')+ ylab('ALO position in Brazil')
g=g+labs(title="Championship outcomes in Brazil")
g=g+ theme(legend.position="none")
g=g+scale_x_continuous(breaks=seq(1, 11, 1))+scale_y_continuous(breaks=seq(1, 11, 1))
print(g)
})
})
To run the app, if your server and ui files are in some directory shinychamp, then something like the following should et the Shiny app running:
library(shiny)
runApp("~/path/to/my/shinychamp")
You can find the code on github here: F1 Championship 2012 – scenarios if the race gets to Brazil…
Unfortunately, until a hosted service is available, you’ll have to run it yourself if you want to try it out…
Disclaimer: I’ve been rushing to get this posted before the start of the race… If you spot errors, please shout!




















