OUseful.Info, the blog…

Trying to find useful things to do with emerging technologies in open education

Archive for the ‘Rstats’ Category

Custom Gridlines and Line Guides in R/ggplot Charts

leave a comment »

In the last quarter of last year, I started paying more attention to the use of custom grid lines and line guides in charts I’ve been developing for the Wrangling F1 Data With R book.

The use of line guides was in part inspired by canopy views from within the cockpit of one of the planes that makes up the Red Arrows aerobatic display team.

SCA-07-247-RAW-UNC-

A little bit of digging suggested that the lines on the cockpit are actually an explosive cord used to shatter the cockpit if a pilot needs to eject from the aircraft – but I can’t believe that the pilots don’t also use the lines as a crib for helping position themselves with respect to the other aircraft in the team? (I emailed the Red Arrows press office to ask about the extent to which the cockpit lines are used in this way but got what amounted to a NULL response.)

Whatever the case, it seems eminently sensible to me that we make use of line guides to help us read charts more effectively, where it makes sense to, or to use guides as stepping stones to more refined data views.

The following example shows how we can generate a 2 dimensional grid based on the discrete points allocations possible for being placed in the top 10 positions of a Formula One race.

The grid lines show allowable points values, and are constructed as follows:

NEWPOINTS =c(25,18,15,12,10,8,6,4,2,1)

#The newpoints dataframe has two columns
#The first column indicates points available, in order
#The second column is the maximum number of points the lower placed driver could score
newpoints=data.frame(x=c(NEWPOINTS,0),y=c(NEWPOINTS[-1],0,0))

baseplot_singleWay=function(g){
  g=g+xlab('Highest placed driver points')+ylab('Lower placed driver points')
  g=g+scale_y_continuous(breaks = newpoints$x,minor_breaks=NULL) 
  g=g+scale_x_continuous(breaks = newpoints$x,minor_breaks=NULL)
  g=g+coord_fixed()
  g=g+guides(size=FALSE)+theme_bw()
  g
}

g=baseplot_singleWay(ggplot())
g

The final chart (of which this is a “one-sided” example) is used to display a count of the number of races in which at least one of the two drivers in a particular team scores points in a race. The horizontal x-axis represents the number of points scored by the highest placed driver in the race, and the y-axis the number of points scored by their lower placed team mate.

A fixed co-ordinate scheme is adopted to ensure that points are separated consistently on the x and y axes. A dotted red line shows the maximum number of points the lower placed driver in a team could scored depending on the number of points scored by their higher placed team mate.

#Add in the maximum points line guide

g+geom_line(data=newpoints,aes(x=x,y=y),col='red',linetype='dotted')

pointsPerformance-basechartAnnotated-1

A two-sided version of the chart is also possible in which the x-axis represents the points scored in a particular race by one name driver and the y-axis represents the points scored by another named driver in the same race. The two-sided chart has two guidelines, representing the maximum points that can be scored by the other driver in the event of one particular driver being higher placed in a race.

pointsPerformance-pointsPerformanceChart-2way-1

A full description can be found in the Points Performance Charts chapter of the Wrangling F1 Data With R book.

Written by Tony Hirst

January 2, 2015 at 9:09 pm

Posted in Rstats

Sketching Scatterplots to Demonstrate Different Correlations

with 8 comments

Looking just now for an openly licensed graphic showing a set of scatterplots that demonstrate different correlations between X and Y values, I couldn’t find one.

[UPDATE: following a comment, Rich Seiter has posted a much cleaner – and general – method here: NORTA Algorithm Examples; refer to that post – rather than this – for the method…(my archival copy of rseiter’s algorithm)]

So here’s a quick R script for constructing one, based on a Cross Validated question/answer (Generate two variables with precise pre-specified correlation):

library(MASS)

corrdata=function(samples=200,r=0){
  data = mvrnorm(n=samples, mu=c(0, 0), Sigma=matrix(c(1, r, r, 1), nrow=2), empirical=TRUE)
  X = data[, 1]  # standard normal (mu=0, sd=1)
  Y = data[, 2]  # standard normal (mu=0, sd=1)
  data.frame(x=X,y=Y)
}

df=data.frame()
for (i in c(1,0.8,0.5,0.2,0,-0.2,-0.5,-0.8,-1)){
  tmp=corrdata(200,i)
  tmp['corr']=i
  df=rbind(df,tmp)
}

library(ggplot2)

g=ggplot(df,aes(x=x,y=y))+geom_point(size=1)
g+facet_wrap(~corr)+ stat_smooth(method='lm',se=FALSE,color='red')

And here’s an example of the result:

scatterCorr

It’s actually a little tidier if we also add in + coord_fixed() to fix up the geometry/aspect ratio of the chart so the axes are of the same length:

scatterCorrSquare

So what sort of OER does that make this post?!;-)

PS methinks it would be nice to be able to use different distributions, such as a uniform distribution across x. Is there a similarly straightforward way of doing that?

UPDATE: via comments, rseiter (maybe Rich Seiter?) suggests the NORmal-To-Anything (NORTA) algorithm (about, also here). I have no idea what it does, but here’s what it looks like!;-)

//based on http://blog.ouseful.info/2014/12/17/sketching-scatterplots-to-demonstrate-different-correlations/#comment-69184
#The NORmal-To-Anything (NORTA) algorithm
library(MASS)
library(ggplot2)

#NORTA - h/t rseiter
corrdata2=function(samples, r){
  mu <- rep(0,4)
  Sigma <- matrix(r, nrow=4, ncol=4) + diag(4)*(1-r)
  rawvars <- mvrnorm(n=samples, mu=mu, Sigma=Sigma)
  #unifvars <- pnorm(rawvars)
  unifvars <- qunif(pnorm(rawvars)) # qunif not needed, but shows how to convert to other distributions
  print(cor(unifvars))
  unifvars
}

df2=data.frame()
for (i in c(1,0.9,0.6,0.4,0)){
  tmp=data.frame(corrdata2(200,i)[,1:2])
  tmp['corr']=i
  df2=rbind(df2,tmp)
}
g=ggplot(df2,aes(x=X1,y=X2))+geom_point(size=1)+facet_wrap(~corr)
g+ stat_smooth(method='lm',se=FALSE,color='red')+ coord_fixed()

Here’s what it looks like with 1000 points:

unifromScatterCorr

Note that with smaller samples, for the correlation at zero, the best fit line may wobble and may not have zero gradient, though in the following case, with 200 points, it looks okay…

unifscattercorrsmall

The method breaks if I set the correlation (r parameter) values to less than zero – Error in mvrnorm(n = samples, mu = mu, Sigma = Sigma) : ‘Sigma’ is not positive definite – but we can just negate the y-values (unifvars[,2]=-unifvars[,2]) and it seems to work…

If in the corrdata2 function we stick with the pnorm(rawvars) distribution rather than the uniform (qunif(pnorm(rawvars))) one, we get something that looks like this:

corrnorm1000

Hmmm. Not sure about that…?

Written by Tony Hirst

December 17, 2014 at 1:24 pm

Posted in Anything you want, Rstats

Tagged with

Identifying Position Change Groupings in Rank Ordered Lists

with one comment

The title says it all, doesn’t it?!

Take the following example – it happens to show race positions by driver for each lap of a particular F1 grand prix, but it could be the evolution over time of any rank-based population.

poschanges

The question I had in mind was – how can I identify positions that are being contested during a particular window of time, where by contested I mean that the particular position was held by more than one person in a particular window of time?

Let’s zoom in to look at a couple of particular steps.

poschangeGroup

We see distinct groups of individuals who swap positions with each other between those two consecutive steps, so how can we automatically detect the positions that these drivers are fighting over?

A solution given to a Stack Overflow question on how to get disjoint sets from a list in R gives what I thought was a really nice solution: treat it as a graph, and then grab the connected components.

Here’s my working of it. Start by getting a list of results that show a particular driver held different positions in the window selected – each row in the original dataframe identifies the position held by a particular driver at the end of a particular lap:

library(DBI)
ergastdb =dbConnect(RSQLite::SQLite(), './ergastdb13.sqlite')

#Get a race identifier for a specific race
raceId=dbGetQuery(ergastdb,
                  'SELECT raceId FROM races WHERE year="2012" AND round="1"')

q=paste('SELECT * FROM lapTimes WHERE raceId=',raceId[[1]])

lapTimes=dbGetQuery(ergastdb,q)
lapTimes$position=as.integer(lapTimes$position)

library(plyr)

#Sort by lap first just in case
lapTimes=arrange(lapTimes,driverId,lap)

#Create a couple of new columns
#pre is previous lap position held by a driver given their current lap
#ch is position change between the current and previous lap
tlx=ddply(lapTimes,.(driverId),transform,pre=(c(0,position[-length(position)])),ch=diff(c(0,position)))

#Find rows where there is a change between a given lap and its previous lap
#In particular, focus on lap 17
llx=tlx[tlx['ch']!=0 & tlx['lap']==17,c("position","pre")]

llx

This filters the complete set of data to just those rows where there is a difference between a driver’s current position and previous position (the first column in the result just shows row numbers and can be ignored).

##      position pre
## 17          2   1
## 191        17  18
## 390         9  10
## 448         1   2
## 506         6   4
## 719        10   9
## 834         4   5
## 892        18  19
## 950         5   6
## 1008       19  17

We can now create a graph in which nodes represent positions (position or pre values) and edges connect a current and previous position.

#install.packages("igraph")
#http://stackoverflow.com/a/25130575/454773
library(igraph)

posGraph = graph.data.frame(llx)
    
}

plot(posGraph)

The resulting graph is split into several components:

posgraph

We can then identify the connected components:

posBattles=split(V(posGraph)$name, clusters(posGraph)$membership)
#Find the position change battles
for (i in 1:length(posBattles)) print(posBattles[[i]])

This gives the following clusters, and their corresponding members:

## [1] "2" "1"
## [1] "17" "18" "19"
## [1] "9"  "10"
## [1] "6" "4" "5"

To generalise this approach, I think we need to do a couple of things:

  • allow a wider window within which to identify battles (so look over groups of three or more consecutive laps);
  • simplify the way we detect position changes for a particular driver; for example, if we take the set of positions held by a driver within the desired window, if the cardinality of the set (that is, its size) is greater than one, then we have had at least one position change for that driver within that window. Each set of size > 1 of unique positions held by different drivers can be used to generate a set of distinct, unordered pairs that connect the positions (I think it only matters that they are connected, not that a driver specifically went from position x to position y going from one lap to the next?). If we generate the graph from the set of distinct unordered pairs taken across all drivers, we should then be able to identify the contested/driver change position clusters.

Hmm… I need to try that out… And when I do, if and when it works(?!), I’ll add a complete demonstration of it – and how we might make use of it – to the Wrangling F1 Data With R book.

Written by Tony Hirst

December 9, 2014 at 10:44 am

Posted in f1stats, Rstats

Information Density and Custom Chart Designs

I’ve been doodling today with a some charts for the Wrangling F1 Data With R living book, trying to see how much information I can start trying to pack into a single chart.

The initial impetus came simply from thinking about a count of laps led in a particular race by each drive; this morphed into charting the number of laps in each position for each driver, and then onto a more comprehensive race summary chart (see More Shiny Goodness – Tinkering With the Ergast Motor Racing Data API for an earlier graphical attempt at producing a race summary chart).

lapPosChart

The chart shows:

grid position: identified using an empty grey square;
race position after the first lap: identified using an empty grey circle;
race position on each driver’s last lap: y-value (position) of corresponding pink circle;
points cutoff line: a faint grey dotted line to show which positions are inside – or out of – the points;
number of laps completed by each driver: size of pink circle;
total laps completed by driver: greyed annotation at the bottom of the chart;
whether a driver was classified or not: the total lap count is displayed using a bold font for classified drivers, and in italics for unclassified drivers;
finishing status of each driver: classification statuses other than *Finished* are also recorded at the bottom of the chart.

The chart also shows drivers who started the race but did not complete the first lap.

What the chart doesn’t show is what stage of the race the driver was in each position, and how long for. But I have an idea for another chart that could help there, as well as being able to reuse elements used in the chart shown here.

FWIW, the following fragment of R code shows the ggplot function used to create the chart. The data came from the ergast API, though it did require a bit of wrangling to get it into a shape that I could use to power the chart.

#Reorder the drivers according to a final ranked position
g=ggplot(finalPos,aes(x=reorder(driverRef,finalPos)))
#Highlight the points cutoff
g=g+geom_hline(yintercept=10.5,colour='lightgrey',linetype='dotted')
#Highlight the position each driver was in on their final lap
g=g+geom_point(aes(y=position,size=lap),colour='red',alpha=0.15)
#Highlight the grid position of each driver
g=g+geom_point(aes(y=grid),shape=0,size=7,alpha=0.2)
#Highlight the position of each driver at the end of the first lap
g=g+geom_point(aes(y=lap1pos),shape=1,size=7,alpha=0.2)
#Provide a count of how many laps each driver held each position for
g=g+geom_text(data=posCounts,
              aes(x=driverRef,y=position,label=poscount,alpha=alpha(poscount)),
              size=4)
#Number of laps completed by driver
g=g+geom_text(aes(x=driverRef,y=-1,label=lap,fontface=ifelse(is.na(classification), 'italic' , 'bold')),size=3,colour='grey')
#Record the status of each driver
g=g+geom_text(aes(x=driverRef,y=-2,label=ifelse(status!='Finished', status,'')),size=2,angle=30,colour='grey')
#Styling - tidy the chart by removing the transparency legend
g+theme_bw()+xRotn()+xlab(NULL)+ylab(&quot;Race Position&quot;)+guides(alpha=FALSE)

The fully worked code can be found in forthcoming update to the Wrangling F1 Data With R living book.

Written by Tony Hirst

November 21, 2014 at 6:21 pm

Posted in Rstats

Tagged with ,

F1 Championship Race, 2014 – Winning Combinations…

As we come up to the final two races of the 2014 Formula One season, the double points mechanism for the final race means that two drivers are still in with a shot at the Drivers’ Championship: Lewis Hamilton and Nico Rosberg.

As James Allen describes in Hamilton closes in on world title: maths favour him but Abu Dhabi threat remains:

Hamilton needs 51 points in the remaining races to be champion if Rosberg wins both races. Hamilton can afford to finish second in Brazil and at the double points finale in Abu Dhabi and still be champion. Mathematically he could also finish third in Brazil and second in the finale and take it on win countback, as Rosberg would have just six wins to Hamilton’s ten.
If Hamilton leads Rosberg home again in a 1-2 in Brazil, then he will go to Abu Dhabi needing to finish fifth or higher to be champion (echoes of Brazil 2008!!). If Rosberg does not finish in Brazil and Hamilton wins the race, then Rosberg would need to win Abu Dhabi with Hamilton not finishing; no other scenario would give Rosberg the title.

A couple of years ago, I developed an interactive R/shiny app for exploring finishing combinations of two drivers in the last two races of a season to see what situations led to what result: Interactive Scenarios With Shiny – The Race to the F1 2012 Drivers’ Championship.

f12014champshiny

I’ve updated the app (taking into account the matter of double points in the final race) so you can check out James Allen’s calculations with it (assuming I got my sums right too!). I tried to pop up an interactive version to Shinyapps, but the Shinyapps publication mechanism seems to be broken (for me at least) at the moment…:-(

In the meantime, if you have RStudio installed, you can run the application yourself. The code is avaliable and can be run from RStudio with: runGist("81380ff09ebe1cd67005")

When I get a chance, I’ll weave elements of this recipe into the Wrangling F1 Data With R book.

PS I’ve also started using the F1dataJunkie blog again as a place to post drafts and snippets of elements I’m working on for that book…

Written by Tony Hirst

November 8, 2014 at 2:07 pm

Posted in Rstats

Tagged with

Wrangling F1 Data With R – F1DataJunkie Book

Earlier this year I started trying to pull together some of my #f1datajunkie R-related ramblings together in a book form. The project stalled, but to try to reboot it I’ve started publishing it as a living book over on Leanpub. Several of the chapters are incomplete – with TO DO items sketched in, others are still unpublished. The beauty of the Leanpub model is that if you buy a copy, you continue to get access to all future updated versions of the book. (And my idea is that by getting the book out there as it is, I’ll feel as if there’s more (social) pressure on actually trying to keep up with it…)

I’ll be posting more details about how the Leanpub process works (for me at least) in the next week or two, but for now, here’s a link to the book: Wrangling F1 Data With R: A Data Junkie’s Guide.

Here’s the table of contents so far:

  • Foreword
    • A Note on the Data Sources
  • Introduction
    • Preamble
    • What are we trying to do with the data?
    • Choosing the tools
    • The Data Sources
    • Getting the Data into RStudio
    • Example F1 Stats Sites
    • How to Use This Book
    • The Rest of This Book…
  • An Introduction to RStudio and R dataframes
    • Getting Started with RStudio
    • Getting Started with R
    • Summary
  • Getting the data from the Ergast Motor Racing Database API
    • Accessing Data from the ergast API
    • Summary
  • Getting the data from the Ergast Motor Racing Database Download
    • Accessing SQLite from R
    • Asking Questions of the ergast Data
    • Summary
    • Exercises and TO DO
  • Data Scraped from the F1 Website
    • Problems with the Formula One Data
    • How to use the FormulaOne.com alongside the ergast data
  • Reviewing the Practice Sessions
    • The Weekend Starts Here
    • Practice Session Data from the FIA
    • Sector Times
    • FIA Media Centre Timing Sheets
  • A Quick Look at Qualifying
    • Qualifying Session Position Summary Chart
    • Another Look at the Session Tables
    • Ultimate Lap Positions
  • Lapcharts
    • Annotated Lapcharts
  • Race History Charts
    • The Simple Laptime Chart
    • Accumulated Laptimes
    • Gap to Leader Charts
    • The Lapalyzer Session Gap
    • Eventually: The Race History Chart
  • Pit Stop Analysis
    • Pit Stop Data
    • Total pit time per race
    • Pit Stops Over Time
    • Estimating pit loss time
    • Tyre Change Data
  • Career Trajectory
    • The Effect of Age on Performance
    • Statistical Models of Career Trajectories
    • The Age-Productivity Gradient
    • Summary
  • Streakiness
    • Spotting Runs
    • Generating Streak Reports
    • Streak Maps
    • Team Streaks
    • Time to N’th Win
    • TO DO
    • Summary
  • Conclusion
  • Appendix One – Scraping formula1.com Timing Data
  • Appendix Two – FIA Timing Sheets
    • Downloading the FIA timing sheets for a particular race
  • Appendix – Converting the ergast Database to SQLite

If you think you deserve a free copy, let me know… ;-)

Written by Tony Hirst

October 31, 2014 at 12:04 am

Posted in Rstats

Tagged with

Running “Native” Data Wrangling Applications in the Browser – IPython Notebooks (and R?) in Chrome

Using browser based data analysis toolkits such as pandas in IPython notebooks, or R in RStudio, means you need to have access to python or R and the corresponding application server either on your own computer, or running on a remote server that you have access to.

When running occasional training sessions or workshops, this can cause several headaches: either a remote service needs to be set up that is capable of supporting the expected number of participants, security may need putting in place, accounts configured (or account management tools supported), network connections need guaranteeing so that participants can access the server, and so on; or participants need to install software on their own computers: ideally this would be done in advance of a training session, otherwise training time is spent installing, configuring and debugging software installs; some computers may have security policies that prevent users installing software, or require and IT person with admin privileges to install the software, and so on.

That’s why the coLaboratory Chrome extension looks like an interesting innovation – it runs an IPython notebook fork, with pandas and matplotlib as a Chrome Native Client application. I posted a quick walkthrough of the extension over on the School of Data blog: Working With Data in the Browser Using python – coLaboratory.

Via a Twitter exchange with @nativeclient, it seems that there’s also the possibility that R could run as a dependency free Chrome extension. Native Client seems to like things written in C/C++, which underpins R, although I think R also has some fortran dependencies. (One of the coLaboratory talks mentioned the to do list item of getting scipy (I think?) running in the coLaboratory extension, the major challenge there (or whatever the package was) being the fortran src; so there maybe be synergies in working the fortran components there?))

Within a couple of hours of the twitter exchange starting, Brad Nelson/@flagxor posted a first attempt at an R port to the Native Client. I don’t pretend to understand what’s involved in moving from this to an extension with some sort of useable UI, even if only a command line, but it represents an interesting possibility: of being able to run R in the browser (or at least, in Chrome). Package availability would be limited of course to packages compiled to run using PNaCl.

For training events, there is still the requirement that users install a Chrome browser on their computer and then install the extension into that. However, I think it is possible to run Chrome as a portable app – that is, from a flash drive such as a USB memory stick: Google Chrome Portable (Windows).

I’m not sure how fast it would be able to run, but it suggests there may be a way of carrying a portable, dependency free pandas environment around that you can run on a Windows computer from a USB key?! And maybe R too…?

Written by Tony Hirst

August 22, 2014 at 9:42 am

Follow

Get every new post delivered to your Inbox.

Join 1,291 other followers