Category: Rstats

Running RStudio on Digital Ocean, AWS etc Using Tutum and Docker Containers

Via RBloggers I noticed a tutorial today on Setting Rstudio server using Amazon Web Services (AWS).

In the post Getting Started With Personal App Containers in the Cloud I described how I linked my tutum account to a Digital Ocean hosting account and then launched a Digital Ocean server. (How to link tutum to Amazon AWS is described here: tutum support: Link your Amazon Web Services account.)

Having launched a server (also described in Getting Started With Personal App Containers in the Cloud), we can now create a new service that will fire up an RStudio container.

First up, we need to locate a likely container – the official one is the rocker/rstudio image:


Having selected the image, we need to do a little bit of essential configuration (we could do more, like giving the service a new name):


Specifically, we need to publish the port so that it’s publicly viewable – then we can Create and Deploy the service:


After a minute or two, the service should be up and running:


We can now find the endpoint, and click through to it (note: we need to change the URL from a tcp:// address to an http:// one. (Am I doing something wrong in the set up to stop the http URL being minted as the service endpoint?)


URL tweaked, you should now be able to see an RStudio login screen. The default user is rstudio and the default password rstudio too:


And there we have it:-)


So we don’t continue paying for the server, I generally stop the container and then terminate it to destroy it…


And then terminate the node…


So, assuming the Amazon sign-up process is painless, I’m assuming it shouldn’t be much harder than that?

By the by, it’s possible to link containers to other containers; here’s an example (on the desktop, using boot2docker, that links an RStudio container to a MySQL database: Connecting RStudio and MySQL Docker Containers – an example using the ergast db. When I get a chance, I’ll have a go at doing that via tutum too…

Spotting Potential Battles in F1 Races

Over the last couple of races, I’ve started trying to review a variety of battlemaps for various drivers in each race. Prompted by an email request for more info around the battlemaps, I generated a new sketch charting the on track gaps between each driver and the lap leader for each lap of the race (How the F1 Canadian Grand Prix Race Evolved on Track).

Colour is used to identify cars on lead lap compared to lapped drivers. For lapped drivers, a count of how many laps they are behind the leader is displayed. I additionally overplot with a highlight for specified driver, as well as adding in a mark that shows the on track position of the leader of the next lap, along with their driver code.


Battles can be identified through the close proximity of two or more drivers within a lap, across several laps. The ‘next-lap-leader’ time at the far right shows how close the leader on the next lead lap is to the backmarker (on track) on the current lead lap.

By highlighting two particular drivers, we could compare how their races evolved, perhaps highlighting different strategies used within a race that eventually bring the drivers into a close competitive battle in the last few laps of a race.

The unchanging leader-on-track-delta-of-0 line is perhaps missing an informational opportunity? For example, should we set the leader’s time to be the delta compared to the lap time for the leader laps from the previous lead lap? Or a delta compared to the fastest laptime on the previous lead lap? And if we do start messing about with an offset to the leader’s lap time, we presumably need to apply the same offset to the laptime of everyone else on the lap so we can still see the comparative on-track gaps to leader?

On the to-do list are various strategies for automatically identifying potential battles based on a variety of in-lap and across-lap heuristics.

Here’s the code:

#Grab some data
lapTimes =lapsData.df(2015,7)

#Process the laptimes

#Find the accumulated race time at the start of each leader's lap

#Find the on-track gap to leader

#Construct a dataframe that contains the difference between the 
#leader accumulated laptime on current lap and next lap
#i.e. how far behind current lap leader is next-lap leader?
#Generate a de facto lap count
#Grab the code of the lap leader on the next lap
ll['c']=lapTimes[lapTimes['position']==1 & lapTimes['lap']>1,'code']

#Plot the on-track gap to leader versus leader lap
g = ggplot(lapTimes) 
g = g + geom_point(aes(x=trackdiff,y=leadlap,col=(lap==leadlap)), pch=1)
g = g + geom_point(data=lapTimes[lapTimes['driverId']=='vettel',],
                  aes(x=trackdiff,y=leadlap), pch='+')
g = g + geom_text(data=lapTimes[lapTimes['lapsbehind']>0,],
                  aes(x=trackdiff,y=leadlap, label=lapsbehind),size=3)
g = g + geom_point(data=ll,aes(x=t, y=n), pch='x')
g = g + geom_text(data=ll,aes(x=t+3, y=n,label=c), size=2)
g = g + geom_vline(aes(xintercept=17), linetype=3)

This chart will be included in a future update to the Wrangling F1 Data With R book. I hope to do a sprint on that book to tidy it up and get it into a reasonably edited state in the next few weeks. At that point, the text will probably be frozen, a print-on-demand version generated, and if it ends up on Amazon, the minimum price being hiked considerably.

Notebooks, knitr and the Language-Markdown View Source Option…

One of the foundational principles of the web, though I suspect ever fewer people know it, is that you can “View Source” on a web page to see what bits of HTML, Javascript and CSS are used to create it.

In the WordPress editor I’m currently writing in, I’m using a Text view that lets me write vanilla HTML; but there is also a WYSIWYG (what you see is what you get) view that shows how the interpreted HTML text will look when it is rendered in the browser as a web page.


Reflecting on IPython Markdown Opportunities in IPython Notebooks and Rstudio, it struck me that the Rmd (Rmarkdown) view used in RStudio, the HTML preview of “executed” Rmd documents generated from Rmd by knitr and the interactive Jupyter (IPython, as was) notebook view can be seen as standing in this sort of relation to each other:


From that, it’s not too hard to imagine RStudio offering the following sort of RStudio/IPython notebook hybrid interface – with an Rmd “text” view, and with a notebook “visual” view (eg via an R notebook kernel):


And from both, we can generate the static HTML preview view.

In terms of underlying machinery, I guess we could have something like this:


I’m looking forward to it:-)

IPython Markdown Opportunities in IPython Notebooks and Rstudio

One of the reasons I started working on the Wrangling F1 Data With R book was to see what the Rmd (RMarkdown) workflow was like. Rmd allows you to combine markdown and R code in the same document, as well as executing the code blocks and then displaying the results of that code execution inline in the output document.


As well as rendering to HTML, we can generate markdown (md is actually produced as the interim step to HTML creation), PDF output documents, etc etc.

One thing I’d love to be able to do in the RStudio/RMarkdown environment is include – and execute – Python code. Does a web search to see what Python support there is in R… Ah, it seems it does it already… (how did I miss that?!)


ADDED: Unfortunately, it seems as if Python state is not persisted between separate python chunks – instead, each chunk is run as a one off python inline python command. However, it seems as if there could be a way round this, which is to use a persistent IPython session; and the knitron package looks like just the thing for supporting that.

So that means in RStudio, I could use knitr and Rmd to write a version of Wrangling F1 Data With RPython

Of course, it would be nicer if I could write such a book in an everyday python environment – such as in an IPython notebook – that could also execute R code (just to be fair;-)

I know that we can already use cell magic to run R in a IPython notebook:


…so that’s that part of the equation.

And the notebooks do already allow us to mix markdown cells and code blocks/output. The default notebook presentation style is to show the code cells with the numbered In []: and Out []: block numbering, but it presumably only takes a small style extension or customisation to suppress that? And another small extension to add the ability to hide a code cell and just display the output?

So what is it that (to my mind at least) makes RStudio a nicer writing environment? One reason is the ability to write the Rmarkdown simply as Rmarkdown in a simple text editor enviroment. Another is the ability to inline R code and display its output in-place.

Taking that second point first, the ability to do better inlining in IPython notebooks – it looks like this is just what the python-markdown extension seems to do:


But how about the ability to write some sort of pythonMarkdown and then open in a notebook? Something like ipymd, perhaps…?


What this seems to do is allow you to open an IPython-markdown document as an IPython notebook (in other words, it replaces the ipynb JSON document with an ipymd markdown document…). To support the document creation aspects better, we just need an exporter that removes the code block numbering and trivially allows code cells to be marked as hidden.

Now I wonder… what would it take to be able to open an Rmd document as an IPython notebook? Presumably just the ability to detect the code language, and then import the necessary magics to handle its execution? It’d be nice if it could cope with inline code, e.g. using the python-markdown magic too?

Exciting times could be ahead:-)

Keeping Track of an Evolving “Top N” Cutoff Threshold Value

In a previous post (Charts are for Reading), I noted how it was difficult to keep track of which times in an F1 qualifying session had made the cutoff time as a qualifying session evolved. The problem can be stated as follows: in the first session, with 20 drivers competing, the 15 drivers with the best ranked laptime will make it into the next session. Each driver can complete zero or more timed laps, with drivers completing laps in any order.

Finding the 15 drivers who will make the cutoff is therefore not simply a matter of ranking the best 15 laptimes at any point, because the same 5 drivers, say, may each record 3 fast laptimes, thus taking up the 15 slots that record the 15 fastest laptimes.

If we define a discrete time series with steps corresponding to each recorded laptime (from any driver), then at each time step we can find the best 15 drivers by finding each driver’s best laptime to date and ranking by those times. Conceptually, we need something like a lap chart which uses a ‘timed lap count’ rather than race lap index to keep track of the top 15 cars at any point.


At each index step, we can then find the laptime of the 15th ranked car to find the current session laptime.

In a dataframe that records laptimes in a session by driver code for each driver, along with a column that contains the current purple laptime, we can arrange the laptimes by cumulative session laptime (so the order of rows follows the order in which laptimes are recorded) and then iterate through those rows one at a time. At each step, we can summarise the best laptime recorded so far in the session for each driver.

for (r in 1:nrow(df)) {
  #summarise best laptime recorded so far in the session for each driver
  #Keep track of which session we are in
  #Rank the best laptimes for each driver to date in the current session
  #(Really should filter by session at the start of this loop?)
  #The different sessions have different cutoffs: Q1, top 15; Q2, top 10
  #if we have at least as many driver best times recorded as the cutoff number
  if (nrow(dfcc) >=n){
    #Grab a record of the current cut-off time
    #along with info about each recorded laptime
    dfc=rbind(dfc,data.frame(df[r,]['qsession'],df[r,]['code'],df[r,]['cuml'],dfcc[n,]['dbest']) )

We can then plot the evolution of the cut-off time as the sessions proceed. The chart in it’s current form is still a bit hard to parse, but it’s a start…


In the above sketch, the lines connect the current purple time and the current cut-off time in each session (aside from the horizontal line which represents the cut-off time at the end of the session). This gives a false impression of the evolution of the cutoff time – really, the line should be a stepped line that traces the current cut-off time horizontally until it is improved, at which point it should step vertically down. (In actual fact, the line does track horizontally as laptimes are recorded that do not change the cuttoff time, as indicated by the horizontal tracks in the Q1 panel as the grey times (laptime slower than driver’s best time in session so far) are completed.

The driver labels are coloured according to: purple – current best in session time; green – driver best in session time to date (that wasn’t also purple); red – driver’s best time in session that was outside the final cut-off time. This colouring conflates two approaches to representing information – the purple/green colours represent online algorithmic processing (if we constructed the chart in real time from laptime data as laps we completed, that’s how we’d colour the points), whereas the red colouring represents the results of offline algorithmic processing (the colour can only be calculated at the end of the session when we know the final session cutoff time). I think these mixed semantics contribute to making the chart difficult to read…

In terms of what sort of stories we might be able to pull from the chart, we see that in Q2, Hulkenberg and Sainz were only fractions of a second apart, and Perez looked like he had squeezed in to the final top 10 slot until Sainz pushed him out. To make it easier to see which times contribute to the top 10 times, we could use font weight (eg bold font) to highlight each drivers session best laptimes.

To make the chart easier to read, I think each time improvement to the cutoff time should be represented by a faint horizontal line, with a slightly darker line tracing the evolution of the cutoff time as a stepped line. This would all us to see which times were within the cutoff time at any point.

I also wonder whether it might be interesting to generate a table a bit like the race lap chart, using session timed lap count rather than race lap count, perhaps with additional colour fields to show which car recorded the time that increased the lap count index, and perhaps also where in the order that time fell if it didn’t change the order in the session so far. We could also generate online and offline differences between each laptime in the session and the current cutoff time (online algorithm) as well as the final overall session cutoff time (offline algorithm).

[As and when I get this chart sorted, it will appear in an update to the Wrangling F1 Data With R lean book.]

Scraping Web Pages With R

One of the things I tend to avoid doing in R, partly because there are better tools elsewhere, is screenscraping. With the release of the new rvest package, I thought I’d have a go at what amounts to one of the simplest webscraping activites – grabbing HTML tables out of webpages.

The tables I had in my sights (when I can actually find them…) are the tables that appear on the newly designed FIA website that describe a range of timing results for F1 qualifying and races [quali example, race example].

Inspecting an example target web page, whilst a menu allows you to select several different results tables, a quick look at the underlying HTML source code reveals that all the tables relevant to the session (that is, a particular race, or complete qualifying session) are described within a single page.

So how can we grab those tables down from a target page? The following recipe seems to do the trick:


#URL of the HTML webpage we want to scrape

  #Grab the page
  #Parse HTML
  cc=html_nodes(hh, xpath = "//table")[[num]] %>% html_table(fill=TRUE)
  #TO DO - extract table name
  #Set the column names
  colnames(cc) = cc[1, ]
  #Drop all NA column
  cc=Filter(function(x)!all(, cc[-1,])
  #Fill blanks with NA
  cc=apply(cc, 2, function(x) gsub("^$|^ $", NA, x))
  #would the dataframe cast handle the NA?

## Qualifying:

The fiaTableGrabber() grabs a particular table from a page with a particular URL (I really should grab the page separately and then extract whatever table from the fetched page, or at least cache the page (unless there is a cacheing option built-in?)

Depending on the table grabbed, we may then need to tidy it up. I hacked together a few sketch functions that tidy up (and remap) column names, convert “natural times” in minutes and seconds to seconds equivalent, and in the case of the race pits data, separate out two tables that get merged into one.

  for (q in c('Q1','Q2','Q3')){
  xx=dplyr:::rename(xx, Q1_laps=LAPS)
  xx=dplyr:::rename(xx, Q2_laps=LAPS.1)
  xx=dplyr:::rename(xx, Q3_laps=LAPS.2)

#2Q, 3R 
  for (s in c('s1','s2','s3')) {

#3Q, 4R

# 4Q, 5R

# 2R
  xx['time']=apply(xx['LAP TIME'],1,timeInS)

# 6R
  r=which(xx['NO']=='RACE - PIT STOP - DETAIL')
  xx['tot_time']=apply(xx['TOTAL TIME'],1,timeInS)
  Filter(function(x)!all(, xx[1:r-1,])

  colnames(xx)=c('NO','DRIVER','LAP','TIME','STOP','NAT DURATION','TOTAL TIME')
  xx['tot_time']=apply(xx['TOTAL TIME'],1,timeInS)
  xx['duration']=apply(xx['NAT DURATION'],1,timeInS)
  r=which(xx['NO']=='RACE - PIT STOP - DETAIL')
  #Remove blank row -
  xx[rowSums( != ncol(xx),]

So for example:


I’m still not convinced that R is the most natural, efficient, elegant or expressive language for scraping with, though…

PS In passing, I note the release of the readxl Excel reading library (no external-to-R dependencies, compatible with various flavours of Excel spreadsheet).

PPS Looking at the above screenshot, it strikes me that if we look at the time of day of and the duration, we can tell if there is a track position (at least) change in the pits… So for example, ROS goes in at 15:11:11 with a 33.689 stop and RIC goes in at 15:11:13 with a 26.714. So ROS enters the pits ahead of RIC and leaves after him? The following lap chart from f1fanatic perhaps reinforces this view?


Wrangling Complex Spreadsheet Column Headers

[This isn’t an R post, per se, but I’m syndicating it via RBloggers because I’m interested – how do you work with hierarchical column indices in R? Do you try to reshape the data to something tidier on the way in? Can you autodetect elements to help with any reshaping?]

Not a little p****d off by the Conservative election pledge to extend the right-to-buy to housing association tenants (my response: so extend the right to private tenants too?) I thought I’d have a dig around to see what data might be available to see what I could learn about the housing situation on the Isle of Wight, using a method that could also be used in other constituencies. That is, what datasets are provided at a national level, broken down to local level. (To start with, I wanted to see what I could lean ex- of visiting the DCLG OpenDataCommuniteis site.

One source of data seems to be the Local authority housing statistics data returns for 2013 to 2014, a multi-sheet spreadsheet reporting at a local authority level on:

– Dwelling Stock
– Local Authority Housing Disposals
– Allocations
– Lettings, Nominations and Mobility Schemes
– Vacants
– Condition of Dwelling Stock
– Stock Management
– Local authority Rents and Rent Arrears
– Affordable Housing Supply


Something I’ve been exploring lately are “external spreadsheet data source” wrappers for the pandas Python library that wrap frequently released spreadsheets with a simple (?!) interface that lets you pull the data from the spreadsheet into a pandas dataframe.

For example, I got started on the LA housing stats sheet as follows – first a look at the sheets, then a routine to grab sheet names out of the Menu sheet:

import pandas as pd
#Menu sheet parse to identify sheets A-I
import re
sd=re.compile(r'Section (\w) - (.*)$')
for row in dfx.parse('Menu')[[1]].values:
    if str(row[0]).startswith('Section'):
#{u'A': u'Dwelling Stock',
# u'B': u'Local Authority Housing Disposals',
# u'C': u'Allocations',
# u'D': u'Lettings, Nominations and Mobility Schemes',
# u'E': u'Vacants',
# u'F': u'Condition of Dwelling Stock',
# u'G': u'Stock Management',
# u'H': u'Local authority Rents and Rent Arrears',
# u'I': u'Affordable Housing Supply'}

All the data sheets have similar columns on the left-hand side, which we can use as a crib to identify the simple, single row, coded header column.

def dfgrabber(dfx,sheet):
    #First pass - identify row for headers
    row = df[df.apply(lambda x: (x == "DCLG code").any(), axis=1)].index.tolist()[0]#.values[0] # will be an array
    #Second pass - generate dataframe
    df=df[df['DCLG code'].notnull()].reset_index(drop=True)
    return df


That gives something like the following:


This is completely useable if we know what the column codes refer to. What is handy is that a single row is available for columns, although metadata that neatly describes the codes is not so tidily presented:


Trying to generate pandas hierarchical index from this data is a bit messy…

One approach I’ve explored is trying to create a lookup table from the coded column names back into the hierarchical column names.

For example, if we can detect the column multi-index rows, we can fill down on the first row (for multicolumn labels, the label is in the leftmost cell), then fill down to fill the index grid spanned cells with the value that spans them.

#row is autodetected and contains the row for the simple header
#Get the header columns - and drop blank rows


#Fill down
xx.fillna(method='ffill', axis=0,inplace=True)
#Fill across
xx=xx.fillna(method='ffill', axis=1)


#append the coded header row


#Now make use of pandas' ability to read in a multi-index CSV
xx.to_csv('multi_index.csv',header=False, index=False)


Note that the pandas column multi-index can span several columns, but not “vertical” levels.

Get rid of the columns that don’t feature in the multi-index:

for c in mxx.columns.get_level_values(0).tolist():
    if c.startswith('Unnamed'):
        mxx = mxx.drop(c, level=0, axis=1)

Now start to work on the lookup…

#Get a dict from the multi-index


We can then use this as a basis for generating a lookup table for the column codes.

for r in dd:
    keyx[dd[r][0].split(' ')[0]]=r


We could also generate more elaborate dicts to provide ways of identifying particular codes.

Note that the key building required a little bit of tidying required arising from footnote numbers that appear in some of the coded column headings:

excel header footnote

This tidying should be also be applied to the code column generation step above…

I’m thinking there really should be an easier way?

PS and then, of course, there are the additional gotchas… like UTF-8 pound signs that’s break ascii encodings…


PPS Handy… informal guidance on Releasing data or statistics in spreadsheets, acts as a counterpoint to GSS’ Releasing statistics in spreadsheets: Good practice guidance.