## Connecting RStudio and MySQL Docker Containers – an example using the ergast db

building on Dockerising Open Data Databases – First Fumblings and my Book Extras – Data Files, Code Files and a Dockerised Application, I just figured out how to get the ergast db into a MySQL docker container and then query it from RStudio:

• install these docker-mysql-scripts
• run boot2docker
• from the boot2docker shell, start up a MySQL server (ergastdb) with password f1: dmysql-server ergastdb f1 By default, this exposes port 3306
• create an new empty database (f1db): dmysql-create-database ergastdb f1db
• add the ergast data to it: dmysql-import-database ergastdb /path/to/ergastdb/f1db.sql --database f1db
• fire up a copy of RStudio, in this case using my psychemedia/wranglingf1data container, linking it to the MySQL database which has the alias db: docker run --name f1djd -p 8788:8787 --link ergastdb:db -d psychemedia/wranglingf1data
• in RStudio, import the RMySQL library: library(RMySQL)
• in RStudio, connect to the database: con=dbConnect(MySQL(),user='root',password='f1',host='db',port=3306,dbname='f1db')
• in RStudio, run a test query: dbQuery(con,'SHOW TABLES');

I guess what I need to do now is pull the various bits into another script to make it a one-liner, perhaps with a few switches? For example, to create the database if it doesn’t exist, to download the ergast database file automatically, to populate the database for the first time, or update it with a more recent copy of the database, to fire up both containers and make sure they are appropriately linked etc. This would dramatically simplify things for use in the context of the Wrangling F1 Data With R book, for example. (If you beat me to it, please post details in the comments below.)

PS Hmm…. seems I get a UTF-8 encoding issue:

Not sure if this is with the database, or the RMySQL connector? Anyone got any ideas of a fix?

Ah ha – sort of via SO:

Running dbGetQuery(con,'SET NAMES utf8;') before querying seems to do the trick…

## Calculating Churn in Seasonal Leagues

One of the things I wanted to explore in the production of the Wrangling F1 Data With R book was the extent to which I could draw on published academic papers for inspiration in exploring the the various results and timing datasets.

In a chapter published earlier this week, I explored the notion of churn, as described in Mizak, D, Neral, J & Stair, A (2007) The adjusted churn: an index of competitive balance for sports leagues based on changes in team standings over time. Economics Bulletin, Vol. 26, No. 3 pp. 1-7, and further appropriated by Berkowitz, J. P., Depken, C. A., & Wilson, D. P. (2011). When going in circles is going backward: Outcome uncertainty in NASCAR. Journal of Sports Economics, 12(3), 253-283.

In a competitive league, churn is defined as:

$C_t = \frac{\sum_{i=1}^{N}\left|f_{i,t} - f_{i,t-1}\right|}{N}$

where $C_t$ is the churn in team standings for year $t$, $\left|f_{i,t} - f_{i,t-1}\right|$ is the absolute value of the $i$-th team’s change in finishing position going from season $t-1$ to season $t$, and $N$ is the number of teams.

The adjusted churn is defined as an indicator with the range 0..1 by dividing the churn, $C_t$, by the maximum churn, $C_max$. The value of the maximum churn depends on whether there is an even or odd number of competitors:

$C_{max} = N/2 \text{, for even N}$

$C_{max} = (N^2 - 1) / 2N \text{, for odd N}$

Berkowitz et al. reconsidered churn as applied to an individual NASCAR race (that is, at the event level). In this case, $f_{i,t}$ is the position of driver $i$ at the end of race $t$, $f_{i,t-1}$ is the starting position of driver $i$ at the beginning of that race (that is, race $t$) and $N$ is the number of drivers participating in the race. Once again, the authors recognise the utility of normalising the churn value to give an *adjusted churn* in the range 0..1 by dividing through by the maximum churn value.

Using these models, I created churn function of the form:

is.even = function(x) x %% 2 == 0
churnmax=function(N)
if (is.even(N)) return(N/2) else return(((N*N)-1)/(2*N))

churn=function(d) sum(d)/length(d)
adjchurn = function(d) churn(d)/churnmax(length(d))

and then used it to explore churn in a variety of contexts:

• comparing grid positions vs race classifications across a season (cf. Berkowitz et al.)
• churn in Drivers’ Championship standings over several seasons (cf. Mizak et al.)
• churn in Constructors’ Championship standings over several seasons (cf. Mizak et al.)

For example, in the first case, we can process data from the ergast database as follows:

library(DBI)
ergastdb = dbConnect(RSQLite::SQLite(), './ergastdb13.sqlite')

q=paste('SELECT round, name, driverRef, code, grid,
position, positionText, positionOrder
FROM results rs JOIN drivers d JOIN races r
ON rs.driverId=d.driverId AND rs.raceId=r.raceId
WHERE r.year=2013',sep='')
results=dbGetQuery(ergastdb,q)

library(plyr)
results['delta'] =  abs(results['grid']-results['positionOrder'])
churn.df = ddply(results[,c('round','name','delta')], .(round,name), summarise,
churn = churn(delta),
)


For more details, see this first release of the Keeping an Eye on Competitiveness – Tracking Churn chapter of the Wrangling F1 Data With R book.

## Information Density and Custom Chart Designs

I’ve been doodling today with a some charts for the Wrangling F1 Data With R living book, trying to see how much information I can start trying to pack into a single chart.

The initial impetus came simply from thinking about a count of laps led in a particular race by each drive; this morphed into charting the number of laps in each position for each driver, and then onto a more comprehensive race summary chart (see More Shiny Goodness – Tinkering With the Ergast Motor Racing Data API for an earlier graphical attempt at producing a race summary chart).

The chart shows:

grid position: identified using an empty grey square;
race position after the first lap: identified using an empty grey circle;
race position on each driver’s last lap: y-value (position) of corresponding pink circle;
points cutoff line: a faint grey dotted line to show which positions are inside – or out of – the points;
number of laps completed by each driver: size of pink circle;
total laps completed by driver: greyed annotation at the bottom of the chart;
whether a driver was classified or not: the total lap count is displayed using a bold font for classified drivers, and in italics for unclassified drivers;
finishing status of each driver: classification statuses other than *Finished* are also recorded at the bottom of the chart.

The chart also shows drivers who started the race but did not complete the first lap.

What the chart doesn’t show is what stage of the race the driver was in each position, and how long for. But I have an idea for another chart that could help there, as well as being able to reuse elements used in the chart shown here.

FWIW, the following fragment of R code shows the ggplot function used to create the chart. The data came from the ergast API, though it did require a bit of wrangling to get it into a shape that I could use to power the chart.

#Reorder the drivers according to a final ranked position
g=ggplot(finalPos,aes(x=reorder(driverRef,finalPos)))
#Highlight the points cutoff
g=g+geom_hline(yintercept=10.5,colour='lightgrey',linetype='dotted')
#Highlight the position each driver was in on their final lap
g=g+geom_point(aes(y=position,size=lap),colour='red',alpha=0.15)
#Highlight the grid position of each driver
g=g+geom_point(aes(y=grid),shape=0,size=7,alpha=0.2)
#Highlight the position of each driver at the end of the first lap
g=g+geom_point(aes(y=lap1pos),shape=1,size=7,alpha=0.2)
#Provide a count of how many laps each driver held each position for
g=g+geom_text(data=posCounts,
aes(x=driverRef,y=position,label=poscount,alpha=alpha(poscount)),
size=4)
#Number of laps completed by driver
g=g+geom_text(aes(x=driverRef,y=-1,label=lap,fontface=ifelse(is.na(classification), 'italic' , 'bold')),size=3,colour='grey')
#Record the status of each driver
g=g+geom_text(aes(x=driverRef,y=-2,label=ifelse(status!='Finished', status,'')),size=2,angle=30,colour='grey')
#Styling - tidy the chart by removing the transparency legend
g+theme_bw()+xRotn()+xlab(NULL)+ylab(&quot;Race Position&quot;)+guides(alpha=FALSE)


The fully worked code can be found in forthcoming update to the Wrangling F1 Data With R living book.

## F1 Championship Race, 2014 – Winning Combinations…

As we come up to the final two races of the 2014 Formula One season, the double points mechanism for the final race means that two drivers are still in with a shot at the Drivers’ Championship: Lewis Hamilton and Nico Rosberg.

As James Allen describes in Hamilton closes in on world title: maths favour him but Abu Dhabi threat remains:

Hamilton needs 51 points in the remaining races to be champion if Rosberg wins both races. Hamilton can afford to finish second in Brazil and at the double points finale in Abu Dhabi and still be champion. Mathematically he could also finish third in Brazil and second in the finale and take it on win countback, as Rosberg would have just six wins to Hamilton’s ten.
If Hamilton leads Rosberg home again in a 1-2 in Brazil, then he will go to Abu Dhabi needing to finish fifth or higher to be champion (echoes of Brazil 2008!!). If Rosberg does not finish in Brazil and Hamilton wins the race, then Rosberg would need to win Abu Dhabi with Hamilton not finishing; no other scenario would give Rosberg the title.

A couple of years ago, I developed an interactive R/shiny app for exploring finishing combinations of two drivers in the last two races of a season to see what situations led to what result: Interactive Scenarios With Shiny – The Race to the F1 2012 Drivers’ Championship.

I’ve updated the app (taking into account the matter of double points in the final race) so you can check out James Allen’s calculations with it (assuming I got my sums right too!). I tried to pop up an interactive version to Shinyapps, but the Shinyapps publication mechanism seems to be broken (for me at least) at the moment…:-(

In the meantime, if you have RStudio installed, you can run the application yourself. The code is avaliable and can be run from RStudio with: runGist("81380ff09ebe1cd67005")

When I get a chance, I’ll weave elements of this recipe into the Wrangling F1 Data With R book.

PS I’ve also started using the F1dataJunkie blog again as a place to post drafts and snippets of elements I’m working on for that book…

## Wrangling F1 Data With R – F1DataJunkie Book

Earlier this year I started trying to pull together some of my #f1datajunkie R-related ramblings together in a book form. The project stalled, but to try to reboot it I’ve started publishing it as a living book over on Leanpub. Several of the chapters are incomplete – with TO DO items sketched in, others are still unpublished. The beauty of the Leanpub model is that if you buy a copy, you continue to get access to all future updated versions of the book. (And my idea is that by getting the book out there as it is, I’ll feel as if there’s more (social) pressure on actually trying to keep up with it…)

I’ll be posting more details about how the Leanpub process works (for me at least) in the next week or two, but for now, here’s a link to the book: Wrangling F1 Data With R: A Data Junkie’s Guide.

• Foreword
• A Note on the Data Sources
• Introduction
• Preamble
• What are we trying to do with the data?
• Choosing the tools
• The Data Sources
• Getting the Data into RStudio
• Example F1 Stats Sites
• How to Use This Book
• The Rest of This Book…
• An Introduction to RStudio and R dataframes
• Getting Started with RStudio
• Getting Started with R
• Summary
• Getting the data from the Ergast Motor Racing Database API
• Accessing Data from the ergast API
• Summary
• Accessing SQLite from R
• Asking Questions of the ergast Data
• Summary
• Exercises and TO DO
• Data Scraped from the F1 Website
• Problems with the Formula One Data
• How to use the FormulaOne.com alongside the ergast data
• Reviewing the Practice Sessions
• The Weekend Starts Here
• Practice Session Data from the FIA
• Sector Times
• FIA Media Centre Timing Sheets
• A Quick Look at Qualifying
• Qualifying Session Position Summary Chart
• Another Look at the Session Tables
• Ultimate Lap Positions
• Lapcharts
• Annotated Lapcharts
• Race History Charts
• The Simple Laptime Chart
• Accumulated Laptimes
• The Lapalyzer Session Gap
• Eventually: The Race History Chart
• Pit Stop Analysis
• Pit Stop Data
• Total pit time per race
• Pit Stops Over Time
• Estimating pit loss time
• Tyre Change Data
• Career Trajectory
• The Effect of Age on Performance
• Statistical Models of Career Trajectories
• Summary
• Streakiness
• Spotting Runs
• Generating Streak Reports
• Streak Maps
• Team Streaks
• Time to N’th Win
• TO DO
• Summary
• Conclusion
• Appendix One – Scraping formula1.com Timing Data
• Appendix Two – FIA Timing Sheets
• Appendix – Converting the ergast Database to SQLite

If you think you deserve a free copy, let me know… ;-)

## F1 Timing Screen as a Spreadsheet?

One of the comment themes I’ve noticed around the first Challenge in the Tata F1 Connectivity Innovation Prize, a challenge to rethink what’s possible around the timing screen given only the data in the real time timing feed, is that the non-programmers don’t get to play. I don’t think that’s true – the challenge seems to be open to ideas as well as practical demonstrations, but it got me thinking about what technical ways in might be to non-programmers who wouldn’t know where to start when it came to working with the timing stream messages.

The answer is surely the timing screen itself… One of the issues I still haven’t fully resolved is a proven way of getting useful information events from the timing feed – it updates the timing screen on a cell by cell basis, so we have to finesse the way we associate new laptimes or sector times with a particular driver, bearing in mind cells update one at a time, in a potentially arbitrary order, and with potentially different timestamps.

So how about if we work with a “live information model” by creating a copy of an example timing screen in a spreadsheet. If we know how, we might be able to parse the real data stream to directly update the appropriate cells, but that’s largely by the by. At least we have something we can work work to start playing with the timing screen in terms of a literal reimagining of it. So what can we do if we put the data from an example timing screen into a spreadsheet?

If we create a new worksheet, we can reference the cells in the “original” timing sheet and pull values over. The timing feed updates cells on a cell by cell basis, but spreadsheets are really good at rippling through changes from one or more cells which are themselves reference by one or more others.

The first thing we might do is just transform the shape of the timing screen. For example, we can take the cells in a column relating to sector 1 times and put them into a row.

The second thing we might do is start to think about some sums. For example, we might find the difference between each of those sector times and (for practice and qualifying sessions at least) the best sector time recorded in that session.

The third thing we might do is to use a calculated value as the basis for a custom cell format that colours the cell according to the delta from the best session time.

Simple, but a start.

I’ve not really tried to push this idea very far – I’m not much of a spreadsheet jockey – but I’d be interested to know how folk who are might be able to push this idea…

If you need example data, there’s some on the F1 site – f1.com – results for Spanish Grand Prix, 2014 and more on Ergast Developer API.

PS FWIW, my entry to the competition is here: #f1datajunkie challenge 1 entry. It’s perhaps a little off-brief, but I’ve been meaning to do this sort of summary for some time, and this was a good starting point. If I get a chance, I’ll have a go a getting the parsers to work properly properly!

## Lazyweb Request – Node-RED & F1 Timing Data

A lazyweb request, because I’m rushing for a boat, going to be away from reliable network connections for getting on for a week, and would like to be able to play from a running start when I get back next week…

In context of the Tata/F1 timing data competition, I’d like to be able to have a play with the data in Node-RED. A feed-based, flow/pipes like environment, Node-RED’s been on my “should play with” list for some time, and this provides a good opportunity.

The data as provided looks something like:

...
<transaction identifier="101" messagecount="121593" timestamp="14:57:10.878"><data column="23" row="1" colour="PURPLE" value="31.6"/></transaction>
<transaction identifier="103" messagecount="940109" timestamp="14:57:11.219"><data column="2" row="1" colour="YELLOW" value="1:41:13" clock="true"/></transaction>
<transaction identifier="101" messagecount="121600" timestamp="14:57:11.681"><data column="2" row="3" colour="WHITE" value="77"/></transaction>
<transaction identifier="101" messagecount="121601" timestamp="14:57:11.681"><data column="3" row="3" colour="WHITE" value="V. BOTTAS"/></transaction>
<transaction identifier="101" messagecount="121602" timestamp="14:57:11.681"><data column="4" row="3" colour="YELLOW" value="17.7"/></transaction>
<transaction identifier="101" messagecount="121603" timestamp="14:57:11.681"><data column="5" row="3" colour="YELLOW" value="14.6"/></transaction>
<transaction identifier="101" messagecount="121604" timestamp="14:57:11.681"><data column="6" row="3" colour="WHITE" value="1:33.201"/></transaction>
<transaction identifier="101" messagecount="121605" timestamp="14:57:11.686"><data column="9" row="3" colour="YELLOW" value="35.4"/></transaction>

...

as a text file. (In the wild, it would be a real time data feed over http or https.)

What I’d like as a crib to work from is a Node-RED demo that has:

1) a file reader that reads the data in from the data file and plays it in as a stream in “real time” according to the timestamps, given a dummy start time;

2) an example of handling state – eg keeping track of drivernumber. (The row is effectively race position, Looking at column 2 (driverNumber), we can see what position a driver is in. Keep track of (row,driverNumber) pairs and if a driver changes position, flag it along with what the previous position was);

3) an example of appending the result to a flat file – for example, building up a list of statements “Driver number x has moved from position M to position N” over time.

Shouldn’t be that hard, right? And it would provide a good starting point for other people to be able to have a play without hassling over how to do the input/output bits?

## F1 Doing the Data Visualisation Competition Thing With Tata?

Sort of via @jottevanger, it seems that Tata Communications announces the first challenge in the F1® Connectivity Innovation Prize to extract and present new information from Formula One Management’s live data feeds. (The F1 site has a post Tata launches F1® Connectivity Innovation Prize dated “10 Jun 2014”? What’s that about then?)

Tata Communications are the folk who supply connectivity to F1, so this could be a good call from them. It’ll be interesting to see how much attention – and interest – it gets.

The competition site can be found here: The F1 Innovation Connectivity Prize.

The first challenge is framed as follows:

The Formula One Management Data Screen Challenge is to propose what new and insightful information can be derived from the sample data set provided and, as a second element to the challenge, show how this insight can be delivered visually to add suspense and excitement to the audience experience.

The sample dataset provided by Formula One Management includes Practice 1, Qualifying and race data, and contains the following elements:

– Position
– Car number
– Driver’s name
– Fastest lap time
– Gap to the leader’s fastest lap time
– Sector 1 time for the current lap
– Sector 2 time for the current lap
– Sector 3 time for the current lap
– Number of laps

If you aren’t familiar with motorsport timing screens, they typically look like this…

A technical manual is also provided for helping makes sense of the data files.

Here are fragments from the data files – one for practice, one for qualifying and one for the race.

First up, practice:

...
<transaction identifier="101" messagecount="10640" timestamp="10:53:14.159"><data column="2" row="15" colour="RED" value="14"/></transaction>
<transaction identifier="101" messagecount="10641" timestamp="10:53:14.162"><data column="3" row="15" colour="WHITE" value="F. ALONSO"/></transaction>
<transaction identifier="103" messagecount="10642" timestamp="10:53:14.169"><data column="9" row="2" colour="YELLOW" value="16"/></transaction>
<transaction identifier="101" messagecount="10643" timestamp="10:53:14.172"><data column="2" row="6" colour="WHITE" value="17"/></transaction>
<transaction identifier="102" messagecount="1102813" timestamp="10:53:14.642"><data column="2" row="1" colour="YELLOW" value="59:39" clock="true"/></transaction>
<transaction identifier="102" messagecount="1102823" timestamp="10:53:15.640"><data column="2" row="1" colour="YELLOW" value="59:38" clock="true"/></transaction>
...

Then qualifying:

...
<transaction identifier="102" messagecount="64968" timestamp="12:22:01.956"><data column="4" row="3" colour="WHITE" value="210"/></transaction>
<transaction identifier="102" messagecount="64971" timestamp="12:22:01.973"><data column="3" row="4" colour="WHITE" value="PER"/></transaction>
<transaction identifier="102" messagecount="64972" timestamp="12:22:01.973"><data column="4" row="4" colour="WHITE" value="176"/></transaction>
<transaction identifier="103" messagecount="876478" timestamp="12:22:02.909"><data column="2" row="1" colour="YELLOW" value="16:04" clock="true"/></transaction>
<transaction identifier="101" messagecount="64987" timestamp="12:22:03.731"><data column="2" row="1" colour="WHITE" value="21"/></transaction>
<transaction identifier="101" messagecount="64989" timestamp="12:22:03.731"><data column="3" row="1" colour="YELLOW" value="E. GUTIERREZ"/></transaction>
...

Then the race:

...
<transaction identifier="101" messagecount="121593" timestamp="14:57:10.878"><data column="23" row="1" colour="PURPLE" value="31.6"/></transaction>
<transaction identifier="103" messagecount="940109" timestamp="14:57:11.219"><data column="2" row="1" colour="YELLOW" value="1:41:13" clock="true"/></transaction>
<transaction identifier="101" messagecount="121600" timestamp="14:57:11.681"><data column="2" row="3" colour="WHITE" value="77"/></transaction>
<transaction identifier="101" messagecount="121601" timestamp="14:57:11.681"><data column="3" row="3" colour="WHITE" value="V. BOTTAS"/></transaction>
<transaction identifier="101" messagecount="121602" timestamp="14:57:11.681"><data column="4" row="3" colour="YELLOW" value="17.7"/></transaction>
<transaction identifier="101" messagecount="121603" timestamp="14:57:11.681"><data column="5" row="3" colour="YELLOW" value="14.6"/></transaction>
<transaction identifier="101" messagecount="121604" timestamp="14:57:11.681"><data column="6" row="3" colour="WHITE" value="1:33.201"/></transaction>
<transaction identifier="101" messagecount="121605" timestamp="14:57:11.686"><data column="9" row="3" colour="YELLOW" value="35.4"/></transaction>

...

We can parse the datafiles using python using an approach something like the following:

from lxml import etree

pl=[]
for xml in open(xml_doc, 'r'):
pl.append(etree.fromstring(xml))

pl[100].attrib
#{'identifier': '101', 'timestamp': '10:49:56.085', 'messagecount': '9716'}

pl[100][0].attrib
#{'column': '3', 'colour': 'WHITE', 'value': 'J. BIANCHI', 'row': '12'}


A few things are worth mentioning about this format… Firstly, the identifier is an identifier of the message type, rather then the message: each transaction message appears instead to be uniquely identified by the messagecount. The transactions each update the value of a single cell in the display screen, setting its value and colour. The cell is identified by its row and column co-ordinates. The timestamp also appears to group messages.

Secondly, within a session, several screen views are possible – essentially associated with data labelled with a particular identifier. This means the data feed is essentially powering several data structures.

Thirdly, each screen display is a snapshot of a datastructure at a particular point in time. There is no single record in the datafeed that gives a view over the whole results table. In fact, there is no single message that describes the state of a single row at a particular point in time. Instead, the datastructure is built up by a continual series of updates to individual cells. Transaction elements in the feed are cell based events not row based events.

It’s not obvious how we can make a row based transaction update, even, though on occasion we may be able to group updates to several columns within a row by gathering together all the messages that occur at a particular timestamp and mention a particular row. For example, look at the example of the race timing data above, for timestamp=”14:57:11.681″ and row=”3″. If we parsed each of these into separate dataframes, using the timestamp as the index, we could align the dataframes using the *pandas* DataFrame .align() method.

[I think I’m thinking about this wrong: the updates to a row appear to come in column order, so if column 2 changes, the driver number, then changes to the rest of the row will follow. So if we keep track of a cursor for each row describing the last column updated, we should be able to track things like row changes, end of lap changes when sector times change and so on. Pitting may complicate matters, but at least I think I have an in now… Should have looked more closely the first time… Doh!]

Note: I’m not sure that the timestamps are necessarily unique across rows, though I suspect that they are likely to be so, which means it would be safer to align, or merge, on the basis of the timestamp and the row number? From inspection of the data, it looks as if it is possible for a couple of timestamps to differ slightly (by milliseconds) yet apply to the same row. I guess we would treat these as separate grouped elements? Depending on the timewidth that all changes to a row are likely to occur in, we could perhaps round times for the basis of the join?

Even with a bundling, we still don’t a have a complete description of all the cells in a row. They need to have been set historically…

The following fragment is a first attempt at building up the timing screen data structure for the practice timing at a particular point of time. To find the state of the timing screen at a particular time, we’d have to start building it up from the start of time, and then stop it updating at the time we were interested in:

#Hacky load and parse of each row in the datafile
pl=[]
for xml in open('data/F1 Practice.txt', 'r'):
pl.append(etree.fromstring(xml))

#Dataframe for current state timing screen
df_practice_pos=pd.DataFrame(columns=[
"timestamp", "time",
"classpos",  "classpos_colour",
"racingNumber","racingNumber_colour",
"name","name_colour",
],index=range(50))

#Column mappings
practiceMap={
'1':'classpos',
'2':'racingNumber',
'3':'name',
'4':'laptime',
'5':'gap',
'6':'sector1',
'7':'sector2',
'8':'sector3',
'9':'laps',
'21':'sector1_best',
'22':'sector2_best',
'23':'sector3_best'
}

def parse_practice(p,df_practice_pos):
if p.attrib['identifier']=='101' and 'sessionstate' not in p[0].attrib:
if p[0].attrib['column'] not in ['10','21','22','23']:
colname=practiceMap[p[0].attrib['column']]
row=int(p[0].attrib['row'])-1
df_practice_pos.ix[row]['timestamp']=p.attrib['timestamp']
tt=p.attrib['timestamp'].replace('.',':').split(':')
df_practice_pos.ix[row]['time'] = datetime.time(int(tt[0]),int(tt[1]),int(tt[2]),int(tt[3])*1000)
df_practice_pos.ix[row][colname]=p[0].attrib['value']
df_practice_pos.ix[row][colname+'_colour']=p[0].attrib['colour']
return df_practice_pos

for p in pl[:2850]:
df_practice_pos=parse_practice(p,df_practice_pos)
df_practice_pos


Getting sensible data structures at the timing screen level looks like it could be problematic. But to what extent are the feed elements meaningful in and of themselves? Each element in the feed actually has a couple of semantically meaningful data points associated with it, as well as the timestamp: the classification position, which corresponds to the row; and the column designator.

That means we can start to explore simple charts that map driver number against race classification, for example, by grabbing the row (that is, the race classification position) and timestamp every time we see a particular driver number:

A notebook where I start to explore some of these ideas can be found here: racedemo.ipynb.

Something else I’ve started looking at is the use of MongoDB for grouping items that share the same timestamp (again, check the racedemo.ipynb notebook). If we create an ID based on the timestamp and row, we can repeatedly $set document elements against that key even if they come from separate timing feed elements. This gets us so far, but still falls short of identifying row based sets. We can perhaps get closer by grouping items associated with a particular row in time, for example, grouping elements associated with a particular row that are within half a second of each other. Again, the racedemo.ipynb notebook has the first fumblings of an attempt to work this out. I’m not likely to have much chance to play with this data over the next week or so, and the time for making entries is short. I never win data competitions anyway (I can’t do the shiny stuff that judges tend to go for), but I’m keen to see what other folk can come up with:-) PS The R book has stalled so badly I’ve pushed what I’ve got so far to wranglingf1datawithr repo now… Hopefully I’ll get a chance to revisit it over the summer, and push on with it a bit more… WHen I get a couple of clear hours, I’ll try to push the stuff that’s there out onto leanpub as a preview… ## Creating Data from Text – Regular Expressions in OpenRefine Although data can take many forms, when generating visualisations, running statistical analyses, or simply querying the data so we can have a conversation with it, life is often made much easier by representing the data in a simple tabular form. A typical format would have one row per item and particular columns containing information or values about one specific attribute of the data item. Where column values are text based, rather than numerical items or dates, it can also help if text strings are ‘normalised’, coming from a fixed, controlled vocabulary (such as items selected from a drop down list) or fixed pattern (for example, a UK postcode in its ‘standard’ form with a space separating the two parts of the postcode). Tables are also quick to spot as data, of course, even if they appear in a web page or PDF document, where we may have to do a little work to get the data as displayed into a table we can actually work with in a spreadsheet or analysis package. More often than not, however, we come across situations where a data set is effectively encoded into a more rambling piece of text. One of the testbeds I used to use a lot for practising my data skills was Formula One motor sport, and though I’ve largely had a year away from that during 2013, it’s something I hope to return to in 2014. So here’s an example from F1 of recreational data activity that provided a bit of entertainment for me earlier this week. It comes from the VivaF1 blog in the form of a collation of sentences, by Grand Prix, about the penalties issued over the course of each race weekend. (The original data is published via PDF based press releases on the FIA website.) The VivaF1 site also publishes a visualisation summarising penalty outcomes incurred by each driver: The recreational data puzzle I set myself was this: how can we get the data contained in the descriptive sentences about the penalties into a data table that could be used to ask questions about different rule infractions, and the penalty outcomes applied, and allow for the ready generation of visualisations around that data? The tool I opted to use was OpenRefine; and the predominant technique for getting the data out of the sentences and in to data columns? Regular expressions. (If regular expressions are new to you, think: search and replace on steroids. There’s a great tutorial that introduces the basics here: Everday text patterns.) What follows is a worked example that shows how to get the “data” from the VivaF1 site into a form that looks more like this: Not every row is a tidy as it could be, but there is often a trade off in tidying data between trying to automate every step, and automating steps that clean the majority of the data, leaving some rows to tidy by hand… So where to start? The first step is getting the “data” into OpenRefine. To do this we can just select the text on the VivaF1 penatlies-by-race page, copy it an paste it in the Clipboard import area of a new project in OpenRefine: We can then import the data as line items, ignoring blank lines: The first step I’m going to take tidying up the data is to generate a column that contains the race name: The expression if(value.contains('Prix'),value,'') finds the rows that have the title of the race (they all include “Grand Prix” in their name) and creates a new column containing matches. (The expression reads as follows: if the original cell value contains ‘Prix’ , copy the cell value into the corresponding cell in the new column, else copy across an empty string/nothing that is, ”) We can then Fill Down on the race column to associate each row with particular race. We can also create another new column containing the description of each penalty notice with a quick tweak of the original expression: if(value.contains('Prix'),'',value). (If we match “Prix”, copy an empty string, else copy the penalty notice.) One of the things that we notice is there are some notices that “Overflow” on to multiple lines: We can filter using a regular expression that finds Penalty Notice lines that start (^) with a character that does not match a – ([^-]): Looking at the row number, we see serval of the rows are xsecutive – we can edit thesse cells to move all the text into a single cell: Cut and paste as required… Looking down the row number column (left hand most column) we see that rows 19, 57 and 174 are now the overflow lines. Remove the filter an in the whole listing, scroll to the appropriate part of the data table and cut the data out of the overflow cell and paste it into the line above. By chance, I also notice that using “Prix” to grab just race names was overly optimistic! Here’s how we could have checked – used the facet as text option on the race column… Phew – that was the only rogue! In line 56, cut the rogue text from the Race column and paste it into the penalty notice column. Then also paste in the additional content from the overflow lines. Remove any filters and fill down again on the Race column to fix the earlier error… The PEnalty Noptice column should now contain blank lines corresponding to rows that originally described the Grand Prix and overflow rows – facet the Penalty Notice column by text and highlight the blank rows so we can then delete them… So where are we now? We have a data file with one row per penalty and columns corresponding to the Grand Prix and the penalty notice. We can now start work on pulling data out of the penalty notice sentences. If you inspect the sentences, you will see they start with a dash, then have the driver name and the team in brackets. Let’s use a regular expression to grab that data: value.match(/- ([^$$]*)\s\(([^$$]*)\).*/).join('::') Here’s the actual regular expression: - ([^$$]*)\s\(([^$$]*)\).* It reads as follows: match a – followed by a space, then grab any characters that don’t contain an open bracket ([^$$]*) and that precede a space \s followed by an open bracket$$: all together ([^$$]*)\s\( That gives the driver name into the first matched pattern. Then grab the team – this is whatever appears before the first close bracket: ([^$$]*)\) Finally, match out all characters to the end of the string .* The two matches are then joined using :: We can then split these values to give driver and team columns: Learning from out previous error, we can use the text facet tool on the drive and team columns just to check the values are in order – it seems like there is one oops in the driver column, so we should probably edit that cell and remove the contents. We can also check the blank lines to see what’s happening there – in this case no driver is mentioned but a team is, but that hasn’t been grabbed into the team column, so we can edit it here: We can also check the text facet view of the team column to make sure there are no gotchas, and pick up/correct any that did slip through. So now we have a driver column and a team column too (it’s probably worth changing the column names to match…) Let’s look at the data again – what else can we pull out? How about the value of any fine? We notice that fine amounts seem to appear at the end of the sentence and be preceded by the word fined, so we gan grab data on that basis, then replace the euro symbol, strip out any commas, and cast the result to a number type: value.match(/.* fined (.*)/)[0].replace(/[€,]/,'').toNumber() We can check for other fines by filtering the the Penalty Notice column on the word fine (or the Euro symbol), applying a number facet to the Fine column and looking for blank rows in that column. Add in fine information by hand as required: So now we have a column that has the value of fines – which means if we export this data we could do plots that show fines per race, or fines per driver, or fines per team, or calculate the average size of fines, or the total number of fines, for example. What other data columns might we pull out? How about the session? Let’s look for phrases that identify free practice sessions or qualifying: Here’s the regular expression: value.match(/.*((FP[\d]+)|(Q[\d]+)|(qualifying)).*/i)[0].toUppercase() Note how we use the pipe symbol | to say ‘look for one pattern OR another’. We can cast everything to uppercase just to help normalise the values that appear. And once again, we can use the Text Facet to check that things worked as we expected: So that’s a column with the session the infringement occurred in (I think! We’d need to read all the descriptions to make absolutely sure!) What else? There’s another set of numbers appear in some of the notices – speeds. Let’s grab those into a new column – look for a space, followed by numbers or decimal points, and then a sapce and km/h, grabbing the numbers of interest and casting them to a number type: value.match(/.*\s([\d\.]+) km\/h.*/)[0].toNumber() So now we have a speed column. Which means we could start to look at speed vs fine scatterplots, perhaps, to see if there is a relationship. (Note, different pit lanes may have different speed limits.) What else? It may be worth trying to identify the outcome of each infringement investigation? value.match(/.*((fine)|(no further action)|([^\d]\d+.place grid.*)|(reprimand)|(drive.thr.*)|(drop of.*)|\s([^\s]+.second stop and go.*)|(start .*from .*)).*/i)[0].toUppercase() Here’s where we’re at now: If we do a text facet on the outcome column, we see there are several opportunities for clustering the data: We can try other cluster types too: If we look at the metaphone (soundalike) clusters: we notice a couple of other things – an opportunity to normalise 5 PLACE GRID DROP as DROP OF 5 GRID PLACES for example: value.replace('-',' ').replace(/(\d+) PLACE GRID DROP/,'DROP OF$1 GRID POSITIONS')

Or we might further standardise the outcome of that by fixing on GRID POSITIONS rather than GRID PLACES:

value.replace('-',' ').replace(/(\d+) PLACE GRID DROP/,'DROP OF $1 GRID POSITIONS').replace('GRID PLACES','GRID POSITIONS') And we might further normalise on numbers rather than number words: value.replace('-',' ').replace(/(\d+) PLACE GRID DROP/,'DROP OF$1 GRID POSITIONS').replace('GRID PLACES','GRID POSITIONS').replace('TWO','2').replace('THREE','3').replace('FIVE','5').replace('TEN','10')

Clustering again:

it might make sense to tidy out the (IN THIS CASE… statements:

value.replace(/ \(IN THIS.*/,'')

Depending on the questions we want to ask, it may be worth splitting out whether or not penalties like GRID DROPS are are this event of the next event, as well as generic penalty types (Drive through, stop and go, grid drop, etc)

Finally, let’s consider what sort of infringement has occurred:

If we create a new column from the Infraction column, we can then cluster items into the core infraction type:

After a bit of tidying, we can start to narrow down on a key set of facet levels:

Viewing down the list further there may be additional core infringements we might be able to pull out.

So here’s where we are now:

And here’s where we came from:

Having got the data into a data from, we can now start to ask questions of it (whilst possible using those conversations to retrun to the data ans tidy it more as we work with it). But that will have to be the subject of another post…

## F1Stats – Correlations Between Qualifying, Grid and Race Classification

Following directly on from F1Stats – Visually Comparing Qualifying and Grid Positions with Race Classification, and continuing in my attempt to replicate some of the methodology and results used in A Tale of Two Motorsports: A Graphical-Statistical Analysis of How Practice, Qualifying, and Past SuccessRelate to Finish Position in NASCAR and Formula One Racing, here’s a quick look at the correlation scores between the final practice, qualifying and grid positions and the final race classification.

I’ve already done brief review of what correlation means (sort of) in F1Stats – A Prequel to Getting Started With Rank Correlations, so I’m just going to dive straight in with some R code that shows how I set about trying to find the correlations between the different classifications:

Here’s the answer from the back of the book paper that we’re aiming for…

Here’s what I got:

> corrs.df[order(corrs.df$V1),] V1 p3pos.int qpos.int grid.int racepos.raw pval.grid pval.qpos pval.p3pos 2 AUSTRALIA 0.30075188 0.01503759 0.087218045 1 7.143421e-01 9.518408e-01 0.197072158 13 MALAYSIA 0.42706767 0.57293233 0.630075188 1 3.584362e-03 9.410805e-03 0.061725312 6 CHINA -0.26015038 0.57443609 0.514285714 1 2.183596e-02 9.193214e-03 0.266812583 3 BAHRAIN 0.13082707 0.73233083 0.739849624 1 2.900250e-04 3.601434e-04 0.581232598 16 SPAIN 0.25112782 0.80451128 0.804511278 1 2.179221e-05 2.179221e-05 0.284231482 14 MONACO 0.51578947 0.48120301 0.476691729 1 3.513870e-02 3.326706e-02 0.021403708 17 TURKEY 0.52330827 0.73082707 0.730827068 1 3.756531e-04 3.756531e-04 0.019344720 9 GREAT BRITAIN 0.65413534 0.83007519 0.830075188 1 8.921842e-07 8.921842e-07 0.002260234 8 GERMANY 0.32030075 0.46917293 0.452631579 1 4.657539e-02 3.844275e-02 0.168419054 10 HUNGARY 0.49649123 0.37017544 0.370175439 1 1.194050e-01 1.194050e-01 0.032293715 7 EUROPE 0.28120301 0.72030075 0.720300752 1 4.997719e-04 4.997719e-04 0.228898214 4 BELGIUM 0.06766917 0.62105263 0.621052632 1 4.222076e-03 4.222076e-03 0.777083014 11 ITALY 0.52932331 0.52481203 0.524812030 1 1.895282e-02 1.895282e-02 0.017815489 15 SINGAPORE 0.50526316 0.58796992 0.715789474 1 5.621214e-04 7.414170e-03 0.024579520 12 JAPAN 0.34912281 0.74561404 0.849122807 1 0.000000e+00 3.739715e-04 0.143204045 5 BRAZIL -0.51578947 -0.02105263 -0.007518797 1 9.771776e-01 9.316030e-01 0.021403708 1 ABU DHABI 0.42556391 0.66466165 0.628571429 1 3.684738e-03 1.824565e-03 0.062722332 The paper mistakenly reports the grid values as the qualifying positions, so if we look down the grid.int column that I use to contain the correlation values between the grid and final classifications, we see they broadly match the values quoted in the paper. I also calculated the p-values and they seem to be a little bit off, but of the right order. And here’s the R-code I used to get those results… The first chunk is just the loader, a refinement of the code I have used previously: require(RSQLite) require(reshape) #Data downloaded from my f1com scraper on scraperwiki f1 = dbConnect(drv="SQLite", dbname="f1com_megascraper.sqlite") getRacesData.full=function(year='2012'){ #Data query results.combined=dbGetQuery(f1, paste('SELECT raceResults.year as year, qualiResults.pos as qpos, p3Results.pos as p3pos, raceResults.pos as racepos, raceResults.race as race, raceResults.grid as grid, raceResults.driverNum as driverNum, raceResults.raceNum as raceNum FROM raceResults, qualiResults, p3Results WHERE raceResults.year==',year,' and raceResults.year = qualiResults.year and raceResults.year = p3Results.year and raceResults.race = qualiResults.race and raceResults.race = p3Results.race and raceResults.driverNum = qualiResults.driverNum and raceResults.driverNum = p3Results.driverNum;',sep='')) #Data tidying results.combined=ddply(results.combined,.(race),mutate,racepos.raw=1:length(race)) for (i in c('racepos','grid','qpos','p3pos','driverNum')) results.combined[[paste(i,'.int',sep='')]]=as.integer( as.character(results.combined[[i]])) results.combined$race=reorder(results.combined$race,results.combined$raceNum)

results.combined
}

f1 = dbConnect(drv="SQLite", dbname="f1com_megascraper.sqlite")

results.combined=getRacesData.full(2009)
corrs.df[order(corrs.df$V1),] Here’s the actual correlation calculation – I use the cor function: #The cor() function returns data that looks like: # p3pos.int qpos.int grid.int racepos.raw #p3pos.int 1.0000000 0.31578947 0.28270677 0.30075188 #qpos.int 0.3157895 1.00000000 0.97744361 0.01503759 #grid.int 0.2827068 0.97744361 1.00000000 0.08721805 #racepos.raw 0.3007519 0.01503759 0.08721805 1.00000000 #Row/col 4 relates to the correlation with the race classification, so for now just return that corr.rank.race=function(results.combined,cmethod='spearman'){ ##Correlations corrs=NULL #Run through the races for (i in levels(factor(results.combined$race))){
results.classified = subset( results.combined,
race==i,
select=c('p3pos.int','qpos.int','grid.int','racepos.raw'))
#print(i)
#print( results.classified)
cp=cor(results.classified,method=cmethod,use="complete.obs")
#print(cp[4,])
corrs=rbind(corrs,c(i,cp[4,]))
}
corrs.df=as.data.frame(corrs)

signif=data.frame()
for (i in levels(factor(results.combined$race))){ results.classified = subset( results.combined, race==i, select=c('p3pos.int','qpos.int','grid.int','racepos.raw')) #p.value pval.grid=cor.test(results.classified$racepos.raw,results.classified$grid.int,method=cmethod,alternative = "two.sided")$p.value
pval.qpos=cor.test(results.classified$racepos.raw,results.classified$qpos.int,method=cmethod,alternative = "two.sided")$p.value pval.p3pos=cor.test(results.classified$racepos.raw,results.classified$p3pos.int,method=cmethod,alternative = "two.sided")$p.value

signif=rbind(signif,data.frame(race=i,pval.grid=pval.grid,pval.qpos=pval.qpos,pval.p3pos=pval.p3pos))
}

corrs.df$qpos.int=as.numeric(as.character(corrs.df$qpos.int))
corrs.df$grid.int=as.numeric(as.character(corrs.df$grid.int))
corrs.df$p3pos.int=as.numeric(as.character(corrs.df$p3pos.int))

corrs.df=merge(corrs.df,signif,by.y='race',by.x='V1')
corrs.df$V1=factor(corrs.df$V1,levels=levels(results.combined$race)) corrs.df } corrs.df=corr.rank.race(results.combined) It’s then trivial to plot the result: require(ggplot2) xRot=function(g,s=5,lab=NULL) g+theme(axis.text.x=element_text(angle=-90,size=s))+xlab(lab) g=ggplot(corrs.df)+geom_point(aes(x=V1,y=grid.int)) g=xRot(g,6)+xlab(NULL)+ylab('Correlation')+ylim(0,1) g=g+ggtitle('F1 2009 Correlation: grid and final classification') g Recalling that there are different types of rank correlation function, specifically “Kendall’s τ (that is, Kendall’s Tau; this coefficient is based on concordance, which describes how the sign of the difference in rank between pairs of numbers in one data series is the same as the sign of the difference in rank between a corresponding pair in the other data series”, I wondered whether it would make sense to look at correlations under this measure to see whether there were any obvious looking differences compared to Spearmans’s rho, that might prompt us to look at the actual grid/race classifications to see which score appears to be more meaningful. The easiest way to spot the difference is probably graphically: corrs.df2=corr.rank.race(results.combined,'kendall') corrs.df2[order(corrs.df2$V1),]

g=ggplot(corrs.df)+geom_point(aes(x=V1,y=grid.int),col='red',size=4)
g=g+geom_point(data=corrs.df2, aes(x=V1,y=grid.int),col='blue')
g=xRot(g,6)+xlab(NULL)+ylab('Correlation')+ylim(0,1)
g=g+ggtitle('F1 2009 Correlation: grid and final classification')
g

corrs.df2[order(corrs.df2\$V1),]
V1   p3pos.int    qpos.int    grid.int racepos.raw    pval.grid    pval.qpos  pval.p3pos
2      AUSTRALIA  0.17894737 -0.01052632  0.04210526           1 8.226829e-01 9.744669e-01 0.288378196
13      MALAYSIA  0.26315789  0.41052632  0.46315789           1 3.782665e-03 1.110136e-02 0.112604127
6          CHINA -0.20000000  0.41052632  0.35789474           1 2.832863e-02 1.110136e-02 0.233266557
3        BAHRAIN  0.07368421  0.51578947  0.52631579           1 8.408301e-04 1.099522e-03 0.677108239
16         SPAIN  0.17894737  0.64210526  0.64210526           1 2.506940e-05 2.506940e-05 0.288378196
14        MONACO  0.38947368  0.35789474  0.35789474           1 2.832863e-02 2.832863e-02 0.016406081
17        TURKEY  0.37894737  0.64210526  0.64210526           1 2.506940e-05 2.506940e-05 0.019784403
9  GREAT BRITAIN  0.46315789  0.63157895  0.63157895           1 3.622261e-05 3.622261e-05 0.003782665
8        GERMANY  0.23157895  0.31578947  0.30526316           1 6.380788e-02 5.475355e-02 0.164976406
10       HUNGARY  0.36842105  0.36842105  0.36842105           1 2.860214e-02 2.860214e-02 0.028602137
7         EUROPE  0.21052632  0.62105263  0.62105263           1 5.176962e-05 5.176962e-05 0.208628398
4        BELGIUM  0.02105263  0.46315789  0.46315789           1 3.782665e-03 3.782665e-03 0.923502331
11         ITALY  0.35789474  0.36842105  0.36842105           1 2.373450e-02 2.373450e-02 0.028328627
15     SINGAPORE  0.35789474  0.45263158  0.55789474           1 3.589956e-04 4.748310e-03 0.028328627
12         JAPAN  0.26315789  0.57894737  0.69590643           1 6.491222e-06 3.109641e-04 0.124796908
5         BRAZIL -0.37894737 -0.05263158 -0.04210526           1 8.226829e-01 7.732195e-01 0.019784403
1      ABU DHABI  0.34736842  0.61052632  0.55789474           1 3.589956e-04 7.321900e-05 0.033643947

Hmm.. Kendall gives lower values for all races except Hungary – maybe put that on the “must look at Hungary compared to the other races” pile…;-)

One thing that did occur to me was that I have access to race data from other years, so it shouldn’t be too hard to see how the correlations play out over the years at different circuits (do grid/race correlations tend to be higher at some circuits, for example?).

testYears=function(years=2009:2012){
bd=NULL
for (year in years) {
d=getRacesData.full(year)
corrs.df=corr.rank.race(d)
bd=rbind(bd,cbind(year,corrs.df))
}
bd
}

a=testYears(2006:2012)
ggplot(a)+geom_point(aes(x=year,y=grid.int))+facet_wrap(~V1)+ylim(0,1)

g=ggplot(a)+geom_boxplot(aes(x=V1,y=grid.int))
g=xRot(g)
g


So Spain and Turkey look like they tend to the processional? Let’s see if a boxplot bears that out:

How predictable have the years been, year on year?

g=ggplot(a)+geom_point(aes(x=V1,y=grid.int))+facet_wrap(~year)+ylim(0,1)
g=xRot(g)
g

ggplot(a)+geom_boxplot(aes(x=factor(year),y=grid.int))

And as a boxplot:

From a betting point of view, (eg Getting Started with F1 Betting Data and The Basics of Betting as a Way of Keeping Score…) it possibly also makes sense to look at the correlation between the P3 times and the qualifying classification to see if there is a testable edge in the data when it comes to betting on quali?

I think I need to tweak my code slightly to make it easy to pull out correlations between specific columns, but that’ll have to wait for another day…