building on Dockerising Open Data Databases – First Fumblings and my Book Extras – Data Files, Code Files and a Dockerised Application, I just figured out how to get the ergast db into a MySQL docker container and then query it from RStudio:
- Download and unzip the f1db.sql.gz file to f1db.sql
- install these docker-mysql-scripts
- run boot2docker
- from the boot2docker shell, start up a MySQL server (ergastdb) with password f1: dmysql-server ergastdb f1 By default, this exposes port 3306
- create an new empty database (f1db): dmysql-create-database ergastdb f1db
- add the ergast data to it: dmysql-import-database ergastdb /path/to/ergastdb/f1db.sql --database f1db
- fire up a copy of RStudio, in this case using my psychemedia/wranglingf1data container, linking it to the MySQL database which has the alias db: docker run --name f1djd -p 8788:8787 --link ergastdb:db -d psychemedia/wranglingf1data
- run boot2docker ip to find where RStudio is running (IPADDRESS) and in your browser go to: http://IPADDRESS:8788, logging in with username rstudio and password rstudio
- in RStudio, import the RMySQL library: library(RMySQL)
- in RStudio, connect to the database: con=dbConnect(MySQL(),user='root',password='f1',host='db',port=3306,dbname='f1db')
- in RStudio, run a test query: dbQuery(con,'SHOW TABLES');
I guess what I need to do now is pull the various bits into another script to make it a one-liner, perhaps with a few switches? For example, to create the database if it doesn’t exist, to download the ergast database file automatically, to populate the database for the first time, or update it with a more recent copy of the database, to fire up both containers and make sure they are appropriately linked etc. This would dramatically simplify things for use in the context of the Wrangling F1 Data With R book, for example. (If you beat me to it, please post details in the comments below.)
PS Hmm…. seems I get a UTF-8 encoding issue:
Not sure if this is with the database, or the RMySQL connector? Anyone got any ideas of a fix?
Ah ha – sort of via SO:
Running dbGetQuery(con,'SET NAMES utf8;') before querying seems to do the trick…
One of the things I wanted to explore in the production of the Wrangling F1 Data With R book was the extent to which I could draw on published academic papers for inspiration in exploring the the various results and timing datasets.
In a chapter published earlier this week, I explored the notion of churn, as described in Mizak, D, Neral, J & Stair, A (2007) The adjusted churn: an index of competitive balance for sports leagues based on changes in team standings over time. Economics Bulletin, Vol. 26, No. 3 pp. 1-7, and further appropriated by Berkowitz, J. P., Depken, C. A., & Wilson, D. P. (2011). When going in circles is going backward: Outcome uncertainty in NASCAR. Journal of Sports Economics, 12(3), 253-283.
In a competitive league, churn is defined as:
where is the churn in team standings for year , is the absolute value of the -th team’s change in finishing position going from season to season , and is the number of teams.
The adjusted churn is defined as an indicator with the range 0..1 by dividing the churn, , by the maximum churn, . The value of the maximum churn depends on whether there is an even or odd number of competitors:
Berkowitz et al. reconsidered churn as applied to an individual NASCAR race (that is, at the event level). In this case, is the position of driver at the end of race , is the starting position of driver at the beginning of that race (that is, race ) and is the number of drivers participating in the race. Once again, the authors recognise the utility of normalising the churn value to give an *adjusted churn* in the range 0..1 by dividing through by the maximum churn value.
Using these models, I created churn function of the form:
is.even = function(x) x %% 2 == 0 churnmax=function(N) if (is.even(N)) return(N/2) else return(((N*N)-1)/(2*N)) churn=function(d) sum(d)/length(d) adjchurn = function(d) churn(d)/churnmax(length(d))
and then used it to explore churn in a variety of contexts:
- comparing grid positions vs race classifications across a season (cf. Berkowitz et al.)
- churn in Drivers’ Championship standings over several seasons (cf. Mizak et al.)
- churn in Constructors’ Championship standings over several seasons (cf. Mizak et al.)
For example, in the first case, we can process data from the ergast database as follows:
library(DBI) ergastdb = dbConnect(RSQLite::SQLite(), './ergastdb13.sqlite') q=paste('SELECT round, name, driverRef, code, grid, position, positionText, positionOrder FROM results rs JOIN drivers d JOIN races r ON rs.driverId=d.driverId AND rs.raceId=r.raceId WHERE r.year=2013',sep='') results=dbGetQuery(ergastdb,q) library(plyr) results['delta'] = abs(results['grid']-results['positionOrder']) churn.df = ddply(results[,c('round','name','delta')], .(round,name), summarise, churn = churn(delta), adjchurn = adjchurn(delta) )
For more details, see this first release of the Keeping an Eye on Competitiveness – Tracking Churn chapter of the Wrangling F1 Data With R book.
I’ve been doodling today with a some charts for the Wrangling F1 Data With R living book, trying to see how much information I can start trying to pack into a single chart.
The initial impetus came simply from thinking about a count of laps led in a particular race by each drive; this morphed into charting the number of laps in each position for each driver, and then onto a more comprehensive race summary chart (see More Shiny Goodness – Tinkering With the Ergast Motor Racing Data API for an earlier graphical attempt at producing a race summary chart).
The chart shows:
– grid position: identified using an empty grey square;
– race position after the first lap: identified using an empty grey circle;
– race position on each driver’s last lap: y-value (position) of corresponding pink circle;
– points cutoff line: a faint grey dotted line to show which positions are inside – or out of – the points;
– number of laps completed by each driver: size of pink circle;
– total laps completed by driver: greyed annotation at the bottom of the chart;
– whether a driver was classified or not: the total lap count is displayed using a bold font for classified drivers, and in italics for unclassified drivers;
– finishing status of each driver: classification statuses other than *Finished* are also recorded at the bottom of the chart.
The chart also shows drivers who started the race but did not complete the first lap.
What the chart doesn’t show is what stage of the race the driver was in each position, and how long for. But I have an idea for another chart that could help there, as well as being able to reuse elements used in the chart shown here.
FWIW, the following fragment of R code shows the ggplot function used to create the chart. The data came from the ergast API, though it did require a bit of wrangling to get it into a shape that I could use to power the chart.
#Reorder the drivers according to a final ranked position g=ggplot(finalPos,aes(x=reorder(driverRef,finalPos))) #Highlight the points cutoff g=g+geom_hline(yintercept=10.5,colour='lightgrey',linetype='dotted') #Highlight the position each driver was in on their final lap g=g+geom_point(aes(y=position,size=lap),colour='red',alpha=0.15) #Highlight the grid position of each driver g=g+geom_point(aes(y=grid),shape=0,size=7,alpha=0.2) #Highlight the position of each driver at the end of the first lap g=g+geom_point(aes(y=lap1pos),shape=1,size=7,alpha=0.2) #Provide a count of how many laps each driver held each position for g=g+geom_text(data=posCounts, aes(x=driverRef,y=position,label=poscount,alpha=alpha(poscount)), size=4) #Number of laps completed by driver g=g+geom_text(aes(x=driverRef,y=-1,label=lap,fontface=ifelse(is.na(classification), 'italic' , 'bold')),size=3,colour='grey') #Record the status of each driver g=g+geom_text(aes(x=driverRef,y=-2,label=ifelse(status!='Finished', status,'')),size=2,angle=30,colour='grey') #Styling - tidy the chart by removing the transparency legend g+theme_bw()+xRotn()+xlab(NULL)+ylab("Race Position")+guides(alpha=FALSE)
The fully worked code can be found in forthcoming update to the Wrangling F1 Data With R living book.
As we come up to the final two races of the 2014 Formula One season, the double points mechanism for the final race means that two drivers are still in with a shot at the Drivers’ Championship: Lewis Hamilton and Nico Rosberg.
As James Allen describes in Hamilton closes in on world title: maths favour him but Abu Dhabi threat remains:
Hamilton needs 51 points in the remaining races to be champion if Rosberg wins both races. Hamilton can afford to finish second in Brazil and at the double points finale in Abu Dhabi and still be champion. Mathematically he could also finish third in Brazil and second in the finale and take it on win countback, as Rosberg would have just six wins to Hamilton’s ten.
If Hamilton leads Rosberg home again in a 1-2 in Brazil, then he will go to Abu Dhabi needing to finish fifth or higher to be champion (echoes of Brazil 2008!!). If Rosberg does not finish in Brazil and Hamilton wins the race, then Rosberg would need to win Abu Dhabi with Hamilton not finishing; no other scenario would give Rosberg the title.
A couple of years ago, I developed an interactive R/shiny app for exploring finishing combinations of two drivers in the last two races of a season to see what situations led to what result: Interactive Scenarios With Shiny – The Race to the F1 2012 Drivers’ Championship.
I’ve updated the app (taking into account the matter of double points in the final race) so you can check out James Allen’s calculations with it (assuming I got my sums right too!). I tried to pop up an interactive version to Shinyapps, but the Shinyapps publication mechanism seems to be broken (for me at least) at the moment…:-(
When I get a chance, I’ll weave elements of this recipe into the Wrangling F1 Data With R book.
PS I’ve also started using the F1dataJunkie blog again as a place to post drafts and snippets of elements I’m working on for that book…
Earlier this year I started trying to pull together some of my #f1datajunkie R-related ramblings together in a book form. The project stalled, but to try to reboot it I’ve started publishing it as a living book over on Leanpub. Several of the chapters are incomplete – with TO DO items sketched in, others are still unpublished. The beauty of the Leanpub model is that if you buy a copy, you continue to get access to all future updated versions of the book. (And my idea is that by getting the book out there as it is, I’ll feel as if there’s more (social) pressure on actually trying to keep up with it…)
I’ll be posting more details about how the Leanpub process works (for me at least) in the next week or two, but for now, here’s a link to the book: Wrangling F1 Data With R: A Data Junkie’s Guide.
Here’s the table of contents so far:
- A Note on the Data Sources
- What are we trying to do with the data?
- Choosing the tools
- The Data Sources
- Getting the Data into RStudio
- Example F1 Stats Sites
- How to Use This Book
- The Rest of This Book…
- An Introduction to RStudio and R dataframes
- Getting Started with RStudio
- Getting Started with R
- Getting the data from the Ergast Motor Racing Database API
- Accessing Data from the ergast API
- Getting the data from the Ergast Motor Racing Database Download
- Accessing SQLite from R
- Asking Questions of the ergast Data
- Exercises and TO DO
- Data Scraped from the F1 Website
- Problems with the Formula One Data
- How to use the FormulaOne.com alongside the ergast data
- Reviewing the Practice Sessions
- The Weekend Starts Here
- Practice Session Data from the FIA
- Sector Times
- FIA Media Centre Timing Sheets
- A Quick Look at Qualifying
- Qualifying Session Position Summary Chart
- Another Look at the Session Tables
- Ultimate Lap Positions
- Annotated Lapcharts
- Race History Charts
- The Simple Laptime Chart
- Accumulated Laptimes
- Gap to Leader Charts
- The Lapalyzer Session Gap
- Eventually: The Race History Chart
- Pit Stop Analysis
- Pit Stop Data
- Total pit time per race
- Pit Stops Over Time
- Estimating pit loss time
- Tyre Change Data
- Career Trajectory
- The Effect of Age on Performance
- Statistical Models of Career Trajectories
- The Age-Productivity Gradient
- Spotting Runs
- Generating Streak Reports
- Streak Maps
- Team Streaks
- Time to N’th Win
- TO DO
- Appendix One – Scraping formula1.com Timing Data
- Appendix Two – FIA Timing Sheets
- Downloading the FIA timing sheets for a particular race
- Appendix – Converting the ergast Database to SQLite
If you think you deserve a free copy, let me know… ;-)
One of the comment themes I’ve noticed around the first Challenge in the Tata F1 Connectivity Innovation Prize, a challenge to rethink what’s possible around the timing screen given only the data in the real time timing feed, is that the non-programmers don’t get to play. I don’t think that’s true – the challenge seems to be open to ideas as well as practical demonstrations, but it got me thinking about what technical ways in might be to non-programmers who wouldn’t know where to start when it came to working with the timing stream messages.
The answer is surely the timing screen itself… One of the issues I still haven’t fully resolved is a proven way of getting useful information events from the timing feed – it updates the timing screen on a cell by cell basis, so we have to finesse the way we associate new laptimes or sector times with a particular driver, bearing in mind cells update one at a time, in a potentially arbitrary order, and with potentially different timestamps.
So how about if we work with a “live information model” by creating a copy of an example timing screen in a spreadsheet. If we know how, we might be able to parse the real data stream to directly update the appropriate cells, but that’s largely by the by. At least we have something we can work work to start playing with the timing screen in terms of a literal reimagining of it. So what can we do if we put the data from an example timing screen into a spreadsheet?
If we create a new worksheet, we can reference the cells in the “original” timing sheet and pull values over. The timing feed updates cells on a cell by cell basis, but spreadsheets are really good at rippling through changes from one or more cells which are themselves reference by one or more others.
The first thing we might do is just transform the shape of the timing screen. For example, we can take the cells in a column relating to sector 1 times and put them into a row.
The second thing we might do is start to think about some sums. For example, we might find the difference between each of those sector times and (for practice and qualifying sessions at least) the best sector time recorded in that session.
The third thing we might do is to use a calculated value as the basis for a custom cell format that colours the cell according to the delta from the best session time.
Simple, but a start.
I’ve not really tried to push this idea very far – I’m not much of a spreadsheet jockey – but I’d be interested to know how folk who are might be able to push this idea…
PS FWIW, my entry to the competition is here: #f1datajunkie challenge 1 entry. It’s perhaps a little off-brief, but I’ve been meaning to do this sort of summary for some time, and this was a good starting point. If I get a chance, I’ll have a go a getting the parsers to work properly properly!
A lazyweb request, because I’m rushing for a boat, going to be away from reliable network connections for getting on for a week, and would like to be able to play from a running start when I get back next week…
In context of the Tata/F1 timing data competition, I’d like to be able to have a play with the data in Node-RED. A feed-based, flow/pipes like environment, Node-RED’s been on my “should play with” list for some time, and this provides a good opportunity.
The data as provided looks something like:
... <transaction identifier="101" messagecount="121593" timestamp="14:57:10.878"><data column="23" row="1" colour="PURPLE" value="31.6"/></transaction> <transaction identifier="103" messagecount="940109" timestamp="14:57:11.219"><data column="2" row="1" colour="YELLOW" value="1:41:13" clock="true"/></transaction> <transaction identifier="101" messagecount="121600" timestamp="14:57:11.681"><data column="2" row="3" colour="WHITE" value="77"/></transaction> <transaction identifier="101" messagecount="121601" timestamp="14:57:11.681"><data column="3" row="3" colour="WHITE" value="V. BOTTAS"/></transaction> <transaction identifier="101" messagecount="121602" timestamp="14:57:11.681"><data column="4" row="3" colour="YELLOW" value="17.7"/></transaction> <transaction identifier="101" messagecount="121603" timestamp="14:57:11.681"><data column="5" row="3" colour="YELLOW" value="14.6"/></transaction> <transaction identifier="101" messagecount="121604" timestamp="14:57:11.681"><data column="6" row="3" colour="WHITE" value="1:33.201"/></transaction> <transaction identifier="101" messagecount="121605" timestamp="14:57:11.686"><data column="9" row="3" colour="YELLOW" value="35.4"/></transaction> ...
as a text file. (In the wild, it would be a real time data feed over http or https.)
What I’d like as a crib to work from is a Node-RED demo that has:
1) a file reader that reads the data in from the data file and plays it in as a stream in “real time” according to the timestamps, given a dummy start time;
2) an example of handling state – eg keeping track of drivernumber. (The row is effectively race position, Looking at column 2 (driverNumber), we can see what position a driver is in. Keep track of (row,driverNumber) pairs and if a driver changes position, flag it along with what the previous position was);
3) an example of appending the result to a flat file – for example, building up a list of statements “Driver number x has moved from position M to position N” over time.
Shouldn’t be that hard, right? And it would provide a good starting point for other people to be able to have a play without hassling over how to do the input/output bits?