With the new F1 season upon us, I’ve started tinkering with bits of code from the Wrangling F1 Data With R book and looking at the data in some new ways.
For example, I started wondering whether we might be able to learn something interesting about the race strategies by looking at laptimes on a stint by stint basis.
To begin with, we need some data – I’m going to grab it directly from the ergast API using some functions that are bundled in with the Leanpub book…
#ergast functions described in: https://leanpub.com/wranglingf1datawithr/ #Get laptime data from the ergast API l2=lapsData.df(2016,2) #Get pits data from the ergast API p2=pitsData.df(2016,2) #merge pit data into the laptime data l3=merge(l2,p2[,c('driverId','lap','rawduration')],by=c('driverId','lap'),all=T) #generate an inlap flag (inlap is the lap assigned the pit time) l3['inlap']=!is.na(l3['rawduration']) #generate an outlap flag (outlap is the first lap of the race or laps starting from the pits l3=ddply(l3,.(driverId),transform,outlap=c(T,!is.na(head(rawduration,-1)))) #use the pitstop flag to number stints; note: a drive through penalty increments the stint count l3=arrange(l3,driverId, -lap) l3=ddply(l3,.(driverId),transform,stint=1+sum(inlap)-cumsum(inlap)) #number the laps in each stint l3=arrange(l3,driverId, lap) l3=ddply(l3,.(driverId,stint),transform,lapInStint=1:length(stint)) l3=arrange(l3,driverId, lap)
The laptimes associated with the in- and out- lap associated with a pit stop add noise to the full lap times completed within each stint, so lets flag those laps so we can then filter them out:
#Discount the inlap and outlap l4=l3[!l3['outlap'] & !l3['inlap'],]
We can now look at the data… I’m going to facet by driver, and also group the laptimes associated with each stint. Then we can plot just the raw laptimes, and also a simple linear model based on the full lap times within each stint:
#Generate a base plot g=ggplot(l4,aes(x=lapInStint, y=rawtime, col=factor(stint)))+facet_wrap(~driverId) #Chart the raw laptimes within each stint g+geom_line() #Plot a simple linear model for each stint g+ geom_smooth(method = "lm", formula = y ~ x)
So for example, here are the raw laptimes, excluding inlap and outlap, by stint for each driver in the recent 2016 Bahrain Formual One Grand Prix:
And here’s the simple linear model:
These charts highlight several things:
- trivially, the number and length of the stints completed by each driver;
- degradation effects in terms of the gradient of the slope of each stint trace;
- fuel effects- the y-axis offset for each stint is the sum of the fuel effect (as the race progresses the cars get lighter and laptime goes down more or less linearly) and a basic tyre effect (the “base” time we might expect from a tyre). Based on the total number of laps completed in stints prior to a particular stint, we can calculate a fuel effect offset for the laptimes in each stint which should serve to normalise the y-axis laptimes and make more evident the base tyre laptime.
- looking at the charts as a whole, we get a feel for strategy – what sort of tyre/stint strategy do the cars start race with, for example; are the cars going long on tyres without much degradation, or pushing for various length stints on tyres that lose significant time each lap? And so on… (What can you read into/from the charts? Let me know in the comments below;-)
If we assume a 0.083s per lap fuel weight penalty effect, we can replot the chart to account for this:
#Generate a base plot g=ggplot(l4,aes(x=lapInStint, y=rawtime+0.083*lap, col=factor(stint))) g+facet_wrap(~driverId) +geom_line()
Here’s what we get:
And here’s what the fuel corrected models look like:
UPDATE: the above fuel calculation goes the wrong way – oops! It should be:
MAXLAPS=max(l4['lap']) FUEL_PENALTY =0.083 .e = environment() g=ggplot(l4,aes(x=lapInStint,y=rawtime-(MAXLAPS-lap)*FUEL_PENALTY,col=factor(stint)),environment=.e)
What we really need to do now is annotate the charts with additional tyre selection information for each stint.
We can also do a few more sums. For example, generate a simple average laptime per stint, excluding inlap and outlap times:
#Calculate some stint summary data l5=ddply(l4,.(driverId,stint), summarise, stintav=sum(rawtime)/length(rawtime), stintsum=sum(rawtime), stinlen=length(rawtime))
which gives results of the form:
driverId stint stintav stintsum stinlen 1 rosberg 1 98.04445 1078.489 11 2 rosberg 2 97.55133 1463.270 15 3 rosberg 3 96.15543 673.088 7 4 rosberg 4 96.32494 1637.524 17 5 massa 1 99.13600 495.680 5 6 massa 2 100.48300 2009.660 20 7 massa 3 98.77862 2568.244 26
It would possibly be useful to also compare inlap and outlaps somehow, as well as factoring in the pitstop time. I’m pondering a couple a possibilities for the latter :
- amortise the pitstop time over the laps leading up to a pitstop by adding a pitsop lap penalty to each lap in that stint calculated as the pitstop time of the stint length of the laps in the stint leading up to the pitstop; this essentially penalises the stint that leads up to the pitstop as a consequence of forcing the pitstop;
- amortise the pitstop time over the laps immediately following a pitstop by adding a pitsop lap penalty to each lap in that stint calculated as the pitstop time of the stint length of the laps in the stint following the pitstop; this essentially penalises the stint that immediately follows the pitstop, and discounts some of the benefit from the pitstop.
I haven’t run the numbers yet though, so I’m not sure how these different approaches will feel…
In much the same way that the IBM DataScientist Workbench seeks to provide some level of integration between analysis tools such as Jupyter notebooks and data access and storage, Azure Machine Learning studio also provides a suite of tools for accessing and working with data in one location. Microsoft’s offering is new to me, but it crossed my radar with the announcement that they have added native R kernel support, as well as Python 2 and 3, to their Jupyter notebooks: Jupyter Notebooks with R in Azure ML Studio.
Guest workspaces are available for free (I’m not sure if this is once only, or whether you can keep going back?) but there is also a free workspace if you have a (free) Microsoft account.
Once inside, you are provides with a range of tools – the one I’m interested in to begin with is the notebook (although the piepwork/dataflow experiments environment also looks interesting):
Select a kernel:
give your new notebook a name, and it will launch into a new browser tab:
You can also arrange notebooks within separate project folders. For example, create a project:
and then add notebooks to it:
When creating a new notebook, you may have noted an option to View More in Gallery. The gallery includes examples of a range of project components, including example notebooks:
Thinking about things like the MyBinder app, which lets you launch a notebook in a container from a Github account, it would be nice to see additional buttons being made available to let folk run notebooks in Azure Machine Learning, or the Data Scientist Workbench.
It’s also worth noting how user tools – such as notebooks – seem to be being provided for free with a limited amount of compute and storage resource behind them as a way of recruiting users into platforms where they might then start to pay for more compute power.
From a course delivery perspective, I’m often unclear as to whether we can tell students to sign up for such services as part of a course or whether that breaks the service terms? (Some providers, such as Wakari, make it clear that “[f]or classes, projects, and long-term use, we strongly encourage a paid plan or Wakari Enterprise. Special academic pricing is available.”) It seems unfair that we should require students to sign up for accounts on a “free” service in their own name as part of our offering for a couple of reasons at least: first, we have no control over what happens on the service; second, it seems that it’s a commercial transaction that should be formalised in some way, even if only to agree that we can (will?) send our students to that particular service exclusively. Another possibility is that we say students should make their own service available, whether by installing software themselves or finding an online provider for themselves.
On the other hand, trying to get online services provided at course scale in a timely fashion within an HEI seems to be all but impossible, at least until such a time as the indie edtech providers such as Reclaim Hosting start to move even more into end-user app provision either at the individual level, or affordable class level (with an SLA)…
So I finally got round to pushing a revised (and typo corrected!) version of Wrangling F1 Data With R: A Data Junkie’s Guide, that also includes a handful of new section and chapters, including descriptions of how to detect undercuts, the new style race history chart that shows the on-track position of each driver for each lap of a race relative to the lap leader, and a range of qualifying session analysis charts that show the evolution of session cut off times and drivers’ personal best times.
Code is described for every data manipulation and chart that appears in the book, along with directions for how to get hold of (most of) the data required to generate the charts. (Future updates will cover some of the scraping techniques required to get of of the rest of it!)
As well as the simple book, there’s also a version bundled with the R code libraries that are loaded in as a short-cut in many of the chapters.
The book is published on Leanpub, which means you can access several different electronic versions of the book, and once you’ve bought a copy, you get access to any future updates for no further cost…
There is a charge on the book, with a set minimum price, but you also have the freedom to pay more! Any monies received for this book go to cover costs (I’ve started trying to pay for the webservices I use, rather than just keep using their free plan). If the monthly receipts bump up a little, I’ll try to get some services that generate some of the charts interactively hosted somewhere…
Part of the vision behind the Jupyter notebook ecosystem seems to be the desire to create a literate computing infrastructure that supports “the weaving of a narrative directly into a live computation, interleaving text with code and results to construct a complete piece that relies equally on the textual explanations and the computational components” (Fernando Perez, “Literate computing” and computational reproducibility: IPython in the age of data-driven journalism, 19/4/13).
The notebook approach complements other live document approaches such as the use of Rmd in applications such as RStudio, providing an interactive, editable rendered view of the live document, including inlined outputs, rather than just the source code view.
Notebooks don’t just have to be used for analysis though. A few months ago, I spotted a notebook being used to configure a database system, db-introspection-notebook – my gut reaction to which was to ponder Literate DevOps? Could We Use IPython Notebooks To Build Custom Virtual Machines?. (A problem with that approach, of course, is that it requires notebook machinery to get started, whereas you might typically want to run configuration scrips in as bare bones a system possible.)
[a] web server that supports different mechanisms for spawning and communicating with Jupyter kernels, such as:
- A Jupyter Notebook server-compatible HTTP API for requesting kernels and talking the Jupyter kernel protocol with them over Websockets
- A[n] HTTP API defined by annotated notebook cells that maps HTTP verbs and resources to code to execute on a kernel
Tooling to support the creation of a literate API then, that fully respects Fernando Perez’ description of literate computing?!
At first glance it looks like all the API functions need to be defined within a single notebook – the notebook run by the kernel gateway. But another Jupyter project in incubation allows notebooks to be imported into other notebooks, as this demo shows: Notebooks as Reusable Modules and Cookbooks. Which means that a parent API defining notebook could pull in dependent child notebooks that each define a separate API call.
And because the Jupyter server can talk to a wide range of language kernels, this means the API can implemented using a increasing range of languages (though I think that all the calls will need to be implemented using the same language kernel?). Indeed, the demo code has notebooks showing how to define notebook powered APIs in python and R.
The news today was lead in part by a story broken by the BBC and BuzzFeed News – The Tennis Racket – about match fixing in Grand Slam tennis tournaments. (The BBC contribution seems to have been done under the ever listenable File on Four: Tennis: Game, Set and Fix?)
One interesting feature of this story was that “BuzzFeed News began its investigation after devising an algorithm to analyse gambling on professional tennis matches over the past seven years”, backing up evidence from leaked documents with “an original analysis of the betting activity on 26,000 matches”. (See also: How BuzzFeed News Used Betting Data To Investigate Match-Fixing In Tennis, and an open access academic paper that inspired it: Rodenberg, R. & Feustel, E.D. (2014), Forensic Sports Analytics: Detecting and Predicting Match-Fixing in Tennis, The Journal of Prediction Markets, 8(1).)
Feature detecting algorithms such as this (where the feature is an unusual betting pattern) are likely to play an increasing role in the discovery of stories from data, step 2 in the model described in this recent Tow Center for Digital Journalism Guide to Automated Journalism:]
Another interesting aspect of the story behind the story was the way in which BuzzFeed News opened up the analysis they had applied to the data. You can find it described on Github – Methodology and Code: Detecting Match-Fixing Patterns In Tennis – along with the data and a Jupyter notebook that includes the code used to perform the analysis: Data and Analysis: Detecting Match-Fixing Patterns In Tennis.
You can even run the notebook to replicate the analysis yourself, either by downloading it and running it using your own Jupyter notebook server, or by using the online mybinder service: run the tennis analysis yourself on mybinder.org.
(I’m not sure if the BuzzFeed or BBC folk tried to do any deeper analysis, for example poking into point summary data as captured by the Tennis Match Charting Project? See also this Teniis Visuals project that makes use of the MCP data. Tennis etting data is also collected here: tennis-data.co.uk. If you’re into the idea of analysing tennis stats, this book is one way in: Analyzing Wimbledon: The Power Of Statistics.)
So what are these notebooks anyway? They’re magic, that’s what!:-)
The Jupyter project is an evolution of an earlier IPython (interactive Python) project that included a browser based notebook style interface for allowing users to write and execute code, as well as seeing the result of executing the code, a line at a time, all in the context of a “narrative” text document. The Jupyter project funding proposal describes it thus:
[T]he core problem we are trying to solve is the collaborative creation of reproducible computational narratives that can be used across a wide range of audiences and contexts.
[C]omputation in science is ultimately in service of a result that needs to be woven into the bigger narrative of the questions under study: that result will be part of a paper, will support or contest a theory, will advance our understanding of a domain. And those insights are communicated in papers, books and lectures: narratives of various formats.
The problem the Jupyter project tackles is precisely this intersection: creating tools to support in the best possible ways the computational workflow of scientific inquiry, and providing the environment to create the proper narrative around that central act of computation. We refer to this as Literate Computing, in contrast to Knuth’s concept of Literate Programming, where the emphasis is on narrating algorithms and programs. In a Literate Computing environment, the author weaves human language with live code and the results of the code, and it is the combination of all that produces a computational narrative.
At the heart of the entire Jupyter architecture lies the idea of interactive computing: humans executing small pieces of code in various programming languages, and immediately seeing the results of their computation. Interactive computing is central to data science because scientific problems benefit from an exploratory process where the results of each computation inform the next step and guide the formation of insights about the problem at hand. In this Interactive Computing focus area, we will create new tools and abstractions that improve the reproducibility of interactive computations and widen their usage in different contexts and audiences.
The Jupyter notebooks include two types of interactive cell – editable text cells into which you can write simple markdown and HTML text that will be rendered as text; and code cells into which you can write executable code. Once executed, the results of that execution are displayed as cell output. Note that the output from a cell may be text, a datatable, a chart, or even an interactive map.
There are multiple ways of running Jupyter notebooks, including the mybinder approach described above, – I describe several of them in the post Seven Ways of Running IPython Notebooks.
As well as having an important role to play in reproducible data journalism and reproducible (scientific) research, notebooks are also a powerful, and expressive, medium for teaching and learning. For example, we’re just about to star using Jupyter notebooks, delivered via a virtual machine, for the new OU course Data management and analysis.
We also used them in the FutureLearn course Learn to Code for Data Analysis, showing how code could be used a line at a time to analyse a variety of opendata sets from sources such as the World Bank Indicators database and the UN Comtrade (import /export data) database.
PS for sports data fans, here’s a list of data sources I started to compile a year or so ago: Sports Data and R – Scope for a Thematic (Rather than Task) View? (Living Post).
One of the many things on my “to do” list is to put together a blogged script that wires together RStudio, Jupyter notebook server, Shiny server, OpenRefine, PostgreSQL and MongDB containers, and perhaps data extraction services like Apache Tika or Tabula and a few OpenRefine style reconciliation services, along with a common shared data container, so the whole lot can be launched on Digital Ocean at a single click to provide a data wrangling playspace with all sorts of application goodness to hand.
(Actually, I think I had a script that was more or less there for chunks of that when I was looking at a docker solution for the databases courses, but that fell by the way side and I suspect the the Jupyter container (IPython notebook server, as was), probably needs a fair bit of updating by now. And I’ve no time or mental energy to look at it right now…:-(
Anyway, the IBM Data Scientist Workbench now sits alongside things like KMis longstanding KMi Crunch Learning Analytics Environment (RStudio + MySQL), and the Australian ResBaz Cloud – Containerised Research Apps Service in my list of why the heck can’t we get our act together to offer this sort of SaaS thing to learners? And yes I know there are cost applications…. but, erm, sponsorship, cough… get-started tokens then PAYG, cough…
It currently offers access to personal persistent storage and the ability to launch OpenRefine, RStudio and Jupyter notebooks:
The toolbar also suggest that the ability to “discover” pre-identified data sources and run pre-configured modeling tools is also on the cards.
The applications themselves run off a subdomain tied to your account – and of course, they’re all available through the browser…
So what’s next? I’d quite like to see ‘data import packs’ that would allow me to easily pull in data from particular sources, such as the CDRC, and quickly get started working with the data. (And again: yes, I know, I could start doing that anyway… maybe when I get round to actually doing something with isleofdata.com ?!;-)
See also these recipes for running app containers on Digital Ocean via Tutum: RStudio, Shiny server, OpenRefine and OpenRefine reconciliation services, and these Seven Ways of Running IPython / Jupyter Notebooks.
So have you been looking for something like RStudio, but for Python?
It’s been out for some time, but a recently updated release of Rodeo gives an increasingly workable RStudio-like environment for Python users.
The layout resembles the RStudio layout – file editor top left, interactive console bottom left, variable inspector and history top right, charts, directory view and plugins bottom right. (For plugins, read: packages).
The preferences panel lets you set the initial working directory as well as the path the required python executable.
Code selected in the file editor can be run in the console. Charts can be generated using matplotlib and are displayed in the chart view area bottom right.
As with RStudio, you can write reproducible research documents that blend markdown and code and render the result as HTML or PDF.
As you might expect, charts can be embedded as outputs in the document too.
Whilst the first version of Rodeo was a flask app viewable via a browser, and installable via pip, the latest version is an electron app, like RStudio. I found the ability to run Rodeo directly in the browser really useful, but the RStudio folks appear to have found a way of running RStudio via a browser using their RStudio server, so I’m hoping there’ll also be an open source version of Rodeo server available too?
One thing I’m wondering is whether Rodeo is a front end that can run against other Jupyter kernels? I notice that there is already a branch on the Rodeo github repo called r-backend, for example…?
Another thing I haven’t really clarified for myself are the differences between authoring (and teaching/learning) using the “Rmd/knitr” RStudio/Rodeo style workflow, and authoring in Jupyter notebooks. Notebook extensions are available that can suppress cell output etc to provide some level of control over what get rendered from a notebook used as an authoring environment. I guess what I’d like for Jupyter notebooks is a simple dropdown that lets me specify the equivalent of knitr text result options that control how code cells are rendered in an output document.
And if you do prefer the notebook route, here are Seven Ways of Running IPython / Jupyter Notebooks.