I’ve been doodling today with a some charts for the Wrangling F1 Data With R living book, trying to see how much information I can start trying to pack into a single chart.
The initial impetus came simply from thinking about a count of laps led in a particular race by each drive; this morphed into charting the number of laps in each position for each driver, and then onto a more comprehensive race summary chart (see More Shiny Goodness – Tinkering With the Ergast Motor Racing Data API for an earlier graphical attempt at producing a race summary chart).
The chart shows:
- grid position: identified using an empty grey square;
- race position after the first lap: identified using an empty grey circle;
- race position on each driver’s last lap: y-value (position) of corresponding pink circle;
- points cutoff line: a faint grey dotted line to show which positions are inside – or out of – the points;
- number of laps completed by each driver: size of pink circle;
- total laps completed by driver: greyed annotation at the bottom of the chart;
- whether a driver was classified or not: the total lap count is displayed using a bold font for classified drivers, and in italics for unclassified drivers;
- finishing status of each driver: classification statuses other than *Finished* are also recorded at the bottom of the chart.
The chart also shows drivers who started the race but did not complete the first lap.
What the chart doesn’t show is what stage of the race the driver was in each position, and how long for. But I have an idea for another chart that could help there, as well as being able to reuse elements used in the chart shown here.
FWIW, the following fragment of R code shows the ggplot function used to create the chart. The data came from the ergast API, though it did require a bit of wrangling to get it into a shape that I could use to power the chart.
#Reorder the drivers according to a final ranked position g=ggplot(finalPos,aes(x=reorder(driverRef,finalPos))) #Highlight the points cutoff g=g+geom_hline(yintercept=10.5,colour='lightgrey',linetype='dotted') #Highlight the position each driver was in on their final lap g=g+geom_point(aes(y=position,size=lap),colour='red',alpha=0.15) #Highlight the grid position of each driver g=g+geom_point(aes(y=grid),shape=0,size=7,alpha=0.2) #Highlight the position of each driver at the end of the first lap g=g+geom_point(aes(y=lap1pos),shape=1,size=7,alpha=0.2) #Provide a count of how many laps each driver held each position for g=g+geom_text(data=posCounts, aes(x=driverRef,y=position,label=poscount,alpha=alpha(poscount)), size=4) #Number of laps completed by driver g=g+geom_text(aes(x=driverRef,y=-1,label=lap,fontface=ifelse(is.na(classification), 'italic' , 'bold')),size=3,colour='grey') #Record the status of each driver g=g+geom_text(aes(x=driverRef,y=-2,label=ifelse(status!='Finished', status,'')),size=2,angle=30,colour='grey') #Styling - tidy the chart by removing the transparency legend g+theme_bw()+xRotn()+xlab(NULL)+ylab("Race Position")+guides(alpha=FALSE)
The fully worked code can be found in forthcoming update to the Wrangling F1 Data With R living book.
Whilst at an event over the weekend – from which I would generally have tweeted once or twice – I got the following two part message in what I regard as my personal Twitter feed to my phone:
1/2: Starting Nov 15, Vodafone UK will no longer support some Twitter SMS notifications. You will still be able to use 2-factor authentication and reset your pa
2/2: ssword over SMS.However, Tweet notifications and activity updates will cease. We are very sorry for the service disruption.
The notification appeared from a mobile number I have listed as “Twitter” in my contacts book. This makes both Twitter and Vodafone very much less useful to me – direct messages and mentions used to come direct to my phone as SMS text messages, and I used to be able to send tweets and direct messages to the same number (Docs: Twitter SMS commands). The service connected an SMS channel on my personal/private phone, with a public address/communication channel that lives on the web.
Why not use a Twitter client? A good question that deserves and answer that may sound foolish: Vodafone offers crappy coverage in the places I need it most (at home, at in-laws) where data connections are as good as useless. At home there’s wifi – and other screens running Twitter apps – elsewhere there typically isn’t. I also use my phone at events in the middle of fields, where coverage is often poor and even getting a connection can be chancy. (Client apps also draw on power – which is a factor on a weekend away at an event where there may be few recharging options; and they’re often keen to access contact book details, as well as all sorts of other permissions over your phone.)
So SMS works – it’s low power, typically on, low bandwidth, personal and public (via my Twitter ID).
But now both my phone contract and Twitter are worth very much less to me. One reason I kept my Vodafone contract was because of the Twitter connection. And one reason I stick with Twitter is the SMS route I have – I had – to it.
So now I’m looking for an alternative. To both.
I thought about rolling my own service using an SMS channel on IFTT, but I don’t think it supports/is supported by Vodafone in the UK? (Do any mobile operators support it? It so, I think I may have to change to them…)
If I do change contract though, I hope it’s easier that the last contract we tried are still trying – to kill. After several years it seems a direct debit on an old contract is still going out; after letters – and a phone call to Vodafone where they promised the direct debit was cancelled – it pops up again, paying out again, the direct debit that will never die. This week we’ll try again. Next month, if it pops up again, I guess we need to call on the ombudsman.
I guess what I’d really like is a mobile operator that offers me an SMS gateway so that I can call arbitrary webhooks in response to text messages I send, and field web requests that can then forward messages to my phone. (Support for the IFTT SMS channel would be almost as good.)
From what I know of Twitter’s origins (as twttr), the mobile SMS context was an important part of it – “I want to have a dispatch service that connects us on our phones using text” @Jack [Dorsey, Twitter founder] possibly once said (How Twitter Was Born). Text still works for me in the many places I go where wifi isn’t available and data connections are unreliable and slow, if indeed they’re available at all. (If wifi is available, I don’t need my phone contract…)
The founding story continues: I remember that @Jack’s first use case was city-related: telling people that the club he’s at is happening. If Vodafone and Twitter hadn’t stopped playing over the weekend, I’d have tweeted that I was watching the Wales Rally (WRC/FIA World Rally Championship) Rallyfest stages in North Wales at Chirk Castle on Saturday, and Kinmel Park on Sunday. As it was, I didn’t – so WRC also lost out on the deal. And I don’t have a personal tweet record of the event I was at.
If I’m going to have to make use of a web client and data connection to make use of Twitter messaging, it’s probably time to look for a service that does it better. What’s WhatsApp like in this respect?
Or if I’m going to have to make use of a web client and data connection to make use of Twitter messaging, I need to find a mobile operator that offers reliable data connections in places where I need it, because Vodafone doesn’t.
Either way, this cessation of the service has made me realise where I get most value from Twitter, and where I get most value from Vodafone, and it was in a combination of those services. With them now separated, the value of both to me are significantly reduced. Reduced to such an effect that I am looking for alternatives – to both.
A couple of weeks ago, I had a little poke around some of the standard reports that we can get out of the OU VLE. OU course materials are generated from a structured document format – OU XML – that generates one or more HTML pages bound to a particular Moodle resource id. Additional Moodle resources are associated with forums, admin pages, library resource pages, and so on.
One of the standard reports provides a count of how many times each resource has been accessed within a given time period, such as a weekly block. Data can only be exported for so many weeks at a time, so to get stats for course materials over the presentation of a course (which may be up to 9 months long) requires multiple exports and the aggregation of the data.
We can then generate simple visual summaries over the data such as the following heatmap.
Usage is indicated by colour density, time in weeks are organised along horizontal x-axis. From the chart, we can clearly see waves of activity over the course of the module as students access resources associated with particular study weeks. We can also see when materials aren’t
being accessed, or are only being accessed by a low number of times (that is, necessarily by a low proportion of students. If we get data about unique user accesses or unique user first use activity, we can get a better idea about the proportion of students in a cohort as a whole accessing a resource).
This sort of reporting – about material usage rather than student attainment – was what originally attracted me to thinking about data in the context of OU courses (eg Course Analytics in Context). That is, I wasn’t that interested in how well students were doing, per se, or interested in trying to find ways of spying on individual students to build clever algorithms behind experimental personalisation and recommender systems that would never make it out of the research context.
That could come later.
What I originally just wanted to know was whether this resource was ever looked at, whether that resource was accessed when I expected (eg if an end of course assessment page was accessed when students were prompted to start thinking about it during an exercise two thirds of the way in to the course), whether students tended to study for half an hour or three hours (so I could design the materials accordingly), how (and when) students searched the course materials – and for what (keyphrase searches copied wholesale out of the continuous assessment materials) and so on.
Nothing very personal in there – everything aggregate. Nothing about students, particularly, everything about course materials. As a member of the course team, asking how are the course materials working rather than how is that student performing?
There’s nothing very clever about this – it’s just basic web stats run with an eye to looking for patterns of behaviour over the life of a course to check that the materials appear to be being worked in the way we expected. (At the OU, course team members are often a step removed from supporting students.)
But what it is, I think, is an important complement to the “student centred” learning analytics. It’s analytics about the usage and utilisation of the course materials, the things we actually spend a couple of years developing but don’t really seem to track the performance of?
It’s data that can be used to inform and check on “learning designs”. Stats that act as indicators about whether the design is being followed – that is, used as expected, or planned.
As a course material designer, I may want to know how well students perform based on how they engage with the materials, but I should really to know how the materials are being utilised, because they’re designed to be utilised in a particular way? And if they’re not being used in that way, maybe I need to have a rethink?
As we come up to the final two races of the 2014 Formula One season, the double points mechanism for the final race means that two drivers are still in with a shot at the Drivers’ Championship: Lewis Hamilton and Nico Rosberg.
As James Allen describes in Hamilton closes in on world title: maths favour him but Abu Dhabi threat remains:
Hamilton needs 51 points in the remaining races to be champion if Rosberg wins both races. Hamilton can afford to finish second in Brazil and at the double points finale in Abu Dhabi and still be champion. Mathematically he could also finish third in Brazil and second in the finale and take it on win countback, as Rosberg would have just six wins to Hamilton’s ten.
If Hamilton leads Rosberg home again in a 1-2 in Brazil, then he will go to Abu Dhabi needing to finish fifth or higher to be champion (echoes of Brazil 2008!!). If Rosberg does not finish in Brazil and Hamilton wins the race, then Rosberg would need to win Abu Dhabi with Hamilton not finishing; no other scenario would give Rosberg the title.
A couple of years ago, I developed an interactive R/shiny app for exploring finishing combinations of two drivers in the last two races of a season to see what situations led to what result: Interactive Scenarios With Shiny – The Race to the F1 2012 Drivers’ Championship.
I’ve updated the app (taking into account the matter of double points in the final race) so you can check out James Allen’s calculations with it (assuming I got my sums right too!). I tried to pop up an interactive version to Shinyapps, but the Shinyapps publication mechanism seems to be broken (for me at least) at the moment…:-(
When I get a chance, I’ll weave elements of this recipe into the Wrangling F1 Data With R book.
PS I’ve also started using the F1dataJunkie blog again as a place to post drafts and snippets of elements I’m working on for that book…
Several years ago, I cam across this mocked up Google search that still makes me laugh now…
And a couple of days ago, I realised I’d misplaced my phone. An Android device. An Android device that I have associated with a secondary Google profile I set up specifically to work with my phone (and that is linked in certain respects, such as calendars, with my primary Google ID).
Not being overly trusting of Google, I thought I’d switched off the various location awareness services that Google, and others, keep trying to get me to enable. Which made me feel a little silly – because if I’d put a location tracker on my device I would have been able to check if I had accidentally lost it from pocket in the place I thought. Or erase it if not.
Or perhaps, not oops. I thought I’d do a quick search for “locate Android phone” just anyway, and turned up the so-called Android Device Manager (about, help). Logging in to that service, and lo and behold, there was a map locating the phone… pretty much exactly to the point I’d thought – I’d hoped – I’d misplaced it.
There are also a couple of other device management services – call the phone (to help find it down the side of the sofa, for example), or erase the phone and lock it. A service also exists to display a number to display on the locked phone in case a kindly soul finds it and wants to call you on that number to let you know they have it.
Another example of a loss of sovereignty? And another example of how Google operate enterprise level control over our devices (it’s operating system and deep seated features), albeit giving us some sort of admin privileges too in a vague attempt to persuade us that we’re in control. Which we aren’t, of course. Useful, yes – but disconcerting and concerning too; because I really thought I’d tried to opt out of, and even disable, location revealing service access on that phone. (I know for a fact GPS was disabled – but then, mobile cell triangulation topped up with wifi hotspot location seem to tunnel things down pretty well…)
Earlier this year I started trying to pull together some of my #f1datajunkie R-related ramblings together in a book form. The project stalled, but to try to reboot it I’ve started publishing it as a living book over on Leanpub. Several of the chapters are incomplete – with TO DO items sketched in, others are still unpublished. The beauty of the Leanpub model is that if you buy a copy, you continue to get access to all future updated versions of the book. (And my idea is that by getting the book out there as it is, I’ll feel as if there’s more (social) pressure on actually trying to keep up with it…)
I’ll be posting more details about how the Leanpub process works (for me at least) in the next week or two, but for now, here’s a link to the book: Wrangling F1 Data With R: A Data Junkie’s Guide.
Here’s the table of contents so far:
- A Note on the Data Sources
- What are we trying to do with the data?
- Choosing the tools
- The Data Sources
- Getting the Data into RStudio
- Example F1 Stats Sites
- How to Use This Book
- The Rest of This Book…
- An Introduction to RStudio and R dataframes
- Getting Started with RStudio
- Getting Started with R
- Getting the data from the Ergast Motor Racing Database API
- Accessing Data from the ergast API
- Getting the data from the Ergast Motor Racing Database Download
- Accessing SQLite from R
- Asking Questions of the ergast Data
- Exercises and TO DO
- Data Scraped from the F1 Website
- Problems with the Formula One Data
- How to use the FormulaOne.com alongside the ergast data
- Reviewing the Practice Sessions
- The Weekend Starts Here
- Practice Session Data from the FIA
- Sector Times
- FIA Media Centre Timing Sheets
- A Quick Look at Qualifying
- Qualifying Session Position Summary Chart
- Another Look at the Session Tables
- Ultimate Lap Positions
- Annotated Lapcharts
- Race History Charts
- The Simple Laptime Chart
- Accumulated Laptimes
- Gap to Leader Charts
- The Lapalyzer Session Gap
- Eventually: The Race History Chart
- Pit Stop Analysis
- Pit Stop Data
- Total pit time per race
- Pit Stops Over Time
- Estimating pit loss time
- Tyre Change Data
- Career Trajectory
- The Effect of Age on Performance
- Statistical Models of Career Trajectories
- The Age-Productivity Gradient
- Spotting Runs
- Generating Streak Reports
- Streak Maps
- Team Streaks
- Time to N’th Win
- TO DO
- Appendix One – Scraping formula1.com Timing Data
- Appendix Two – FIA Timing Sheets
- Downloading the FIA timing sheets for a particular race
- Appendix – Converting the ergast Database to SQLite
If you think you deserve a free copy, let me know… ;-)
Via a half quote by Adam Cooper in his SoLAR flare talk today, elucidated in his blog post Exploratory Data Analysis, I am led to a talk by John Tukey – The Technical Tools of Statistics – read at the 125th Anniversary Meeting of the American Statistical Association, Boston, November 1964.
As ever (see, for example, Quoting Tukey on Visual Storytelling with Data), it contains some gems… The following is a spoiler of the joy of reading the paper itself. I suggest you do that instead – you’ll more than likely find your own gems in the text: The Technical Tools of Statistics.
If you’re too lazy to click away, here are some of the quotes and phrases I particularly enjoyed.
To start with, the quote referenced by Adam:
Some of my friends felt that I should be very explicit in warning you of how much time and money can be wasted on computing, how much clarity and insight can be lost in great stacks of computer output. In fact, I ask you to remember only two points:
- The tool that is so dull that you cannot cut yourself on it is not likely to be sharp enough to be either useful or helpful.
- Most uses of the classical tools of statistics have been, are, and will be, made by those who know not what they do.
And here’s one I’m going to use when talking about writing diagrams:
Hand-drawing of graphs, except perhaps for reproduction in books and in some journals, is now economically wasteful, slow, and on the way out.
(It strikes me that using a spreadsheet wizard to create charts in a research or production setting, where we are working in a reproducible, document generation context, is akin to the “hand-drwaing of graphs” of yesteryear?)
“I know of no person or group that is taking nearly adequate advantage of the graphical potentialities of the computer.”
[W]e are going to reach a position we should have reached long ago. We are going, if I have to build it myself, to have a programming system — a “language” if you like — with all that that implies, suited to the needs of data analysis. This will be planned to handle numbers in organized patterns of very different shapes, to apply a wide variety of data-analytical operations to make new patterns from old, to carry out the oddest sequences of apparently unrelated operations, to provide a wide variety of outputs, to automatically store all time-expensive intermediate results “on disk” until the user decides whether or not he will want to do something else with them, and to do all this and much more easily.
Since I’ve started playing with pandas, my ability to have written conversations with data has improved. Returning to R after a few months away, I’m also finding that easier to write as well (the tabular data models, and elements of the syntax, are broadly similar across the two).
Most of the technical tools of the future statistician will bear the stamp of computer manufacture, and will be used in a computer. We will be remiss in our duty to our students if we do not see that they learn to use the computer more easily, flexibly, and thoroughly than we ever have; we will be remiss in our duties to ourselves if we do not try to improve and broaden our own uses.
This does not mean that we shall have to continue to teach our students the elements of computer programming; most of the class of ’70 is going to learn that as freshmen or sophomores. Nor does it mean that each student will write his own program for analysis of variance or for seasonal adjustment, this would be a waste. … It must mean learning to put together, effectively and easily — on a program-self-modifying computer and by means of the most helpful software then available — data analytical steps appropriate to the need, whether this is to uncover an anticipated specific appearance or to explore some broad area for unanticipated, illuminating appearances, or, as is more likely, to do both.
Interesting to note that in the UK, “text-based programming” has made it into the curriculum. (Related: Text Based Programming, One Line at a Time (short course pitch).)
Tukey also talks about how computing will offer flexibility and fluidity. Flexibility includes the “freedom to introduce new approaches; freedom, in a word, to be a journeyman carpenter of data-analytical tools”. Fluidity “means that we are prepared to use structures of analysis that can flow rather freely … to fit the apparent desires of the data”.
As the computer revolution finally penetrates into the technical tools of statistics, it will not change the essential characteristics of these tools, no matter how much it changes their appearance, scope, appositeness and economy. We can only look for:
- more of the essential erector-set character of data analysis techniques, in which a kit of pieces are available for assembly into any of a multitude of analytical schemes,
- an increasing swing toward a greater emphasis on graphicality and informality of inference,
- a greater and greater role for, graphical techniques as aids to exploration and incisiveness,
- steadily increasing emphasis on flexibility and on fluidity,
- wider and deeper use of empirical inquiry, of actual trials on potentially interesting data, as a way to discover new analytic techniques,
- greater emphasis on parsimony of representation and inquiry, on the focussing, in each individual analysis, of most of our attention on relatively specific questions, usually in combination with a broader spreading of the remainder of our attention to the exploration of more diverse possibilities.
In order that our tools, and their uses, develop effectively … we shall have to give still more attention to doing the approximately right, rather than the exactly wrong, …
All quotes from John Tukey, The Technical Tools of Statistics, 1964.