Sports Data and R – Scope for a Thematic (Rather than Task) View? (Living Post)

Via my feeds, I noticed a package announcement today for cricketR!, a new package for analysing cricket performance data.

This got me wondering (again!) about what other sports related packages there might be out there, either in terms of functional thematic packages (to do with sport in general, or one sport in particular), or particular data packages, that either bundle up sports related data sets, or provide and API (that is, a wrapper for an official API, or a wrapper for a scraper that extracts data from one or more websites in a slightly scruffier way!)

This is just a first quick attempt, an unstructured listing that may also include data sets that are more generic than R-specific (eg CSV datafiles, or SQL database exports). I’ll try to keep this post updated as I find/hear about more packages, and also work a bit more on structuring it a little better. I really should pist this as a wiki somewhere – or perhaps curate something on Github?

  • generic:
    • SportsAnalytics [CRAN]: “infrastructure for sports analysis. Anyway, currently it is a selection of data sets, functions to fetch sports data, examples, and demos”.
    • PlayerRatings [CRAN]: “schemes for estimating player or team skill based on dynamic updating. Implemented methods include Elo, Glicko and Stephenson” (via Twitter: @UTVilla)
  • athletics:
    • olympic {ade4} [Inside-R packages]: “performances of 33 men’s decathlon at the Olympic Games (1988)”.
    • decathlon {GDAdata} [CRAN]: “Top performances in the Decathlon from 1985 to 2006.” (via comments: Antony Unwin)
    • MexLJ {GDAdata} [CRAN]: “Data from the longjump final in the 1968 Mexico Olympics.” (via comments: Antony Unwin)
  • baseball:
  • basketball:
  • biathlon:
  • chess:
  •  cricket:
  • darts:
    • darts [CRAN]: “Statistical Tools to Analyze Your Darts Game” (via comments: @MarchiMax)
  • football (American football):
  • football (soccer):
    • engsoccerdata [Github]: “a repository for complete soccer datasets, along with some built-in functions for analyzing parts of the data. Currently includes English League data, FA Cup data, Playoff data, some European leagues (Spain, Germany, Italy, Holland).”. Citation: James P. Curley (2015). engsoccerdata: English Soccer Data 1871-2015. R package version 0.1.4
    • UKSoccer {vcd} [Inside-R packages]: data “on the goals scored by Home and Away teams in the Premier Football League, 1995/6 season.”.
    • Soccer {PASWR} [Inside-R packages]: “how many goals were scored in the regulation 90 minute periods of World Cup soccer matches from 1990 to 2002”.
  • fbRanks [CRAN]: “Association Football (Soccer) Ranking via Poisson Regression: time dependent Poisson regression and a record of goals scored in matches to rank teams via estimated attack and defense strengths” (via comments: @MarchiMax)
  • golf:
  • gymnastics:
  • horse racing:
    • RcappeR [Github]: “tools to aid the analysis and handicapping of Thoroughbred Horse Racing” (via Twitter: @UTVilla)
    • rBloodstock [Github]: “datasets from Thoroughbred Bloodstock Sales, Tattersalls sales from 2010 to 2015 (incomplete)” (via Twitter: @UTVilla)
  • ice hockey:
    • nhlscrapr [CRAN]: “routines for extracting play-by-play game data for regular-season and playoff
      NHL games, particularly for analyses that depend on which players are on the ice”
      . [via comments – Triplethink]
    • hockey {gamlr} [Inside-R packages]: “information about play configuration and the players on ice (including goalies) for every goal from 2002-03 to 2012-13 NHL seasons” [via comments – Triplethink]
    • nhl-pbp [Github]: “code to parse and analyze NHL PBP data using R”.
    • ( liigadata (python) – utility for parsing Finnish ice hockey league game data from website)
  • motor sport:
  • skiing:
    • SpeedSki {GDAdata} [CRAN]: “World Speed Skiing Competition, Verbier 21st April, 2011.” (via comments: Antony Unwin)
  • sailing: I didn’t find any R packages, but I did find a sailing regatta results data interchange format: ISAF XML Regatta Reporting (XRR) Data Format
  • snooker:
  • swimming: I didn’t find any R packages, but I did find a swimming results data interchange format: Lenex; and a site that publishes data in that format: Omega Timing.
  • tennis:

It would perhaps make more sense to try to collect rather more structured (meta)data for each package. For example: homepage, sport/discipline; analysis, data (package or API), or analysis and data; if data: year-range, source, data coverage (e.g. table column headings); if analysis, brief synopsis of tools available (e.g. chart generators).

If you know of any others, please let me know via the comments and I’ll try to keep this page updated with a reasonably current list.

As well as packages, here are some links to blog posts that look at sports data analysis using R:

Again, if you can recommend further posts, please let me know via the comments.

PS other sports data interchange formats: SportsML-G2

Sports Data Journalism and “Datatainment”

Over the last couple of years, you’ve probably noticed that data has become a Big Thing in commerce (Big Data for business advantage) as well as in the openness/transparency community, with governments and the media joining the party particularly in the context of the latter. But if you’re looking to develop data journalism skills, it’s probably also worth remembering the area of sports journalism, and the wealth of data produced around sporting events.

Part of the attraction of developing learning activities around sports data is that there’s a good chance that it’ll keep on delivering… If you develop a way of analysing or displaying sports data that pulls out interesting features or story elements from a set of sports data, you should be able to keep on using it… To set the scene, here’s a example: Driven By Data: Data Journalism in Sports. For a peek at my own fumblings, I’ve started exploring the automatic creation of F1DataJunkie Stats Graphics reports (still a lot to be done, but it’s a start…)

In the extreme case, you might be able to generate story outlines, or even canned prose… For example, in certain computer games in the sports genre, you might find you’re playing a game along to a “live commentary”, generated from the data being produced by the game. Automatic commentary generation is a form of sports journalism. And automated article generation is already here, as @RobbieAllen describes in How I automated my writing career, a brief overview of Automated Insights, a company that specialises in computer generated visualisations and prose.

See also: Automated Storytelling in Sports: A Rich Domain to Be Explored, Automated Event Recognition for Football Commentary Generation, Three RoboCup Simulation League Commentator Systems, and so on…

Getting hold of data is always an issue, of course, but I suspect that many larger newsrooms will take a subscription to the Press Association sports data feeds, for example…

Anyway, as an exercise, here’s some data to start with, from the Guardian datastore: Premier League’s top scorers: who is scoring the most goals? Is there a correlation with age, perhaps? (Where would you find the age data…?)

As well as sports reporting, I think we’re also likely to see an increase in what Head of Digital at Manchester City FC, Richard Ayers, referes to as datatainment: “where you use data as the primary source of entertainment. You might choose to make the visualisation of raw data entertaining or perhaps use data visualisation as part of the process of entertainment – but there’s definitely a strong editorial control which is focussed on entertaining the audience rather than exposing data.” (Data? Entertainment? You need Datatainment and Defining Data Visualisation, Data Journalism & Data Entertainment).

Devices such as FanVision already blend video and audio streams with data feeds, for example, more and more sports have “live stats apps” associated with them, and it’s not hard to imagine the data crunching that goes on under the hood in things like Optiplay making an appearance on sports analysis and review sites?

I also think that the “data as entertainment” line might work well as a second screen activity. Things like the F1 Live Timing app already demonstrate this:

On the other hand, there’s an opportunity for data focussed sites that go into deep analysis for the hardcore fan. Again looking at Formula One, the Intelligent F1 blog features a data-powered model developed by a rocket scientist that provides engagment oaround a particular race over an extended period, from predicting Sunday race behaviour based on Friday practice data and previous outings, through analysis of practice and qualifying data, to a detailed series of post-race analyses. (Complement this with technical analyses applied to the cars on the Scarbs F1, and you have the ultimate F1 geeks paradise!;-)

PS This also caught my eye: Gametime [Assistant]: Girls’ Lacrosse Game Data, which steps through the design of a “datatainment” app…

PPS as the Lacrosse app suggests, the data collection thing can also improve engagement with a live event. For example, my own doodlings around a motorsport lapcharting app (Thoughts on a Couple of Possible Lap Charting Apps, initial code experiment)

PPPS Seems like the algorithmic story generation thing is a itself a story that keeps coming round again… So for example, a couple of pieces that both appeared around the same time in April 2012: Can the Computers at Narrative Science Replace Paid Writers? in The Atlantic, and Can an Algorithm Write a Better News Story Than a Human Reporter? in Wired.

F1 Data Junkie, the Blog…

To try to bring a bit of focus back to this blog, I’ve started a new blog – F1 Data Junkie: (aka – that will act as the home for my “procedural” F1 Data postings. I’ll still post the occasional thing here – for example, reviewing the howto behind some of the novel visualisations I’m working on (such as practice/qualification session utilisation charts, and race battle maps), but charts relating to particular races, will, in the main, go onto the new blog….

I’m hoping by the end of the season to have an automated route of generating different sorts of visual reviews of practice, qualification and race sessions based on both official timing data, and (hopefully) the McLaren telemetry data. (If anyone has managed to scrape and decode the Mecedes F1 live telemetry data and is willing to share it with me, that would be much appreciated:-)

I also hope to use the spirit of F1 to innovate like crazy on the visualisations as and when I get the chance; I think that there’s a lot of mileage still to come in innovative sports graphics/data visualisations*, not only for the stats geek fan, but also for sports journalists looking to uncover stories from the data that they may have missed during an event. And with a backlog of data going back years for many sports, there’s also the opportunity to revisit previous events and reinterpret them… Over this weekend, I’ve been hacking around a few old scripts to to to automate the production of various data formatters, as well as working on a couple of my very own visualisation types:-) So if you want to see what I’ve been up to, you should probably pop over to F1 Data Junkie, the blog… ;-)

*A lot of innovation is happening in live sports graphics for TV overlays, such as the Piero system developed by the BBC, or the HawkEye ball tracking system (the company behind it has just been bought by Sony, so I wonder if we’ll see the tech migrate into TVs, start to play a role in capturing data that crosses over in gaming (e.g. Play ALong With the Real World), or feed commercial data augmentations from Sony to viewers via widgets on Sony TVs…

There’ve also been recent innovations in sports graphics in the press and online. For example, seeing this interactive football chalkboard on the Guardian website, that lets you pull up, in a graphical way, stats reviews of recent and historical games, or this Daily Telegraph interactive that provides a Hawk-Eye analysis of the Ashes (is there an equivalent for The Master golf anywhere, I wonder, or Wimbledon tennis? See also Cricket visualisation tool), I wonder why there aren’t any interactive graphical viewers over recent and historical F1 data…. (or maybe there are? If you know of any – or know of any interesting visualisations around motorsport in general and F1 in particular, please let me know in the comments…:-)

Visualising F1 Timing Sheet Data

Putting together a couple of tricks from recent posts (Visualising Vodafone Mclaren F1 Telemetry Data in Gephi and PDF Data Liberation: Formula One Press Release Timing Sheets), I thought I’d have a little play with the timing sheet data in Gephi…

The representations I have used to date are graph based, with each node corresponding a particular lap performance by a particular driver, and edges connecting consecutive laps.

**If you want to play along, you’ll need to download Gephi and this data file: F1 timing, Malaysia 2011 (NB it’s not throughly checked… glitches may have got through in the scraping process:-(**

The nodes carry the following data, as specified using the GDF format:

  • name VARCHAR: the ID of each node, given as driverNumber_lapNumber (e.g. 12_43)
  • label VARCHAR: the name of the driver (e.g. S. VETTEL
  • driverID INT: the driver number (e.g. 7)
  • driverNum VARCHAR: an ID for the driver of the lap (e.g. driver_12
  • team VARCHAR: the team name (e.g. Vodafone McLaren Mercedes)
  • lap INT: the lap number (e.g. 41)
  • pos INT: the position at the end of the lap (e.g. 5)
  • pitHistory INT: the number of pitstops to date (e.g. 2)
  • pitStopThisLap DOUBLE: the duration of any pitstop this lap, else 0 (e.g. 12.321)
  • laptime DOUBLE: the laptime, in seconds (e.g. 72.125)
  • lapdelta DOUBLE: the difference between the current laptime and the previous laptime (e.g. 1.327)
  • elapsedTime DOUBLE: the summed laptime to date (e.g. 1839.021)
  • elapsedTimeHun DOUBLE: the elapsed time divided by a hundred (e.g. )

Using the geolayout with an equirectangular (presumably this means Cartesian?) layout, we can generate a range of charts simply by selecting suitable co-ordinate dimensions. For example, if we select the laptime as the y (“latitude”) co-ordinate and x (“longitude”) as the lap, filtering out the nodes with a null laptime value, we can generate a graph of the form:

We can then tweak this a little – e.g. colour the nodes by driver (using a Partition based coluring), and edges according to node, resize the nodes to show the number of pit stops to date, and then filter to compare just a couple of drivers :

This sort of lap time comparison is all very well, but it doesn’t necessarily tell us relative track positions. If we size the nodes non-linearly according to position, with a larger size for the “smaller” numerical position (so first is less than second, and hence first is sized larger than second), we can see whether the relative positions change (in this case, they don’t…)

Another sort of chart we might generate will be familiar to many race fans, with a tweak – simply plot position against lap, colour according to driver, and then size the nodes according to lap time:

Again, filtering is trivial:

If we plot the elapsed time against lap, we get a view of separations (deltas between cars are available in the media centre reports, but I haven’t used this data yet…):

In this example, lap time flows up the graph, elapsed time increases left to right. Nodes are coloured by driver, and sized according to postion. If a driver has a hight lap count and lower total elapsed time than a driver on the previous lap, then it’s lapped that car… Within a lap, we also see the separation of the various cars. (This difference should be the same as the deltas that are available via FIA press releases.)

If we zoom into a lap, we can better see the separation between cars. (Using the data I have, I’m hoping I haven’t introduced any systematic errors arising from essentially dead reckoning the deltas between cars…)

Also note that where lines between two laps cross, we have a change of position between laps.

[ADDED] Here’s another view, plotting elapsed time against itself to see where folk are on the track-as-laptime:

Okay, that’s enough from me for now.. Here’s something far more beautiful from @bencc/Ben Charlton that was built on top of the McLaren data…

First up, a 3D rendering of the lap data:

And then a rather nice lap-by-lap visualisation:

So come on F1 teams – give us some higher resolution data to play with and let’s see what we can really do… ;-)

PS I see that Joe Saward is a keen user of Lap charts…. That reminds me of an idea for an app I meant to do for race days that makes grabbing position data as cars complete a lap as simple as clicking…;-) Hmmm….

PPS for another take of visualising the timing data/timing stats, see Keith Collantine/F1Fanatic’s Malaysia summary post.