OUseful.Info, the blog…

Trying to find useful things to do with emerging technologies in open education

Intellectual Talisman, Baudelaire

leave a comment »

During tumultuous times there is often an individual, an intellectual talisman if you like, who watches events unfold and extracts the essence of what is happening into a text, which then provides a handbook for the oppressed. For the frustrated Paris-based artists battling with the Academy during the second half of the nineteenth century, Baudelaire was that individual, his essay, The Painter of Modern Life, the text.

… He claimed that ‘for the sketch of manners, the depiction of bourgeois life … there is a rapidity of movement which calls for an equal speed of execution of the artist’. Sound familiar? The essay goes on to feature several references to the word ‘flâneur‘, the concept of a man-about-town, which Buudelaire was responsible for bringing to the public’s attention, describing the role thus: ‘Observer, philosopher, flâneur – call him what you will… the crowd is his element, as the air is that of the birds and water of fishes. His passion and his profssion are to become one flesh with the crowd. For the perfect flâneur, for the passionate spectator, it is an immense joy to set up house in the heart of the multitude, amid the ebb and flow of movement, in the midst of the fugitive and the infinite.’

There was no better provocation for the Impressionists to go out and paint en plein air. Baudelaire passionately believed that it was incumbent upon living artists to document their time…

And the way to do that was by immersing oneself in the day-to-day of metroplitan living: watching, thinking, feeling and finally recording. …”

Will Gompertz, What Are You Looking At? 150 Years of Modern Art in the Blink of an Eye pp. 28-29

Written by Tony Hirst

December 19, 2014 at 11:01 am

Posted in Anything you want

A Briefest of Looks at the REF 2014 Results Data – Reflection on a Data Exercise

leave a comment »

At a coursemodule team meeting for the new OU data course […REDACTED…], which will involve students exploring data sets that we’ve given them, as well as ones that they will hopefully find for themselves, it was mentioned that we really should get an idea about how the the exercises we’ve written, are writing, and have yet to write will take students to do.

In that context, I noticed that the UK Higher Education Research Excellence Framework, 2014 results were out today, so I gave myself an hour to explore the data, see what’s there, and get an idea for some of the more obvious stories we might try to pull out.

Here’s as far as I got: an hour long conversation from a standing start with the REF 2014 data.

Although I did have a couple of very minor interruptions, I didn’t get as far as I’d expected/hoped.

So here are a few reflections, as well as some comments on the constraints I put myself under:

  • the students will be working in a headless virtual machine we have provided them with; we don’t make any requirements of students to have access to a spreadsheet application; OpenRefine runs on the VM, so that could be used to get a preview of the spreadsheet (I’m not sure how well OpenRefine copes with multiple sheets, if a spreadsheet does contain multiple sheets?); given all that, I thought I’d try to explore the notebook purely within an IPython notebook, without (as @andysc put it) eyeballing the data in a spreadsheet first;
  • I didn’t really read any of the REF docs, so I wasn’t really sure how the data would be reported. I’m not sure how much time it would have taken to read up on the reporting data used, or what sort of explanatory notes and/or external metadata are provided?
  • I had no real idea what questions to ask or reports to generate. “League tables” was an obvious one, but calculated how? You need to know what numbers are available in the data set and how they may (or may not) relate to each other to start down that track. I guess I could have looked at distributions down a column, and then grouped in different ways, and then started to look for and identify the outliers, at least as visually revealed.
  • I didn’t do any charts at all. I had it half in mind to do some dodged bar charts eg within an institution to show how the different profiles were scored within each unit of assessment for a given institution, but ran out of time before I tried that. (I couldn’t remember offhand what sort of shape the data needs to be in to plot that sort of chart, and then wasted a minute or two, gardener’s foot on fork staring into the distance, pondering what we could do if I cast (unmelted) the separate profile data into different columns for each return), but then decided it’d use up too much of my hour trying to remember/look-up how to do that, let alone then trying to make up stuff to do with the data once it was in that form.
  • The exploration/conversation demonstrated grouping, sorting and filtering, though I didn’t limit the range of columns displayed. I did use a few cribs, both from the pandas online documentation, and referenced other notebooks we have drafted for student use (eg on how to generate sorted group/aggregate reports on dataframe).
  • our assessment will probably mark folk down for not doing graphical stuff… so I’d have lost marks on not just putting a random chart in, such as a bar chart counting numbers of institutions by unit of assessment;
  • I didn’t generate any derived data – again, this is something we’d maybe mark students down on; an example I saw just now in the OU internal report on the REF results is GPA – grade point average. I’m not sure what it means, but while in the data I was wondering whether I should explore some function of the points (eg 4 x (num x 4*) + 3 x (num x 3*) … etc) or some function of the number of FTEs and the star rating results.
  • Mid-way through my hour, I noticed that Chris Gutteridge had posted the data as Linked Data; Linked Data and SPARQL querying is another part of the course, so maybe I should spend an hour seeing what I can do with that endpoint from a standing start? (Hmm.. I wonder – does the Gateway to Research also have a SPARQL endpoint?)
  • The course is about database management systems, in part, but I didn’t put the data into either PostgreSQL or MongoDB, which are the two systems we introduce students to, or discuss rationale for which db may have been a useful store for the data, extent to which normalisation was required, (eg taking the data to third normal form or wherever, and perhaps actually demonstrating that etc). (In the course, we’ll probably also show students how to generate RDF triples they can run their own SPARQL queries against.) Nor did I throw the dataframe in SQLite using pandasql, which would have perhaps made it easier (and quicker?) to write some of the queries using SQL rather than the pandas syntax?
  • I didn’t link in to any other datasets, which again is something we’d like students to be able to do. At the more elaborate end might have been pulling in data from something like Gateway to Research? A quicker hack might have been to try annotating the data with administrative information, which I guess can be pulled from one of the datasets on data.ac.uk?
  • I didn’t do any data joining or merging; again, I expect we’ll want students to be able to demonstrate this sort of thing in an appropriate way, eg as a result of merging data in from another source.
  • writing filler text (setting the context, explaining what you’re going to do, commenting on results etc) in the notebook takes time… (That is, the hour is not just taken up by writing the code/queries; and there is also time spent, but not seen, in coming up with questions to ask, as well as then converting them to queries and and then reading, checking and mentally interpreting the results.)

One thing I suggested to the course team was that we all spend an hour on the data and see what we come up with. Another thing that comes to mind is what might I now be able to achieve in a second hour; and then a third hour. (This post has taken maybe half an hour?)

Another approach might have been to hand off notebooks to each other, the second person building on the first’s notebook etc. (We’d need to think how that would work for time: would the second person’s hour start before – or after – reading the first person’s notebook?) This would in some way model providing the student with an overview of a dataset and then getting them to explore it further, giving us an estimate of timings based on how well we can build on work started by someone else, but still getting us to work under a time limit.

Hmmm.. does this also raise the possibility of some group exercises? Eg one person has to normalise the data and get it into PostgreSQL, someone else get some additional linkable data into the mix, someone to start generating summary textual reports and derived data elements, someone generating charts/graphical reports, someone exploring the Linked Data approach?

PS One other thing I didn’t look at, but that is a good candidate for all sorts of activity, would be to try to make comparisons to previous years. Requires finding and ordering previous results, comparing rankings, deciding whether rankings actually refer to similar things (and extent to which we can compare them at all). Also, data protection issues: could we identify folk likely to have been included in a return just from the results data, annotated with data from Gateway to Research, perhaps, or institutional repositories?

Written by Tony Hirst

December 18, 2014 at 1:59 pm

Sketching Scatterplots to Demonstrate Different Correlations

with 7 comments

Looking just now for an openly licensed graphic showing a set of scatterplots that demonstrate different correlations between X and Y values, I couldn’t find one.

[UPDATE: following a comment, Rich Seiter has posted a much cleaner – and general – method here: NORTA Algorithm Examples; refer to that post – rather than this – for the method…(my archival copy of rseiter’s algorithm)]

So here’s a quick R script for constructing one, based on a Cross Validated question/answer (Generate two variables with precise pre-specified correlation):


  data = mvrnorm(n=samples, mu=c(0, 0), Sigma=matrix(c(1, r, r, 1), nrow=2), empirical=TRUE)
  X = data[, 1]  # standard normal (mu=0, sd=1)
  Y = data[, 2]  # standard normal (mu=0, sd=1)

for (i in c(1,0.8,0.5,0.2,0,-0.2,-0.5,-0.8,-1)){


g+facet_wrap(~corr)+ stat_smooth(method='lm',se=FALSE,color='red')

And here’s an example of the result:


It’s actually a little tidier if we also add in + coord_fixed() to fix up the geometry/aspect ratio of the chart so the axes are of the same length:


So what sort of OER does that make this post?!;-)

PS methinks it would be nice to be able to use different distributions, such as a uniform distribution across x. Is there a similarly straightforward way of doing that?

UPDATE: via comments, rseiter (maybe Rich Seiter?) suggests the NORmal-To-Anything (NORTA) algorithm (about, also here). I have no idea what it does, but here’s what it looks like!;-)

//based on http://blog.ouseful.info/2014/12/17/sketching-scatterplots-to-demonstrate-different-correlations/#comment-69184
#The NORmal-To-Anything (NORTA) algorithm

#NORTA - h/t rseiter
corrdata2=function(samples, r){
  mu <- rep(0,4)
  Sigma <- matrix(r, nrow=4, ncol=4) + diag(4)*(1-r)
  rawvars <- mvrnorm(n=samples, mu=mu, Sigma=Sigma)
  #unifvars <- pnorm(rawvars)
  unifvars <- qunif(pnorm(rawvars)) # qunif not needed, but shows how to convert to other distributions

for (i in c(1,0.9,0.6,0.4,0)){
g+ stat_smooth(method='lm',se=FALSE,color='red')+ coord_fixed()

Here’s what it looks like with 1000 points:


Note that with smaller samples, for the correlation at zero, the best fit line may wobble and may not have zero gradient, though in the following case, with 200 points, it looks okay…


The method breaks if I set the correlation (r parameter) values to less than zero – Error in mvrnorm(n = samples, mu = mu, Sigma = Sigma) : ‘Sigma’ is not positive definite – but we can just negate the y-values (unifvars[,2]=-unifvars[,2]) and it seems to work…

If in the corrdata2 function we stick with the pnorm(rawvars) distribution rather than the uniform (qunif(pnorm(rawvars))) one, we get something that looks like this:


Hmmm. Not sure about that…?

Written by Tony Hirst

December 17, 2014 at 1:24 pm

Posted in Anything you want, Rstats

Tagged with

Seven Ways of Running IPython Notebooks

leave a comment »

We’re looking at using IPython notebooks for a MOOC on something or other, so here’s a quick review of the different ways I think we can provide access to them. Please let me know via the comments if there are other models…

User Desktop – Native App

Download a version of a scientific python distribution such as Anaconda (python 2.7 & 3) or Enthought Canopy (python 2.7) and run the notebook from within that. Runs cross platform, requires user admin privileges to install application.

In the context of a MOOC, this approach would require participants to download and install the python distribution on a desktop or laptop computer. This approach will not work on a tablet.
Delivery/use costs: none.
Publisher demands: notebook development.
Maintenance: dependence on distribution provider.
Support issues: installation of 3rd party software.
Custom branding/styling/extensions: can be installed after running a config script.

Browser Extension

The CoLaboratory (about) Chrome extension allows you to run IPython notebooks within the Chrome browser without the need to install any other software. Notebooks are saved to/opened from Google Drive. Python 2.7(?). Requires Google Chrome (cross-platform), Google Account (for Google Drive integration). This approach does not support the installation of arbitrary third party python libraries – only libraries compiled into the extension will work.

In the context of a MOOC, this approach would require participants to download and install Chrome and the CoLaboratory extension. This approach will not work on a tablet.
Delivery/use costs: none.
Publisher demands: notebook development.
Maintenance: dependence on extension publisher; (code available to fork).
Support issues: installation of 3rd party software.
Custom branding/styling/extensions: not supported(?)

iOS App

An iOS app, Computable (blog) makes notebooks available on an iPad. The app is free to preview notebooks bundled with the app, but requires a $10 in-app purchase to run your own notebooks. Integrated with Dropbox. Source code not available. Other reports of demos – but again, no code – available, such as: IPython notebook on iPhone. I don’t know if this approach supports the installation of arbitrary third party python libraries; that is, I don’t know if only libraries compiled into the application will work.

In the context of a MOOC, this approach would require participants to download and install the app and then pay the $10 fee. A Dropbox account is also required(?). This approach will only work on an iOS device.
Delivery/use costs: $10 to student.
Publisher demands: notebook development.
Maintenance: dependence on app publisher.
Support issues: installation of 3rd party software.
Custom branding/styling/extensions: not supported(?)

User Desktop – Virtual Machine

Run an IPython server within a virtual machine on the user’s desktop, exposing the notebook via a browser. Requires a virtual machine runner (eg VirtualBox, VMWare), port forwarding; some mechanism for starting up the VM and auto-running the notebook server. Devops tools may be used to deploy VMs either locally or in the cloud (eg JiffyLab; the OU course TM351 (in production) is currently exploring the use of VMs managed using vagrant to deliver IPython notebooks along with a range of other services).

Pre-defined IPython notebook VM examples: docker: ipython/notebook; Chef: IPython notebook cookbook.

In the context of a MOOC, this approach would require participants to download and install a virtual machine runner and the virtual machine image. This approach will not work on a tablet.
Delivery/use costs: none.
Publisher demands: development of VM image.
Maintenance: tracking VM runner compatability.
Support issues: installation of 3rd party software; installation of VM; port forwarding conflicts, locating the notebook server address.
Custom branding/styling/extensions: yes.

In the Cloud – Managed Services

Open a notebook running on a managed IPython notebook server such as Wakari or Authorea. Free plans available, requires personal account on managed service, internet connection.

In the context of a MOOC, this approach would require participants to have access to a network connection and create an account on a provider service. This approach will work on any device.
Deliver/use costs: Notebooks created under free plans will be public.
Custom branding/styling/extensions: no – branding may be associated with provider; provider identified extensions may be offered as part of hosted service.

In the Cloud – Ad Hoc tmp Notebooks

Nature recently did a splash on IPython notebooks (Interactive notebooks: Sharing the code) that included a demo notebook for readers to experiment with: IPython interactive demo.

The system uses a ‘temporary notebook’ server described here: Instant Temporary IPython Notebooks (code: Jupyter tmpnb). The sever spawns temporary notebooks that can be used to run particular activities. Other example use cases include the provision of notebooks at conferences/workshops for demo purposes, hour of code demos etc.

In the context of a MOOC, this approach would require participants to have access to a network connection. Notebooks will be temporary and cannot be persisted. This approach would be ideal for one off activities. This approach will work on any device. My assumption is that we would not be able direct participants to the current 3rd party provider of this service without prior agreement.
Delivery/use costs: server provision for running notebooks.
Publisher demands: management of cloud services if self-hosted.
Maintenance: tracking VM runner compatability.
Support issues:
Custom branding/styling/extensions: no.

In the Cloud – Hosted VMs

Another virtual machine approach, but participants fire up a prebuilt virtual machine in the cloud. Examples include: Yhat Sciencebox. Another example is the notebookcloud, a semi-managed service built around Google authentication and user’s personal AWS credentials; ((forked) legacy code). (From the docs: NotebookCloud is service that allows you to launch and control IPython Notebook servers on Amazon EC2 from your browser. This enables you to host your own Python programming environment, on your own Amazon virtual machine, and access it from any modern web browser.)

In the context of a MOOC, this approach would require participants to have access to a network connection and create an account on a provider service. This approach will work on any device.
Delivery/use costs: participant pays machine, storage and bandwidth costs according to usage.
Publisher demands: management of cloud services if self-hosted. Alternatively, we could host a version of NotebookCloud.
Maintenance: potential reliance on 3rd party VM configuration.
Support issues: account creation, VM management.
Custom branding/styling/extensions: no – branding may be associated with provider; provider identified extensions may be offered as part of hosted service.

Hosted MultiUser Systems

IPython and IPython Notebooks are under active development and it looks as if future releases will support multi-user services, for example JupyterHub: A multi-user server for Jupyter notebooks.


In order to make interactive use of IPython notebooks within a course, we need to make notebooks available, and provide a way of running them. A wide variety of models that support the running of notebooks exists. We can distinguish: where does the notebook server run; where does it load pre-existing notebooks from; where does it save notebooks to.

Running a notebook server incurs a resource cost in terms of installation, maintenance, (remote) access (eg managing multiple instances, port availability etc when offering a hosted service), any financial costs associated with running the service. Who covers the costs and meets any load issues depends on the solution adopted.

As far as usage goes, notebooks are accessed via a web browser and as such are accessible on any device. However, if a server runs locally rather than on a remote host, there are device dependencies to contend with that rule out some solutions on some platforms (VM can’t run on iOS; iOS app won’t run on Windows machine etc).

In an online course environment, we may be able separate concerns and suggest a variety of ways of running notebooks to participants, leaving it up to each participant to find a way of running the notebooks that works for them. Duty of care might then extend only insofar as making notebooks available that will run on all the platforms we have recommended.

In terms of pedagogy, we might distinguish between notebooks as used:

  1. to run essentially standalone exercises, and which we might treat as disposable (a model which the tmpnb solution fits beautifully because no persistence of state is required); this exercises might be pre-prepared notebooks made available to participants, or user created notebooks that users create as a scratch pad to run through a particular set of activities described elsewhere;
  2. for activities where the participant may want to save a work in progress and return to it at a later date (which requires persistence of state on a per user basis); this might be in the context of a notebook we provide to the students, or one they have created and are working on themselves;
  3. for activities where participants create their own notebooks and wish to preserve them.

From the OU perspective, we should probably take a view on the extent to which we develop solutions that work across one or more contexts, such as MOOCs delivered ex- of OU provisioned services (eg FutureLean); MOOCs delivered via an OU context (eg OpenLearn); courses delivered to OU fee paying students via OU systems.

There may be opportunities to develop solutions that work to support the delivery of OU courses, as well as OU MOOCs, and that are further licensable to other institutions, eg to support their own course delivery or MOOC delivery, or appropriate for release as open software in support of the wider community.

Written by Tony Hirst

December 12, 2014 at 12:37 pm

Posted in OU2.0

Tagged with ,

Thoroughly Confused About Student VMs & Docker

with 11 comments

The story so far… We’re looking at using a virtual machine (VM) preconfigured with all sorts of software and services for a distance education course. The VM runs in headless mode (no graphical desktop) and exposes the applications we want students to be able to run as services accessed through a web browser. The VM is built from a vagrant script using puppet, in order to support maintenance and as a demonstration of a potentially generic production model. It also means that we should be able to build VMs for different VM runners (Virtualbox, VMWare etc), as well as being able to generate machine images for use in cloud hosted VMs as well as getting them up and running. The activities to be run inside the VM include a demonstration of a distributed MongoDB network into which network partitions are introduced. The separate database instances run inside individual docker containers and firewall rules are used to network partition them.

The set-up looks something like this – green blocks are services running in the VM, orange blocks are containers:


The containers are started up from within the IPython notebook using a python wrapper to docker.io.

One problem with this approach is that we have two Mongo DB downloads – one in the VM and one for use in containers. This makes me think that it might make more sense to run all the applications in containers of their own. For example, the notebook server in a container of its own, the original MongoDB instance in a container of its own, PostgreSQL in a container of its own, and finally a data container or other area of the VM that can be used to persist data within the VM such that it can be accessed by one or more of the other services as and when required.

I’m not sure what the best strategy would be for persisting state, for example as used by the database services? If we mount a database’s datastore in a volume within the VM, we can destroy the container that is used to run a database service and our database datastore will be preserved. If the mounted volume is located in the VM/host share area, if the VM itself is destroyed the data will persist in a volume on the host. This is perhaps a bit scrappy because it means that we might nominally take up a large amount of space “on the host”, compared with the situation of providing students with a pre-populated database in which the data volume was mounted “inside” the VM proper (i.e. away from the share area). In such a case, it would nice if any data tables that students built were mounted into the host shared area (so the students can clearly “retain ownership” of those tables), but I suspect that DBMS don’t like putting different data tables or databases into different volumes..?

This approach to using docker is perhaps at odds with the typical way of thinking about how we might make use of it. It is very much in the style of a VM acting as an app runner for a host, and docker containers being used to run the individual apps.

One problem with this approach is that we need a control panel that:

  • is always running;
  • is exposed via an http/HTML service that can be accessed on the host machine;
  • allows students to bring up and shut down containers/services/”apps” as required.

The container approach is nice because if something breaks with one of the databases, for example, it should be easy for a student to switch it off and switch it back on again… That said, if we are popping up and ripping down containers, we need to work out a connection manager so that eg IPython knows where to find a particular database (I think that docker starts services up on essentially arbitrary internal IP addresses?). This could possibly be done by naming docker processes sensibly and then creating a little python library to look up the docker processes by name so that we know where to find them?

As ever, there are tradeoffs in terms of making this easy approach easy for us (as production engineers), making them easy for students, and making them easy for the helpdesk to support…

I don’t know enough about any of this stuff to know whether it makes sense rebuilding the VM we currently have, in which services roaming around the base VM, with a completely containerised version. The core requirement from the user perspective is that a student should be able to download a base box, fire it up as easily as possible in Virtualbox, and then access a control panel (ideally in a browser) that allows them to start up and shut down applications-in-containers, as well as seeing a clear dashboard view of what services are up and running and what (localhost) ports they can be accessed on through their browser.

If you can help talk me through what the issues are or might be, and whether any of the above makes sense or is complete nonsense, I would be most grateful…

Written by Tony Hirst

December 10, 2014 at 2:34 pm

Posted in Anything you want, OU2.0

Tagged with

Crowd-Sourced Social Media Subtitling of the F1 Video Archive – Or Not…

with 2 comments

As a team entry with Martin Hawksey, we put in an entry to the third Tata F1 Connectivity Prize Challenge, which was to catalogue the F1 video archive. We didn’t win (prizewinning entries here), but here’s the gist of our entry… You may recognise it as something we’d bounced ideas around with before…

Social Media Subtitling of F1 Races

Every race weekend, a multitude of F1 fans informally index live race coverage. Along with broadcasters and F1 teams, audiences use Twitter and other social media platforms to generate real-time metadata which could be used to index video footage. The same approach can be used to index the 60,000 hours of footage dating back to 1981.

We propose an approach that harnesses the collection of race commentaries with the promotion of the race being watched, an approach referred to as social media subtitling. Social media style updates collected while a race is being watched are harvested and used to provide commentary-like subtitles for each race. These subtitles can then be used to index and search into each race video.


Annotating Archival Footage

Audiences log in to an authenticated area using an account linked to one or more of their linked social media profiles. They select a video to watch and are presented with a DRM-enabled embedded streaming video player. An associated text editor can be used to create the social media subtitles. On starting to type into the text editor, a timestamp is grabbed from the video for a few seconds before the typing started (so a replay of the commented event can be seen) and associated with the text entry. On posting the subtitle, it is placed into timestamped comment database. Optionally, the comment can be published via a public social media account with an appropriate hashtag and a link to the timestamped part of the video. The link could lead to an authentication page to gain access to the video and the commentary client, or it may lead to a teaser video clip containing a second or two of the commented upon scene.


Examples of the evolution of the original iTitle Twitter subtitler by M. Hawksey, showing: timestamped social media subtitle editor linked to video player; searching into a video using timestamped social media updates; video transcript from social media harvested updates collected in realtime.

The subtitles can be searched and used to act as a timestamped text index describing the video. Subtitles can also be used to generate commentary transcripts.

If a fan watches a replay of a race and comments on it using their own social media account, a video start time tweet could be sent from the player when they start watching the video (“I’ve just started watching the XXXX #F1 race on #TataF1.tv [LINK]”). This tweet then acts to timestamp their social media updates relative the corresponding video timestamp as well as publicising the video.

Subtitle Quality Control

The quality of subtitles can be controlled in several ways:

  • a stream of subtitles can be played alongside the video and liked (positively scored) or disliked (negatively scored) by a viewer. (There is a further opportunity here for liked comments to be shared to public social media (along with a timestamped link into the video).) This feedback can also be used to generate trust ratings for commenters (someone whose comments are “liked” by a wide variety of people may be seen as providing trusted commentary);
  • text mining / topic modeling of aggregated comments around the same time can be used to identify crowd consensus topic or keywords.

If available, historical race timing data may be used to confirm certain sorts of information. For example, from timing sheets we can get data about pitstops, or the laps on which cars exited a race from an accident or mechanical failure. This information can be matched to the racetime timestamp of a comment; if comment topics match timing data identified events at about the right time, those comments can automatically be rated positively.

Making Use of Public Social Media Updates

For current and future races, logging social media updates around live races provides a way of bootstrapping the comment database. (Timestamps would be taken as realtime updates, although offsetting mechanisms to account for several second delays in digital TV feeds, for example, would need to be accounted for.) Feeds from known F1 journalists, race teams etc. would be taken as trusted feeds. Harvesting hashtagged feeds from the wider F1 audience would allow the collection of race comment social media updates more widely.


Social media updates can also be harvested in real time around live races or replayed races if we know the video start time.

For recent historical races, archived social media updates, as for example collected by Datasift, could be purchased and used to bootstrap the social media subtitle database.

Race Club

Social media subtitling provides a great opportunity for social activity. Groups of individuals can choose to watch a race at the same time, commenting to each other either through the bespoke subtitler client or by using public social media updates and an appropriate hashtag. If a user logs in to the video playback area, timestamps of concurrent updates from their linked public social media accounts can be reconciled with timestamps associated with the streamed video they are watching in the authenticated race video area.

In the off-season, or in the days leading up to a particular race “Historic Race Weekend” videos could be shown, perhaps according to a streamed broadcast model. That is, a race is streamed from within the authenticated area at a particular set time. Fans watch this scheduled event (under authentication) but comment on it in public using social media. These updates are harvested and the timestamps reconciled with the streamed video.


Social media subtitling draws on the idea that social media updates can be used to provide race commentary. Live social media comments collected around live events can be used to bootstrap a social media commentary database. Replayed streamed events can be annotated by associating social media update timestamps with known start/stop times of video replays. A custom client tied to a video player can be used to enter commentary directly to the database as well as issuing it as a social media update.

Team entry: Tony Hirst and Martin Hawksey

PS Rather than referring to social media subtitles and social media subtitling, I think social media captions and social media captioning is more generic?

Written by Tony Hirst

December 10, 2014 at 11:24 am

Posted in Anything you want

Tagged with

Identifying Position Change Groupings in Rank Ordered Lists

leave a comment »

The title says it all, doesn’t it?!

Take the following example – it happens to show race positions by driver for each lap of a particular F1 grand prix, but it could be the evolution over time of any rank-based population.


The question I had in mind was – how can I identify positions that are being contested during a particular window of time, where by contested I mean that the particular position was held by more than one person in a particular window of time?

Let’s zoom in to look at a couple of particular steps.


We see distinct groups of individuals who swap positions with each other between those two consecutive steps, so how can we automatically detect the positions that these drivers are fighting over?

A solution given to a Stack Overflow question on how to get disjoint sets from a list in R gives what I thought was a really nice solution: treat it as a graph, and then grab the connected components.

Here’s my working of it. Start by getting a list of results that show a particular driver held different positions in the window selected – each row in the original dataframe identifies the position held by a particular driver at the end of a particular lap:

ergastdb =dbConnect(RSQLite::SQLite(), './ergastdb13.sqlite')

#Get a race identifier for a specific race
                  'SELECT raceId FROM races WHERE year="2012" AND round="1"')

q=paste('SELECT * FROM lapTimes WHERE raceId=',raceId[[1]])



#Sort by lap first just in case

#Create a couple of new columns
#pre is previous lap position held by a driver given their current lap
#ch is position change between the current and previous lap

#Find rows where there is a change between a given lap and its previous lap
#In particular, focus on lap 17
llx=tlx[tlx['ch']!=0 & tlx['lap']==17,c("position","pre")]


This filters the complete set of data to just those rows where there is a difference between a driver’s current position and previous position (the first column in the result just shows row numbers and can be ignored).

##      position pre
## 17          2   1
## 191        17  18
## 390         9  10
## 448         1   2
## 506         6   4
## 719        10   9
## 834         4   5
## 892        18  19
## 950         5   6
## 1008       19  17

We can now create a graph in which nodes represent positions (position or pre values) and edges connect a current and previous position.


posGraph = graph.data.frame(llx)


The resulting graph is split into several components:


We can then identify the connected components:

posBattles=split(V(posGraph)$name, clusters(posGraph)$membership)
#Find the position change battles
for (i in 1:length(posBattles)) print(posBattles[[i]])

This gives the following clusters, and their corresponding members:

## [1] "2" "1"
## [1] "17" "18" "19"
## [1] "9"  "10"
## [1] "6" "4" "5"

To generalise this approach, I think we need to do a couple of things:

  • allow a wider window within which to identify battles (so look over groups of three or more consecutive laps);
  • simplify the way we detect position changes for a particular driver; for example, if we take the set of positions held by a driver within the desired window, if the cardinality of the set (that is, its size) is greater than one, then we have had at least one position change for that driver within that window. Each set of size > 1 of unique positions held by different drivers can be used to generate a set of distinct, unordered pairs that connect the positions (I think it only matters that they are connected, not that a driver specifically went from position x to position y going from one lap to the next?). If we generate the graph from the set of distinct unordered pairs taken across all drivers, we should then be able to identify the contested/driver change position clusters.

Hmm… I need to try that out… And when I do, if and when it works(?!), I’ll add a complete demonstration of it – and how we might make use of it – to the Wrangling F1 Data With R book.

Written by Tony Hirst

December 9, 2014 at 10:44 am

Posted in f1stats, Rstats


Get every new post delivered to your Inbox.

Join 865 other followers