As a team entry with Martin Hawksey, we put in an entry to the third Tata F1 Connectivity Prize Challenge, which was to catalogue the F1 video archive. We didn’t win (prizewinning entries here), but here’s the gist of our entry… You may recognise it as something we’d bounced ideas around with before…
Social Media Subtitling of F1 Races
Every race weekend, a multitude of F1 fans informally index live race coverage. Along with broadcasters and F1 teams, audiences use Twitter and other social media platforms to generate real-time metadata which could be used to index video footage. The same approach can be used to index the 60,000 hours of footage dating back to 1981.
We propose an approach that harnesses the collection of race commentaries with the promotion of the race being watched, an approach referred to as social media subtitling. Social media style updates collected while a race is being watched are harvested and used to provide commentary-like subtitles for each race. These subtitles can then be used to index and search into each race video.
Annotating Archival Footage
Audiences log in to an authenticated area using an account linked to one or more of their linked social media profiles. They select a video to watch and are presented with a DRM-enabled embedded streaming video player. An associated text editor can be used to create the social media subtitles. On starting to type into the text editor, a timestamp is grabbed from the video for a few seconds before the typing started (so a replay of the commented event can be seen) and associated with the text entry. On posting the subtitle, it is placed into timestamped comment database. Optionally, the comment can be published via a public social media account with an appropriate hashtag and a link to the timestamped part of the video. The link could lead to an authentication page to gain access to the video and the commentary client, or it may lead to a teaser video clip containing a second or two of the commented upon scene.
Examples of the evolution of the original iTitle Twitter subtitler by M. Hawksey, showing: timestamped social media subtitle editor linked to video player; searching into a video using timestamped social media updates; video transcript from social media harvested updates collected in realtime.
The subtitles can be searched and used to act as a timestamped text index describing the video. Subtitles can also be used to generate commentary transcripts.
If a fan watches a replay of a race and comments on it using their own social media account, a video start time tweet could be sent from the player when they start watching the video (“I’ve just started watching the XXXX #F1 race on #TataF1.tv [LINK]”). This tweet then acts to timestamp their social media updates relative the corresponding video timestamp as well as publicising the video.
Subtitle Quality Control
The quality of subtitles can be controlled in several ways:
- a stream of subtitles can be played alongside the video and liked (positively scored) or disliked (negatively scored) by a viewer. (There is a further opportunity here for liked comments to be shared to public social media (along with a timestamped link into the video).) This feedback can also be used to generate trust ratings for commenters (someone whose comments are “liked” by a wide variety of people may be seen as providing trusted commentary);
- text mining / topic modeling of aggregated comments around the same time can be used to identify crowd consensus topic or keywords.
If available, historical race timing data may be used to confirm certain sorts of information. For example, from timing sheets we can get data about pitstops, or the laps on which cars exited a race from an accident or mechanical failure. This information can be matched to the racetime timestamp of a comment; if comment topics match timing data identified events at about the right time, those comments can automatically be rated positively.
Making Use of Public Social Media Updates
For current and future races, logging social media updates around live races provides a way of bootstrapping the comment database. (Timestamps would be taken as realtime updates, although offsetting mechanisms to account for several second delays in digital TV feeds, for example, would need to be accounted for.) Feeds from known F1 journalists, race teams etc. would be taken as trusted feeds. Harvesting hashtagged feeds from the wider F1 audience would allow the collection of race comment social media updates more widely.
Social media updates can also be harvested in real time around live races or replayed races if we know the video start time.
For recent historical races, archived social media updates, as for example collected by Datasift, could be purchased and used to bootstrap the social media subtitle database.
Social media subtitling provides a great opportunity for social activity. Groups of individuals can choose to watch a race at the same time, commenting to each other either through the bespoke subtitler client or by using public social media updates and an appropriate hashtag. If a user logs in to the video playback area, timestamps of concurrent updates from their linked public social media accounts can be reconciled with timestamps associated with the streamed video they are watching in the authenticated race video area.
In the off-season, or in the days leading up to a particular race “Historic Race Weekend” videos could be shown, perhaps according to a streamed broadcast model. That is, a race is streamed from within the authenticated area at a particular set time. Fans watch this scheduled event (under authentication) but comment on it in public using social media. These updates are harvested and the timestamps reconciled with the streamed video.
Social media subtitling draws on the idea that social media updates can be used to provide race commentary. Live social media comments collected around live events can be used to bootstrap a social media commentary database. Replayed streamed events can be annotated by associating social media update timestamps with known start/stop times of video replays. A custom client tied to a video player can be used to enter commentary directly to the database as well as issuing it as a social media update.
Team entry: Tony Hirst and Martin Hawksey
PS Rather than referring to social media subtitles and social media subtitling, I think social media captions and social media captioning is more generic?
I often get quizzical looks when I drop F1 related visualisations into random presentations (“Tony slacking around again”), whereas if I said “Raspberry Pi” then it would somehow be rather more legitimate… However, one of the ways I see it is that I’m trying to engage in an informal way with a large audience in a target demographic, a significant proportion of which are prequalified as ‘interested in STEM’. I’m also trying to engage, albeit slackly, in some sort of weak knowledge transfer (hey, motor racing folk: you increasingly haz data, and maybe there are ways of visualising it to try and gain value from it that you haven’t really thought about yet…)
In case you didn’t already know, motorsport is worth shedloads* to quite a lot* of UK companies in both domestic and export sales and employs probably more than seven* people. (*Official trade association stats.)
Anyway, what prompted this post? This did:
learndirect as sponsor of the Marussia F1 Team?!
I have to admit, for some reason I associate learndirect with DirectGov, the government one stop-shop (will gov.uk be rebranded as DirectGov when it comes out of beta, I wonder? Or will DirectGov go the way of the open2.net and be quietly run down and then out?!)… but the truth of the matter is that learndirect is a
VCprivate equity operated outfit, “the UK’s leading online learning provider”, apparently, “[acquired in] October 2011 [by LDC] … in a transaction valued in the order of £40 million.” LDC Portfolio: learndirect.
Ah, here’s where my memory tricked me (like it does with supermarket and bank “promises”…): “LDC bought learndirect by acquiring its parent Ufi Limited from the Ufi Charitable Trust (UCT). UCT, a registered charity, was set up in 1998 to use new technology to transform the delivery of learning and skills.” Ufi, of course, was the University for Industry, an ill-fated government venture that I seem to remember the OU partnered to a certain extent…
So why would LDC be splashing the learndirect brand all over the MarussiaF1 racing car (aside from the fact the learndirect owners LDC also have a stake in the Marussia F1 team (one aim of which is to “meet our latent sponsorship potential”, which presumably means getting sponsorship mileage for other LDC companies?), as well as having at least one person on both the learndirect and Marussia Virgin(?, or should that be F1?) Racing boards…
And there was me thinking there were absolutely no opportunities for wrangling F1 freebies, seeing as I am stuck in the education sector… Hmmm… time to dig out some of my old science, technology, engineering and maths outreach pitches, maybe…?! (If anyone at the Marussia F1 Racing team fancy chatting about exploring the use of data visualisation either for outreach, or maybe in research, please feel free to get in touch…:-) The (nearby, Milton Keynes based) OU also has various lab facilities and experience in instrumentation (including space flown instruments – so good on the heat, mass, volume and vibration front, I’m guessing…?), materials and CFD (though I suspect too much CFD may be something of a sore point!?), and I’ll happily put you in touch with folk who can tell you more if you’re interested…;-) There’s also some experience in Twitter audience interest profiling, heh heh;-)
PS MarussiaF1 also happen to have appointed a female test driver, Maria de Villota, which may or may not also be a good thing as far as WISE-like initiatives go (I know the drivers aren’t engineers, but it’s a aspiration-related funnel thing; see also James Allen on Why aren’t there more women engineers in F1, where he writes: “F1 in Schools has a very high ratio of female competitors, around 35%, and all-girl teams are quite common. And yet when they get to around 15 years of age, the numbers fall away and few girls pursue engineering degrees.”)
PPS During National Motorsport Week last year, I won a trip round the Marussia(-Virgin, as it then was) F1 factory in Dinnington, near Sheffield (it’s since moved to Banbury; the factory, that is, not Dinnington…;-). Here’s the obligatory blog post: Marussia Virgin Racing F1 Factory Visit. Btw, National Motorsport Week runs again this year too: National Motorsport Week 2012).
PPPS this reminds me of a noticing by @barnstormed (?) a couple of weeks ago that the OU had an ad on ?rotating digital hoardings during Six Nations rugby? (Confirmed by @stuartbrown: “the ou was advertising on boards during scot vs england in the 6 nations rugby”. Photo of that anyone?) Anyone got other examples of education related orgs sponsoring sports to a significant extent?
Yesterday, I had the good fortune to visit the F1 Marussia Virgin Racing factory at Dinnington, near Sheffield, as a result of “winning” a luck dip competition run via GoMotorSport (part of a series of National Motorsport week promotions being run by the F1 teams based in the UK).
[Thanks to @markhendy for the pic…]
Thanks to Finance Director Mark Hendy and engineer Shakey for the insight into the team’s operations:-)
Over the next few days and weeks, I’ll try to pick up on a few of the things I learned from the tour on the F1DataJunkie blog, tying them in to the corresponding technical regulations and other bits and pieces, but for now, here are some of the noticings I came away with…
– the engines aren’t that big, weighing 90kg or so and looking small than the engine in my own car…
– wheels are slotted onto the axles using a 3 pin mount on the front and a six(?) pin mount on the rear. (The engines are held on using a 6(?) point fixing.)
– the drivers aren’t that heavy either, weight wise (not that we met either of the drivers: neither Timo Glock nor Jerome D’Ambrosio are frequent visitors to the Dinnington factory, where the team’s cars are prepared fro before, and overhauled after, each race…): 70 kg or so. With cars prepared to meet racing weight regulations to a tolerance of 0.5kg or so, a large mixed grill and a couple of pints can make a big difference… (Hmm, I guess it would be easy enough to calculate the “big dinner weight effect” penalty on laptime?!)
I’m not sure if this was a “right-handed vs left-handed spanner” remark, but a comment was also made that the adhesive sponsor sticker can have a noticeable effect on the car’s aerodynamics as the corners become unstuck and start to flap. (Which made me wonder, of that is the case, is the shape of stickers taken into account? Is a leading edge on a label with a point/right angled corner rather than a smooth curve likely to come unstuck more easily, for example?!) Cars also need repainting every few races (stripping back to the carbon, and repainting afresh) because of pitting and chipping and other minor damage than can affect smooth airflow.
– side impact tubes are an integral part of the safety related design of the car:
– to track the usage of tyres during a race weekend, an FIA official scans a barcode on each tyre as it is used on the car:
The data junkie in me in part wonders whether this data could be made available in a timely fashion via the Pirelli website (or a Pirelli gadget on each team’s website) – or would that me giving away too much race intelligence to the other teams? That way, we could get an insight into the tyre usage over the course weekend…
– IT plays an increasingly important part of the the pit garage setup; local area networks (cabled and wifi?) are set up by each team for the weekend, the data engineers sitting behind the screen and viewing area in the garage (rather than having a fixed set up in one of the 5(?) trucks that attends each race.).
– the cars are rigged up with 60 or sensors; there is only redundancy on throttle and clutch sensors. Data analysis is in part provided through engineers provided by parts suppliers (McLaren Electronics, who supply the car’s ECU (and telemetry box(?)) provide a dedicated person(?) to support the team; data analysis is, in part, carried out using the Atlas (9?) Advanced Telemetry Linked Acquisition System from McLaren Electronic Systems. Data collected during a stint is transmitted under encryption back to the the pits, as well as being logged on the car itself. A full data dump is available to the team and the FIA scrutineers via an umbilical/wired connection when the car is pitted.
UST Global, one of the teams partners, also provide 3(?) data analysts to support the team during a race (presumably using UST Global’s “Race Management System”?).
– for design and testing, weekly reporting is required that conforms to a trade-off between the number of hours per week that each team can spend on wind tunnel testing (60 hours per week) and and CFD (“can’t find downforce”;-) simulation (40 teraflops per week). My first impression there was that efficient code could effectively mean more simulation testing?! (CFD via CSC? CSC expands relationship with Marussia Virgin Racing, doubling computing power for the team’s 2011 formula 1 season, or are things set to change with the replacement of Nick Wirth by Pat Symonds…?)
– the resource restriction agreement also limits the number of people who can work on the chassis. For a race weekend, teams are limited to 50 (47?) people. We were given a quick run down of at least (8?) engineer roles assigned to each car, but I forget them…
So – that’s a quick summary of some of the things I can remember off the top of my head…
…but here are a couple of other things to note that may be of interest…
Marussia Virgin are making the most of their Virgin partnership over the Silverstone race weekend with a camping party/Virgin Experience at Stowe School (Silverstone Weekend) and a hook-up with Joe Saward’s “An Audience With Joe“… (If you don’t listen to @sidepodcast’s An Aside With Joe podcast series, you should…;-)
The team has also got en education thing going with race ticket sweeteners for folk signing up to the course: Motorsport Management Online Course.
I can’t help thinking there may be a market for a “hardcore fans” course on F1 that could run over a race season and run as an informal, open online course… I still don’t really know how a car works, for example ;-)
Anyway – that’s by the by: thanks again to the GoMotorsport and the Marussia Virgin Racing team (esp. Mark Hendy and Shakey) for a great day out :-)
PS I think the @marussiavirgin team are trying to build up their social media presence too… to see who they’re listening to, here’s how their friends connect:
I *love* treemaps. If you’re not familiar with them, they provide a very powerful way of visualising categorically organised hierarchical data that bottoms out with a quantitative, numerical dimension in a single view.
For example, consider the total population of students on the degrees offered across UK HE by HESA subject code. As well as the subject level, we might also categorise the data according to the number of students in each year of study (first year, second year, third year).
If we were to tabulate this data, we might have columns: institution, HESA subject code, no. of first year students, no. of second year students, no. of third year students. We could also restructure the table so that the data was presented in the form: institution, HESA subject code, year of study, number of students. And then we could visualise it in a treemap… (which I may do one day… but not now; if you beat me to it, please post a link in the comments;-)
Instead, what I will show is how to visualise data from a sports championship, in particular the start of the Formula One 2011 season. This championship has the same entrants in each race, each a member of one of a fixed number of teams. Points are awarded for each race (that is, each round of the championship) and totalled across rounds to give the current standing. As well as the driver championship (based on points won by individual drivers) is the team championship (where the points contribution form drivers within a team is totalled).
Here’s what the results from the third round (China) looks like:
|Paul di Resta||Force India-Mercedes||0|
|Adrian Sutil||Force India-Mercedes||0|
F1 2011 Results – China, © 2011 Formula One World Championship Ltd
We can represent data from across all the races using a table of the form:
Sample of F1 2011 Results 2011, © 2011 Formula One World Championship Ltd
Here’s what it looks like when we view it in a treemap visualisation:
The size of the boxes is proportional to the (summed) values within the hierarchical categories. In the above case, the large blocks are the total points awarded to each driver across teams and races. (The team field might be useful if a driver were to change team during the season.)
I’m not certain, but I think the Many Eyes treemap algorithm populates the map using a sorted list of summed numerical values taken through the hierarchical path from left to right, top to bottom. Which means top left is the category with the largest summed points. If this is the case, in the above example we can directly see that Webber is in fourth place overall in the championship. We can also look within each blocked area for more detail: for example, we can see Hamilton didn’t score as many points in Malaysia as he did in the other two races.
One of the nice features about the Many Eyes treemap is that it allows you to reorder the levels of the hierarchy that is being displayed. So for example, with a simple reordering of the labels we can get a view over the team championship too:
What might be interesting would be to feed Protovis or the JIT with data dynamically form a Google Spreadsheet, for example, so that a single page could be used to display the treemap with the data being maintained in a spreadsheet.
Hmm, I wonder – does Google spreadsheets have a treemap gadget? Ooh – it does: treemap-gviz. It looks as if a bit of wrangling may be required around the data, but if the display works out then just popping the points data into a Google spreadsheet and creating the gadget should give an embeddable treemap display with no code required:-) (It will probably be necessary to format the data hierarchy by hand, though, requiring differently layed out data tables to act as source for individual and team based reports.)
So – how long before we see some “live” treemap displays for F1 results on the F1 blogs then? Or championship tables from other sports? Or is the treemap too confusing as a display for the uninitiated? (I personally don’t think so.. but then, I love macroscopic views over datasets:-)
PS see also More Olympics Medal Table Visualisations which includes a demonstration of a treemap visualisation over Olympic medal standings.
Just a quick post (that I could actually have published 20 mins or so ago), showing a couple of graphics generated from my scrape of the 2011 China Formula One Grand Prix timing data (via FIA press releases).
First up, the race to the podium:
The full lap chart, with pit stops:
Both the above graphics were using data scraped from press releases published on the FIA media centre website. You can find the data in the GDF format I used to generate the images using Gephi here (howto).
PPS which reminds me – here’s an example of how to use Gephi to visualise telemetry data captured from the McLaren websire: Visualising Vodafone Mclaren F1 Telemetry Data in Gephi
Putting together a couple of tricks from recent posts (Visualising Vodafone Mclaren F1 Telemetry Data in Gephi and PDF Data Liberation: Formula One Press Release Timing Sheets), I thought I’d have a little play with the timing sheet data in Gephi…
The representations I have used to date are graph based, with each node corresponding a particular lap performance by a particular driver, and edges connecting consecutive laps.
The nodes carry the following data, as specified using the GDF format:
- name VARCHAR: the ID of each node, given as driverNumber_lapNumber (e.g. 12_43)
- label VARCHAR: the name of the driver (e.g. S. VETTEL
- driverID INT: the driver number (e.g. 7)
- driverNum VARCHAR: an ID for the driver of the lap (e.g. driver_12
- team VARCHAR: the team name (e.g. Vodafone McLaren Mercedes)
- lap INT: the lap number (e.g. 41)
- pos INT: the position at the end of the lap (e.g. 5)
- pitHistory INT: the number of pitstops to date (e.g. 2)
- pitStopThisLap DOUBLE: the duration of any pitstop this lap, else 0 (e.g. 12.321)
- laptime DOUBLE: the laptime, in seconds (e.g. 72.125)
- lapdelta DOUBLE: the difference between the current laptime and the previous laptime (e.g. 1.327)
- elapsedTime DOUBLE: the summed laptime to date (e.g. 1839.021)
- elapsedTimeHun DOUBLE: the elapsed time divided by a hundred (e.g. )
Using the geolayout with an equirectangular (presumably this means Cartesian?) layout, we can generate a range of charts simply by selecting suitable co-ordinate dimensions. For example, if we select the laptime as the y (“latitude”) co-ordinate and x (“longitude”) as the lap, filtering out the nodes with a null laptime value, we can generate a graph of the form:
We can then tweak this a little – e.g. colour the nodes by driver (using a Partition based coluring), and edges according to node, resize the nodes to show the number of pit stops to date, and then filter to compare just a couple of drivers :
This sort of lap time comparison is all very well, but it doesn’t necessarily tell us relative track positions. If we size the nodes non-linearly according to position, with a larger size for the “smaller” numerical position (so first is less than second, and hence first is sized larger than second), we can see whether the relative positions change (in this case, they don’t…)
Again, filtering is trivial:
If we plot the elapsed time against lap, we get a view of separations (deltas between cars are available in the media centre reports, but I haven’t used this data yet…):
In this example, lap time flows up the graph, elapsed time increases left to right. Nodes are coloured by driver, and sized according to postion. If a driver has a hight lap count and lower total elapsed time than a driver on the previous lap, then it’s lapped that car… Within a lap, we also see the separation of the various cars. (This difference should be the same as the deltas that are available via FIA press releases.)
If we zoom into a lap, we can better see the separation between cars. (Using the data I have, I’m hoping I haven’t introduced any systematic errors arising from essentially dead reckoning the deltas between cars…)
Also note that where lines between two laps cross, we have a change of position between laps.
[ADDED] Here’s another view, plotting elapsed time against itself to see where folk are on the track-as-laptime:
Okay, that’s enough from me for now.. Here’s something far more beautiful from @bencc/Ben Charlton that was built on top of the McLaren data…
First up, a 3D rendering of the lap data:
And then a rather nice lap-by-lap visualisation:
So come on F1 teams – give us some higher resolution data to play with and let’s see what we can really do… ;-)
PS I see that Joe Saward is a keen user of Lap charts…. That reminds me of an idea for an app I meant to do for race days that makes grabbing position data as cars complete a lap as simple as clicking…;-) Hmmm….
PPS for another take of visualising the timing data/timing stats, see Keith Collantine/F1Fanatic’s Malaysia summary post.
If you want F1 summary timing data from practice sessions, qualifying and the race itself, you might imagine that the the FIA Media Centre is the place to go:
Some of the documents provide all the results on a single page in a relatively straightforward fashion:
Others are split into tables over multiple pages:
Following the race, the official classification was available as a scrapable PDF in preliminary for, but the final result – with handwritten signature – looked to be a PDF of a photocopy, and as such defies scraping without an OCR pass first… which I didn’t try…
I did consider setting up separate scrapers for each timing document, and saving the data into a corresponding Scraperwiki database, but a quick look at the license conditions made me a little wary…
No part of these results/data may be reproduced, stored in a retrieval system or transmitted in any form or by any means electronic, mechanical, photocopying, recording, broadcasting or otherwise without prior permission of the copyright holder except for reproduction in local/national/international daily press and regular printed publications on sale to the public within 90 days of the event to which the results/data relate and provided that the copyright symbol appears together with the address shown below …
Instead, I took the scrapers just so far such that I (that is, me ;-) could see how I would be able to get hold of the data without too much additional effort, but I didn’t complete the job… there’s partly an ulterior motive for this too… if anyone really wants the data, then you’ll probably have to do a bit of delving into the mechanics of Scraperwiki;-)
(The other reason for not my spending more time on this at the moment is that I was looking for a couple of simple exercises to get started with grabbing data from PDFs, and the FIA docs seemed quite an easy way in… Writing the scrapers is also bit like doing Sudoku, or Killer, which is one of my weekend pastimes…;-)
The scraper I set up is here: F1 Timing Scraperwiki
To use the scrapers, you need to open up the Scraperwiki editor, and do a little bit of configuration:
(Note the the press releases may disappear a few days after the race – I’m not sure how persistent the URLs are?)
When you’ve configured the scraper, run it…
The results of the scrape should now be displayed…
Scraperwiki does allow scraped data to be deposited into a database, and then accessed via an API, or other scrapers, or uploaded to Google Spreadsheets. However, my code stops at the point of getting the data into a Python list. (If you want a copy of the code, I posted it as a gist: F1 timings – press release scraper; you can also access it via Scraperwiki, of course).
Note that so far I’ve only tried the docs from a single race, so the scrapers may break on the releases published for future (or previous) races… Such is life when working with scrapers… I’ll try to work on robustness as the races go by. (I also need to work on the session/qualifying times and race analysis scrapers… they currently report unstructured data and also display an occasional glitch that I need to handle via a post-scrape cleanser.
If you want to use the scraper code as a starting point for building a data grabber that publishes the timing information as data somewhere, that’s what it’s there for (please let me know in the comments;-)
PS by the by, Mercedes GP publish an XML file of the latest F1 Championship Standings. They also appear to be publishing racetrack information in XML form using URLs of the form http://assets.mercedes-gp.com/—9—swf/assets/xml/race_23_en.xml. Presumably the next race will be 24?
If you know of any other “data” sources or machine readable, structured/semantic data relating to F1, please let me know via a comment below:-)