OUseful.Info, the blog…

Trying to find useful things to do with emerging technologies in open education

PDF Data Liberation: Formula One Press Release Timing Sheets

If you want F1 summary timing data from practice sessions, qualifying and the race itself, you might imagine that the the FIA Media Centre is the place to go:

Hmm… PDFs…

Some of the documents provide all the results on a single page in a relatively straightforward fashion:

Others are split into tables over multiple pages:

Following the race, the official classification was available as a scrapable PDF in preliminary for, but the final result – with handwritten signature – looked to be a PDF of a photocopy, and as such defies scraping without an OCR pass first… which I didn’t try…

I did consider setting up separate scrapers for each timing document, and saving the data into a corresponding Scraperwiki database, but a quick look at the license conditions made me a little wary…

No part of these results/data may be reproduced, stored in a retrieval system or transmitted in any form or by any means electronic, mechanical, photocopying, recording, broadcasting or otherwise without prior permission of the copyright holder except for reproduction in local/national/international daily press and regular printed publications on sale to the public within 90 days of the event to which the results/data relate and provided that the copyright symbol appears together with the address shown below …

Instead, I took the scrapers just so far such that I (that is, me ;-) could see how I would be able to get hold of the data without too much additional effort, but I didn’t complete the job… there’s partly an ulterior motive for this too… if anyone really wants the data, then you’ll probably have to do a bit of delving into the mechanics of Scraperwiki;-)

(The other reason for not my spending more time on this at the moment is that I was looking for a couple of simple exercises to get started with grabbing data from PDFs, and the FIA docs seemed quite an easy way in… Writing the scrapers is also bit like doing Sudoku, or Killer, which is one of my weekend pastimes…;-)

The scraper I set up is here: F1 Timing Scraperwiki

To use the scrapers, you need to open up the Scraperwiki editor, and do a little bit of configuration:

(Note the the press releases may disappear a few days after the race – I’m not sure how persistent the URLs are?)

When you’ve configured the scraper, run it…

The results of the scrape should now be displayed…

Scraperwiki does allow scraped data to be deposited into a database, and then accessed via an API, or other scrapers, or uploaded to Google Spreadsheets. However, my code stops at the point of getting the data into a Python list. (If you want a copy of the code, I posted it as a gist: F1 timings – press release scraper; you can also access it via Scraperwiki, of course).

Note that so far I’ve only tried the docs from a single race, so the scrapers may break on the releases published for future (or previous) races… Such is life when working with scrapers… I’ll try to work on robustness as the races go by. (I also need to work on the session/qualifying times and race analysis scrapers… they currently report unstructured data and also display an occasional glitch that I need to handle via a post-scrape cleanser.

If you want to use the scraper code as a starting point for building a data grabber that publishes the timing information as data somewhere, that’s what it’s there for (please let me know in the comments;-)

PS by the by, Mercedes GP publish an XML file of the latest F1 Championship Standings. They also appear to be publishing racetrack information in XML form using URLs of the form http://assets.mercedes-gp.com/—9—swf/assets/xml/race_23_en.xml. Presumably the next race will be 24?

If you know of any other “data” sources or machine readable, structured/semantic data relating to F1, please let me know via a comment below:-)

Written by Tony Hirst

April 10, 2011 at 9:52 pm

Posted in Data, Tinkering

Tagged with , ,

11 Responses

Subscribe to comments with RSS.

  1. (Note the the press releases may disappear a few days after the race – I’m not sure how persistent the URLs are?)

    they get replaced as soon as the next race info appears. so you probably have until wednesday to test this and then they’re gone, until data appears for friday.

    Mr C

    April 10, 2011 at 10:03 pm

    • @Mr C – agreed the links on the FIA page disappear in time for next race… I thought last year I managed to track down some old releases (i.e. releases from previous races) down another URL path, but that may be a false memory…? (I guess the old releases are available for accredited press?) I certainly try to remember to grab copies of the files after each race… hmmm… maybe I should script that collection activity?!;-)

      Tony Hirst

      April 10, 2011 at 10:15 pm

  2. The URLs are completely persistent – the trouble is they’re password-protected when the next race approaches, and only FIA members get the password (as far as I can tell, and assuming last year’s system is still in effect).

    Alianora La Canta

    April 10, 2011 at 10:24 pm

  3. I tried one of the PDFs with DocumentCloud, and it did quite a good job very quickly – https://www.documentcloud.org/documents/83759-mal-race-sectors.html

    You’d still have to get that into a spreadsheet, which I haven’t attempted.

    Which was the photocopied one with the signature? Would be interesting to see how DC copes with that.

    Mr Paul J Bradshaw

    April 11, 2011 at 10:39 am

  4. [...] examples can be found at Mercedes GP’s XML file of the latest F1 Championship Standings (see the PS at the end of Tony Hirst’s post for an explanation of how this is structured), and MySociety’s Parliament Parser, which [...]

  5. [...] Visualising Vodafone Mclaren F1 Telemetry Data in Gephi and PDF Data Liberation: Formula One Press Release Timing Sheets), I thought I’d have a little play with the timing sheet data in [...]

  6. [...] managed to get F1 timing data data through my cobbled together F1 timing data Scraperwiki, it becomes much easier to try out different visualisation approaches that can be used to review [...]

  7. Has anybody been saving the pdf for older races? I wanted to do some analysis or lap timings throughout the race by each of the drivers. Can’t find for the past races :(

    Will be saving for the future ones in case anybody needs it, ping me

    frosty

    July 17, 2011 at 10:18 pm

  8. […] examples can be found at Mercedes GP’s XML file of the latest F1 Championship Standings (see the PS at the end of Tony Hirst’s post for an explanation of how this is structured), and MySociety’s Parliament Parser, which […]


Comments are closed.

Follow

Get every new post delivered to your Inbox.

Join 729 other followers

%d bloggers like this: