PDF Data Liberation: Formula One Press Release Timing Sheets

If you want F1 summary timing data from practice sessions, qualifying and the race itself, you might imagine that the the FIA Media Centre is the place to go:

Hmm… PDFs…

Some of the documents provide all the results on a single page in a relatively straightforward fashion:

Others are split into tables over multiple pages:

Following the race, the official classification was available as a scrapable PDF in preliminary for, but the final result – with handwritten signature – looked to be a PDF of a photocopy, and as such defies scraping without an OCR pass first… which I didn’t try…

I did consider setting up separate scrapers for each timing document, and saving the data into a corresponding Scraperwiki database, but a quick look at the license conditions made me a little wary…

No part of these results/data may be reproduced, stored in a retrieval system or transmitted in any form or by any means electronic, mechanical, photocopying, recording, broadcasting or otherwise without prior permission of the copyright holder except for reproduction in local/national/international daily press and regular printed publications on sale to the public within 90 days of the event to which the results/data relate and provided that the copyright symbol appears together with the address shown below …

Instead, I took the scrapers just so far such that I (that is, me ;-) could see how I would be able to get hold of the data without too much additional effort, but I didn’t complete the job… there’s partly an ulterior motive for this too… if anyone really wants the data, then you’ll probably have to do a bit of delving into the mechanics of Scraperwiki;-)

(The other reason for not my spending more time on this at the moment is that I was looking for a couple of simple exercises to get started with grabbing data from PDFs, and the FIA docs seemed quite an easy way in… Writing the scrapers is also bit like doing Sudoku, or Killer, which is one of my weekend pastimes…;-)

The scraper I set up is here: F1 Timing Scraperwiki

To use the scrapers, you need to open up the Scraperwiki editor, and do a little bit of configuration:

(Note the the press releases may disappear a few days after the race – I’m not sure how persistent the URLs are?)

When you’ve configured the scraper, run it…

The results of the scrape should now be displayed…

Scraperwiki does allow scraped data to be deposited into a database, and then accessed via an API, or other scrapers, or uploaded to Google Spreadsheets. However, my code stops at the point of getting the data into a Python list. (If you want a copy of the code, I posted it as a gist: F1 timings – press release scraper; you can also access it via Scraperwiki, of course).

Note that so far I’ve only tried the docs from a single race, so the scrapers may break on the releases published for future (or previous) races… Such is life when working with scrapers… I’ll try to work on robustness as the races go by. (I also need to work on the session/qualifying times and race analysis scrapers… they currently report unstructured data and also display an occasional glitch that I need to handle via a post-scrape cleanser.

If you want to use the scraper code as a starting point for building a data grabber that publishes the timing information as data somewhere, that’s what it’s there for (please let me know in the comments;-)

PS by the by, Mercedes GP publish an XML file of the latest F1 Championship Standings. They also appear to be publishing racetrack information in XML form using URLs of the form http://assets.mercedes-gp.com/—9—swf/assets/xml/race_23_en.xml. Presumably the next race will be 24?

If you know of any other “data” sources or machine readable, structured/semantic data relating to F1, please let me know via a comment below:-)

11 comments

  1. Mr C

    (Note the the press releases may disappear a few days after the race – I’m not sure how persistent the URLs are?)

    they get replaced as soon as the next race info appears. so you probably have until wednesday to test this and then they’re gone, until data appears for friday.

    • Tony Hirst

      @Mr C – agreed the links on the FIA page disappear in time for next race… I thought last year I managed to track down some old releases (i.e. releases from previous races) down another URL path, but that may be a false memory…? (I guess the old releases are available for accredited press?) I certainly try to remember to grab copies of the files after each race… hmmm… maybe I should script that collection activity?!;-)

  2. Alianora La Canta

    The URLs are completely persistent – the trouble is they’re password-protected when the next race approaches, and only FIA members get the password (as far as I can tell, and assuming last year’s system is still in effect).

  3. Pingback: Data for journalists: understanding XML and RSS | Online Journalism Blog
  4. Pingback: Visualising F1 Timing Sheet Data « OUseful.Info, the blog…
  5. Pingback: A First Attempt at Looking at F1 Timing Data in Google Motion Charts (aka “Gapminder”) « OUseful.Info, the blog…
  6. frosty

    Has anybody been saving the pdf for older races? I wanted to do some analysis or lap timings throughout the race by each of the drivers. Can’t find for the past races :(

    Will be saving for the future ones in case anybody needs it, ping me

  7. Pingback: Data for journalists: understanding XML and RSS | Online Journalism Blog