Fragments – Scraping Tabular Data from PDFs

Over the weekend, we went to Snetterton to watch the BTCC touring cars and go-for-it Ginetta Juniors. Timing sheets from the event are available on the TSL website, so I thought I’d have a play with the data…

Each series has it’s own results booklet, a multi-page PDF document containing a range of timing sheets. Here’s an example of part of one of them:

It’s easy enough to use tools like Tabula (at version 1.0 as of August, 2015) to extract the data from regular (ish) tables, but for more complex tables we’d need to do some additional cleaning.

For example, on a page like:

we get the data out simply by selecting the bits of the PDF we are interested in:

and preview (or export it):

Note that this would still require a bit of work to regularise it further, perhaps using something like OpenRefine.

When I scrape PDFs, I tend to use pdf2html (from the poppler package, I think?) and then parse in the resulting XML:

import os
fn='ginettaJnrSnetterton2015'
cmd = 'pdftohtml -xml -nodrm -zoom 1.5 -enc UTF-8 -noframes %s "%s" "%s"' % ( '',fn+'.pdf', os.path.splitext(fn+'.xml')[0])
# Can't turn off output? Throw it away...
cmd + " &gt;/dev/null 2&gt;&amp;1"
os.system(cmd)

import lxml.etree

xmldata = open(fn+'.xml','r').read()
root = lxml.etree.fromstring(xmldata)
pages = list(root)

We can then quickly preview the “raw” data we’re getting from the PDF:

def flatten(el):
    result = [ (el.text or "") ]
    for sel in el:
        result.append(flatten(sel))
        result.append(sel.tail or "")
    return "".join(result)

def pageview(pages,page):
    for el in pages[page]:
        print( el.attrib['left'], el.attrib['top'],flatten(el))

The scraped data includes top and left co-ordinates for each text element. We can count how many data elements are found at each x (left) co-ordinate and use that to help build our scraper.

By eye, we can spot natural breaks in the counts…:

but can we also detect them automatically? The Jenks Natural Breaks algorithm [code] looks like it tries to do that…

The centres identified by the Jenks natural breaks algorithm could then be used as part of a default hierarchy to assign a particular data element to a particular column. Crudely, we might use something like the following:

Whilst it’s quite possible to hand-build scrapers that inspect each element scraped from the PDF document in turn, I notice that the Tabula extraction engine now has a command line interface, so it may be worth spending some time looking at that instead. (It would also be nice if the Tabula GUI could be used to export configuration info, so you could highlight areas of a PDF using the graphical tools and then generate the command line parameter values for reuse from from the command line?)

PS another handy PDF table extractor is published by Scraperwiki: pdftables.com. Which is probably the way to go if you have the funds to pay for it…

PPS A handy summary from the Scraperwiki blog about the different sorts of table containing documents you often come across as PDFS: The four kinds of data PDF (“large tables”, “pivotted tables”, “transactions”, “reports”).

PPPS This also looks relevant – an MSc thesis by Anssi Nurminen, from Tampere University of Technology, on Algorithmic Extraction of Data in Tables in PDF; also this report by Burcu Yildiz, Katharina Kaiser, and Silvia Miksch on pdf2table: A Method to Extract Table Information from PDF Files and an associated Masters thesis by Burcu Yildiz, Information Extraction – Utilizing Table Patterns.

Author: Tony Hirst

I'm a Senior Lecturer at The Open University, with an interest in #opendata policy and practice, as well as general web tinkering... View all posts by Tony Hirst