Category: Tinkering

Fragments – Scraping Tabular Data from PDFs

Over the weekend, we went to Snetterton to watch the BTCC touring cars and go-for-it Ginetta Juniors. Timing sheets from the event are available on the TSL website, so I thought I’d have a play with the data…

Each series has it’s own results booklet, a multi-page PDF document containing a range of timing sheets. Here’s an example of part of one of them:


It’s easy enough to use tools like Tabula (at version 1.0 as of August, 2015) to extract the data from regular (ish) tables, but for more complex tables we’d need to do some additional cleaning.

For example, on a page like:


we get the data out simply by selecting the bits of the PDF we are interested in:


and preview (or export it):


Note that this would still require a bit of work to regularise it further, perhaps using something like OpenRefine.

When I scrape PDFs, I tend to use pdf2html (from the poppler package, I think?) and then parse in the resulting XML:

import os
cmd = 'pdftohtml -xml -nodrm -zoom 1.5 -enc UTF-8 -noframes %s "%s" "%s"' % ( '',fn+'.pdf', os.path.splitext(fn+'.xml')[0])
# Can't turn off output? Throw it away...
cmd + " >/dev/null 2>&1"

import lxml.etree

xmldata = open(fn+'.xml','r').read()
root = lxml.etree.fromstring(xmldata)
pages = list(root)

We can then quickly preview the “raw” data we’re getting from the PDF:

def flatten(el):
    result = [ (el.text or "") ]
    for sel in el:
        result.append(sel.tail or "")
    return "".join(result)

def pageview(pages,page):
    for el in pages[page]:
        print( el.attrib['left'], el.attrib['top'],flatten(el))


The scraped data includes top and left co-ordinates for each text element. We can count how many data elements are found at each x (left) co-ordinate and use that to help build our scraper.

By eye, we can spot natural breaks in the counts…:


but can we also detect them automatically? The Jenks Natural Breaks algorithm [code] looks like it tries to do that…


The centres identified by the Jenks natural breaks algorithm could then be used as part of a default hierarchy to assign a particular data element to a particular column. Crudely, we might use something like the following:


Whilst it’s quite possible to hand-build scrapers that inspect each element scraped from the PDF document in turn, I notice that the Tabula extraction engine now has a command line interface, so it may be worth spending some time looking at that instead. (It would also be nice if the Tabula GUI could be used to export configuration info, so you could highlight areas of a PDF using the graphical tools and then generate the command line parameter values for reuse from from the command line?)

PS another handy PDF table extractor is published by Scraperwiki: Which is probably the way to go if you have the funds to pay for it…

PPS A handy summary from the Scraperwiki blog about the different sorts of table containing documents you often come across as PDFS: The four kinds of data PDF (“large tables”, “pivotted tables”, “transactions”, “reports”).

PPPS This also looks relevant – an MSc thesis by Anssi Nurminen, from Tampere University of Technology, on Algorithmic Extraction of Data in Tables in PDF; also this report by Burcu Yildiz, Katharina Kaiser, and Silvia Miksch on pdf2table: A Method to Extract Table Information from PDF Files and an associated Masters thesis by Burcu Yildiz, Information Extraction – Utilizing Table Patterns.

Running a Shell Script Once Only in vagrant

Via somewhere (I’ve lost track of the link), here’s a handy recipe for running a shell script once and once only from Vagrantfile.

In the shell script (


if [ ! -f ~/runonce ]


  touch ~/runonce

In the Vagrantfile:

  config.vm.provision :shell, :inline => <<-SH
    chmod ugo+x /vagrant/

Exporting and Distributing Docker Images and Data Container Contents

Although it was a beautiful day today, and I should really have spent it in the garden, or tinkering with F1 data, I lost the day to the screen and keyboard pondering various ways in which we might be able to use Kitematic to support course activities.

One thing I’ve had on pause for some time is the possibility of distributing docker images to students via a USB stick, and then loading them into Kitematic. To do this we need to get tarballs of the appropriate images so we could then distribute them.

docker save psychemedia/openrefine_ou:tm351d2test | gzip -c > test_openrefine_ou.tgz
docker save psychemedia/tm351_scipystacknserver:tm351d3test | gzip -c > test_ipynb.tgz
docker save psychemedia/dockerui_patch:tm351d2test | gzip -c > test_dockerui.tgz
docker save busybox:latest | gzip -c > test_busybox.tgz
docker save mongo:latest | gzip -c > test_mongo.tgz
docker save postgres:latest | gzip -c > test_postgres.tgz

On the to do list is getting to these to with the portable Kitematic branch (I’m not sure if that branch will continue, or whether the interest is too niche?!), but in the meantime, I could load it into the Kitematic VM from the Kitematice CLI using:

docker load < test_mongo.tgz

assuming the test_mongo.tgz file is in the current working directory.

Another I need to explore is how to get the set up the data volume containers on the students’ machine.

The current virtual machine build scripts aim to seed the databases from raw data, but to set up the student machines it would seem more sensible to either rebuild a database from a backup, or just load in a copy of the seeded data volume container. (All the while we have to be mindful of providing a route for the students to recreate the original, as distributed, setup, just in case things go wrong. At the same time, we also need to start thing about backup strategies for the students so they can checkpoint their own work…)

The traditional backup and restore route for PostgreSQL seems to be something like the following:

#Use docker exec to run a postgres export
docker exec -t vagrant_devpostgres_1 pg_dumpall -Upostgres -c &gt; dump_`date +%d-%m-%Y"_"%H_%M_%S`.sql
#If it's a large file, maybe worth zipping: pg_dump dbname | gzip > filename.gz

#The restore route would presumably be something like:
cat postgres_dump.sql | docker exec -i vagrant_devpostgres_1 psql -Upostgres
#For the compressed backup: cat postgres_dump.gz | gunzip | psql -Upostgres

For mongo, things seem to be a little bit more complicated. Something like:

docker exec -t vagrant_mongo_1 mongodump

#Complementary restore command is: mongorestore

would generate a dump in the container, but then we’d have to tar it and get it out? Something like these mongodump containers may be easier? (mongo seems to have issues with mounting data containers on host, on a Mac at least?

By the by, if you need to get into a container within a Vagrant launched VM (I use vagrant with vagrant-docker-compose), the following shows how:

#If you need to get into a container:
vagrant ssh
#Then in the VM:
  docker exec -it CONTAINERNAME bash

Another way of getting to the data is to export the contents of the seeded data volume containers from the build machine. For example:

#  Export data from a data volume container that is linked to a database server

docker run --volumes-from vagrant_devpostgres_1 -v $(pwd):/backup busybox tar cvf /backup/postgresbackup.tar /var/lib/postgresql/data 

#I wonder if these should be run with --rm to dispose of the temporary container once run?

docker run --volumes-from vagrant_mongo_1 -v $(pwd):/backup busybox tar cvf /backup/mongobackup.tar /data/db

We can then take the tar file, distribute it to students, and use it to seed a data volume container.

Again, from the Kitematic command line, I can run something like the following to create a couple of data volume containers:

#Create a data volume container
docker create -v /var/lib/postgresql/data --name devpostgresdata busybox true
#Restore the contents
docker run --volumes-from devpostgresdata -v $(pwd):/backup ubuntu sh -c "tar xvf /backup/postgresbackup.tar"
#Note - the docker helpfiles don't show how to use sh -c - which appears to be required...
#Again, I wonder whether this should be run with --rm somewhere to minimise clutter?

Unfortunately, things don’t seem to run so smoothly with mongo?

#Unfortunately, when trying to run a mongo server against a data volume container
#the presence of a mongod.lock seems to break things
#We probably shouldn't do this, but if the database has settled down and completed
#  all its writes, it should be okay?!
docker run --volumes-from vagrant_mongo_1 -v $(pwd):/backup busybox tar cvf /backup/mongobackup.tar /data/db --exclude=*mongod.lock
#This generates a copy of the distributable file without the lock...

#Here's an example of the reconstitution from the distributable file for mongo
docker create -v /data/db --name devmongodata busybox true
docker run --volumes-from devmongodata -v $(pwd):/backup ubuntu sh -c "tar xvf /backup/mongobackup.tar"

(If I’m doing something wrong wrt the getting the mongo data out of the container, please let me know… I wonder as well with the cavalier way I treat the lock file whether the mongo container should be started up in repair mode?!)

If have a docker-compose.yml file in the working directory like the following:

  image: mongo
    - "27017:27017"
    - devmongodata

##We DO NOT need to declare the data volume here
#We have already created it
#Also, if we leave it in, a "docker-compose rm" command
#will destroy the data volume container...
#...which means we wouldn't persist the data in it
#    command: echo created
#    image: busybox
#    volumes: 
#        - /data/db

We can the run docker-compose up and it should fire up a mongo container and link it to the seeded data volume container, making the data contains in that data volume container available to us.

I’ve popped some test files here. Download and unzip, from the Kitematic CLI cd into the unzipped dir, create and populate the data containers as above, then run: docker-compose up

You should be presented with some application containers including OpenRefine and an OU customised IPython notebook server. You’ll need to mount the IPython notebooks folder onto the unzipped folder. The example notebook (if everything works!) should show demonstrate calls to prepopulated mongo and postgres databases.


Doodling With 3d Animated Charts in R

Doodling with some Gapminder data on child mortality and GDP per capita in PPP$, I wondered whether a 3d plot of the data over the time would show different trajectories over time for different countries, perhaps showing different development pathways over time.

Here are a couple of quick sketches, generated using R (this is the first time I’ve tried to play with 3d plots…)

#data downloaded from Gapminder
#wb=loadWorkbook("indicator gapminder gdp_per_capita_ppp.xlsx")

#Set up dataframes
gdp=read.xlsx("indicator gapminder gdp_per_capita_ppp.xlsx", sheetName = "Data")
mort=read.xlsx("indicator gapminder under5mortality.xlsx", sheetName = "Data")

#Tidy up the data a bit

gdpm=melt(gdp,id.vars = 'GDP.per.capita','year')
gdpm$year = as.integer(gsub('X', '', gdpm$year))
gdpm=rename(gdpm, c("GDP.per.capita"="country", "value"="GDP.per.capita"))

mortm=melt(mort,id.vars = 'Under.five.mortality','year')
mortm$year = as.integer(gsub('X', '', mortm$year))
mortm=rename(mortm, c("Under.five.mortality"="country", "value"="Under.five.mortality"))

#The following gives us a long dataset by country and year with cols for GDP and mortality

#Filter out some datasets by country[gdpmort['country']=='United States',][gdpmort['country']=='Bangladesh',][gdpmort['country']=='China',]

Now let’s have a go at some charts. First, let’s try a static 3d line plot using the scatterplot3d package:


s3d = scatterplot3d($year,$Under.five.mortality,$GDP.per.capita, 
                     color = "red", angle = -50, type='l', zlab = "GDP.per.capita",
                     ylab = "Under.five.mortality", xlab = "year")
             col = "purple", type = "l")
             col = "blue", type = "l")

Here’s what it looks like… (it’s worth fiddling with the angle setting to get different views):


A 3d bar chart provides a slightly different view:

s3d = scatterplot3d($year,$Under.five.mortality,$GDP.per.capita, 
                     color = "red", angle = -50, type='h', zlab = "GDP.per.capita",
                     ylab = "Under.five.mortality", xlab = "year",pch = " ")
             col = "purple", type = "h",pch = " ")
             col = "blue", type = "h",pch = " ")


As well as static 3d plots, we can generate interactive ones using the rgl library.

Here’s the code to generate an interactive 3d plot that you can twist and turn with a mouse:

#Get the data from required countries - data cols are GDP and child mortality
x.several = gdpmort[gdpmort$country %in% c('United States','China','Bangladesh'),]

plot3d(x.several$year, x.several$Under.five.mortality,  log10(x.several$GDP.per.capita),
       col=as.integer(x.several$country), size=3)

We can also set the 3d chart spinning….

play3d(spin3d(axis = c(0, 0, 1)))

We can also grab frames from the spinning animation and save them as individual png files. If you have Imagemagick installed, there’s a function that will generate the image files and weave them into an animated gif automatically.

It’s easy enough to install on a Mac if you have the Homebrew package manager installed. On the command line:

brew install imagemagick

Then we can generate a movie:

movie3d(spin3d(axis = c(0, 0, 1)), duration = 10,
        dir = getwd())

Here’s what it looks like:



A Quick Look at Planning Data on the Isle of Wight

One of the staples that I suspect many folk look to in our weekly local paper, the Isle of Wight local press, is the listing of recent planning notices.

The Isle of Wight Council website also provides a reasonably comprehensive online source about planning information. Notices are split across several listings:

It’s easy enough to knock up a scraper to grab the list of current applications, scrape each of the linked to application pages in turn, and then generate a map showing the locations of the current planning applications.


Indeed, working with my local hyperlocal, here’s a sketch of exactly such an approach, published for the first time yesterday: Isle of Wight planning applications : Mapped (announcement).

I’m hoping to do a lot more with OnTheWight – and perhaps others…? – over the coming weeks and months, so it’d great to hear any feedback you have either here, or on the OnTheWight site itself.

Where Next?

The sketch is a good start, but it’s exactly that. If we are going to extend the service, for example, by also providing a means of reviewing recently accepted (or rejected) applications, as well as applications currently under appeal, we perhaps need to think a little bit more clearly about how we store the data – and keep track of where it is in the planning process.

If we look at the page for a particular application, we see that there are essentially three tables:


The listings pages also take slightly different forms. All of them have an address, and all of them have a planning application identification number (though in two forms, albeit intersecting); but they differ in terms of the semantics of the third and possible fourth columns, although each ultimately resolves to a date or null value.

– current (and archive) listings:


– recent decisions:


– appeals:


At the moment, the OnTheWight sketchmap is generated from a scrape of the Isle of Wight Council current planning applications page (latitude and longitude are generated by geocoding the address). A more complete solution would be to start to build a database of all applications, though this requires a little bit of thought when it comes to setting up the database so it becomes possible to track the current state of a particular application.

It might also be useful to put together a simple flow chart that shows how the public information available around an application evolves as an application progresses and then build a data model that can readily reflect that. We could then start to annotate that chart with different output opportunities – for example, as the map goes, it’s easy enough to imagine several layers: a current applications layer, a (current) appeals layer, a recent decisions layer, an archived decision layer.

A process diagram would also allow us to start spotting event opportunities around which we might be able to generate alerts. For example, generating feeds that that allow you to identify changes in application activity within a particular unit postcode or postcode district (ONS: UK postcode structure) or ward could act as the basis of a simple alerting mechanism. It’s then easy enough to set up an IFTT feed to email pipe, though longer term an “onsite” feed to email subscription service would allow for a more local service. (Is there a WordPress plugin that lets logged in users generate multiple email subscriptions to different feeds?

In terms of other value-adds that arise from processing the data, I can think of a few… For example, keeping track of repeated applications to the same property, analysing which agents are popular in terms of applications (and perhaps generating a league table of success rates!), linkage to other location based services (for example, license applications or prices paid data) and so on.

Takes foot off spade and stops looking into the future, surveys weeds and half dug hole…;-)

Things I Take for Granted #287 – Grabbing Stuff from Web Form Drop Down Lists

Over the years, I’ve collected lots of little hacks for tinkering with various data sets. Here’s an example…

A form on a web page with country names that map to code values:


If we want to generate a two column look up table from the names on the list to the values that encode them, we can look to the HTML source, grab the list of elements, then use a regular expression to to extract the names and values and rewrite them in two column, tab separated text file, with one item per line:

regexp form exractor

NOTE: the last character in the replace is \n (newline character). I grabbed the screenshot when the cursor was blinking on:-(

A Google Spreadsheets View Over DWP Tabulation Tool 1-Click Time Series Data

Whilst preparing for an open data training session for New Economy in Manchester earlier this week, I was introduced to the DWP tabulation tool that provides a quick way of analysing various benefits and allowances related datasets, including bereavement benefits, incapacity benefit and employment and support allowance.

The tool supports the construction of various data views as well as providing 1-click link views over “canned” datasets for each category of data.


The data is made available in the form on an HTML data table via a static URL (example):


To simplify working with data, we can import the data table directly into Google spreadsheets using the importHTML() formula, which allows you to specify a URL, and then import a specified HTML data table from that page. In the following example, the first table from a results page – that contains the description of the table – is imported into cell A1, and the actual datatable (table 2) is imported via an importhtml() formula specified in cell A2.


Note that the first data row does not appear to import cleanly – inspection of the original HTML table shows why – the presence of what is presumably a split cell that declares the name of the timeseries index column along with the first time index value.

To simplify the import of these data tables into a Google Spreadsheet, we can make use of a small script to add an additional custom menu into Google spreadsheets that will import a particular dataset.


The following script shows one way of starting to construct such a set of menus:

function onOpen() {
  var ui = SpreadsheetApp.getUi();
  // Or DocumentApp or FormApp.
  ui.createMenu('DWP Tabs')
      .addSubMenu(ui.createMenu('Bereavement Benefits')
          .addItem('1-click BW/BB timeseries', 'mi_bb_b')
          .addItem('1-click Region timeseries', 'mi_bb_r')
          .addItem('1-click Gender timeseries', 'mi_bb_g')
          .addItem('1-click Age timeseries', 'mi_bb_a')
      .addSubMenu(ui.createMenu('Incapacity Benefit/Disablement')
          .addItem('1-click Region timeseries', 'mi_ic_r')

function menuActionImportTable(url){
  var ss = SpreadsheetApp.getActiveSpreadsheet();
  var sheet = ss.getSheets()[0];

  var cell = sheet.getRange("A1");
  cell = sheet.getRange("A2");

//--Incapacity Benefit/Disablement
function mi_ic_r() {
  var url='';

//-- Bereavement Benefits
function mi_bb_r() {
  var url='';

function mi_bb_g() {
  var url='';

function mi_bb_a() {
  var url='';

function mi_bb_b() {
  var url='';

Copying the above script into the script editor associated with a spreadsheet, and then reloading the spreadsheet (permissions may need to be granted to the script the first time it is run), provides a custom menu that allows the direct import of a particular dataset:


Duplicating the spreadsheet carries the script along with it (I think) and can presumably also be shared… (It’s been some time since I played with apps script – I’m not sure how permissioning works or how easy it is to convert scripts to add-ons, though I note from the documentation that top-level app menus aren’t supported by add-ons.