Category: Tinkering

Sketching Out a Python / pandas Remote Data Access Wrapper for the Nomis API

One of the things I keep failing to spend time looking at and playing with is the generation of text based reports from tabular datasets (“data2txt”, “data to text”, “textualisation”, “natural language generation (NLG)” etc).

One of my earlier fumblings was to look at generating “press releases” around monthly Jobseeker’s Allowance figures. One of the reasons that stalled was the amount of time it took me to trying to find my way around the nomis site and API, trying to piece together the URLs that would let me pull down the data in a meaningful way and help me properly understand what the data actually referred to.

So over the weekend, I started to put together a wrapper for the nomis API that would let have a conversation with it so that I could start to find out what sorts of datasets it knows about and how I can run queries into those datasets (that is, what dimensions are available for each dataset that we can query on) as well as pulling back actual datasets from it.

To make the data easier to work with once I have pulled it down, I put it into a pandas dataframe so that I can work with it in that context.

(With Open Knowledge, I’m running a series of Code Clubs in “Wrangling data with Python” for the Cabinet Office at the moment, based around pandas and IPython notebooks. If I can get the wrapper working reliably enough, it could be interesting to see what they make of it…)

The code can be found here ( and a notebook demonstrating it’s used can be found here: Demo Python Wrapper for nomis API.

This is a flavour of the sorts of thing I’ve been reaching for with it:



To make life easier, you can pass in dimension parameter values using either the dimension parameter codes or their actual values; because the nomis API requires the codes, legitimate values are automatically converted. (Note to self – add further checks to discard illegitimate values, where detected…)


Any comments, feedback, issues if you try it etc, please let me know via the comments to this post (for now…!).

PS next up – revisit the ONS API following this first, aborted attempt.

Recreating a Node.js Installation – Package Versions

Rebuilding a fresh version of the TM351 VM from scratch yesterday, I got an error trying to install tty.js, a node.js app that provides a “terminal desktop in the browser”.


Looking into a copy of the VM where tty.js does work, I could discover the version of node I’d previously successfully used, as well as check all the installed package versions:

### Show nodejs version and packages
> node -v

> npm list -g
├─┬ npm@1.4.28
│ ├── abbrev@1.0.5
│ ├── ansi@0.3.0
│ ├── ansicolors@0.3.2
│ ├── ansistyles@0.1.3
│ ├── archy@0.0.2
│ ├── block-stream@0.0.7
│ └── which@1.0.5
└─┬ tty.js@0.2.13
  ├─┬ express@3.1.0
  │ ├── buffer-crc32@0.1.1

Using this information, I could then use nvm, a node.js version manager, installed via:

curl | NVM_DIR=/usr/local/lib/ bash

to install, from a new shell, the version I knew worked:
nvm install 0.10.35
npm install tty.js

(I should probably add the tty.js version in there too? npm install tty.js@0.2.13 perhaps? )

The terminal can then be run as a demon from:

/usr/local/lib/node_modules/tty.js/bin/tty.js --port 3000 --daemonize

What this got me wondering was: are there any utilities that let you capture a nodejs configuration, for example, and the recreate it in a new machine. That is, export the node version number and versions of the installed packages, then create an installation script that will recreate that setup?

It would be handy if this approach could be extended further. For example, we can also look at the packages – and their version numbers – installed on the Linux box using:

### Show packages
dpkg -l

And we can get a list of Python packages – and their version numbers – using:

### Show Python packages
pip3 list

Surely there must be some simple tools/utilities out that support this sort of thing? Or even just cheatsheets that show you what commands to run to export the packages and versions into a file in a format that allows you to use that file as part of an installation script in a new machine to help rebuild the original one?

Getting Text Out Of Anything (docs, PDFs, Images) Using Apache Tika

So you’ve got a dozen or so crappy Word documents collected over the years in a variety of formats, from .doc to .docx, and perhaps even a PDF or two, listing the biographies of speakers at this or that event, or the members of this or that group (a set of company directors, for example). And your task is to identify the names of the people identified in those documents and the companies they have been associated with.

Or you’ve been presented with a set of scanned PDF documents, where the text is selectable, or worse, a set of png images of text documents. And you have a stack of them to search through to find a particular name. What do you do?

Apart from cry a little, inside?

If the documents were HTML web pages, you might consider writing a scraper, using the structure of the HTML document to help you identify different meaningful elements within a webpage, and as a result try to recreate the database that contained the data that was used to generate the web pages.

But in a stack of arbitrary documents, or scanned image files, there is no consistent template you can work with to help you write the one scraper that will handle all the documents.

So how about a weaker form of document parsing? Text extraction, for example. Rather than trying to recreate a data base, how about we settle for just getting the text (the sort of thing a search engine might extract from a set of documents that it can index and search over, for example).

Something like this Microsoft Office (word) doc for example:

bio word doc

Or this scanned PDF (the highlighted text shows the text is copyable as such – so it is actually in the document as text):


Or this image I captured from a fragment of the scanned PDF – no text as such here…:


What are we to do?

Here’s where Apache Tika can help…

Apache Tika is like magic; give a document and it’ll (try) to give you back the text it contains. Even if that document is an image. Tika is quite a hefty bit of code, but it’s something you can run quite easily yourself as a service, using the magic of dockers containers.

In this example, I’m running Apache Tika as a web service in the cloud for a few pennies an hour; and yes, you can do this yourself – instructions for how to run Apache Tika in the cloud or on your own computer are described at the end of the post…

In my case, I had Apache Tika running at the address (that address is no longer working).

I was working in an IPython notebook running on a Linux machine (the recipe will also work on a Mac; on Windows, you may need to install curl).

There are two steps:

  1. PUT the file you want the text extracted from to the server; I use curl, with a command of the form curl -T path/to/myfile.png > path/to/myfile_txt.json
  2. Look at the result in the returned JSON file (path/to/myfile_txt.json)

Simple as that…simple as this:

Parse the word doc shown above…

You can see the start of the extracted text in the x-Tika:content element at the bottom…


Parse the PDF doc shown above…


Parse the actual image of fragment of the PDF doc shown above…


See how Tika has gone into image parsing and optical character recognition mode automatically, and done its best to extract the text from the image file? :-)

Running Apache Tika in the Cloud

As described in Getting Started With Personal App Containers in the Cloud, the first thing you need to do is set up an account with a cloud service provider – I’m using Digital Ocean at the moment: it has simple billing and lets you launch cloud hosted virtual machines of a variety of sizes in a variety of territories, including the UK. Billing is per hour with a monthly cap with different rates for different machine specs. To get started, you need to register an account and make a small ($5 or so) downpayment using Paypal or a credit card. You don’t need to do anything else – we’ll be using that service via another one… [Affiliate Link: sign up to Digital Ocean and get $10 credit]

Having got your cloud provider account set up, create an account with Tutum and then link your Digital Ocean account to it.

Launch a node cluster as described at the start of Getting Started With Personal App Containers in the Cloud. The 2GB/2 core server is plenty.

Now launch a container – the one you want is logicalspark/docker-tikaserver:


To be able to access the service over the web, you need to make its ports public:


I’m going to give it a custom port number, but you don’t have to, in which case a random one will be assigned:


Having created and deployed the container, look up it’s address from the Endpoints tab. The address will be something like tcp:// You can check the service is there by going to in your browser.


When you’re don terminate the container and the node cluster so you donlt get billed any more than is necessary.



Running Apache Tika on your own computer

  1. Install boot2docker
  2. Launch boot2docker
  3. In the boot2docker command line, enter: docker pull logicalspark/docker-tikaserver to grab the container image;
  4. To run the container: docker run -d -p 9998:9998 logicalspark/docker-tikaserver
  5. enter boot2docker ip to find the address bootdocker is publishing to (eg;
  6. Check the server is there – in your browser, go to eg:

(Don’t be afraid of command lines; you probably already know how to download an app (step 1), definitely know how to launch an app (step 2), know how to type (steps 3 to 5), and how to go to a web location (step 6; note: you do have to enter this URL in the browser location bar at the top of the screen – entering it into Google won’t help..;-) All steps 3 to 5 do are get you to write the commands the computer is to follow, rather than automatically selecting them from a nicely named menu option. (What do you think a computer actually does when you select a menu option?!)

PS via @Pudo, see also: textract – python library for “extracting text out of any document”.

Getting Started With Personal App Containers in the Cloud

…aka “how to run OpenRefine in the cloud in just a couple of clicks and for a few pennies an hour”…

I managed to get my first container up and running in the cloud today (yeah!:-), using tutum to launch a container I’d defined on Dockerhub and run it on a linked DigitalOcean server (or as they call them, “droplet”).

This sort of thing is probably a “so what?” to many devs, or even folk who do the self-hosting thing, where for example you can launch your own web applications using CPanel, setting up your own WordPress site, perhaps, or an online database.

The difference for me is that the instance of OpenRefine I got up and running in the cloud via a web browser was the result of composing several different, loosely coupled services together:

  • I’d already published a container on dockerhub that launches the latest release version of OpenRefine: psychemedia/docker-openrefine. This lets me run OpenRefine in a boot2docker virtual machine running on my own desktop and access it through a browser on the same computer.
  • Digital Ocean is a cloud hosting service with simple billing (I looked at Amazon AWS but it was just too complicated) that lets you launch cloud hosted virtual machines of a variety of sizes and in a variety of territories (including the UK). Billing is per hour with a monthly cap with different rates for different machine specs. To get started, you need to register an account and make a small ($5 or so) downpayment using Paypal or a credit card. So that’s all I did there – created an account and made a small payment. [Affiliate Link: sign up to Digital Ocean and get $10 credit]
  • tutum an intermediary service that makes it easy to launch servers and containers running inside them. By linking a DigitalOcean account to tutum, I can launch containers on DigitalOcean in a relatively straightforward way…

Launching OpenRefine via tutum

I’m going to start by launching a 2GB machine which comes in a 3 cents an hour, capped at $20 a month.



Now we need to get a container – which I’m thinking of as if it was a personal app, or personal app server:


I’m going to make use of a public container image – here’s one I prepared earlier…


We need to do a tiny bit of configuration. Specifically, all I need to do is ensure that I make the port public so I can connect to it; by default, it will be assigned to a random port in a particular range on the publicly viewable service. I can also set the service name, but for now I’ll leave the default.


If I create and deploy the container, the image will be pulled from dockerhub and a container launched based on it that I should be able to access via a public URL:


The first time I pull the container into a specific machine it takes a little time to set up as the container files are imported into the machine. If I create another container using the same image (another OpenRefine instance, for example), it should start really quickly because all the required files have already been loaded into the node.


Unfortunately, when I go through to the corresponding URL, there’s nothing there. Looking at the logs, I think maybe there wasn’t enough memory to launch a second OpenRefine container… (I could test this by launching a second droplet/server with more memory, and then deploying a couple of containers to that one.)


The billing is calculated on DigitalOcean on a hourly rate, based on the number and size of servers running. To stop racking up charges, you can terminate the server/droplet (so you also lose the containers).


Note than in the case of OpenRefine, we could allow several users all to access the same OpenRefine container (the same URL) and just run different projects within them.

So What?

Although this is probably not the way that dev ops folk think of containers, I’m seeing them as a great way of packaging service based applications that I might one to run at a personal level, or perhaps in a teaching/training context, maybe on a self-service basis, maybe on a teacher self-service basis (fire up one application server that everyone in a cohort can log on to, or one container/application server for each of them; I noticed that I could automatically launch as many containers as I wanted – a 64GB 20 core processor costs about $1 per hour on Digital Ocean, so for an all day School of Data training session, for example, with 15-20 participants, that would be about $10, with everyone in such a class of 20 having their own OpenRefine container/server, all started with the same single click? Alternatively, we could fire up separate droplet servers, one per participant, each running its own set of containers? That might be harder to initialise though (i.e. more than one or two clicks?!) Or maybe not?)

One thing I haven’t explored yet is mounting data containers/volumes to link to application containers. This makes sense in a data teaching context because it cuts down on bandwidth. If folk are going to work on the same 1GB file, it makes sense to just load it in to the virtual machine once, then let all the containers synch from that local copy, rather than each container having to download its own copy of the file.

The advantage of the approach described in the walkthrough above over “pre-configured” self-hosting solutions is the extensibility of the range of applications available to me. If I can find – or create – a Dockerfile that will configure a container to run a particular application, I can test it on my local machine (using boot2docker, for example) and then deploy a public version in the cloud, at an affordable rate, in just a couple of steps.

Whilst templated configurations using things like fig or panamax which would support the 1-click launch of multiple linked containers configurations aren’t supported by tutum yet, I believe they are in the timeline… So I look forward to trying out a click cloud version of Using Docker to Build Linked Container Course VMs when that comes onstream:-)

In an institutional setting, I can easily imagine a local docker registry that hosts images for apps that are “approved” within the institution, or perhaps tagged as relevant to particular courses. I don’t know if it’s similarly possible to run your own panamax configuration registry, as opposed to pushing a public panamax template for example, but I could imagine that being useful institutionally too? For example, I could put container images on a dockerhub style OU teaching hub or OU research hub, and container or toolchain configurations that pull from those on a panamax style course template register, or research team/project reregister? To front this, something like tutum, though with an even easier interface to allow me to fire up machines and tear them down?

Just by the by, I think part of the capital funding the OU got recently from HEFCE was slated for a teaching related institutional “cloud”, so if that’s the case, it would be great to have a play around trying to set up a simple self-service personal app runner thing ?;-) That said, I think the pitch in that bid probably had the forthcoming TM352 Web, Mobile and Cloud course in mind (2016? 2017??), though from what I can tell I’m about as persona non grata as possible with respect to even being allowed to talk to anyone about that course!;-)

OpenRefine Style Reconciliation Containers

Over the weekend, I rediscovered Michael Bauer/@mihi_tr’s Reconcile CSV [code] service that builds an OpenRefine reconciliation service on top of a CSV file. One column in the CSV file contains a list of values that you want to reconcile (that is, fuzzy match) against, the other is a set of key identifier values associated with the matched against value.

Having already popped OpenRefine into a docker container, I thought I’d also explore dockerising Michael’s service: docker-reconciliation.

The default container uses a CSV file of UK MP names (current and previous) and returns their full title and an identifier used in the UK Parliament Members’ names data platform.

To run service in boot2docker:

  • docker run -p 8002:8000 --name mprecon -d psychemedia/docker-reconciliation
  • boot2docker ip to get the IP address the service is running on, eg
  • Test the service in your browser: Cameroon

In OpenRefine, set the reconciliation service URL to

NOTE: I had thought I should be able to fire up linked OpenRefine and ReconcileCSV containers and address more conveniently, for example:

docker run --name openrefiner -p 3335:3333 --link mprecon:mprecon -d psychemedia/openrefine

and then setting something like http://mprecon:8000/reconcile as the reconciliation service endpoint, but that didn’t seem to work? Instead I had to use the endpoint routed to host (

I also added some command line parameters to the script so that you can fire up the container and reconcile against your own CSV file:

docker run -p 8003:8000 -v /path/to/:/tmp/import -e RECONFILE=myfile.csv -e SEARCHCOL=mysearchcol -e IDCOL=myidcol --name recon_mycsv -d psychemedia/docker-reconciliation

This loads in the file on your host computer at /path/to/myfule.csv using the column named mysearchcol for the search/fuzzy match values and the column named myidcol for the identifiers.

It struck me that I could then commit this customised container as a docker image, and push it to dockerhub as a tagged image. Permissions mean I can’t push to the original trusted/managed repository that builds containers from my github repo, but I can create a new dockerhub repository containing tagged images. For example:

docker commit recon_mycsv psychemedia/docker-reconciler:recon_mycsv
docker push psychemedia/docker-reconciler:recon_mycsv

This means I can collect a whole range of reconciliation services, each independently tagged, at psychemedia/docker-reconciler – tags.

So for example:

  • docker run --name reconcile_ukmps -p 8008:8000 -d psychemedia/docker-reconciler:ukmps_pastpresent runs a reconciliation service agains UK past and present MPs on port 8008;
  • docker run --name reconcile_westminster -p 8009:8000 -d psychemedia/docker-reconciler:westminster_constituency runs a reconciliation service against Westminster constituencies on port 8009.

In practice the current reconciliation service only seems to work well on small datasets, up to a few thousand lines, but nonetheless it can still be useful to be able to reconcile against such datasets. For larger files – such as the UK Companies House register, where we might use registered name for the search column and company number for the ID – it seems to take a while…! (For this latter example, a reconciliation service already exists at OpenCorporates.)

One problem with the approach I have taken is that the data file is mounted within the reconciliation server container. It would probably make more to sense have the RefineCSV container mount a data volume containing the CSV file, so that we can then upgrade the reconciliation server container once and then just link it to data containers. As it is, with the current model, we would have to rebuild each tagged image separately to update the reconciliation server they use.

Unfortunately, I don’t know of an easy way to package up data volume containers (an issue I’ve also come up against with database data stores). What I’d like to be able to do is have a simple “docker datahub” that I could push data volumes to, and then be able to say something like docker run -d --volumes-from psychemedia/reconciliation-data:westminster_constituency --name recon_constituencies psychemedia/reconciliation. Here, --volumes-from would look up data volume containers on something like and psychemedia/reconciliation from

So where’s all this going, and what next? First up, it would be useful to have an official Dockerfile that builds Michael’s Reconcile CSV server. (It would also be nice to see an example of a Python based reconciliation server – because I think I might be able to hack around with that! [UPDATE – there is one here that I forked here and dockerised here]) Secondly, I really need to find a way through the portable data volume container mess. Am I missing something obvious? Thirdly, the reconciliation server needs a bit of optimisation so it can work with larger files, a fast fuzzy match of some sort. (I also wonder whether a lite reconciliation wrapper for PostgreSQL would be useful that can leverage the PostgreSQL backend and fuzzy search plugin to publish a reconciliation service?)

And what’s the payoff? The ability to quickly fire up multiple reconciliation services against reference CSV documents.

Defining Environment Variables Indirectly in bash

I spent a chunk of time this morning engaged in what ended up being something of a red herring, but learning was involved along the way, so here it is… how to set an environment variable indirectly in a bash shell.

Suppose I have a variable TAG=key and a variable VARVAL=thisval.

#Set key_val=$VARVAL
eval ${TAG}_val=\$VARVAL

#Export key_val=$VARVAL
export eval ${TAG}_val=\$VARVAL

Now suppose I want to test if $TAG exists, and further whether it is set to the same value as $CURRTAG. The ${TAG:+1} tests whether that TAG variable exists and that it is not empty. The -a is a logical AND.

if [ -n "${TAG:+1}" -a "$TAG" != "$CURRTAG" ]; then
    export VARVAL=${!tmpf}
    export CURRTAG=$TAG

Erm, I think… I realised this wouldn’t actually be appropriate for the context I had in mind so never fully tested it…

Adding Metadata to Google Docs

A couple of months ago I had started working on an export tool that would export a Google doc in the OU-XML format. The rationale? The first couple of drafts of the teaching material that will be delivered through the VLE in the forthcoming (October, 2015) OU Data management and analysis course (TM351) have been prepared in Google docs, and the production process will soon have to move to the Open University’s XML workflow. This workflow is built around an OU defined schema, often referred to as OU-XML (or OUXML), and is supported by a couple of oXygen XML editor extensions that make it easy to preview rendered versions of the documents in a test VLE site.

The schema itself includes several elements that are more akin to metadata elements than actual content – things like the course code, course title, for example, or the byline (or lead author) of a particular unit.

Support for a small amount of metadata is provided by Google Drive, but the only easily customisable element is a free text description element.


So whilst patching a couple of “issues” today with the Google Docs to OU-XML generator, and adding a menu option that allows users to create a zip file in Google Drive that contains the OU-XML and any associated image files for a particular Google doc, I thought it might also be handy to add some support for additional metadata elements. Google Drive apps support a Properties class that allows metadata properties represented as key-value pairs to be associated with a particular document, user or script. Google Apps Script can be used to set and retrieve these properties. In addition, Google Apps Script can be used to generate templated HTML user interface forms that can be used to extend Google docs or spreadsheets functionality.

In particular, I created a handful of Google Apps Script functions to pop up a templated panel, save metadata descriptions entered into the metadata form as document properties, and retrieve the value of a particular metadata element.

//Pop up the metadata edit/display panel
//The document is created as a templated HTML document
function metadataView() {
  // Generate the HTML
  html= HtmlService
  //Pop up a panel and render the HTML describing the metadata form inside it
  DocumentApp.getUi().showModalDialog(html, 'Metadata');

//This function sets the document properties from the metadata form elements
function processMetadataForm(theForm) {
  var props=PropertiesService.getDocumentProperties()
  //Process each form element (atm, they are just input text elements)
  for (var item in theForm) {

The templated HTML form is configured using a set of desired metadata elements. Each element is described using a label that is displayed in the form, an attribute name which should be a single word) and an optional default value. The template also demonstrates how we can call a server side Apps Script function from the dialogue using the construction.

<script src="//"></script>
//Add metadata fields here in the following format:
//[Label, a unique identifier (unique word, no spaces or punctuation), an optional default value]
var metadataItems =[
    ["Lead Author","leadAuthor"],
    ["Course Code","courseCode"],
    ["Course Title","courseTitle"],
    ["Unit Title","unitTitle"],
    ["Rendering","rendering","VLE2 staff (learn3)"]
<? var metadata = PropertiesService.getDocumentProperties() ?>
//When the metadata has been successfully saved as document properties
//  close the metadata form panel
function onSave() {}
<form id='metadataForm'>
<!-- Construct a set of form elements, one for each metadata item -->
<? for (var i = 0; i < metadataItems.length; i++) { ?>
  <div><?= metadataItems[i][0] ?>: 
    <input type="text"
      name = "<?= metadataItems[i][1] ?>"
      <? val=''
        if (metadataItems[i].length>2) val= metadataItems[i][2]  ?>
      value= "<?= metadata.getProperty(metadataItems[i][1]) ? metadata.getProperty(metadataItems[i][1])  : val  ?>"
<? } ?>
    value="Save & Close"

When the metadataView() function is called from the Add-Ons menu, it pops a dialogue that looks (in unstyled form) something like this:


Metadata elements are loaded in to the form if they exist or a default value is specified.

When generating the export OU-XML, a helper function grabs the value of the relevant metadata element from the document properties. This value then then be inserted into the OU XML at the appropriate point.

//A helper function to display a particular metadata element
//This function is called from the metadata form
function getProp(key) {
  var props= PropertiesService.getDocumentProperties()
  return props.getProperty(key) ? props.getProperty(key) : '';

var COURSECODE= getProp('courseCode');

One issue with this approach is that if we have lots of documents relating to different units for the same course, we may need to enter the same values for several metadata elements across each document (for example, the course code and course title). Unfortunately, Google Drive does not support arbitrary properties for folders. One solution, suggested by Tom Smith/@everythingabili was to use the description element for a folder to store JSON represented metadata. I think we could actually simplify that, using a line based representation or a simple delimited representation that we can easily split on, something like:

courseCode :: TM351;;
courseTitle:: Data Management and Analysis

for example. We could then split on ;; to get each pair, strip whitespace, split on :: and strip whitespace again to get the key:value elements for each metadata item.


I guess one way of getting the folder decription given a particular document as a starting point is to find the parent folder using file#getParents() perhaps?) and then call folder#getDescription()?

Another approach might be to have a dummy, canonically named file in each folder (metadata for example), that we add metadata to, and then whenever we open a new file in the folder we look for the metadata file, get its metadata property values, and use those to seed the metadata values for our content document.

Finally, it’s maybe worth also pondering the issue of generating the OU-XML export for all the documents within a given folder? One way to do this might be to create a function off a each document that will find the parent folder, find all the files (except, perhaps, a metadata file?!) in that folder, and then run the OU-XML generator over all of them, bundling them up into a single zip file, perhaps with a directory structure that puts the OU XML for each document, along with any image files associated with it, into separate folders?

Only it probably isn’t.. I suspect that if the migration to the OU-XML format, if it hasn’t already happened, will involve copying and pasting…

PS for completeness, the menu option can be installed as follows:

function onOpen(e) {