Archive for the ‘Tinkering’ Category
On my to do list over the next few weeks is to pull together a set of resources that could be useful in supporting data related activities around the UK General Election.
For starters, I’ve popped up an example of using the folium Python library for plotting simple choropleth maps using geojson based Westminster Parliamentary constituency boundaries: example election maps notebook.
What I haven’t yet figured out – and don’t know if it’s possible – is how to generate qualitative/categorical maps using predefined colour maps (so eg filling boundaries using colour to represent the party of the current MP etc). If you know how to do this, please let me know via the comments…;-)
Also in the notebook is a reference to an election odds scraper I’m running over each (or at least, many…let me know if you spot any missing ones…) Parliamentary constituencies. The names associated with the constituencies don’t correspond in an exact match sense to any standard vocabularies, so on the to do list is to work out a mapping from the election odds constituency names to standard constituency identifiers. I’m thinking this could be represent a handy way of demonstrating my Westminster Constituency reconciliation service docker container….:-)
Trying to clear my head of code on a dog walk after a couple of days tinkering with the nomis API and I started to ponder what an API is good for.
Chris Gutteridge and Alex Dutton’s open data excuses bingo card and Owen Boswarva’s Open Data Publishing Decision Tree both suggest that not having an API can be used as an excuse for not publishing a dataset as open data.
So what is an API good for?
I think one naive view is that this is what an API gets you…
It doesn’t of course, because folk actually want this…
Which is not necessarily that easy even with an API:
For a variety of reasons…
Even when the discovery part is done and you think you have a reasonable idea of how to call the API to get the data you want out of it, you’re still faced with the very real practical problem of how to actually get the data in to the analysis environment in a form that is workable on in that environment. Just because you publish standards based SDMX flavoured XML doesn’t mean anything to anybody if they haven’t got an “import from SDMX flavoured XML directly into some format I know how to work with” option.
And even then, once the data is in, the problems aren’t over…
(I’m assuming the data is relatively clean and doesn’t need any significant amount of cleaning, normalising, standardising, type-casting, date par;-sing etc etc. Which is of course likely to be a nonsense assumption;-)
So what is an API good for, and where does it actually exist?
I’m starting to think that for many users, if there isn’t some equivalent of an “import from X” option in the tool they are using or environment they’re working in, then the API-for-X is not really doing much of a useful job for them.
Also, if there isn’t a discovery tool they can use from the tool they’re using or environment they’re working in, then finding data from service X turns into another chore that takes them out of their analysis context and essentially means that the API-for-X is not really doing much of a useful job for them.
What I tried to do in doodling the Python / pandas Remote Data Access Wrapper for the Nomis API for myself create some tools that would help me discover various datasets on the nomis platform from my working environment – an IPython notebook – and then fetch any data I wanted from the platform into that environment in a form in which I could immediately start working with it – which is to say, typically as a pandas dataframe.
I haven’t started trying to use it properly yet – and won’t get a chance to for a week or so at least now – but that was the idea. That is, the wrapper should support a conversation with the discovery and access parts of the conversation I want to have with the nomis data from within my chosen environment. That’s what I want from an API. Maybe?!;-)
And note further – this does not mean things like a pandas Remote Data Access plugin or a CRAN package for R (such as the World Bank Development Indicators package or any of the other data/API packages referenced from the rOpenSci packages list should be seen as extensions of the API. At worst, they should be seen as projections of the API into user environments. At best, it is those packages that should be seen as the actual API.
APIs for users – not programmers. That’s what I want from an API.
PS See also this response from @apievangelist: The API Journey.
One of the things I keep failing to spend time looking at and playing with is the generation of text based reports from tabular datasets (“data2txt”, “data to text”, “textualisation”, “natural language generation (NLG)” etc).
One of my earlier fumblings was to look at generating “press releases” around monthly Jobseeker’s Allowance figures. One of the reasons that stalled was the amount of time it took me to trying to find my way around the nomis site and API, trying to piece together the URLs that would let me pull down the data in a meaningful way and help me properly understand what the data actually referred to.
So over the weekend, I started to put together a wrapper for the nomis API that would let have a conversation with it so that I could start to find out what sorts of datasets it knows about and how I can run queries into those datasets (that is, what dimensions are available for each dataset that we can query on) as well as pulling back actual datasets from it.
To make the data easier to work with once I have pulled it down, I put it into a pandas dataframe so that I can work with it in that context.
(With Open Knowledge, I’m running a series of Code Clubs in “Wrangling data with Python” for the Cabinet Office at the moment, based around pandas and IPython notebooks. If I can get the wrapper working reliably enough, it could be interesting to see what they make of it…)
This is a flavour of the sorts of thing I’ve been reaching for with it:
To make life easier, you can pass in dimension parameter values using either the dimension parameter codes or their actual values; because the nomis API requires the codes, legitimate values are automatically converted. (Note to self – add further checks to discard illegitimate values, where detected…)
Any comments, feedback, issues if you try it etc, please let me know via the comments to this post (for now…!).
PS next up – revisit the ONS API following this first, aborted attempt.
Rebuilding a fresh version of the TM351 VM from scratch yesterday, I got an error trying to install tty.js, a node.js app that provides a “terminal desktop in the browser”.
Looking into a copy of the VM where tty.js does work, I could discover the version of node I’d previously successfully used, as well as check all the installed package versions:
### Show nodejs version and packages > node -v v0.10.35 > npm list -g /usr/local/lib ├─┬ firstname.lastname@example.org │ ├── email@example.com │ ├── firstname.lastname@example.org │ ├── email@example.com │ ├── firstname.lastname@example.org │ ├── email@example.com │ ├── firstname.lastname@example.org ... │ └── email@example.com └─┬ firstname.lastname@example.org ├─┬ email@example.com │ ├── firstname.lastname@example.org ...
Using this information, I could then use nvm, a node.js version manager, installed via:
curl https://raw.githubusercontent.com/creationix/nvm/v0.23.3/install.sh | NVM_DIR=/usr/local/lib/ bash
to install, from a new shell, the version I knew worked:
nvm install 0.10.35
npm install tty.js
(I should probably add the tty.js version in there too? npm install email@example.com perhaps? )
The terminal can then be run as a demon from:
/usr/local/lib/node_modules/tty.js/bin/tty.js --port 3000 --daemonize
What this got me wondering was: are there any utilities that let you capture a nodejs configuration, for example, and the recreate it in a new machine. That is, export the node version number and versions of the installed packages, then create an installation script that will recreate that setup?
It would be handy if this approach could be extended further. For example, we can also look at the packages – and their version numbers – installed on the Linux box using:
### Show packages
And we can get a list of Python packages – and their version numbers – using:
### Show Python packages
Surely there must be some simple tools/utilities out that support this sort of thing? Or even just cheatsheets that show you what commands to run to export the packages and versions into a file in a format that allows you to use that file as part of an installation script in a new machine to help rebuild the original one?
So you’ve got a dozen or so crappy Word documents collected over the years in a variety of formats, from .doc to .docx, and perhaps even a PDF or two, listing the biographies of speakers at this or that event, or the members of this or that group (a set of company directors, for example). And your task is to identify the names of the people identified in those documents and the companies they have been associated with.
Or you’ve been presented with a set of scanned PDF documents, where the text is selectable, or worse, a set of png images of text documents. And you have a stack of them to search through to find a particular name. What do you do?
Apart from cry a little, inside?
If the documents were HTML web pages, you might consider writing a scraper, using the structure of the HTML document to help you identify different meaningful elements within a webpage, and as a result try to recreate the database that contained the data that was used to generate the web pages.
But in a stack of arbitrary documents, or scanned image files, there is no consistent template you can work with to help you write the one scraper that will handle all the documents.
So how about a weaker form of document parsing? Text extraction, for example. Rather than trying to recreate a data base, how about we settle for just getting the text (the sort of thing a search engine might extract from a set of documents that it can index and search over, for example).
Something like this Microsoft Office (word) doc for example:
Or this scanned PDF (the highlighted text shows the text is copyable as such – so it is actually in the document as text):
Or this image I captured from a fragment of the scanned PDF – no text as such here…:
What are we to do?
Here’s where Apache Tika can help…
Apache Tika is like magic; give a document and it’ll (try) to give you back the text it contains. Even if that document is an image. Tika is quite a hefty bit of code, but it’s something you can run quite easily yourself as a service, using the magic of dockers containers.
In this example, I’m running Apache Tika as a web service in the cloud for a few pennies an hour; and yes, you can do this yourself – instructions for how to run Apache Tika in the cloud or on your own computer are described at the end of the post…
In my case, I had Apache Tika running at the address http://quicktika-1.psychemedia.cont.tutum.io:8008/tika (that address is no longer working).
I was working in an IPython notebook running on a Linux machine (the recipe will also work on a Mac; on Windows, you may need to install curl).
There are two steps:
- PUT the file you want the text extracted from to the server; I use curl, with a command of the form curl -T path/to/myfile.png http://quicktika-1.psychemedia.cont.tutum.io:8008/rmeta > path/to/myfile_txt.json
- Look at the result in the returned JSON file (path/to/myfile_txt.json)
Simple as that…simple as this:
Parse the word doc shown above…
You can see the start of the extracted text in the x-Tika:content element at the bottom…
Parse the PDF doc shown above…
Parse the actual image of fragment of the PDF doc shown above…
See how Tika has gone into image parsing and optical character recognition mode automatically, and done its best to extract the text from the image file? :-)
Running Apache Tika in the Cloud
As described in Getting Started With Personal App Containers in the Cloud, the first thing you need to do is set up an account with a cloud service provider – I’m using Digital Ocean at the moment: it has simple billing and lets you launch cloud hosted virtual machines of a variety of sizes in a variety of territories, including the UK. Billing is per hour with a monthly cap with different rates for different machine specs. To get started, you need to register an account and make a small ($5 or so) downpayment using Paypal or a credit card. You don’t need to do anything else – we’ll be using that service via another one… [Affiliate Link: sign up to Digital Ocean and get $10 credit]
Launch a node cluster as described at the start of Getting Started With Personal App Containers in the Cloud. The 2GB/2 core server is plenty.
Now launch a container – the one you want is logicalspark/docker-tikaserver:
To be able to access the service over the web, you need to make its ports public:
I’m going to give it a custom port number, but you don’t have to, in which case a random one will be assigned:
Having created and deployed the container, look up it’s address from the Endpoints tab. The address will be something like tcp://thing-1.yourid.cont.tutum.io:NNNN. You can check the service is there by going to thing-1.yourid.cont.tutum.io:NNNN/tika in your browser.
When you’re don terminate the container and the node cluster so you donlt get billed any more than is necessary.
Running Apache Tika on your own computer
- Install boot2docker
- Launch boot2docker
- In the boot2docker command line, enter: docker pull logicalspark/docker-tikaserver to grab the container image;
- To run the container: docker run -d -p 9998:9998 logicalspark/docker-tikaserver
- enter boot2docker ip to find the address bootdocker is publishing to (eg 192.168.59.103);
- Check the server is there – in your browser, go to eg: http://192.168.59.103:9998/tika
(Don’t be afraid of command lines; you probably already know how to download an app (step 1), definitely know how to launch an app (step 2), know how to type (steps 3 to 5), and how to go to a web location (step 6; note: you do have to enter this URL in the browser location bar at the top of the screen – entering it into Google won’t help..;-) All steps 3 to 5 do are get you to write the commands the computer is to follow, rather than automatically selecting them from a nicely named menu option. (What do you think a computer actually does when you select a menu option?!)
PS via @Pudo, see also: textract – python library for “extracting text out of any document”.
…aka “how to run OpenRefine in the cloud in just a couple of clicks and for a few pennies an hour”…
I managed to get my first container up and running in the cloud today (yeah!:-), using tutum to launch a container I’d defined on Dockerhub and run it on a linked DigitalOcean server (or as they call them, “droplet”).
This sort of thing is probably a “so what?” to many devs, or even folk who do the self-hosting thing, where for example you can launch your own web applications using CPanel, setting up your own WordPress site, perhaps, or an online database.
The difference for me is that the instance of OpenRefine I got up and running in the cloud via a web browser was the result of composing several different, loosely coupled services together:
- I’d already published a container on dockerhub that launches the latest release version of OpenRefine: psychemedia/docker-openrefine. This lets me run OpenRefine in a boot2docker virtual machine running on my own desktop and access it through a browser on the same computer.
- Digital Ocean is a cloud hosting service with simple billing (I looked at Amazon AWS but it was just too complicated) that lets you launch cloud hosted virtual machines of a variety of sizes and in a variety of territories (including the UK). Billing is per hour with a monthly cap with different rates for different machine specs. To get started, you need to register an account and make a small ($5 or so) downpayment using Paypal or a credit card. So that’s all I did there – created an account and made a small payment. [Affiliate Link: sign up to Digital Ocean and get $10 credit]
- tutum an intermediary service that makes it easy to launch servers and containers running inside them. By linking a DigitalOcean account to tutum, I can launch containers on DigitalOcean in a relatively straightforward way…
Launching OpenRefine via tutum
I’m going to start by launching a 2GB machine which comes in a 3 cents an hour, capped at $20 a month.
Now we need to get a container – which I’m thinking of as if it was a personal app, or personal app server:
I’m going to make use of a public container image – here’s one I prepared earlier…
We need to do a tiny bit of configuration. Specifically, all I need to do is ensure that I make the port public so I can connect to it; by default, it will be assigned to a random port in a particular range on the publicly viewable service. I can also set the service name, but for now I’ll leave the default.
If I create and deploy the container, the image will be pulled from dockerhub and a container launched based on it that I should be able to access via a public URL:
The first time I pull the container into a specific machine it takes a little time to set up as the container files are imported into the machine. If I create another container using the same image (another OpenRefine instance, for example), it should start really quickly because all the required files have already been loaded into the node.
Unfortunately, when I go through to the corresponding URL, there’s nothing there. Looking at the logs, I think maybe there wasn’t enough memory to launch a second OpenRefine container… (I could test this by launching a second droplet/server with more memory, and then deploying a couple of containers to that one.)
The billing is calculated on DigitalOcean on a hourly rate, based on the number and size of servers running. To stop racking up charges, you can terminate the server/droplet (so you also lose the containers).
Note than in the case of OpenRefine, we could allow several users all to access the same OpenRefine container (the same URL) and just run different projects within them.
Although this is probably not the way that dev ops folk think of containers, I’m seeing them as a great way of packaging service based applications that I might one to run at a personal level, or perhaps in a teaching/training context, maybe on a self-service basis, maybe on a teacher self-service basis (fire up one application server that everyone in a cohort can log on to, or one container/application server for each of them; I noticed that I could automatically launch as many containers as I wanted – a 64GB 20 core processor costs about $1 per hour on Digital Ocean, so for an all day School of Data training session, for example, with 15-20 participants, that would be about $10, with everyone in such a class of 20 having their own OpenRefine container/server, all started with the same single click? Alternatively, we could fire up separate droplet servers, one per participant, each running its own set of containers? That might be harder to initialise though (i.e. more than one or two clicks?!) Or maybe not?)
One thing I haven’t explored yet is mounting data containers/volumes to link to application containers. This makes sense in a data teaching context because it cuts down on bandwidth. If folk are going to work on the same 1GB file, it makes sense to just load it in to the virtual machine once, then let all the containers synch from that local copy, rather than each container having to download its own copy of the file.
The advantage of the approach described in the walkthrough above over “pre-configured” self-hosting solutions is the extensibility of the range of applications available to me. If I can find – or create – a Dockerfile that will configure a container to run a particular application, I can test it on my local machine (using boot2docker, for example) and then deploy a public version in the cloud, at an affordable rate, in just a couple of steps.
Whilst templated configurations using things like fig or panamax which would support the 1-click launch of multiple linked containers configurations aren’t supported by tutum yet, I believe they are in the timeline… So I look forward to trying out a click cloud version of Using Docker to Build Linked Container Course VMs when that comes onstream:-)
In an institutional setting, I can easily imagine a local docker registry that hosts images for apps that are “approved” within the institution, or perhaps tagged as relevant to particular courses. I don’t know if it’s similarly possible to run your own panamax configuration registry, as opposed to pushing a public panamax template for example, but I could imagine that being useful institutionally too? For example, I could put container images on a dockerhub style OU teaching hub or OU research hub, and container or toolchain configurations that pull from those on a panamax style course template register, or research team/project reregister? To front this, something like tutum, though with an even easier interface to allow me to fire up machines and tear them down?
Just by the by, I think part of the capital funding the OU got recently from HEFCE was slated for a teaching related institutional “cloud”, so if that’s the case, it would be great to have a play around trying to set up a simple self-service personal app runner thing ?;-) That said, I think the pitch in that bid probably had the forthcoming TM352 Web, Mobile and Cloud course in mind (2016? 2017??), though from what I can tell I’m about as persona non grata as possible with respect to even being allowed to talk to anyone about that course!;-)
Over the weekend, I rediscovered Michael Bauer/@mihi_tr’s Reconcile CSV [code] service that builds an OpenRefine reconciliation service on top of a CSV file. One column in the CSV file contains a list of values that you want to reconcile (that is, fuzzy match) against, the other is a set of key identifier values associated with the matched against value.
The default container uses a CSV file of UK MP names (current and previous) and returns their full title and an identifier used in the UK Parliament Members’ names data platform.
To run service in boot2docker:
- docker run -p 8002:8000 --name mprecon -d psychemedia/docker-reconciliation
- boot2docker ip to get the IP address the service is running on, eg 192.168.59.103
- Test the service in your browser: http://192.168.59.103:8002/reconcile?query=David Cameroon
In OpenRefine, set the reconciliation service URL to http://192.168.59.103:8002/reconcile.
NOTE: I had thought I should be able to fire up linked OpenRefine and ReconcileCSV containers and address more conveniently, for example:
docker run --name openrefiner -p 3335:3333 --link mprecon:mprecon -d psychemedia/openrefine
and then setting something like http://mprecon:8000/reconcile as the reconciliation service endpoint, but that didn’t seem to work? Instead I had to use the endpoint routed to host (http://192.168.59.103:8002/reconcile).
I also added some command line parameters to the script so that you can fire up the container and reconcile against your own CSV file:
docker run -p 8003:8000 -v /path/to/:/tmp/import -e RECONFILE=myfile.csv -e SEARCHCOL=mysearchcol -e IDCOL=myidcol --name recon_mycsv -d psychemedia/docker-reconciliation
This loads in the file on your host computer at /path/to/myfule.csv using the column named mysearchcol for the search/fuzzy match values and the column named myidcol for the identifiers.
It struck me that I could then commit this customised container as a docker image, and push it to dockerhub as a tagged image. Permissions mean I can’t push to the original trusted/managed repository that builds containers from my github repo, but I can create a new dockerhub repository containing tagged images. For example:
docker commit recon_mycsv psychemedia/docker-reconciler:recon_mycsv
docker push psychemedia/docker-reconciler:recon_mycsv
This means I can collect a whole range of reconciliation services, each independently tagged, at psychemedia/docker-reconciler – tags.
So for example:
- docker run --name reconcile_ukmps -p 8008:8000 -d psychemedia/docker-reconciler:ukmps_pastpresent runs a reconciliation service agains UK past and present MPs on port 8008;
- docker run --name reconcile_westminster -p 8009:8000 -d psychemedia/docker-reconciler:westminster_constituency runs a reconciliation service against Westminster constituencies on port 8009.
In practice the current reconciliation service only seems to work well on small datasets, up to a few thousand lines, but nonetheless it can still be useful to be able to reconcile against such datasets. For larger files – such as the UK Companies House register, where we might use registered name for the search column and company number for the ID – it seems to take a while…! (For this latter example, a reconciliation service already exists at OpenCorporates.)
One problem with the approach I have taken is that the data file is mounted within the reconciliation server container. It would probably make more to sense have the RefineCSV container mount a data volume containing the CSV file, so that we can then upgrade the reconciliation server container once and then just link it to data containers. As it is, with the current model, we would have to rebuild each tagged image separately to update the reconciliation server they use.
Unfortunately, I don’t know of an easy way to package up data volume containers (an issue I’ve also come up against with database data stores). What I’d like to be able to do is have a simple “docker datahub” that I could push data volumes to, and then be able to say something like docker run -d --volumes-from psychemedia/reconciliation-data:westminster_constituency --name recon_constituencies psychemedia/reconciliation. Here, --volumes-from would look up data volume containers on something like registry.datahub.docker.com and psychemedia/reconciliation from registry.hub.docker.com.
So where’s all this going, and what next? First up, it would be useful to have an official Dockerfile that builds Michael’s Reconcile CSV server. (It would also be nice to see an example of a Python based reconciliation server – because I think I might be able to hack around with that! [UPDATE – there is one here that I forked here and dockerised here]) Secondly, I really need to find a way through the portable data volume container mess. Am I missing something obvious? Thirdly, the reconciliation server needs a bit of optimisation so it can work with larger files, a fast fuzzy match of some sort. (I also wonder whether a lite reconciliation wrapper for PostgreSQL would be useful that can leverage the PostgreSQL backend and fuzzy search plugin to publish a reconciliation service?)
And what’s the payoff? The ability to quickly fire up multiple reconciliation services against reference CSV documents.