Getting Text Out Of Anything (docs, PDFs, Images) Using Apache Tika

So you’ve got a dozen or so crappy Word documents collected over the years in a variety of formats, from .doc to .docx, and perhaps even a PDF or two, listing the biographies of speakers at this or that event, or the members of this or that group (a set of company directors, for example). And your task is to identify the names of the people identified in those documents and the companies they have been associated with.

Or you’ve been presented with a set of scanned PDF documents, where the text is selectable, or worse, a set of png images of text documents. And you have a stack of them to search through to find a particular name. What do you do?

Apart from cry a little, inside?

If the documents were HTML web pages, you might consider writing a scraper, using the structure of the HTML document to help you identify different meaningful elements within a webpage, and as a result try to recreate the database that contained the data that was used to generate the web pages.

But in a stack of arbitrary documents, or scanned image files, there is no consistent template you can work with to help you write the one scraper that will handle all the documents.

So how about a weaker form of document parsing? Text extraction, for example. Rather than trying to recreate a data base, how about we settle for just getting the text (the sort of thing a search engine might extract from a set of documents that it can index and search over, for example).

Something like this Microsoft Office (word) doc for example:

bio word doc

Or this scanned PDF (the highlighted text shows the text is copyable as such – so it is actually in the document as text):

scan_ocr

Or this image I captured from a fragment of the scanned PDF – no text as such here…:

ED121193

What are we to do?

Here’s where Apache Tika can help…

Apache Tika is like magic; give a document and it’ll (try) to give you back the text it contains. Even if that document is an image. Tika is quite a hefty bit of code, but it’s something you can run quite easily yourself as a service, using the magic of dockers containers.

In this example, I’m running Apache Tika as a web service in the cloud for a few pennies an hour; and yes, you can do this yourself – instructions for how to run Apache Tika in the cloud or on your own computer are described at the end of the post…

In my case, I had Apache Tika running at the address http://quicktika-1.psychemedia.cont.tutum.io:8008/tika (that address is no longer working).

I was working in an IPython notebook running on a Linux machine (the recipe will also work on a Mac; on Windows, you may need to install curl).

There are two steps:

  1. PUT the file you want the text extracted from to the server; I use curl, with a command of the form curl -T path/to/myfile.png http://quicktika-1.psychemedia.cont.tutum.io:8008/rmeta > path/to/myfile_txt.json
  2. Look at the result in the returned JSON file (path/to/myfile_txt.json)

Simple as that…simple as this:

Parse the word doc shown above…

You can see the start of the extracted text in the x-Tika:content element at the bottom…

tika-extract1

Parse the PDF doc shown above…

tika-extract2

Parse the actual image of fragment of the PDF doc shown above…

tika-extract3

See how Tika has gone into image parsing and optical character recognition mode automatically, and done its best to extract the text from the image file? :-)

Running Apache Tika in the Cloud

As described in Getting Started With Personal App Containers in the Cloud, the first thing you need to do is set up an account with a cloud service provider – I’m using Digital Ocean at the moment: it has simple billing and lets you launch cloud hosted virtual machines of a variety of sizes in a variety of territories, including the UK. Billing is per hour with a monthly cap with different rates for different machine specs. To get started, you need to register an account and make a small ($5 or so) downpayment using Paypal or a credit card. You don’t need to do anything else – we’ll be using that service via another one… [Affiliate Link: sign up to Digital Ocean and get $10 credit]

Having got your cloud provider account set up, create an account with Tutum and then link your Digital Ocean account to it.

Launch a node cluster as described at the start of Getting Started With Personal App Containers in the Cloud. The 2GB/2 core server is plenty.

Now launch a container – the one you want is logicalspark/docker-tikaserver:

tutum_tika

To be able to access the service over the web, you need to make its ports public:

tutum_tika2

I’m going to give it a custom port number, but you don’t have to, in which case a random one will be assigned:

tika_tutum3

Having created and deployed the container, look up it’s address from the Endpoints tab. The address will be something like tcp://thing-1.yourid.cont.tutum.io:NNNN. You can check the service is there by going to thing-1.yourid.cont.tutum.io:NNNN/tika in your browser.

tika_titum4

When you’re don terminate the container and the node cluster so you donlt get billed any more than is necessary.

quicktika___Tutum5

tika_tutum6

Running Apache Tika on your own computer

  1. Install boot2docker
  2. Launch boot2docker
  3. In the boot2docker command line, enter: docker pull logicalspark/docker-tikaserver to grab the container image;
  4. To run the container: docker run -d -p 9998:9998 logicalspark/docker-tikaserver
  5. enter boot2docker ip to find the address bootdocker is publishing to (eg 192.168.59.103);
  6. Check the server is there – in your browser, go to eg: http://192.168.59.103:9998/tika

(Don’t be afraid of command lines; you probably already know how to download an app (step 1), definitely know how to launch an app (step 2), know how to type (steps 3 to 5), and how to go to a web location (step 6; note: you do have to enter this URL in the browser location bar at the top of the screen – entering it into Google won’t help..;-) All steps 3 to 5 do are get you to write the commands the computer is to follow, rather than automatically selecting them from a nicely named menu option. (What do you think a computer actually does when you select a menu option?!)

PS via @Pudo, see also: textract – python library for “extracting text out of any document”.

Running RStudio on Digital Ocean, AWS etc Using Tutum and Docker Containers

Via RBloggers I noticed a tutorial today on Setting Rstudio server using Amazon Web Services (AWS).

In the post Getting Started With Personal App Containers in the Cloud I described how I linked my tutum account to a Digital Ocean hosting account and then launched a Digital Ocean server. (How to link tutum to Amazon AWS is described here: tutum support: Link your Amazon Web Services account.)

Having launched a server (also described in Getting Started With Personal App Containers in the Cloud), we can now create a new service that will fire up an RStudio container.

First up, we need to locate a likely container – the official one is the rocker/rstudio image:

New_Service_Wizard___Tutum

Having selected the image, we need to do a little bit of essential configuration (we could do more, like giving the service a new name):

New_Service_Wizard___Tutum2

Specifically, we need to publish the port so that it’s publicly viewable – then we can Create and Deploy the service:

New_Service_Wizard___Tutum3

After a minute or two, the service should be up and running:

rstudio-8a316887___Tutum4

We can now find the endpoint, and click through to it (note: we need to change the URL from a tcp:// address to an http:// one. (Am I doing something wrong in the set up to stop the http URL being minted as the service endpoint?)

rstudio-8a316887___Tutum5

URL tweaked, you should now be able to see an RStudio login screen. The default user is rstudio and the default password rstudio too:

RStudio_Sign_In

And there we have it:-)

RStudio

So we don’t continue paying for the server, I generally stop the container and then terminate it to destroy it…

rstudio-8a316887___Tutum

And then terminate the node…

Node_dashboard___Tutum5

So, assuming the Amazon sign-up process is painless, I’m assuming it shouldn’t be much harder than that?

By the by, it’s possible to link containers to other containers; here’s an example (on the desktop, using boot2docker, that links an RStudio container to a MySQL database: Connecting RStudio and MySQL Docker Containers – an example using the ergast db. When I get a chance, I’ll have a go at doing that via tutum too…

PS see also: How to Run An R Shiny App in the Cloud Using Tutum, Digital Ocean and Docker Containers.

How to Run A Shiny App in the Cloud Using Tutum Docker Cloud, Digital Ocean and Docker Containers

Via RBloggers, I spotted this post on Deploying Your Very Own Shiny Server (here’s another on a similar theme). I’ve been toying with the idea of running some of my own Shiny apps, so that post provided a useful prompt, though way too involved for me;-)

So here’s what seems to me to be an easier, rather more pointy-clicky, wiring stuff together way using Docker containers (though it might not seem that much easier to you the first time through!). The recipe includes: github, Dockerhub, Tutum and Digital Ocean.

To being with, I created a minimal shiny app to allow the user to select a CSV file, upload it to the app and display it. The ui.R and server.R files, along with whatever else you need, should be placed into an application directory, for example shiny_demo within a project directory, which I’m confusingly also calling shiny_demo (I should have called it something else to make it a bit clearer – for example, shiny_demo_project.)

The shiny server comes from a prebuilt docker container on dockerhub – rocker/shiny.

This shiny server can run several Shiny applications, though I only want to run one: shiny_demo.

I’m going to put my application into it’s own container. This container will use the rocker/shiny container as a base, and simply copy my application folder into the shiny server folder from which applications are served. My Dockerfile is really simple and contains just two lines – it looks like this and goes into a file called Dockerfile in the project directory:

FROM rocker/shiny
ADD shiny_demo /srv/shiny-server/shiny_demo

The ADD command simply copies the the contents of the child directory into a similarly named directory in the container’s /srv/shiny-server/ directory. You could add as many applications you wanted to the server as long as each is in it’s own directory. For example, if I have several applications:

docker-containers

I can add the second application to my container using:

ADD shiny_other_demo /srv/shiny-server/shiny_other_demo

The next thing I need to do is check-in my shiny_demo project into Github. (I don’t have a how to on this, unfortunately…) In fact, I’ve checked my project in as part of another repository (docker-containers).

docker-containers_shiny_demo_at_master_·_psychemedia_docker-containers

The next step is to build a container image on DockerHub. If I create an account and log in to DockerHub, I can link my Github account to it.

(I can do that locally if Docker is installed locally by cding into the folder containing the Dockerfile and running docker build -t shinydemo . Then I can run it using the command docker run shinydemo -p 8881:3838 (the server runs on port 3838 in the container, which I expose as localhost:8881. The application is at localhost:8881/shiny_demo)

I can then create an Automated Build that will build a container image from my Github repository. First, identify the repository on my linked Github account and name the image:

Docker_Hub_

Then add the path the project directory that contains the Dockerfile for the image you’re interested in:

Docker_Hub_git

Click on Trigger to build the image the first time. In the future, every time I update that folder in the repository, the container image will be rebuilt to include the updates.

So now I have a Docker container image on Dockerhub that contains the Shiny server from the rocker/shiny image and a copy of my shiny application files.

Now I need to go Tutum (also part of the Docker empire), which is an application for launching containers on a range of cloud services. If you link your Digital Ocean account to tutum, you can use tutum to launch docker containers on Dockerhub on a Digital Ocean droplet.

Within tutum, you’ll need to create a new node cluster on Digital Ocean:

(Notwithstanding the below, I generally go for a single 4GB node…)

Now we need to create a service from a container image:

I can find the container image I want to deploy on the cluster that I previously built on Dockerhub:

New_Service_Wizard___Tutum1

Select the image and then configure it – you may want to rename it, for example. One thing you definitely need to do though is tick to publish the port – this will make the shiny server port visible on the web.

New_Service_Wizard___Tutum3a

Create and deploy the service. When the container is built, and has started running, you’ll be told where you can find it.

Uploading_Files_and_Welcome_to_Shiny_Server__and_shiny-demo-0aed1c00-1___Tutum

Note that if you click on the link to the running container, the default URL starts with tcp:// which you’ll need to change to http://. The port will be dynamically allocated unless you specified a particular port mapping on the service creation page.

To view your shiny app, simply add the name of the folder the application is in to the URL.

When you’ve finished running the app, you may want to shut the container down – and more importantly perhaps, switch the Digital Ocean droplet off so you don’t continue paying for it!

Node_dashboard___Tutum

As I said at the start, the first time round seems quite complicated. After all, you need to:

(Actually, you can miss out the dockerhub steps, and instead link your github account to your tutum account and do the automated build from the github files within tutum: Tutum automatic image builds from GitHub repositories. The service can then be launched by finding the container image in your tutum repository)

However, once you do have your project files in github, you can then easily update them and easily launch them on Digital Ocean. In fact, you can make it even easier by adding a deploy to tutum button to a project README.md file in Github.

See also: How to run RStudio on Digital Ocean via Tutum and How to run OpenRefine on Digital Ocean via Tutum.

PS to test the container locally, I launch a docker terminal from Kitematic, cd into the project folder, and run something like:

docker build -t psychemedia/shinydemo .
docker run --name shinydemo -i -t psychemedia/shinydemo -p 8881:3838

In the above, I set the port map (the service runs in the container on 3838, which I expose on host as localhost:8881. Alternatively, I could find a link to the server from within Kitematic.

Tutum Cloud Becomes Docker Cloud

In several previous posts, I’ve shown how to launch docker containers in the cloud using Tutum Cloud, for example to run OpenRefine, RStudio, or Shiny apps in the cloud. Docker bought Tutum out several months ago, and now it seems that they’re running the tutum environment as Docker Cloud (docs, announcement).

Release_Notes

Tutum – as was – looks like it’s probably on the way out…

Tutum

The onboarding screen confirms that the familiar Tutum features are available in much the same form as before:

Welcome_to_Docker_Cloud

The announcement post suggests that you can get a free node and one free repository, but whilst my Docker hub images seemed to have appeared, I couldn’t see how to get a free node. (Maybe I was too quick linking my Digital Ocean account?)

As with Tutum, (because it is, was, tutum!), Docker Cloud provides you with the opportunity to spin up one or more servers on a linked cloud host and run containers on those servers, either individually or linked together as part of a “stack” (essentially, a Docker Compose container composition). You also get an image repository within Docker Cloud – mine looks as if it’s linked to my DockerHub repository:

Repository_dashboard___Docker_Cloud_and_various

A nice feature of this is that you can 1-click start a container from your image repository if you have a server node running.

The Docker Cloud service currently provides a rebranding of the Tutum offering, so it’ll be interesting to see if product features continue to be developed. One thing I keep thinking might be interesting is a personal MyBinder style service that simplifies the deploy to Tutum (as was) service and allows users to use linked hosts and persistent logins to 1-click launch container services with persistent state (for example, Launch Docker Container Compositions via Tutum and Stackfiles.io – But What About Container Stashing?). I guess this would mean linking in some cheap storage somehow, rather than having to keep server nodes up to persist container state? By the by, C. Titus Brown has some interesting reflections on MyBinder here: Is mybinder 95% of the way to next-gen computational science publishing, or only 90%?

If nothing else, signing up to Docker Cloud does give you a $20 Digital Ocean credit voucher that can also be applied to pre-exisiting Digital Ocean accounts:-) (If you haven’t already signed up for Digital Ocean but want to give it a spin, this affiliate link should also get you $10 free credit.)

PS as is the way of these things, I currently run docker containers in the cloud on my own tab (or credit vouchers) rather than institutional servers, because – you know – corporate IT. So I was interested to see that Docker have also recently launched a Docker DataCenter service (docs), and the associated promise of “containers-as-a-service” (CaaS), that makes it easy to offer cloud-based container deployment infrastructure. Just sayin’…;-)

PPS So when are we going to get Docker Cloud integration in Kitematic, or a Kitematic client for Docker Cloud?