So you’ve got a dozen or so crappy Word documents collected over the years in a variety of formats, from .doc to .docx, and perhaps even a PDF or two, listing the biographies of speakers at this or that event, or the members of this or that group (a set of company directors, for example). And your task is to identify the names of the people identified in those documents and the companies they have been associated with.
Or you’ve been presented with a set of scanned PDF documents, where the text is selectable, or worse, a set of png images of text documents. And you have a stack of them to search through to find a particular name. What do you do?
Apart from cry a little, inside?
If the documents were HTML web pages, you might consider writing a scraper, using the structure of the HTML document to help you identify different meaningful elements within a webpage, and as a result try to recreate the database that contained the data that was used to generate the web pages.
But in a stack of arbitrary documents, or scanned image files, there is no consistent template you can work with to help you write the one scraper that will handle all the documents.
So how about a weaker form of document parsing? Text extraction, for example. Rather than trying to recreate a data base, how about we settle for just getting the text (the sort of thing a search engine might extract from a set of documents that it can index and search over, for example).
Something like this Microsoft Office (word) doc for example:
Or this scanned PDF (the highlighted text shows the text is copyable as such – so it is actually in the document as text):
Or this image I captured from a fragment of the scanned PDF – no text as such here…:
What are we to do?
Here’s where Apache Tika can help…
Apache Tika is like magic; give a document and it’ll (try) to give you back the text it contains. Even if that document is an image. Tika is quite a hefty bit of code, but it’s something you can run quite easily yourself as a service, using the magic of dockers containers.
In this example, I’m running Apache Tika as a web service in the cloud for a few pennies an hour; and yes, you can do this yourself – instructions for how to run Apache Tika in the cloud or on your own computer are described at the end of the post…
In my case, I had Apache Tika running at the address http://quicktika-1.psychemedia.cont.tutum.io:8008/tika (that address is no longer working).
I was working in an IPython notebook running on a Linux machine (the recipe will also work on a Mac; on Windows, you may need to install curl).
There are two steps:
- PUT the file you want the text extracted from to the server; I use curl, with a command of the form curl -T path/to/myfile.png http://quicktika-1.psychemedia.cont.tutum.io:8008/rmeta > path/to/myfile_txt.json
- Look at the result in the returned JSON file (path/to/myfile_txt.json)
Simple as that…simple as this:
Parse the word doc shown above…
You can see the start of the extracted text in the x-Tika:content element at the bottom…
Parse the PDF doc shown above…
Parse the actual image of fragment of the PDF doc shown above…
See how Tika has gone into image parsing and optical character recognition mode automatically, and done its best to extract the text from the image file? :-)
Running Apache Tika in the Cloud
As described in Getting Started With Personal App Containers in the Cloud, the first thing you need to do is set up an account with a cloud service provider – I’m using Digital Ocean at the moment: it has simple billing and lets you launch cloud hosted virtual machines of a variety of sizes in a variety of territories, including the UK. Billing is per hour with a monthly cap with different rates for different machine specs. To get started, you need to register an account and make a small ($5 or so) downpayment using Paypal or a credit card. You don’t need to do anything else – we’ll be using that service via another one… [Affiliate Link: sign up to Digital Ocean and get $10 credit]
Having got your cloud provider account set up, create an account with Tutum and then link your Digital Ocean account to it.
Launch a node cluster as described at the start of Getting Started With Personal App Containers in the Cloud. The 2GB/2 core server is plenty.
Now launch a container – the one you want is logicalspark/docker-tikaserver:
To be able to access the service over the web, you need to make its ports public:
I’m going to give it a custom port number, but you don’t have to, in which case a random one will be assigned:
Having created and deployed the container, look up it’s address from the Endpoints tab. The address will be something like tcp://thing-1.yourid.cont.tutum.io:NNNN. You can check the service is there by going to thing-1.yourid.cont.tutum.io:NNNN/tika in your browser.
When you’re don terminate the container and the node cluster so you donlt get billed any more than is necessary.
Running Apache Tika on your own computer
- Install boot2docker
- Launch boot2docker
- In the boot2docker command line, enter: docker pull logicalspark/docker-tikaserver to grab the container image;
- To run the container: docker run -d -p 9998:9998 logicalspark/docker-tikaserver
- enter boot2docker ip to find the address bootdocker is publishing to (eg 192.168.59.103);
- Check the server is there – in your browser, go to eg: http://192.168.59.103:9998/tika
(Don’t be afraid of command lines; you probably already know how to download an app (step 1), definitely know how to launch an app (step 2), know how to type (steps 3 to 5), and how to go to a web location (step 6; note: you do have to enter this URL in the browser location bar at the top of the screen – entering it into Google won’t help..;-) All steps 3 to 5 do are get you to write the commands the computer is to follow, rather than automatically selecting them from a nicely named menu option. (What do you think a computer actually does when you select a menu option?!)
PS via @Pudo, see also: textract – python library for “extracting text out of any document”.
7 thoughts on “Getting Text Out Of Anything (docs, PDFs, Images) Using Apache Tika”
Wow, does that look easy (cough).
Not automatic, but a little trick I learned recently for .docx files – change the extension to .zip and on Mac control click and Show Contents, elsewhere expand the ZIP. You can find all the bits of media in there.
Not automatic, but its been handy to extract images from the dreaded Word dungeon.
@cogdog Ok – so why’s it not?! (Serious question – would value critical response… we need to make these things easier, right? It may be checkmate in 4, but if you see it right, it’s forced all the way?)
Thanks Tony for the write up – very interesting! I was wondering if Tika allows one to extract highlighted text from a pdf.. have you had any experience with that by any chance?
Hi Michele – I’m still in the early days of playing with Tika myself. I don’t think it returns any style information at all – just flat text, so highlighting cues would presumably also be lost?
too bad. I’ll find another way :-)
It does return style information (in fact its internal object that it returns is in XHTML). The Tika-Server module just returns the Text and Metadata (but you can get the XHTML by calling another method). See: http://wiki.apache.org/tika/TikaJAXRS
@Chris Ah – thanks for clarifying that:-)
Comments are closed.