I came across Apache Tika a few weeks ago, a service that will tell you what pretty much any document type is based on it’s metadata, and will have a good go at extracting text from it.
With a prompt and a 101 from @IgorBrigadir, it was pretty easier getting started with it – sort of…
First up, I needed to get the Apache Tika server running. As there’s a containerised version available on dockerhub (logicalspark/docker-tikaserver), it was simple enough for me to fire up a server in a click using tutum (as described in this post on how to run OpenRefine in the cloud in just a couple of clicks and for a few pennies an hour; pretty much all you need to do is fire up a server, start a container based on logicalspark/docker-tikaserver, and tick to make the port public…)
His suggested recipe for using python requests library borked for me – I couldn’t get python to open the file to get the data bits to send to the server (file encoding issues; one reason for using Tika is it’ll try to accept pretty much anything you throw at it…)
I had a look at pycurl:
!apt-get install -y libcurl4-openssl-dev
!pip3 install pycurl
but couldn’t make head or tail of how to use it: the pycurl equivalant of curl -T foo.doc http://example.com:9998/rmeta can’t be that hard to write, can it? (Translations appreciated via the comments…;-)
Instead I took the approach of dumping the result of a curl request on the command line into a file:
!curl -T Text/foo.doc http://example.com:9998/rmeta > tikatest.json
and then grabbing the response out of that:
Not elegant, and far from ideal, but a stop gap for now.
Part of the response from the Tika server is the text extracted from the document, which can then provide the basis for some style free text analysis…
I haven’t tried with any document types other than crappy old MS Word .doc formats, but this looks like it could be a really easy tool to use.
And with the containerised version available, and tutum and Digital Ocean to hand, it’s easy enough to fire up a version in the cloud, let alone my desktop, whenever I need it:-)