A quick recipe for extracting images embedded in PDFs (and in particular, extracting photos contained with PDFs…).
For example, Shell Nigeria has a site that lists oil spills along with associated links to PDF docs that contain photos corresponding to the oil spill:
Running an import.io scraper over the site can give a list of all the oil spills along with links to the corresponding PDFs. We can trawl through these links, downloading the PDFs and extracting the images from them.
import os,re import urllib2 #New OU course will start using pandas, so I need to start getting familiar with it. #In this case it's overkill, because all I'm using it for is to load in a CSV file... import pandas as pd #url='http://s01.static-shell.com/content/dam/shell-new/local/country/nga/downloads/pdf/oil-spills/967426_BenisedeWell11_flowline_at_Amabulou_Photos.pdf' #Load in the data scraped from Shell df= pd.read_csv('shell_30_11_13_ng.csv') errors=[] #For each line item: for url in df[df.columns[15]]: try: print 'trying',url u = urllib2.urlopen(url) fn=url.split('/')[-1] #Grab a local copy of the downloaded picture containing PDF localFile = open(fn, 'w') localFile.write(u.read()) localFile.close() except: print 'error with',url errors.append(url) continue #If we look at the filenames/urls, the filenames tend to start with the JIV id #...so we can try to extract this and use it as a key id=re.split(r'[_-]',fn)[0] #I'm going to move the PDFs and the associated images stripped from them in separate folders fo='data/'+id os.system(' '.join(['mkdir',fo])) idp='/'.join([fo,id]) #Try to cope with crappy filenames containing punctuation chars fn= re.sub(r'([()&])', r'\\\1', fn) #THIS IS THE LINE THAT PULLS OUT THE IMAGES #Available via poppler-utils #See: http://ubuntugenius.wordpress.com/2012/02/04/how-to-extract-images-from-pdf-documents-in-ubuntulinux/ #Note: the '; mv' etc etc bit copies the PDF file into the new JIV report directory cmd=' '.join(['pdfimages -j',fn, idp, '; mv',fn,fo ]) os.system(cmd) #Still a couple of errors on filenames #just as quick to catch by hand/inspection of files that don't get moved properly print 'Errors',errors
Images in the /data directory at: https://github.com/psychemedia/ScoDa_oil/tree/master/shell-ng
The important line of code in the above is:
pdfimages -j FILENAME OUTPUT_STUB
FILENAME is the PDF you want to extract the images from, OUTPUT_STUB sets the main part of the name of the image files. pdfimages is actually a command line file, which is why we need to run it from the Python script using the os.system call. (I’m running on a Mac – I have no idea how this might work on a Windows machine!)
pdfimages can be downloaded as part of poppler (I think?!)
See also this Stack Exchange question/answer: Extracting images from a PDF
PS to put this data to work a little, I wondered about using the data to generate a WordPress blog with one post per spill.
http://python-wordpress-xmlrpc.readthedocs.org/en/latest/examples/media.html provides a Python API. First thoughts were:
– generate post containing images and body text made up from data in the associated line from the CSV file.
Example data:
Date Reported | Incident Site | JIV Date | Terrain | Cause | Estimated Spill Volume (bbl) | Clean-up Status | Comments | Photo |
02-Jan-13 | 10″ Diebu Creek – Nun River Pipeline at Onyoma | 05-Jan-13 | Swamp | Sabotage/Theft | 65 | Recovery of spilled volume commenced on 6th January 2013 and was completed on 22nd January 2013. Cleanup of residual impacted area was completed on 9th May 2013. | Site Certification was completed on 28th June 2013. | http://s06.static-shell.com/content/dam/shell-new/local/country/nga/downloads/pdf/oil-spills/911964_10in_DiebuCreek-NunRiver_pipeline_at_Onyoma_Photos.pdf |
So we can pull this out for the body post. We can also parse the image PDF to get the JIV ID. We don’t have lat/long (nor northing/easting) though, so no maps unless we try a crude geocoding of the incident site column (column 2).
A lot of the incidents appear to start with a pipe diameter, so we can maybe pull this out too (eg 8″ in the example above).
We can use things like the cause, terrain, est. spill volume (as a range?), and maybe also an identified pipe diameter, to create tags or categories for the post. This allows us to generate views over particular posts (eg all posts relating to theft/sabotage).
There are several dates contained in the data and we may be able to do something with these – eg to date the post, or maybe as the basis for a timeline view over all the data. We might also be able to start collecting stats on eg the difference between the data reported (col 1) and the JIV date (col 3), or where we can scrape it, look for structure on the clean-up status filed. For example:
Recovery of spilled volume commenced on 6th January 2013 and was completed on 22nd January 2013. Cleanup of residual impacted area was completed on 9th May 2013.
If those phrases are common/templated refrains, we can parse the corresponding dates out?
I should probably also try to pull out the caption text from the image PDF [DONE in code on github] and associate it with a given image? This would be useful for any generated blog post too?
I’ve tried pdfimages on academic pdfs. Doesn’t work as desired. Don’t know if it’s the tool or just the awful way in which academic pdfs are made. Tended to get lots of tiny images out (seemingly pixel by pixel) rather than whole figures. Rather annoying!
Hi Ross..
I’m not really that familiar with how PDFs package images (in fact, how they package anything). Are there any other open source tools out there, I wonder, that are more specifically tuned to extracting images howsoever they are embedded in a PDF?