Extracting Images from PDFs

A quick recipe for extracting images embedded in PDFs (and in particular, extracting photos contained with PDFs…).

For example, Shell Nigeria has a site that lists oil spills along with associated links to PDF docs that contain photos corresponding to the oil spill:

shell ng oil spill

Running an import.io scraper over the site can give a list of all the oil spills along with links to the corresponding PDFs. We can trawl through these links, downloading the PDFs and extracting the images from them.

import os,re
import urllib2

#New OU course will start using pandas, so I need to start getting familiar with it.
#In this case it's overkill, because all I'm using it for is to load in a CSV file...
import pandas as pd


#Load in the data scraped from Shell
df= pd.read_csv('shell_30_11_13_ng.csv')


#For each line item:
for url in df[df.columns[15]]:
		print 'trying',url
		u = urllib2.urlopen(url)

		#Grab a local copy of the downloaded picture containing PDF
		localFile = open(fn, 'w')
		print 'error with',url
	#If we look at the filenames/urls, the filenames tend to start with the JIV id
	#...so we can try to extract this and use it as a key

	#I'm going to move the PDFs and the associated images stripped from them in separate folders
	os.system(' '.join(['mkdir',fo]))
	#Try to cope with crappy filenames containing punctuation chars
	fn= re.sub(r'([()&])', r'\\\1', fn)

	#Available via poppler-utils
	#See: http://ubuntugenius.wordpress.com/2012/02/04/how-to-extract-images-from-pdf-documents-in-ubuntulinux/
	#Note: the '; mv' etc etc bit copies the PDF file into the new JIV report directory
	cmd=' '.join(['pdfimages -j',fn, idp, '; mv',fn,fo  ])
	#Still a couple of errors on filenames
	#just as quick to catch by hand/inspection of files that don't get moved properly
print 'Errors',errors

Images in the /data directory at: https://github.com/psychemedia/ScoDa_oil/tree/master/shell-ng

The important line of code in the above is:


FILENAME is the PDF you want to extract the images from, OUTPUT_STUB sets the main part of the name of the image files. pdfimages is actually a command line file, which is why we need to run it from the Python script using the os.system call. (I’m running on a Mac – I have no idea how this might work on a Windows machine!)

pdfimages can be downloaded as part of poppler (I think?!)

See also this Stack Exchange question/answer: Extracting images from a PDF

PS to put this data to work a little, I wondered about using the data to generate a WordPress blog with one post per spill.

http://python-wordpress-xmlrpc.readthedocs.org/en/latest/examples/media.html provides a Python API. First thoughts were:

– generate post containing images and body text made up from data in the associated line from the CSV file.

Example data:

Date Reported Incident Site JIV Date Terrain Cause Estimated Spill Volume (bbl) Clean-up Status Comments Photo
02-Jan-13 10″ Diebu Creek – Nun River Pipeline at Onyoma 05-Jan-13 Swamp Sabotage/Theft 65 Recovery of spilled volume commenced on 6th January 2013 and was completed on 22nd January 2013. Cleanup of residual impacted area was completed on 9th May 2013. Site Certification was completed on 28th June 2013. http://s06.static-shell.com/content/dam/shell-new/local/country/nga/downloads/pdf/oil-spills/911964_10in_DiebuCreek-NunRiver_pipeline_at_Onyoma_Photos.pdf

So we can pull this out for the body post. We can also parse the image PDF to get the JIV ID. We don’t have lat/long (nor northing/easting) though, so no maps unless we try a crude geocoding of the incident site column (column 2).

A lot of the incidents appear to start with a pipe diameter, so we can maybe pull this out too (eg 8″ in the example above).

We can use things like the cause, terrain, est. spill volume (as a range?), and maybe also an identified pipe diameter, to create tags or categories for the post. This allows us to generate views over particular posts (eg all posts relating to theft/sabotage).

There are several dates contained in the data and we may be able to do something with these – eg to date the post, or maybe as the basis for a timeline view over all the data. We might also be able to start collecting stats on eg the difference between the data reported (col 1) and the JIV date (col 3), or where we can scrape it, look for structure on the clean-up status filed. For example:

Recovery of spilled volume commenced on 6th January 2013 and was completed on 22nd January 2013. Cleanup of residual impacted area was completed on 9th May 2013.

If those phrases are common/templated refrains, we can parse the corresponding dates out?

I should probably also try to pull out the caption text from the image PDF [DONE in code on github] and associate it with a given image? This would be useful for any generated blog post too?


  1. Ross Mounce (@rmounce)

    I’ve tried pdfimages on academic pdfs. Doesn’t work as desired. Don’t know if it’s the tool or just the awful way in which academic pdfs are made. Tended to get lots of tiny images out (seemingly pixel by pixel) rather than whole figures. Rather annoying!

    • Tony Hirst

      Hi Ross..
      I’m not really that familiar with how PDFs package images (in fact, how they package anything). Are there any other open source tools out there, I wonder, that are more specifically tuned to extracting images howsoever they are embedded in a PDF?