A Python XML Handling Gotcha – Namespaces

Just a quick note to self from a matter arising at the #IIP11 hackday earlier today – parsing XML from the PatientOpinion API using xml.etree.ElementTree library in Python. The issue – as Dan Hagon (aka @axiomsofchoice) discovered, and as I couldn’t help on out given my appalling lack of skills in using Python, was that the namespace the XML results file used needed handling explicitly. (I wonder if this is also why Yahoo Pipes choked on the XML?).

Anyway, here’s an example of the XML returned from Patient Choices:

<Opinions xmlns="http://www.patientopinion.org.uk/api/rest/v1" xmlns:i="http://www.w3.org/2001/XMLSchema-instance">
  <Opinion>
    <Author>*****</Author>
    <Body>My mother who is ...
      ...
    </Body>
    <PostingID>27290</PostingID>
    <dtSubmitted>2010-01-05T12:57:19.667</dtSubmitted>
    <syndicOriginalID/><syndicSourceID>po</syndicSourceID>
    <HealthServices>
      <HealthService>
        <NACS>RAJ01_430</NACS>
        <Name>Geriatric medicine</Name>
        <OrganisationNACS>RAJ</OrganisationNACS>
        <Postcode>SS0 0RY</Postcode>
        <SiteNACS>RAJ01</SiteNACS>
        <Town/>
        <Type>service</Type>
      </HealthService>
    </HealthServices>
    <Period>Today</Period>
    <PostingAs>a relative</PostingAs>
    <Responses/>
    <Tags>
      <Tag>
        <TagGroup>Condition</TagGroup>
        <TagName>confused</TagName>
      </Tag>
    </Tags>
    <Title>My darling dad</Title>
    <Type>Story</Type>
  </Opinion>
</Opinions>

And here’s a snippet for how to handle it, as gleaned from the ever helpful Stack Overflow…

import urllib2
from xml.etree.ElementTree import *

req = urllib2.Request(url='http://www.patientopinion.org.uk/api/rest.svc/v1/postings/search?tag=dirty&take=20&apikey=******')
f = urllib2.urlopen(req)

tree = ElementTree()
tree.parse(f)
doc = tree.getroot()

#http://stackoverflow.com/questions/1319385/need-help-using-xpath-in-elementtree
namespace = "{http://www.patientopinion.org.uk/api/rest/v1}"
t= doc.find("{0}Opinion/{0}HealthServices/{0}HealthService/{0}Postcode".format(namespace))
print t.text

It also seems as if the full path from the root is required?

PS are there any Python libraries out there that would have been able to handle the namespaced XML automagically….?

Extracting Images from PDFs

A quick recipe for extracting images embedded in PDFs (and in particular, extracting photos contained with PDFs…).

For example, Shell Nigeria has a site that lists oil spills along with associated links to PDF docs that contain photos corresponding to the oil spill:

shell ng oil spill

Running an import.io scraper over the site can give a list of all the oil spills along with links to the corresponding PDFs. We can trawl through these links, downloading the PDFs and extracting the images from them.

import os,re
import urllib2

#New OU course will start using pandas, so I need to start getting familiar with it.
#In this case it's overkill, because all I'm using it for is to load in a CSV file...
import pandas as pd

#url='http://s01.static-shell.com/content/dam/shell-new/local/country/nga/downloads/pdf/oil-spills/967426_BenisedeWell11_flowline_at_Amabulou_Photos.pdf'

#Load in the data scraped from Shell
df= pd.read_csv('shell_30_11_13_ng.csv')

errors=[]

#For each line item:
for url in df[df.columns[15]]:
	try:
		print 'trying',url
		u = urllib2.urlopen(url)
		fn=url.split('/')[-1]

		#Grab a local copy of the downloaded picture containing PDF
		localFile = open(fn, 'w')
		localFile.write(u.read())
		localFile.close()
	except:
		print 'error with',url
		errors.append(url)
		continue
	
	#If we look at the filenames/urls, the filenames tend to start with the JIV id
	#...so we can try to extract this and use it as a key
	id=re.split(r'[_-]',fn)[0]

	#I'm going to move the PDFs and the associated images stripped from them in separate folders
	fo='data/'+id
	os.system(' '.join(['mkdir',fo]))
	idp='/'.join([fo,id])
	
	#Try to cope with crappy filenames containing punctuation chars
	fn= re.sub(r'([()&])', r'\\\1', fn)

	#THIS IS THE LINE THAT PULLS OUT THE IMAGES
	#Available via poppler-utils
	#See: http://ubuntugenius.wordpress.com/2012/02/04/how-to-extract-images-from-pdf-documents-in-ubuntulinux/
	#Note: the '; mv' etc etc bit copies the PDF file into the new JIV report directory
	cmd=' '.join(['pdfimages -j',fn, idp, '; mv',fn,fo  ])
	os.system(cmd)
	#Still a couple of errors on filenames
	#just as quick to catch by hand/inspection of files that don't get moved properly
	
print 'Errors',errors

Images in the /data directory at: https://github.com/psychemedia/ScoDa_oil/tree/master/shell-ng

The important line of code in the above is:

pdfimages -j FILENAME OUTPUT_STUB

FILENAME is the PDF you want to extract the images from, OUTPUT_STUB sets the main part of the name of the image files. pdfimages is actually a command line file, which is why we need to run it from the Python script using the os.system call. (I’m running on a Mac – I have no idea how this might work on a Windows machine!)

pdfimages can be downloaded as part of poppler (I think?!)

See also this Stack Exchange question/answer: Extracting images from a PDF

PS to put this data to work a little, I wondered about using the data to generate a WordPress blog with one post per spill.

http://python-wordpress-xmlrpc.readthedocs.org/en/latest/examples/media.html provides a Python API. First thoughts were:

– generate post containing images and body text made up from data in the associated line from the CSV file.

Example data:

Date Reported Incident Site JIV Date Terrain Cause Estimated Spill Volume (bbl) Clean-up Status Comments Photo
02-Jan-13 10″ Diebu Creek – Nun River Pipeline at Onyoma 05-Jan-13 Swamp Sabotage/Theft 65 Recovery of spilled volume commenced on 6th January 2013 and was completed on 22nd January 2013. Cleanup of residual impacted area was completed on 9th May 2013. Site Certification was completed on 28th June 2013. http://s06.static-shell.com/content/dam/shell-new/local/country/nga/downloads/pdf/oil-spills/911964_10in_DiebuCreek-NunRiver_pipeline_at_Onyoma_Photos.pdf

So we can pull this out for the body post. We can also parse the image PDF to get the JIV ID. We don’t have lat/long (nor northing/easting) though, so no maps unless we try a crude geocoding of the incident site column (column 2).

A lot of the incidents appear to start with a pipe diameter, so we can maybe pull this out too (eg 8″ in the example above).

We can use things like the cause, terrain, est. spill volume (as a range?), and maybe also an identified pipe diameter, to create tags or categories for the post. This allows us to generate views over particular posts (eg all posts relating to theft/sabotage).

There are several dates contained in the data and we may be able to do something with these – eg to date the post, or maybe as the basis for a timeline view over all the data. We might also be able to start collecting stats on eg the difference between the data reported (col 1) and the JIV date (col 3), or where we can scrape it, look for structure on the clean-up status filed. For example:

Recovery of spilled volume commenced on 6th January 2013 and was completed on 22nd January 2013. Cleanup of residual impacted area was completed on 9th May 2013.

If those phrases are common/templated refrains, we can parse the corresponding dates out?

I should probably also try to pull out the caption text from the image PDF [DONE in code on github] and associate it with a given image? This would be useful for any generated blog post too?

Using One Programming Language In the Context of Another – Python and R

Over the last couple of years, I’ve settled into using R an python as my languages of choice for doing stuff:

  • R, because RStudio is a nice environment, I can blend code and text using R markdown and knitr, ggplot2 and Rcharts make generating graphics easy, and reshapers such as plyr make wrangling with data realtvely easy(?!) once you get into the swing of it… (though sometimes OpenRefine can be easier…;-)
  • python, because it’s an all round general purpose thing with lots of handy libraries, good for scraping, and a joy to work with in iPython notebook…

Sometimes, however, you know – or remember – how to do one thing in one language that you’re not sure how to do in another. Or you find a library that is just right for the task hand but it’s in the other language to the one in which you’re working, and routing the data out and back again can be a pain.

How handy it would be if you could make use of one language in the context of another? Well, it seems as if we can (note: I haven’t tried any of these recipes yet…):

Using R inside Python Programs

Whilst python has a range of plotting tools available for it, such as matplotlib, I haven’t found anything quite as a expressive as R’s ggplot2 (there is a python port of ggplot underway but it’s still early days and the syntax, as well as the functionality, is still far from complete as compared to the original [though not a far as it was given the recent update;-)] ). So how handy would it be to be able to throw a pandas data frame, for example, into an R data frame and then use ggplot to render a graphic?

The Rpy and Rpy2 libraries support exactly that, allowing you to run R code within a python programme. For an example, see this Example of using ggplot2 from IPython notebook.

There also seems to be some magic help for running R in iPython notebooks and some experimental integrational work going on in pandas: pandas: rpy2 / R interface.

(See also: ggplot2 in Python: A major barrier broken.)

Using python Inside R

Whilst one of the things I often want to do in python is plot R style ggplots, one of the hurdles I often encounter in R is getting data in in the first place. For example, the data may come from a third party source that needs screenscraping, or via a web API that has a python wrapper but not an R one. Python is my preferred tool for writing scrapers, so is there a quick way I can add a python data grabber into my R context? It seems as if there is: rPython, though the way code is included looks rather clunky and WIndows support appears to be moot. What would be nice would be for RStudio to include some magic, or be able to support python based chunks…

(See also: Calling Python from R with rPython.)

(Note: I’m currently working on the production of an Open University course on data management and use, and I can imagine the upset about overcomplicating matters if I mooted this sort of blended approach in the course materials. But this is exactly the sort of pragmatic use that technologists use code for – as a tool that comes to hand and that can be used quickly and relatively efficiently in concert with other tools, at least when you’re working in a problem solving (rather than production) mode.)

“Natural Language” Time Periods in Python

Mulling over a search feed that includes date range limits, I had a quick look for a python library that includes “natural language” functions for describing different date ranges. Not finding anything offhand, I popped some quick starter-for-ten functions up at this gist, which should also be embedded below.

It includes things like today(), tomorrow(), last_week(), later_this_month() and so on.

If you know of a “proper” library that does this, please let me know via the comments…

UPDATE: code repo here: https://github.com/psychemedia/python-natural-time-periods

SEE ALSO: https://github.com/DavidAmison/natural_time

PS more handy fragments:

#http://stackoverflow.com/a/28290050/454773
#Get month and year between two dates
import datetime
from dateutil.rrule import rrule, MONTHLY

strt_dt = datetime.date(2015,4,1)
end_dt = datetime.date(2016,10,1)

dates = ['_'.join([dt.strftime('%B').lower(), dt.strftime('%Y')]) for dt in rrule(MONTHLY, dtstart=strt_dt, until=end_dt)]