Quick Command Line Reports from CSV Data Parsed Out of XML Data Files

It’s amazing how time flies, isn’t it..? Spotting today’s date I realised that there’s only a week left before the closing date of the UK Discovery Developer Competition, which is making available several UK “cultural metadata” datasets from library catalogue and activity data, EDINA OpenUrl resolver data, National Archives images and Engligh Heritage places metadata, as well as ArchivesHub project related Linked Data and Tyne and Wear Museums Collections metadata.

I was intending to have a look at how easy it was to engage with datasets (e.g. by blogging intitial explorations for each dataset along the lines of the text processing tricks I posted around the EDINA data in Postcards from a Text Processing Excursion, Playing With Large (ish) CSV Files, and Using Them as a Database from the Command Line: EDINA OpenURL Logs and Visualising OpenURL Referrals Using Gource), but I seem to have left it a bit let considering other work I need to get done this week… (i.e. marking:-(

..except for posting this old bit of code I don’t think I’ve posted before that demonstrates how to use the Python scripting language to parse an XML file, such as the Huddersfield University library MOSAIC activity data, and create a CSV/text file that we can then run simple text processing tools against.

If you download the Huddersfield or Lincoln data (e.g. from http://library.hud.ac.uk/wikis/mosaic/index.php/Project_Data) and have a look at a few lines from the XML files (e.g. using the Unix command line tool head, as in head -n 150 filename.xml to show the first 150 lines of the file), you will notice records of the form:

      <institution>University of Huddersfield</institution>
      <globalID type="ISBN">1903365430</globalID>
      <author>Elizabeth I, Queen of England, 1533-1603.</author>
      <title>Elizabeth I : the golden reign of Gloriana /</title>
      <publisher>National Archives</publisher>
      <courseCode type="ucas">LQV0</courseCode>
      <courseName>BA(H) Humanities</courseName>

Suppose we want to extract the data showing which courses each resource was borrowed against. That is, for each use record, we want to extract a localID and the courseCode. The following script achieves that:

from lxml import etree
import csv
#Inspired by http://www.blog.pythonlibrary.org/2010/11/20/python-parsing-xml-with-lxml/
def parseMOSAIC_Level1_XML(xmlFile,writer):
	context = etree.iterparse(xmlFile)
	record = {}
	# we are going to use record to create a record containing UCAS codes
	# record={ucasCode:[],localID:''}
	records = []
	print 'starting...'
	for action, elem in context:
		if elem.tag=='useRecord' and action=='end':
			#we have parsed the end of a useRecord, so output course data
			if 'ucasCode' in record and 'localID' in record:
				for cc in record['ucasCode']:
		if elem.tag=='localID':
		elif elem.tag=='courseCode' and 'type' in elem.attrib and elem.attrib['type']=='ucas':
			if 'ucasCode' not in record:
		elif elem.tag=='progression' and elem.text=='staff':
	#return records

writer = csv.writer(open("test.csv", "wb"))


Usage (if you save the code to the file mosaicXML2csv.py): python mosaicXML2csv.py
Note: this minimal example uses the file specified by f=, in the above case mosaic.2008.level1.1265378452.0000001.xml and writes the CSV out to test.csv

(You can also find the code as a gist on Github: simple Python XML2CSV converter)

Running the script gives data of the form:


Note that we might add an additional column for progression. Add in something like:
if elem.tag=='progression': record['progression']=elem.text
and modify the write command to something like writer.writerow([record['localID'],cc,record['progression']])

We can now generate quick reports over the simplified test.csv data file.

For example, how many records did we extract:
wc -l test.csv

If we sort the records (and by so doing, group duplicated rows) [sort test.csv], we can then pull out unique rows and count the number of times they repeat [uniq -c], then sort them by the number of reoccurrences [sort -k 1 -n -r] and pull out the top 20 [head -n 20] using the combined, piped Unix commandline command:

sort test.csv | uniq -c | sort -k 1 -n -r | head -n 20

This gives and output of the form:
186 220759,L500
134 176259,L500
130 176895,L500

showing that resource with localID 220759 was taken out on course L500 186 times.

If we just want to count the number of books taken out on a course as a whole, we can just pull out the coursecode column using the cut command, setting the delimiter to be a comma:
cut -f 2 -d ',' test.csv

Having extracted the course code column, we can sort, find repeat counts, sort again and show the courses with the most borrowings against them:

cut -f 2 -d ',' test.csv | sort | uniq -c | sort -k 1 -n -r | head -n 10

This gives a result of the form:
13476 L500
8799 M931
7499 P301

In other words, we can create very quick and dirty reports over the data using simple commandline tools once we generate a row based, simply delimited text file version of the original XML data report.

Having got the data in a simple test.csv text file, we can also load it directly into the graph plotting tool Gephi, where the two columns (localID and courseCode) are both interpreted as nodes, with an edge going from the localID to the courseCode. (That is, we can treat the two column CSV file as defining a bipartite graph structure.)

Running a clustering statistic and a statistic that allows us to size nodes according to degree, we can generate a view over the data that shows the relative activity against courses:

Huddersfield mosaic data

Here’s another view, using a different layout:

Huddersfield JISC MOSAIC activity data

Also note that by choosing an appropriate layout algorithm, the network structure visually identifies courses that are “similar” by virtue of being connected to similar resources. The thickness of the edges is proportional to the number of times a resource was borrowed against a particular course, so we can also visually identify such items at a glance.