Quick Command Line Reports from CSV Data Parsed Out of XML Data Files

It’s amazing how time flies, isn’t it..? Spotting today’s date I realised that there’s only a week left before the closing date of the UK Discovery Developer Competition, which is making available several UK “cultural metadata” datasets from library catalogue and activity data, EDINA OpenUrl resolver data, National Archives images and Engligh Heritage places metadata, as well as ArchivesHub project related Linked Data and Tyne and Wear Museums Collections metadata.

I was intending to have a look at how easy it was to engage with datasets (e.g. by blogging intitial explorations for each dataset along the lines of the text processing tricks I posted around the EDINA data in Postcards from a Text Processing Excursion, Playing With Large (ish) CSV Files, and Using Them as a Database from the Command Line: EDINA OpenURL Logs and Visualising OpenURL Referrals Using Gource), but I seem to have left it a bit let considering other work I need to get done this week… (i.e. marking:-(

..except for posting this old bit of code I don’t think I’ve posted before that demonstrates how to use the Python scripting language to parse an XML file, such as the Huddersfield University library MOSAIC activity data, and create a CSV/text file that we can then run simple text processing tools against.

If you download the Huddersfield or Lincoln data (e.g. from http://library.hud.ac.uk/wikis/mosaic/index.php/Project_Data) and have a look at a few lines from the XML files (e.g. using the Unix command line tool head, as in head -n 150 filename.xml to show the first 150 lines of the file), you will notice records of the form:

      <institution>University of Huddersfield</institution>
      <globalID type="ISBN">1903365430</globalID>
      <author>Elizabeth I, Queen of England, 1533-1603.</author>
      <title>Elizabeth I : the golden reign of Gloriana /</title>
      <publisher>National Archives</publisher>
      <courseCode type="ucas">LQV0</courseCode>
      <courseName>BA(H) Humanities</courseName>

Suppose we want to extract the data showing which courses each resource was borrowed against. That is, for each use record, we want to extract a localID and the courseCode. The following script achieves that:

from lxml import etree
import csv
#Inspired by http://www.blog.pythonlibrary.org/2010/11/20/python-parsing-xml-with-lxml/
def parseMOSAIC_Level1_XML(xmlFile,writer):
	context = etree.iterparse(xmlFile)
	record = {}
	# we are going to use record to create a record containing UCAS codes
	# record={ucasCode:[],localID:''}
	records = []
	print 'starting...'
	for action, elem in context:
		if elem.tag=='useRecord' and action=='end':
			#we have parsed the end of a useRecord, so output course data
			if 'ucasCode' in record and 'localID' in record:
				for cc in record['ucasCode']:
		if elem.tag=='localID':
		elif elem.tag=='courseCode' and 'type' in elem.attrib and elem.attrib['type']=='ucas':
			if 'ucasCode' not in record:
		elif elem.tag=='progression' and elem.text=='staff':
	#return records

writer = csv.writer(open("test.csv", "wb"))


Usage (if you save the code to the file mosaicXML2csv.py): python mosaicXML2csv.py
Note: this minimal example uses the file specified by f=, in the above case mosaic.2008.level1.1265378452.0000001.xml and writes the CSV out to test.csv

(You can also find the code as a gist on Github: simple Python XML2CSV converter)

Running the script gives data of the form:


Note that we might add an additional column for progression. Add in something like:
if elem.tag=='progression': record['progression']=elem.text
and modify the write command to something like writer.writerow([record['localID'],cc,record['progression']])

We can now generate quick reports over the simplified test.csv data file.

For example, how many records did we extract:
wc -l test.csv

If we sort the records (and by so doing, group duplicated rows) [sort test.csv], we can then pull out unique rows and count the number of times they repeat [uniq -c], then sort them by the number of reoccurrences [sort -k 1 -n -r] and pull out the top 20 [head -n 20] using the combined, piped Unix commandline command:

sort test.csv | uniq -c | sort -k 1 -n -r | head -n 20

This gives and output of the form:
186 220759,L500
134 176259,L500
130 176895,L500

showing that resource with localID 220759 was taken out on course L500 186 times.

If we just want to count the number of books taken out on a course as a whole, we can just pull out the coursecode column using the cut command, setting the delimiter to be a comma:
cut -f 2 -d ',' test.csv

Having extracted the course code column, we can sort, find repeat counts, sort again and show the courses with the most borrowings against them:

cut -f 2 -d ',' test.csv | sort | uniq -c | sort -k 1 -n -r | head -n 10

This gives a result of the form:
13476 L500
8799 M931
7499 P301

In other words, we can create very quick and dirty reports over the data using simple commandline tools once we generate a row based, simply delimited text file version of the original XML data report.

Having got the data in a simple test.csv text file, we can also load it directly into the graph plotting tool Gephi, where the two columns (localID and courseCode) are both interpreted as nodes, with an edge going from the localID to the courseCode. (That is, we can treat the two column CSV file as defining a bipartite graph structure.)

Running a clustering statistic and a statistic that allows us to size nodes according to degree, we can generate a view over the data that shows the relative activity against courses:

Huddersfield mosaic data

Here’s another view, using a different layout:

Huddersfield JISC MOSAIC activity data

Also note that by choosing an appropriate layout algorithm, the network structure visually identifies courses that are “similar” by virtue of being connected to similar resources. The thickness of the edges is proportional to the number of times a resource was borrowed against a particular course, so we can also visually identify such items at a glance.

Another Blooming Look at Gource and the Edina OpenURL Data

Having done a first demo of how to use Gource to visualise activity around the EDINA OpenURL data (Visualising OpenURL Referrals Using Gource), I thought I’d trying something a little more artistic, and use the colour features to try to pull out a bit more detail from the data [video].

What this one shows is how the mendeley referrals glow brightly green, which – if I’ve got my code right – suggests a lot of e-issn lookups are going on (the red nodes correspond to an issn lookup, blue to an isbn lookup and yellow/orange to an unknown lookup). The regularity of activity around particular nodes also shows how a lot of the activity is actually driven by a few dominant services, at least during the time period I sampled to generate this video.

So how was this visualisation created?

Firstly, I pulled out a few more data columns, specifically the issn, eissn, isbn and genre data. I then opted to set node colour according to whether the issn (red), eissn (green) or isbn (blue) columns were populated using a default reasoning approach (if all three were blank, I coloured the node yellow). I then experimented with colouring the actors (I think?) according to whether the genre was article-like, book-like or unkown (mapping these on to add, modify or delete actions), before dropping the size of the actors altogether in favour of just highlighting referrers and asset type (i.e. issn, e-issn, book or unknown).

cut -f 1,2,3,4,27,28,29,32,40 L2_2011-04.csv > openurlgource.csv

When running the Pythin script, I got a “NULL Byte” error that stopped the script working (something obviously snuck in via one of the newly added columns), so I googled around and turned up a little command line cleanup routine for the cut data file:

tr < openurlgource.csv -d '\000' > openurlgourcenonulls.csv

Here’s the new Python script too that shows the handling of the colour fields:

import csv
from time import *

# Command line pre-processing step to handle NULL characters
#tr < openurlgource.csv -d '\000' > openurlgourcenonulls.csv
#alternatively?: sed 's/\x0/ /g' openurlgource.csv > openurlgourcenonulls.csv

f=open('openurlgourcenonulls.csv', 'rb')

reader = csv.reader(f, delimiter='\t')
writer = csv.writer(open('openurlgource.txt','wb'),delimiter='|')
headerline = reader.next()

for row in reader:
	if row[8].strip() !='':
		t=int(mktime(strptime(row[0]+" "+row[1], "%Y-%m-%d %H:%M:%S")))
		if row[4]!='':
		elif row[5]!='':
		elif row[6]!='':
		if row[7]=='article' or row[7]=='journal':
		elif row[7]=='book' or row[7]=='bookitem':

The new gource command is:

gource -s 1 --hide usernames --start-position 0.8 --stop-position 0.82 --user-scale 0.1 openurlgource.txt

and the command to generate the video:

gource -s 1 --hide usernames --start-position 0.8 --stop-position 0.82 --user-scale 0.1 -o - openurlgource.txt | ffmpeg -y -b 3000K -r 60 -f image2pipe -vcodec ppm -i - -vcodec libx264 -vpre slow -threads 0 gource.mp4

If you’ve been tempted to try Gource out yourself on some of your own data, please post a link in the comments below:-) (AI wonder just how many different sorts of data we can force into the shape that Gource expects?!;-)

Visualising OpenURL Referrals Using Gource

Picking up on the OpenURL referrer data that I played with here, here’s a demo of how to visualise it using Gource [video]:

If you haven’t come across it before, Gource is a repository visualiser (Code Swarm is another one) that lets you visualise who has been checking documents into and out of a code repository. As the documentation describes it, “software projects are displayed by Gource as an animated tree with the root directory of the project at its centre. Directories appear as branches with files as leaves. Developers can be seen working on the tree at the times they contributed to the project.”

One of the nice things about Gource is that it accepts a simple custom log format that can be used to visualise anything you can represent as a series of actors, doing things to something that lives down a path, over time… (So for example, PyEvolve which visualises Google Analytics data to show website usage.)

In the case of the Edina OpenURL resolver, I mapped referring services onto the “flower”/file nodes, and institutional IDs onto the people. (If someone could clarify what the institutional IDs – column 4 of the log – actually refer to, I’d be really grateful?)

To generate the Gource log file – which needs to look like this:

  • timestamp – A unix timestamp of when the update occured.
  • username – The name of the user who made the update.
  • type – initial for the update type – (A)dded, (M)odified or (D)eleted.
  • file – Path of the file updated.

That is: 1275543595|andrew|A|src/main.cpp

I used a command line trick and a Python trick:

cut -f 1,2,3,4,40 L2_2011-04.csv > openurlgource.csv
head -n 100 openurlgource.csv > openurlgource100.csv

(Taking the head of the file containing just columns 1,2,3,4 and 40 of the log data meant I could try out my test script on a small file to start with…)

import csv
from time import *
f=open('openurlgource.csv', 'rb')

reader = csv.reader(f, delimiter='\t')
writer = csv.writer(open('openurlgource.txt','wb'),delimiter='|')
headerline = reader.next()
for row in reader:
	if row[4].strip() !='':
		t=int(mktime(strptime(row[0]+" "+row[1], "%Y-%m-%d %H:%M:%S")))

(Thanks to @quentinsf for the Python time handling crib:-)

This gives me log data of the required form:
1301612414|117143|A|OVID/Ovid MEDLINE(R)

Running Gource uses commands of the form:

gource -s 1 --hide usernames --start-position 0.5 --stop-position 0.51 openurlgource.txt

The video was generated using ffmpeg with a piped command of the form:

gource -s 1 --hide usernames --start-position 0.5 --stop-position 0.51 -o - openurlgource.txt | ffmpeg -y -b 3000K -r 60 -f image2pipe -vcodec ppm -i - -vcodec libx264 -vpre slow -threads 0 gource.mp4

Note that I had to compile ffmpeg myself, which required hunting down a variety of libraries (e.g. Lame, the WebM encoder, and the x264 encoder library), compiling them as shared resources (./configure --enable-shared) and then adding them into the build (in the end, on my Macbook Pro, I used ./configure –enable-libmp3lame –enable-shared –enable-libvpx –enable-libx264 –enable-gpl –disable-mmx –arch=x86_64 followed by the usual make and then sudo make install).

Getting ffmpeg and its dependencies configured and compiled was the main hurdle (I had an older version installed for transforming video between formats, as described in ffmpeg – Handy Hints, but needed the update), but now it’s in place, it’s yet another toy in the toybox that can do magical things when given data in the right format: gource:-)