Quick Command Line Reports from CSV Data Parsed Out of XML Data Files

It’s amazing how time flies, isn’t it..? Spotting today’s date I realised that there’s only a week left before the closing date of the UK Discovery Developer Competition, which is making available several UK “cultural metadata” datasets from library catalogue and activity data, EDINA OpenUrl resolver data, National Archives images and Engligh Heritage places metadata, as well as ArchivesHub project related Linked Data and Tyne and Wear Museums Collections metadata.

I was intending to have a look at how easy it was to engage with datasets (e.g. by blogging intitial explorations for each dataset along the lines of the text processing tricks I posted around the EDINA data in Postcards from a Text Processing Excursion, Playing With Large (ish) CSV Files, and Using Them as a Database from the Command Line: EDINA OpenURL Logs and Visualising OpenURL Referrals Using Gource), but I seem to have left it a bit let considering other work I need to get done this week… (i.e. marking:-(

..except for posting this old bit of code I don’t think I’ve posted before that demonstrates how to use the Python scripting language to parse an XML file, such as the Huddersfield University library MOSAIC activity data, and create a CSV/text file that we can then run simple text processing tools against.

If you download the Huddersfield or Lincoln data (e.g. from http://library.hud.ac.uk/wikis/mosaic/index.php/Project_Data) and have a look at a few lines from the XML files (e.g. using the Unix command line tool head, as in head -n 150 filename.xml to show the first 150 lines of the file), you will notice records of the form:

      <institution>University of Huddersfield</institution>
      <globalID type="ISBN">1903365430</globalID>
      <author>Elizabeth I, Queen of England, 1533-1603.</author>
      <title>Elizabeth I : the golden reign of Gloriana /</title>
      <publisher>National Archives</publisher>
      <courseCode type="ucas">LQV0</courseCode>
      <courseName>BA(H) Humanities</courseName>

Suppose we want to extract the data showing which courses each resource was borrowed against. That is, for each use record, we want to extract a localID and the courseCode. The following script achieves that:

from lxml import etree
import csv
#Inspired by http://www.blog.pythonlibrary.org/2010/11/20/python-parsing-xml-with-lxml/
def parseMOSAIC_Level1_XML(xmlFile,writer):
	context = etree.iterparse(xmlFile)
	record = {}
	# we are going to use record to create a record containing UCAS codes
	# record={ucasCode:[],localID:''}
	records = []
	print 'starting...'
	for action, elem in context:
		if elem.tag=='useRecord' and action=='end':
			#we have parsed the end of a useRecord, so output course data
			if 'ucasCode' in record and 'localID' in record:
				for cc in record['ucasCode']:
		if elem.tag=='localID':
		elif elem.tag=='courseCode' and 'type' in elem.attrib and elem.attrib['type']=='ucas':
			if 'ucasCode' not in record:
		elif elem.tag=='progression' and elem.text=='staff':
	#return records

writer = csv.writer(open("test.csv", "wb"))


Usage (if you save the code to the file mosaicXML2csv.py): python mosaicXML2csv.py
Note: this minimal example uses the file specified by f=, in the above case mosaic.2008.level1.1265378452.0000001.xml and writes the CSV out to test.csv

(You can also find the code as a gist on Github: simple Python XML2CSV converter)

Running the script gives data of the form:


Note that we might add an additional column for progression. Add in something like:
if elem.tag=='progression': record['progression']=elem.text
and modify the write command to something like writer.writerow([record['localID'],cc,record['progression']])

We can now generate quick reports over the simplified test.csv data file.

For example, how many records did we extract:
wc -l test.csv

If we sort the records (and by so doing, group duplicated rows) [sort test.csv], we can then pull out unique rows and count the number of times they repeat [uniq -c], then sort them by the number of reoccurrences [sort -k 1 -n -r] and pull out the top 20 [head -n 20] using the combined, piped Unix commandline command:

sort test.csv | uniq -c | sort -k 1 -n -r | head -n 20

This gives and output of the form:
186 220759,L500
134 176259,L500
130 176895,L500

showing that resource with localID 220759 was taken out on course L500 186 times.

If we just want to count the number of books taken out on a course as a whole, we can just pull out the coursecode column using the cut command, setting the delimiter to be a comma:
cut -f 2 -d ',' test.csv

Having extracted the course code column, we can sort, find repeat counts, sort again and show the courses with the most borrowings against them:

cut -f 2 -d ',' test.csv | sort | uniq -c | sort -k 1 -n -r | head -n 10

This gives a result of the form:
13476 L500
8799 M931
7499 P301

In other words, we can create very quick and dirty reports over the data using simple commandline tools once we generate a row based, simply delimited text file version of the original XML data report.

Having got the data in a simple test.csv text file, we can also load it directly into the graph plotting tool Gephi, where the two columns (localID and courseCode) are both interpreted as nodes, with an edge going from the localID to the courseCode. (That is, we can treat the two column CSV file as defining a bipartite graph structure.)

Running a clustering statistic and a statistic that allows us to size nodes according to degree, we can generate a view over the data that shows the relative activity against courses:

Huddersfield mosaic data

Here’s another view, using a different layout:

Huddersfield JISC MOSAIC activity data

Also note that by choosing an appropriate layout algorithm, the network structure visually identifies courses that are “similar” by virtue of being connected to similar resources. The thickness of the edges is proportional to the number of times a resource was borrowed against a particular course, so we can also visually identify such items at a glance.

My (Rather Scruffy) MOSAIC Library Data Competition Entry

Several weeks ago, I put in an entry to the JISC MOSAIC Library Data Competition based on a couple of earlier hacks I’d put together that were built around @daveyp’s MOSAIC API wrapper the competition dataset. Here’s the howto…

First thing to note is that the application built on stuff I’d posted about previously:

– an elaboration of UCAS course choice pages with a link that identified books related to a particular course based on course code: First Dabblings With @daveyp’s MOSAIC Library Competition Data API;
– a bookmarklet that would “look up to see whether there are any courses associated with a particular book (or its other ISBN variants) if its ISBN appears in the URI”: People Who Referred To This Book Were Taking This Course.

So what was the submission? A page, keyed in a RESTful way by a UCAS course code, that:

  • takes a course code,
  • looks for the set books that have been borrowed more than a certain number of time by students associated with that course code (“find popular books associated with this course”),
  • grabs a review of each book from the Amazon mobile site,
  • annotates each book with a list of courses whose students have borrowed the book (“find other courses whose students have used this book”).

This page can be called using the following URI pattern:

(Note that the service can be quite slow at times – the pipe is doing a fair bit of work and I’m not sure how quick the MOSAIC API is either. But that’s not the point, right? The point is identifying the logical glue required to join up the MOSAIC service API with a range of other index keys and web services, so that a production service could then potentially be implemented against a proven (-ish!;-) and demonstrably working (sometimes!) functional spec.)

The pipework itself can be found here: MOSAIC Data: Books on a course, with related courses

The first part takes a UCAS course code as key/index value, and queries the MOSAIC API; only books that have been taken out more than a specified minimum number of times in the context of a particular course are passed though:

A link to the book cover is then included in the description item of each book, and a call made to find the courses related to that book. A crude bit of screenscraping on the Amazon mobile page for each book brings in a book review. (I had originally intended to pull in more reviews from the Amazon API, but over the summer Amazon introduced a key based handshake to access to the API – so s***w ’em.)

The output of the pipe is pulled into this page as a JSON feed, and the data from it is used to populate the page.

The first column in the table simply displays the title of each book associated with the specified course code.

The book cover is pulled in from… Worldcat, I think…, keyed by ISBN10 (maybe there are licensing issues involved..?) The preview link pops up a preview of the book in a shadowbox, if available, or failing that a link to the generic book info page on Google Books. (Close the shadowbox using the X on the bottom right of the Shadowbox view.)

It should be possible to handle the previewer using code that is displayed directly within the shadowbox, but I ran out of time trying to get it to work, so resorted to using a helper page instead that could be embedded via an iframe in the shadowbox (the helper might be quite useful in it’s own right as a piece of rapid protoyping componentware? http://ouseful.open.ac.uk/gbookIframeEmbed.php?isbn=159059858X). The lightbox code is from Shadowbox.js.

The Amazon info (the ‘first’ listed Amazon review for the book) is pulled in from the Amazon mobile site via a Yahoo pipe. (The Amazon URIs I use look something like http://www.amazon.co.uk/gp/aw/d.html/?a=ISBN10&er=1) Since Amazon started requiring API requests to be signed, it’s made quick hacks difficult; the next quickest thing is to scrape the mobile site, which is what I’ve done here. A regular expression in the page rewrites the Amazon mobile URIs to the normal web URIs.

The course info columns shows info for courses that are also associated with the book; because not enough people in HE appear to care about URIs and “pivot data”, we often can’t just take a course code and create a URI that links to a corresponding course description. In the short term, I pivot back to this page, so you can see what other books are associated with the specified course. And as a hack, I munge together a Google search query that at least tries to track the course down based on course code and HEI name (e.g. http://www.google.com/search?q=BSc%28H%29+Computer+Games+Programming+%22University+of+Huddersfield%22. I don’t query UCAS page directly because the UCAS search uses session variables and a handshake as part of a shortlived URI to a set of search results.) Many of the results are to timed out UCAS searches indexed by Google though. It also amuses me that for some HEIs, searching their public site with the course code for a course they have ‘advertised’ on the UCAS site turns up no results. Zero. Zilch. None.

I also posted a complementary bookmarklet that can be used to annotate course search results page on the UCAS website with a link to the appropriate related books’n’courses page.

You may have noticed that the competition entry was posted in a minimal, unstyled form (I had hoped to make use of a Google visualisation API table widget to display the results, but proper work intruded ;-) This is in part to make the point that it is not – never was meant to be – a production service. It’s a working rapid prototype intended to demonstrate how Library data might be used outside the Library domain to act as a marketing support tool for Higher Education courses on the one hand, and an informal learning/related resources recommendation tool on the other.

These sorts of prototypes can be constructed in 1 to 2 hours and provide something tangible for Library folk to talk around (as opposed to documents produced at length describing how the needs of fictional characters and user scenarios generated by break out groups in “let’s reinvent our web presence” departmental away days…

They’re also the sorts of thing that we should be creating and discussing as throwaways on a regular basis, not hiding away for months end because they’re in some competition or other…

(As to why I haven’t posted this before – a huge half written blog post backlog; the how to has been available on the demo page, which was tweeted about widely as soon as I’d got it on to the server…)

PS see also @ostephen’s Read To Learn competition entry, which takes a set of ISBNs from an uploaded file, looks them up against the MOSAIC API and returns the UCAS course codes of courses that are associated through book loans with the uploaded ISBNs (and maybe xISBNs too?). Each course code is then looked up against the current UCAS course catalogue, and the search results (i.e. the list of corresponding courses at institutions across the UK) is retrieved and displayed. In short, Read To Learn takes a set of ISBNs and finds related courses from a course code lookup on the UCAS site. My app just took a single course code and tried to find related books (along with reviews of the books) and courses.

People Who Referred To This Book Were Taking This Course

In First Dabblings With @daveyp’s MOSAIC Library Competition Data API I posted a recipe for using the Mosaic HEI library loans data to augment UCAS course search results pages with a link to a list of books related to each course based on a minimum specified number of people borrowing those books whilst taking a course with the same course code in previous years.

This post is exactly the same, but completely different: augmenting Amazon book pages with a list of qualifications that were being studied by people who borrowed the same book from a university library. If you it’s obvious to you why it’s the same, you have a Good Attitude. If you think it’s completely different, you need ejukating…

So – to start: if you haven’t already done so, go and read @daveyp’s blog post on Simple API for JISC MOSAIC Project Developer Competition data [competition announcement], because that’s what we’re playing with…

Looking at the API calls, there is one we can use that will pull back the courses associated with a particular ISBN:


So if we can grab an ISBN from an Amazon book page, we can then got to a list of related courses from the book page, or even augment the page with related courses. (Linking from, or augmenting Amazon book pages with Library related data has a long history. For example, see Jon Udell’s LibraryLookup Project, LibraryLookup Greasemonkey scripts, Coming Together around Library 2.0, OU Library Traveller – Title Lookup etc.)

Anyway, something like this regular expression will book an ISBN out of a URI:

var isbn = window.location.href.match(/([\/-]|is[bs]n|searchData=)(\d{7,9}[\dX])/i)[2];

[based on a 2002 script by Jon Udell]

so we can readily create a bookmarklet that will take the ISBN and display the Mosaic results for a lookup of the ISBN on that page:

(function (){
 var isbn = window.content.location.href.match(/([\/-]|is[bs]n|searchData=)(\d{7,9}[\dX])/i)[2];

So for example if you visit the Amaqzon page for this book and run the bookmarklet, you’ll get the Mosaic API results for the corresponding ISBN. (ISBNs also appear in many other book relatd URIs, so the bookmarklet is likely to work in many other places too; except library catalogues, of course. They rarely, if ever, expose the ISBN of a book in a way that means you can reliably identify and extract it.)

So can you see why this is pretty much exactly the same mashup pattern as before?

1) grab an identifier that we know to be universal, in a given context;
2) use that identifier as a key value in the appropriate API call and construct a URI to call the appropriate query;
3) do something with the result (or not!)

One problem with using a single ISBN however is that many versions of the same book may exist, each with their own ISBN. In some cases, the librarians are right – you need to know exactly which edition of a book you should be looking at. Most of the time, however, you just want a copy of the book that has the same title and author as the one you were told to find. Hardback, paperback, whatever…

In the context of the course lookup, what this might mean is that given the ISBN of a book being browsed on Amazon, or elsewhere, you want to look up courses associated with that particular ISBN, or with ISBNs of the different variants of the same book.

There are a couple of tools that do this – xISBN (from WorldCat), and ThingISBN (from LibraryThing) – and they work as follows: given an ISBN, return the ISBNs of all the other variants of that book.

Both are implemented as RESTful webservices, so given the ISBN you can easily generate a URI that will return an XML file containing the variant ISBNs. (See the ISBN Playground for a list of more than a few URIs constructed around an ISBN key. Linked data without the RDF, maybe? Or pivot data, at least…;-)

A quick tweet to @daveyp over the weekend, and as if by magic, the ability to call the Mosaic API with more than one ISBN appeared:


Another tweet, and an xisbn service was implemented on the server side (using ThingISBN maybe? I’m not sure…?):


Which is to say, look up all the courses associated with all ISBN variants of the specified ISBN.

So if we retweak the bookmarklet, we can look up to see whether there are any courses associated with a particular book (or its other ISBN variants) if its ISBN appears in the URI:

(function (){
 var isbn = window.location.href.match(/([\/-]|is[bs]n|searchData=)(\d{7,9}[\dX])/i)[2]

And the use case? An engine for generating contextual, book specific qualification adverts, perhaps?;-) Or an extension to a library traveler script/toolbar?

PS Just by the by, whilst looking through the Mosaic XML, I noticed that it appeared to assume the presence of ISBN10 formatted ISBNs, which prompted me to ask @daveyp whether ISBNs, or other ‘universal’ identifiers (such as ISSNs, and maybe even ebooks identifiers, DOIs and suchlike (though I didn’t ask him about the latter two)) could be catered for. Apparently, the spec has nothing currently to say on such things….

PPS See also The ISBN Is Dead (via @lorcanD) for a commentary on the lack of support that the ISBN has from some quarters… like Amazon, who apparently only use ISBNs sort of – the number that actually appears in the URI is an Amazon ASIN (which may or may not correspond to an ISBN…).

First Dabblings With @daveyp’s MOSAIC Library Competition Data API

A couple of days ago, Dave Pattern published a simple API to the JISC MOSAIC project HE library loans data that has been opened up as part of a developer competition running over the summer (Simple API for JISC MOSAIC Project Developer Competition data [competition announcement]).

The API offers a variety of simple queries on the records data (we like simple queries:-) that allow the user to retrieve records according to ISBN of the book that was borrowed, books that were taken out by people on a particuler course, (using a UCAS course code identifier) or a combination of the two.

The results file from a search is returned as an XML file containing zero or more results records; each record has the form:

<useRecord row="19">
  <institution>University of Huddersfield</institution>
  <globalID type="ISBN">1856752321</globalID>
  <author>Mojay, Gabriel.</author>
  <title>Aromatherapy for healing the spirit : a guide to restoring emotional and mental balance through essential oils /</title>
  <courseCode type="ucas">B390</courseCode>
  <courseName>FdSc Holistic Therapies</courseName>

[How to style code fragments in a hosted Wordpres.com blog, via WordPress Support: “Code” shortcode]

(For a more complte description of the record format, see Mosaic data collection – A guide [PDF])

As a warm up exercise to familiarise myself with the data, I did a proof of concept hack (in part inspired by the Library Traveller script) that would simply linkify course codes appearing on a UCAS courses results page such that each course code would link to a results page listing the books that had been borrowed by students associated with courses with a similar course code in previous years.

Looking at the results page, w can see the course code appears next to tha name of each course:

A simple bookmarklet can b used to linkify the qualification code so that it points to the results of a query on the MOSAIC data with the appropriate course code:

 var regu=new RegExp("\([A-Z0-9]{4}\)");
 var s=document.getElementsByTagName('span');
 for (i=0;i<s.length;i++){
  if (s&#91;i&#93;.getAttribute('class')=='bodyTextSmallGrey')
   if (regu.test(s&#91;i&#93;.innerHTML)){
    var id=regu.exec(s&#91;i&#93;.innerHTML);
    s&#91;i&#93;.innerHTML=s&#91;i&#93;.innerHTML.replace( regu,
     "<a href=\"http://library.hud.ac.uk/mosaic/api.pl?ucas="

(It’s trivial to convert this to to Greasemonkey script: simple add the required Greasemonkey header and save the file with an appropriate filename – e.g. ucasCodeLinkify.user.js)

Clicking on the bookmarklet linkifies the qualification code to point to a search on http://library.hud.ac.uk/mosaic/api.pl?ucas= wigth the appropriate code.

To make the results a little friendlier, I created a simple Yahoo pipe that generates an RSS feed containing a list of books (along with their book covers) that had been borrowed more than a specified minimum number of times by people associated with that course code in previous years.

To start with, create the URI with a particular UCAS qualification code to call the web service and pull in the XML results feed:

Map elements onto legitimate RSS feed elements:

and strip out duplicate books. Note that the pipe also counts how many duplicate (?!) items there are:

Now filter the items based on the duplication (replication?) count – we want to only see books that have been borrowed at least a minimum number of times (this rules out ‘occasional’ loans on unrelated courses by single individuals – we only want the popular or heavily borrowed books associated with a particular UCAS qualification code in the first instance):

Finally, create a link to each book on Google books, and grab the book cover. Note that I assume the global identifier is an ISBN10… (if it’s an ISBN13, see Looking Up Alternative Copies of a Book on Amazon, via ThingISBN for a defensive measure using the LibraryThing Check ISBN API. That post also points towards a way by which we might find other courses that are associated with different editions of a particular book… ;-)


You can find the pipe here: MOSAIC data: books borrowed on particular courses.

If we now change the linkifying URL to the RSS output of the pipe, we can provide a link for each course on the UCAS course search results page that points to an RSS feed of reading material presumably popularly associated with the course in previous years (note, however, that note all codes have books associated with them).

To do this, simply change the following URI stub in the bookmarklet:

The “popular related borrowings” reading list generated now allows a potential student to explore some of the books associated with a particular course at decision time:-)

One possible follow on step would be to look up other courses related to each original course by virtue of several people having borrowed the same book (or other editions of it) on other courses. Can you see how you might achieve that, or at least what sorts of steps you need to implement?

PS If anyone would like to work this recipe up properly as (part of) a competition entry, feel free, though I’d appreciate a cut of any prize money if you win;-)