Visualising OpenURL Referrals Using Gource

Picking up on the OpenURL referrer data that I played with here, here’s a demo of how to visualise it using Gource [video]:

If you haven’t come across it before, Gource is a repository visualiser (Code Swarm is another one) that lets you visualise who has been checking documents into and out of a code repository. As the documentation describes it, “software projects are displayed by Gource as an animated tree with the root directory of the project at its centre. Directories appear as branches with files as leaves. Developers can be seen working on the tree at the times they contributed to the project.”

One of the nice things about Gource is that it accepts a simple custom log format that can be used to visualise anything you can represent as a series of actors, doing things to something that lives down a path, over time… (So for example, PyEvolve which visualises Google Analytics data to show website usage.)

In the case of the Edina OpenURL resolver, I mapped referring services onto the “flower”/file nodes, and institutional IDs onto the people. (If someone could clarify what the institutional IDs – column 4 of the log – actually refer to, I’d be really grateful?)

To generate the Gource log file – which needs to look like this:

  • timestamp – A unix timestamp of when the update occured.
  • username – The name of the user who made the update.
  • type – initial for the update type – (A)dded, (M)odified or (D)eleted.
  • file – Path of the file updated.

That is: 1275543595|andrew|A|src/main.cpp

I used a command line trick and a Python trick:

cut -f 1,2,3,4,40 L2_2011-04.csv > openurlgource.csv
head -n 100 openurlgource.csv > openurlgource100.csv

(Taking the head of the file containing just columns 1,2,3,4 and 40 of the log data meant I could try out my test script on a small file to start with…)

import csv
from time import *
f=open('openurlgource.csv', 'rb')

reader = csv.reader(f, delimiter='\t')
writer = csv.writer(open('openurlgource.txt','wb'),delimiter='|')
headerline = reader.next()
for row in reader:
	if row[4].strip() !='':
		t=int(mktime(strptime(row[0]+" "+row[1], "%Y-%m-%d %H:%M:%S")))
		writer.writerow([t,row[3],'A',row[4].rstrip(':').replace(':','/')])

(Thanks to @quentinsf for the Python time handling crib:-)

This gives me log data of the required form:
1301612404|687369|A|www.isinet.com/WoK/UA
1301612413|305037|A|www.isinet.com/WoK/WOS
1301612414|117143|A|OVID/Ovid MEDLINE(R)
1301612436|822542|A|mendeley.com/mendeley

Running Gource uses commands of the form:

gource -s 1 --hide usernames --start-position 0.5 --stop-position 0.51 openurlgource.txt

The video was generated using ffmpeg with a piped command of the form:

gource -s 1 --hide usernames --start-position 0.5 --stop-position 0.51 -o - openurlgource.txt | ffmpeg -y -b 3000K -r 60 -f image2pipe -vcodec ppm -i - -vcodec libx264 -vpre slow -threads 0 gource.mp4

Note that I had to compile ffmpeg myself, which required hunting down a variety of libraries (e.g. Lame, the WebM encoder, and the x264 encoder library), compiling them as shared resources (./configure --enable-shared) and then adding them into the build (in the end, on my Macbook Pro, I used ./configure –enable-libmp3lame –enable-shared –enable-libvpx –enable-libx264 –enable-gpl –disable-mmx –arch=x86_64 followed by the usual make and then sudo make install).

Getting ffmpeg and its dependencies configured and compiled was the main hurdle (I had an older version installed for transforming video between formats, as described in ffmpeg – Handy Hints, but needed the update), but now it’s in place, it’s yet another toy in the toybox that can do magical things when given data in the right format: gource:-)

Another Blooming Look at Gource and the Edina OpenURL Data

Having done a first demo of how to use Gource to visualise activity around the EDINA OpenURL data (Visualising OpenURL Referrals Using Gource), I thought I’d trying something a little more artistic, and use the colour features to try to pull out a bit more detail from the data [video].

What this one shows is how the mendeley referrals glow brightly green, which – if I’ve got my code right – suggests a lot of e-issn lookups are going on (the red nodes correspond to an issn lookup, blue to an isbn lookup and yellow/orange to an unknown lookup). The regularity of activity around particular nodes also shows how a lot of the activity is actually driven by a few dominant services, at least during the time period I sampled to generate this video.

So how was this visualisation created?

Firstly, I pulled out a few more data columns, specifically the issn, eissn, isbn and genre data. I then opted to set node colour according to whether the issn (red), eissn (green) or isbn (blue) columns were populated using a default reasoning approach (if all three were blank, I coloured the node yellow). I then experimented with colouring the actors (I think?) according to whether the genre was article-like, book-like or unkown (mapping these on to add, modify or delete actions), before dropping the size of the actors altogether in favour of just highlighting referrers and asset type (i.e. issn, e-issn, book or unknown).

cut -f 1,2,3,4,27,28,29,32,40 L2_2011-04.csv > openurlgource.csv

When running the Pythin script, I got a “NULL Byte” error that stopped the script working (something obviously snuck in via one of the newly added columns), so I googled around and turned up a little command line cleanup routine for the cut data file:

tr < openurlgource.csv -d '\000' > openurlgourcenonulls.csv

Here’s the new Python script too that shows the handling of the colour fields:

import csv
from time import *

# Command line pre-processing step to handle NULL characters
#tr < openurlgource.csv -d '\000' > openurlgourcenonulls.csv
#alternatively?: sed 's/\x0/ /g' openurlgource.csv > openurlgourcenonulls.csv

f=open('openurlgourcenonulls.csv', 'rb')

reader = csv.reader(f, delimiter='\t')
writer = csv.writer(open('openurlgource.txt','wb'),delimiter='|')
headerline = reader.next()

for row in reader:
	if row[8].strip() !='':
		t=int(mktime(strptime(row[0]+" "+row[1], "%Y-%m-%d %H:%M:%S")))
		if row[4]!='':
			col='FF0000'
		elif row[5]!='':
			col='00FF00'
		elif row[6]!='':
			col='0000FF'
		else:
			col='666600'
		if row[7]=='article' or row[7]=='journal':
			typ='A'
		elif row[7]=='book' or row[7]=='bookitem':
			typ='M'
		else:
			typ='D'
		agent=row[8].rstrip(':').replace(':','/')
		writer.writerow([t,row[3],typ,agent,col])

The new gource command is:

gource -s 1 --hide usernames --start-position 0.8 --stop-position 0.82 --user-scale 0.1 openurlgource.txt

and the command to generate the video:

gource -s 1 --hide usernames --start-position 0.8 --stop-position 0.82 --user-scale 0.1 -o - openurlgource.txt | ffmpeg -y -b 3000K -r 60 -f image2pipe -vcodec ppm -i - -vcodec libx264 -vpre slow -threads 0 gource.mp4

If you’ve been tempted to try Gource out yourself on some of your own data, please post a link in the comments below:-) (AI wonder just how many different sorts of data we can force into the shape that Gource expects?!;-)