gource – OUseful.Info, the blog…

Picking up on the OpenURL referrer data that I played with here, here’s a demo of how to visualise it using Gource [video]:

If you haven’t come across it before, Gource is a repository visualiser (Code Swarm is another one) that lets you visualise who has been checking documents into and out of a code repository. As the documentation describes it, “software projects are displayed by Gource as an animated tree with the root directory of the project at its centre. Directories appear as branches with files as leaves. Developers can be seen working on the tree at the times they contributed to the project.”

One of the nice things about Gource is that it accepts a simple custom log format that can be used to visualise anything you can represent as a series of actors, doing things to something that lives down a path, over time… (So for example, PyEvolve which visualises Google Analytics data to show website usage.)

In the case of the Edina OpenURL resolver, I mapped referring services onto the “flower”/file nodes, and institutional IDs onto the people. (If someone could clarify what the institutional IDs – column 4 of the log – actually refer to, I’d be really grateful?)

To generate the Gource log file – which needs to look like this:

timestamp – A unix timestamp of when the update occured.

username – The name of the user who made the update.

type – initial for the update type – (A)dded, (M)odified or (D)eleted.
file – Path of the file updated.

That is: 1275543595|andrew|A|src/main.cpp

I used a command line trick and a Python trick:

cut -f 1,2,3,4,40 L2_2011-04.csv > openurlgource.csv head -n 100 openurlgource.csv > openurlgource100.csv

(Taking the head of the file containing just columns 1,2,3,4 and 40 of the log data meant I could try out my test script on a small file to start with…)

import csv
from time import *
f=open('openurlgource.csv', 'rb')

reader = csv.reader(f, delimiter='\t')
writer = csv.writer(open('openurlgource.txt','wb'),delimiter='|')
headerline = reader.next()
for row in reader:
	if row[4].strip() !='':
		t=int(mktime(strptime(row[0]+" "+row[1], "%Y-%m-%d %H:%M:%S")))
		writer.writerow([t,row[3],'A',row[4].rstrip(':').replace(':','/')])

(Thanks to @quentinsf for the Python time handling crib:-)

This gives me log data of the required form:
1301612404|687369|A|www.isinet.com/WoK/UA 1301612413|305037|A|www.isinet.com/WoK/WOS 1301612414|117143|A|OVID/Ovid MEDLINE(R) 1301612436|822542|A|mendeley.com/mendeley

Running Gource uses commands of the form:

gource -s 1 --hide usernames --start-position 0.5 --stop-position 0.51 openurlgource.txt

The video was generated using ffmpeg with a piped command of the form:

gource -s 1 --hide usernames --start-position 0.5 --stop-position 0.51 -o - openurlgource.txt | ffmpeg -y -b 3000K -r 60 -f image2pipe -vcodec ppm -i - -vcodec libx264 -vpre slow -threads 0 gource.mp4

Note that I had to compile ffmpeg myself, which required hunting down a variety of libraries (e.g. Lame, the WebM encoder, and the x264 encoder library), compiling them as shared resources (./configure --enable-shared) and then adding them into the build (in the end, on my Macbook Pro, I used ./configure –enable-libmp3lame –enable-shared –enable-libvpx –enable-libx264 –enable-gpl –disable-mmx –arch=x86_64 followed by the usual make and then sudo make install).

Getting ffmpeg and its dependencies configured and compiled was the main hurdle (I had an older version installed for transforming video between formats, as described in ffmpeg – Handy Hints, but needed the update), but now it’s in place, it’s yet another toy in the toybox that can do magical things when given data in the right format: gource:-)

import csv from time import * # Command line pre-processing step to handle NULL characters #tr < openurlgource.csv -d '\000' > openurlgourcenonulls.csv #alternatively?: sed 's/\x0/ /g' openurlgource.csv > openurlgourcenonulls.csv f=open('openurlgourcenonulls.csv', 'rb') reader = csv.reader(f, delimiter='\t') writer = csv.writer(open('openurlgource.txt','wb'),delimiter='|') headerline = reader.next() for row in reader: if row[8].strip() !='': t=int(mktime(strptime(row[0]+" "+row[1], "%Y-%m-%d %H:%M:%S"))) if row[4]!='': col='FF0000' elif row[5]!='': col='00FF00' elif row[6]!='': col='0000FF' else: col='666600' if row[7]=='article' or row[7]=='journal': typ='A' elif row[7]=='book' or row[7]=='bookitem': typ='M' else: typ='D' agent=row[8].rstrip(':').replace(':','/') writer.writerow([t,row[3],typ,agent,col])