Another Blooming Look at Gource and the Edina OpenURL Data

Having done a first demo of how to use Gource to visualise activity around the EDINA OpenURL data (Visualising OpenURL Referrals Using Gource), I thought I’d trying something a little more artistic, and use the colour features to try to pull out a bit more detail from the data [video].

What this one shows is how the mendeley referrals glow brightly green, which – if I’ve got my code right – suggests a lot of e-issn lookups are going on (the red nodes correspond to an issn lookup, blue to an isbn lookup and yellow/orange to an unknown lookup). The regularity of activity around particular nodes also shows how a lot of the activity is actually driven by a few dominant services, at least during the time period I sampled to generate this video.

So how was this visualisation created?

Firstly, I pulled out a few more data columns, specifically the issn, eissn, isbn and genre data. I then opted to set node colour according to whether the issn (red), eissn (green) or isbn (blue) columns were populated using a default reasoning approach (if all three were blank, I coloured the node yellow). I then experimented with colouring the actors (I think?) according to whether the genre was article-like, book-like or unkown (mapping these on to add, modify or delete actions), before dropping the size of the actors altogether in favour of just highlighting referrers and asset type (i.e. issn, e-issn, book or unknown).

cut -f 1,2,3,4,27,28,29,32,40 L2_2011-04.csv > openurlgource.csv

When running the Pythin script, I got a “NULL Byte” error that stopped the script working (something obviously snuck in via one of the newly added columns), so I googled around and turned up a little command line cleanup routine for the cut data file:

tr < openurlgource.csv -d '\000' > openurlgourcenonulls.csv

Here’s the new Python script too that shows the handling of the colour fields:

import csv
from time import *

# Command line pre-processing step to handle NULL characters
#tr < openurlgource.csv -d '\000' > openurlgourcenonulls.csv
#alternatively?: sed 's/\x0/ /g' openurlgource.csv > openurlgourcenonulls.csv

f=open('openurlgourcenonulls.csv', 'rb')

reader = csv.reader(f, delimiter='\t')
writer = csv.writer(open('openurlgource.txt','wb'),delimiter='|')
headerline =

for row in reader:
	if row[8].strip() !='':
		t=int(mktime(strptime(row[0]+" "+row[1], "%Y-%m-%d %H:%M:%S")))
		if row[4]!='':
		elif row[5]!='':
		elif row[6]!='':
		if row[7]=='article' or row[7]=='journal':
		elif row[7]=='book' or row[7]=='bookitem':

The new gource command is:

gource -s 1 --hide usernames --start-position 0.8 --stop-position 0.82 --user-scale 0.1 openurlgource.txt

and the command to generate the video:

gource -s 1 --hide usernames --start-position 0.8 --stop-position 0.82 --user-scale 0.1 -o - openurlgource.txt | ffmpeg -y -b 3000K -r 60 -f image2pipe -vcodec ppm -i - -vcodec libx264 -vpre slow -threads 0 gource.mp4

If you’ve been tempted to try Gource out yourself on some of your own data, please post a link in the comments below:-) (AI wonder just how many different sorts of data we can force into the shape that Gource expects?!;-)

Playing With Large (ish) CSV Files, and Using Them as a Database from the Command Line: EDINA OpenURL Logs

You know those files that are too large for you to work with, or even open? Maybe they’re not….

Picking up on Postcards from a Text Processing Excursion where I started dabbling with Unix command line text processing tools (it sounds scarier than it is… err… maybe?!;-), I thought it would make sense to have a quick play with them in the context of some “real” data files.

The files I’ve picked are intended to be intimidating (maybe?) at first glance because of their size: in this post I’ll look at a set of OpenURL activity data from Edina (24.6MB download, unpacking to 76MB), and for a future post, I thought it might to interesting to see whether this approach would work with a dump of local council spending data from OpenlyLocal (73.2MB download, unzipping to 1,011.9MB).

To start with, let’s have a quick play with the OpenURL data: you can download it from here: OpenURL activity data (April 2011)

What I intend to do in this post is track my own preliminary exploration of the file using what I learned in the “Postcards” post. I may also need to pick up a few new tricks along the way… One thing I think I want to look for as I start this exploration is an idea of how many referrals are coming in from particular institutions and particular sources…

Let’s start at the beginning though by seeing how many lines/rows there are in the file, which I downloaded as L2_2011-04.csv:

wc -l L2_2011-04.csv

I get the result 289,691; older versions of Excel used to only support 65,536 rows per sheet, though I believe more recent versions (Excel 2007, and Excel 2010) can support over a million; Google Apps currently limits sheet sizes to up to 200,000 cells (max 256 columns), so even if the file was only one column wide, it would still be too big to upload into a single Google spreadsheet. Google Fusion Tables can accept CSV files up to 100MB, so that would work (if we could actually get the file to upload… Spanish accent characters seemed to break things when I tried… the workaround I found was to split the original file, then separately upload and resave the parts using Google Refine, before uploading the files to Google Fusion tables (upload one to a new table, then import and append the other files into the same table).

..which is to say: the almost 300,00 rows in the downloaded CSV file are probably too many for many people to know what to do with, unless they know how to drive a database… which is why I thought it might be interesting to see how far we can get with just the unix command line text processing tools.

To see what’s in the file, let’s see what’s in there (we might also look to the documentation):

head L2_2011-04.csv

Column 40 looks interesting to me: sid (service ID); in the data, there’s a reference in there to mendeley, as well as some other providers I recognise (EBSCO, Elsevier and so on), so I think this refers to the source of the referral to the EDINA openurl resolver (@ostephens and @lucask suggested they thought so too. Also, @lucask suggested “OpenUrl from Endnote has ISIWOS as default SID too!”, so we may find that some sources either mask their true origin to hide low referral numbers (maybe very few people ever go from endnote to the EDINA openurl resolver?), or to inflate other numbers (Endnote inflating apparent referrals from ISIWOS.)

Rather than attack the rather large original file, let’s start by creating a smaller sample file with a couple of hundred rows that we can use as a test file for our text processing attempts:

head -n 200 L2_2011-04.csv > samp_L2_2011-04.csv

Let’s pull out column 40, sort, and then look for unique entries in the sample file we created:

cut -f 40 samp_L2_2011-04.csv | sort | uniq -c

I get a response that starts:

1 EBSCO:Academic Search Premier
7 EBSCO:Business Source Premier

so in the sample file there were 12 blank entries, 1 from CAS:MEDLINE, 7 from BSCO:Business Source Premier and so on, so this appears to work okay. Let’s try it on the big file (it may take a few seconds…) and save the result into a file (samp_uniqueSID.csv:

cut -f 40 L2_2011-04.csv | sort | uniq -c > uniqueSID.csv

This results of the count will be in arbitrary order, so it’s possible to add a sort into the pipeline in order to sort the entries according to the number of entries. The column we want to sort on is column 1 (so we set the sort -k key to 1; and because sort sorts into increasing order by default, we can reverse the order (-r) to get the most referenced entries at the top (the following is NOT RECOMMENDED… read on to see why…):

cut -f 40 L2_2011-04.csv | sort | uniq -c | sort -k 1 -r > uniqueSID.csv

We can now view the uniqueSD.csv file using the more command (more uniqueSD.csv), r look at the top 5 rows using the head command:

head -n 5 uniqueSID.csv

Here’s what I get as the result (treat this with suspicion…):

9181 OVID:medline
9006 Elsevier:Scopus
6720 EBSCO:jlh

If we look through the file, we actually see:

1817 OVID:embase
1720 EBSCO:CINAHL with Full Text
1529 EBSCO:cmedm
1505 OVID:ovftdb

I actually was alerted to this oops when looking to see how many referrals were from mendeley, by using grep on the counts file (if grep complains about a “Binary file”, just use the -a switch…):

grep mendeley uniqueSID.csv


17119 beat the “top count” 9181 from OVID:medline – obviously I’d done something wrong!

Specifically, the sort had sorted by character not by numerical value… (17119 and 16885 are numerically grater than 1720, but 171 and 168 are less (in string sorting terms) than 172. The reasoning is the same as why we’d index aardman before aardvark).

To force sort to sort using numerical values, rather than string values, we need to use th -n switch (so now I know!):

cut -f 40 L2_2011-04.csv | sort | uniq -c | sort -k 1 -r -n > uniqueSID.csv

Here’s what we get now:

9181 OVID:medline
9006 Elsevier:Scopus
6720 EBSCO:jlh

To compare the referrals from the actual sources (e.g. the aggregated EBSCO sources, rather than EBSCO:CINAHL, EBSCO:jlh and so on), we can split on the “:” character, to create a two columns from one: the first containing the bit before the ‘:’, the second column containing the bit after:

sed s/:/'ctrl-v<TAB>'/ uniqueSD.csv | sort -k 2 > uniquerootSID.csv

(Some versions of sed may let you identify the tab character as \t; I had to explicitly put in a tab by using ctrl-V then tab.)

What this does is retain the number of lines, but sort the file so all the EBSCO referrals are next to each other, all the Elsevier referrals are next to each other, and so on.

Via an answer on Stack Overflow, I found this bit of voodoo that would then sum the contributions from the same root referrers:

cat uniquerootSID.csv | awk '{a[$2]+=$1}END{for(i in a ) print i,a[i] }' | sort -k 2 -r -n > uniquerootsumSID.csv

Using data from the file uniquerootSID.csv, the awk command sets up an array (a) that has indices corresponding to the different sources (EBSCO, Elsevier, and so on). It then runs an accumulator that sums the contributions from each unique source. After processing all the rows (END), the routine then loops through all the unique sources in the a array, and emits the source and the total. The sort command then sorts the output by total for each source and puts the list into the file uniquerootsumSID.csv.

Here are the top 15: 99453
EBSCO 44870
OVID 27545 17119 16885
Elsevier 9446
CSA 6938
EI 6180
Ovid 4353 3399
jstor 2558 2553 2175
Dialog 2070
Refworks 1034

If we add the two Mendeley referral counts that gives ~34,000 referrals. How much are the referrals from commercial databases costing, I wonder, by comparison? Of course, it may be that the distribution of referrals from different institutions is different. Some institutions may see all their traffic through EBSCO, or Ovid, or indeed Mendeley… If nothing else though, this report suggests that Mendeley is generating a fair amount of EDINA openurl traffic…

Let’s use the cut command again to see how much traffic is coming from each unique insititution (not that I know how to decode these identifiers…); column 4 is the one we want (remember, we use the uniq command to count the occurrences of each identifier):

cut -f 4 L2_2011-04.csv | sort | uniq -c | sort -k 1 -r -n > uniqueInstID.csv

Here are the top 10 referrer institutions (columns are: no. of referrals, institution ID):

41268 553329
31999 592498
31168 687369
29442 117143
24144 290257
23645 502487
18912 305037
18450 570035
11138 446861
10318 400091

How about column 5, the routerRedirectIdentifier:

195499 athens
39381 wayf
29904 ukfed
24766 no identifier
140 ip

How about the publication year of requests (column 17):

26400 2010
16284 2009
13425 2011
13134 2008
10731 2007
8922 2006
8088 2005
7288 2004

It seems to roughly follow year?!

How about unique journal title (column 15):

277 Journal of World Business
263 Journal of Financial Economics
263 Annual Review of Nuclear and Particle Science
252 Communications in Computer and Information Science
212 Journal of the Medical Association of Thailand Chotmaihet thangphaet
208 Scandinavian Journal of Medicine & Science in Sports
204 Paleomammalia
194 Astronomy & Astrophysics
193 American Family Physician

How about books (column 29 gives ISBN):

1695 9780470669303
750 9780470102497
151 0761901515
102 9781874400394

And so it goes..

What’s maybe worth remembering is that I haven’t had to use any tools other than command line tools to start exploring this data, notwithstanding the fact that the source file may be too large to open in some everyday applications…

The quick investigation I was able to carry out on the EDINA openurl data also built directly on what I’d learned in doing the Postcards post (except for the voodoo awk script to sum similarly headed rows, and the sort switches to reverse the order of the sort, and force a numerical rather than string based sort). Also bear in mind that three days ago, I didn’t know how to do any of this…

…but what I do suspect is that it’s the sort of thing that Unix sys admins play around with all the time, e.g. in the context of log file hacking…

PS so what else can we do…? It strikes me that by using the date and timestamp, as well as the institutional ID and referrer ID, we can probably identify searches that are taking place: a) within a particular session, b) maybe by the same person over several days (e.g. in the case of someone coming in from the same place within a short window of time (1-2 hours), or around about the same time on the same day of the week, from the same IDs and searching around a similar topic).