Playing With Large (ish) CSV Files, and Using Them as a Database from the Command Line: EDINA OpenURL Logs

You know those files that are too large for you to work with, or even open? Maybe they’re not….

Picking up on Postcards from a Text Processing Excursion where I started dabbling with Unix command line text processing tools (it sounds scarier than it is… err… maybe?!;-), I thought it would make sense to have a quick play with them in the context of some “real” data files.

The files I’ve picked are intended to be intimidating (maybe?) at first glance because of their size: in this post I’ll look at a set of OpenURL activity data from Edina (24.6MB download, unpacking to 76MB), and for a future post, I thought it might to interesting to see whether this approach would work with a dump of local council spending data from OpenlyLocal (73.2MB download, unzipping to 1,011.9MB).

To start with, let’s have a quick play with the OpenURL data: you can download it from here: OpenURL activity data (April 2011)

What I intend to do in this post is track my own preliminary exploration of the file using what I learned in the “Postcards” post. I may also need to pick up a few new tricks along the way… One thing I think I want to look for as I start this exploration is an idea of how many referrals are coming in from particular institutions and particular sources…

Let’s start at the beginning though by seeing how many lines/rows there are in the file, which I downloaded as L2_2011-04.csv:

wc -l L2_2011-04.csv

I get the result 289,691; older versions of Excel used to only support 65,536 rows per sheet, though I believe more recent versions (Excel 2007, and Excel 2010) can support over a million; Google Apps currently limits sheet sizes to up to 200,000 cells (max 256 columns), so even if the file was only one column wide, it would still be too big to upload into a single Google spreadsheet. Google Fusion Tables can accept CSV files up to 100MB, so that would work (if we could actually get the file to upload… Spanish accent characters seemed to break things when I tried… the workaround I found was to split the original file, then separately upload and resave the parts using Google Refine, before uploading the files to Google Fusion tables (upload one to a new table, then import and append the other files into the same table).

..which is to say: the almost 300,00 rows in the downloaded CSV file are probably too many for many people to know what to do with, unless they know how to drive a database… which is why I thought it might be interesting to see how far we can get with just the unix command line text processing tools.

To see what’s in the file, let’s see what’s in there (we might also look to the documentation):

head L2_2011-04.csv

Column 40 looks interesting to me: sid (service ID); in the data, there’s a reference in there to mendeley, as well as some other providers I recognise (EBSCO, Elsevier and so on), so I think this refers to the source of the referral to the EDINA openurl resolver (@ostephens and @lucask suggested they thought so too. Also, @lucask suggested “OpenUrl from Endnote has ISIWOS as default SID too!”, so we may find that some sources either mask their true origin to hide low referral numbers (maybe very few people ever go from endnote to the EDINA openurl resolver?), or to inflate other numbers (Endnote inflating apparent referrals from ISIWOS.)

Rather than attack the rather large original file, let’s start by creating a smaller sample file with a couple of hundred rows that we can use as a test file for our text processing attempts:

head -n 200 L2_2011-04.csv > samp_L2_2011-04.csv

Let’s pull out column 40, sort, and then look for unique entries in the sample file we created:

cut -f 40 samp_L2_2011-04.csv | sort | uniq -c

I get a response that starts:

12
1 CAS:MEDLINE
1 EBSCO:Academic Search Premier
7 EBSCO:Business Source Premier
1 EBSCO:CINAHL
...

so in the sample file there were 12 blank entries, 1 from CAS:MEDLINE, 7 from BSCO:Business Source Premier and so on, so this appears to work okay. Let’s try it on the big file (it may take a few seconds…) and save the result into a file (samp_uniqueSID.csv:

cut -f 40 L2_2011-04.csv | sort | uniq -c > uniqueSID.csv

This results of the count will be in arbitrary order, so it’s possible to add a sort into the pipeline in order to sort the entries according to the number of entries. The column we want to sort on is column 1 (so we set the sort -k key to 1; and because sort sorts into increasing order by default, we can reverse the order (-r) to get the most referenced entries at the top (the following is NOT RECOMMENDED… read on to see why…):

cut -f 40 L2_2011-04.csv | sort | uniq -c | sort -k 1 -r > uniqueSID.csv

We can now view the uniqueSD.csv file using the more command (more uniqueSD.csv), r look at the top 5 rows using the head command:

head -n 5 uniqueSID.csv

Here’s what I get as the result (treat this with suspicion…):

9181 OVID:medline
9006 Elsevier:Scopus
7929 EBSCO:CINAHL
74740 http://www.isinet.com:WoK:UA
6720 EBSCO:jlh

If we look through the file, we actually see:

1817 OVID:embase
1720 EBSCO:CINAHL with Full Text
17119 mendeley.com:mendeley
16885 mendeley.com/mendeley:
1529 EBSCO:cmedm
1505 OVID:ovftdb

I actually was alerted to this oops when looking to see how many referrals were from mendeley, by using grep on the counts file (if grep complains about a “Binary file”, just use the -a switch…):

grep mendeley uniqueSID.csv

17119 mendeley.com:mendeley
16885 mendeley.com/mendeley:

17119 beat the “top count” 9181 from OVID:medline – obviously I’d done something wrong!

Specifically, the sort had sorted by character not by numerical value… (17119 and 16885 are numerically grater than 1720, but 171 and 168 are less (in string sorting terms) than 172. The reasoning is the same as why we’d index aardman before aardvark).

To force sort to sort using numerical values, rather than string values, we need to use th -n switch (so now I know!):

cut -f 40 L2_2011-04.csv | sort | uniq -c | sort -k 1 -r -n > uniqueSID.csv

Here’s what we get now:

74740 http://www.isinet.com:WoK:UA
34186
20900 http://www.isinet.com:WoK:WOS
17119 mendeley.com:mendeley
16885 mendeley.com/mendeley:
9181 OVID:medline
9006 Elsevier:Scopus
7929 EBSCO:CINAHL
6720 EBSCO:jlh
...

To compare the referrals from the actual sources (e.g. the aggregated EBSCO sources, rather than EBSCO:CINAHL, EBSCO:jlh and so on), we can split on the “:” character, to create a two columns from one: the first containing the bit before the ‘:’, the second column containing the bit after:

sed s/:/'ctrl-v<TAB>'/ uniqueSD.csv | sort -k 2 > uniquerootSID.csv

(Some versions of sed may let you identify the tab character as \t; I had to explicitly put in a tab by using ctrl-V then tab.)

What this does is retain the number of lines, but sort the file so all the EBSCO referrals are next to each other, all the Elsevier referrals are next to each other, and so on.

Via an answer on Stack Overflow, I found this bit of voodoo that would then sum the contributions from the same root referrers:

cat uniquerootSID.csv | awk '{a[$2]+=$1}END{for(i in a ) print i,a[i] }' | sort -k 2 -r -n > uniquerootsumSID.csv

Using data from the file uniquerootSID.csv, the awk command sets up an array (a) that has indices corresponding to the different sources (EBSCO, Elsevier, and so on). It then runs an accumulator that sums the contributions from each unique source. After processing all the rows (END), the routine then loops through all the unique sources in the a array, and emits the source and the total. The sort command then sorts the output by total for each source and puts the list into the file uniquerootsumSID.csv.

Here are the top 15:

http://www.isinet.com 99453
EBSCO 44870
OVID 27545
mendeley.com 17119
mendeley.com/mendeley 16885
Elsevier 9446
CSA 6938
EI 6180
Ovid 4353
wiley.com 3399
jstor 2558
mimas.ac.uk 2553
summon.serialssolutions.com 2175
Dialog 2070
Refworks 1034

If we add the two Mendeley referral counts that gives ~34,000 referrals. How much are the referrals from commercial databases costing, I wonder, by comparison? Of course, it may be that the distribution of referrals from different institutions is different. Some institutions may see all their traffic through EBSCO, or Ovid, or indeed Mendeley… If nothing else though, this report suggests that Mendeley is generating a fair amount of EDINA openurl traffic…

Let’s use the cut command again to see how much traffic is coming from each unique insititution (not that I know how to decode these identifiers…); column 4 is the one we want (remember, we use the uniq command to count the occurrences of each identifier):

cut -f 4 L2_2011-04.csv | sort | uniq -c | sort -k 1 -r -n > uniqueInstID.csv

Here are the top 10 referrer institutions (columns are: no. of referrals, institution ID):

41268 553329
31999 592498
31168 687369
29442 117143
24144 290257
23645 502487
18912 305037
18450 570035
11138 446861
10318 400091

How about column 5, the routerRedirectIdentifier:

195499 athens
39381 wayf
29904 ukfed
24766 no identifier
140 ip

How about the publication year of requests (column 17):

45867
26400 2010
16284 2009
13425 2011
13134 2008
10731 2007
8922 2006
8088 2005
7288 2004

It seems to roughly follow year?!

How about unique journal title (column 15):

258740
277 Journal of World Business
263 Journal of Financial Economics
263 Annual Review of Nuclear and Particle Science
252 Communications in Computer and Information Science
212 Journal of the Medical Association of Thailand Chotmaihet thangphaet
208 Scandinavian Journal of Medicine & Science in Sports
204 Paleomammalia
194 Astronomy & Astrophysics
193 American Family Physician

How about books (column 29 gives ISBN):

278817
1695 9780470669303
750 9780470102497
151 0761901515
102 9781874400394

And so it goes..

What’s maybe worth remembering is that I haven’t had to use any tools other than command line tools to start exploring this data, notwithstanding the fact that the source file may be too large to open in some everyday applications…

The quick investigation I was able to carry out on the EDINA openurl data also built directly on what I’d learned in doing the Postcards post (except for the voodoo awk script to sum similarly headed rows, and the sort switches to reverse the order of the sort, and force a numerical rather than string based sort). Also bear in mind that three days ago, I didn’t know how to do any of this…

…but what I do suspect is that it’s the sort of thing that Unix sys admins play around with all the time, e.g. in the context of log file hacking…

PS so what else can we do…? It strikes me that by using the date and timestamp, as well as the institutional ID and referrer ID, we can probably identify searches that are taking place: a) within a particular session, b) maybe by the same person over several days (e.g. in the case of someone coming in from the same place within a short window of time (1-2 hours), or around about the same time on the same day of the week, from the same IDs and searching around a similar topic).

Author: Tony Hirst

I'm a Senior Lecturer at The Open University, with an interest in #opendata policy and practice, as well as general web tinkering...

17 thoughts on “Playing With Large (ish) CSV Files, and Using Them as a Database from the Command Line: EDINA OpenURL Logs”

  1. Re: “what I do suspect is that it’s the sort of thing that Unix sys admins play around with all the time”.

    Yes.

    For your next task, learn Perl :-)

    1. I remember one OU residential school in bath I bought the Perl book and tried working through the first pages or so; the only thing I can remember is that there’s always more than one way to do it… ;-)

      I think there is an issue, though, with the practical skills of being able to work with data in text formats, particularly with recent interest in opening up data. If the datasets are large, and provided as CSV, for example, then they often aren’t accessible to mortals?

      It strikes me that a lot of infoskills are actually concentrated within different disciplines, and that a few bridges that allow skills to transfer between disciplines might be in order?

    2. Sorry, I was being somewhat flippant. What I like about this post is it shows the power of the Unix command line and the ‘small tools, loosely joined’ mentality that comes with it. That said, as soon as you are doing lots of stuff with chains of grep, sed, awk, sort, uniq, etc., my experience says that perl is often a useful next step. This is the kind of thing that Perl was designed for.

      1. @andy I have no probs with being flippant (I was soliciting it after all!); one point I was trying to get across was that there exists very powerful tools that have been around for ever that are maybe unknown to folk who suddenly find themselves in the position of coming across large text files but not knowing what to do with them?

        Another issue is that a lot of the tools, and command line way of doing things, haven’t necessarily evolved much in terms f usability (maybe they don’t need to?)

        The point about perl is a good one, but I fear that maybe a step too far?! (I think that the next thing on my list is getting to grips with something like SQLite?)

        re: the routerRedirectIdentifier – all I know is it tends to come out as athens or ukfed?

  2. One other thing… I’m interested in the results around routerRedirectIdentifier (column 5). Do you know what this field is? Or, perhaps more likely, do you know someone I could ask!? TIA

    1. @mark Interesting, thanks. I wonder if anyone else will produce comparative methods for other languages/toolsets?!;-)

      @laurence – thanks for the link in to that additional set of documentation:-)

Comments are closed.