OUseful.Info, the blog…

Trying to find useful things to do with emerging technologies in open education

Archive for the ‘Library’ Category

Appropriate IT – My ILI2011 Presentation

Here’s a copy of the slides from my ILI2011 presentation on Appropriate IT:

One thing I wanted to explore was, if discovery happens elsewhere, and the role of the librarian is no longer focussed on discovery related issues, where can library folk help out? Here’s where I think we need to start placing some attention: sensemaking, and knowing what’s possible (aka helping redistribute the future that is already around us;-) Allied with this is the idea that we need to make more out of using appropriate IT for particular tasks, as well as appropriating IT where we can to make our lives easier.

In part, sensemaking is turning the wealth of relevant data out there into something meaningful for the question or issue at hand, or the choice we have to make. My own dabblings with social network analysis are approaches I’m working on that help me make sense of interest networks and social positioning within those networks so I can get a feel for how those communities are structured and who the major actors are within them.

As far as knowing what’s possible, I think we have a real issue with “folk IT” knowledge. Most of us have a reasonable grasp of folk physics and folk psychology. That is, we have a reasonable common-sense model of how the world works at the human scale (let go of an apple, it falls to the floor), and we can generally read other people from their behaviour; but how well developed is “folk IT” knowledge? Given that to most people the idea that you can search within a page in a wide variety of electronic documents using crtrl-F as a keyboard shortcut to a “search within page/document” feature is alien to them, I think our folk understanding of IT is limited to the principle of “if you switch it off and on again it should start working again”.

Folk IT is also tied up with computational thinking, but at a practical, “human scale”. So here are a few ideas I think the librarians need to start pushing:

- the idea of a graph; it’s what the web’s based around, after all, and it also helps us understand social networks. If you think of your website as a graph, with edges representing links that connect nodes/pages together, and realise that your on-site homepage is whatever page someone lands on from a search engine or third party link, you soon start to realise that maybe your website is not as usefully structured as you thought…
– some sort of common sense understanding of the role that URLs/URIs play in the browser, along with the idea that URIs are readable and hackable and also may say something about the way a website, or the resources it makes available, organised;
– the notion of “View Source”, that allows you to copy and crib the work of others when constructing your own applications, along with the very idea that you might be able to build web pages yourself out of free standing components.
– the idea of document types and applications that can work all sorts of magic given documents of that type; the knowledge that an MP3 file works well with an audio player or audio editor, for example, or that a PNG or JPG encodes an image, along with more esoteric formats such as KML (paste a URL to a KML file into the search box of a Google Maps search and see what happens, for example…). Knowledge of the filetype/document type gives you some sort of power over it, and helps you realise what sorts of thing you can do with it… (except for things like PDF, for example, which is to all intents and purposes a “can’t do anything with it” filetype;-)

I also think an understanding of pattern based string matching and what regular expressions allow you to do would go a long way towards helping folk who ever have to manipulate text or text-based data files, at least in terms of letting them know that there are often better ways of cleaning up a text file automagically rather than having to repeat the same operation over and over again on each separate row in file containing several thousand lines… They don’t need to know how to write the regular expression from the off, just that the sorts of operation regular expressions support are possible, and that someone will probably be able to show you how to do it…

Written by Tony Hirst

October 31, 2011 at 3:50 pm

Posted in Infoskills, Library, Presentation

Tagged with

Search Engine Powered Courses…

How can we use customised search engines to support uncourses, or the course models used to support MOOC style offerings?

To set the scene, here’s what Stephen Downes wrote recently on the topic of How to partcipate in a MOOC:

You will notice quickly that there is far too much information being posted in the course for any one person to consume. We tried to start slowly with just a few resources, but it quickly turns into a deluge.

You will be provided with summaries and links to dozens, maybe hundreds, maybe even thousands of web posts, articles from journals and magazines, videos and lectures, audio recordings, live online sessions, discussion groups, and more. Very quickly, you may feel overwhelmed.

Don’t let it intimidate you. Think of it as being like a grocery store or marketplace. Nobody is expected to sample and try everything. Rather, the purpose is to provide a wide selection to allow you to pick and choose what’s of interest to you.

This is an important part of the connectivist model being used in this course. The idea is that there is no one central curriculum that every person follows. The learning takes place through the interaction with resources and course participants, not through memorizing content. By selecting your own materials, you create your own unique perspective on the subject matter.

It is the interaction between these unique perspectives that makes a connectivist course interesting. Each person brings something new to the conversation. So you learn by interacting rather than by mertely consuming.

When I put together the the OU course T151, the original vision revolved around a couple of principles:

1) the course would be built in part around materials produced in public as part of the Digital Worlds uncourse;

2) each week’s offering would follow a similar model: one or two topic explorations, plus an activity and forum discussion time.

In addition, the topic explorations would have a standard format: scene setting, and maybe a teaser question with answer reveal or call to action in the forums; a set of topic exploration questions to frame the topic exploration; a set of resources related to the topic at hand, organised by type (academic readings (via a libezproxy link for subscription content so no downstream logins are required to access the content), Digital Worlds resources, weblinks (industry or well informed blogs, news sites etc), audio and video resources); and a reflective essay by the instructor exploring some of the themes raised in the questions and referring to some of the resources. The aim of the reflective essay was to model the sort of exploration or investigation the student might engage in.

(I’d probably just have a mixed bag of resources listed now, along with a faceting option to focus in on readings, videos, etc.)

The idea behind designing the course in this way was that it would be componentised as much as possible, to allow flexibility in swapping resources or even topics in and out, as well as (though we never managed this), allowing the freedom to study the topics in an arbitrary order. Note: I realised today that to make the materials more easily maintainable, a set of ‘Recent links’ might be identified that weren’t referred to in the ‘My Reflections’ response. That is, they could be completely free standing, and would have no side effects if replaced.

As far as the provision of linked resources went, the original model was that the links should be fed into the course materials from an instructor maintained bookmark collection (for an early take on this, see Managing Bookmarks, with a proof of concept demo at CourseLinks Demo (Hmmm, everything except the dynamic link injection appears to have rotted:-().

The design of the questions/resources page was intended to have the scoping questions at the top of the page, and then the suggested resources presented in a style reminiscent of a search engine results listing, the idea being that we would present the students with too many resources for them to comfortably read in the allocated time, so that they would have to explore the resources from their own perspective (eg given their current level of understanding/knowledge, their personal interests, and so on). In one of my more radical moments, I suggested that the resources would actually be pulled in from a curated/custom search engine ‘live’, according to search terms specially selected around the current topic and framing questions, but I was overruled on that. However, the course does have a Google custom search engine associated with it which searches over materials that are linked to from the course.

So that’s the context…

Where I’m at now is pondering how we can use an enhanced custom search engine as a delivery platform for a resource based uncourse. So here’s my first thought: using a Google Custom Search Engine populated with curated resources in a particular area, can we use Google CSE Promotions to help scaffold a topic exploration?

Here’s my first promotions file:

<Promotions>
   <Promotion id="t151_1a" 
        queries="topic 1a, Topic 1A, topic exploration 1a, topic exploration 1A, topic 1A, what is a game, game definition" 
        title="T151 Topic Exploration 1A - So what is a game?" 
        url="http://digitalworlds.wordpress.com/2008/03/05/so-what-is-a-game/"
        description="The aim of this topic is to think about what makes a game a game. Spend a minute or two to come up with your own definition. If you're stuck, read through the Digital Worlds post 'So what is a game?'"
        image_url="http://kmi.open.ac.uk/images/ou-logo.gif" />
</Promotions>

It’s running on the Digital Worlds Search Engine, so if you want to try it out, try entering the search phrase what is a game or game definition.

T151 CSE promotion - game definition

(This example suggests to me that it would also make sense to use result boosting to boost the key readings/suggested resources I proposed in the topic materials so that they appear nearer the top of the results (that’ll be the focus of a future post;-))

The promotion displays at the top of the results listing if the specified queries match the search terms the user enters. My initial feeling is that to bootstrap the process, we need to handle:

- queries that allow a user to call on a starting point for a topic exploration by specifically identifying that topic;
– “naive queries”: one reason for using the resource-search model is to try to help students develop effective information skills relating to search. Promotions (and result boosting) allow us to pick up on anticipated naive queries (or popular queries identified from search logs), and suggest a starting point for a sensible way in to the topic. Alternatively, they could be used to offer suggestions for improved or refined searches, or search strategy hints. (I’m reminded of Dave Pattern’s work with guided searches/keyword refinements in the University of Huddersfield Library catalogue in this context).

Here’s another example using the same promotion, but on a different search term:

T151 CSE - topic 1a

Of course, we could also start to turn the search engine into something like an adventure game engine. So for example, if we type: start or about, we might get something like:

T151 CSE - start

(The link I associated with start should really point to the course introduction page in the VLE…)

We can also use the search context to provide pastoral or study skills support:

T151 CSE - pastoral

These sort of promotions/enhancements might be produced centrally and rolled out across course search engines, leaving the course and discipline related customisations to the course team and associated subject librarians.

Just a final note: ignoring resource limitations on Google CSEs for a moment, we might imagine the following scenarios for their role out:

1) course wide: bespoke CSEs are commissioned for each course, although they may be supplemented by generic enhancements (eg relating to study skills);

2) qualification based: the CSE is defined at the qualification level, and students call on particular course enhancements by prefacing the search with the course code; it might be that students also see a personalised view of the qualification CSE that is tuned to their current year of study.

3) university wide: the CSE is defined at the university level, and students students call on particular course or qualification level enhancements by prefacing the search with the course or qualification code.

Written by Tony Hirst

September 15, 2011 at 2:03 pm

Open Data Processes: the Open Metadata Laundry

Another quick note from yesterday’s mini-mash at Cambridge, hosted by Ed Chamberlain, and with participation from consultant Owen Stephens, Lincoln’s Paul Stainthorp and his decentralised developers, and Sussex’s Chris Keene. This idea came from the Lincoln Jerome project (I’m not sure if this has been blogged on the Jerome project blog?), and provides a way of scrubbing MARC based records to free the metadata up from license restrictions.

The recipe goes along the lines of reconciling the record for each item with openly licensed equivalents, and creating a new record for each item where data fields are populated with content that is know to be openly licensed. In part, this relies on having a common identifier. One approach that was discussed was generating hashes based on titles with punctuation removed. This feels a bit arbitrary to me…? I’d probably reduce all the letters to the same case at the very least in an attempt to normalise the things we might be trying to hash?

I wonder if Ed’s mapping of metadata ownership might also have a role to play in developing a robust laundry service? (e.g. “Ownership” of MARC-21 records and Where exactly DOES a record come from?).

We also discussed recipes where different libraries, each with their own MARC records for a work, might be compared field by field to identify differences between the ways similar items might be catalogued differently. As well as identifying records that maybe contain errors, this approach might also enhance discovery, for example through widening a set of keywords or classification indices.

One of the issues we keep returning to is why it might be interesting to release lots of open data in a given context. Being able to pivot from a resource in one context to a resource in another context is a general/weak way of answering this question, but here are a couple of more specific issues that came up in conversation:

1) having unique identifiers is key, and becomes useful when people use the same identifier, or same-as’d identifiers, to refer to the same thing;

2) we need tool support to encourage people creating metadata to start linking in to a recognised/shared identifier spaces. I wonder if there might be value in institutions starting to publish reconciliation services that can be addressed from tools like Google Refine. (For example, How to use OpenCorporates to match companies in Google Refine or Google Refine Reconciliation Service API). Note that it might make sense for reconciliation services to employ various string similarity heuristics as part of the service.

3) we still don’t have enough compelling use cases about the benefits of linked IDs, or tools that show why it’s powerful. (I think of linked identifier spaces that are rich enough to offer benefits as if they were (super)saturated solutions, where it’s easy to crystallise out interesting things…) One example I like is how Open Corporates use reconciliation to allow you to map companies names in local council accounts to specific corporate entities. In time, one can imagine mapping company directors and local council councillors onto person entities and then starting to map these councillor-corporate-contract networks out…;-)

Finally, something Owen mentioned that resonates with some of my thinking on List Intelligence: Superduping/Work Superclusters, in which we take an ISBN, look at its equivalents using ThingISBN or xISBN, and then for each of those alternatives, look at their ThingISBN/xISBN alternatives, until we reach a limit set. (cf my approaches for looking at lists a Twitter UserID is included on, looking at the other members of the same lists, then finding the other lists they are mentioned on, etc. Note in the case of Twitter lists, this doesn’t necessarily hit a limit without the use of thresholding!)

Written by Tony Hirst

August 9, 2011 at 12:19 pm

Integrating Course Related Search and Bookmarking?

Not surprisingly, I’m way behind on the two eSTEeM projects I put proposals in for – my creative juices don’t seem to have been flowing in those areas for a bit:-( – but as a marking avoidance strategy I thought I’d jot down some thoughts that have been coming to mind about how the custom search project at least might develop (eSTEeM Project: Custom Course Search Engines).

The original idea was to provide a custom search engine that indexes pages and domains that are referenced within a course in order to provide a custom search engine for that course. The OU course T151 is structured as a series of topic explorations using the structure:

- topic overview
– framing questions
– suggested resources
– my reflections on the topic, guided by the questions, drawing on the suggested resources and a critique of them

One original idea for the course was that rather than give an explicit list of suggested resources, we provide a set of links pulled in live from a predefined search query. The list would look as if it was suggested by the course team but it would actually be created dynamically. As instructors, we wouldn’t be specifying particular readings, instead we would be trusting the search algorithm to return relevant resources. (You might argue this is a neglectful approach… a more realistic model might be to have specifically recommended items as well as a dynamically created list of “Possibly related resources”.)

At this point it’s maybe worth stepping back a moment to consider what goes into producing a set of search results. Essentially, there are three key elements:

- the index, the set of content that the search engine has “searched” and from which it can return a set of results;
– the search query; this is run against the index to identify a set of candidate search results;
– a presentation algorithm that determines how to order the search results as presented to the user.

If the search engine and the presentation algorithm are fixed, then for a given set of search terms, and a given index, we can specify a search term and get a known set of results back. So in this case, we could use a fixed custom search engine, with know search terms, and return a known list of suggested readings. The search engine would provide some sort of “ground truth” – same answer for the same query, always.

If we trust the sources and the presentation algorithm, and we trust that we have written an effective search query, then if the index is not fixed, or if a personalised ranking algorithm (that we trust) is used as part of the search engine, we would potentially be returning search results that the instructor has not seen before. For example, the resources may be more recent than the last time the instructor searched for resources to recommend, or they better fit the personalisation criteria for the user under the ranking algorithm used as part of the presentation algorithm.

In this case, the instructor is not saying: “I want you to read this particular resource”. They are saying something more along the lines of: “these are potentially the sorts of resource I might suggest you look at in order to study this topic”. (Lots of caveats in there… If you believe in content led instruction, with students referring to to specifically referenced resources, I imagine that you would totally rile against this approach!)

At times, we might want to explicitly recommend one or two particular resources, but also open up some other recommendations to “the algorithm”. It struck me that it might be possible to do this within the context of a Google Custom Search approach using “special results” (e.g. Google CSEs: Creating Special Results/Promotions).

For example, Google CSEs support:

- promotions: “A promotion is simply an association between a pre-defined set of query terms and a link to a webpage. When a user types a search that exactly matches one of your query terms, the promotion appears at the top of the page.” So by using a specific search term, we can force the return of a specific result as the top result. In the context of a topic exploration, we could thus prepopulate the search form of an embedded search engine with a known search phrase, and use a promotion to force a “recommend reading” link to the top of the results listing.

Promotion links are stored in a separate config file and have the form:

<Promotions>
  <Promotion id="1"
    queries="wanderer, the wanderer" 
    title="Groo the Wanderer" 
    url="http://www.groo.com/"
    description="Comedy. American series illustrated by Sergio Aragonés."
    image_url="http://www.newsfromme.com/images5/groo11.jpg" />
</Promotions>

- subscribed links: subscribed links allow you to return results in a specific format (such as text, or text and a link, or other structured results) based on a perfect match with a specific search term. In a sense, subscribed links represent a generalised version of promotions. Subscribed links are also available to users outside the context of a CSE. If a user subscribes to a particular subscribed link file, then if there is an exact match against of one the search phrases in the subscribed link file and a search phrase used by a subscribing user on Google web search (i.e. on google.com or google.co.uk), the subscribed link will be returned in the results listing.

In the simplest case, subscribed links can be defined at the individual link level:

Google subscribed link definition

If your search term is an exact match for the term in the subscribed link definition, it will appear in the main search results page:

Google subscribed links

It’s also possible to define subscribed link definition files, either as simple tab separated docs or RSS/Atom feeds, or using a more formal XML document structure. One advantage of creating subscribed links files for use within in custom search engine is that users (i.e. students) can subscribe to them as a way of augmenting or enhancing their own Google search results. This has the joint effect of increasing the surface area of the course, so that course related recommendations can be pushed to the student for relevant queries made through the Google search engine, as well as providing a legacy offering: students can potentially take away a subscription when then finish the course to continue to receive “academically credible” results on relevant search topics. (By issuing subscription links on a per course presentation basis (or even on a personalised, unique feed per student basis), feeds to course alumni might be customised, or example by removing links to subscription content (or suggesting how such content might be obtained through a subscription to the university library), or occasionally adding in advertising related links (so if a student searches using a “course” keyword, make recommendations around that via a subscribed links feed; in the limit, this could even take on the form of a personalised, subscription based advertising channel).

Another way in which “recommended” links can be boosted in a custom search result listing is through boosting search results via their ranking factors (Changing the Ranking of Your Search Results).

In the case of both subscribed links and boosted search results, it’s possible to create a configuration file dynamically. Where students are bookmarking search results relating to a course, it would therefore be possible to feed these into a course related custom search engine definition file, or a subscribed link file. If subscribed link files are maintained at a personal level, it would also be possible to integrate a student’s bookmarked links in to their subscribed links feed, at least for use on Google websearch (probably not in the custom search engine context?). This would support rediscovery of content bookmarked by the student through subscribed link recommendations.

Just by the by, a PR mailing in my inbox today threw up another example of how search and bookmarking might be brought more closely together: SearchTeam (screenshots [pdf]).

The model here is based around defining search contexts that one or more users can contribute to, and then saving out results from a search into a topic based bookmark area. The video suggests that particular results can also be blocked (and maybe boosted? The greyed plus on the left hand side?) – presumably this is a persistent feature, so if you, or another member of your “search team” runs the search, the blocked result doesn’t appear? (Is a list of blocked results and their corresponding search terms available anywhere I wonder?) In common with the clipping blog model used by sites such as posterous, it’s possible to post links and short blog posts into a topic area. Commenting is also supported.

To say that search was Google’s initial big idea, it’s surprising that it seems to play no significant role in Google’s offerings for education through Google Apps. Thinking back, search related topics were what got me into blogging and quick hacks; maybe it’s time to return to that area…

Written by Tony Hirst

July 26, 2011 at 12:34 pm

Posted in Library, Search, SEO

Tagged with

Open Book Talk

“A booktalk in the broadest terms is what is spoken with the intent to convince someone to read a book.” Wikipedia

Whilst putting together yesterday’s post about personal art collections online (for a wider take on this, see Mia Ridge’s The rise of the non-museum (and death by aggregation), which offers all manner of food for thought around personal collection building…), I started thinking again about how we might use recorded discussions or book talks focussing on particular books as a component in the “content scaffolding” around works that might be used as resources in an informal learning context.

(For an earlier foray in to the book talk world, see my post on BBC “In Our Time” Reading List using Linked Data.)

So the (really simple and obvious) idea is this (and I fully appreciate other sites out there may already exist that do this: if so, please let me know in the comments): how about we build a lookup service that allows you to search by author, book title, ISBN (or cross ISBN), and it returns details for the book as well as links to audio or video recordings of book talks around the book.

I’ve started trying to cobble together a few resources around this, setting up (a not yet complete set of) scrapers (in various states of disrepair) on Scraperwiki to collate books and book talk audio links from:

It might also be appropriate to try to pull in “quality” book reviews* to annotate book listings, given that part of my idea at least is to find ways of enriching reading book references with discussion around them that can help folk make sense of the big ideas contained within the book, as well as maybe encouraging them to buy the book (the all required sustainability model: in this case, Amazon referral fees! Note that several of the sites use Amazon referrals as part of their own sustainability model. So it would only be fair to use their affiliate codes at least part of the time if their playable audio content was embedded on the site (even if that content is openly licensed… Share and share alike, right?! That is, trickle back a portion of any income you do make off the work of others, even if it is openly licensed for commercial use;-)

Another strand to all of this, of course, is sensemaking annotations around books pulled from “OERs” (what is is about education that makes the sector want its content to be somehow regarded as “special” and deserving of all sorts of qualification?!;-)

*Maybe the Guardian Platform API or one of the New York Times APIs could play a role here?

So, as ever, I’ve made a start, and as ever, that’ll probably be the end of it…. Sigh… Nice thought while it lasted though…

PS if I were to do next steps, it would probably to take the scraped data and try to normalise it in some ad hoc way in a triple store, maybe on the Talis platform? Note that in the current incarnation, some of the scraped BBC data contains multiple book references in a single record, and thise should be spearated out; also note that a lot of book references are informal (author/title), though I did manage to grab ISBNs (I think?!) from the IT COnversations/Tech Nation pages.

PPS In passing, I note that some of the older archived episodes of A Good Read have been split into chapters covering the different books reviewed in the programme? Was this some sort of experimental enrichment, or just the start of a more general roll out of chapterisation…?

Written by Tony Hirst

June 24, 2011 at 10:42 am

Playing With Large (ish) CSV Files, and Using Them as a Database from the Command Line: EDINA OpenURL Logs

You know those files that are too large for you to work with, or even open? Maybe they’re not….

Picking up on Postcards from a Text Processing Excursion where I started dabbling with Unix command line text processing tools (it sounds scarier than it is… err… maybe?!;-), I thought it would make sense to have a quick play with them in the context of some “real” data files.

The files I’ve picked are intended to be intimidating (maybe?) at first glance because of their size: in this post I’ll look at a set of OpenURL activity data from Edina (24.6MB download, unpacking to 76MB), and for a future post, I thought it might to interesting to see whether this approach would work with a dump of local council spending data from OpenlyLocal (73.2MB download, unzipping to 1,011.9MB).

To start with, let’s have a quick play with the OpenURL data: you can download it from here: OpenURL activity data (April 2011)

What I intend to do in this post is track my own preliminary exploration of the file using what I learned in the “Postcards” post. I may also need to pick up a few new tricks along the way… One thing I think I want to look for as I start this exploration is an idea of how many referrals are coming in from particular institutions and particular sources…

Let’s start at the beginning though by seeing how many lines/rows there are in the file, which I downloaded as L2_2011-04.csv:

wc -l L2_2011-04.csv

I get the result 289,691; older versions of Excel used to only support 65,536 rows per sheet, though I believe more recent versions (Excel 2007, and Excel 2010) can support over a million; Google Apps currently limits sheet sizes to up to 200,000 cells (max 256 columns), so even if the file was only one column wide, it would still be too big to upload into a single Google spreadsheet. Google Fusion Tables can accept CSV files up to 100MB, so that would work (if we could actually get the file to upload… Spanish accent characters seemed to break things when I tried… the workaround I found was to split the original file, then separately upload and resave the parts using Google Refine, before uploading the files to Google Fusion tables (upload one to a new table, then import and append the other files into the same table).

..which is to say: the almost 300,00 rows in the downloaded CSV file are probably too many for many people to know what to do with, unless they know how to drive a database… which is why I thought it might be interesting to see how far we can get with just the unix command line text processing tools.

To see what’s in the file, let’s see what’s in there (we might also look to the documentation):

head L2_2011-04.csv

Column 40 looks interesting to me: sid (service ID); in the data, there’s a reference in there to mendeley, as well as some other providers I recognise (EBSCO, Elsevier and so on), so I think this refers to the source of the referral to the EDINA openurl resolver (@ostephens and @lucask suggested they thought so too. Also, @lucask suggested “OpenUrl from Endnote has ISIWOS as default SID too!”, so we may find that some sources either mask their true origin to hide low referral numbers (maybe very few people ever go from endnote to the EDINA openurl resolver?), or to inflate other numbers (Endnote inflating apparent referrals from ISIWOS.)

Rather than attack the rather large original file, let’s start by creating a smaller sample file with a couple of hundred rows that we can use as a test file for our text processing attempts:

head -n 200 L2_2011-04.csv > samp_L2_2011-04.csv

Let’s pull out column 40, sort, and then look for unique entries in the sample file we created:

cut -f 40 samp_L2_2011-04.csv | sort | uniq -c

I get a response that starts:

12
1 CAS:MEDLINE
1 EBSCO:Academic Search Premier
7 EBSCO:Business Source Premier
1 EBSCO:CINAHL
...

so in the sample file there were 12 blank entries, 1 from CAS:MEDLINE, 7 from BSCO:Business Source Premier and so on, so this appears to work okay. Let’s try it on the big file (it may take a few seconds…) and save the result into a file (samp_uniqueSID.csv:

cut -f 40 L2_2011-04.csv | sort | uniq -c > uniqueSID.csv

This results of the count will be in arbitrary order, so it’s possible to add a sort into the pipeline in order to sort the entries according to the number of entries. The column we want to sort on is column 1 (so we set the sort -k key to 1; and because sort sorts into increasing order by default, we can reverse the order (-r) to get the most referenced entries at the top (the following is NOT RECOMMENDED… read on to see why…):

cut -f 40 L2_2011-04.csv | sort | uniq -c | sort -k 1 -r > uniqueSID.csv

We can now view the uniqueSD.csv file using the more command (more uniqueSD.csv), r look at the top 5 rows using the head command:

head -n 5 uniqueSID.csv

Here’s what I get as the result (treat this with suspicion…):

9181 OVID:medline
9006 Elsevier:Scopus
7929 EBSCO:CINAHL
74740 http://www.isinet.com:WoK:UA
6720 EBSCO:jlh

If we look through the file, we actually see:

1817 OVID:embase
1720 EBSCO:CINAHL with Full Text
17119 mendeley.com:mendeley
16885 mendeley.com/mendeley:
1529 EBSCO:cmedm
1505 OVID:ovftdb

I actually was alerted to this oops when looking to see how many referrals were from mendeley, by using grep on the counts file (if grep complains about a “Binary file”, just use the -a switch…):

grep mendeley uniqueSID.csv

17119 mendeley.com:mendeley
16885 mendeley.com/mendeley:

17119 beat the “top count” 9181 from OVID:medline – obviously I’d done something wrong!

Specifically, the sort had sorted by character not by numerical value… (17119 and 16885 are numerically grater than 1720, but 171 and 168 are less (in string sorting terms) than 172. The reasoning is the same as why we’d index aardman before aardvark).

To force sort to sort using numerical values, rather than string values, we need to use th -n switch (so now I know!):

cut -f 40 L2_2011-04.csv | sort | uniq -c | sort -k 1 -r -n > uniqueSID.csv

Here’s what we get now:

74740 http://www.isinet.com:WoK:UA
34186
20900 http://www.isinet.com:WoK:WOS
17119 mendeley.com:mendeley
16885 mendeley.com/mendeley:
9181 OVID:medline
9006 Elsevier:Scopus
7929 EBSCO:CINAHL
6720 EBSCO:jlh
...

To compare the referrals from the actual sources (e.g. the aggregated EBSCO sources, rather than EBSCO:CINAHL, EBSCO:jlh and so on), we can split on the “:” character, to create a two columns from one: the first containing the bit before the ‘:’, the second column containing the bit after:

sed s/:/'ctrl-v<TAB>'/ uniqueSD.csv | sort -k 2 > uniquerootSID.csv

(Some versions of sed may let you identify the tab character as \t; I had to explicitly put in a tab by using ctrl-V then tab.)

What this does is retain the number of lines, but sort the file so all the EBSCO referrals are next to each other, all the Elsevier referrals are next to each other, and so on.

Via an answer on Stack Overflow, I found this bit of voodoo that would then sum the contributions from the same root referrers:

cat uniquerootSID.csv | awk '{a[$2]+=$1}END{for(i in a ) print i,a[i] }' | sort -k 2 -r -n > uniquerootsumSID.csv

Using data from the file uniquerootSID.csv, the awk command sets up an array (a) that has indices corresponding to the different sources (EBSCO, Elsevier, and so on). It then runs an accumulator that sums the contributions from each unique source. After processing all the rows (END), the routine then loops through all the unique sources in the a array, and emits the source and the total. The sort command then sorts the output by total for each source and puts the list into the file uniquerootsumSID.csv.

Here are the top 15:

http://www.isinet.com 99453
EBSCO 44870
OVID 27545
mendeley.com 17119
mendeley.com/mendeley 16885
Elsevier 9446
CSA 6938
EI 6180
Ovid 4353
wiley.com 3399
jstor 2558
mimas.ac.uk 2553
summon.serialssolutions.com 2175
Dialog 2070
Refworks 1034

If we add the two Mendeley referral counts that gives ~34,000 referrals. How much are the referrals from commercial databases costing, I wonder, by comparison? Of course, it may be that the distribution of referrals from different institutions is different. Some institutions may see all their traffic through EBSCO, or Ovid, or indeed Mendeley… If nothing else though, this report suggests that Mendeley is generating a fair amount of EDINA openurl traffic…

Let’s use the cut command again to see how much traffic is coming from each unique insititution (not that I know how to decode these identifiers…); column 4 is the one we want (remember, we use the uniq command to count the occurrences of each identifier):

cut -f 4 L2_2011-04.csv | sort | uniq -c | sort -k 1 -r -n > uniqueInstID.csv

Here are the top 10 referrer institutions (columns are: no. of referrals, institution ID):

41268 553329
31999 592498
31168 687369
29442 117143
24144 290257
23645 502487
18912 305037
18450 570035
11138 446861
10318 400091

How about column 5, the routerRedirectIdentifier:

195499 athens
39381 wayf
29904 ukfed
24766 no identifier
140 ip

How about the publication year of requests (column 17):

45867
26400 2010
16284 2009
13425 2011
13134 2008
10731 2007
8922 2006
8088 2005
7288 2004

It seems to roughly follow year?!

How about unique journal title (column 15):

258740
277 Journal of World Business
263 Journal of Financial Economics
263 Annual Review of Nuclear and Particle Science
252 Communications in Computer and Information Science
212 Journal of the Medical Association of Thailand Chotmaihet thangphaet
208 Scandinavian Journal of Medicine & Science in Sports
204 Paleomammalia
194 Astronomy & Astrophysics
193 American Family Physician

How about books (column 29 gives ISBN):

278817
1695 9780470669303
750 9780470102497
151 0761901515
102 9781874400394

And so it goes..

What’s maybe worth remembering is that I haven’t had to use any tools other than command line tools to start exploring this data, notwithstanding the fact that the source file may be too large to open in some everyday applications…

The quick investigation I was able to carry out on the EDINA openurl data also built directly on what I’d learned in doing the Postcards post (except for the voodoo awk script to sum similarly headed rows, and the sort switches to reverse the order of the sort, and force a numerical rather than string based sort). Also bear in mind that three days ago, I didn’t know how to do any of this…

…but what I do suspect is that it’s the sort of thing that Unix sys admins play around with all the time, e.g. in the context of log file hacking…

PS so what else can we do…? It strikes me that by using the date and timestamp, as well as the institutional ID and referrer ID, we can probably identify searches that are taking place: a) within a particular session, b) maybe by the same person over several days (e.g. in the case of someone coming in from the same place within a short window of time (1-2 hours), or around about the same time on the same day of the week, from the same IDs and searching around a similar topic).

Written by Tony Hirst

June 4, 2011 at 10:23 pm

Posted in Infoskills, Library

Tagged with , , ,

eSTEeM Project: Library Website Tracking For VLE Referrals

Assuming my projects haven’t been cut out at the final acceptance stage because I haven’t yet submitted a revised project plan,

Preamble
As OU courses are increasingly presented through the VLE, many of them opt to have one or more “Library Resources” pages that contain links to course related resources either hosted on the OU Library website or made available through a Library operated web service. Links to Library hosted or moderated resources may also appear inline in course content on the VLE. However, at the current time, it is difficult to get much idea about the extent to which any of these resources are ever accessed, or how students on a course make use of other Library resources.

With the state of the collection and reporting of activity data from the VLE still evolving, this project will explore the extent to which we can make use of data I do know exists, and to which I do have access, specifically Google Analytics data for the library.open.ac.uk domain.

The intention is to produce a three-way reporting framework using Google Analytics for visitors to the OU Library website and Library managed resources from the VLE. The reports will be targeted at: subject librarians who liaise with course teams; course teams; subscription managers.

Google Analytics (to which I have access) are already running on the library website and the matter just(?!) arises now of:

1) Identifying appropriate filters and segments to capture visits from different courses;

2) development of Google Analytics API wrapper calls to capture data by course or resource based segments and enable analysis, visualisation and reporting not supported within the Google Analytics environment.

3) Providing a meaningful reporting format for the three audience types. (note: we might also explore whether a view over the activity data may be appropriate for presenting back to students on a course.)

The Project
The OU Library has been running Google Analytics for several year, but to my knowledge has not started to exploit the data being collected as part of a reporting strategy on the usage of library resources resulting from referrals from the VLE. (Whenever a user clicks on a link in the VLE that leads to the Library website, the Google Analytics on the Library website can capture that fact.)

At the moment, we do not tend to work on optimising our online courses as websites so that they deliver the sorts of behaviour we want to encourage. If we were a web company, we would regularly analyse user behaviour on our course websites and modify them as a result.

This project represents the first step in a web analytics approach to understanding how our students access Library resources from the VLE: reporting. The project will then provide the basis for a follow on project that can look at how we can take insight from those reports and make them actionable, for example in the redesign of the way links to library resources are presented or used in the VLE, or how visitors from the VLE are handled when they hit the Library website.

The project complements work that has just started in the Library on a JISC funded project to making journal recommendations to students based on previous user actions.

The first outcome will be a set of Google Analytics filters and advanced segments tuned to the VLE visitor traffic and resource usage on the Library website. The second will be a set of Google analytics API wrappers that allow us to export this data and use it outside the Google Analytics environment.

The final deliverables are three report types in two possible flavours:

1) a report to subject librarians about the usage of library resources from visitors referred from the VLE for courses they look after

2) a report to librarians responsible for particular subscription databases showing how that resource is accessed by visitors referred from the VLE, broken down by course

3) a report to course teams showing how library resources linked to from the VLE for their course are used by visitors referred to those resources from the VLE.

The two flavours are:

a) Google analytics reports

b) custom dashboard with data accessed via the Google Analytics API

Recommendations will also be made based on the extent to which Library website usage by anonymous students on particular OU courses may be tracked by other means, such as affinity strings in the SAMS cookie, and the benefits that may accrue from this more comprehensive form of tracking.

If course team members on any OU courses presenting over the next 9 months are interested in how students are using the library website following a referral from the VLE, please get in touch. If academics on courses outside the OU would like to discuss the use of Google Analytics in an educational context, I’d love to hear from you too:-)

eSTEeM is joint initiative between the Open University’s Faculty of Science and Faculty of Maths, Computing and Technology to develop new approaches to teaching and learning both within existing and new programmes.

Written by Tony Hirst

April 13, 2011 at 11:01 am

Posted in Analytics, Library, OU2.0, Project

Tagged with ,

Library Catalogue SRU Queries via YQL and Yahoo Pipes

I got a question from @liwazi last week wondering why a SRU request to the Cambridge Library catalogue wasn’t being handled properly in Yahoo pipes… I think it’s because the Yahoo Pipes XML parser is sometimes like that!

Anyway, here was my fix – to use YQL as a proxy, based around a SRU request URL of the form:
http://search.lib.cam.ac.uk/sru.ashx?version=1.1&recordSchema=dc&query=SEARCH+TERMS
&operation=searchRetrieve&maximumRecords=10

Here’s the the form of YQL query (try it in YQL developer console):

select * from xml where url='http://search.lib.cam.ac.uk/sru.ashx?version=1.1&recordSchema=dc&query=SEARCH+TERMSsearchRetrieve&maximumRecords=10'

You can find a copy of the pipe here: SRU demo pipe

Note that as well as accessing the data via the pipe, you can also pull the results of a search into a web page directly from YQL as a JSON feed:

If you’re really keen, you might also define a YQL data table that would allow you to make a request of the form “select * from camsru where q=’learning perl'”, and then set up a short alias for the query so you could run it using a construction of the form http://query.yahooapis.com/v1/public/yql/EXAMPLEUSER/camsru?q=learning%20perl based on a YQL query of the form select * from camsru where q=@q

PS tomorrow is Mashed Library day at Lincoln – Pancakes and Mash. Be sure to follow #mashlib and chip in if you can ;-)

Written by Tony Hirst

March 7, 2011 at 3:19 pm

Posted in Library

Tagged with

BBC “In Our Time” Reading List using Linked Data

If you’re a regular listener of BBC Radio 4, you will almost certainly have come across In Our Time, a weekly, single topic discussion programme (with a longstanding archive of listen again material) hosted by Melvyn Bragg on matters scientific, philosophical, historical and cultural. In certain respects, In Our Time may be thought of as discussion based audio encyclopedia. The format sees a panel of three experts (made up of academics, commentators and critics knowledgeable on the topic for that week) teaching the host about the topic. A diligent student, he will of course have done some background reading, and posted links to the references consulted on the programme’s web page.

I’ve already had a quick play with the In Our Time data, looking to see how easy it is to relate programmes to expert academics from various UK universities (Visualising OU Academic Participation with the BBC’s “In Our Time”), but I also wondered whether it would be possible to do anything with the book references, such as using them to identify courses that may be related to a particular programme; (this is reminiscent of a couple of MOSAIC competition entries that looked at ways of recommending books based on courses, and courses based on books using @daveyp’s data from Huddersfield University library that associated course codes with the books borrowed by students taking those courses).

Being a lazy sort, I posted an idea to the OKF Ideas Incubator suggesting that it might be worth considering extracting references from In Our Time programme pages and then reconciling them with Linked Data representations of the corresponding book data.

And then, as if by magic, a solution appeared, from Orangeaurochs: “In Our Time” booklist which describes a method for parsing out the book data and then getting a Linked Data resource reference back from Bibliographica.

The original recipe suggested screenscraping the raw book references from the page HTML, but I posted a comment (at the time of writing, still in the moderation queue) which suggests:

Hi
Great to see you taking this challenge on. Re your step 2 – obtaining the reading list – a possibly more structured way of doing this is to get the appropriate section out of the xml or json representation of the programme page (eg http://www.bbc.co.uk/programmes/b00xhz8d.xml or http://www.bbc.co.uk/programmes/b00xhz8d.json).

I wonder if the BBC will start to structure the data even more – for example by adding explicitly marked up biblio data to book references?

Anyway, you can see an example of the results at pages with URLs of the form http://www.aurochs.org/inourtime_booklist/inourtime_booklist_v1.php?http://www.bbc.co.uk/programmes/b00xhz8d – just add the appropriate IOT programme page URL to extract the data from it.

There are a few hit and misses, but it’s a great start, and something that can be used as a starting point for thinking about how to annotate programme related booklists with structured bibliographic data and exploring what that might mean in a world of linked educational resources that can also reference linked BBC content… :-)

PS Hmmm, I wonder what other programmes are associated with books? A Good Read and Desert Island Discs certainly…

Written by Tony Hirst

February 24, 2011 at 4:06 pm

Posted in BBC, Data, Library, OBU, OU2.0

Tagged with ,

UK HE Libraries Using Google Analytics

How many UK Higher Education Library websites are running Google Analytics, and how many of them are actually using them to report anything other than sitewide pageviews and visitor numbers?

A couple of years ago, I ran a series of posts on Library Analytics where I started to explore some of the ways in which Google Analytics (as it was then) could be used to help us start to understand how a library website was being used by its different sorts of visitors.

Two years on, and I’ve started looking again at Googalytics in the Library, and will hopefully get round to publishing a few posts at least about what I’ve learned about using as it currently stands for making sense of Library website usage, and for what we may be able to report back to course teams about library website activity of users referred from course pages on the OU VLE.

One thing I thought I’d like to try to do is come up with custom reports, segments and goal recipes that might be transferable, or useful to other HE Library websites, as well as identify “best practice” approaches that are used by other HE libraries running Google Analytics… But which libraries are running Google Analytics?

Using a list of HE Library websites grabbed from a November, 2009 dump of a scrape of the Sconul website (by @ostephens, I think?), I ran a quick python script to sniff library websites for evidence of Google Analytics tracking codes (results).

Total number of websites checked 181
Number with Google Analytics code detected 110 Percentage: 0.60773480663
Number without Google Analytics code detected 67 Percentage: 0.370165745856
Number of pages failed to load 4 Percentage: 0.0220994475138

So, it seems like a fair few folk are running Google Analytics… but I wonder: what are they reporting, what segments and custom reports do they find most useful, what goals have they defined (and do they carry a meaningful “financial” conversion value? If so, defined how?), are they in any sense “actionable” (that is, have they been used to prompt interventions to increase traffic, influence on-site behaviour, feed in to website design changes, feed in to subscription or book acquisition policies, improve links with course academics, update reading lists, contribute to VLE content or structure, schedule and staff online help services, influence opening hours etc. etc.)

If you are working in an HE library, running Google Analytics, and can provide even fragmentary answers to any of the above questions, please reply in a comment below, or feel free to email me (in confidence, if required) at: a.j.hirst@open.ac.uk

PS I’m even going to start looking to the literature, too… So for example, this is next on my reading list: Turner, S. J. (2010). Website Statistics 2.0: Using Google Analytics to Measure Library Website Effectiveness. Technical Services Quarterly, 27(3), 261-278. doi:10.1080/07317131003765910

PPS I thought I’d follow the single citation to that paper too, but it seems I can’t unless I pay for it…

This is interesting, methinks. Not only is the content of the paper kept behind a paywall, but so is its incoming link context…

Written by Tony Hirst

February 11, 2011 at 12:31 pm

Posted in Library, OU2.0

Tagged with ,

Follow

Get every new post delivered to your Inbox.

Join 766 other followers