OUseful.Info, the blog…

Trying to find useful things to do with emerging technologies in open education

Archive for the ‘Library’ Category

Citation Positioning

It’s been years and years since I did either a formal literature review, or used a reference manager like EndNote or RefWorks in anger, but whilst at the Arcadia Project review in Cambridge a couple of days ago, I started wondering what sorts of ‘added value’ features I’d like to see, maybe even expect, from referencing software nowadays…

One of the ideas I’ve been playing with recently is the idea of emergent social positioning (ESP;-) in online social networks, which I’m defining in terms of where an individual or an expression of a particular interest group might be positioned in terms of the socially projected interests of people following that person or interest group.

For the case of an individual, the approach I’m taking is to look at who the followers of that individual follow to any great extent; for the case of an interest group, as evidenced by users of a particular hashtag, for example, it might be to look at who the followers of the users of the hashtag also follow in significant numbers.

A slightly more constrained approach might be to look at how the followers of the individual or the hashtag users follow each other (a depth 1.5 follower network about an indvidual or set of individals, in effect).

So for example, here’s a map I just grabbed of folk who are followed by 3 or more followers from a sampling of the followers recent users of the #gdslaunch (Government Digital Service launch) hashtag.

In the vicinity of #gdslaunch

So what does this have to do with reference managers? Let’s start with a single academic paper (the ‘target’ paper), that contains a list of references to other works. If we can easily grab the reference lists from all those works, we can generate a depth 1.5 reference map that show how the works referenced in the first paper reference each other. Exploring the structural properties of this map may help us better understand the support basis for the ideas covered in our target paper.

By looking at the depth 2 reference network (that is, the network that shows references included in the target paper, and all their references), we may be able to discover additional (re)sources relevant to the target paper.

Unfortunately, getting free and and easy machine readable access to the lists of references contained within journal articles, conference papers and books is not trivial. There are patchy services such as CiteSeer, Citebase or opencitations.net, but I don’t think services like Mendeley, Zotero or CiteUlike are yet expressing this sort of data? Or maybe they are, and I’m missing a trick somewhere.

(Just by the by, presumably some of the commercial citation services have APIs that support at least accessing this data? If you know of any, could you add a link in the comments please?:-)

Another hack I’d like to try is to generate what more closely corresponds to the social positioning idea, which is to grab the references from a target paper, and then the papers that cite those references and see how they all link together. This would help position the target paper in the space of other papers referencing similar works. I think CiteSeer has this sort of functionality, though not in a graphical form?

PS on my to do list is seeing whether I can get reference lists for articles out of Citeseer using the Citeseer OAI-PMH endpoint. I’ve got as far as installing the pyoai Python library, but not had time to try it out yet. If anyone knows of a guide to OAI for complete novices, ideally with pyoai examples I can crib from, please post a link (or some examples) via the comments:-)

Written by Tony Hirst

December 8, 2011 at 1:47 pm

Posted in Infoskills, Library

Tagged with

Notes on Custom Course Search Engines Derived from OU Structured Authoring Documents

Over the last few days, I’ve been tinkering with OU Structured Authoring documents, XML docs from which OU course materials – both print and HTML – are generated (you can get an idea about what they look like from OpenLearn: find a course page with a URL of the form http://openlearn.open.ac.uk/mod/oucontent/view.php?id=397337&direct=1 and change direct to content: http://openlearn.open.ac.uk/mod/oucontent/view.php?id=397337&content=1; h/t to Colin Chambers for that one;-). I’ve been focussing in particular on the documents used to describe T151, an entry level online course I developed around all things gaming (culture, business, design and development), and the way in which we can automatically generate custom search engines based on these documents.

The course had a very particular structure – weekly topic explorations framed as a preamble, set of guiding questions, suggested resources (organised by type) and a commentary, along with a weekly practical session.

One XML doc was used per week, and was used to generate the separate HTML pages for each week’s study.

One of the experimental components of the course has been a Google Custom Search Engine, that supports searches over external resources that are linked to from the blog. The course also draws heavily on the Digital Worlds Uncourse blog, a site used to scope out the design of the course, as well as draft some of the materials used within it, and the CSE indexes both that site and the sites that are linked from it. (See eSTEeM Project: Custom Course Search Engines and Integrating Course Related Search and Bookmarking? for a little more context around this.)

Through using the course custom search engine myself, I have found several issues with it:

1) with a small index, it’s not really very satisfactory. If you only index exact pages that are linked to from the site, it can be quite hard getting any hits at all. A more relaxed approach might be to index the domains associated with resources, and also include the exact references explicitly with a boosted search rank. At the current time, I have code that scrapes external links from across the T151 course materials and dumps them into a single annotations file (the file that identifies which resources are included in the CSE) without further embellishment. I also have code that identifies the unique domains that are linked to from the course materials and which can also be added to the annotations file. On the to do list is to annotate the resources with labels that identify which topic they are associated with so we can filter results by topic.

2) the Google Custom Search Engines seem to behave very oddly indeed. In several of my experiments, one word queries often returned few results, more specific queries building on the original search term delivered more and different results. This gives a really odd search experience, and one that I suspect would put many users off.

3) I’ve been coming round more and more to the idea that the best way of highlighting course resources in a search context is through the use of Subscribed Links, that a user can subscribe to and that then appear in their Google search results if there is an exact query match. Unfortunately, Google pulled the Subscribed Links service in September (A fall spring-clean; for example of what’s been lost, see e.g. Stone Temple Consulting: Google Co-Op Subscribed Links).

4) The ability to feed promotions into the top of the CSE results listing is attractive (up to 3 promoted links can be displayed for any given query), but the automatic generation of query terms is problematic. Promotion definitions take the form:

<Promotion image_url="http://kmi.open.ac.uk/images/ou-logo.gif"
  	title="Week 4"
  	queries="week 4,T151 week 4,t151 week 4"
  	description="Topic Exploration 4A - An animated experience Topic exploration 4B - Flow in games "/>

Course CSE - week promotion

There are several components we need to consider here:

  1. queries: these are the phrases that are used to trigger the display of the particular promotions links. Informal testing suggests that where multiple promotions are triggered by the same query, the order in which they are defined in the Promotions file determines the order in which they appear in the results. Note that the at most three (3) promotions can be displayed for any query. Queries may be based at least around either structural components (such as study week, topic number), subject matter terms (e.g. tags, keywords, or headings) and resource type (eg audio/video material, academic readings etc), although we might argue the resource type is not such a meaningful distinction (just because we can make it doesn’t mean we should!). In the T51 materials, presentation conventions provide us with a way of extracting structural components and using these to seed the promotions file. Identifying keywords or phrases is more challenging: students are unlikely to enter search phrases that exactly match section or subsection headings, so some element of term extraction would be required in order to generate query terms that are likely to be used.
  2. title: this is what catches the attention, so we need to put something sensible in here. There is a limit of 160 characters on the length of the title.
  3. description: the description allows us to elaborate on the title. There is a limit of 200 characters on the length of the description.
  4. url: the URL is required but not necessarily ‘used’ by our promotion. That is, if we are using the promotion for informational reasons, and not necessarily wanting to offer a click through, the link may be redundant. (However, the CSE still requires it to be defined.) Alternatively, we might argue that the a click through action should always be generated, in which case it might be worth considering whether we can generate a click through to a more refined query on the CSE itself?

Where multiple promotions are provided, we need to think about:
a) how they are ordered;
b) what other queries they are associated with (i.e. their specificity);
c) where they link to.

In picking apart the T151 structured authoring documents, I have started by focussing on the low hanging fruit when it comes to generating promotion links. Looking through the document format, it is possible to identify topics associated with separate weeks and questions associated with particular topics. This allows us to build up a set of informational promotions that allow the search engine to respond to queries of what we might term a navigational flavour. So for example, we can ask what topics are covered in a particular week (I also added the topic query as a query for questions related to a particular topic):

Course CSE - multiple promotions

Or what a particular question is within a particular topic:

COurse CSE - what's the question?

The promotion definitions are generated automatically and are all very procedural. For example, here’s a fragment from the definition of the promotion from question 4 in topic 4A:

  	title="Topic Exploration 4A Question 4"
  	queries="topic 4a q4,T151 topic 4a q4,t151 topic 4a q4,topic 4a,T151 topic 4a,t151 topic 4a"
  	... />

The queries this promotion will be displayed for are all based around navigational structural elements. This requires some knowledge of the navigational query syntax, and also provides an odd user experience, because the promotions only display on the main CSE tab, and the organic results from indexed sites turn up all manner of odd results for queries like “week 3″ and “topic 1a q4″… (You can try out the CSE here.)

The promotions I have specified so far also lack several things:

1) queries based on the actual question description, so that a query related to the question might pull the corresponding promotion into the search results (would that be useful?)

2) a sensible link. At the moment, there is no obvious way in the SA document of identifying one or more resources that particularly relate to a specific question. If there was such a link, then we could use that information to automatically associate a link with a question in the corresponding promotions element. (The original design of the course imagined the Structured Authoring document itself being constructed automatically from component parts. In particular, it was envisioned that suggested links would be tagged on a social bookmarking service and then automatically pulled into the appropriate area of the Structured Authoring document. Resources could then be tagged in a way that associates them with one or more questions (or topics), either directly though a question ID, or indirectly through matching subject tags on a question and on a resource. The original model also considered the use of “suggested search queries” that would be used to populate suggested resources lists with results pulled in live from a (custom) search engine…)

At the moment, it is possible to exploit the T151 document structure to generate these canned navigational queries. The question now is: are promotional links a useful feature, and how might we go about automatically identifying subject meaningful queries?

At the limit, we might imagine the course custom search engine interface being akin to the command line in a text based adventure game, with the adventure itself being the learning journey, and the possible next step a combination of Promotions based guidance and actual search results…

[Code for the link scraping/CSE file generation and mindmap generator built around the T151 SA docs can be found at Github: Course Custom Search Engines]

PS as ever, I tend to focus on tinkering around a rapid prototype/demonstration at the technical systems overview level, with a passing nod to the usefulness of the service (which, as noted above, is a bit patchy where the searchengine index is sparse). What I haven’t done is spend much time thinking about the pedagogical aspects relating to how we might make most effective use of custom search engines in the course context. From a scoping point of view, I think there are a several things we need to unpick that relate to this: what is it that students are searching for, what context are they searching in, and what are they expecting to discover?

My original thinking around custom course search engines was that they would augment a search across course materials by providing a way of searching across the full text of resources* linked to from the course materials, and maybe also provide a context for searching over student suggested resources.

Search hierarchy

It strikes me that the course search engine is most likely to be relevant where there is active curation of the search engine that provides a search view over a reasonably sized set of resources discovered by folk taking the course and sharing resources related to it. “MOOCs” might be interesting in this respect, particularly where: 1) it is possible to include MOOC blog tag feeds in the CSE as a source of relevant content (both the course blog content and resources linked to from that content – the CSE can be configured to include resources that are linked to from a specified resource); 2) we can grab links that are tagged and shared with the MOOC code on social media and add those to the CSE annotations file. (Note that in this case, it would make sense to resolve shortened links to their ultimate destination URL before adding them to the CSE.) I’m not sure what role promotions might play in a MOOC though, or the extent to which they could be automatically generated?

*Full text search across linked to resources is actually quite problematic. Consider the following classes of online resources that we might expect to be linked to from course materials:

  • academic papers, often behind a paywall: links are likely to be redirected through a library proxy service allowing for direct click-thru to the resource using institutional credentials (I assume the user is logged in to the VLE to see the link, and single sign on support allows direct access to any subscribed to resources via appropriate proxies. That is, the link to the resource leads directly to the full text, subscribed to version of the resource if the user is signed on to the institutional system and has appropriate credentials). There are several issues here: the link that is supplied to the CSE should be be the public link to the article homepage; the article homepage is likely to reveal little more than the paper abstract to the search engine. I’m not sure if Google Scholar does full-text indexing of articles, but even if it does, Scholar results are not available to the CSE. (There is also the issue of how we would limit the Scholar search to the articles we are linking to from the course materials.)
  • news and magazine articles: again, these may be behind a paywall, but even if they are, they may have been indexed by Google. So they may be full text discoverable via a CSE, even if they aren’t accessible once you click through…
  • video and audio resources: discovery in a large part will depend on the text on the web page the resources are hosted on. If the link is directly to an audio or video file, discoverability via the CSE may well be very limited!
  • books: Google book search provides full text search, but this is not available via a CSE. Full text searchable collections of books are achievable using Google Books Library Shelves; there’s also an API available.

I guess the workaround to all this is not to use a Google Custom Search Engine as the basis for a course search engine. Instead, construct a repository that contains full text copies of all resources linked to from the course, and index that using a local search engine, providing aliased links to the original sources if required?

Fudging the CSE with a local searchengine

However, that wasn’t what this experiment was about!;-)

Course Resources as part of a larger connected graph

Another way of thinking about linked to course resources is that they are a gateway into a set of connected resources. Most obviously, for an academic paper it is part of a graph structure that includes:
– links to papers referenced in the article;
– links to papers that cite the article;
– links to other papers written by the same author;
– links to other papers in collections containing the article on services such as Mendeley;
– links into the social graph, such as the social connections of the author, or the discovery of people who have shared a link to the resource on a public social network.
For an informal resource such as a blog post, it might be other posts linked to from the post of interest, or other posts that link to it.

Thinking about resources as being part of one or more connected graphs may influence our thinking about the pedagogy. If the intention is that a learner is directed to a resource as a terminal, atomic resource, from which they are expected to satisfy a particular learning requirement, then we aren’t necessarily interested in the context surrounding the resource. If the intention is that the resource is gateway to a networked context around one or more ideas or concepts, then we need to select our resources so that they provide a springboard to other resources. This can be done directly (eg though following references contained within the work, or tracking down resources that cite it), or indirectly, for example by suggesting keywords or search phrases that might be used to discover related resources by independent means. Alternatively, we might link to a resource as an exemplar of the sort of resource students are expected to work with on a given activity, and then expect them to find similar sorts of, but alternative, resources for themselves.

Written by Tony Hirst

November 8, 2011 at 1:54 pm

Digital Scholarship and Academically Discoverable Blogs

What does it take for a digital scholar’s blog to become academically credible?

At a time when we know that folk go to Google for a lot of their search needs, the academic library argues it’s case, in part, as a place where you can go to get access to “good quality” (academically credible) and comprehensive information through what we might term academic search engines.

The library’s search offerings are presumably subscription based (?) and their results often link through to subscription content; but the academic life is a privileged one, and our institutions cover the access costs on our behalf. (I guess this could almost be considered one of the “+ benefits” you might imagine an enthusiastic copywriter assuming for an academic job ad!)

The library and information access privilege extends to students too, so we might imagine a well-intentioned, but perhaps naive, student thinking that if they run a search using the Library’s “academically certified” search engine, they will get the sort of result they can happily cite in an essay, without fear of criticism about the academic credibility of the source publication.

We might imagine, too, that academics and researchers also place an element of trust in the credibility of sources returned as results to search queries raised using library discovery services.

So here’s a claim (which is untested and may or may not be true): if you want your work to stand a chance of being referenced in a piece of scholarly work, you need it to be discoverable in the places that the scholar goes to discover supporting claims or related material for the work they’re doing. The assumption is that the scholar will use a library provided discovery service because it is less noisy than a general web search engine and is likely to return to resources from credible sources. The curation of sources – and what is not included in the index – is in part what the subscription discovery service offers.

What this means is that if digital scholars want their blogging activity to be discoverable in the academic context, they need to find some way of getting some of their blogposts at least into academic discovery service indices.

But this is not likely to happen, right? Wrong… Here’s what I noticed when I ran a search using the OU Library’s “one-stop” search earlier today:

Acaemically verified???

A top two reference to a Mashable article (albeit identified as a news item) via the Newsbank database and a top ranked periodical article from Fast Company (via the UK/EIRE Reference Centre database). (Hmmm, I wonder how quickly this content is indexed? That is, how soon after posting on Mashable does an article become discoverable here?)

So maybe I need to start writing for Mashable?!

Or maybe not…?

One of the attractive features of WordPress as a publishing platform is that it provides feeds for everything, including category and tag level feeds. A handful of my category feeds are syndicated, for example to R-Bloggers, the Guardian Datablog blogroll and (I’m not sure if this still works?) the Online Journalism blog. Only posts tagged in a particular way are sent to the syndicated feeds.

So I’m wondering this: how much mileage would there be in setting up aggregation blogs around particular academic areas that not only syndicate content from publisher members, but also act as a focus for indexing by a service such as Newsbank? The content would be publisher-moderated (I don’t post content on non-R related matters to my R-bloggers syndication feed) and hopefully responsive to the norms of the aggregation community itself.

Precendents already exist of course; for example, Nature.com blogs aggregates blogs from a variety of working scientists. Is this content discoverable via the OU Library’s one stop/Ebsco search?

For an academic’s work to count in RAE terms, it needs to be cited. In order to be cited, it needs to be discoverable. Even if it isn’t citeable as a formal article, it can still make a contribution if it’s discoverable. To be academically discoverable, content needs to be discoverable via academic search engines. So why should Mashable count, but not personal academic blogs that are respected within their own communities?

PS I’m a bit out of touch with referencing converntions; I remember that pers. comm. used to be an acceptable way of crediting someone’s ideas they had personally communicated to you; is there a pub. comm. (that’s pub. comm. not just pub comm. ;-) equivalent that might be used to refer to online or offline public communications that might not otherwise be citeable?

Written by Tony Hirst

November 2, 2011 at 11:41 am

Posted in Infoskills, Library, OU2.0

Appropriate IT – My ILI2011 Presentation

Here’s a copy of the slides from my ILI2011 presentation on Appropriate IT:

One thing I wanted to explore was, if discovery happens elsewhere, and the role of the librarian is no longer focussed on discovery related issues, where can library folk help out? Here’s where I think we need to start placing some attention: sensemaking, and knowing what’s possible (aka helping redistribute the future that is already around us;-) Allied with this is the idea that we need to make more out of using appropriate IT for particular tasks, as well as appropriating IT where we can to make our lives easier.

In part, sensemaking is turning the wealth of relevant data out there into something meaningful for the question or issue at hand, or the choice we have to make. My own dabblings with social network analysis are approaches I’m working on that help me make sense of interest networks and social positioning within those networks so I can get a feel for how those communities are structured and who the major actors are within them.

As far as knowing what’s possible, I think we have a real issue with “folk IT” knowledge. Most of us have a reasonable grasp of folk physics and folk psychology. That is, we have a reasonable common-sense model of how the world works at the human scale (let go of an apple, it falls to the floor), and we can generally read other people from their behaviour; but how well developed is “folk IT” knowledge? Given that to most people the idea that you can search within a page in a wide variety of electronic documents using crtrl-F as a keyboard shortcut to a “search within page/document” feature is alien to them, I think our folk understanding of IT is limited to the principle of “if you switch it off and on again it should start working again”.

Folk IT is also tied up with computational thinking, but at a practical, “human scale”. So here are a few ideas I think the librarians need to start pushing:

– the idea of a graph; it’s what the web’s based around, after all, and it also helps us understand social networks. If you think of your website as a graph, with edges representing links that connect nodes/pages together, and realise that your on-site homepage is whatever page someone lands on from a search engine or third party link, you soon start to realise that maybe your website is not as usefully structured as you thought…
– some sort of common sense understanding of the role that URLs/URIs play in the browser, along with the idea that URIs are readable and hackable and also may say something about the way a website, or the resources it makes available, organised;
– the notion of “View Source”, that allows you to copy and crib the work of others when constructing your own applications, along with the very idea that you might be able to build web pages yourself out of free standing components.
– the idea of document types and applications that can work all sorts of magic given documents of that type; the knowledge that an MP3 file works well with an audio player or audio editor, for example, or that a PNG or JPG encodes an image, along with more esoteric formats such as KML (paste a URL to a KML file into the search box of a Google Maps search and see what happens, for example…). Knowledge of the filetype/document type gives you some sort of power over it, and helps you realise what sorts of thing you can do with it… (except for things like PDF, for example, which is to all intents and purposes a “can’t do anything with it” filetype;-)

I also think an understanding of pattern based string matching and what regular expressions allow you to do would go a long way towards helping folk who ever have to manipulate text or text-based data files, at least in terms of letting them know that there are often better ways of cleaning up a text file automagically rather than having to repeat the same operation over and over again on each separate row in file containing several thousand lines… They don’t need to know how to write the regular expression from the off, just that the sorts of operation regular expressions support are possible, and that someone will probably be able to show you how to do it…

Written by Tony Hirst

October 31, 2011 at 3:50 pm

Posted in Infoskills, Library, Presentation

Tagged with

Search Engine Powered Courses…

How can we use customised search engines to support uncourses, or the course models used to support MOOC style offerings?

To set the scene, here’s what Stephen Downes wrote recently on the topic of How to partcipate in a MOOC:

You will notice quickly that there is far too much information being posted in the course for any one person to consume. We tried to start slowly with just a few resources, but it quickly turns into a deluge.

You will be provided with summaries and links to dozens, maybe hundreds, maybe even thousands of web posts, articles from journals and magazines, videos and lectures, audio recordings, live online sessions, discussion groups, and more. Very quickly, you may feel overwhelmed.

Don’t let it intimidate you. Think of it as being like a grocery store or marketplace. Nobody is expected to sample and try everything. Rather, the purpose is to provide a wide selection to allow you to pick and choose what’s of interest to you.

This is an important part of the connectivist model being used in this course. The idea is that there is no one central curriculum that every person follows. The learning takes place through the interaction with resources and course participants, not through memorizing content. By selecting your own materials, you create your own unique perspective on the subject matter.

It is the interaction between these unique perspectives that makes a connectivist course interesting. Each person brings something new to the conversation. So you learn by interacting rather than by mertely consuming.

When I put together the the OU course T151, the original vision revolved around a couple of principles:

1) the course would be built in part around materials produced in public as part of the Digital Worlds uncourse;

2) each week’s offering would follow a similar model: one or two topic explorations, plus an activity and forum discussion time.

In addition, the topic explorations would have a standard format: scene setting, and maybe a teaser question with answer reveal or call to action in the forums; a set of topic exploration questions to frame the topic exploration; a set of resources related to the topic at hand, organised by type (academic readings (via a libezproxy link for subscription content so no downstream logins are required to access the content), Digital Worlds resources, weblinks (industry or well informed blogs, news sites etc), audio and video resources); and a reflective essay by the instructor exploring some of the themes raised in the questions and referring to some of the resources. The aim of the reflective essay was to model the sort of exploration or investigation the student might engage in.

(I’d probably just have a mixed bag of resources listed now, along with a faceting option to focus in on readings, videos, etc.)

The idea behind designing the course in this way was that it would be componentised as much as possible, to allow flexibility in swapping resources or even topics in and out, as well as (though we never managed this), allowing the freedom to study the topics in an arbitrary order. Note: I realised today that to make the materials more easily maintainable, a set of ‘Recent links’ might be identified that weren’t referred to in the ‘My Reflections’ response. That is, they could be completely free standing, and would have no side effects if replaced.

As far as the provision of linked resources went, the original model was that the links should be fed into the course materials from an instructor maintained bookmark collection (for an early take on this, see Managing Bookmarks, with a proof of concept demo at CourseLinks Demo (Hmmm, everything except the dynamic link injection appears to have rotted:-().

The design of the questions/resources page was intended to have the scoping questions at the top of the page, and then the suggested resources presented in a style reminiscent of a search engine results listing, the idea being that we would present the students with too many resources for them to comfortably read in the allocated time, so that they would have to explore the resources from their own perspective (eg given their current level of understanding/knowledge, their personal interests, and so on). In one of my more radical moments, I suggested that the resources would actually be pulled in from a curated/custom search engine ‘live’, according to search terms specially selected around the current topic and framing questions, but I was overruled on that. However, the course does have a Google custom search engine associated with it which searches over materials that are linked to from the course.

So that’s the context…

Where I’m at now is pondering how we can use an enhanced custom search engine as a delivery platform for a resource based uncourse. So here’s my first thought: using a Google Custom Search Engine populated with curated resources in a particular area, can we use Google CSE Promotions to help scaffold a topic exploration?

Here’s my first promotions file:

   <Promotion id="t151_1a" 
        queries="topic 1a, Topic 1A, topic exploration 1a, topic exploration 1A, topic 1A, what is a game, game definition" 
        title="T151 Topic Exploration 1A - So what is a game?" 
        description="The aim of this topic is to think about what makes a game a game. Spend a minute or two to come up with your own definition. If you're stuck, read through the Digital Worlds post 'So what is a game?'"
        image_url="http://kmi.open.ac.uk/images/ou-logo.gif" />

It’s running on the Digital Worlds Search Engine, so if you want to try it out, try entering the search phrase what is a game or game definition.

T151 CSE promotion - game definition

(This example suggests to me that it would also make sense to use result boosting to boost the key readings/suggested resources I proposed in the topic materials so that they appear nearer the top of the results (that’ll be the focus of a future post;-))

The promotion displays at the top of the results listing if the specified queries match the search terms the user enters. My initial feeling is that to bootstrap the process, we need to handle:

– queries that allow a user to call on a starting point for a topic exploration by specifically identifying that topic;
– “naive queries”: one reason for using the resource-search model is to try to help students develop effective information skills relating to search. Promotions (and result boosting) allow us to pick up on anticipated naive queries (or popular queries identified from search logs), and suggest a starting point for a sensible way in to the topic. Alternatively, they could be used to offer suggestions for improved or refined searches, or search strategy hints. (I’m reminded of Dave Pattern’s work with guided searches/keyword refinements in the University of Huddersfield Library catalogue in this context).

Here’s another example using the same promotion, but on a different search term:

T151 CSE - topic 1a

Of course, we could also start to turn the search engine into something like an adventure game engine. So for example, if we type: start or about, we might get something like:

T151 CSE - start

(The link I associated with start should really point to the course introduction page in the VLE…)

We can also use the search context to provide pastoral or study skills support:

T151 CSE - pastoral

These sort of promotions/enhancements might be produced centrally and rolled out across course search engines, leaving the course and discipline related customisations to the course team and associated subject librarians.

Just a final note: ignoring resource limitations on Google CSEs for a moment, we might imagine the following scenarios for their role out:

1) course wide: bespoke CSEs are commissioned for each course, although they may be supplemented by generic enhancements (eg relating to study skills);

2) qualification based: the CSE is defined at the qualification level, and students call on particular course enhancements by prefacing the search with the course code; it might be that students also see a personalised view of the qualification CSE that is tuned to their current year of study.

3) university wide: the CSE is defined at the university level, and students students call on particular course or qualification level enhancements by prefacing the search with the course or qualification code.

Written by Tony Hirst

September 15, 2011 at 2:03 pm

Open Data Processes: the Open Metadata Laundry

Another quick note from yesterday’s mini-mash at Cambridge, hosted by Ed Chamberlain, and with participation from consultant Owen Stephens, Lincoln’s Paul Stainthorp and his decentralised developers, and Sussex’s Chris Keene. This idea came from the Lincoln Jerome project (I’m not sure if this has been blogged on the Jerome project blog?), and provides a way of scrubbing MARC based records to free the metadata up from license restrictions.

The recipe goes along the lines of reconciling the record for each item with openly licensed equivalents, and creating a new record for each item where data fields are populated with content that is know to be openly licensed. In part, this relies on having a common identifier. One approach that was discussed was generating hashes based on titles with punctuation removed. This feels a bit arbitrary to me…? I’d probably reduce all the letters to the same case at the very least in an attempt to normalise the things we might be trying to hash?

I wonder if Ed’s mapping of metadata ownership might also have a role to play in developing a robust laundry service? (e.g. “Ownership” of MARC-21 records and Where exactly DOES a record come from?).

We also discussed recipes where different libraries, each with their own MARC records for a work, might be compared field by field to identify differences between the ways similar items might be catalogued differently. As well as identifying records that maybe contain errors, this approach might also enhance discovery, for example through widening a set of keywords or classification indices.

One of the issues we keep returning to is why it might be interesting to release lots of open data in a given context. Being able to pivot from a resource in one context to a resource in another context is a general/weak way of answering this question, but here are a couple of more specific issues that came up in conversation:

1) having unique identifiers is key, and becomes useful when people use the same identifier, or same-as’d identifiers, to refer to the same thing;

2) we need tool support to encourage people creating metadata to start linking in to a recognised/shared identifier spaces. I wonder if there might be value in institutions starting to publish reconciliation services that can be addressed from tools like Google Refine. (For example, How to use OpenCorporates to match companies in Google Refine or Google Refine Reconciliation Service API). Note that it might make sense for reconciliation services to employ various string similarity heuristics as part of the service.

3) we still don’t have enough compelling use cases about the benefits of linked IDs, or tools that show why it’s powerful. (I think of linked identifier spaces that are rich enough to offer benefits as if they were (super)saturated solutions, where it’s easy to crystallise out interesting things…) One example I like is how Open Corporates use reconciliation to allow you to map companies names in local council accounts to specific corporate entities. In time, one can imagine mapping company directors and local council councillors onto person entities and then starting to map these councillor-corporate-contract networks out…;-)

Finally, something Owen mentioned that resonates with some of my thinking on List Intelligence: Superduping/Work Superclusters, in which we take an ISBN, look at its equivalents using ThingISBN or xISBN, and then for each of those alternatives, look at their ThingISBN/xISBN alternatives, until we reach a limit set. (cf my approaches for looking at lists a Twitter UserID is included on, looking at the other members of the same lists, then finding the other lists they are mentioned on, etc. Note in the case of Twitter lists, this doesn’t necessarily hit a limit without the use of thresholding!)

Written by Tony Hirst

August 9, 2011 at 12:19 pm

Integrating Course Related Search and Bookmarking?

Not surprisingly, I’m way behind on the two eSTEeM projects I put proposals in for – my creative juices don’t seem to have been flowing in those areas for a bit:-( – but as a marking avoidance strategy I thought I’d jot down some thoughts that have been coming to mind about how the custom search project at least might develop (eSTEeM Project: Custom Course Search Engines).

The original idea was to provide a custom search engine that indexes pages and domains that are referenced within a course in order to provide a custom search engine for that course. The OU course T151 is structured as a series of topic explorations using the structure:

– topic overview
– framing questions
– suggested resources
– my reflections on the topic, guided by the questions, drawing on the suggested resources and a critique of them

One original idea for the course was that rather than give an explicit list of suggested resources, we provide a set of links pulled in live from a predefined search query. The list would look as if it was suggested by the course team but it would actually be created dynamically. As instructors, we wouldn’t be specifying particular readings, instead we would be trusting the search algorithm to return relevant resources. (You might argue this is a neglectful approach… a more realistic model might be to have specifically recommended items as well as a dynamically created list of “Possibly related resources”.)

At this point it’s maybe worth stepping back a moment to consider what goes into producing a set of search results. Essentially, there are three key elements:

– the index, the set of content that the search engine has “searched” and from which it can return a set of results;
– the search query; this is run against the index to identify a set of candidate search results;
– a presentation algorithm that determines how to order the search results as presented to the user.

If the search engine and the presentation algorithm are fixed, then for a given set of search terms, and a given index, we can specify a search term and get a known set of results back. So in this case, we could use a fixed custom search engine, with know search terms, and return a known list of suggested readings. The search engine would provide some sort of “ground truth” – same answer for the same query, always.

If we trust the sources and the presentation algorithm, and we trust that we have written an effective search query, then if the index is not fixed, or if a personalised ranking algorithm (that we trust) is used as part of the search engine, we would potentially be returning search results that the instructor has not seen before. For example, the resources may be more recent than the last time the instructor searched for resources to recommend, or they better fit the personalisation criteria for the user under the ranking algorithm used as part of the presentation algorithm.

In this case, the instructor is not saying: “I want you to read this particular resource”. They are saying something more along the lines of: “these are potentially the sorts of resource I might suggest you look at in order to study this topic”. (Lots of caveats in there… If you believe in content led instruction, with students referring to to specifically referenced resources, I imagine that you would totally rile against this approach!)

At times, we might want to explicitly recommend one or two particular resources, but also open up some other recommendations to “the algorithm”. It struck me that it might be possible to do this within the context of a Google Custom Search approach using “special results” (e.g. Google CSEs: Creating Special Results/Promotions).

For example, Google CSEs support:

promotions: “A promotion is simply an association between a pre-defined set of query terms and a link to a webpage. When a user types a search that exactly matches one of your query terms, the promotion appears at the top of the page.” So by using a specific search term, we can force the return of a specific result as the top result. In the context of a topic exploration, we could thus prepopulate the search form of an embedded search engine with a known search phrase, and use a promotion to force a “recommend reading” link to the top of the results listing.

Promotion links are stored in a separate config file and have the form:

  <Promotion id="1"
    queries="wanderer, the wanderer" 
    title="Groo the Wanderer" 
    description="Comedy. American series illustrated by Sergio Aragonés."
    image_url="http://www.newsfromme.com/images5/groo11.jpg" />

subscribed links: subscribed links allow you to return results in a specific format (such as text, or text and a link, or other structured results) based on a perfect match with a specific search term. In a sense, subscribed links represent a generalised version of promotions. Subscribed links are also available to users outside the context of a CSE. If a user subscribes to a particular subscribed link file, then if there is an exact match against of one the search phrases in the subscribed link file and a search phrase used by a subscribing user on Google web search (i.e. on google.com or google.co.uk), the subscribed link will be returned in the results listing.

In the simplest case, subscribed links can be defined at the individual link level:

Google subscribed link definition

If your search term is an exact match for the term in the subscribed link definition, it will appear in the main search results page:

Google subscribed links

It’s also possible to define subscribed link definition files, either as simple tab separated docs or RSS/Atom feeds, or using a more formal XML document structure. One advantage of creating subscribed links files for use within in custom search engine is that users (i.e. students) can subscribe to them as a way of augmenting or enhancing their own Google search results. This has the joint effect of increasing the surface area of the course, so that course related recommendations can be pushed to the student for relevant queries made through the Google search engine, as well as providing a legacy offering: students can potentially take away a subscription when then finish the course to continue to receive “academically credible” results on relevant search topics. (By issuing subscription links on a per course presentation basis (or even on a personalised, unique feed per student basis), feeds to course alumni might be customised, or example by removing links to subscription content (or suggesting how such content might be obtained through a subscription to the university library), or occasionally adding in advertising related links (so if a student searches using a “course” keyword, make recommendations around that via a subscribed links feed; in the limit, this could even take on the form of a personalised, subscription based advertising channel).

Another way in which “recommended” links can be boosted in a custom search result listing is through boosting search results via their ranking factors (Changing the Ranking of Your Search Results).

In the case of both subscribed links and boosted search results, it’s possible to create a configuration file dynamically. Where students are bookmarking search results relating to a course, it would therefore be possible to feed these into a course related custom search engine definition file, or a subscribed link file. If subscribed link files are maintained at a personal level, it would also be possible to integrate a student’s bookmarked links in to their subscribed links feed, at least for use on Google websearch (probably not in the custom search engine context?). This would support rediscovery of content bookmarked by the student through subscribed link recommendations.

Just by the by, a PR mailing in my inbox today threw up another example of how search and bookmarking might be brought more closely together: SearchTeam (screenshots [pdf]).

The model here is based around defining search contexts that one or more users can contribute to, and then saving out results from a search into a topic based bookmark area. The video suggests that particular results can also be blocked (and maybe boosted? The greyed plus on the left hand side?) – presumably this is a persistent feature, so if you, or another member of your “search team” runs the search, the blocked result doesn’t appear? (Is a list of blocked results and their corresponding search terms available anywhere I wonder?) In common with the clipping blog model used by sites such as posterous, it’s possible to post links and short blog posts into a topic area. Commenting is also supported.

To say that search was Google’s initial big idea, it’s surprising that it seems to play no significant role in Google’s offerings for education through Google Apps. Thinking back, search related topics were what got me into blogging and quick hacks; maybe it’s time to return to that area…

Written by Tony Hirst

July 26, 2011 at 12:34 pm

Posted in Library, Search, SEO

Tagged with


Get every new post delivered to your Inbox.

Join 1,416 other followers