Sketching a datasette powered Jupyter Notebook Search Engine: nbsearch

Every so often, I’ve pondered the question of "notebook search": how can we easily support searches over Jupyter notebooks. I don’t really understand why this area seems so underserved, especially given the explosion in the number of notebooks and the way in which notebooks are increasingly used as a document for writing technical documentation, tutorial and instructional material.

One approach I have seen as a workaround is to produce an HTML site from a set of notebooks using something like nbsphinx or Jupyter Book simply to generate access to an inbuilt search engine. But that somehow feels redundant to me. The HTML Jupyter book form is not a collection of notebooks, nor does it provide a satisfying search environment. To access runnable notebooks you need to click through to open the notebook in another environment (for example, a MyBinder environment built from a repository of notebooks that created the HTML pages), or return the the HTML environment and run code cells inline using something like Thebelab.

So I finally got round to considering this whole question again in the form of a quick sketch to see what an integrated Jupyter notebook server search engine might feel like. It’s still early days — the nbsearch tool is provided as a Jupyter server proxy application, rather than integrated as a Jupyter server extension available via a integrated tab, but that does mean it also works in a standalone mode.

The search engine is built on top of a SQLite database, served using datasette. The base UI was stolen wholesale from Simon Willison’s Fast Autocomplete Search for Your Website demo.

The repo is currently here.

The search index is currently based on a full text search index of notebook code and markdown cells. (At the moment, you have to manually generate the index from a command line command. On the to do list for another sketch is an indexer that monitors the file system.) Cells are returned in a cell-type sensitive way:

Screenshot of initial nbsearch UI.

Code cells are syntax highlighted using Prism.js, and feature a Copy button for copying the (unstyled) code (clipboard.js). Markdown cells are styled using a simple Javascript markdown parser (marked.js).

The code cells should also have line numbers but this seems a little erratic at the moment; I can’t get local static js and css files to load properly under the Jupyter server proxy at the moment, so I’m using a CDN. The prism.js line number extension is a separate CDN delivered script to the main Prism script, and it seems that the line number extension doesnlt necessarily load correctly? A race condition maybe?

Each result item displays a link to the original notebook (although this doesn’t necessarily resolve correctly at the moment), and a description of which cell in the notebook the result corresponds to. An inline graphic depicts the structure of the notebook (markdown cells are blue, and code cells pink). Clicking the graphic toggles the display (show / hide) of that results cell group.

The contents of a cell are limited in terms of number of characters displayed. Clicking the the Show all cell button displays the full range of content. Two other buttons — Show previous cell and Show next cell — allow you to repeatedly grab additional cells that surround the originally retrieved results cell.

I’ve also started experimenting with a Thebelab code execution support. At the moment this is hardwired to use a MyBinder backend, but the intention is that if a local Jupyer server is available (eg as in the case when running nbsearch as a Jupyter server proxy application), it will use the local Jupyter server. (Ideally, it would also ensure the correct kernel is selected for any given notebook result.)

nbsearch UI with ThebeLab code execution example.

At the moment, things don’t work completely properly with Thebelab. If you run a query, and "activate" Thebelab in the normal way, things work fine. But when I dynamically add new cells, they arenlt activated.

If I try to manually activate them via a cell-centric button:

then the run/restart buttons appear, but trying to run the cell just hangs on the "Waiting for kernel…" message.

At the moment, the code cell is non-editable, but making it editable should just be a case of tweaking the code cell attributes.

There are lots of other issues to consider regarding cell execution, such as when a cell requires other cells to have run previously. This could be managed by running another query to grab all the previous code cells associated with a particular code code, and running those cells on a restarted kernel using Thebelab before running the current cell.

Providing an option to grab and display (and even copy) all the previous code in a notebook, or perhaps explore the gather package for finding precursor cells, might be a useful facility anyway, even without the ability to execute the code directly.

At the moment, results are limited to the first ten. This needs tweaking, perhaps with a slider ranged to the total number of results for a particular query and then letting you slide to select how many of them you want to display.

A switch to limit results to just code or just markdown cells might also be useful, as would an indicator somewhere that shows the grouped number of hits per notebook, perhaps with selection of this group acting as a facet: selecting a particular notebook would then limit cell results to just that notebook, perhaps grouping and ordering cells within a notebook by cell otde.

The ranking algorithm is something else that may be worth exploring more generally. One simple ranking tweak that may be useful in an educational setting could be to order results by notebook and cell order (for example, if notebooks are named according to some numbering convention: 01.1 Introduction to X, O1.2 X in more detail, 02.1 etc). Again, Simon Willison has led the way in some of the practicalities associated with exploring custom ranking schemes in his post Exploring search relevance algorithms with SQLite.

Way back when, when I originally started blogging, search was one of my favourite topics. I’ve neglected it over the years, but still think it has a lot to offer as a teaching and learning tool (eg things like Search Engine Powered Courses… and search hubs / discovered custom search engines). Many educators disagree with this approach because they like to think they are in control of the narrative, whereas I think that search, with a properly tuned ranking algorithm, can help support a student demand led, query result constructed, personalised structured narrative. Maybe it’s time for me to start playing with these ideas again…

Contextualised Search Result Displays

One of the prevalent topics covered in the early days of this blog concerned how to appropriate search tools and technologies and explore how they could be used as more general purpose technologies. Related to this were posts on making the most of document collections, or exploiting technologies complementary to search that returned results or content based on context.

For example:

There’s probably more…

(Like looking for shared text across documents to try to work out the provenance of a particular section of text as a document goes through multiple versions…)

Whatever…

So where are we at now…?

Mulling over recent updates to Parliamentary search, I started wondered about ranking and the the linear display of results. I’ve always quite liked facet limits that will filter out a subset of results returned by the search term based on a particular attribute. For example, in an Amazon search, we’re probably all familiar with entering a general search term then using category filters / facets in the left hand sidebar to narrow results down to books, or subject categories within books.

Indeed, the faceted search basis of  “search + filter” is one that inspired many of my own hacks.

As well as linear displays of ranked results (ranked how is always an issue), every so often multi-dimensional result displays appear. For example, things like Bento box displays (examples) were all the rage in university library catalogues several years ago, where multiple topical results panels display results from different facets or collections in different boxes distributed on a 2D grid. I’m not sure if they’re still “a thing” or whether preferences have gone back to a linear stream of results, perhaps with faceting to limit results within a topic? I guess one of the issues now is limited real estate on portrait orientation mobile phone displays compared to more expansive real estate you get in a landscape oriented large screen desktop display? (Hmmm, thinks… are Netvibes and PageFlakes still a thing?)

Anyway, I’ve not been thinking about search much for years, so in a spirit of playfulness, here’s a line of thinking I think could be fun to explore: contextualised search results, or more specifically, contextualised search result displays.

This phrase unpacks in several ways depending on where you think the emphasis on “contextualised” lies. (“contextualised lies”… “contextualies”…. hmmm…)

For example, if we interpret contextualised in the sense of context sensitive relative to the “natural arranging” of the results returned, we might trivially think of things like map based displays for displaying the results of a search where search items are geotagged. Complementary to this are displays where the results have some sort of time dependency. This is often displayed in the form of a date based ranking, but why not display results in a calendar interface or timeline (e..g. tracking Parliamentary bill process via a timeline)? Or where dates and locations are relevant to each resource, return the results via a calendar map display such as TimeMapper (more generally, see these different takes on storymaps). (I’ve always thought such displays should have to modes: a “show all” mode, and then a filtered mode, e.g. that shows just results for a particular time/geographical area limit.)

(One of the advantages of making search results available via a feed is that tinkerers can then easily wire the results into other sorts of display, particularly when feed items are also tagged, eg with facet information, dates, times, names of entities identified in the text, etc.)

A second sense in which we might think of contextualised search result displays is to identify the context of the user based on their interests. Given a huge set of linear search results, how might they then group, arrange or organise the results so that they can work with them more effectively?

Bento box displays offer a trivial start here for the visual display, for example by grouping differently faceted results in their own results panel. Looking at something like Parliamentary search, this might mean the user entering a search term and the results coming back in panels relating to different content types: results from research briefings in one panel, for example, from Hansard in other, written questions / answers in a third, and so on.

(Again, if results from the search engine are available as a tagged feed, it should be easy enough to roll your own display? Hmm… thinks.. what Javascript libraries would you use to display such a thing nowadays?)

It might also be possible to derive additional information from the results. For example, if results are tagged with members associated with a result (on a committee, asked that question, was the person speaking whose result was returned in the Hansard result), then a simple ranked facet of who the members interested in the topic across all the resource types might identify that person as someone interested in the topic (expert search / discovery also used to be a big thing, I seem to remember?).

In terms of trying to imagine differently contextualised displays, what sorts of user / user interest might there be? Off the top of my head, I can imagine:

  • someone searching for a topic “in general”: so just give them a list of stuff ranked however the search algo ranks it;
  • someone searching for a topic in general, organised by format or type (e.g. research briefing, written question/answer, parliamentary debate, committee report, etc), in which case a faceted display or bento box display might work;
  • someone searching for something in response to a news item, in which case they might want something ordered by time and maybe boosted by news mentions as a ranking factor (reminds me of trying to track media mentions of press releases and my press release / poll report CSE);
  • someone searching around the activities of an MP, in which case, you might want something like TheyWorkForYou member pages or perhaps a calendar or timeline view of their activity, or a narrative chart (e.g. with one line for a member, then other lines for the sorts of interaction they have with a topic – committee, question, debate – with each node linking to the associated document);
  • someone trying to track something in the context of the progress of a piece of legislation (or committee inquiry), in which case you may want a timeline, narrative chart or storyline style view; and maybe a custom search hub that searches over all documents relating to that piece of evolving legislation;
  • someone interested in people interested in a topic – expert search, in other words;
  • someone interested in the engagement of a person or organisation with Parliamentary processes, such as witness appearances at committee, submissions to written evidence, etc; it would also be handy if this turned up government relations, such as an association with a government group (it would be nice of that was a register, with each group having a formal register entry that included things like members…). Showing the different sorts of process, and the stage of the process at which the interaction or mention occurred could also be useful….

There are probably more…

Anyway, perhaps thinking about search could be fun again… So: does the new Parliamentary search make feeds available? And when are the Release TBC items listed on explore.data.parliament.uk going to be available?!:-)

Google Asks the Question

Via my feeds, a post on the Google Operating System blog that notes Google Converts Queries Into Questions:

When searching for [alcohol with the highest boiling], Google converted my query into a question: “Which alcohol has the highest boiling point?”

Me too:

alcohol_with_the_highest_boiling_-_google_search

I ran a related query – alcohol with highest boiling point – which offered a range of related questions, albeit further down the results list:

alcohol_with_highest_boiling_point_-_google_search_and_add_new_post__ouseful_info__the_blog____-_wordpress

Google results trying to draw you into a conversation – and hence running more queries (or questions…)?

Creating a Google Custom Search Engine Over Hyperlocal Sites

Years ago, I used to spend quite a bit of time playing with Google Custom Search Engines, which allow you to run searches over a specified list of sites, trying to encourage librarians and educators to think about ways in which we might make use of them. I was reminded of this technology yesterday at the a Community Journalism conference, so thought it might be worth posting a quick how to about how to set up a CSE, in particular one that searches over the websites of hyperlocals listed on LocalWebList.net. (If you don’t want to see how it’s done, but do want to try it out, here’s my half-hour hack LocalWebList UK hyperlocal CSE.)

One way of creating a CSE is to manually enter the URLs of the sites you want to search over. Another is to use an annotations file that contains the URLs of sites you want to search over. These files can be hosted on your own site, or uploaded to Google (in the latter case, there is (small) limit on the size of file you can upload – 30KB.

Annotations__Defining_Sites_to_Search_ _ _Custom_Search_ _ _Google_Developers

The simplest annotations file is a two column (URL and Label) tab separated value file containing one row per site you want to include. Typically, sites are included using a URL pattern – onthewight.com/* for example, to say “index over all the pages on the onthewight.com domain.

The data file published by the LocalWebList includes a column containing the URL for the homepage for each hyperlocal site listed. We can download the datafile and then open it in the powerful data cleaning tool OpenRefine to inspect it:

OpenRefine1

OpenRefine2

If you skim through the URLs, you might notice that several sites have simple URLs (example.com), others are a bit more cluttered (example.com/index2.html), others point to sites like facebook. I’m going to make an arbitrary decision to ignore facebook sites and define patterns based on all the pages in a single domain.

To do that, I’m going to create a new column (url2) in OpenRefine from the URL column, that defines just such a pattern based on the original URL.

localweblist_csv_-_OpenRefine4

The following expression:

value.replace(/https?:\/\/([^\/]*)/,'$1').split('/')[0]+'/*'

uses a regular expression to manage just such a transformation.

localweblist_csv_-_OpenRefine5

I can inspect the unique values generated by this transformation by looking at a text facet applied to the new url2 column:

localweblist_csv_-_OpenRefine6

If you sort by count in the text facet, you will see several of the hyperlocal sites have websites hosed on aboutmyarea, or facebook. (Click on one of those links in the text facet to show the sites associated with those domains.) I am going to discount those links from my CSE, so hover over the link and click on the “include” setting to toggle it to “exclude”. Then click on the “invert” option to show all the sites that aren’t the ones you’ve selected as excluded.

localweblist_csv_-_OpenRefine7

This leaves us with sites that are more likely unique:

localweblist_csv_-_OpenRefine8

Having got a filtered lists of sites, we can generate an annotations file containing the URL patterns we want to search over and the CSE label. The label identifies to Google which CSE the URLs in the annotations file apply to. We get that code by generating a CSE

When creating a new CSE, along with giving it a name, you;ll also have to seed it with at least one URL. Simply enter a pattern for a URL you know you want to include in the search engine.

Custom_Search_-_Create_CSE

Hit create, and you’ll have a new CSE…

Custom_Search_-_Congratulations_

From the “Advanced” tab, go to the CSE annotations area and find the code for your CSE:

Custom_Search_-_Advanced

Now we’re in a position to add the CSE code to our annotations file – so copy the CSE label for your CSE… We can create the annotations file in OpenRefine from the “Export” menu,  where we select “Templating”:

localweblist_csv_-_OpenRefine

The templating option allows us to define a custom export template. The template is built up from a header, a row separator, a footer, and a row pattern that describes how to write out each row. I define a simple template as follows, and then export the file.

localweblist_csv_-_OpenRefine11

(Note  – there are other ways I could have done this (indeed, there are often “other ways”!). For example, I could have created a new column containing just the CSE label value, and then done a custom table export, selecting the url2 column and label column, along with the TSV output format.)

Export the annotations file and then import it into the CSE – hit the “Add” button in the CSE annotations area.

Custom_Search_-_Advanced4

Once uploaded (and remember, there is a 30KB file size limit on this route), go back to the Basics tab:  you should find that your custom search engine now lists as sites to be searched over the sites you included in your annotations file, as well as being provided with a link to your CSE.

Custom_Search_-_Basic

You can tweak with some of the styling for the CSE from the “Look and Feel” menu option in the CSE admin pages sidebar.

CSE_-_Look_and_Feel_Layout

If you now click on your CSE URL you should find you have a minimal Google Custom Search engine that searches over several hundred UK hyperlocal websites.

Google_Custom_Search

To add in some of the sites we originally excluded, eg the ones on the aboutmyarea domain, we could add specific URL patterns in explicitly via the CSE control panel.

Google Custom search engines can be really quick to set up in a minimal form, but can also be customised further – for example, with tweaks to the ranking algorithm or with custom annotations (see for example Search Engine Powered Courses).

You can also generate lists of URLs from things like homepage links in Twitter bios grabbed from a Twitter list (eg Using Twitter Lists to Define Custom Search Engines – that code appears to have rotted slightly, but I have a fix…Let me know via the comments if you’re interested in generating CSEs from Twitter lists etc).

As I mentioned at the start, it’s been some years since I played with Google Custom Search Engines – I was really hopeful for them at one point, but Google never really seems to give them any love (not necessarily a bad thing – perhaps they are just enough over and under the radar for Google to cut them?), and I couldn’t seem to persuade anyone else (in the OU at least) that they were worth spending any time on.

I think a few librarians did pick up on them though! And if there is interest in the hyperlocal community for seeing what we might do with them, I’d be happy to put my thinking cap back on, work up some more tutorials or use cases, and run training workshops etc etc.

Polling the News…

Towards the end of last week I attended a two day symposium on Statistics in Journalism Practice and Education at the University of Sheffield. The programme was mixed, with several reviews of data journalism is or could be, and the occasional consideration of what stats might go into a statistics curriculum for students, but it got me thinking again about the way that content gets created and shunted around the news world.

Take polls, for example. At one point a comment got me idly wondering about the percentage of news copy that is derived from polls or surveys, and how it might be possible to automate the counting of such things. (My default position in this case is usually to wonder what might be possible be with the Guardian open platform content API. But I also started to wonder about how we could map the fan out from independent or commissioned polls or surveys as they get reported in the news media, then maybe start to find their way into other reports and documents by virtue of having been reported in the news.

This sort of thing is a corollary to tracking the way in which news stories might make their way from the newswires and into the papers via a bit of cut-and-pasting, as Nick Davies wrote so damningly about several years ago now in Flat Earth News, his indictment of churnalism and all that goes with it; it also reminds me of this old, old piece of Yahoo Pipes pipework where I tried to support the discovery of Media Release Related News Stories by putting university press release feeds into the same timeline view as news stories about that university.

aberdeenPressRelease

I don’t remember whether I also built a custom search engine at the time for searching over press releases and news sites for mentions of universities, but that was what came immediately to mind this time round.

So for starters, here’s a quick Google Custom Search Engine that searches over a variety of polling organisation and news media websites looking for polls and surveys – Churnalism Times (Polls & Surveys Edition).

Here’s part of the setup, showing the page URL patterns to be search over.

List the sites you want to search over

I added additional refinements to the tab that searches over the news organisations so only pull out pages where “poll” or “survey” is mentioned. Note that if these words are indexed in the chrome around the news story (eg in a banner or sidebar), then we can get a false positive hit on the page (i.e. pull back a page where an irrelevant story is mentioned because a poll is linked to in the sidebar).

Add refinements

From way back when when I took an interest in search more than I do now, I thought Google was trying to find ways of distinguishing content from furniture, but I’m not so sure any more…

Anyway, here’s an example of a search into polls and surveys published by some of the big pollsters:

tuitionFeesPoll

And an example of results from the news orgs:

Tuition fees media

For what it’s worth I also put together a custom search engine for searching over press releases – Churnalism Times (PR wires edition):

PR search

The best way of using this is to just past in a quote, or part of a quote, from a news story, in double quotes, to see which PR notice it came from…

PR search for quote

To make life easier, an old bookmarklet generator I produced way back when on an Arcadia fellowship at the Cambridge University Library, can be used to knock up a simple bookmarklet that will let you highlight a chunk of text and then search for it – get-selection bookmarklet generator.

Give it a sensible title; then this is the URL chunk you need to add:

https://www.google.com/cse/publicurl?cx=016419300868826941330:wvfrmcn2oxc&q=

bookmarklet genrator

Sigh.. I used to have so much fun…

PS it actually makes more sense to enclose the selected quote in quotes. Here’s a tweaked version of the bookmarklet code I grabbed from my installation of it in Chrome:

javascript:(function()%7Bvar t%3Dwindow.getSelection%3Fwindow.getSelection().toString()%3Adocument.selection.createRange().text%3Bwindow.location%3D%27https%3A%2F%2Fwww.google.com%2Fcse%2Fpublicurl%3Fcx%3D016419300868826941330%3Awvfrmcn2oxc%26q%3D"%27%2Bt%2B%27"%27%3B%7D)()

PPS I’ve started to add additional search domains to the PR search engine to include political speeches.

Invisible Library Support – Now You Can’t Afford Not to be Social?

If you live by pop tech feed or Twitter, you’ve probably heard that Google is rolling out a new style of socially powered search results. If not, or if you’re still not clear about what it entails, read Phil Bradley’s post on the matter: Why Google Search Plus is a disaster for search.

Done that? If not, why not? This post isn’t likely to make much sense at all if you don’t know the context. Here’s the link again: Why Google Search Plus is a disaster for search

So the starting point for this post is this: Google is in the process of rolling out a new web search service that (optionally) offers very personal search results that contains content from folk that Google thinks you’re associated with, and that Google is willing to show you based on license agreements and corporate politics.

Think about this for a minute…. in e the totally personalised view, folk will only see content that their friends have published or otherwise shared…

In Could Librarians Be Influential Friends?, I wondered aloud whether it made sense for librarians and other folk involved with providing support relating to resource discovery and recommendation to start a) creating social network profiles and encouraging their patrons to friend them, and b) start recommending resources using those profiles in order to start influencing the ordering/ranking of results in patrons’ search results based on those personal recommendations. The idea here was that you could start to make invisible frictionless recommendations by influencing the search engine results returned to your patrons (the results aren’t invisible because your profile picture may appear by the result showing that you recommend it. They’re frictionless in the sense that having made the original recommendation, you no longer have to do any work in trying to bring it to the attention of your patron – the search engines take care of that for you (okay, I know that’s a simplistic view;-). [Hmm.. how about referring to it as recommendation mode support?]

(Note that there is an complementary form of support to the approach which I’ve previously referred to as Invisible Library Tech Support (responsive mode support?; which I guess is also frictionless, at least from the perspective of the patron) in which librarians friend their patrons or monitor generic search terms/tags on Q&A sites and then proactively respond to requests that users post into their social networks more generally.)

With the aggressive stance Google now seems to be taking towards pushing social circle powered results, I think we need to face up to the fact – as Phil Bradley pointed out – that if librarians want to make sure they’re heard by their patrons, they’re going to need to start setting up social profiles, getting their patrons to friend them, and start making content and resource recommendations just anyway in order to make them available as resources that are indexed by patrons’ personal search engines. The same goes for publishers of OERs, academic teaching staff, and “courses”.

If we think of Google social search as searching over custom search engines bound by resources created and recommended by members of a users social circle, if you want to make (invisible) recommendations to a user via their (personalised) web search results, you’re going to need to make sure that the resources/content you want to recommend is indexed by their personal search engines. Which means: a) you need to friend them; and b) you need to share that content/those resources in that social context.

(Hmmm…this makes me think there may be something in the course custom search engine approach after all… Specifically, if the course has a social profile, and recommends the links contained within the course via that profile, they become part of the personalised search index of student’s following that course profile?)

Just by the by, as another example of Google completely messing things up at the moment, I notice that when I share links to posts on this blog via Google+, they don’t appear as trackbacks to the post in question. Which means that if someone refers to a post on this blog on Google+, I don’t know about it… whereas if they blog the link, I do…

See also my chronologically ordered posts on the eroding notion of “Google Ground Truth”.

[Invisible vs frictionless (and various notions of that word) is all getting a bit garbled; see eg @briankelly’s Should Higher Education Welcome Frictionless Sharing and my comments to it for a little more on this…]

PS I’ve been getting increasingly infuriated by the clutter around, and lack of variation within, Google search results lately, so I changed my default search engine to Bing. The results are a bit all over the place compared to the Google results I tend to get, but this may be down in part to personalisation/training. I am still making occasional forays to Google, but for now, Bing is it… (because Bing is not Google…)

PPS Hah – just noticed: Google Search Plus doesn’t mean plus in the sense of search more, it means search Google+, which is less, or minus the wider world view…;-)

PPPS I keep meaning to blog this, and keep forgetting: Turn[ing] off [Google] search history personalization, in particular: “If you’ve disabled signed-out search history personalization, you’ll need to disable it again after clearing your browser cookies. Clearing your Google cookie clears your search settings, thereby turning history-based customizations back on.” WHich is to say, when you disable personalisation, you don’t disable personalisation against your Google account, you disable it only insofar as it relates to your current cookie ID?

Notes on Custom Course Search Engines Derived from OU Structured Authoring Documents

Over the last few days, I’ve been tinkering with OU Structured Authoring documents, XML docs from which OU course materials – both print and HTML – are generated (you can get an idea about what they look like from OpenLearn: find a course page with a URL of the form http://openlearn.open.ac.uk/mod/oucontent/view.php?id=397337&direct=1 and change direct to content: http://openlearn.open.ac.uk/mod/oucontent/view.php?id=397337&content=1; h/t to Colin Chambers for that one;-). I’ve been focussing in particular on the documents used to describe T151, an entry level online course I developed around all things gaming (culture, business, design and development), and the way in which we can automatically generate custom search engines based on these documents.

The course had a very particular structure – weekly topic explorations framed as a preamble, set of guiding questions, suggested resources (organised by type) and a commentary, along with a weekly practical session.

One XML doc was used per week, and was used to generate the separate HTML pages for each week’s study.

One of the experimental components of the course has been a Google Custom Search Engine, that supports searches over external resources that are linked to from the blog. The course also draws heavily on the Digital Worlds Uncourse blog, a site used to scope out the design of the course, as well as draft some of the materials used within it, and the CSE indexes both that site and the sites that are linked from it. (See eSTEeM Project: Custom Course Search Engines and Integrating Course Related Search and Bookmarking? for a little more context around this.)

Through using the course custom search engine myself, I have found several issues with it:

1) with a small index, it’s not really very satisfactory. If you only index exact pages that are linked to from the site, it can be quite hard getting any hits at all. A more relaxed approach might be to index the domains associated with resources, and also include the exact references explicitly with a boosted search rank. At the current time, I have code that scrapes external links from across the T151 course materials and dumps them into a single annotations file (the file that identifies which resources are included in the CSE) without further embellishment. I also have code that identifies the unique domains that are linked to from the course materials and which can also be added to the annotations file. On the to do list is to annotate the resources with labels that identify which topic they are associated with so we can filter results by topic.

2) the Google Custom Search Engines seem to behave very oddly indeed. In several of my experiments, one word queries often returned few results, more specific queries building on the original search term delivered more and different results. This gives a really odd search experience, and one that I suspect would put many users off.

3) I’ve been coming round more and more to the idea that the best way of highlighting course resources in a search context is through the use of Subscribed Links, that a user can subscribe to and that then appear in their Google search results if there is an exact query match. Unfortunately, Google pulled the Subscribed Links service in September (A fall spring-clean; for example of what’s been lost, see e.g. Stone Temple Consulting: Google Co-Op Subscribed Links).

4) The ability to feed promotions into the top of the CSE results listing is attractive (up to 3 promoted links can be displayed for any given query), but the automatic generation of query terms is problematic. Promotion definitions take the form:

<Promotion image_url="http://kmi.open.ac.uk/images/ou-logo.gif"
  	title="Week 4"
  	id="T151_Week4"
  	queries="week 4,T151 week 4,t151 week 4"
  	url="http://www.open.ac.uk"
  	description="Topic Exploration 4A - An animated experience Topic exploration 4B - Flow in games "/>

Course CSE - week promotion

There are several components we need to consider here:

  1. queries: these are the phrases that are used to trigger the display of the particular promotions links. Informal testing suggests that where multiple promotions are triggered by the same query, the order in which they are defined in the Promotions file determines the order in which they appear in the results. Note that the at most three (3) promotions can be displayed for any query. Queries may be based at least around either structural components (such as study week, topic number), subject matter terms (e.g. tags, keywords, or headings) and resource type (eg audio/video material, academic readings etc), although we might argue the resource type is not such a meaningful distinction (just because we can make it doesn’t mean we should!). In the T51 materials, presentation conventions provide us with a way of extracting structural components and using these to seed the promotions file. Identifying keywords or phrases is more challenging: students are unlikely to enter search phrases that exactly match section or subsection headings, so some element of term extraction would be required in order to generate query terms that are likely to be used.
  2. title: this is what catches the attention, so we need to put something sensible in here. There is a limit of 160 characters on the length of the title.
  3. description: the description allows us to elaborate on the title. There is a limit of 200 characters on the length of the description.
  4. url: the URL is required but not necessarily ‘used’ by our promotion. That is, if we are using the promotion for informational reasons, and not necessarily wanting to offer a click through, the link may be redundant. (However, the CSE still requires it to be defined.) Alternatively, we might argue that the a click through action should always be generated, in which case it might be worth considering whether we can generate a click through to a more refined query on the CSE itself?

Where multiple promotions are provided, we need to think about:
a) how they are ordered;
b) what other queries they are associated with (i.e. their specificity);
c) where they link to.

In picking apart the T151 structured authoring documents, I have started by focussing on the low hanging fruit when it comes to generating promotion links. Looking through the document format, it is possible to identify topics associated with separate weeks and questions associated with particular topics. This allows us to build up a set of informational promotions that allow the search engine to respond to queries of what we might term a navigational flavour. So for example, we can ask what topics are covered in a particular week (I also added the topic query as a query for questions related to a particular topic):

Course CSE - multiple promotions

Or what a particular question is within a particular topic:

COurse CSE - what's the question?

The promotion definitions are generated automatically and are all very procedural. For example, here’s a fragment from the definition of the promotion from question 4 in topic 4A:

<Promotion 
  	title="Topic Exploration 4A Question 4"
  	queries="topic 4a q4,T151 topic 4a q4,t151 topic 4a q4,topic 4a,T151 topic 4a,t151 topic 4a"
  	... />

The queries this promotion will be displayed for are all based around navigational structural elements. This requires some knowledge of the navigational query syntax, and also provides an odd user experience, because the promotions only display on the main CSE tab, and the organic results from indexed sites turn up all manner of odd results for queries like “week 3” and “topic 1a q4″… (You can try out the CSE here.)

The promotions I have specified so far also lack several things:

1) queries based on the actual question description, so that a query related to the question might pull the corresponding promotion into the search results (would that be useful?)

2) a sensible link. At the moment, there is no obvious way in the SA document of identifying one or more resources that particularly relate to a specific question. If there was such a link, then we could use that information to automatically associate a link with a question in the corresponding promotions element. (The original design of the course imagined the Structured Authoring document itself being constructed automatically from component parts. In particular, it was envisioned that suggested links would be tagged on a social bookmarking service and then automatically pulled into the appropriate area of the Structured Authoring document. Resources could then be tagged in a way that associates them with one or more questions (or topics), either directly though a question ID, or indirectly through matching subject tags on a question and on a resource. The original model also considered the use of “suggested search queries” that would be used to populate suggested resources lists with results pulled in live from a (custom) search engine…)

At the moment, it is possible to exploit the T151 document structure to generate these canned navigational queries. The question now is: are promotional links a useful feature, and how might we go about automatically identifying subject meaningful queries?

At the limit, we might imagine the course custom search engine interface being akin to the command line in a text based adventure game, with the adventure itself being the learning journey, and the possible next step a combination of Promotions based guidance and actual search results…

[Code for the link scraping/CSE file generation and mindmap generator built around the T151 SA docs can be found at Github: Course Custom Search Engines]

PS as ever, I tend to focus on tinkering around a rapid prototype/demonstration at the technical systems overview level, with a passing nod to the usefulness of the service (which, as noted above, is a bit patchy where the searchengine index is sparse). What I haven’t done is spend much time thinking about the pedagogical aspects relating to how we might make most effective use of custom search engines in the course context. From a scoping point of view, I think there are a several things we need to unpick that relate to this: what is it that students are searching for, what context are they searching in, and what are they expecting to discover?

My original thinking around custom course search engines was that they would augment a search across course materials by providing a way of searching across the full text of resources* linked to from the course materials, and maybe also provide a context for searching over student suggested resources.

Search hierarchy

It strikes me that the course search engine is most likely to be relevant where there is active curation of the search engine that provides a search view over a reasonably sized set of resources discovered by folk taking the course and sharing resources related to it. “MOOCs” might be interesting in this respect, particularly where: 1) it is possible to include MOOC blog tag feeds in the CSE as a source of relevant content (both the course blog content and resources linked to from that content – the CSE can be configured to include resources that are linked to from a specified resource); 2) we can grab links that are tagged and shared with the MOOC code on social media and add those to the CSE annotations file. (Note that in this case, it would make sense to resolve shortened links to their ultimate destination URL before adding them to the CSE.) I’m not sure what role promotions might play in a MOOC though, or the extent to which they could be automatically generated?

*Full text search across linked to resources is actually quite problematic. Consider the following classes of online resources that we might expect to be linked to from course materials:

  • academic papers, often behind a paywall: links are likely to be redirected through a library proxy service allowing for direct click-thru to the resource using institutional credentials (I assume the user is logged in to the VLE to see the link, and single sign on support allows direct access to any subscribed to resources via appropriate proxies. That is, the link to the resource leads directly to the full text, subscribed to version of the resource if the user is signed on to the institutional system and has appropriate credentials). There are several issues here: the link that is supplied to the CSE should be be the public link to the article homepage; the article homepage is likely to reveal little more than the paper abstract to the search engine. I’m not sure if Google Scholar does full-text indexing of articles, but even if it does, Scholar results are not available to the CSE. (There is also the issue of how we would limit the Scholar search to the articles we are linking to from the course materials.)
  • news and magazine articles: again, these may be behind a paywall, but even if they are, they may have been indexed by Google. So they may be full text discoverable via a CSE, even if they aren’t accessible once you click through…
  • video and audio resources: discovery in a large part will depend on the text on the web page the resources are hosted on. If the link is directly to an audio or video file, discoverability via the CSE may well be very limited!
  • books: Google book search provides full text search, but this is not available via a CSE. Full text searchable collections of books are achievable using Google Books Library Shelves; there’s also an API available.

I guess the workaround to all this is not to use a Google Custom Search Engine as the basis for a course search engine. Instead, construct a repository that contains full text copies of all resources linked to from the course, and index that using a local search engine, providing aliased links to the original sources if required?

Fudging the CSE with a local searchengine

However, that wasn’t what this experiment was about!;-)

Course Resources as part of a larger connected graph

Another way of thinking about linked to course resources is that they are a gateway into a set of connected resources. Most obviously, for an academic paper it is part of a graph structure that includes:
– links to papers referenced in the article;
– links to papers that cite the article;
– links to other papers written by the same author;
– links to other papers in collections containing the article on services such as Mendeley;
– links into the social graph, such as the social connections of the author, or the discovery of people who have shared a link to the resource on a public social network.
For an informal resource such as a blog post, it might be other posts linked to from the post of interest, or other posts that link to it.

Thinking about resources as being part of one or more connected graphs may influence our thinking about the pedagogy. If the intention is that a learner is directed to a resource as a terminal, atomic resource, from which they are expected to satisfy a particular learning requirement, then we aren’t necessarily interested in the context surrounding the resource. If the intention is that the resource is gateway to a networked context around one or more ideas or concepts, then we need to select our resources so that they provide a springboard to other resources. This can be done directly (eg though following references contained within the work, or tracking down resources that cite it), or indirectly, for example by suggesting keywords or search phrases that might be used to discover related resources by independent means. Alternatively, we might link to a resource as an exemplar of the sort of resource students are expected to work with on a given activity, and then expect them to find similar sorts of, but alternative, resources for themselves.

A Quick Intro to Google Custom Search Engine Definition Files

In Search Engine Powered Courses…, I took an initial, baby step to demonstrate one way in which a promoted link might be used be within a course specific custom search engine. In the next post in this series, I will describe how to influence the positioning of results within a Google custom search engine by boosting their ranking, as well as how results may be ‘faceted’ into different results sets through the use of labels.

In this post, I thought it would be worth taking a step back and reviewing the three configuration files we have access to when defining a Google custom search engine: the configuration file, the promotions file, and the annotations file. If you create a minimal Google custom search engine using the CSE management tools, and then go to the Advanced page, you will see options that allow you to upload the configuration and annotations file. The promotions file can be imported via the Promotions page.

So what do each of these file do?

  • The configuration file defines the top level configuration of the search engine. The easiest way of obtaining a template for a CSE is to create a minimal search engine using the CSE management tools, and then export the configuration file from the Advanced page. The configuration file defines, among other things: whether the search engine will search over the whole web, prioritising (or ‘BOOSTing’) sites and pages indexed explicitly by the CSE, or whether it will just return resuts from the explicilty indexed pages (a FILTER style search engine); a definition of the labels, or facets, that allow different search refinements to be applied as different search strategy contexts within the CSE; some styling information; and information relating to Subscribed Links (more of them in another post, if they’re still supported by then..)..
  • The promotions file allows you do define promoted links within a CSE; in Search Engine Powered Courses…, I give an example of how these might be used in a course search engine.
  • The annotations file identifies the sites and pages that are specific members of the CSE index, as well as how they should be handled (eg the extent to which they should be positively or negatively boosted in the search engine results listing, whether they should appear in the top few results, and what labels or facets should apply to them).

It’s also possible to customise the styling/presentation of the search engine, but that’s a shiny, shiny feature, so probably not something I’ll be looking at…

PS I just noticed you can now manage Google Analytics settings for custom search engines (which allows you to log search queries) from within the CSE control panel… I’m still not sure how easy it is to track which results get clicked through, though?

Search Engine Powered Courses…

How can we use customised search engines to support uncourses, or the course models used to support MOOC style offerings?

To set the scene, here’s what Stephen Downes wrote recently on the topic of How to partcipate in a MOOC:

You will notice quickly that there is far too much information being posted in the course for any one person to consume. We tried to start slowly with just a few resources, but it quickly turns into a deluge.

You will be provided with summaries and links to dozens, maybe hundreds, maybe even thousands of web posts, articles from journals and magazines, videos and lectures, audio recordings, live online sessions, discussion groups, and more. Very quickly, you may feel overwhelmed.

Don’t let it intimidate you. Think of it as being like a grocery store or marketplace. Nobody is expected to sample and try everything. Rather, the purpose is to provide a wide selection to allow you to pick and choose what’s of interest to you.

This is an important part of the connectivist model being used in this course. The idea is that there is no one central curriculum that every person follows. The learning takes place through the interaction with resources and course participants, not through memorizing content. By selecting your own materials, you create your own unique perspective on the subject matter.

It is the interaction between these unique perspectives that makes a connectivist course interesting. Each person brings something new to the conversation. So you learn by interacting rather than by mertely consuming.

When I put together the the OU course T151, the original vision revolved around a couple of principles:

1) the course would be built in part around materials produced in public as part of the Digital Worlds uncourse;

2) each week’s offering would follow a similar model: one or two topic explorations, plus an activity and forum discussion time.

In addition, the topic explorations would have a standard format: scene setting, and maybe a teaser question with answer reveal or call to action in the forums; a set of topic exploration questions to frame the topic exploration; a set of resources related to the topic at hand, organised by type (academic readings (via a libezproxy link for subscription content so no downstream logins are required to access the content), Digital Worlds resources, weblinks (industry or well informed blogs, news sites etc), audio and video resources); and a reflective essay by the instructor exploring some of the themes raised in the questions and referring to some of the resources. The aim of the reflective essay was to model the sort of exploration or investigation the student might engage in.

(I’d probably just have a mixed bag of resources listed now, along with a faceting option to focus in on readings, videos, etc.)

The idea behind designing the course in this way was that it would be componentised as much as possible, to allow flexibility in swapping resources or even topics in and out, as well as (though we never managed this), allowing the freedom to study the topics in an arbitrary order. Note: I realised today that to make the materials more easily maintainable, a set of ‘Recent links’ might be identified that weren’t referred to in the ‘My Reflections’ response. That is, they could be completely free standing, and would have no side effects if replaced.

As far as the provision of linked resources went, the original model was that the links should be fed into the course materials from an instructor maintained bookmark collection (for an early take on this, see Managing Bookmarks, with a proof of concept demo at CourseLinks Demo (Hmmm, everything except the dynamic link injection appears to have rotted:-().

The design of the questions/resources page was intended to have the scoping questions at the top of the page, and then the suggested resources presented in a style reminiscent of a search engine results listing, the idea being that we would present the students with too many resources for them to comfortably read in the allocated time, so that they would have to explore the resources from their own perspective (eg given their current level of understanding/knowledge, their personal interests, and so on). In one of my more radical moments, I suggested that the resources would actually be pulled in from a curated/custom search engine ‘live’, according to search terms specially selected around the current topic and framing questions, but I was overruled on that. However, the course does have a Google custom search engine associated with it which searches over materials that are linked to from the course.

So that’s the context…

Where I’m at now is pondering how we can use an enhanced custom search engine as a delivery platform for a resource based uncourse. So here’s my first thought: using a Google Custom Search Engine populated with curated resources in a particular area, can we use Google CSE Promotions to help scaffold a topic exploration?

Here’s my first promotions file:

<Promotions>
   <Promotion id="t151_1a" 
        queries="topic 1a, Topic 1A, topic exploration 1a, topic exploration 1A, topic 1A, what is a game, game definition" 
        title="T151 Topic Exploration 1A - So what is a game?" 
        url="http://digitalworlds.wordpress.com/2008/03/05/so-what-is-a-game/"
        description="The aim of this topic is to think about what makes a game a game. Spend a minute or two to come up with your own definition. If you're stuck, read through the Digital Worlds post 'So what is a game?'"
        image_url="http://kmi.open.ac.uk/images/ou-logo.gif" />
</Promotions>

It’s running on the Digital Worlds Search Engine, so if you want to try it out, try entering the search phrase what is a game or game definition.

T151 CSE promotion - game definition

(This example suggests to me that it would also make sense to use result boosting to boost the key readings/suggested resources I proposed in the topic materials so that they appear nearer the top of the results (that’ll be the focus of a future post;-))

The promotion displays at the top of the results listing if the specified queries match the search terms the user enters. My initial feeling is that to bootstrap the process, we need to handle:

– queries that allow a user to call on a starting point for a topic exploration by specifically identifying that topic;
– “naive queries”: one reason for using the resource-search model is to try to help students develop effective information skills relating to search. Promotions (and result boosting) allow us to pick up on anticipated naive queries (or popular queries identified from search logs), and suggest a starting point for a sensible way in to the topic. Alternatively, they could be used to offer suggestions for improved or refined searches, or search strategy hints. (I’m reminded of Dave Pattern’s work with guided searches/keyword refinements in the University of Huddersfield Library catalogue in this context).

Here’s another example using the same promotion, but on a different search term:

T151 CSE - topic 1a

Of course, we could also start to turn the search engine into something like an adventure game engine. So for example, if we type: start or about, we might get something like:

T151 CSE - start

(The link I associated with start should really point to the course introduction page in the VLE…)

We can also use the search context to provide pastoral or study skills support:

T151 CSE - pastoral

These sort of promotions/enhancements might be produced centrally and rolled out across course search engines, leaving the course and discipline related customisations to the course team and associated subject librarians.

Just a final note: ignoring resource limitations on Google CSEs for a moment, we might imagine the following scenarios for their role out:

1) course wide: bespoke CSEs are commissioned for each course, although they may be supplemented by generic enhancements (eg relating to study skills);

2) qualification based: the CSE is defined at the qualification level, and students call on particular course enhancements by prefacing the search with the course code; it might be that students also see a personalised view of the qualification CSE that is tuned to their current year of study.

3) university wide: the CSE is defined at the university level, and students students call on particular course or qualification level enhancements by prefacing the search with the course or qualification code.

From “Special Result” to “Promotion” in Google CSEs

In passing, I noticed I had a broken link to a Google CSE documentation page:

http://code.google.com/apis/customsearch/docs/special_results.html

Searching a little, I found the page had moved to

http://code.google.com/apis/customsearch/docs/promotions.html

A cached version of the originally linked page is still available, so I did a side-by-side comparison:

From 'special result' to 'promotion'

Hmmm…