One of the prevalent topics covered in the early days of this blog concerned how to appropriate search tools and technologies and explore how they could be used as more general purpose technologies. Related to this were posts on making the most of document collections, or exploiting technologies complementary to search that returned results or content based on context.
- custom search engines (see also this example from and Arcadia Project blog post – Custom Search Engines On Library Websites – which uses a tabbed Google CSE to search over university websites, Parliamentary research briefings, science protocols and other authoritative sources), If you check the right hand sidebar of the OUseful.info blog, you’ll see some links to custom search engines (though I’m not sure if they still work!);
- search hubs and feed powered / dynamic custom search engines: the idea behind these search engines was that any collection of links could be used to limit a search either to just those links, or prioritising results from those links; so for example, a Committee inquiry could act as a search hub and generate a custom search engine that searched over all the committee evidence as well as reports identified as relevant to the inquiry (for example, the sorts of report that turn up being referenced in a committee report);
- structured products from content repositories (for example, generating alternative interfaces such as topical mindmap representations of search results from structured content);
- using enhanced search results to deliver a course;
- from WriteToReply, paragraph level search results (and paragraph embedding / transclusion); we also explored report unbundling, displaying sections of a report via a Bento box dashboard display;
- appropriating ad servers (not strictly search, but it is relevant to exploiting context and using ad servers to deliver contextualised content rather than ads).
There’s probably more…
(Like looking for shared text across documents to try to work out the provenance of a particular section of text as a document goes through multiple versions…)
So where are we at now…?
Mulling over recent updates to Parliamentary search, I started wondered about ranking and the the linear display of results. I’ve always quite liked facet limits that will filter out a subset of results returned by the search term based on a particular attribute. For example, in an Amazon search, we’re probably all familiar with entering a general search term then using category filters / facets in the left hand sidebar to narrow results down to books, or subject categories within books.
Indeed, the faceted search basis of “search + filter” is one that inspired many of my own hacks.
As well as linear displays of ranked results (ranked how is always an issue), every so often multi-dimensional result displays appear. For example, things like Bento box displays (examples) were all the rage in university library catalogues several years ago, where multiple topical results panels display results from different facets or collections in different boxes distributed on a 2D grid. I’m not sure if they’re still “a thing” or whether preferences have gone back to a linear stream of results, perhaps with faceting to limit results within a topic? I guess one of the issues now is limited real estate on portrait orientation mobile phone displays compared to more expansive real estate you get in a landscape oriented large screen desktop display? (Hmmm, thinks… are Netvibes and PageFlakes still a thing?)
Anyway, I’ve not been thinking about search much for years, so in a spirit of playfulness, here’s a line of thinking I think could be fun to explore: contextualised search results, or more specifically, contextualised search result displays.
This phrase unpacks in several ways depending on where you think the emphasis on “contextualised” lies. (“contextualised lies”… “contextualies”…. hmmm…)
For example, if we interpret contextualised in the sense of context sensitive relative to the “natural arranging” of the results returned, we might trivially think of things like map based displays for displaying the results of a search where search items are geotagged. Complementary to this are displays where the results have some sort of time dependency. This is often displayed in the form of a date based ranking, but why not display results in a calendar interface or timeline (e..g. tracking Parliamentary bill process via a timeline)? Or where dates and locations are relevant to each resource, return the results via a calendar map display such as TimeMapper (more generally, see these different takes on storymaps). (I’ve always thought such displays should have to modes: a “show all” mode, and then a filtered mode, e.g. that shows just results for a particular time/geographical area limit.)
(One of the advantages of making search results available via a feed is that tinkerers can then easily wire the results into other sorts of display, particularly when feed items are also tagged, eg with facet information, dates, times, names of entities identified in the text, etc.)
A second sense in which we might think of contextualised search result displays is to identify the context of the user based on their interests. Given a huge set of linear search results, how might they then group, arrange or organise the results so that they can work with them more effectively?
Bento box displays offer a trivial start here for the visual display, for example by grouping differently faceted results in their own results panel. Looking at something like Parliamentary search, this might mean the user entering a search term and the results coming back in panels relating to different content types: results from research briefings in one panel, for example, from Hansard in other, written questions / answers in a third, and so on.
It might also be possible to derive additional information from the results. For example, if results are tagged with members associated with a result (on a committee, asked that question, was the person speaking whose result was returned in the Hansard result), then a simple ranked facet of who the members interested in the topic across all the resource types might identify that person as someone interested in the topic (expert search / discovery also used to be a big thing, I seem to remember?).
In terms of trying to imagine differently contextualised displays, what sorts of user / user interest might there be? Off the top of my head, I can imagine:
- someone searching for a topic “in general”: so just give them a list of stuff ranked however the search algo ranks it;
- someone searching for a topic in general, organised by format or type (e.g. research briefing, written question/answer, parliamentary debate, committee report, etc), in which case a faceted display or bento box display might work;
- someone searching for something in response to a news item, in which case they might want something ordered by time and maybe boosted by news mentions as a ranking factor (reminds me of trying to track media mentions of press releases and my press release / poll report CSE);
- someone searching around the activities of an MP, in which case, you might want something like TheyWorkForYou member pages or perhaps a calendar or timeline view of their activity, or a narrative chart (e.g. with one line for a member, then other lines for the sorts of interaction they have with a topic – committee, question, debate – with each node linking to the associated document);
- someone trying to track something in the context of the progress of a piece of legislation (or committee inquiry), in which case you may want a timeline, narrative chart or storyline style view; and maybe a custom search hub that searches over all documents relating to that piece of evolving legislation;
- someone interested in people interested in a topic – expert search, in other words;
- someone interested in the engagement of a person or organisation with Parliamentary processes, such as witness appearances at committee, submissions to written evidence, etc; it would also be handy if this turned up government relations, such as an association with a government group (it would be nice of that was a register, with each group having a formal register entry that included things like members…). Showing the different sorts of process, and the stage of the process at which the interaction or mention occurred could also be useful….
There are probably more…
Anyway, perhaps thinking about search could be fun again… So: does the new Parliamentary search make feeds available? And when are the Release TBC items listed on explore.data.parliament.uk going to be available?!:-)
Via my feeds, a post on the Google Operating System blog that notes Google Converts Queries Into Questions:
When searching for [alcohol with the highest boiling], Google converted my query into a question: “Which alcohol has the highest boiling point?”
I ran a related query – alcohol with highest boiling point – which offered a range of related questions, albeit further down the results list:
Google results trying to draw you into a conversation – and hence running more queries (or questions…)?
Years ago, I used to spend quite a bit of time playing with Google Custom Search Engines, which allow you to run searches over a specified list of sites, trying to encourage librarians and educators to think about ways in which we might make use of them. I was reminded of this technology yesterday at the a Community Journalism conference, so thought it might be worth posting a quick how to about how to set up a CSE, in particular one that searches over the websites of hyperlocals listed on LocalWebList.net. (If you don’t want to see how it’s done, but do want to try it out, here’s my half-hour hack LocalWebList UK hyperlocal CSE.)
One way of creating a CSE is to manually enter the URLs of the sites you want to search over. Another is to use an annotations file that contains the URLs of sites you want to search over. These files can be hosted on your own site, or uploaded to Google (in the latter case, there is (small) limit on the size of file you can upload – 30KB.
The simplest annotations file is a two column (URL and Label) tab separated value file containing one row per site you want to include. Typically, sites are included using a URL pattern – onthewight.com/* for example, to say “index over all the pages on the onthewight.com domain.
The data file published by the LocalWebList includes a column containing the URL for the homepage for each hyperlocal site listed. We can download the datafile and then open it in the powerful data cleaning tool OpenRefine to inspect it:
If you skim through the URLs, you might notice that several sites have simple URLs (example.com), others are a bit more cluttered (example.com/index2.html), others point to sites like facebook. I’m going to make an arbitrary decision to ignore facebook sites and define patterns based on all the pages in a single domain.
To do that, I’m going to create a new column (url2) in OpenRefine from the URL column, that defines just such a pattern based on the original URL.
The following expression:
uses a regular expression to manage just such a transformation.
I can inspect the unique values generated by this transformation by looking at a text facet applied to the new url2 column:
If you sort by count in the text facet, you will see several of the hyperlocal sites have websites hosed on aboutmyarea, or facebook. (Click on one of those links in the text facet to show the sites associated with those domains.) I am going to discount those links from my CSE, so hover over the link and click on the “include” setting to toggle it to “exclude”. Then click on the “invert” option to show all the sites that aren’t the ones you’ve selected as excluded.
This leaves us with sites that are more likely unique:
Having got a filtered lists of sites, we can generate an annotations file containing the URL patterns we want to search over and the CSE label. The label identifies to Google which CSE the URLs in the annotations file apply to. We get that code by generating a CSE…
When creating a new CSE, along with giving it a name, you;ll also have to seed it with at least one URL. Simply enter a pattern for a URL you know you want to include in the search engine.
Hit create, and you’ll have a new CSE…
From the “Advanced” tab, go to the CSE annotations area and find the code for your CSE:
Now we’re in a position to add the CSE code to our annotations file – so copy the CSE label for your CSE… We can create the annotations file in OpenRefine from the “Export” menu, where we select “Templating”:
The templating option allows us to define a custom export template. The template is built up from a header, a row separator, a footer, and a row pattern that describes how to write out each row. I define a simple template as follows, and then export the file.
(Note – there are other ways I could have done this (indeed, there are often “other ways”!). For example, I could have created a new column containing just the CSE label value, and then done a custom table export, selecting the url2 column and label column, along with the TSV output format.)
Export the annotations file and then import it into the CSE – hit the “Add” button in the CSE annotations area.
Once uploaded (and remember, there is a 30KB file size limit on this route), go back to the Basics tab: you should find that your custom search engine now lists as sites to be searched over the sites you included in your annotations file, as well as being provided with a link to your CSE.
You can tweak with some of the styling for the CSE from the “Look and Feel” menu option in the CSE admin pages sidebar.
If you now click on your CSE URL you should find you have a minimal Google Custom Search engine that searches over several hundred UK hyperlocal websites.
To add in some of the sites we originally excluded, eg the ones on the aboutmyarea domain, we could add specific URL patterns in explicitly via the CSE control panel.
Google Custom search engines can be really quick to set up in a minimal form, but can also be customised further – for example, with tweaks to the ranking algorithm or with custom annotations (see for example Search Engine Powered Courses).
You can also generate lists of URLs from things like homepage links in Twitter bios grabbed from a Twitter list (eg Using Twitter Lists to Define Custom Search Engines – that code appears to have rotted slightly, but I have a fix…Let me know via the comments if you’re interested in generating CSEs from Twitter lists etc).
As I mentioned at the start, it’s been some years since I played with Google Custom Search Engines – I was really hopeful for them at one point, but Google never really seems to give them any love (not necessarily a bad thing – perhaps they are just enough over and under the radar for Google to cut them?), and I couldn’t seem to persuade anyone else (in the OU at least) that they were worth spending any time on.
I think a few librarians did pick up on them though! And if there is interest in the hyperlocal community for seeing what we might do with them, I’d be happy to put my thinking cap back on, work up some more tutorials or use cases, and run training workshops etc etc.
Towards the end of last week I attended a two day symposium on Statistics in Journalism Practice and Education at the University of Sheffield. The programme was mixed, with several reviews of data journalism is or could be, and the occasional consideration of what stats might go into a statistics curriculum for students, but it got me thinking again about the way that content gets created and shunted around the news world.
Take polls, for example. At one point a comment got me idly wondering about the percentage of news copy that is derived from polls or surveys, and how it might be possible to automate the counting of such things. (My default position in this case is usually to wonder what might be possible be with the Guardian open platform content API. But I also started to wonder about how we could map the fan out from independent or commissioned polls or surveys as they get reported in the news media, then maybe start to find their way into other reports and documents by virtue of having been reported in the news.
This sort of thing is a corollary to tracking the way in which news stories might make their way from the newswires and into the papers via a bit of cut-and-pasting, as Nick Davies wrote so damningly about several years ago now in Flat Earth News, his indictment of churnalism and all that goes with it; it also reminds me of this old, old piece of Yahoo Pipes pipework where I tried to support the discovery of Media Release Related News Stories by putting university press release feeds into the same timeline view as news stories about that university.
I don’t remember whether I also built a custom search engine at the time for searching over press releases and news sites for mentions of universities, but that was what came immediately to mind this time round.
So for starters, here’s a quick Google Custom Search Engine that searches over a variety of polling organisation and news media websites looking for polls and surveys – Churnalism Times (Polls & Surveys Edition).
Here’s part of the setup, showing the page URL patterns to be search over.
I added additional refinements to the tab that searches over the news organisations so only pull out pages where “poll” or “survey” is mentioned. Note that if these words are indexed in the chrome around the news story (eg in a banner or sidebar), then we can get a false positive hit on the page (i.e. pull back a page where an irrelevant story is mentioned because a poll is linked to in the sidebar).
From way back when when I took an interest in search more than I do now, I thought Google was trying to find ways of distinguishing content from furniture, but I’m not so sure any more…
Anyway, here’s an example of a search into polls and surveys published by some of the big pollsters:
And an example of results from the news orgs:
For what it’s worth I also put together a custom search engine for searching over press releases – Churnalism Times (PR wires edition):
The best way of using this is to just past in a quote, or part of a quote, from a news story, in double quotes, to see which PR notice it came from…
To make life easier, an old bookmarklet generator I produced way back when on an Arcadia fellowship at the Cambridge University Library, can be used to knock up a simple bookmarklet that will let you highlight a chunk of text and then search for it – get-selection bookmarklet generator.
Give it a sensible title; then this is the URL chunk you need to add:
Sigh.. I used to have so much fun…
PS it actually makes more sense to enclose the selected quote in quotes. Here’s a tweaked version of the bookmarklet code I grabbed from my installation of it in Chrome:
PPS I’ve started to add additional search domains to the PR search engine to include political speeches.
If you live by pop tech feed or Twitter, you’ve probably heard that Google is rolling out a new style of socially powered search results. If not, or if you’re still not clear about what it entails, read Phil Bradley’s post on the matter: Why Google Search Plus is a disaster for search.
Done that? If not, why not? This post isn’t likely to make much sense at all if you don’t know the context. Here’s the link again: Why Google Search Plus is a disaster for search
So the starting point for this post is this: Google is in the process of rolling out a new web search service that (optionally) offers very personal search results that contains content from folk that Google thinks you’re associated with, and that Google is willing to show you based on license agreements and corporate politics.
Think about this for a minute…. in e the totally personalised view, folk will only see content that their friends have published or otherwise shared…
In Could Librarians Be Influential Friends?, I wondered aloud whether it made sense for librarians and other folk involved with providing support relating to resource discovery and recommendation to start a) creating social network profiles and encouraging their patrons to friend them, and b) start recommending resources using those profiles in order to start influencing the ordering/ranking of results in patrons’ search results based on those personal recommendations. The idea here was that you could start to make
invisible frictionless recommendations by influencing the search engine results returned to your patrons (the results aren’t invisible because your profile picture may appear by the result showing that you recommend it. They’re frictionless in the sense that having made the original recommendation, you no longer have to do any work in trying to bring it to the attention of your patron – the search engines take care of that for you (okay, I know that’s a simplistic view;-). [Hmm.. how about referring to it as recommendation mode support?]
(Note that there is an complementary form of support to the approach which I’ve previously referred to as Invisible Library Tech Support (responsive mode support?; which I guess is also frictionless, at least from the perspective of the patron) in which librarians friend their patrons or monitor generic search terms/tags on Q&A sites and then proactively respond to requests that users post into their social networks more generally.)
With the aggressive stance Google now seems to be taking towards pushing social circle powered results, I think we need to face up to the fact – as Phil Bradley pointed out – that if librarians want to make sure they’re heard by their patrons, they’re going to need to start setting up social profiles, getting their patrons to friend them, and start making content and resource recommendations just anyway in order to make them available as resources that are indexed by patrons’ personal search engines. The same goes for publishers of OERs, academic teaching staff, and “courses”.
If we think of Google social search as searching over custom search engines bound by resources created and recommended by members of a users social circle, if you want to make (invisible) recommendations to a user via their (personalised) web search results, you’re going to need to make sure that the resources/content you want to recommend is indexed by their personal search engines. Which means: a) you need to friend them; and b) you need to share that content/those resources in that social context.
(Hmmm…this makes me think there may be something in the course custom search engine approach after all… Specifically, if the course has a social profile, and recommends the links contained within the course via that profile, they become part of the personalised search index of student’s following that course profile?)
Just by the by, as another example of Google completely messing things up at the moment, I notice that when I share links to posts on this blog via Google+, they don’t appear as trackbacks to the post in question. Which means that if someone refers to a post on this blog on Google+, I don’t know about it… whereas if they blog the link, I do…
See also my chronologically ordered posts on the eroding notion of “Google Ground Truth”.
[Invisible vs frictionless (and various notions of that word) is all getting a bit garbled; see eg @briankelly’s Should Higher Education Welcome Frictionless Sharing and my comments to it for a little more on this…]
PS I’ve been getting increasingly infuriated by the clutter around, and lack of variation within, Google search results lately, so I changed my default search engine to Bing. The results are a bit all over the place compared to the Google results I tend to get, but this may be down in part to personalisation/training. I am still making occasional forays to Google, but for now, Bing is it… (because Bing is not Google…)
PPS Hah – just noticed: Google Search Plus doesn’t mean plus in the sense of search more, it means search Google+, which is less, or minus the wider world view…;-)
PPPS I keep meaning to blog this, and keep forgetting: Turn[ing] off [Google] search history personalization, in particular: “If you’ve disabled signed-out search history personalization, you’ll need to disable it again after clearing your browser cookies. Clearing your Google cookie clears your search settings, thereby turning history-based customizations back on.” WHich is to say, when you disable personalisation, you don’t disable personalisation against your Google account, you disable it only insofar as it relates to your current cookie ID?
Over the last few days, I’ve been tinkering with OU Structured Authoring documents, XML docs from which OU course materials – both print and HTML – are generated (you can get an idea about what they look like from OpenLearn: find a course page with a URL of the form http://openlearn.open.ac.uk/mod/oucontent/view.php?id=397337&direct=1 and change direct to content: http://openlearn.open.ac.uk/mod/oucontent/view.php?id=397337&content=1; h/t to Colin Chambers for that one;-). I’ve been focussing in particular on the documents used to describe T151, an entry level online course I developed around all things gaming (culture, business, design and development), and the way in which we can automatically generate custom search engines based on these documents.
The course had a very particular structure – weekly topic explorations framed as a preamble, set of guiding questions, suggested resources (organised by type) and a commentary, along with a weekly practical session.
One XML doc was used per week, and was used to generate the separate HTML pages for each week’s study.
One of the experimental components of the course has been a Google Custom Search Engine, that supports searches over external resources that are linked to from the blog. The course also draws heavily on the Digital Worlds Uncourse blog, a site used to scope out the design of the course, as well as draft some of the materials used within it, and the CSE indexes both that site and the sites that are linked from it. (See eSTEeM Project: Custom Course Search Engines and Integrating Course Related Search and Bookmarking? for a little more context around this.)
Through using the course custom search engine myself, I have found several issues with it:
1) with a small index, it’s not really very satisfactory. If you only index exact pages that are linked to from the site, it can be quite hard getting any hits at all. A more relaxed approach might be to index the domains associated with resources, and also include the exact references explicitly with a boosted search rank. At the current time, I have code that scrapes external links from across the T151 course materials and dumps them into a single annotations file (the file that identifies which resources are included in the CSE) without further embellishment. I also have code that identifies the unique domains that are linked to from the course materials and which can also be added to the annotations file. On the to do list is to annotate the resources with labels that identify which topic they are associated with so we can filter results by topic.
2) the Google Custom Search Engines seem to behave very oddly indeed. In several of my experiments, one word queries often returned few results, more specific queries building on the original search term delivered more and different results. This gives a really odd search experience, and one that I suspect would put many users off.
3) I’ve been coming round more and more to the idea that the best way of highlighting course resources in a search context is through the use of Subscribed Links, that a user can subscribe to and that then appear in their Google search results if there is an exact query match. Unfortunately, Google pulled the Subscribed Links service in September (A fall spring-clean; for example of what’s been lost, see e.g. Stone Temple Consulting: Google Co-Op Subscribed Links).
4) The ability to feed promotions into the top of the CSE results listing is attractive (up to 3 promoted links can be displayed for any given query), but the automatic generation of query terms is problematic. Promotion definitions take the form:
<Promotion image_url="http://kmi.open.ac.uk/images/ou-logo.gif" title="Week 4" id="T151_Week4" queries="week 4,T151 week 4,t151 week 4" url="http://www.open.ac.uk" description="Topic Exploration 4A - An animated experience Topic exploration 4B - Flow in games "/>
There are several components we need to consider here:
- queries: these are the phrases that are used to trigger the display of the particular promotions links. Informal testing suggests that where multiple promotions are triggered by the same query, the order in which they are defined in the Promotions file determines the order in which they appear in the results. Note that the at most three (3) promotions can be displayed for any query. Queries may be based at least around either structural components (such as study week, topic number), subject matter terms (e.g. tags, keywords, or headings) and resource type (eg audio/video material, academic readings etc), although we might argue the resource type is not such a meaningful distinction (just because we can make it doesn’t mean we should!). In the T51 materials, presentation conventions provide us with a way of extracting structural components and using these to seed the promotions file. Identifying keywords or phrases is more challenging: students are unlikely to enter search phrases that exactly match section or subsection headings, so some element of term extraction would be required in order to generate query terms that are likely to be used.
- title: this is what catches the attention, so we need to put something sensible in here. There is a limit of 160 characters on the length of the title.
- description: the description allows us to elaborate on the title. There is a limit of 200 characters on the length of the description.
- url: the URL is required but not necessarily ‘used’ by our promotion. That is, if we are using the promotion for informational reasons, and not necessarily wanting to offer a click through, the link may be redundant. (However, the CSE still requires it to be defined.) Alternatively, we might argue that the a click through action should always be generated, in which case it might be worth considering whether we can generate a click through to a more refined query on the CSE itself?
Where multiple promotions are provided, we need to think about:
a) how they are ordered;
b) what other queries they are associated with (i.e. their specificity);
c) where they link to.
In picking apart the T151 structured authoring documents, I have started by focussing on the low hanging fruit when it comes to generating promotion links. Looking through the document format, it is possible to identify topics associated with separate weeks and questions associated with particular topics. This allows us to build up a set of informational promotions that allow the search engine to respond to queries of what we might term a navigational flavour. So for example, we can ask what topics are covered in a particular week (I also added the topic query as a query for questions related to a particular topic):
Or what a particular question is within a particular topic:
The promotion definitions are generated automatically and are all very procedural. For example, here’s a fragment from the definition of the promotion from question 4 in topic 4A:
<Promotion title="Topic Exploration 4A Question 4" queries="topic 4a q4,T151 topic 4a q4,t151 topic 4a q4,topic 4a,T151 topic 4a,t151 topic 4a" ... />
The queries this promotion will be displayed for are all based around navigational structural elements. This requires some knowledge of the navigational query syntax, and also provides an odd user experience, because the promotions only display on the main CSE tab, and the organic results from indexed sites turn up all manner of odd results for queries like “week 3” and “topic 1a q4″… (You can try out the CSE here.)
The promotions I have specified so far also lack several things:
1) queries based on the actual question description, so that a query related to the question might pull the corresponding promotion into the search results (would that be useful?)
2) a sensible link. At the moment, there is no obvious way in the SA document of identifying one or more resources that particularly relate to a specific question. If there was such a link, then we could use that information to automatically associate a link with a question in the corresponding promotions element. (The original design of the course imagined the Structured Authoring document itself being constructed automatically from component parts. In particular, it was envisioned that suggested links would be tagged on a social bookmarking service and then automatically pulled into the appropriate area of the Structured Authoring document. Resources could then be tagged in a way that associates them with one or more questions (or topics), either directly though a question ID, or indirectly through matching subject tags on a question and on a resource. The original model also considered the use of “suggested search queries” that would be used to populate suggested resources lists with results pulled in live from a (custom) search engine…)
At the moment, it is possible to exploit the T151 document structure to generate these canned navigational queries. The question now is: are promotional links a useful feature, and how might we go about automatically identifying subject meaningful queries?
At the limit, we might imagine the course custom search engine interface being akin to the command line in a text based adventure game, with the adventure itself being the learning journey, and the possible next step a combination of Promotions based guidance and actual search results…
[Code for the link scraping/CSE file generation and mindmap generator built around the T151 SA docs can be found at Github: Course Custom Search Engines]
PS as ever, I tend to focus on tinkering around a rapid prototype/demonstration at the technical systems overview level, with a passing nod to the usefulness of the service (which, as noted above, is a bit patchy where the searchengine index is sparse). What I haven’t done is spend much time thinking about the pedagogical aspects relating to how we might make most effective use of custom search engines in the course context. From a scoping point of view, I think there are a several things we need to unpick that relate to this: what is it that students are searching for, what context are they searching in, and what are they expecting to discover?
My original thinking around custom course search engines was that they would augment a search across course materials by providing a way of searching across the full text of resources* linked to from the course materials, and maybe also provide a context for searching over student suggested resources.
It strikes me that the course search engine is most likely to be relevant where there is active curation of the search engine that provides a search view over a reasonably sized set of resources discovered by folk taking the course and sharing resources related to it. “MOOCs” might be interesting in this respect, particularly where: 1) it is possible to include MOOC blog tag feeds in the CSE as a source of relevant content (both the course blog content and resources linked to from that content – the CSE can be configured to include resources that are linked to from a specified resource); 2) we can grab links that are tagged and shared with the MOOC code on social media and add those to the CSE annotations file. (Note that in this case, it would make sense to resolve shortened links to their ultimate destination URL before adding them to the CSE.) I’m not sure what role promotions might play in a MOOC though, or the extent to which they could be automatically generated?
*Full text search across linked to resources is actually quite problematic. Consider the following classes of online resources that we might expect to be linked to from course materials:
- academic papers, often behind a paywall: links are likely to be redirected through a library proxy service allowing for direct click-thru to the resource using institutional credentials (I assume the user is logged in to the VLE to see the link, and single sign on support allows direct access to any subscribed to resources via appropriate proxies. That is, the link to the resource leads directly to the full text, subscribed to version of the resource if the user is signed on to the institutional system and has appropriate credentials). There are several issues here: the link that is supplied to the CSE should be be the public link to the article homepage; the article homepage is likely to reveal little more than the paper abstract to the search engine. I’m not sure if Google Scholar does full-text indexing of articles, but even if it does, Scholar results are not available to the CSE. (There is also the issue of how we would limit the Scholar search to the articles we are linking to from the course materials.)
- news and magazine articles: again, these may be behind a paywall, but even if they are, they may have been indexed by Google. So they may be full text discoverable via a CSE, even if they aren’t accessible once you click through…
- video and audio resources: discovery in a large part will depend on the text on the web page the resources are hosted on. If the link is directly to an audio or video file, discoverability via the CSE may well be very limited!
- books: Google book search provides full text search, but this is not available via a CSE. Full text searchable collections of books are achievable using Google Books Library Shelves; there’s also an API available.
I guess the workaround to all this is not to use a Google Custom Search Engine as the basis for a course search engine. Instead, construct a repository that contains full text copies of all resources linked to from the course, and index that using a local search engine, providing aliased links to the original sources if required?
However, that wasn’t what this experiment was about!;-)
Course Resources as part of a larger connected graph
Another way of thinking about linked to course resources is that they are a gateway into a set of connected resources. Most obviously, for an academic paper it is part of a graph structure that includes:
– links to papers referenced in the article;
– links to papers that cite the article;
– links to other papers written by the same author;
– links to other papers in collections containing the article on services such as Mendeley;
– links into the social graph, such as the social connections of the author, or the discovery of people who have shared a link to the resource on a public social network.
For an informal resource such as a blog post, it might be other posts linked to from the post of interest, or other posts that link to it.
Thinking about resources as being part of one or more connected graphs may influence our thinking about the pedagogy. If the intention is that a learner is directed to a resource as a terminal, atomic resource, from which they are expected to satisfy a particular learning requirement, then we aren’t necessarily interested in the context surrounding the resource. If the intention is that the resource is gateway to a networked context around one or more ideas or concepts, then we need to select our resources so that they provide a springboard to other resources. This can be done directly (eg though following references contained within the work, or tracking down resources that cite it), or indirectly, for example by suggesting keywords or search phrases that might be used to discover related resources by independent means. Alternatively, we might link to a resource as an exemplar of the sort of resource students are expected to work with on a given activity, and then expect them to find similar sorts of, but alternative, resources for themselves.
In Search Engine Powered Courses…, I took an initial, baby step to demonstrate one way in which a promoted link might be used be within a course specific custom search engine. In the next post in this series, I will describe how to influence the positioning of results within a Google custom search engine by boosting their ranking, as well as how results may be ‘faceted’ into different results sets through the use of labels.
In this post, I thought it would be worth taking a step back and reviewing the three configuration files we have access to when defining a Google custom search engine: the configuration file, the promotions file, and the annotations file. If you create a minimal Google custom search engine using the CSE management tools, and then go to the Advanced page, you will see options that allow you to upload the configuration and annotations file. The promotions file can be imported via the Promotions page.
So what do each of these file do?
- The configuration file defines the top level configuration of the search engine. The easiest way of obtaining a template for a CSE is to create a minimal search engine using the CSE management tools, and then export the configuration file from the Advanced page. The configuration file defines, among other things: whether the search engine will search over the whole web, prioritising (or ‘BOOSTing’) sites and pages indexed explicitly by the CSE, or whether it will just return resuts from the explicilty indexed pages (a FILTER style search engine); a definition of the labels, or facets, that allow different search refinements to be applied as different search strategy contexts within the CSE; some styling information; and information relating to Subscribed Links (more of them in another post, if they’re still supported by then..)..
- The promotions file allows you do define promoted links within a CSE; in Search Engine Powered Courses…, I give an example of how these might be used in a course search engine.
- The annotations file identifies the sites and pages that are specific members of the CSE index, as well as how they should be handled (eg the extent to which they should be positively or negatively boosted in the search engine results listing, whether they should appear in the top few results, and what labels or facets should apply to them).
It’s also possible to customise the styling/presentation of the search engine, but that’s a shiny, shiny feature, so probably not something I’ll be looking at…
PS I just noticed you can now manage Google Analytics settings for custom search engines (which allows you to log search queries) from within the CSE control panel… I’m still not sure how easy it is to track which results get clicked through, though?