Getting Lots of Results Out of a Google Custom Search Engine (CSE) via RSS

In Getting an RSS Feed Out of a Google Custom Search Engine (CSE), I described a Yahoo! pipe that can be used as a way of getting a RSS feed out of a Google Custom Search Engine using the Google Ajax Search API.

One of the limitations of the API appears to be that it only returns 8 search results at a time. although these can be paged.

So for example, if you run a normal Google search that returns lots of results, those results are presented over several results pages. If you hover over the links for the different pages, and look at the status bar at the bottom of your browser where the link is displayed, you’ll see that the URL for each page of results is largely the same; the difference comes in the &start= argument in the URI that says which number search result should be at the top of the page; something like this:

The same argument – start can be used to page the CSE results from the AJAX API; which means we can add this in to the URI that calls the Google AJAX Search API within a pipe:

This gives us a quick fix for getting more than 8 results out of a CSE: use the get 8 CSE results starting at a given result pipe to get the first 8 results (counted as results 0..7), then another copy of the pipe to get results 9-16 (counted as 8..15 – i.e. starting at result 8), a second copy of the pipe to get results 17-25, and so on, and then aggregate all the results…

Here’s an example – lots of Google CSE results as RSS pipe:

Notice that each CSE calling pipe is called with the same CSE ID and the same search query, but different start numbers.

This resipe hopefully also gives you the clue that you could use the Union pipe block to merge results from different CSEs (just make sure you use the right CSE ID and the right start values!).

Situated Video Advertising With Tesco Screens

In 2004, Tesco launched an in-store video service under the name Tesco TV as part of its Digital Retail Network service. The original service is described in TESCO taps into the power of satellite broadband to create a state-of-the-art “Digital Retail Network” and is well worth a read. A satellite delivery service provided “news and entertainment, as well as promotional information on both TESCO’s own products and suppliers’ branded products” that was displayed on video screens around the store.
In order to make content as relevant as possible (i.e. to maximise the chances of it influencing a purchase decision;-), the content was zoned:

Up to eight different channels are available on TESCO TV, each channel specifically intended for a particular zone of the store. The screens in the Counters area, for instance, display different content from the screens in the Wines and Spirits area. The latest music videos are shown in the Home Entertainment department and Health & Beauty has its own channel, too. In the Cafe, customers can relax watching the latest news, sports clips, and other entertainment programs.

I’d have loved to have seen the control room:

Remote control from a central location of which content is played on each screen, at each store, in each zone, is an absolute necessity. One reason is that advertisers are only obligated to pay for their advertisements if they are shown in the contracted zones and at the contracted times.

In parallel to the large multimedia files, smaller files with the scripts and programming information are sent to all branches simultaneously or separately, depending on what is required. These scripts are available per channel and define which content is played on which screen at which time. Of course, it is possible to make real-time changes to the schedule enabling TESCO to react within minutes, if required.

In 2006, dunnhumby, the company that runs the Tesco Clubcard service and that probably knows more about your shopping habits at Tesco than you do, won the ad sales contract for Tesco TV’s “5,000 LCD and plasma screens across 100 Tesco Superstores and Extra outlets”. Since then, it has “redeveloped the network to make it more targeted, so that it complements in-store marketing and ties in with above-the-line campaigns”, renaming Tesco TV as Tesco Screens in 2007 as part of that effort (Dunnhumby expands Tesco TV content, dunnhumby relaunches Tesco in-store TV screens). Apparently, “[a]ll campaigns on Tesco Screens are analysed with a bespoke control group using EPOS and Clubcard data.” (If you’ve read any of my previous posts on the topic (e.g. The Tesco Data Business (Notes on “Scoring Points”) or ) you’ll know that dunnhumby excels at customer profiling and targeting.)

Now I don’t know about you, but dunnhumby’s apparent reach and ability to influence millions of shoppers at points of weakness is starting to scare me…(as well as hugely impressing me;-)

On a related note, it’s not just Tesco that use video screen advertising, of course. In Video, Video, Everywhere…, for example, I described how video advertising has now started appearing throughout the London Underground network.

So with the growth of video advertising, it’s maybe not so surprising that Joel Hopwood, one of the management team behind Tesco Screens Retail Media Group should strike out with a start-up: Capture Marketing.

[Capture Marketing] may well be the first agency in the UK to specialise in planning, buying and optimising Retail Media across all retailers – independent of any retailer or media owner!!

They aim to buy from the likes of dunnhumby, JCDecaux, Sainsbury, Asda Media Centre etc in order to give clients a single, independent and authoritative buying and planning point for the whole sector. [DailyDOOH: What On Earth Is Shopper Marketing?]

So what’s the PR strapline for Capture Marketing? “Turning insight into influence”.

If you step back and look at our marketing mix across most of the major brands, it’s clearly shifting, and it’s shifting to in-store, to the internet and to trial activity.
So what’s the answer? Marketing to shoppers. We’ll help you get your message to the consumer when they’re in that crucial zone, after they’ve become a shopper, but before they’ve made a choice. We’ll help you take your campaign not just outside the home, but into the store. Using a wide range of media vehicles, from digital screens to web favourite interrupts to targeted coupons, retail media is immediate, proximate, effective and measurable.

I have no idea where any of this is going… Do you? Could it shift towards making use of VRM (“vendor relationship management”) content, in which customers are able to call up content they desire to help they make a purchase decision (such as price, quality, or nutrition information comparisons?). After all, scanner apps are already starting to appear on Android phones (e.g. ShopSavvy) and the iPhone (Snappr), not to mention the ability to recognise books from their cover or music from the sound of it (The Future of Search is Already Here).

PS Just by the by, here’s some thoughts about how Tesco might make use of SMS:

PPS for a quick A-Z of all you need to know to start bluffing about video based advertising, see Billboards and Beyond: The A-Z of Digital Screens.

Are You Ready To Play “Search Engine Consequences”?

In the world of search, what happens when your local library categorises the film “Babe” as a teen movie in the “classic DVDs” section of the Library website? What happens if you search for “babe teen movie” on Google using any setting other than Safe Search set to “Use strict filtering”? Welcome, in a roundabout way, to the world of search engine consequences.

To set the scene, you need to know a few things. Firstly, how the web search engines know about your web pages in the first place. Secondly, how they rank your pages. And thirdly, how the ads and “related items” that annotate many search listings and web pages are selected (because the whole point about search engines is that they are run by advertising companies, right?).

So, how do the search engines know where you pages are? Any offers? You at the back there – how does Google know about your library home page?

As far as I know, there are three main ways:

  • a page that is already indexed by the search engine links to your page. When a search engine indexes a page, it makes a note of all the pages that are linked to from that page, so these pages can be indexed in turn by the search engine crawler (or spider). (If you have a local search engine, it will crawl your website in a similar way, as this documentation about the Google Search Appliance crawler describes.)
  • the page URL is listed in a Sitemap, and your website manager has told the search engine where that Sitemap lives. The Sitemap lists all the pages on your site that you want indexing. This helps the search engine out – it doesn’t have to crawl your site looking for all the pages – and it helps you out: you can tell the search engine how often the page changes, and how often it needs to be re-indexed, for example.
  • you tell the search engine at a page level that the page exists. For example, if your page includes any Google Adsense modules, or Google Analytics tracking codes, Google will know that page exists the first time it is viewed.

And once a page has been indexed by a search engine, it becomes a possible search result within that search engine.

So when someone makes a query, how are the results selected?

During the actual process of indexing, the search engine does some voodoo magic to try and understand what the page is about. This might be as simple as counting the occurrence of every different word on the page, or it might be an actual attempt to understand what the page is about using all manner of heuristics and semantic engineering approaches. Pages deemed a “hit” for a particular query if the search terms can be used to look up the page in the index. The hits are then rank ordered according to a score for each page calculated according to whatever algorithm the search engine uses. Typically this is some function of both “relevance” and “quality”.

“Relevance” is identified in part by comparing how the page has been indexed compared to the query.

“Quality” often relates to how well regarded the page is in the greater scheme of things; for example, link analysis identifies how many other pages link to the page (where we assume a link is some sort of vote of confidence for the page) and clickthrough analysis monitors how many times people click through on a particular result for a particular query in a search engine results listing (as search engine companies increasingly run web analytics packages too, they can potentially factor this information back in to calculating how satisfied a user was with a particular page).

So what are the consequences of publishing content to the web, and letting a search engine index it?

At this point it’s worth considering not just web search engines, but also image and video search engines. Take Youtube, for example. Posting a video to Youtube means that Youtube will index it. And one of nice things about Youtube is that it will index your video according to at least the title, description and tags you add to the movie and use this as the basis for a recommending other “related” videos to you. You might think of this in terms of Youtube indexing your video in a particular way, and then running a query for other videos using those index terms as the search terms.

Bearing that in mind, a major issue is that you can’t completely control where a page might turn up in a search engine results listing or what other items might be suggested as “related items”. If you need to be careful managing your “brand” online, or you have a duty of care to your users, being associated with inappropriate material in a search engine results listing can be a risky affair.

To try and ground this with a real world example, check out David Lee King’s post from a couple of weeks ago on YouTube Being Naughty Today. It tells a story of how “[a] month or so ago, some of my library’s teen patrons participated in a Making Mini Movie Masterpieces program held at my library. Cool program!”

One of our librarians just posted the videos some of the teens made to YouTube … and guess what? In the related videos section of the video page (and also on the related videos flash thing that plays at the end of an embedded YouTube video), some … let’s just say “questionable” videos appeared.

Here’s what I think happened: YouTube found “similar” videos based on keywords. And the keywords it found in our video include these words in the title and description: mini teen teens . Dump those into YouTube and you’ll unfortunately find some pretty “interesting” videos.

And here’s how I commented on the post:

If you assume everything you publish on the web will be subject to simple term extraction or semantic term extraction, then in a sense it becomes a search query crafted around those terms that will potentially associate the results of that “emergent query” with the content itself.

One of the functions of the Library used to be classifying works so they could be discovered. Maybe now there is a need for understanding how machines will classify web published content so we can try to guard against “consequential annotations”?

For a long time I’ve thought one role for the Library going forwards is SEO – and raising the profile of the host institution’s content in the dominant search engines. But maybe an understanding of SEO is also necessary in a *defensive* capacity?

This need for care is particularly relevant if you run Adsense on your website. Adsense works in a conceptually similar way to the Youtube related videos service, so you might think of it in the following terms: the page is indexed, key terms are identified and run as “queries” on the Adsense search engine, returning “relevant ads” as a result. “Relevance” in this case is calculated (in part) based on how much the advertiser is willing to pay for their advert to be returned against a particular search term or in the context of a page that appears to be about a particular topic.

Although controls are starting to appear that give the page publisher an element of control over what ads appear, their is still uncertainty in the equation, as there is whenever your content appears alongside content that is deemed “relevant” or related in some way – whether that’s in the context of a search engine results listing, an Adsense placement, or a related video.

So one thing to keep in mind – always – is how might your page be indexed, and what sorts of query, ad placement or related item might it be the perfect match for? What are the search engine consequences of writing your page in a particular way, or including particular key words in it?

David’s response to my comment identified a major issue here: “The problem is SO HUGE though, because … at least at my library’s site … everyone needs this training. Not just the web dudes! In my example above, a youth services librarian posted the video – she has great training in helping kids and teens find stuff, in running successful programs for that age group … but probably not in SEO stuff.”

I’m not sure I have the answer, but I think this is another reason Why Librarians Need to Know About SEO – not just so they can improve the likelihood of content they do approve of appearing in search engine listings or as recommended items, but also so they can defend against unanticipated SEO, where they unconsciously optimise a page so that it fares well on an unanticipated, or unwelcome, search.

What that means is, you need to know how SEO works so you don’t inadvertently do SEO on something you didn’t intend optimising for; or so you can institute “SEO countermeasures” to try to defuse any potentially unwelcome search engine consequences that might arise for a particular page.

Library Analytics (Part 8)

In Library Analytics (Part 7), I posted a couple of ideas about how it might be an idea if the Library started crafting URLs for the the Library resources pages for individual courses in the Moodle VLE that contained a campaign tracking code, so that we could track the behaviour of students coming into the Library site by course.

From a quick peak at a handful of courses in the VLE, that recommendation either doesn’t appear to have been taken up, or it’s just “too hard” to do, so that’s another couple of months data we don’t have easy access to in the Google Analytics environment. (Or maybe the Library have moved over to using the OU’s SIte Analytics service for this sort of insight?)

Just to recall, we need to put some sort of additional measures in place because Moodle generates crappy URLs (e.g. URLs of the form http://learn.open.ac.uk/mod/resourcepage/view.php?id=119070) and crafting nice URLs or using mod-rewrite (or similar) definitely is way too hard for the VLE’n’network people to manage;-) The default set up of Google Analytics dumps everything after the “?”, unless they are official campaign tracking arguments or are captured otherwise.

(From a quick scan of Google Analytics Tracking API, I’m guessing that setting pageTracker._setCampSourceKey(“id”); in the tracking code on each Library web page might also capture the id from referrer URLs? Can anyone confirm/deny that?)

Aside: from what I’ve been told, I don’t think we offer server side compression for content served from most http://www.open.ac.uk/* sites, either (though I haven’t checked)? Given that there are still a few students on low bandwidth connections and relatively modern browsers, this is probably an avoidable breach of some sort of accessibility recommendation? For example, over the lat 3 weeks or so, here’s the number of dial-up visits to the Library website:

A quick check of the browser stats shows that IE breaks down almost completely as IE6 and above; all of which cope with compressed files, I think?

[Clarification (?! heh heh) re: dial-in stats – “when you’re looking at the dial-up use of the Library website is that we have a dial-up PC in the Library to replicate off-campus access and to check load times of our resources. So it’s probably worth filtering out that IP address (***.***.***.***) to cut out library staff checking out any problems as this will inflate the perceived use of dial-up by our students. Even if we’ve only used it once a day then that’s a lot of hits on the website that aren’t really students using dial-up” – thanks, Clari :-)]

Anyway – back to the course tracking: as a stop gap, I created a few of my own reports that use a user defined argument corresponding to the full referrer URL:

We can then view reports according to this user defined segment to see which VLE pages are sending traffic to the Library website:

Clicking through on one of these links gives a report for that referrer URL, and then it’s easy to see which landing pages the users are arriving at (and by induction, which links on the VLE page they clicked on):

If we look at the corresponding VLE page:

Then we can say that the analytics suggest that the Open University Library – http://library.open.ac.uk/, the Online collections by subject – http://library.open.ac.uk/find/eresources/index.cfm and the Library Help & Support – http://library.open.ac.uk/about/index.cfm?id=6939 are the only links that have been clicked on.

[Ooops… “Safari & Info Skills for Researchers are our sites, but don’t sit within the library.open.ac.uk domain ([ http://www.open.ac.uk/safari ]www.open.ac.uk/safari and [ http://www.open.ac.uk/infoskills-researchers ]www.open.ac.uk/infoskills-researchers respectively) and the Guide to Online Information Searching in the Social Sciences is another Moodle site.” – thanks Clari:-) So it may well be that people are clicking on the other links… Note to self – if you ever see 0 views for a link, be suspicious and check everything!]

(Note that I have only reported on data from a short period within the lifetime of the course, rather than data taken from over the life of the course. Looking at the incidence of traffic over a whole course presentation would also give an idea of when during the course students are making use of the Library resource page within the course.)

Another way of exploring how VLE referrer traffic is impacting on the Library website is to look at the most popular Landing pages and then see which courses (from the user defined segment) are sourcing that traffic.

So for example, here are the VLE pages that are driving traffic to the elluminate registration page:

One VLE page seems responsible:

Hmmm… ;-)

How about the VLE pages driving traffic to the ejournals page?

And the top hit is….

… the article for question 3 on TMA01 of the November 2008 presentation of M882.

The second most popular referrer page is interesting because it contains two links to the Library journals page:

2008-12-14_0016

Unfortunately, there’s no way of disambiguating which link is driving the tracking – which is one good reason why a separate campaign related tracking code should be associated with each link.

(Do you also see the reference to Google books in there? Heh heh – surely they aren’t suggesting that students try to get what they need from the book via the Google books previewer?!;-)

Okay – enough for now. To sum up, we have the opportunity to provide two sorts of report – one for the Library to look at how VLE sourced traffic as a whole impacts on the Library website; and a different set of reports that can be provided to course teams and course link librarians to show how students on the course are using the VLE to access Library resources.

PS if you havenlt yet watch Dave Pattern’s presentation on mining lending data records, do so NOW: Can You Dig It? A Systems Perspective.

Revisiting the Library Flip – Why Librarians Need to Know About SEO

What does information literacy mean in the age of web search engines? I’ve been arguing for some time (e.g. in The Library Flip) that one of the core skills going forward for those information professionals who “help people find stuff” is going to be SEO – search engine optimisation. Why? Because increasingly people are attuned to searching for “stuff” using a web search engine (you know who I’m talking about…;-); and if your “stuff” doesn’t appear near the top of the organic results listing (or in the paid for links) for a particular query, it might as well not exist…

Whereas once academics and students would have traipsed into the library to ask the one of the High Priestesses to perform some magical incantation on a Dialog database through a privileged access terminal, for many people research now starts with a G. Which means that if you want your academics and students to find the content that you’d recommend, then you have to help get that content to the top of the search engine listings.

With the rate of content production growing to seventy three tera-peta-megabits a second, or whatever it is, does it make sense to expect library staffers to know what the good content is, any more (in the sense of “here, read this – it’s just what you need”)? Does it make even make sense to expect people to know where to find it (in the sense of “try this database, it should contain what you need”)? Or is the business now more one of showing people how to go about finding good stuff, wherever it is (in the sense of “here’s a search strategy for finding what you need”) and helping the search engines see that stuff as good stuff?

Just think about this for a moment. If your service is only usable by members of your institution and only usable within the locked down confines of your local intranet, how useful is it?

When your students leave your institution, how many reusable skills are they taking away? How many people doing informal learning or working within SMEs have access to highly priced, subscription content? How useful is the content in those archives anyway? How useful are “academic information skills” to non-academics and non-students? (I’m just asking the question…;-)

And some more: do academic courses set people up for life outside? Irrespective of whether they do or not, does the library serve students on those courses well within the context of their course? Does the library provide students with skills they will be able to use when they leave the campus and go back to the real world and live with Google. (“Back to”? Hah – I wonder how much traffic on HEI networks is launched by people clicking on links from pages that sit on the google.com domain?) Should libraries help students pass their courses, or give them skills that are useful after graduation? Are those skills the same skills? Or are they different skills (and if so, are they compatible with the course related skills?)?

Here’s where SEO comes in – help people find the good stuff by improving the likelihood that it will be surfaced on the front page of a relevant web search query. For example, “how to cite an article“. (If you click through, it will take you to a Google results page for that query. Are you happy with the results? If not, you need to do one of two things – either start to promote third party resources you do like from your website (essentially, this means you’re doing off-site SEO for those resources) OR start to do onsite and offsite SEO on resources you want people to find on your own site.

(If you don’t know what I’m talking about, you’re well on the way to admitting that you don’t understand how web search engines work. Which is a good first step… because it means you’ve realised you need to learn about it…)

As to how to go about it, I’d suggest one way is to get a better understanding of how people actually use library or course websites. (Another is Realising the Value of Library Data and finding ways of mining behavioural data to build recommendation engines that people might find useful.)

So to start off – find out what search terms are the most popular in terms of driving traffic to your Library website (ideally relating to some sort of resource on your site, such as a citation guide, or a tutorial on information skills); run that query on Google and see where you page comes in the results listing. If it’s not at the top, try to improve its ranking. That’s all…

For example, take a look at the following traffic (as collected by Google Analytics) coming in to the OU Library site over a short period some time ago.

A quick scan suggests that we maybe have some interesting content on “law cases” and “references”. For the “references” link, there’s a good proportion of new visitors to the OU site, and it looks from the bounce rate that half of those visited more than one page on the OU site. (We really should do a little more digging at this point to see what those people actually did on site, but this is just for argument’s sake, okay?!;-)

Now do a quick Google on “references” and what do we see?

On the first page, most of the links are relating to job references, although there is one citation reference near the bottom:

Leeds University library makes it in at 11 (at the time of searching, on google.co.uk):

So here would be a challenge – try to improve the ranking of an OU page on this results listing (or try to boost the Leeds University ranking). As to which OU page we could improve, first look at what Google thinks the OU library knows about references:

Now check that Google favours the page we favour for a search on “references” and if it does, try to boost it’s ranking on the organic SERP. If Google isn’t favouring the page we want as its top hit on the OU site for a search on “references”, do some SEO to correct that (maybe we want “Manage Your References” to come out as the top hit?)

Okay, enough for now – in the next post on this topic I’ll look at the related issue of Search Engine Consequences, which is something that we’re all going to have to become increasingly aware of…

PS Ah, what the heck – here’s how to find out what the people who arrived at the Library website from a Google search on “references” were doing onsite. Create an advanced segment:

Google analytics advanced segment

(PS I first saw these and learned how to use them at a trivial level maybe 5 minutes ago;-)

Now look to see where the traffic came in (i.e. the landing pages for that segment):

Okay? The power of segmentation – isn’t it lovely:-)

We can also go back to the “All Visitors” segment, and see what other keywords people were using who ended up on the “How to cite a reference” page, because we’d possibly want to optimise for those, too.

Enough – time for the weekend to start :-)

PS if you’re not sure what techniques to use to actually “do SEO”, check on Academic Search Premier (or whatever it’s called), because Google and Google Blogsearch won’t return the right sort of information, will they?;-)

Realising the Value of Library Data

For anyone listening out there in library land who hasn’t picked up on Dave Pattern’s blog post from earlier today – WHY NOT? Go and read it, NOW: Free book usage data from the University of Huddersfield:

I’m very proud to announce that Library Services at the University of Huddersfield has just done something that would have perhaps been unthinkable a few years ago: we’ve just released a major portion of our book circulation and recommendation data under an Open Data Commons/CC0 licence. In total, there’s data for over 80,000 titles derived from a pool of just under 3 million circulation transactions spanning a 13 year period.

http://library.hud.ac.uk/usagedata/

I would like to lay down a challenge to every other library in the world to consider doing the same.

So are you going to pick up the challenge…?

And if not, WHY NOT? (Dave posts some answers to the first two or three objections you’ll try to raise, such as the privacy question and the licensing question.)

He also sketches out some elements of a possible future:

I want you to imagine a world where a first year undergraduate psychology student can run a search on your OPAC and have the results ranked by the most popular titles as borrowed by their peers on similar courses around the globe.

I want you to imagine a book recommendation service that makes Amazon’s look amateurish.

I want you to imagine a collection development tool that can tap into the latest borrowing trends at a regional, national and international level.

DON’T YOU DARE NOT DO THIS…

See also a presentation Dave gave to announce this release – Can You Dig It? A systems Perspective:

What else… Library website analytics – are you making use of them yet? I know the OU Library is collecting analytics on the OU Library website, although I don’t think they’re using them? (Knowing that you had x thousand page views last week is NOT INTERESTING. Most of them were probably people flailing round the site failing to find what they wanted? (And before anyone from the Library says that’s not true, PROVE IT TO ME – or at least to yourself – with some appropriate analytics reports.) For example, I haven’t noticed any evidence of changes to the website or A/B testing going on as a result of using Googalytics on the site??? (Hmmm – that’s probably me in trouble again…!;-)

PS I’ve just realised I didn’t post a link to Course Analytics presentation from Online Info last week, so here it is:

Nor did I mention the follow up podcast chat I had about the topic with Richard Wallis from Talis: Google Analytics to analyse student course activity – Tony Hirst Talks with Talis.

Or the “commendation” I got at the IWR Information Professional Award ceremony. I like to think this was for being the “unprofessional” of the year (in the sense of “unconference”, of course…;-). It was much appreciated, anyway :-)