Getting Lots of Results Out of a Google Custom Search Engine (CSE) via RSS

In Getting an RSS Feed Out of a Google Custom Search Engine (CSE), I described a Yahoo! pipe that can be used as a way of getting a RSS feed out of a Google Custom Search Engine using the Google Ajax Search API.

One of the limitations of the API appears to be that it only returns 8 search results at a time. although these can be paged.

So for example, if you run a normal Google search that returns lots of results, those results are presented over several results pages. If you hover over the links for the different pages, and look at the status bar at the bottom of your browser where the link is displayed, you’ll see that the URL for each page of results is largely the same; the difference comes in the &start= argument in the URI that says which number search result should be at the top of the page; something like this:

The same argument – start can be used to page the CSE results from the AJAX API; which means we can add this in to the URI that calls the Google AJAX Search API within a pipe:

This gives us a quick fix for getting more than 8 results out of a CSE: use the get 8 CSE results starting at a given result pipe to get the first 8 results (counted as results 0..7), then another copy of the pipe to get results 9-16 (counted as 8..15 – i.e. starting at result 8), a second copy of the pipe to get results 17-25, and so on, and then aggregate all the results…

Here’s an example – lots of Google CSE results as RSS pipe:

Notice that each CSE calling pipe is called with the same CSE ID and the same search query, but different start numbers.

This resipe hopefully also gives you the clue that you could use the Union pipe block to merge results from different CSEs (just make sure you use the right CSE ID and the right start values!).

Situated Video Advertising With Tesco Screens

In 2004, Tesco launched an in-store video service under the name Tesco TV as part of its Digital Retail Network service. The original service is described in TESCO taps into the power of satellite broadband to create a state-of-the-art “Digital Retail Network” and is well worth a read. A satellite delivery service provided “news and entertainment, as well as promotional information on both TESCO’s own products and suppliers’ branded products” that was displayed on video screens around the store.
In order to make content as relevant as possible (i.e. to maximise the chances of it influencing a purchase decision;-), the content was zoned:

Up to eight different channels are available on TESCO TV, each channel specifically intended for a particular zone of the store. The screens in the Counters area, for instance, display different content from the screens in the Wines and Spirits area. The latest music videos are shown in the Home Entertainment department and Health & Beauty has its own channel, too. In the Cafe, customers can relax watching the latest news, sports clips, and other entertainment programs.

I’d have loved to have seen the control room:

Remote control from a central location of which content is played on each screen, at each store, in each zone, is an absolute necessity. One reason is that advertisers are only obligated to pay for their advertisements if they are shown in the contracted zones and at the contracted times.

In parallel to the large multimedia files, smaller files with the scripts and programming information are sent to all branches simultaneously or separately, depending on what is required. These scripts are available per channel and define which content is played on which screen at which time. Of course, it is possible to make real-time changes to the schedule enabling TESCO to react within minutes, if required.

In 2006, dunnhumby, the company that runs the Tesco Clubcard service and that probably knows more about your shopping habits at Tesco than you do, won the ad sales contract for Tesco TV’s “5,000 LCD and plasma screens across 100 Tesco Superstores and Extra outlets”. Since then, it has “redeveloped the network to make it more targeted, so that it complements in-store marketing and ties in with above-the-line campaigns”, renaming Tesco TV as Tesco Screens in 2007 as part of that effort (Dunnhumby expands Tesco TV content, dunnhumby relaunches Tesco in-store TV screens). Apparently, “[a]ll campaigns on Tesco Screens are analysed with a bespoke control group using EPOS and Clubcard data.” (If you’ve read any of my previous posts on the topic (e.g. The Tesco Data Business (Notes on “Scoring Points”) or ) you’ll know that dunnhumby excels at customer profiling and targeting.)

Now I don’t know about you, but dunnhumby’s apparent reach and ability to influence millions of shoppers at points of weakness is starting to scare me…(as well as hugely impressing me;-)

On a related note, it’s not just Tesco that use video screen advertising, of course. In Video, Video, Everywhere…, for example, I described how video advertising has now started appearing throughout the London Underground network.

So with the growth of video advertising, it’s maybe not so surprising that Joel Hopwood, one of the management team behind Tesco Screens Retail Media Group should strike out with a start-up: Capture Marketing.

[Capture Marketing] may well be the first agency in the UK to specialise in planning, buying and optimising Retail Media across all retailers – independent of any retailer or media owner!!

They aim to buy from the likes of dunnhumby, JCDecaux, Sainsbury, Asda Media Centre etc in order to give clients a single, independent and authoritative buying and planning point for the whole sector. [DailyDOOH: What On Earth Is Shopper Marketing?]

So what’s the PR strapline for Capture Marketing? “Turning insight into influence”.

If you step back and look at our marketing mix across most of the major brands, it’s clearly shifting, and it’s shifting to in-store, to the internet and to trial activity.
So what’s the answer? Marketing to shoppers. We’ll help you get your message to the consumer when they’re in that crucial zone, after they’ve become a shopper, but before they’ve made a choice. We’ll help you take your campaign not just outside the home, but into the store. Using a wide range of media vehicles, from digital screens to web favourite interrupts to targeted coupons, retail media is immediate, proximate, effective and measurable.

I have no idea where any of this is going… Do you? Could it shift towards making use of VRM (“vendor relationship management”) content, in which customers are able to call up content they desire to help they make a purchase decision (such as price, quality, or nutrition information comparisons?). After all, scanner apps are already starting to appear on Android phones (e.g. ShopSavvy) and the iPhone (Snappr), not to mention the ability to recognise books from their cover or music from the sound of it (The Future of Search is Already Here).

PS Just by the by, here’s some thoughts about how Tesco might make use of SMS:

PPS for a quick A-Z of all you need to know to start bluffing about video based advertising, see Billboards and Beyond: The A-Z of Digital Screens.

Are You Ready To Play “Search Engine Consequences”?

In the world of search, what happens when your local library categorises the film “Babe” as a teen movie in the “classic DVDs” section of the Library website? What happens if you search for “babe teen movie” on Google using any setting other than Safe Search set to “Use strict filtering”? Welcome, in a roundabout way, to the world of search engine consequences.

To set the scene, you need to know a few things. Firstly, how the web search engines know about your web pages in the first place. Secondly, how they rank your pages. And thirdly, how the ads and “related items” that annotate many search listings and web pages are selected (because the whole point about search engines is that they are run by advertising companies, right?).

So, how do the search engines know where you pages are? Any offers? You at the back there – how does Google know about your library home page?

As far as I know, there are three main ways:

  • a page that is already indexed by the search engine links to your page. When a search engine indexes a page, it makes a note of all the pages that are linked to from that page, so these pages can be indexed in turn by the search engine crawler (or spider). (If you have a local search engine, it will crawl your website in a similar way, as this documentation about the Google Search Appliance crawler describes.)
  • the page URL is listed in a Sitemap, and your website manager has told the search engine where that Sitemap lives. The Sitemap lists all the pages on your site that you want indexing. This helps the search engine out – it doesn’t have to crawl your site looking for all the pages – and it helps you out: you can tell the search engine how often the page changes, and how often it needs to be re-indexed, for example.
  • you tell the search engine at a page level that the page exists. For example, if your page includes any Google Adsense modules, or Google Analytics tracking codes, Google will know that page exists the first time it is viewed.

And once a page has been indexed by a search engine, it becomes a possible search result within that search engine.

So when someone makes a query, how are the results selected?

During the actual process of indexing, the search engine does some voodoo magic to try and understand what the page is about. This might be as simple as counting the occurrence of every different word on the page, or it might be an actual attempt to understand what the page is about using all manner of heuristics and semantic engineering approaches. Pages deemed a “hit” for a particular query if the search terms can be used to look up the page in the index. The hits are then rank ordered according to a score for each page calculated according to whatever algorithm the search engine uses. Typically this is some function of both “relevance” and “quality”.

“Relevance” is identified in part by comparing how the page has been indexed compared to the query.

“Quality” often relates to how well regarded the page is in the greater scheme of things; for example, link analysis identifies how many other pages link to the page (where we assume a link is some sort of vote of confidence for the page) and clickthrough analysis monitors how many times people click through on a particular result for a particular query in a search engine results listing (as search engine companies increasingly run web analytics packages too, they can potentially factor this information back in to calculating how satisfied a user was with a particular page).

So what are the consequences of publishing content to the web, and letting a search engine index it?

At this point it’s worth considering not just web search engines, but also image and video search engines. Take Youtube, for example. Posting a video to Youtube means that Youtube will index it. And one of nice things about Youtube is that it will index your video according to at least the title, description and tags you add to the movie and use this as the basis for a recommending other “related” videos to you. You might think of this in terms of Youtube indexing your video in a particular way, and then running a query for other videos using those index terms as the search terms.

Bearing that in mind, a major issue is that you can’t completely control where a page might turn up in a search engine results listing or what other items might be suggested as “related items”. If you need to be careful managing your “brand” online, or you have a duty of care to your users, being associated with inappropriate material in a search engine results listing can be a risky affair.

To try and ground this with a real world example, check out David Lee King’s post from a couple of weeks ago on YouTube Being Naughty Today. It tells a story of how “[a] month or so ago, some of my library’s teen patrons participated in a Making Mini Movie Masterpieces program held at my library. Cool program!”

One of our librarians just posted the videos some of the teens made to YouTube … and guess what? In the related videos section of the video page (and also on the related videos flash thing that plays at the end of an embedded YouTube video), some … let’s just say “questionable” videos appeared.

Here’s what I think happened: YouTube found “similar” videos based on keywords. And the keywords it found in our video include these words in the title and description: mini teen teens . Dump those into YouTube and you’ll unfortunately find some pretty “interesting” videos.

And here’s how I commented on the post:

If you assume everything you publish on the web will be subject to simple term extraction or semantic term extraction, then in a sense it becomes a search query crafted around those terms that will potentially associate the results of that “emergent query” with the content itself.

One of the functions of the Library used to be classifying works so they could be discovered. Maybe now there is a need for understanding how machines will classify web published content so we can try to guard against “consequential annotations”?

For a long time I’ve thought one role for the Library going forwards is SEO – and raising the profile of the host institution’s content in the dominant search engines. But maybe an understanding of SEO is also necessary in a *defensive* capacity?

This need for care is particularly relevant if you run Adsense on your website. Adsense works in a conceptually similar way to the Youtube related videos service, so you might think of it in the following terms: the page is indexed, key terms are identified and run as “queries” on the Adsense search engine, returning “relevant ads” as a result. “Relevance” in this case is calculated (in part) based on how much the advertiser is willing to pay for their advert to be returned against a particular search term or in the context of a page that appears to be about a particular topic.

Although controls are starting to appear that give the page publisher an element of control over what ads appear, their is still uncertainty in the equation, as there is whenever your content appears alongside content that is deemed “relevant” or related in some way – whether that’s in the context of a search engine results listing, an Adsense placement, or a related video.

So one thing to keep in mind – always – is how might your page be indexed, and what sorts of query, ad placement or related item might it be the perfect match for? What are the search engine consequences of writing your page in a particular way, or including particular key words in it?

David’s response to my comment identified a major issue here: “The problem is SO HUGE though, because … at least at my library’s site … everyone needs this training. Not just the web dudes! In my example above, a youth services librarian posted the video – she has great training in helping kids and teens find stuff, in running successful programs for that age group … but probably not in SEO stuff.”

I’m not sure I have the answer, but I think this is another reason Why Librarians Need to Know About SEO – not just so they can improve the likelihood of content they do approve of appearing in search engine listings or as recommended items, but also so they can defend against unanticipated SEO, where they unconsciously optimise a page so that it fares well on an unanticipated, or unwelcome, search.

What that means is, you need to know how SEO works so you don’t inadvertently do SEO on something you didn’t intend optimising for; or so you can institute “SEO countermeasures” to try to defuse any potentially unwelcome search engine consequences that might arise for a particular page.

Library Analytics (Part 8)

In Library Analytics (Part 7), I posted a couple of ideas about how it might be an idea if the Library started crafting URLs for the the Library resources pages for individual courses in the Moodle VLE that contained a campaign tracking code, so that we could track the behaviour of students coming into the Library site by course.

From a quick peak at a handful of courses in the VLE, that recommendation either doesn’t appear to have been taken up, or it’s just “too hard” to do, so that’s another couple of months data we don’t have easy access to in the Google Analytics environment. (Or maybe the Library have moved over to using the OU’s SIte Analytics service for this sort of insight?)

Just to recall, we need to put some sort of additional measures in place because Moodle generates crappy URLs (e.g. URLs of the form http://learn.open.ac.uk/mod/resourcepage/view.php?id=119070) and crafting nice URLs or using mod-rewrite (or similar) definitely is way too hard for the VLE’n’network people to manage;-) The default set up of Google Analytics dumps everything after the “?”, unless they are official campaign tracking arguments or are captured otherwise.

(From a quick scan of Google Analytics Tracking API, I’m guessing that setting pageTracker._setCampSourceKey(“id”); in the tracking code on each Library web page might also capture the id from referrer URLs? Can anyone confirm/deny that?)

Aside: from what I’ve been told, I don’t think we offer server side compression for content served from most http://www.open.ac.uk/* sites, either (though I haven’t checked)? Given that there are still a few students on low bandwidth connections and relatively modern browsers, this is probably an avoidable breach of some sort of accessibility recommendation? For example, over the lat 3 weeks or so, here’s the number of dial-up visits to the Library website:

A quick check of the browser stats shows that IE breaks down almost completely as IE6 and above; all of which cope with compressed files, I think?

[Clarification (?! heh heh) re: dial-in stats – “when you’re looking at the dial-up use of the Library website is that we have a dial-up PC in the Library to replicate off-campus access and to check load times of our resources. So it’s probably worth filtering out that IP address (***.***.***.***) to cut out library staff checking out any problems as this will inflate the perceived use of dial-up by our students. Even if we’ve only used it once a day then that’s a lot of hits on the website that aren’t really students using dial-up” – thanks, Clari :-)]

Anyway – back to the course tracking: as a stop gap, I created a few of my own reports that use a user defined argument corresponding to the full referrer URL:

We can then view reports according to this user defined segment to see which VLE pages are sending traffic to the Library website:

Clicking through on one of these links gives a report for that referrer URL, and then it’s easy to see which landing pages the users are arriving at (and by induction, which links on the VLE page they clicked on):

If we look at the corresponding VLE page:

Then we can say that the analytics suggest that the Open University Library – http://library.open.ac.uk/, the Online collections by subject – http://library.open.ac.uk/find/eresources/index.cfm and the Library Help & Support – http://library.open.ac.uk/about/index.cfm?id=6939 are the only links that have been clicked on.

[Ooops… “Safari & Info Skills for Researchers are our sites, but don’t sit within the library.open.ac.uk domain ([ http://www.open.ac.uk/safari ]www.open.ac.uk/safari and [ http://www.open.ac.uk/infoskills-researchers ]www.open.ac.uk/infoskills-researchers respectively) and the Guide to Online Information Searching in the Social Sciences is another Moodle site.” – thanks Clari:-) So it may well be that people are clicking on the other links… Note to self – if you ever see 0 views for a link, be suspicious and check everything!]

(Note that I have only reported on data from a short period within the lifetime of the course, rather than data taken from over the life of the course. Looking at the incidence of traffic over a whole course presentation would also give an idea of when during the course students are making use of the Library resource page within the course.)

Another way of exploring how VLE referrer traffic is impacting on the Library website is to look at the most popular Landing pages and then see which courses (from the user defined segment) are sourcing that traffic.

So for example, here are the VLE pages that are driving traffic to the elluminate registration page:

One VLE page seems responsible:

Hmmm… ;-)

How about the VLE pages driving traffic to the ejournals page?

And the top hit is….

… the article for question 3 on TMA01 of the November 2008 presentation of M882.

The second most popular referrer page is interesting because it contains two links to the Library journals page:

2008-12-14_0016

Unfortunately, there’s no way of disambiguating which link is driving the tracking – which is one good reason why a separate campaign related tracking code should be associated with each link.

(Do you also see the reference to Google books in there? Heh heh – surely they aren’t suggesting that students try to get what they need from the book via the Google books previewer?!;-)

Okay – enough for now. To sum up, we have the opportunity to provide two sorts of report – one for the Library to look at how VLE sourced traffic as a whole impacts on the Library website; and a different set of reports that can be provided to course teams and course link librarians to show how students on the course are using the VLE to access Library resources.

PS if you havenlt yet watch Dave Pattern’s presentation on mining lending data records, do so NOW: Can You Dig It? A Systems Perspective.

Revisiting the Library Flip – Why Librarians Need to Know About SEO

What does information literacy mean in the age of web search engines? I’ve been arguing for some time (e.g. in The Library Flip) that one of the core skills going forward for those information professionals who “help people find stuff” is going to be SEO – search engine optimisation. Why? Because increasingly people are attuned to searching for “stuff” using a web search engine (you know who I’m talking about…;-); and if your “stuff” doesn’t appear near the top of the organic results listing (or in the paid for links) for a particular query, it might as well not exist…

Whereas once academics and students would have traipsed into the library to ask the one of the High Priestesses to perform some magical incantation on a Dialog database through a privileged access terminal, for many people research now starts with a G. Which means that if you want your academics and students to find the content that you’d recommend, then you have to help get that content to the top of the search engine listings.

With the rate of content production growing to seventy three tera-peta-megabits a second, or whatever it is, does it make sense to expect library staffers to know what the good content is, any more (in the sense of “here, read this – it’s just what you need”)? Does it make even make sense to expect people to know where to find it (in the sense of “try this database, it should contain what you need”)? Or is the business now more one of showing people how to go about finding good stuff, wherever it is (in the sense of “here’s a search strategy for finding what you need”) and helping the search engines see that stuff as good stuff?

Just think about this for a moment. If your service is only usable by members of your institution and only usable within the locked down confines of your local intranet, how useful is it?

When your students leave your institution, how many reusable skills are they taking away? How many people doing informal learning or working within SMEs have access to highly priced, subscription content? How useful is the content in those archives anyway? How useful are “academic information skills” to non-academics and non-students? (I’m just asking the question…;-)

And some more: do academic courses set people up for life outside? Irrespective of whether they do or not, does the library serve students on those courses well within the context of their course? Does the library provide students with skills they will be able to use when they leave the campus and go back to the real world and live with Google. (“Back to”? Hah – I wonder how much traffic on HEI networks is launched by people clicking on links from pages that sit on the google.com domain?) Should libraries help students pass their courses, or give them skills that are useful after graduation? Are those skills the same skills? Or are they different skills (and if so, are they compatible with the course related skills?)?

Here’s where SEO comes in – help people find the good stuff by improving the likelihood that it will be surfaced on the front page of a relevant web search query. For example, “how to cite an article“. (If you click through, it will take you to a Google results page for that query. Are you happy with the results? If not, you need to do one of two things – either start to promote third party resources you do like from your website (essentially, this means you’re doing off-site SEO for those resources) OR start to do onsite and offsite SEO on resources you want people to find on your own site.

(If you don’t know what I’m talking about, you’re well on the way to admitting that you don’t understand how web search engines work. Which is a good first step… because it means you’ve realised you need to learn about it…)

As to how to go about it, I’d suggest one way is to get a better understanding of how people actually use library or course websites. (Another is Realising the Value of Library Data and finding ways of mining behavioural data to build recommendation engines that people might find useful.)

So to start off – find out what search terms are the most popular in terms of driving traffic to your Library website (ideally relating to some sort of resource on your site, such as a citation guide, or a tutorial on information skills); run that query on Google and see where you page comes in the results listing. If it’s not at the top, try to improve its ranking. That’s all…

For example, take a look at the following traffic (as collected by Google Analytics) coming in to the OU Library site over a short period some time ago.

A quick scan suggests that we maybe have some interesting content on “law cases” and “references”. For the “references” link, there’s a good proportion of new visitors to the OU site, and it looks from the bounce rate that half of those visited more than one page on the OU site. (We really should do a little more digging at this point to see what those people actually did on site, but this is just for argument’s sake, okay?!;-)

Now do a quick Google on “references” and what do we see?

On the first page, most of the links are relating to job references, although there is one citation reference near the bottom:

Leeds University library makes it in at 11 (at the time of searching, on google.co.uk):

So here would be a challenge – try to improve the ranking of an OU page on this results listing (or try to boost the Leeds University ranking). As to which OU page we could improve, first look at what Google thinks the OU library knows about references:

Now check that Google favours the page we favour for a search on “references” and if it does, try to boost it’s ranking on the organic SERP. If Google isn’t favouring the page we want as its top hit on the OU site for a search on “references”, do some SEO to correct that (maybe we want “Manage Your References” to come out as the top hit?)

Okay, enough for now – in the next post on this topic I’ll look at the related issue of Search Engine Consequences, which is something that we’re all going to have to become increasingly aware of…

PS Ah, what the heck – here’s how to find out what the people who arrived at the Library website from a Google search on “references” were doing onsite. Create an advanced segment:

Google analytics advanced segment

(PS I first saw these and learned how to use them at a trivial level maybe 5 minutes ago;-)

Now look to see where the traffic came in (i.e. the landing pages for that segment):

Okay? The power of segmentation – isn’t it lovely:-)

We can also go back to the “All Visitors” segment, and see what other keywords people were using who ended up on the “How to cite a reference” page, because we’d possibly want to optimise for those, too.

Enough – time for the weekend to start :-)

PS if you’re not sure what techniques to use to actually “do SEO”, check on Academic Search Premier (or whatever it’s called), because Google and Google Blogsearch won’t return the right sort of information, will they?;-)

Realising the Value of Library Data

For anyone listening out there in library land who hasn’t picked up on Dave Pattern’s blog post from earlier today – WHY NOT? Go and read it, NOW: Free book usage data from the University of Huddersfield:

I’m very proud to announce that Library Services at the University of Huddersfield has just done something that would have perhaps been unthinkable a few years ago: we’ve just released a major portion of our book circulation and recommendation data under an Open Data Commons/CC0 licence. In total, there’s data for over 80,000 titles derived from a pool of just under 3 million circulation transactions spanning a 13 year period.

http://library.hud.ac.uk/usagedata/

I would like to lay down a challenge to every other library in the world to consider doing the same.

So are you going to pick up the challenge…?

And if not, WHY NOT? (Dave posts some answers to the first two or three objections you’ll try to raise, such as the privacy question and the licensing question.)

He also sketches out some elements of a possible future:

I want you to imagine a world where a first year undergraduate psychology student can run a search on your OPAC and have the results ranked by the most popular titles as borrowed by their peers on similar courses around the globe.

I want you to imagine a book recommendation service that makes Amazon’s look amateurish.

I want you to imagine a collection development tool that can tap into the latest borrowing trends at a regional, national and international level.

DON’T YOU DARE NOT DO THIS…

See also a presentation Dave gave to announce this release – Can You Dig It? A systems Perspective:

What else… Library website analytics – are you making use of them yet? I know the OU Library is collecting analytics on the OU Library website, although I don’t think they’re using them? (Knowing that you had x thousand page views last week is NOT INTERESTING. Most of them were probably people flailing round the site failing to find what they wanted? (And before anyone from the Library says that’s not true, PROVE IT TO ME – or at least to yourself – with some appropriate analytics reports.) For example, I haven’t noticed any evidence of changes to the website or A/B testing going on as a result of using Googalytics on the site??? (Hmmm – that’s probably me in trouble again…!;-)

PS I’ve just realised I didn’t post a link to Course Analytics presentation from Online Info last week, so here it is:

Nor did I mention the follow up podcast chat I had about the topic with Richard Wallis from Talis: Google Analytics to analyse student course activity – Tony Hirst Talks with Talis.

Or the “commendation” I got at the IWR Information Professional Award ceremony. I like to think this was for being the “unprofessional” of the year (in the sense of “unconference”, of course…;-). It was much appreciated, anyway :-)

Arise Ye Databases of Intention

In what counts as one of my favourite business books (“The Search: How Google and Its Rivals Rewrote the Rules of Business and Transformed Our Culture), John Battelle sets the scene with a chapter entitled “The Database of Intentions”.

The Database of Intentions is simply this.: the aggregate results of every search ever entered, every result list ever tendered, and every path taken as a result.

(Also described in this blog post: Database of Intentions).

The phrase “the database of intentions” is a powerful one; but whilst I don’t necessarily agree with the above definition any more (and I suspect Battelle’s thinking about his own definition of this term may also have moved on since then) I do think that the web’s ability to capture intentional data is being operationalised in a far more literal form than even search histories reveal.

Here’s something I remarked to myself on this topic a couple of days ago, following a particular Google announcement:

The announcement was this one: New in Labs: Tasks and it describes the release of a Task list (i.e. a to do list) into the Google Mail environment.

As Simon PErry notes in hos post on the topic (“Google Tasks: Gold Dust Info For Advertisers).

It’s highly arguable that no piece of information could be more valuable to Google that what your plans / tasks / desire are.

In the world of services driven by advertising in exchange for online services, this stuff is gold dust.

We’d imagine that Google will be smart enough not to place ads related to your todo list directly next to the list, as that could well freak people out. Don’t forget that as soon as Google know this info about you, they can place the adverts where ever and when ever they feel like.

“Don’t forget that as soon as Google know this info about you, they can place the adverts where ever and when ever they feel like.”… Visit 10 web pages that run advertising, and I would bet that close to the majority are running Google Adsense.

Now I know everybody knows this, but I suspect most people don’t…

How does Google use cookies to serve ads?
A cookie is a snippet of text that is sent from a website’s servers and stored on a web browser. Like most websites and search engines, Google uses cookies in order to provide a better user experience and to serve relevant ads. Cookies are set based on your viewing of web pages in Google’s content network and do not contain any information that can identify you personally. This information helps Google deliver ads that are relevant to your interests, control the number of times you see a given ad, and measure the effectiveness of ad campaigns. Anyone who prefers not to see ads with this level of customization can opt out of advertising cookies. This opt-out will be specific only to the browser that you are using when you click the “opt out” button. [Advertising Privacy FAQ]

Now I’m guessing that “your viewing of web pages in Google’s content network” includes viewing pages in GMail, pages which might include your “to do” list… So a consequence of adding an item to your Task list might be an advert that Google serves to you next time you do a search on Google or visit a site running Google Ads.

Here’s another way of thinking about those ads: as predictive searches. Google knows what you intend to do from your “to do” list, so in principle it can look at what people with similar “to do” items search for, and serve the most common of these up next time you go to http://getitdone.google.com. Just imagine it – whereas you go to google.com and see an empty search box, you go to getitdone.google.com and you load a page that has guessed what query you were going to make, and runs it for you. So if the item on your task list for Thursday afternoon was “buy car insurance”, you’ll see something like this:

Heh, heh ;-) Good old Google – just being helpful ;-)

Explicit intention information is not just being handed over to Google in increasingly literal and public ways, of course. A week or so ago, John Battelle posted the following (Shifting Search from Static to Real-time):

I’ve been mulling something that keeps tugging at my mind as a Big Idea for some time now, and I may as well Think Out Loud about it and see what comes up.

To summarize, I think Search is about to undergo an important evolution. It remains to be seen if this is punctuated equilibrium or a slow, constant process (it sort of feels like both), but the end result strikes me as extremely important: Very soon, we will be able to ask Search a very basic and extraordinarily important question that I can best summarize as this: What are people saying about (my query) right now?
Imagine AdSense, Live. …

[I]magine a service that feels just like Google, but instead of gathering static web results, it gathers liveweb results – what people are saying, right now (or some approximation of now – say the past few hours or so) … ? And/or, you could post your query to that engine, and you could get realtime results that were created – by other humans – directly in response to you? Well, you can get a taste of what such an engine might look like on search.twitter.com, but that’s just a taste.

A few days later, Nick Bilton developed a similar theme (The Twitter Gold Mine & Beating Google to the Semantic Web):

Twitter, potentially, has the ability to deliver unbelievably smart advertising; advertising that I actually want to see, and they have the ability to deliver search results far superior and more accurate to Google, putting Twitter in the running to beat Google in the latent quest to the semantic web. With some really intelligent data mining and cross pollination, they could give me ads that makes sense not for something I looked at 3 weeks ago, or a link my wife clicked on when she borrowed my laptop, but ads that are extremely relevant to ‘what I’m doing right now’.

If I send a tweet saying “I’m looking for a new car does anyone have any recommendations”, I would be more than happy to see ‘smart’ user generated advertising recommendations based on my past tweets, mine the data of other people living Brooklyn who have tweeted about their car and deliver a tweet/ad based on those result leaving spammers lost in the noise. I’d also expect when I send a tweet saying ‘I got a new car and love it!’ that those car ads stop appearing and something else, relevant to only me, takes its place.

(See also Will Lack of Relevancy be the Downfall of Google?, where I fumble around a thought or too about whether Google will lose out on identifying well liked content because links are increasingly being shared in real time in places that Google doesn’t index.)

It seems to me that To Do/Task lists, Calendars, search queries and tweets all lie somewhere different along a time vs. commitment graph of our intentions. The to do list is something you plan to do; searching on Google or Amazon is an action executed in pursuit of that goal. Actually buying your car insurance is completing the action.

Things like wishlists also blend in our desires. Calendars, click-thrus and actual purchases all record what might be referred to as commitments (a click thru on an ad could be seen as a very weak commitment to buy, for example; an event booked into your calendar is a much stronger commitment; handing over your credit card and hitting the “complete transaction” button is the strongest commitment you can make regarding a purchase decision).

Way back when, I used to play with software agents, constructed according to a BDI (“beady eye”) model – Beliefs, Desires and Intentions. I also looked at agent teams, and the notion of “joint persistent goals” (I even came up with a game for them to play – “DIFFOBJ – A Game for Exercising Teams of Agents – although I’m not sure the logic was sound!). Somewhere in there is the basis of a logic for describing the Database of Intentions, and the relationship between an individual with a goal and an engine that is trying to help the searcher achieve that goal, whether by serving them with “organic” content or paid for content.

PS I don’t think I’ve linked to this yet? All about Google

It’s worth skimming through…

PPS See also Status and Intent: Twoogle, in which I idly wonder whether a status update from Google just before I start searching there could provide explicit intent information to Google about the sort of thing I want to achieve from a particular search.

More Remarks on the Tesco Data Play

A little while ago, I posted some notes I’d made whilst reading “Scoring Points”, which looked at the way Tesco developed it’s ClubCard business and started using consumer data to improve a whole range of operational and marketing functions within the tesco operation (The Tesco Data Business (Notes on “Scoring Points”)). For anyone who’s interested, here are a few more things I managed to dig up Tesco’s data play, and their relationship with Dunnhumby, who operate the service.

[UPDATE – most of the images were removed from this post because I got a take down notice from Dunnhumby’s lawyers in the US…]

Firstly, here’s a couple of snippets from a presentation by Giles Pavey, Head of Analysis at dunnhumby, presented earlier this year. The first thing to grab me was this slide summarisign how to turn data into insight, and then $$$s (the desired result of changing customer behaviour from less, to more profitable!):

In the previous post, I mentioned how Tesco segment shoppers according to their “lifestyle profile”. This is generated by looking at the data generated by a shopper, in terms of what they buy, when they buy it, what stories you can tell about them as a result.

So how well does Tesco know you, for example?

(I assume Tesco knows Miss Jones drives to Tesco on a Saturday because she uses her Clubcard when topping up on fuel at the Tesco petrol station…).

Clustering shopped for items in an appropriate way lets Tesco identify the “Lifestyle DNA” of each shopper:

(If you self-categorise according to those meaningful sounding lifestyle categories, I wonder how well it would match the profile Tesco has allocated to you?!)

It’s quite interesting to see what other players in the area think is important, too. One way of doing this is to have a look around at who else is speaking at the trade events Giles Pavey turns up at. For example, earlier this year was a day of impressive looking talks at The Business Applications of Marketing Analytics.

Not sure what “Marketing Analytics” are? Maybe you need to become a Master of Marketing Analysis to find out?! Here’s what appears to be involved:

The course website also features an interview with three members of dunnhumby: Orlando Machado (Head of Insight Analysis), Martin Hayward (Director of Strategy) and Giles Pavey (head of Customer Insight) [view it here].

You can see/hear a couple more takes on dunnhumby here:
Martin Hayward, Director of Consumer Strategy and Futures at dunnhumby on the growth of dunnhumby;
Life as an “intern” at dunnhumby.

And here’s another event that dunnhumby presented at: The Future of Geodemographics – 21st Century datasets and dynamic segmentation: New methods of classifying areas and individuals. Although the dunnhumby presentation isn’t available for download, several others are. I may try to pull out some gems from them in a later post, but in the meantime, here are some titles to try to tease you into clicking through and maybe pulling out the nuggets, and adding them as comments to this post, yourself:
Understanding People on the Move in London (I/m guessing this means “Oyster card tracking”?!);
Geodemographics and Privacy (something we should all be taking an interest in?);
Real Time Geodemographics – New Services and Business Opportunities from Analysing People in Time and Space: real-time? Maybe this ties in with things like behavioural analytics and localised mobile phone tracking in shopping centres?

So what are “geodemographics: (or “geodems”, as they’re known in the trade;-)? No idea – but I’m guessing it’s the demographics of a particular locales?

Here’s one of the reasons why Tesco are interested, anyway:

An finally (for now at least…) it seems that Tesco and dunnhumby may be looking for additional ways of using Clubcard data, in particular for targeted advertising:

Tesco is working with Dunnhumby, the marketing group behind Tesco Clubcard, to integrate highly targeted third-party advertising across Tesco.com when the company’s new-look site launches next year.
Jean-Pierre Van Lin, head of markets at Dunnhumby, explained to NMA that, once a Clubcard holder had logged in to the website, data from their previous spending could be used to select advertising of specific relevance to that user.
[Ref: Tesco.com to use Clubcard data to target third-party advertising (thanks, Ben:-)]

Now I’m guessing that this will represent a change in the way the data has been used to date – so I wonder, have Tesco ClubCard Terms and Conditions changed recently?

Looking at the global reach of dunnhumby, I wonder whether they’re building capacity for a global targeted ad service, via the back door?

Does it matter, anyway, if profiling data from our offline shopping habits are reconciled with our online presence?

In “Diving for Data”, (Supermarket News, 00395803, 9/26/2005, Vol. 53, Issue 39), Lucia Moses reports that the Tesco Clucbcard in the UK “boasts 10 million households and captures 85% of weekly store sales”, along with 30% of UK food sales. The story in the US could soon be similar, where dunnhumby works with Kroger to analyse “6.5 million top shopper households”, (identified as the “slice of the total 42 million households that visit Kroger stores that drive more than 50% of sales”). With “Kroger claim[ing] that 40% of U.S. households hold one of its cards”, does dunnhumby’s “goal … to understand the customer better than anyone” rival Google in its potential for evil?!

OpenLearn WordPress Plugins

Just before the summer break, I managed to Patrick McAndrew to use some of his OpenLearn cash to get a WordPress-MU plugin built that would allow anyone to republish OpenLearn materials across a set of WordPress Multi-User blogs. A second WordPress plugin was commissioned that would allow any learners happening by the blogs would subscribe to those courses using “Daily feeds”, that would deliver course material to them on a daily basis.

The plugins were coded by Greg Gaughan at Isotoma, and tested by Jim and D’Arcy, among others… (I haven’t acted on your feedback yet – sorry, guys…:-( For all manner of reasons, I didn’t post the plugins (partly because I wanted to do another pass on usability/pick up on feedback, but mainly because I wanted set up a demo site first… but I still haven’t done that… so here’s a link to the plugins anyway in case anyone fancies having a play over the next few weeks: OpenLearn WordPress Plugins.

I’ll keep coming back to this post – and the download page – to add in documentation and some of the thoughts and discussions we had about how to evolve the WPMU plugin workflow/UI etc, as well as the daily feeds widget functionality.

In the meantime, here’s the minimal info I gave the original testers:

The story is:
– one ‘openlearn republisher’ plugin, that will take the URL of an RSS feed describing OpenLearn courses (e.g. on the Modern Languages page, the RSS: Modern Languages feed)) , and suck those courses into WPMU, one WPMU blog per course, via the full content RSS feed for each course.

– one “daily feeds” widget; this can be added to any WP blog and should provide a ‘daily’ feed of the content from that blog, that sends e.g. one item per day to the subscriber from the day they subscribe. The idea here is if a WP blog is used as a content publishing sys for ‘static’, unchanging content (e.g. a course, or a book, where each chapter is a post, or a fixed length podcast series), users can still get it delivered in a paced/one-per-day fashion. This widget should work okay…

Here’s another link the page where you can find the downloads: OpenLearn WordPress Plugins. Enjoy – all comments welcome. Please post a link back here if you set up any blogs using either of these two plugins.

Decoding Patents – An Appropriate Context for Teaching About Technology?

A couple of nights ago, as I was having a rummage around the European patent office website, looking up patents by company to see what the likes of Amazon, Google, Yahoo and, err, Technorati have been posting recently, it struck me that IT and engineering courses might be able to use patents in the similar way to the way that Business Schools use Case Studies as a teaching tool (e.g. Harvard Business Online: Undergraduate Course Materials)?

This approach would seem to offer several interesting benefits:

  • the language used in patents is opaque – so patents can be used to develop reading skills;
  • the ideas expressed are likely to come from a commercial research context; with universities increasingly tasked with taking technology transfer more seriously, looking at patents situates theoretical understanding in an application area, as well as providing the added advantage of transferring knowledge in to the ivory tower, too, and maybe influencing curriculum development as educators try to keep up with industrial inventions;-)
  • many patents locate an invention within both a historical context and a systemic context;
  • scientific and mathematical principles can be used to model or explore ideas expressed in a patent in more detail, and in a the “situated” context of an expression or implementation of the ideas described within the patent.

As an example of how patents might be reviewed in an uncourse blog context, see one of my favourite blogs, SEO by the SEA, in which Bill Slawski regularly decodes patents in the web search area.

To see whether there may be any mileage in it, I’m going to keep an occasional eye on patents in the web area over the next month or two, and see what sort of response they provoke from me. To make life easier, I’ve set up a pipe to scrape the search results for patents issued by company, so I can now easily subscribe to a feed of new patents issued by Amazon, or Yahoo, for example.

You can find the pipe here: European Patent Office Search by company pipe.

I’ve also put several feeds into an OPML file on Grazr (Web2.0 new patents, and will maybe look again at the styling of my OPML dashboard so I can use that as a display surface (e.g. Web 2.0 patents dashboard).