Writing Diagrams

One of the reasons I don’t tend to use many diagrams in the OUseful.info blog is that I’ve always been mindful that the diagrams I do draw rarely turn out how I wanted them to (the process of converting a mind’s eye vision to a well executed drawing always fails somewhere along the line, I imagine in part because I’ve never really put the time into practising drawing, even with image editors and drawing packages etc etc.)

Which is one reason why I’m always on the lookout for tools that let me write the diagram (e.g. Scripting Diagrams).

So for example, I’m very fond of Graphviz, which I can use to create network diagrams/graphs from a simple textual description of the graph (or a description of the graph that has been generated algorithmically…).

Out of preference, I tend to use Mac version of Graphviz, although the appearance of a canvas/browser version of graphviz is really appealing… (I did put in a soft request for a Drupal module that would generate a Graphviz plot from a URL that pointed to a dot file, but I’m not sure it went anywhere, and the canvas version looks far more interesting anyway…)

Hmmm – it seems there’s an iPhone/iPod touch Graphviz app too – Instaviz:

Another handy text2image service is the rather wonderful Web sequence diagrams, a service that lets you write out a UML sequence diagram:

There’s an API, too, that lets you write a sequence diagram within a <pre> tag in an HTML page, and a javascript routine will then progressively enhance it and provide you with the diagrammatic version, a bit like MathTran, or the Google Chart API etc etc (RESTful Image Generation – When Text Just Won’t Do).

If graphs or sequence diagrams aren’t your thing, here’s a handy hierarchical mindmap generator: Text2Mindmap:

And finally, if you do have to resort to actually drawing diagrams yourself, there are a few tools out there that look promising: for example, the LucidChart flow chart tool crossed my feedreader the other day. More personally, since Gliffy tried to start charging me at some point during last year, I’ve been using the Project Draw Autodesk online editor on quite a regular basis.

PS Online scripting tool for UML diagrams: YUML

PPS This is neat – a quite general diagramming language for use in eg markdown scripts: pikchr.

Decoding Patents – An Appropriate Context for Teaching About Technology?

A couple of nights ago, as I was having a rummage around the European patent office website, looking up patents by company to see what the likes of Amazon, Google, Yahoo and, err, Technorati have been posting recently, it struck me that IT and engineering courses might be able to use patents in the similar way to the way that Business Schools use Case Studies as a teaching tool (e.g. Harvard Business Online: Undergraduate Course Materials)?

This approach would seem to offer several interesting benefits:

  • the language used in patents is opaque – so patents can be used to develop reading skills;
  • the ideas expressed are likely to come from a commercial research context; with universities increasingly tasked with taking technology transfer more seriously, looking at patents situates theoretical understanding in an application area, as well as providing the added advantage of transferring knowledge in to the ivory tower, too, and maybe influencing curriculum development as educators try to keep up with industrial inventions;-)
  • many patents locate an invention within both a historical context and a systemic context;
  • scientific and mathematical principles can be used to model or explore ideas expressed in a patent in more detail, and in a the “situated” context of an expression or implementation of the ideas described within the patent.

As an example of how patents might be reviewed in an uncourse blog context, see one of my favourite blogs, SEO by the SEA, in which Bill Slawski regularly decodes patents in the web search area.

To see whether there may be any mileage in it, I’m going to keep an occasional eye on patents in the web area over the next month or two, and see what sort of response they provoke from me. To make life easier, I’ve set up a pipe to scrape the search results for patents issued by company, so I can now easily subscribe to a feed of new patents issued by Amazon, or Yahoo, for example.

You can find the pipe here: European Patent Office Search by company pipe.

I’ve also put several feeds into an OPML file on Grazr (Web2.0 new patents, and will maybe look again at the styling of my OPML dashboard so I can use that as a display surface (e.g. Web 2.0 patents dashboard).

Situated Video Advertising With Tesco Screens

In 2004, Tesco launched an in-store video service under the name Tesco TV as part of its Digital Retail Network service. The original service is described in TESCO taps into the power of satellite broadband to create a state-of-the-art “Digital Retail Network” and is well worth a read. A satellite delivery service provided “news and entertainment, as well as promotional information on both TESCO’s own products and suppliers’ branded products” that was displayed on video screens around the store.
In order to make content as relevant as possible (i.e. to maximise the chances of it influencing a purchase decision;-), the content was zoned:

Up to eight different channels are available on TESCO TV, each channel specifically intended for a particular zone of the store. The screens in the Counters area, for instance, display different content from the screens in the Wines and Spirits area. The latest music videos are shown in the Home Entertainment department and Health & Beauty has its own channel, too. In the Cafe, customers can relax watching the latest news, sports clips, and other entertainment programs.

I’d have loved to have seen the control room:

Remote control from a central location of which content is played on each screen, at each store, in each zone, is an absolute necessity. One reason is that advertisers are only obligated to pay for their advertisements if they are shown in the contracted zones and at the contracted times.

In parallel to the large multimedia files, smaller files with the scripts and programming information are sent to all branches simultaneously or separately, depending on what is required. These scripts are available per channel and define which content is played on which screen at which time. Of course, it is possible to make real-time changes to the schedule enabling TESCO to react within minutes, if required.

In 2006, dunnhumby, the company that runs the Tesco Clubcard service and that probably knows more about your shopping habits at Tesco than you do, won the ad sales contract for Tesco TV’s “5,000 LCD and plasma screens across 100 Tesco Superstores and Extra outlets”. Since then, it has “redeveloped the network to make it more targeted, so that it complements in-store marketing and ties in with above-the-line campaigns”, renaming Tesco TV as Tesco Screens in 2007 as part of that effort (Dunnhumby expands Tesco TV content, dunnhumby relaunches Tesco in-store TV screens). Apparently, “[a]ll campaigns on Tesco Screens are analysed with a bespoke control group using EPOS and Clubcard data.” (If you’ve read any of my previous posts on the topic (e.g. The Tesco Data Business (Notes on “Scoring Points”) or ) you’ll know that dunnhumby excels at customer profiling and targeting.)

Now I don’t know about you, but dunnhumby’s apparent reach and ability to influence millions of shoppers at points of weakness is starting to scare me…(as well as hugely impressing me;-)

On a related note, it’s not just Tesco that use video screen advertising, of course. In Video, Video, Everywhere…, for example, I described how video advertising has now started appearing throughout the London Underground network.

So with the growth of video advertising, it’s maybe not so surprising that Joel Hopwood, one of the management team behind Tesco Screens Retail Media Group should strike out with a start-up: Capture Marketing.

[Capture Marketing] may well be the first agency in the UK to specialise in planning, buying and optimising Retail Media across all retailers – independent of any retailer or media owner!!

They aim to buy from the likes of dunnhumby, JCDecaux, Sainsbury, Asda Media Centre etc in order to give clients a single, independent and authoritative buying and planning point for the whole sector. [DailyDOOH: What On Earth Is Shopper Marketing?]

So what’s the PR strapline for Capture Marketing? “Turning insight into influence”.

If you step back and look at our marketing mix across most of the major brands, it’s clearly shifting, and it’s shifting to in-store, to the internet and to trial activity.
So what’s the answer? Marketing to shoppers. We’ll help you get your message to the consumer when they’re in that crucial zone, after they’ve become a shopper, but before they’ve made a choice. We’ll help you take your campaign not just outside the home, but into the store. Using a wide range of media vehicles, from digital screens to web favourite interrupts to targeted coupons, retail media is immediate, proximate, effective and measurable.

I have no idea where any of this is going… Do you? Could it shift towards making use of VRM (“vendor relationship management”) content, in which customers are able to call up content they desire to help they make a purchase decision (such as price, quality, or nutrition information comparisons?). After all, scanner apps are already starting to appear on Android phones (e.g. ShopSavvy) and the iPhone (Snappr), not to mention the ability to recognise books from their cover or music from the sound of it (The Future of Search is Already Here).

PS Just by the by, here’s some thoughts about how Tesco might make use of SMS:

PPS for a quick A-Z of all you need to know to start bluffing about video based advertising, see Billboards and Beyond: The A-Z of Digital Screens.

Tinkering With Time

A few weeks ago now, I was looking for a context within which I could have a play with the deprecated BBC Web API. Now this isn’t the most useful of APIs, as far as I’m concerned, because rather than speaking in the language of iPLayer programme identifiers it users a different set of programme IDs (and I haven’t yet found a way of mapping the one onto the other). But as I’d found the API, I wanted to try to do something with it, and this is what I came up with: a service you can tweet to that will tell you what’s on a specified BBC TV or radio channel now (or sometime in the recent past).

Now I didn’t actually get round to building the tweetbot, and the time handling is a little ropey, but if I write it up it make spark some more ideas. So here goes…

The first part of the pipe parses a message of the form “#remindme BBCChannel time statement”. The BBCChannel needs to be in the correct format (e.g. BBCOne, BBCRFour) and only certain time constructs work (now, two hours ago, 3 hours later all seem to work).

The natural language-ish time expression time gets converted to an actual time by the Date Builder block, and is then written into a string format that the BBC Web API requires:

Then we construct the URI that references the BBC Web API, grab the data back from that URI and do a tiny bit of tidying up:

If you run the pipe, you get something like this:

Time expressions such as “last Friday” seem to calculate the correct date and use the current time of day. So you could use this service to remind yourself what was on at the current time last week, for example.

A second pipe grabs the programme data from the programme ID, by constructing the web service call:

then grabbing the programme data and constructing a description based on it:

It’s then easy enough to call this description getting pipe at the end of the original pipe, remembering to call the pipe with the appropriate programme ID:

So now we get the description too:

To see what’s on (or what was on) between two times, we need to to construct a URI to call the BBC Web API with the appropriate time arguments, suitably encoded:

and then call the web service with that URI.

It’s easy enough to embed this pipe in a variant of the original pipe that generates the two appropriately encoded time strings from two natural language time strings:

If we add the programme details fetching pipe to the end of the pipe, we can grab the details for each programme and get this sort of output from the whole pipeline:

OpenLearn WordPress Plugins

Just before the summer break, I managed to Patrick McAndrew to use some of his OpenLearn cash to get a WordPress-MU plugin built that would allow anyone to republish OpenLearn materials across a set of WordPress Multi-User blogs. A second WordPress plugin was commissioned that would allow any learners happening by the blogs would subscribe to those courses using “Daily feeds”, that would deliver course material to them on a daily basis.

The plugins were coded by Greg Gaughan at Isotoma, and tested by Jim and D’Arcy, among others… (I haven’t acted on your feedback yet – sorry, guys…:-( For all manner of reasons, I didn’t post the plugins (partly because I wanted to do another pass on usability/pick up on feedback, but mainly because I wanted set up a demo site first… but I still haven’t done that… so here’s a link to the plugins anyway in case anyone fancies having a play over the next few weeks: OpenLearn WordPress Plugins.

I’ll keep coming back to this post – and the download page – to add in documentation and some of the thoughts and discussions we had about how to evolve the WPMU plugin workflow/UI etc, as well as the daily feeds widget functionality.

In the meantime, here’s the minimal info I gave the original testers:

The story is:
– one ‘openlearn republisher’ plugin, that will take the URL of an RSS feed describing OpenLearn courses (e.g. on the Modern Languages page, the RSS: Modern Languages feed)) , and suck those courses into WPMU, one WPMU blog per course, via the full content RSS feed for each course.

– one “daily feeds” widget; this can be added to any WP blog and should provide a ‘daily’ feed of the content from that blog, that sends e.g. one item per day to the subscriber from the day they subscribe. The idea here is if a WP blog is used as a content publishing sys for ‘static’, unchanging content (e.g. a course, or a book, where each chapter is a post, or a fixed length podcast series), users can still get it delivered in a paced/one-per-day fashion. This widget should work okay…

Here’s another link the page where you can find the downloads: OpenLearn WordPress Plugins. Enjoy – all comments welcome. Please post a link back here if you set up any blogs using either of these two plugins.

Arise Ye Databases of Intention

In what counts as one of my favourite business books (“The Search: How Google and Its Rivals Rewrote the Rules of Business and Transformed Our Culture), John Battelle sets the scene with a chapter entitled “The Database of Intentions”.

The Database of Intentions is simply this.: the aggregate results of every search ever entered, every result list ever tendered, and every path taken as a result.

(Also described in this blog post: Database of Intentions).

The phrase “the database of intentions” is a powerful one; but whilst I don’t necessarily agree with the above definition any more (and I suspect Battelle’s thinking about his own definition of this term may also have moved on since then) I do think that the web’s ability to capture intentional data is being operationalised in a far more literal form than even search histories reveal.

Here’s something I remarked to myself on this topic a couple of days ago, following a particular Google announcement:

The announcement was this one: New in Labs: Tasks and it describes the release of a Task list (i.e. a to do list) into the Google Mail environment.

As Simon PErry notes in hos post on the topic (“Google Tasks: Gold Dust Info For Advertisers).

It’s highly arguable that no piece of information could be more valuable to Google that what your plans / tasks / desire are.

In the world of services driven by advertising in exchange for online services, this stuff is gold dust.

We’d imagine that Google will be smart enough not to place ads related to your todo list directly next to the list, as that could well freak people out. Don’t forget that as soon as Google know this info about you, they can place the adverts where ever and when ever they feel like.

“Don’t forget that as soon as Google know this info about you, they can place the adverts where ever and when ever they feel like.”… Visit 10 web pages that run advertising, and I would bet that close to the majority are running Google Adsense.

Now I know everybody knows this, but I suspect most people don’t…

How does Google use cookies to serve ads?
A cookie is a snippet of text that is sent from a website’s servers and stored on a web browser. Like most websites and search engines, Google uses cookies in order to provide a better user experience and to serve relevant ads. Cookies are set based on your viewing of web pages in Google’s content network and do not contain any information that can identify you personally. This information helps Google deliver ads that are relevant to your interests, control the number of times you see a given ad, and measure the effectiveness of ad campaigns. Anyone who prefers not to see ads with this level of customization can opt out of advertising cookies. This opt-out will be specific only to the browser that you are using when you click the “opt out” button. [Advertising Privacy FAQ]

Now I’m guessing that “your viewing of web pages in Google’s content network” includes viewing pages in GMail, pages which might include your “to do” list… So a consequence of adding an item to your Task list might be an advert that Google serves to you next time you do a search on Google or visit a site running Google Ads.

Here’s another way of thinking about those ads: as predictive searches. Google knows what you intend to do from your “to do” list, so in principle it can look at what people with similar “to do” items search for, and serve the most common of these up next time you go to http://getitdone.google.com. Just imagine it – whereas you go to google.com and see an empty search box, you go to getitdone.google.com and you load a page that has guessed what query you were going to make, and runs it for you. So if the item on your task list for Thursday afternoon was “buy car insurance”, you’ll see something like this:

Heh, heh ;-) Good old Google – just being helpful ;-)

Explicit intention information is not just being handed over to Google in increasingly literal and public ways, of course. A week or so ago, John Battelle posted the following (Shifting Search from Static to Real-time):

I’ve been mulling something that keeps tugging at my mind as a Big Idea for some time now, and I may as well Think Out Loud about it and see what comes up.

To summarize, I think Search is about to undergo an important evolution. It remains to be seen if this is punctuated equilibrium or a slow, constant process (it sort of feels like both), but the end result strikes me as extremely important: Very soon, we will be able to ask Search a very basic and extraordinarily important question that I can best summarize as this: What are people saying about (my query) right now?
Imagine AdSense, Live. …

[I]magine a service that feels just like Google, but instead of gathering static web results, it gathers liveweb results – what people are saying, right now (or some approximation of now – say the past few hours or so) … ? And/or, you could post your query to that engine, and you could get realtime results that were created – by other humans – directly in response to you? Well, you can get a taste of what such an engine might look like on search.twitter.com, but that’s just a taste.

A few days later, Nick Bilton developed a similar theme (The Twitter Gold Mine & Beating Google to the Semantic Web):

Twitter, potentially, has the ability to deliver unbelievably smart advertising; advertising that I actually want to see, and they have the ability to deliver search results far superior and more accurate to Google, putting Twitter in the running to beat Google in the latent quest to the semantic web. With some really intelligent data mining and cross pollination, they could give me ads that makes sense not for something I looked at 3 weeks ago, or a link my wife clicked on when she borrowed my laptop, but ads that are extremely relevant to ‘what I’m doing right now’.

If I send a tweet saying “I’m looking for a new car does anyone have any recommendations”, I would be more than happy to see ‘smart’ user generated advertising recommendations based on my past tweets, mine the data of other people living Brooklyn who have tweeted about their car and deliver a tweet/ad based on those result leaving spammers lost in the noise. I’d also expect when I send a tweet saying ‘I got a new car and love it!’ that those car ads stop appearing and something else, relevant to only me, takes its place.

(See also Will Lack of Relevancy be the Downfall of Google?, where I fumble around a thought or too about whether Google will lose out on identifying well liked content because links are increasingly being shared in real time in places that Google doesn’t index.)

It seems to me that To Do/Task lists, Calendars, search queries and tweets all lie somewhere different along a time vs. commitment graph of our intentions. The to do list is something you plan to do; searching on Google or Amazon is an action executed in pursuit of that goal. Actually buying your car insurance is completing the action.

Things like wishlists also blend in our desires. Calendars, click-thrus and actual purchases all record what might be referred to as commitments (a click thru on an ad could be seen as a very weak commitment to buy, for example; an event booked into your calendar is a much stronger commitment; handing over your credit card and hitting the “complete transaction” button is the strongest commitment you can make regarding a purchase decision).

Way back when, I used to play with software agents, constructed according to a BDI (“beady eye”) model – Beliefs, Desires and Intentions. I also looked at agent teams, and the notion of “joint persistent goals” (I even came up with a game for them to play – “DIFFOBJ – A Game for Exercising Teams of Agents – although I’m not sure the logic was sound!). Somewhere in there is the basis of a logic for describing the Database of Intentions, and the relationship between an individual with a goal and an engine that is trying to help the searcher achieve that goal, whether by serving them with “organic” content or paid for content.

PS I don’t think I’ve linked to this yet? All about Google

It’s worth skimming through…

PPS See also Status and Intent: Twoogle, in which I idly wonder whether a status update from Google just before I start searching there could provide explicit intent information to Google about the sort of thing I want to achieve from a particular search.

Realising the Value of Library Data

For anyone listening out there in library land who hasn’t picked up on Dave Pattern’s blog post from earlier today – WHY NOT? Go and read it, NOW: Free book usage data from the University of Huddersfield:

I’m very proud to announce that Library Services at the University of Huddersfield has just done something that would have perhaps been unthinkable a few years ago: we’ve just released a major portion of our book circulation and recommendation data under an Open Data Commons/CC0 licence. In total, there’s data for over 80,000 titles derived from a pool of just under 3 million circulation transactions spanning a 13 year period.


I would like to lay down a challenge to every other library in the world to consider doing the same.

So are you going to pick up the challenge…?

And if not, WHY NOT? (Dave posts some answers to the first two or three objections you’ll try to raise, such as the privacy question and the licensing question.)

He also sketches out some elements of a possible future:

I want you to imagine a world where a first year undergraduate psychology student can run a search on your OPAC and have the results ranked by the most popular titles as borrowed by their peers on similar courses around the globe.

I want you to imagine a book recommendation service that makes Amazon’s look amateurish.

I want you to imagine a collection development tool that can tap into the latest borrowing trends at a regional, national and international level.


See also a presentation Dave gave to announce this release – Can You Dig It? A systems Perspective:

What else… Library website analytics – are you making use of them yet? I know the OU Library is collecting analytics on the OU Library website, although I don’t think they’re using them? (Knowing that you had x thousand page views last week is NOT INTERESTING. Most of them were probably people flailing round the site failing to find what they wanted? (And before anyone from the Library says that’s not true, PROVE IT TO ME – or at least to yourself – with some appropriate analytics reports.) For example, I haven’t noticed any evidence of changes to the website or A/B testing going on as a result of using Googalytics on the site??? (Hmmm – that’s probably me in trouble again…!;-)

PS I’ve just realised I didn’t post a link to Course Analytics presentation from Online Info last week, so here it is:

Nor did I mention the follow up podcast chat I had about the topic with Richard Wallis from Talis: Google Analytics to analyse student course activity – Tony Hirst Talks with Talis.

Or the “commendation” I got at the IWR Information Professional Award ceremony. I like to think this was for being the “unprofessional” of the year (in the sense of “unconference”, of course…;-). It was much appreciated, anyway :-)

Revisiting the Library Flip – Why Librarians Need to Know About SEO

What does information literacy mean in the age of web search engines? I’ve been arguing for some time (e.g. in The Library Flip) that one of the core skills going forward for those information professionals who “help people find stuff” is going to be SEO – search engine optimisation. Why? Because increasingly people are attuned to searching for “stuff” using a web search engine (you know who I’m talking about…;-); and if your “stuff” doesn’t appear near the top of the organic results listing (or in the paid for links) for a particular query, it might as well not exist…

Whereas once academics and students would have traipsed into the library to ask the one of the High Priestesses to perform some magical incantation on a Dialog database through a privileged access terminal, for many people research now starts with a G. Which means that if you want your academics and students to find the content that you’d recommend, then you have to help get that content to the top of the search engine listings.

With the rate of content production growing to seventy three tera-peta-megabits a second, or whatever it is, does it make sense to expect library staffers to know what the good content is, any more (in the sense of “here, read this – it’s just what you need”)? Does it make even make sense to expect people to know where to find it (in the sense of “try this database, it should contain what you need”)? Or is the business now more one of showing people how to go about finding good stuff, wherever it is (in the sense of “here’s a search strategy for finding what you need”) and helping the search engines see that stuff as good stuff?

Just think about this for a moment. If your service is only usable by members of your institution and only usable within the locked down confines of your local intranet, how useful is it?

When your students leave your institution, how many reusable skills are they taking away? How many people doing informal learning or working within SMEs have access to highly priced, subscription content? How useful is the content in those archives anyway? How useful are “academic information skills” to non-academics and non-students? (I’m just asking the question…;-)

And some more: do academic courses set people up for life outside? Irrespective of whether they do or not, does the library serve students on those courses well within the context of their course? Does the library provide students with skills they will be able to use when they leave the campus and go back to the real world and live with Google. (“Back to”? Hah – I wonder how much traffic on HEI networks is launched by people clicking on links from pages that sit on the google.com domain?) Should libraries help students pass their courses, or give them skills that are useful after graduation? Are those skills the same skills? Or are they different skills (and if so, are they compatible with the course related skills?)?

Here’s where SEO comes in – help people find the good stuff by improving the likelihood that it will be surfaced on the front page of a relevant web search query. For example, “how to cite an article“. (If you click through, it will take you to a Google results page for that query. Are you happy with the results? If not, you need to do one of two things – either start to promote third party resources you do like from your website (essentially, this means you’re doing off-site SEO for those resources) OR start to do onsite and offsite SEO on resources you want people to find on your own site.

(If you don’t know what I’m talking about, you’re well on the way to admitting that you don’t understand how web search engines work. Which is a good first step… because it means you’ve realised you need to learn about it…)

As to how to go about it, I’d suggest one way is to get a better understanding of how people actually use library or course websites. (Another is Realising the Value of Library Data and finding ways of mining behavioural data to build recommendation engines that people might find useful.)

So to start off – find out what search terms are the most popular in terms of driving traffic to your Library website (ideally relating to some sort of resource on your site, such as a citation guide, or a tutorial on information skills); run that query on Google and see where you page comes in the results listing. If it’s not at the top, try to improve its ranking. That’s all…

For example, take a look at the following traffic (as collected by Google Analytics) coming in to the OU Library site over a short period some time ago.

A quick scan suggests that we maybe have some interesting content on “law cases” and “references”. For the “references” link, there’s a good proportion of new visitors to the OU site, and it looks from the bounce rate that half of those visited more than one page on the OU site. (We really should do a little more digging at this point to see what those people actually did on site, but this is just for argument’s sake, okay?!;-)

Now do a quick Google on “references” and what do we see?

On the first page, most of the links are relating to job references, although there is one citation reference near the bottom:

Leeds University library makes it in at 11 (at the time of searching, on google.co.uk):

So here would be a challenge – try to improve the ranking of an OU page on this results listing (or try to boost the Leeds University ranking). As to which OU page we could improve, first look at what Google thinks the OU library knows about references:

Now check that Google favours the page we favour for a search on “references” and if it does, try to boost it’s ranking on the organic SERP. If Google isn’t favouring the page we want as its top hit on the OU site for a search on “references”, do some SEO to correct that (maybe we want “Manage Your References” to come out as the top hit?)

Okay, enough for now – in the next post on this topic I’ll look at the related issue of Search Engine Consequences, which is something that we’re all going to have to become increasingly aware of…

PS Ah, what the heck – here’s how to find out what the people who arrived at the Library website from a Google search on “references” were doing onsite. Create an advanced segment:

Google analytics advanced segment

(PS I first saw these and learned how to use them at a trivial level maybe 5 minutes ago;-)

Now look to see where the traffic came in (i.e. the landing pages for that segment):

Okay? The power of segmentation – isn’t it lovely:-)

We can also go back to the “All Visitors” segment, and see what other keywords people were using who ended up on the “How to cite a reference” page, because we’d possibly want to optimise for those, too.

Enough – time for the weekend to start :-)

PS if you’re not sure what techniques to use to actually “do SEO”, check on Academic Search Premier (or whatever it’s called), because Google and Google Blogsearch won’t return the right sort of information, will they?;-)

Are You Ready To Play “Search Engine Consequences”?

In the world of search, what happens when your local library categorises the film “Babe” as a teen movie in the “classic DVDs” section of the Library website? What happens if you search for “babe teen movie” on Google using any setting other than Safe Search set to “Use strict filtering”? Welcome, in a roundabout way, to the world of search engine consequences.

To set the scene, you need to know a few things. Firstly, how the web search engines know about your web pages in the first place. Secondly, how they rank your pages. And thirdly, how the ads and “related items” that annotate many search listings and web pages are selected (because the whole point about search engines is that they are run by advertising companies, right?).

So, how do the search engines know where you pages are? Any offers? You at the back there – how does Google know about your library home page?

As far as I know, there are three main ways:

  • a page that is already indexed by the search engine links to your page. When a search engine indexes a page, it makes a note of all the pages that are linked to from that page, so these pages can be indexed in turn by the search engine crawler (or spider). (If you have a local search engine, it will crawl your website in a similar way, as this documentation about the Google Search Appliance crawler describes.)
  • the page URL is listed in a Sitemap, and your website manager has told the search engine where that Sitemap lives. The Sitemap lists all the pages on your site that you want indexing. This helps the search engine out – it doesn’t have to crawl your site looking for all the pages – and it helps you out: you can tell the search engine how often the page changes, and how often it needs to be re-indexed, for example.
  • you tell the search engine at a page level that the page exists. For example, if your page includes any Google Adsense modules, or Google Analytics tracking codes, Google will know that page exists the first time it is viewed.

And once a page has been indexed by a search engine, it becomes a possible search result within that search engine.

So when someone makes a query, how are the results selected?

During the actual process of indexing, the search engine does some voodoo magic to try and understand what the page is about. This might be as simple as counting the occurrence of every different word on the page, or it might be an actual attempt to understand what the page is about using all manner of heuristics and semantic engineering approaches. Pages deemed a “hit” for a particular query if the search terms can be used to look up the page in the index. The hits are then rank ordered according to a score for each page calculated according to whatever algorithm the search engine uses. Typically this is some function of both “relevance” and “quality”.

“Relevance” is identified in part by comparing how the page has been indexed compared to the query.

“Quality” often relates to how well regarded the page is in the greater scheme of things; for example, link analysis identifies how many other pages link to the page (where we assume a link is some sort of vote of confidence for the page) and clickthrough analysis monitors how many times people click through on a particular result for a particular query in a search engine results listing (as search engine companies increasingly run web analytics packages too, they can potentially factor this information back in to calculating how satisfied a user was with a particular page).

So what are the consequences of publishing content to the web, and letting a search engine index it?

At this point it’s worth considering not just web search engines, but also image and video search engines. Take Youtube, for example. Posting a video to Youtube means that Youtube will index it. And one of nice things about Youtube is that it will index your video according to at least the title, description and tags you add to the movie and use this as the basis for a recommending other “related” videos to you. You might think of this in terms of Youtube indexing your video in a particular way, and then running a query for other videos using those index terms as the search terms.

Bearing that in mind, a major issue is that you can’t completely control where a page might turn up in a search engine results listing or what other items might be suggested as “related items”. If you need to be careful managing your “brand” online, or you have a duty of care to your users, being associated with inappropriate material in a search engine results listing can be a risky affair.

To try and ground this with a real world example, check out David Lee King’s post from a couple of weeks ago on YouTube Being Naughty Today. It tells a story of how “[a] month or so ago, some of my library’s teen patrons participated in a Making Mini Movie Masterpieces program held at my library. Cool program!”

One of our librarians just posted the videos some of the teens made to YouTube … and guess what? In the related videos section of the video page (and also on the related videos flash thing that plays at the end of an embedded YouTube video), some … let’s just say “questionable” videos appeared.

Here’s what I think happened: YouTube found “similar” videos based on keywords. And the keywords it found in our video include these words in the title and description: mini teen teens . Dump those into YouTube and you’ll unfortunately find some pretty “interesting” videos.

And here’s how I commented on the post:

If you assume everything you publish on the web will be subject to simple term extraction or semantic term extraction, then in a sense it becomes a search query crafted around those terms that will potentially associate the results of that “emergent query” with the content itself.

One of the functions of the Library used to be classifying works so they could be discovered. Maybe now there is a need for understanding how machines will classify web published content so we can try to guard against “consequential annotations”?

For a long time I’ve thought one role for the Library going forwards is SEO – and raising the profile of the host institution’s content in the dominant search engines. But maybe an understanding of SEO is also necessary in a *defensive* capacity?

This need for care is particularly relevant if you run Adsense on your website. Adsense works in a conceptually similar way to the Youtube related videos service, so you might think of it in the following terms: the page is indexed, key terms are identified and run as “queries” on the Adsense search engine, returning “relevant ads” as a result. “Relevance” in this case is calculated (in part) based on how much the advertiser is willing to pay for their advert to be returned against a particular search term or in the context of a page that appears to be about a particular topic.

Although controls are starting to appear that give the page publisher an element of control over what ads appear, their is still uncertainty in the equation, as there is whenever your content appears alongside content that is deemed “relevant” or related in some way – whether that’s in the context of a search engine results listing, an Adsense placement, or a related video.

So one thing to keep in mind – always – is how might your page be indexed, and what sorts of query, ad placement or related item might it be the perfect match for? What are the search engine consequences of writing your page in a particular way, or including particular key words in it?

David’s response to my comment identified a major issue here: “The problem is SO HUGE though, because … at least at my library’s site … everyone needs this training. Not just the web dudes! In my example above, a youth services librarian posted the video – she has great training in helping kids and teens find stuff, in running successful programs for that age group … but probably not in SEO stuff.”

I’m not sure I have the answer, but I think this is another reason Why Librarians Need to Know About SEO – not just so they can improve the likelihood of content they do approve of appearing in search engine listings or as recommended items, but also so they can defend against unanticipated SEO, where they unconsciously optimise a page so that it fares well on an unanticipated, or unwelcome, search.

What that means is, you need to know how SEO works so you don’t inadvertently do SEO on something you didn’t intend optimising for; or so you can institute “SEO countermeasures” to try to defuse any potentially unwelcome search engine consequences that might arise for a particular page.

Library Analytics (Part 8)

In Library Analytics (Part 7), I posted a couple of ideas about how it might be an idea if the Library started crafting URLs for the the Library resources pages for individual courses in the Moodle VLE that contained a campaign tracking code, so that we could track the behaviour of students coming into the Library site by course.

From a quick peak at a handful of courses in the VLE, that recommendation either doesn’t appear to have been taken up, or it’s just “too hard” to do, so that’s another couple of months data we don’t have easy access to in the Google Analytics environment. (Or maybe the Library have moved over to using the OU’s SIte Analytics service for this sort of insight?)

Just to recall, we need to put some sort of additional measures in place because Moodle generates crappy URLs (e.g. URLs of the form http://learn.open.ac.uk/mod/resourcepage/view.php?id=119070) and crafting nice URLs or using mod-rewrite (or similar) definitely is way too hard for the VLE’n’network people to manage;-) The default set up of Google Analytics dumps everything after the “?”, unless they are official campaign tracking arguments or are captured otherwise.

(From a quick scan of Google Analytics Tracking API, I’m guessing that setting pageTracker._setCampSourceKey(“id”); in the tracking code on each Library web page might also capture the id from referrer URLs? Can anyone confirm/deny that?)

Aside: from what I’ve been told, I don’t think we offer server side compression for content served from most http://www.open.ac.uk/* sites, either (though I haven’t checked)? Given that there are still a few students on low bandwidth connections and relatively modern browsers, this is probably an avoidable breach of some sort of accessibility recommendation? For example, over the lat 3 weeks or so, here’s the number of dial-up visits to the Library website:

A quick check of the browser stats shows that IE breaks down almost completely as IE6 and above; all of which cope with compressed files, I think?

[Clarification (?! heh heh) re: dial-in stats – “when you’re looking at the dial-up use of the Library website is that we have a dial-up PC in the Library to replicate off-campus access and to check load times of our resources. So it’s probably worth filtering out that IP address (***.***.***.***) to cut out library staff checking out any problems as this will inflate the perceived use of dial-up by our students. Even if we’ve only used it once a day then that’s a lot of hits on the website that aren’t really students using dial-up” – thanks, Clari :-)]

Anyway – back to the course tracking: as a stop gap, I created a few of my own reports that use a user defined argument corresponding to the full referrer URL:

We can then view reports according to this user defined segment to see which VLE pages are sending traffic to the Library website:

Clicking through on one of these links gives a report for that referrer URL, and then it’s easy to see which landing pages the users are arriving at (and by induction, which links on the VLE page they clicked on):

If we look at the corresponding VLE page:

Then we can say that the analytics suggest that the Open University Library – http://library.open.ac.uk/, the Online collections by subject – http://library.open.ac.uk/find/eresources/index.cfm and the Library Help & Support – http://library.open.ac.uk/about/index.cfm?id=6939 are the only links that have been clicked on.

[Ooops… “Safari & Info Skills for Researchers are our sites, but don’t sit within the library.open.ac.uk domain ([ http://www.open.ac.uk/safari ]www.open.ac.uk/safari and [ http://www.open.ac.uk/infoskills-researchers ]www.open.ac.uk/infoskills-researchers respectively) and the Guide to Online Information Searching in the Social Sciences is another Moodle site.” – thanks Clari:-) So it may well be that people are clicking on the other links… Note to self – if you ever see 0 views for a link, be suspicious and check everything!]

(Note that I have only reported on data from a short period within the lifetime of the course, rather than data taken from over the life of the course. Looking at the incidence of traffic over a whole course presentation would also give an idea of when during the course students are making use of the Library resource page within the course.)

Another way of exploring how VLE referrer traffic is impacting on the Library website is to look at the most popular Landing pages and then see which courses (from the user defined segment) are sourcing that traffic.

So for example, here are the VLE pages that are driving traffic to the elluminate registration page:

One VLE page seems responsible:

Hmmm… ;-)

How about the VLE pages driving traffic to the ejournals page?

And the top hit is….

… the article for question 3 on TMA01 of the November 2008 presentation of M882.

The second most popular referrer page is interesting because it contains two links to the Library journals page:


Unfortunately, there’s no way of disambiguating which link is driving the tracking – which is one good reason why a separate campaign related tracking code should be associated with each link.

(Do you also see the reference to Google books in there? Heh heh – surely they aren’t suggesting that students try to get what they need from the book via the Google books previewer?!;-)

Okay – enough for now. To sum up, we have the opportunity to provide two sorts of report – one for the Library to look at how VLE sourced traffic as a whole impacts on the Library website; and a different set of reports that can be provided to course teams and course link librarians to show how students on the course are using the VLE to access Library resources.

PS if you havenlt yet watch Dave Pattern’s presentation on mining lending data records, do so NOW: Can You Dig It? A Systems Perspective.