In the world of search, what happens when your local library categorises the film “Babe” as a teen movie in the “classic DVDs” section of the Library website? What happens if you search for “babe teen movie” on Google using any setting other than Safe Search set to “Use strict filtering”? Welcome, in a roundabout way, to the world of search engine consequences.
To set the scene, you need to know a few things. Firstly, how the web search engines know about your web pages in the first place. Secondly, how they rank your pages. And thirdly, how the ads and “related items” that annotate many search listings and web pages are selected (because the whole point about search engines is that they are run by advertising companies, right?).
So, how do the search engines know where you pages are? Any offers? You at the back there – how does Google know about your library home page?
As far as I know, there are three main ways:
- a page that is already indexed by the search engine links to your page. When a search engine indexes a page, it makes a note of all the pages that are linked to from that page, so these pages can be indexed in turn by the search engine crawler (or spider). (If you have a local search engine, it will crawl your website in a similar way, as this documentation about the Google Search Appliance crawler describes.)
- the page URL is listed in a Sitemap, and your website manager has told the search engine where that Sitemap lives. The Sitemap lists all the pages on your site that you want indexing. This helps the search engine out – it doesn’t have to crawl your site looking for all the pages – and it helps you out: you can tell the search engine how often the page changes, and how often it needs to be re-indexed, for example.
- you tell the search engine at a page level that the page exists. For example, if your page includes any Google Adsense modules, or Google Analytics tracking codes, Google will know that page exists the first time it is viewed.
And once a page has been indexed by a search engine, it becomes a possible search result within that search engine.
So when someone makes a query, how are the results selected?
During the actual process of indexing, the search engine does some voodoo magic to try and understand what the page is about. This might be as simple as counting the occurrence of every different word on the page, or it might be an actual attempt to understand what the page is about using all manner of heuristics and semantic engineering approaches. Pages deemed a “hit” for a particular query if the search terms can be used to look up the page in the index. The hits are then rank ordered according to a score for each page calculated according to whatever algorithm the search engine uses. Typically this is some function of both “relevance” and “quality”.
“Relevance” is identified in part by comparing how the page has been indexed compared to the query.
“Quality” often relates to how well regarded the page is in the greater scheme of things; for example, link analysis identifies how many other pages link to the page (where we assume a link is some sort of vote of confidence for the page) and clickthrough analysis monitors how many times people click through on a particular result for a particular query in a search engine results listing (as search engine companies increasingly run web analytics packages too, they can potentially factor this information back in to calculating how satisfied a user was with a particular page).
So what are the consequences of publishing content to the web, and letting a search engine index it?
At this point it’s worth considering not just web search engines, but also image and video search engines. Take Youtube, for example. Posting a video to Youtube means that Youtube will index it. And one of nice things about Youtube is that it will index your video according to at least the title, description and tags you add to the movie and use this as the basis for a recommending other “related” videos to you. You might think of this in terms of Youtube indexing your video in a particular way, and then running a query for other videos using those index terms as the search terms.
Bearing that in mind, a major issue is that you can’t completely control where a page might turn up in a search engine results listing or what other items might be suggested as “related items”. If you need to be careful managing your “brand” online, or you have a duty of care to your users, being associated with inappropriate material in a search engine results listing can be a risky affair.
To try and ground this with a real world example, check out David Lee King’s post from a couple of weeks ago on YouTube Being Naughty Today. It tells a story of how “[a] month or so ago, some of my library’s teen patrons participated in a Making Mini Movie Masterpieces program held at my library. Cool program!”
One of our librarians just posted the videos some of the teens made to YouTube … and guess what? In the related videos section of the video page (and also on the related videos flash thing that plays at the end of an embedded YouTube video), some … let’s just say “questionable” videos appeared.
Here’s what I think happened: YouTube found “similar” videos based on keywords. And the keywords it found in our video include these words in the title and description: mini teen teens . Dump those into YouTube and you’ll unfortunately find some pretty “interesting” videos.
And here’s how I commented on the post:
If you assume everything you publish on the web will be subject to simple term extraction or semantic term extraction, then in a sense it becomes a search query crafted around those terms that will potentially associate the results of that “emergent query” with the content itself.
One of the functions of the Library used to be classifying works so they could be discovered. Maybe now there is a need for understanding how machines will classify web published content so we can try to guard against “consequential annotations”?
For a long time I’ve thought one role for the Library going forwards is SEO – and raising the profile of the host institution’s content in the dominant search engines. But maybe an understanding of SEO is also necessary in a *defensive* capacity?
This need for care is particularly relevant if you run Adsense on your website. Adsense works in a conceptually similar way to the Youtube related videos service, so you might think of it in the following terms: the page is indexed, key terms are identified and run as “queries” on the Adsense search engine, returning “relevant ads” as a result. “Relevance” in this case is calculated (in part) based on how much the advertiser is willing to pay for their advert to be returned against a particular search term or in the context of a page that appears to be about a particular topic.
Although controls are starting to appear that give the page publisher an element of control over what ads appear, their is still uncertainty in the equation, as there is whenever your content appears alongside content that is deemed “relevant” or related in some way – whether that’s in the context of a search engine results listing, an Adsense placement, or a related video.
So one thing to keep in mind – always – is how might your page be indexed, and what sorts of query, ad placement or related item might it be the perfect match for? What are the search engine consequences of writing your page in a particular way, or including particular key words in it?
David’s response to my comment identified a major issue here: “The problem is SO HUGE though, because … at least at my library’s site … everyone needs this training. Not just the web dudes! In my example above, a youth services librarian posted the video – she has great training in helping kids and teens find stuff, in running successful programs for that age group … but probably not in SEO stuff.”
I’m not sure I have the answer, but I think this is another reason Why Librarians Need to Know About SEO – not just so they can improve the likelihood of content they do approve of appearing in search engine listings or as recommended items, but also so they can defend against unanticipated SEO, where they unconsciously optimise a page so that it fares well on an unanticipated, or unwelcome, search.
What that means is, you need to know how SEO works so you don’t inadvertently do SEO on something you didn’t intend optimising for; or so you can institute “SEO countermeasures” to try to defuse any potentially unwelcome search engine consequences that might arise for a particular page.