A Custom Search Engine for the Computer Weekly IT Blog Awards 2010 Nominees

If a list of award nominees is a collection of some merit, how can we extract value from it. Yesterday, I took a cheap swipe at the Computer Weekly IT Blog Awards, (Adding Value to the Blog Award Nomination Collections…), which the Computer Weekly folk were very gracious about and asked for “non-developer” help with…

So by way of a peace offering(?!;-), here’s a recipe I came up with last night and this morning for getting a custom search together from a list of blog URLs. What the search engine does is allow you to search over the sites that have been nominated for an award, either across all the nominees, or just the sites nominated in a particular award.

You’ll need:
– several lists of blog homepage URLs (e.g. a separate list for each award);
– a copy of Excel; (I’m assuming this is a better bet than a text editor that supports regular expressions…?!;-)
– a web browser;
– a Google account.

Step 1 – The Basic CSE
The first thing we need to do is set up a new Google custom search engine

Create a google cse

When you create a new CSE, there’s very little to do in the definition…

defining a CSE

Choose a new theme, or just click “Next”…

CSE - theme choice

You now have a minimal CSE… The next step is to add in a few refinements to our search engine…

BAseoc CSE - now add more sites

Step 2 – Refinements
We’re actually not ready to add the URLs we want to search over yet – there’s a little more preparation to do. Click on the Refinements option…

CSE Refinements..

We’re now going to create a search refinement for each award. This will let us just search over the blogs nominated for a particular award. (The “top level” search engine will search over nominees from all the awards.)

Add a refinement for each award, limiting the refinement to search over only the listed sites:

Adding CSE refinements

We’re not ready to add the sites yet – we’re still adding refinements labels… click Close:

Adding CSE refinements

Continue adding refinements until you have listed all the award categories, then go to Sites…

All refinements in...

Step 3 – Finding How to Associate Blog Homepage URLs with Refinement Labels
If we have a set of URLs listed in the CSE, we can add labels to them from the Sites page:

Adding labels to a site

This can take some time, though, so if we’re building a CSE from scratch it can be easier to upload the URLs we want the CSE to search over from a index file, and include details about the search refinement labels we want to apply to each URL in the index file. To find out how to label the URLs, we need to go to the Advanced settings page:

CSE - Advanced settings

We then need to look at the XML context file (don’t worry, it’s not that scary;-):

View CSE XML context file

Here’s what we see (if it doesn’t display, try the “View Source” option which should be somewhere in your browser View menu, or download the XML file and open it in a text editor.)

CSE config file

As you can hopefully see, each search refinement has a Label name, such as cio_it_director (the title element shows the human readable label we added to the refinements list). If we associate a label name with a URL, that URL will be associated with that search refinement.

Step 4 – Preparing the URLs
Right… time to start preparing the URLs… In a spreadsheet, put a header in cell A1 (it doesn’t matter what, we won’t be uploading this column…); for the first award, add one nominee URL per row in column A:

List of URLs in a spreadsheet

We need to prepare the URLs so that the search engine will search over all the pages within the site specified by the URL, as well as tidy it up a little by removing the leading http://. Let’s trim that first using the formula:
=SUBSTITUTE(A3,”http://”,””)

excel substitute

We also need to use a so-called wildcard character that we want the search engine to search the pages contained within the site. So for example, to search all the pages within example.com, we need to rewrite the location as example.com/*.

The following formula does this crudely, and should catch most of the URLs we provide to it…

=IF(RIGHT(B2)=”/”,CONCATENATE(B2,”*”),IF(OR(RIGHT(B2,4)=”.com”,RIGHT(B2,6)=”.co.uk”),CONCATENATE(B2,”/*”),B2))

(Arghh – there’s an error in the screenshot – the RIGHT argument needs a character length specifier for the .com and .co.uk conditionals.)

Excel - adding a wildcard at end of URL

(If you know a better way, please let me know; I’ve already had a suggestion from @herrdoktorc that the CONCATENATE can be replaced using constructions of the form A1&”*”)

We now need to associate the URL with two things. Firstly, a label that identifies our CSE; secondly, a refinement label that specifies the blog award. In Step 3, we saw how to find the refinement labels for each award, but where do we find the identifier for the CSE itself? Here, on the Advanced settings page:

CSE id

(Hmm.. I don’t know if these CSE identifers are supposed to be secret…? if other people upload config files to their CSE using the above ID, will those results be included in my CSE??? Ooops, if so;-)

So we now add the appropriate labels to the URLs – one for the CSE, one for the blog award:

Putting a cse config fiel with refinements together

We can now add the URLs, and their labels to our spreadsheet for the other awards… We also need to add some column headings – URL for the rewritten URLs (with wildcard), Label for the other two:

CSE URLs file

Now copy these three columns and paste them by value to a new sheet, and if you’re diligent, just scan down them looking for URLs that might need the wildcard adding. For example, I noticed a .edu blog, which can be tweaked to .edu/*. A formula such as =IF(RIGHT(CELL)”*”,”check”,””) might help detect URLs that have no wildcard associated with them. (If the URL specifies a page, e.g. index.html, that page will be indexed; make a judgement call if/how to tweak the URL if you want to search more than just that page…)

Save the file as tab delimited text:

Save as text in excel

Step 5 – Upload the file
In the Advanced area of the CSE control panel, you can now select upload your file that contains the list of URLs and their refinement labels:

Add annotation file to CSE

(I had an issue with filetypes/file associations? The uploader thought I had an Excel file, not a tab separated text file, so I just copied everything, pasted it into a new text document, and saved it. Uploading this file worked fine…)

Once uploaded, you can preview your CSE:

CSE Preview

You can also see the refinement labels: clicking on one of these will limit the search to just the blogs nominated in that award category.

Here’s the homepage (unstyled):

CSE homepage

And here’s an instantised version, via Martin Hawksey’s Instant CSE app:

Instantised...

PS The original proof of concept for this took half an hour. The write-up and screenshots took just over 2 hours…:-(

PPS With the uploaded links file, if the same URL is listed twice, the CSE will cope with it, and will just add the additional labels, rather than overwriting previous entries. SO for example, the “Tech for Tesco” blog appears in a couple of categories.

PPPS @daveyp mentioned reuse of lists like this might contravene database rights (e.g. where the list is viewed as a database that took significant effort to compile). So if that’s the case and Computer Weekly sue me, I won’t be too happy…;-)

PPPPS Note that I reserve the right to delete the CSE at any time… the number of annotations (i.e. URLs) that can be accommodated by an individual’s CSE account is limited… and there are plenty of other CSEs I’d like to try out;-)

Author: Tony Hirst

I'm a Senior Lecturer at The Open University, with an interest in #opendata policy and practice, as well as general web tinkering...

11 thoughts on “A Custom Search Engine for the Computer Weekly IT Blog Awards 2010 Nominees”

  1. I’m interested in setting up a search engine almost exactly like google’s newsfeed with the difference that it doesn’t include Murdoch’s pay for view sites.

    Any ideas as to how I can do this?

  2. mmmm. used http://news.google.co.uk

    got something working on the ‘try out’ section. Posted the code into my blog (blogger) but there seems to be some incompatability.

    Shame.

    (It appears to provide scope for the exclusions above).

    Shame ‘cos another project would be to have a dedicated search engine for the uk govs webpage, http://www.legislation.gov.uk/

    When I do this in ‘try-out’ section I get fantastic results but I don’t see any point in posting to my blogger website.

    Will keep on probing etc and thanks for the original suggestions.

Comments are closed.

%d bloggers like this: