A Custom Search Engine for the Computer Weekly IT Blog Awards 2010 Nominees
If a list of award nominees is a collection of some merit, how can we extract value from it. Yesterday, I took a cheap swipe at the Computer Weekly IT Blog Awards, (Adding Value to the Blog Award Nomination Collections…), which the Computer Weekly folk were very gracious about and asked for “non-developer” help with…
So by way of a peace offering(?!;-), here’s a recipe I came up with last night and this morning for getting a custom search together from a list of blog URLs. What the search engine does is allow you to search over the sites that have been nominated for an award, either across all the nominees, or just the sites nominated in a particular award.
- several lists of blog homepage URLs (e.g. a separate list for each award);
- a copy of Excel; (I’m assuming this is a better bet than a text editor that supports regular expressions…?!;-)
- a web browser;
- a Google account.
Step 1 – The Basic CSE
The first thing we need to do is set up a new Google custom search engine
When you create a new CSE, there’s very little to do in the definition…
Choose a new theme, or just click “Next”…
You now have a minimal CSE… The next step is to add in a few refinements to our search engine…
Step 2 – Refinements
We’re actually not ready to add the URLs we want to search over yet – there’s a little more preparation to do. Click on the Refinements option…
We’re now going to create a search refinement for each award. This will let us just search over the blogs nominated for a particular award. (The “top level” search engine will search over nominees from all the awards.)
Add a refinement for each award, limiting the refinement to search over only the listed sites:
We’re not ready to add the sites yet – we’re still adding refinements labels… click Close:
Continue adding refinements until you have listed all the award categories, then go to Sites…
Step 3 – Finding How to Associate Blog Homepage URLs with Refinement Labels
If we have a set of URLs listed in the CSE, we can add labels to them from the Sites page:
This can take some time, though, so if we’re building a CSE from scratch it can be easier to upload the URLs we want the CSE to search over from a index file, and include details about the search refinement labels we want to apply to each URL in the index file. To find out how to label the URLs, we need to go to the Advanced settings page:
We then need to look at the XML context file (don’t worry, it’s not that scary;-):
Here’s what we see (if it doesn’t display, try the “View Source” option which should be somewhere in your browser View menu, or download the XML file and open it in a text editor.)
As you can hopefully see, each search refinement has a Label name, such as cio_it_director (the title element shows the human readable label we added to the refinements list). If we associate a label name with a URL, that URL will be associated with that search refinement.
Step 4 – Preparing the URLs
Right… time to start preparing the URLs… In a spreadsheet, put a header in cell A1 (it doesn’t matter what, we won’t be uploading this column…); for the first award, add one nominee URL per row in column A:
We need to prepare the URLs so that the search engine will search over all the pages within the site specified by the URL, as well as tidy it up a little by removing the leading http://. Let’s trim that first using the formula:
We also need to use a so-called wildcard character that we want the search engine to search the pages contained within the site. So for example, to search all the pages within example.com, we need to rewrite the location as example.com/*.
The following formula does this crudely, and should catch most of the URLs we provide to it…
(Arghh – there’s an error in the screenshot – the RIGHT argument needs a character length specifier for the .com and .co.uk conditionals.)
(If you know a better way, please let me know; I’ve already had a suggestion from @herrdoktorc that the CONCATENATE can be replaced using constructions of the form A1&”*”)
We now need to associate the URL with two things. Firstly, a label that identifies our CSE; secondly, a refinement label that specifies the blog award. In Step 3, we saw how to find the refinement labels for each award, but where do we find the identifier for the CSE itself? Here, on the Advanced settings page:
(Hmm.. I don’t know if these CSE identifers are supposed to be secret…? if other people upload config files to their CSE using the above ID, will those results be included in my CSE??? Ooops, if so;-)
So we now add the appropriate labels to the URLs – one for the CSE, one for the blog award:
We can now add the URLs, and their labels to our spreadsheet for the other awards… We also need to add some column headings – URL for the rewritten URLs (with wildcard), Label for the other two:
Now copy these three columns and paste them by value to a new sheet, and if you’re diligent, just scan down them looking for URLs that might need the wildcard adding. For example, I noticed a .edu blog, which can be tweaked to .edu/*. A formula such as =IF(RIGHT(CELL)”*”,”check”,””) might help detect URLs that have no wildcard associated with them. (If the URL specifies a page, e.g. index.html, that page will be indexed; make a judgement call if/how to tweak the URL if you want to search more than just that page…)
Save the file as tab delimited text:
Step 5 – Upload the file
In the Advanced area of the CSE control panel, you can now select upload your file that contains the list of URLs and their refinement labels:
(I had an issue with filetypes/file associations? The uploader thought I had an Excel file, not a tab separated text file, so I just copied everything, pasted it into a new text document, and saved it. Uploading this file worked fine…)
Once uploaded, you can preview your CSE:
You can also see the refinement labels: clicking on one of these will limit the search to just the blogs nominated in that award category.
Here’s the homepage (unstyled):
PS The original proof of concept for this took half an hour. The write-up and screenshots took just over 2 hours…:-(
PPS With the uploaded links file, if the same URL is listed twice, the CSE will cope with it, and will just add the additional labels, rather than overwriting previous entries. SO for example, the “Tech for Tesco” blog appears in a couple of categories.
PPPS @daveyp mentioned reuse of lists like this might contravene database rights (e.g. where the list is viewed as a database that took significant effort to compile). So if that’s the case and Computer Weekly sue me, I won’t be too happy…;-)
PPPPS Note that I reserve the right to delete the CSE at any time… the number of annotations (i.e. URLs) that can be accommodated by an individual’s CSE account is limited… and there are plenty of other CSEs I’d like to try out;-)