OUseful.Info, the blog…

Trying to find useful things to do with emerging technologies in open education

Posts Tagged ‘cse

A Quick Intro to Google Custom Search Engine Definition Files

leave a comment »

In Search Engine Powered Courses…, I took an initial, baby step to demonstrate one way in which a promoted link might be used be within a course specific custom search engine. In the next post in this series, I will describe how to influence the positioning of results within a Google custom search engine by boosting their ranking, as well as how results may be ‘faceted’ into different results sets through the use of labels.

In this post, I thought it would be worth taking a step back and reviewing the three configuration files we have access to when defining a Google custom search engine: the configuration file, the promotions file, and the annotations file. If you create a minimal Google custom search engine using the CSE management tools, and then go to the Advanced page, you will see options that allow you to upload the configuration and annotations file. The promotions file can be imported via the Promotions page.

So what do each of these file do?

  • The configuration file defines the top level configuration of the search engine. The easiest way of obtaining a template for a CSE is to create a minimal search engine using the CSE management tools, and then export the configuration file from the Advanced page. The configuration file defines, among other things: whether the search engine will search over the whole web, prioritising (or ‘BOOSTing’) sites and pages indexed explicitly by the CSE, or whether it will just return resuts from the explicilty indexed pages (a FILTER style search engine); a definition of the labels, or facets, that allow different search refinements to be applied as different search strategy contexts within the CSE; some styling information; and information relating to Subscribed Links (more of them in another post, if they’re still supported by then..)..
  • The promotions file allows you do define promoted links within a CSE; in Search Engine Powered Courses…, I give an example of how these might be used in a course search engine.
  • The annotations file identifies the sites and pages that are specific members of the CSE index, as well as how they should be handled (eg the extent to which they should be positively or negatively boosted in the search engine results listing, whether they should appear in the top few results, and what labels or facets should apply to them).

It’s also possible to customise the styling/presentation of the search engine, but that’s a shiny, shiny feature, so probably not something I’ll be looking at…

PS I just noticed you can now manage Google Analytics settings for custom search engines (which allows you to log search queries) from within the CSE control panel… I’m still not sure how easy it is to track which results get clicked through, though?

Written by Tony Hirst

September 28, 2011 at 10:33 am

Posted in Search

Tagged with

From “Special Result” to “Promotion” in Google CSEs

leave a comment »

In passing, I noticed I had a broken link to a Google CSE documentation page:

http://code.google.com/apis/customsearch/docs/special_results.html

Searching a little, I found the page had moved to

http://code.google.com/apis/customsearch/docs/promotions.html

A cached version of the originally linked page is still available, so I did a side-by-side comparison:

From 'special result' to 'promotion'

Hmmm…

Written by Tony Hirst

September 15, 2011 at 11:25 am

Posted in Search

Tagged with

A Custom Search Engine for the Computer Weekly IT Blog Awards 2010 Nominees

with 10 comments

If a list of award nominees is a collection of some merit, how can we extract value from it. Yesterday, I took a cheap swipe at the Computer Weekly IT Blog Awards, (Adding Value to the Blog Award Nomination Collections…), which the Computer Weekly folk were very gracious about and asked for “non-developer” help with…

So by way of a peace offering(?!;-), here’s a recipe I came up with last night and this morning for getting a custom search together from a list of blog URLs. What the search engine does is allow you to search over the sites that have been nominated for an award, either across all the nominees, or just the sites nominated in a particular award.

You’ll need:
- several lists of blog homepage URLs (e.g. a separate list for each award);
- a copy of Excel; (I’m assuming this is a better bet than a text editor that supports regular expressions…?!;-)
- a web browser;
- a Google account.

Step 1 – The Basic CSE
The first thing we need to do is set up a new Google custom search engine

Create a google cse

When you create a new CSE, there’s very little to do in the definition…

defining a CSE

Choose a new theme, or just click “Next”…

CSE - theme choice

You now have a minimal CSE… The next step is to add in a few refinements to our search engine…

BAseoc CSE - now add more sites

Step 2 – Refinements
We’re actually not ready to add the URLs we want to search over yet – there’s a little more preparation to do. Click on the Refinements option…

CSE Refinements..

We’re now going to create a search refinement for each award. This will let us just search over the blogs nominated for a particular award. (The “top level” search engine will search over nominees from all the awards.)

Add a refinement for each award, limiting the refinement to search over only the listed sites:

Adding CSE refinements

We’re not ready to add the sites yet – we’re still adding refinements labels… click Close:

Adding CSE refinements

Continue adding refinements until you have listed all the award categories, then go to Sites…

All refinements in...

Step 3 – Finding How to Associate Blog Homepage URLs with Refinement Labels
If we have a set of URLs listed in the CSE, we can add labels to them from the Sites page:

Adding labels to a site

This can take some time, though, so if we’re building a CSE from scratch it can be easier to upload the URLs we want the CSE to search over from a index file, and include details about the search refinement labels we want to apply to each URL in the index file. To find out how to label the URLs, we need to go to the Advanced settings page:

CSE - Advanced settings

We then need to look at the XML context file (don’t worry, it’s not that scary;-):

View CSE XML context file

Here’s what we see (if it doesn’t display, try the “View Source” option which should be somewhere in your browser View menu, or download the XML file and open it in a text editor.)

CSE config file

As you can hopefully see, each search refinement has a Label name, such as cio_it_director (the title element shows the human readable label we added to the refinements list). If we associate a label name with a URL, that URL will be associated with that search refinement.

Step 4 – Preparing the URLs
Right… time to start preparing the URLs… In a spreadsheet, put a header in cell A1 (it doesn’t matter what, we won’t be uploading this column…); for the first award, add one nominee URL per row in column A:

List of URLs in a spreadsheet

We need to prepare the URLs so that the search engine will search over all the pages within the site specified by the URL, as well as tidy it up a little by removing the leading http://. Let’s trim that first using the formula:
=SUBSTITUTE(A3,”http://”,”")

excel substitute

We also need to use a so-called wildcard character that we want the search engine to search the pages contained within the site. So for example, to search all the pages within example.com, we need to rewrite the location as example.com/*.

The following formula does this crudely, and should catch most of the URLs we provide to it…

=IF(RIGHT(B2)=”/”,CONCATENATE(B2,”*”),IF(OR(RIGHT(B2,4)=”.com”,RIGHT(B2,6)=”.co.uk”),CONCATENATE(B2,”/*”),B2))

(Arghh – there’s an error in the screenshot – the RIGHT argument needs a character length specifier for the .com and .co.uk conditionals.)

Excel - adding a wildcard at end of URL

(If you know a better way, please let me know; I’ve already had a suggestion from @herrdoktorc that the CONCATENATE can be replaced using constructions of the form A1&”*”)

We now need to associate the URL with two things. Firstly, a label that identifies our CSE; secondly, a refinement label that specifies the blog award. In Step 3, we saw how to find the refinement labels for each award, but where do we find the identifier for the CSE itself? Here, on the Advanced settings page:

CSE id

(Hmm.. I don’t know if these CSE identifers are supposed to be secret…? if other people upload config files to their CSE using the above ID, will those results be included in my CSE??? Ooops, if so;-)

So we now add the appropriate labels to the URLs – one for the CSE, one for the blog award:

Putting a cse config fiel with refinements together

We can now add the URLs, and their labels to our spreadsheet for the other awards… We also need to add some column headings – URL for the rewritten URLs (with wildcard), Label for the other two:

CSE URLs file

Now copy these three columns and paste them by value to a new sheet, and if you’re diligent, just scan down them looking for URLs that might need the wildcard adding. For example, I noticed a .edu blog, which can be tweaked to .edu/*. A formula such as =IF(RIGHT(CELL)”*”,”check”,”") might help detect URLs that have no wildcard associated with them. (If the URL specifies a page, e.g. index.html, that page will be indexed; make a judgement call if/how to tweak the URL if you want to search more than just that page…)

Save the file as tab delimited text:

Save as text in excel

Step 5 – Upload the file
In the Advanced area of the CSE control panel, you can now select upload your file that contains the list of URLs and their refinement labels:

Add annotation file to CSE

(I had an issue with filetypes/file associations? The uploader thought I had an Excel file, not a tab separated text file, so I just copied everything, pasted it into a new text document, and saved it. Uploading this file worked fine…)

Once uploaded, you can preview your CSE:

CSE Preview

You can also see the refinement labels: clicking on one of these will limit the search to just the blogs nominated in that award category.

Here’s the homepage (unstyled):

CSE homepage

And here’s an instantised version, via Martin Hawksey’s Instant CSE app:

Instantised...

PS The original proof of concept for this took half an hour. The write-up and screenshots took just over 2 hours…:-(

PPS With the uploaded links file, if the same URL is listed twice, the CSE will cope with it, and will just add the additional labels, rather than overwriting previous entries. SO for example, the “Tech for Tesco” blog appears in a couple of categories.

PPPS @daveyp mentioned reuse of lists like this might contravene database rights (e.g. where the list is viewed as a database that took significant effort to compile). So if that’s the case and Computer Weekly sue me, I won’t be too happy…;-)

PPPPS Note that I reserve the right to delete the CSE at any time… the number of annotations (i.e. URLs) that can be accommodated by an individual’s CSE account is limited… and there are plenty of other CSEs I’d like to try out;-)

Written by Tony Hirst

October 22, 2010 at 10:52 am

Follow

Get every new post delivered to your Inbox.

Join 134 other followers