Creating a Google Custom Search Engine Over Hyperlocal Sites

Years ago, I used to spend quite a bit of time playing with Google Custom Search Engines, which allow you to run searches over a specified list of sites, trying to encourage librarians and educators to think about ways in which we might make use of them. I was reminded of this technology yesterday at the a Community Journalism conference, so thought it might be worth posting a quick how to about how to set up a CSE, in particular one that searches over the websites of hyperlocals listed on LocalWebList.net. (If you don’t want to see how it’s done, but do want to try it out, here’s my half-hour hack LocalWebList UK hyperlocal CSE.)

One way of creating a CSE is to manually enter the URLs of the sites you want to search over. Another is to use an annotations file that contains the URLs of sites you want to search over. These files can be hosted on your own site, or uploaded to Google (in the latter case, there is (small) limit on the size of file you can upload – 30KB.

Annotations__Defining_Sites_to_Search_ _ _Custom_Search_ _ _Google_Developers

The simplest annotations file is a two column (URL and Label) tab separated value file containing one row per site you want to include. Typically, sites are included using a URL pattern – onthewight.com/* for example, to say “index over all the pages on the onthewight.com domain.

The data file published by the LocalWebList includes a column containing the URL for the homepage for each hyperlocal site listed. We can download the datafile and then open it in the powerful data cleaning tool OpenRefine to inspect it:

OpenRefine1

OpenRefine2

If you skim through the URLs, you might notice that several sites have simple URLs (example.com), others are a bit more cluttered (example.com/index2.html), others point to sites like facebook. I’m going to make an arbitrary decision to ignore facebook sites and define patterns based on all the pages in a single domain.

To do that, I’m going to create a new column (url2) in OpenRefine from the URL column, that defines just such a pattern based on the original URL.

localweblist_csv_-_OpenRefine4

The following expression:

value.replace(/https?:\/\/([^\/]*)/,'$1').split('/')[0]+'/*'

uses a regular expression to manage just such a transformation.

localweblist_csv_-_OpenRefine5

I can inspect the unique values generated by this transformation by looking at a text facet applied to the new url2 column:

localweblist_csv_-_OpenRefine6

If you sort by count in the text facet, you will see several of the hyperlocal sites have websites hosed on aboutmyarea, or facebook. (Click on one of those links in the text facet to show the sites associated with those domains.) I am going to discount those links from my CSE, so hover over the link and click on the “include” setting to toggle it to “exclude”. Then click on the “invert” option to show all the sites that aren’t the ones you’ve selected as excluded.

localweblist_csv_-_OpenRefine7

This leaves us with sites that are more likely unique:

localweblist_csv_-_OpenRefine8

Having got a filtered lists of sites, we can generate an annotations file containing the URL patterns we want to search over and the CSE label. The label identifies to Google which CSE the URLs in the annotations file apply to. We get that code by generating a CSE

When creating a new CSE, along with giving it a name, you;ll also have to seed it with at least one URL. Simply enter a pattern for a URL you know you want to include in the search engine.

Custom_Search_-_Create_CSE

Hit create, and you’ll have a new CSE…

Custom_Search_-_Congratulations_

From the “Advanced” tab, go to the CSE annotations area and find the code for your CSE:

Custom_Search_-_Advanced

Now we’re in a position to add the CSE code to our annotations file – so copy the CSE label for your CSE… We can create the annotations file in OpenRefine from the “Export” menu,  where we select “Templating”:

localweblist_csv_-_OpenRefine

The templating option allows us to define a custom export template. The template is built up from a header, a row separator, a footer, and a row pattern that describes how to write out each row. I define a simple template as follows, and then export the file.

localweblist_csv_-_OpenRefine11

(Note  – there are other ways I could have done this (indeed, there are often “other ways”!). For example, I could have created a new column containing just the CSE label value, and then done a custom table export, selecting the url2 column and label column, along with the TSV output format.)

Export the annotations file and then import it into the CSE – hit the “Add” button in the CSE annotations area.

Custom_Search_-_Advanced4

Once uploaded (and remember, there is a 30KB file size limit on this route), go back to the Basics tab:  you should find that your custom search engine now lists as sites to be searched over the sites you included in your annotations file, as well as being provided with a link to your CSE.

Custom_Search_-_Basic

You can tweak with some of the styling for the CSE from the “Look and Feel” menu option in the CSE admin pages sidebar.

CSE_-_Look_and_Feel_Layout

If you now click on your CSE URL you should find you have a minimal Google Custom Search engine that searches over several hundred UK hyperlocal websites.

Google_Custom_Search

To add in some of the sites we originally excluded, eg the ones on the aboutmyarea domain, we could add specific URL patterns in explicitly via the CSE control panel.

Google Custom search engines can be really quick to set up in a minimal form, but can also be customised further – for example, with tweaks to the ranking algorithm or with custom annotations (see for example Search Engine Powered Courses).

You can also generate lists of URLs from things like homepage links in Twitter bios grabbed from a Twitter list (eg Using Twitter Lists to Define Custom Search Engines – that code appears to have rotted slightly, but I have a fix…Let me know via the comments if you’re interested in generating CSEs from Twitter lists etc).

As I mentioned at the start, it’s been some years since I played with Google Custom Search Engines – I was really hopeful for them at one point, but Google never really seems to give them any love (not necessarily a bad thing – perhaps they are just enough over and under the radar for Google to cut them?), and I couldn’t seem to persuade anyone else (in the OU at least) that they were worth spending any time on.

I think a few librarians did pick up on them though! And if there is interest in the hyperlocal community for seeing what we might do with them, I’d be happy to put my thinking cap back on, work up some more tutorials or use cases, and run training workshops etc etc.

A Quick Intro to Google Custom Search Engine Definition Files

In Search Engine Powered Courses…, I took an initial, baby step to demonstrate one way in which a promoted link might be used be within a course specific custom search engine. In the next post in this series, I will describe how to influence the positioning of results within a Google custom search engine by boosting their ranking, as well as how results may be ‘faceted’ into different results sets through the use of labels.

In this post, I thought it would be worth taking a step back and reviewing the three configuration files we have access to when defining a Google custom search engine: the configuration file, the promotions file, and the annotations file. If you create a minimal Google custom search engine using the CSE management tools, and then go to the Advanced page, you will see options that allow you to upload the configuration and annotations file. The promotions file can be imported via the Promotions page.

So what do each of these file do?

  • The configuration file defines the top level configuration of the search engine. The easiest way of obtaining a template for a CSE is to create a minimal search engine using the CSE management tools, and then export the configuration file from the Advanced page. The configuration file defines, among other things: whether the search engine will search over the whole web, prioritising (or ‘BOOSTing’) sites and pages indexed explicitly by the CSE, or whether it will just return resuts from the explicilty indexed pages (a FILTER style search engine); a definition of the labels, or facets, that allow different search refinements to be applied as different search strategy contexts within the CSE; some styling information; and information relating to Subscribed Links (more of them in another post, if they’re still supported by then..)..
  • The promotions file allows you do define promoted links within a CSE; in Search Engine Powered Courses…, I give an example of how these might be used in a course search engine.
  • The annotations file identifies the sites and pages that are specific members of the CSE index, as well as how they should be handled (eg the extent to which they should be positively or negatively boosted in the search engine results listing, whether they should appear in the top few results, and what labels or facets should apply to them).

It’s also possible to customise the styling/presentation of the search engine, but that’s a shiny, shiny feature, so probably not something I’ll be looking at…

PS I just noticed you can now manage Google Analytics settings for custom search engines (which allows you to log search queries) from within the CSE control panel… I’m still not sure how easy it is to track which results get clicked through, though?

From “Special Result” to “Promotion” in Google CSEs

In passing, I noticed I had a broken link to a Google CSE documentation page:

http://code.google.com/apis/customsearch/docs/special_results.html

Searching a little, I found the page had moved to

http://code.google.com/apis/customsearch/docs/promotions.html

A cached version of the originally linked page is still available, so I did a side-by-side comparison:

From 'special result' to 'promotion'

Hmmm…

A Custom Search Engine for the Computer Weekly IT Blog Awards 2010 Nominees

If a list of award nominees is a collection of some merit, how can we extract value from it. Yesterday, I took a cheap swipe at the Computer Weekly IT Blog Awards, (Adding Value to the Blog Award Nomination Collections…), which the Computer Weekly folk were very gracious about and asked for “non-developer” help with…

So by way of a peace offering(?!;-), here’s a recipe I came up with last night and this morning for getting a custom search together from a list of blog URLs. What the search engine does is allow you to search over the sites that have been nominated for an award, either across all the nominees, or just the sites nominated in a particular award.

You’ll need:
– several lists of blog homepage URLs (e.g. a separate list for each award);
– a copy of Excel; (I’m assuming this is a better bet than a text editor that supports regular expressions…?!;-)
– a web browser;
– a Google account.

Step 1 – The Basic CSE
The first thing we need to do is set up a new Google custom search engine

Create a google cse

When you create a new CSE, there’s very little to do in the definition…

defining a CSE

Choose a new theme, or just click “Next”…

CSE - theme choice

You now have a minimal CSE… The next step is to add in a few refinements to our search engine…

BAseoc CSE - now add more sites

Step 2 – Refinements
We’re actually not ready to add the URLs we want to search over yet – there’s a little more preparation to do. Click on the Refinements option…

CSE Refinements..

We’re now going to create a search refinement for each award. This will let us just search over the blogs nominated for a particular award. (The “top level” search engine will search over nominees from all the awards.)

Add a refinement for each award, limiting the refinement to search over only the listed sites:

Adding CSE refinements

We’re not ready to add the sites yet – we’re still adding refinements labels… click Close:

Adding CSE refinements

Continue adding refinements until you have listed all the award categories, then go to Sites…

All refinements in...

Step 3 – Finding How to Associate Blog Homepage URLs with Refinement Labels
If we have a set of URLs listed in the CSE, we can add labels to them from the Sites page:

Adding labels to a site

This can take some time, though, so if we’re building a CSE from scratch it can be easier to upload the URLs we want the CSE to search over from a index file, and include details about the search refinement labels we want to apply to each URL in the index file. To find out how to label the URLs, we need to go to the Advanced settings page:

CSE - Advanced settings

We then need to look at the XML context file (don’t worry, it’s not that scary;-):

View CSE XML context file

Here’s what we see (if it doesn’t display, try the “View Source” option which should be somewhere in your browser View menu, or download the XML file and open it in a text editor.)

CSE config file

As you can hopefully see, each search refinement has a Label name, such as cio_it_director (the title element shows the human readable label we added to the refinements list). If we associate a label name with a URL, that URL will be associated with that search refinement.

Step 4 – Preparing the URLs
Right… time to start preparing the URLs… In a spreadsheet, put a header in cell A1 (it doesn’t matter what, we won’t be uploading this column…); for the first award, add one nominee URL per row in column A:

List of URLs in a spreadsheet

We need to prepare the URLs so that the search engine will search over all the pages within the site specified by the URL, as well as tidy it up a little by removing the leading http://. Let’s trim that first using the formula:
=SUBSTITUTE(A3,”http://”,””)

excel substitute

We also need to use a so-called wildcard character that we want the search engine to search the pages contained within the site. So for example, to search all the pages within example.com, we need to rewrite the location as example.com/*.

The following formula does this crudely, and should catch most of the URLs we provide to it…

=IF(RIGHT(B2)=”/”,CONCATENATE(B2,”*”),IF(OR(RIGHT(B2,4)=”.com”,RIGHT(B2,6)=”.co.uk”),CONCATENATE(B2,”/*”),B2))

(Arghh – there’s an error in the screenshot – the RIGHT argument needs a character length specifier for the .com and .co.uk conditionals.)

Excel - adding a wildcard at end of URL

(If you know a better way, please let me know; I’ve already had a suggestion from @herrdoktorc that the CONCATENATE can be replaced using constructions of the form A1&”*”)

We now need to associate the URL with two things. Firstly, a label that identifies our CSE; secondly, a refinement label that specifies the blog award. In Step 3, we saw how to find the refinement labels for each award, but where do we find the identifier for the CSE itself? Here, on the Advanced settings page:

CSE id

(Hmm.. I don’t know if these CSE identifers are supposed to be secret…? if other people upload config files to their CSE using the above ID, will those results be included in my CSE??? Ooops, if so;-)

So we now add the appropriate labels to the URLs – one for the CSE, one for the blog award:

Putting a cse config fiel with refinements together

We can now add the URLs, and their labels to our spreadsheet for the other awards… We also need to add some column headings – URL for the rewritten URLs (with wildcard), Label for the other two:

CSE URLs file

Now copy these three columns and paste them by value to a new sheet, and if you’re diligent, just scan down them looking for URLs that might need the wildcard adding. For example, I noticed a .edu blog, which can be tweaked to .edu/*. A formula such as =IF(RIGHT(CELL)”*”,”check”,””) might help detect URLs that have no wildcard associated with them. (If the URL specifies a page, e.g. index.html, that page will be indexed; make a judgement call if/how to tweak the URL if you want to search more than just that page…)

Save the file as tab delimited text:

Save as text in excel

Step 5 – Upload the file
In the Advanced area of the CSE control panel, you can now select upload your file that contains the list of URLs and their refinement labels:

Add annotation file to CSE

(I had an issue with filetypes/file associations? The uploader thought I had an Excel file, not a tab separated text file, so I just copied everything, pasted it into a new text document, and saved it. Uploading this file worked fine…)

Once uploaded, you can preview your CSE:

CSE Preview

You can also see the refinement labels: clicking on one of these will limit the search to just the blogs nominated in that award category.

Here’s the homepage (unstyled):

CSE homepage

And here’s an instantised version, via Martin Hawksey’s Instant CSE app:

Instantised...

PS The original proof of concept for this took half an hour. The write-up and screenshots took just over 2 hours…:-(

PPS With the uploaded links file, if the same URL is listed twice, the CSE will cope with it, and will just add the additional labels, rather than overwriting previous entries. SO for example, the “Tech for Tesco” blog appears in a couple of categories.

PPPS @daveyp mentioned reuse of lists like this might contravene database rights (e.g. where the list is viewed as a database that took significant effort to compile). So if that’s the case and Computer Weekly sue me, I won’t be too happy…;-)

PPPPS Note that I reserve the right to delete the CSE at any time… the number of annotations (i.e. URLs) that can be accommodated by an individual’s CSE account is limited… and there are plenty of other CSEs I’d like to try out;-)