Years ago, I used to spend quite a bit of time playing with Google Custom Search Engines, which allow you to run searches over a specified list of sites, trying to encourage librarians and educators to think about ways in which we might make use of them. I was reminded of this technology yesterday at the a Community Journalism conference, so thought it might be worth posting a quick how to about how to set up a CSE, in particular one that searches over the websites of hyperlocals listed on LocalWebList.net. (If you don’t want to see how it’s done, but do want to try it out, here’s my half-hour hack LocalWebList UK hyperlocal CSE.)
One way of creating a CSE is to manually enter the URLs of the sites you want to search over. Another is to use an annotations file that contains the URLs of sites you want to search over. These files can be hosted on your own site, or uploaded to Google (in the latter case, there is (small) limit on the size of file you can upload – 30KB.
The simplest annotations file is a two column (URL and Label) tab separated value file containing one row per site you want to include. Typically, sites are included using a URL pattern – onthewight.com/* for example, to say “index over all the pages on the onthewight.com domain.
The data file published by the LocalWebList includes a column containing the URL for the homepage for each hyperlocal site listed. We can download the datafile and then open it in the powerful data cleaning tool OpenRefine to inspect it:
If you skim through the URLs, you might notice that several sites have simple URLs (example.com), others are a bit more cluttered (example.com/index2.html), others point to sites like facebook. I’m going to make an arbitrary decision to ignore facebook sites and define patterns based on all the pages in a single domain.
To do that, I’m going to create a new column (url2) in OpenRefine from the URL column, that defines just such a pattern based on the original URL.
The following expression:
uses a regular expression to manage just such a transformation.
I can inspect the unique values generated by this transformation by looking at a text facet applied to the new url2 column:
If you sort by count in the text facet, you will see several of the hyperlocal sites have websites hosed on aboutmyarea, or facebook. (Click on one of those links in the text facet to show the sites associated with those domains.) I am going to discount those links from my CSE, so hover over the link and click on the “include” setting to toggle it to “exclude”. Then click on the “invert” option to show all the sites that aren’t the ones you’ve selected as excluded.
This leaves us with sites that are more likely unique:
Having got a filtered lists of sites, we can generate an annotations file containing the URL patterns we want to search over and the CSE label. The label identifies to Google which CSE the URLs in the annotations file apply to. We get that code by generating a CSE…
When creating a new CSE, along with giving it a name, you;ll also have to seed it with at least one URL. Simply enter a pattern for a URL you know you want to include in the search engine.
Hit create, and you’ll have a new CSE…
From the “Advanced” tab, go to the CSE annotations area and find the code for your CSE:
Now we’re in a position to add the CSE code to our annotations file – so copy the CSE label for your CSE… We can create the annotations file in OpenRefine from the “Export” menu, where we select “Templating”:
The templating option allows us to define a custom export template. The template is built up from a header, a row separator, a footer, and a row pattern that describes how to write out each row. I define a simple template as follows, and then export the file.
(Note – there are other ways I could have done this (indeed, there are often “other ways”!). For example, I could have created a new column containing just the CSE label value, and then done a custom table export, selecting the url2 column and label column, along with the TSV output format.)
Export the annotations file and then import it into the CSE – hit the “Add” button in the CSE annotations area.
Once uploaded (and remember, there is a 30KB file size limit on this route), go back to the Basics tab: you should find that your custom search engine now lists as sites to be searched over the sites you included in your annotations file, as well as being provided with a link to your CSE.
You can tweak with some of the styling for the CSE from the “Look and Feel” menu option in the CSE admin pages sidebar.
If you now click on your CSE URL you should find you have a minimal Google Custom Search engine that searches over several hundred UK hyperlocal websites.
To add in some of the sites we originally excluded, eg the ones on the aboutmyarea domain, we could add specific URL patterns in explicitly via the CSE control panel.
Google Custom search engines can be really quick to set up in a minimal form, but can also be customised further – for example, with tweaks to the ranking algorithm or with custom annotations (see for example Search Engine Powered Courses).
You can also generate lists of URLs from things like homepage links in Twitter bios grabbed from a Twitter list (eg Using Twitter Lists to Define Custom Search Engines – that code appears to have rotted slightly, but I have a fix…Let me know via the comments if you’re interested in generating CSEs from Twitter lists etc).
As I mentioned at the start, it’s been some years since I played with Google Custom Search Engines – I was really hopeful for them at one point, but Google never really seems to give them any love (not necessarily a bad thing – perhaps they are just enough over and under the radar for Google to cut them?), and I couldn’t seem to persuade anyone else (in the OU at least) that they were worth spending any time on.
I think a few librarians did pick up on them though! And if there is interest in the hyperlocal community for seeing what we might do with them, I’d be happy to put my thinking cap back on, work up some more tutorials or use cases, and run training workshops etc etc.