If you want to search across all university prospectuses, what do you do? Suffer, that’s what…
At the CETIS/DevCSI hackday, a group of undevelopers* came together to pull together a Google Custom Search Engine that would search over all the undergraduate prospectus pages on all the UK university websites.
If you want to try it out, there’s a basic version running at: CourseDetective.co.uk
The first thing we had to do was grab a list of HEIs – @dkernohan grabbed one off the HEFCE website (I think? Do you have a link to the one you grabbed, DK?) and popped it into a Google Spreadsheet so we could all work on it.
A quick first pass meant Googling each university (e.g. for Undergraduate course foo university) and then finding the prospectus home page. A bit of digging then took us to an example of an actual course page. The intention was to find as deep a path as possible into the website that would still return individual pages for each course.
The number of URL patterns is, unsurprisingly, as large as the number of institutions, with few commonalities. For example, here are the first five (alphabetical order):
Some URLs embed the year of entry:
Sometimes course identifiers appear as variables:
And so on…
Having got paths into the prospectus, we used them to define a custom search engine. The first pass was to paste the links directly into the search engine definition wizard. (We maybe need to check we’ve done this correctly: we actually have two link types – one where we do “URL contains”, such as http://www.beds.ac.uk/courses, the other where we should be checking against a pattern e.g. http://www.nottingham.ac.uk/ugstudy/course.php?inc=course&code=*. I wonder, how would we cope with capturing both ?inc=course&code=* and ?&code=*&inc=course)
Something we did notice that is a *huge* problem with Google Custom Search engines is that if you collaborate with other people in populating the same CSE, you can only get a view over the links added by one person at a time. So I could look at the links David had added, and he could look at the links I had added, but we couldn’t look at all the links at the same time:-(
The next step was to generate a CSE definition file from the imported links, so that we could (in theory) start to craft a machine generated CSE definition file (see Transitioning a CSE). At least one copy of the CSE file is available on the coursedetective github site (look for the annotations.xml file).
To host the site, my first thought was to use Blogger – but this was a bit limiting in terms of possible site design – and secondly to use Google sites. However, Google sites seems to strip out the embedding that the Google custom search engine wizard generates, so instead we opted for Google App Engine using this template. (It would be really helpful if Google Sites provided a trivial way of embedding a Google custom search engine in a sites page…?)
To make the hack relevant to the OERhackday, David added some course OER links and an OER category to the search engine that would allow users to (ideally) locate topic related OERs. The longer term vision is that users should be able to discover courses via OERs, and also check out OERs associated with a course as part of the “what course should I do?” research process.
To enrich the search results further, we also started to collate the URLs for the official institutional Youtube pages so we could search into those videos as well as courses prospectus pages and OERs. I’m not sure if Youtube videos can be previewed in CSE results listings, but it’s something to explore…;-)
On the design side, we didn’t manage to get any CSS out, but James and Joel did come up with a stylish design, as you’ve seen above:-)
In terms of usage, the site is currently unstyled, but it is functional. The results can also be accessed via API calls (the current CSE ID is 006974165492396950327:xvnuayaygic). For universities wanting to compare “Google” searches against their online prospectuses and those of other HEIs, CourseDetective might be an appropriate tool for the SEO toolbox?
It struck me just now that by driving the CSE from a linked file, we could actually define multiple linked definition files for different flavours of website, for example boosting or suppressing course results according to user preferences (for example, geography, or other properties we can associate with or derive from, the course prospectus root URL or common search terms.)
A couple of other things I’d like to be able to do: search for foundation degrees; search for part-time degrees; search for distance education degrees; search for postgraduate taught courses.
I also managed to waste a bit of time (i.e. I still haven’t found a workaround) on the analytics side. What I wanted to do was use the AJAX version of the CSE and then use Google Analytics event tracking to track:
– which results were clicked on.
Again, it would be really helpful if the Google Custom Search Engine and Google Analytics folk had a little sit down together to work out how to do at least the first of these, if not the second (they might be protective of folk knowing which links get clicked on in the CSE results? Although that’s not to say that someone else might not come up with a solution… Please feel free to let me know if you have just such a fix in the comments;-)
In terms of time and effort, I reckon it took about 6 person hours to collate the links. If anyone fancies helping develop the site further, I think we’re up for that… :-)
* i.e. folk who aren’t developers but aspire to doing developery things howsover they can;-) The team included: me, David Kernohan, designers James Roscoe and Joel Reed, Shelagh Finlay and Tracey Murray.