A Quick Lookup Service for UK University Bursary & Scholarship Pages

Here’s a quick recipe for grabbing a set of links from an alphabetised set of lookup pages and then providing a way of looking them up… The use case is to lookup URLs of pages on the websites of colleges and universities offering financial support for students as part of the UK National Scholarship Programme, as described on the DirectGov website:

National Scholarship programme

Index pages have URLs of the form:

<div class="subContent">
						<div class="subContent">
					<ul class="subLinks">
						<li><a href="http://www.anglia.ac.uk/nsp"   target="_blank">Anglia Ruskin University<span class='tooltip' title='Opens new window'> <span>Opens new window</span></span></a></li>
				<div class="subContent">
					<ul class="subLinks">
						<li><a href="http://www.aucb.ac.uk/international/feesandfinance/financialhelp.aspx"   target="_blank">Arts University College at Bournemouth<span class='tooltip' title='Opens new window'> <span>Opens new window</span></span></a></li>
				<div class="subContent">
					<ul class="subLinks">
						<li><a href="http://www1.aston.ac.uk/study/undergraduate/student-finance/tuition-fees/2012-entry/ "   target="_blank">Aston University Birmingham<span class='tooltip' title='Opens new window'> <span>Opens new window</span></span></a></li>

I’ve popped a quick scraper onto Scraperwiki (University Bursaries / Scholarship / Bursary Pages) that trawls the the index pages A-Z, grabs the names of the institutions and the URLs they link to and pops them into a database.

import scraperwiki
import string,lxml.html

# A function I usually bring in with lxml that strips tags and just give you text contained in an XML substree
## via http://stackoverflow.com/questions/5757201/help-or-advice-me-get-started-with-lxml/5899005#5899005
def flatten(el):           
    result = [ (el.text or "") ]
    for sel in el:
        result.append(sel.tail or "")
    return "".join(result)
#As it happens, we're not actually going to use this function in this scraper, so we could remove it from the code...

# We want to poll through page URLs indexed by an uppercase alphachar
allTheLetters = string.uppercase

for letter in allTheLetters:
    #Generate the URL
    print letter
    #Grab the HTML page from the URL and generate an XML object from it
    #There are probably more efficient ways of doing this scrape...
    for element in page.findall('.//div'):
        if element.find('h3')!=None and element.find('h3').text==letter:
            for uni in element.findall('.//li/a'):
                print uni.text,uni.get('href')
                scraperwiki.sqlite.save(unique_keys=["href"], data={"href":uni.get('href'), "uni":uni.text})

Running this gives a database containing the names of the institutions that signed up to the National Scholarship Programmea and the information that have about scholarships and bursaries availabale in that context.

The Scraperwiki API allows you to run queries on this database and get the results back as JSON, HTML, CSV or RSS: University Bursaries API. So for example, we can search for bursary pages on Liverpool colleges and universities websites:

Scraperwiki API

We can also generate a view over the data on Scraperwiki… (this script shows how to interrogate the Scraperwiki database from within a webpage.

Finally, if we take the URLs from the bursary pages and pop them into a Google custom search engine, we can now search over just those pages… UK HE Financial Support (National Scholarship Programme) Search Engine. (Note that this is a bit ropey at them moment.) If you own the CSE, it’s easy enough to grab embed codes that allow you to pop search and results controls for the CSE into your own webpage.

(On the to do list is generate a view over the data that defines a Google Custom Search Engine Annotations file that can be used to describe the sites/pages searched over by the CSE.)

Just Back From #DevXS

Just back home from #devXS, the first DevCSI student developer event held at the University of Lincoln, in which a shed load (literally!) of student developers gave up their weekend for a 24 hour code bash (and 2 minute Rememberance Sunday silence) on projects of their own design. Well done to all the teams for their hacks and apps – I’m guessing a list of prize winners will appear on the DevXS blog, but you can find a full list on the wiki

It was really encouraging to see several teams hacking out apps and services around course code data – it’s just a shame that UCAS Terms and Conditions make it so hard for folk to find an open way in to getting hold of a national catalogue of course codes. In the same way that restrictions on UK postcode data held back grass roots development for way too long until recently, access to course code data – which UCAS could help out with – is really holding back the development of grass roots apps around course choice and selection…if crappy license conditions are respected of course… (is there an “in the public interest” defence that could be mounted against respecting such terms and conditions?!)

Here’s the overall winning app, from St Andrews’ Another Team: UUG: the Unofficial University Guide

Many congrats and thanks to the local organisers Alex Bilbie, Nick Jackson, Joss Winn, Jamie Mahoney and any others I may have omitted (apols…) as well as UKOLN’s DevCSI co-ordinator Mahendra Mahey. Great stuff, chaps:-)

PS FWIW, here are my slides from the presentation I gave at the event, as well as a hack I did along the way