A Quick Lookup Service for UK University Bursary & Scholarship Pages

Here’s a quick recipe for grabbing a set of links from an alphabetised set of lookup pages and then providing a way of looking them up… The use case is to lookup URLs of pages on the websites of colleges and universities offering financial support for students as part of the UK National Scholarship Programme, as described on the DirectGov website:

National Scholarship programme

Index pages have URLs of the form:

<div class="subContent">
						<div class="subContent">
					<ul class="subLinks">
						<li><a href="http://www.anglia.ac.uk/nsp"   target="_blank">Anglia Ruskin University<span class='tooltip' title='Opens new window'> <span>Opens new window</span></span></a></li>
				<div class="subContent">
					<ul class="subLinks">
						<li><a href="http://www.aucb.ac.uk/international/feesandfinance/financialhelp.aspx"   target="_blank">Arts University College at Bournemouth<span class='tooltip' title='Opens new window'> <span>Opens new window</span></span></a></li>
				<div class="subContent">
					<ul class="subLinks">
						<li><a href="http://www1.aston.ac.uk/study/undergraduate/student-finance/tuition-fees/2012-entry/ "   target="_blank">Aston University Birmingham<span class='tooltip' title='Opens new window'> <span>Opens new window</span></span></a></li>

I’ve popped a quick scraper onto Scraperwiki (University Bursaries / Scholarship / Bursary Pages) that trawls the the index pages A-Z, grabs the names of the institutions and the URLs they link to and pops them into a database.

import scraperwiki
import string,lxml.html

# A function I usually bring in with lxml that strips tags and just give you text contained in an XML substree
## via http://stackoverflow.com/questions/5757201/help-or-advice-me-get-started-with-lxml/5899005#5899005
def flatten(el):           
    result = [ (el.text or "") ]
    for sel in el:
        result.append(sel.tail or "")
    return "".join(result)
#As it happens, we're not actually going to use this function in this scraper, so we could remove it from the code...

# We want to poll through page URLs indexed by an uppercase alphachar
allTheLetters = string.uppercase

for letter in allTheLetters:
    #Generate the URL
    print letter
    #Grab the HTML page from the URL and generate an XML object from it
    #There are probably more efficient ways of doing this scrape...
    for element in page.findall('.//div'):
        if element.find('h3')!=None and element.find('h3').text==letter:
            for uni in element.findall('.//li/a'):
                print uni.text,uni.get('href')
                scraperwiki.sqlite.save(unique_keys=["href"], data={"href":uni.get('href'), "uni":uni.text})

Running this gives a database containing the names of the institutions that signed up to the National Scholarship Programmea and the information that have about scholarships and bursaries availabale in that context.

The Scraperwiki API allows you to run queries on this database and get the results back as JSON, HTML, CSV or RSS: University Bursaries API. So for example, we can search for bursary pages on Liverpool colleges and universities websites:

Scraperwiki API

We can also generate a view over the data on Scraperwiki… (this script shows how to interrogate the Scraperwiki database from within a webpage.

Finally, if we take the URLs from the bursary pages and pop them into a Google custom search engine, we can now search over just those pages… UK HE Financial Support (National Scholarship Programme) Search Engine. (Note that this is a bit ropey at them moment.) If you own the CSE, it’s easy enough to grab embed codes that allow you to pop search and results controls for the CSE into your own webpage.

(On the to do list is generate a view over the data that defines a Google Custom Search Engine Annotations file that can be used to describe the sites/pages searched over by the CSE.)

One comment

  1. Pingback: Just Back From #DevXS « OUseful.Info, the blog…