Screenscraping the OU Set Books Webpage – Sometimes it’s Worth Asking…

A couple of days ago, i thought I’d complement the OU’s course related Linked Data with some data relating to the set books students need to buy on some of out courses. (Some courses provide the books as part of course materials, others require you to buy them yourselves.)

OU set books

The books required for each course (if required), are listed in a separate HTML table, one table per course. Here’s what the HTML look(ed) like:

OU set books - view src

If you inspect the HTML, you’ll see that the course code and the name of the course are contained in an element outside the table that contains the book details for the course. If you’ve ever managed a children’s party, where cards and presents are easily separated from each other, you’ll maybe get a sense of what these means when trying to screenscrape the booklist for each course… Because screenscraping can be a bit like looking inside a present. That is, it’s easy enough to grab hold of all the table elements as separate bundles (one per course), abnd then look inside them separately, but it can be a real pain picking up those tables as presents and a set of separate envelopes, and trying to make sure you keep track of which envelope goes with which present. Which is a bit like what would happen above…

What would make life easier is for each table to carry with it some sort of information about the course the table is associated with. So I sent to a tweet to someone I thought might be able to help, and the tweet had repurcussions:

From: ******
Sent: 16 November 2010 20:10
To: *******
Subject: set books

Tony H asks

do you know who owns template of http://bit.ly/c6qht7 ? would be scraper friendly if table summary attribute had course code and title...

-----------------

From: *******
Date: 17 November 2010 10:35:05 GMT
To: ******
Subject: RE: set books

No problem, I’ve altered the template and the summary will include code and title from tomorrow……

:-)

So here’s what the page looks like now – you’ll see the summary attribute of each table contains the course code and description.

OU set books -new html

WHich makes scraping the data much easier. here’s my Scraperwiki script (OU Set book scraper):

###############################################################################
# Basic scraper
###############################################################################
import scraperwiki
from BeautifulSoup import BeautifulSoup

unknown=0
# retrieve a page
starting_url = 'http://www3.open.ac.uk/about/setbooks/'
html = scraperwiki.scrape(starting_url)
print html
soup = BeautifulSoup(html)

# The books for each course are listed in a separate table
# use BeautifulSoup to get all <table> tags
count=0
tables = soup.findAll('table') 
for table in tables:
    for attr,val in table.attrs:
        # The course code and course title are contained in the table summary attribute
        if attr=='summary':
            print val
            ccode=val.split(' ')[0]
            ctitle=val.replace(ccode,'').strip()
    firstrow = True
    
    # Work through each row in the table - one row per book - ignoring the header row
    for row in table.findAll('tr'):
        blankLine=False
        for attr,val in row.attrs:
            if attr=='class' and val=='white':
                blankLine=True
        if blankLine:
            break
        if not firstrow:
            cells=row.findAll('td')
            print cells
            author=cells[0].text.replace('&nbsp;','')
            author=author.replace('&amp;','and')
            title=cells[1].text
            isbn=cells[2].text
            publisher=cells[3].text
            rrp=cells[4].text.replace('£','')
            print ccode, ctitle,author,title,isbn,publisher,rrp
            if isbn=='&amp;nbsp;':
                key=ccode+'_unkown:'+str(unknown)
                unknown+=1
            else:
                key=ccode+'_'+isbn
            count +=1
            record = {'id':key, 'Course Code':ccode, 'Course title':ctitle,'Author':author,'Title':title,'ISBN':isbn,'Publisher':publisher,'RRP':rrp }
    
            # save records to the datastore
            scraperwiki.datastore.save(['id'], record)
        else:
            firstrow=False

print count

[Thanks to @ostephens for pointing out my code was broken and was only scraping first line of each table… oops:-( Note to self: always check, and run just one more test… ]

And here’s the result (data as CSV):

OU set books - scraped

And the moral? Sometimes it’s worth asking -just on the offchance – if a page owner can make a pragmatic little change to the page that can make all the difference when it comes to a scrape being easy to achieve, or rather more involved…

PS it’d be nice to see this added to the course linked data on data.open.ac.uk?

PPS Maybe I should have asked the Lucero team about the Linked Data… @mdaquin tweeted: http://bit.ly/aCXDCx last part of second URI is ISBN of book (will add more info, and more “course material” soon) :-)

OU LInked Data - course books

There are more books listed here than I scraped from the set book list though, so I wonder what http://data.open.ac.uk/saou/ontology#hasBook means? Are these all published books associated with a course, irrespective of whether a student has to buy them themselves (as on the set book list), or whether they are supplied as part of the course materials? Or maybe this list includes more courses than on the set book page for some reason? [UPDATE: the scraper was broken, and was only grabbing the first row of each table into the database…bah:-( Apols… The results are closer now – I scrape 356 compared to 367 reported from the LD query, but only 340-odd in the database, so maybe I still have a bug or two in the scraper:-( Ah – some records have no ISBN, and I was using ISBN as part of the unique ID for each record…Fixed that, but still not getting counts to tally though:-( ] In any case, I think the distinction between supplied and provided books is an important one: for example, if I want to find out the cost of a course, then it would be useful to be able to price in the cost of any books I have to buy myself? That said, being able to find all books associated with courses is also handy?

Author: Tony Hirst

I'm a Senior Lecturer at The Open University, with an interest in #opendata policy and practice, as well as general web tinkering...

2 thoughts on “Screenscraping the OU Set Books Webpage – Sometimes it’s Worth Asking…”

  1. In the data at data.open.ac.uk it looks like we aren’t currently including material without ISBNs – I’ll need to double check with the rest of the team, but for example for AA100, one of the ‘set materials’ is a DVD of Bahji on the Beach – which isn’t included in the list on our Linked Data page for the course (http://data.open.ac.uk/page/course/aa100).

    One of the next data sources we are going to tackle for data.open.ac.uk is the information the library catalogue has about course materials – this includes both ‘set texts’ – i.e. those that the student has to additionally purchase – and ‘course texts/materials’ – i.e. those that are delivered to the student as part of the course.

    I agree being able to differentiate ‘set’ from ‘course’ material would be useful – we need to have a look at this and see if we can model it.

    If I get a chance I’ll have a look at your current results and see if I can get to the bottom of the discrepancies you are still seeing.

Comments are closed.