A couple of days ago, i thought I’d complement the OU’s course related Linked Data with some data relating to the set books students need to buy on some of out courses. (Some courses provide the books as part of course materials, others require you to buy them yourselves.)
The books required for each course (if required), are listed in a separate HTML table, one table per course. Here’s what the HTML look(ed) like:
If you inspect the HTML, you’ll see that the course code and the name of the course are contained in an element outside the table that contains the book details for the course. If you’ve ever managed a children’s party, where cards and presents are easily separated from each other, you’ll maybe get a sense of what these means when trying to screenscrape the booklist for each course… Because screenscraping can be a bit like looking inside a present. That is, it’s easy enough to grab hold of all the table elements as separate bundles (one per course), abnd then look inside them separately, but it can be a real pain picking up those tables as presents and a set of separate envelopes, and trying to make sure you keep track of which envelope goes with which present. Which is a bit like what would happen above…
What would make life easier is for each table to carry with it some sort of information about the course the table is associated with. So I sent to a tweet to someone I thought might be able to help, and the tweet had repurcussions:
From: ******
Sent: 16 November 2010 20:10
To: *******
Subject: set books
Tony H asks
do you know who owns template of http://bit.ly/c6qht7 ? would be scraper friendly if table summary attribute had course code and title...
-----------------
From: *******
Date: 17 November 2010 10:35:05 GMT
To: ******
Subject: RE: set books
No problem, I’ve altered the template and the summary will include code and title from tomorrow……
:-)
So here’s what the page looks like now – you’ll see the summary attribute of each table contains the course code and description.
WHich makes scraping the data much easier. here’s my Scraperwiki script (OU Set book scraper):
############################################################################### # Basic scraper ############################################################################### import scraperwiki from BeautifulSoup import BeautifulSoup unknown=0 # retrieve a page starting_url = 'http://www3.open.ac.uk/about/setbooks/' html = scraperwiki.scrape(starting_url) print html soup = BeautifulSoup(html) # The books for each course are listed in a separate table # use BeautifulSoup to get all <table> tags count=0 tables = soup.findAll('table') for table in tables: for attr,val in table.attrs: # The course code and course title are contained in the table summary attribute if attr=='summary': print val ccode=val.split(' ')[0] ctitle=val.replace(ccode,'').strip() firstrow = True # Work through each row in the table - one row per book - ignoring the header row for row in table.findAll('tr'): blankLine=False for attr,val in row.attrs: if attr=='class' and val=='white': blankLine=True if blankLine: break if not firstrow: cells=row.findAll('td') print cells author=cells[0].text.replace(' ','') author=author.replace('&','and') title=cells[1].text isbn=cells[2].text publisher=cells[3].text rrp=cells[4].text.replace('£','') print ccode, ctitle,author,title,isbn,publisher,rrp if isbn=='&nbsp;': key=ccode+'_unkown:'+str(unknown) unknown+=1 else: key=ccode+'_'+isbn count +=1 record = {'id':key, 'Course Code':ccode, 'Course title':ctitle,'Author':author,'Title':title,'ISBN':isbn,'Publisher':publisher,'RRP':rrp } # save records to the datastore scraperwiki.datastore.save(['id'], record) else: firstrow=False print count
[Thanks to @ostephens for pointing out my code was broken and was only scraping first line of each table… oops:-( Note to self: always check, and run just one more test… ]
And here’s the result (data as CSV):
And the moral? Sometimes it’s worth asking -just on the offchance – if a page owner can make a pragmatic little change to the page that can make all the difference when it comes to a scrape being easy to achieve, or rather more involved…
PS it’d be nice to see this added to the course linked data on data.open.ac.uk?
PPS Maybe I should have asked the Lucero team about the Linked Data… @mdaquin tweeted: http://bit.ly/aCXDCx last part of second URI is ISBN of book (will add more info, and more “course material” soon) :-)
There are more books listed here than I scraped from the set book list though, so I wonder what http://data.open.ac.uk/saou/ontology#hasBook means? Are these all published books associated with a course, irrespective of whether a student has to buy them themselves (as on the set book list), or whether they are supplied as part of the course materials? Or maybe this list includes more courses than on the set book page for some reason? [UPDATE: the scraper was broken, and was only grabbing the first row of each table into the database…bah:-( Apols… The results are closer now – I scrape 356 compared to 367 reported from the LD query, but only 340-odd in the database, so maybe I still have a bug or two in the scraper:-( Ah – some records have no ISBN, and I was using ISBN as part of the unique ID for each record…Fixed that, but still not getting counts to tally though:-( ] In any case, I think the distinction between supplied and provided books is an important one: for example, if I want to find out the cost of a course, then it would be useful to be able to price in the cost of any books I have to buy myself? That said, being able to find all books associated with courses is also handy?
In the data at data.open.ac.uk it looks like we aren’t currently including material without ISBNs – I’ll need to double check with the rest of the team, but for example for AA100, one of the ‘set materials’ is a DVD of Bahji on the Beach – which isn’t included in the list on our Linked Data page for the course (http://data.open.ac.uk/page/course/aa100).
One of the next data sources we are going to tackle for data.open.ac.uk is the information the library catalogue has about course materials – this includes both ‘set texts’ – i.e. those that the student has to additionally purchase – and ‘course texts/materials’ – i.e. those that are delivered to the student as part of the course.
I agree being able to differentiate ‘set’ from ‘course’ material would be useful – we need to have a look at this and see if we can model it.
If I get a chance I’ll have a look at your current results and see if I can get to the bottom of the discrepancies you are still seeing.