A Tracking Inspired Hack That Breaks the Web…? Naughty OpenLearn…

So it’s not just me who wonders Why Open Data Sucks Right Now and comes to this conclusion:

What will make open data better? What will make it usable and useful? What will push people to care about the open data they produce?
Simply that. If we start using the data, we can email, write, text and punch people until their data is in a standard, useful and usable format. How do I know if my data is correct until someone tries to put pins on a map for ever meal I’ve eaten? I simply don’t. And this is the rock/hard place that open data lies in at the moment:

It’s all so moon-hoveringly bad because no-one uses it.
No-one uses it because what is out there is moon-hoveringly bad

Or broken…

Earlier today, I posted some, erm, observations about OpenLearn XML, and in doing so appear to have logged, in a roundabout and indirect way, a couple of bugs. (I did think about raising the issues internally within the OU, but as the above quote suggests, the iteration has to start somewhere, and I figured it may be instructive to start it in the open…)

So here’s another, erm, issue I found relating to accessing OpenLearn xml content. It’s actually something I have a vague memory of colliding with before, but I don’t seem to have blogged it, and since moving to an institutional mail server that limits mailbox size, I can’t check back with my old email messages to recap on the conversation around the matter from last time…

The issue started with this error message that was raised when I tried to parse an OU XML document via Scraperwiki:

Line 85 - tree = etree.parse(cr)
lxml.etree.pyx:2957 -- lxml.etree.parse (src/lxml/lxml.etree.c:56230)(())
parser.pxi:1533 -- lxml.etree._parseDocument (src/lxml/lxml.etree.c:82313)(())
parser.pxi:1562 -- lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:82606)(())
parser.pxi:1462 -- lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:81645)(())
parser.pxi:1002 -- lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:78554)(())
parser.pxi:569 -- lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:74498)(())
parser.pxi:650 -- lxml.etree._handleParseResult (src/lxml/lxml.etree.c:75389)(())
parser.pxi:590 -- lxml.etree._raiseParseError (src/lxml/lxml.etree.c:74722)(())
XMLSyntaxError: Entity 'nbsp' not defined, line 155, column 34

nbsp is an HTML entity that shouldn’t appear untreated in an arbitrary XML doc. So I assumed this was a fault of the OU XML doc, and huffed and puffed and sighed for a bit and tried with another XML doc; and got the same result. A trawl around the web looking for whether there were workarounds for the lxml Python library I was using to parse the “XML” turned up nothing… Then I thought I should check…

A command line call to an OU XML URL using curl:

curl http://openlearn.open.ac.uk/mod/oucontent/view.php?id=397313&content=1

returned the following:

<meta http-equiv="refresh" content="0; url=http://openlearn.open.ac.uk/login/index.php?loginguest=true" /><script type="text/javascript">

Ah… vague memories… there’s some sort of handshake goes on when you first try to access OpenLearn content (maybe something to do with tracking?), before the actual resource that was called is returned to the calling party. Browsers handle this handshake automatically, but the etree.parse(URL) function I was calling to load in and parse the XML document doesn’t. It just sees the HTML response and chokes, raising the error that first alerted me to the problem.

[Seems the redirect is a craptastic Moodle fudge /via @ostephens]

So now it’s two hours later than it was when I started a script, full of joy and light and happy intentions, that would generate an aggregated glossary of glossary items from across OpenLearn and allow users to look up terms, link to associated units, and so on; (the OU-XML document schema that OpenLearn uses has markup for explicitly describing glossary items). Then I got the error message, ran round in circles for a bit, got ranty and angry and developed a really foul mood, probably tweeted some things that I may regret, one day, figured out what the issue was, but not how to solve it, thus driving my mood fouler and darker… (If anyone has a workaround that lets me get an XML file back directly from OpenLearn (or hides the workaround handshake in a Python script I can simply cut and paste), please enlighten me in the comments.)

I also found at least one OpenLearn unit that has glossary items, but just dumps then in paragraph tags and doesn’t use the glossary markup. Sigh…;-)

So… how was your day?! I’ve given up on mine…


  1. pallih

    Use mechanize to fetch the url:

    import mechanize
    response = mechanize.urlopen(“http://openlearn.open.ac.uk/mod/oucontent/view.php?id=397313&content=1”)
    print response.read()

  2. jenny gray

    Love that quote – it expresses exactly the problem of why we never have time to put effort into these areas of OpenLearn.

    I had my twitter switched off yesterday because I too was having a bad day. I think you’ve worked everything out already but…

    We have talked about the handshake before. Its that Moodle relies on cookies to track sessions and provide access to content. Browsers and many feed readers cope with it seamlessly but if you’re writing your own you will have to cater for it.

  3. Pingback: Deconstructing OpenLearn Units – Glossary Items, Learning Outcomes and Image Search « OUseful.Info, the blog…