OpenLearn WordPress Plugins

Just before the summer break, I managed to Patrick McAndrew to use some of his OpenLearn cash to get a WordPress-MU plugin built that would allow anyone to republish OpenLearn materials across a set of WordPress Multi-User blogs. A second WordPress plugin was commissioned that would allow any learners happening by the blogs would subscribe to those courses using “Daily feeds”, that would deliver course material to them on a daily basis.

The plugins were coded by Greg Gaughan at Isotoma, and tested by Jim and D’Arcy, among others… (I haven’t acted on your feedback yet – sorry, guys…:-( For all manner of reasons, I didn’t post the plugins (partly because I wanted to do another pass on usability/pick up on feedback, but mainly because I wanted set up a demo site first… but I still haven’t done that… so here’s a link to the plugins anyway in case anyone fancies having a play over the next few weeks: OpenLearn WordPress Plugins.

I’ll keep coming back to this post – and the download page – to add in documentation and some of the thoughts and discussions we had about how to evolve the WPMU plugin workflow/UI etc, as well as the daily feeds widget functionality.

In the meantime, here’s the minimal info I gave the original testers:

The story is:
– one ‘openlearn republisher’ plugin, that will take the URL of an RSS feed describing OpenLearn courses (e.g. on the Modern Languages page, the RSS: Modern Languages feed)) , and suck those courses into WPMU, one WPMU blog per course, via the full content RSS feed for each course.

– one “daily feeds” widget; this can be added to any WP blog and should provide a ‘daily’ feed of the content from that blog, that sends e.g. one item per day to the subscriber from the day they subscribe. The idea here is if a WP blog is used as a content publishing sys for ‘static’, unchanging content (e.g. a course, or a book, where each chapter is a post, or a fixed length podcast series), users can still get it delivered in a paced/one-per-day fashion. This widget should work okay…

Here’s another link the page where you can find the downloads: OpenLearn WordPress Plugins. Enjoy – all comments welcome. Please post a link back here if you set up any blogs using either of these two plugins.

OpenLearn Website Refresh, and the Re-emergence of SocialLearn…

It’s seems like today has been a busy day for a couple of the OU’s web teams…

First up, and with a beta launch today, the new OpenLearn site makes an appearance, including integration of content from the Open2.net site. As I understand it, the new OpenLearn website amounts to something akin to the “public service educator” presence of the OU, (complemented by OU Platform, the OU’s (open to all) social community site, and presumably SocialLearn, about which, more later…)

OpenLearn Relaunched

openlearn - www.open.ac.uk/openlearn

As well as providing the access point to the OU’s openly licensed (and free to use) educational material that was hosted on the original OpenLearn LearningSpace site, and content that is published to iTunesU and Youtube, OpenLearn (http://www.open.ac.uk/openlearn) will also support the OU’s “broadcast” strategy. This will include support for OU co-produced programming with the BBC, taking over this role from open2.net (apparently: “Open2.net will stay live for a while so we can tell our existing users about the changes and manage any current broadcast related activity on the site. We will then close the open2.net and anyone following links to open2.net will be redirected to the new site.”), as well as providing opportunities for publishing materials in order to support major news events, perhaps along the lines of The COP15 University Expert Press Room; (I’m not sure if OpenLearn will also act as a channel for teaching and research related news, as well, cf. Social Media Releases and the University Press Office?).

Course materials are organised by topic, as well as resource type (in a way that reminds me of the OpenLearn content promotion that (used to? or still does?) appear on Sky’s Skylearning website):

OpenLearn beta

We can also search by media type:

OpenLearn - - media resources

(One thing that might be handy would be the ability to subscribe to a podcast feed from a search on a particular topic area?)

Note that the search function may be a little ropey today, as the site is still being indexed…

The What’s On area of the site seems to have links to recent OU broadcast content, though I’m not sure whether it will also start to promote academic presentations and webcast events of general interest e.g. from the OU’s Berrill Stadium?

OpenLearn - what's on

I didn’t spot an RSS feed though…:-(

(Hmmm, which reminds me – I wonder if my Recent OU Programmes on the BBC, via iPlayer hack, or the mobile/iPhone or Boxee versions still work?!;-)

All the OpenLearn content appears commentable, although I think an OU registration/login is required? I seem to remember logins being provided to all comers for the Platform site, so maybe anyone can just register?

In fact, registration opportunities provide a good link to the announcement of a private beta for SocialLearn that opened up today…

(Re)Introducing SocialLearn

Long time readers of this blog may remember an OU project called “SocialLearn” that was initiated to explore the opportunities for a web scale social learning platform to straddle formal and informal learning. After various fits, starts, and consumption of budget, SocialLearn is back at http://sociallearn.org/ in a new widgetised form, appearing to offer (in the early stages a least) a Netvibes like dashboard that can host custom created widgets or (I think?) iGoogle gadgets that can be used to support your learning… (whatever that means!;-)

A couple of nice features that struck me: firstly, as a social platform, you can use credentials from the most popular social networking services to log in/create your account, and you can also import personal details from those networks. (I’m not sure how this works in practice – I don’t have an invite yet…) Secondly, a SocailLearn toolbar can be raised on any page from a bookmarklet (rather than a browser extension?), and then used to display individual widgets from different widget sets as overlays on the current page (I assume widget sets are like tabs on Netvibes, or Pageflakes?)

The best way to demo this is by a video… err.. Hmm… Being a web platform, site features are described using a video tour; and being a social platform, the video player has viral sharing/embedding features… errr…. probably… maybe… errr.. well if there is, I can’t spot them, at least, not in my browser…and being a WordPress hosted blog, I’m limited as to what I can embed anyway…

Ho hum…

This new, lighter weight view of SocialLearn harkens back to some of the original ideas that were mooted at the early stages of the first iteration of SocialLearn, but since then we’ve had the Facebook effect and a shift in terms of attention to that platform and the Facebook way of interacting, as well as the explosion in availability of smartphones and the app economy. Maybe the time is now right for portable toolbars that carry your applications with you to separate websites? Will the widget base of SocialLearn allow widgets to act as a standalone apps on a mobile device, or work nicely together in a combined app? Who knows…? It’ll be interesting to see…

(It’s also interesting to wonder whether the SocialLearn gadget approach is being developed with an eye on the Google Apps for Edu, which a little bird told me will start to roll out to OU students over the summer… (can anyone confirm that? And more specifically confirm what will be rolling out to whom?)

PS as far as user behaviour and UI aesthetics go, I wonder if the Max Expose inspired Firefox Tab Candy means we’ll start to see more of this style of interaction too?

Educative Media?

Another interesting looking job ad from the OU, this time for a Web Assistant Producer with Open Learn (Explore) in the OBU (Open Broadcasting Unit).

Here’s how it reads:

Earlier this year the OU launched an updated public facing, topical news and media driven site. The site bridges the gap between BBC TV viewing and OU services and functions as the new ‘front door’ to Open Learn and all of the Open University’s open, public content. We are looking for a Web Assistant Producer with web production/editing skills.

You will work closely with a Producer, 2 Web Assistant Producers, the Head of Online Commissioning and many others in the Open University, as well as the BBC.

You need to demonstrate a real interest in finding and building links between popular media/news stories, OU curriculum content, research and more. You must have experience of producing online educational material including: Researching online content, writing articles; sourcing images or other assets and/or placing and managing content text, FLASH and video/audio content within a Content Management System.

(I have to say, I’m quite tempted by the idea of this role…)

One of the things I wonder about is the extent to which “news” editorial guidelines will apply? When the OU ran the Open2.net website (now replaced by the revamped OpenLearn) content was nominally managed under BBC editorial guidelines, though I have to say I never read them… Nor did I realise how comprehensive they appear to be: BBC Editorial Guidelines. (Does the OU have an equivalent for teaching materials, I wonder?!)

As a publisher of informal, academic educational content, to what extent might editorial guidelines originating from a news and public service broadcaster be appropriate, and in what ways, if any, might they be inappropriate? (I think I need to try out a mapping from the BBC guidelines into an educational/educative context, if one hasn’t been done already…?)

Anyway, for a long time I’ve thought that we could be trying to make increased mileage of news stories in terms of providing deeper analysis and wider contextualisation/explanation that the news media can offer. (In this respect, I just spotted something – now a couple of days old: oops! – in my mailbox along exactly these lines. I’m working towards inbox zero and a shift to a new email client in the new year, so fingers crossed visiting my email inbox won’t be so offputting in future!) So it’s great to see that the new OpenLearn appears to be developing along exactly those lines.

A complementary thing (at least in the secondary sense of OpenLearn as open courseware and open educational resources) is to find a way of accrediting folk who have participated in open online courses and who want to be accredited against that participation in some way … and it just so happens that’s something I’m working on at the moment and hoping to pitch within the OU in the new year…

PS in passing, as the HE funding debate and demos rage on, anyone else think the OU should be license fee funded as a public service educator?!;-)

Generating Mind Maps from OU/OpenLearn Structured Authoring XML Documents

One of the really useful things about publishing documents in a structured way is that we can treat the document as a database, or generate an outline view of it automatically.

Whilst looking through the OU Structured Authoring XML docs looking for things I could reliably extract from them in order to configure a course custom search engine (Notes on Custom Course Search Engines Derived from OU Structured Authoring Documents), I put together a quick script to generate a course mind map based around the course structure.

It struck me that as structured document/XML views of OpenLearn material is available, I could do the same for OpenLearn docs. So here’s an example. If you visit the OpenLearn site, you should be able to find several modules derived from the old OU course T175. Going to the first page proper for each of the derived modules (URLs have the form http://openlearn.open.ac.uk/mod/oucontent/view.php?id=398868&direct=1), it is possible to grab a copy of the source XML document for the unit by rewriting the URL to include the setting&content=1: for example, http://openlearn.open.ac.uk/mod/oucontent/view.php?id=398868&content=1 UPDATE: the switch is now &content=scxml

OpenLearn source XML

Downloading the XML files for each of the T175 derived modules on OpenLearn into a single folder, I put together a quick script to mine the structure of the document and pull out the learning objectives for each unit, as well as the headings of each section and subsection. The resulting mindmap provides an outline of the course as a whole, something that can be used to provide a macroscopic view over the whole course, as well as providing a document that could be made available to people following the unit as a resource they could use to organise their notes or annotations around the unit.

T175 on Openlearn mindmap

Download a copy of the T175 on OpenLearn Outline Freemind/.mm mindmap

If we could find a way of getting the OpenLearn page URLs for each section, we could add them in as links within the mindmap, thus allowing it to be used as a navigation surface. (See also MindMap Navigation for Online Courses in this regard.)

Here’s a copy of the Python script I ran over the folder to generate the Freemind mindmap definition file (filetype .mm) based on the section and subsection elements used to structure the document.

# DEPENDENCIES
## We're going to load files in from a course related directory
import os
## Quick hack approach - use lxml parser to parse SA XML files
from lxml import etree
# We may find it handy to generate timestamps...
import time


# CONFIGURATION

## The directory the course XML files are in (separate directory for each course for now) 
SA_XMLfiledir='data'
## We can get copies of the XML versions of Structured Authoring documents
## that are rendered in the VLE by adding &content=1 to the end of the URL
## [via Colin Chambers]
## eg http://learn.open.ac.uk/mod/oucontent/view.php?id=526433&content=1


# UTILITIES

#lxml flatten routine - grab text from across subelements
#via http://stackoverflow.com/questions/5757201/help-or-advice-me-get-started-with-lxml/5899005#5899005
def flatten(el):           
    result = [ (el.text or "") ]
    for sel in el:
        result.append(flatten(sel))
        result.append(sel.tail or "")
    return "".join(result)

#Quick and dirty handler for saving XML trees as files
def xmlFileSave(fn,xml):
	# Output
	txt = etree.tostring(xml, pretty_print=True)
	#print txt
	fout=open(fn,'wb+')
	#fout.write('<?xml version="1.0" encoding="UTF-8" ?>\n')
	fout.write(txt)
	fout.close()


#GENERATE A FREEMIND MINDMAP FROM A SINGLE T151 SA DOCUMENT
## The structure of the T151 course lends itself to a mindmap/tree style visualisation
## Essentially what we are doing here is recreating an outline view of the course that was originally used in the course design phase
def freemindRoot(page):
	tree = etree.parse('/'.join([SA_XMLfiledir,page]))
	courseRoot = tree.getroot()
	mm=etree.Element("map")
	mm.set("version", "0.9.0")
	root=etree.SubElement(mm,"node")
	root.set("CREATED",str(int(time.time())))
	root.set("STYLE","fork")
	#We probably need to bear in mind escaping the text strings?
	#courseRoot: The course title is not represented consistently in the T151 SA docs, so we need to flatten it
	title=flatten(courseRoot.find('CourseTitle'))
	root.set("TEXT",title)
	
	## Grab a listing of the SA files in the target directory
	listing = os.listdir(SA_XMLfiledir)

	#For each SA doc, we need to handle it separately
	for page in listing:
		print 'Page',page
		#Week 0 and Week 10 are special cases and don't follow the standard teaching week layout
		if page!='week0.xml' and page!='week10.xml':
			tree = etree.parse('/'.join([SA_XMLfiledir,page]))
			courseRoot = tree.getroot()
			parsePage(courseRoot,root)
	return mm

def learningOutcomes(courseRoot,root):
	mmlos=etree.SubElement(root,"node")
	mmlos.set("TEXT","Learning Outcomes")
	mmlos.set("FOLDED","true")
	
	los=courseRoot.findall('.//FrontMatter/LearningOutcomes/LearningOutcome')
	for lo in los:
		mmsession=etree.SubElement(mmlos,"node")
		mmsession.set("TEXT",flatten(lo))

def parsePage(courseRoot,root):
	unitTitle=courseRoot.find('.//Unit/UnitTitle')

	mmweek=etree.SubElement(root,"node")
	mmweek.set("TEXT",flatten(unitTitle))
	mmweek.set("FOLDED","true")

	learningOutcomes(courseRoot,mmweek)
	
	sessions=courseRoot.findall('.//Unit/Session')
	for session in sessions:
		title=flatten(session.find('.//Title'))
		mmsession=etree.SubElement(mmweek,"node")
		mmsession.set("TEXT",title)
		mmsession.set("FOLDED","true")
		subsessions=session.findall('.//Section')
		for subsession in subsessions:
			heading=subsession.find('.//Title')
			if heading !=None:
				title=flatten(heading)
				mmsubsession=etree.SubElement(mmsession,"node")
				mmsubsession.set("TEXT",title)
				mmsubsession.set("FOLDED","true")


mm=freemindRoot('t175_1.xml')
print etree.tostring(mm, pretty_print=True)
xmlFileSave('reports/test_t175_full.mm',mm)

If you try to run it over other OpenLearn materials, you may need to tweak the parser slightly. For example, some documents may make use of InnerSection elements, or Header rather than Title elements.

If youdo try using the above script to generate mindmaps/outlines of other OpenLearn courses, please let me know how you got on in the comments below (eg whether you needed to tweak the script, or whether you found other structural elements that could be pulled into the mindmap.)

Do We Need an OpenLearn Content Liberation Front?

For me, one of the defining attributes of openness relates to accessibility of the machine kind: if I can’t write a script to handle the repetitive stuff for me, or can’t automate the embedding of image and/or video resources, then whatever the content is, it’s not open enough in a practical sense for me to do what I want with it.

So here’s an, erm, how can I put this politely, little niggle I have with OpenLearn XML. (For those of you not keeping up, one of the many OpenLearn sites is the OU’s open course materials site; the materials published on the site as course unit contentful HTML pages are also available as structured XML documents. (When I say “structured”, I mean that certain elements of the materials are marked up in a semantically meaningful way; lots of elements aren’t, but we have to start somewhere ;-))

The context is this: following on from my presentation on Making More of Structured Course Materials at the eSTeEM conference last week, I left a chat with Jonathan Fine with the intention of seeing what sorts of secondary product I could easily generate from the OpenLearn content. I’m in the middle of building a scraper and structured content extractor at the moment, grabbing things like learning outcomes, glossary items, references and images, but I almost immediately hit a couple of problems, first with actually locating the OU XML docs, and secondly locating the images…

Getting hold of a machine readable list of OpenLearn units is easy enough via the OpenLearn OPML feed (much easier to work with than the “all units” HTML index page). Units are organised by topic and are listed using the following format:

<outline type="rss" text="Unit content for Water use and the water cycle" htmlUrl="http://openlearn.open.ac.uk/course/view.php?name=S278_12" xmlUrl="http://openlearn.open.ac.uk/rss/file.php/stdfeed/4307/S278_12_rss.xml"/>

URLs of the form http://openlearn.open.ac.uk/course/view.php?name=S278_12 link to a ‘homepage” for each unit, which then links to the first page of actual content, content which is also available in XML form. The content page URLs have the form http://openlearn.open.ac.uk/mod/oucontent/view.php?id=398820&direct=1, where the ID is one-one uniquely mapped to the course name identifier. The XML version of the page can then be accessed by changing direct=1 in the URL to content=1. Only, we don’t know the mapping from course unit name to page id. The easiest way I’ve found of doing that is to load in the RSS feed for each unit and grab the first link URL, which points the first HTML content page view of the unit.

I’ve popped a scraper up on Scraperwiki to build the lookup for XML URLs for OpenLearn units – OpenLearn XML Processor:

import scraperwiki

from lxml import etree

#===
#via http://stackoverflow.com/questions/5757201/help-or-advice-me-get-started-with-lxml/5899005#5899005
def flatten(el):           
    result = [ (el.text or "") ]
    for sel in el:
        result.append(flatten(sel))
        result.append(sel.tail or "")
    return "".join(result)
#===

def getcontenturl(srcUrl):
    rss= etree.parse(srcUrl)
    rssroot=rss.getroot()
    try:
        contenturl= flatten(rssroot.find('./channel/item/link'))
    except:
        contenturl=''
    return contenturl

def getUnitLocations():
    #The OPML file lists all OpenLearn units by topic area
    srcUrl='http://openlearn.open.ac.uk/rss/file.php/stdfeed/1/full_opml.xml'
    tree = etree.parse(srcUrl)
    root = tree.getroot()
    topics=root.findall('.//body/outline')
    #Handle each topic area separately?
    for topic in topics:
        tt = topic.get('text')
        print tt
        for item in topic.findall('./outline'):
            it=item.get('text')
            if it.startswith('Unit content for'):
                it=it.replace('Unit content for','')
                url=item.get('htmlUrl')
                rssurl=item.get('xmlUrl')
                ccu=url.split('=')[1]
                cctmp=ccu.split('_')
                cc=cctmp[0]
                if len(cctmp)>1: ccpart=cctmp[1]
                else: ccpart=1
                slug=rssurl.replace('http://openlearn.open.ac.uk/rss/file.php/stdfeed/','')
                slug=slug.split('/')[0]
                contenturl=getcontenturl(rssurl)
                print tt,it,slug,ccu,cc,ccpart,url,contenturl
                scraperwiki.sqlite.save(unique_keys=['ccu'], table_name='unitsHome', data={'ccu':ccu, 'uname':it,'topic':tt,'slug':slug,'cc':cc,'ccpart':ccpart,'url':url,'rssurl':rssurl,'ccurl':contenturl})

getUnitLocations()

The next step in the plan (because I usually do have a plan; it’s hard to play effectively without some sort of direction in mind…) as far as images goes was to grab the figure elements out of the XML documents and generate an image gallery that allows you to search through OpenLearn images by title/caption and/or description, and preview them. Getting the caption and description from the XML is easy enough, but getting the image URLs is not

Here’s an example of a figure element from an OpenLearn XML document:

<Figure id="fig001">
<Image src="\\DCTM_FSS\content\Teaching and curriculum\Modules\Shared Resources\OpenLearn\S278_5\1.0\s278_5_f001hi.jpg" height="" webthumbnail="false" x_imagesrc="s278_5_f001hi.jpg" x_imagewidth="478" x_imageheight="522"/>
<Caption>Figure 1 The geothermal gradient beneath a continent, showing how temperature increases more rapidly with depth in the lithosphere than it does in the deep mantle.</Caption>
<Alternative>Figure 1</Alternative>
<Description>Figure 1</Description>
</Figure>

Looking at the HTML page for the corresponding unit on OpenLearn, we see it points to the image resource file at http://openlearn.open.ac.uk/file.php/4178/!via/oucontent/course/476/s278_5_f001hi.jpg:

So how can we generate that image URL from the resource link in the XML document? The filename is the same, but how can we generate what are presumably contextually relevant path elements: http://openlearn.open.ac.uk/file.php/4178/!via/oucontent/course/476/

If we look at the OpenLearn OPML file that lists all current OpenLearn units, we can find the first identifier in the path to the RSS file:

<outline type="rss" text="Unit content for Energy resources: Geothermal energy" htmlUrl="http://openlearn.open.ac.uk/course/view.php?name=S278_5" xmlUrl="http://openlearn.open.ac.uk/rss/file.php/stdfeed/4178/S278_5_rss.xml"/>

But I can’t seem to find a crib for the second identifier – 476 – anywhere? Which means I can’t mechanise the creation of links to actually OpenLearn image assets from the XML source. Also note that there are no credits, acknowledgements or license conditions associated with the image contained within the figure description. Which also makes it hard to reuse the image in a legal, rights recognising sense.

[Doh – I can surely just look at URL for an image in an OpenLearn unit RSS feed and pick the path up from there, can’t I? Only I can’t, because the image links in the RSS feeds are: a) relative links, without path information, and b) broken as a result…]

Reusing images on the basis of the OpenLearn XML “sourcecode” document is therefore: NOT OBVIOUSLY POSSIBLE.

What this suggests to me is that if you release “source code” documents, they may actually need some processing in terms of asset resolution that generates publicly resolvable locators to assets if they are encoded within the source code document as “private” assets/non-resolvable identifiers.

Where necessary, acknowledgements/credits are provided in the backmatter using elements of the form:

<Paragraph>Figure 7 Willes-Richards, J., et al. (1990) ; HDR Resource/Economics’ in Baria, R. (ed.) <i>Hot Dry Rock Geothermal Energy</i>, Copyright CSM Associates Limited</Paragraph>

Whilst OU-XML does support the ability to make a meaningful link to a resource within the XML document, using an element of the form:

<CrossRef idref="fig007">Figure 7</CrossRef>

(which presumably uses the Alternative label as the cross-referenced identifier, although not the figure element id (eg fig007) which is presumably unique within any particular XML document?), this identifier is not used to link the informally stated figure credit back to the uniquely identified figure element?

If the same image asset is used in several course units, there is presumably no way of telling from the element data (or even, necessarily, the credit data?) whether the images are in fact one and the same. That is, we can’t audit the OpenLearn materials in a text mechanised way to see whether or not particular images are reused across two or more OpenLearn units.

Just in passing, it’s maybe also worth noting that in the above case at least, a description for the image is missing. In actual OU course materials, the description element is used to capture a textual description of the image that explicates the image in the context of the surrounding text. This represents a partial fulfilment of accessibility requirements surrounding images and represents, even if not best, at least effective practice.

Where else might content need liberating within OpenLearn content? At the end of the course unit XML documents, in the “backmatter” element, there is often a list of references. References have the form:

<Reference>Sheldon, P. (2005) Earth’s Physical Resources: An Introduction (Book 1 of S278 Earth’s Physical Resources: Origin, Use and Environmental Impact), The Open University, Milton Keynes</Reference>

Hmmm… no structure there… so how easy would it be to reliably generate a link to an authoritative record for that item? (Note that other records occasionally use presentational markup such as italics (or emphasis) tags to presentationally style certain parts of some references (confusing presentation with semantics…).)

Finally, just a quick note on why I’m blogging this publicly rather than raising it, erm, quietly within the OU. My reasoning is similar to the reasoning we use when we tell students to not be afraid of asking questions, because it’s likely that others will also have the same question… I’m asking a question about the structure of an open educational resource, because I don’t quite understand it; by asking the question in public, it may be the case that others can use the same questioning strategy to review the way they present their materials, so when I find those, I don’t have to ask similar sorts of question again;-)

PS sort of related to this, see TechDis’ Terry McAndrew’s Accessible courses need and accessibilty-friendly schema standard.

PPS see also another take on ways of trying to reduce cognitive waste – Joss Winn’s latest bid in progress, which will examine how the OAuth 2.0 specification can be integrated into a single sign on environment alongside Microsoft’s Unified Access Gateway. If that’s an issue or matter of interest in your institution, why not fork the bid and work it up yourself, or maybe even fork it and contribute elements back?;-) (Hmm, if several institutions submitted what was essentially the same bid from multiple institutions, how would they cope during the marking process?!;-)

A Tracking Inspired Hack That Breaks the Web…? Naughty OpenLearn…

So it’s not just me who wonders Why Open Data Sucks Right Now and comes to this conclusion:

What will make open data better? What will make it usable and useful? What will push people to care about the open data they produce?
SOMEONE USING IT!
Simply that. If we start using the data, we can email, write, text and punch people until their data is in a standard, useful and usable format. How do I know if my data is correct until someone tries to put pins on a map for ever meal I’ve eaten? I simply don’t. And this is the rock/hard place that open data lies in at the moment:

It’s all so moon-hoveringly bad because no-one uses it.
No-one uses it because what is out there is moon-hoveringly bad

Or broken…

Earlier today, I posted some, erm, observations about OpenLearn XML, and in doing so appear to have logged, in a roundabout and indirect way, a couple of bugs. (I did think about raising the issues internally within the OU, but as the above quote suggests, the iteration has to start somewhere, and I figured it may be instructive to start it in the open…)

So here’s another, erm, issue I found relating to accessing OpenLearn xml content. It’s actually something I have a vague memory of colliding with before, but I don’t seem to have blogged it, and since moving to an institutional mail server that limits mailbox size, I can’t check back with my old email messages to recap on the conversation around the matter from last time…

The issue started with this error message that was raised when I tried to parse an OU XML document via Scraperwiki:

Line 85 - tree = etree.parse(cr)
lxml.etree.pyx:2957 -- lxml.etree.parse (src/lxml/lxml.etree.c:56230)(())
parser.pxi:1533 -- lxml.etree._parseDocument (src/lxml/lxml.etree.c:82313)(())
parser.pxi:1562 -- lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:82606)(())
parser.pxi:1462 -- lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:81645)(())
parser.pxi:1002 -- lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:78554)(())
parser.pxi:569 -- lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:74498)(())
parser.pxi:650 -- lxml.etree._handleParseResult (src/lxml/lxml.etree.c:75389)(())
parser.pxi:590 -- lxml.etree._raiseParseError (src/lxml/lxml.etree.c:74722)(())
XMLSyntaxError: Entity 'nbsp' not defined, line 155, column 34

nbsp is an HTML entity that shouldn’t appear untreated in an arbitrary XML doc. So I assumed this was a fault of the OU XML doc, and huffed and puffed and sighed for a bit and tried with another XML doc; and got the same result. A trawl around the web looking for whether there were workarounds for the lxml Python library I was using to parse the “XML” turned up nothing… Then I thought I should check…

A command line call to an OU XML URL using curl:

curl http://openlearn.open.ac.uk/mod/oucontent/view.php?id=397313&content=1

returned the following:

<meta http-equiv="refresh" content="0; url=http://openlearn.open.ac.uk/login/index.php?loginguest=true" /><script type="text/javascript">
//<![CDATA[
location.replace('http://openlearn.open.ac.uk/login/index.php?loginguest=true');
//]]></script>

Ah… vague memories… there’s some sort of handshake goes on when you first try to access OpenLearn content (maybe something to do with tracking?), before the actual resource that was called is returned to the calling party. Browsers handle this handshake automatically, but the etree.parse(URL) function I was calling to load in and parse the XML document doesn’t. It just sees the HTML response and chokes, raising the error that first alerted me to the problem.

[Seems the redirect is a craptastic Moodle fudge /via @ostephens]

So now it’s two hours later than it was when I started a script, full of joy and light and happy intentions, that would generate an aggregated glossary of glossary items from across OpenLearn and allow users to look up terms, link to associated units, and so on; (the OU-XML document schema that OpenLearn uses has markup for explicitly describing glossary items). Then I got the error message, ran round in circles for a bit, got ranty and angry and developed a really foul mood, probably tweeted some things that I may regret, one day, figured out what the issue was, but not how to solve it, thus driving my mood fouler and darker… (If anyone has a workaround that lets me get an XML file back directly from OpenLearn (or hides the workaround handshake in a Python script I can simply cut and paste), please enlighten me in the comments.)

I also found at least one OpenLearn unit that has glossary items, but just dumps then in paragraph tags and doesn’t use the glossary markup. Sigh…;-)

So… how was your day?! I’ve given up on mine…

Deconstructing OpenLearn Units – Glossary Items, Learning Outcomes and Image Search

It turns out that part of the grief I encountered here in trying to access OpenLearn XML content was easily resolved (check the comments: mechanise did the trick…), though I’ve still to try to sort out a workaround for accessing OpenLearn images (a problem described here)), but at least now I have another stepping stone: a database of some deconstructed OpenLearn content.

Using Scraperwiki to pull down and parse the OpenLearn XML files, I’ve created some database tables that contain the following elements scraped from across the OpenLearn units by this OpenLearn XML Processor:

  • glossary items;
  • learning objectives;
  • figure captions and descriptions.

You can download CSV data files corresponding to the tables, or the whole SQLite database. (Note that there is also an “errors” table that identifies units that threw an error when I tried to grab, or parse, the OpenLearn XML.)

Unfortunately, I haven’t had a chance yet to pop up a view over the data (I tried, briefly, but today was another of those days where something that’s probably very simple and obvious prevented me from getting the code I wanted to write working; if anyone has an example Scraperwiki view that chucks data into a sortable HTML table or a Simile Exhibit searchable table, please post a link below; or even better, add a view to the scraper:-)

So in the meantime, if ypu want to have a play, you need to make use of the Scraperwiki API wizard.

Here are some example queries:

  • a search for figure descriptions containing the word “communication” – select * from `figures` where desc like ‘%communication%’: try it
  • a search over learning outcomes that include the phrase how to followed at some point by the word dataselect * from `learningoutcomes` where lo like ‘%how to%data%’: try it
  • a search of glossary items for glossary terms that contain the word “period” or a definition that contains the word “ancient” – select * from `glossary` where definition like ‘%ancient%’ or term like ‘%period%’: try it
  • find figures with empty captions – select * from `figures` where caption==”: try it

I’ll try to add some more examples when I get a chance, as well as knocking up a more friendly search interface. Unless you want to try…?!;-)

OU/BBC Co-Pros Currently on iPlayer

Given the continued state of presentational disrepair of the OpenLearn What’s On feed, I assume I’m the only person who subscribes to it?

Despite its looks, though, I have to say I find it *really useful* for keeping up with OU/BBC co-pros.

The feed displays links to OpenLearn pages relating to programmes that are scheduled for broadcast in the next 24 hours or so (I think?). This includes programmes that are being repeated, as well as first broadcast. However, clicking through some of the links to the supporting programme pages on OpenLearn, I notice a couple of things:

Firstly, the post is timestamped around the time of the original broadcast. This approach is fine if you want to root a post in time, but it makes the page look out-of-date if I stumble onto either from a What’s On feed link or from a link on the supporting page on the corresponding BBC /programme page. I think canonical programme pages for individual programmes have listings of when the programme was broadcast, so it should also be possible to display this information?

Secondly, as a piece of static, “archived” content, there is not necessarily any way of knowing that the programme is currently available. I grabbed the above screenshot because it doesn’t even appear toprovide a link to the BBC programme page for the series, let alone actively promote the fact that the programme itself, or at least, other programmes from the same series, are currently: 1) upcoming for broadcast; 2) already, or about to be, available on iPlayer. Note that as well as full broadcasts, many programmes also have clips available on BBC iPlayer. Even if the full programmes aren’t embeddable within the OpenLearn programme pages (for rights reasons, presumably, rather than techincal reasons?), might we be able to get the clips locally viewable? Or do we need to distniguish between BBC “official” clips, and the extra clips the OU sometimes gets for local embedding as part of the co-pro package?

If the OU is to make the most of repeat broadcasts of OU-BBC co-pro, then I think OpenLearn could do a couple of things in the short term, such as create a carousel of images on the homepage that link through to “timeless” series or episode supporting programmes. The programme support pages should also have a very clearly labelled, dynamically generated, “Now Available on iPlayer” link for programmes that are currently available, along with other available programmes from the same series. The next step would be to find some way of making more of persistent clips on iPlayer?

Anyway – enough of the griping. To provide some raw materials for anyone who would like to have a play around this idea, or maybe come up with a Twitter Bootstrap page that promotes OU/BBC co-pro programmes currently on iPlayer, here’s a (very) raw example: a simple HTML web page that grabs a list of OU/BBC co-pro series pages I’ve been on-and-off maintaining on delicious for some time now (if there are any omissions, please let me know;-), extracts the series IDs, pulls down the corresponding list of series episodes currently on iPlayer via a YQL JSON-P proxy, and then displays a simple list of currently available programmes:

Here’s the code:

<html><head>
<title></title>

<script type="text/javascript" src="http://ajax.googleapis.com/ajax/libs/jquery/1.2.6/jquery.min.js">
</script>

<script type="text/javascript">
//Routine to display programmes currently available on iPlayer given series ID
// The output is attached to a uniquely identified HTML item

var seriesID='b01dl8gl'
// The BBC programmes series ID

//The id of the HTML element you want to contain the displayed feed
var containerID="test";

//------------------------------------------------------

function cross_domain_JSON_call(seriesID){
 // BBC json does not support callbacks, so use YQL as JSON-P proxy
 
 var url = 'http://query.yahooapis.com/v1/public/yql?q=select%20*%20from%20json%20where%20url%3D%22http%3A%2F%2Fwww.bbc.co.uk%2Fprogrammes%2F' + seriesID + '%2Fepisodes%2Fplayer.json%22%20and%20itemPath%20%3D%20%22json.episodes%22&format=json&callback=?'
 
 //fetch the feed from the address specified in 'url'
// then call "myCallbackFunction" with the resulting feed items
 $.getJSON(
   url,
   function(data) { myCallbackFunction(data.query.results); }
 )
}

// A simple utility function to display the title of the feed items
function displayOutput(txt){
  $('#'+containerID).append('<div>'+txt+'</div>');
}

function myCallbackFunction(items){
  console.log(items.episodes)
  items=items.episodes
  // Run through each item in the feed and print out its title
  for (prog in items){
    displayOutput('<img src="http://static.bbc.co.uk/programmeimages/272x153/episode/' + items[prog].programme.pid+'.jpg"/>' + items[prog].programme.programme.title+': <a href="http://www.bbc.co.uk/programmes/' + items[prog].programme.pid+'">' + items[prog].programme.title+'</a> (' + items[prog].programme.short_synopsis + ', ' + items[prog].programme.media.availability + ')');
  }
}

function parseSeriesFeed(items){
  for (var i in items) {
    seriesID=items[i].u.split('/')[4]
    console.log(seriesID)
    if (seriesID !='')
      cross_domain_JSON_call(seriesID)
  }
}

function getSeriesList(){
  var seriesFeed = 'http://feeds.delicious.com/v2/json/psychemedia/oubbccopro?count=100&callback=?'
  $.getJSON(
   seriesFeed,
   function(data) { parseSeriesFeed(data); }
 )
}

// Tell JQuery to call the feed loader when the page is all loaded
//$(document).ready(cross_domain_JSON_call(seriesID));
$(document).ready(getSeriesList())
</script>

</head>

<body>
<div id="test"></div>
</body>

</html>

If you copy the (raw) code to a file and save it as an .html file, you should be able to preview it in your own browser.

I’ll try to make any updated versions of the code available on github: iplayerSeriesCurrProgTest.html

If you have a play with it, and maybe knock up a demo, please let me know via a comment;-)

PS seems I should have dug around the OpenLearn website a bit more – there is a What’s on this week page, linked to from the front page, that lists upcoming transmissions/broadcasts:

I’m guessing this is done as a Saturday-Friday weekly schedule, in line with TV listings magazines, but needless to say I have a few issues with this approach!;-)

For example, the focus is on linear schedules of upcoming broadcast content in the next 0-7 days, depending when the updated list is posted. But why not have a rolling “coming up over the next seven days” schedule, as well as a “catch-up” service linking to to content currently on iPlayer from programmes that were broadcast maybe last Thursday, or even longer ago?

The broadcast schedule is still a handy thing for viewers who don’t have access to digital on-demand services, but it also provides a focus for “event telly” for folk who do typically watch on-demand content. I’m not sure any OU-BBC co-pro programmes have made a point of running an online, realtime social media engagement exercise around a scheduled broadcast (and I think second screen experiments have only been run as pilots?), but again, it’s an opportunity that doesn’t seem to be being reflected anywhere?

Tinkering With Scraperwiki – The Bottom Line, OpenCorporates Reconciliation and the Google Viz API

Having got to grips with adding a basic sortable table view to a Scraperwiki view using the Google Chart Tools (Exporting and Displaying Scraperwiki Datasets Using the Google Visualisation API), I thought I’d have a look at wiring in an interactive dashboard control.

You can see the result at BBC Bottom Line programme explorer:

The page loads in the contents of a source Scraperwiki database (so only good for smallish datasets in this version) and pops them into a table. The searchbox is bound to the Synopsis column and and allows you to search for terms or phrases within the Synopsis cells, returning rows for which there is a hit.

Here’s the function that I used to set up the table and search control, bind them together and render them:

    google.load('visualization', '1.1', {packages:['controls']});

    google.setOnLoadCallback(drawTable);

    function drawTable() {

      var json_data = new google.visualization.DataTable(%(json)s, 0.6);

    var json_table = new google.visualization.ChartWrapper({'chartType': 'Table','containerId':'table_div_json','options': {allowHtml: true}});
    //i expected this limit on the view to work?
    //json_table.setColumns([0,1,2,3,4,5,6,7])

    var formatter = new google.visualization.PatternFormat('<a href="http://www.bbc.co.uk/programmes/{0}">{0}</a>');
    formatter.format(json_data, [1]); // Apply formatter and set the formatted value of the first column.

    formatter = new google.visualization.PatternFormat('<a href="{1}">{0}</a>');
    formatter.format(json_data, [7,8]);

    var stringFilter = new google.visualization.ControlWrapper({
      'controlType': 'StringFilter',
      'containerId': 'control1',
      'options': {
        'filterColumnLabel': 'Synopsis',
        'matchType': 'any'
      }
    });

  var dashboard = new google.visualization.Dashboard(document.getElementById('dashboard')).bind(stringFilter, json_table).draw(json_data);

    }

The formatter is used to linkify the two URLs. However, I couldn’t get the table to hide the final column (the OpenCorporates URI) in the displayed table? (Doing something wrong, somewhere…) You can find the full code for the Scraperwiki view here.

Now you may (or may not) be wondering where the OpenCorporates ID came from. The data used to populate the table is scraped from the JSON version of the BBC programme pages for the OU co-produced business programme The Bottom Line (Bottom Line scraper). (I’ve been pondering for sometime whether there is enough content there to try to build something that might usefully support or help promote OUBS/OU business courses or link across to free OU business courses on OpenLearn…) Supplementary content items for each programme identify the name of each contributor and the company they represent in a conventional way. (Their role is also described in what looks to be a conventionally constructed text string, though I didn’t try to extract this explicitly – yet. (I’m guessing the Reuters OpenCalais API would also make light work of that?))

Having got access to the company name, I thought it might be interesting to try to get a corporate identifier back for each one using the OpenCorporates (Google Refine) Reconciliation API (Google Refine reconciliation service documentation).

Here’s a fragment from the scraper showing how to lookup a company name using the OpenCorporates reconciliation API and get the data back:

ocrecURL='http://opencorporates.com/reconcile?query='+urllib.quote_plus("".join(i for i in record['company'] if ord(i)<128))
    try:
        recData=simplejson.load(urllib.urlopen(ocrecURL))
    except:
        recData={'result':[]}
    print ocrecURL,[recData]
    if len(recData['result'])>0:
        if recData['result'][0]['score']>=0.7:
            record['ocData']=recData['result'][0]
            record['ocID']=recData['result'][0]['uri']
            record['ocName']=recData['result'][0]['name']

The ocrecURL is constructed from the company name, sanitised in a hack fashion. If we get any results back, we check the (relevance) score of the first one. (The results seem to be ordered in descending score order. I didn’t check to see whether this was defined or by convention.) If it seems relevant, we go with it. From a quick skim of company reconciliations, I noticed at least one false positive – Reed – but on the whole it seemed to work fairly well. (If we look up more details about the company from OpenCorporates, and get back the company URL, for example, we might be able to compare the domain with the domain given in the link on the Bottom Line page. A match would suggest quite strongly that we have got the right company…)

As @stuartbrown suggeted in a tweet, a possible next step is to link the name of each guest to a Linked Data identifier for them, for example, using DBPedia (although I wonder – is @opencorporates also minting IDs for company directors?). I also need to find some way of pulling out some proper, detailed subject tags for each episode that could be used to populate a drop down list filter control…

PS for more Google Dashboard controls, check out the Google interactive playground…

PPS see also: OpenLearn Glossary Search and OpenLearn LEarning Outcomes Search

Scraperwiki Powered OpenLearn Searches – Learning Outcomes and Glossary Items

A quick follow up to Tinkering With Scraperwiki – The Bottom Line, OpenCorporates Reconciliation and the Google Viz API demonstrating how to reuse that pattern (a little more tinkering is required to fully generalise it, but that’ll probably have to wait until after the Easter wifi-free family tour… I also need to do a demo of a pure HTML/JS version of the approach).

In particular, a search over OpenLearn learning outcomes:

and a search over OpenLearn glossary items:

Both are powered by tables from my OpenLearn XML Processor scraperwiki.