Archive for the ‘OU2.0’ Category
Visualising Networks in Gephi via a Scraperwiki Exported GEXF File
How do you visualise data scraped from the web using Scraperwiki as a network using a graph visualisation tool such as Gephi? One way is to import the a two-dimensional data table (i.e. a CSV file) exported from Scraperwiki into Gephi using the Data Explorer, but at times this can be a little fiddly and may require you to mess around with column names to make sure they’re the names Gephi expects. Another way is to get the data into a graph based representation using an appropriate file format such as GEXF or GraphML that can be loaded directly (and unambiguously) into Gephi or other network analysis and visualisation tools.
A quick bit of backstory first…
A couple of related key features for me of a “data management system” (eg the joint post from Francis Irving and Rufus Pollock on From CMS to DMS: C is for Content, D is for Data) are the ability to put data into shapes that play nicely with predefined analysis and visualisation routines, and the ability to export data in a variety of formats or representations that allow that data to be be readily imported into, or used by, other applications, tools, or software libraries. Which is to say, I’m into glue…
So here’s some glue – a recipe for generating a GEXF formatted file that can be loaded directly into Gephi and used to visualise networks like this one of how OpenLearn units are connected by course code and top level subject area:
The inspiration for this demo comes from a couple of things: firstly, noticing that networkx is one of the third party supported libraries on ScraperWiki (as of last night, I think the igraph library is also available; thanks @frabcus ;-); secondly, having broken ground for myself on how to get Scraperwiki views to emit data feeds rather than HTML pages (eg OpenLearn Glossary Items as a JSON feed).
As a rather contrived demo, let’s look at the data from this scrape of OpenLearn units, as visualised above:
The data is available from the openlearn-units scraper in the table swdata. The columns of interest are name, parentCourseCode, topic and unitcode. What I’m going to do is generate a graph file that represents which unitcodes are associated with which parentCourseCodes, and which topics are associated with each parentCourseCode. We can then visualise a network that shows parentCourseCodes by topic, along with the child (unitcode) course units generated from each Open University parent course (parentCourseCode).
From previous dabblings with the networkx library, I knew it’d be easy enough to generate a graph representation from the data in the Scraperwiki data table. Essentially, two steps are required: 1) create and label nodes, as required; 2) tie nodes together with edges. (If a node hasn’t been defined when you use it to create an edge, netwrokx will create it for you.)
I decided to create and label some of the nodes in advance: unit nodes would carry their name and unitcode; parent course nodes would just carry their parentCourseCode; and topic nodes would carry an newly created ID and the topic name itself. (The topic name is a string of characters and would make for a messy ID for the node!)
To keep gephi happy, I’m going to explicitly add a label attribute to some of the nodes that will be used, by default, to label nodes in Gephi views of the network. (Here are some hints on generating graphs in networkx.)
Here’s how I built the graph:
import scraperwiki
import urllib
import networkx as nx
scraperwiki.sqlite.attach( 'openlearn-units' )
q = '* FROM "swdata"'
data = scraperwiki.sqlite.select(q)
G=nx.Graph()
topics=[]
for row in data:
G.add_node(row['unitcode'],label=row['unitcode'],name=row['name'],parentCC=row['parentCourseCode'])
topic=row['topic']
if topic not in topics:
topics.append(topic)
tID=topics.index(topic)
topicID='topic_'+str(tID)
G.add_node(topicID,label=topic,name=topic)
G.add_edge(topicID,row['parentCourseCode'])
G.add_edge(row['unitcode'],row['parentCourseCode'])
Having generated a representation of the data as a graph using networkx, we now need to export the data. networkx supports a variety of export formats, including GEXF. Looking at the documentation for the GEXF exporter, we see that it offers methods for exporting the GEXF representation to a file. But for scraperwiki, we want to just print out a representation of the file, not actually save the printed representation of the graph to a file. So how do we get hold of an XML representation of the GEXF formatted data so we can print it out? A peek into the source code for the GEXF exporter (other exporter file sources here) suggests that the functions we need can be found in the networkx.readwrite.gexf file: a constructor (GEXFWriter), and a method for loading in the graph (.add_graph()). An XML representation of the file can then be obtained and printed out using the ElementTree tostring function.
Here’s the code I hacked out as a result of that little investigation:
import networkx.readwrite.gexf as gf
writer=gf.GEXFWriter(encoding='utf-8',prettyprint=True,version='1.1draft')
writer.add_graph(G)
scraperwiki.utils.httpresponseheader("Content-Type", "text/xml")
from xml.etree.cElementTree import tostring
print tostring(writer.xml)
Note the use of the scraperwiki.utils.httpresponseheader to set the MIMEtype of the view. If we don’t do this, scraperwiki will by default publish an HTML page view, along with a Scraperwiki logo embedded in the page.
Here’s the full code for the view.
And here’s the GEXF view:
Save this file with a .gexf suffix and you can then open the file directly into Gephi.
Hopefully, what this post shows is how you can generate your own, potentially complex, output file formats within Scraperwiki that can then be imported directly into other tools.
PS see also Exporting and Displaying Scraperwiki Datasets Using the Google Visualisation API, which shows how to generate a Google Visualisation API JSON from Scraperwiki, allowing for the quick and easy generation of charts and tables using Google Visualisation API components.
OU/BBC Co-Pros Currently on iPlayer, via ScraperWiki
A quick update to yesterday’s post on OU/BBC Co-Pros Currently on iPlayer: I’ve popped the first draft of a daily scraper onto Scraperwiki that looks at my delicious bookmark list of OU/BBC series co-pros and tries to find corresponding programmes that are currently available on iPlayer: OU BBC Co-pros on iPlayer Scraperwiki
This is probably not the most efficient solution, but at least it provides some sort of API to at least some relevant iPlayer data.
I’ve also popped up a quick Scraperwiki view over the data OU BBC Co-pros on iPlayer (Scraperwiki HTML View); note that this data is unsorted (I need to think about how best to do that?)
[I've added a couple more columns since that screenshot was grabbed; please feel free to work on the scraper, or the view, to improve them further; if you grab a copy of the view to work on your own, please add a link back to it in the comments below, along with a brief description of what you're trying to achieve with your view...]
PS hmm, maybe I should pop the academics on In Our Time code onto Scraperwiki too?
OU/BBC Co-Pros Currently on iPlayer
Given the continued state of presentational disrepair of the OpenLearn What’s On feed, I assume I’m the only person who subscribes to it?
Despite its looks, though, I have to say I find it *really useful* for keeping up with OU/BBC co-pros.
The feed displays links to OpenLearn pages relating to programmes that are scheduled for broadcast in the next 24 hours or so (I think?). This includes programmes that are being repeated, as well as first broadcast. However, clicking through some of the links to the supporting programme pages on OpenLearn, I notice a couple of things:
Firstly, the post is timestamped around the time of the original broadcast. This approach is fine if you want to root a post in time, but it makes the page look out-of-date if I stumble onto either from a What’s On feed link or from a link on the supporting page on the corresponding BBC /programme page. I think canonical programme pages for individual programmes have listings of when the programme was broadcast, so it should also be possible to display this information?
Secondly, as a piece of static, “archived” content, there is not necessarily any way of knowing that the programme is currently available. I grabbed the above screenshot because it doesn’t even appear toprovide a link to the BBC programme page for the series, let alone actively promote the fact that the programme itself, or at least, other programmes from the same series, are currently: 1) upcoming for broadcast; 2) already, or about to be, available on iPlayer. Note that as well as full broadcasts, many programmes also have clips available on BBC iPlayer. Even if the full programmes aren’t embeddable within the OpenLearn programme pages (for rights reasons, presumably, rather than techincal reasons?), might we be able to get the clips locally viewable? Or do we need to distniguish between BBC “official” clips, and the extra clips the OU sometimes gets for local embedding as part of the co-pro package?
If the OU is to make the most of repeat broadcasts of OU-BBC co-pro, then I think OpenLearn could do a couple of things in the short term, such as create a carousel of images on the homepage that link through to “timeless” series or episode supporting programmes. The programme support pages should also have a very clearly labelled, dynamically generated, “Now Available on iPlayer” link for programmes that are currently available, along with other available programmes from the same series. The next step would be to find some way of making more of persistent clips on iPlayer?
Anyway – enough of the griping. To provide some raw materials for anyone who would like to have a play around this idea, or maybe come up with a Twitter Bootstrap page that promotes OU/BBC co-pro programmes currently on iPlayer, here’s a (very) raw example: a simple HTML web page that grabs a list of OU/BBC co-pro series pages I’ve been on-and-off maintaining on delicious for some time now (if there are any omissions, please let me know;-), extracts the series IDs, pulls down the corresponding list of series episodes currently on iPlayer via a YQL JSON-P proxy, and then displays a simple list of currently available programmes:
Here’s the code:
<html><head>
<title></title>
<script type="text/javascript" src="http://ajax.googleapis.com/ajax/libs/jquery/1.2.6/jquery.min.js">
</script>
<script type="text/javascript">
//Routine to display programmes currently available on iPlayer given series ID
// The output is attached to a uniquely identified HTML item
var seriesID='b01dl8gl'
// The BBC programmes series ID
//The id of the HTML element you want to contain the displayed feed
var containerID="test";
//------------------------------------------------------
function cross_domain_JSON_call(seriesID){
// BBC json does not support callbacks, so use YQL as JSON-P proxy
var url = 'http://query.yahooapis.com/v1/public/yql?q=select%20*%20from%20json%20where%20url%3D%22http%3A%2F%2Fwww.bbc.co.uk%2Fprogrammes%2F' + seriesID + '%2Fepisodes%2Fplayer.json%22%20and%20itemPath%20%3D%20%22json.episodes%22&format=json&callback=?'
//fetch the feed from the address specified in 'url'
// then call "myCallbackFunction" with the resulting feed items
$.getJSON(
url,
function(data) { myCallbackFunction(data.query.results); }
)
}
// A simple utility function to display the title of the feed items
function displayOutput(txt){
$('#'+containerID).append('<div>'+txt+'</div>');
}
function myCallbackFunction(items){
console.log(items.episodes)
items=items.episodes
// Run through each item in the feed and print out its title
for (prog in items){
displayOutput('<img src="http://static.bbc.co.uk/programmeimages/272x153/episode/' + items[prog].programme.pid+'.jpg"/>' + items[prog].programme.programme.title+': <a href="http://www.bbc.co.uk/programmes/' + items[prog].programme.pid+'">' + items[prog].programme.title+'</a> (' + items[prog].programme.short_synopsis + ', ' + items[prog].programme.media.availability + ')');
}
}
function parseSeriesFeed(items){
for (var i in items) {
seriesID=items[i].u.split('/')[4]
console.log(seriesID)
if (seriesID !='')
cross_domain_JSON_call(seriesID)
}
}
function getSeriesList(){
var seriesFeed = 'http://feeds.delicious.com/v2/json/psychemedia/oubbccopro?count=100&callback=?'
$.getJSON(
seriesFeed,
function(data) { parseSeriesFeed(data); }
)
}
// Tell JQuery to call the feed loader when the page is all loaded
//$(document).ready(cross_domain_JSON_call(seriesID));
$(document).ready(getSeriesList())
</script>
</head>
<body>
<div id="test"></div>
</body>
</html>
If you copy the (raw) code to a file and save it as an .html file, you should be able to preview it in your own browser.
I’ll try to make any updated versions of the code available on github: iplayerSeriesCurrProgTest.html
If you have a play with it, and maybe knock up a demo, please let me know via a comment;-)
PS seems I should have dug around the OpenLearn website a bit more – there is a What’s on this week page, linked to from the front page, that lists upcoming transmissions/broadcasts:
I’m guessing this is done as a Saturday-Friday weekly schedule, in line with TV listings magazines, but needless to say I have a few issues with this approach!;-)
For example, the focus is on linear schedules of upcoming broadcast content in the next 0-7 days, depending when the updated list is posted. But why not have a rolling “coming up over the next seven days” schedule, as well as a “catch-up” service linking to to content currently on iPlayer from programmes that were broadcast maybe last Thursday, or even longer ago?
The broadcast schedule is still a handy thing for viewers who don’t have access to digital on-demand services, but it also provides a focus for “event telly” for folk who do typically watch on-demand content. I’m not sure any OU-BBC co-pro programmes have made a point of running an online, realtime social media engagement exercise around a scheduled broadcast (and I think second screen experiments have only been run as pilots?), but again, it’s an opportunity that doesn’t seem to be being reflected anywhere?
Guardian Telly on Google TV… Is the OU There, Yet?
A handful of posts across several Guardian blogs brought my attention to the Guardian’s new Google TV app (eg Guardian app for Google TV: an introduction (announcement), Developing the Google TV app in Beta (developer notes), The Guardian GoogleTV project, innovation & hacking (developer reflection)). Launched for the US, initially, “[i]t’s a new way to view [the Guardian's] latest videos, headlines and photo galleries on a TV.”
The OU has had a demo Google TV app for several months now, courtesy of ex-of-the-OU, now of MetaBroadcast, Liam Green Hughes – An HTML5 Leanback TV webapp that brings SPARQL to your living room:
[Try the demo here: OU Google TV App [ demo ]]
Liam’s app is interesting for a couple of reasons: first, it demonstrates how to access data – and then content – from the OU’s open Linked Data store (in a similar way, the Guardian app draws on the Guardian Platform API, I think?); secondly, it demonstrates how to use the Google TV templates to get put a TV app together.
(It’s maybe also worth noting that the Google TV wasn’t Liam’s first crack at OU-TV – he also put together a Boxee app way back when: Rising to the Boxee developer challenge with an Open University app.)
As well as video and audio based course materials, seminar/lecture recordings, video shorts (such as the The History of the English Language in Ten Animated Minutes series (I couldn’t quickly find a good OU link?)), the OU also co-produces broadcast video with both the BBC (now under the OU-BBC “sixth agreement”), as well as Channel 4 (eg The Secret Life of Buildings was an OU co-pro).
Many of the OU/BBC co-pro programmes have video clips available on BBC iPlayer via the corresponding BBC programmes sites (I generate a quite possibly incomplete list through this hack – Linked Data Without the SPARQL – OU/BBC Programmes on iPlayer (here’s the current clips feed – I really should redo this script in something like Scraperwiki…); as far as I know, there’s no easy way of getting any sort of list of series codes/programme codes for OU/BBC co-pros, let alone an authoritative and complete one). The OU also gets access to extra clips, which appear on programme related pages on one of the OpenLearn branded sites (OpenLearn), but again, there’s no easy way of navigating these clips, and, erm, no TV app to showcase them.
Admittedly, Google TV enabled TVs are still in the minority and internet TV is still to prove itself with large audiences. I’m not sure what the KPIs are around OU/BBC co-pros (or how much the OU gives the BBC each year in broadcast related activity?), but I can’t for the life of me understand why we aren’t engaging more actively in beta styled initiatives around second screen in particular, but also things like Google TV. (If you think of apps on internet TV platforms such as Google TV or Boxee as channels that you can programme linearly or as on-demand services, might it change folks’ attitude towards them?)
Note that I’m not thinking of apps for course delivery, necessarily… I’m thinking more of ways of making more of the broadcast spend, increasing it’s surface area/exposure, and (particularly in the case of second screen) enriching broadcast materials and providing additional academic/learning journey value. Second screen activity might also as contribute to community development and brand enhancement through online social media engagement in an OU-owned and branded space parallel to the BBC space. Or it might not, of course…;-)
Of course, you might argue that this is all off-topic for the OU… but it isn’t if your focus is the OU’s broadcast activities, rather than formal education. If a fraction of the SocialLearn spend had gone on thinking about second screen applications, and maybe keeping Boxee/Google TV app development ticking over to see what insights it might bring about increasing engagement with broadcast materials, I also wonder if we might have started to think our way round to how second screen and leanback apps could also be used to support actual course delivery and drive innovation in that area?
PS two more things about the Guardian TV app announcement; firstly, it was brought to my attention through several different vectors (different blog subscriptions, Twitter); secondly, it introduced me to the Guardian beta minisite, which acts as an umbrella over/container for several of the Guardian blogs I follow… Now, where was the OU bloggers aggregated feed again? Planet OU wasn’t it? Another @liamgh initiative, I seem to remember…
PPS via a tweet from @barnstormed, I am reminded of something I keep meaning to blog about – OU Playlists on Youtube. For example, Digital Nepal or 60 Second Adventures in Thought, as well as The History of English in Ten Minutes. Given those playlists, one question might be: how might you build an app round them?!
PPPS via @paulbradshaw, it seems that the Guardian is increasingly into the content business, rather than just the news busines: Guardian announces multimedia partnerships with prestigious arts institutions [doh! of course it is....!] In this case, “partnering with Glyndebourne, the Royal Opera House, The Young Vic, Art Angel and the Roundhouse the Guardian [to] offer all more arts multimedia content than ever before”. “Summits” such as the recent Changing Media Summit are also candidate content factory events (eg in the same way that TED, O’Reilly conference and music festival events generate content…)
Deconstructing OpenLearn Units – Glossary Items, Learning Outcomes and Image Search
It turns out that part of the grief I encountered here in trying to access OpenLearn XML content was easily resolved (check the comments: mechanise did the trick…), though I’ve still to try to sort out a workaround for accessing OpenLearn images (a problem described here)), but at least now I have another stepping stone: a database of some deconstructed OpenLearn content.
Using Scraperwiki to pull down and parse the OpenLearn XML files, I’ve created some database tables that contain the following elements scraped from across the OpenLearn units by this OpenLearn XML Processor:
- glossary items;
- learning objectives;
- figure captions and descriptions.
You can download CSV data files corresponding to the tables, or the whole SQLite database. (Note that there is also an “errors” table that identifies units that threw an error when I tried to grab, or parse, the OpenLearn XML.)
Unfortunately, I haven’t had a chance yet to pop up a view over the data (I tried, briefly, but today was another of those days where something that’s probably very simple and obvious prevented me from getting the code I wanted to write working; if anyone has an example Scraperwiki view that chucks data into a sortable HTML table or a Simile Exhibit searchable table, please post a link below; or even better, add a view to the scraper:-)
So in the meantime, if ypu want to have a play, you need to make use of the Scraperwiki API wizard.
Here are some example queries:
- a search for figure descriptions containing the word “communication” – select * from `figures` where desc like ‘%communication%’: try it
- a search over learning outcomes that include the phrase how to followed at some point by the word data – select * from `learningoutcomes` where lo like ‘%how to%data%’: try it
- a search of glossary items for glossary terms that contain the word “period” or a definition that contains the word “ancient” – select * from `glossary` where definition like ‘%ancient%’ or term like ‘%period%’: try it
- find figures with empty captions – select * from `figures` where caption==”: try it
I’ll try to add some more examples when I get a chance, as well as knocking up a more friendly search interface. Unless you want to try…?!;-)
A Tracking Inspired Hack That Breaks the Web…? Naughty OpenLearn…
So it’s not just me who wonders Why Open Data Sucks Right Now and comes to this conclusion:
What will make open data better? What will make it usable and useful? What will push people to care about the open data they produce?
SOMEONE USING IT!
Simply that. If we start using the data, we can email, write, text and punch people until their data is in a standard, useful and usable format. How do I know if my data is correct until someone tries to put pins on a map for ever meal I’ve eaten? I simply don’t. And this is the rock/hard place that open data lies in at the moment:It’s all so moon-hoveringly bad because no-one uses it.
No-one uses it because what is out there is moon-hoveringly bad
Or broken…
Earlier today, I posted some, erm, observations about OpenLearn XML, and in doing so appear to have logged, in a roundabout and indirect way, a couple of bugs. (I did think about raising the issues internally within the OU, but as the above quote suggests, the iteration has to start somewhere, and I figured it may be instructive to start it in the open…)
So here’s another, erm, issue I found relating to accessing OpenLearn xml content. It’s actually something I have a vague memory of colliding with before, but I don’t seem to have blogged it, and since moving to an institutional mail server that limits mailbox size, I can’t check back with my old email messages to recap on the conversation around the matter from last time…
The issue started with this error message that was raised when I tried to parse an OU XML document via Scraperwiki:
Line 85 - tree = etree.parse(cr)
lxml.etree.pyx:2957 -- lxml.etree.parse (src/lxml/lxml.etree.c:56230)(())
parser.pxi:1533 -- lxml.etree._parseDocument (src/lxml/lxml.etree.c:82313)(())
parser.pxi:1562 -- lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:82606)(())
parser.pxi:1462 -- lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:81645)(())
parser.pxi:1002 -- lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:78554)(())
parser.pxi:569 -- lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:74498)(())
parser.pxi:650 -- lxml.etree._handleParseResult (src/lxml/lxml.etree.c:75389)(())
parser.pxi:590 -- lxml.etree._raiseParseError (src/lxml/lxml.etree.c:74722)(())
XMLSyntaxError: Entity 'nbsp' not defined, line 155, column 34
nbsp is an HTML entity that shouldn’t appear untreated in an arbitrary XML doc. So I assumed this was a fault of the OU XML doc, and huffed and puffed and sighed for a bit and tried with another XML doc; and got the same result. A trawl around the web looking for whether there were workarounds for the lxml Python library I was using to parse the “XML” turned up nothing… Then I thought I should check…
A command line call to an OU XML URL using curl:
curl http://openlearn.open.ac.uk/mod/oucontent/view.php?id=397313&content=1
returned the following:
<meta http-equiv="refresh" content="0; url=http://openlearn.open.ac.uk/login/index.php?loginguest=true" /><script type="text/javascript">
//<![CDATA[
location.replace('http://openlearn.open.ac.uk/login/index.php?loginguest=true');
//]]></script>
Ah… vague memories… there’s some sort of handshake goes on when you first try to access OpenLearn content (maybe something to do with tracking?), before the actual resource that was called is returned to the calling party. Browsers handle this handshake automatically, but the etree.parse(URL) function I was calling to load in and parse the XML document doesn’t. It just sees the HTML response and chokes, raising the error that first alerted me to the problem.
[Seems the redirect is a craptastic Moodle fudge /via @ostephens]
So now it’s two hours later than it was when I started a script, full of joy and light and happy intentions, that would generate an aggregated glossary of glossary items from across OpenLearn and allow users to look up terms, link to associated units, and so on; (the OU-XML document schema that OpenLearn uses has markup for explicitly describing glossary items). Then I got the error message, ran round in circles for a bit, got ranty and angry and developed a really foul mood, probably tweeted some things that I may regret, one day, figured out what the issue was, but not how to solve it, thus driving my mood fouler and darker… (If anyone has a workaround that lets me get an XML file back directly from OpenLearn (or hides the workaround handshake in a Python script I can simply cut and paste), please enlighten me in the comments.)
I also found at least one OpenLearn unit that has glossary items, but just dumps then in paragraph tags and doesn’t use the glossary markup. Sigh…;-)
So… how was your day?! I’ve given up on mine…
Do We Need an OpenLearn Content Liberation Front?
For me, one of the defining attributes of openness relates to accessibility of the machine kind: if I can’t write a script to handle the repetitive stuff for me, or can’t automate the embedding of image and/or video resources, then whatever the content is, it’s not open enough in a practical sense for me to do what I want with it.
So here’s an, erm, how can I put this politely, little niggle I have with OpenLearn XML. (For those of you not keeping up, one of the many OpenLearn sites is the OU’s open course materials site; the materials published on the site as course unit contentful HTML pages are also available as structured XML documents. (When I say “structured”, I mean that certain elements of the materials are marked up in a semantically meaningful way; lots of elements aren’t, but we have to start somewhere ;-))
The context is this: following on from my presentation on Making More of Structured Course Materials at the eSTeEM conference last week, I left a chat with Jonathan Fine with the intention of seeing what sorts of secondary product I could easily generate from the OpenLearn content. I’m in the middle of building a scraper and structured content extractor at the moment, grabbing things like learning outcomes, glossary items, references and images, but I almost immediately hit a couple of problems, first with actually locating the OU XML docs, and secondly locating the images…
Getting hold of a machine readable list of OpenLearn units is easy enough via the OpenLearn OPML feed (much easier to work with than the “all units” HTML index page). Units are organised by topic and are listed using the following format:
<outline type="rss" text="Unit content for Water use and the water cycle" htmlUrl="http://openlearn.open.ac.uk/course/view.php?name=S278_12" xmlUrl="http://openlearn.open.ac.uk/rss/file.php/stdfeed/4307/S278_12_rss.xml"/>
URLs of the form http://openlearn.open.ac.uk/course/view.php?name=S278_12 link to a ‘homepage” for each unit, which then links to the first page of actual content, content which is also available in XML form. The content page URLs have the form http://openlearn.open.ac.uk/mod/oucontent/view.php?id=398820&direct=1, where the ID is one-one uniquely mapped to the course name identifier. The XML version of the page can then be accessed by changing direct=1 in the URL to content=1. Only, we don’t know the mapping from course unit name to page id. The easiest way I’ve found of doing that is to load in the RSS feed for each unit and grab the first link URL, which points the first HTML content page view of the unit.
I’ve popped a scraper up on Scraperwiki to build the lookup for XML URLs for OpenLearn units – OpenLearn XML Processor:
import scraperwiki
from lxml import etree
#===
#via http://stackoverflow.com/questions/5757201/help-or-advice-me-get-started-with-lxml/5899005#5899005
def flatten(el):
result = [ (el.text or "") ]
for sel in el:
result.append(flatten(sel))
result.append(sel.tail or "")
return "".join(result)
#===
def getcontenturl(srcUrl):
rss= etree.parse(srcUrl)
rssroot=rss.getroot()
try:
contenturl= flatten(rssroot.find('./channel/item/link'))
except:
contenturl=''
return contenturl
def getUnitLocations():
#The OPML file lists all OpenLearn units by topic area
srcUrl='http://openlearn.open.ac.uk/rss/file.php/stdfeed/1/full_opml.xml'
tree = etree.parse(srcUrl)
root = tree.getroot()
topics=root.findall('.//body/outline')
#Handle each topic area separately?
for topic in topics:
tt = topic.get('text')
print tt
for item in topic.findall('./outline'):
it=item.get('text')
if it.startswith('Unit content for'):
it=it.replace('Unit content for','')
url=item.get('htmlUrl')
rssurl=item.get('xmlUrl')
ccu=url.split('=')[1]
cctmp=ccu.split('_')
cc=cctmp[0]
if len(cctmp)>1: ccpart=cctmp[1]
else: ccpart=1
slug=rssurl.replace('http://openlearn.open.ac.uk/rss/file.php/stdfeed/','')
slug=slug.split('/')[0]
contenturl=getcontenturl(rssurl)
print tt,it,slug,ccu,cc,ccpart,url,contenturl
scraperwiki.sqlite.save(unique_keys=['ccu'], table_name='unitsHome', data={'ccu':ccu, 'uname':it,'topic':tt,'slug':slug,'cc':cc,'ccpart':ccpart,'url':url,'rssurl':rssurl,'ccurl':contenturl})
getUnitLocations()
The next step in the plan (because I usually do have a plan; it’s hard to play effectively without some sort of direction in mind…) as far as images goes was to grab the figure elements out of the XML documents and generate an image gallery that allows you to search through OpenLearn images by title/caption and/or description, and preview them. Getting the caption and description from the XML is easy enough, but getting the image URLs is not…
Here’s an example of a figure element from an OpenLearn XML document:
<Figure id="fig001"> <Image src="\\DCTM_FSS\content\Teaching and curriculum\Modules\Shared Resources\OpenLearn\S278_5\1.0\s278_5_f001hi.jpg" height="" webthumbnail="false" x_imagesrc="s278_5_f001hi.jpg" x_imagewidth="478" x_imageheight="522"/> <Caption>Figure 1 The geothermal gradient beneath a continent, showing how temperature increases more rapidly with depth in the lithosphere than it does in the deep mantle.</Caption> <Alternative>Figure 1</Alternative> <Description>Figure 1</Description> </Figure>
Looking at the HTML page for the corresponding unit on OpenLearn, we see it points to the image resource file at http://openlearn.open.ac.uk/file.php/4178/!via/oucontent/course/476/s278_5_f001hi.jpg:

So how can we generate that image URL from the resource link in the XML document? The filename is the same, but how can we generate what are presumably contextually relevant path elements: http://openlearn.open.ac.uk/file.php/4178/!via/oucontent/course/476/
If we look at the OpenLearn OPML file that lists all current OpenLearn units, we can find the first identifier in the path to the RSS file:
<outline type="rss" text="Unit content for Energy resources: Geothermal energy" htmlUrl="http://openlearn.open.ac.uk/course/view.php?name=S278_5" xmlUrl="http://openlearn.open.ac.uk/rss/file.php/stdfeed/4178/S278_5_rss.xml"/>
But I can’t seem to find a crib for the second identifier – 476 – anywhere? Which means I can’t mechanise the creation of links to actually OpenLearn image assets from the XML source. Also note that there are no credits, acknowledgements or license conditions associated with the image contained within the figure description. Which also makes it hard to reuse the image in a legal, rights recognising sense.
[Doh - I can surely just look at URL for an image in an OpenLearn unit RSS feed and pick the path up from there, can't I? Only I can't, because the image links in the RSS feeds are: a) relative links, without path information, and b) broken as a result...]
Reusing images on the basis of the OpenLearn XML “sourcecode” document is therefore: NOT OBVIOUSLY POSSIBLE.
What this suggests to me is that if you release “source code” documents, they may actually need some processing in terms of asset resolution that generates publicly resolvable locators to assets if they are encoded within the source code document as “private” assets/non-resolvable identifiers.
Where necessary, acknowledgements/credits are provided in the backmatter using elements of the form:
<Paragraph>Figure 7 Willes-Richards, J., et al. (1990) ; HDR Resource/Economics’ in Baria, R. (ed.) <i>Hot Dry Rock Geothermal Energy</i>, Copyright CSM Associates Limited</Paragraph>
Whilst OU-XML does support the ability to make a meaningful link to a resource within the XML document, using an element of the form:
<CrossRef idref="fig007">Figure 7</CrossRef>
(which presumably uses the Alternative label as the cross-referenced identifier, although not the figure element id (eg fig007) which is presumably unique within any particular XML document?), this identifier is not used to link the informally stated figure credit back to the uniquely identified figure element?
If the same image asset is used in several course units, there is presumably no way of telling from the element data (or even, necessarily, the credit data?) whether the images are in fact one and the same. That is, we can’t audit the OpenLearn materials in a text mechanised way to see whether or not particular images are reused across two or more OpenLearn units.
Just in passing, it’s maybe also worth noting that in the above case at least, a description for the image is missing. In actual OU course materials, the description element is used to capture a textual description of the image that explicates the image in the context of the surrounding text. This represents a partial fulfilment of accessibility requirements surrounding images and represents, even if not best, at least effective practice.
Where else might content need liberating within OpenLearn content? At the end of the course unit XML documents, in the “backmatter” element, there is often a list of references. References have the form:
<Reference>Sheldon, P. (2005) Earth’s Physical Resources: An Introduction (Book 1 of S278 Earth’s Physical Resources: Origin, Use and Environmental Impact), The Open University, Milton Keynes</Reference>
Hmmm… no structure there… so how easy would it be to reliably generate a link to an authoritative record for that item? (Note that other records occasionally use presentational markup such as italics (or emphasis) tags to presentationally style certain parts of some references (confusing presentation with semantics…).)
Finally, just a quick note on why I’m blogging this publicly rather than raising it, erm, quietly within the OU. My reasoning is similar to the reasoning we use when we tell students to not be afraid of asking questions, because it’s likely that others will also have the same question… I’m asking a question about the structure of an open educational resource, because I don’t quite understand it; by asking the question in public, it may be the case that others can use the same questioning strategy to review the way they present their materials, so when I find those, I don’t have to ask similar sorts of question again;-)
PS sort of related to this, see TechDis’ Terry McAndrew’s Accessible courses need and accessibilty-friendly schema standard.
PPS see also another take on ways of trying to reduce cognitive waste – Joss Winn’s latest bid in progress, which will examine how the OAuth 2.0 specification can be integrated into a single sign on environment alongside Microsoft’s Unified Access Gateway. If that’s an issue or matter of interest in your institution, why not fork the bid and work it up yourself, or maybe even fork it and contribute elements back?;-) (Hmm, if several institutions submitted what was essentially the same bid from multiple institutions, how would they cope during the marking process?!;-)
Media, Erm, Studies?
Over the weekend, I noticed an advert in the Guardian Review for a course on creative writing operated by the Guardian but accredited by the UEA: UEA-Guardian Masterclasses. A little dig around and I see the Guardian are actually offering a whole host masterclasses in a variety of subjects: Guardian Masterclasses. They are also offering their first(? more to come) masterclass with General Assembly (“a campus for technology, design, and entrepreneurship based in New York City”) on Understanding the Digital Economy; of note here is the additional comment that “General Assembly will be opening a campus in London at the end of 2012.” Campus; not hackspace or officespace, or workspace (though that may well be what it actually is): but campus.
[Update: via @jukesie, I'm also reminded of the Guardian's teacher resources site, learnthings/learn.co.uk; for completeness, maybe also worth mentioning other innovations the Guardian is up to publishing-wise, eg wrt eboks: second half of A Tinkerer’s Toolbox….]
Alongside this, we have Condé Nast announcing a College of Fashion and Design to start from 2013 (as described in If Courses are About Content, We Have Competition…) and accredited by, erm, Vogue.
Educators in the area of IT will be well aware of the preponderance of vendor certification, where (arguably justifiably) vendors create a training curriculum that covers the key principles relating to one or more of their products. Institutions renowned for their training in certain areas have also been know to make their content available, as for example via the BBC College of Journalism.
In the OU, we’ve had a couple of rapidly produced courses* that wrap a pre-existing vendor qualification with an academic wrapper and academic assessment, and then provide the student an opportunity to earn both a vendor certificate and formal academic credit using the same vehicle. (See also: Towards Vendor Certification on the Open Web? Google Training Resources and Due Out Soon – The Google “Qualified Developer Program”.)
*For example, CCNA/Cisco Networking; T155 Linux: An Introduction provides a route to CompTIA accreditation, and T189 Digital Photography is “recognised by The Royal Photographic Society (RPS) as suitable preparatory work and a foundation for a Licentiateship Distinction (LRPS) in still photography”. And if you want badges, then try iSpot…;-)
The OU has also, in the past, produced short courses around broadcast television programmes co-produced with the BBC: S180 Life in the Oceans around Blue Planet, for example; (was S198 Exploring Mars tied to a TV series?; or A178 Perspectives on Leonardo da Vinci?). I’m not sure about the extent to which the OU is allowed to make use of BBC archive footage (could someone let me have a peek of the Sixth Agreement? Discretion assured/NDA signed if required; or is it FOIable?!;-) but I keep on wondering about how we might be able to make more of co-pro’d content, especially content that had courses developed around it (and which may or may not already be on OpenLearn?) (NB it’s worth noting that OU strategy appears at the moment to be focussed on competing for full time, younger students with other HEI entrants into the distance learning market, and moving away from shorter “leisure learning” courses which is a market that the media appear to be encroaching on. I can’t help wondering what might have happened if the OU had hooked up with the Guardian two or three years ago…[Disclaimer: this post barely represents my own beliefs, let alone those of my employer... etc etc...])
And finally, in Learning around F1…?!;-), I commented on how private equity owned learndirect are sponsoring a Formula One motor racing team; and so it goes…
Something is happening; but even if we can’t figure out what, at the very least we need to identify where higher education is placed in it all and what value it adds and what unique service(s) it offers… (See also: So What Do Universities Sell?, incl. comments.)
PS I think I need to read the Innovator’s Dilemma, and consequent books, again; wasn’t one of the claims that new entrants could pick some of the long hanging fruit (short courses, leisure learning, partnered accreditation and accreditation scheme/trust development) and then slowly build up capacity to take on the incumbents (longer form courses; credit + experience equivalents)?
PPS In passing, I notice that the Economist offers a suite of courses: Economist Education: Courses. The FT suggests ways of Enhanc[ing] your curriculum with the Financial Times) as well as branding a series of Pearson published textbooks (FT Publishing). Publishers such as O’Reilly are big in the conference organisation area (O’Reilly Conferences), and the Guardian (again) has also made in-roads into this area of content and buzz generation through things like the Activate Summit or the (CPD Certified) Higher Education Summit (note to self: does anyone else use the word summit for this sort of offering?)
eSTeEM Conference Presentation – Making More of Structured Course Materials
A copy of the presentation I gave at the OU-eSTeEM conference (no event URL?) on generating custom course search engines and mining OU XML documents to generate course mindmaps (Making More of Structured Documents presentation; delicious stack/bookmark list of related resources):
Chatting to Jonathan Fine after the event, he gave me the phrase secondary products to describe things like course mindmaps that can be generated from XML source files of OU course materials. From what I can tell, there isn’t much if any work going on in the way of finding novel ways of exploiting the structure of OU structured course materials, other than using them simply as a way of generating different presentational views of the course materials as a whole (that is, HTML versions, maybe mobile friendly versions, PDF versions). (If that’s not the case, please feel free to put me right in the comments:-)
One thing Jonathan has been scouring the documents for is evidence of mathematical content across the courses; he also mentioned a couple of ideas relating to access audits over the content itself, such as extracting figure headings, or image captions. (This reminded me of the OpenLearn XML processor (and redux) I first played with 4 years ago (sigh… and nothing’s changed… sigh….), which stripped assets by type from the first generation of OU XML docs). So on my to do list is to have a deeper look at the structure of OU XML, have a peek at what sorts of things might meaningfully (and easily;-) extracted, and figure out two or three secondary products that can be generated as a result. Note that these products might be products for different audiences, at different times of the course lifecycle: tools for use by course team or LTS during production (such as accessibility checks), products to support maintenance (there is already a link checker, but maybe there is more that can be done here?), products for students (such as the mindmap), products for alumni, products for OpenLearn views over the content, products to support “learning analytics”, and so on. (If you have any ideas of what forms the secondary products might take, or what structures/elements/entities you’d like to see mined from OU XML, please let me know via the comments. For an example of an OU XML doc, see here.
OU Marketers Go After Competition Supported Editorial…?
Over the weekend, I noticed that the Guardian was offering readers a chance to win the chance to study for an OU degree for free. Today, via a tweet, I see a link to a piece of editorial coverage from Friday – Live and learn with distance learning – on some of the motivations for studying for an OU degree – as well as a look at the commitment that’s involved in taking a distance learning degree.
The competition is prominently linked to:
I suspect we are going to see more of this…
I was also interested to see this tweet from @barnstormed on Sunday: Nice to see the @openuniversity on one of the electronic pitch-side advertising boards at Murrayfield :) #rugby #6nations [Anyone got a screenshot?]
See also a previous campaign: OU Course Discounts with the Tesco Clubcard, although I note this is about to come to an end?
Hmmm…











