Open as in Closed

Lorcan Dempsey was revisiting an old favourite last week, in a discussion about inside-out and outside-in library activities (Discovery vs discoverability …), where outside-in relates to managing collections of, and access to, external resources, versus the inside-out strategy whereby the library accepts that discovery happens elsewhere, and sees its role as making library mediated resources (and resources published by the host institution) available in the places where the local patrons are likely to be engaging in resource discovery (i.e. on the public web…)

A similar notion can be applied to innovation, as fumblingly described in this old post Innovating from the Inside, Outside. The idea there was that if institutions made their resources and data public and openly licensed, then internal developers would be able to make use of them for unofficial and skunkwork internal projects. (Anyone who works for a large institution will know how painful it can be getting hold of resources that are “owned” by other parts of the institution). A lot of the tinkering I’ve done around OU services has only been possible because I’ve been able to hold of the necessary resources via public (and unauthenticated) URLs. A great example of this relates to my OpenLearn tinkerings (e.g. as described in both the above linked “Innovation” post and more recently in Derived Products from OpenLearn/OU XML Documents).

But with the recent migration of OpenLearn to the domain, it seems as if the ability to just add ?content=1 to the end of a unit URL and as a result get access to the “source” XML document (essentially, a partially structured “database” of the course unit) has been disabled:

openlearn closed

Of course, this could just be an oversight, a switch that failed to be flicked when the migration happened; although from the unit homepage, there is no obvious invitation to download an XML version of the unit.

OPenlearn unit homepage

[UPDATE: see comments – seems as if this should be currently classed as “broken” rather than “removed”.]

In a sense, then, access to a useful format of the course materials for the purpose of deriving secondary products has been removed. (I also note that the original, machine readable ‘single full list’ of available OpenLearn units has disappeared, making the practical act of harvesting harder even if the content is available…) Which means I can no longer easily generate meta-glossaries over all the OpenLearn units, nor image galleries or learning objective directories, all of which are described in the Derived Products from OpenLearn post. (If I started putting scrapes on the OU network, which I’ve considered many times, I suspect the IT police would come calling…) Which is a shame, especially at a time when the potential usefulness of text mining appears to be being recognised (eg BIS press release on ‘Consumers given more copyright freedom’, December 20, 2012: “Data analytics for non-commercial research – to allow non-commercial researchers to use computers to study published research results and other data without copyright law interfering;”, interpreted by Peter Murray Rust as the UK government says it’s legal to mine content for the purposes of non-commercial research. By the by, I also notice that the press release also mentions “Research and private study – to allow sound recordings, films and broadcasts to be copied for non-commercial research and private study purposes without permission from the copyright holder.” Which could be handy…).

This effective closing down of once open services is (deliberate or not), of course, common to anyone who plays with web APIs, which are often open and free in early beta development phase, but then get locked down as companies are faced with the need to commercialise them. Faced with the need to commercialise them.

Returning to Lorcan’s post for a moment, in which he notes “growing interest in connecting the library’s collections to external discovery environments so that the value of the library investment is actually released for those for whom it was made” on the one hand; and “a parallel interest in making institutional resources (research and learning materials, digitized special materials, faculty expertise, etc) more actively discoverable.” More actively discoverable.

If part of the mission is also to promote reuse of content, as well as affording the possibility of third parties opening up additional discovery channels (for example, through structured indices and recommendation engines), not to say creating derived and value-add products, then making content available in “source” form, where structural metadata can be mined for added value discovery (for example, faceted search over learning objectives, or images or glossary items, blah, blah, blah..) is good for everyone.

Unless you’re precious about the product of course, and don’t really want it to be open (whatever “open” means…).

As as pragmatist, and a personal learner/researcher, I often tend not to pay too much attention to things like copyright. In effect, I assert the right to read and “reuse” content for my own personal research and learning purposes. So the licensing part of openness doesn’t really bother me in that respect too much anyway. It might become a problem if I built something that I made public that started getting use and starting “stealing” from, or misrepresenting the original publisher, and then I’d have to do worry about the legal side of things… But not for personal research.

Note that as I play with things like Scraperwiki more and more, I find myself more and more attracted to the idea of pulling content in to a database so that I can add enhanced discovery services over the content for my own purposes, particularly if I can pull structural elements out o the scraped content to enable more particular search queries. When building scrapers, I tend to limit myself to scraping sites that do not present authentication barriers, and whose content is generally searchable via public web search engines (i.e. it has already been indexed and is publicly discoverable).

Which brings me to consider a possibly disturbing feature of MOOC platforms such as Coursera. The course may be open (if you enrol, but the content of, and access to, the materials ins’t discoverable. That is, it’s not open as to search. It’s not open as to discovery. (Udacity on the other hand does seem to let you search course content; e.g. search with limits

I’m not sure what the business model behind FutureLearn will be, but when (if?!) the platform actually appears, I wonder whether course content will be searchable/outside-discoverable on it? (I also wonder to what extent the initial offerings will relate to course resources that JISC OER funding helped to get openly licensed? And what sort of license will apply to the content on the site (for folk who do pay heed to the legalistic stuff;-)

So whilst Martin Weller victoriously proclaims Openness has won – now what?, saying “we’ll never go back to closed systems in academia”, I just hope that we don’t start seeing more and more lock dawn, that we don’t start seeing less and less discovery of useful content published sites, that competition between increasingly corporatised universities doesn’t mean that all we get access to is HE marketing material in the form of course blurbs, and undiscoverable content that can only be accessed in exchange for credentials and personal tracking data.

In the same way that academics have always worked round the journal subscription racket that the libraries were complicit in developing with with academic publishers (if you get a chance, go to UKSG, where publisher reps with hospitality accounts do the schmooze with the academic library folk;-), sharing copies of papers if anyone ever asked, I hope that they do the same with their teaching materials, making them discoverable and sharing the knowledge.

Viewing OpenLearn Mindmaps Using d3.js

In a comment on Generating OpenLearn Navigation Mindmaps Automagically, Pete Mitton hinted that the d3.js tree layout example might be worth looking at as a way of visualising hierarchical OpenLearn mindmaps/navigation layouts.

It just so happens that there is a networkx utility that can publish a tree structure represented as a networkx directed graph in the JSONic form that d3.js works with (networkx.readwrite.json_graph), so I had a little play with the code I used to generate Freemind mind maps from OpenLearn units and refactored it to generate a networkx graph, and from that a d3.js view:

(The above view is a direct copy of Mike Bostock’s example code, feeding from an automagically generated JSON representation of an OpenLearn unit.)

For demo purposes, I did a couple of views: a pure HTML/JSON view, and a Python one, that throws the JSON into an HTML template.

The d3.js JSON generating code can be found on Scraperwiki too: OpenLearn Tree JSON. When you run the view, it parses the OpenLearn XML and generates a JSON representation of the unit (pass the unit code via a ?ucode=UNITCODE URL parameter, for example

The Python powered d3.js view also responds to the unit URL parameter, for example:

The d3.js view is definitely very pretty, although at times the layout is a little cluttered. I guess the next step is a functional one, though, which is to find how to linkify some of the elements so the tree view can act as a navigational surface.

Generating OpenLearn Navigation Mindmaps Automagically

I’ve posted before about using mindmaps as a navigation surface for course materials, or as way of bootstrapping the generation of user annotatable mindmaps around course topics or study weeks. The OU’s XML document format that underpins OU course materials, including the free course units that appear on OpenLearn, makes for easy automated generation of secondary publication products.

So here’s the next step in my exploration of this idea, a data sketch that generates a Freemind .mm format mindmap file for a range of OpenLearn offerings using metadata puled into Scraperwiki. The file can be downloaded to your desktop (save it with a .mm suffix), and then opened – and annotated – within Freemind.

You can find the code here: OpenLearn mindmaps.

By default, the mindmap will describe the learning outcomes associated with each course unit published on the Open University OpenLearn learning zone site.

By hacking the view URL, other mindmaps are possible. For example, we ca make the following additions to the actual mindmap file URL (reached by opening the Scraperwiki view) as follows:

  • ?unit=UNITCODE, where UNITCODE= something like T180_5 or K100_2 and you will get a view over section headings and learning outcomes that appear in the corresponding course unit.
  • ?unitset=UNITSET where UNITSET= something like T180 or K100 – ie the parent course code from which a specific unit was derived. This view will give a map showing headings and Learning Outcomes for all the units derived from a given UNITSET/course code.
  • ?keywordsearch=KEYWORD where KEYWORD= something like: physics This will identify all unit codes marked up with the keyword in the RSS version of the unit and generate a map showing headings and Learning Outcomes for all the units associated with the keyword. (This view is still a little buggy…)

In the first iteration, I haven’t added links to actual course units, so the mindmap doesn’t yet act as a clickable navigation surface, but that it is on the timeline…

It’s also worth noting that there is a flash browser available for simple Freemind mindmaps, which means we could have an online, in-browser service that displays the mindmap as such. (I seem to have a few permissions problems with getting new files onto at the moment – Mac side, I think? – so I haven’t yet been able to demo this. I suspect that browser security policies will require the .mm file to be served from the same server as the flash component, which means a proxy will be required if the data file is pulled from the Scraperwiki view.)

What would be really nice, of course, would be an HTML5 route to rendering a JSONified version of the .mm XML format… (I’m not sure how straightforward it would be to port the original Freemind flash browser Actionscript source code?)

The Learning Journey Starts Here: and OpenLearn Resource Linkage

Mulling over the OU’s OULearn pages on Youtube a week or two ago, colleague Bernie Clark pointed out to me how the links from the OU clip descriptions could be rather hit or miss:

Via @lauradee, I see that the OU has a new offering on is far more supportive of links to related content, links that can represent the start of a learning journey through OU educational – and commentary – content on the OU website.

Here’s a way in to the first bit of OU content that seems to have appeared:

This links through to a playlist page with a couple of different sorts of opportunity for linking to resources collated at the “Course materials” or “Lecture materials” level:

(The language gives something away, I think, about the expectation of what sort of content is likely to be uploaded here…)

So here, for example, are links at the level of the course/playlist:

And here are links associated with each lecture, erm, clip:

In this first example, several types of content are being linked to, although from the link itself it’s not immediately obvious what sort of resource a link points to? For example, some of the links lead through to course units on OpenLearn/Learning Zone:

Others link through to “articles” posted on the OpenLearn “news” site (I’m not ever really sure how to refer to that site, or the content posts that appear on it?)

The placing of content links into the Assignments and Others tabs always seems a little arbitrary to me from this single example, but I suspect that when a few more lists have been posted some sort of feeling about what sorts of resources should go where (i.e. what folk might expect by “Assignment” or “Other” resource links). If there’s enough traffic generated through these links, a bit of A/B testing might even be in order relating to the positioning of links within tabs and the behaviour of students once they click through (assuming you can track which link they clicked through, of course…)?

The transcript link is unambiguous though! And, in this case at least), resolves to a PDF hosted somewhere on the OU podcasts/media filestore:

(I’m not sure if caption files are also available?)

Anyway – it’ll be interesting to hear back about whether this enriched linking experience drives more traffic to the OpenLearn resources, as well as whether the positioning of links in the different tab areas has any effect on engagement with materials following a click…

And as far as the linkage itself goes, I’m wondering: how are the links to OpenLearn course units and articles generated/identified, and are those links captured in one of the stores? Or is the process that manages what resource links get associated with lists and list items on Youtube/edu one that doesn’t leave (or readily support the automated creation of) public data traces?

PS How much (if any( of the linked resource goodness is grabbable via the Youtube API, I wonder? If anyone finds out before me, please post details in the comments below:-)

Asset Stripping OpenLearn – Images

A long time ago, I tinkered with various ways of disaggregating OpenLearn course units into various components – images, audio files, videos, etc. (OpenLearn_XML Asset stripper (long since rotted)). Over the last few weeks, I’ve returned to the idea, using Scraperwiki to trash through the OpenLearn XML (and RSS) in order to build collections out of various different parts of the OpenLearn materials. So for example, a searchable OpenLearn meta-glossary, that generates one big glossary out of all the separate glossary entries in different OpenLearn units, and an OpenLearn learning outcomes explorer, that allows you to search through learning outcomes as described in different OpenLearn courses.

I’ve also been pulling out figure captions and descriptions, so last night I added a view that allows you to preview images used across OpenLearn: OpenLearn image viewer.

There’s a bit of niggle in using the viewer at the moment (as Jenny Gray puts it, “it’ll be a session cookie called MoodleSession in the domain (if you can grab it?)”) which, if you don’t have a current OpenLearn session cookie, requires you to click on one of the broken images in the righthand-most column and then go back to the gallery viewer (at which point, the images should load okay…unless you have some cookie blocking or anti-tracking features in place, which may well break things further:-( )

(If anyone can demonstrate a workaround for me for how to set the cookie before displaying the images, that’s be appreciated…)

To limit the viewed images, you can filter results according to terms appearing in the captions or descriptions or by course unit number.

One thing to note is that although the OpenLearn units are CC licensed, some of the images used in the units (particularly third party images) may not be so liberally licensed. At the moment, there is a disconnect in the OU XML between images and any additional rights information (typically a set of unstructured acknowledgements at the end of the unit XML), which makes a fully automated “open images from OpenLearn” gallery/previewer tricky to knock together. (When I get a chance, I’ll put together a few thoughts about what would be required to support such a service. It probably won’t be much, just an appropriate metadata filed or two…)

PS here’s an example of why the ‘need a cookie to get the image’ thing is really rather crap… I embedded an image from OpenLearn, via a link/url, in a post (making sure to link back to the original page). Good for me – I get a relevant image, I don’t have to upload it anywhere – good for OpenLearn, they get a link back, good for OpenLearn, they get a loggable server hit when anyone views the image (although bad for them in that it’s their server and bandwidth that has to deliver the image).

However, as it stands, it’s bad for OpenLearn because all the users see is a broken link, rather than the image, unless you have a current OpenLearn cookie session already set. The fix for me is more work: download the image, upload it to my own server, and then embed my copy of the image. OpenLearn no longer gets any of the “paradata” surrounding the views on that image, and indeed may never even know that I’m reusing it…

Scraperwiki Powered OpenLearn Searches – Learning Outcomes and Glossary Items

A quick follow up to Tinkering With Scraperwiki – The Bottom Line, OpenCorporates Reconciliation and the Google Viz API demonstrating how to reuse that pattern (a little more tinkering is required to fully generalise it, but that’ll probably have to wait until after the Easter wifi-free family tour… I also need to do a demo of a pure HTML/JS version of the approach).

In particular, a search over OpenLearn learning outcomes:

and a search over OpenLearn glossary items:

Both are powered by tables from my OpenLearn XML Processor scraperwiki.

Tinkering With Scraperwiki – The Bottom Line, OpenCorporates Reconciliation and the Google Viz API

Having got to grips with adding a basic sortable table view to a Scraperwiki view using the Google Chart Tools (Exporting and Displaying Scraperwiki Datasets Using the Google Visualisation API), I thought I’d have a look at wiring in an interactive dashboard control.

You can see the result at BBC Bottom Line programme explorer:

The page loads in the contents of a source Scraperwiki database (so only good for smallish datasets in this version) and pops them into a table. The searchbox is bound to the Synopsis column and and allows you to search for terms or phrases within the Synopsis cells, returning rows for which there is a hit.

Here’s the function that I used to set up the table and search control, bind them together and render them:

    google.load('visualization', '1.1', {packages:['controls']});


    function drawTable() {

      var json_data = new google.visualization.DataTable(%(json)s, 0.6);

    var json_table = new google.visualization.ChartWrapper({'chartType': 'Table','containerId':'table_div_json','options': {allowHtml: true}});
    //i expected this limit on the view to work?

    var formatter = new google.visualization.PatternFormat('<a href="{0}">{0}</a>');
    formatter.format(json_data, [1]); // Apply formatter and set the formatted value of the first column.

    formatter = new google.visualization.PatternFormat('<a href="{1}">{0}</a>');
    formatter.format(json_data, [7,8]);

    var stringFilter = new google.visualization.ControlWrapper({
      'controlType': 'StringFilter',
      'containerId': 'control1',
      'options': {
        'filterColumnLabel': 'Synopsis',
        'matchType': 'any'

  var dashboard = new google.visualization.Dashboard(document.getElementById('dashboard')).bind(stringFilter, json_table).draw(json_data);


The formatter is used to linkify the two URLs. However, I couldn’t get the table to hide the final column (the OpenCorporates URI) in the displayed table? (Doing something wrong, somewhere…) You can find the full code for the Scraperwiki view here.

Now you may (or may not) be wondering where the OpenCorporates ID came from. The data used to populate the table is scraped from the JSON version of the BBC programme pages for the OU co-produced business programme The Bottom Line (Bottom Line scraper). (I’ve been pondering for sometime whether there is enough content there to try to build something that might usefully support or help promote OUBS/OU business courses or link across to free OU business courses on OpenLearn…) Supplementary content items for each programme identify the name of each contributor and the company they represent in a conventional way. (Their role is also described in what looks to be a conventionally constructed text string, though I didn’t try to extract this explicitly – yet. (I’m guessing the Reuters OpenCalais API would also make light work of that?))

Having got access to the company name, I thought it might be interesting to try to get a corporate identifier back for each one using the OpenCorporates (Google Refine) Reconciliation API (Google Refine reconciliation service documentation).

Here’s a fragment from the scraper showing how to lookup a company name using the OpenCorporates reconciliation API and get the data back:

ocrecURL=''+urllib.quote_plus("".join(i for i in record['company'] if ord(i)<128))
    print ocrecURL,[recData]
    if len(recData['result'])>0:
        if recData['result'][0]['score']>=0.7:

The ocrecURL is constructed from the company name, sanitised in a hack fashion. If we get any results back, we check the (relevance) score of the first one. (The results seem to be ordered in descending score order. I didn’t check to see whether this was defined or by convention.) If it seems relevant, we go with it. From a quick skim of company reconciliations, I noticed at least one false positive – Reed – but on the whole it seemed to work fairly well. (If we look up more details about the company from OpenCorporates, and get back the company URL, for example, we might be able to compare the domain with the domain given in the link on the Bottom Line page. A match would suggest quite strongly that we have got the right company…)

As @stuartbrown suggeted in a tweet, a possible next step is to link the name of each guest to a Linked Data identifier for them, for example, using DBPedia (although I wonder – is @opencorporates also minting IDs for company directors?). I also need to find some way of pulling out some proper, detailed subject tags for each episode that could be used to populate a drop down list filter control…

PS for more Google Dashboard controls, check out the Google interactive playground…

PPS see also: OpenLearn Glossary Search and OpenLearn LEarning Outcomes Search