Appropriating OpenLearn Content and Republishing Edited Versions Of It Via a “Simple” Automated Text Blogging Workflow

I had intended on using my (unpaid) strike days to catch up with some books and harp practice, and maybe even the garden, and keep away from the keyboard; or failing that, to have a push on my rally data tinkering and get another LeanPub book started to try to reboot the £50 a quarter or so my previous publication (Wrangling F1 Data With R) generated, which keeps things like recurring Dropbox and Flickr etc etc charges covered (no-one has ever bought me a KoFi, as far as I can tell…).

And I was determined not to do any of the mounting workload associated with the day job, no matter how much fun some it is likely to be (like getting Ev3devSim working as an ipywidget in Jupyter notebooks).

Whilst I did manage to stick to the determined not to path, I never even really started down the intended one, instead spending hours and hours in front of keyboard trying to hack something together around my OpenLearn publishing workflow.

So here’s what I’ve come up with…

An OpenLearn Unit Text Publishing Thing

Firstly, it’s a thing that lets you grab the “source” content of an OpenLearn unit (at least, some of it; I still haven’t got round to grabbing things like video files or audio files, or scraping PDFs etc.) and churn it into a simple text format, markdown, which looks like this:

Headers are prefixed with a #, you can emphasise things by wrapping it in * characters, eg *italics* -> italics, or **double them up** for strong emphasis. Embedding links — [link text](link/path/file.html) — and images — ![Alt text](path/to/image.file) — is also pretty easy when you get the hang of it.

So how do you get started? First, you need a Github account (sign up here; just get one: you’re not going to have to do any hard Github stuff, you’re just making use of their free hosting). Get one, and sign in.

Second, visit my demo repo — psychemedia/openlearn-publish-test — (the URL will change at some point, but I’ll archive the original and link the new address from it…) and grab a copy of your own repo from mine by clicking the big green Use this template button:

You’ll be presented with a form:

Give your repo a name (no spaces). Optionally add a description. Keep the repo public. And click the big green Create repository from template button.

Things will churn for a moment or two:

And then you’ll have your own repo, containing a copy of the files in mine:

Behind the scenes, there is work going on…

Click the Actions tab on your copy of the repo to see what…

At first, it may look like nothing… but wait a moment or two and refresh the page:

A couple of actions will start running to initialise, and customise, your repo for you.

When the actions are done, you’ll be informed… (you shouldn’t have to refresh the page, the status indicators should update when things are done…):

If you go back to your repo homepage, you’ll see it’s been updated with a new README that’s slightly different to the original copy from my repo, and that has been personalised to yours:

So… now you can grab some OpenLearn content into your repo.

Click on the file link in your repo:

You will be presented with a list of units on OpenLearn.

Find one you like the look of and click the Grab Unit into this repo link:

This will open a new issue for you in the Issues tab of your repo, and prepopulate it with a title that will tell a Github Action you want to grab some OpenLearn content, and an issue body that tells the action where the unit can be found.

Click the Submit new issue button to get things started.

Back in the Actions tab, you can see the helper elves have started doing their thing again…

If you click on a running Action, you can check its progress in more detail:

Click through on the actual job name to see what’s happening inside:

You can expand a step by clicking the arrow to see what each step is doing or has already done…

If you read through the steps, you’ll see several things are done: for example, we grab some OUXML (the OpenLearn content), convert to markdown, build some HTML files (these are what gets published), and deploy them, then build a LaTeX version of the material (which is used to generate a PDF), and an ePub ebook. (The LaTeX step takes some time; I should perhaps simplify things so that only the HTML build is done by default.)

When the Actions are green circled / green ticked and done (which may take a few minutes…), or at least, when the Deploy HTML to gh-pages step has run, go back to the repo home page, where you should see a new commit has been made to your repo:

If you click into the content folder you’ll see one or more session folders:

If you click into a session folder, you see some markdown files:

If you click on one of those, you’ll see some scraped and converted OpenLearn content:

So the content has been grabbed from OpenLearn and saved to your repo.

But that’s not all.

If you scroll down on your README page (I really should make this link more prominent in the README…) you’ll see a link to a site published from your repo:

Click it…

If you see a “404”, page not found, don’t panic

On the repo home page, select the Settings tab:

and scroll down to the Github Pages area:

Change the Source from gh-pages branch to master branch:

And then, select the master branch:

And set the Source back to the gh-pages branch:

When you see something like this, you knows all good to go:

Note that cacheing of a previous build of the site may last for up to 10 minues, so grab yourself a cup of tea, or perhaps look through the markdown files in the content directory, or even go back to the Actions tab and, if the actions have completed.

If the Actions have completed, select the OpenLearnXML2 (or a completed nbsphinx publisher action if you have committed your own changes to the markdown files) and you should see and the availability of an Artifacts download.

Down load and unzip the artifacts file. If the build process has been able to build a PDF file and/or an ePub file from the content, it will be found in the unzipped downloaded directory.

Right… time to try your site link again:

An OpenLearn Editing Thing

This will have to be in a part two to this post… I’ve run out of time for now and need to get back to the day job…

If you are itching to get started, this may work, if I’ve got my autopublishing things fixed…

In the content folder (on the default master branch of the repo),  find the markdown file you want to edit, and click on the pencil icon to open the editor:


Edit the file / make the changes you want, and commit it (you may want to set a meaningful commit message title summarising the chnages, and perhaps even a longer description about the motivation for the changes, but both are optional…):


Click the big green Commit changes button to commit the changes. If you look in the Actions tab, you should see that an nbsphinx publisher action has started that should publish your changes to your site.


Note that even when the publishing action has generated and pushed updated site pages  to where they need to be, the site may take a few minutes to update because of page cacheing on the Github site.

The Future

One of the spinoffs of this for me was the realisation that I could use Github Actions to run arbitrary code in response to particular events, such as.. commits or issue postings. The current machinery uses a Sphinx / nbSphinx publishing route, but I’ve also started exploring a recipe for Jekyll based Jupyter Book publishing. (Next on the to do list will be an Executable Book project / MyST workflow](; I also need to split out the workflows into actions of their own, but I haven’t figured out how to do that for myself yet.) It strikes me that I could bundle all these in the same repo with some way of flagging which build process I want to use. This would allow the user to then republish their material using the publishing tool, and its various peculiarities, customisationa and affordances, of their choice.

Immediate to dos, that may not happen because I’m the only user, I know it’s possible, and I’m not that interested, are to: make the Github Pages / pubished site link more prominent in the README; get movie and audio downloads and embeds working. Also a way of handling PDFs linked from the OpenLearn materials, and perhaps extracting text from those, even, to support republishing…

On the publish side, it would be useful to be able to publis to HTML only by default, with some optional way of invoking the PDF and ePub builds. The ePub build also needs things like title and author setting. The PDF build sometimes breaks, eg due to the inability to detect a bounding box size round a gif image. I maybe need to use another PDF generator, eg some hints here.

I also need to refactor the code, two ways: firstly, a simplification, that uses the bare minimum of packages and just churns the markdown direct from XML in one simple step. Secondly, fixing the current workflow, which stages the XML in a SQLite database, so that the database can properly handly content from multiple unitis and I can reliably churn the md from the database for any single unit. At the moment, I think things pretty much assume there’s content from just a single unit in the database. Putting the md into the db might be useful too… Then I could imagine a datasette powered publishing route too…

As a recent tweet from Martin Hawksey reveals, he’s been blogging about how we can turn Google’s App Script to our own purposes as a hosted code runner, and I think Github now provides a similar opportunity for anyone who wants to appropriate it to that end…

OpenLearn OER (Re)Publishing the Text Way

In response to a provocation, I built a thing that will let you grab an OpenLearn unit, convert it to a simple text format, and publish it on your own website.

[For the next step in this journey, see: Appropriating OpenLearn Content and Republishing Edited Versions Of It Via a “Simple” Automated Text Blogging Workflow.]

It doesn’t require much:

  • if you haven’t got one already, create a Github account (just don’t “ooh, Github, that’s really hard, so I won’t be able to do it…”; just f***ing get an account);
  • visit my repo and read down the page to see what to do…

And what to do essentially boils down to:

  • press a BIG GREEN BUTTON to grab your a copy of the repo;
  • raise an issue, which is to say: click a BIG GREEN button, copy and paste Fetch as the title, and an OpenLearn course unit URL (if it ends in content-section-overview-0 or content-section-overview-0?SOMETHING it should work) as the first line in the issue body; for example:
  • go to your repo’s Github Pages website. For a repo at this will be and after a few minutes and a page refresh or or two you should see your website there. If it doesn’t appear, check the README for a possible fix.

As for changing the content – it’s not that hard once you’ve done it a few times and just go with the flow of writing what feels natural… “Easy” to edit text files are in the content directory and you can edit them via the Github website.

Fragment: Hard to Use OpenLearn OU-XML to Markdown Tool, If You Fancy Trying It…

Over the years, I’ve dabbled on and off with OU-XML, the XML document format that OU and OpenLearn texts are mastered in. Over the last year I’ve been exploring convertng OU-XML to the simple markdown text format (eg here).

There are a several advantages to using markdown: firstly, it’s a simple text format; secondly, you can open and edit markdown docs in a Jupyter notebook UI via Jupytext; thirdly, there are well proven (though still fiddly…) workflows for publising websites from markdown source docs (eg on of my experiments here).

As to why editing markdown docs in a notebook UI is useful: for one, you can edit — and preview — Latex, which means you can write maths equations and chemical formulae in a simple text way; for another, you can add code into your document that can embed interactives: for example, my folium magic lets you embed maps with markers or shaperfiles in to the document with a single, relatively straightforward, one-liner; or code to generate charts from data; or create simple interactive applications using ipywidgets. And so on. In short, the notebook is a medium that affords you lots of possibilities for incorporating generated, as well as interactive, content.

Following a proviocation by Marco Kalz / @mkalz yesterday, I cobbled together various bits of code into this repo — innovationOUtside/open-ouxml-tools — which doubles as the src for an installable Python package’n’CLI, that lets you:

  • download and grab the OU-XML for an OpenLearn unit, along with all its image assets, into a SQLite database;
  • generate a set of markdown files from the SQLite database.

With the single test unit I tried it on, it seems to work okay in MyBinder (just click on the button on the repo homepage, than click on the file when the notebook UI loads).

To get the files out, the nbarchive extension is preinstalled into the Binderised environment so you should be able to zip and export the all the generated files.

They could then be uploaded into a clone of something like ouseful-template-repos/oer-md-publish for autopublishing. (That example uses CircleCI as per this). I’ll try to figure out a Github Action way of doing something similar over the next few days, perhaps in a repo that will also grab a specified OpenLEarn unit for you (eg by using a Git commit performative CLI call, for example…?!;-)

Note that I’m still not claiming that this is easy, but I think the pieces are there if anyone wants to work through it and try it out. If folk do play with it, I’m more likely to try to make it a bit easier. But I know that because it isn’t easy, most folk won’t try it. (S’like a built in defense mechanism for me; matched time. If no-one else bothers, I don’t have to either… So if you want this thing to become real, you have to invest time into it now, too…)

PS I’m working on a new way of introducing recipes like this, as TINEWY (tin yui) ones: There Is No Easy Way Yet.

Open as in Closed

Lorcan Dempsey was revisiting an old favourite last week, in a discussion about inside-out and outside-in library activities (Discovery vs discoverability …), where outside-in relates to managing collections of, and access to, external resources, versus the inside-out strategy whereby the library accepts that discovery happens elsewhere, and sees its role as making library mediated resources (and resources published by the host institution) available in the places where the local patrons are likely to be engaging in resource discovery (i.e. on the public web…)

A similar notion can be applied to innovation, as fumblingly described in this old post Innovating from the Inside, Outside. The idea there was that if institutions made their resources and data public and openly licensed, then internal developers would be able to make use of them for unofficial and skunkwork internal projects. (Anyone who works for a large institution will know how painful it can be getting hold of resources that are “owned” by other parts of the institution). A lot of the tinkering I’ve done around OU services has only been possible because I’ve been able to hold of the necessary resources via public (and unauthenticated) URLs. A great example of this relates to my OpenLearn tinkerings (e.g. as described in both the above linked “Innovation” post and more recently in Derived Products from OpenLearn/OU XML Documents).

But with the recent migration of OpenLearn to the domain, it seems as if the ability to just add ?content=1 to the end of a unit URL and as a result get access to the “source” XML document (essentially, a partially structured “database” of the course unit) has been disabled:

openlearn closed

Of course, this could just be an oversight, a switch that failed to be flicked when the migration happened; although from the unit homepage, there is no obvious invitation to download an XML version of the unit.

OPenlearn unit homepage

[UPDATE: see comments – seems as if this should be currently classed as “broken” rather than “removed”.]

In a sense, then, access to a useful format of the course materials for the purpose of deriving secondary products has been removed. (I also note that the original, machine readable ‘single full list’ of available OpenLearn units has disappeared, making the practical act of harvesting harder even if the content is available…) Which means I can no longer easily generate meta-glossaries over all the OpenLearn units, nor image galleries or learning objective directories, all of which are described in the Derived Products from OpenLearn post. (If I started putting scrapes on the OU network, which I’ve considered many times, I suspect the IT police would come calling…) Which is a shame, especially at a time when the potential usefulness of text mining appears to be being recognised (eg BIS press release on ‘Consumers given more copyright freedom’, December 20, 2012: “Data analytics for non-commercial research – to allow non-commercial researchers to use computers to study published research results and other data without copyright law interfering;”, interpreted by Peter Murray Rust as the UK government says it’s legal to mine content for the purposes of non-commercial research. By the by, I also notice that the press release also mentions “Research and private study – to allow sound recordings, films and broadcasts to be copied for non-commercial research and private study purposes without permission from the copyright holder.” Which could be handy…).

This effective closing down of once open services is (deliberate or not), of course, common to anyone who plays with web APIs, which are often open and free in early beta development phase, but then get locked down as companies are faced with the need to commercialise them. Faced with the need to commercialise them.

Returning to Lorcan’s post for a moment, in which he notes “growing interest in connecting the library’s collections to external discovery environments so that the value of the library investment is actually released for those for whom it was made” on the one hand; and “a parallel interest in making institutional resources (research and learning materials, digitized special materials, faculty expertise, etc) more actively discoverable.” More actively discoverable.

If part of the mission is also to promote reuse of content, as well as affording the possibility of third parties opening up additional discovery channels (for example, through structured indices and recommendation engines), not to say creating derived and value-add products, then making content available in “source” form, where structural metadata can be mined for added value discovery (for example, faceted search over learning objectives, or images or glossary items, blah, blah, blah..) is good for everyone.

Unless you’re precious about the product of course, and don’t really want it to be open (whatever “open” means…).

As as pragmatist, and a personal learner/researcher, I often tend not to pay too much attention to things like copyright. In effect, I assert the right to read and “reuse” content for my own personal research and learning purposes. So the licensing part of openness doesn’t really bother me in that respect too much anyway. It might become a problem if I built something that I made public that started getting use and starting “stealing” from, or misrepresenting the original publisher, and then I’d have to do worry about the legal side of things… But not for personal research.

Note that as I play with things like Scraperwiki more and more, I find myself more and more attracted to the idea of pulling content in to a database so that I can add enhanced discovery services over the content for my own purposes, particularly if I can pull structural elements out o the scraped content to enable more particular search queries. When building scrapers, I tend to limit myself to scraping sites that do not present authentication barriers, and whose content is generally searchable via public web search engines (i.e. it has already been indexed and is publicly discoverable).

Which brings me to consider a possibly disturbing feature of MOOC platforms such as Coursera. The course may be open (if you enrol, but the content of, and access to, the materials ins’t discoverable. That is, it’s not open as to search. It’s not open as to discovery. (Udacity on the other hand does seem to let you search course content; e.g. search with limits

I’m not sure what the business model behind FutureLearn will be, but when (if?!) the platform actually appears, I wonder whether course content will be searchable/outside-discoverable on it? (I also wonder to what extent the initial offerings will relate to course resources that JISC OER funding helped to get openly licensed? And what sort of license will apply to the content on the site (for folk who do pay heed to the legalistic stuff;-)

So whilst Martin Weller victoriously proclaims Openness has won – now what?, saying “we’ll never go back to closed systems in academia”, I just hope that we don’t start seeing more and more lock dawn, that we don’t start seeing less and less discovery of useful content published sites, that competition between increasingly corporatised universities doesn’t mean that all we get access to is HE marketing material in the form of course blurbs, and undiscoverable content that can only be accessed in exchange for credentials and personal tracking data.

In the same way that academics have always worked round the journal subscription racket that the libraries were complicit in developing with with academic publishers (if you get a chance, go to UKSG, where publisher reps with hospitality accounts do the schmooze with the academic library folk;-), sharing copies of papers if anyone ever asked, I hope that they do the same with their teaching materials, making them discoverable and sharing the knowledge.

Viewing OpenLearn Mindmaps Using d3.js

In a comment on Generating OpenLearn Navigation Mindmaps Automagically, Pete Mitton hinted that the d3.js tree layout example might be worth looking at as a way of visualising hierarchical OpenLearn mindmaps/navigation layouts.

It just so happens that there is a networkx utility that can publish a tree structure represented as a networkx directed graph in the JSONic form that d3.js works with (networkx.readwrite.json_graph), so I had a little play with the code I used to generate Freemind mind maps from OpenLearn units and refactored it to generate a networkx graph, and from that a d3.js view:

(The above view is a direct copy of Mike Bostock’s example code, feeding from an automagically generated JSON representation of an OpenLearn unit.)

For demo purposes, I did a couple of views: a pure HTML/JSON view, and a Python one, that throws the JSON into an HTML template.

The d3.js JSON generating code can be found on Scraperwiki too: OpenLearn Tree JSON. When you run the view, it parses the OpenLearn XML and generates a JSON representation of the unit (pass the unit code via a ?ucode=UNITCODE URL parameter, for example

The Python powered d3.js view also responds to the unit URL parameter, for example:

The d3.js view is definitely very pretty, although at times the layout is a little cluttered. I guess the next step is a functional one, though, which is to find how to linkify some of the elements so the tree view can act as a navigational surface.

Generating OpenLearn Navigation Mindmaps Automagically

I’ve posted before about using mindmaps as a navigation surface for course materials, or as way of bootstrapping the generation of user annotatable mindmaps around course topics or study weeks. The OU’s XML document format that underpins OU course materials, including the free course units that appear on OpenLearn, makes for easy automated generation of secondary publication products.

So here’s the next step in my exploration of this idea, a data sketch that generates a Freemind .mm format mindmap file for a range of OpenLearn offerings using metadata puled into Scraperwiki. The file can be downloaded to your desktop (save it with a .mm suffix), and then opened – and annotated – within Freemind.

You can find the code here: OpenLearn mindmaps.

By default, the mindmap will describe the learning outcomes associated with each course unit published on the Open University OpenLearn learning zone site.

By hacking the view URL, other mindmaps are possible. For example, we ca make the following additions to the actual mindmap file URL (reached by opening the Scraperwiki view) as follows:

  • ?unit=UNITCODE, where UNITCODE= something like T180_5 or K100_2 and you will get a view over section headings and learning outcomes that appear in the corresponding course unit.
  • ?unitset=UNITSET where UNITSET= something like T180 or K100 – ie the parent course code from which a specific unit was derived. This view will give a map showing headings and Learning Outcomes for all the units derived from a given UNITSET/course code.
  • ?keywordsearch=KEYWORD where KEYWORD= something like: physics This will identify all unit codes marked up with the keyword in the RSS version of the unit and generate a map showing headings and Learning Outcomes for all the units associated with the keyword. (This view is still a little buggy…)

In the first iteration, I haven’t added links to actual course units, so the mindmap doesn’t yet act as a clickable navigation surface, but that it is on the timeline…

It’s also worth noting that there is a flash browser available for simple Freemind mindmaps, which means we could have an online, in-browser service that displays the mindmap as such. (I seem to have a few permissions problems with getting new files onto at the moment – Mac side, I think? – so I haven’t yet been able to demo this. I suspect that browser security policies will require the .mm file to be served from the same server as the flash component, which means a proxy will be required if the data file is pulled from the Scraperwiki view.)

What would be really nice, of course, would be an HTML5 route to rendering a JSONified version of the .mm XML format… (I’m not sure how straightforward it would be to port the original Freemind flash browser Actionscript source code?)

The Learning Journey Starts Here: and OpenLearn Resource Linkage

Mulling over the OU’s OULearn pages on Youtube a week or two ago, colleague Bernie Clark pointed out to me how the links from the OU clip descriptions could be rather hit or miss:

Via @lauradee, I see that the OU has a new offering on is far more supportive of links to related content, links that can represent the start of a learning journey through OU educational – and commentary – content on the OU website.

Here’s a way in to the first bit of OU content that seems to have appeared:

This links through to a playlist page with a couple of different sorts of opportunity for linking to resources collated at the “Course materials” or “Lecture materials” level:

(The language gives something away, I think, about the expectation of what sort of content is likely to be uploaded here…)

So here, for example, are links at the level of the course/playlist:

And here are links associated with each lecture, erm, clip:

In this first example, several types of content are being linked to, although from the link itself it’s not immediately obvious what sort of resource a link points to? For example, some of the links lead through to course units on OpenLearn/Learning Zone:

Others link through to “articles” posted on the OpenLearn “news” site (I’m not ever really sure how to refer to that site, or the content posts that appear on it?)

The placing of content links into the Assignments and Others tabs always seems a little arbitrary to me from this single example, but I suspect that when a few more lists have been posted some sort of feeling about what sorts of resources should go where (i.e. what folk might expect by “Assignment” or “Other” resource links). If there’s enough traffic generated through these links, a bit of A/B testing might even be in order relating to the positioning of links within tabs and the behaviour of students once they click through (assuming you can track which link they clicked through, of course…)?

The transcript link is unambiguous though! And, in this case at least), resolves to a PDF hosted somewhere on the OU podcasts/media filestore:

(I’m not sure if caption files are also available?)

Anyway – it’ll be interesting to hear back about whether this enriched linking experience drives more traffic to the OpenLearn resources, as well as whether the positioning of links in the different tab areas has any effect on engagement with materials following a click…

And as far as the linkage itself goes, I’m wondering: how are the links to OpenLearn course units and articles generated/identified, and are those links captured in one of the stores? Or is the process that manages what resource links get associated with lists and list items on Youtube/edu one that doesn’t leave (or readily support the automated creation of) public data traces?

PS How much (if any( of the linked resource goodness is grabbable via the Youtube API, I wonder? If anyone finds out before me, please post details in the comments below:-)

Asset Stripping OpenLearn – Images

A long time ago, I tinkered with various ways of disaggregating OpenLearn course units into various components – images, audio files, videos, etc. (OpenLearn_XML Asset stripper (long since rotted)). Over the last few weeks, I’ve returned to the idea, using Scraperwiki to trash through the OpenLearn XML (and RSS) in order to build collections out of various different parts of the OpenLearn materials. So for example, a searchable OpenLearn meta-glossary, that generates one big glossary out of all the separate glossary entries in different OpenLearn units, and an OpenLearn learning outcomes explorer, that allows you to search through learning outcomes as described in different OpenLearn courses.

I’ve also been pulling out figure captions and descriptions, so last night I added a view that allows you to preview images used across OpenLearn: OpenLearn image viewer.

There’s a bit of niggle in using the viewer at the moment (as Jenny Gray puts it, “it’ll be a session cookie called MoodleSession in the domain (if you can grab it?)”) which, if you don’t have a current OpenLearn session cookie, requires you to click on one of the broken images in the righthand-most column and then go back to the gallery viewer (at which point, the images should load okay…unless you have some cookie blocking or anti-tracking features in place, which may well break things further:-( )

(If anyone can demonstrate a workaround for me for how to set the cookie before displaying the images, that’s be appreciated…)

To limit the viewed images, you can filter results according to terms appearing in the captions or descriptions or by course unit number.

One thing to note is that although the OpenLearn units are CC licensed, some of the images used in the units (particularly third party images) may not be so liberally licensed. At the moment, there is a disconnect in the OU XML between images and any additional rights information (typically a set of unstructured acknowledgements at the end of the unit XML), which makes a fully automated “open images from OpenLearn” gallery/previewer tricky to knock together. (When I get a chance, I’ll put together a few thoughts about what would be required to support such a service. It probably won’t be much, just an appropriate metadata filed or two…)

PS here’s an example of why the ‘need a cookie to get the image’ thing is really rather crap… I embedded an image from OpenLearn, via a link/url, in a post (making sure to link back to the original page). Good for me – I get a relevant image, I don’t have to upload it anywhere – good for OpenLearn, they get a link back, good for OpenLearn, they get a loggable server hit when anyone views the image (although bad for them in that it’s their server and bandwidth that has to deliver the image).

However, as it stands, it’s bad for OpenLearn because all the users see is a broken link, rather than the image, unless you have a current OpenLearn cookie session already set. The fix for me is more work: download the image, upload it to my own server, and then embed my copy of the image. OpenLearn no longer gets any of the “paradata” surrounding the views on that image, and indeed may never even know that I’m reusing it…

Scraperwiki Powered OpenLearn Searches – Learning Outcomes and Glossary Items

A quick follow up to Tinkering With Scraperwiki – The Bottom Line, OpenCorporates Reconciliation and the Google Viz API demonstrating how to reuse that pattern (a little more tinkering is required to fully generalise it, but that’ll probably have to wait until after the Easter wifi-free family tour… I also need to do a demo of a pure HTML/JS version of the approach).

In particular, a search over OpenLearn learning outcomes:

and a search over OpenLearn glossary items:

Both are powered by tables from my OpenLearn XML Processor scraperwiki.

Tinkering With Scraperwiki – The Bottom Line, OpenCorporates Reconciliation and the Google Viz API

Having got to grips with adding a basic sortable table view to a Scraperwiki view using the Google Chart Tools (Exporting and Displaying Scraperwiki Datasets Using the Google Visualisation API), I thought I’d have a look at wiring in an interactive dashboard control.

You can see the result at BBC Bottom Line programme explorer:

The page loads in the contents of a source Scraperwiki database (so only good for smallish datasets in this version) and pops them into a table. The searchbox is bound to the Synopsis column and and allows you to search for terms or phrases within the Synopsis cells, returning rows for which there is a hit.

Here’s the function that I used to set up the table and search control, bind them together and render them:

    google.load('visualization', '1.1', {packages:['controls']});


    function drawTable() {

      var json_data = new google.visualization.DataTable(%(json)s, 0.6);

    var json_table = new google.visualization.ChartWrapper({'chartType': 'Table','containerId':'table_div_json','options': {allowHtml: true}});
    //i expected this limit on the view to work?

    var formatter = new google.visualization.PatternFormat('<a href="{0}">{0}</a>');
    formatter.format(json_data, [1]); // Apply formatter and set the formatted value of the first column.

    formatter = new google.visualization.PatternFormat('<a href="{1}">{0}</a>');
    formatter.format(json_data, [7,8]);

    var stringFilter = new google.visualization.ControlWrapper({
      'controlType': 'StringFilter',
      'containerId': 'control1',
      'options': {
        'filterColumnLabel': 'Synopsis',
        'matchType': 'any'

  var dashboard = new google.visualization.Dashboard(document.getElementById('dashboard')).bind(stringFilter, json_table).draw(json_data);


The formatter is used to linkify the two URLs. However, I couldn’t get the table to hide the final column (the OpenCorporates URI) in the displayed table? (Doing something wrong, somewhere…) You can find the full code for the Scraperwiki view here.

Now you may (or may not) be wondering where the OpenCorporates ID came from. The data used to populate the table is scraped from the JSON version of the BBC programme pages for the OU co-produced business programme The Bottom Line (Bottom Line scraper). (I’ve been pondering for sometime whether there is enough content there to try to build something that might usefully support or help promote OUBS/OU business courses or link across to free OU business courses on OpenLearn…) Supplementary content items for each programme identify the name of each contributor and the company they represent in a conventional way. (Their role is also described in what looks to be a conventionally constructed text string, though I didn’t try to extract this explicitly – yet. (I’m guessing the Reuters OpenCalais API would also make light work of that?))

Having got access to the company name, I thought it might be interesting to try to get a corporate identifier back for each one using the OpenCorporates (Google Refine) Reconciliation API (Google Refine reconciliation service documentation).

Here’s a fragment from the scraper showing how to lookup a company name using the OpenCorporates reconciliation API and get the data back:

ocrecURL=''+urllib.quote_plus("".join(i for i in record['company'] if ord(i)<128))
    print ocrecURL,[recData]
    if len(recData['result'])>0:
        if recData['result'][0]['score']>=0.7:

The ocrecURL is constructed from the company name, sanitised in a hack fashion. If we get any results back, we check the (relevance) score of the first one. (The results seem to be ordered in descending score order. I didn’t check to see whether this was defined or by convention.) If it seems relevant, we go with it. From a quick skim of company reconciliations, I noticed at least one false positive – Reed – but on the whole it seemed to work fairly well. (If we look up more details about the company from OpenCorporates, and get back the company URL, for example, we might be able to compare the domain with the domain given in the link on the Bottom Line page. A match would suggest quite strongly that we have got the right company…)

As @stuartbrown suggeted in a tweet, a possible next step is to link the name of each guest to a Linked Data identifier for them, for example, using DBPedia (although I wonder – is @opencorporates also minting IDs for company directors?). I also need to find some way of pulling out some proper, detailed subject tags for each episode that could be used to populate a drop down list filter control…

PS for more Google Dashboard controls, check out the Google interactive playground…

PPS see also: OpenLearn Glossary Search and OpenLearn LEarning Outcomes Search