Quick Way in to Hacking Legacy OU Course Materials Using Markdown

By some arcane process, OU course materials authored typically in MS Word are converted to an XML format (OU-XML) and then rendered variously to HTML for the Moodle website, ebook formats, and perhaps PDF (we don’t want to make it too easy for students to print of the materials…).

An internal project that ran for a couple of years (maybe a bit more) looking at more direct authoring workflows was shelved earlier this year. (I was banned from blogging about it whilst it was under development, so I’m afraid I don’t have screen shots to show what it looked like from the time I was given preview access.) As far as I know, the authoring tool was completely distinct from the one developed by the OU’s bastard offspring that is FutureLearn. Nowt like sharing.

One of the things I’m slated to do over the next few months is update, or possibly rewrite, a unit in a first year equivalent module.

My preferred way of authoring for some time has been to keep it simple and just use markdown.

So that’s what I’m probably going to do.

If there’s any griping or sniping that it doesn’t fit the OU workflow, I’ll just run it through pandoc to generate an MS Word docx version and hand that over.

(I’ve been saying *for years* we should have pandoc read/write filters for OU-XML (the most recent notes are here). It would have been a damn site cheaper than the aborted authoring tool project and would have allowed authors to explain a whole range of tools for creating their warez, with pandoc handling the conversion to OU-XML. And yes, I f**king know that some hand cleaning of the OU-XML would almost certainly have been required but we’d have got a far better feeling for what sorts of document structures folk produce if they were allowed to use the tools that suit them. And authors’ shonky mark-up (including my own) *always* needs some fettling anyway: we already know that…)

So… markdown…

If I’m going to revise the current materials, I need to get them out of the current format and into markdown. I’ve previously started looking at an XSLT to convert OU-XML to markdown, eg as described in Fragment – OpenLearn Jupyter Books Remix; a copy of the current-ish XSLT, and some code fragments to grab and convret an example OU-XML document, can be found here.

But today, I thought of an even scruffier and quicker way…

Within the VLE, a single OU-XML source document is rendered across multiple HTML pages, along  with a navigation index:

A single HTML page view (for easier printing) is also available… Hmmm…there are plenty of HTML2markdown converters out there, aren’t there?

#!pip3 install markdownify
from bs4 import BeautifulSoup
from markdownify import markdownify as md

with open('Robotics study week 1 – Introduction_ View as single page.html', 'r') as f:
    # Let's just grab the HTML body...
    tree = BeautifulSoup(f.read(), 'lxml')
    body = tree.body
    txt = md(str(body))
with open('week1-mardownify.md','w') as f:
    # There'll still be script tag cruft, videos won't be embedded / linked etc
    # but it's enough to get started with and the diffs should be easy to see...

The output is a bit flakey in parts, but most of the stuff I need is there.  Certainly, there’s more than enough of it in useable form for me to start using as an outline. Indeed, much of the work will be ripping out and replacing the huge chunks of content that are now rather dated.

I can also edit the markdown in a notebook environment using Jupytext, using metadata cells to highlight certain blocks of content with additional structural or semantic metadata, saving the metadata into the markdown document from where it could be processed (I’m not sure how it would turn up if the enhanced markup were converted to docx using pandoc, for example?).

From what I saw of the aborted OpenCreate editor, it used a block/cell style metaphor for creating separate content elements within a page, so it’d also be interesting to compare the jupytext/metadata enhanced markdown, or even the notebook ipynb output format, with the OpenCreate document format / representation to see whether there are similarities in the block level semantic / structural markup.