Appropriating OpenLearn Content and Republishing Edited Versions Of It Via a “Simple” Automated Text Blogging Workflow

I had intended on using my (unpaid) strike days to catch up with some books and harp practice, and maybe even the garden, and keep away from the keyboard; or failing that, to have a push on my rally data tinkering and get another LeanPub book started to try to reboot the £50 a quarter or so my previous publication (Wrangling F1 Data With R) generated, which keeps things like recurring Dropbox and Flickr etc etc charges covered (no-one has ever bought me a KoFi, as far as I can tell…).

And I was determined not to do any of the mounting workload associated with the day job, no matter how much fun some it is likely to be (like getting Ev3devSim working as an ipywidget in Jupyter notebooks).

Whilst I did manage to stick to the determined not to path, I never even really started down the intended one, instead spending hours and hours in front of keyboard trying to hack something together around my OpenLearn publishing workflow.

So here’s what I’ve come up with…

An OpenLearn Unit Text Publishing Thing

Firstly, it’s a thing that lets you grab the “source” content of an OpenLearn unit (at least, some of it; I still haven’t got round to grabbing things like video files or audio files, or scraping PDFs etc.) and churn it into a simple text format, markdown, which looks like this:

Headers are prefixed with a #, you can emphasise things by wrapping it in * characters, eg *italics* -> italics, or **double them up** for strong emphasis. Embedding links — [link text](link/path/file.html) — and images — ![Alt text](path/to/image.file) — is also pretty easy when you get the hang of it.

So how do you get started? First, you need a Github account (sign up here; just get one: you’re not going to have to do any hard Github stuff, you’re just making use of their free hosting). Get one, and sign in.

Second, visit my demo repo — psychemedia/openlearn-publish-test — (the URL will change at some point, but I’ll archive the original and link the new address from it…) and grab a copy of your own repo from mine by clicking the big green Use this template button:

You’ll be presented with a form:

Give your repo a name (no spaces). Optionally add a description. Keep the repo public. And click the big green Create repository from template button.

Things will churn for a moment or two:

And then you’ll have your own repo, containing a copy of the files in mine:

Behind the scenes, there is work going on…

Click the Actions tab on your copy of the repo to see what…

At first, it may look like nothing… but wait a moment or two and refresh the page:

A couple of actions will start running to initialise, and customise, your repo for you.

When the actions are done, you’ll be informed… (you shouldn’t have to refresh the page, the status indicators should update when things are done…):

If you go back to your repo homepage, you’ll see it’s been updated with a new README that’s slightly different to the original copy from my repo, and that has been personalised to yours:

So… now you can grab some OpenLearn content into your repo.

Click on the SET_UP.md file link in your repo:

You will be presented with a list of units on OpenLearn.

Find one you like the look of and click the Grab Unit into this repo link:

This will open a new issue for you in the Issues tab of your repo, and prepopulate it with a title that will tell a Github Action you want to grab some OpenLearn content, and an issue body that tells the action where the unit can be found.

Click the Submit new issue button to get things started.

Back in the Actions tab, you can see the helper elves have started doing their thing again…

If you click on a running Action, you can check its progress in more detail:

Click through on the actual job name to see what’s happening inside:

You can expand a step by clicking the arrow to see what each step is doing or has already done…

If you read through the steps, you’ll see several things are done: for example, we grab some OUXML (the OpenLearn content), convert to markdown, build some HTML files (these are what gets published), and deploy them, then build a LaTeX version of the material (which is used to generate a PDF), and an ePub ebook. (The LaTeX step takes some time; I should perhaps simplify things so that only the HTML build is done by default.)

When the Actions are green circled / green ticked and done (which may take a few minutes…), or at least, when the Deploy HTML to gh-pages step has run, go back to the repo home page, where you should see a new commit has been made to your repo:

If you click into the content folder you’ll see one or more session folders:

If you click into a session folder, you see some markdown files:

If you click on one of those, you’ll see some scraped and converted OpenLearn content:

So the content has been grabbed from OpenLearn and saved to your repo.

But that’s not all.

If you scroll down on your README page (I really should make this link more prominent in the README…) you’ll see a link to a github.io site published from your repo:

Click it…

If you see a “404”, page not found, don’t panic

On the repo home page, select the Settings tab:

and scroll down to the Github Pages area:

Change the Source from gh-pages branch to master branch:

And then, select the master branch:

And set the Source back to the gh-pages branch:

When you see something like this, you knows all good to go:

Note that cacheing of a previous build of the site may last for up to 10 minues, so grab yourself a cup of tea, or perhaps look through the markdown files in the content directory, or even go back to the Actions tab and, if the actions have completed.

If the Actions have completed, select the OpenLearnXML2 (or a completed nbsphinx publisher action if you have committed your own changes to the markdown files) and you should see and the availability of an Artifacts download.

Down load and unzip the artifacts file. If the build process has been able to build a PDF file and/or an ePub file from the content, it will be found in the unzipped downloaded directory.

Right… time to try your github.io site link again:

An OpenLearn Editing Thing

This will have to be in a part two to this post… I’ve run out of time for now and need to get back to the day job…

If you are itching to get started, this may work, if I’ve got my autopublishing things fixed…

In the content folder (on the default master branch of the repo),  find the markdown file you want to edit, and click on the pencil icon to open the editor:

my-oer_Part_00_02_md_at_master_·_psychemedia_my-oer

Edit the file / make the changes you want, and commit it (you may want to set a meaningful commit message title summarising the chnages, and perhaps even a longer description about the motivation for the changes, but both are optional…):

Editing_my-oer_Part_00_02_md_at_master_·_psychemedia_my-oer

Click the big green Commit changes button to commit the changes. If you look in the Actions tab, you should see that an nbsphinx publisher action has started that should publish your changes to your site.

Actions_·_psychemedia_my-oer6

Note that even when the publishing action has generated and pushed updated site pages  to where they need to be, the site may take a few minutes to update because of page cacheing on the Github site.

The Future

One of the spinoffs of this for me was the realisation that I could use Github Actions to run arbitrary code in response to particular events, such as.. commits or issue postings. The current machinery uses a Sphinx / nbSphinx publishing route, but I’ve also started exploring a recipe for Jekyll based Jupyter Book publishing. (Next on the to do list will be an Executable Book project / MyST workflow](https://ebp.jupyterbook.org/en/latest/); I also need to split out the workflows into actions of their own, but I haven’t figured out how to do that for myself yet.) It strikes me that I could bundle all these in the same repo with some way of flagging which build process I want to use. This would allow the user to then republish their material using the publishing tool, and its various peculiarities, customisationa and affordances, of their choice.

Immediate to dos, that may not happen because I’m the only user, I know it’s possible, and I’m not that interested, are to: make the Github Pages / pubished site link more prominent in the README; get movie and audio downloads and embeds working. Also a way of handling PDFs linked from the OpenLearn materials, and perhaps extracting text from those, even, to support republishing…

On the publish side, it would be useful to be able to publis to HTML only by default, with some optional way of invoking the PDF and ePub builds. The ePub build also needs things like title and author setting. The PDF build sometimes breaks, eg due to the inability to detect a bounding box size round a gif image. I maybe need to use another PDF generator, eg some hints here.

I also need to refactor the code, two ways: firstly, a simplification, that uses the bare minimum of packages and just churns the markdown direct from XML in one simple step. Secondly, fixing the current workflow, which stages the XML in a SQLite database, so that the database can properly handly content from multiple unitis and I can reliably churn the md from the database for any single unit. At the moment, I think things pretty much assume there’s content from just a single unit in the database. Putting the md into the db might be useful too… Then I could imagine a datasette powered publishing route too…

As a recent tweet from Martin Hawksey reveals, he’s been blogging about how we can turn Google’s App Script to our own purposes as a hosted code runner, and I think Github now provides a similar opportunity for anyone who wants to appropriate it to that end…

Adding Metadata to Google Docs

A couple of months ago I had started working on an export tool that would export a Google doc in the OU-XML format. The rationale? The first couple of drafts of the teaching material that will be delivered through the VLE in the forthcoming (October, 2015) OU Data management and analysis course (TM351) have been prepared in Google docs, and the production process will soon have to move to the Open University’s XML workflow. This workflow is built around an OU defined schema, often referred to as OU-XML (or OUXML), and is supported by a couple of oXygen XML editor extensions that make it easy to preview rendered versions of the documents in a test VLE site.

The schema itself includes several elements that are more akin to metadata elements than actual content – things like the course code, course title, for example, or the byline (or lead author) of a particular unit.

Support for a small amount of metadata is provided by Google Drive, but the only easily customisable element is a free text description element.

gdocsMetadata

So whilst patching a couple of “issues” today with the Google Docs to OU-XML generator, and adding a menu option that allows users to create a zip file in Google Drive that contains the OU-XML and any associated image files for a particular Google doc, I thought it might also be handy to add some support for additional metadata elements. Google Drive apps support a Properties class that allows metadata properties represented as key-value pairs to be associated with a particular document, user or script. Google Apps Script can be used to set and retrieve these properties. In addition, Google Apps Script can be used to generate templated HTML user interface forms that can be used to extend Google docs or spreadsheets functionality.

In particular, I created a handful of Google Apps Script functions to pop up a templated panel, save metadata descriptions entered into the metadata form as document properties, and retrieve the value of a particular metadata element.

//Pop up the metadata edit/display panel
//The document is created as a templated HTML document
function metadataView() {
  // Generate the HTML
  html= HtmlService
      .createTemplateFromFile('metadata')
      .evaluate()
      .setSandboxMode(HtmlService.SandboxMode.IFRAME);
  //Pop up a panel and render the HTML describing the metadata form inside it
  DocumentApp.getUi().showModalDialog(html, 'Metadata');
}

//This function sets the document properties from the metadata form elements
function processMetadataForm(theForm) {
  var props=PropertiesService.getDocumentProperties()
  //Process each form element (atm, they are just input text elements)
  for (var item in theForm) {
    props.setProperty(item,theForm[item])
    Logger.log(item+':::'+theForm[item]);
  }
}

The templated HTML form is configured using a set of desired metadata elements. Each element is described using a label that is displayed in the form, an attribute name which should be a single word) and an optional default value. The template also demonstrates how we can call a server side Apps Script function from the dialogue using the google.script.run.FUNCTION_NAME construction.

<script src="//ajax.googleapis.com/ajax/libs/jquery/1.9.1/jquery.min.js"></script>
  
<? 
//Add metadata fields here in the following format:
//[Label, a unique identifier (unique word, no spaces or punctuation), an optional default value]
var metadataItems =[
    ["Lead Author","leadAuthor"],
    ["Course Code","courseCode"],
    ["Course Title","courseTitle"],
    ["Unit Title","unitTitle"],
    ["Rendering","rendering","VLE2 staff (learn3)"]
]
?>
  
<? var metadata = PropertiesService.getDocumentProperties() ?>
<script>
//When the metadata has been successfully saved as document properties
//  close the metadata form panel
function onSave() {google.script.host.close()}
</script>
  
<form id='metadataForm'>
<!-- Construct a set of form elements, one for each metadata item -->
<? for (var i = 0; i < metadataItems.length; i++) { ?>
  <div><?= metadataItems[i][0] ?>: 
    <input type="text"
      name = "<?= metadataItems[i][1] ?>"
      <? val=''
        if (metadataItems[i].length>2) val= metadataItems[i][2]  ?>
      value= "<?= metadata.getProperty(metadataItems[i][1]) ? metadata.getProperty(metadataItems[i][1])  : val  ?>"
    /> 
  </div>
<? } ?>
    
</form>
  
<div>
  <input
    type="button"
    value="Save & Close"
    onclick="google.script.run.withSuccessHandler(onSave).processMetadataForm(document.getElementById('metadataForm'))"
  />
  
  <input
    type="button"
    value="Cancel"
    onclick="google.script.host.close()"
  />
</div>

When the metadataView() function is called from the Add-Ons menu, it pops a dialogue that looks (in unstyled form) something like this:

googleDocMetadata

Metadata elements are loaded in to the form if they exist or a default value is specified.

When generating the export OU-XML, a helper function grabs the value of the relevant metadata element from the document properties. This value then then be inserted into the OU XML at the appropriate point.

//A helper function to display a particular metadata element
//This function is called from the metadata form
function getProp(key) {
  var props= PropertiesService.getDocumentProperties()
  return props.getProperty(key) ? props.getProperty(key) : '';
}

var COURSECODE= getProp('courseCode');

One issue with this approach is that if we have lots of documents relating to different units for the same course, we may need to enter the same values for several metadata elements across each document (for example, the course code and course title). Unfortunately, Google Drive does not support arbitrary properties for folders. One solution, suggested by Tom Smith/@everythingabili was to use the description element for a folder to store JSON represented metadata. I think we could actually simplify that, using a line based representation or a simple delimited representation that we can easily split on, something like:

courseCode :: TM351;;
courseTitle:: Data Management and Analysis

for example. We could then split on ;; to get each pair, strip whitespace, split on :: and strip whitespace again to get the key:value elements for each metadata item.

gdocsfoldermetadata

I guess one way of getting the folder decription given a particular document as a starting point is to find the parent folder using file#getParents() perhaps?) and then call folder#getDescription()?

Another approach might be to have a dummy, canonically named file in each folder (metadata for example), that we add metadata to, and then whenever we open a new file in the folder we look for the metadata file, get its metadata property values, and use those to seed the metadata values for our content document.

Finally, it’s maybe worth also pondering the issue of generating the OU-XML export for all the documents within a given folder? One way to do this might be to create a function off a each document that will find the parent folder, find all the files (except, perhaps, a metadata file?!) in that folder, and then run the OU-XML generator over all of them, bundling them up into a single zip file, perhaps with a directory structure that puts the OU XML for each document, along with any image files associated with it, into separate folders?

Only it probably isn’t.. I suspect that if the migration to the OU-XML format, if it hasn’t already happened, will involve copying and pasting…

PS for completeness, the menu option can be installed as follows:

function onOpen(e) {
  DocumentApp.getUi().createAddonMenu()
      .addItem('Metadata','metadataView')
      .addToUi();
}