Republishing OpenLearn Materials In Markdown – Next Steps Taken…

Following on from yesterday’s post, I made a little more progress today trying to sort out a workflow.

First up I had a look at my binder-base-boxes to see if I could automate the building of those using repo2docker. It seems I can and there is an example build at binder-examples/continuous-build as referenced from the repo2docker docs: Using repo2docker as part of your Continuous Integration.

I needed to make a slight tweak to the CircleCI config to allow pushing containers built in repo branches to Dockerhub, but it was easy enough to spot where (removing the lines that limited builds to only run in master). There’s also a slight complication in that my choice of Github repo name has a - in it, and said symbol is disallowed in DockerHub repo names; so rather than just lazily use the repo orgname when pushing the image, I had to set another org name (without the -) as an env var in my CircleCI project profile that the script could pull on (support for this is built in to the script). I also added a tweak to the container naming to use the branch name as the container image tag. There’s an example box here: binder-base-boxes:chemistry, though I haven’t tried to use it as part of a CircleCI build yet… (I guess need to check it includes CircleCi required packages…) The associated DockerHub repo is here.

So that’s one dangling jigsaw piece…

I also created a template repo for publishing Github pages sites using nbsphinx under CircleCi. This should have all you need to get going dumping a load of .md files into a repo and then automatically publishing it under CircleCI to Github Pages. (Actually, I probably need to add a few docs to the README…) There’s an example repo here — markdown version of OpenLearn course: The molecular world and site here: The molecular world – OpenLearm Reimagined.

Next on the to do list:

  • automatically generate a simple index.rst file;
  • sort out image dereferencing for nested directories (path to a common image dir);
  • put together a reusable script or CLI tool that can download and generate a set of markdown documents from the OU-XML source of an OpenLearn module given an OpenLearn course URL and generate the md, with derefenced image links from it.

What this would then do is make it easy for anyone to convert an OpenLearn course that has a source OU-XML document to an equivalent markdown source site that can be automatically republished as an HTML site and that they can edit directly in the markdown source on Github.

The other major workflow issue I need to sort out is how best to manage “Binder” environments required to execute documents via Jupytext as part of the nbsphinx publishing step. (The chemistry base box takes quite a long time to build, for example, so if it’s used to build pages as part of an nbsphinx workflow it would be good to be able to pull a cached build in CircleCI (I really need to get my head round CircleCI cacheing) or use a prebuilt Docker image.)

There’s also thinking needs doing about the differences between a publishing step where a notebook is executed and that generates eg some HTML/JS that can be embedded and work standalone as an interactive on Github Pages vs. interactive widgets that need a Jupyter server on the back end to work. I’ve already spotted at least one opportunity for recasting an ipywidgets decorated function that generates views over different 3D molecules to a simple “pure” JS display that works without the need for the py function on the backend. Related to this I need to explore ThebeLab and nbinteract support in nbsphinx. If anyone has demos, please share… :-)

OER Text Publishing Workflows Rooted on OpenLearn OU-XML Via Github, CircleCI and Github Pages Using Jupytext and nbSphinx

Slowly, slowly, my recipes are coming together for generating markdown from OU-XML sourced, variously, from modules on the OU VLE and units on OpenLearn.

The code needs a couple more passes through but at some point I should be able to pull a simple CLI together (hopefully! I’m still manually running some handcranked steps spread across a couple of notebooks at the moment:-(

So… where am I currently at?

First up, I have chunks of code that can generate markdown from OU-XML, sort of. The XSLT is still a bit ropey (lists are occasionally broken[FIXED], for example, repeating the text) and the image link reconciliation for OpenLearn images doesn’t work, although I may have a way of accessing the images directly from the OU-XML image paths. (There could still be image rights issues if I was archiving the images in my own repo, which perhaps counts as a redistribution step…?)

The markdown can be handled in various ways.

Firstly, it can be edited/viewed as markdown. Chatting to colleague Jon Rosewell the other day, I realised that JupyterLab provides one way of editing and previewing markdown: in the JupyterLab file browser, right click on an .md file and you should be able to preview it:

There is also a WYSIWYG editor extension for JupyterLab (which looks like it may enter core at some point): Jupyter Scribe / jupyterlab-richtext-mode.

If you have Jupytext installed, then clicking on an .md file in the notebook tree browser opens the document into a Jupyter notebook editor, where markdown and code cells can be edited separately. An .ipynb file can then be downloaded from the notebook editor, and/or Jupytext can be used to pair markdwon and .ipynb docs from the notebook file menu if you install the Jupytext notebook extension. Jupytext can also be called on the command line to convert .md to .ipynb files. If the markdown file is prefaced with Jupytext YAML metadata (i.e. if the markdown file is a “Jupytext markdown” file, then notebook metadata (which includes cell tags, for example) is preserved in the markdown and can be used for round-tripping between markdown and notebook document formats. (This is handy for RISE slideshows, for example; the slide tags are preserved in the markdown so you can edit a RISE slideshow as a markdown document and then present it via Jupytext and a notebook server.)

In a couple of simple tests I tried, the .ipynb generated from markdown using Jupytext seemed to open okay in the new Netflix Polynote notebook application (early review). This is handy, because Polynote has a WYSIWYG markdown editor… So for anyone who gripes that notebooks are too hard because writing markdown is too hard, this provides an alternative.

I also note that the wrong code language has been selected (presumably the default in the absence of any specified language? So I need to make sure I do tag code cells with a default language somehow… I wonder if Jupytext can do that?).

Having a bunch of markdown documents, or notebooks derived from markdown documents using Jupytext is one thing, providing as it does a set of documents that can be easily edited and interacted with, albeit in the context of a Jupyter notebook server.

However, we can also generate HTML websites based on those documents using tools such as Jupyter Book and nbsphinx. Jupyter Book uses a Jekyll engine to build HTML sites, which is a bit of a pain (I noted a demo here that used CircleCI to build a site from notebooks and md using Jupyter Book) but the nbsphinx Python package that extends the (also pip installable) Sphinx documentation engine is a much easier propostion…

As a proof-of-concept demo, the ouseful-oer/openlearn-learntocode repo contains markdown files generated from the OpenLearn Learn to code for data analysis course.

Whenever the master branch on the repository is updated, CircleCI kicks in and uses nbsphinx to build a documentation site from the markdown docs and pushes them to the repository’s gh-pages branch, which makes the site available via Github Pages: “Learn To Code…” on Github Pages.

What this means is that I should be able to edit the markdown directly via the Github website, or using an online editor such as prose.io connected to my Github account, commit changes and then let CircleCI rebuild the site for me.

(I’m pretty sure I haven’t set things up as efficiently I could in terms of CI; what I would like is for only things that have changed to be rebuilt, but as it is, everything gets rebuilt (although the installed Python environment should be cached?) Hints / tips / suggestions about improving my CircleCI config.yml file would be much appreciated…

At the moment, nbsphinx is set up to run .md files through Jupytext to convert them to .ipynb, which nbsphinx then eventually churns back to HTML. I’ve also disabled code cell execution in the current set up (which means the routing through .ipynb in this instance is superfluous – the site could just be generated from the .md files). But the principle is there for a flick of a switch meaning that the code cells could be executed and their outputs immortalised in the punlished site HTML.

So… what next?

I need to automate the prodcution of the root index file (index.rst) so that the table of contents are built from the parsed OU-XML. I think Sphinx handles navigation menu nesting based on header levels, which is a bit of a pain in the demo site. (It would be nice if there were a Sphinx trick that lets me increase the de facto heading level for files in a subdirectory so that in the navigation sidebar menu each week’s content could be given its own heading and then the week’s pages listed as child pages within that. Is there such a trick?)

Slowly, slowly, I can see the pieces coming together. A tool chain looks possible that will:

  • download OU-XML;
  • generate markdown;
  • optionally, cast markdown as notebook files (via jupytext);
  • publish markdown / (un)executed notebooks (via nbsphinx).

A couple of next steps I want tack on to the end as and when I get a chance and top up my creative energy levels: firstly, a routine that will wrap the published pages in an electron app for different platforms (Mac, Windows, Linux); secondly, publishing the content to different formats (for example, PDF, ebook) as well as HMTL.

I also need to find a way of adding interaction — as Jupyter Book does — integrating something like ThebeLab or nbinteract buttons to support in-page code execution (ThebeLab) and interactive widgets (nbinteract).

Fragment: Transit Mapping

Noticing @edent’s take on a semantic tube map using data from wikidata, I started wondering (again?) about transit map layout engines.

There’s theory behind it (eg Martin Nöllenburg’s Automated Drawing of Metro Maps (2005)) and examples of folk having built tools to support automated layout (eg Automatic layout of schematic transit maps and Transportation maps – creation by optimisation), but I haven’t found a layout engine package that I can make use of (something that plays nice with networkx, and perhaps complements osmnx for getting data out of OpenStreetMap, would be nice… Perhaps even a netwulf filter for laying out transit maps via a GUI?).

Poking around, public-transport/generating-transit-maps links to a couple of repos that style an optimised graph for a couple of German transit routes. There’s a couple of links to possible optimisers / layout engines: this solution in Julia — dirkschumacher/TransitmapSolver.jl — and this nodejs application — juliuste/transit-map. The latter uses a commercial solver, Gurobi, but this post on Optimization Modeling in Python: PuLP, Gurobi, and CPLEX shows equivalent solutions to a (different) optimisation problem using both Gurobi and a free Python solver, PuLP. So it might be straightforward enough to create a Py equivalent of the nodejs solver?

Here’s a more recent package, again in node: gipong/automatic-metro-map.

As far as styling goes, there look to be various things out there. For example, this d3-tube-map, and another: d3-tube. And an HTML5 canvas solution.

Not quite what I had in mind re: layouts, but… wha’….? London tube netwrok as a git graph?!

In passing, having got a map, animating it might be nice… Here’s an old animation package — vasile/transit-map — that might help with that? Or maybe something lifted from this transport routing demo (the Cesium variant does animations along a route I think? More here).

For some 3D relief, harp.gl; and though not directly relevant, I do still like these ridge maps.

Querying DBPedia Linked Data From Jupyter Notebooks – Music Genres Related to Heavy Metal and Music Venues in England

Some time ago I did some examples of querying DBPedia to find related music genres (Mapping Related Musical Genres on Wikipedia/DBPedia With Gephi) as well as other sorts of influence networks (eg programming languages).

After visiting the Black Sabbath exhibition in Birmingham recently (following the awesome Dawn After Dark gig supporting Balaam and the Angel) which had the most dubious of “metal” relationship maps on display in the shop, I thought I’d see how Wikipedia, via DBpedia, mapped that area out.

Gephi’s getting a bit long in the tooth now (netwulf is starting to look handy as a tool for styling networkx graphs; works in Jupyter notebooks too… more on this in another post…) and my original recipe seems to have broken (plus WordPress keeps crapping on the code, removing angle brackets etc), so I started scribbling notes around a recipe for trying to map band genres; it’s ages since I’ve had to try writing SPARQL querues, the notes are very scrappy / fragmentary, and some of the queries are quite possibly nonsense; but FWIW, you can find the notes here: Linked Data bands. I’ll try revisit them and produce some tidier recipes at some point…

I still used Gephi to render the network, though (this was before I found netwulf…). As an example, here’s a map of genres related to Heavey metal music [original svg].

I also started wondering about that other live music related things I might be able to dredge up out of DBPedia queries. One of the categories used to tag entities in DBPedia is Music_venues_in_England, and from that music venues in other, smaller locales; venues are also tagged with geo-co-ordinates (latitude and longitude values) so we can quite easily run a query for music venues in England and from that generate a map.

An example notebook is here: Venues – Linked Data.ipynb. A preview of the map can be found rendered from the gist here.

As I’ve noted previously, (an insight I think Martin Hawksey first made me grok fully), visualisations like this can be great for spotting errors, or gaps, in datasets. For example, Southampton has several other excellent venues aside from the Joiners Arms, and on the Isle of Wight, we have Strings as our local indie venue. [Actually, that might be the wrong version of the map; several venues that I thought were in the map I recall arenlt on that map…]

PS this is neat, from Terence Eden / @edent, a semantic take on the London Tube Map using data from Wikidata, with lines relating topical categories and stations to people’s names: The Great(er) Bear – using Wikidata to generate better artwork. [Picking up on that, some notes on transit mapping.]

Convert Notebook to ThebeLab HTML

A watched issue on Github reminded of something I’d forgotten I’d started to look at — an nbconvert template to convert an .ipynb file to an HTML page that could execute the code against a specified MyBinder provided environment.

(Apparently, the Jupyter Book jupyter-book page path/to/notebook.ipynb command should achieve much the same thing, though I’m not sure where you have to set the Binder repo URL. In a config file?)

FWIW, here’s the nbconvert template I’d started to sketch.

{% extends 'full.tpl'%}

{% block header %}

{{ super() }}


  {
    requestKernel: true,
    binderOptions: {
      repo: "binder-examples/requirements",
    },
  }



{% endblock header %}


{%- block body %}
Activate

var bootstrapThebe = function() {
    thebelab.bootstrap();
}

document.querySelector("#activateButton").addEventListener('click', bootstrapThebe)


{{ super() }}
{%- endblock body %}

It doesn’t quite work at the moment because the <pre> code tag doesn’t carry the correct attributes (though a proposed patch to thebelab may address that).

Chris Holdgraf uses Javascript to rewrite the tags dynamically in a related Jupyter Book template:

            // Find all code cells, replace with Thebelab interactive code cells
            const codeCells = document.querySelectorAll('.input_area pre')
            codeCells.forEach((codeCell, index) => {
                codeCell.setAttribute('data-executable', 'true')
                // Figure out the language it uses and add this too
                var parentDiv = codeCell.parentElement.parentElement;
                var arrayLength = parentDiv.classList.length;
                for (var ii = 0; ii < arrayLength; ii++) {
                    var parts = parentDiv.classList[ii].split('language-');
                    if (parts.length === 2) {
                        // If found, assign dataLanguage and break the loop
                        var dataLanguage = parts[1];
                        break;
                    }
                }
                codeCell.setAttribute('data-language', dataLanguage)
                // If the code cell is hidden, show it
                var inputCheckbox = document.querySelector(`input#hidebtn${codeCell.id}`);
                if (inputCheckbox !== null) {
                    setCodeCellVisibility(inputCheckbox, 'visible');
                }
            });

This can be put in a <script> tag immediately before the Jinja {%- endblock body %} directive.

Example gist here.

Track this issue for more…

News: Arise All Ye Notebooks

A handful of brief news-y items…

Netflix Polynote Notebooks

Netflix have announced a new notebook candidate, Polynote [code], capable of running polyglot notebooks (scala, Python, SQL) with fixed cell ordering, variable inspector and WYSIWYG text authoring.

At the moment you need to download and install it yourself (no official Docker container yet?) but from the currently incomplete installation docs, it looks like there may be other routes on the way…

The UI is clean, and whilst perhaps slightly more cluttered than vanilla Jupyter notebooks it’s easier on the eye (to my mind) than JupyterLab.

Cells are code cells or text cells, the text cells offering a WYSIWYG editor view:

One of the things I note is the filetype: .ipynb.

Code cells are sensitive to syntax, with a code completion prompt:

I really struggle with code complete. I can’t write import pandas as pd RETURN because that renders as import pandas as pandas. Instead I have to enter import pandas as pd ESC RETURN.

Running cells are indicated with a green sidebar to the cell (you can get a similar effect in Jupyter notebooks with the multi-outputs extension):

I couldn’t see how to connect to a SQL database, nor did I seem to get an error from running a presumably badly formed SQL query?

The execution model is supposed to enforce linear execution, but I could insert a cell after and unrun cell and get an error from it (so the execution model is not run all cells above either literally, or based on analysis of the programme abstract syntax tree?)

There is a variable inspector, although rather than showing or previewing cell state, you just get a listing of variables and then need to click through to view the value:

I couldn’t see how to render a matplotibl plot:

The IPython magic used in Jupyter notebooks throws an error, for example:

This did make me realise that cell lines are line numbered on one side and there’s a highlight shown on the other side which line errored. I couldn’t seem to click through to raise a more detailed error trace though?

On the topic of charts, if you have a Vega chart spec, you can paste that into a Vega spec type code cell and it will render the chart when you run the cell:

The developers also seem to be engaging with the “open” thing…

Take it for a spin today by heading over to our website or directly to the code and let us know what you think! Take a look at our currently open issues and to see what we’re planning, and, of course, PRs are always welcome!

Streamlit.io

Streamlit.io is another new not-really-a-notebook alternative, pip installable and locally runnable. The model appears to be that you create a Python file and run the streamlit server against that file. Trying to print("Hello World") doesn’t appear to have any effect — so that’s a black mark as far as I’n concerned! — but the display is otherwise very clean.

Hovering top right will raise the context menu (if it’s timed-out itself closed) showing if the source file has recently been saved and not rerun, or allowing you to always rerun the execution each time the file is saved.

I’m not sure if there’s any cacheing of steps that are slow to run if associated code hasn’t changed up to that point in a newly saved file.

Ah, it looks there is…

… and the docs go into further detail, with the use of decorators to support cacheing the output of particular functions.

I need to play with this a bit more, but it looks to me like it’d make for a really interesting VS Code extension. It also has the feel of Scripted Forms, as was, (a range of widgets are available in streamlit as UI components), and R’s Shiny application framework. It also feels like something I guess you could do in Jupyterlab, perhaps with a bit of Jupytext wiring.

In a similar vein,  a package called Handout also appeared a few weeks ago, offering the promise of “[t]urn[ing] Python scripts into handouts with Markdown comments and inline figures”. I didnlt spot it in the streamlit UI, but it’d be useful to be able to save or export the rendered streamlit document eg as an HTML file, or even as an ipynb notebook, with run cells, rather than having to save it via the browser save menu?

Wolfram Notebooks

Wolfram have just announced their new, “free” Wolfram Notebooks service, the next step in the evolution of Wolfram Cloud (announcement review], I guess? (I scare-quote “free because, well, Wolfram; you’d also need to carefully think about the “open” and “portable” aspects…

*Actually, I did try to have a play, but I went to the various sites labelled as “Wolfram Notebooks” and I couldn’t actually find a 1-click get started (at all, let alone, for “free”) link button anywhere obvious?

Ah… here we go:

[W]e’ve set it up so that anyone can make their own copy of a published notebook, and start using it; all they need is a (free) Cloud Basic account. And people with Cloud Basic accounts can even publish their own notebooks in the cloud, though if they want to store them long term they’ll have to upgrade their account.

Fragment: Indexing Local Jupyter Notebooks for Search

It’s been some time since I last explored this (eg here and here, and as far as I know know other solutions have appeared since, but a question still remains as to how to effectively search over a set of notebooks.

Partial alternative solutions maybe worth noting include:

  • nbscan for searching over notebooks from the command-line;
  • nbgallery bakes in Solr/sunspot; it’d be really nice if the nbgallery search tools could be easily decoupled so the search could be added to an arbitrary Jupyter notebook, or JupyterHub, server as an extension…);
  • this simple search engine with automcomplete by Simon Willison.

There is also the lunr based search of Jupyter Book (related issue). (The more recent elasticlunr Javascript search engine also looks interesting… perhaps even more so than lunr.js…)

One of the things I often wondered about in respect of building a notebook search engine index would be how to crawl / index freshly updated notebooks.

One way would presumably be to regularly crawl the directory path in which notebooks live looking for notebook files that have a changed timestamp compared to the last time they were indexed; another might be to set up some sort of watcher on the operating system that calls the indexer whenever it spots a file being updated (maybe something like fswatch?).

Another way might be to use something like the pgcontents contents manager to save (or process) notebooks into a search engine index database. (For other examples of Jupyter notebook content managers, see this Tracking Jupyter round-up. I wonder, is there a sqlite content manager that can save notebooks directly into SQLite? Would the pgcontents extension handle that with little or no modification, other thn to the supplied database connection string?) If notebooks were saved as notebooks to disk, and into a database for indexing as part of the search engine, how would the indexed notebook also be linked back to the notebook on disk so it could be linked to via search results?

Thinks: how is nbgallery architected? Where are notebooks saved to? How is the Solr search engine index managed?

More generally, I wonder: are there any Python based, simple full-text search engines with local fielsystem crawlers/monitors/indexers out there?

PS Other search engines to have a look at: