Looking Up R / CRAN Package Maintainers With an ac.uk Affiliation

Trying to find an examiner for a particular PhD thesis relating to a rather interesting datastructure for wrangling messy datatables, I wondered whether we might find a likely suspect amongst the R package maintainer community.

We can get a list of R package maintainers here and a list of package name / short descriptions here.

FWIW, here’s the code fragment:

import pandas as pd

maintainers = pd.read_html('https://cran.r-project.org/web/checks/check_summary_by_maintainer.html')[0]
maintainers_email = maintainers.dropna(subset=[0])

packages = pd.read_html('https://cran.r-project.org/web/packages/available_packages_by_name.html')[0]

maintainers_email_acuk = maintainers_email[maintainers_email[0].str.contains('.ac.uk')][[0,1]]

See also: What Do you Mean You Write Code EVERY DAY?, examples of which I’ve just turned into a new blog category: WDYMYWCED.

Trying to Get Hold of UK Air Quality Data Via a Python API

It’s that time of year again for prepping the end of course assessment material for our TM351 Data Management and Analysis course (not that I typically have much to do with preparing such things…!).

The end of course assessment is typically framed as a data project that requires students linking several datasets and finding interesting to say about them. This final project is set up via a continuous assessment activity that introduces one of the datasets and gets students started working with it – exploring what the dataset looks like, getting it into a database, generating some basis charts from it and starting to formulate some questions around it.

As with many of the data activities, my preference is for ones that makes use of national datasets with local relevance. This can add variety — if students compare data from three local authorities selected from across the UK, there’s a good chance they might select different locations — and it also provides them with the opportunity to carry out a data investigations for their local area using data that they may not have been aware even existed…

This year’s topic is likely to be bootstrapped around air quality data. Sites such as the London Air Quality network make data available for London boroughs, but it’d nice to be able to offer data fro a more national scope.

Looking at Defra’s UK Air website, data does seem to be available for sites across the UK, but the download form is horrible, hugely restrictive on the amount of data you can download, and not obviously open in the creation of URLs that can be machine generated and used to programmatically download data.

Which is not ideal…

However, it does seem that an API exists for R users in the form of David Carslaw’s openair package. So how does that work, then???

Poking around in the code, it seems that sampling site metadata as well air quality sample data is available via .Rdata packages.

Hmm… a bit more poking in the code turns up some URL patterns, and a quick search turns up for Python packages that can read .Rdata packages without the need to install R turns up pyreadr.

So here’s a quick first attempt at a Python downloader for UK air quality data:

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

The location IDs look a bit ad hoc/made up, but there is lat/long data, so it should be easy enough to call something like postcodes.io to find some rather more standardised administrative codes.

With a couple of tiny functions, it should be easy enough to grab data from the metadata dataframe to generate a simple ipywidgets powered UI that lets you select a local authority by name, perhaps pre-filtered to LAs within a particular selected region, and download just the data for that authority.

But that, as they say, is an exercise left for the reader…

Quick Way in to Hacking Legacy OU Course Materials Using Markdown

By some arcane process, OU course materials authored typically in MS Word are converted to an XML format (OU-XML) and then rendered variously to HTML for the Moodle website, ebook formats, and perhaps PDF (we don’t want to make it too easy for students to print of the materials…).

An internal project that ran for a couple of years (maybe a bit more) looking at more direct authoring workflows was shelved earlier this year. (I was banned from blogging about it whilst it was under development, so I’m afraid I don’t have screen shots to show what it looked like from the time I was given preview access.) As far as I know, the authoring tool was completely distinct from the one developed by the OU’s bastard offspring that is FutureLearn. Nowt like sharing.

One of the things I’m slated to do over the next few months is update, or possibly rewrite, a unit in a first year equivalent module.

My preferred way of authoring for some time has been to keep it simple and just use markdown.

So that’s what I’m probably going to do.

If there’s any griping or sniping that it doesn’t fit the OU workflow, I’ll just run it through pandoc to generate an MS Word docx version and hand that over.

(I’ve been saying *for years* we should have pandoc read/write filters for OU-XML (the most recent notes are here). It would have been a damn site cheaper than the aborted authoring tool project and would have allowed authors to explain a whole range of tools for creating their warez, with pandoc handling the conversion to OU-XML. And yes, I f**king know that some hand cleaning of the OU-XML would almost certainly have been required but we’d have got a far better feeling for what sorts of document structures folk produce if they were allowed to use the tools that suit them. And authors’ shonky mark-up (including my own) *always* needs some fettling anyway: we already know that…)

So… markdown…

If I’m going to revise the current materials, I need to get them out of the current format and into markdown. I’ve previously started looking at an XSLT to convert OU-XML to markdown, eg as described in Fragment – OpenLearn Jupyter Books Remix; a copy of the current-ish XSLT, and some code fragments to grab and convret an example OU-XML document, can be found here.

But today, I thought of an even scruffier and quicker way…

Within the VLE, a single OU-XML source document is rendered across multiple HTML pages, along  with a navigation index:

A single HTML page view (for easier printing) is also available… Hmmm…there are plenty of HTML2markdown converters out there, aren’t there?

#!pip3 install markdownify
from bs4 import BeautifulSoup
from markdownify import markdownify as md

with open('Robotics study week 1 – Introduction_ View as single page.html', 'r') as f:
    # Let's just grab the HTML body...
    tree = BeautifulSoup(f.read(), 'lxml')
    body = tree.body
    txt = md(str(body))
with open('week1-mardownify.md','w') as f:
    # There'll still be script tag cruft, videos won't be embedded / linked etc
    # but it's enough to get started with and the diffs should be easy to see...

The output is a bit flakey in parts, but most of the stuff I need is there.  Certainly, there’s more than enough of it in useable form for me to start using as an outline. Indeed, much of the work will be ripping out and replacing the huge chunks of content that are now rather dated.

I can also edit the markdown in a notebook environment using Jupytext, using metadata cells to highlight certain blocks of content with additional structural or semantic metadata, saving the metadata into the markdown document from where it could be processed (I’m not sure how it would turn up if the enhanced markup were converted to docx using pandoc, for example?).

From what I saw of the aborted OpenCreate editor, it used a block/cell style metaphor for creating separate content elements within a page, so it’d also be interesting to compare the jupytext/metadata enhanced markdown, or even the notebook ipynb output format, with the OpenCreate document format / representation to see whether there are similarities in the block level semantic / structural markup.