Simple Link Checking from OU-XML Documents

Another of those very buried lede posts…

Over the years, I’ve spent a lot of time pondering the way the OU produces and publishes course materials. The OU is a publisher and a content factory, and many of the production modes model a factory system, not least in terms of the scale of delivery (OU course populations can run at over 1000 students per presentation, and first year undergrad equivalent modules can be presented (in the same form, largely unchanged) twice a year for five years or more.

One of the projects currently being undertaking internally is the intriguingly titled Redesigning Production project, although I still can’t quite make sense (for myself, in terms I understand!) of what the remit or the scope actually is.

Whatever. The project is doing a great job soliciting contributions through online workshops, forums, and the painfully horrible Yammer channel (it demands third party cookies are set and repeatedly prompts me to reauthenticate. With the rest of the university moving gung ho to Teams, that a future looking project is using a deprecated comms channel seems… whatever.) So I’ve been dipping my oar in, pub bore style, with what are probably overbearing and overlong (and maybe out of scope? I can’t fathom it out…) “I remember when”, “why don’t we…” and “so I hacked together this thing for myself” style contributions…

So here’s a little something inspired by a current, and ongoing, discussion about detecting broken links in live course materials: a simple link checker.

# Run a link check on a single link

import requests

def link_reporter(url, display=False, redirect_log=True):
    """Attempt to resolve a URL and report on how it was resolved."""
    if display:
        print(f"Checking {url}...")
    # Make request and follow redirects
    r = requests.head(url, allow_redirects=True)
    # Optionally create a report including each step of redirection/resolution
    steps = r.history + [r] if redirect_log else [r]
    report = {'url': url}
    step_reports = []
    for step in steps:
        step_report = (step.ok, step.url, step.status_code, step.reason)
        step_reports.append( step_report )
        if display:
            txt_report = f'\tok={step.ok} :: {step.url} :: {step.status_code} :: {step.reason}\n'

    return step_reports

That bit of Python code, which took maybe 10 minutes to put together, will take a URL and try ro resolve it, keepng track of any redirects along the way as well as the status from the final page request (for example, whether the page was code 200 successfully loaded or whether a 404 page not found was encountered. Other status messages are also possible.

[UPDATE: I am informed that there is VLE link checker to check module links availabe from the adinstration block on a module’s VLE site. If there is, and I’m looking in the right place, it’s possibly not something I can see or use due to permissioning… I’d be interested to see what sort of report it produces though:-)]

The code is a hacky recipe intended to prove a concept quickly that stands a chance of working at least some of the time. It’s also the sort of thing that could probably be improved on, and evolved, over time. But it works, mostly, now, and could be used by someone who could create their own simple program to take in a set of URLs and iterate through them generating a link report for each of them.

Here’s an example of the style of report it can create using a link that was included in the materials with as a Library proxied link ( that I cleaned to give a none proxied link (note to self: I should perhaps create a flag that identifies links of that type as Library proxied links; and perhaps also flag another link type at least, which are library managed (proxied) links keyed by a link ID value and routed vie

  'Moved Permanently'),
  'Moved Temporarily'),

So.. link checker.

At the moment, module teams manually check links in web materials published on the VLE over many many pages. To check a hundred linkes spread over a hierachical tree of pages to depth two or three takes a lot of time and navigation.

More often than not, dead links are reported by students in module forums. Some links are perhaps never clicked on and have been broken for years, but we wouldn’t know it, or have been clicked on but been unreported. (This raises another question: why do never-clicked links remain in the materials anyway? Reporting about link activity is yet another of those stats we could and should act on internally (course analytics, a quality issue) but we don’t (the institution prefers to try to shape students by tracking them using learning analytics, rather than improving things we have control over using course analytics. We analyse our students, not our materials, even if our students’ performance is shaped by our materials. Go figure.)

This is obviously not a “production” tool, but if you have a set of links from a set of course materials, perhaps collected together in a spreadsheet, and you had a Python code environment, and you were prepared to figure our how to paste a set of URLs into a Python script, and you could figure out a loop to iterate through them and call the link checker, you could automate the link checking process in some sort of fashion.

So: tools to make life easier can be quickly created for, and made available to (and can also be created or extended by) folk with access to certain environments that let them run automation scripts and who have the skills to use the tools provided (or the skills and time to make them for themselves).

Is It Worth the Time?
Is it worth the time?

By the by, anyone who has been tempted, or actually attempted, to create their own (end user development) automation tools will know that even though you know it should only take a half hour hack to create a thing, that half an hour is elastic:

Having created that simple link checker fragment to drop into the “broken link” Redesigning Production forum thread, in part to demonstrate that a link checker that works at the protocol level can identify a range of redirects and errors (for example, ‘content not available in your region’ / HTTP 451 Unavailable For Legal Reasons is one that GDPR has resulted in when trying to access various US based news sites), I figured I really should get round to creating a link checker that will trawl through links automatically extracted from one or more OU-XML documents in a local directory. (I did have code to grab OU-XML documents from the VLE, but the OU auth process has changed since I last use that code which means I need to move the scraper from mechanicalsoup to selenium…) You can find the current OU-XML link checker command line tool here:

So, we now have a link checker that anyone can use, right? Well, not really… It doesn’t work like that. You can use the link checker if you have Python 3 installed, and you know how to go onto the command line to install the package, and you know what copying the pip install instruction I posted in the Yammer group won’t work because the Github url is shortened by an ellipsis, and if you call “pip” and Python 2 has the focus on pip you’ll get an error, and when you try to run the command to run the link checker on the command line you know how to navigate to, or specify, the path (including paths with spaces…) and you know how to open a CSV file and/or open and make sense of a JSON file with the full report, and you can get copies of the OU-XML files for the materials you are interested in and get them onto a path you can call the link checker command line command with in the first place, then you have access to a link checker.

So this is why it can take months rather than minutes to make tools generally available. Plus there is the issue of scale – what happens if folk on hundreds of OU modules start running link checkers over the full set of links referenced in a each of their courses on a regular basis? If (when) the code breaks parsing a document, or trying to resolve a particular URL, what does the user do then. (The hacker who created it, or anyone else with the requisite skills) could possibly fix the script quite quickly, even if just by adding in an exception handler or excluding particular source documents or URLs and remembering they hadn’t checked those automatically.)

But it does also raise the issue that quick fixes that will save chunks of time that some, maybe even many, eventually, could make use of right now aren’t generally available. So every time a module presents, some poor soul on each module has to manually check, one at a time, potentially hundreds of links in web materials published on the VLE spread over many many pages published in a hierachical tree to depth two or three.

PS As I looked at the link checker today, deciding whether I should post about it, I figured it might also be useful to add in a couple of extra features, specifically a screenshot grabber to grab a snapshot image of the final page retrieved from each link, and a tool to submit the URL to a web archiving service such as the Internet Archive or the UK Web Archive, or create a proxy link to an automatically archived versio of it using something like the Mementoweb Robust Links service. So that’s the tinkering for my next two coffee breaks sorted… And again, I’ll make them generally available in a way that probably isn’t…

And maybe I should also look at more generally adding in a typo and repeated word checker, eg as per More Typo Checking for Jupyter Notebooks — Repeated Words and Grammar Checking?

PPS the quality question of never-clicked links also raises a question that for me would be in scope as a Redesigning Production question and relates to the issue of continual improvement of course material, contrasted with maintenance (fixing broken links or typos that are identified, for example) and update (more significant changes to course materials that may happen after several years to give the course a few more years of life).

Our TM351 Data Managemen and Analysis module has been a rare beast in that we have essentially been engaged in a rolling rewrite of it ever since we first presented it. Each year (it presents once a year), we update the software and reviewing the practical activities distributed via Jupyter notebooks which take up about 40% of the module study time. (Revising the VLE materials is much harder because there is a long, slow production process associated with making those updates. Updating notebooks is handled purely within the module team and without reference to external processes that require scheduling and formal scheduled handovers.)

To my mind, the production process for some modules at least should be capable of supporting continual improvement, and move away from “fixed for three years then significant update” model.

Running SQLite in a Zero2Kubernetes (Azure) JupyterHub Spawned Jupyter Notebook Server

I think this is an issue, or it may just be a quirk of a container I built for deployment via JupyterHub using Kubernetes on Azure to run user containers, but it seems you that SQLite does things with file locks that break can the sqlite3 package…

For example, the hacky cross-notebook search engine I built, the PyPi installable nbsearch, (which is not the same as the IBM semantic notebook search of the same name, WatViz/nbsearch) indexes notebooks into a SQLite database saved into a hidden directory in home.

The nbsearch UI is published using Jupyter server proxy. When the Jupyter noteobook server starts, the jupyter-server-proxy extension looks for packages with jupyter-server-proxy registered start hooks (code).

If the jupyter-server-proxy setup fails for for one registered service, it seems to fail for them all. During testing of a deployment, I noticed none of the jupyter-server-proxy services I expected to be visible from the notebook homepage New menu were there.

Checking logs (via @yuvipanda, kubectl logs -n <namespace> jupyter-<username>) it seemed that an initialisation script in nbsearch was failing the whole jupyter-server-proxy setup (sqlite3.OperationalError: database is locked; related issue).

Scanning the JupyterHub docs, I noted that:

> The SQLite database should not be used on NFS. SQLite uses reader/writer locks to control access to the database. This locking mechanism might not work correctly if the database file is kept on an NFS filesystem. This is because fcntl() file locking is broken on many NFS implementations. Therefore, you should avoid putting SQLite database files on NFS since it will not handle well multiple processes which might try to access the file at the same time.

This relates to setting up the JupyterHub service, but it did put me on the track of various other issues perhaps related to my issue posted variously around the web. For example, this issueAllow nobrl parameter like docker to use sqlite over network drive — suggests alternative file mountOptions which seemed to fix things…

Partly Solipsist Conversational Blogging Over Time

This is blog is one of the places where I have, over the years, chatted to not just to myself, but also to others.

Microblogging tools, as they used to be called — things like Twitter — are now often little more than careless public resharing sites for links and images and “memes” (me-me-me-me, look what I can do in taking this object other people have repurposed and I have repurposed too).

Blog posts, too, are linky things, or can be. Generally, most of my blog posts include links becuase they situate the post in a wider context of “other stuff”. The stuff that prompted me to write the post, the stuff that I’ve written previously that relates to or informs my current understanding, other stuff around the web that I have gone out and discovered to see what other folk have written about the thing I’m writing about, stuff that riffs further on something I don’t have space or time or inclination to cover in any more depth than a casual tease of a reference, or that I’d like to follow up later, stuff that future references posts I haven’t written yet, and so on.

So, this post was inspired by a conversation I heard between Two Old Folks in the Park Waxing On about something or other, and also picks up on something I didn’t mention explicitly in something I posted yesterday, but touched on in passing: social bookmarking.

Social bookmarking is personal link sharing: saving a link to something you think you might want to refer to later in a collection that you deliberately add to, rather than traipse through your browser search history.

At this point, I was going to link to something I thought I’d read on Simon Willison’s blog about using datasette to explore browser search history sqlite files, but I can’t seem to find the thing I perhaps falsely remember? There is a post on Simon’s blog about trawling through Apple photos sqlite db though.

That said, I will drop in this aside about how to find and interrogate your Safari browser history as a SQLite database, and how to find the myriad number of Google Chrome data stashes, which I should probably poke through in a later post. And in aut0-researching another bit of this post, I also note that I have actually posted on searching Google history sqlite files using datasette: Practical DigiSchol – Refinding Lost Recent Web History.

The tool I use for social bookmarking now is pinboard (I note in passing from @simonw’s dogsheep personal analytics page that there is a third-party dogsheep aligned pinboard-to-sqlite exporter, but as one of the old folk muttering, Alan Levine, mentioned, I also have fond memories of (are the .’s in the right place?!), the first social bookmarking tool I got into a daily habit of using.

One of the handy things about those early web apps was that they had easy to use public APIs that didn’t require any tokens or authentication: you just created a URL and pulled back JSON data. This meant it was relatively easy to roll your own applications, often as simple single page web apps. One tool I created off the back of was deliSearch, that let you retrieve a list of tagged links from delicsious for a particular tag, then do an OR’d search over the associated pages, or the domains they were on, using Yahoo search. (For several years now, if you try running more than a couple of hand crafted, advanced searches using logical constructs and/or advanced search limits using Google web search, you get a captcha challenge under the assumption that you’re a robot.)

deliSearch led to several other search experiments, generalised as searchfeedr. This would let you roll a list of links from numerous sources, such as social bookmark feeds, all the links in a set of course materials, or even just the links on a web page such as this one, and roll your own search engine over them. See foer example, this presentation from ILI 2007, back in those happy days when I used to get speaking gigs: Search Hubs and Custom Search Engines (ILI2007).

Something else I picked up from social bookmarking sites was the power of collections recast as graphs over people and/or tags. Here’s an early example of a crude view over folk using the edupunk tag on delicious circa 2008 (eduPunk Chatter):

edupunk bookmarks

And here was another experiment from several years later (Dominant Tags in My Delicious Network), looking at the dominant tags used across folk in my delicious network defined as the folk I followed on delicious at the time:

Another experiment used a delicious feature that let you retrieve the 100 most recent bookmarks saved with a particular tag and data on on who bookmarked it an what other tags they used, to give us a snapshot at a very particular point in time around a particular tag (Visualising Delicious Tag Communities Using Gephi). Here’s the activity around the ds106 tag one particular day in 2011:

And finally, pinboard also has a JSON API that I think replicated many of the calls that could be made to the delicious API. I thought I’d done some recipes updating some of my old delicious hacks to use the pinboard API bit offhand, I can’t seem to find them anywhere. Should’a bookmarked them, I guess!

PS in getting on for decades of using WordPress, I don’t think I have never mistakenly published a page rather than a post. With the new WordPress-com UI, perhaps showing me some new features I donlt want, I made that mistake for the first time I can think of today. WordPress-com hosted blog new authoring experience sucks, and if it’s easy to make publishing mistakes like that, I imagine other f**d-up ness can follow along too.

Fragment: More Typo Checking for Jupyter Notebooks — Repeated Words and Grammar Checking

One of the typographical error types that isn’t picked up in the recipe I used in Spellchecking Jupyter Notebooks with pyspelling is the repeated word error type (for example, the the).

A quick way to spot repeated words is to use egrep on the command line over a set of notebooks-as-markdown (via Jupytext) files: egrep -o  "\b(\w+)\s+\1\b" */.md/*.md

I do seem to get some false positives with this, generating an output file of the report and then doing a quick filter on that wouls tidy that up.

An alternative route might be to extend pyspelling and look at tokenised word pairs for duplicates. Packages such as spacy also support things like Rule-Based Phrase Text Extraction and Matching at a token-based, as well as regex, level. Spacy also has extensions for hunspell [spacy_hunspell]. A wide range of contextual spell checkers are also available (for example, neuspell seems to offer a meta-tool over several of them), although care would need to be taken when it comes to (not) detecting US vs UK English spellings as typos. For nltk based spell-checking, see eg sussex_nltk/spell.

Note that adding an autofix would be easy enough but may make for false positives if there is a legitimate repeated word pair in a text. Falsely autocorrecting that, then detecting the created error / tracking down the incorrect deletion so it can be repaired, would be non-trivial.

Increasingly, I think it might be useful to generate a form with suggested autocorrections and checkboxes pre-checked by default that could be used to script corrections might be useful. It could also generate a change history.

For checking grammar, the Java based LanguageTool seems to be one of the most popular tools out there, being as it is the engine behind the OpenOffice spellchecker. Python wrappers are available for it (for example, jxmorris12/language_tool_python).

Browser Based “Custom Search Engines”

Back in the days when search was my first love of web tech, and blogging was a daily habit, I’d have been all over this… But for the last year or two, work has sucked blogging all but dry, and by not writing post(card)s I don’t pay as much attention to my digital flâneury now as I used to and as I’d like.

There are still folk out there who do keep turning up interesting stuff on a (very) regular basis, though, and who do manage to stick to the regular posting which I track, as ever, via a filter I curate: RSS feed subscriptions.

So for example, via @cogdog (Alan Levine’s) CogDogBlog post A Tiny Tool for Google Image Searches set to Creative Commons, I learn (?…) about something I really should immediately just be able to recall off the top of my head, and yet also seems somehow familiar…:

 In Chrome, create a saved search engine under Preferences -> Search Engine -> Manage Search Engines or directly chrome://settings/searchEngines.

Alan Levine

So what’s there?

Chrome browser preferences: search engine settings

First up, we notice that this is yet another place where Google appears to have essentially tracked a large part of my web behaviour, creating Other search engine links for sites I have visited. I guess this is how it triggers behaviours such as “within site search” in the location bar:

Within-site search in Chrome location bar

…which I now recall is (was?) called the omnibar, from this post from just over 10 years ago in February 2011: Coming Soon, A Command Line to the Web in Google Chrome?.

That post also refers to Firefox’s smart keywords which were exactly what I was trying to recall, and which I’d played with back in 2008: Time for a TinyNS? (a post which also features guess who in the first line…).

Firefox: smart kewords, circa 2008

So with that idea brought to mind again, I’ll be mindful of opportunities to roll my own again…

Alan’s recent post also refers to the magic that is (are) bookmarklets. I still use a variety of these all the time, but I haven’t been minded to create any in months and months… in fact, probably not in years…

My top three bookmarklets:

  • pinboard bookmarklet: social bookmarking; many times a day;
  • nbviewer bookmarklet: reliably preview Jupyter notebooks from Github repos and gists, with js output cells rendered; several times a day;
  • OU Library ezproxy: open subscription academic content (online journals, ebooks etc.) via OU redirect (access to full text, where OU subscription available).

What this all makes me think is that the personal DIY productivity tools that gave some of us, at least, some sort of hope, back in the day, have largely not become common parlance. These tools are likely still alien to the majority, not least in terms of the very idea of them, let alone how to write incantations of your own to make the most of them.

Which reminds me. At one point I did start to explore bookmarklet generator tools (I don’t recall pitching an equivalent for smart keyword scripts), in a series of posts I did whilst on the Arcadia Project:

An Introduction to Bookmarklets
The ‘Get Current URL’ Bookmarklet Pattern
The ‘Get Selection’ Bookmarklet Pattern

Happy days… Now I’m just counting down the 5000 or so days to retirement from the point I wake up to the point I go to sleep.

Thanks, Alan, for the reminders; your daily practise is keeping this stuff real…

Digital Art(efacts)

So.. non-fungible tokens for signing digital art. FFS. I wonder when someone will set up an NFT a long way down the chain and relate the inflating cost of their artwork to the increasing amounts of energy you need to spend to mint the next token. Thinks: Drummond and Cautey were burned a million, now you can burn a million on top of the cost of the art work minting your certificate of ownership.

Hmm… maybe that’s a world of art projects in itself: mint new tokens on a server powered by electricity from the windmill in your garden and limit how often ownership can change by virtue of the slow release of new tokens. Create fairygold tokens that disintegrate after a period of time, so you are forced to resell the art work within a particular period of time. Run your blockchain miner on a GPU powered by a server where you control the energy tariff. Etc etc.

Alternatively, why not sell “a thing”.

Back in the day, I used to enjoy buying signed limited edition prints. There was no way I could afford originals or studies, but the signed limited edition print had the benefit of relative scarcity, and the knowledge that the artist had presumably looked at and touched the piece. And if they weren’t happy with it, they wouldnlt release it. A chef at the pass.

I have a cheap RPi (Rasberry Pi) connected by an ethernet cable to my home internet router. I run a simple Jupyer notebook server on it as a sketchpad, essentially. In a Jupyter notebook, I can write text and code, execute the code, embed code outputs (tables, charts, generated images, 3D models, sounds, movies, and so on) and export those generated code output assets as digital files. I could sign them as NFTs.

I can also run other things on the RPi, such as a text editor, or a web browser.

I can use the RPi to perform the calculations on which I create a piece.

So… rather than sell a signed digital artefact, I could connect to the RPi, write some code on it, or open a drawing package on it, and create a piece.

Then I could make the files read only on the device and sell you the means of production plus the files I actaully worked on and created.

If it’s an RPi, where the files are saved on an SD card, I could sell you just the SD card.

In the first case, where I sell you the RPi, you would have the silicon that did the computation that created the digital asset as well as the files I, in a particular sense, touched and worked on, with a timestamp on the file repreenting the last time I edited and saved that file.

In the second case, where I sell you the SD Card, I sell you the original files that define the created the digital asset that I, in a particular sense, touched and worked on, with a timestamp on the file repreenting the last time I edited and saved that file.

This seems to me be a much more tabgible way of selling digital artworks as touched by the creator.

Spellchecking Jupyter Notebooks with pyspelling

One of the things I failed to do at the end of last year was put together a spellchecking pipeline to try to pick up typos across several dozen Jupyter notebooks used as course materials.

I’d bookmarked pyspelling as a possible solution, but didn’t have the drive to do anything with it.

So with a need to try to correct typos for the next presentation (some students on the last presentation posted about typos but didn’t actually point out where they thought were so we could fix them) I thought I’d have a look at whether pyspelling could actual help having spotted a Github spellcheck action — rojopolis/spellcheck-github-actions — that reminded me of it (and that also happens to use pyspelling).

The pyspelling package uses a matrix and pipeline ideas. The matrix lets you define and run separate pipelines, the pipelines let you sequence a series of filter steps. Available filters include markdown, html and python filters that preprocess files and pass text elements for spellchecking to the spellchecker. The Python filter allows you to extract things like comments and docstrings and run spell checks over those; the markdown and HTML filters can work together so you can transform markdown to HTML, then ignore the content of code, pre and tt tags, for example, and spell check the rest of the content. A url filter lets you remove URLs before spellchecking.

By default, there is no Jupyter notebook / ipynb filter, so I started off by running the spellchecker against Jupytext markdown files generated from my notebooks. A filter to strip out the YAML header at the start of the jupytext-md file was there to help minimise false positive typos from the spell checker report.

In passing, I often use a Jupytext -pre-commit filter to commit a markdown version of Git committed notebooks to a hidden .md directory. For example, in .git/hooks/pre-commit, add the line: jupytext –from ipynb –to .md//markdown –pre-commit [docs]. Whenever you commit a notebook, a Jupytext markdown version of the notebook (ex- of the code cell output content) will also be added and commited into a .md hidden directory in the same directory as the notebook.

Here’s the first attempt a pyspelling config file:

# -- .pyspelling.yml --

- name: Markdown
    lang: en
    - .wordlist.txt
    encoding: utf-8
  - pyspelling.filters.context:
      # Cribbed from pyspelling docs
      context_visible_first: true
      # Ignore YAML at the top of juptext-md file
      # (but may also exclude other content?)
        - open: '(?s)^(?P<open> *-{3,})$'
          close: '^(?P=open)$'
  - pyspelling.filters.url:
  - pyspelling.filters.markdown:
        - pymdownx.superfences:
  - pyspelling.filters.html:
      comments: false
        - code
        - pre
        - tt
    - '**/.md/*.md'
  default_encoding: utf-8

Note that the config also includes a reference to a custom wrodlist in .wordlist.txt that includes additional whitelist terms over the default dictionary.

Running pyspelling using the above confguration runs the spell checker over the desired files in the desired way: pyspelling > typos.txt

The output typos.txt file then has the form:

Misspelled words:
<htmlcontent> content/02. Getting started with robot and Python programming/02.1 Robot programming constructs.ipynb: html>body>p

Misspelled words:
<htmlcontent> content/02. Getting started with robot and Python programming/02.1 Robot programming constructs.ipynb: html>body>p

We can create a simple pandas script to parse the result and generate a report that counts the prevalence of particular typos. For example, something of the form:

datalog          37
dataset          32
pre              31
convolutional    19
RGB              17
pathologies       1
Microsfot         1

One possible way of using that information is to identify terms that maybe aren’t in the dictionary but should be added to the whitelist. Another way of using that infomation might be to identify jargon or potential glossary terms. Reverse ordering the list is more likely to give you occasional typos; middling prevalence items might be common typos; and so on.

That recipe works okay, and could be used to support spell checking over a wide range of literate programming file formats (Jupyter notebooks, Rmd, various structured Python and markdown formats, for example). Basing the process around a format Jupytext exports into allows us to then have a Jupytext step at the front a small pieces lightly joined text file pipeline that takes a literate programming document, converts it to eg Jupytext-md, and then passes it to the pyspelling pipeline.

But a problem with that approach is that we are throwing away perfectly good structure in the orginal document. One of the nice things about the ipynb JSON format is that it separates code and markdown in a very clean way (and by so doing makes things like my innovationOUtside/nb_quality_profile notebook quality profiler relatively easy to put together). So can we create our own ipynb filter for pyspelling?

Cribbing the markdown filter definition, it was quite straightforward to hack a first pass attempt at an ipynb filter that lets you extract the content of code or markdown cells into the spell checking pipeline:

# -- --

"""Jupyter ipynb document format filter."""

from .. import filters
import codecs
import markdown
import nbformat

class IpynbFilter(filters.Filter):
    """Spellchecking Jupyter notebook ipynb cells."""

    def __init__(self, options, default_encoding='utf-8'):

        super().__init__(options, default_encoding)

    def get_default_config(self):
        """Get default configuration."""

        return {
            'cell_type': 'markdown', # Cell type to filter
            'language': '', # This is the code language for the notebook
            # Optionally specify whether code cell outputs should be spell checked
            'output': False, # TO DO
            # Allow tagged cells to be excluded
            'tags-exclude': ['code-fails']

    def setup(self):

        self.cell_type = self.config['cell_type'] if self.config['cell_type'] in ['markdown', 'code'] else 'markdown'
        self.language = self.config['language'].upper()
        self.tags_exclude = set(self.config['tags-exclude'])

    def filter(self, source_file, encoding):  # noqa A001
        """Parse ipynb file."""

        nb =, as_version=4)
        self.lang = nb.metadata['language_info']['name'].upper() if 'language_info' in nb.metadata else None
        # Allow possibility to ignore code cells if language is set and
        # does not match parameter specified language? E.g. in extreme case:
        #if self.cell_type=='code' and self.config['language'] and self.config['language']!=self.lang:
        #    nb=nbformat.v4.new_notebook()
        # Or maybe better to just exclude code cells and retain other cells?

        encoding = 'utf-8'

        return [filters.SourceText(self._filter(nb), source_file, encoding, 'ipynb')]

    def _filter(self, nb):
        """Filter ipynb."""

        text_list = []
        for cell in nb.cells:
            if 'tags' in cell['metadata'] and \
            if cell['cell_type']==self.cell_type:
        return '\n'.join(text_list)

    def sfilter(self, source):

        return [filters.SourceText(self._filter(source.text), source.context, source.encoding, 'ipynb')]

def get_plugin():
    """Return the filter."""

    return IpynbFilter

We can then create a config file to run a couple of matrix pipelines: one over ntoebook markdown cells, one over code cells:

# -- ipyspell.yml --

- name: Markdown
    lang: en
    - .wordlist.txt
    encoding: utf-8
  - pyspelling.filters.ipynb:
      cell_type: markdown
  - pyspelling.filters.url:
  - pyspelling.filters.markdown:
        - pymdownx.superfences:
  - pyspelling.filters.html:
      comments: false
      #  - '*|*:not(script,style,code)'
      #  - 'code > *:not(.c1)'
        - code
        - pre
        - tt
    - 'content/*/*.ipynb'
    #- '**/.md/*.md'
  default_encoding: utf-8
- name: Python
    lang: en
    - .wordlist.txt
    encoding: utf-8
  - pyspelling.filters.ipynb:
      cell_type: code
  - pyspelling.filters.url:
  - pyspelling.filters.python:
    - 'content/*/*.ipynb'
    #- '**/.md/*.md'
  default_encoding: utf-8

We can then run that config as: pyspelling -c ipyspell.yml > typos.txt

The following Python code then generates a crude dataframe of the reseults:

import pandas as pd

fn = 'typos.txt'
with open(fn,'r') as f:
    txt = f.readlines()

# aspell
df = pd.DataFrame(columns=['filename', 'cell_type', 'typo'])

currfile = ''
cell_type = ''

for t in txt:
    t = t.strip('\n').strip()
    if not t or t in ['Misspelled words:', '!!!Spelling check failed!!!'] or t.startswith('-----'):
    if t.startswith('<htmlcontent>') or t.startswith('<py-'):
        if t.startswith('<html'):
            cell_type = 'md'
        elif t.startswith('<py-'):
            cell_type = 'code'

        currfile = t.split('/')[-1].split('.ipynb')[0]#+'.ipynb'
    df = df.append({'filename': currfile, 'cell_type': cell_type,
                    'typo': t}, ignore_index=True)

The resulting dataframe lets us filter by code or markdown cell:

We can also generate reports over the typos found in markdown cells, grouped by notebook:

df_group = df[(df['filename'].str.startswith('0')) & (df['cell_type']=='md')][['filename','typo']].groupby(['filename'])
for key, item in df_group:

    print(df_group.get_group(key).value_counts(), "\n\n")

Thsi gives basic results of the form:

Something that might be worth exploring is a tool that present a user with form that lets them enter (or select from a list of options?) a corrected version and that will then automatically fix the typo in the original file. To reduce the chance of false positives, it might also be worth showing the typo in it’s original context using the sort of display that is typical in a search engine results snippet, for example (eg ouseful-testing/nbsearch).

When Less is More: Data Tables That Make a Difference

In the previous post, From Visual Impressions to Visual Opinions, I gave various examples of charts that express opinions. In this post, I’ll share a few examples of how we can take a simple data table and derive multiple views from it that each provide a different take on the same story (or does that mean, tells different stories from the same set of "facts"?)

Here’s the original, base table, showing the recorded split times from a single rally stage. The time is the accumulated stage time to each split point (i.e. the elapsed stage time you see for a driver as they reach each split point):

From this, we immediately note the ordering (more on this in another post) which seems not useful. It is, in fact, the road order (i.e. the order in which each driver started the stage).

We also note that the final split is not the actual final stage time: the final split in this case was a kilometer or so before the stage end. So from the table, we can’t actually determine who won the stage.

Making a Difference

The times presented are the actual split times. But one thing we may be more interested in is the differences to see how far ahead or behind one driver another driver was at a particular point. We can subtract one driver’s time from anothers to find this difference. For example, how did the times at each split compare to first on road Ogier’s (OGI)?

Note that we can “rebase” the table relative to any driver by subtracting the required driver’s row from every other row in the original table.

From this “rebased” table, which has fewer digits (less ink) in it than the original, we can perhaps more easily see who was in the lead at each split, specifically, the person with the minimum relative time. The minimum value is trivially the most negative value in a column (i.e. at each split), or, if there are no negative values, the minimum zero value.

As well a subtracting one row from every other row to find the differences realative to a specified driver, we can also subtract the first column from the second, the second from the third etc to find the time it took to get from one split point to the next (we subtract 0 from the first split point time since the elapsed time into stage at the start of the stage is 0 seconds).

The above table shows the time taken to traverse the distance from one split point to the next; the extra split_N column is based on the final stage time. Once again, we could subtract one row from all the other rows to rebase these times relative to a particular driver to see the difference in time it took each driver to traverse a split section, relative to a specified driver.

As well as rebasing relative to an actual driver, we can also rebase relative to variously defined “ultimate” drivers. For example, if we find the minimum of each of the “split traverse” table columns, we create a dummy driver whose split section times represent the ultimate quickest times taken to get from one split to the next. We can then subtract this dumny row from every row of the split section times table:

In this case, the 0 in the first split tells us who got to the first split first, but then we lose information (withiut further calculation) about anything other than relative performance on each split section traverse. Zeroes in the other columns tell us who completed that particular split section traverse in the quickest time.

Another class of ultimate time dummy driver is the accumulated ultimate section time driver. That is, take the ultimate split sections then find the cumulative sum of them. These times then represent the dummy elapsed stage times of an ultimate driver who completed each split in the fastest split section time. If we rebase against that dummy driver:

In this case, there may be only a single 0, specifically at the first split.

A third possible ultimate dummy driver is the one who “as if” recorded the minimum actual elapsed time at each split. Again, we can rebase according to that driver:

In this case, will be at least one zero in each column (for the driver who recorded that particular elapsed time at each split).

Visualising the Difference

Viewing the above tables as purely numerical tables is fine as far as it goes, but we can also add visual cues to help us spot patterns, and different stories, more readily.

For example, looking at times rebased to the ultimate split section dummy driver, we get the following:

We see that SOL was flying from the second split onwards, getting from one split to another in pretty much the fastest time after a relatively poor start.

The variation in columns may also have something interesting to say. SOL somehow made time against pretty much every between split 4 and 5, but in the other sections (apart from the short last section to finish), there is quite a lot of variability. Checking this view against a split sectioned route map might help us understand whether there were particular features of the route that might explain these differences.

How about if we visualise the accumulated ultimate split section time dummy driver?

Here, we see that TAN was recording the best time compared the ultimate time as calculated against the sum of best split section times, but was still off the ultimate pace: it was his first split that made the difference.

How about if we rebase against the dummy driver that represents the driver with the fastest actual recorded accumulated time at each split:

Here, we see that TAN led the stage at each split point based on actual accumulated time.

Remember, all these stories were available in the original data table, but sometimes it takes a bit of differencing to see them clearly…

From Visual Impressions to Visual Opinions

In The Analytics Trap I scribbled some notes on how I like using data not as a source of "truth", but as a lens, or a perspective, from a particular viewpoint.

One idea I’ve increasingly noticed being talked about explcitly across various software projects I follow is the idea of opionated software and opionated design.

According to the Basecamp bible, Getting Real, [th]e best software takes sides. … [Apps should] have an attitude. This seems to lie at the heart of opinionated design.

A blog post from 2015, The Rise of Opinionated Software presents a widely shared definition: Opinionated Software is a software product that believes a certain way of approaching a business process is inherently better and provides software crafted around that approach. Other widely shared views relate to software design: opinonated software should have "a view" on how things are done and should enforce that view.

So this idea of opinion is perhaps one we can riff on.

I’ve been playing with data for years, and one of things I’ve believed, throughout, in my opinionated way, is that its an unreliable and opinionated witness.

In the liminal space between wake and sleep this morning, I started wondering about how visualisations in particular could range from providing visual impressions to visual opinions.

For example, here’s a view of a rally stage, overlaid onto a map:

This sort of thing is widely recongnisable to anyboy had use an online map, and anyone who has seen a printed map and drawn a route on it.

Example interactive map view

Here’s a visual impression of just the route:

View of route

Even this view is opinionated because the co-ordinates are projected to a particular co-ordinate system, albeit the one we are most familiar with when viewing online maps; but other projections are available.

Now here’s a more opinionated view of the route, with it cut into approximuately 1km segments:

Or the chart can express an opinion about where it things significant left and right hand corners are:

The following view has strong opinions about how to display each kilometer section: not only does it make claims about where it things significant right and left corners are, it also rotates each segment to so the start and end point of the section lay on the same horixontal line:

Another viewpoint brings in another dimension: elevation. It also transforms the flat 2D co-ordinates of each point along the route to a 1-D distance-along-route measure allowing us to plot the elevation against a 1-D representation of the route in a 2D (1D!) line chart.

Again, the chart expresses an opinion about where the significant right and left corners are. The chart also chooses not to be more helpful than it could be: if vertical grid lines corresponded to the start and end distace-into-stage values for the segmented plots, it would be easier to see how this chart relates to the 1km segmented sections.

At this point, you may say that the points are "facts" from the data, but again, they really aren’t. There are various ways of trying to define the intensity of a turn, and there may be various ways of calculating any particular measure that give slightly differnent results. Many definitions rely on particular parameter settings (for example, if you measure radius of curvature from three points on a route, how far should those points be apart? 1m? 10m? 20m? 50m?

The "result" is only a "fact" insofar as it represents the output of a particular calculation of a particular measure using a particular set of parameters, things that are typically not disclosed in chart labels, often aren’t mentioned in chart captions, and may or may not be disclosed in the surrounding text.

On the surface, the chart is simply expressing an opion about how tight any of the particular corners are. If we take it a face value, and trust its opinion is based on reasonable foundations, then we can accept (or not accept) the chart’s opinion aabout where the significant turns are.

If we were really motivated to understand the chart’s opinion further, if we had access to the code that generated it we could start to probe its definition of "significnant curvature" to see if we agree with the principles on which the chart has based its opinion. But in most cases, we don’t do that. We take the chart for what it is, typically accept it for what it appears to say, and ascribe some sort of truth to it.

But at the end of the day, it’s just an opinion.

The charts were generated using R based on ideas inspired by Visualising WRC Rally Stages With rayshader and R [repo].

Punk is an aesthetic I never really subscribed to…

… the mohicans, the fashion sense and the apparently nihilistic attitude, the appearance of potentially looming violence, the drugs of preference — needles have no place in recreation other than knitting (?!) — and that particular subculture…

…which isn’t to say I didn’t know folk who other people classed as punks, but we were not quite goths, not quite ‘evvy metal, sort of grebo, not quite hippy, not quite rock and not quite blues, not crusties, not travellers (hitchers, yes, most definitely), and definitely not ravers (though some probably were).

Three or four years ago I started hitting the road again, trekking round the country following Hands Off Gretel, who tend towards the punk aesthetic with dayglo overtones, but the tunes I like are the poppy ones of Nevermind era Nirvana, corssed with Pink, and voice to match in both respects.

Over the last few years, I’d seen Ferocious Dog t-shirts, hoodies, caps and more getting ever more prevalent at the festivals we frequent, but from the look of the stage photos they were “punk” so not my thing…

…till I heard them, of course, and the folk punk rock melodies and social political nature, the family feel of an FD gig and the merch you can’t not get a habit for once you get the habit means they are hugely habit forming…

…and despite the lockdown and the many tickets to gigs that keep getting rolled back, we had Thosdis to look forward to, and Red Ken’s lockdown sessions (what could possibly go wrong…) and amongst the classics (Ken only plays classics), some new bands to me I’d not really heard before, done as solo acoustics, from punk named bands but with melodies and rhythms to die for…

…so enter Rancid and Social Distortion to my regular listening mix….

…and a thought that maybe, maybe, I need to start listening a bit more widely to the punk rock back catalogue, because there are some cracking tunes out there, and even the aesthetic isn;t your thing, the melodies may be…

… and some fantastically singalong-a-lyrics, particularly in the choruses…

Punk? Not me, not never, ever… But maybe, maybe, I need to rethink what I thought I thought I thought I understood by punk rock.

FWIW, I always thought of the Sex Pistols as a rock band (as least as far as Bollocks goes…); and Green Day; and Dog’s d’Amour (whom Social Distortion keep reminding me of….). And Iggy & the Stooges; and The Ramones. And the mother of all rock and roll bands: Motörhead.