Spellchecking Jupyter Notebooks with pyspelling

One of the things I failed to do at the end of last year was put together a spellchecking pipeline to try to pick up typos across several dozen Jupyter notebooks used as course materials.

I’d bookmarked pyspelling as a possible solution, but didn’t have the drive to do anything with it.

So with a need to try to correct typos for the next presentation (some students on the last presentation posted about typos but didn’t actually point out where they thought were so we could fix them) I thought I’d have a look at whether pyspelling could actual help having spotted a Github spellcheck action — rojopolis/spellcheck-github-actions — that reminded me of it (and that also happens to use pyspelling).

The pyspelling package uses a matrix and pipeline ideas. The matrix lets you define and run separate pipelines, the pipelines let you sequence a series of filter steps. Available filters include markdown, html and python filters that preprocess files and pass text elements for spellchecking to the spellchecker. The Python filter allows you to extract things like comments and docstrings and run spell checks over those; the markdown and HTML filters can work together so you can transform markdown to HTML, then ignore the content of code, pre and tt tags, for example, and spell check the rest of the content. A url filter lets you remove URLs before spellchecking.

By default, there is no Jupyter notebook / ipynb filter, so I started off by running the spellchecker against Jupytext markdown files generated from my notebooks. A filter to strip out the YAML header at the start of the jupytext-md file was there to help minimise false positive typos from the spell checker report.

In passing, I often use a Jupytext -pre-commit filter to commit a markdown version of Git committed notebooks to a hidden .md directory. For example, in .git/hooks/pre-commit, add the line: jupytext –from ipynb –to .md//markdown –pre-commit [docs]. Whenever you commit a notebook, a Jupytext markdown version of the notebook (ex- of the code cell output content) will also be added and commited into a .md hidden directory in the same directory as the notebook.

Here’s the first attempt a pyspelling config file:

# -- .pyspelling.yml --

matrix:
- name: Markdown
  aspell:
    lang: en
  dictionary:
    wordlists:
    - .wordlist.txt
    encoding: utf-8
  pipeline:
  - pyspelling.filters.context:
      # Cribbed from pyspelling docs
      context_visible_first: true
      # Ignore YAML at the top of juptext-md file
      # (but may also exclude other content?)
      delimiters:
        - open: '(?s)^(?P<open> *-{3,})$'
          close: '^(?P=open)$'
  - pyspelling.filters.url:
  - pyspelling.filters.markdown:
      markdown_extensions:
        - pymdownx.superfences:
  - pyspelling.filters.html:
      comments: false
      ignores:
        - code
        - pre
        - tt
  sources:
    - '**/.md/*.md'
  default_encoding: utf-8

Note that the config also includes a reference to a custom wordlist in .wordlist.txt that includes additional whitelist terms over the default dictionary.

Running pyspelling using the above confguration runs the spell checker over the desired files in the desired way: pyspelling > typos.txt

The output typos.txt file then has the form:

Misspelled words:
<htmlcontent> content/02. Getting started with robot and Python programming/02.1 Robot programming constructs.ipynb: html>body>p
--------------------------------------------------------------------------------
accidently
--------------------------------------------------------------------------------

Misspelled words:
<htmlcontent> content/02. Getting started with robot and Python programming/02.1 Robot programming constructs.ipynb: html>body>p
--------------------------------------------------------------------------------
autorunning
--------------------------------------------------------------------------------

We can create a simple pandas script to parse the result and generate a report that counts the prevalence of particular typos. For example, something of the form:

datalog          37
dataset          32
pre              31
convolutional    19
RGB              17
                 ..
pathologies       1
Microsfot         1

One possible way of using that information is to identify terms that maybe aren’t in the dictionary but should be added to the whitelist. Another way of using that infomation might be to identify jargon or potential glossary terms. Reverse ordering the list is more likely to give you occasional typos; middling prevalence items might be common typos; and so on.

That recipe works okay, and could be used to support spell checking over a wide range of literate programming file formats (Jupyter notebooks, Rmd, various structured Python and markdown formats, for example). Basing the process around a format Jupytext exports into allows us to then have a Jupytext step at the front a small pieces lightly joined text file pipeline that takes a literate programming document, converts it to eg Jupytext-md, and then passes it to the pyspelling pipeline.

But a problem with that approach is that we are throwing away perfectly good structure in the orginal document. One of the nice things about the ipynb JSON format is that it separates code and markdown in a very clean way (and by so doing makes things like my innovationOUtside/nb_quality_profile notebook quality profiler relatively easy to put together). So can we create our own ipynb filter for pyspelling?

Cribbing the markdown filter definition, it was quite straightforward to hack a first pass attempt at an ipynb filter that lets you extract the content of code or markdown cells into the spell checking pipeline:

# -- ipynb.py --

"""Jupyter ipynb document format filter."""

from .. import filters
import codecs
import markdown
import nbformat

class IpynbFilter(filters.Filter):
    """Spellchecking Jupyter notebook ipynb cells."""

    def __init__(self, options, default_encoding='utf-8'):
        """Initialization."""

        super().__init__(options, default_encoding)

    def get_default_config(self):
        """Get default configuration."""

        return {
            'cell_type': 'markdown', # Cell type to filter
            'language': '', # This is the code language for the notebook
            # Optionally specify whether code cell outputs should be spell checked
            'output': False, # TO DO
            # Allow tagged cells to be excluded
            'tags-exclude': ['code-fails']
        }

    def setup(self):
        """Setup."""

        self.cell_type = self.config['cell_type'] if self.config['cell_type'] in ['markdown', 'code'] else 'markdown'
        self.language = self.config['language'].upper()
        self.tags_exclude = set(self.config['tags-exclude'])

    def filter(self, source_file, encoding):  # noqa A001
        """Parse ipynb file."""

        nb = nbformat.read(source_file, as_version=4)
        self.lang = nb.metadata['language_info']['name'].upper() if 'language_info' in nb.metadata else None
        # Allow possibility to ignore code cells if language is set and
        # does not match parameter specified language? E.g. in extreme case:
        #if self.cell_type=='code' and self.config['language'] and self.config['language']!=self.lang:
        #    nb=nbformat.v4.new_notebook()
        # Or maybe better to just exclude code cells and retain other cells?

        encoding = 'utf-8'

        return [filters.SourceText(self._filter(nb), source_file, encoding, 'ipynb')]

    def _filter(self, nb):
        """Filter ipynb."""

        text_list = []
        for cell in nb.cells:
            if 'tags' in cell['metadata'] and \
                set(cell['metadata']['tags']).intersection(self.tags_exclude):
                continue
            if cell['cell_type']==self.cell_type:
                text_list.append(cell['source'])
        
        return '\n'.join(text_list)

    def sfilter(self, source):
        """Filter."""

        return [filters.SourceText(self._filter(source.text), source.context, source.encoding, 'ipynb')]


def get_plugin():
    """Return the filter."""

    return IpynbFilter

We can then create a config file to run a couple of matrix pipelines: one over ntoebook markdown cells, one over code cells:

# -- ipyspell.yml --

matrix:
- name: Markdown
  aspell:
    lang: en
  dictionary:
    wordlists:
    - .wordlist.txt
    encoding: utf-8
  pipeline:
  - pyspelling.filters.ipynb:
      cell_type: markdown
  - pyspelling.filters.url:
  - pyspelling.filters.markdown:
      markdown_extensions:
        - pymdownx.superfences:
  - pyspelling.filters.html:
      comments: false
      # https://github.com/facelessuser/pyspelling/issues/110#issuecomment-800619907
      #captures:
      #  - '*|*:not(script,style,code)'
      #ignores:
      #  - 'code > *:not(.c1)'
      ignores:
        - code
        - pre
        - tt
  sources:
    - 'content/*/*.ipynb'
    #- '**/.md/*.md'
  default_encoding: utf-8
- name: Python
  aspell:
    lang: en
  dictionary:
    wordlists:
    - .wordlist.txt
    encoding: utf-8
  pipeline:
  - pyspelling.filters.ipynb:
      cell_type: code
  - pyspelling.filters.url:
  - pyspelling.filters.python:
  sources:
    - 'content/*/*.ipynb'
    #- '**/.md/*.md'
  default_encoding: utf-8

We can then run that config as: pyspelling -c ipyspell.yml > typos.txt

The following Python code then generates a crude dataframe of the results:

import pandas as pd

fn = 'typos.txt'
with open(fn,'r') as f:
    txt = f.readlines()

# aspell
df = pd.DataFrame(columns=['filename', 'cell_type', 'typo'])

currfile = ''
cell_type = ''

for t in txt:
    t = t.strip('\n').strip()
    if not t or t in ['Misspelled words:', '!!!Spelling check failed!!!'] or t.startswith('-----'):
        continue
    
    if t.startswith('<htmlcontent>') or t.startswith('<py-'):
        if t.startswith('<html'):
            cell_type = 'md'
        elif t.startswith('<py-'):
            cell_type = 'code'
        else:
            cell_type=''

        currfile = t.split('/')[-1].split('.ipynb')[0]#+'.ipynb'
        continue
        
    df = df.append({'filename': currfile, 'cell_type': cell_type,
                    'typo': t}, ignore_index=True)

The resulting dataframe lets us filter by code or markdown cell:

We can also generate reports over the typos found in markdown cells, grouped by notebook:

df_group = df[(df['filename'].str.startswith('0')) & (df['cell_type']=='md')][['filename','typo']].groupby(['filename'])
    
for key, item in df_group:

    print(df_group.get_group(key).value_counts(), "\n\n")

Thsi gives basic results of the form:

Something that might be worth exploring is a tool that present a user with form that lets them enter (or select from a list of options?) a corrected version and that will then automatically fix the typo in the original file. To reduce the chance of false positives, it might also be worth showing the typo in it’s original context using the sort of display that is typical in a search engine results snippet, for example (eg ouseful-testing/nbsearch).

Author: Tony Hirst

I'm a Senior Lecturer at The Open University, with an interest in #opendata policy and practice, as well as general web tinkering...