Fragment: More Typo Checking for Jupyter Notebooks — Repeated Words and Grammar Checking

One of the typographical error types that isn’t picked up in the recipe I used in Spellchecking Jupyter Notebooks with pyspelling is the repeated word error type (for example, the the).

A quick way to spot repeated words is to use egrep on the command line over a set of notebooks-as-markdown (via Jupytext) files: egrep -o  "\b(\w+)\s+\1\b" */.md/*.md

I do seem to get some false positives with this, generating an output file of the report and then doing a quick filter on that wouls tidy that up.

An alternative route might be to extend pyspelling and look at tokenised word pairs for duplicates. Packages such as spacy also support things like Rule-Based Phrase Text Extraction and Matching at a token-based, as well as regex, level. Spacy also has extensions for hunspell [spacy_hunspell]. A wide range of contextual spell checkers are also available (for example, neuspell seems to offer a meta-tool over several of them), although care would need to be taken when it comes to (not) detecting US vs UK English spellings as typos. For nltk based spell-checking, see eg sussex_nltk/spell.

Note that adding an autofix would be easy enough but may make for false positives if there is a legitimate repeated word pair in a text. Falsely autocorrecting that, then detecting the created error / tracking down the incorrect deletion so it can be repaired, would be non-trivial.

Increasingly, I think it might be useful to generate a form with suggested autocorrections and checkboxes pre-checked by default that could be used to script corrections might be useful. It could also generate a change history.

For checking grammar, the Java based LanguageTool seems to be one of the most popular tools out there, being as it is the engine behind the OpenOffice spellchecker. Python wrappers are available for it (for example, jxmorris12/language_tool_python).

Author: Tony Hirst

I'm a Senior Lecturer at The Open University, with an interest in #opendata policy and practice, as well as general web tinkering...

%d bloggers like this: