On Not Porting My Classic Jupyter Notebook Extensions to JupyterLab

With the deprecation of classic notebook, and future notebook releases to be based on JupyterLab UI components, we’ll have to make a decision somewhen soon about whether to pin the notebook version in the environment we release to students in September starts modules to <7, move to RetroLab/new notebook, or move to JupyterLab. Or go somewhere else.

If we move, a question arises as to the extent to which we try to preserve extended functionality and theming.

Currently, our classic notebook environment:

  • has custom branding (the insitutional logo replaces the Jupyer Logo);
  • uses “official unofficial” off-the-shelf classic notebook extensions;
  • uses in-house devleoped classic notebook extensions;
  • uses in-house developed ipywdgets (via jp_proxy_widget);
  • uses in-house developed and off-the-shelf magics.

In a move to JupyterLab UIs, the magics (MUST HAVE) should continue to work; the jp_proxy_widget applications (MUST HAVE) should work as long as that package continues to work; the branding should be fixable (SHOULD HAVE) (I was going to sy “how hard can it be?” to replace a logo, but I guess this is JupyterLab, so maybe REALLY HARD?!;-) , the off-the-shelf extensions (SHOULD HAVE tending to MUST HAVE) may get ported or may already have JupyterLab equivalents (see, for example, Scoping Out JupyterLab Extensions); and the custom notebook extensions (at the end of the day, these are [rpbably strictly SHOULD HAVE rather than than MUST HAVE, becuase materials are intended to degrade gracefully without them, but they are part of our value add and student experience proposition so it would be nice to keep them) will not (at least, not by me) get ported.

So what are we to do?

I think there are two sorts of decision that need to be made with a view to presentation:

  • for modules already in presentation, do they stay with classic notebook or do we try to migrate?
  • for modules in production, which environment do they develop notebooks for? Options are: classic notebook (I think this would be foolish); new notebook / JupyterLab (sensible option if the Jupyter environment is adopted); alternative environment (for example, VS Code, RStudio, Quarto, other). When it comes to alternatives, I think there should also be a consideration of more radical alternatives where the full notebook mechanic may not necessarily be required; Jupyter Book with ThebeLab, for example, if some way can be found to support persisting code edits (eg to browser storage).

I also think we should take the opportunity to explore alternative authoring and editing environments, as well as quality process automation (for example, Using Automation To Support Quality Assured Jupyter Notebook Publishing Processes): I tend to move between classic notebook, RStudio and VS Code, and if/when a standalone Quarto editor is made available, I’ll probably also try that for certain side-projects. Others might prefer things like Quarto, which is a bit more word-processor hand-holdy (see, for example, Previewing Richly Formatted Jupyter Book Style Content Authored Using MyST-md).

One of the most widely used of our custom extensions colour highlights notebook cells:

Poking around, it seems that there is a JupyterLab extension that allows you to set cell CSS style via metadata: educational-technology-collective/etc_jupyterlab_cell_properties. This looks ridiculously complicated to me, (maybe it needs an architecture statement?!), and seems to require a server extension among other things… (What is it that I keep saying about JupyterHub providing a really hostile architecture to have-a-go end user developers?!).

This makes me a little concerned: will the extension support cell highlighint in RetroLab? (Part of the promise of RetroLab for me was that it shared similar underpinnings to JupyterLab, but does that extend to “server companion” elements? And HTML / CSS structure?) Will the extension even run in JupyterLab in JupyterLite?

Whatever. I shall try to remain positive… When run in MyBinder, for the source repo at least (though I note, that uses an old version of JupyterLab?). it seems to work (though I note the JupyterLab metadata editor is really shonky – sometimes it lets you save, sometimes it doesn’t…?).

I also note the extension calls on a tornado decorator at one point, and I thought JupyterLab was trying to dump that package?

It would be more convenient if you could create a cell tag and then associate custom styling with it (I don’t know if / how Jupyterlab extensions have a simple configurator panel set-up like the classic jupyter_nbextensions_configurator?) but it sort of provides a workflow…?

And there is a migration path, of a sort. Our current notebooks use cell tags to identify cells for styling, so it would be easy enough to create a simple notebook processor that would check each cell for styling tags, and then add styling metadata to a cell as appropriate. Hugely inefficient and horrible (it would be so much easier if a cell tag could propagate into the UI DOM at an appropriate level, then to be styled via a custom CSS stylesheet, perhaps set via a cell styletags configurator) but workable as a workaround?

Only one way to find out, I guess, so I’ll have to try it and see…

PS I’d originally started this post in, if not in an upbeat mood about JupyterLab, at least a moderately hopeful sense. The jupyterlab-myst extension offers the promise of previewing various bits of MyST enhanced syntax in JupyterLab notebooks (though I haven’t tried or seen the extension working in RetroLab or JupyterLab in JupyterLite…), as briefly reviewed in the previous post, and the educational-technology-collective/etc_jupyterlab_cell_properties extension looked like it could be used in a hacky way to render style retwritten into cell metadata based on tags in my original notebooks.

But on reflection, I find myself falling back to my original position: that JupyterLab is not fit for our purposes. For a start, it’s too complicated. The cell properties extension does seem to work with an old build of JupyterLab, and if it still works with recent builds, how long is that likely to continue, particulalry given the extension also seems to require some sort of service level hacking? The repo does not appear to be maintained, and given the apparent complexity of it (I am not a dedveloper; I don’t have a clue how either JupyterLab, or typescript, or the build process works etc etc), we’re f****d if it breaks and I’m supposed to fix it (and we need things to keep working for years). I have no idea the extension works in RetroLab, or JupyterLite environments, or what would be involved to make it work if it doesn’t (or is it a bespoke JupyterLab desktop server solution). A comment I made in an issue to the jupyterlab-myst repo on leveraging cell tags in display fills me with no hope that there is an easy way to leverage cell tag data or cell metadata. If I recall correctly, cell tags were quite late to the notebook party, so it may be that architectural decisions were made early on that make it difficult for cell content processors to exploit cell metadata / cell tags from the notebook file, or the surrounding presentation layer (HTML DOM, CSS elements, etc.). I have no idea; if it is easy, I’m surprised I haven’t stumbled across some easy-to-follow demos around that that I would have tried to co-opt for my own purposes? From the cell properties extension, which appears to require mucking about with a server extension, it does look like it is not-just-for-me complicated!

Every time I try to show willing and get started with JupyterLab, it just seems to push me further and further away.

Fragments – Previewing Richly Formatted Jupyter Book Style Content Authored Using MyST-md

Of the many flavours of markdown out there that can be used to author executable documents (Rmd, jupytext-md, and variants thereof, MyST-md), I find myself increasingly making use of MyST-md and Jupyter Book with various plugins to render the content to HTML (I haven’t tried rendering to other formats such as PDF, docx, or various ebook formats yet).

One of the major blockers to wider adoption that I’d expect “colleagues” to express is that raw markdown is too hard to write in and there’s no easy preview, let alone WYSIWYG, views.

But that appears to be changing, with several demos and proofs of concept around MyST-flavoured authoring now in play. It’ll be interesting to see whether this crop of tools provides enough of a basis for folk to start exploring MyST, and maybe driving some competition that will encourage further adoption and development.

So, what’s possible today? Let’s look at four possibilities: JupyterLab-MyST extension, Iridium notebooks, the Curvenote editor and the VS Code myst-vs-code extension.

Jupyterlab-MyST Extension

Perhaps most interestingly, a proof-of-concept JupyterLab extension, very much under development, in the form of jupyterlab-myst (Binder demo available, although the long build time and the fast rate of updates to the repo means you might well have to wait for a build to complete!).

The extension provides rendered views of an increasing amount of MyST native enhanced markdown features (I’m not sure if it will extend to rendering elements provided by plugins?), with markdown entered as normal in the edit view of a markdown cell.

Demo of executablebooks/jupyterlab-myst extension in action

I’m not sure if the extension works (yet?) in the pure-browser JupyterLite eenvironment?

One thing I’d quite like to see is the ability to use a cell tag to configure a markdown as a special content block rather than have to use a labeled code fenced block.

Some more examples of the jupyterlab-myst extension in action, at least as it stands today, including tables and footnotes (which are currently rendered at the bottom of a cell, and numbered relative just to the cell?):

I also note that nested admonitions work, term definitions almost work, and Sphinx function docs style rendering still needs work:

In Jupyter Book land, there are quite a few extensions that add further richness to content rendered via Jupyter Book, including dropdowns/sphinx-togglebutton, sphinx-panels, sphinx-tabs, proofs, exercises and so on. It’s not immediately clear to me whether there is an obvious path for incorporating those rendering extensions in the views that jupyterlab-myst can render, or whether the Sphinx extensions and the JupyerLab extension could share some of the same templates and styling? (I would hope they could share some of it at least…)

This all makes me wonder how easy it is for the end-user to define their own special content block tags and styles. If a really simple recipe could be found for supporting this sort of end-user development in the Jupyter Book / MysT context, as well as the jupyterlab-myst extension, it would go a long way towards making JupyterLab / RetroLab more palatable to me. For example, would a “custom block proxy” be possible where you define the code fenced block label and provide a template for how to render the content within the block?)

See also: A Personal Take on Customising Jupyter Via End-User Innovation

Irydium

The irydium project is a web-native browser based notebook offering MyST based authoring, in-browser Python code execution via Pyodide, and interactive graphical output via plotly charts. Javascript flavoured coding is also available using Observable Plot, which I think sits on top of d3.js.

The Irydium browser provides a split window display, with a MyST markdown text editor on the left and a previewer on the right. The two windows appear not to be locked – each scrolls independently of the other. As you type in the editor, thre previewer tries to display a live preview, although s the notebook gets longer the previwer gets slower and slower…

Basic admonitions are supported, although the {admonition} block which lets you set the block title, isn’t; nested blocks don’t seem to work, nor do the :class: attributes; the dropdown block does not appear to be supported (yet?) either.

Python and Javascript code cells are supported. Pyhton code execution is provided via a WASM Python environment provided by default. It doesnlt look like you can print Hello World? But you can (somehow?) pass state to a plotly Javascript widget.

Tables and tabbed content don’t appear to work (yet?) but panels do:

Maths notation / equations don;t appear to work (do we need to install a js package?) nor do footnotes:

Perhaps the biggest bigbear for me is that you need to log in to save the notebook. It would be nice to be able to:

  • save and load named notebooks to browser storage, or to disk via the filesystem API;
  • export / load myst-md files to / from the desktop.

Curvenote Editor

Compared to the side-by-side “source and preview” approach offered by Iridium, the Curvenote editor is a WYSIWYG web-based editor (open source, demo) that fronts the Curvenote publishing, collaboration and reuse platform.

Basic Markdown elements (headers, emphasis, lists, etc.) are entered as raw markdown, and code cells are triggered by entering three back ticks. Other custom blocks are created via a slash command that indexes into a drop down list:

Tables are created via a /table command:

Additional rows / columns can be addded via a pop-up toolbar, or slash commands inside the table:

Footnotes and and equations are also entered from slash commands.

Slash command can be invoked inline, for example, when adding a footnote:

The document source is available as MyST-md (somehow???); blocks, equations and tables are all generated as expected, but not the footnote (which appears as a popu-up in the HTML, rather than appearing inline at the foot of the page), unless I’m doing it wrong.

For example, for this WYSIWYG view:

The following MyST-md is generated:

I’d prefer it if the table were not enclosed in a block.

For folk who like WYSIWYG editors, and who are comfortable with / commands, for example, as used in the WordPress Gutenberg block editor, the editor makes it easy to flip between edit and rendered views of styled elements, with the syntactic boilerplate provided for you. Styled blocks are invoked inline from slash commands and selected from a filtered list.

This approach balances WYSIWYG and “toolbar button” applied styling or raised wizards, and appear to be “a thing” in many (developer driven?) environments. (Keyboard shortcuts are also supported, but I’m not sure how deep they go? Eg can you raise the table editor or a note block from a keyboard shortcut?) But they go against my personal preferences (I really dislike the WordPress editor, for example: I’d rather just write in markdown.)

VS Code myst-vs-code Extension

The executablebooks/myst-vs-code extension provides rich preview rendering of MyST markdown documents. The rich preview view is rendered in a separate panel to the edited document.

For example, we can preview various flavours of special content block / admonition block:

We can also define footnotes, which are rendered at the botton of the preview, and create tables from markdown table syntax.

Equations / maths notation didn’t seem to render for me though because I couldn’t seem to find a way to enable additional extensions?

I also wonder how additional extensions are added, and what these have in common with the Sphin extensions? Eg how easy would it be to add support for the exercise, panel or proof extensions?

The previewer doesn’t work if the shd101wyy/markdown-preview-enhanced is enabled, though (my preferred VS Code markdown extension; there is also a slight variation in the sytax preferred by markdown-preview-enhanced compared to MyST, as noted in this related issue).

See also: VS Code as an Integrated, Extensible Authoring Environment for Rich Media Asset Creation.

Fragment – R and CPython REPLs Running Purely in the Browser via WASM

In passing, I note a console based R REPL running in the browser via WASM, courtesy of georgestagg/webR (demo here).

My immediate thought was: so how easy would it be to wrap the underlying R environment as a JupyterLite kernel? (Related issue here.) I guess an open question then is: will R packages just install as cross-platfrom PyPi wheels do, or will compiling be necessary?

For a related tale about the joys of trying to get various programming languages to run via WASM, see FORTRAN In The Browser.

In further passing, I note the availability of Build scripts and configuration for building CPython for Emscripten at ethanhs/python-wasm, with a REPL demo here.

SQLite Functions for Retrieving Sentences and Paragraphs from a Text

As part of my sin eater correspondence debate unproject, one of the things I keep pondering is effective ways of retrieving content from the corpus database for inclusion, or potentially even transclusion, (which is to say, dynamically included content), in the text.

As I seem to be on something of a roll with regards to creating dodgy SQLite functions at the moment (see previous posts), here are a couple of possible queries for extracting sentences and paragraphs.

For the sentence extraction, we can use a spacy pipeline component:

nlp = en_core_web_sm.load()
sentencizer = nlp.add_pipe("sentencizer")

class SQLite_Sentenciser(TableFunction):
    params = ['start', 'end', 'text']
    columns = ['sentence', 'sent_idx']
    name = 'get_sentences'

    def initialize(self, start=None, end=None, text=None):
        doc = nlp( text )
        # Use inclusive bounds; start index: 0
        self._iter = zip(list(doc.sents)[start:end+1], range(start,end+1))

    def iterate(self, idx):
        sentence, sent_idx = next(self._iter)
        return (sentence.text, sent_idx, )

# And register the function
SQLite_Sentenciser.register(db.conn)

Calling this in a query returns sentences:

_q = f"""
SELECT title, s.*
FROM sources_fts, get_sentences(4,5,sources_fts.text) AS s WHERE sources_fts MATCH {db.quote(q)} LIMIT 5 ;
"""
list(db.query(_q))

"""
[{'title': 'ANSWERS.',
  'sentence': 'From goat to dog, and from dog to man, are remarkable bounds; but here is an account of a human "scapegoat" from a trustworthy authority :— "In a certain county in Wales, some years ago when a person died, his friends sent for the \'sin-eater\' of the district, who, for half-a-crown, *took upon himself the sins of the deceased*, by the simple process of eating them.',
  'sent_idx': 4},
 {'title': 'ANSWERS.',
  'sentence': "A loaf of bread was provided, which the 'sin-eater' first placed upon the dead man's chest, then muttered some incantations over it, finally eating it.",
  'sent_idx': 5}]
"""

For the paragraph extraction, we need to create our own function – or crib one!

#https://gist.github.com/wpm/bf1f2301b98a883b50e903bc3cc86439
def paragraphs(document):
    start = 0
    for token in document:
        if token.is_space and token.text.count("\n") > 1:
            yield document[start:token.i]
            start = token.i
    yield document[start:]

# Example usage:
paras  = list(paragraphs(nlp("""This is a sentence. And another.\n\nA new sentence.\n\nA new para.""")))
paras
# For the text: [p.text for p in paras]

"""
[This is a sentence. And another.,
 
 
 A new sentence.,
 
 
 A new para.]
"""

We can then use this function to extract paragraphs, e.g. in this case using the rule that pargraphs are sequences of text separated by more than one end of line character:

class SQLite_Paragraphiser(TableFunction):
    params = ['start', 'end', 'text']
    columns = ['paragraph', 'para_idx']
    name = 'get_paragraphs'

    def initialize(self, start=None, end=None, text=None):
        doc = nlp( text )
        # Use inclusive bounds; start index: 0
        self._iter = zip(list(paragraphs(doc))[start:end+1], range(start,end+1))

    def iterate(self, idx):
        (paragraph, idx) = next(self._iter)
        return (paragraph.text.strip(), idx,)

# And register the function
SQLite_Paragraphiser.register(db.conn)

And in use:

# Bad example - the record only has one para.

_q = f"""
SELECT title, s.*
FROM sources_fts, get_paragraphs(0,2,sources_fts.text) AS s WHERE sources_fts MATCH {db.quote(q)} LIMIT 5 ;
"""
list(db.query(_q))

"""
[{'title': 'ANSWERS.',
  'paragraph': 'AN EXTRAORDINARY WELSH SUPERSTITION.\nThe great majority of your readers will have read the account of the "scapegoat" ceremony recorded in the Sacred Volume. Some, too, will remember a striking parallel custom (or, to put it more correctly, a parallel custom with a disparity) observed by the North American Indians. From goat to dog, and from dog to man, are remarkable bounds; but here is an account of a human "scapegoat" from a trustworthy authority :— "In a certain county in Wales, some years ago when a person died, his friends sent for the \'sin-eater\' of the district, who, for half-a-crown, *took upon himself the sins of the deceased*, by the simple process of eating them. A loaf of bread was provided, which the \'sin-eater\' first placed upon the dead man\'s chest, then muttered some incantations over it, finally eating it. By this he was believed to have taken from the deceased the heavy weight of his sins, appropriating them to himself, for which act of kindness he was regarded by everybody as an outcast. After the ceremony was finished, and he had received his pay, the \'sin-devourer\' vanished in double-quick time, it being the usual custom for the bereaved friends to belabour him with sticks."\nH., Newport, Monmouthshire.',
  'para_idx': 0}]
"""

With these two functions, I can now pull out one or more contiguous paragraphs, or one or more contiguous sentences, from each database record.

We can also create a custom aggregation function that can be applied to grouped rows. For example, suppose we want to join the first two sentences retrieved from every book, by book.

The custom aggregator is a class that provides step an

class SentenceJoin:
    """Join sentences by group."""

    def __init__(self):
        self.sentences = []

    # Define action to be take for each row
    def step(self, item):
        self.sentences.append(item)

    # Return final aggregated value
    def finalize(self):
        return ' '.join(self.sentences)
    
db.conn.create_aggregate('sentence_join', 1, SentenceJoin)

We can then call the function as a grouping function:

_q = f"""
SELECT title, sentence_join(s.sentence) AS js
FROM sources_fts, get_sentences(4,5,sources_fts.text) AS s WHERE sources_fts MATCH {db.quote(q)} 
GROUP BY sources_fts.title
LIMIT 5 ;
"""
list(db.query(_q))

"""
[{'title': 'ANSWERS.',
  'js': 'From goat to dog, and from dog to man, are remarkable bounds; but here is an account of a human "scapegoat" from a trustworthy authority :— "In a certain county in Wales, some years ago when a person died, his friends sent for the \'sin-eater\' of the district, who, for half-a-crown, *took upon himself the sins of the deceased*, by the simple process of eating them. A loaf of bread was provided, which the \'sin-eater\' first placed upon the dead man\'s chest, then muttered some incantations over it, finally eating it.'}]
"""

Fuzzy Search With Multiple Items Table Return in SQLite

In Fuzzy Search, Pattern Matching and Retrieval Using Python and SQLite I reviwed several tools that could be used to support fuzzy search, giving an example of a scalr SQLite custom application function that could be used to retrieve a document containing a fuzzy matched search term.

In Custom SQLite Functions That Return A Table: A SQLite Spell-Checker, I showed how we can create functions that can return multiple values via a virtual table.

In this post, I’ll describe a couple of custom virtual table generating functions that support fuzzy searching within document records and the return of document fragments, one per row, associated with the matches.

The trick to creating table returning functions is provided by Charles Leifer’s coleifer/sqlite-vtfunc package, as described in SQLite Table-Valued Functions with Python. The demos for that package show how to create a regular expression search function:

#https://charlesleifer.com/blog/sqlite-table-valued-functions-with-python/
#https://github.com/coleifer/sqlite-vtfunc
#%pip install git+https://github.com/coleifer/sqlite-vtfunc.git

from vtfunc import TableFunction
import re

from vtfunc import TableFunction

class RegexSearch(TableFunction):
    params = ['pattern', 'search_string']
    columns = ['match']
    name = 'regex_search'

    def initialize(self, pattern=None, search_string=None):
        self._iter = re.finditer(pattern, search_string)

    def iterate(self, idx):
        # We do not need `idx`, so just ignore it.
        return (next(self._iter).group(0),)

# The `db` object is a sqlite_utls Database object
from sqlite_utils import Database
db = Database('test.db')

# Register the function with SQLite
RegexSearch.register(db.conn)

Here’s what happens if we try it out:

results = re.finditer(
            '<a[^\>]+?href="([^\"]+?)"[^\>]*?>([^\<]+?)</a>',
            'Junk  <a href="http://www.bbc.co.uk">Link text</a> wrapper; Junk  <a href="http://www.open.co.uk">Link text</a> wrapper')

next(results).groups(), next(results).groups()

"""
(('http://www.bbc.co.uk', 'Link text'), ('http://www.open.co.uk', 'Link text'))
"""

It’s not too hard to crib this to create a partial regex matcher using the regex package. We can extend the returned table to give the start and end character locations of the matched term in the response table:

#%pip install regex
import regex

class RegexSearch2(TableFunction):
    params = ['pattern', 'search_string']
    columns = ['match', 'start', 'end']
    name = 'regex_search2'

    def initialize(self, pattern=None, search_string=None):
        self._iter = regex.finditer(pattern, search_string)

    def iterate(self, idx):
        item = next(self._iter)
        return (item.group(0), item.start(), item.end(),)

RegexSearch2.register(db.conn)

Here’s how it works:

query_params = ("(sin eater){e<=1}", "The bin eater is a sin-eater thing...")

for i in db.execute('SELECT * FROM regex_search2(?, ?);', query_params):
    print(i)

"""
('bin eater', 4, 13)
('sin-eater', 19, 28)
"""

We can also create fuzzy matchers using the fuzzysearch.find_near_matches() function:

#%pip install fuzzysearch
from fuzzysearch import find_near_matches

class FindNearMatches(TableFunction):
    params = ['target_string', 'search_string']
    columns = ['matched', 'start', 'end', 'dist']
    name = 'find_near_matches'

    def initialize(self, target_string=None, search_string=None):
        self._iter = iter(find_near_matches(target_string, search_string, max_l_dist=2))

    def iterate(self, idx):
        r = next(self._iter)
        return (r.matched, r.start, r.end, r.dist,)

# And register the function
FindNearMatches.register(db.conn)

Here’s how it looks:

query_params = ('PATTERN', '--PRTTERN-PATERN---PATEN----PATN--PATREN-')

for i in db.execute('SELECT *, matched FROM find_near_matches(?, ?);', query_params):
    print(i)

"""
('PRTTERN', 2, 9, 1, 'PRTTERN')
('PATERN', 10, 16, 1, 'PATERN')
('PATEN', 19, 24, 2, 'PATEN')
('PATREN', 34, 40, 2, 'PATREN')
"""

One disadvantage of the above is that we have hardwired the max_l_dist parameter; but we can add in any number of additional parameters as we require:

class FindNearMatches2(TableFunction):
    params = ['target_string', 'search_string', 'max_l_dist']
    columns = ['matched', 'start', 'end', 'dist']
    name = 'find_near_matches2'

    def initialize(self, target_string=None, search_string=None, max_l_dist=2):
        self._iter = iter(find_near_matches(target_string, search_string, max_l_dist=max_l_dist))

    def iterate(self, idx):
        r = next(self._iter)
        return (r.matched, r.start, r.end, r.dist,)

# And register the function
FindNearMatches2.register(db.conn)

Here’s how it looks this time:

query_params = ('PATTERN', '--PRTTERN-PATERN---PATEN----PATN--PATREN-', 1)
for i in db.execute('SELECT *, matched FROM find_near_matches2(?, ?, ?);', query_params):
    print(i)

"""
('PRTTERN', 2, 9, 1, 'PRTTERN')
('PATERN', 10, 16, 1, 'PATERN')
"""

We can, of course, use the functions as part of a query that passes in the text values from other tables. So for example, here is a test table:

fuzzytests = db["fuzzytests"]
 
# Add some rows
fuzzytests.insert({'id':1,
                   'text':'Some text about the bin-eater.'})
fuzzytests.insert({'id':2,
                   'text':'Something about sin eating'})
fuzzytests.insert({'id':3,
                   'text':'Nothing to see here...'})

And we can run some queries against it:

# Run our fuzzy searcher over the rows
for i in db.execute(f'SELECT * FROM fuzzytests, find_near_matches2("sin eater", fuzzytests.text, 2);'):
    print(i)

"""
(1, 'Some text about the bin-eater.', 'bin-eater', 20, 29, 2)
(2, 'Something about sin eating', 'sin eatin', 16, 25, 2)
"""

Other fuzzy search tools are available, of course. So here’s an example using the gandersen101/spaczz fuzzy matcher for spaCy:

from spaczz.matcher import FuzzyMatcher

nlp = en_core_web_sm.load()

class SQLite_FuzzyMatcher(TableFunction):
    params = ['target_string', 'search_string']
    columns = ['matched', 'start', 'end', 'ratio']
    name = 'find_fuzzy_matches'
    
    def initialize(self, target_string=None, search_string=None):
        doc = nlp( search_string )
        matcher = FuzzyMatcher(nlp.vocab)
        # We can tune the sensitivity for the match; default is 75
        matcher.add(target_string, [nlp(target_string)], kwargs=[{"min_r2": 75}])
        self._iter = iter(matcher(doc))

    def iterate(self, idx):
        # ratio is a quality score for the match
        matched, start, end, ratio = next(self._iter)
        return (matched, start, end, ratio,)

# And register the function
SQLite_FuzzyMatcher.register(db.conn)

And when we run it?

_q = f"""
SELECT *
FROM fuzzytests, find_fuzzy_matches('sin eater', fuzzytests.text) AS fz ;
"""

list(db.query(_q))

"""
[{'id': 1,
  'text': 'Some text about the bin-eater.',
  'matched': 'sin eater',
  'start': 4,
  'end': 7,
  'ratio': 78}]
"""

Next up – a porter stemmer search tool, maybe?!

A SQLite Function for Finding Repeated Phrases in Fairy Tales, Part I

Way back when, when I first started looking at virtualised environments for distributing student materials, I came across a project created by @DataminerUK called infinite interns. The idea was to create an “intern” for each stage of a data investigation (a particular instance of a virtualised environment) that would be “virgin” and “uncontaminated”, sandboxed and reproducible. The insight was that you’d have a fresh pair of hands each time you wanted to a perform a task, set up just so, in a very particular way for the task at hand. And when the task was done you’d just throw that infinite intern away. (At this point, you could probably make some cultural observations about how interns might be treated in particular industries!)

Another way of considering infinite interns might be to consider them as a scaleable resource for performing a particular task, repeatedly, with no real cost to yourself. In coding terms, the infinite intern might thus also correspond to the repeated application of a particular function, which is to say, aset of instructions that are bundled together under a memorable name that can perform a particular task: hey, you, make_my_tea("egg and chips"); and so it will be done.

What this means is that you can create a function once, and then use it repeatedly: hey, you, make_my_tea("fishfinger sandwich").

But what’s even better is that we can find ways of scaling how we call functions even more. If I have a database containing tens, hundreds, thousands, tens of thousands, hundreds of thousands, millions, even, of records, how powerful would it be if we could write a function once and then apply it to each of those tens, hundreds, thousands, tens of thousands, hundreds of thousands, millions, even, of records from a single command? Pretty powerful, right?

In the previous two scarily named blogposts —Fuzzy Search, Pattern Matching and Retrieval Using Python and SQLite and Custom SQLite Functions That Return A Table: A SQLite Spell-Checker — I described how we can define custom Python programming functions for the simple, file based SQLite database that can return either a single valued item, or a whole set of items, from each record in the database.

That probably sounds more complicated than it actually is. For example, suppose each separate record in the database describes what I had for tea on a particular day. And then, with a single command, I could search that database for everything I’ve had for tea, ever, and call the make_my_tea() function separately for that separate item. But being a database, I could also search for just those items that I ate on a Friday, or during a particular month of a particular year, of that explicitly involved eggs. And get those made for me, all in one go.

Now I don’t happen to have a database of things I’ve had for tea in the past, but in the blog post Searchable SQLite Database of Andrew Lang Fairy Stories I did describe a simple database I had created around Andrew Lang’s coloured fairy story books.

This database contains several hundred stories, each as a separate database record, each one from two to twenty or so pages of text. It’s the sort of resource that would take quite a long time to read through if you were to retrieve every story and then read them all. One reason I created the database was so that I could more easily search for stories that just involved a king or queen or three sons.

q = """
SELECT * FROM books_fts
WHERE books_fts MATCH {db.quote('"pretty hen"')} ;
"""

for row in db.query(_q):
    print(row["title"])

# For future convenience, grab the text for the last reported item
story_text = row["text"]

"""
The House In The Wood
"""

But it also struck me that I should also be able to search for repeated refrains within a particular story. You undoubtedly know the sort of thing: the phrase “and he huffed and he puffed and he” that appears three times in the story of the Three Little Pigs.

If the repeated phrase is four words long, one way of trying to find such a phrase is take the first four words of the text, and then compare them to every set of four contiguous words in the text. If the repeated phrase crops up again, you will get an exact match at that point.

  • If the repeated phrase
  • the repeated phrase is
  • repeated phrase is four
  • the text. If the
  • text. If the repeated
  • If the repeated phrase
  • the repeated phrase crops

If you start with a long sequence, twenty words, for example, you can start with that. If you don’t get an exact match for twenty words, try with nineteen. If that fails, try eighteen. All the way donw to some minimum desired phrase length. If you do get a match, keep count of how many matches you get as you work through the whole text. If you get at leaest a desired number of repetitions, then you can return that match phrase.

Here’s a way of doing that in code. In part, it makes ue of some functions in a Python package called nltk (“natural language toolkit”) that separates each word in a text as a seperate token:

import nltk

tokens = nltk.word_tokenize(story_text)

We can find every (overlapping) set of 10 words in the text, so called ngrams, using the nltk_ngrams function:

from nltk.util import ngrams as nltk_ngrams

ngrams = nltk_ngrams(tokens, size)

for i in ngrams:
    print(' '.join(i))

"""
From the German of Grimm
the German of Grimm .
German of Grimm . A
of Grimm . A poor
Grimm . A poor woodcutter
...
"""

The Python pandas package is a very rich and very powerful package for creating and working with tabular datasets, which is to say, tables of data. In pandas, a data table is referred to as a dataframe.

For example, if we put the ngrams into a table (dataframe) with one column, and one ngram per row, we can count all the instances of each unique ngram directly from the dataframe:

import pandas as pd

df = pd.DataFrame({'phrase':[' '.join(i) for i in nltk_ngrams(tokens, size)]})
df['phrase'].value_counts()

"""
, pretty brindled cow ,        4
And you , pretty brindled      4
you , pretty brindled cow      4
pretty brindled cow , What     4
brindled cow , What do         4
                              ..
leaving him all day without    1
"""

If we take one of those repeating phrases, we can search the text to find each occurrence, and then report the wider context, showing the text before and after the phrase. This is often referred to as a concordance.

import re

q2 = 'pretty brindled cow'

for m in re.finditer(_q, row["text"]):
    # Display the matched terms and the 50 characters
    # immediately preceding and following the phrase 
    print(f'===\n{q2}: ', m.start(), m.end(), row["text"][max(0, m.start()-50):m.end()+50])

"""
===
pretty brindled cow:  1566 1585 
The man said:

Pretty cock, Pretty hen, And you, pretty brindled cow, What do you say now?

'Duks,' answered the beast
===
pretty brindled cow:  3505 3524 ed the beasts:

Pretty cock, Pretty hen, And you, pretty brindled cow, What do you say now?

The beasts answered, 'Duks
===
pretty brindled cow:  4932 4951  beasts again:

Pretty cock, Pretty hen, And you, pretty brindled cow, What do you say now?

'Duks,' they said. Then th
===
pretty brindled cow:  6119 6138  to rest now?'

Pretty cock, Pretty hen, And you, pretty brindled cow, What do you say now?

The animals said, 'Duks:
"""

To make that a little easier to work with, we can wrap it in its own function:

def find_contexts(text, phrase,  width=50):
    """Find the context(s) of the phrase."""
    contexts = []
    for m in re.finditer(phrase, text):
        # Display the matched terms and the `width` characters
        # immediately preceding and following the phrase 
        contexts.append(text[max(0, m.start()-width):m.end()+width])
    return contexts

## Usage:
# for i in find_contexts(row['text'], 'pretty brindled cow'):
#    print(i,"\n==")

So we have a couple of possible pieces for our jigsaw that let us find repeated sequences and then, if required, report them in a wider context.

The following function automates the process, which is to say “implements the algorithm”, described above for starting with a long sequence of tokens – a long ngram — and then reducing the sequence length down to a minimum until we get a match.

Following the basic rule of three, in which many things are repeated in stories three times, we set the minimm repetition count to 3 as the default.

def scanner(text, minlen=4, startlen=50, min_repeats = 3, autostop=True):
    """Search a text for repeated phrases above a minimum length."""
    # Tokenise the text
    tokens = nltk.word_tokenize(text)
    
    #nltk_ngrams returns an empty list if we ask for an ngram longer than the sentence
    # So set the (long) start length to the lesser of the original provided
    # start length or the token length of the original text;
    # which is to say, the minimum of the provided start length 
    # or the length of the text
    startlen = min(startlen, len(tokens))
    
    # Start with a long sequence then iterate down to a minumum length sequence
    for size in range(startlen, minlen-1, -1):
        
        # Generate a dataframe containing all the ngrams, one row per ngram
        df = pd.DataFrame({'phrase':[' '.join(i) for i in nltk_ngrams(tokens,size)]})
        
        # Find the occurrence counts of each phrase
        value_counts_series = df['phrase'].value_counts()
        
        # If we have at least the specified number of occurrences
        # don't bother searching for any more
        if max(value_counts_series) >= min_repeats:
            if autostop:
                break
            pass
    # Return a pandas series (an indexed list, essentially)
    # containing the longest phrase (or phrases) we found
    return  value_counts_series[(value_counts_series>=min_repeats) & (value_counts_series==max(value_counts_series))]

If we call that function on the text, we get a long phrase back:

# Display the first (0'th indexed) item
# (In this case there is only one item hat repeats this number of times anyway.)
scanner( row["text"] ).index[0], scanner( row["text"] ).values[0]

"""
('Pretty cock , Pretty hen , And you , pretty brindled cow , What do you say now ?',
 4)
"""

We can create another function that calls the scanner function and returns the first (or only) long detected phrase, or nothing:

def find_repeating_phrase(text):
    """Return the longest repeating phrase found in a text.
       If there are more than one of the same lentgh, return the first.
    """
    phrase = scanner(text)
    
    #If there is at least one response, take the first
    if not phrase.empty:
        return phrase.index[0]

We can now make this function available to the SQLite database by registing it as a custom appliction function.

# The `db` object is a sqlite_utils database object
# Pass in:
# - the name of the function we want to use in the database
# - the number of arguments it takes
# - the function we want to invoke
db.conn.create_function('find_repeating_phrase', 1, find_repeating_phrase)

We can now use the function as part of a query, for example applying it to the text retrieved for the book The House In The Wood:

_q = """
SELECT book, title, find_repeating_phrase(text) AS phrase 
FROM books AS phrase WHERE title="The House In The Wood" ;
"""

for row2 in db.query(_q):
    print(row2)

"""
{'book': 'The Pink Fairy Book', 'title': 'The House In The Wood', 'phrase': 'Pretty cock , Pretty hen , And you , pretty brindled cow , What do you say now ?'}
"""

At this point, I still haven’t yet run this function over anyhting other than a single test case, the text of The House In The Wood, which I knew contained a repeating phrase.

But what will happen if I run it over all the stories in The Pink Fairy Book? It probably sounds a bit daft, but I’m almost shaking in excited anticipation of writing this query and trying it out…

Here’s the query:

_q = """
SELECT book, title, find_repeating_phrase(text) AS phrase
FROM books AS phrase WHERE book="The Pink Fairy Book" ;
"""

Now I’m going to shut my eyes and run it… and then peek at the response…

_q = """
SELECT  title, find_repeating_phrase(text) AS phrase
FROM books AS phrase WHERE book="The Pink Fairy Book" ;
"""

for row3 in db.query(_q):
    if row3['phrase'] is not None:
        print(row3)

How did it do?

{'title': 'Catherine And Her Destiny', 'phrase': 'the court , and'}
{'title': 'Esben And The Witch', 'phrase': "? ' 'Ye -- e -- s ! ' 'Are you coming back again ? ' 'That may be , ' said Esben . 'Then you 'll catch it , '"}
{'title': "Hans, The Mermaid's Son", 'phrase': ", ' said Hans ; ' I"}
{'title': 'How The Dragon Was Tricked', 'phrase': ", ' said the dragon"}
{'title': "How The Hermit Helped To Win The King's Daughter", 'phrase': "'Ask him if he will come with us"}
{'title': 'I Know What I Have Learned', 'phrase': 'and asked his wife whether the cow had calved'}
{'title': 'King Lindorm', 'phrase': 'rode out into the forest'}
{'title': 'Maiden Bright-Eye', 'phrase': ". 'Good evening , ' it said . 'Thanks , Maiden Bright-eye , ' said the dog . 'Where is my brother ? ' 'He is in the serpent-pit . ' 'Where is my wicked sister ? ' 'She is with the noble king . ' 'Alas ! alas !"}
{'title': 'Master And Pupil', 'phrase': ", ' said the boy ."}
{'title': 'Peter Bull', 'phrase': "'Oh , yes , ' said the"}
{'title': 'Princess Minon-Minette', 'phrase': ", ' replied the old woman"}
{'title': "The Bird 'Grip'", 'phrase': 'the horse with the golden shoes'}
{'title': 'The Cunning Shoemaker', 'phrase': ", ' replied the shoemaker"}
{'title': 'The Fir-Tree', 'phrase': "' thought the tree ."}
{'title': 'The Flying Trunk', 'phrase': ". '' ' ''"}
{'title': 'The Goblin And The Grocer', 'phrase': ', and he had'}
{'title': 'The Golden Lion', 'phrase': ', and the young man'}
{'title': 'The House In The Wood', 'phrase': ': Pretty cock , Pretty hen , And you , pretty brindled cow , What do you say now ?'}
{'title': 'The Jackal, The Dove, And The Panther', 'phrase': "which side do you turn to ? '"}
{'title': 'The King Who Would Have A Beautiful Wife', 'phrase': ". ' 'And I"}
{'title': 'The Little Hare', 'phrase': 'the little hare , the little hare ,'}
{'title': 'The Man Without A Heart', 'phrase': ", ' said the"}
{'title': 'The Merry Wives', 'phrase': ", ' said the"}
{'title': 'The Princess In The Chest', 'phrase': "'Sentry , where are you ? Sentry , where are you ?"}
{'title': 'The Shirt-Collar', 'phrase': "! ' said the shirt-collar ,"}
{'title': 'The Slaying Of The Tanuki', 'phrase': '. The Tanuki ,'}
{'title': 'The Snow-Man', 'phrase': "? ' asked the Snow-man ."}
{'title': 'The Snow-Queen', 'phrase': ", ' said the crow ,"}
{'title': 'The Sparrow With The Slit Tongue', 'phrase': 'the house , and'}
{'title': 'The Sprig Of Rosemary', 'phrase': "'Do you , rich as you are ,"}
{'title': 'The Story Of Ciccu', 'phrase': "accept them with my humble duty . '"}
{'title': 'The Three Brothers', 'phrase': "the house . '"}
{'title': "The Troll's Daughter", 'phrase': 'at the bottom of the sea .'}
{'title': 'The Two Brothers', 'phrase': 'seven years and seven months'}
{'title': 'The Water Of Life', 'phrase': 'the talking bird , and a branch of the tree of beauty'}
{'title': 'The White Dove', 'phrase': ", ' said the prince ,"}
{'title': 'The Wounded Lion', 'phrase': 'will hire me for a servant ?'}
{'title': 'Uraschimataro And The Turtle', 'phrase': 'the sea , and'}

Not bad. For example, we get good long phrases from Maiden Bright-Eye and Esben And The Witch, and not just The House In The Wood. And both How The Hermit Helped To Win The King’s Daughter and The Wounded Lion look like they may have a nice repeating riff in the story.

It does looks as if the punctation can get in the way some times though; and it could be handy if we were able to to set the minimum token length as part of our query.

There are a couple of ways we might try to find round this. One is to “depunctuate” the strings; another is perhaps to try tokenising on sentences themselves?

But for that, we will have to wait until another day.

Custom SQLite Functions That Return A Table: A SQLite Spell-Checker

In Fragments – Fuzzy Search, Pattern Matching and Retrieval Using Python and SQLite, I noted the hugely powerful SQLite feature that allows you to create custom application defined SQLite functions that can be used as part of a query. The examples I gave included functions for retrieving a record based on fuzzy matching a particular search term within the record. (Not shown, but also supported, are custom aggregegate functions that can be used as part of a grouping operator.)

The custom functions MUST return a single, not-iterable, scalar value. But what if you want to return multiple items retrieved from within a single record? Which is to say, what if you need to retrun a table from a query onto a record?

Poking around in the SQLite docs, it seems as if you can define functions that return a virtual table as part of an extension. But how might you acheive a similar thing with a simpler Python API?

A quick search turns up a really magical recipe, described by Charles Leifer in SQLite Table-Valued Functions with Python and implemented in the coleifer/sqlite-vtfunc package.

All you need to do is define a class with a couple of methods: an initialisation routine that creates an iterator over the items you want to return via the virtual table; and an iterator that returns each row for your return table. Provide the names of you input arguments and output columns, and that’s pretty much it.

So here’s an example, based on an old code fragment I came across for highlighting typos whilst looking for something else (which reminds me: I still haven’t found the post I was looking for…).

The code uses the jxmorris12/language_tool_python package that provides a Python API to the Java’n’http server based LanguageTool spellchecking server and returns highlighted typographical errors in the presented text.

So can we riff on that to create a spell-checker that we can apply to records in a SQLite database, and for each record return rows of a “spellcheck” virtual table with one typo per row?

Let’s start with a class that creates the spellchecker. When you call the languagetool, it takes forever to start up an server, so it makes sense to create a class wrapper with a global language server, assuming that you’re working in a single language.

# %pip install --upgrade language_tool_python
import language_tool_python

# Loading the languagetool is slow, so it makes sense to only do it once.
# So put it in a class and refer to it from a persistent object context

class TypoHighlighter:
    """Find typographical errors in a text and highlight them.
        Note that this class make be slow to instantiate as the
        LanguageTool http server started.
    """
    # Shared instance
    tool = language_tool_python.LanguageTool('en-UK')
    
    def __init__(self, style='html', #lang='en-UK',
                 html_color='red', md_style='__'):
        self.style = style
        #self.lang = lang
        self.html_color = html_color
        self.md_style = md_style
        
    def typo_styler(self, error):
        """Highlight error term."""
        typo = error.context
        from_ = error.offsetInContext
        to_ = from_ + error.errorLength
        txt = typo[from_:to_]
        if self.style=='html':
            typo =  typo[:from_] + f'<span style="color:{self.html_color}">{txt}</span>{typo[to_:]}'
        elif not None:
            typo =  f"{typo[:from_]}{self.md_style}{txt}{self.md_style}{typo[to_:]}"
        #print(f"**{html}")
        return typo

    def highlight_typos(self, text, highlight=True):
        """Highlight spelling errors in text."""
        matches = TypoHighlighter.tool.check(text)
        if not highlight:
            return matches
        else:
            return [self.typo_styler(m) for m in matches]

We can then run the spellchecker as follows to retrieve the raw response of the spell-checker:

tool = TypoHighlighter()

text = 'This sentence is fine. A sentence with a error in the Hitchhiker’s Guide tot he Galaxy'

matches = tool.highlight_typos(text, False)
matches

"""
[Match({'ruleId': 'EN_A_VS_AN', 'message': 'Use “an” instead of ‘a’ if the following word starts with a vowel sound, e.g. ‘an article’, ‘an hour’.', 'replacements': ['an'], 'offsetInContext': 39, 'context': 'This sentence is fine. A sentence with a error in the Hitchhiker’s Guide tot he ...', 'offset': 39, 'errorLength': 1, 'category': 'MISC', 'ruleIssueType': 'misspelling', 'sentence': "A sentence with a error in the Hitchhiker's Guide tot he Galaxy"}),
 Match({'ruleId': 'TOT_HE', 'message': 'Did you mean “to the”?', 'replacements': ['to the'], 'offsetInContext': 43, 'context': '... with a error in the Hitchhiker’s Guide tot he Galaxy', 'offset': 73, 'errorLength': 6, 'category': 'TYPOS', 'ruleIssueType': 'misspelling', 'sentence': "A sentence with a error in the Hitchhiker's Guide tot he Galaxy"})]
"""

Or we can get inline highlighted errors:

matches = tool.highlight_typos(text)
matches

"""
['This sentence is fine. A sentence with <span style="color:red">a</span> error in the Hitchhiker’s Guide tot he ...',
 '... with a error in the Hitchhiker’s Guide <span style="color:red">tot he</span> Galaxy']
"""

So how might we make use of this in a SQLite context?

Let’s start with the error highlighter, where we just return a simple string that shows each error, in context, with the error highlighted. If we run our function over a record in the database, we want to return a set of rows with a single column that contains that highlighted text.

We start the declaration of the class as follows, defining the input argument(s), the column names for the virtual table output, the name of the custom function we can call in a SQLite SQL query, and a class global spellchecker object to perform the spell check:

class DBTypoHighlighter(TableFunction):
    # Input parameter - the text we want to check for typos
    params = ['text']
    # A single output column containing highlighted errors in context
    columns = ['highlight']
    # The name of the function we can call in SQLite
    name = 'typo_highlighter'

    # A class global spellchecker object
    typo_highlighter = TypoHighlighter()

The next thing we need to do is define an initialisation function inside the class that will return an iterator over items we are to return on a per row basis inside our virtual table. The spellchecker returns a Python list containing separate error items, which we can easily return as an iterator:

    def initialize(self, text=None):
        """Return an iterator to generate output virtual table rows."""
        self._iter = iter(DBTypoHighlighter.typo_highlighter.highlight_typos(text))

Next, we need to define the iterator method for the class that returns each row:

    def iterate(self, idx):
        """Return the next row for the virtual table."""
        # We don't need to make us of the idx value
        # but it is required in the methof signature 
        item = next(self._iter)
       # The final , is required.
        return (item,)

We can now register the function with a SQlite database object:

from sqlite_utils import Database
db = Database('test.db')

DBTypoHighlighter.register(db.conn)

We should now be able to run a typo_highlighter() query over a test string:

text = 'This sentence is fine. A sentence with a error in the Hitchhiker’s Guide tot he Galaxy'

for i in db.execute(f'SELECT * FROM typo_highlighter("{text}");'):
    print(i)

"""
('This sentence is fine. A sentence with <span style="color:red">a</span> error in the Hitchhiker’s Guide tot he ...',)
('... with a error in the Hitchhiker’s Guide <span style="color:red">tot he</span> Galaxy',)
"""

And if we can run it over one string, we can apply it to multiple rows in a query:

spelltests = db["spelltests"]

# Add some rows
spelltests.insert({'id':1,
                   'text':'This sentence is fine. A sentence with a error in the Hitchhiker’s Guide tot he Galaxy'})
spelltests.insert({'id':2,
                   'text':'A sentence with a error in the Hitchhiker’s Guide tot he Galaxy'})
spelltests.insert({'id':3,
                   'text':'This sentence is fine.'})

# Run our spellchecker over the rows
for i in db.execute(f'SELECT * FROM spelltests, typo_highlighter(spelltests.text);'):
    print(i)

"""
(1, 'This sentence is fine. A sentence with a error in the Hitchhiker’s Guide tot he Galaxy', 'This sentence is fine. A sentence with <span style="color:red">a</span> error in the Hitchhiker’s Guide tot he ...')
(1, 'This sentence is fine. A sentence with a error in the Hitchhiker’s Guide tot he Galaxy', '... with a error in the Hitchhiker’s Guide <span style="color:red">tot he</span> Galaxy')
(2, 'A sentence with a error in the Hitchhiker’s Guide tot he Galaxy', 'A sentence with <span style="color:red">a</span> error in the Hitchhiker’s Guide tot he ...')
(2, 'A sentence with a error in the Hitchhiker’s Guide tot he Galaxy', '... with a error in the Hitchhiker’s Guide <span style="color:red">tot he</span> Galaxy')
"""

So that seems to works:-)

How about a more general function that returns more complete tabling detailing the errors as found by the languagetool, with the virtual table containing separate columns for each element in the language tool error object.

We note that one of the attributes (replacements) is returned as a list, which we need to serialise somehow, either by returning just the first item, or creating a single string from all the listed suggestions, with each item distinguished wihtn that string.

Constructing the class proceeds as before, although this time we need to make sure we declare all the columns we want to return in the virtual table, and make sure that the iterator method populates those columns appropriately:

class DBTypoFinder(TableFunction):
    """Return a virtual table containing error data associated
       with typos in the presented text."""
    params = ['text']
    columns = ['message', 'replacements', 'offset', 'context',
               'sentence', 'category', 'ruleId', 'ruleIssueType',
               'offsetInContext', 'errorLength']
    name = 'typo_finder'
    typo_highlighter = TypoHighlighter()
    
    def initialize(self, text=None):
        self._iter = iter(DBTypoHighlighter.typo_highlighter.highlight_typos(text, False))

    def iterate(self, idx):
        i = next(self._iter)
        return (i.message, '::'.join(i.replacements), i.offset, i.context,
               i.sentence, i.category, i.ruleId, i.ruleIssueType,
               i.offsetInContext, i.errorLength,)

DBTypoFinder.register(db.conn)

When we run this query, we get back a much richer table:

import pandas as pd

pd.read_sql('SELECT * FROM spelltests, typo_finder(spelltests.text);',
            db.conn)

This recipe is so much fun! :-)

Fragments – Fuzzy Search, Pattern Matching and Retrieval Using Python and SQLite

Scrappy notes on some possibly useful fuzzy and partial match tricks and tips.

First up, we can easily find one or more exact matches for a pattern across a text using a regular expression of the form:

import re

txt = 'The sin eater was a tradition whereby a poor individual – the sin eater – would... '

_q = 'sin eater'

for m in re.finditer(_q, txt):
    print(f'{_q}: ', m.start(), m.end())

But what if we want to start finding patterns where thre is only a partial match? For example, in many OCRd texts, or texts corresponding to historical documents with old variant spellings of particular wordes, we might want to match on the near misses.

The jellyfish Python package provides functions for phonetic encoding (American Soundex, Metaphone, NYSIIS (New York State Identification and Intelligence System), Match Rating Codex) and string comparison / approximate matching (Levenshtein Distance, Damerau-Levenshtein Distance, Jaro Distance, Jaro-Winkler Distance, Match Rating Approach Comparison, Hamming Distance).

Cribbing simonw/datasette-jellyfish, a Datasette plugin that adds jellyfish functions to SQLite queries in datasette that we can use at least in the SELECT part of a query, we can add the functions to SQLite with the following:

# Crib from https://github.com/simonw/datasette-jellyfish

import jellyfish

one_args = (
    # Phonetic
    # https://jellyfish.readthedocs.io/en/latest/phonetic.html
    "soundex",
    "metaphone",
    "nysiis",
    "match_rating_codex",
    # Stemming
    # https://jellyfish.readthedocs.io/en/latest/stemming.html
    "porter_stem",
)
two_args = (
    # String Comparison
    # https://jellyfish.readthedocs.io/en/latest/comparison.html
    "levenshtein_distance",
    "damerau_levenshtein_distance",
    "hamming_distance",
    "jaro_similarity",
    "jaro_winkler_similarity",
    "match_rating_comparison",
)

# `db` is a sqlite_utils Database object
from sqlite_utils import Database
db = Database('test.db')
# For in memory db, use: db = Database(memory=True)

for fn in one_args:
    db.conn.create_function(fn, 1, getattr(jellyfish, fn))
for fn in two_args:
    db.conn.create_function(fn, 2, getattr(jellyfish, fn))

Functions that set default values for parameters can be defined using .conn.create_function(NAME, -1, FUNCTION). These functions can have 1, 2 or 3 parameters.

We can then use the function as part of a query:

# `db` is a sqlite_utils Database object
db["my_table"].create({"id": int, "txt": str,}, pk="id")
db["my_table"].insert({"id":1, "txt": "sin eater in Wales."})
db["my_table"].insert({"id":2, "txt": "The bin eater."})

for i in db.query("SELECT txt, levenshtein_distance('sin eater', txt) AS dist FROM my_table WHERE levenshtein_distance('sin eater', txt) < 10"):
    print(i)

This limits us to “exact matching” near miss phrases or terms, rather than finding near miss phrases within a larger text.

The fuzzysearch package provides a simple function for finding matches based on maximum Levenshtein distance (max_l_dist), maximum number of substitutions (max_substitutions), maximum number of deletions / skipped characters in the sub-sequence (max_deletions), maximum number of insertions / skipped characters in a sequence (max_insertions).

#%pip install fuzzysearch
from fuzzysearch import find_near_matches

# search for 'PATTERN' with a maximum Levenshtein Distance of 2
find_near_matches('PATTERN', '---PATERN---PATEN----PATN--PATREN-', max_l_dist=2)

[Match(start=3, end=9, dist=1, matched='PATERN'),
 Match(start=12, end=17, dist=2, matched='PATEN'),
 Match(start=27, end=33, dist=2, matched='PATREN')]

It could also be interesting to try to take the fuzzysearch.find_near_matches function for use in a SQLite query. Here’s a minimal example:

def find_near_matches2(*args, **kwargs):
    response = find_near_matches(*args, **kwargs, max_l_dist=3)
    if response:
        return response[0].matched
    return ''

# Make function available in SQLite queries
db.conn.create_function('find_near_matches2', 2, find_near_matches2)

# Try a query
for i in db.query("SELECT txt, find_near_matches('sin eater', txt) AS matched FROM my_table WHERE find_near_matches('sin eater', txt) !=''"):
    print(i)

"""
{'txt': 'The is a sin eater in Wales.', 'matched': 'sin eater'}
{'txt': 'The bin eater.', 'matched': 'in eater'}
"""

The question is: what should we really return from this function? And can we parameterise it better?

A key parameter is the max_l_dist value; we can specify three arguments to the custom function if one of them is a provided with a default value, so let’s use max_l_dist for such a value:

def find_near_matches3(pattern, search_string, max_l_dist=3):
    response = find_near_matches(pattern, search_string, max_l_dist=max_l_dist)
    if response:
        return response[0].matched
    return ''

# The -1 value for the number of args says we may have a default.
# The function can then take 1, 2 or 3 arguments.
db.conn.create_function('find_near_matches3', -1, find_near_matches3)

For spacy, the spaczz package provides fuzzy entity extraction for spacy pipelines.

import spacy
from spaczz.matcher import FuzzyMatcher

nlp = spacy.blank("en")
text = """SIB- EATERZ. The tradition of the sin eater is contested. 
Claimed as a Walesian tradition by the English, the Welsh didn't seem to have heard of sin-eating!"""
doc = nlp(text)

matcher = FuzzyMatcher(nlp.vocab)
matcher.add("GPE", [nlp("Wales")])
matches = matcher(doc)

for match_id, start, end, ratio in matches:
    print(match_id, doc[start:end], ratio)

"""
GPE Walesian 77
"""

The minimum confidence level for which a token is tagged can be configured.

The matcher function can be customised to return additional information, such as the location of the matched item(s) in the text.

from spacy.tokens import Span

# Create custom matcher to return location in document
def add_name_ent(matcher, doc, i, matches):
    """Callback on match function. Adds "THING" entities to doc."""
    # Get the current match and create tuple of entity label, start and end.
    # Append entity to the doc's entity. (Don't overwrite doc.ents!)
    _match_id, start, end, _ratio = matches[i]
    entity = Span(doc, start, end, label="THING")
    doc.ents += (entity,)
    
matcher.add("THING", [nlp("sin eater")], on_match=add_name_ent)
matches = matcher(doc)

for ent in doc.ents:
    print((ent.text, ent.start, ent.end, ent.label_))

A regex matcher is also supported. See the docs for more features. Thi spackage definitely merits some further investigation.

It is worth noting that the Python regex package natively supports a range of options that can be used to find approximate matches that allow a certain number of deletion or substitution errors, or a certain number of single character errors of whatever flavour.

# https://github.com/mrabarnett/mrab-regex
#%pip install --upgrade regex
import regex

#https://github.com/mrabarnett/mrab-regex#approximate-fuzzy-matching-hg-issue-12-hg-issue-41-hg-issue-109

print(regex.search("(sin eater){e<=1}", "The bin eater is...")) # Errors
print(regex.search("(sin eater){d<=1}", "The in eater is...")) # Deletions
print(regex.search("(sin eater){s<=1}", "The bin eater is...")) # Substitutions
print(regex.search("(sin eater){e<=3}", "The bin eating tradition is...")) # Errors

"""
<regex.Match object; span=(4, 13), match='bin eater', fuzzy_counts=(1, 0, 0)>
<regex.Match object; span=(4, 12), match='in eater', fuzzy_counts=(0, 0, 1)>
<regex.Match object; span=(4, 13), match='bin eater', fuzzy_counts=(1, 0, 0)>
<regex.Match object; span=(4, 13), match='bin eatin', fuzzy_counts=(3, 0, 0)>
"""

There is also a partial matching option that can be used against a truncated document, if the match pattern could potentially have matched if the document continued.

In terms of using “native” SQLite functions to support fuzzy search, a Stack Overflow answer suggests using full text search along with a Spellfix virtual table.

We can add the Spellfix extension to a SQLite db in the following way:

#https://github.com/karlb/sqlite-spellfix
#%pip install git+git://github.com/karlb/sqlite-spellfix
import sqlite_spellfix

# The `db` object is a sqlite_utils Database object
db.conn.enable_load_extension(True)
db.conn.load_extension(sqlite_spellfix.extension_path())

When querying simple tables, eg a table with a column for index entries in a book, the editdist3() function will match terms within a prescribed “distance” of the search term.

# Select based on not exceeding edit cost:
# - default single-character insertion and deletion costs are 100;
# - default single-character to single-character substitution cost is 150.
# A cost of 10000 or more is considered "infinite" and
# causes the rule to be ignored.
SELECT * FROM mytable WHERE editdist3(indexTerm, "sin eater") < 300'

However, to match a search phrase within a longer document, for example using full test search, requires searching across another virtual spellfix table. Trying this out / creating a minimal working recipe for this, is still on my to do list. See here for a possible crib.

Pencil Sketch – Building a Simple Search Engine Around a Particular Internet Archive Document Collection

Over the last few weeks, I’ve been making a fair bit of use of the Internet Archive, not least to look up items in copies of Notes & Queries from the 19th century. In part, this has rekindled my interest in things like indexes (for human readers) and custom search engines. So here are a few notes before I try and hack anything together about various bits and bobs that come to mind with how I could start to play a bit with the Internet Archive content.

Inspection of various sample records shows a few things we can work with:

Particularly:

  • the pub_notes-and-queries tag appears to be used to assign related documents to an associated document collection;
  • the OCR search text is available as a text file;
  • an OCR page index document contains a list of four tuples, one per scanned paged; the first two elements in the tuple appear to give the start and end character index count of characters in the OCR search text file for each page. (I have no idea what the thrid and fourth elements in the tuple relate to.)
  • a page numbers text file that gives a list of records, one per page, with a page number estimate and confidence score:

To retrieve the data, we can use the Python internetarchive package. This provides various command line tools, including bulk downloaers, as well as a Python API.

For example, we can use the API to search for items within a particular collection:

#%pip install internetarchive

from internetarchive import search_items

item_ids = []
# Limit search to collection items
#collection:"pub_notes-and-queries"
# Limit search to a particular publication and  volume
#sim_pubid:1250 AND volume:6
# Limit search to a particular publication and year
#sim_pubid:1250 AND year:1867

for i in search_items('collection:"pub_notes-and-queries"').iter_as_items():
    #iterate through retrieved item records
   pass

The item records include information that is likely to be useful for helping us retrieve items and construct a search engine over the documents (though what form that search engine might take, I am still not sure).

{'identifier': 'sim_notes-and-queries_1867_12_index',
 'adaptive_ocr': 'true',
 'auditor': 'associate-jerald-capanay@archive.org',
 'betterpdf': 'true',
 'boxid': 'IA1641612',
 'canister': 'IA1641612-06',
 'collection': ['pub_notes-and-queries', 'sim_microfilm', 'periodicals'],
 'contrast_max': '250',
 'contrast_min': '147',
 'contributor': 'Internet Archive',
 'copies': '4',
 'date': '1867',
 'derive_version': '0.0.19',
 'description': 'Notes and Queries 1867: <a href="https://archive.org/search.php?query=sim_pubid%3A1250%20AND%20volume%3A12" rel="nofollow">Volume 12</a>, Issue Index.<br />Digitized from <a href="https://archive.org/details/sim_raw_scan_IA1641612-06/page/n1668" rel="nofollow">IA1641612-06</a>.<br />Previous issue: <a href="https://archive.org/details/sim_notes-and-queries_1867-06-29_11_287" rel="nofollow">sim_notes-and-queries_1867-06-29_11_287</a>.<br />Next issue: <a href="https://archive.org/details/sim_notes-and-queries_1867-07-06_12_288" rel="nofollow">sim_notes-and-queries_1867-07-06_12_288</a>.',
 'issn': '0029-3970',
 'issue': 'Index',
 'language': 'English',
 'mediatype': 'texts',
 'metadata_operator': 'associate-berolyn-gilbero@archive.org',
 'next_item': 'sim_notes-and-queries_1867-07-06_12_288',
 'ppi': '400',
 'previous_item': 'sim_notes-and-queries_1867-06-29_11_287',
 'pub_type': 'Scholarly Journals',
 'publisher': 'Oxford Publishing Limited(England)',
 'scanner': 'microfilm03.cebu.archive.org',
 'scanningcenter': 'cebu',
 'sim_pubid': '1250',
 'software_version': 'nextStar 4.5.0.20626',
 'source': ['IA1641612-06', 'microfilm'],
 'sponsor': 'Kahle/Austin Foundation',
 'subject': ['Classical Studies',
  'Library And Information Sciences',
  'Literary And Political Reviews',
  'Literature',
  'Publishing And Book Trade',
  'Scholarly Journals',
  'microfilm'],
 'title': 'Notes and Queries  1867: Vol 12 Index',
 'volume': '12',
 'uploader': 'arthur+microfilm02@archive.org',
 'publicdate': '2021-10-19 11:27:07',
 'addeddate': '2021-10-19 11:27:07',
 'identifier-access': 'http://archive.org/details/sim_notes-and-queries_1867_12_index',
 'identifier-ark': 'ark:/13960/t3gz6zr86',
 'imagecount': '29',
 'ocr': 'tesseract 5.0.0-beta-20210815',
 'ocr_parameters': '-l eng',
 'ocr_module_version': '0.0.13',
 'ocr_detected_script': 'Latin',
 'ocr_detected_script_conf': '0.9685',
 'ocr_detected_lang': 'en',
 'ocr_detected_lang_conf': '1.0000',
 'page_number_confidence': '100.00',
 'pdf_module_version': '0.0.15',
 'foldoutcount': '0'}

So for example, we might pull out the volume, issue, date, and title; we can check whether page numbers were identified; we have reference to the next and previous items.

From inspection of other records, we also note that if an item is restricted (for example, in its preview, then a 'access-restricted-item': 'true' attribute is also set.

We can retrieve documents from the API by id and document type:

from internetarchive import download

# downloads to a dir with name same as id
# _page_numbers.json
# _hocr_searchtext.txt.gz

doc_id = 'sim_notes-and-queries_1849-11-03_1_1'

download(doc_id, destdir='ia-downloads',
         formats=[ "OCR Page Index", "OCR Search Text", "Page Numbers JSON"])

To create a search index from the downloads, what are we to do?

By inspection of several OCR Search Text files, it appears as if the content is arranged as one paragraph per line of file, where paragraphs are perhaps more correctly blocks of text that appear visually separated from other blocks of text.

For a full text search engine, for example, over a SQLite FTS4 extension virtual table, we could just add each document as a separate record.

However, it might also be interesting to have full text search over page level records. We could do this by splitting content in the full text document according to the OCR page index file, and also numbering pages with the “scan page” index and also any extracted “actual” page number from the page numbers text file. (The length of the lists in the OCR page index file and the page numbers text file should be the same.)

A search over the pages tables would then be able to return page numbers.

In some cases it may be that the searcher will want to view the actual original document scan, for example, in the case of dodgy OCR, or to check the emphasis or layout of the original text. So it probably makes sense to also grab the scanned documents, either as a PDF, or as a collection of JPG images, loading the file as binary data and saving into the database as a SQLite BLOB.

We can preview the files downloadable for a particular item by getting a list of the associated files and then reporting their file format:

from internetarchive import get_item

#Retrieve an item record by id
jx = get_item('sim_notes-and-queries_1867_12_index')

# The item .get_files() method returns an iterator of
# available file types
for jj in jx.get_files():
    #Report the file format
    print(jj.format)

"""
Item Tile
JPEG 2000
JPEG 2000
Text PDF
Archive BitTorrent
chOCR
DjVuTXT
Djvu XML
Metadata
JSON
hOCR
OCR Page Index
OCR Search Text
Item Image
Single Page Processed JP2 ZIP
Metadata
Metadata
Page Numbers JSON
JSON
Scandata
"""

The Single Page Processed JP2 ZIP contains separate JPEG images, one per scan page, so that is a good candidate. We could also grab the Text PDF, which is a searchable PDF document. If page level access were required were could then grab the whole PDF and just extract and display the required page.

However, downloading all the JPG / PDF files feels a bit excessive. So it might be more interesting to try to build something that will only download and store PDFs / images on demand, as and when a users wants to preview an actual scanned page.

PS in passing, I note this recent package – https://github.com/cbrennig/pdf_sqlite_fts – for OCRing, indexing and full-text searching PDF docs.

A Personal Take on Customising Jupyter Via End-User Innovation

Following a Github discussion on The future of the classic notebook interface and the Jupyter Notebook version 7 JEP (Jupyter Enhancement proposal), a Pre-release plan for Notebook v7 is now in play that will see RetroLab, as was, the notebook style JupyerLab powered single document UI form the basis of future notebooks UIs.

I’ve been asked for comments on my take on now the original notebook supported originally supported end-user development which I’ll try to describe here. But first I should add some caveats:

  • I am not a developer;
  • I am not interested in environments that developers use for doing things developers do;
  • I am not a data scientist;
  • I am not interested in environments that data scientists use for doing things that data scientists do;
  • I am a note taker; I am a tinkerer and explorer of the potential for using newly available technologies in combination with other technologies; I am doodler, creating code exploiting sketches to perform particular tasks, often a single line of code at a time; I am an author of texts that exploit interactivity in a wide variety of subject areas using third party, off-the-shelf packages that exploit IPython (and hence, Jupyter notebook) display machinery.
A line of code can be used, largely in ignorance of how it works and simply by copying, pasting, and changing a single recognisable value, to embed a rich interactive into a document. In this case, a 3D molecular model is embdedded based on a standardised compound code passed in via a variable defined elsewhere in the document.
  • I am interested in environments that help me take notes, that help me tinker and explore the potential for using newly available technologies in combination with other technologies, that help me author the sort of texts that I want to explore.
  • I am interested in media that can be used to support the open and distance education, both teaching (in sense of making materials available to learners) and learning (which might be done either in a teaching context, or independently). My preference for teaching materials is that they support learning.
  • I am interested in end-user innovation where an end-user with enthusiasm and only modicum skills can extend an/or co-opt environment or the features or serviecs it offers, for their own personal use, without having to ask for permission or modify the environment’s core offering or code base (i.e. end-user innovation that allows a user to lock themselves in through extras they have added; in certain situations, this may be characterised as developing on top of undocumented features (it certainly shares many similar features));
  • In my organisation, the lead times are ridiculous. The following is only a slight caricature: a module takes 2+ years to create then is expected to last for 5-10 years largely unchanged. A technology freeze might essentially be put in place a year before student first use date. Technology selection is often based on adopting a mature technology at the start of the produciton process (two years prior to first use date).
  • When we adopted Jupyter notebooks for the first time for a module due for first release in 2016, it was a huge punt. The notebooks (IPython notebooks) were immature and unstable at the time; we also adopted pandas which was still in early days. There were a lot of reasons why I felt comfortable recommending both those solutions based on observation of the developer communities and practical experience of using the technologies. One practical observation was that I could get started very quickly, over a coffee, without docs, and do something useful. That meant other people would be able to too. Which would mean low barrier to first use. Which meant easy adoption. Which meant few blockers to trying to scale use. (Note that getting the environment you wanted set up as you wanted could be a faff, but we could mitigate that by using virtual machines to deliver software. It was also likely that installation would get easier.)
  • One of the attractive things about the classic Jupyter notebook UI was that I could easily hack the branding to loosely match organisational branding (a simple CSS tweak, based on inspection by someone who didn’t really know CSS (i.e. me), to point to our logo). As a distance learning organisation, I felt it was important to frame the environment in a particular way, so that students should feel as if they working in what felt like an institutional context. When you’re working in that (branded) space, you are expected to behave, and work, in a particular way:
  • there were also third party extensions, written in simple JS and HTML. These could be created and installed by an end-user, taking inspiration and code from pre-existing extensions. As an end-user, I was interested in customising the appearance of content in the notebook. For teaching / publishing purposes, I was interested in being able to replicate the look of materials in our VLE (virtual learning environment). The materials use colour theming to cue different sorts of behaviour. For example, activities and SAQs (self-assessment questions) use colour highlighted blocks to identify particular sorts of content:
  • the open source nature of the Jupyter code base meant that we could run things as local users or as an on-prem service, or as a rented hosted service from a third party provider; in my opinion all three are important. I think students need to be able to run code locally so that they can: work offline, as well as share or use the provided environment in another context, eg a work context; I think being able to offer an institutionally hosted service provides equitable access to students who may be limited in terms of personal access to computers; I think the external provider route demonstrates a more widespread adoption of a particular approach, which means longer term viability and support as well as a demonstration that we are using “authentic” tools that are used elsewhere.

One of our original extensions sought to colour theme activities in Jupyter notebooks in a similar way. (This could be improved, probably, but the following was a quick solution based on inspection of the HTML to try to find attributes that could be easily styled.)

The cells are highlighted by selecting a cell and then clicking a button. How to add the button to a toolbar was cribbed from a simple, pre-existing extension.

If anything, I’m a “View source” powered tinkerer, copying fragrments of code from things I’ve found that do more or less the sort of thing I want to do. This strategy is most effective when the code to achieve a particular effect appears in a single place. It’s also helpful if its obvious how to load in an required packages, and what those packages might be.

At the time I created the colour theming extension I ran into an issue identifying how to address cells in Javascript and queried it somewhere (I though in a Github issue, but I can’t find it); IIRC, @minrk promptly responded with a very quick and simple idea for how to address my issue. Not only could I hack most of what I wanted, I could articulute enough of a quetion to be abele to ask for help, and help could also be quickly and relatvely easily given: if it’s easy to answer a query, or offer a fix, eg Stack Overflow style, you might; if it’s complex and hard, and takes a lot of effort to answer a query, you are less likely to; and as a result, less likely to help other people then continue past a blocker and continue to help themselves.

The ability to add toolbar buttons to the user interface meant that it was easy enough to customise the tools offered to the end-user via the toolbar. How to add buttons was cribbed from inspection of the Javascript used by the simplest pre-existing extension I could find that added buttons to the toolbar.

Another thing the classic notebook extensions offered was a simple extensions configurator. This is based on a simple YAML file. The extensions configurator means the end-user developer can trivially specify an extensions panel. Here’s an example of what our current cell colour highlighter extension configurator supports, specificlly, which buttons are displayed and what colours to use:

And the corresponding fragment of the YAML file that creates it:

How to create the YAML file was cribbed from the content of a YAML config script from the simplest extensions I could find that offered the configurator controls I required.

How to access state set by the extension configurator from the extension Javascript was based on inspection of very simple Javascript of the simplest pre-existing extension I could find that made use of a similar configurator setting.

The state of the extension settings can be persisted, and is easily shared as part of a distribution via a config file. (This means we can easily configure an environment with pre-enabled extensions with particular settings to give the end user a pre-configured, customised environment that they can then, if they so choose, personalise / customise further).

This is important: we can customise the environment we give to students, and those users can then personalise it.

What this means is that there is a way for folk to easily lock themselves in to their own customised environment.

In the wider world, there are a great many JupyterHub powered environments out there serving Jupyter end-user interfaces as the default UI (JupyterLab, classic notebook, perhaps RetroLab). In order to support differentiation, these different environments may brand themselves, may offer particular back-end resources (compute/GPU, storage, access to particular datasets etc.), may offer particular single-sign on and data access authentication controls, may offer particular computational environments (certain packages preinstalled etc), may offer particular pre-instaled extensions, including in-house developed extensions which may or may not be open sourced / generally available, may wire those extensions together or or “co-/cross-configure” them in such a way as to offer a “whole-is-more-than-sum-of-parts” user experience, and so on.

For the personal user,running locally, or running on a third party server that enables extension installation and persistence, they can configure their own environment from available third party extensions.

And for the have-a-go tinkerer, they can develop and share there own extensions, and perhaps make them available to others.

In each case, the service operator or designer can lock themselves in to a partcular set-up. In our case, we have locked ourselves into a classic Jupyter notebook environment through extensions and configurations we have developed locally. And we are not in a position to migrate, in part because we have accreted workflows and presentation styles through our own not-fork, in part because of the technical barriers to entry to creating extensions in the JupyterLab environment. Because as I see it, that requires: a knowledge of particular frameworks and “modern” ways of using Javascript.

The current version of my own extensions has started to use, by cribbing others,rather than created from a position of knowledge or understanding, things like promises; but I’ve only got to that by iterating on simpler approaches and by cribbing diffs from other, pre-existing extensions that have migrated from the original way of working to more contemporary methods (all hail developers for helpful commit messages!); a knowledge of the JupyterLab frameworks (in the classic notebook, I could, over a half-hour coffee break, crib some simple HTML and CSS from the classic UI, crib some simple JS from an pre-exsiting extension that had a feature I wanted to use, or appeared to use a method for acheiving something similar to the effect I wanted to achieve).

There has been work in the JupyterLab extensions repo to try to provide simple examples, and I have to admit, I don’t check there very often to see if they have added the sorts of examples that I tend to crib on because from experience they tend to be targeted at developers doing developery things.

I. Am. Not. A. Developer. And the development I want to do is often end user interface customisations. (I also note from the Jupyter notebook futures discussions comments along the lines of “the core devs aren’t fron end developers, so keeping the old’n’hacky noteb’ook UI going is unsustainable” which I both accept and appreciate (appreciate in sense of understand). Bt it raises the question: who is there looking for ways to offer “open” and “casual” (i.e. informal, end-user) UI developments.

It is also worth noting that the original notebook UI was developed by folk who were perhaps not web developers and so got by on simple HTML/CSS/JS techniques, because that was their skill level in that domain at the time. And they were also new to Jupyter frameworks in the sense that those frameworks were still new and still relatively small in feature and scope. But the core devs are now seasoned pros in working in those frameworks. Whereas have-a-go end-user developers wanting to scrath that one itch, are always brand new to it. And they may have zero requirement to ever do another bit of development again. On anything. Ever.

The “professionalisation” of examples and extensions in the JupyterLab ecosystem is also hostile to not-developers. For example, here’s a repo for a collapsible headings extension I happened to have in an open tab:

I have no idea what many (most) of those files or, or how neccessary they are to build a working extension. I’m not sure I could figure out how to actually build the extension either (becuase I think they do need building before they can be installed?) I. Am. Not. A . Developer. Just as I don’t think users should have to be sys admins to be able to install and run a notebook (which is one reason we increasingly offer hosted solutions), I don’t think end user developers who want to hack a tiny bit of code should have to be developers with the knowledge, skills and toolchains available to be able to build package before it can be used. (I think there are tools in the JupyerLab UI context that are starting to explore making things a bit more “build free”.)

To help set the scene, imagine the user is a music teacher who wants to use music21 to in a class. They get by using what is essentially a DSL (domain specific language) in the form of music21 functions in a notebook environment. Their use case for Jupyter is write what are essentially interactive handouts relating to music teaching. They also note that they can publish the materials on the web using Jupyer Book They see thay Jupyter Book has a particular UI feature, such as a highlighted noe, and the think “how hard could it be to add that tho the notebook UI”.

One approach I have taken previously with regard to such UI tweaks is to make use of cell tag attributes to identify cells that should be styled in a particular way. (Early on, I’d hoped tags would be exposed appropriately in cell HTML as class attibutes, e.g. expose cell tags as HTML classes. This opens up end user development in a really simple way (hacking CSS, essentially, or iterating HTML based on class attributes; though ideally you’d work with the notebook JSON data structure and index cells based on metadata tag values).

As an example of hacky workflow from a “not a developer” perspective to acheive a styling effect similar to the Jupyter Book style effect above, I use a “tags2style” extension to style things in the notebook, and a tag processor churn notebook .ipynb content into appropriately marked up markdown for Jupyter Book. (Contributing to Jupyter Book extensions is also a little beyond me. I can proof-of-concept, but all the “proper” developer stuff of lint’n’tests and sensible commit messages, as well as how to use git properly, etc., are beyond me…! Not a team player, I guess… Just trying to get stuff done for my own purposes.)

So… in terms of things I’d find useful for end user development, and allowing using to help themselves, a lot of it boils down to not a lot (?!;-):

  • I want to be able to access a notebook datastructure and iterate over cells (either all cells, or just cells of a particular type);
  • I want to be able to read and write tag state;
  • I want to be able to render style based on cell tag; failing that, I want to be able to control cell UI class attributes so I can modify them based on cell tags.
  • I want to be able to add buttons to the toolbar UI;
  • I want to be able to trigger operations from tool bar button clicks that apply to a the current in focus cell, a set of selected cells, all cells / all cells of a particular type;
  • I want to be able to configure extension state in a simply defined configuration panel;
  • I want to be able to easily access extension configuration state ans use it within my extension;
  • I want to be able to easily persist and distrbute extension configuration state;
  • It would be nice to be able to group cells in a nested tree; eg identify a set of cells as an exercise block and within that as exercise-question and exercise-answer cells, and style the (sub-)grouped cells together and potentially the first, rest, and last cells in each (sub-)group differently to the rest.

In passing, several more areas of interest.

First, JupyterLab workspaces. When these were originally announced, really early on, they seemed really appealing to me. A workspace can be used to preconfigure / persist / share a particular arrangement of panels in the JupyterLab UI. This means you can define a “workbench” with a particular arrangement of panels, much as you might set up a physical lab with a particular arrangement of equipment. (Imagine a school chemistry lab; the lab assistant sets up each bench with the apparatus needed for that day’s experiment.) In certain senses, the resulting workspace might be also be thought of as an “app” or a “lab”.

I would have aggressively explored workspaces, but I was locked into using the custom styling extensions of classic notebook, and this blocked me from exploring JupyterLab further.

I firmly believe value can be added to an environment by providing preconfigured workspaces, where the computational environment is set up as required (necessary packages installed, maybe some configuration of particular package settings, appropriate panels opened and arranged on screen), particularly in an educational setting. But I haven’t really seen people talking about using workspaces in such ways, or even many eamples of workspaces being used and customised at all.

Example of Dask Examples repo launched on MyBinder – a custom workspace opens predefined panels on launch.

A lot of work has been placed around dashboards in a Jupyter context, which is perhaps Jupyter used in a business reporting context, but not JupyterLab workspaces, which are really powerful for education.

I note that various discussion relating to classic notebook versus JupyterLab relate to the perceived complexity of the JupyterLab UI. My own take on the JupyterLab UI is that is can be very cluttered and have a lot of menu options or elements available that are irrelevant to a particular user in a particular use case. For different classes of user, we might want to add lightness to the UI, to simplify it to just what is required for a particular activity, and strip out the unnecessary. Workspaces offer that possibility. Dashboard and app style views, if used creatively, can also be used that way BUT they don’t provide access to the JupyterLab controls.

On the question of what to put into workspaces, jupyter-sidecar could be extremely useful in that respect. It was a long time coming, but side cast now lets to display a widget directly into a panel, rather than first having to display it as cell output.

This means I could demo for myself using my nbev3devsim simulator in a JupyterLab context.

Note that the ONLY reason I can access that JS app as a Jupyter widget is via the jp_proxy_widget, which to my mind should be mainitained in core along with things like Jupyter server proxy, JupyterLite, and maybe jhsingle-native-proxy. All these things allow not-developers and not-sysadmins to customise a distributed environment without core developer skills.

A final area of concern for me relates to environments that are appropriate for authoring new forms of document, particularly those that:

  • embed standalone interactive elements generated from a single single line of magic code (for example, embed an interactive map centred on a location give a location);
Using parameterised magic to embed a customised interactive.
  • generate and embed non-textual media elements from a text description;

Note that to innovate in the production of tools to create such outputs does not require “Jupyter developer skills”. The magics can often be simple Pyhton cribbed, largely based on code cribbed from other, pre-existing magics, applied to new off-the-shelf packages that support rich IPython object output displays.

In terms of document rendering, Jupyter Book currently offers one of the richest and most powerful HTML output pathways, but I think support for PDF generation may still lag behind R/bookdown workflows. I’m not sure about e-book generation. For support end-user innovation around the publication path, there are several considerations: the document format (eg MyST) and how to get content into that format (eg generating it from magics); the parsed representation (how easy it is to manipulate a document model and render content from it); the templates, that provide examples for how to go from the parsed object representation to output format content (HTML, LaTeX, etc); and the stylesheets, that allow you to customised content rendered by a particular template.

In terms of document authoring, I think there are several issues: first, rich editors, that allow you to preview or edit directly in a styled display view. For example, WYSIWYG editors. Jupyter notebook has had a WYSWYG markdown cell extension for a long time and it sucks: as soon as you use it you’re original markdown is converted to crappy HTML. The WYSIWYG editor needs to preserve markdown, insofar as that is possible, which means it needs to work with and generate a rich enough flavour of markdown (such as MyST) to provide the author with the freedom to author the content they want to author. It would be nice of such an editor could be extended to allow you to embed to embed high level object genrators, for example, IPython line or block magics.

Ideally, I’d be able to have a rich display in a full notebook editor that resembles the look in terms of broad styling features the look of Jupyter Book HTML output, perhaps provided via a Jupyter Book default theme styling extension for JupyterLab / RetroLab.

I’m not sure what the RStudio / Quarto gameplan is for the Quarto editor, which currently seems to take the form of a rich editor inside RStudio. The docs appear to suggest it is likely to be spun out as its own editor at some point. How various flavours of output document are generated and published will be a good indicator of how much traction Quarto will gain. RStudio allow “Shiny app publishing” from RStudio, so integration with e.g. Github Pages for “static” or WASM powered live code docs, or server backed environments for live code documents executing against a hosted code server would demonstrate a similar keen-ness to support online published outputs.

Personally, I’d like to see a Leanpub style option somewhere for Jupyter Book style ebooks which would open up a commercialisation route that could help drive a certain amount of adoption; but currently, the Leanpub flavour of markdown is not easily generated from eg MyST via tools such as pandoc or Jupytext, which means there is not easy / direct conversion workflow. In terms of supporting end user innovation, having support in Jupytext for “easy” converters, eg where you can specify rules for how Jupytext/MyST object model elements map onto output text, both in terms of linear documents (eg Markdown) or nested (eg HTML, XML).

Internally, we use a custom XML format based on DocBook (I think?). I proposed an internal project to develop pandoc converters for it to allow conversion to/from that format into other pandoc supported formats, which would have enabled notebook mediated authoring and display of such content. After hours of meetings quibbling over what validation process for the output should be (I’d have been happy with – is it good enough to get started converting docs that already exist) I gave up. In the time I spent on the proposal, its revisions, and in meetings, I could probably have learned enough Haskell to hack something together myself. But that wasn’t the point!

At the moment, Curvenote looks to be offering a rich, WYSIWYG authoring, along with various other value adds. For me to feel confident in exploring this further, I would like to be able to run the editor locally and load and save files to both disk and browser storage. Purely as a MyST / markdown editor, a minimum viable Github Pages demo would demonstrate to me that that is possible. In terms of code execution, my initial preference would be to be able to execute code using something like Thebelab connected invisibly to a JupyterLite powered kernel. More generally, the ability to connect to a local or remote Binder Jupyter notebook server and launch kernels against it, and then the ability to connect to a local or remote Jupyter notebook server, both requiring and not requiring authentication.

In terms of what else WASM powered code execution might support, and noting that you can call on things like the pillow image editing package directly in the browser, I wonder about whether it is possible to do the following purely wihtin the browser (and if not, why not / what are the blockers?):

It is also interesting to consider the possibility of extensions to the Jupyter Book UI that allow it to be used as an editor bith in terms of low hanging fruit and also in terms of more ridiculous What if? wonderings. Currently, Thebelab enabled books allow readers to edit code cells as well as executing code against an externally launched kernel. However, there is no ability to save edited code to browser storage (or local or remote storage), or load modified pages from browser storage. (In a shared computer setting, there is also the question of how borwser local storage is managed. In Chrome, or when using Chromebooks, for example, can a user sign in to the browser and they have their local storage synched to other browsers, and have it cleared from the actual browser they were using a session when they sign out?) There is also no option to edit the markdown cell source, bit this would presumably not be markdown anyway, but rather rendered HTML. (Unless the browser was rendering the HTML from the fly from source markdown?!) This perhaps limits their use in education, where we might want to treat the notebooks as interactive worksheets users can modify notebooks and retain their edits. But the ability to edit, save and reload code cell content at least, and may even add and delete code cells, would be a start. One approach might be a simple recipe for running Jupyter Book via Jupyer server proxy (e.g. simple hacky POC), or for Jupyter Book serving JupyterLite. In the first case, if there was a watcher on a file directory, a Jupyter Book user could perhaps open the file in the local /server Jupyter notebook environment, save the file, and then have the directory watcher trigger jupyer book build to update the book. In the second case, could JupyterLite let the user to edit the source HTML of locally stored Jupyter Book HTML content and then re-serve it? Or could we run Jupyter Book build process in the browser and make changes to the source notebook or markdown document?! Which leads to the following more general set of questions about WASM powered code execution in the browser. For example, can we / could we:

  • run Jupyter-proxied flask apps, API publishing services such as datasette, or browser-based code executing Voila dashboards?
  • run Jupyter server extensions, eg jupytext?
  • run Sphinx / Jupyter Book build processes?
  • run pandoc to support pandoc powered document conversion and export?
  • connect to remote storage and/or mount the local file system into the JupyterLite environment (also: what would the security implications of the that be?)?

Are any of these currently or near-future possible? Actually impossible? Are there demos? If they are not possible, why not? What are the blockers?

One of the major issues I had, and continue to have, with Jupyter notebook server handling from the notebook UI is in connecting to kernels. Ideally, a user would be trivially be able to connect to kernels running on either a local server or listed by one or more remote servers, all from the same notebook UI. This would mean a student could work locally much of the time, but connect to a remote server (from the same UI) if they need to access a server with a particular resource availability, such as a GPU or proximity to a large dataset. VS Code makes it relatively easy to connect, from the VS Code client, to new Jupyter servers, but at the cost of breaking other connections. Using a Jupyter notebook server, remote kernel approaches typically appear to require the use of ssh tunneling to establish a connection to launch and connect to a remote server.

One way round the problem of server connections for code execution is to have in-browser local code execution. Extending Thebelab to support in-browser JupyterLite / WASM powered kernel connections will enable users of tools such as Jupyter Book to publish books capable of executing code from just a simple webserver, eg using a site such as Github Pages. Trivially, JupyterLite kernels incorporate a useful range of packages, although to support the easy creation of end-user distributions, a very simple mechanism for adding additional packages that are “pre-installed” in the WASM environment is not available. (The result is the end-end-user needs to install packages before they can be used.) JupyterLite also currently lacks an easy / natural way of loading files from browser storage into code.

Here ends the core dump!