Fragments – Fuzzy Search, Pattern Matching and Retrieval Using Python and SQLite

Scrappy notes on some possibly useful fuzzy and partial match tricks and tips.

First up, we can easily find one or more exact matches for a pattern across a text using a regular expression of the form:

import re

txt = 'The sin eater was a tradition whereby a poor individual – the sin eater – would... '

_q = 'sin eater'

for m in re.finditer(_q, txt):
    print(f'{_q}: ', m.start(), m.end())

But what if we want to start finding patterns where thre is only a partial match? For example, in many OCRd texts, or texts corresponding to historical documents with old variant spellings of particular wordes, we might want to match on the near misses.

The jellyfish Python package provides functions for phonetic encoding (American Soundex, Metaphone, NYSIIS (New York State Identification and Intelligence System), Match Rating Codex) and string comparison / approximate matching (Levenshtein Distance, Damerau-Levenshtein Distance, Jaro Distance, Jaro-Winkler Distance, Match Rating Approach Comparison, Hamming Distance).

Cribbing simonw/datasette-jellyfish, a Datasette plugin that adds jellyfish functions to SQLite queries in datasette that we can use at least in the SELECT part of a query, we can add the functions to SQLite with the following:

# Crib from https://github.com/simonw/datasette-jellyfish

import jellyfish

one_args = (
    # Phonetic
    # https://jellyfish.readthedocs.io/en/latest/phonetic.html
    "soundex",
    "metaphone",
    "nysiis",
    "match_rating_codex",
    # Stemming
    # https://jellyfish.readthedocs.io/en/latest/stemming.html
    "porter_stem",
)
two_args = (
    # String Comparison
    # https://jellyfish.readthedocs.io/en/latest/comparison.html
    "levenshtein_distance",
    "damerau_levenshtein_distance",
    "hamming_distance",
    "jaro_similarity",
    "jaro_winkler_similarity",
    "match_rating_comparison",
)

# `db` is a sqlite_utils Database object
from sqlite_utils import Database
db = Database('test.db')
# For in memory db, use: db = Database(memory=True)

for fn in one_args:
    db.conn.create_function(fn, 1, getattr(jellyfish, fn))
for fn in two_args:
    db.conn.create_function(fn, 2, getattr(jellyfish, fn))

Functions that set default values for parameters can be defined using .conn.create_function(NAME, -1, FUNCTION). These functions can have 1, 2 or 3 parameters.

We can then use the function as part of a query:

# `db` is a sqlite_utils Database object
db["my_table"].create({"id": int, "txt": str,}, pk="id")
db["my_table"].insert({"id":1, "txt": "sin eater in Wales."})
db["my_table"].insert({"id":2, "txt": "The bin eater."})

for i in db.query("SELECT txt, levenshtein_distance('sin eater', txt) AS dist FROM my_table WHERE levenshtein_distance('sin eater', txt) < 10"):
    print(i)

This limits us to “exact matching” near miss phrases or terms, rather than finding near miss phrases within a larger text.

The fuzzysearch package provides a simple function for finding matches based on maximum Levenshtein distance (max_l_dist), maximum number of substitutions (max_substitutions), maximum number of deletions / skipped characters in the sub-sequence (max_deletions), maximum number of insertions / skipped characters in a sequence (max_insertions).

#%pip install fuzzysearch
from fuzzysearch import find_near_matches

# search for 'PATTERN' with a maximum Levenshtein Distance of 2
find_near_matches('PATTERN', '---PATERN---PATEN----PATN--PATREN-', max_l_dist=2)

[Match(start=3, end=9, dist=1, matched='PATERN'),
 Match(start=12, end=17, dist=2, matched='PATEN'),
 Match(start=27, end=33, dist=2, matched='PATREN')]

It could also be interesting to try to take the fuzzysearch.find_near_matches function for use in a SQLite query. Here’s a minimal example:

def find_near_matches2(*args, **kwargs):
    response = find_near_matches(*args, **kwargs, max_l_dist=3)
    if response:
        return response[0].matched
    return ''

# Make function available in SQLite queries
db.conn.create_function('find_near_matches2', 2, find_near_matches2)

# Try a query
for i in db.query("SELECT txt, find_near_matches('sin eater', txt) AS matched FROM my_table WHERE find_near_matches('sin eater', txt) !=''"):
    print(i)

"""
{'txt': 'The is a sin eater in Wales.', 'matched': 'sin eater'}
{'txt': 'The bin eater.', 'matched': 'in eater'}
"""

The question is: what should we really return from this function? And can we parameterise it better?

A key parameter is the max_l_dist value; we can specify three arguments to the custom function if one of them is a provided with a default value, so let’s use max_l_dist for such a value:

def find_near_matches3(pattern, search_string, max_l_dist=3):
    response = find_near_matches(pattern, search_string, max_l_dist=max_l_dist)
    if response:
        return response[0].matched
    return ''

# The -1 value for the number of args says we may have a default.
# The function can then take 1, 2 or 3 arguments.
db.conn.create_function('find_near_matches3', -1, find_near_matches3)

For spacy, the spaczz package provides fuzzy entity extraction for spacy pipelines.

import spacy
from spaczz.matcher import FuzzyMatcher

nlp = spacy.blank("en")
text = """SIB- EATERZ. The tradition of the sin eater is contested. 
Claimed as a Walesian tradition by the English, the Welsh didn't seem to have heard of sin-eating!"""
doc = nlp(text)

matcher = FuzzyMatcher(nlp.vocab)
matcher.add("GPE", [nlp("Wales")])
matches = matcher(doc)

for match_id, start, end, ratio in matches:
    print(match_id, doc[start:end], ratio)

"""
GPE Walesian 77
"""

The minimum confidence level for which a token is tagged can be configured.

The matcher function can be customised to return additional information, such as the location of the matched item(s) in the text.

from spacy.tokens import Span

# Create custom matcher to return location in document
def add_name_ent(matcher, doc, i, matches):
    """Callback on match function. Adds "THING" entities to doc."""
    # Get the current match and create tuple of entity label, start and end.
    # Append entity to the doc's entity. (Don't overwrite doc.ents!)
    _match_id, start, end, _ratio = matches[i]
    entity = Span(doc, start, end, label="THING")
    doc.ents += (entity,)
    
matcher.add("THING", [nlp("sin eater")], on_match=add_name_ent)
matches = matcher(doc)

for ent in doc.ents:
    print((ent.text, ent.start, ent.end, ent.label_))

A regex matcher is also supported. See the docs for more features. Thi spackage definitely merits some further investigation.

It is worth noting that the Python regex package natively supports a range of options that can be used to find approximate matches that allow a certain number of deletion or substitution errors, or a certain number of single character errors of whatever flavour.

# https://github.com/mrabarnett/mrab-regex
#%pip install --upgrade regex
import regex

#https://github.com/mrabarnett/mrab-regex#approximate-fuzzy-matching-hg-issue-12-hg-issue-41-hg-issue-109

print(regex.search("(sin eater){e<=1}", "The bin eater is...")) # Errors
print(regex.search("(sin eater){d<=1}", "The in eater is...")) # Deletions
print(regex.search("(sin eater){s<=1}", "The bin eater is...")) # Substitutions
print(regex.search("(sin eater){e<=3}", "The bin eating tradition is...")) # Errors

"""
<regex.Match object; span=(4, 13), match='bin eater', fuzzy_counts=(1, 0, 0)>
<regex.Match object; span=(4, 12), match='in eater', fuzzy_counts=(0, 0, 1)>
<regex.Match object; span=(4, 13), match='bin eater', fuzzy_counts=(1, 0, 0)>
<regex.Match object; span=(4, 13), match='bin eatin', fuzzy_counts=(3, 0, 0)>
"""

There is also a partial matching option that can be used against a truncated document, if the match pattern could potentially have matched if the document continued.

In terms of using “native” SQLite functions to support fuzzy search, a Stack Overflow answer suggests using full text search along with a Spellfix virtual table.

We can add the Spellfix extension to a SQLite db in the following way:

#https://github.com/karlb/sqlite-spellfix
#%pip install git+git://github.com/karlb/sqlite-spellfix
import sqlite_spellfix

# The `db` object is a sqlite_utils Database object
db.conn.enable_load_extension(True)
db.conn.load_extension(sqlite_spellfix.extension_path())

When querying simple tables, eg a table with a column for index entries in a book, the editdist3() function will match terms within a prescribed “distance” of the search term.

# Select based on not exceeding edit cost:
# - default single-character insertion and deletion costs are 100;
# - default single-character to single-character substitution cost is 150.
# A cost of 10000 or more is considered "infinite" and
# causes the rule to be ignored.
SELECT * FROM mytable WHERE editdist3(indexTerm, "sin eater") < 300'

However, to match a search phrase within a longer document, for example using full test search, requires searching across another virtual spellfix table. Trying this out / creating a minimal working recipe for this, is still on my to do list. See here for a possible crib.

Custom SQLite Functions That Return A Table: A SQLite Spell-Checker

In Fragments – Fuzzy Search, Pattern Matching and Retrieval Using Python and SQLite, I noted the hugely powerful SQLite feature that allows you to create custom application defined SQLite functions that can be used as part of a query. The examples I gave included functions for retrieving a record based on fuzzy matching a particular search term within the record. (Not shown, but also supported, are custom aggregegate functions that can be used as part of a grouping operator.)

The custom functions MUST return a single, not-iterable, scalar value. But what if you want to return multiple items retrieved from within a single record? Which is to say, what if you need to retrun a table from a query onto a record?

Poking around in the SQLite docs, it seems as if you can define functions that return a virtual table as part of an extension. But how might you acheive a similar thing with a simpler Python API?

A quick search turns up a really magical recipe, described by Charles Leifer in SQLite Table-Valued Functions with Python and implemented in the coleifer/sqlite-vtfunc package.

All you need to do is define a class with a couple of methods: an initialisation routine that creates an iterator over the items you want to return via the virtual table; and an iterator that returns each row for your return table. Provide the names of you input arguments and output columns, and that’s pretty much it.

So here’s an example, based on an old code fragment I came across for highlighting typos whilst looking for something else (which reminds me: I still haven’t found the post I was looking for…).

The code uses the jxmorris12/language_tool_python package that provides a Python API to the Java’n’http server based LanguageTool spellchecking server and returns highlighted typographical errors in the presented text.

So can we riff on that to create a spell-checker that we can apply to records in a SQLite database, and for each record return rows of a “spellcheck” virtual table with one typo per row?

Let’s start with a class that creates the spellchecker. When you call the languagetool, it takes forever to start up an server, so it makes sense to create a class wrapper with a global language server, assuming that you’re working in a single language.

# %pip install --upgrade language_tool_python
import language_tool_python

# Loading the languagetool is slow, so it makes sense to only do it once.
# So put it in a class and refer to it from a persistent object context

class TypoHighlighter:
    """Find typographical errors in a text and highlight them.
        Note that this class make be slow to instantiate as the
        LanguageTool http server started.
    """
    # Shared instance
    tool = language_tool_python.LanguageTool('en-UK')
    
    def __init__(self, style='html', #lang='en-UK',
                 html_color='red', md_style='__'):
        self.style = style
        #self.lang = lang
        self.html_color = html_color
        self.md_style = md_style
        
    def typo_styler(self, error):
        """Highlight error term."""
        typo = error.context
        from_ = error.offsetInContext
        to_ = from_ + error.errorLength
        txt = typo[from_:to_]
        if self.style=='html':
            typo =  typo[:from_] + f'<span style="color:{self.html_color}">{txt}</span>{typo[to_:]}'
        elif not None:
            typo =  f"{typo[:from_]}{self.md_style}{txt}{self.md_style}{typo[to_:]}"
        #print(f"**{html}")
        return typo

    def highlight_typos(self, text, highlight=True):
        """Highlight spelling errors in text."""
        matches = TypoHighlighter.tool.check(text)
        if not highlight:
            return matches
        else:
            return [self.typo_styler(m) for m in matches]

We can then run the spellchecker as follows to retrieve the raw response of the spell-checker:

tool = TypoHighlighter()

text = 'This sentence is fine. A sentence with a error in the Hitchhiker’s Guide tot he Galaxy'

matches = tool.highlight_typos(text, False)
matches

"""
[Match({'ruleId': 'EN_A_VS_AN', 'message': 'Use “an” instead of ‘a’ if the following word starts with a vowel sound, e.g. ‘an article’, ‘an hour’.', 'replacements': ['an'], 'offsetInContext': 39, 'context': 'This sentence is fine. A sentence with a error in the Hitchhiker’s Guide tot he ...', 'offset': 39, 'errorLength': 1, 'category': 'MISC', 'ruleIssueType': 'misspelling', 'sentence': "A sentence with a error in the Hitchhiker's Guide tot he Galaxy"}),
 Match({'ruleId': 'TOT_HE', 'message': 'Did you mean “to the”?', 'replacements': ['to the'], 'offsetInContext': 43, 'context': '... with a error in the Hitchhiker’s Guide tot he Galaxy', 'offset': 73, 'errorLength': 6, 'category': 'TYPOS', 'ruleIssueType': 'misspelling', 'sentence': "A sentence with a error in the Hitchhiker's Guide tot he Galaxy"})]
"""

Or we can get inline highlighted errors:

matches = tool.highlight_typos(text)
matches

"""
['This sentence is fine. A sentence with <span style="color:red">a</span> error in the Hitchhiker’s Guide tot he ...',
 '... with a error in the Hitchhiker’s Guide <span style="color:red">tot he</span> Galaxy']
"""

So how might we make use of this in a SQLite context?

Let’s start with the error highlighter, where we just return a simple string that shows each error, in context, with the error highlighted. If we run our function over a record in the database, we want to return a set of rows with a single column that contains that highlighted text.

We start the declaration of the class as follows, defining the input argument(s), the column names for the virtual table output, the name of the custom function we can call in a SQLite SQL query, and a class global spellchecker object to perform the spell check:

class DBTypoHighlighter(TableFunction):
    # Input parameter - the text we want to check for typos
    params = ['text']
    # A single output column containing highlighted errors in context
    columns = ['highlight']
    # The name of the function we can call in SQLite
    name = 'typo_highlighter'

    # A class global spellchecker object
    typo_highlighter = TypoHighlighter()

The next thing we need to do is define an initialisation function inside the class that will return an iterator over items we are to return on a per row basis inside our virtual table. The spellchecker returns a Python list containing separate error items, which we can easily return as an iterator:

    def initialize(self, text=None):
        """Return an iterator to generate output virtual table rows."""
        self._iter = iter(DBTypoHighlighter.typo_highlighter.highlight_typos(text))

Next, we need to define the iterator method for the class that returns each row:

    def iterate(self, idx):
        """Return the next row for the virtual table."""
        # We don't need to make us of the idx value
        # but it is required in the methof signature 
        item = next(self._iter)
       # The final , is required.
        return (item,)

We can now register the function with a SQlite database object:

from sqlite_utils import Database
db = Database('test.db')

DBTypoHighlighter.register(db.conn)

We should now be able to run a typo_highlighter() query over a test string:

text = 'This sentence is fine. A sentence with a error in the Hitchhiker’s Guide tot he Galaxy'

for i in db.execute(f'SELECT * FROM typo_highlighter("{text}");'):
    print(i)

"""
('This sentence is fine. A sentence with <span style="color:red">a</span> error in the Hitchhiker’s Guide tot he ...',)
('... with a error in the Hitchhiker’s Guide <span style="color:red">tot he</span> Galaxy',)
"""

And if we can run it over one string, we can apply it to multiple rows in a query:

spelltests = db["spelltests"]

# Add some rows
spelltests.insert({'id':1,
                   'text':'This sentence is fine. A sentence with a error in the Hitchhiker’s Guide tot he Galaxy'})
spelltests.insert({'id':2,
                   'text':'A sentence with a error in the Hitchhiker’s Guide tot he Galaxy'})
spelltests.insert({'id':3,
                   'text':'This sentence is fine.'})

# Run our spellchecker over the rows
for i in db.execute(f'SELECT * FROM spelltests, typo_highlighter(spelltests.text);'):
    print(i)

"""
(1, 'This sentence is fine. A sentence with a error in the Hitchhiker’s Guide tot he Galaxy', 'This sentence is fine. A sentence with <span style="color:red">a</span> error in the Hitchhiker’s Guide tot he ...')
(1, 'This sentence is fine. A sentence with a error in the Hitchhiker’s Guide tot he Galaxy', '... with a error in the Hitchhiker’s Guide <span style="color:red">tot he</span> Galaxy')
(2, 'A sentence with a error in the Hitchhiker’s Guide tot he Galaxy', 'A sentence with <span style="color:red">a</span> error in the Hitchhiker’s Guide tot he ...')
(2, 'A sentence with a error in the Hitchhiker’s Guide tot he Galaxy', '... with a error in the Hitchhiker’s Guide <span style="color:red">tot he</span> Galaxy')
"""

So that seems to works:-)

How about a more general function that returns more complete tabling detailing the errors as found by the languagetool, with the virtual table containing separate columns for each element in the language tool error object.

We note that one of the attributes (replacements) is returned as a list, which we need to serialise somehow, either by returning just the first item, or creating a single string from all the listed suggestions, with each item distinguished wihtn that string.

Constructing the class proceeds as before, although this time we need to make sure we declare all the columns we want to return in the virtual table, and make sure that the iterator method populates those columns appropriately:

class DBTypoFinder(TableFunction):
    """Return a virtual table containing error data associated
       with typos in the presented text."""
    params = ['text']
    columns = ['message', 'replacements', 'offset', 'context',
               'sentence', 'category', 'ruleId', 'ruleIssueType',
               'offsetInContext', 'errorLength']
    name = 'typo_finder'
    typo_highlighter = TypoHighlighter()
    
    def initialize(self, text=None):
        self._iter = iter(DBTypoHighlighter.typo_highlighter.highlight_typos(text, False))

    def iterate(self, idx):
        i = next(self._iter)
        return (i.message, '::'.join(i.replacements), i.offset, i.context,
               i.sentence, i.category, i.ruleId, i.ruleIssueType,
               i.offsetInContext, i.errorLength,)

DBTypoFinder.register(db.conn)

When we run this query, we get back a much richer table:

import pandas as pd

pd.read_sql('SELECT * FROM spelltests, typo_finder(spelltests.text);',
            db.conn)

This recipe is so much fun! :-)

A SQLite Function for Finding Repeated Phrases in Fairy Tales, Part I

Way back when, when I first started looking at virtualised environments for distributing student materials, I came across a project created by @DataminerUK called infinite interns. The idea was to create an “intern” for each stage of a data investigation (a particular instance of a virtualised environment) that would be “virgin” and “uncontaminated”, sandboxed and reproducible. The insight was that you’d have a fresh pair of hands each time you wanted to a perform a task, set up just so, in a very particular way for the task at hand. And when the task was done you’d just throw that infinite intern away. (At this point, you could probably make some cultural observations about how interns might be treated in particular industries!)

Another way of considering infinite interns might be to consider them as a scaleable resource for performing a particular task, repeatedly, with no real cost to yourself. In coding terms, the infinite intern might thus also correspond to the repeated application of a particular function, which is to say, aset of instructions that are bundled together under a memorable name that can perform a particular task: hey, you, make_my_tea("egg and chips"); and so it will be done.

What this means is that you can create a function once, and then use it repeatedly: hey, you, make_my_tea("fishfinger sandwich").

But what’s even better is that we can find ways of scaling how we call functions even more. If I have a database containing tens, hundreds, thousands, tens of thousands, hundreds of thousands, millions, even, of records, how powerful would it be if we could write a function once and then apply it to each of those tens, hundreds, thousands, tens of thousands, hundreds of thousands, millions, even, of records from a single command? Pretty powerful, right?

In the previous two scarily named blogposts —Fuzzy Search, Pattern Matching and Retrieval Using Python and SQLite and Custom SQLite Functions That Return A Table: A SQLite Spell-Checker — I described how we can define custom Python programming functions for the simple, file based SQLite database that can return either a single valued item, or a whole set of items, from each record in the database.

That probably sounds more complicated than it actually is. For example, suppose each separate record in the database describes what I had for tea on a particular day. And then, with a single command, I could search that database for everything I’ve had for tea, ever, and call the make_my_tea() function separately for that separate item. But being a database, I could also search for just those items that I ate on a Friday, or during a particular month of a particular year, of that explicitly involved eggs. And get those made for me, all in one go.

Now I don’t happen to have a database of things I’ve had for tea in the past, but in the blog post Searchable SQLite Database of Andrew Lang Fairy Stories I did describe a simple database I had created around Andrew Lang’s coloured fairy story books.

This database contains several hundred stories, each as a separate database record, each one from two to twenty or so pages of text. It’s the sort of resource that would take quite a long time to read through if you were to retrieve every story and then read them all. One reason I created the database was so that I could more easily search for stories that just involved a king or queen or three sons.

q = """
SELECT * FROM books_fts
WHERE books_fts MATCH {db.quote('"pretty hen"')} ;
"""

for row in db.query(_q):
    print(row["title"])

# For future convenience, grab the text for the last reported item
story_text = row["text"]

"""
The House In The Wood
"""

But it also struck me that I should also be able to search for repeated refrains within a particular story. You undoubtedly know the sort of thing: the phrase “and he huffed and he puffed and he” that appears three times in the story of the Three Little Pigs.

If the repeated phrase is four words long, one way of trying to find such a phrase is take the first four words of the text, and then compare them to every set of four contiguous words in the text. If the repeated phrase crops up again, you will get an exact match at that point.

  • If the repeated phrase
  • the repeated phrase is
  • repeated phrase is four
  • the text. If the
  • text. If the repeated
  • If the repeated phrase
  • the repeated phrase crops

If you start with a long sequence, twenty words, for example, you can start with that. If you don’t get an exact match for twenty words, try with nineteen. If that fails, try eighteen. All the way donw to some minimum desired phrase length. If you do get a match, keep count of how many matches you get as you work through the whole text. If you get at leaest a desired number of repetitions, then you can return that match phrase.

Here’s a way of doing that in code. In part, it makes ue of some functions in a Python package called nltk (“natural language toolkit”) that separates each word in a text as a seperate token:

import nltk

tokens = nltk.word_tokenize(story_text)

We can find every (overlapping) set of 10 words in the text, so called ngrams, using the nltk_ngrams function:

from nltk.util import ngrams as nltk_ngrams

ngrams = nltk_ngrams(tokens, size)

for i in ngrams:
    print(' '.join(i))

"""
From the German of Grimm
the German of Grimm .
German of Grimm . A
of Grimm . A poor
Grimm . A poor woodcutter
...
"""

The Python pandas package is a very rich and very powerful package for creating and working with tabular datasets, which is to say, tables of data. In pandas, a data table is referred to as a dataframe.

For example, if we put the ngrams into a table (dataframe) with one column, and one ngram per row, we can count all the instances of each unique ngram directly from the dataframe:

import pandas as pd

df = pd.DataFrame({'phrase':[' '.join(i) for i in nltk_ngrams(tokens, size)]})
df['phrase'].value_counts()

"""
, pretty brindled cow ,        4
And you , pretty brindled      4
you , pretty brindled cow      4
pretty brindled cow , What     4
brindled cow , What do         4
                              ..
leaving him all day without    1
"""

If we take one of those repeating phrases, we can search the text to find each occurrence, and then report the wider context, showing the text before and after the phrase. This is often referred to as a concordance.

import re

q2 = 'pretty brindled cow'

for m in re.finditer(_q, row["text"]):
    # Display the matched terms and the 50 characters
    # immediately preceding and following the phrase 
    print(f'===\n{q2}: ', m.start(), m.end(), row["text"][max(0, m.start()-50):m.end()+50])

"""
===
pretty brindled cow:  1566 1585 
The man said:

Pretty cock, Pretty hen, And you, pretty brindled cow, What do you say now?

'Duks,' answered the beast
===
pretty brindled cow:  3505 3524 ed the beasts:

Pretty cock, Pretty hen, And you, pretty brindled cow, What do you say now?

The beasts answered, 'Duks
===
pretty brindled cow:  4932 4951  beasts again:

Pretty cock, Pretty hen, And you, pretty brindled cow, What do you say now?

'Duks,' they said. Then th
===
pretty brindled cow:  6119 6138  to rest now?'

Pretty cock, Pretty hen, And you, pretty brindled cow, What do you say now?

The animals said, 'Duks:
"""

To make that a little easier to work with, we can wrap it in its own function:

def find_contexts(text, phrase,  width=50):
    """Find the context(s) of the phrase."""
    contexts = []
    for m in re.finditer(phrase, text):
        # Display the matched terms and the `width` characters
        # immediately preceding and following the phrase 
        contexts.append(text[max(0, m.start()-width):m.end()+width])
    return contexts

## Usage:
# for i in find_contexts(row['text'], 'pretty brindled cow'):
#    print(i,"\n==")

So we have a couple of possible pieces for our jigsaw that let us find repeated sequences and then, if required, report them in a wider context.

The following function automates the process, which is to say “implements the algorithm”, described above for starting with a long sequence of tokens – a long ngram — and then reducing the sequence length down to a minimum until we get a match.

Following the basic rule of three, in which many things are repeated in stories three times, we set the minimm repetition count to 3 as the default.

def scanner(text, minlen=4, startlen=50, min_repeats = 3, autostop=True):
    """Search a text for repeated phrases above a minimum length."""
    # Tokenise the text
    tokens = nltk.word_tokenize(text)
    
    #nltk_ngrams returns an empty list if we ask for an ngram longer than the sentence
    # So set the (long) start length to the lesser of the original provided
    # start length or the token length of the original text;
    # which is to say, the minimum of the provided start length 
    # or the length of the text
    startlen = min(startlen, len(tokens))
    
    # Start with a long sequence then iterate down to a minumum length sequence
    for size in range(startlen, minlen-1, -1):
        
        # Generate a dataframe containing all the ngrams, one row per ngram
        df = pd.DataFrame({'phrase':[' '.join(i) for i in nltk_ngrams(tokens,size)]})
        
        # Find the occurrence counts of each phrase
        value_counts_series = df['phrase'].value_counts()
        
        # If we have at least the specified number of occurrences
        # don't bother searching for any more
        if max(value_counts_series) >= min_repeats:
            if autostop:
                break
            pass
    # Return a pandas series (an indexed list, essentially)
    # containing the longest phrase (or phrases) we found
    return  value_counts_series[(value_counts_series>=min_repeats) & (value_counts_series==max(value_counts_series))]

If we call that function on the text, we get a long phrase back:

# Display the first (0'th indexed) item
# (In this case there is only one item hat repeats this number of times anyway.)
scanner( row["text"] ).index[0], scanner( row["text"] ).values[0]

"""
('Pretty cock , Pretty hen , And you , pretty brindled cow , What do you say now ?',
 4)
"""

We can create another function that calls the scanner function and returns the first (or only) long detected phrase, or nothing:

def find_repeating_phrase(text):
    """Return the longest repeating phrase found in a text.
       If there are more than one of the same lentgh, return the first.
    """
    phrase = scanner(text)
    
    #If there is at least one response, take the first
    if not phrase.empty:
        return phrase.index[0]

We can now make this function available to the SQLite database by registing it as a custom appliction function.

# The `db` object is a sqlite_utils database object
# Pass in:
# - the name of the function we want to use in the database
# - the number of arguments it takes
# - the function we want to invoke
db.conn.create_function('find_repeating_phrase', 1, find_repeating_phrase)

We can now use the function as part of a query, for example applying it to the text retrieved for the book The House In The Wood:

_q = """
SELECT book, title, find_repeating_phrase(text) AS phrase 
FROM books AS phrase WHERE title="The House In The Wood" ;
"""

for row2 in db.query(_q):
    print(row2)

"""
{'book': 'The Pink Fairy Book', 'title': 'The House In The Wood', 'phrase': 'Pretty cock , Pretty hen , And you , pretty brindled cow , What do you say now ?'}
"""

At this point, I still haven’t yet run this function over anyhting other than a single test case, the text of The House In The Wood, which I knew contained a repeating phrase.

But what will happen if I run it over all the stories in The Pink Fairy Book? It probably sounds a bit daft, but I’m almost shaking in excited anticipation of writing this query and trying it out…

Here’s the query:

_q = """
SELECT book, title, find_repeating_phrase(text) AS phrase
FROM books AS phrase WHERE book="The Pink Fairy Book" ;
"""

Now I’m going to shut my eyes and run it… and then peek at the response…

_q = """
SELECT  title, find_repeating_phrase(text) AS phrase
FROM books AS phrase WHERE book="The Pink Fairy Book" ;
"""

for row3 in db.query(_q):
    if row3['phrase'] is not None:
        print(row3)

How did it do?

{'title': 'Catherine And Her Destiny', 'phrase': 'the court , and'}
{'title': 'Esben And The Witch', 'phrase': "? ' 'Ye -- e -- s ! ' 'Are you coming back again ? ' 'That may be , ' said Esben . 'Then you 'll catch it , '"}
{'title': "Hans, The Mermaid's Son", 'phrase': ", ' said Hans ; ' I"}
{'title': 'How The Dragon Was Tricked', 'phrase': ", ' said the dragon"}
{'title': "How The Hermit Helped To Win The King's Daughter", 'phrase': "'Ask him if he will come with us"}
{'title': 'I Know What I Have Learned', 'phrase': 'and asked his wife whether the cow had calved'}
{'title': 'King Lindorm', 'phrase': 'rode out into the forest'}
{'title': 'Maiden Bright-Eye', 'phrase': ". 'Good evening , ' it said . 'Thanks , Maiden Bright-eye , ' said the dog . 'Where is my brother ? ' 'He is in the serpent-pit . ' 'Where is my wicked sister ? ' 'She is with the noble king . ' 'Alas ! alas !"}
{'title': 'Master And Pupil', 'phrase': ", ' said the boy ."}
{'title': 'Peter Bull', 'phrase': "'Oh , yes , ' said the"}
{'title': 'Princess Minon-Minette', 'phrase': ", ' replied the old woman"}
{'title': "The Bird 'Grip'", 'phrase': 'the horse with the golden shoes'}
{'title': 'The Cunning Shoemaker', 'phrase': ", ' replied the shoemaker"}
{'title': 'The Fir-Tree', 'phrase': "' thought the tree ."}
{'title': 'The Flying Trunk', 'phrase': ". '' ' ''"}
{'title': 'The Goblin And The Grocer', 'phrase': ', and he had'}
{'title': 'The Golden Lion', 'phrase': ', and the young man'}
{'title': 'The House In The Wood', 'phrase': ': Pretty cock , Pretty hen , And you , pretty brindled cow , What do you say now ?'}
{'title': 'The Jackal, The Dove, And The Panther', 'phrase': "which side do you turn to ? '"}
{'title': 'The King Who Would Have A Beautiful Wife', 'phrase': ". ' 'And I"}
{'title': 'The Little Hare', 'phrase': 'the little hare , the little hare ,'}
{'title': 'The Man Without A Heart', 'phrase': ", ' said the"}
{'title': 'The Merry Wives', 'phrase': ", ' said the"}
{'title': 'The Princess In The Chest', 'phrase': "'Sentry , where are you ? Sentry , where are you ?"}
{'title': 'The Shirt-Collar', 'phrase': "! ' said the shirt-collar ,"}
{'title': 'The Slaying Of The Tanuki', 'phrase': '. The Tanuki ,'}
{'title': 'The Snow-Man', 'phrase': "? ' asked the Snow-man ."}
{'title': 'The Snow-Queen', 'phrase': ", ' said the crow ,"}
{'title': 'The Sparrow With The Slit Tongue', 'phrase': 'the house , and'}
{'title': 'The Sprig Of Rosemary', 'phrase': "'Do you , rich as you are ,"}
{'title': 'The Story Of Ciccu', 'phrase': "accept them with my humble duty . '"}
{'title': 'The Three Brothers', 'phrase': "the house . '"}
{'title': "The Troll's Daughter", 'phrase': 'at the bottom of the sea .'}
{'title': 'The Two Brothers', 'phrase': 'seven years and seven months'}
{'title': 'The Water Of Life', 'phrase': 'the talking bird , and a branch of the tree of beauty'}
{'title': 'The White Dove', 'phrase': ", ' said the prince ,"}
{'title': 'The Wounded Lion', 'phrase': 'will hire me for a servant ?'}
{'title': 'Uraschimataro And The Turtle', 'phrase': 'the sea , and'}

Not bad. For example, we get good long phrases from Maiden Bright-Eye and Esben And The Witch, and not just The House In The Wood. And both How The Hermit Helped To Win The King’s Daughter and The Wounded Lion look like they may have a nice repeating riff in the story.

It does looks as if the punctation can get in the way some times though; and it could be handy if we were able to to set the minimum token length as part of our query.

There are a couple of ways we might try to find round this. One is to “depunctuate” the strings; another is perhaps to try tokenising on sentences themselves?

But for that, we will have to wait until another day.

Fuzzy Search With Multiple Items Table Return in SQLite

In Fuzzy Search, Pattern Matching and Retrieval Using Python and SQLite I reviwed several tools that could be used to support fuzzy search, giving an example of a scalr SQLite custom application function that could be used to retrieve a document containing a fuzzy matched search term.

In Custom SQLite Functions That Return A Table: A SQLite Spell-Checker, I showed how we can create functions that can return multiple values via a virtual table.

In this post, I’ll describe a couple of custom virtual table generating functions that support fuzzy searching within document records and the return of document fragments, one per row, associated with the matches.

The trick to creating table returning functions is provided by Charles Leifer’s coleifer/sqlite-vtfunc package, as described in SQLite Table-Valued Functions with Python. The demos for that package show how to create a regular expression search function:

#https://charlesleifer.com/blog/sqlite-table-valued-functions-with-python/
#https://github.com/coleifer/sqlite-vtfunc
#%pip install git+https://github.com/coleifer/sqlite-vtfunc.git

from vtfunc import TableFunction
import re

from vtfunc import TableFunction

class RegexSearch(TableFunction):
    params = ['pattern', 'search_string']
    columns = ['match']
    name = 'regex_search'

    def initialize(self, pattern=None, search_string=None):
        self._iter = re.finditer(pattern, search_string)

    def iterate(self, idx):
        # We do not need `idx`, so just ignore it.
        return (next(self._iter).group(0),)

# The `db` object is a sqlite_utls Database object
from sqlite_utils import Database
db = Database('test.db')

# Register the function with SQLite
RegexSearch.register(db.conn)

Here’s what happens if we try it out:

results = re.finditer(
            '<a[^\>]+?href="([^\"]+?)"[^\>]*?>([^\<]+?)</a>',
            'Junk  <a href="http://www.bbc.co.uk">Link text</a> wrapper; Junk  <a href="http://www.open.co.uk">Link text</a> wrapper')

next(results).groups(), next(results).groups()

"""
(('http://www.bbc.co.uk', 'Link text'), ('http://www.open.co.uk', 'Link text'))
"""

It’s not too hard to crib this to create a partial regex matcher using the regex package. We can extend the returned table to give the start and end character locations of the matched term in the response table:

#%pip install regex
import regex

class RegexSearch2(TableFunction):
    params = ['pattern', 'search_string']
    columns = ['match', 'start', 'end']
    name = 'regex_search2'

    def initialize(self, pattern=None, search_string=None):
        self._iter = regex.finditer(pattern, search_string)

    def iterate(self, idx):
        item = next(self._iter)
        return (item.group(0), item.start(), item.end(),)

RegexSearch2.register(db.conn)

Here’s how it works:

query_params = ("(sin eater){e<=1}", "The bin eater is a sin-eater thing...")

for i in db.execute('SELECT * FROM regex_search2(?, ?);', query_params):
    print(i)

"""
('bin eater', 4, 13)
('sin-eater', 19, 28)
"""

We can also create fuzzy matchers using the fuzzysearch.find_near_matches() function:

#%pip install fuzzysearch
from fuzzysearch import find_near_matches

class FindNearMatches(TableFunction):
    params = ['target_string', 'search_string']
    columns = ['matched', 'start', 'end', 'dist']
    name = 'find_near_matches'

    def initialize(self, target_string=None, search_string=None):
        self._iter = iter(find_near_matches(target_string, search_string, max_l_dist=2))

    def iterate(self, idx):
        r = next(self._iter)
        return (r.matched, r.start, r.end, r.dist,)

# And register the function
FindNearMatches.register(db.conn)

Here’s how it looks:

query_params = ('PATTERN', '--PRTTERN-PATERN---PATEN----PATN--PATREN-')

for i in db.execute('SELECT *, matched FROM find_near_matches(?, ?);', query_params):
    print(i)

"""
('PRTTERN', 2, 9, 1, 'PRTTERN')
('PATERN', 10, 16, 1, 'PATERN')
('PATEN', 19, 24, 2, 'PATEN')
('PATREN', 34, 40, 2, 'PATREN')
"""

One disadvantage of the above is that we have hardwired the max_l_dist parameter; but we can add in any number of additional parameters as we require:

class FindNearMatches2(TableFunction):
    params = ['target_string', 'search_string', 'max_l_dist']
    columns = ['matched', 'start', 'end', 'dist']
    name = 'find_near_matches2'

    def initialize(self, target_string=None, search_string=None, max_l_dist=2):
        self._iter = iter(find_near_matches(target_string, search_string, max_l_dist=max_l_dist))

    def iterate(self, idx):
        r = next(self._iter)
        return (r.matched, r.start, r.end, r.dist,)

# And register the function
FindNearMatches2.register(db.conn)

Here’s how it looks this time:

query_params = ('PATTERN', '--PRTTERN-PATERN---PATEN----PATN--PATREN-', 1)
for i in db.execute('SELECT *, matched FROM find_near_matches2(?, ?, ?);', query_params):
    print(i)

"""
('PRTTERN', 2, 9, 1, 'PRTTERN')
('PATERN', 10, 16, 1, 'PATERN')
"""

We can, of course, use the functions as part of a query that passes in the text values from other tables. So for example, here is a test table:

fuzzytests = db["fuzzytests"]
 
# Add some rows
fuzzytests.insert({'id':1,
                   'text':'Some text about the bin-eater.'})
fuzzytests.insert({'id':2,
                   'text':'Something about sin eating'})
fuzzytests.insert({'id':3,
                   'text':'Nothing to see here...'})

And we can run some queries against it:

# Run our fuzzy searcher over the rows
for i in db.execute(f'SELECT * FROM fuzzytests, find_near_matches2("sin eater", fuzzytests.text, 2);'):
    print(i)

"""
(1, 'Some text about the bin-eater.', 'bin-eater', 20, 29, 2)
(2, 'Something about sin eating', 'sin eatin', 16, 25, 2)
"""

Other fuzzy search tools are available, of course. So here’s an example using the gandersen101/spaczz fuzzy matcher for spaCy:

from spaczz.matcher import FuzzyMatcher

nlp = en_core_web_sm.load()

class SQLite_FuzzyMatcher(TableFunction):
    params = ['target_string', 'search_string']
    columns = ['matched', 'start', 'end', 'fragment', 'ratio']
    name = 'find_fuzzy_matches'
    
    def initialize(self, target_string=None, search_string=None):
        doc = nlp( search_string )
        matcher = FuzzyMatcher(nlp.vocab)
        # We can tune the sensitivity for the match; default is 75
        matcher.add(search_string, [nlp(target_string)], kwargs=[{"min_r2": 75}])
        self._iter = iter(matcher(doc))

    def iterate(self, idx):
        # ratio is a quality score for the match
        matched, start, end, ratio = next(self._iter)
        fragment = nlp(matched)[start:end].text
        return (matched, start, end, fragment, ratio,)

# And register the function
SQLite_FuzzyMatcher.register(db.conn)

And when we run it?

_q = f"""
SELECT *
FROM fuzzytests, find_fuzzy_matches('sin eater', fuzzytests.text) AS fz ;
"""

list(db.query(_q))

"""
[{'id': 1,
  'text': 'Some text about the bin-eater.',
  'matched': 'sin eater',
  'start': 4,
  'end': 7,
  'ratio': 78}]
"""

Next up – a porter stemmer search tool, maybe?!

SQLite Functions for Retrieving Sentences and Paragraphs from a Text

As part of my sin eater correspondence debate unproject, one of the things I keep pondering is effective ways of retrieving content from the corpus database for inclusion, or potentially even transclusion, (which is to say, dynamically included content), in the text.

As I seem to be on something of a roll with regards to creating dodgy SQLite functions at the moment (see previous posts), here are a couple of possible queries for extracting sentences and paragraphs.

For the sentence extraction, we can use a spacy pipeline component:

nlp = en_core_web_sm.load()
sentencizer = nlp.add_pipe("sentencizer")

class SQLite_Sentenciser(TableFunction):
    params = ['start', 'end', 'text']
    columns = ['sentence', 'sent_idx']
    name = 'get_sentences'

    def initialize(self, start=None, end=None, text=None):
        doc = nlp( text )
        # Use inclusive bounds; start index: 0
        self._iter = zip(list(doc.sents)[start:end+1], range(start,end+1))

    def iterate(self, idx):
        sentence, sent_idx = next(self._iter)
        return (sentence.text, sent_idx, )

# And register the function
SQLite_Sentenciser.register(db.conn)

Calling this in a query returns sentences:

_q = f"""
SELECT title, s.*
FROM sources_fts, get_sentences(4,5,sources_fts.text) AS s WHERE sources_fts MATCH {db.quote(q)} LIMIT 5 ;
"""
list(db.query(_q))

"""
[{'title': 'ANSWERS.',
  'sentence': 'From goat to dog, and from dog to man, are remarkable bounds; but here is an account of a human "scapegoat" from a trustworthy authority :— "In a certain county in Wales, some years ago when a person died, his friends sent for the \'sin-eater\' of the district, who, for half-a-crown, *took upon himself the sins of the deceased*, by the simple process of eating them.',
  'sent_idx': 4},
 {'title': 'ANSWERS.',
  'sentence': "A loaf of bread was provided, which the 'sin-eater' first placed upon the dead man's chest, then muttered some incantations over it, finally eating it.",
  'sent_idx': 5}]
"""

For the paragraph extraction, we need to create our own function – or crib one!

#https://gist.github.com/wpm/bf1f2301b98a883b50e903bc3cc86439
def paragraphs(document):
    start = 0
    for token in document:
        if token.is_space and token.text.count("\n") > 1:
            yield document[start:token.i]
            start = token.i
    yield document[start:]

# Example usage:
paras  = list(paragraphs(nlp("""This is a sentence. And another.\n\nA new sentence.\n\nA new para.""")))
paras
# For the text: [p.text for p in paras]

"""
[This is a sentence. And another.,
 
 
 A new sentence.,
 
 
 A new para.]
"""

We can then use this function to extract paragraphs, e.g. in this case using the rule that pargraphs are sequences of text separated by more than one end of line character:

class SQLite_Paragraphiser(TableFunction):
    params = ['start', 'end', 'text']
    columns = ['paragraph', 'para_idx']
    name = 'get_paragraphs'

    def initialize(self, start=None, end=None, text=None):
        doc = nlp( text )
        # Use inclusive bounds; start index: 0
        self._iter = zip(list(paragraphs(doc))[start:end+1], range(start,end+1))

    def iterate(self, idx):
        (paragraph, idx) = next(self._iter)
        return (paragraph.text.strip(), idx,)

# And register the function
SQLite_Paragraphiser.register(db.conn)

And in use:

# Bad example - the record only has one para.

_q = f"""
SELECT title, s.*
FROM sources_fts, get_paragraphs(0,2,sources_fts.text) AS s WHERE sources_fts MATCH {db.quote(q)} LIMIT 5 ;
"""
list(db.query(_q))

"""
[{'title': 'ANSWERS.',
  'paragraph': 'AN EXTRAORDINARY WELSH SUPERSTITION.\nThe great majority of your readers will have read the account of the "scapegoat" ceremony recorded in the Sacred Volume. Some, too, will remember a striking parallel custom (or, to put it more correctly, a parallel custom with a disparity) observed by the North American Indians. From goat to dog, and from dog to man, are remarkable bounds; but here is an account of a human "scapegoat" from a trustworthy authority :— "In a certain county in Wales, some years ago when a person died, his friends sent for the \'sin-eater\' of the district, who, for half-a-crown, *took upon himself the sins of the deceased*, by the simple process of eating them. A loaf of bread was provided, which the \'sin-eater\' first placed upon the dead man\'s chest, then muttered some incantations over it, finally eating it. By this he was believed to have taken from the deceased the heavy weight of his sins, appropriating them to himself, for which act of kindness he was regarded by everybody as an outcast. After the ceremony was finished, and he had received his pay, the \'sin-devourer\' vanished in double-quick time, it being the usual custom for the bereaved friends to belabour him with sticks."\nH., Newport, Monmouthshire.',
  'para_idx': 0}]
"""

With these two functions, I can now pull out one or more contiguous paragraphs, or one or more contiguous sentences, from each database record.

We can also create a custom aggregation function that can be applied to grouped rows. For example, suppose we want to join the first two sentences retrieved from every book, by book.

The custom aggregator is a class that provides step an

class SentenceJoin:
    """Join sentences by group."""

    def __init__(self):
        self.sentences = []

    # Define action to be take for each row
    def step(self, item):
        self.sentences.append(item)

    # Return final aggregated value
    def finalize(self):
        return ' '.join(self.sentences)
    
db.conn.create_aggregate('sentence_join', 1, SentenceJoin)

We can then call the function as a grouping function:

_q = f"""
SELECT title, sentence_join(s.sentence) AS js
FROM sources_fts, get_sentences(4,5,sources_fts.text) AS s WHERE sources_fts MATCH {db.quote(q)} 
GROUP BY sources_fts.title
LIMIT 5 ;
"""
list(db.query(_q))

"""
[{'title': 'ANSWERS.',
  'js': 'From goat to dog, and from dog to man, are remarkable bounds; but here is an account of a human "scapegoat" from a trustworthy authority :— "In a certain county in Wales, some years ago when a person died, his friends sent for the \'sin-eater\' of the district, who, for half-a-crown, *took upon himself the sins of the deceased*, by the simple process of eating them. A loaf of bread was provided, which the \'sin-eater\' first placed upon the dead man\'s chest, then muttered some incantations over it, finally eating it.'}]
"""

Here’s a function that will grab a fragment of text from a record between a starting phrase and end phrase (we could extend it to allow the retruned segment to be includive or exclusive of the start and end phrase, and perhaps also trim the returned phrase):

def get_fragment(text, startend):
    """Return substring from a text based on start and end substrings delimited by ::."""
    startend = startend.split("::")
    start_idx = text.index(startend[0])
    end_idx = text.index(startend[1])
    
    return text[start_idx: end_idx+len(startend[1])]

db.conn.create_function("get_fragment", 2, get_fragment)

for i in db.query("SELECT get_fragment('Start here. This is the end. To here.', 'This::the end') AS fragment"):
    print(i)

"""
{'fragment': 'This is the end'}
"""

Finally, here is a simple function that will return the longest matching substring between two texts:

from difflib import SequenceMatcher 
#https://docs.python.org/3/library/difflib.html#sequencematcher-objects

def get_longest_common_substring(text_a, text_b):
    """Find longest common subtring."""
    # isjunk=None, a='', b='', autojunk=True
    seqMatch = SequenceMatcher(None, text_a, text_b, autojunk=False)
    #Also: 
    # autojunk = True (default)
    # isjunk = None (deafult), same as: lambda x: False;
    # or return True for junk (ignored) using: isjunk = lambda x: x in " \t"
    
    # find_longest_match(alo=0, ahi=None, blo=0, bhi=None)
    # Find longest matching block in a[alo:ahi] and b[blo:bhi].
    match = seqMatch.find_longest_match(0, len(text_a), 0, len(text_b))
    
    if (match.size):
        return text_a[match.a: match.a + match.size]

db.conn.create_function("get_longest_common_substring", 2, get_longest_common_substring)

for i in db.query("SELECT get_longest_common_substring('This is a text with blah blah', \
                                                       'And here is a text with blah, whatever') AS match"):
    print(i)

"""
{'match': ' is a text with blah'}
"""

Fragment – R and CPython REPLs Running Purely in the Browser via WASM

In passing, I note a console based R REPL running in the browser via WASM, courtesy of georgestagg/webR (demo here).

My immediate thought was: so how easy would it be to wrap the underlying R environment as a JupyterLite kernel? (Related issue here.) I guess an open question then is: will R packages just install as cross-platfrom PyPi wheels do, or will compiling be necessary?

For a related tale about the joys of trying to get various programming languages to run via WASM, see FORTRAN In The Browser.

In further passing, I note the availability of Build scripts and configuration for building CPython for Emscripten at ethanhs/python-wasm, with a REPL demo here.

Fragments – Previewing Richly Formatted Jupyter Book Style Content Authored Using MyST-md

Of the many flavours of markdown out there that can be used to author executable documents (Rmd, jupytext-md, and variants thereof, MyST-md), I find myself increasingly making use of MyST-md and Jupyter Book with various plugins to render the content to HTML (I haven’t tried rendering to other formats such as PDF, docx, or various ebook formats yet).

One of the major blockers to wider adoption that I’d expect “colleagues” to express is that raw markdown is too hard to write in and there’s no easy preview, let alone WYSIWYG, views.

But that appears to be changing, with several demos and proofs of concept around MyST-flavoured authoring now in play. It’ll be interesting to see whether this crop of tools provides enough of a basis for folk to start exploring MyST, and maybe driving some competition that will encourage further adoption and development.

So, what’s possible today? Let’s look at four possibilities: JupyterLab-MyST extension, Iridium notebooks, the Curvenote editor and the VS Code myst-vs-code extension.

Jupyterlab-MyST Extension

Perhaps most interestingly, a proof-of-concept JupyterLab extension, very much under development, in the form of jupyterlab-myst (Binder demo available, although the long build time and the fast rate of updates to the repo means you might well have to wait for a build to complete!).

The extension provides rendered views of an increasing amount of MyST native enhanced markdown features (I’m not sure if it will extend to rendering elements provided by plugins?), with markdown entered as normal in the edit view of a markdown cell.

Demo of executablebooks/jupyterlab-myst extension in action

I’m not sure if the extension works (yet?) in the pure-browser JupyterLite eenvironment?

One thing I’d quite like to see is the ability to use a cell tag to configure a markdown as a special content block rather than have to use a labeled code fenced block.

Some more examples of the jupyterlab-myst extension in action, at least as it stands today, including tables and footnotes (which are currently rendered at the bottom of a cell, and numbered relative just to the cell?):

I also note that nested admonitions work, term definitions almost work, and Sphinx function docs style rendering still needs work:

In Jupyter Book land, there are quite a few extensions that add further richness to content rendered via Jupyter Book, including dropdowns/sphinx-togglebutton, sphinx-panels, sphinx-tabs, proofs, exercises and so on. It’s not immediately clear to me whether there is an obvious path for incorporating those rendering extensions in the views that jupyterlab-myst can render, or whether the Sphinx extensions and the JupyerLab extension could share some of the same templates and styling? (I would hope they could share some of it at least…)

This all makes me wonder how easy it is for the end-user to define their own special content block tags and styles. If a really simple recipe could be found for supporting this sort of end-user development in the Jupyter Book / MysT context, as well as the jupyterlab-myst extension, it would go a long way towards making JupyterLab / RetroLab more palatable to me. For example, would a “custom block proxy” be possible where you define the code fenced block label and provide a template for how to render the content within the block?)

See also: A Personal Take on Customising Jupyter Via End-User Innovation

Irydium

The irydium project is a web-native browser based notebook offering MyST based authoring, in-browser Python code execution via Pyodide, and interactive graphical output via plotly charts. Javascript flavoured coding is also available using Observable Plot, which I think sits on top of d3.js.

The Irydium browser provides a split window display, with a MyST markdown text editor on the left and a previewer on the right. The two windows appear not to be locked – each scrolls independently of the other. As you type in the editor, thre previewer tries to display a live preview, although s the notebook gets longer the previwer gets slower and slower…

Basic admonitions are supported, although the {admonition} block which lets you set the block title, isn’t; nested blocks don’t seem to work, nor do the :class: attributes; the dropdown block does not appear to be supported (yet?) either.

Python and Javascript code cells are supported. Pyhton code execution is provided via a WASM Python environment provided by default. It doesnlt look like you can print Hello World? But you can (somehow?) pass state to a plotly Javascript widget.

Tables and tabbed content don’t appear to work (yet?) but panels do:

Maths notation / equations don;t appear to work (do we need to install a js package?) nor do footnotes:

Perhaps the biggest bigbear for me is that you need to log in to save the notebook. It would be nice to be able to:

  • save and load named notebooks to browser storage, or to disk via the filesystem API;
  • export / load myst-md files to / from the desktop.

Curvenote Editor

Compared to the side-by-side “source and preview” approach offered by Iridium, the Curvenote editor is a WYSIWYG web-based editor (open source, demo) that fronts the Curvenote publishing, collaboration and reuse platform.

Basic Markdown elements (headers, emphasis, lists, etc.) are entered as raw markdown, and code cells are triggered by entering three back ticks. Other custom blocks are created via a slash command that indexes into a drop down list:

Tables are created via a /table command:

Additional rows / columns can be addded via a pop-up toolbar, or slash commands inside the table:

Footnotes and and equations are also entered from slash commands.

Slash command can be invoked inline, for example, when adding a footnote:

The document source is available as MyST-md (somehow???); blocks, equations and tables are all generated as expected, but not the footnote (which appears as a popu-up in the HTML, rather than appearing inline at the foot of the page), unless I’m doing it wrong.

For example, for this WYSIWYG view:

The following MyST-md is generated:

I’d prefer it if the table were not enclosed in a block.

For folk who like WYSIWYG editors, and who are comfortable with / commands, for example, as used in the WordPress Gutenberg block editor, the editor makes it easy to flip between edit and rendered views of styled elements, with the syntactic boilerplate provided for you. Styled blocks are invoked inline from slash commands and selected from a filtered list.

This approach balances WYSIWYG and “toolbar button” applied styling or raised wizards, and appear to be “a thing” in many (developer driven?) environments. (Keyboard shortcuts are also supported, but I’m not sure how deep they go? Eg can you raise the table editor or a note block from a keyboard shortcut?) But they go against my personal preferences (I really dislike the WordPress editor, for example: I’d rather just write in markdown.)

VS Code myst-vs-code Extension

The executablebooks/myst-vs-code extension provides rich preview rendering of MyST markdown documents. The rich preview view is rendered in a separate panel to the edited document.

For example, we can preview various flavours of special content block / admonition block:

We can also define footnotes, which are rendered at the botton of the preview, and create tables from markdown table syntax.

Equations / maths notation didn’t seem to render for me though because I couldn’t seem to find a way to enable additional extensions?

I also wonder how additional extensions are added, and what these have in common with the Sphin extensions? Eg how easy would it be to add support for the exercise, panel or proof extensions?

The previewer doesn’t work if the shd101wyy/markdown-preview-enhanced is enabled, though (my preferred VS Code markdown extension; there is also a slight variation in the sytax preferred by markdown-preview-enhanced compared to MyST, as noted in this related issue).

See also: VS Code as an Integrated, Extensible Authoring Environment for Rich Media Asset Creation.

SQL Databases in the Browser, via WASM: SQLite and DuckDB

Several years ago, in a post on Asking Questions of CSV Data, Using SQL In the Browser, I described a rather nifty application called franchise, (from the repo, it seems to have been deprecated for some time, or at least, is no longer actively maintained there). It worked as a general SQL database client, or could be used to manage and query a SQLite database powered by sql.js.

In this post, I’ll review a couple of WASM compiled databses that you can run purely within a browser: SQLite and Duck.db.

SQLite

Several ways of working with SQLite in the browser exist; the following does not claim to be a complete list, just a representative sample, and includes:

  • sql.js and sql.js-httpvfs;
  • SQLime SQLite Playground;
  • JupyterLite xeus-sqlite-kernel.

See also: Seven Ways of Making Use of SQLite.

sql.js

The sql.js package provides a WASM compiled version of SQLite that runs purely in the browser. Databases can be loaded from an uploaded file, or retrieved from a remote URL. If you want to try it out, there’s a simple demo page here:

Barebones sql.js UI

Applications moving off-the-server and into the browser is really handy in many educational contexts because it means:

  • there is no requirement on the user to install any software on their desktop: all they need is a web browser, a network connection to download the environment, and a powerful enough computer to run the application in the browser;
  • there is no requirement on the provider to provide a code execution server: a static web server is all that is required to provide the environment to the user.

A Brief Aside – Serving Webpages Locally Without a Webserver

For purely local running, there may be requirement for the the user to run a local webserver to serve the environment: if you double click an HTML file on your desktop and open it in a broweser with a URL starting file://, the browser may well throw CORS (cross-origin) security errors as it tries to load the page. Fortunately, applications such as servefolder.dev make it possible to run a webserver in your web browser to serve content held locally by uploading it to the browser and serving it from there:

Unfortunately, you can’t run that application locally from a file:// addressed location.

Fortunately, you can install the application as a Chrome app that does work when you are offline:

Install serverfolder.dev for offline running as a Chrome App

Essentially, Chrome will trust apps you have installed from the web into Chrome, but not things you want to run from you own desktop…

Serve folder app

For running simple applications, installing something like servefolder.dev as an app so you can run it offline is really handy when it comes to not requiring a user to run their own webserver. But there is another huge issue that is perhaps currently a blocker for using in-browser apps in an educational setting: if you edit content in the app – for example, a SQL query you spent ages crafting – you lose it when you close the web page: there is nowhere you can save it to and then reload it, nowhere you can persist it, unless you export it / downlad it to the desktop, and then import it / upload it from the desktop next time you run the application.

In the case of sql.js, the minimal demo UI does not persist the database, nor provide any means for saving and reusing queries: it is just a minimal, temporary UI. IndexedDB is a NoSQL storage solution that is supported inside contemporary browsers. Packages such as dexie.js provide a convenient wrapper around it and tools such as manics/jupyter-offlinenotebook can use it to persist items in the browser from otherwise transient web application sessions, such as MyBinder powered Jupyter notebook sessions. It would be handy if a simple template for creating sql.js apps with custom queries and database and query persistence using IndexedDB were available. If you know of such an example, please let me know via the comments.

sql.js-httpvfs

If the database you want to query is hosted via a web URL, sql.js can load the database from the URL. If the database is large, however, this could cause a delay as the database is downloaded, or knock your browser over completely if the database file is very large. Via @simonw, I learned of the phiresky/sql.js-httpvfs package, a fork of sql.js that “provide[s] a read-only HTTP-Range-request based virtual file system for SQLite”. This means that you can host an arbitrarily large SQLite database on a (static) file hoster and then query the database from the browser without needing to download the whole database. The techique used is described in Phiresky’s blog post Hosting SQLite databases on Github Pages (or any static file hoster) (for a full demo, use the netlify hosted version of the post).

The Phiresky blog post also includes a rather nifty web component for embedding, editing and executing SQL queries over the (remote) database, as well as displaying the results.

The results can be displayed in the context of that component, or embedded into the DOM (that is, anywhere in the web page) as inline content. This raises the intriguing prospect of having a web page is essentially a set of SQL queries over a remote database that are then rendered as the web page content.

Also notably, the source code for the original blog post is a markdown file that is rendered to HTML via a custom pandoc filter. I tried to make sense of the source code repository to see how easy it would be to just copy and paste the bits I’d need to run my own minimal demo of rendering a simple markdown file to HTML, with SQL editor components embedded, and querying my own online database, but it’s too web-dev uber-foo for me to be able to even rebuild from the source repository. (Again, if you know of, or create, a minimal example repo that shows how to build a Github Page from the repo that lets you query a database hosted in the repo using the above components – assuming there’s a license or implied license that allows their reuse – please let me know via the comments…)

In passing, I was also prompted to this fascinating interview with Richard Hipp, creator of SQLite, via @simonw. It’s well worth a listen.

SQLime SQLite Playground

Another way of exploring SQLite in the browser that does provide a means of persistence if you have a network connection available is the SQLime SQLite Playground [repo]:

This client can upload a locally available database file, or download and then connect to a database from a remote location:

SQLime “remote” database access is a download rather than a connect…

It would be interesting if SQLime wrapped phiresky/sql.js-httpvfs and also offered its remote query support, methinks (issue)…

File persistence is supported by connecting to Github:

If you then “share” a query, it is saved to a gist and you are provided with the gist link:

The content of the database and the query are then saved as a private gist:

This is obviously quite an expensive operation – each time you save a query to new gist you also save the contents of the database there – so it would be neater if you could alternatively just save the URL location of the database, or be able to support remote connections as per sql.js-httpvfs (issue).

JupyterLite xeus-sqlite-kernel

JupyterLite is Jupyter distribution that runs purely in the browser. One of the available kernels is the jupyterlite/xeus-sqlite-kernel (based on the jupyter-xeus/xeus-sqlite Jupyter kernel). Code cells in this kernel run SQL queries against an in-memory SQLite database, with additional magics defined to support loading of data and rendering of query results as Vega charts.

An example notebook demonstrates how databases can be created and queried.

Example RetroLab notebook running with a xeus-sqlite-kernel .

You can try a live example of the notebook running, via JupyterLite in the browser, here: https://jupyterlite.github.io/demo/retro/notebooks/?path=xeus-sqlite/simple-operations.ipynb

Currently, databases must be created and populated within the notebook. However, a repo issue comment claims a recent PR to the upstream xeus/xeus-sqlite supports magics for downloading a database from a URL (although I haven’t been able to get it to work as yet [issue]) and then opening a database connection to it, so this support will hopefully soon make its way into the xeus-sqlite-kernel at some point.

Notes are persisted in browser storage, and the previously mentioned PR looks like it will also allow the database to be persisted in storage too (though I’m not sure whether browser permissions will allow it to be accessed from other JupyterLite notebooks with a different filename/URL?)

DuckDB

DuckDB is available as a WebAssembly compiled application that can run purely in the browser (Chrome, Firefox, Safari) that claims to offer full support for Apache Arrow.

The intial release blog post – DuckDB-Wasm: Efficient Analytical SQL in the Browser – suggests that DuckDB “reads Parquet, CSV and JSON files from either your local filesystem or HTTP servers”, but whilst it was easy enough to figure out how to download a file from a web URL to my desktop from the DuckDB CLI (.files download $URL) I couldn’t work out how to open a file from my desktop (I thoought .files add might open a desktop file picker, but it just seems to hang the browser without any error messages I could see anywhere…). And whilst I had no trouble querying web-hosted parquet or CSV files, I didn’t manage to query a JSON file. (I also note that it would be hand to query a remote sqlite databse, especially one configured to best support efficient remote sql.js-httpvfs style connections).

The WASM client is fronted by a simple terminal, shell.duckdb.org. (I found that if I did anything wrong it would just hang and I had to reload the page; no ctrl-c etc to get out of trouble.)

DuckDB WASM terminal

The client appears to support queries onto remote files. For example, queries can be run over remotely hosted CSV files found randomly on Github:

The first query appears to download the whole file into memory, and the second query then essentially runs against that.

Queries can also be run over remote parquet hosted files, which I assume could be of an arbitrary size, without the need to download them in full:

Queries also seemed to run against arbitrary parquet data files I found on Github:

I couldn’t find a way to load or query a JSON file, though. Or persist and save and retrieve queries? If you know of a simple DuckDb tutorial that gives a complete set of worked CLI examples, please… let me know via the comments:-)

On Not Porting My Classic Jupyter Notebook Extensions to JupyterLab

With the deprecation of classic notebook, and future notebook releases to be based on JupyterLab UI components, we’ll have to make a decision somewhen soon about whether to pin the notebook version in the environment we release to students in September starts modules to <7, move to RetroLab/new notebook, or move to JupyterLab. Or go somewhere else.

If we move, a question arises as to the extent to which we try to preserve extended functionality and theming.

Currently, our classic notebook environment:

  • has custom branding (the insitutional logo replaces the Jupyer Logo);
  • uses “official unofficial” off-the-shelf classic notebook extensions;
  • uses in-house devleoped classic notebook extensions;
  • uses in-house developed ipywdgets (via jp_proxy_widget);
  • uses in-house developed and off-the-shelf magics.

In a move to JupyterLab UIs, the magics (MUST HAVE) should continue to work; the jp_proxy_widget applications (MUST HAVE) should work as long as that package continues to work; the branding should be fixable (SHOULD HAVE) (I was going to sy “how hard can it be?” to replace a logo, but I guess this is JupyterLab, so maybe REALLY HARD?!;-) , the off-the-shelf extensions (SHOULD HAVE tending to MUST HAVE) may get ported or may already have JupyterLab equivalents (see, for example, Scoping Out JupyterLab Extensions); and the custom notebook extensions (at the end of the day, these are [rpbably strictly SHOULD HAVE rather than than MUST HAVE, becuase materials are intended to degrade gracefully without them, but they are part of our value add and student experience proposition so it would be nice to keep them) will not (at least, not by me) get ported.

So what are we to do?

I think there are two sorts of decision that need to be made with a view to presentation:

  • for modules already in presentation, do they stay with classic notebook or do we try to migrate?
  • for modules in production, which environment do they develop notebooks for? Options are: classic notebook (I think this would be foolish); new notebook / JupyterLab (sensible option if the Jupyter environment is adopted); alternative environment (for example, VS Code, RStudio, Quarto, other). When it comes to alternatives, I think there should also be a consideration of more radical alternatives where the full notebook mechanic may not necessarily be required; Jupyter Book with ThebeLab, for example, if some way can be found to support persisting code edits (eg to browser storage).

I also think we should take the opportunity to explore alternative authoring and editing environments, as well as quality process automation (for example, Using Automation To Support Quality Assured Jupyter Notebook Publishing Processes): I tend to move between classic notebook, RStudio and VS Code, and if/when a standalone Quarto editor is made available, I’ll probably also try that for certain side-projects. Others might prefer things like Quarto, which is a bit more word-processor hand-holdy (see, for example, Previewing Richly Formatted Jupyter Book Style Content Authored Using MyST-md).

One of the most widely used of our custom extensions colour highlights notebook cells:

Poking around, it seems that there is a JupyterLab extension that allows you to set cell CSS style via metadata: educational-technology-collective/etc_jupyterlab_cell_properties. This looks ridiculously complicated to me, (maybe it needs an architecture statement?!), and seems to require a server extension among other things… (What is it that I keep saying about JupyterHub providing a really hostile architecture to have-a-go end user developers?!).

This makes me a little concerned: will the extension support cell highlighint in RetroLab? (Part of the promise of RetroLab for me was that it shared similar underpinnings to JupyterLab, but does that extend to “server companion” elements? And HTML / CSS structure?) Will the extension even run in JupyterLab in JupyterLite?

Whatever. I shall try to remain positive… When run in MyBinder, for the source repo at least (though I note, that uses an old version of JupyterLab?). it seems to work (though I note the JupyterLab metadata editor is really shonky – sometimes it lets you save, sometimes it doesn’t…?).

I also note the extension calls on a tornado decorator at one point, and I thought JupyterLab was trying to dump that package?

It would be more convenient if you could create a cell tag and then associate custom styling with it (I don’t know if / how Jupyterlab extensions have a simple configurator panel set-up like the classic jupyter_nbextensions_configurator?) but it sort of provides a workflow…?

And there is a migration path, of a sort. Our current notebooks use cell tags to identify cells for styling, so it would be easy enough to create a simple notebook processor that would check each cell for styling tags, and then add styling metadata to a cell as appropriate. Hugely inefficient and horrible (it would be so much easier if a cell tag could propagate into the UI DOM at an appropriate level, then to be styled via a custom CSS stylesheet, perhaps set via a cell styletags configurator) but workable as a workaround?

Only one way to find out, I guess, so I’ll have to try it and see…

PS I’d originally started this post in, if not in an upbeat mood about JupyterLab, at least a moderately hopeful sense. The jupyterlab-myst extension offers the promise of previewing various bits of MyST enhanced syntax in JupyterLab notebooks (though I haven’t tried or seen the extension working in RetroLab or JupyterLab in JupyterLite…), as briefly reviewed in the previous post, and the educational-technology-collective/etc_jupyterlab_cell_properties extension looked like it could be used in a hacky way to render style retwritten into cell metadata based on tags in my original notebooks.

But on reflection, I find myself falling back to my original position: that JupyterLab is not fit for our (which is to say, my!;-) purposes of long term availability (at least, when it comes to extensions that keep wroking) and support of easy end-user innovation. For a start, it’s too complicated. The cell properties extension does seem to work with an old build of JupyterLab, and if it still works with recent builds, how long is that likely to continue, particulalry given the extension also seems to require some sort of service level hacking? The repo does not appear to be maintained, and given the apparent complexity of it (I am not a dedveloper; I don’t have a clue how either JupyterLab, or typescript, or the build process works etc etc), we’re f****d if it breaks and I’m supposed to fix it (and we need things to keep working for years). I have no idea the extension works in RetroLab, or JupyterLite environments, or what would be involved to make it work if it doesn’t (or is it a bespoke JupyterLab desktop server solution). A comment I made in an issue to the jupyterlab-myst repo on leveraging cell tags in display fills me with no hope that there is an easy way to leverage cell tag data or cell metadata. If I recall correctly, cell tags were quite late to the notebook party, so it may be that architectural decisions were made early on that make it difficult for cell content processors to exploit cell metadata / cell tags from the notebook file, or the surrounding presentation layer (HTML DOM, CSS elements, etc.). I have no idea; if it is easy, I’m surprised I haven’t stumbled across some easy-to-follow demos around that that I would have tried to co-opt for my own purposes? From the cell properties extension, which appears to require mucking about with a server extension, it does look like it is not-just-for-me complicated!

Every time I try to show willing and get started with JupyterLab, it just seems to push me further and further away.

Idly Dreaming – End-User UX Development in JupyterLab and MyST

The following probably demonstrates my total lack of understanding about how things actually but work, but I’m always happy to demonstrate my ignorance… Anyway, here are a couple of things I (think I) would like to be able to do in Jupyter’n’MyST land…

They take inspiration from things like SQLite application defined functions, where you can call something like conn.create_function('custom_fn_name', nargs, custom_fn) to register a new SQLite function, create_aggregate('custom_agg_fn_name', nargs, CustomAggClass), where CustomAggClass has required step() and finalize() methods, and custom table returning functions registered using CustomTableFunction.register(db.conn) on a user defined CustomTableFunction with reqiured params, columns and names attributes and initialize() and iterate() methods; as well as my own hacky innovationOUtside/nb_js_diagrammers package for defining block magics that return an object that uses simple templates that wrap off-the-shelf Javascript packages and text-based diagram descriptions from the magicked cell to render diagrams. (Ideally, I’d simplify that to something like register_diagram_magic('diagram_magic_name', DIAGRAM_TEMPLATE).)

The key idea is that, at the programmatic level, I really don’t care what goes on under the hood, I just want a really high level function that lets me register an additional function or handler for a particular attribute value in a generic handler.

So for example, in JupyterLab, I’d love to be able to call something like:

register_tag_styler("tagname", tagcss) that would apply the tagcss styling to cells tagged with tagname.

It’s still not clear to me where might run this code, but ideally there’d be a trivial way to define extension config panels, much as as the jupyter_nbextensions_configurator does, to provide a UI over the registration function. The educational-technology-collective/etc_jupyterlab_cell_properties extension lets you add styling metadata to a cell, but it would be neater to have an extension that lets you define one or more tag names, and some css style associated with each tag, and then apply that styling to correspondingly tagged cells.

Obviously, this should work in both JupyterLab and Retrolab served from a “full” server, or in JupyterLite.

Second up, MyST. MyST supports block level directives of the form:

```{mydirective}
TEXT
```

where either core MyST, or MyST extensions identify the mydirective block and then do something to the TEXT.

As a parallel to my nb_js_diagrammers block magics, it’d be really handy to be able to say register_myst_html_directive('my_directive_label', TEMPLATE, js_load=["mypackage.js"], css_load=["mystyle.css"]) and then just load in the required packages, as required, and return some HTML templating that consumes the TEXT. Where parameters are parsed from the top of the directive block, or space sperated from the directive label, I guess they could be passed as such into the template.

It would also be handy if the MyST-md parser, and perhaps even cell magics, could get access to cell tags. In the case of MyST, this might then allow tagged cells (eg of the form directive-note) to essentially identify the contents of a cell as the contents of a {note} directive block (related issue).

In each of the above cases, as and end user trying to customise the look and feel of notebooks, or rich outputs in a rendered document, I really don’t care how horrible or complicated it is underneath. All I have to do is follow a class structure or some simple template conventions.