Fragments – Fuzzy Search, Pattern Matching and Retrieval Using Python and SQLite

Scrappy notes on some possibly useful fuzzy and partial match tricks and tips.

First up, we can easily find one or more exact matches for a pattern across a text using a regular expression of the form:

import re

txt = 'The sin eater was a tradition whereby a poor individual – the sin eater – would... '

_q = 'sin eater'

for m in re.finditer(_q, txt):
    print(f'{_q}: ', m.start(), m.end())

But what if we want to start finding patterns where thre is only a partial match? For example, in many OCRd texts, or texts corresponding to historical documents with old variant spellings of particular wordes, we might want to match on the near misses.

The jellyfish Python package provides functions for phonetic encoding (American Soundex, Metaphone, NYSIIS (New York State Identification and Intelligence System), Match Rating Codex) and string comparison / approximate matching (Levenshtein Distance, Damerau-Levenshtein Distance, Jaro Distance, Jaro-Winkler Distance, Match Rating Approach Comparison, Hamming Distance).

Cribbing simonw/datasette-jellyfish, a Datasette plugin that adds jellyfish functions to SQLite queries in datasette that we can use at least in the SELECT part of a query, we can add the functions to SQLite with the following:

# Crib from https://github.com/simonw/datasette-jellyfish

import jellyfish

one_args = (
    # Phonetic
    # https://jellyfish.readthedocs.io/en/latest/phonetic.html
    "soundex",
    "metaphone",
    "nysiis",
    "match_rating_codex",
    # Stemming
    # https://jellyfish.readthedocs.io/en/latest/stemming.html
    "porter_stem",
)
two_args = (
    # String Comparison
    # https://jellyfish.readthedocs.io/en/latest/comparison.html
    "levenshtein_distance",
    "damerau_levenshtein_distance",
    "hamming_distance",
    "jaro_similarity",
    "jaro_winkler_similarity",
    "match_rating_comparison",
)

# `db` is a sqlite_utils Database object
from sqlite_utils import Database
db = Database('test.db')
# For in memory db, use: db = Database(memory=True)

for fn in one_args:
    db.conn.create_function(fn, 1, getattr(jellyfish, fn))
for fn in two_args:
    db.conn.create_function(fn, 2, getattr(jellyfish, fn))

Functions that set default values for parameters can be defined using .conn.create_function(NAME, -1, FUNCTION). These functions can have 1, 2 or 3 parameters.

We can then use the function as part of a query:

# `db` is a sqlite_utils Database object
db["my_table"].create({"id": int, "txt": str,}, pk="id")
db["my_table"].insert({"id":1, "txt": "sin eater in Wales."})
db["my_table"].insert({"id":2, "txt": "The bin eater."})

for i in db.query("SELECT txt, levenshtein_distance('sin eater', txt) AS dist FROM my_table WHERE levenshtein_distance('sin eater', txt) < 10"):
    print(i)

This limits us to “exact matching” near miss phrases or terms, rather than finding near miss phrases within a larger text.

The fuzzysearch package provides a simple function for finding matches based on maximum Levenshtein distance (max_l_dist), maximum number of substitutions (max_substitutions), maximum number of deletions / skipped characters in the sub-sequence (max_deletions), maximum number of insertions / skipped characters in a sequence (max_insertions).

#%pip install fuzzysearch
from fuzzysearch import find_near_matches

# search for 'PATTERN' with a maximum Levenshtein Distance of 2
find_near_matches('PATTERN', '---PATERN---PATEN----PATN--PATREN-', max_l_dist=2)

[Match(start=3, end=9, dist=1, matched='PATERN'),
 Match(start=12, end=17, dist=2, matched='PATEN'),
 Match(start=27, end=33, dist=2, matched='PATREN')]

It could also be interesting to try to take the fuzzysearch.find_near_matches function for use in a SQLite query. Here’s a minimal example:

def find_near_matches2(*args, **kwargs):
    response = find_near_matches(*args, **kwargs, max_l_dist=3)
    if response:
        return response[0].matched
    return ''

# Make function available in SQLite queries
db.conn.create_function('find_near_matches2', 2, find_near_matches2)

# Try a query
for i in db.query("SELECT txt, find_near_matches('sin eater', txt) AS matched FROM my_table WHERE find_near_matches('sin eater', txt) !=''"):
    print(i)

"""
{'txt': 'The is a sin eater in Wales.', 'matched': 'sin eater'}
{'txt': 'The bin eater.', 'matched': 'in eater'}
"""

The question is: what should we really return from this function? And can we parameterise it better?

A key parameter is the max_l_dist value; we can specify three arguments to the custom function if one of them is a provided with a default value, so let’s use max_l_dist for such a value:

def find_near_matches3(pattern, search_string, max_l_dist=3):
    response = find_near_matches(pattern, search_string, max_l_dist=max_l_dist)
    if response:
        return response[0].matched
    return ''

# The -1 value for the number of args says we may have a default.
# The function can then take 1, 2 or 3 arguments.
db.conn.create_function('find_near_matches3', -1, find_near_matches3)

For spacy, the spaczz package provides fuzzy entity extraction for spacy pipelines.

import spacy
from spaczz.matcher import FuzzyMatcher

nlp = spacy.blank("en")
text = """SIB- EATERZ. The tradition of the sin eater is contested. 
Claimed as a Walesian tradition by the English, the Welsh didn't seem to have heard of sin-eating!"""
doc = nlp(text)

matcher = FuzzyMatcher(nlp.vocab)
matcher.add("GPE", [nlp("Wales")])
matches = matcher(doc)

for match_id, start, end, ratio in matches:
    print(match_id, doc[start:end], ratio)

"""
GPE Walesian 77
"""

The minimum confidence level for which a token is tagged can be configured.

The matcher function can be customised to return additional information, such as the location of the matched item(s) in the text.

from spacy.tokens import Span

# Create custom matcher to return location in document
def add_name_ent(matcher, doc, i, matches):
    """Callback on match function. Adds "THING" entities to doc."""
    # Get the current match and create tuple of entity label, start and end.
    # Append entity to the doc's entity. (Don't overwrite doc.ents!)
    _match_id, start, end, _ratio = matches[i]
    entity = Span(doc, start, end, label="THING")
    doc.ents += (entity,)
    
matcher.add("THING", [nlp("sin eater")], on_match=add_name_ent)
matches = matcher(doc)

for ent in doc.ents:
    print((ent.text, ent.start, ent.end, ent.label_))

A regex matcher is also supported. See the docs for more features. Thi spackage definitely merits some further investigation.

It is worth noting that the Python regex package natively supports a range of options that can be used to find approximate matches that allow a certain number of deletion or substitution errors, or a certain number of single character errors of whatever flavour.

# https://github.com/mrabarnett/mrab-regex
#%pip install --upgrade regex
import regex

#https://github.com/mrabarnett/mrab-regex#approximate-fuzzy-matching-hg-issue-12-hg-issue-41-hg-issue-109

print(regex.search("(sin eater){e<=1}", "The bin eater is...")) # Errors
print(regex.search("(sin eater){d<=1}", "The in eater is...")) # Deletions
print(regex.search("(sin eater){s<=1}", "The bin eater is...")) # Substitutions
print(regex.search("(sin eater){e<=3}", "The bin eating tradition is...")) # Errors

"""
<regex.Match object; span=(4, 13), match='bin eater', fuzzy_counts=(1, 0, 0)>
<regex.Match object; span=(4, 12), match='in eater', fuzzy_counts=(0, 0, 1)>
<regex.Match object; span=(4, 13), match='bin eater', fuzzy_counts=(1, 0, 0)>
<regex.Match object; span=(4, 13), match='bin eatin', fuzzy_counts=(3, 0, 0)>
"""

There is also a partial matching option that can be used against a truncated document, if the match pattern could potentially have matched if the document continued.

In terms of using “native” SQLite functions to support fuzzy search, a Stack Overflow answer suggests using full text search along with a Spellfix virtual table.

We can add the Spellfix extension to a SQLite db in the following way:

#https://github.com/karlb/sqlite-spellfix
#%pip install git+git://github.com/karlb/sqlite-spellfix
import sqlite_spellfix

# The `db` object is a sqlite_utils Database object
db.conn.enable_load_extension(True)
db.conn.load_extension(sqlite_spellfix.extension_path())

When querying simple tables, eg a table with a column for index entries in a book, the editdist3() function will match terms within a prescribed “distance” of the search term.

# Select based on not exceeding edit cost:
# - default single-character insertion and deletion costs are 100;
# - default single-character to single-character substitution cost is 150.
# A cost of 10000 or more is considered "infinite" and
# causes the rule to be ignored.
SELECT * FROM mytable WHERE editdist3(indexTerm, "sin eater") < 300'

However, to match a search phrase within a longer document, for example using full test search, requires searching across another virtual spellfix table. Trying this out / creating a minimal working recipe for this, is still on my to do list. See here for a possible crib.

Author: Tony Hirst

I'm a Senior Lecturer at The Open University, with an interest in #opendata policy and practice, as well as general web tinkering... View all posts by Tony Hirst

Rate this:

Share this:

Related

Author: Tony Hirst

One thought on “Fragments – Fuzzy Search, Pattern Matching and Retrieval Using Python and SQLite”