SQLite Functions for Retrieving Sentences and Paragraphs from a Text

As part of my sin eater correspondence debate unproject, one of the things I keep pondering is effective ways of retrieving content from the corpus database for inclusion, or potentially even transclusion, (which is to say, dynamically included content), in the text.

As I seem to be on something of a roll with regards to creating dodgy SQLite functions at the moment (see previous posts), here are a couple of possible queries for extracting sentences and paragraphs.

For the sentence extraction, we can use a spacy pipeline component:

nlp = en_core_web_sm.load()
sentencizer = nlp.add_pipe("sentencizer")

class SQLite_Sentenciser(TableFunction):
    params = ['start', 'end', 'text']
    columns = ['sentence', 'sent_idx']
    name = 'get_sentences'

    def initialize(self, start=None, end=None, text=None):
        doc = nlp( text )
        # Use inclusive bounds; start index: 0
        self._iter = zip(list(doc.sents)[start:end+1], range(start,end+1))

    def iterate(self, idx):
        sentence, sent_idx = next(self._iter)
        return (sentence.text, sent_idx, )

# And register the function
SQLite_Sentenciser.register(db.conn)

Calling this in a query returns sentences:

_q = f"""
SELECT title, s.*
FROM sources_fts, get_sentences(4,5,sources_fts.text) AS s WHERE sources_fts MATCH {db.quote(q)} LIMIT 5 ;
"""
list(db.query(_q))

"""
[{'title': 'ANSWERS.',
  'sentence': 'From goat to dog, and from dog to man, are remarkable bounds; but here is an account of a human "scapegoat" from a trustworthy authority :— "In a certain county in Wales, some years ago when a person died, his friends sent for the \'sin-eater\' of the district, who, for half-a-crown, *took upon himself the sins of the deceased*, by the simple process of eating them.',
  'sent_idx': 4},
 {'title': 'ANSWERS.',
  'sentence': "A loaf of bread was provided, which the 'sin-eater' first placed upon the dead man's chest, then muttered some incantations over it, finally eating it.",
  'sent_idx': 5}]
"""

For the paragraph extraction, we need to create our own function – or crib one!

#https://gist.github.com/wpm/bf1f2301b98a883b50e903bc3cc86439
def paragraphs(document):
    start = 0
    for token in document:
        if token.is_space and token.text.count("\n") > 1:
            yield document[start:token.i]
            start = token.i
    yield document[start:]

# Example usage:
paras  = list(paragraphs(nlp("""This is a sentence. And another.\n\nA new sentence.\n\nA new para.""")))
paras
# For the text: [p.text for p in paras]

"""
[This is a sentence. And another.,
 
 
 A new sentence.,
 
 
 A new para.]
"""

We can then use this function to extract paragraphs, e.g. in this case using the rule that pargraphs are sequences of text separated by more than one end of line character:

class SQLite_Paragraphiser(TableFunction):
    params = ['start', 'end', 'text']
    columns = ['paragraph', 'para_idx']
    name = 'get_paragraphs'

    def initialize(self, start=None, end=None, text=None):
        doc = nlp( text )
        # Use inclusive bounds; start index: 0
        self._iter = zip(list(paragraphs(doc))[start:end+1], range(start,end+1))

    def iterate(self, idx):
        (paragraph, idx) = next(self._iter)
        return (paragraph.text.strip(), idx,)

# And register the function
SQLite_Paragraphiser.register(db.conn)

And in use:

# Bad example - the record only has one para.

_q = f"""
SELECT title, s.*
FROM sources_fts, get_paragraphs(0,2,sources_fts.text) AS s WHERE sources_fts MATCH {db.quote(q)} LIMIT 5 ;
"""
list(db.query(_q))

"""
[{'title': 'ANSWERS.',
  'paragraph': 'AN EXTRAORDINARY WELSH SUPERSTITION.\nThe great majority of your readers will have read the account of the "scapegoat" ceremony recorded in the Sacred Volume. Some, too, will remember a striking parallel custom (or, to put it more correctly, a parallel custom with a disparity) observed by the North American Indians. From goat to dog, and from dog to man, are remarkable bounds; but here is an account of a human "scapegoat" from a trustworthy authority :— "In a certain county in Wales, some years ago when a person died, his friends sent for the \'sin-eater\' of the district, who, for half-a-crown, *took upon himself the sins of the deceased*, by the simple process of eating them. A loaf of bread was provided, which the \'sin-eater\' first placed upon the dead man\'s chest, then muttered some incantations over it, finally eating it. By this he was believed to have taken from the deceased the heavy weight of his sins, appropriating them to himself, for which act of kindness he was regarded by everybody as an outcast. After the ceremony was finished, and he had received his pay, the \'sin-devourer\' vanished in double-quick time, it being the usual custom for the bereaved friends to belabour him with sticks."\nH., Newport, Monmouthshire.',
  'para_idx': 0}]
"""

With these two functions, I can now pull out one or more contiguous paragraphs, or one or more contiguous sentences, from each database record.

We can also create a custom aggregation function that can be applied to grouped rows. For example, suppose we want to join the first two sentences retrieved from every book, by book.

The custom aggregator is a class that provides step an

class SentenceJoin:
    """Join sentences by group."""

    def __init__(self):
        self.sentences = []

    # Define action to be take for each row
    def step(self, item):
        self.sentences.append(item)

    # Return final aggregated value
    def finalize(self):
        return ' '.join(self.sentences)
    
db.conn.create_aggregate('sentence_join', 1, SentenceJoin)

We can then call the function as a grouping function:

_q = f"""
SELECT title, sentence_join(s.sentence) AS js
FROM sources_fts, get_sentences(4,5,sources_fts.text) AS s WHERE sources_fts MATCH {db.quote(q)} 
GROUP BY sources_fts.title
LIMIT 5 ;
"""
list(db.query(_q))

"""
[{'title': 'ANSWERS.',
  'js': 'From goat to dog, and from dog to man, are remarkable bounds; but here is an account of a human "scapegoat" from a trustworthy authority :— "In a certain county in Wales, some years ago when a person died, his friends sent for the \'sin-eater\' of the district, who, for half-a-crown, *took upon himself the sins of the deceased*, by the simple process of eating them. A loaf of bread was provided, which the \'sin-eater\' first placed upon the dead man\'s chest, then muttered some incantations over it, finally eating it.'}]
"""

Here’s a function that will grab a fragment of text from a record between a starting phrase and end phrase (we could extend it to allow the retruned segment to be includive or exclusive of the start and end phrase, and perhaps also trim the returned phrase):

def get_fragment(text, startend):
    """Return substring from a text based on start and end substrings delimited by ::."""
    startend = startend.split("::")
    start_idx = text.index(startend[0])
    end_idx = text.index(startend[1])
    
    return text[start_idx: end_idx+len(startend[1])]

db.conn.create_function("get_fragment", 2, get_fragment)

for i in db.query("SELECT get_fragment('Start here. This is the end. To here.', 'This::the end') AS fragment"):
    print(i)

"""
{'fragment': 'This is the end'}
"""

Finally, here is a simple function that will return the longest matching substring between two texts:

from difflib import SequenceMatcher 
#https://docs.python.org/3/library/difflib.html#sequencematcher-objects

def get_longest_common_substring(text_a, text_b):
    """Find longest common subtring."""
    # isjunk=None, a='', b='', autojunk=True
    seqMatch = SequenceMatcher(None, text_a, text_b, autojunk=False)
    #Also: 
    # autojunk = True (default)
    # isjunk = None (deafult), same as: lambda x: False;
    # or return True for junk (ignored) using: isjunk = lambda x: x in " \t"
    
    # find_longest_match(alo=0, ahi=None, blo=0, bhi=None)
    # Find longest matching block in a[alo:ahi] and b[blo:bhi].
    match = seqMatch.find_longest_match(0, len(text_a), 0, len(text_b))
    
    if (match.size):
        return text_a[match.a: match.a + match.size]

db.conn.create_function("get_longest_common_substring", 2, get_longest_common_substring)

for i in db.query("SELECT get_longest_common_substring('This is a text with blah blah', \
                                                       'And here is a text with blah, whatever') AS match"):
    print(i)

"""
{'match': ' is a text with blah'}
"""

Author: Tony Hirst

I'm a Senior Lecturer at The Open University, with an interest in #opendata policy and practice, as well as general web tinkering...

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: