A SQLite Function for Finding Repeated Phrases in Fairy Tales, Part I

Way back when, when I first started looking at virtualised environments for distributing student materials, I came across a project created by @DataminerUK called infinite interns. The idea was to create an “intern” for each stage of a data investigation (a particular instance of a virtualised environment) that would be “virgin” and “uncontaminated”, sandboxed and reproducible. The insight was that you’d have a fresh pair of hands each time you wanted to a perform a task, set up just so, in a very particular way for the task at hand. And when the task was done you’d just throw that infinite intern away. (At this point, you could probably make some cultural observations about how interns might be treated in particular industries!)

Another way of considering infinite interns might be to consider them as a scaleable resource for performing a particular task, repeatedly, with no real cost to yourself. In coding terms, the infinite intern might thus also correspond to the repeated application of a particular function, which is to say, aset of instructions that are bundled together under a memorable name that can perform a particular task: hey, you, make_my_tea("egg and chips"); and so it will be done.

What this means is that you can create a function once, and then use it repeatedly: hey, you, make_my_tea("fishfinger sandwich").

But what’s even better is that we can find ways of scaling how we call functions even more. If I have a database containing tens, hundreds, thousands, tens of thousands, hundreds of thousands, millions, even, of records, how powerful would it be if we could write a function once and then apply it to each of those tens, hundreds, thousands, tens of thousands, hundreds of thousands, millions, even, of records from a single command? Pretty powerful, right?

In the previous two scarily named blogposts —Fuzzy Search, Pattern Matching and Retrieval Using Python and SQLite and Custom SQLite Functions That Return A Table: A SQLite Spell-Checker — I described how we can define custom Python programming functions for the simple, file based SQLite database that can return either a single valued item, or a whole set of items, from each record in the database.

That probably sounds more complicated than it actually is. For example, suppose each separate record in the database describes what I had for tea on a particular day. And then, with a single command, I could search that database for everything I’ve had for tea, ever, and call the make_my_tea() function separately for that separate item. But being a database, I could also search for just those items that I ate on a Friday, or during a particular month of a particular year, of that explicitly involved eggs. And get those made for me, all in one go.

Now I don’t happen to have a database of things I’ve had for tea in the past, but in the blog post Searchable SQLite Database of Andrew Lang Fairy Stories I did describe a simple database I had created around Andrew Lang’s coloured fairy story books.

This database contains several hundred stories, each as a separate database record, each one from two to twenty or so pages of text. It’s the sort of resource that would take quite a long time to read through if you were to retrieve every story and then read them all. One reason I created the database was so that I could more easily search for stories that just involved a king or queen or three sons.

q = """
SELECT * FROM books_fts
WHERE books_fts MATCH {db.quote('"pretty hen"')} ;
"""

for row in db.query(_q):
    print(row["title"])

# For future convenience, grab the text for the last reported item
story_text = row["text"]

"""
The House In The Wood
"""

But it also struck me that I should also be able to search for repeated refrains within a particular story. You undoubtedly know the sort of thing: the phrase “and he huffed and he puffed and he” that appears three times in the story of the Three Little Pigs.

If the repeated phrase is four words long, one way of trying to find such a phrase is take the first four words of the text, and then compare them to every set of four contiguous words in the text. If the repeated phrase crops up again, you will get an exact match at that point.

  • If the repeated phrase
  • the repeated phrase is
  • repeated phrase is four
  • the text. If the
  • text. If the repeated
  • If the repeated phrase
  • the repeated phrase crops

If you start with a long sequence, twenty words, for example, you can start with that. If you don’t get an exact match for twenty words, try with nineteen. If that fails, try eighteen. All the way donw to some minimum desired phrase length. If you do get a match, keep count of how many matches you get as you work through the whole text. If you get at leaest a desired number of repetitions, then you can return that match phrase.

Here’s a way of doing that in code. In part, it makes ue of some functions in a Python package called nltk (“natural language toolkit”) that separates each word in a text as a seperate token:

import nltk

tokens = nltk.word_tokenize(story_text)

We can find every (overlapping) set of 10 words in the text, so called ngrams, using the nltk_ngrams function:

from nltk.util import ngrams as nltk_ngrams

ngrams = nltk_ngrams(tokens, size)

for i in ngrams:
    print(' '.join(i))

"""
From the German of Grimm
the German of Grimm .
German of Grimm . A
of Grimm . A poor
Grimm . A poor woodcutter
...
"""

The Python pandas package is a very rich and very powerful package for creating and working with tabular datasets, which is to say, tables of data. In pandas, a data table is referred to as a dataframe.

For example, if we put the ngrams into a table (dataframe) with one column, and one ngram per row, we can count all the instances of each unique ngram directly from the dataframe:

import pandas as pd

df = pd.DataFrame({'phrase':[' '.join(i) for i in nltk_ngrams(tokens, size)]})
df['phrase'].value_counts()

"""
, pretty brindled cow ,        4
And you , pretty brindled      4
you , pretty brindled cow      4
pretty brindled cow , What     4
brindled cow , What do         4
                              ..
leaving him all day without    1
"""

If we take one of those repeating phrases, we can search the text to find each occurrence, and then report the wider context, showing the text before and after the phrase. This is often referred to as a concordance.

import re

q2 = 'pretty brindled cow'

for m in re.finditer(_q, row["text"]):
    # Display the matched terms and the 50 characters
    # immediately preceding and following the phrase 
    print(f'===\n{q2}: ', m.start(), m.end(), row["text"][max(0, m.start()-50):m.end()+50])

"""
===
pretty brindled cow:  1566 1585 
The man said:

Pretty cock, Pretty hen, And you, pretty brindled cow, What do you say now?

'Duks,' answered the beast
===
pretty brindled cow:  3505 3524 ed the beasts:

Pretty cock, Pretty hen, And you, pretty brindled cow, What do you say now?

The beasts answered, 'Duks
===
pretty brindled cow:  4932 4951  beasts again:

Pretty cock, Pretty hen, And you, pretty brindled cow, What do you say now?

'Duks,' they said. Then th
===
pretty brindled cow:  6119 6138  to rest now?'

Pretty cock, Pretty hen, And you, pretty brindled cow, What do you say now?

The animals said, 'Duks:
"""

To make that a little easier to work with, we can wrap it in its own function:

def find_contexts(text, phrase,  width=50):
    """Find the context(s) of the phrase."""
    contexts = []
    for m in re.finditer(phrase, text):
        # Display the matched terms and the `width` characters
        # immediately preceding and following the phrase 
        contexts.append(text[max(0, m.start()-width):m.end()+width])
    return contexts

## Usage:
# for i in find_contexts(row['text'], 'pretty brindled cow'):
#    print(i,"\n==")

So we have a couple of possible pieces for our jigsaw that let us find repeated sequences and then, if required, report them in a wider context.

The following function automates the process, which is to say “implements the algorithm”, described above for starting with a long sequence of tokens – a long ngram — and then reducing the sequence length down to a minimum until we get a match.

Following the basic rule of three, in which many things are repeated in stories three times, we set the minimm repetition count to 3 as the default.

def scanner(text, minlen=4, startlen=50, min_repeats = 3, autostop=True):
    """Search a text for repeated phrases above a minimum length."""
    # Tokenise the text
    tokens = nltk.word_tokenize(text)
    
    #nltk_ngrams returns an empty list if we ask for an ngram longer than the sentence
    # So set the (long) start length to the lesser of the original provided
    # start length or the token length of the original text;
    # which is to say, the minimum of the provided start length 
    # or the length of the text
    startlen = min(startlen, len(tokens))
    
    # Start with a long sequence then iterate down to a minumum length sequence
    for size in range(startlen, minlen-1, -1):
        
        # Generate a dataframe containing all the ngrams, one row per ngram
        df = pd.DataFrame({'phrase':[' '.join(i) for i in nltk_ngrams(tokens,size)]})
        
        # Find the occurrence counts of each phrase
        value_counts_series = df['phrase'].value_counts()
        
        # If we have at least the specified number of occurrences
        # don't bother searching for any more
        if max(value_counts_series) >= min_repeats:
            if autostop:
                break
            pass
    # Return a pandas series (an indexed list, essentially)
    # containing the longest phrase (or phrases) we found
    return  value_counts_series[(value_counts_series>=min_repeats) & (value_counts_series==max(value_counts_series))]

If we call that function on the text, we get a long phrase back:

# Display the first (0'th indexed) item
# (In this case there is only one item hat repeats this number of times anyway.)
scanner( row["text"] ).index[0], scanner( row["text"] ).values[0]

"""
('Pretty cock , Pretty hen , And you , pretty brindled cow , What do you say now ?',
 4)
"""

We can create another function that calls the scanner function and returns the first (or only) long detected phrase, or nothing:

def find_repeating_phrase(text):
    """Return the longest repeating phrase found in a text.
       If there are more than one of the same lentgh, return the first.
    """
    phrase = scanner(text)
    
    #If there is at least one response, take the first
    if not phrase.empty:
        return phrase.index[0]

We can now make this function available to the SQLite database by registing it as a custom appliction function.

# The `db` object is a sqlite_utils database object
# Pass in:
# - the name of the function we want to use in the database
# - the number of arguments it takes
# - the function we want to invoke
db.conn.create_function('find_repeating_phrase', 1, find_repeating_phrase)

We can now use the function as part of a query, for example applying it to the text retrieved for the book The House In The Wood:

_q = """
SELECT book, title, find_repeating_phrase(text) AS phrase 
FROM books AS phrase WHERE title="The House In The Wood" ;
"""

for row2 in db.query(_q):
    print(row2)

"""
{'book': 'The Pink Fairy Book', 'title': 'The House In The Wood', 'phrase': 'Pretty cock , Pretty hen , And you , pretty brindled cow , What do you say now ?'}
"""

At this point, I still haven’t yet run this function over anyhting other than a single test case, the text of The House In The Wood, which I knew contained a repeating phrase.

But what will happen if I run it over all the stories in The Pink Fairy Book? It probably sounds a bit daft, but I’m almost shaking in excited anticipation of writing this query and trying it out…

Here’s the query:

_q = """
SELECT book, title, find_repeating_phrase(text) AS phrase
FROM books AS phrase WHERE book="The Pink Fairy Book" ;
"""

Now I’m going to shut my eyes and run it… and then peek at the response…

_q = """
SELECT  title, find_repeating_phrase(text) AS phrase
FROM books AS phrase WHERE book="The Pink Fairy Book" ;
"""

for row3 in db.query(_q):
    if row3['phrase'] is not None:
        print(row3)

How did it do?

{'title': 'Catherine And Her Destiny', 'phrase': 'the court , and'}
{'title': 'Esben And The Witch', 'phrase': "? ' 'Ye -- e -- s ! ' 'Are you coming back again ? ' 'That may be , ' said Esben . 'Then you 'll catch it , '"}
{'title': "Hans, The Mermaid's Son", 'phrase': ", ' said Hans ; ' I"}
{'title': 'How The Dragon Was Tricked', 'phrase': ", ' said the dragon"}
{'title': "How The Hermit Helped To Win The King's Daughter", 'phrase': "'Ask him if he will come with us"}
{'title': 'I Know What I Have Learned', 'phrase': 'and asked his wife whether the cow had calved'}
{'title': 'King Lindorm', 'phrase': 'rode out into the forest'}
{'title': 'Maiden Bright-Eye', 'phrase': ". 'Good evening , ' it said . 'Thanks , Maiden Bright-eye , ' said the dog . 'Where is my brother ? ' 'He is in the serpent-pit . ' 'Where is my wicked sister ? ' 'She is with the noble king . ' 'Alas ! alas !"}
{'title': 'Master And Pupil', 'phrase': ", ' said the boy ."}
{'title': 'Peter Bull', 'phrase': "'Oh , yes , ' said the"}
{'title': 'Princess Minon-Minette', 'phrase': ", ' replied the old woman"}
{'title': "The Bird 'Grip'", 'phrase': 'the horse with the golden shoes'}
{'title': 'The Cunning Shoemaker', 'phrase': ", ' replied the shoemaker"}
{'title': 'The Fir-Tree', 'phrase': "' thought the tree ."}
{'title': 'The Flying Trunk', 'phrase': ". '' ' ''"}
{'title': 'The Goblin And The Grocer', 'phrase': ', and he had'}
{'title': 'The Golden Lion', 'phrase': ', and the young man'}
{'title': 'The House In The Wood', 'phrase': ': Pretty cock , Pretty hen , And you , pretty brindled cow , What do you say now ?'}
{'title': 'The Jackal, The Dove, And The Panther', 'phrase': "which side do you turn to ? '"}
{'title': 'The King Who Would Have A Beautiful Wife', 'phrase': ". ' 'And I"}
{'title': 'The Little Hare', 'phrase': 'the little hare , the little hare ,'}
{'title': 'The Man Without A Heart', 'phrase': ", ' said the"}
{'title': 'The Merry Wives', 'phrase': ", ' said the"}
{'title': 'The Princess In The Chest', 'phrase': "'Sentry , where are you ? Sentry , where are you ?"}
{'title': 'The Shirt-Collar', 'phrase': "! ' said the shirt-collar ,"}
{'title': 'The Slaying Of The Tanuki', 'phrase': '. The Tanuki ,'}
{'title': 'The Snow-Man', 'phrase': "? ' asked the Snow-man ."}
{'title': 'The Snow-Queen', 'phrase': ", ' said the crow ,"}
{'title': 'The Sparrow With The Slit Tongue', 'phrase': 'the house , and'}
{'title': 'The Sprig Of Rosemary', 'phrase': "'Do you , rich as you are ,"}
{'title': 'The Story Of Ciccu', 'phrase': "accept them with my humble duty . '"}
{'title': 'The Three Brothers', 'phrase': "the house . '"}
{'title': "The Troll's Daughter", 'phrase': 'at the bottom of the sea .'}
{'title': 'The Two Brothers', 'phrase': 'seven years and seven months'}
{'title': 'The Water Of Life', 'phrase': 'the talking bird , and a branch of the tree of beauty'}
{'title': 'The White Dove', 'phrase': ", ' said the prince ,"}
{'title': 'The Wounded Lion', 'phrase': 'will hire me for a servant ?'}
{'title': 'Uraschimataro And The Turtle', 'phrase': 'the sea , and'}

Not bad. For example, we get good long phrases from Maiden Bright-Eye and Esben And The Witch, and not just The House In The Wood. And both How The Hermit Helped To Win The King’s Daughter and The Wounded Lion look like they may have a nice repeating riff in the story.

It does looks as if the punctation can get in the way some times though; and it could be handy if we were able to to set the minimum token length as part of our query.

There are a couple of ways we might try to find round this. One is to “depunctuate” the strings; another is perhaps to try tokenising on sentences themselves?

But for that, we will have to wait until another day.

Author: Tony Hirst

I'm a Senior Lecturer at The Open University, with an interest in #opendata policy and practice, as well as general web tinkering...

%d bloggers like this: