Finding Common Phrases or Sentences Across Different Documents

As mentioned in the previous post, I picked up on a nice little challenge from my colleague Ray Corrigan a couple days ago to find common sentences across different documents.

My first, rather naive, thought was to segment each of the docs into sentences and then compare sentences using a variety of fuzzy matching techniques, retaining the ones that sort-of matched. That approach was a bit ropey (I’ll describe it in another post), but whilst pondering it over a dog walk a much neater idea suggested itself – compare n-grams of various lengths over the two documents. At it’s heart, all we need to do is find the intersection of the ngrams that occur in each document.

So here’s a recipe to do that…

First, we need to get documents into a text form. I started off with PDF docs, but it was easy enough to extract the text using textract.

!pip install textract

import textract
txt = textract.process('ukpga_19840012_en.pdf')

The next step is to compare docs for a particular size n-gram – the following bit of code finds the common ngrams of a particular size and returns them as a list:

import nltk
from nltk.util import ngrams as nltk_ngrams

def common_ngram_txt(tokens1,tokens2,size=15):
    print('Checking ngram length {}'.format(size))
    ng1=set(nltk_ngrams(tokens1, size))
    ng2=set(nltk_ngrams(tokens2, size))

    match=set.intersection(ng1,ng2)
    print('..found {}'.format(len(match)))

    return match

I want to be able to find common ngrams of various lengths, so I started to put together the first fumblings of an n-gram sweeper.

The core idea was really simple – starting with the largest common n-gram, detect increasingly smaller n-grams; then do a concordance report on each of the common ngrams to show how that ngram appeared in the context of each document. (See n-gram / Multi-Word / Phrase Based Concordances in NLTK.)

Rather than generate lots of redundant reports – if I detected the common 3gram “quick brown fox”, I would also find the common ngrams “quick brown” and “brown fox” – I started off with the following heuristic: if a common n-gram is part of a longer common n-gram, ignore it. But this immediately turned up a problem. Consider the following case:

Document 1: the quick brown fox
Document 2: the quick brown fox and the quick brown cat and the quick brown dog

Here, there is a common 4-tuple:the quick brown fox. There is also a common 3-tuple: the quick brown, which a concordance plot would reveal as being found in the context of a cat and a dog as well as a fox. What I really need to do is keep a copy of common n-gram locations that are not contained within the context of a longer n-gram context in the second document, but drop copies of locations where it is subsumed in an already found longer ngram.

Indexing on token number within the second doc, I need to return something like this:

([('the', 'quick', 'brown', 'fox'),
  ('the', 'quick', 'brown'),
  ('the', 'quick', 'brown')],
 [[0, 3], [10, 12], [5, 7]]

which shows up the shorter common ngrams only in places where it is not part of the longer common ngram.

In the following, n_concordance_offset() finds the location of a phrase token list within a document token list. The ngram_sweep_txt() scans down a range of ngram lengths, starting with the longest, trying to identify locations that are not contained with an already discovered longer ngram.

def n_concordance_offset(text,phraseList):
    c = nltk.ConcordanceIndex(text.tokens, key = lambda s: s.lower())
    
    #Find the offset for each token in the phrase
    offsets=[c.offsets(x) for x in phraseList]
    offsets_norm=[]
    #For each token in the phraselist, find the offsets and rebase them to the start of the phrase
    for i in range(len(phraseList)):
        offsets_norm.append([x-i for x in offsets[i]])
    #We have found the offset of a phrase if the rebased values intersect
    #via http://stackoverflow.com/a/3852792/454773
    intersects=set(offsets_norm[0]).intersection(*offsets_norm[1:])
    
    return intersects
    
def ngram_sweep_txt(txt1,txt2,ngram_min=8,ngram_max=50):    
    tokens1 = nltk.word_tokenize(txt1)
    tokens2 = nltk.word_tokenize(txt2)

    text1 = nltk.Text( tokens1 )
    text2 = nltk.Text( tokens2 )

    ngrams=[]
    strings=[]
    ranges=[]
    for i in range(ngram_max,ngram_min-1,-1):
        #Find long ngrams first
        newsweep=common_ngram_txt(tokens1,tokens2,size=i)
        for m in newsweep:
            localoffsets=n_concordance_offset(text2,m)

            #We need to avoid the problem of masking shorter ngrams by already found longer ones
            #eg if there is a common 3gram in a doc2 4gram, but the 4gram is not in doc1
            #so we need to see if the current ngram is contained within the doc index of longer ones already found
            
            for o in localoffsets:
                fnd=False
                for r in ranges:
                    if o>=r[0] and o<=r[1]:
                        fnd=True
                if not fnd:
                    ranges.append([o,o+i-1])
                    ngrams.append(m)
    return ngrams,ranges,txt1,txt2

def ngram_sweep(fn1,fn2,ngram_min=8,ngram_max=50):
    txt1 = textract.process(fn1).decode('utf8')
    txt2 = textract.process(fn2).decode('utf8')
    ngrams,ranges,txt1,txt2=ngram_sweep_txt(txt1,txt2,ngram_min=ngram_min,ngram_max=ngram_max)
    return ngrams,ranges,txt1,txt2

What I really need to do is automatically detect the largest n-gram and work back from there, perhaps using a binary search starting with an n-gram the size of the number of tokens in the shortest doc… But that’s for another day…

Having discovered common phrases, we need to report them. The following n_concordance() function, (based on this) does just that; the concordance_reporter() function manages the outputs.

import textract

def n_concordance(txt,phrase,left_margin=5,right_margin=5):
    #via https://simplypython.wordpress.com/2014/03/14/saving-output-of-nltk-text-concordance/
    tokens = nltk.word_tokenize(txt)
    text = nltk.Text(tokens)

    phraseList=nltk.word_tokenize(phrase)

    intersects= n_concordance_offset(text,phraseList)
    
    concordance_txt = ([text.tokens[map(lambda x: x-left_margin if (x-left_margin)>0 else 0,[offset])[0]:offset+len(phraseList)+right_margin]
                        for offset in intersects])
                         
    outputs=[''.join([x+' ' for x in con_sub]) for con_sub in concordance_txt]
    return outputs

def concordance_reporter(fn1='Draft_Equipment_Interference_Code_of_Practice.pdf',
                         fn2='ukpga_19940013_en.pdf',fo='test.txt',ngram_min=10,ngram_max=15,
                         left_margin=5,right_margin=5,n=5):
    
    fo=fn2.replace('.pdf','_ngram_rep{}.txt'.format(n))
    
    f=open(fo, 'w+')
    f.close()
     
    print('Handling {}'.format(fo))
    ngrams,strings, txt1,txt2=ngram_sweep(fn1,fn2,ngram_min,ngram_max)
    #Remove any redundancy in the ngrams...
    ngrams=set(ngrams)
    with open(fo, 'a') as outfile:
        outfile.write('REPORT FOR ({} and {}\n\n'.format(fn1,fn2))
        print('found {} ngrams in that range...'.format(len(ngrams)))
        for m in ngrams:
            mt=' '.join(m)
            outfile.write('\n\-------\n{}\n\n'.format(mt.encode('utf8')))
            for c in n_concordance(txt1,mt,left_margin,right_margin):
                outfile.write('<<<<<{}\n\n'.format(c.encode('utf8')))
            for c in n_concordance(txt2,mt,left_margin,right_margin):
                outfile.write('>>>>>{}\n\n'.format(c.encode('utf8')))
    return

Finally, the following function makes it easier to compare a document of interest with several other documents:

for f in ['Draft_Investigatory_Powers_Bill.pdf','ukpga_19840012_en.pdf',
          'ukpga_19940013_en.pdf','ukpga_19970050_en.pdf','ukpga_20000023_en.pdf']:
    concordance_reporter(fn2=f,ngram_min=10,ngram_max=40,left_margin=15,right_margin=15)

Here’s an example of the sort of report it produces:

REPORT FOR (Draft_Equipment_Interference_Code_of_Practice.pdf and ukpga_19970050_en.pdf


\-------
concerning an individual ( whether living or dead ) who can be identified from it

>>>>>personal information is information held in confidence concerning an individual ( whether living or dead ) who can be identified from it , and the material in question relates 

<<<<<section `` personal information '' means information concerning an individual ( whether living or dead ) who can be identified from it and relating—- ( a ) to his 


\-------
satisfied that the action authorised by it is no longer necessary .

>>>>>must cancel a warrant if he is satisfied that the action authorised by it is no longer necessary . 4.13 The person who made the application 

<<<<<an authorisation given in his absence if satisfied that the action authorised by it is no longer necessary . ( 6 ) If the authorising officer 

<<<<<cancel an authorisation given by him if satisfied that the action authorised by it is no longer necessary . ( 5 ) An authorising officer shall 


\-------
involves the use of violence , results in substantial financial gain or is conduct by a large number of persons in pursuit of a common purpose

>>>>>one or more offences and : It involves the use of violence , results in substantial financial gain or is conduct by a large number of persons in pursuit of a common purpose ; or a person aged twenty-one or 

<<<<<if , — ( a ) it involves the use of violence , results in substantial financial gain or is conduct by a large number of persons in pursuit of a common purpose , or ( b ) the offence 


\-------
to an express or implied undertaking to hold it in confidence

>>>>>in confidence if it is held subject to an express or implied undertaking to hold it in confidence or is subject to a restriction on 

>>>>>in confidence if it is held subject to an express or implied undertaking to hold it in confidence or it is subject to a restriction 

<<<<<he holds it subject— ( a ) to an express or implied undertaking to hold it in confidence , or ( b ) to a 


\-------
no previous convictions could reasonably be expected to be sentenced to

>>>>>a person aged twenty-one or over with no previous convictions could reasonably be expected to be sentenced to three years’ imprisonment or more . 4.5 

<<<<<attained the age of twenty-one and has no previous convictions could reasonably be expected to be sentenced to imprisonment for a term of three years 


\-------
considers it necessary for the authorisation to continue to have effect for the purpose for which it was

>>>>>to have effect the Secretary of State considers it necessary for the authorisation to continue to have effect for the purpose for which it was given , the Secretary of State may 

<<<<<in whose absence it was given , considers it necessary for the authorisation to continue to have effect for the purpose for which it was issued , he may , in writing 

The first line in a block is the common phrase, the >>> elements are how the phrase appears in the first doc, the <<< elements are how it appears in the second doc. The width of the right and left margins of the contextual / concordance plot are parameterised and can be easily increased.

This seems such a basic problem – finding common phrases in different documents – I’d have expected there to be a standard solution to this? But in the quick search I tried, I couldn’t find one? It was quite a fun puzzle to play with though, and offers lots of scope for improvement (I suspect it’s a bit ropey when it comes to punctuation, for example). But it’s a start…:-)

There’s lots could be done on a UI front, too. For example, it’d be nice to be able to link documents, so you can click through from the first to show where the phrase came from in the second. But to do that requires annotating the original text, which in turn means being able to accurately identify where in a doc a token sequence appears. But building UIs is hard and time consuming… it’d be so much easier if folk could learn to use a code line UI!;-)

If you know of any “standard” solutions or packages for dealing with this sort of problem, please let me know via the comments:-)

PS The code could also do with some optimisation – eg if we know we’re repeatedly comparing against a base doc, it’s foolish to keep opening and tokenising the base doc…

4 comments

  1. Pingback: Identifying similar text (and plagiarism) | Finding Knowledge
  2. Pingback: Identifying similar text (and plagiarism) – Writing Analytics