n-gram / Multi-Word / Phrase Based Concordances in NLTK

A couple of days ago, my colleague Ray Corrigan shared with me a time consuming problem he was working on looking for original uses of sentences in previously published documents, drafts and bills that are contained in a currently consulting draft code of practice. On the one hand you might thing of this of an ‘original use’ detection problem, on the other, a “plagiarism detection” issue. Text analysis is a data problem I’ve not really engaged with, so I thought this provided an interesting problem set – how can I detect common sentences across two documents. It has the advantage of being easily stated, and easily tested…

I’ll describe in another post how I tried to solve the problem, but in this post will separate out one of the useful tricks I stumbled across along the way – how to display a multiple word concordance using the python text analysis package, NLTK.

To set the scene, the NLTK concordance method will find the bit of a document that a search word appears in and display the word along with the bit of text just before and after it:

SImilarities

Here’s my hacked together code – you’ll see the trick of the phrase start detection is actually a piece of code I found on Stack Overflow relating to the intersection of multiple python lists wrapped up in another found recipe describing how to emulate nltk.text.concordance() and return the found segments:

def n_concordance_tokenised(text,phrase,left_margin=5,right_margin=5):
    #concordance replication via https://simplypython.wordpress.com/2014/03/14/saving-output-of-nltk-text-concordance/
    phraseList=phrase.split(' ')

    c = nltk.ConcordanceIndex(text.tokens, key = lambda s: s.lower())
    
    #Find the offset for each token in the phrase
    offsets=[c.offsets(x) for x in phraseList]
    offsets_norm=[]
    #For each token in the phraselist, find the offsets and rebase them to the start of the phrase
    for i in range(len(phraseList)):
        offsets_norm.append([x-i for x in offsets[i]])
    #We have found the offset of a phrase if the rebased values intersect
    #--
    # http://stackoverflow.com/a/3852792/454773
    #the intersection method takes an arbitrary amount of arguments
    #result = set(d[0]).intersection(*d[1:])
    #--
    intersects=set(offsets_norm[0]).intersection(*offsets_norm[1:])
    
    concordance_txt = ([text.tokens[map(lambda x: x-left_margin if (x-left_margin)>0 else 0,[offset])[0]:offset+len(phraseList)+right_margin]
                        for offset in intersects])
                         
    outputs=[''.join([x+' ' for x in con_sub]) for con_sub in concordance_txt]
    return outputs

def n_concordance(txt,phrase,left_margin=5,right_margin=5):
    tokens = nltk.word_tokenize(txt)
    text = nltk.Text(tokens)
 
    return n_concordance_tokenised(text,phrase,left_margin=left_margin,right_margin=right_margin)

If there are better ways of doing this, please let me know via a comment:-)

PS thinking about it, I should possibly tokenise rather than split the phrase? Then the tokens are the same as the tokens used in the matcher?