A couple of days ago, my colleague Ray Corrigan shared with me a time consuming problem he was working on looking for original uses of sentences in previously published documents, drafts and bills that are contained in a currently consulting draft code of practice. On the one hand you might thing of this of an ‘original use’ detection problem, on the other, a “plagiarism detection” issue. Text analysis is a data problem I’ve not really engaged with, so I thought this provided an interesting problem set – how can I detect common sentences across two documents. It has the advantage of being easily stated, and easily tested…
I’ll describe in another post how I tried to solve the problem, but in this post will separate out one of the useful tricks I stumbled across along the way – how to display a multiple word concordance using the python text analysis package, NLTK.
To generate a token parsed text, and a set of overlapping ngrams of the same, use a pattern of the form:
To set the scene, the NLTK concordance method will find the bit of a document that a search word appears in and display the word along with the bit of text just before and after it:
Here’s my hacked together code – you’ll see the trick of the phrase start detection is actually a piece of code I found on Stack Overflow relating to the intersection of multiple python lists wrapped up in another found recipe describing how to emulate nltk.text.concordance() and return the found segments:
def n_concordance_tokenised(text,phrase,left_margin=5,right_margin=5): #concordance replication via https://simplypython.wordpress.com/2014/03/14/saving-output-of-nltk-text-concordance/ phraseList=phrase.split(' ') c = nltk.ConcordanceIndex(text.tokens, key = lambda s: s.lower()) #Find the offset for each token in the phrase offsets=[c.offsets(x) for x in phraseList] offsets_norm=[] #For each token in the phraselist, find the offsets and rebase them to the start of the phrase for i in range(len(phraseList)): offsets_norm.append([x-i for x in offsets[i]]) #We have found the offset of a phrase if the rebased values intersect #-- # http://stackoverflow.com/a/3852792/454773 #the intersection method takes an arbitrary amount of arguments #result = set(d[0]).intersection(*d[1:]) #-- intersects=set(offsets_norm[0]).intersection(*offsets_norm[1:]) concordance_txt = ([text.tokens[map(lambda x: x-left_margin if (x-left_margin)>0 else 0,[offset])[0]:offset+len(phraseList)+right_margin] for offset in intersects]) outputs=[''.join([x+' ' for x in con_sub]) for con_sub in concordance_txt] return outputs def n_concordance(txt,phrase,left_margin=5,right_margin=5): tokens = nltk.word_tokenize(txt) text = nltk.Text(tokens) return n_concordance_tokenised(text,phrase,left_margin=left_margin,right_margin=right_margin)If there are better ways of doing this, please let me know via a comment:-)
PS thinking about it, I should possibly tokenise rather than split the phrase? Then the tokens are the same as the tokens used in the matcher?
PPS In python 3, need to cast the
map()
to a subscriptable list:concordance_txt = ([text.tokens[list(map(lambda x: x-left_margin if (x-left_margin)>0 else 0,[offset]))[0]:offset+len(phraseList)+right_margin] for offset in intersects])