A couple of days ago, my colleague Ray Corrigan shared with me a time consuming problem he was working on looking for original uses of sentences in previously published documents, drafts and bills that are contained in a currently consulting draft code of practice. On the one hand you might thing of this of an ‘original use’ detection problem, on the other, a “plagiarism detection” issue. Text analysis is a data problem I’ve not really engaged with, so I thought this provided an interesting problem set – how can I detect common sentences across two documents. It has the advantage of being easily stated, and easily tested…
I’ll describe in another post how I tried to solve the problem, but in this post will separate out one of the useful tricks I stumbled across along the way – how to display a multiple word concordance using the python text analysis package, NLTK.
To set the scene, the NLTK concordance method will find the bit of a document that a search word appears in and display the word along with the bit of text just before and after it:
Here’s my hacked together code – you’ll see the trick of the phrase start detection is actually a piece of code I found on Stack Overflow relating to the intersection of multiple python lists wrapped up in another found recipe describing how to emulate nltk.text.concordance() and return the found segments:
def n_concordance_tokenised(text,phrase,left_margin=5,right_margin=5): #concordance replication via https://simplypython.wordpress.com/2014/03/14/saving-output-of-nltk-text-concordance/ phraseList=phrase.split(' ') c = nltk.ConcordanceIndex(text.tokens, key = lambda s: s.lower()) #Find the offset for each token in the phrase offsets=[c.offsets(x) for x in phraseList] offsets_norm= #For each token in the phraselist, find the offsets and rebase them to the start of the phrase for i in range(len(phraseList)): offsets_norm.append([x-i for x in offsets[i]]) #We have found the offset of a phrase if the rebased values intersect #-- # http://stackoverflow.com/a/3852792/454773 #the intersection method takes an arbitrary amount of arguments #result = set(d).intersection(*d[1:]) #-- intersects=set(offsets_norm).intersection(*offsets_norm[1:]) concordance_txt = ([text.tokens[map(lambda x: x-left_margin if (x-left_margin)>0 else 0,[offset]):offset+len(phraseList)+right_margin] for offset in intersects]) outputs=[''.join([x+' ' for x in con_sub]) for con_sub in concordance_txt] return outputs def n_concordance(txt,phrase,left_margin=5,right_margin=5): tokens = nltk.word_tokenize(txt) text = nltk.Text(tokens) return n_concordance_tokenised(text,phrase,left_margin=left_margin,right_margin=right_margin)
If there are better ways of doing this, please let me know via a comment:-)
PS thinking about it, I should possibly tokenise rather than split the phrase? Then the tokens are the same as the tokens used in the matcher?