After having a quick play at building a script for Finding Common Phrases or Sentences Across Different Documents, I thought I should spend a few minutes trying to find how this problem is more formally identified, described and attacked…
…and in doing so came across a rather interesting looking network on digital text forensics: PAN. (By the by, see also: EuSpRiG – European Spreadsheet Risks Interest Group.)
PAN run a series of challenges on originality which relate directly to the task I was looking at, with two in particular jumping out at me:
- External Plagiarism Detection: given a document and a set of documents, determine whether parts of the former have been reused from the latter (2009-2011)
- Text Alignment: given a pair of documents, extract all pairs of reused passages of maximal length (2012-2015)
The challenges come with training and test cases, as well as performance metrics are described for each year – here’s the 2015 Plagiarism Detection / Text Reuse Detection Task Definitions.
Maybe I should give them a code to see how my code stacks up against the default benchmark… Hmmm…If I do worse, I find a better solution; if I do better, my code wasn’t that bad; if I do the same, at least I got the starting point right… Just no time to do it right now…;-)
PS Thinks: what would be really useful would be a git repository somewhere with code examples for all the challenge entries…;-)