Digital Text Forensics

After having a quick play at building a script for Finding Common Phrases or Sentences Across Different Documents, I thought I should spend a few minutes trying to find how this problem is more formally identified, described and attacked…

…and in doing so came across a rather interesting looking network on digital text forensics: PAN. (By the by, see also: EuSpRiG – European Spreadsheet Risks Interest Group.)

PAN run a series of challenges on originality which relate directly to the task I was looking at, with two in particular jumping out at me:

  • External Plagiarism Detection: given a document and a set of documents, determine whether parts of the former have been reused from the latter (2009-2011)
  • Text Alignment: given a pair of documents, extract all pairs of reused passages of maximal length (2012-2015)

The challenges come with training and test cases, as well as performance metrics are described for each year – here’s the 2015 Plagiarism Detection / Text Reuse Detection Task Definitions.

Maybe I should give them a code to see how my code stacks up against the default benchmark… Hmmm…If I do worse, I find a better solution; if I do better, my code wasn’t that bad; if I do the same, at least I got the starting point right… Just no time to do it right now…;-)

PS Thinks: what would be really useful would be a git repository somewhere with code examples for all the challenge entries…;-)

Author: Tony Hirst

I'm a lecturer at The Open University, with an interest in #opendata policy and practice, as well as general web tinkering...