Several years ago, I picked up a fascinating book on the Morphology of the Folktale by Vladimir Propp. This identifies a set of primitives that could be used to describe the structure of Russian folktales, though they also serve to describe lots of other Western folktales too. (A summary of the primitives identified by Propp can be found here: Propp functions. (See also Levi-Strauss and structural analysis of folk-tales).)
Over the weekend, I started wondering whether there are any folk-tale corpuses out there annotated with Propp functions, that might be used as the basis for a “structural search engine”, or that could perhaps be used to build a model that could attempt to automatically analyse other folktales in structural terms.
Thus far, I’ve found one, as described in ProppLearner: Deeply annotating a corpus of Russian folktales to enable the machine learning of a Russian formalist theory, Mark A. Finlayson, Digital Scholarship in the Humanities, Volume 32, Issue 2, June 2017, Pages 284–300, https://doi.org/10.1093/llc/fqv067 . The papers gives a description of the method used to annotate the tale collection, and also links to data download containing annotation guides and the annotated collection as an XML file: Supplementary materials for “ProppLearner: Deeply Annotating a Corpus of Russian Folktales to Enable the Machine Learning of a Russian Formalist Theory” [description, zip file]. The bundle includes fifteen annotated tales marked up using the Story Workbench XML format; a guide to the format is also included.
(The Story Workbench is (was?) an Eclipse based tool for annotating texts. I wonder if anything has replaced it that also opens the Story Workbench files? In passing, I note Prodigy, a commercial text annotation tool that integrates tightly with spacy, as well as a couple of free local server powered browser based tools, brat and doccano. The metanno promises “a JupyterLab extension that allows you build your own annotator”, but from the README, I can’t really make sense of what it supports…)
The markup file looks a bit involved, and will take some time to make sense of. It includes POS tags, sentences, extracted events, timex3 / TimeML tags, Proppian functions and archetypes, among other things. To get a better sense of it, I need to build a parser and have a play with it…
The TimeML tagging was new to me, and looks like it could be handy. It provides an XML based tagging scheme for describing temporal events, expressions and relationships [Wikipedia]. The
timexy Python package provides a
spacy pipeline component that “extracts and normalizes temporal expressions” and represents them using TimeML tags.
In passing, I also note a couple of things
spacy related, such as a phrase matcher that could be used to reconcile different phrases (eg
Finn Mac Cool and
Fionn MacCumHail etc): PhraseMatcher, and this note on Training Custom NER models in spacy to auto-detect named entitities.