Via a comment to Creating Training Data Sets for Custom Named Entity Recognition Taggers in spacy by Adam G, I learn that as well as training statistical models (as used in that post) spacy lets you write simple pattern matching rules that can be used to identify entities: Rule-based entity recognition.
import spacy
nlp = spacy.load("en_core_web_sm")
ruler = nlp.add_pipe("entity_ruler")
patterns = [{"label": "ROYALTY", "pattern": [{"LOWER": "king"}]},
{"label": "ROYALTY", "pattern": [{"LOWER": "queen"}]},
{"label": "ROYALTY", "pattern": [{"TEXT": {"REGEX": "[Pp]rinc[es]*"}}]},
{"label": "MONEY", "pattern": [{"LOWER": {"REGEX": "(gold|silver|copper)"},
"OP": "?"},
{"TEXT": {"REGEX": "(coin|piece)s?"}}]}]
ruler.add_patterns(patterns)
doc = nlp("The King gave 100 gold coins to the Queen, 1 coin to the prince and ten COPPER pieces the Princesses")
print([(ent.text, ent.label_) for ent in doc.ents])
"""
[('King', 'ROYALTY'), ('100', 'CARDINAL'), ('gold coins', 'MONEY'), ('Queen', 'ROYALTY'), ('1', 'CARDINAL'), ('coin', 'MONEY'), ('prince', 'ROYALTY'), ('ten', 'CARDINAL'), ('COPPER pieces', 'MONEY'), ('Princesses', 'ROYALTY')]
"""
There is a handy tool for trying out patterns at https://demos.explosion.ai/matcher [example]:
(I note that REGEX is not available in the playground though?)
The playground also generates the pattern match rule for you:
However, if I try that rule in my own pattern match, the cardinal element is not matched?
There appears to be a lot of scope as to what you can put in the patterns to be matched , including parts of speech. Which reminds me that I meant to look at Using Pandas DataFrames to analyze sentence structure which uses dependency parsing on spacy parsed sentences to pull out relationships, such as peoples’ names and the associated job titles, from company documents. This probably also means digging into spacy’s Linguistic features.
This also makes me wonder again about the extent to which it might be possible to extract certain Propp functions from sentences parsed using spacy and explict pattern matching rules on particular texts with bits of hand tuning (eg using hand crafted rules in the identification of actors)?
PS I guess if this part of the pipeline is crearing the entity types, they may not be available to the matcher, even if the ENT_TYPE
is allowed as part of the rule conditions? In which case, can we fettle the pipeline somehow so we can get rules to match on previoulsy identified entity types?
Hello again! Your blogs are a fun and thought-provoking read as always.
For the regex issue, you may need to prepend the string with an r in the pattern to make it a raw string for the regex interpreter like so: r”(gold|silver|copper)”. I’m not entirely sure why this is the case as I’m not great at regexes (and usually rely on spacy patterns), but this stackoverflow gets at the issue:
https://stackoverflow.com/questions/280435/escaping-regex-string/73068412#73068412
And this helpful spacy youtube video uses raw strings:
To your point about Propp functions, you may find the Holmes library that is built on top of spacy useful:
https://explosion.ai/blog/introduction-to-holmes
I’m still trying to wrap my head around all it can do, but you may be able to reinterpret some subset of Propp functions as queries on a corpus.
Happy exploring and thanks for the reply!
Ah, good point about r””; should have tried that… Holmes looks interesting… also on my to do list is in the info retrieval context is https://haystack.deepset.ai/tutorials/first-qa-system ; also it’s claimed ability to support Q and A on tables eg https://github.com/deepset-ai/haystack/blob/main/docs/v1.8.0/_src/tutorials/tutorials/15.md