I’ve just been having a quick look at getting custom NER (named entity recognition) working in spacy
. The training data seems to be presented as a list of tuples with the form:
(TEXT_STRING,
{'entities': [(START_IDX, END_IDX, ENTITY_TYPE), ...]})
TO simplify things, I wanted to create the training data structure from a simpler representation, a two-tuple of the form (TEXT, PHRASE)
, where PHRASE
is the entity you want to match.
Let’s start by finding the index values of a match phrase in a text string:
import re
def phrase_index(txt, phrase):
"""Return start and end index phrase in string."""
matches = [(idx.start(),
idx.start()+len(phrase)) for idx in re.finditer(phrase, txt)]
return matches
# Example
#phrase_index("this and this", "this")
#[(0, 4), (9, 13)]
I’m using training documents of the following form:
_train_strings = [("The King gave the pauper three gold coins and the pauper thanked the King.", [("three gold coins", "MONEY"), ("King", "ROYALTY")]) ]
We can then generate the formatted training data as follows:
def generate_training_data(_train_strings):
"""Generate training data from text and match phrases."""
for (txt, items) in _train_strings:
_ents_list = []
for (phrase, typ) in items:
matches = phrase_index(txt, phrase)
for (start, end) in matches:
_ents_list.append( (start, end, typ) )
if _ents_list:
training_data.append( (txt, {"entities": _ents_list}) )
return training_data
# Call as:
#generate_training_data(_train_strings)
I was a bit surprised this sort of utility doesn’t already exist? Or did I miss it? (I haven’t really read the spacy
docs, but then again, spacy
seems to keep getting updated…)
I think you want the EntityRuler as detailed here:
https://spacy.io/usage/rule-based-matching#entityruler
You can add your entity patterns as shown in the example and then run your doc text through the nlp pipe– the text will be tokenized and you can check that entities were picked up in doc.ents.
Training uses docbins per this link:
https://spacy.io/usage/training#training-data
Save your train/dev split to disk and pass it in your config file. Happy to ping back and forth on this via the comments section!
Ah, thanks for that — will take a look.
(Apols for delay in approving the comment…)
Just had a quick play and that looks handy.
Eg rules of form:
{“label”: “MONEY”, “pattern”: [{“TEXT”: {“REGEX”: “(gold|silver|copper)”}, “OP”: “?”},
{“TEXT”: {“REGEX”: “coins?”}}]}
If I try to add an {“ENT_TYPE”: “CARDINAL”, “OP”: “?”} at the start , though , it doesn’t return things like “5 gold coins”. 5 is matched as cardinal, and “gold coins” is matched as MONEY.
However, in the playground, I can get a pattern match on “ten gold coins” etc? eg https://demos.explosion.ai/matcher?text=The%20King%20gave%20100%20gold%20coins%20to%20the%20Queen%2C%205%20coins%20to%20the%20prince%20and%20%20ten%20COPPER%20coins%20the%20Princesses&model=en_core_web_sm&pattern=%5B%7B%22id%22%3A0%2C%22attrs%22%3A%5B%7B%22name%22%3A%22ENT_TYPE%22%2C%22value%22%3A%22CARDINAL%22%7D%2C%7B%22name%22%3A%22OP%22%2C%22value%22%3A%22%3F%22%7D%5D%7D%2C%7B%22id%22%3A1%2C%22attrs%22%3A%5B%7B%22name%22%3A%22LOWER%22%2C%22value%22%3A%22copper%22%7D%2C%7B%22name%22%3A%22OP%22%2C%22value%22%3A%22%3F%22%7D%5D%7D%2C%7B%22id%22%3A2%2C%22attrs%22%3A%5B%7B%22name%22%3A%22LOWER%22%2C%22value%22%3A%22coins%22%7D%5D%7D%5D
Adam – quickest of plays based on your suggestion. With a niggle in adding CARDINAL elements to matched entities: https://blog.ouseful.info/2022/09/30/creating-rule-based-entity-pattern-matchers-in-spacy/