Creating Training Data Sets for Custom Named Entity Recognition Taggers in spacy

I’ve just been having a quick look at getting custom NER (named entity recognition) working in spacy. The training data seems to be presented as a list of tuples with the form:

  {'entities': [(START_IDX, END_IDX, ENTITY_TYPE), ...]})

TO simplify things, I wanted to create the training data structure from a simpler representation, a two-tuple of the form (TEXT, PHRASE), where PHRASE is the entity you want to match.

Let’s start by finding the index values of a match phrase in a text string:

import re

def phrase_index(txt, phrase):
    """Return start and end index phrase in string."""
    matches = [(idx.start(),
                idx.start()+len(phrase)) for idx in re.finditer(phrase, txt)]
    return matches

# Example
#phrase_index("this and this", "this")
#[(0, 4), (9, 13)]

I’m using training documents of the following form:

_train_strings = [("The King gave the pauper three gold coins and the pauper thanked the King.", [("three gold coins", "MONEY"), ("King", "ROYALTY")]) ]

We can then generate the formatted training data as follows:

def generate_training_data(_train_strings):
    """Generate training data from text and match phrases."""
    for (txt, items) in _train_strings:
        _ents_list = []
        for (phrase, typ) in items:
            matches = phrase_index(txt, phrase)
            for (start, end) in matches:
                _ents_list.append( (start, end, typ) )
        if _ents_list:
            training_data.append( (txt, {"entities": _ents_list}) )

    return training_data

# Call as:

I was a bit surprised this sort of utility doesn’t already exist? Or did I miss it? (I haven’t really read the spacy docs, but then again, spacy seems to keep getting updated…)

Author: Tony Hirst

I'm a Senior Lecturer at The Open University, with an interest in #opendata policy and practice, as well as general web tinkering...

4 thoughts on “Creating Training Data Sets for Custom Named Entity Recognition Taggers in spacy”

  1. I think you want the EntityRuler as detailed here:

    You can add your entity patterns as shown in the example and then run your doc text through the nlp pipe– the text will be tokenized and you can check that entities were picked up in doc.ents.

    Training uses docbins per this link:

    Save your train/dev split to disk and pass it in your config file. Happy to ping back and forth on this via the comments section!

    1. Just had a quick play and that looks handy.

      Eg rules of form:

      {“label”: “MONEY”, “pattern”: [{“TEXT”: {“REGEX”: “(gold|silver|copper)”}, “OP”: “?”},
      {“TEXT”: {“REGEX”: “coins?”}}]}

      If I try to add an {“ENT_TYPE”: “CARDINAL”, “OP”: “?”} at the start , though , it doesn’t return things like “5 gold coins”. 5 is matched as cardinal, and “gold coins” is matched as MONEY.

      However, in the playground, I can get a pattern match on “ten gold coins” etc? eg

Comments are closed.

%d bloggers like this: