Simple Text Analysis Using Python – Identifying Named Entities, Tagging, Fuzzy String Matching and Topic Modelling

Text processing is not really my thing, but here’s a round-up of some basic recipes that allow you to get started with some quick’n’dirty tricks for identifying named entities in a document, and tagging entities in documents.

In this post, I’ll briefly review some getting started code for:

  • performing simple entity extraction from a text; for example, when presented with a text document, label it with named entities (people, places, organisations); entity extraction is typically based on statistical models that rely on document features such as correct capitalisation of names to work correctly;
  • tagging documents that contain exact matches of specified terms: in this case, we have a list of specific text strings we are interested in (for example, names of people or companies) and we want to know if there are exact matches in the text document and where those matches occur in the document;
  • partial and fuzzing string matching of specified entities in a text: in this case, we may want to know whether something resembling a specified text string occurs in the document (for example, mis0spellings of name);
  • topic modelling: the identification, using statistical models, of “topic terms” that appear across a set of documents.

You can find a gist containing a notebook that summarises the code here.

Simple named entity recognition

spaCy is a natural language processing library for Python library that includes a basic model capable of recognising (ish!) names of people, places and organisations, as well as dates and financial amounts.

According to the spaCy entity recognition documentation, the built in model recognises the following types of entity:

  • PERSON People, including fictional.
  • NORP Nationalities or religious or political groups.
  • FACILITY Buildings, airports, highways, bridges, etc.
  • ORG Companies, agencies, institutions, etc.
  • GPE Countries, cities, states. (That is, Geo-Political Entitites)
  • LOC Non-GPE locations, mountain ranges, bodies of water.
  • PRODUCT Objects, vehicles, foods, etc. (Not services.)
  • EVENT Named hurricanes, battles, wars, sports events, etc.
  • WORK_OF_ART Titles of books, songs, etc.
  • LANGUAGE Any named language.
  • LAW A legislation related entity(?)

Quantities are also recognised:

  • DATE Absolute or relative dates or periods.
  • TIME Times smaller than a day.
  • PERCENT Percentage, including “%”.
  • MONEY Monetary values, including unit.
  • QUANTITY Measurements, as of weight or distance.
  • ORDINAL “first”, “second”, etc.
  • CARDINAL Numerals that do not fall under another type.

Custom models can also be trained, but this requires annotated training documents.

#!pip3 install spacy
from spacy.en import English
parser = English()
example='''
That this House notes the announcement of 300 redundancies at the Nestlé manufacturing factories in York, Fawdon, Halifax and Girvan and that production of the Blue Riband bar will be transferred to Poland; acknowledges in the first three months of 2017 Nestlé achieved £21 billion in sales, a 0.4 per cent increase over the same period in 2016; further notes 156 of these job losses will be in York, a city that in the last six months has seen 2,000 job losses announced and has become the most inequitable city outside of the South East, and a further 110 jobs from Fawdon, Newcastle; recognises the losses come within a month of triggering Article 50, and as negotiations with the EU on the UK leaving the EU and the UK's future with the EU are commencing; further recognises the cost of importing products, including sugar, cocoa and production machinery, has risen due to the weakness of the pound and the uncertainty over the UK's future relationship with the single market and customs union; and calls on the Government to intervene and work with hon. Members, trades unions GMB and Unite and the company to avert these job losses now and prevent further job losses across Nestlé.
'''
#Code "borrowed" from somewhere?!
def entities(example, show=False):
    if show: print(example)
    parsedEx = parser(example)

    print("-------------- entities only ---------------")
    # if you just want the entities and nothing else, you can do access the parsed examples "ents" property like this:
    ents = list(parsedEx.ents)
    tags={}
    for entity in ents:
        #print(entity.label, entity.label_, ' '.join(t.orth_ for t in entity))
        term=' '.join(t.orth_ for t in entity)
        if ' '.join(term) not in tags:
            tags[term]=[(entity.label, entity.label_)]
        else:
            tags[term].append((entity.label, entity.label_))
    print(tags)
entities(example)
-------------- entities only ---------------
{'House': [(380, 'ORG')], '300': [(393, 'CARDINAL')], 'Nestlé': [(380, 'ORG')], '\n York , Fawdon': [(381, 'GPE')], 'Halifax': [(381, 'GPE')], 'Girvan': [(381, 'GPE')], 'the Blue Riband': [(380, 'ORG')], 'Poland': [(381, 'GPE')], '\n': [(381, 'GPE'), (381, 'GPE')], 'the first three months of 2017': [(387, 'DATE')], '£ 21 billion': [(390, 'MONEY')], '0.4 per': [(390, 'MONEY')], 'the same period in 2016': [(387, 'DATE')], '156': [(393, 'CARDINAL')], 'York': [(381, 'GPE')], '\n the': [(381, 'GPE')], 'six': [(393, 'CARDINAL')], '2,000': [(393, 'CARDINAL')], 'the South East': [(382, 'LOC')], '110': [(393, 'CARDINAL')], 'Fawdon': [(381, 'GPE')], 'Newcastle': [(380, 'ORG')], 'a month of': [(387, 'DATE')], 'Article 50': [(21153, 'LAW')], 'EU': [(380, 'ORG')], 'UK': [(381, 'GPE')], 'GMB': [(380, 'ORG')], 'Unite': [(381, 'GPE')]}
q= "Bob Smith was in the Houses of Parliament the other day"
entities(q)
-------------- entities only ---------------
{'Bob Smith': [(377, 'PERSON')]}

Note that the way that models are trained typically realises on cues from the correct capitalisation of named entities.

entities(q.lower())
-------------- entities only ---------------
{}

polyglot

A simplistic, and quite slow, tagger, supporting limited recognition of Locations (I-LOC), Organizations (I-ORG) and Persons (I-PER).

#!pip3 install polyglot

##Mac ??
#!brew install icu4c
#I found I needed: pip3 install pyicu, pycld2, morfessor
##Linux
#apt-get install libicu-dev
!polyglot download embeddings2.en ner2.en
[polyglot_data] Downloading package embeddings2.en to
[polyglot_data]     /Users/ajh59/polyglot_data...
[polyglot_data] Downloading package ner2.en to
[polyglot_data]     /Users/ajh59/polyglot_data...
from polyglot.text import Text

text = Text(example)
text.entities
[I-LOC(['York']),
 I-LOC(['Fawdon']),
 I-LOC(['Halifax']),
 I-LOC(['Girvan']),
 I-LOC(['Poland']),
 I-PER(['Nestlé']),
 I-LOC(['York']),
 I-LOC(['Fawdon']),
 I-LOC(['Newcastle']),
 I-ORG(['EU']),
 I-ORG(['EU']),
 I-ORG(['Government']),
 I-ORG(['GMB']),
 I-LOC(['Nestlé'])]
Text(q).entities
[I-PER(['Bob', 'Smith'])]

Partial Matching Specific Entities

Sometimes we may have a list of entities that we wish to match in a text. For example, suppose we have a list of MPs’ names, or a list of ogranisations of subject terms identified in a thesaurus, and we want to tag a set of documents with those entities if the entity exists in the document.

To do this, we can search a text for strings that exactly match any of the specified terms or where any of the specified terms match part of a longer string in the text.

Naive implementations can take a signifcant time to find multiple strings within a tact, but the Aho-Corasick algorithm will efficiently match a large set of key values within a particular text.

## The following recipe was hinted at via @pudo

#!pip3 install pyahocorasick
#https://github.com/alephdata/aleph/blob/master/aleph/analyze/corasick_entity.py

First, construct an automaton that identifies the terms you want to detect in the target text.

from ahocorasick import Automaton

A=Automaton()
A.add_word("Europe",('VOCAB','Europe'))
A.add_word("European Union",('VOCAB','European Union'))
A.add_word("Boris Johnson",('PERSON','Boris Johnson'))
A.add_word("Boris",('PERSON','Boris Johnson'))
A.add_word("Boris Johnson",('PERSON','Boris Johnson (LC)'))

A.make_automaton()
q2='Boris Johnson went off to Europe to complain about the European Union'
for item in A.iter(q2):
    print(item, q2[:item[0]+1])
(4, ('PERSON', 'Boris Johnson')) Boris
(12, ('PERSON', 'Boris Johnson')) Boris Johnson
(31, ('VOCAB', 'Europe')) Boris Johnson went off to Europe
(60, ('VOCAB', 'Europe')) Boris Johnson went off to Europe to complain about the Europe
(68, ('VOCAB', 'European Union')) Boris Johnson went off to Europe to complain about the European Union

Once again, case is important.

q2l = q2.lower()
for item in A.iter(q2l):
    print(item, q2l[:item[0]+1])
(12, ('PERSON', 'Boris Johnson (LC)')) boris johnson

We can tweak the automata patterns to capture the length of the string match term, so we can annotate the text with matches more exactly:

A=Automaton()
A.add_word("Europe",(('VOCAB', len("Europe")),'Europe'))
A.add_word("European Union",(('VOCAB', len("European Union")),'European Union'))
A.add_word("Boris Johnson",(('PERSON', len("Boris Johnson")),'Boris Johnson'))
A.add_word("Boris",(('PERSON', len("Boris")),'Boris Johnson'))

A.make_automaton()
for item in A.iter(q2):
    start=item[0]-item[1][0][1]+1
    end=item[0]+1
    print(item, '{}*{}*{}'.format(q2[start-3:start],q2[start:end],q2[end:end+3]))
(4, (('PERSON', 5), 'Boris Johnson')) *Boris* Jo
(12, (('PERSON', 13), 'Boris Johnson')) *Boris Johnson* we
(31, (('VOCAB', 6), 'Europe')) to *Europe* to
(60, (('VOCAB', 6), 'Europe')) he *Europe*an 
(68, (('VOCAB', 14), 'European Union')) he *European Union*

Fuzzy String Matching

Whilst the Aho-Corasick approach will return hits for strings in the text that partially match the exact match key terms, sometimes we want to know whether there are terms in a text that almost match terms in specific set of terms.

Imagine a situation where we have managed to extract arbitrary named entities from a text, but they do not match strings in a specified list in an exact or partially exact way. Our next step might be to attempt to further match those entities in a fuzzy way with entities in a specified list.

fuzzyset

The python fuzzyset package will try to match a specified string to similar strings in a list of target strings, returning a single item from a specified target list that best matches the provided term.

For example, if we extract the name Boris Johnstone in a text, we might then try to further match that string, in a fuzzy way, with a list of correctly spelled MP names.

A confidence value expresses the degree of match to terms in the fuzzy match set list.

import fuzzyset

fz = fuzzyset.FuzzySet()
#Create a list of terms we would like to match against in a fuzzy way
for l in ["Diane Abbott", "Boris Johnson"]:
    fz.add(l)

#Now see if our sample term fuzzy matches any of those specified terms
sample_term='Boris Johnstone'
fz.get(sample_term), fz.get('Diana Abbot'), fz.get('Joanna Lumley')
([(0.8666666666666667, 'Boris Johnson')],
 [(0.8333333333333334, 'Diane Abbott')],
 [(0.23076923076923073, 'Diane Abbott')])

fuzzywuzzy

If we want to try to find a fuzzy match for a term within a text, we can use the python fuzzywuzzy library. Once again, we spcify a list of target items we want to try to match against.

from fuzzywuzzy import process
from fuzzywuzzy import fuzz
terms=['Houses of Parliament', 'Diane Abbott', 'Boris Johnson']

q= "Diane Abbott, Theresa May and Boris Johnstone were in the Houses of Parliament the other day"
process.extract(q,terms)
[('Houses of Parliament', 90), ('Diane Abbott', 90), ('Boris Johnson', 86)]

By default, we get match confidence levels for each term in the target match set, although we can limit the response to a maximum number of matches:

process.extract(q,terms,scorer=fuzz.partial_ratio, limit=2)
[('Houses of Parliament', 90), ('Boris Johnson', 85)]

A range of fuzzy match scroing algorithms are supported:

  • WRatio – measure of the sequences’ similarity between 0 and 100, using different algorithms
  • QRatio – Quick ratio comparison between two strings
  • UWRatio – a measure of the sequences’ similarity between 0 and 100, using different algorithms. Same as WRatio but preserving unicode
  • UQRatio – Unicode quick ratio
  • ratio
  • `partial_ratio – ratio of the most similar substring as a number between 0 and 100
  • token_sort_ratio – a measure of the sequences’ similarity between 0 and 100 but sorting the token before comparing
  • partial_token_set_ratio
  • partial_token_sort_ratio – ratio of the most similar substring as a number between 0 and 100 but sorting the token before comparing

More usefully, perhaps, is to return items that match above a particular confidence level:

process.extractBests(q,terms,score_cutoff=90)
[('Houses of Parliament', 90), ('Diane Abbott', 90)]

However, one problem with the fuzzywuzzy matcher is that it doesn’t tell us where in the supplied text string the match occurred, or what string in the text was matched.

The fuzzywuzzy package can also be used to try to deduplicate a list of items, returning the longest item in the duplicate list. (It might be more useful if this is optionally the first item in the original list?)

names=['Diane Abbott', 'Boris Johnson','Boris Johnstone','Diana Abbot', 'Boris Johnston','Joanna Lumley']
process.dedupe(names, threshold=80)
['Joanna Lumley', 'Boris Johnstone', 'Diane Abbott']

It might also be useful to see the candidate strings associated with each deduped item, treating the first item in the list as the canonical one:

import hashlib

clusters={}
fuzzed=[]
for t in names:
    fuzzyset=process.extractBests(t,names,score_cutoff=85)
    #Generate a key based on the sorted members of the set
    keyvals=sorted(set([x[0] for x in fuzzyset]),key=lambda x:names.index(x),reverse=False)
    keytxt=''.join(keyvals)
    key=hashlib.md5(keytxt).hexdigest()
    if len(keyvals)>1 and key not in fuzzed:
        clusters[key]=sorted(set([x for x in fuzzyset]),key=lambda x:names.index(x[0]),reverse=False)
        fuzzed.append(key)
for cluster in clusters:
    print(clusters[cluster])
[('Diane Abbott', 100), ('Diana Abbot', 87)]
[('Boris Johnson', 100), ('Boris Johnstone', 93), ('Boris Johnston', 96)]

OpenRefine Clustering

As well as running as a browser accessed application, OpenRefine also runs as a service that can be accessed from Python using the refine-client.py client libary.

In particular, we can use the OpenRefine service to cluster fuzzily matched items within a list of items.

#!pip install git+https://github.com/PaulMakepeace/refine-client-py.git
#NOTE - this requires a python 2 kernel
#Initialise the connection to the server using default or environment variable defined server settings
#REFINE_HOST = os.environ.get('OPENREFINE_HOST', os.environ.get('GOOGLE_REFINE_HOST', '127.0.0.1'))
#REFINE_PORT = os.environ.get('OPENREFINE_PORT', os.environ.get('GOOGLE_REFINE_PORT', '3333'))
from google.refine import refine, facet
server = refine.RefineServer()
orefine = refine.Refine(server)
#Create an example CSV file to load into a test OpenRefine project
project_file = 'simpledemo.csv'
with open(project_file,'w') as f:
    for t in ['Name']+names+['Boris Johnstone']:
        f.write(t+ '\n')
!cat {project_file}
Name
Diane Abbott
Boris Johnson
Boris Johnstone
Diana Abbot
Boris Johnston
Joanna Lumley
Boris Johnstone
p=orefine.new_project(project_file=project_file)
p.columns
[u'Name']

OpenRefine supports a range of clustering functions:

- clusterer_type: binning; function: fingerprint|metaphone3|cologne-phonetic
- clusterer_type: binning; function: ngram-fingerprint; params: {'ngram-size': INT}
- clusterer_type: knn; function: levenshtein|ppm; params: {'radius': FLOAT,'blocking-ngram-size': INT}
clusters=p.compute_clusters('Name',clusterer_type='binning',function='cologne-phonetic')
for cluster in clusters:
    print(cluster)
[{'count': 1, 'value': u'Diana Abbot'}, {'count': 1, 'value': u'Diane Abbott'}]
[{'count': 2, 'value': u'Boris Johnstone'}, {'count': 1, 'value': u'Boris Johnston'}]

Topic Models

Topic models are statistical models that attempts to categorise different “topics” that occur across a set of docments.

Several python libraries provide a simple interface for the generation of topic models from text contained in multiple documents.

gensim

#!pip3 install gensim
#https://github.com/sgsinclair/alta/blob/e5bc94f7898b3bcaf872069f164bc6534769925b/ipynb/TopicModelling.ipynb
from gensim import corpora, models

def get_lda_from_lists_of_words(lists_of_words, **kwargs):
    dictionary = corpora.Dictionary(lists_of_words) # this dictionary maps terms to integers
    corpus = [dictionary.doc2bow(text) for text in lists_of_words] # create a bag of words from each document
    tfidf = models.TfidfModel(corpus) # this models the significance of words using term frequency inverse document frequency
    corpus_tfidf = tfidf[corpus]
    kwargs["id2word"] = dictionary # set the dictionary
    return models.LdaModel(corpus_tfidf, **kwargs) # do the LDA topic modelling

def print_top_terms(lda, num_terms=10):
    txt=[]
    num_terms=min([num_terms,lda.num_topics])
    for i in range(0, num_terms):
        terms = [term for term,val in lda.show_topic(i,num_terms)]
        txt.append("\t - top {} terms for topic #{}: {}".format(num_terms,i,' '.join(terms)))
    return '\n'.join(txt)

To start with, let’s create a list of dummy documents and then generate word lists for each document.

docs=['The banks still have a lot to answer for the financial crisis.',
     'This MP and that Member of Parliament were both active in the debate.',
     'The companies that work in finance need to be responsible.',
     'There is a reponsibility incumber on all participants for high quality debate in Parliament.',
     'Corporate finance is a big responsibility.']

#Create lists of words from the text in each document
from nltk.tokenize import word_tokenize
docs = [ word_tokenize(doc.lower()) for doc in docs ]

#Remove stop words from the wordlists
from nltk.corpus import stopwords
docs = [ [word for word in doc if word not in stopwords.words('english') ] for doc in docs ]

Now we can generate the topic models from the list of word lists.

topicsLda = get_lda_from_lists_of_words([s for s in docs if isinstance(s,list)], num_topics=3, passes=20)
print( print_top_terms(topicsLda))
     - top 3 terms for topic #0: parliament debate active
     - top 3 terms for topic #1: responsible work need
     - top 3 terms for topic #2: corporate big responsibility

The model is randomised – if we run it again we are likely to get a different result.

topicsLda = get_lda_from_lists_of_words([s for s in docs if isinstance(s,list)], num_topics=3, passes=20)
print( print_top_terms(topicsLda))
     - top 3 terms for topic #0: finance corporate responsibility
     - top 3 terms for topic #1: participants quality high
     - top 3 terms for topic #2: member mp active

One comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s