Playing with story stuff, trying to write a simple BibTeX record generator based on archive.org
(see a forthcoming post…), I wanted to parse out publisher and location elements from strings such as Joe Bloggs & Sons (London) or London : Joe Bloggs & Sons.
If the string structures are consistent (PUBLISHER (LOCATION)
or LOCATION : PUBLISHER
, for example), we can start to create a set of regular expression / pattern matching expressions to pull out the fields.
But there is another way, albeit one that relies on a much heavier wieight computational approach than a simple regex, and that is to do some language modeling and see if we can extract entities of a particular type, such as an organisation and a geogrpahical location.
Something like this, maybe:
import spacy
nlp = spacy.load("en_core_web_sm")
refs = ["Joe Bloggs & Sons, London",
"Joe Bloggs & Sons, (London)",
"Joe Bloggs & Sons (London)",
"London: Joe Bloggs & Sons",
"London: Joe, Bloggs & Doe"]
for ref in refs:
doc = nlp(ref)
print("---")
for ent in doc.ents:
print(ref, "::", ent.text, ent.label_)
Which gives as a result:
---
Joe Bloggs & Sons, London :: Joe Bloggs & Sons ORG
Joe Bloggs & Sons, London :: London GPE
---
Joe Bloggs & Sons, (London) :: Joe Bloggs & Sons ORG
Joe Bloggs & Sons, (London) :: London GPE
---
Joe Bloggs & Sons (London) :: Joe Bloggs & Sons ORG
Joe Bloggs & Sons (London) :: London GPE
---
London: Joe Bloggs & Sons :: London GPE
London: Joe Bloggs & Sons :: Joe Bloggs & Sons ORG
---
London: Joe, Bloggs & Doe :: London GPE
London: Joe, Bloggs & Doe :: Joe, Bloggs & Doe ORG
My natural inclination would probably be to get frustrated by writing ever more, and ever more elaborate, regular epxressions to try to capture “always one more” outliers in how the publisher/location data might be presented in a single string. But I wonder: is that an outmoded way of doing compute now? Are the primitives we can readily work with now conveniently available as features at a higher representational level of abstraction?
See also things like storysniffer
, a Python package that includes a pretrained model for sniffing a URL and estimating whether the related page is likely to contain a news story.
Some folk will say, of course, that these model based approaches aren’t exact or provable. But in the case of my regexer to GUESS at the name of a publisher and a location, there is still uncertainty as to whether the correct response will be provided for an arbitrary string: my regexes might be incorrect, or I might have missed a pattern in likely presented strings. The model based approach will also be uncertain in its ability to correctly identify the publisher and location , but the source of the uncertainty will be different. As a user, though, all I likely care about is that most of the time the approach does work, and does work reasonably correctly. I may well be willing to tolerate, or at least, put up with, errors, and don’t really care how or why they arise unless I can do something about it.