Fragment: Is English, Rather than Maths, a Better Grounding for the Way Large Areas of Computing are Going?

Or maybe Latin…?

Having a quick play with the spacy natural language processing package just now, in part because I started wondering again about how to reliably extract “facts” from the MPs’ register of interests (I blame @fantasticlife for bringing that to mind again; data available via a @simonw datasette here: simonw/register-of-members-interests-datasette).

Skimming over the Linguistic Features docs, I realise that I probably need to brush up on my grammar. It also helped crystallise out a bit further some niggling concerns I’ve got about what from practical computing is, or might take in the near future, based on the sorts of computational objects we can reasily work with.

Typically when you start teaching computing, you familiarise learners with the primitive types of computational object that they can work with: integers, floating point numbers, strings (that is, text), then slightly more elaborate structures, such as Python lists or dictionaries (dicts). You might then move up to custom defined classes, and so on. What sort of thing a computation object largely determines what you can do with it, and what you can extract from it.

Traditionally, getting stuff out of natural language, free text strings has been a bit fiddly. It you’re working with a text sentence, represented typically as a string, one way of trying to extract “facts” from it is to pattern match on it. This means that strings (texts) with a regular structure are among the easier things to work with.

As an example, a few weeks ago I was trying to reliably parse out the name of a publisher, and the town/city of publication from free text citation data such as Joe Bloggs & Sons (London) or London: Joe Bloggs & Sons (see Semantic Feature / Named Entity Extraction Rather Than Regular Expression Based Pattern Matching and Parsing). A naive approach to this might be to try to write some templated regular expression pattern matching rules. A good way of of understanding how this can work is to consider their opposite: template generators.

Suppose I have a data file with a list of book references; the data has a column for PUBLISHER, and one for CITY. If I want to generate a text based citation, I might create different rules to generates citations of different forms. For example, the template {PUBLISHER} ({CITY}) might display the publisher and then the city in brackets (the curly brackets show we want to populate the string with a value pulled from a particular column in a row of data); or the template {CITY}: {PUBLISHER} the city, followed by a colon, and then the publsiher.

If we reverse this process, we then need to create a rule that can extract the publisher and the city from the text. This may or may not be easy to do. If our text phrase only contains the publisher and the city, things should be relatively straightforward. For example, if all the references were generated by a template {PUBLISHER} ({CITY}), I could just match on something like THIS IS THE PUBLISHER TEXT (THIS IS THE CITY) : anything between the brackets is the city, anything before the brackets is the publisher. (That said, if the publisher name included something in brackets, things might start to break…). However, if the information appears in a more general context, things are likely to get more complicated quite quickly. For example, suppose we have the text “the book was published by Joe Bloggs and Sons in London“. How then do we pull out the publisher name? What pattern are we trying to match?

When I first looked at this problem, I started writing rules at the character level in the string. But then it occurred to me that I could use a more structured, higher level representation of the text based on “entities”. Tools such as the spacy Python package provide a range of tools for representing natural language text at much higher levels of representation than simple text strings. For example, give spacy a text document, and it then lets you work with it at the sentence level, or the word level. It will also try to recognise “entities”: dates, numerical values, monetary values, peoples names, organisation names, geographical entities (or geopolitical entities, GPEs), and so on. With a sentence structured this way, I should be able to parse my publisher/place sentence and extract the recognised organisation and place:

import spacy

nlp = spacy.load("en_core_web_sm")

doc = nlp("The book was published by Joe Bloggs & Sons of London.")
print([(ent.text, ent.label_) for ent in doc.ents])

[('Joe Bloggs & Sons', 'ORG'), ('London', 'GPE')]

As well as using “AI”, which is to say, trained statistical models, to recognise entities, spacy will also let you define rules to match your entities. These definitions can be based on matching explicit strings, or regular expression patterns (see a quick example in Creating Rule Based Entity Pattern Matchers in spacy) as well as linguistic “parts of speech” features: nouns, verbs, prepositions etc etc.

I don’t really remember when we did basic grammar school, back in the day, but I do remember that I learned more about tenses in French and Latin then I ever did in English, and I learned more about grammar in general in Latin than I ever did in English.

So I wonder: how grammar is taught in school now? I wonder whether things like the spacy playground at exist as interactive teaching / tutorial playground tools in English lessons as a way of allowing learners to explore grammatical structures in a text?

In addition, I wonder whether tools like that are used in IT and computing lessons, where pupils get to explore writing linguistic patterns matching rules in support of teaching “computational thinking” (whatever that is…), and then build their own information extracting tools using their own rules.

Certainly, to make best progress, a sound understanding of grammar would help when writing rules, so where should that grammar be taught? In computing, or in English? (This is not to argue that English teaching should become subservient to the computing curriculum. Whilst there is a risk that the joy of language is taught instrumentally with a view to how it can be processed by machines, or how we can (are forced to?) modify our language to get machines to “accept out input” or do what we want them to do, this sort of knowledge might also help provide us with defences against computational processing and obviously opinionated (in a design sense) textual interfaces. (Increasing amounts of natural language text is parsed and processed by machines, and used to generate responses: chat bots, email responders, SMS responders, search engines, etc etc. In one sense in it’s wrong that we need to understand how our inputs maybe interpreted by machines, particulary if we find ourselves changing how we interat to suit the machine. On the other, a little bit of knowledge may also give us power over those machines… [This is probably all sounding a little paranoid, isn’t it?! ;-)])

And of course, all the above is before we even start to get on to the subject of how to generate texts based on grammar based templating tools. If we do this for ourselves, it can be really powerful. If we realise this is being done to us, (and spot tells that suggest to us we are being presented with machine generated texts), it can give us a line of defence (paranoia again?! ;-).

Author: Tony Hirst

I'm a Senior Lecturer at The Open University, with an interest in #opendata policy and practice, as well as general web tinkering...

%d bloggers like this: