OCR and Generative AI: what could possibly go wrong?

OCR — optical character recognition — describes the process of converting images of text, whether typed or handwritten, into digital text documents.

The original text, as written down, is in some sense fixed. A mark was made to demote a particular letter a, or a particular numeral 5, or to indicate a particular punctuation mark. In some cases, it may be that the text contains typographical errors, such as misspelled or duplicated words; in others, an obviously incorrect word may have be included: hompohones (soundalike words) or antonyms (the opposite meaning to what was perhaps intended, for example).

Many of the approaches used for OCR are have been classed as “AI” over the years (AI is a moveable thing; one classic definition of it is “Almost Invented” in the sense of “almost working”). This include techniques for segmenting or separating out individual characters (easy with many print typefaces, more difficult in handwritten manuscript) and then using a variety of techniques to recognise each character: simple rule based approaches, for example, or neural network based classifiers. In some documents, such as newspapers, additional steps may be required prior to the text recognition for isolating text in columns, identifying breaks between articles, and so on.

In many cases of scanned print materials, the image quality may be poor: uneven illumination, misaligned text, worn characters from the days of metal type, all add to the difficulties of accurately re-presenting a printed text document as a digital text document, rather than just a digital image.

From working with lots of scanned documents recently, retrieved in the main from the Internet Archive/archive.org (19th century books) and the British Newspaper Archive (newspapers from the 1850s), I note some common errors in many scanned documents:

  • numerals are often not reported or are incorrectly returned, sometimes as other digits, othertimes as letters;
  • italicised words (which is to say, emphasised, or particularly stressed, words are often incorrectly represented);
  • certain letters in certain contexts are regularly misreported (tbe rather than the, whicb rather than which and so on);
  • proper names are often garbled or otherwise incorrectly returned;
  • short words, and often definite or indefinite articles (a, an) are often missing in the digital text;
  • the adverb not is often omitted.

And so on.

What this means is, if we are searching over an OCR’d document searching for known items (proper names, particular amounts, emphasised words, negated terms), we might find we miss a lot of the documents that contain the words we are looking for.

It also means if you are using things like automated sentiment analysis ovet texts, the extracted sentiment is exactly wrong… (He was not a nice man.)

If these documents are being used to train an LLM, and the LLM is being used to “generatively recall” a response to some stated question or as completion to a partial sentence, the recall may be based on incorrect statistics, with the result that the text that is generated contains an incorrect name, a hallucinated amount, a statement expressed in positive rather than negative (not) terms, and so on.

Where an OCR model returns a value for a particular character, or word, it may do so with a particular level of confidence. That character is definitely an e, that character may be an S or an s or a 5, and so on. It is not hard to then imagine OCR system architectures that attempt to improve certainty, for example by checking uncertain words against dictionary terms, or indeed, checking all words against dictionary terms, and then assessing the likelihood that a particular suggested dictionary term is in fact the intended word (this rather than thi5, for example).

The Internet Archive is home to a great many out of copyright digitally scanned books and journals from the 19th century. In many cases, multiple different scans, and OCRd texts generated from them, are available for the same work (different scans of the same edition of a particular book, for example). I keep meaning to try to build a tool for myself that will attempt to align such texts and improve the quality of the extracted text by comparing the texts, returning consensus words if they agree, dictionary words if they differ but one of the words is in a dictionary, and so on). BUT I am increasingly wary that if one of the texts has been genAI corrected, it won’t be a fair representation of the original text that was scanned.

One of my current interest is in texts relating to a certain Reverend Carus Wilson, resident on the Isle of Wight in the 1850s, author of an inflammatory tract criticising Newport as a place of depravity, and incidentally founder of a school that gained a certain notoriety by way of association with an establishment described in a certain book by a certain Currer Bell. (In passing, my search terms within the British Newspaper Archive variously include Carms Wilson, Carvus Wilson, Carnus Wilsov and so on.) One of the approaches I used for tidying up the poor OCR of many of the newspaper texts and replacing corrupted words with accurate transcriptions of the words as printed, I have built up a dictionary of terms — proper names, particular spellings — that are accepted by the spell checker.

On my to do list is a thing for “normalising” particular words, or names, or phrases, using a canonical term using just such a dictionary (for example, “correcting” all manifestations of the name Carus Wilson, and variants thereof, to a consistent form.

However, there are risks to using that approach from a research / digital humanities point of view…

In a previous dive into the archives (when exploring the legend, or otherwise, of the sin eater), where I was trying to trace the history of a very particular paragraph of text through various articles, I needed the EXACT transcription of text, typos and all, because one of the lineages of paragraph through various sources included a very particular, and preserved, typographical error. In this case, autocorrected texts, which are perhaps easier to read in a narrative that includes quoted texts, would have broken the research line.

But back to genAI.

Many multimodal LLMs are now starting to offer OCR/text extraction as part of the feature set for working with images. One of the things that concerns me is the way that these models may be using language statistics to generate the text they extract. Text with a low degree of confidence may be extracted from an image, passed through an LLM to “correct” it, perhaps compare the corrected text back against the original to see if it looks plausible, and if it does, use the corrected text. This is fine until it isn’t. This is fine unless you want an exact and accurate copy of the original text, typos and all. This is fine unless you are happy with the numerical values stated in a particular table to be hallucinated back as completely different values in the generated text. This is fine if you are happy with dates, names and addresses being changed. This is fine if you are happy with original statements and facts being creatively reinterpreted back at you. This is fine if you want Google statistical English phrases rather than carefully recorded phonetic vernacular transcriptions of orally collected texts. This is fine if you want what wasn’t written down in the first place.

Author: Tony Hirst

I'm a Senior Lecturer at The Open University, with an interest in #opendata policy and practice, as well as general web tinkering...

One thought on “OCR and Generative AI: what could possibly go wrong?”

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.