Fragment — Using LLMs and Multimodal Models for OCR

Researching some storynotes on the Ludding times via contemporary newspaper reports accessed via the British Newspaper Archive, a huge chunk of my is spent tidying up the extracted text.

Some of the scans are pretty ropey and difficult to read, particularly for the Leeds Intelligencer, where the bleed is significant:

So how well can something like ChatGPT do the task.

A comment from @simonw suggests that ChatGTP 4o model can do OCR directly: 4o can do “OCR” directly using the model, but if it decides to use Code Interpreter it uses pytesseract instead which isn’t as good. You can prompt it to NOT use Python and then tell it to use Python after it has extracted the text.

But I’m not sure I want that quite yet (I’m not that trusting). I’d rather see an iterative pass over the original, going from whatever can be extracted to to the repaired text.

Let’s see what ChatGPT does with a simple prompt that hints at some of the preprocessing I think is necessary:

In the course of a debate on the State of the Nation, in-

Mr. E. Addington assured the House, that his Noble Relation (Lord Sidmouth) had never known nor heard of Coates, until the beginning of January last, and though Nichols, named in a pamphlet which had been mentioned, had called it the office of the Secretary of State, he had never been employed by Government. As the Hon. and Learned Member who opened the debate, had made allusions to an individual whom he considered to be greatly slandered and calumniated, he meant Mr. OLIVER. (Hear, hear!) — he felt it his duty as an act of justice, to say a few words in his behalf. Within a fortnight last, two Magistrates in the county of York, who had been led to entertain sentiments by no means favorable to that person, requested that an examination concerning him might be entered into at the Secretary of State’s Office. The examination accordingly took place, of which these Magistrates attended; and the result was, a Declaration from them BOTH, that the charges urged against Mr. Oliver were UTTERLY UNFOUNDED. It was about the beginning of April that Mr. Oliver first came to the Office of the Secretary of State; and in a manner essential to secure prepared to make a voluntary disclosure of what he knew relative to the proceedings of certain disaffected persons. This was done without any stipulation or hint of a reward, and no reward or remuneration except his bare expenses (Hear, hear!) had been given to him. About the fourteenth of May, he set off for Nottingham and Liverpool, and the final riots in these places was designed to take place on the 8th or 10th of June. Mr. Oliver was instructed to communicate with six or seven Magistrates, three only of whom were known to the Secretary of State.

This refined text preserves the essence and context of the original document

The output also provides a link to the Python preprocessing step that ChatGPT applied to the image:


from PIL import Image, ImageOps, ImageFilter
import pytesseract

# Load the image
image_path = '/mnt/data/image.png'
image = Image.open(image_path)

# Convert to grayscale
gray_image = ImageOps.grayscale(image)

# Apply filters to reduce bleed-through
enhanced_image = gray_image.filter(ImageFilter.MedianFilter(size=3))

# Further enhancement to clean the image
enhanced_image = enhanced_image.point(lambda x: 0 if x < 128 else 255, '1')

# Use pytesseract to extract text from the enhanced image
extracted_text = pytesseract.image_to_string(enhanced_image)

extracted_text

That’s a step I can take and try at home…

Can ChatGPT show me the cleaned up image it produced?

One of the issues I have with the LLM-based extraction is the sense that the model wants to “do more”, and perhaps summarise or interpret the extracted text, or improve it in languaging terms:

This refined text preserves the essence and context of the original document

But that’s not what I want…

PROMPT:> Please extract the text as literally as you can, then repair it without summarising or rephrasing. I want a text as close to the original as possible.

I can help manually refine the text from the previous extraction, ensuring it is as close to the original as possible without summarizing or rephrasing. Here’s a literal extraction followed by a repair:

In the comet o fuption, deate on the State of the Nation, in- 0penued ty Mr, pachot. Mr, h, Adldington assured the Houne, that his Koble Relation (Lard Gidumouth) had never known nor hoard of Coates, until the beginning of Jnuary test, and though Hichole, nemed in a pamphiet which had bain mentiened, had called it the office of the Berretary of State, he had never been employed by Govern- ment. As the Hon. and Learned Mamher who opcned the debate, had made elluoions to an individual whom be considered to be greatly slandered and calumnlated, he meant Mr OLIVER. ( Hear, hear! )—Jone felt it his duty as an act of justice, to say a few words in his behalf. Within a fortnight least, two Magis- trates in the county of York, who bad been led to entertain sentiments by no means favourable to that person, re- quested that an examination concerning him might be entered into at the Becretary of State’s Office. The examination ac- cordingly took place, of which these Magiatrates attended ; and the result waa, a Declaration from them BOTH, that the charges urged against Mr, Oliver were UTTERLY UNFOUNDED— It was about the beginning of April that Mr, Oliver first came to the Office of the Secretary of Btate ; and in a manner essential to secure prepared to make a voluntary disclosure of what he knew relative to the proceedings of certain disaffected persons— This was done without any stipulation or hint of a reward, and no reward or remuneration except his bare expenses ( Hear, hear! ) had been given to him, About the fourteenth of May, he set off for Nottingham and Liverpool, and the final riots in those places was designed to take place on the 8th or 10th of June, Mr. Oliver was instructed to communicate with six or seven Magistrates, three only of whom were known to the Go- vernment.

Repaired Text:

In the course of a debate on the State of the Nation, in- quired by Mr. [unreadable]. Mr. E. Addington assured the House, that his Noble Relation (Lord Sidmouth) had never known nor heard of Coates, until the beginning of January last, and though Nichols, named in a pamphlet which had been mentioned, had called it the office of the Secretary of State, he had never been employed by Govern- ment. As the Hon. and Learned Member who opened the debate, had made allusions to an individual whom he considered to be greatly slandered and calumniated, he meant Mr. OLIVER. (Hear, hear!)—he felt it his duty as an act of justice, to say a few words in his behalf. Within a fortnight last, two Magis- trates in the county of York, who had been led to entertain sentiments by no means favourable to that person, re- quested that an examination concerning him might be entered into at the Secretary of State’s Office. The examination ac- cordingly took place, of which these Magistrates attended; and the result was, a Declaration from them BOTH, that the charges urged against Mr. Oliver were UTTERLY UNFOUNDED— It was about the beginning of April that Mr. Oliver first came to the Office of the Secretary of State; and in a manner essential to secure, prepared to make a voluntary disclosure of what he knew relative to the proceedings of certain disaffected persons— This was done without any stipulation or hint of a reward, and no reward or remuneration except his bare expenses (Hear, hear!) had been given to him. About the fourteenth of May, he set off for Nottingham and Liverpool, and the final riots in those places were designed to take place on the 8th or 10th of June. Mr. Oliver was instructed to communicate with six or seven Magistrates, three only of whom were known to the Go- vernment.

My next step would have been to ask it to compare the cleaned up text with the pytesseract extracted tex, and also compare the cleaned up text with the text it originally extract. (I’m not sure if it can do a diff view; previously when I’ve prompted that sort of request, it generated a list of the individual changes it claimed to have made.

However, as I’m on the free plan:

I’ve also experimented with using Bing CoPilot for OCR, but it seems really keen on “improving” the text, which is useless from a transcription point of view (and something to be very wary of if using LLMs for this sort of task.)

That said, is using ChatGPT etc. overkill? I have been exploring using pytesseract and a local model for repairing the text and that may be all I need. So more on that in a follow post at some point…

PS there’s also the potential for doing that in the browser, eg using tesseract-wasm (related, my In-browser WASM powered OCR Word Add-In) and an in-browser model (eg phi-3 looks like it works with onnx runtime web (microsoft/Phi-3-mini-4k-instruct-onnx-web; and phi-3-vision comes in at only 4.2B, which is fine for local use).

PPS here’s a quick demo of using a Phi3 Vision 128K Instruct model (official demo here):

OLIVER AND CASTLEX.
HOUSE OF COMMONS-FID.4Y.
In the course of a very short debate on the State of the Nation, in-
trudged by Mr. Dudley
Mr. H. Addington secured the House, that his noble Relation
(Lord Falknath) had never known not heard of Costion, until
the beginning of January and, and thoughichols, tuned in a
pamphlet which had been mentioned, had called at the office of
the Secretary of State, he had never been employed by Govern-
ment. As the Hon. and Learned Member who opted the
detate, had made allusion to an individual whom he considered
to be greatly admired and endorsed, he sent Mr. OLIVER.
(Near, here?/he (fits it his duty on an act of Justice, to say a
four words in his behalf. Within a fortnight back, two may re-
spectable Magistrates in the county of York, who had been led to
entertain suspicions by no means fanciful to that purse, re-
quested that an examination concerning his might be carried
into at the Secretary of State’s Office. The examination was
conducted the next day, at which Mr. Addington, who had
the real name, a Districting from then BOTH, that the charge
cryed out against Mr. Dudley were UTTERLY UTTERED.
It was about the beginning of April that Mr. Oliver first came to
the Office of the Secretary of State, and in a neat respectful
manner proposed to make a voluntary declaration of what he
knew relative to the proceedings of certain disaffected persons.
This was done without my intervention or hint of a reward, and
so received or promised except his due expression (here,
here, and here) and his pains to time. About the beginning of May,
he was at the Westminster and Liverpool, and the instant riding
to these quarters was expected to take place the 16th or 18th of
June. Mr. Oliver was instructed to communicate with six or
seven Magistrates, those only of whom were known to the Govern-
ment of State, and he was ordered to return on the 5th of June, in
order that admissions and precautionary measures might be taken
before that election.
Government. He put to be immediately drawn, that he could not
venture to be absolutely certain, that he had not been
Rem

This is something that it would be easy to set up an eval (evaluation) pipeline for, with some sample scans of articles of the sort I am interested in and some gold master transcripts of them.

Maybe that’s worth, given I’d like to spend even more time rooting through the British Newspaper Archive and 19th century scanned texts on archive,org.

So… are there any off-the-shelf evaluation tools for rating OCR pipelines? This looks like overkill — Yuliang-Liu/MultimodalOCR / OCRBench?!

See also: my OCR tools bookmarks.

Author: Tony Hirst

I'm a Senior Lecturer at The Open University, with an interest in #opendata policy and practice, as well as general web tinkering...

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.