Fragment — Using LLMs and Multimodal Models for OCR

Researching some storynotes on the Ludding times via contemporary newspaper reports accessed via the British Newspaper Archive, a huge chunk of my is spent tidying up the extracted text.

Some of the scans are pretty ropey and difficult to read, particularly for the Leeds Intelligencer, where the bleed is significant:

So how well can something like ChatGPT do the task.

A comment from @simonw suggests that ChatGTP 4o model can do OCR directly: 4o can do “OCR” directly using the model, but if it decides to use Code Interpreter it uses pytesseract instead which isn’t as good. You can prompt it to NOT use Python and then tell it to use Python after it has extracted the text.

But I’m not sure I want that quite yet (I’m not that trusting). I’d rather see an iterative pass over the original, going from whatever can be extracted to to the repaired text.

Let’s see what ChatGPT does with a simple prompt that hints at some of the preprocessing I think is necessary:

In the course of a debate on the State of the Nation, in-

Mr. E. Addington assured the House, that his Noble Relation (Lord Sidmouth) had never known nor heard of Coates, until the beginning of January last, and though Nichols, named in a pamphlet which had been mentioned, had called it the office of the Secretary of State, he had never been employed by Government. As the Hon. and Learned Member who opened the debate, had made allusions to an individual whom he considered to be greatly slandered and calumniated, he meant Mr. OLIVER. (Hear, hear!) — he felt it his duty as an act of justice, to say a few words in his behalf. Within a fortnight last, two Magistrates in the county of York, who had been led to entertain sentiments by no means favorable to that person, requested that an examination concerning him might be entered into at the Secretary of State’s Office. The examination accordingly took place, of which these Magistrates attended; and the result was, a Declaration from them BOTH, that the charges urged against Mr. Oliver were UTTERLY UNFOUNDED. It was about the beginning of April that Mr. Oliver first came to the Office of the Secretary of State; and in a manner essential to secure prepared to make a voluntary disclosure of what he knew relative to the proceedings of certain disaffected persons. This was done without any stipulation or hint of a reward, and no reward or remuneration except his bare expenses (Hear, hear!) had been given to him. About the fourteenth of May, he set off for Nottingham and Liverpool, and the final riots in these places was designed to take place on the 8th or 10th of June. Mr. Oliver was instructed to communicate with six or seven Magistrates, three only of whom were known to the Secretary of State.

This refined text preserves the essence and context of the original document

The output also provides a link to the Python preprocessing step that ChatGPT applied to the image:


from PIL import Image, ImageOps, ImageFilter
import pytesseract

# Load the image
image_path = '/mnt/data/image.png'
image = Image.open(image_path)

# Convert to grayscale
gray_image = ImageOps.grayscale(image)

# Apply filters to reduce bleed-through
enhanced_image = gray_image.filter(ImageFilter.MedianFilter(size=3))

# Further enhancement to clean the image
enhanced_image = enhanced_image.point(lambda x: 0 if x < 128 else 255, '1')

# Use pytesseract to extract text from the enhanced image
extracted_text = pytesseract.image_to_string(enhanced_image)

extracted_text

That’s a step I can take and try at home…

Can ChatGPT show me the cleaned up image it produced?

One of the issues I have with the LLM-based extraction is the sense that the model wants to “do more”, and perhaps summarise or interpret the extracted text, or improve it in languaging terms:

This refined text preserves the essence and context of the original document

But that’s not what I want…

PROMPT:> Please extract the text as literally as you can, then repair it without summarising or rephrasing. I want a text as close to the original as possible.

I can help manually refine the text from the previous extraction, ensuring it is as close to the original as possible without summarizing or rephrasing. Here’s a literal extraction followed by a repair:

In the comet o fuption, deate on the State of the Nation, in- 0penued ty Mr, pachot. Mr, h, Adldington assured the Houne, that his Koble Relation (Lard Gidumouth) had never known nor hoard of Coates, until the beginning of Jnuary test, and though Hichole, nemed in a pamphiet which had bain mentiened, had called it the office of the Berretary of State, he had never been employed by Govern- ment. As the Hon. and Learned Mamher who opcned the debate, had made elluoions to an individual whom be considered to be greatly slandered and calumnlated, he meant Mr OLIVER. ( Hear, hear! )—Jone felt it his duty as an act of justice, to say a few words in his behalf. Within a fortnight least, two Magis- trates in the county of York, who bad been led to entertain sentiments by no means favourable to that person, re- quested that an examination concerning him might be entered into at the Becretary of State’s Office. The examination ac- cordingly took place, of which these Magiatrates attended ; and the result waa, a Declaration from them BOTH, that the charges urged against Mr, Oliver were UTTERLY UNFOUNDED— It was about the beginning of April that Mr, Oliver first came to the Office of the Secretary of Btate ; and in a manner essential to secure prepared to make a voluntary disclosure of what he knew relative to the proceedings of certain disaffected persons— This was done without any stipulation or hint of a reward, and no reward or remuneration except his bare expenses ( Hear, hear! ) had been given to him, About the fourteenth of May, he set off for Nottingham and Liverpool, and the final riots in those places was designed to take place on the 8th or 10th of June, Mr. Oliver was instructed to communicate with six or seven Magistrates, three only of whom were known to the Go- vernment.

Repaired Text:

In the course of a debate on the State of the Nation, in- quired by Mr. [unreadable]. Mr. E. Addington assured the House, that his Noble Relation (Lord Sidmouth) had never known nor heard of Coates, until the beginning of January last, and though Nichols, named in a pamphlet which had been mentioned, had called it the office of the Secretary of State, he had never been employed by Govern- ment. As the Hon. and Learned Member who opened the debate, had made allusions to an individual whom he considered to be greatly slandered and calumniated, he meant Mr. OLIVER. (Hear, hear!)—he felt it his duty as an act of justice, to say a few words in his behalf. Within a fortnight last, two Magis- trates in the county of York, who had been led to entertain sentiments by no means favourable to that person, re- quested that an examination concerning him might be entered into at the Secretary of State’s Office. The examination ac- cordingly took place, of which these Magistrates attended; and the result was, a Declaration from them BOTH, that the charges urged against Mr. Oliver were UTTERLY UNFOUNDED— It was about the beginning of April that Mr. Oliver first came to the Office of the Secretary of State; and in a manner essential to secure, prepared to make a voluntary disclosure of what he knew relative to the proceedings of certain disaffected persons— This was done without any stipulation or hint of a reward, and no reward or remuneration except his bare expenses (Hear, hear!) had been given to him. About the fourteenth of May, he set off for Nottingham and Liverpool, and the final riots in those places were designed to take place on the 8th or 10th of June. Mr. Oliver was instructed to communicate with six or seven Magistrates, three only of whom were known to the Go- vernment.

My next step would have been to ask it to compare the cleaned up text with the pytesseract extracted tex, and also compare the cleaned up text with the text it originally extract. (I’m not sure if it can do a diff view; previously when I’ve prompted that sort of request, it generated a list of the individual changes it claimed to have made.

However, as I’m on the free plan:

I’ve also experimented with using Bing CoPilot for OCR, but it seems really keen on “improving” the text, which is useless from a transcription point of view (and something to be very wary of if using LLMs for this sort of task.)

That said, is using ChatGPT etc. overkill? I have been exploring using pytesseract and a local model for repairing the text and that may be all I need. So more on that in a follow post at some point…

PS there’s also the potential for doing that in the browser, eg using tesseract-wasm (related, my In-browser WASM powered OCR Word Add-In) and an in-browser model (eg phi-3 looks like it works with onnx runtime web (microsoft/Phi-3-mini-4k-instruct-onnx-web; and phi-3-vision comes in at only 4.2B, which is fine for local use).

PPS here’s a quick demo of using a Phi3 Vision 128K Instruct model (official demo here):

OLIVER AND CASTLEX.
HOUSE OF COMMONS-FID.4Y.
In the course of a very short debate on the State of the Nation, in-
trudged by Mr. Dudley
Mr. H. Addington secured the House, that his noble Relation
(Lord Falknath) had never known not heard of Costion, until
the beginning of January and, and thoughichols, tuned in a
pamphlet which had been mentioned, had called at the office of
the Secretary of State, he had never been employed by Govern-
ment. As the Hon. and Learned Member who opted the
detate, had made allusion to an individual whom he considered
to be greatly admired and endorsed, he sent Mr. OLIVER.
(Near, here?/he (fits it his duty on an act of Justice, to say a
four words in his behalf. Within a fortnight back, two may re-
spectable Magistrates in the county of York, who had been led to
entertain suspicions by no means fanciful to that purse, re-
quested that an examination concerning his might be carried
into at the Secretary of State’s Office. The examination was
conducted the next day, at which Mr. Addington, who had
the real name, a Districting from then BOTH, that the charge
cryed out against Mr. Dudley were UTTERLY UTTERED.
It was about the beginning of April that Mr. Oliver first came to
the Office of the Secretary of State, and in a neat respectful
manner proposed to make a voluntary declaration of what he
knew relative to the proceedings of certain disaffected persons.
This was done without my intervention or hint of a reward, and
so received or promised except his due expression (here,
here, and here) and his pains to time. About the beginning of May,
he was at the Westminster and Liverpool, and the instant riding
to these quarters was expected to take place the 16th or 18th of
June. Mr. Oliver was instructed to communicate with six or
seven Magistrates, those only of whom were known to the Govern-
ment of State, and he was ordered to return on the 5th of June, in
order that admissions and precautionary measures might be taken
before that election.
Government. He put to be immediately drawn, that he could not
venture to be absolutely certain, that he had not been
Rem

This is something that it would be easy to set up an eval (evaluation) pipeline for, with some sample scans of articles of the sort I am interested in and some gold master transcripts of them.

Maybe that’s worth, given I’d like to spend even more time rooting through the British Newspaper Archive and 19th century scanned texts on archive,org.

So… are there any off-the-shelf evaluation tools for rating OCR pipelines? This looks like overkill — Yuliang-Liu/MultimodalOCR / OCRBench?!

See also: my OCR tools bookmarks.

From print debugging to logger, with added TTS

Via my feeds yesterday, I spotted a post entitled Say Goodbye to Print Statements: How to Use Logger for Effective Debugging which suggests using the Python logging package rather than print statements for casual debugging.

Hmmm…

So I wondered whether this might be a handy thing to suggest to students.

It’s a little bit of faff to set up a logger, but if we want students to use this sort of thing we need to minimise the boilerplate, and try to resist the temptation to pre-load packages into the student Python environment*.

* The reason being we want students to be able to do things elsewhere, and if we load too much stuff behind the scenes, if they move to a new environment, they may wonder why things don’t work the same was as they seemed to work before.

Repo for the package I cobbled together here: https://github.com/innovationOUtside/ou-logger-py

One good reason for using a logger is that you can set a logging level, and then include logging (“print”) statements that only produce an output at or above that logging level:

logger.setLevel(LEVEL)
# LEVEL: DEBUG, INFO, WARNING, ERROR, CRITICAL

logger.error('This is an error message')
logger.warning('This is a warning message')

Thinks: it would be nice to have logging groups as well, so maybe something like:

logger.setGroups(["ingest", "analysis"]

logger.error('This is an ingest error message', "ingest")

I guess the way to do that is just to have different loggers (logger1, logger2, etc)? Related issue here.

Anyway. Basic student use…

STEP 1. A simple package that lets you just get going.

from ou_logger import logger

But then I wondered… how about if the logger can also speak the message aloud? We’re working in Jupyter notebooks, so we can get access to the browser text-to-speech module

So, STEP 2:

# Via ChatGPT
from IPython.display import display, Javascript

# Define a custom logging handler that speaks the log messages using browser TTS
class BrowserTTSHandler(logging.Handler):
    def emit(self, record):
        msg = self.format(record)
        # Use JavaScript to speak the message in the browser
        display(
            Javascript(f'speechSynthesis.speak(new SpeechSynthesisUtterance("{msg}"))')
        )

So now with a from ou_logger import logger, set_handler we can say:

set_handler("text")
set_handler("text, tts")
set_handler("tts")

If we have the "tts" handler set, the message will be spoken aloud by the browser speechSynthesis module.

When I tested this, the JupyterLab environment I was using also had my jupyterlab_cell_status_extension installed with the audible cell completion alert and error message TTS enabled which made for quite an audible experience…

Imagining “Google AdPrompts”…

Watching the Google I/O keynote, and I’m bored witless by the “creative” tools it offers, and noting that the Google Search I no longer use will soon have “generated answers”.

I also wonder when we’ll see a “Google AdPrompts” service that:

  • parses a user query/prompt;
  • sells it to bidders
  • lets bidders add an “advertiser prompt” that gets mixed in with the system prompt and user prompt.

Every day is a day closer to retiring and a day closer to the day where I don’t feel I need to track any of this sh*t any more.

PS noting, throughout the keynote, the repeated use of “agentive”. Boll*cks. Complete and utter boll*cks.

From Assessment Questions to Prompts, and Back Again. And what counts as cheating anyway?

Writing a good (effective?) assessment question can be hard — it needs to be unambiguous, it needs to be able to assess something of what a student might be expected to have learned by studying a particular course, it needs to not give the answer but does need to be structured enough that it prompts (prompts…) an answer that is markable according to a marking guide (particularly in cases where you have large numbers of markers who are ideally marking in a consistent way), and so on. In many cases, there may be different components that are credit bearing. For example, in a numerical question, the “final answer” may get a mark, but showing the working may also gain marks; in an essay question, the central argument may gain credit, but you might also get negative marks for poor spelling, or poor grammar.

It’s not surprising, then, that once you have a question or question format that works well, it often makes sense to recycle it. Depending on the question structure, there are various ways in which you can mint new questions from old ones. In numerical questions, you might change some of the numbers, but the working remains largely the same (as a corollary, you might consider “personalised” questions where each student is given a variant of the same question that tests them equally (typically, requires the same working), but generates a different “final answer”. For example, and trivially, what is the sum of squares for: 2 and 7, 6 and 3, 4 and 5, etc etc. Making this non-commutative, eg “the difference of squares”, provides even more final answer variants for a limited range of variable numbers, but requires more careful phrasing of the question, (find the difference of squares, x**2-y**2 for x= and y= etc. But then, if you aren’t careful, you are much of the way there to giving the working part of the answer. ). An essay style question that requires students to includes “examples from the last twelve months” is evergreen in terms of requiring contemporary matters to be included. If I were a scholar, I’d probably try to include a load of references here…

One of the reasons for having to create new questions, of course, is to minimise the chances that students can recycle answers. But now, it seems, there is also the additional problem of using generative AI tools to answer assessment questions. A naive take on this is to consider an assessment to be “genAI safe” if it somehow defends against students using a simple zero shot “cut and paste” strategy to paste a question in, get an answer out, and use that as their assessment answer. But that’s a really simplistic way of using genAI. Anyone who has spent any time at all in a genAI conversation knows that there a few simple tricks you can adopt to improve a prompt. For example, OpenAI guidance suggests the following

Challenging or critiquing the response and forcing the genAI to review or modify its answer as part of an iterative, “conversational” approach can also dramatically improve an answer. Compare this to an open book assessment, where students are expected to work on their own solutions but may also be encouraged to form study groups to discuss ways in which a particular question might be addressed, without actually sharing their own answers.

Cutting and pasting in a question, then copying the answer out might be classed as “cheating”, but if a student works with the genAI model to iterate on the answer, under the direction of the student, is that still cheating? If the student uses their understanding of the question to add guiding context in the original prompt (a description of how to approach the answering of the question, as the student understands it) is that cheating, compared to copying and pasting generic guidance about how to approach an assessment question from a study skills site?

From one of my must read blogs, commentator on matters publishing Ian Mulvany posted some musings over the weekend on the matter of data to paper, and some reflections on LLMs, noting that:

What strikes me right now about both of these projects is that there is still a very high level of effort required to get something that begins to be marginally useful. Too much effort to be radically disruptive, but certainly a level of disruption is available that was not available before. 

Which is to say: power tools for power users who put the work in when using them.

But perhaps even more interesting were a couple of example prompts that Ian linked to. For example, this prompt to review a legal qeustion in the context of the draft AI act. The first part of the prompt reads very much like the framing of an assessment question. The prompt then provides a couple of few-shot examples of the form the answer might take, akin to a sample answer in a generic marking guide for a question type where a student is expected to consider a legal question in the context of a particular regulation. The second example Ian gives also embeds prompt elements that are very assessment question like.

So I wonder:

  • what are the similarities and differences between a good assessment question and a good prompt, e.g. in leading the co-respondent (the student in the case of an assessment question, a genAI/LLM in the case of a prompt) to provide an answer that is “correct” and of a form the presenter (of the question or prompt) expects?

I also wonder:

  • to what extent should a good prompt include “marking guide” constructs such as “a good answer might be expected to take the following form…”?
  • if a student adds the “a good answer might be expected to take the following form…” element to a prompt as additional context around a pasted in question, has the student committed the same sort of academic offence as if they just pasted in the question, then copied the generated answer?
  • if a student iterates on a generated answer by prompting the genAI to address concerns raised by the student about the answer generated to date, has that student committed the same sort of academic offence as a student using a one-shot, single step prompt approach to using LLMs to generate answers?

Information held vs “information” generated

Noting that noyb.eu (“none of your business”; info rights org) has a compliant in, in Austria, against OpenAI — ChatGPT provides false information about people, and OpenAI can’t correct it.

So I wonder: if you make a DPA subject access request for everything a company holds about you in its databases, it can look that data up.

But what if you ask the company what it “believes” about you, e.g. if you search any of its information systems, and assuming that genAI models are classed as “information systems” (if not, what are they classed as?).

If the model has been trained on personal information, then that information has influenced the model weights and may return elements of that information through a statistical process. If you ask a model what it “knows” about a particular person, then could you argue that what it returns is what that organisation believes about you, and is therefore subject to personal information subject access requests? If the process is a statistical one, how can be return with any degree of confidence what that “information” is? And how can it correct it?

How responsible is a company for any generated statements a model may make about you if a model operated by the company:

  • has been trained from the ground up on the company’s data?
  • is the result of a third party model have been further trained or fine-tuned on the company’s data?
  • is purely a third party model?

If a company operates a retrieval augmented generation (RAG) process where your (actual) data is ‘interpreted” and returned after processing through a model as generated text, is that generated text what the company believes about you?

If a marketing company has a database that puts me into a particular labeled demographic group, can I request what those groups and what those labels are?

If someone in a company looks me up and a conversational AI labels me based on its training and based on my data (eg provided via a RAG mechanism), am I allowed to request what that information was, eg via chat logs? But what about the next time someone in the company ask a corporate chatUI about me, what will it say then? Can I ask for everything the information may say about me, ever, along with probabilities for each response?!

What is the status, in GDPR terms, of “generated information” (is it even information)? How does “generated information” relate to information (or data) “held” about me.

For a person using a user interface, how is “generated information” distinguished from retrieved information? In each case, to what extent might that information be said to be what the company “believes” about me?

PS see also Appropriate/ing Knowledge and Belief Tools?

Simple Playwright Tests for Jupyter Environments

In passing, I’ve been revising a couple of JupyterLab extensions I’ve cobbled together in the past and started looking at producing some simple Playwright tests for them.

For years now, there has been a testing helper framework called Galata, now bundled as part of JupyterLab, although the docs are scarce, the examples few, and things you might expect to be there are lacking. (There are also seemingly obvious signals lacking in the core JupyterLab codebase — you can find out when a notebook code cell is queued to run, stops running, and whether it has successfully or unsuccessfully run, but not find out when it actually starts running — but that’s another story.)

Anyway… I wasted a whole chunk of time trying to use Galata when it would probably have been easier just writing simple, native Playwright tests, and then a chunk of time more trying to use Galata in a minimal demo of hooking up a node.js Playwright Docker container to a container running a JupyterLab environment so I can run tests in the Playwright container with the testing framework completely isolated from the Jupyter container.

Minimally, I used a minimal Docker Compose script (run using docker-compose up) to create the two containers (I used a node Playwright container, put into a holding position) and networked them together. A shared folder on host containing the tests was mounted in into the playwright container (I could also have mounted some test notebooks from another directory into the Jupyter container).

# ./docker-compose.yml
networks:
  playwright-test-network:
    driver: bridge
 
services:
  playwright:
    image: mcr.microsoft.com/playwright:v1.43.0-jammy
    stdin_open: true
    tty: true
    # Perhaps tighten up things by requiring the container
    # to be tested to actually start running:
    #depends_on:
      #- tm351
    volumes:
      - ./:/home/pwuser
    networks:
      - playwright-test-network

  tm351:
    image: mmh352/tm351:23j
    networks:
      - playwright-test-network

A playwright.config.js file of the form:

// ./playwright.config.js

module.exports = {
  use: {
    // Browser options
    headless: true,

  },
};

and an example test file showing how to log in to the auth’ed Jupyter environment:

// tests/demo-login.spec.ts

import { test, expect } from "@playwright/test";

test("form submission test", async ({ page }) => {
  // If we run this from the docker-compose.yml entrypoint,
  // we need to give the other container time to start up
  // and publish the Jupyter landing page.
  // await page.waitForTimeout(10000);
  // Does playwright have a function that polls a URL a
  // few times every so often for a specified time or
  // number of poll attempts?
  // That would be a more effective way of waiting.

  // Navigate to the webpage
  await page.goto("http://tm351:8888/login?next=%2Flab");

  // Enter a value into the form
  await page.fill("#password_input", "TM351-23J");
  await page.screenshot({ path: "screenshot1.png" });

  // Click the button
  await page.click("#login_submit");

  // Wait for response or changes on the page
  await page.waitForResponse((response) => response.status() === 200);

  await page.waitForTimeout(10000);

  // Take a screenshot
  await page.screenshot({ path: "screenshot2.png" });
});

Inside the Playwright container, we can run npx install @jupyterlab/galata to install the Galata test framework and use galata test functions and helpers. However, the Galata test package just seemed to mess things up for me and not work, in a way that the Playwright test package didn’t, and did just work.

After looking up the name of the Playwright container (docker ps) I logged into it (docker exec -it  playtest-playwright-1 ; by default this was as root, but we can force that by adding -u root to the exec command) and then manually ran the tests: npx playwright test. Ctrl-C to stop the containers and docker-compose rm to delete them.

We can run the tests directly from the docker-compose.yml:

networks:
  playwright-test-network:
    driver: bridge
 

services:
  playwright:
    image: mcr.microsoft.com/playwright:v1.43.0-jammy
    #stdin_open: true
    #tty: true
    entrypoint: ["npx","playwright","test"]
    working_dir: /home/pwuser
    volumes:
      - ./:/home/pwuser
    networks:
      - playwright-test-network

  tm351:
    image: mmh352/tm351:23j
    networks:
      - playwright-test-network

Ideally, we’d just want to start up both containers, run the tests, then shut them down, but docker-compose doesn’t seem to offer that sort of facility (does kubernetes?).

In passing, I also note that there is a pre-built Docker container that uses the Python Playwright API (mcr.microsoft.com/playwright/python:v1.42.0-jammy), which creates opportunities for using something like playwright-pytest. I also note that within a notebook, we could inject and run Python code into notebooks (or include Python scripts in pre-written notebooks) that control the JupyterLab UI using ipylab. The ipylab package lets you script and manipulate the behaviour of the JupyterLab UI from a notebook code cell. This creates the interesting possibility of having notebook based Python scripts control the JupyterLab UI, and whose execution is controlled by the test script (either a Python test script or a Typescript test script).

And related to pytest testing, via (who else but?!) @simonw here, I note inline-snapshot, a rather handy tool that lets you specify a test against a thing, and if the thing doesn’t exist, it will grab a gold-master copy of the thing for you. Playwright does this natively for screenshots. for example, the first time a screenshot comparison test is run as part of its visual comparisons offering.

Automatically generating SQL equivalents to dataframe method chains

Back to the “genAI is not necessarily the best alternative” thing again, this time in the context of generating SQL code that performs a particular sort of query. Natural language may be a starting point for this, but code might also provide the entry point, as for example if you have prototyped something you know that works using in-memory pandas or R dataframes, and you know want to move that over to set of operations performed on corresponding relational tables inside a SQL database.

And if you have code that works, why would you want a statistical genAI process to have a guess at some code that might work for you, rather than using a deterministic process and mechanically translating something that works into something that is logically equivalent (even if it’s not the most idiomatic or optimal equivalent)?

In the R world dbplyr is a handy utility that will convert your dplyr code (transformations applied over a dataframe using dplyr methods) to equivalent, thoug possibly not optimal or idiomatic, SQL code. The dplyr package works with several different db backends, including MariaDB, PostgreSQL and SQLite, so presumably any dialect differences are accommodated if you provide an appropriate connection or db description?

From the docs:

# lazily generates query
summary <- mtcars2 %>% 
  group_by(cyl) %>% 
  summarise(mpg = mean(mpg, na.rm = TRUE)) %>% 
  arrange(desc(mpg))

# see query
summary %>% show_query()
#> <SQL>
#> SELECT `cyl`, AVG(`mpg`) AS `mpg`
#> FROM `mtcars`
#> GROUP BY `cyl`
#> ORDER BY `mpg` DESC

If required, you can also explicitly inject SQL into a dplyr expression (example; I don’t think the code will also try to execute this on the R side, eg by dumping the query into SQLite…).

On the question of producing appropriate SQL dialects, I note the SQLGlot Python package, a “no-dependency SQL parser, transpiler, optimizer, and engine” that “can be used to format SQL or translate between 21 different dialects. As far as the optimiser goes, I assume this means it could accept suboptimal SQL code generated using dbplyr and then return something more efficient?

Whilst the Python pandas package is happy to connect to a database backend, and can read and write tables from and to a connected database, it doesn’t try to generate SQL equivalent queries to a chained set of pandas dataframe manipulating methods.

Although pandas was the original widely used dataframe package on the Python block, several other alternatives have appeared over the years, with improved efficiency and with a syntax that resembles the pandas syntax, although it may not match it exactly. If you know the various operations that SQL supports, you’ll have a fair idea of what verbs are available for manipulating dataframes in any dataframe package; so if you’ve worked with dplyr, or pandas, dask (built with parallelism in mind), or polars (a Python package powered from Rust), there’s a good chance you’ll be able to make sense of any of the others.

At least one of the pandas alternative dataframes-in-Python packages does seem to have given the “conversion-to-SQL” thing some serious consideration: ibis (repo). From the docs:

con = ibis.connect("duckdb://")

t = con.read_parquet("penguins.parquet")

g = t.group_by(["species", "island"]).agg(count=t.count()).order_by("count")
ibis.to_sql(g)

which gives:

SELECT
  `t1`.`species`,
  `t1`.`island`,
  `t1`.`count`
FROM (
  SELECT
    `t0`.`species`,
    `t0`.`island`,
    COUNT(*) AS `count`
  FROM `ibis_read_parquet_t2ab23nqsnfydeuy5zpg4yg2im` AS `t0`
  GROUP BY
    1,
    2
) AS `t1`
ORDER BY
  `t1`.`count` ASC NULLS LAST

Actually that looks a bit cryptic and shorthand (t0, t1, 1, 2?) and could perhaps have been made to look a bit more friendly?

There’s also the rather nifty option of combining method chains with SQL code:

sql = """
SELECT
  species,
  island,
  COUNT(*) AS count
FROM penguins
GROUP BY species, island
""".strip()

If you sometimes think in pandas and sometimes in SQL, then this sort of approach might be really handy… I note that the DuckDB docs give it a nod, possibly because it uses DuckDB as the default backend? But I don’t have much sense about the trajectory it’s on in terms of development (first, second, and/or third party), adoption, and support.

The ibis machinery is built using Substrait, a logical framework for supporting interoperability across different structured data processing platforms. I wonder if that sort of conceptualisation might be a useful framing in an educational context? Here’s their primer document. The repo show some level of ongoing activity and engagement, but I’m not sure where the buy-in comes from or how committed it is. In terms of process, it seems the project it lloking to getting its foundations right, as Apache Arrow did; and that really seems to have taken off as a low level columnar in-memory and inter-process communication messaging serialisation format with excellent support for things like parquet file-reading and writing.

PS hmm… so I wonder…. might we expect something like a substrait sql2pandas utility? A pandoc for sql2sql, sql2dfFramework, dfFramework2sql and dfFramework2dfFramework conversions? Maybe I need to keep tabs on this PyPi package: substrait (repo). I note other packages in the substrait-io organisation for other languages…

GenAI Outputs as “Almost Information” from Second-hand Secondary Information Sources?

Checking my feeds, AI-hype continues but I get the sense there are also other signs of “yes, but…” and “well, actually…” starting to appear. In the RAG (retrieval augmented generation), I’m also noticing more items relating to organising your documentary source data better organised, which in part seems to boil down to improving the retrieval part of the problem (retrieving appropriate documents / information based on a particular information request), in part to improving that step as a search problem (addressing the needs of an information user seeking particular information for a particular reason).

A lot of genAI applications, not just those that are conversational front-ends to RAG pipelines, seem to have the flavour of “interpreted search” systems: a user has an information need, makes an informational request to the system, and the systems provides an informative response (ideally!) in a natural language (often conversational way), albeit with possible added noise (bias, hallucination) at each step. Loosely related: GenAI in Edu and The Chinese Room.

From the archive: interesting to note the remarks I made on my first encounter with OpenAI GPT3-beta in July 2021: Appropriate/ing Knowledge and Belief Tools?

In passing, we might note different flavours of what might be termed AI in a RAG pipeline. The retrieval step is often a semantic search step where a search query is encoded as a vector (an “embedding”) and this is then compared with the embedding vectors of documents in the corpus. Ones that are close in the embedding vector space are then returned. This could be classed as one form of AI. At a lower level, tokenisation of a text (identifying the atomic “words” that are mapped into the embedding space) might be seen as being another “AI” function; trivially, how do you cope with punctuation that abuts a word; do you stem words; etc.; but then there are also mappings at a higher level, such as named entities (“Doo Dah Whatsit Company Ltd.”, “The Sisters of Mercy”) recognised and potentially disambiguated, (e.g. “The Sisters of Mercy (band)” rather than “Sisters of Mercy (religious order)”) and then represented as a single token. At the output side, the generative AI part, which creates a stream of tokens (and then decodes them) in response to an input prompt.

Perhaps a sign of my growing discontent with AI code demos that do things in less efficient and potentially more error prone way than “traditional” ways (hmm.. “traditional computing”… But then, I know several folk who would class that as assembler, Lisp and C) but a couple of posts this morning also really wound me up: R and Python Together: A Second Case Study Using Langchain’s LLM Tools, and its precursor, Using R And Python Together, Seamlessly: A Case Study Using Openai’s Gpt Models.

At this point, I should probably point out that that sort of demo is typical of the sort of mash-up approach I’d have used back in the day when I was looking or no- and low-code ways of grabbing information from the web and then having a play with it. But being older and more jaded, and increasingly wary of anything that appears on the web in terms of my ability to: a) trust it, and b) reliably retrieve it, I am starting to reconsider what the value of doing that sort of play in public actually is. For me, the play was related to seeing what I could do with tools and information that were out there, the assumption being that if I could hack something together over a coffee break: a) it was quite likely that someone was building a version to do a similar thing “properly”; b) the low barrier to entry meant that lots of people could tinker with similar ideas at a similarly low opportunity cost to themselves, which meant the ideas or tools might end up being adopted at scale, driven by early interest and adoption [I note a couple of noteable misses: Google Reader / Feedburner ; Yahoo Pipes; and a thing I liked but no-one else did: the OPML feed widget and in-browser browser that was Grazr, and my own StringLE (“String’n’glue Learning Environment”) hack. (Thinking about it, JupyterLab is not such a dissimilar sort of thing to either of those in terms of how I think it can be used as an integration environment; it’s just that it is and always has been such a hostile environment to work with…).] The tinkering also allowed me to just lift up the mat or the bed-clothes or whatever the metaphor is, to have a peek at how bad things could get if the play was escalated. Though I typically didn’t write-up my “Oh, so that means we can do this really evil thing…” thoughts, I just grokked them and filed them away, or noted them in passing, as “we are so f****d”. Maybe that was a mistake and I should have been more of a lighthouse against danger.

Anyway. The R and Py thing. Doing a useful job comparing the ability of OpenAI to answer a fact-y thing, and using some “mechanical” R and Py hacks to scrape film data from Wikipedia info-boxes. As I said, exactly the sort of “chain of hacks” tinkering I used to do, but that was from a perspective of “folk have data ‘locked up’ in raw from but exposed on the web”; as an amateur, I can extract that data from the website, albeit maybe with the odd issue from ropey scraping; I can then make some sort of observation, caveated with “subject to only having some of the data, perhaps with added with noise”, but hopefully giving an answer in the order of magnitude ball-park of correct-ish and interesting-or-not-ness.

But today, it wound me up. Scraping a Wikipedia infobox, the author notes, having grabbed the content, that clean requires some code of the “# I have no idea how this works # I just got it online” form. Again, the sort of thing I have done in the past, and still do, but in this case I think several things. Firstly, there is possibly a better way of getting the infobox that scraping on the basis of CSS selectors (getting a Wikipedia page as something closer to a (semi-)structured XML or JSON doc, for example. Secondly, can we properly retrieve rather more structured data from a related data source? For example, a) what cryptic Wikidata codes might I query on, b) and what cryptic SPARQL incantation should I use to pull the data from Wikidata or DBPedia ((I note the Wikidata examples include examples of querying for awards related data, which is pertinent to one of the original demo’s questions). Thirdly, would downloading the IMDb database make more sense, and then use SQL to query that data?

Now I know this takes requires arguably more “formal” skills than some hacky scraping and third-party regex-ing, but… But. Much of the data is available in a structured form, either via an open API, or as a bulk data download. Admittedly the latter requires some handling to put it into a form you can start to query it, but tools like datasette and DuckDB are making it much easier to get data from flat files into into a form you can use SQL to query it.

I have to admit I’m a bit wary of the direction of travel, and level, of LLM support that is being introduced into datasette. I would rather write my own f**ked up queries over the data and suffer the results than let an AI SQL generator do it in the background and then wonder and worry about what sort of data my new intern has managed to pick up based on asking the folk in the public bar what sort of query they should run, even if some of the folk in the bar are pretty good coders, or used to be, back in the day when the syntax was different.

Trying to pin down my concerns, they are these: why scrape data from a page (with the risk that entails) when you can request the page in a more structured form? Why scrape data when you can make a more structured request on a data source that feeds the web page you were scraping, and receive the data in a structured way? Why use a generic data source when you can use a domain specific datasource, with the added authority that follows from that that. In part, this boils down to: why use a secondary source when you can use a primary one?

The whole AI and genAI thing is potentially making the barrier entry lower to hacking your own information tools together. When I used to create hacky mashup pipelines, I made the decision at each step of what to hook-up to what, and how. And with each decision was an awareness of the problems that might arise as a consequence. For example, from knowing that when Y consumed from X, the output from X was a bit flaky, and when Z consumed from Y, the interface/conversion was a bit ropey, so the end result was this but subject to loads of caveats I knew about. There was a lot of white-box stuff in the pipeline that I had put together.

And at the output stage, where a genAI takes information in (that may be correct and appropriate) and then re-presents it, there are two broad classes of concern: how was that information retrieved (and and what risks etc are associated with that) and what sort of reinterpretation has been applied in the parsing of that information and re-presentation of it in the provided response?

The foundational models are building up layers of language from whatever crap they found on the web or in peoples’ chat conversations (which makes me think of anthrax (not the band)), and, increasingly, whatever AI generated training and test sets are being used to develop the latest iterations of them (which always makes me think of BSE / mad cow disease ).

PS per chance, I also stumbled across another old post today — Translate to Google Statistical (“Google Standard”?!) English? — where I closed with a comment “keep an eye on translate.google.com to see when the English to English translation ceases to be a direct copy”.

Here’s what happens now – note the prompt on the left includes errors and I’m asked in that context if I want to correct them.

If I say no — that is, translate the version with typos — do the typos get fixed by the translation?

Hmm.. do we also seem to have some predictive translation starting to appear?

In-browser WASM powered OCR Word Add-In

One of the longstanding off-the-shelf models used for OCR — optical character recognition — is provided in the form of Tesseract. It started life over thirty years and gets updated every so often. It’s also available as an in-browser model in the form of tesseract.js, a Javascript wrapper around a WASM implementation of the tesseract engine.

At some point last week, Simon Willison posted a single page web app [code, about, demo] that combines tesseract.js and pdf.js, a Javascript package for parsing and rendering PDF docs in the browser, to provide a simple image and PDF to text service.

A single page web app…

So I wondered if I could follow on from the previous two posts just cut and past the code and CSS and run it as a MS WOrd add-in:

And pleasingly, it did “just work”.

So whilst you’re in Microsoft office, you can drop an image or PDF into the side bar, and get the text out which can then be edited, copied and pasted into the Word doc.

The next level of integration would be to click to paste the text into the Word doc. Another obvious next step is to grab an image out of the word doc, OCR it, and paste the text back into the doc. More elaborately, I wonder if there are plugins for the pdf/image viewer that would let you select particular areas of the image for processing, or otherwise process the image, before running the OCR? For example, the photon WASM app [repo] seems to provide really powerful image manipulation features that all run in the browser.

I did wonder about whether the in-browser LLM chat demo based on the Google tflite/Mediapipe demo would also “just work” but either I messed something up, or something doesn’t quite work in the (Edge?) browser that runs the Word add-in. With an LLM in the sidebar, we should be able to run a local model for basic summarisation or document based Q & A.

Something else I want to try is audio to text as a self-contained add-in using Whisper.cpp WASM

I’m guessing there are already Microsoft Office add-ins that do a lot of this already, but I’m more interested in what I can build myself: a) using off-the-shelf code and models, b) running locally, in the sidebar (browser); c) to create things that might even just be one-shot, disposable apps to help me do a particular thing, or scratch a particular itch, when working in a particular document.

More DIY MS Word Add-ins — Python and R code execution using Pyodide and WebR

Picking up on the pattern I used in DIY Microsoft Office Add-Ins – Postgres WASM in the Task Pane, it’s easy enough to do the same thing for Python code execution using pyodide WASM and R code using WebR WASM.

As in the postgres/SQL demo, this lets you select code in the Word doc, then edit and execute it in the task pane. If you modify the code in the task pane, you can use it to replace the highlighted text in the Word doc. (I haven’t yet looked to see what else the Word API supports…) The result of the code execution can be pasted back to the end of the Word doc.

The Pyodide and WebR environments persist between code execution steps, so you can build up state. I added a button to reset the state of the Pyodide environment back to an initial state, but haven’t done that for the WebR environment yet.

I’m not sure how useful this is? It’s a scratchpad thing as much as anything for lightly checking whether the code fragment in a Word doc are valid and run as expected. It can be used to bring back the results of the code execution into the Word doc, which may be useful. The coupling is tighter than if you are copying and pasting code and results to/from a code editor, but it’s still weaker than the integration you get from a reproducible document type such as a Jupyter notebook or an executable MyST markdown doc.

One thing that might be interesting to explore is whether I can style the code in the Word doc, then extract and run the code from the Task Pane to check it works, maybe even checking the output against some sort of particularly style output in the Word doc. But again, that feels a bit clunky compared to authoring in a notebook or Myst and then generating a Word doc, or whatever format doc, with actual code execution generating the reported outputs etc.

Here’s the code:

<!-- Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT License. -->
<!-- This file shows how to design a first-run page that provides a welcome screen to the user about the features of the add-in. -->

<!DOCTYPE html>
<html>

<head>
    <meta charset="UTF-8" />
    <meta http-equiv="X-UA-Compatible" content="IE=Edge" />
    <meta name="viewport" content="width=device-width, initial-scale=1">
    <title>Contoso Task Pane Add-in</title>

    <!-- Office JavaScript API -->
    https://appsforoffice.microsoft.com/lib/1.1/hosted/office.js

    <!-- Load pyodide -->
    https://cdn.jsdelivr.net/pyodide/v0.25.0/full/pyodide.js
    http://./pyodide.js
    <!-- For more information on Fluent UI, visit https://developer.microsoft.com/fluentui#/. -->
    <link rel="stylesheet" href="https://static2.sharepointonline.com/files/fabric/office-ui-fabric-core/11.0.0/css/fabric.min.css"/>

    <!-- Template styles -->
    <link href="taskpane.css" rel="stylesheet" type="text/css" />
</head>

<body class="ms-font-m ms-welcome ms-Fabric">
    <header class="ms-welcome__header ms-bgColor-neutralLighter">
        <h1 class="ms-font-su">pyodide & WebR demo</h1>
    </header>
    <section id="sideload-msg" class="ms-welcome__main">
        <h2 class="ms-font-xl">Please <a target="_blank" href="https://learn.microsoft.com/office/dev/add-ins/testing/test-debug-office-add-ins#sideload-an-office-add-in-for-testing">sideload</a> your add-in to see app body.</h2>
    </section>
    <main id="app-body" class="ms-welcome__main" style="display: none;">
        <h2 class="ms-font-xl"> Pyodide & WebR demo </h2>
        <div>Execute Python or R code using Pyodide and WebR WASM powered code execution environments.</div>
        <textarea id="query" rows="4" cols="30"></textarea>
    <div><button id="getsel">Get Selection</button><button id="execute-py">Execute Py</button><button id="execute-r">Execute R</button><button id="exepaste">Paste Result</button><button id="replacesel">Replace selection</button><button id="reset">Reset Py</button></div>
     <div id="output-type"></div>
    <div id="output"></div>

    </main>
</body>

</html>

import { WebR } from "https://webr.r-wasm.org/latest/webr.mjs";

window.addEventListener("DOMContentLoaded", async function () {
  const buttonpy = /** @type {HTMLButtonElement} */ (document.getElementById("execute-py"));
  const buttonr = /** @type {HTMLButtonElement} */ (document.getElementById("execute-r"));
 
  let pyodide = await loadPyodide();

  const webR = new WebR();
  await webR.init();

  const resetpybutton = /** @type {HTMLButtonElement} */ (document.getElementById("reset"));
  resetpybutton.addEventListener("click", async function () {
    pyodide = await loadPyodide();
  });

  // Execute py on button click.
  buttonpy.addEventListener("click", async function () {
    buttonpy.disabled = true;

    // Get SQL from editor.
    const queries = document.getElementById("query").value;

    // Clear any previous output on the page.
    const output = document.getElementById("output");
    while (output.firstChild) output.removeChild(output.lastChild);

    //const timestamp = document.getElementById("timestamp");
    //timestamp.textContent = new Date().toLocaleTimeString();

    let time = Date.now();
    console.log(`${queries}`);
    document.getElementById("output-type").innerHTML = "Executing Py code...";
    try {
      const queries = document.getElementById("query").value;
      let output_txt = pyodide.runPython(queries);
      output.innerHTML = output_txt;
    } catch (e) {
      // Adjust for browser differences in Error.stack().
      const report = (window["chrome"] ? "" : `${e.message}\n`) + e.stack;
      output.innerHTML = `<pre>${report}</pre>`;
    } finally {
      //timestamp.textContent += ` ${(Date.now() - time) / 1000} seconds`;
      buttonpy.disabled = false;
      document.getElementById("output-type").innerHTML = "Py code result:";
    }
  });

  // Execute R on button click.
  buttonr.addEventListener("click", async function () {
    buttonr.disabled = true;

    // Get SQL from editor.
    const queries = document.getElementById("query").value;

    // Clear any previous output on the page.
    const output = document.getElementById("output");
    while (output.firstChild) output.removeChild(output.lastChild);

    //const timestamp = document.getElementById("timestamp");
    //timestamp.textContent = new Date().toLocaleTimeString();

    let time = Date.now();
    console.log(`${queries}`);
    document.getElementById("output-type").innerHTML = "Executing R code...";
    try {
      const queries = document.getElementById("query").value;
      let output_r = await webR.evalR(queries);
      let output_json = await output_r.toJs();
      output.innerHTML = JSON.stringify(output_json);
    } catch (e) {
      // Adjust for browser differences in Error.stack().
      const report = (window["chrome"] ? "" : `${e.message}\n`) + e.stack;
      output.innerHTML = `<pre>${report}</pre>`;
    } finally {
        document.getElementById("output-type").innerHTML = "R code result:";
      //timestamp.textContent += ` ${(Date.now() - time) / 1000} seconds`;
      buttonr.disabled = false;
    }
  });

});

/*
 * Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT license.
 * See LICENSE in the project root for license information.
 */

/* global document, Office, Word */

Office.onReady((info) => {
  if (info.host === Office.HostType.Word) {
    document.getElementById("sideload-msg").style.display = "none";
    document.getElementById("app-body").style.display = "flex";
    document.getElementById("exepaste").onclick = exePaste;
    document.getElementById("getsel").onclick = getSel;
    document.getElementById("replacesel").onclick = replaceSel;
  }
});

async function getSel() {
  await Word.run(async (context) => {
    // Get code from selection
    const selected = context.document.getSelection();
    selected.load("text");
    await context.sync();
    document.getElementById("query").value = selected.text;
  });
}

async function replaceSel() {
  await Word.run(async (context) => {
    // Replace selected code
    const selected = context.document.getSelection();
    const replace_text = document.getElementById("query").value;
    selected.insertText(replace_text, Word.InsertLocation.replace);
    await context.sync();
  });
}

async function exePaste() {
  await Word.run(async (context) => {
    var output = document.getElementById("output").innerHTML;
  const docBody = context.document.body;
  docBody.insertParagraph(
    output,
    Word.InsertLocation.end
  );
    await context.sync();
  });
}