OUseful.Info, the blog...

Imagining “Google AdPrompts”…

Watching the Google I/O keynote, and I’m bored witless by the “creative” tools it offers, and noting that the Google Search I no longer use will soon have “generated answers”.

I also wonder when we’ll see a “Google AdPrompts” service that:

parses a user query/prompt;
sells it to bidders
lets bidders add an “advertiser prompt” that gets mixed in with the system prompt and user prompt.

Every day is a day closer to retiring and a day closer to the day where I don’t feel I need to track any of this sh*t any more.

PS noting, throughout the keynote, the repeated use of “agentive”. Boll*cks. Complete and utter boll*cks.

From Assessment Questions to Prompts, and Back Again. And what counts as cheating anyway?

Writing a good (effective?) assessment question can be hard — it needs to be unambiguous, it needs to be able to assess something of what a student might be expected to have learned by studying a particular course, it needs to not give the answer but does need to be structured enough that it prompts (prompts…) an answer that is markable according to a marking guide (particularly in cases where you have large numbers of markers who are ideally marking in a consistent way), and so on. In many cases, there may be different components that are credit bearing. For example, in a numerical question, the “final answer” may get a mark, but showing the working may also gain marks; in an essay question, the central argument may gain credit, but you might also get negative marks for poor spelling, or poor grammar.

It’s not surprising, then, that once you have a question or question format that works well, it often makes sense to recycle it. Depending on the question structure, there are various ways in which you can mint new questions from old ones. In numerical questions, you might change some of the numbers, but the working remains largely the same (as a corollary, you might consider “personalised” questions where each student is given a variant of the same question that tests them equally (typically, requires the same working), but generates a different “final answer”. For example, and trivially, what is the sum of squares for: 2 and 7, 6 and 3, 4 and 5, etc etc. Making this non-commutative, eg “the difference of squares”, provides even more final answer variants for a limited range of variable numbers, but requires more careful phrasing of the question, (find the difference of squares, x**2-y**2 for x= and y= etc. But then, if you aren’t careful, you are much of the way there to giving the working part of the answer. ). An essay style question that requires students to includes “examples from the last twelve months” is evergreen in terms of requiring contemporary matters to be included. If I were a scholar, I’d probably try to include a load of references here…

One of the reasons for having to create new questions, of course, is to minimise the chances that students can recycle answers. But now, it seems, there is also the additional problem of using generative AI tools to answer assessment questions. A naive take on this is to consider an assessment to be “genAI safe” if it somehow defends against students using a simple zero shot “cut and paste” strategy to paste a question in, get an answer out, and use that as their assessment answer. But that’s a really simplistic way of using genAI. Anyone who has spent any time at all in a genAI conversation knows that there a few simple tricks you can adopt to improve a prompt. For example, OpenAI guidance suggests the following

Challenging or critiquing the response and forcing the genAI to review or modify its answer as part of an iterative, “conversational” approach can also dramatically improve an answer. Compare this to an open book assessment, where students are expected to work on their own solutions but may also be encouraged to form study groups to discuss ways in which a particular question might be addressed, without actually sharing their own answers.

Cutting and pasting in a question, then copying the answer out might be classed as “cheating”, but if a student works with the genAI model to iterate on the answer, under the direction of the student, is that still cheating? If the student uses their understanding of the question to add guiding context in the original prompt (a description of how to approach the answering of the question, as the student understands it) is that cheating, compared to copying and pasting generic guidance about how to approach an assessment question from a study skills site?

From one of my must read blogs, commentator on matters publishing Ian Mulvany posted some musings over the weekend on the matter of data to paper, and some reflections on LLMs, noting that:

What strikes me right now about both of these projects is that there is still a very high level of effort required to get something that begins to be marginally useful. Too much effort to be radically disruptive, but certainly a level of disruption is available that was not available before.

Which is to say: power tools for power users who put the work in when using them.

But perhaps even more interesting were a couple of example prompts that Ian linked to. For example, this prompt to review a legal qeustion in the context of the draft AI act. The first part of the prompt reads very much like the framing of an assessment question. The prompt then provides a couple of few-shot examples of the form the answer might take, akin to a sample answer in a generic marking guide for a question type where a student is expected to consider a legal question in the context of a particular regulation. The second example Ian gives also embeds prompt elements that are very assessment question like.

So I wonder:

what are the similarities and differences between a good assessment question and a good prompt, e.g. in leading the co-respondent (the student in the case of an assessment question, a genAI/LLM in the case of a prompt) to provide an answer that is “correct” and of a form the presenter (of the question or prompt) expects?

I also wonder:

to what extent should a good prompt include “marking guide” constructs such as “a good answer might be expected to take the following form…”?
if a student adds the “a good answer might be expected to take the following form…” element to a prompt as additional context around a pasted in question, has the student committed the same sort of academic offence as if they just pasted in the question, then copied the generated answer?
if a student iterates on a generated answer by prompting the genAI to address concerns raised by the student about the answer generated to date, has that student committed the same sort of academic offence as a student using a one-shot, single step prompt approach to using LLMs to generate answers?

Information held vs “information” generated

Noting that noyb.eu (“none of your business”; info rights org) has a compliant in, in Austria, against OpenAI — ChatGPT provides false information about people, and OpenAI can’t correct it.

So I wonder: if you make a DPA subject access request for everything a company holds about you in its databases, it can look that data up.

But what if you ask the company what it “believes” about you, e.g. if you search any of its information systems, and assuming that genAI models are classed as “information systems” (if not, what are they classed as?).

If the model has been trained on personal information, then that information has influenced the model weights and may return elements of that information through a statistical process. If you ask a model what it “knows” about a particular person, then could you argue that what it returns is what that organisation believes about you, and is therefore subject to personal information subject access requests? If the process is a statistical one, how can be return with any degree of confidence what that “information” is? And how can it correct it?

How responsible is a company for any generated statements a model may make about you if a model operated by the company:

has been trained from the ground up on the company’s data?
is the result of a third party model have been further trained or fine-tuned on the company’s data?
is purely a third party model?

If a company operates a retrieval augmented generation (RAG) process where your (actual) data is ‘interpreted” and returned after processing through a model as generated text, is that generated text what the company believes about you?

If a marketing company has a database that puts me into a particular labeled demographic group, can I request what those groups and what those labels are?

If someone in a company looks me up and a conversational AI labels me based on its training and based on my data (eg provided via a RAG mechanism), am I allowed to request what that information was, eg via chat logs? But what about the next time someone in the company ask a corporate chatUI about me, what will it say then? Can I ask for everything the information may say about me, ever, along with probabilities for each response?!

What is the status, in GDPR terms, of “generated information” (is it even information)? How does “generated information” relate to information (or data) “held” about me.

For a person using a user interface, how is “generated information” distinguished from retrieved information? In each case, to what extent might that information be said to be what the company “believes” about me?

Simple Playwright Tests for Jupyter Environments

In passing, I’ve been revising a couple of JupyterLab extensions I’ve cobbled together in the past and started looking at producing some simple Playwright tests for them.

For years now, there has been a testing helper framework called Galata, now bundled as part of JupyterLab, although the docs are scarce, the examples few, and things you might expect to be there are lacking. (There are also seemingly obvious signals lacking in the core JupyterLab codebase — you can find out when a notebook code cell is queued to run, stops running, and whether it has successfully or unsuccessfully run, but not find out when it actually starts running — but that’s another story.)

Anyway… I wasted a whole chunk of time trying to use Galata when it would probably have been easier just writing simple, native Playwright tests, and then a chunk of time more trying to use Galata in a minimal demo of hooking up a node.js Playwright Docker container to a container running a JupyterLab environment so I can run tests in the Playwright container with the testing framework completely isolated from the Jupyter container.

Minimally, I used a minimal Docker Compose script (run using docker-compose up) to create the two containers (I used a node Playwright container, put into a holding position) and networked them together. A shared folder on host containing the tests was mounted in into the playwright container (I could also have mounted some test notebooks from another directory into the Jupyter container).

# ./docker-compose.yml
networks:
  playwright-test-network:
    driver: bridge
 
services:
  playwright:
    image: mcr.microsoft.com/playwright:v1.43.0-jammy
    stdin_open: true
    tty: true
    # Perhaps tighten up things by requiring the container
    # to be tested to actually start running:
    #depends_on:
      #- tm351
    volumes:
      - ./:/home/pwuser
    networks:
      - playwright-test-network

  tm351:
    image: mmh352/tm351:23j
    networks:
      - playwright-test-network

A playwright.config.js file of the form:

// ./playwright.config.js

module.exports = {
  use: {
    // Browser options
    headless: true,

  },
};

and an example test file showing how to log in to the auth’ed Jupyter environment:

// tests/demo-login.spec.ts

import { test, expect } from "@playwright/test";

test("form submission test", async ({ page }) => {
  // If we run this from the docker-compose.yml entrypoint,
  // we need to give the other container time to start up
  // and publish the Jupyter landing page.
  // await page.waitForTimeout(10000);
  // Does playwright have a function that polls a URL a
  // few times every so often for a specified time or
  // number of poll attempts?
  // That would be a more effective way of waiting.

  // Navigate to the webpage
  await page.goto("http://tm351:8888/login?next=%2Flab");

  // Enter a value into the form
  await page.fill("#password_input", "TM351-23J");
  await page.screenshot({ path: "screenshot1.png" });

  // Click the button
  await page.click("#login_submit");

  // Wait for response or changes on the page
  await page.waitForResponse((response) => response.status() === 200);

  await page.waitForTimeout(10000);

  // Take a screenshot
  await page.screenshot({ path: "screenshot2.png" });
});

Inside the Playwright container, we can run npx install @jupyterlab/galata to install the Galata test framework and use galata test functions and helpers. However, the Galata test package just seemed to mess things up for me and not work, in a way that the Playwright test package didn’t, and did just work.

After looking up the name of the Playwright container (docker ps) I logged into it (docker exec -it playtest-playwright-1 ; by default this was as root, but we can force that by adding -u root to the exec command) and then manually ran the tests: npx playwright test. Ctrl-C to stop the containers and docker-compose rm to delete them.

We can run the tests directly from the docker-compose.yml:

networks:
  playwright-test-network:
    driver: bridge
 

services:
  playwright:
    image: mcr.microsoft.com/playwright:v1.43.0-jammy
    #stdin_open: true
    #tty: true
    entrypoint: ["npx","playwright","test"]
    working_dir: /home/pwuser
    volumes:
      - ./:/home/pwuser
    networks:
      - playwright-test-network

  tm351:
    image: mmh352/tm351:23j
    networks:
      - playwright-test-network

Ideally, we’d just want to start up both containers, run the tests, then shut them down, but docker-compose doesn’t seem to offer that sort of facility (does kubernetes?).

In passing, I also note that there is a pre-built Docker container that uses the Python Playwright API (mcr.microsoft.com/playwright/python:v1.42.0-jammy), which creates opportunities for using something like playwright-pytest. I also note that within a notebook, we could inject and run Python code into notebooks (or include Python scripts in pre-written notebooks) that control the JupyterLab UI using ipylab. The ipylab package lets you script and manipulate the behaviour of the JupyterLab UI from a notebook code cell. This creates the interesting possibility of having notebook based Python scripts control the JupyterLab UI, and whose execution is controlled by the test script (either a Python test script or a Typescript test script).

And related to pytest testing, via (who else but?!) @simonw here, I note inline-snapshot, a rather handy tool that lets you specify a test against a thing, and if the thing doesn’t exist, it will grab a gold-master copy of the thing for you. Playwright does this natively for screenshots. for example, the first time a screenshot comparison test is run as part of its visual comparisons offering.

Automatically generating SQL equivalents to dataframe method chains

Back to the “genAI is not necessarily the best alternative” thing again, this time in the context of generating SQL code that performs a particular sort of query. Natural language may be a starting point for this, but code might also provide the entry point, as for example if you have prototyped something you know that works using in-memory pandas or R dataframes, and you know want to move that over to set of operations performed on corresponding relational tables inside a SQL database.

And if you have code that works, why would you want a statistical genAI process to have a guess at some code that might work for you, rather than using a deterministic process and mechanically translating something that works into something that is logically equivalent (even if it’s not the most idiomatic or optimal equivalent)?

In the R world dbplyr is a handy utility that will convert your dplyr code (transformations applied over a dataframe using dplyr methods) to equivalent, thoug possibly not optimal or idiomatic, SQL code. The dplyr package works with several different db backends, including MariaDB, PostgreSQL and SQLite, so presumably any dialect differences are accommodated if you provide an appropriate connection or db description?

From the docs:

# lazily generates query
summary <- mtcars2 %>% 
  group_by(cyl) %>% 
  summarise(mpg = mean(mpg, na.rm = TRUE)) %>% 
  arrange(desc(mpg))

# see query
summary %>% show_query()
#> <SQL>
#> SELECT `cyl`, AVG(`mpg`) AS `mpg`
#> FROM `mtcars`
#> GROUP BY `cyl`
#> ORDER BY `mpg` DESC

If required, you can also explicitly inject SQL into a dplyr expression (example; I don’t think the code will also try to execute this on the R side, eg by dumping the query into SQLite…).

On the question of producing appropriate SQL dialects, I note the SQLGlot Python package, a “no-dependency SQL parser, transpiler, optimizer, and engine” that “can be used to format SQL or translate between 21 different dialects“. As far as the optimiser goes, I assume this means it could accept suboptimal SQL code generated using dbplyr and then return something more efficient?

Whilst the Python pandas package is happy to connect to a database backend, and can read and write tables from and to a connected database, it doesn’t try to generate SQL equivalent queries to a chained set of pandas dataframe manipulating methods.

Although pandas was the original widely used dataframe package on the Python block, several other alternatives have appeared over the years, with improved efficiency and with a syntax that resembles the pandas syntax, although it may not match it exactly. If you know the various operations that SQL supports, you’ll have a fair idea of what verbs are available for manipulating dataframes in any dataframe package; so if you’ve worked with dplyr, or pandas, dask (built with parallelism in mind), or polars (a Python package powered from Rust), there’s a good chance you’ll be able to make sense of any of the others.

At least one of the pandas alternative dataframes-in-Python packages does seem to have given the “conversion-to-SQL” thing some serious consideration: ibis (repo). From the docs:

con = ibis.connect("duckdb://")

t = con.read_parquet("penguins.parquet")

g = t.group_by(["species", "island"]).agg(count=t.count()).order_by("count")
ibis.to_sql(g)

which gives:

SELECT
  `t1`.`species`,
  `t1`.`island`,
  `t1`.`count`
FROM (
  SELECT
    `t0`.`species`,
    `t0`.`island`,
    COUNT(*) AS `count`
  FROM `ibis_read_parquet_t2ab23nqsnfydeuy5zpg4yg2im` AS `t0`
  GROUP BY
    1,
    2
) AS `t1`
ORDER BY
  `t1`.`count` ASC NULLS LAST

Actually that looks a bit cryptic and shorthand (t0, t1, 1, 2?) and could perhaps have been made to look a bit more friendly?

There’s also the rather nifty option of combining method chains with SQL code:

sql = """
SELECT
  species,
  island,
  COUNT(*) AS count
FROM penguins
GROUP BY species, island
""".strip()

If you sometimes think in pandas and sometimes in SQL, then this sort of approach might be really handy… I note that the DuckDB docs give it a nod, possibly because it uses DuckDB as the default backend? But I don’t have much sense about the trajectory it’s on in terms of development (first, second, and/or third party), adoption, and support.

The ibis machinery is built using Substrait, a logical framework for supporting interoperability across different structured data processing platforms. I wonder if that sort of conceptualisation might be a useful framing in an educational context? Here’s their primer document. The repo show some level of ongoing activity and engagement, but I’m not sure where the buy-in comes from or how committed it is. In terms of process, it seems the project it lloking to getting its foundations right, as Apache Arrow did; and that really seems to have taken off as a low level columnar in-memory and inter-process communication messaging serialisation format with excellent support for things like parquet file-reading and writing.

PS hmm… so I wonder…. might we expect something like a substrait sql2pandas utility? A pandoc for sql2sql, sql2dfFramework, dfFramework2sql and dfFramework2dfFramework conversions? Maybe I need to keep tabs on this PyPi package: substrait (repo). I note other packages in the substrait-io organisation for other languages…

GenAI Outputs as “Almost Information” from Second-hand Secondary Information Sources?

Checking my feeds, AI-hype continues but I get the sense there are also other signs of “yes, but…” and “well, actually…” starting to appear. In the RAG (retrieval augmented generation), I’m also noticing more items relating to organising your documentary source data better organised, which in part seems to boil down to improving the retrieval part of the problem (retrieving appropriate documents / information based on a particular information request), in part to improving that step as a search problem (addressing the needs of an information user seeking particular information for a particular reason).

A lot of genAI applications, not just those that are conversational front-ends to RAG pipelines, seem to have the flavour of “interpreted search” systems: a user has an information need, makes an informational request to the system, and the systems provides an informative response (ideally!) in a natural language (often conversational way), albeit with possible added noise (bias, hallucination) at each step. Loosely related: GenAI in Edu and The Chinese Room.

From the archive: interesting to note the remarks I made on my first encounter with OpenAI GPT3-beta in July 2021: Appropriate/ing Knowledge and Belief Tools?

In passing, we might note different flavours of what might be termed AI in a RAG pipeline. The retrieval step is often a semantic search step where a search query is encoded as a vector (an “embedding”) and this is then compared with the embedding vectors of documents in the corpus. Ones that are close in the embedding vector space are then returned. This could be classed as one form of AI. At a lower level, tokenisation of a text (identifying the atomic “words” that are mapped into the embedding space) might be seen as being another “AI” function; trivially, how do you cope with punctuation that abuts a word; do you stem words; etc.; but then there are also mappings at a higher level, such as named entities (“Doo Dah Whatsit Company Ltd.”, “The Sisters of Mercy”) recognised and potentially disambiguated, (e.g. “The Sisters of Mercy (band)” rather than “Sisters of Mercy (religious order)”) and then represented as a single token. At the output side, the generative AI part, which creates a stream of tokens (and then decodes them) in response to an input prompt.

Perhaps a sign of my growing discontent with AI code demos that do things in less efficient and potentially more error prone way than “traditional” ways (hmm.. “traditional computing”… But then, I know several folk who would class that as assembler, Lisp and C) but a couple of posts this morning also really wound me up: R and Python Together: A Second Case Study Using Langchain’s LLM Tools, and its precursor, Using R And Python Together, Seamlessly: A Case Study Using Openai’s Gpt Models.

At this point, I should probably point out that that sort of demo is typical of the sort of mash-up approach I’d have used back in the day when I was looking or no- and low-code ways of grabbing information from the web and then having a play with it. But being older and more jaded, and increasingly wary of anything that appears on the web in terms of my ability to: a) trust it, and b) reliably retrieve it, I am starting to reconsider what the value of doing that sort of play in public actually is. For me, the play was related to seeing what I could do with tools and information that were out there, the assumption being that if I could hack something together over a coffee break: a) it was quite likely that someone was building a version to do a similar thing “properly”; b) the low barrier to entry meant that lots of people could tinker with similar ideas at a similarly low opportunity cost to themselves, which meant the ideas or tools might end up being adopted at scale, driven by early interest and adoption [I note a couple of noteable misses: Google Reader / Feedburner ; Yahoo Pipes; and a thing I liked but no-one else did: the OPML feed widget and in-browser browser that was Grazr, and my own StringLE (“String’n’glue Learning Environment”) hack. (Thinking about it, JupyterLab is not such a dissimilar sort of thing to either of those in terms of how I think it can be used as an integration environment; it’s just that it is and always has been such a hostile environment to work with…).] The tinkering also allowed me to just lift up the mat or the bed-clothes or whatever the metaphor is, to have a peek at how bad things could get if the play was escalated. Though I typically didn’t write-up my “Oh, so that means we can do this really evil thing…” thoughts, I just grokked them and filed them away, or noted them in passing, as “we are so f****d”. Maybe that was a mistake and I should have been more of a lighthouse against danger.

Anyway. The R and Py thing. Doing a useful job comparing the ability of OpenAI to answer a fact-y thing, and using some “mechanical” R and Py hacks to scrape film data from Wikipedia info-boxes. As I said, exactly the sort of “chain of hacks” tinkering I used to do, but that was from a perspective of “folk have data ‘locked up’ in raw from but exposed on the web”; as an amateur, I can extract that data from the website, albeit maybe with the odd issue from ropey scraping; I can then make some sort of observation, caveated with “subject to only having some of the data, perhaps with added with noise”, but hopefully giving an answer in the order of magnitude ball-park of correct-ish and interesting-or-not-ness.

But today, it wound me up. Scraping a Wikipedia infobox, the author notes, having grabbed the content, that clean requires some code of the “# I have no idea how this works # I just got it online” form. Again, the sort of thing I have done in the past, and still do, but in this case I think several things. Firstly, there is possibly a better way of getting the infobox that scraping on the basis of CSS selectors (getting a Wikipedia page as something closer to a (semi-)structured XML or JSON doc, for example. Secondly, can we properly retrieve rather more structured data from a related data source? For example, a) what cryptic Wikidata codes might I query on, b) and what cryptic SPARQL incantation should I use to pull the data from Wikidata or DBPedia ((I note the Wikidata examples include examples of querying for awards related data, which is pertinent to one of the original demo’s questions). Thirdly, would downloading the IMDb database make more sense, and then use SQL to query that data?

Now I know this takes requires arguably more “formal” skills than some hacky scraping and third-party regex-ing, but… But. Much of the data is available in a structured form, either via an open API, or as a bulk data download. Admittedly the latter requires some handling to put it into a form you can start to query it, but tools like datasette and DuckDB are making it much easier to get data from flat files into into a form you can use SQL to query it.

I have to admit I’m a bit wary of the direction of travel, and level, of LLM support that is being introduced into datasette. I would rather write my own f**ked up queries over the data and suffer the results than let an AI SQL generator do it in the background and then wonder and worry about what sort of data my new intern has managed to pick up based on asking the folk in the public bar what sort of query they should run, even if some of the folk in the bar are pretty good coders, or used to be, back in the day when the syntax was different.

Trying to pin down my concerns, they are these: why scrape data from a page (with the risk that entails) when you can request the page in a more structured form? Why scrape data when you can make a more structured request on a data source that feeds the web page you were scraping, and receive the data in a structured way? Why use a generic data source when you can use a domain specific datasource, with the added authority that follows from that that. In part, this boils down to: why use a secondary source when you can use a primary one?

The whole AI and genAI thing is potentially making the barrier entry lower to hacking your own information tools together. When I used to create hacky mashup pipelines, I made the decision at each step of what to hook-up to what, and how. And with each decision was an awareness of the problems that might arise as a consequence. For example, from knowing that when Y consumed from X, the output from X was a bit flaky, and when Z consumed from Y, the interface/conversion was a bit ropey, so the end result was this but subject to loads of caveats I knew about. There was a lot of white-box stuff in the pipeline that I had put together.

And at the output stage, where a genAI takes information in (that may be correct and appropriate) and then re-presents it, there are two broad classes of concern: how was that information retrieved (and and what risks etc are associated with that) and what sort of reinterpretation has been applied in the parsing of that information and re-presentation of it in the provided response?

The foundational models are building up layers of language from whatever crap they found on the web or in peoples’ chat conversations (which makes me think of anthrax (not the band)), and, increasingly, whatever AI generated training and test sets are being used to develop the latest iterations of them (which always makes me think of BSE / mad cow disease ).

PS per chance, I also stumbled across another old post today — Translate to Google Statistical (“Google Standard”?!) English? — where I closed with a comment “keep an eye on translate.google.com to see when the English to English translation ceases to be a direct copy”.

Here’s what happens now – note the prompt on the left includes errors and I’m asked in that context if I want to correct them.

If I say no — that is, translate the version with typos — do the typos get fixed by the translation?

Hmm.. do we also seem to have some predictive translation starting to appear?

In-browser WASM powered OCR Word Add-In

One of the longstanding off-the-shelf models used for OCR — optical character recognition — is provided in the form of Tesseract. It started life over thirty years and gets updated every so often. It’s also available as an in-browser model in the form of tesseract.js, a Javascript wrapper around a WASM implementation of the tesseract engine.

At some point last week, Simon Willison posted a single page web app [code, about, demo] that combines tesseract.js and pdf.js, a Javascript package for parsing and rendering PDF docs in the browser, to provide a simple image and PDF to text service.

A single page web app…

So I wondered if I could follow on from the previous two posts just cut and past the code and CSS and run it as a MS WOrd add-in:

And pleasingly, it did “just work”.

So whilst you’re in Microsoft office, you can drop an image or PDF into the side bar, and get the text out which can then be edited, copied and pasted into the Word doc.

The next level of integration would be to click to paste the text into the Word doc. Another obvious next step is to grab an image out of the word doc, OCR it, and paste the text back into the doc. More elaborately, I wonder if there are plugins for the pdf/image viewer that would let you select particular areas of the image for processing, or otherwise process the image, before running the OCR? For example, the photon WASM app [repo] seems to provide really powerful image manipulation features that all run in the browser.

I did wonder about whether the in-browser LLM chat demo based on the Google tflite/Mediapipe demo would also “just work” but either I messed something up, or something doesn’t quite work in the (Edge?) browser that runs the Word add-in. With an LLM in the sidebar, we should be able to run a local model for basic summarisation or document based Q & A.

Something else I want to try is audio to text as a self-contained add-in using Whisper.cpp WASM…

I’m guessing there are already Microsoft Office add-ins that do a lot of this already, but I’m more interested in what I can build myself: a) using off-the-shelf code and models, b) running locally, in the sidebar (browser); c) to create things that might even just be one-shot, disposable apps to help me do a particular thing, or scratch a particular itch, when working in a particular document.

More DIY MS Word Add-ins — Python and R code execution using Pyodide and WebR

Picking up on the pattern I used in DIY Microsoft Office Add-Ins – Postgres WASM in the Task Pane, it’s easy enough to do the same thing for Python code execution using pyodide WASM and R code using WebR WASM.

As in the postgres/SQL demo, this lets you select code in the Word doc, then edit and execute it in the task pane. If you modify the code in the task pane, you can use it to replace the highlighted text in the Word doc. (I haven’t yet looked to see what else the Word API supports…) The result of the code execution can be pasted back to the end of the Word doc.

The Pyodide and WebR environments persist between code execution steps, so you can build up state. I added a button to reset the state of the Pyodide environment back to an initial state, but haven’t done that for the WebR environment yet.

I’m not sure how useful this is? It’s a scratchpad thing as much as anything for lightly checking whether the code fragment in a Word doc are valid and run as expected. It can be used to bring back the results of the code execution into the Word doc, which may be useful. The coupling is tighter than if you are copying and pasting code and results to/from a code editor, but it’s still weaker than the integration you get from a reproducible document type such as a Jupyter notebook or an executable MyST markdown doc.

One thing that might be interesting to explore is whether I can style the code in the Word doc, then extract and run the code from the Task Pane to check it works, maybe even checking the output against some sort of particularly style output in the Word doc. But again, that feels a bit clunky compared to authoring in a notebook or Myst and then generating a Word doc, or whatever format doc, with actual code execution generating the reported outputs etc.

Here’s the code:

<!-- Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT License. -->
<!-- This file shows how to design a first-run page that provides a welcome screen to the user about the features of the add-in. -->

<!DOCTYPE html>
<html>

<head>
    <meta charset="UTF-8" />
    <meta http-equiv="X-UA-Compatible" content="IE=Edge" />
    <meta name="viewport" content="width=device-width, initial-scale=1">
    <title>Contoso Task Pane Add-in</title>

    <!-- Office JavaScript API -->
    https://appsforoffice.microsoft.com/lib/1.1/hosted/office.js

    <!-- Load pyodide -->
    https://cdn.jsdelivr.net/pyodide/v0.25.0/full/pyodide.js
    http://./pyodide.js
    <!-- For more information on Fluent UI, visit https://developer.microsoft.com/fluentui#/. -->
    <link rel="stylesheet" href="https://static2.sharepointonline.com/files/fabric/office-ui-fabric-core/11.0.0/css/fabric.min.css"/>

    <!-- Template styles -->
    <link href="taskpane.css" rel="stylesheet" type="text/css" />
</head>

<body class="ms-font-m ms-welcome ms-Fabric">
    <header class="ms-welcome__header ms-bgColor-neutralLighter">
        <h1 class="ms-font-su">pyodide & WebR demo</h1>
    </header>
    <section id="sideload-msg" class="ms-welcome__main">
        <h2 class="ms-font-xl">Please <a target="_blank" href="https://learn.microsoft.com/office/dev/add-ins/testing/test-debug-office-add-ins#sideload-an-office-add-in-for-testing">sideload</a> your add-in to see app body.</h2>
    </section>
    <main id="app-body" class="ms-welcome__main" style="display: none;">
        <h2 class="ms-font-xl"> Pyodide & WebR demo </h2>
        <div>Execute Python or R code using Pyodide and WebR WASM powered code execution environments.</div>
        <textarea id="query" rows="4" cols="30"></textarea>
    <div><button id="getsel">Get Selection</button><button id="execute-py">Execute Py</button><button id="execute-r">Execute R</button><button id="exepaste">Paste Result</button><button id="replacesel">Replace selection</button><button id="reset">Reset Py</button></div>
     <div id="output-type"></div>
    <div id="output"></div>

    </main>
</body>

</html>

import { WebR } from "https://webr.r-wasm.org/latest/webr.mjs";

window.addEventListener("DOMContentLoaded", async function () {
  const buttonpy = /** @type {HTMLButtonElement} */ (document.getElementById("execute-py"));
  const buttonr = /** @type {HTMLButtonElement} */ (document.getElementById("execute-r"));
 
  let pyodide = await loadPyodide();

  const webR = new WebR();
  await webR.init();

  const resetpybutton = /** @type {HTMLButtonElement} */ (document.getElementById("reset"));
  resetpybutton.addEventListener("click", async function () {
    pyodide = await loadPyodide();
  });

  // Execute py on button click.
  buttonpy.addEventListener("click", async function () {
    buttonpy.disabled = true;

    // Get SQL from editor.
    const queries = document.getElementById("query").value;

    // Clear any previous output on the page.
    const output = document.getElementById("output");
    while (output.firstChild) output.removeChild(output.lastChild);

    //const timestamp = document.getElementById("timestamp");
    //timestamp.textContent = new Date().toLocaleTimeString();

    let time = Date.now();
    console.log(`${queries}`);
    document.getElementById("output-type").innerHTML = "Executing Py code...";
    try {
      const queries = document.getElementById("query").value;
      let output_txt = pyodide.runPython(queries);
      output.innerHTML = output_txt;
    } catch (e) {
      // Adjust for browser differences in Error.stack().
      const report = (window["chrome"] ? "" : `${e.message}\n`) + e.stack;
      output.innerHTML = `<pre>${report}</pre>`;
    } finally {
      //timestamp.textContent += ` ${(Date.now() - time) / 1000} seconds`;
      buttonpy.disabled = false;
      document.getElementById("output-type").innerHTML = "Py code result:";
    }
  });

  // Execute R on button click.
  buttonr.addEventListener("click", async function () {
    buttonr.disabled = true;

    // Get SQL from editor.
    const queries = document.getElementById("query").value;

    // Clear any previous output on the page.
    const output = document.getElementById("output");
    while (output.firstChild) output.removeChild(output.lastChild);

    //const timestamp = document.getElementById("timestamp");
    //timestamp.textContent = new Date().toLocaleTimeString();

    let time = Date.now();
    console.log(`${queries}`);
    document.getElementById("output-type").innerHTML = "Executing R code...";
    try {
      const queries = document.getElementById("query").value;
      let output_r = await webR.evalR(queries);
      let output_json = await output_r.toJs();
      output.innerHTML = JSON.stringify(output_json);
    } catch (e) {
      // Adjust for browser differences in Error.stack().
      const report = (window["chrome"] ? "" : `${e.message}\n`) + e.stack;
      output.innerHTML = `<pre>${report}</pre>`;
    } finally {
        document.getElementById("output-type").innerHTML = "R code result:";
      //timestamp.textContent += ` ${(Date.now() - time) / 1000} seconds`;
      buttonr.disabled = false;
    }
  });

});

/*
 * Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT license.
 * See LICENSE in the project root for license information.
 */

/* global document, Office, Word */

Office.onReady((info) => {
  if (info.host === Office.HostType.Word) {
    document.getElementById("sideload-msg").style.display = "none";
    document.getElementById("app-body").style.display = "flex";
    document.getElementById("exepaste").onclick = exePaste;
    document.getElementById("getsel").onclick = getSel;
    document.getElementById("replacesel").onclick = replaceSel;
  }
});

async function getSel() {
  await Word.run(async (context) => {
    // Get code from selection
    const selected = context.document.getSelection();
    selected.load("text");
    await context.sync();
    document.getElementById("query").value = selected.text;
  });
}

async function replaceSel() {
  await Word.run(async (context) => {
    // Replace selected code
    const selected = context.document.getSelection();
    const replace_text = document.getElementById("query").value;
    selected.insertText(replace_text, Word.InsertLocation.replace);
    await context.sync();
  });
}

async function exePaste() {
  await Word.run(async (context) => {
    var output = document.getElementById("output").innerHTML;
  const docBody = context.document.body;
  docBody.insertParagraph(
    output,
    Word.InsertLocation.end
  );
    await context.sync();
  });
}

DIY Microsoft Office Add-Ins – Postgres WASM in the Task Pane

Poking around the org’s Office 365 offering today, I noticed in Word the ability to add “Add-ins”, sidebar extensions essentially.

There’s lots of them in the list, but many of them require some sort of enabling from OUr 364 admins, whoch rather begs the question why Microsoft should be allowed to pump such stuff into our corporate Office environment, but I guess that’s another thing that just being a customer of behemoth means you haver to suck up and accept.

Anyway… it looks like you can upload your own… (whether they’d word without approval I don’t know…)

Anyway, new to me, so how can it be to write one?

There’s set-up docs and tutorials and API docs, and at its simplest for a developer preview you “just” need to install some node stuff for the build and then tinker with some HTML, js, and css.

# On the command line
npm install -g yo generator-office
yo office

# I also found this useful to kill node webwerver
# npx kill-port 3000

I selected the Office Add-in Task Pane project option, then Javascript then Word.

As with modt mode things, this downloads the internet. For each new extension you build.

Unlike the battles I’ve had trying to build things for JupyterLab, it dodnlt take much time to repurpose my pglite (minimal postgres) demo that runs a WASM based version of Postgres in the browser to something that runs it in the Word task pane:

The demo runs via a local server in the Word app on my desktop.

In the above example, a “for example” use case might be:

I’m writing some SQL teaching materials;
I knock up some SQL in the Word doc;
I select it, and click a button in the task pane that copies the selected text over to the task pane;
I click another button that runs the SQL against the postgres WASM app running in the task pane, and it displays the result;
if the query doesn’t work as intended, I can fettle it in the task pane until it does work;
if necessary, I click a button to replace the original (broken) SQL in the Word doc with the corrected SQL;
if required, I click a button and it grabs my SQL result table and pastes it as a Word table at the end of the doc (hmm, maybe this should be to the cursor location?)

Here’s the HTML for the task pane view:

<pre class="wp-block-syntaxhighlighter-code"><!-- Originally Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT License. -->

<!DOCTYPE html>
<html>

<head>
    <meta charset="UTF-8" />
    <meta http-equiv="X-UA-Compatible" content="IE=Edge" />
    <meta name="viewport" content="width=device-width, initial-scale=1">
    <title>OUseful-pglite Task Pane Add-in</title>

    <!-- Office JavaScript API -->
    <a href="https://appsforoffice.microsoft.com/lib/1.1/hosted/office.js">https://appsforoffice.microsoft.com/lib/1.1/hosted/office.js</a>

    <!-- Bring in a OUseful file-->
    <a href="http://pglite.js">http://pglite.js</a>

    <link rel="stylesheet" href="https://static2.sharepointonline.com/files/fabric/office-ui-fabric-core/11.0.0/css/fabric.min.css"/>

    <!-- Template styles -->
    <link href="taskpane.css" rel="stylesheet" type="text/css" />
</head>

<body class="ms-font-m ms-welcome ms-Fabric">
    <header class="ms-welcome__header ms-bgColor-neutralLighter">
        <h1 class="ms-font-su">ouseful.info WASM demo</h1>
    </header>
    <section id="sideload-msg" class="ms-welcome__main">
        <h2 class="ms-font-xl">Please <a target="_blank" href="https://learn.microsoft.com/office/dev/add-ins/testing/test-debug-office-add-ins#sideload-an-office-add-in-for-testing">sideload</a> your add-in to see app body.</h2>
    </section>
    <main id="app-body" class="ms-welcome__main" style="display: none;">
        <h2 class="ms-font-xl">WASM app running in sidebar...</h2>
        <div>Simple pglite application running in the browser. Example query: <tt>select * from test;</tt><br/><br/></div>
        <textarea id="query" rows="4" cols="30"></textarea>
    <div><button id="getsel">Get Selection</button><button id="execute">Execute</button><button id="exepaste">Paste Result</button><button id="replacesel">Replace selection</button></div>
    <div id="timestamp"></div>
    <div id="output"></div>
    <div id="results"></div>
    </main>
</body>

</html></pre>

And the JS from my pglite demo:

// PGLite loader
import { PGlite } from "https://cdn.jsdelivr.net/npm/@electric-sql/pglite/dist/index.js";

// Initialize PGlite
const db = new PGlite();
// We can persist the db in the browser
//const db = new PGlite('idb://my-pgdata')

const DEFAULT_SQL = `
-- Optionally select statements to execute.

CREATE TABLE IF NOT EXISTS test  (
        id serial primary key,
        title varchar not null
      );

INSERT INTO test (title) values ('dummy');

`.trim();

async function createTable() {
  await db.query(DEFAULT_SQL);
  await db.query("select * from test").then((resultsX) => {
    console.log(JSON.stringify(resultsX));
  });
}
createTable();
window.addEventListener("DOMContentLoaded", async function () {
  const button = /** @type {HTMLButtonElement} */ (
    document.getElementById("execute")
  );

  // Execute SQL on button click.
  button.addEventListener("click", async function () {
    button.disabled = true;

    // Get SQL from editor.
    const queries = document.getElementById("query").value;

    // Clear any previous output on the page.
    const output = document.getElementById("output");
    while (output.firstChild) output.removeChild(output.lastChild);

    const timestamp = document.getElementById("timestamp");
    timestamp.textContent = new Date().toLocaleTimeString();

    let time = Date.now();
    console.log(`${queries}`);
    try {
      const results = await db.query(`${queries}`);
      //.then(results => {console.log("results are"+JSON.stringify(results))});

      const resultsDiv = document.getElementById("results");
      resultsDiv.innerHTML = "";
      const table = formatTable(results);
      formatRows(results, table);
      resultsDiv.appendChild(table);
    } catch (e) {
      // Adjust for browser differences in Error.stack().
      const report = (window["chrome"] ? "" : `${e.message}\n`) + e.stack;
      output.innerHTML = `<pre>${report}</pre>`;
    } finally {
      timestamp.textContent += ` ${(Date.now() - time) / 1000} seconds`;
      button.disabled = false;
    }
  });
});

function formatTable(results) {
  const table = document.createElement("table");

  const headerRow = table.insertRow();
  Object.keys(results[0]).forEach((key) => {
    const th = document.createElement("th");
    th.textContent = key;
    headerRow.appendChild(th);
  });
  return table;
}

function formatRows(results, table) {
  results.forEach((rowData) => {
    const row = table.insertRow();
    Object.values(rowData).forEach((value) => {
      const cell = row.insertCell();
      cell.textContent = value;
    });
  });
}

And the very task pane js:

/*
 * Originally Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT license.
 * See LICENSE in the project root for license information.
 */

/* global document, Office, Word */

Office.onReady((info) => {
  if (info.host === Office.HostType.Word) {
    document.getElementById("sideload-msg").style.display = "none";
    document.getElementById("app-body").style.display = "flex";
    document.getElementById("exepaste").onclick = exePaste;
    document.getElementById("getsel").onclick = getSel;
    document.getElementById("replacesel").onclick = replaceSel;
  }
});


async function getSel() {
  await Word.run(async (context) => {
    // Get SQL from selection
    const selected = context.document.getSelection();
    selected.load("text");
    await context.sync();
    document.getElementById("query").value = selected.text;
    //const paragraph = context.document.body.insertParagraph("Running command: " + selected.text, Word.InsertLocation.end);
    await context.sync();

  });
}

async function replaceSel() {
  await Word.run(async (context) => {
    // Get SQL from selection
    const selected = context.document.getSelection();
    const replace_text = document.getElementById("query").value;
    selected.insertText(replace_text, Word.InsertLocation.replace);
    //const paragraph = context.document.body.insertParagraph("Running command: " + selected.text, Word.InsertLocation.end);
    await context.sync();
  });
}

async function exePaste() {
  await Word.run(async (context) => {
    // Get SQL from editor.
    //const txt = document.getElementById("results").innerHTML;
    //const paragraph = context.document.body.insertParagraph(txt, Word.InsertLocation.end);
    //var parser = new DOMParser();
    //var htmlDoc = parser.parseFromString(txt, "text/html");
    //var table = htmlDoc.getElementsByTagName("table")[0];
    var table = document.getElementById("results").firstChild;
    var rows = table.rows;
    var numCols = rows[0].cells.length;
    var numRows = rows.length;
    // Extract table data into a two-dimensional array
    var tableData = [];
    for (var i = 0; i < numRows; i++) {
      var rowData = [];
      for (var j = 0; j < numCols; j++) {
        rowData.push(rows[i].cells[j].innerText);
      }
      tableData.push(rowData);
    }

    // Insert the table into the Word document
    const wtable = context.document.body.insertTable(numRows, numCols, Word.InsertLocation.end, tableData);

    await context.sync();

  });
}

For completeness, here’s the css:

/* 
 * Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT license.
 * See LICENSE in the project root for license information.
 */

 html,
 body {
     width: 100%;
     height: 100%;
     margin: 0;
     padding: 0;
 }
 
 ul {
     margin: 0;
     padding: 0;
 }
 
 .ms-welcome__header {
    padding: 20px;
    padding-bottom: 30px;
    padding-top: 100px;
    display: -webkit-flex;
    display: flex;
    -webkit-flex-direction: column;
    flex-direction: column;
    align-items: center;
 }

 .ms-welcome__main {
    display: -webkit-flex;
    display: flex;
    -webkit-flex-direction: column;
    flex-direction: column;
    -webkit-flex-wrap: nowrap;
    flex-wrap: nowrap;
    -webkit-align-items: center;
    align-items: center;
    -webkit-flex: 1 0 0;
    flex: 1 0 0;
    padding: 10px 20px;
 }
 
 .ms-welcome__main > h2 {
     width: 100%;
     text-align: center;
 }
 
 .ms-welcome__features {
     list-style-type: none;
     margin-top: 20px;
 }
 
 .ms-welcome__features.ms-List .ms-ListItem {
     padding-bottom: 20px;
     display: -webkit-flex;
     display: flex;
 }
 
 .ms-welcome__features.ms-List .ms-ListItem > .ms-Icon {
     margin-right: 10px;
 }
 
 .ms-welcome__action.ms-Button--hero {
     margin-top: 30px;
 }
 
.ms-Button.ms-Button--hero .ms-Button-label {
  color: #0078d7;
}

.ms-Button.ms-Button--hero:hover .ms-Button-label,
.ms-Button.ms-Button--hero:focus .ms-Button-label{
  color: #005a9e;
  cursor: pointer;
}

b {
    font-weight: bold;
}

      #editor-container {
        width: 100%;
        height: 20vh;
        border:lightgray;
        border-style: solid;
      }

      #vfs-container {
        margin-top: 0.5em;
        margin-bottom: 0.5em;
      }
      #timestamp {
        margin-top: 0.5em;
      }

      #output {
        display: flex;
        flex-wrap: wrap;
        width: 100%;
      }

      table {
        margin-top: 1em;
        margin-left: 0.4em;
        margin-right: 0.4em;
        border-collapse: collapse;
      }

      td, th {
        border: 1px solid #999;
        padding: 0.5rem;
        text-align: left;
      }

Next up, I think I’lll have a look at adding a simple pyodide / Python sandbox into the task pane, perhaps trying the “react” rather than javascript cookiecutter (I think that offers a different way of loading js and wasm assets?)

This would then allow a similar pattern of “author code in the Word doc, try it out, fix it, paste the fix back, paste the result back”. It’s nto a reperoducible pattern, but it does shorten the loop between code and output, and also means you can test the code (sort of) whilst authoring.

PS Hmm, I wonder, could I style the code in the word doc then extract just all the styled code, or “all above” styled code, and run it in the sidebar? One to try tomorrow, maybe…