Simple Playwright Tests for Jupyter Environments

In passing, I’ve been revising a couple of JupyterLab extensions I’ve cobbled together in the past and started looking at producing some simple Playwright tests for them.

For years now, there has been a testing helper framework called Galata, now bundled as part of JupyterLab, although the docs are scarce, the examples few, and things you might expect to be there are lacking. (There are also seemingly obvious signals lacking in the core JupyterLab codebase — you can find out when a notebook code cell is queued to run, stops running, and whether it has successfully or unsuccessfully run, but not find out when it actually starts running — but that’s another story.)

Anyway… I wasted a whole chunk of time trying to use Galata when it would probably have been easier just writing simple, native Playwright tests, and then a chunk of time more trying to use Galata in a minimal demo of hooking up a node.js Playwright Docker container to a container running a JupyterLab environment so I can run tests in the Playwright container with the testing framework completely isolated from the Jupyter container.

Minimally, I used a minimal Docker Compose script (run using docker-compose up) to create the two containers (I used a node Playwright container, put into a holding position) and networked them together. A shared folder on host containing the tests was mounted in into the playwright container (I could also have mounted some test notebooks from another directory into the Jupyter container).

# ./docker-compose.yml
networks:
  playwright-test-network:
    driver: bridge
 
services:
  playwright:
    image: mcr.microsoft.com/playwright:v1.43.0-jammy
    stdin_open: true
    tty: true
    # Perhaps tighten up things by requiring the container
    # to be tested to actually start running:
    #depends_on:
      #- tm351
    volumes:
      - ./:/home/pwuser
    networks:
      - playwright-test-network

  tm351:
    image: mmh352/tm351:23j
    networks:
      - playwright-test-network

A playwright.config.js file of the form:

// ./playwright.config.js

module.exports = {
  use: {
    // Browser options
    headless: true,

  },
};

and an example test file showing how to log in to the auth’ed Jupyter environment:

// tests/demo-login.spec.ts

import { test, expect } from "@playwright/test";

test("form submission test", async ({ page }) => {
  // If we run this from the docker-compose.yml entrypoint,
  // we need to give the other container time to start up
  // and publish the Jupyter landing page.
  // await page.waitForTimeout(10000);
  // Does playwright have a function that polls a URL a
  // few times every so often for a specified time or
  // number of poll attempts?
  // That would be a more effective way of waiting.

  // Navigate to the webpage
  await page.goto("http://tm351:8888/login?next=%2Flab");

  // Enter a value into the form
  await page.fill("#password_input", "TM351-23J");
  await page.screenshot({ path: "screenshot1.png" });

  // Click the button
  await page.click("#login_submit");

  // Wait for response or changes on the page
  await page.waitForResponse((response) => response.status() === 200);

  await page.waitForTimeout(10000);

  // Take a screenshot
  await page.screenshot({ path: "screenshot2.png" });
});

Inside the Playwright container, we can run npx install @jupyterlab/galata to install the Galata test framework and use galata test functions and helpers. However, the Galata test package just seemed to mess things up for me and not work, in a way that the Playwright test package didn’t, and did just work.

After looking up the name of the Playwright container (docker ps) I logged into it (docker exec -it  playtest-playwright-1 ; by default this was as root, but we can force that by adding -u root to the exec command) and then manually ran the tests: npx playwright test. Ctrl-C to stop the containers and docker-compose rm to delete them.

We can run the tests directly from the docker-compose.yml:

networks:
  playwright-test-network:
    driver: bridge
 

services:
  playwright:
    image: mcr.microsoft.com/playwright:v1.43.0-jammy
    #stdin_open: true
    #tty: true
    entrypoint: ["npx","playwright","test"]
    working_dir: /home/pwuser
    volumes:
      - ./:/home/pwuser
    networks:
      - playwright-test-network

  tm351:
    image: mmh352/tm351:23j
    networks:
      - playwright-test-network

Ideally, we’d just want to start up both containers, run the tests, then shut them down, but docker-compose doesn’t seem to offer that sort of facility (does kubernetes?).

In passing, I also note that there is a pre-built Docker container that uses the Python Playwright API (mcr.microsoft.com/playwright/python:v1.42.0-jammy), which creates opportunities for using something like playwright-pytest. I also note that within a notebook, we could inject and run Python code into notebooks (or include Python scripts in pre-written notebooks) that control the JupyterLab UI using ipylab. The ipylab package lets you script and manipulate the behaviour of the JupyterLab UI from a notebook code cell. This creates the interesting possibility of having notebook based Python scripts control the JupyterLab UI, and whose execution is controlled by the test script (either a Python test script or a Typescript test script).

And related to pytest testing, via (who else but?!) @simonw here, I note inline-snapshot, a rather handy tool that lets you specify a test against a thing, and if the thing doesn’t exist, it will grab a gold-master copy of the thing for you. Playwright does this natively for screenshots. for example, the first time a screenshot comparison test is run as part of its visual comparisons offering.

Automatically generating SQL equivalents to dataframe method chains

Back to the “genAI is not necessarily the best alternative” thing again, this time in the context of generating SQL code that performs a particular sort of query. Natural language may be a starting point for this, but code might also provide the entry point, as for example if you have prototyped something you know that works using in-memory pandas or R dataframes, and you know want to move that over to set of operations performed on corresponding relational tables inside a SQL database.

And if you have code that works, why would you want a statistical genAI process to have a guess at some code that might work for you, rather than using a deterministic process and mechanically translating something that works into something that is logically equivalent (even if it’s not the most idiomatic or optimal equivalent)?

In the R world dbplyr is a handy utility that will convert your dplyr code (transformations applied over a dataframe using dplyr methods) to equivalent, thoug possibly not optimal or idiomatic, SQL code. The dplyr package works with several different db backends, including MariaDB, PostgreSQL and SQLite, so presumably any dialect differences are accommodated if you provide an appropriate connection or db description?

From the docs:

# lazily generates query
summary <- mtcars2 %>% 
  group_by(cyl) %>% 
  summarise(mpg = mean(mpg, na.rm = TRUE)) %>% 
  arrange(desc(mpg))

# see query
summary %>% show_query()
#> <SQL>
#> SELECT `cyl`, AVG(`mpg`) AS `mpg`
#> FROM `mtcars`
#> GROUP BY `cyl`
#> ORDER BY `mpg` DESC

If required, you can also explicitly inject SQL into a dplyr expression (example; I don’t think the code will also try to execute this on the R side, eg by dumping the query into SQLite…).

On the question of producing appropriate SQL dialects, I note the SQLGlot Python package, a “no-dependency SQL parser, transpiler, optimizer, and engine” that “can be used to format SQL or translate between 21 different dialects. As far as the optimiser goes, I assume this means it could accept suboptimal SQL code generated using dbplyr and then return something more efficient?

Whilst the Python pandas package is happy to connect to a database backend, and can read and write tables from and to a connected database, it doesn’t try to generate SQL equivalent queries to a chained set of pandas dataframe manipulating methods.

Although pandas was the original widely used dataframe package on the Python block, several other alternatives have appeared over the years, with improved efficiency and with a syntax that resembles the pandas syntax, although it may not match it exactly. If you know the various operations that SQL supports, you’ll have a fair idea of what verbs are available for manipulating dataframes in any dataframe package; so if you’ve worked with dplyr, or pandas, dask (built with parallelism in mind), or polars (a Python package powered from Rust), there’s a good chance you’ll be able to make sense of any of the others.

At least one of the pandas alternative dataframes-in-Python packages does seem to have given the “conversion-to-SQL” thing some serious consideration: ibis (repo). From the docs:

con = ibis.connect("duckdb://")

t = con.read_parquet("penguins.parquet")

g = t.group_by(["species", "island"]).agg(count=t.count()).order_by("count")
ibis.to_sql(g)

which gives:

SELECT
  `t1`.`species`,
  `t1`.`island`,
  `t1`.`count`
FROM (
  SELECT
    `t0`.`species`,
    `t0`.`island`,
    COUNT(*) AS `count`
  FROM `ibis_read_parquet_t2ab23nqsnfydeuy5zpg4yg2im` AS `t0`
  GROUP BY
    1,
    2
) AS `t1`
ORDER BY
  `t1`.`count` ASC NULLS LAST

Actually that looks a bit cryptic and shorthand (t0, t1, 1, 2?) and could perhaps have been made to look a bit more friendly?

There’s also the rather nifty option of combining method chains with SQL code:

sql = """
SELECT
  species,
  island,
  COUNT(*) AS count
FROM penguins
GROUP BY species, island
""".strip()

If you sometimes think in pandas and sometimes in SQL, then this sort of approach might be really handy… I note that the DuckDB docs give it a nod, possibly because it uses DuckDB as the default backend? But I don’t have much sense about the trajectory it’s on in terms of development (first, second, and/or third party), adoption, and support.

The ibis machinery is built using Substrait, a logical framework for supporting interoperability across different structured data processing platforms. I wonder if that sort of conceptualisation might be a useful framing in an educational context? Here’s their primer document. The repo show some level of ongoing activity and engagement, but I’m not sure where the buy-in comes from or how committed it is. In terms of process, it seems the project it lloking to getting its foundations right, as Apache Arrow did; and that really seems to have taken off as a low level columnar in-memory and inter-process communication messaging serialisation format with excellent support for things like parquet file-reading and writing.

PS hmm… so I wonder…. might we expect something like a substrait sql2pandas utility? A pandoc for sql2sql, sql2dfFramework, dfFramework2sql and dfFramework2dfFramework conversions? Maybe I need to keep tabs on this PyPi package: substrait (repo). I note other packages in the substrait-io organisation for other languages…

GenAI Outputs as “Almost Information” from Second-hand Secondary Information Sources?

Checking my feeds, AI-hype continues but I get the sense there are also other signs of “yes, but…” and “well, actually…” starting to appear. In the RAG (retrieval augmented generation), I’m also noticing more items relating to organising your documentary source data better organised, which in part seems to boil down to improving the retrieval part of the problem (retrieving appropriate documents / information based on a particular information request), in part to improving that step as a search problem (addressing the needs of an information user seeking particular information for a particular reason).

A lot of genAI applications, not just those that are conversational front-ends to RAG pipelines, seem to have the flavour of “interpreted search” systems: a user has an information need, makes an informational request to the system, and the systems provides an informative response (ideally!) in a natural language (often conversational way), albeit with possible added noise (bias, hallucination) at each step. Loosely related: GenAI in Edu and The Chinese Room.

From the archive: interesting to note the remarks I made on my first encounter with OpenAI GPT3-beta in July 2021: Appropriate/ing Knowledge and Belief Tools?

In passing, we might note different flavours of what might be termed AI in a RAG pipeline. The retrieval step is often a semantic search step where a search query is encoded as a vector (an “embedding”) and this is then compared with the embedding vectors of documents in the corpus. Ones that are close in the embedding vector space are then returned. This could be classed as one form of AI. At a lower level, tokenisation of a text (identifying the atomic “words” that are mapped into the embedding space) might be seen as being another “AI” function; trivially, how do you cope with punctuation that abuts a word; do you stem words; etc.; but then there are also mappings at a higher level, such as named entities (“Doo Dah Whatsit Company Ltd.”, “The Sisters of Mercy”) recognised and potentially disambiguated, (e.g. “The Sisters of Mercy (band)” rather than “Sisters of Mercy (religious order)”) and then represented as a single token. At the output side, the generative AI part, which creates a stream of tokens (and then decodes them) in response to an input prompt.

Perhaps a sign of my growing discontent with AI code demos that do things in less efficient and potentially more error prone way than “traditional” ways (hmm.. “traditional computing”… But then, I know several folk who would class that as assembler, Lisp and C) but a couple of posts this morning also really wound me up: R and Python Together: A Second Case Study Using Langchain’s LLM Tools, and its precursor, Using R And Python Together, Seamlessly: A Case Study Using Openai’s Gpt Models.

At this point, I should probably point out that that sort of demo is typical of the sort of mash-up approach I’d have used back in the day when I was looking or no- and low-code ways of grabbing information from the web and then having a play with it. But being older and more jaded, and increasingly wary of anything that appears on the web in terms of my ability to: a) trust it, and b) reliably retrieve it, I am starting to reconsider what the value of doing that sort of play in public actually is. For me, the play was related to seeing what I could do with tools and information that were out there, the assumption being that if I could hack something together over a coffee break: a) it was quite likely that someone was building a version to do a similar thing “properly”; b) the low barrier to entry meant that lots of people could tinker with similar ideas at a similarly low opportunity cost to themselves, which meant the ideas or tools might end up being adopted at scale, driven by early interest and adoption [I note a couple of noteable misses: Google Reader / Feedburner ; Yahoo Pipes; and a thing I liked but no-one else did: the OPML feed widget and in-browser browser that was Grazr, and my own StringLE (“String’n’glue Learning Environment”) hack. (Thinking about it, JupyterLab is not such a dissimilar sort of thing to either of those in terms of how I think it can be used as an integration environment; it’s just that it is and always has been such a hostile environment to work with…).] The tinkering also allowed me to just lift up the mat or the bed-clothes or whatever the metaphor is, to have a peek at how bad things could get if the play was escalated. Though I typically didn’t write-up my “Oh, so that means we can do this really evil thing…” thoughts, I just grokked them and filed them away, or noted them in passing, as “we are so f****d”. Maybe that was a mistake and I should have been more of a lighthouse against danger.

Anyway. The R and Py thing. Doing a useful job comparing the ability of OpenAI to answer a fact-y thing, and using some “mechanical” R and Py hacks to scrape film data from Wikipedia info-boxes. As I said, exactly the sort of “chain of hacks” tinkering I used to do, but that was from a perspective of “folk have data ‘locked up’ in raw from but exposed on the web”; as an amateur, I can extract that data from the website, albeit maybe with the odd issue from ropey scraping; I can then make some sort of observation, caveated with “subject to only having some of the data, perhaps with added with noise”, but hopefully giving an answer in the order of magnitude ball-park of correct-ish and interesting-or-not-ness.

But today, it wound me up. Scraping a Wikipedia infobox, the author notes, having grabbed the content, that clean requires some code of the “# I have no idea how this works # I just got it online” form. Again, the sort of thing I have done in the past, and still do, but in this case I think several things. Firstly, there is possibly a better way of getting the infobox that scraping on the basis of CSS selectors (getting a Wikipedia page as something closer to a (semi-)structured XML or JSON doc, for example. Secondly, can we properly retrieve rather more structured data from a related data source? For example, a) what cryptic Wikidata codes might I query on, b) and what cryptic SPARQL incantation should I use to pull the data from Wikidata or DBPedia ((I note the Wikidata examples include examples of querying for awards related data, which is pertinent to one of the original demo’s questions). Thirdly, would downloading the IMDb database make more sense, and then use SQL to query that data?

Now I know this takes requires arguably more “formal” skills than some hacky scraping and third-party regex-ing, but… But. Much of the data is available in a structured form, either via an open API, or as a bulk data download. Admittedly the latter requires some handling to put it into a form you can start to query it, but tools like datasette and DuckDB are making it much easier to get data from flat files into into a form you can use SQL to query it.

I have to admit I’m a bit wary of the direction of travel, and level, of LLM support that is being introduced into datasette. I would rather write my own f**ked up queries over the data and suffer the results than let an AI SQL generator do it in the background and then wonder and worry about what sort of data my new intern has managed to pick up based on asking the folk in the public bar what sort of query they should run, even if some of the folk in the bar are pretty good coders, or used to be, back in the day when the syntax was different.

Trying to pin down my concerns, they are these: why scrape data from a page (with the risk that entails) when you can request the page in a more structured form? Why scrape data when you can make a more structured request on a data source that feeds the web page you were scraping, and receive the data in a structured way? Why use a generic data source when you can use a domain specific datasource, with the added authority that follows from that that. In part, this boils down to: why use a secondary source when you can use a primary one?

The whole AI and genAI thing is potentially making the barrier entry lower to hacking your own information tools together. When I used to create hacky mashup pipelines, I made the decision at each step of what to hook-up to what, and how. And with each decision was an awareness of the problems that might arise as a consequence. For example, from knowing that when Y consumed from X, the output from X was a bit flaky, and when Z consumed from Y, the interface/conversion was a bit ropey, so the end result was this but subject to loads of caveats I knew about. There was a lot of white-box stuff in the pipeline that I had put together.

And at the output stage, where a genAI takes information in (that may be correct and appropriate) and then re-presents it, there are two broad classes of concern: how was that information retrieved (and and what risks etc are associated with that) and what sort of reinterpretation has been applied in the parsing of that information and re-presentation of it in the provided response?

The foundational models are building up layers of language from whatever crap they found on the web or in peoples’ chat conversations (which makes me think of anthrax (not the band)), and, increasingly, whatever AI generated training and test sets are being used to develop the latest iterations of them (which always makes me think of BSE / mad cow disease ).

PS per chance, I also stumbled across another old post today — Translate to Google Statistical (“Google Standard”?!) English? — where I closed with a comment “keep an eye on translate.google.com to see when the English to English translation ceases to be a direct copy”.

Here’s what happens now – note the prompt on the left includes errors and I’m asked in that context if I want to correct them.

If I say no — that is, translate the version with typos — do the typos get fixed by the translation?

Hmm.. do we also seem to have some predictive translation starting to appear?

In-browser WASM powered OCR Word Add-In

One of the longstanding off-the-shelf models used for OCR — optical character recognition — is provided in the form of Tesseract. It started life over thirty years and gets updated every so often. It’s also available as an in-browser model in the form of tesseract.js, a Javascript wrapper around a WASM implementation of the tesseract engine.

At some point last week, Simon Willison posted a single page web app [code, about, demo] that combines tesseract.js and pdf.js, a Javascript package for parsing and rendering PDF docs in the browser, to provide a simple image and PDF to text service.

A single page web app…

So I wondered if I could follow on from the previous two posts just cut and past the code and CSS and run it as a MS WOrd add-in:

And pleasingly, it did “just work”.

So whilst you’re in Microsoft office, you can drop an image or PDF into the side bar, and get the text out which can then be edited, copied and pasted into the Word doc.

The next level of integration would be to click to paste the text into the Word doc. Another obvious next step is to grab an image out of the word doc, OCR it, and paste the text back into the doc. More elaborately, I wonder if there are plugins for the pdf/image viewer that would let you select particular areas of the image for processing, or otherwise process the image, before running the OCR? For example, the photon WASM app [repo] seems to provide really powerful image manipulation features that all run in the browser.

I did wonder about whether the in-browser LLM chat demo based on the Google tflite/Mediapipe demo would also “just work” but either I messed something up, or something doesn’t quite work in the (Edge?) browser that runs the Word add-in. With an LLM in the sidebar, we should be able to run a local model for basic summarisation or document based Q & A.

Something else I want to try is audio to text as a self-contained add-in using Whisper.cpp WASM

I’m guessing there are already Microsoft Office add-ins that do a lot of this already, but I’m more interested in what I can build myself: a) using off-the-shelf code and models, b) running locally, in the sidebar (browser); c) to create things that might even just be one-shot, disposable apps to help me do a particular thing, or scratch a particular itch, when working in a particular document.

More DIY MS Word Add-ins — Python and R code execution using Pyodide and WebR

Picking up on the pattern I used in DIY Microsoft Office Add-Ins – Postgres WASM in the Task Pane, it’s easy enough to do the same thing for Python code execution using pyodide WASM and R code using WebR WASM.

As in the postgres/SQL demo, this lets you select code in the Word doc, then edit and execute it in the task pane. If you modify the code in the task pane, you can use it to replace the highlighted text in the Word doc. (I haven’t yet looked to see what else the Word API supports…) The result of the code execution can be pasted back to the end of the Word doc.

The Pyodide and WebR environments persist between code execution steps, so you can build up state. I added a button to reset the state of the Pyodide environment back to an initial state, but haven’t done that for the WebR environment yet.

I’m not sure how useful this is? It’s a scratchpad thing as much as anything for lightly checking whether the code fragment in a Word doc are valid and run as expected. It can be used to bring back the results of the code execution into the Word doc, which may be useful. The coupling is tighter than if you are copying and pasting code and results to/from a code editor, but it’s still weaker than the integration you get from a reproducible document type such as a Jupyter notebook or an executable MyST markdown doc.

One thing that might be interesting to explore is whether I can style the code in the Word doc, then extract and run the code from the Task Pane to check it works, maybe even checking the output against some sort of particularly style output in the Word doc. But again, that feels a bit clunky compared to authoring in a notebook or Myst and then generating a Word doc, or whatever format doc, with actual code execution generating the reported outputs etc.

Here’s the code:

<!-- Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT License. -->
<!-- This file shows how to design a first-run page that provides a welcome screen to the user about the features of the add-in. -->

<!DOCTYPE html>
<html>

<head>
    <meta charset="UTF-8" />
    <meta http-equiv="X-UA-Compatible" content="IE=Edge" />
    <meta name="viewport" content="width=device-width, initial-scale=1">
    <title>Contoso Task Pane Add-in</title>

    <!-- Office JavaScript API -->
    https://appsforoffice.microsoft.com/lib/1.1/hosted/office.js

    <!-- Load pyodide -->
    https://cdn.jsdelivr.net/pyodide/v0.25.0/full/pyodide.js
    http://./pyodide.js
    <!-- For more information on Fluent UI, visit https://developer.microsoft.com/fluentui#/. -->
    <link rel="stylesheet" href="https://static2.sharepointonline.com/files/fabric/office-ui-fabric-core/11.0.0/css/fabric.min.css"/>

    <!-- Template styles -->
    <link href="taskpane.css" rel="stylesheet" type="text/css" />
</head>

<body class="ms-font-m ms-welcome ms-Fabric">
    <header class="ms-welcome__header ms-bgColor-neutralLighter">
        <h1 class="ms-font-su">pyodide & WebR demo</h1>
    </header>
    <section id="sideload-msg" class="ms-welcome__main">
        <h2 class="ms-font-xl">Please <a target="_blank" href="https://learn.microsoft.com/office/dev/add-ins/testing/test-debug-office-add-ins#sideload-an-office-add-in-for-testing">sideload</a> your add-in to see app body.</h2>
    </section>
    <main id="app-body" class="ms-welcome__main" style="display: none;">
        <h2 class="ms-font-xl"> Pyodide & WebR demo </h2>
        <div>Execute Python or R code using Pyodide and WebR WASM powered code execution environments.</div>
        <textarea id="query" rows="4" cols="30"></textarea>
    <div><button id="getsel">Get Selection</button><button id="execute-py">Execute Py</button><button id="execute-r">Execute R</button><button id="exepaste">Paste Result</button><button id="replacesel">Replace selection</button><button id="reset">Reset Py</button></div>
     <div id="output-type"></div>
    <div id="output"></div>

    </main>
</body>

</html>

import { WebR } from "https://webr.r-wasm.org/latest/webr.mjs";

window.addEventListener("DOMContentLoaded", async function () {
  const buttonpy = /** @type {HTMLButtonElement} */ (document.getElementById("execute-py"));
  const buttonr = /** @type {HTMLButtonElement} */ (document.getElementById("execute-r"));
 
  let pyodide = await loadPyodide();

  const webR = new WebR();
  await webR.init();

  const resetpybutton = /** @type {HTMLButtonElement} */ (document.getElementById("reset"));
  resetpybutton.addEventListener("click", async function () {
    pyodide = await loadPyodide();
  });

  // Execute py on button click.
  buttonpy.addEventListener("click", async function () {
    buttonpy.disabled = true;

    // Get SQL from editor.
    const queries = document.getElementById("query").value;

    // Clear any previous output on the page.
    const output = document.getElementById("output");
    while (output.firstChild) output.removeChild(output.lastChild);

    //const timestamp = document.getElementById("timestamp");
    //timestamp.textContent = new Date().toLocaleTimeString();

    let time = Date.now();
    console.log(`${queries}`);
    document.getElementById("output-type").innerHTML = "Executing Py code...";
    try {
      const queries = document.getElementById("query").value;
      let output_txt = pyodide.runPython(queries);
      output.innerHTML = output_txt;
    } catch (e) {
      // Adjust for browser differences in Error.stack().
      const report = (window["chrome"] ? "" : `${e.message}\n`) + e.stack;
      output.innerHTML = `<pre>${report}</pre>`;
    } finally {
      //timestamp.textContent += ` ${(Date.now() - time) / 1000} seconds`;
      buttonpy.disabled = false;
      document.getElementById("output-type").innerHTML = "Py code result:";
    }
  });

  // Execute R on button click.
  buttonr.addEventListener("click", async function () {
    buttonr.disabled = true;

    // Get SQL from editor.
    const queries = document.getElementById("query").value;

    // Clear any previous output on the page.
    const output = document.getElementById("output");
    while (output.firstChild) output.removeChild(output.lastChild);

    //const timestamp = document.getElementById("timestamp");
    //timestamp.textContent = new Date().toLocaleTimeString();

    let time = Date.now();
    console.log(`${queries}`);
    document.getElementById("output-type").innerHTML = "Executing R code...";
    try {
      const queries = document.getElementById("query").value;
      let output_r = await webR.evalR(queries);
      let output_json = await output_r.toJs();
      output.innerHTML = JSON.stringify(output_json);
    } catch (e) {
      // Adjust for browser differences in Error.stack().
      const report = (window["chrome"] ? "" : `${e.message}\n`) + e.stack;
      output.innerHTML = `<pre>${report}</pre>`;
    } finally {
        document.getElementById("output-type").innerHTML = "R code result:";
      //timestamp.textContent += ` ${(Date.now() - time) / 1000} seconds`;
      buttonr.disabled = false;
    }
  });

});

/*
 * Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT license.
 * See LICENSE in the project root for license information.
 */

/* global document, Office, Word */

Office.onReady((info) => {
  if (info.host === Office.HostType.Word) {
    document.getElementById("sideload-msg").style.display = "none";
    document.getElementById("app-body").style.display = "flex";
    document.getElementById("exepaste").onclick = exePaste;
    document.getElementById("getsel").onclick = getSel;
    document.getElementById("replacesel").onclick = replaceSel;
  }
});

async function getSel() {
  await Word.run(async (context) => {
    // Get code from selection
    const selected = context.document.getSelection();
    selected.load("text");
    await context.sync();
    document.getElementById("query").value = selected.text;
  });
}

async function replaceSel() {
  await Word.run(async (context) => {
    // Replace selected code
    const selected = context.document.getSelection();
    const replace_text = document.getElementById("query").value;
    selected.insertText(replace_text, Word.InsertLocation.replace);
    await context.sync();
  });
}

async function exePaste() {
  await Word.run(async (context) => {
    var output = document.getElementById("output").innerHTML;
  const docBody = context.document.body;
  docBody.insertParagraph(
    output,
    Word.InsertLocation.end
  );
    await context.sync();
  });
}

DIY Microsoft Office Add-Ins – Postgres WASM in the Task Pane

Poking around the org’s Office 365 offering today, I noticed in Word the ability to add “Add-ins”, sidebar extensions essentially.

There’s lots of them in the list, but many of them require some sort of enabling from OUr 364 admins, whoch rather begs the question why Microsoft should be allowed to pump such stuff into our corporate Office environment, but I guess that’s another thing that just being a customer of behemoth means you haver to suck up and accept.

Anyway… it looks like you can upload your own… (whether they’d word without approval I don’t know…)

Anyway, new to me, so how can it be to write one?

There’s set-up docs and tutorials and API docs, and at its simplest for a developer preview you “just” need to install some node stuff for the build and then tinker with some HTML, js, and css.

# On the command line
npm install -g yo generator-office
yo office

# I also found this useful to kill node webwerver
# npx kill-port 3000

I selected the Office Add-in Task Pane project option, then Javascript then Word.

As with modt mode things, this downloads the internet. For each new extension you build.

Unlike the battles I’ve had trying to build things for JupyterLab, it dodnlt take much time to repurpose my pglite (minimal postgres) demo that runs a WASM based version of Postgres in the browser to something that runs it in the Word task pane:

The demo runs via a local server in the Word app on my desktop.

In the above example, a “for example” use case might be:

  • I’m writing some SQL teaching materials;
  • I knock up some SQL in the Word doc;
  • I select it, and click a button in the task pane that copies the selected text over to the task pane;
  • I click another button that runs the SQL against the postgres WASM app running in the task pane, and it displays the result;
  • if the query doesn’t work as intended, I can fettle it in the task pane until it does work;
  • if necessary, I click a button to replace the original (broken) SQL in the Word doc with the corrected SQL;
  • if required, I click a button and it grabs my SQL result table and pastes it as a Word table at the end of the doc (hmm, maybe this should be to the cursor location?)

Here’s the HTML for the task pane view:

<pre class="wp-block-syntaxhighlighter-code"><!-- Originally Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT License. -->

<!DOCTYPE html>
<html>

<head>
    <meta charset="UTF-8" />
    <meta http-equiv="X-UA-Compatible" content="IE=Edge" />
    <meta name="viewport" content="width=device-width, initial-scale=1">
    <title>OUseful-pglite Task Pane Add-in</title>

    <!-- Office JavaScript API -->
    <a href="https://appsforoffice.microsoft.com/lib/1.1/hosted/office.js">https://appsforoffice.microsoft.com/lib/1.1/hosted/office.js</a>

    <!-- Bring in a OUseful file-->
    <a href="http://pglite.js">http://pglite.js</a>

    <link rel="stylesheet" href="https://static2.sharepointonline.com/files/fabric/office-ui-fabric-core/11.0.0/css/fabric.min.css"/>

    <!-- Template styles -->
    <link href="taskpane.css" rel="stylesheet" type="text/css" />
</head>

<body class="ms-font-m ms-welcome ms-Fabric">
    <header class="ms-welcome__header ms-bgColor-neutralLighter">
        <h1 class="ms-font-su">ouseful.info WASM demo</h1>
    </header>
    <section id="sideload-msg" class="ms-welcome__main">
        <h2 class="ms-font-xl">Please <a target="_blank" href="https://learn.microsoft.com/office/dev/add-ins/testing/test-debug-office-add-ins#sideload-an-office-add-in-for-testing">sideload</a> your add-in to see app body.</h2>
    </section>
    <main id="app-body" class="ms-welcome__main" style="display: none;">
        <h2 class="ms-font-xl">WASM app running in sidebar...</h2>
        <div>Simple pglite application running in the browser. Example query: <tt>select * from test;</tt><br/><br/></div>
        <textarea id="query" rows="4" cols="30"></textarea>
    <div><button id="getsel">Get Selection</button><button id="execute">Execute</button><button id="exepaste">Paste Result</button><button id="replacesel">Replace selection</button></div>
    <div id="timestamp"></div>
    <div id="output"></div>
    <div id="results"></div>
    </main>
</body>

</html></pre>

And the JS from my pglite demo:

// PGLite loader
import { PGlite } from "https://cdn.jsdelivr.net/npm/@electric-sql/pglite/dist/index.js";

// Initialize PGlite
const db = new PGlite();
// We can persist the db in the browser
//const db = new PGlite('idb://my-pgdata')

const DEFAULT_SQL = `
-- Optionally select statements to execute.

CREATE TABLE IF NOT EXISTS test  (
        id serial primary key,
        title varchar not null
      );

INSERT INTO test (title) values ('dummy');

`.trim();

async function createTable() {
  await db.query(DEFAULT_SQL);
  await db.query("select * from test").then((resultsX) => {
    console.log(JSON.stringify(resultsX));
  });
}
createTable();
window.addEventListener("DOMContentLoaded", async function () {
  const button = /** @type {HTMLButtonElement} */ (
    document.getElementById("execute")
  );

  // Execute SQL on button click.
  button.addEventListener("click", async function () {
    button.disabled = true;

    // Get SQL from editor.
    const queries = document.getElementById("query").value;

    // Clear any previous output on the page.
    const output = document.getElementById("output");
    while (output.firstChild) output.removeChild(output.lastChild);

    const timestamp = document.getElementById("timestamp");
    timestamp.textContent = new Date().toLocaleTimeString();

    let time = Date.now();
    console.log(`${queries}`);
    try {
      const results = await db.query(`${queries}`);
      //.then(results => {console.log("results are"+JSON.stringify(results))});

      const resultsDiv = document.getElementById("results");
      resultsDiv.innerHTML = "";
      const table = formatTable(results);
      formatRows(results, table);
      resultsDiv.appendChild(table);
    } catch (e) {
      // Adjust for browser differences in Error.stack().
      const report = (window["chrome"] ? "" : `${e.message}\n`) + e.stack;
      output.innerHTML = `<pre>${report}</pre>`;
    } finally {
      timestamp.textContent += ` ${(Date.now() - time) / 1000} seconds`;
      button.disabled = false;
    }
  });
});

function formatTable(results) {
  const table = document.createElement("table");

  const headerRow = table.insertRow();
  Object.keys(results[0]).forEach((key) => {
    const th = document.createElement("th");
    th.textContent = key;
    headerRow.appendChild(th);
  });
  return table;
}

function formatRows(results, table) {
  results.forEach((rowData) => {
    const row = table.insertRow();
    Object.values(rowData).forEach((value) => {
      const cell = row.insertCell();
      cell.textContent = value;
    });
  });
}

And the very task pane js:

/*
 * Originally Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT license.
 * See LICENSE in the project root for license information.
 */

/* global document, Office, Word */

Office.onReady((info) => {
  if (info.host === Office.HostType.Word) {
    document.getElementById("sideload-msg").style.display = "none";
    document.getElementById("app-body").style.display = "flex";
    document.getElementById("exepaste").onclick = exePaste;
    document.getElementById("getsel").onclick = getSel;
    document.getElementById("replacesel").onclick = replaceSel;
  }
});


async function getSel() {
  await Word.run(async (context) => {
    // Get SQL from selection
    const selected = context.document.getSelection();
    selected.load("text");
    await context.sync();
    document.getElementById("query").value = selected.text;
    //const paragraph = context.document.body.insertParagraph("Running command: " + selected.text, Word.InsertLocation.end);
    await context.sync();

  });
}

async function replaceSel() {
  await Word.run(async (context) => {
    // Get SQL from selection
    const selected = context.document.getSelection();
    const replace_text = document.getElementById("query").value;
    selected.insertText(replace_text, Word.InsertLocation.replace);
    //const paragraph = context.document.body.insertParagraph("Running command: " + selected.text, Word.InsertLocation.end);
    await context.sync();
  });
}

async function exePaste() {
  await Word.run(async (context) => {
    // Get SQL from editor.
    //const txt = document.getElementById("results").innerHTML;
    //const paragraph = context.document.body.insertParagraph(txt, Word.InsertLocation.end);
    //var parser = new DOMParser();
    //var htmlDoc = parser.parseFromString(txt, "text/html");
    //var table = htmlDoc.getElementsByTagName("table")[0];
    var table = document.getElementById("results").firstChild;
    var rows = table.rows;
    var numCols = rows[0].cells.length;
    var numRows = rows.length;
    // Extract table data into a two-dimensional array
    var tableData = [];
    for (var i = 0; i < numRows; i++) {
      var rowData = [];
      for (var j = 0; j < numCols; j++) {
        rowData.push(rows[i].cells[j].innerText);
      }
      tableData.push(rowData);
    }

    // Insert the table into the Word document
    const wtable = context.document.body.insertTable(numRows, numCols, Word.InsertLocation.end, tableData);

    await context.sync();

  });
}

For completeness, here’s the css:

/* 
 * Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT license.
 * See LICENSE in the project root for license information.
 */

 html,
 body {
     width: 100%;
     height: 100%;
     margin: 0;
     padding: 0;
 }
 
 ul {
     margin: 0;
     padding: 0;
 }
 
 .ms-welcome__header {
    padding: 20px;
    padding-bottom: 30px;
    padding-top: 100px;
    display: -webkit-flex;
    display: flex;
    -webkit-flex-direction: column;
    flex-direction: column;
    align-items: center;
 }

 .ms-welcome__main {
    display: -webkit-flex;
    display: flex;
    -webkit-flex-direction: column;
    flex-direction: column;
    -webkit-flex-wrap: nowrap;
    flex-wrap: nowrap;
    -webkit-align-items: center;
    align-items: center;
    -webkit-flex: 1 0 0;
    flex: 1 0 0;
    padding: 10px 20px;
 }
 
 .ms-welcome__main > h2 {
     width: 100%;
     text-align: center;
 }
 
 .ms-welcome__features {
     list-style-type: none;
     margin-top: 20px;
 }
 
 .ms-welcome__features.ms-List .ms-ListItem {
     padding-bottom: 20px;
     display: -webkit-flex;
     display: flex;
 }
 
 .ms-welcome__features.ms-List .ms-ListItem > .ms-Icon {
     margin-right: 10px;
 }
 
 .ms-welcome__action.ms-Button--hero {
     margin-top: 30px;
 }
 
.ms-Button.ms-Button--hero .ms-Button-label {
  color: #0078d7;
}

.ms-Button.ms-Button--hero:hover .ms-Button-label,
.ms-Button.ms-Button--hero:focus .ms-Button-label{
  color: #005a9e;
  cursor: pointer;
}

b {
    font-weight: bold;
}

      #editor-container {
        width: 100%;
        height: 20vh;
        border:lightgray;
        border-style: solid;
      }

      #vfs-container {
        margin-top: 0.5em;
        margin-bottom: 0.5em;
      }
      #timestamp {
        margin-top: 0.5em;
      }

      #output {
        display: flex;
        flex-wrap: wrap;
        width: 100%;
      }

      table {
        margin-top: 1em;
        margin-left: 0.4em;
        margin-right: 0.4em;
        border-collapse: collapse;
      }

      td, th {
        border: 1px solid #999;
        padding: 0.5rem;
        text-align: left;
      }

Next up, I think I’lll have a look at adding a simple pyodide / Python sandbox into the task pane, perhaps trying the “react” rather than javascript cookiecutter (I think that offers a different way of loading js and wasm assets?)

This would then allow a similar pattern of “author code in the Word doc, try it out, fix it, paste the fix back, paste the result back”. It’s nto a reperoducible pattern, but it does shorten the loop between code and output, and also means you can test the code (sort of) whilst authoring.

PS Hmm, I wonder, could I style the code in the word doc then extract just all the styled code, or “all above” styled code, and run it in the sidebar? One to try tomorrow, maybe…

Data Exfiltration Using Copilot in Edge?

So… I’m on a work machine, which has all sorts of gubbins that IT security folk put on it. But it’s also classed as an “autonomous” machine that I have root access on, which lets me install and build things. I’m running Microsoft’s Edge browser, although I forget where I installed it from. I’m pretty sure this would have been from the Edge download site, rather than the organisation’s software download catalogue.

The Edge browser is running a Copilot sidebar. I imagine this is something I might have opted into originally when CoPilot first became available in Edge.

I have a profile selected, but I’m not sure I’ve signed in. I have no direct recollection of knowingly using my current organisation password to sign in to the Edge browser at least since the last time I changed that password.

I’m looking at an OU authenticated page for an old module on the VLE website. This particular page is quite light on content in the central pane, and there is lots of navigation with meaningful titles. So the balance of main text and navigational text is quite balanced. I ask Copilot to summarise the page, which it does.

[UPDATE]: according the the IT folk, “It’s the enterprise version of Chat GPT, assuming you’re signed in the main advantage of which is it doesn’t store the prompt data, unlike Chat GPT. If signed in you should see this at the top of the panel”:

From the summary, it obviously has access to the page content, the page being an authenticated page.

So how does it do it?

There are several possibilities:

  • it generates the summary using a local model by taking a copy of the page content, passing it to the local model as context, and prompting for a summary; the contents of the web page stay on my machine; see, for example, Running AI Models in the Browser;
  • it looks at the URL of the page I am on, finds that on the public web, and uses that as context either in a local model (which would be a bit silly doing the remote lookup), or using a third party service. But the page I’m looking at is authenticated, so there is no online public version of it;
  • the browser sends the page URL to a remote server along with some of my credentials (e.g. the auth tokens/cookies set when I logged into the web page); the remote server logs in and pretends to be me, looks up the URL, extracts the page content and uses it as context; this would not be ideal…
  • the browser knows I’m me from from my browser profile, and knows I’m me from logging into the web page, so it uses some sort of backend Microsoft federated auth thing to allow the summariser model access to the authed web page. Again, this would not be ideal…
  • the CoPilot sidebar grabs a copy of the page content and sends it to the remote service, where it is used as context for the summariser.

So… how can check what the browser may or may not be phoning home?

Most browsers make a powerful debugging environment available in the form of a Developer Tools or Dev Tools application.

If you want to know anything about a web page, what it’s loading, what it’s storing, what it’s phoning home, this is where to look.

So for example, I can look at the network traffic.

This tells me what the page is doing, but not the CoPilot sidebar.

However, if I right click in the sidebar, I get an option to Inspect.

Which opens up devtools for the extension.

Which shows us that the sidebar talks to Bing…

…and can phone the page contents back there:

We also seem to have content chunked on linebreaks:

If I copy the webcontents, we can trivially find our module content in there:

So… can I do the same in private browsing? Nope… it seems that CoPilot is not available in that view?

How about if I create a new browser profile without any credentials? Using CoPilot to view the authenticated page, I get:

Ah, so… maybe I had accepted that previously when I first tried out CoPilot in Edge months ago.

I let that dialogue time out and disappear without me accepting either way, and ask for a page summary. The dialogue reappears briefly and then I get:

so it seems that my organisation is happy for me to be using CoPilot? If I accept, will the page get a summary?

I get the dialogue, again, asking me to allow Microsoft access to page content. (The timeout without me answering is an implied not yes, not no…)

Let’s quit that profile and start a new one… I’ll visit a public page – a BBC news page for example, and ask CoPilot for a summary. I get the Allow Microsoft to Access Page Content dialogue, and accept. I ask for a page summary, and a copy of the page content is sent to Microsoft. I’m not sure what the BBC web page license conditions are when it comes to me sending a copy of their content to a third party?

Maybe because I am introducing third party content into the chat, the chat disables chat history before going any further. This may or may not have anyhting to do with sharing arbitrary content from a viewed page with Microsoft/Bing.

Note that I seem to have granted Microsoft access to all pages I now view, not just the previous page or the previous domain. This is unlike a cookie notice, which typically applies to a domain.

If I now auth into my organisational site, view a VLE page and ask for a summary, I get one.

I don’t notice the any prompt this time about whether my org has given me access to CoPilot.

However, by creating a new profile, with no contact details or other history, it seems that I can access CoPilot, grant it permission to read page content, log in to an authenticated site, and send it off to whatever chat model CoPilot/Bing is running. Or use a profile where I perhaps once in the past granted CoPilot a universal privilege, and henceforth don’t get any challenges about sharing content from other domains back to Bing.

I guess the next step to watch out for, eg if CoPilot starts allowing me to add third party “agents” to my CoPilot experience will be: can I get Bing to then send the page content to an arbitrary third party service?

PS I wonder… could we have some content in each VLE page in a hidden CSS style that can act as a prompt injection thing when CoPilot uses the page content as context? Add additional guardrails, nudge CoPilot into a particular role, get it to give users a warning about not cheating etc etc?

PPS tinkering a bit more, in a conversation that already included a previous prompt asking to summarise an (authed) page, I can:

  • load a new authed page (I note: this was in the same tab and the result of a click through on a link in that page to a new page on the same domain; I haven’t yet tried to see if changing browser tabs and loading a page in a new tab results in that page being pre-emptively sent to Bing without further prompting in the CoPilot window; or if it will pre-emptively send content from a page on a new domain if I explicitly enter a url to a new / authed domain in a tab in which I had just summarised an unauthed public page);
  • CoPilot will send the content back to Bing before I make any further prompts.

This means that I can visit a page that contains personal information and without entering anything further into the CoPilot dialogue, it will send the contents of the page to Bing. And yes, I do have screenshots.

Running AI Models in the Browser

AI qua LLMs, multimodal models, etc. seem likely to be deployed, for the short term at least, in a couple of ways:

  • as (part of) server based services, running really big models on ridiculous hardware, metered and payed for (e.g. by the token) or as part of some sort of subscription plan;
  • as (part of) local services running either on your own device, or directly in the browser*.

* Running in the browser does not necessarily mean requiring a network connection once the the app has loaded. Many applications that run in the browser have essentially been downloaded to the browser and run just as if they were an app you had installed on your desktop. As long as you don’t close the browser tab, the app will continue to work. And in some cases, even if you do close the tab and then open it again, the app. may be able to reload from the browser cache, without requiring network access to the original website the app appears to be running from. See for example progressive web apps.

For example, consider Google’s recently released magika model-based app (MBA?) which attempts to detect what sort of file a particular file might be purely from its contents . You can see how running this purely in the browser might be useful for letting the browser check that a file uploaded to a browser is of an appropriate type before sending it to a remote server. Here’s a demo: magika.

In terms of distance eduction, running apps in the browser has several advantages:

  • works on all platforms that run a browser capable of running the model (so desktop machines, laptops, Chromebooks, tablets, phones);
  • works offline once the model site is loaded;
  • may be “installable”/persistent if deployed as a progressive web app (PWA);

On the downside:

  • may be slow on first run as model installed (or each run if not cached);
  • contributes to hiding large files on your machine somewhere that have been downloaded by the browser;
  • if multiple sites use the same model, your browser may be stashing multiple copies of the same model, once per website;
  • ties you to using a particular browser on a particular machine if you want to keep reusing the same app, eg becuase you have saved files from it to browser storage associated with that app and that browser.

I’m pretty sure that browser sync does not sync browser storage. If you sign in to e.g. Google from Chrome, files saved to browser storage will not get synched too unless the app you are running supports file synching somehow.

Architecturally, when designing learning systems, why might it be useful to be able to run models in the browser?

One reason is that we can run a service without having to run a service. The OU model for distributing teaching and learning materials is largely based around providing students access to contentful HTML webpages via the VLE. If the VLE ships a text summariser model to a student’s browser, the students can ask the model to summarise or redescribe the content in the page without making a further request to an OU server or calling on the need for the OU to centrally provide an on-the-fly summarisation service. The analytics vultures might complain, of course, in their desire to track and monitor anything they can, just in case the data might be useful (or commercially valuable in some way, including as a uniquely available asset used to bolster the weight of research bids.

From the student perspective, this means they can do things anonymously, unless, of course, the analytics vultures are tracking actions and sending telemetry logs back to base. To try to shape how students might make use of a model, I could imagine some folk wanting to bake in custom guardrails trained into the model that don’t just reply with “I don’t do that” to “talk sexy to me” requests, but also get snarky if you obviously try to something that looks like pasting an assessment question in and expecting an answer.

Another advantage of shipping things like a summariser service to students is to stop them using third party services that do the same thing. I’n not sure if the Copilot sidebar in Edge is phone home the content of my authed pages, or whether it’s using a local model to summarise content from the page, but something is getting access to the authed page content… Screenshot below is from a module presentation several years ago.

So how do things stand at the moment for running models in the browser? And what’s required?

To use a model in an browser, we need two things: something to run the model, and the model itself.

One of the ways of packaging applications so that they can be deployed efficiently in a browser is to use a WASM (web-assembly) app. WASM apps can in principle run anywhere, although they are often custom built for running in a particular context, such as in a web page, on a node server or as a callable module from another programming language.

One early tool for running models in the browser, and one that is stull under active development, is Google’s TensorFlow Lite javascript library. TensorFlow Lite can be used as part of Google’s MediaPipe framework for delivering in-browser and on-device machine learning/AI applications, which is to say, letting you run apps on your phone or in your browser.

A recent web demo shows how you can run Large Language Models On-Device with MediaPipe and TensorFlow Lite. The model needs to be downloaded to your desktop and then uploaded to the browser page that runs the demo. But note that the model doesn’t leave your browser (you don’t need a network connection to run it), it runs purely within the browser.

An even simpler demo can be run locally from a trivial HTML file and a tiny bit of Javascript that loads in the model directly: web demo code for use with local webserver. I’m not sure what the context length is, or how good the model is for summarising. The chat can be a bit hit and miss, and also has seems to have quite a few internal guardrails.

<!-- Via https://github.com/googlesamples/mediapipe/blob/main/examples/llm_inference/js/index.html -->
<html lang="en">
<head>
    <title>LLM Inference Web Demo</title>
</head>
<body>
    Input:<br />
    <textarea id="input" style="height: 300px; width: 600px"></textarea><br />
    <input type="button" id="submit" value="Get Response" disabled /><br />
    <br />
    Result:<br />
    <textarea id="output" style="height: 300px; width: 600px"></textarea>
    <script type="module" src="index.js"></script>
</body>
</html>
//Via: https://github.com/googlesamples/mediapipe/blob/main/examples/llm_inference/js/index.js

import {FilesetResolver, LlmInference} from 'https://cdn.jsdelivr.net/npm/@mediapipe/tasks-genai';

const input = document.getElementById('input');
const output = document.getElementById('output');
const submit = document.getElementById('submit');

const modelFileName = 'gemma-2b-it-gpu-int4.bin'; /* Update the file name */

/**
 * Display newly generated partial results to the output text box.
 */
function displayPartialResults(partialResults, complete) {
  output.textContent += partialResults;

  if (complete) {
    if (!output.textContent) {
      output.textContent = 'Result is empty';
    }
    submit.disabled = false;
  }
}

/**
 * Main function to run LLM Inference.
 */
async function runDemo() {
  const genaiFileset = await FilesetResolver.forGenAiTasks(
      'https://cdn.jsdelivr.net/npm/@mediapipe/tasks-genai/wasm');
  let llmInference;

  submit.onclick = () => {
    output.textContent = '';
    submit.disabled = true;
    llmInference.generateResponse(input.value, displayPartialResults);
  };

  submit.value = 'Loading the model...'
  LlmInference
      .createFromOptions(genaiFileset, {
        baseOptions: {modelAssetPath: modelFileName},
        // maxTokens: 512,  // The maximum number of tokens (input tokens + output
        //                  // tokens) the model handles.
        // randomSeed: 1,   // The random seed used during text generation.
        // topK: 1,  // The number of tokens the model considers at each step of
        //           // generation. Limits predictions to the top k most-probable
        //           // tokens. Setting randomSeed is required for this to make
        //           // effects.
        // temperature:
        //     1.0,  // The amount of randomness introduced during generation.
        //           // Setting randomSeed is required for this to make effects.
      })
      .then(llm => {
        llmInference = llm;
        submit.disabled = false;
        submit.value = 'Get Response'
      })
      .catch(() => {
        alert('Failed to initialize the task.');
      });
}

runDemo();

It is also possible to run image generation models such as Stable Diffusion purely in the browser (note that the first run may be quite slow as the large model is loaded into the browser cache): https://github.com/mlc-ai/web-stable-diffusion/ .

Another tool for running models in the browser is Microsoft’s ONNX runtime (repo). ONNX is an open standard for representing models, and is supported by various model frameworks including TensorFlow, PyTorch and scikit learn. The ONNX runtime web package, as described is a Javascript library for running models in the browser using WASM (CPU) or WebGL (GPU) (see ONNX Runtime Web—running your machine learning model in browser for a good introduction).

A simple ONNX runtime web demo describes how to locally server a Whisper speech to text model for use in the browser.

The ONNX runtime is also used by transformers.js to run models in the browser. For example, using a Whisper speech to text transcription model in the browser from an audio file URL, uploaded audio file, or microphone input: https://huggingface.co/spaces/Xenova/whisper-web. Or how about a text to speech model running in the browser: Xenova/text-to-speech [repo]?

Fragment — Apps in Notebooks

Almost ten years ago, when we were first exploring the possible use of IPython notebooks, as they were then known, for the data management and analysis course that is still running today, I posted about an interactive javascript app that let you do interactive drag and drop mediated pivot table operations on table rendered from a pandas data frame using Nicola Kruchen’s pivottable js package (Pivot Tables, pandas and IPython Notebooks).

One of the advantages of the embedded app was (is) that it allowed you to manipulate the data directly. One of the disadvantages is that your interactions are lost and that you lose reproducibility, which is one of the motivating reasons for using notebooks in the first place.

To make the pivottable-in-a-notebook properly useful in an educational context, as well as in an exploratory data analysis context, we need to be able to to export to a code a cell the code operation that mimics the transformation the user has applied.

For example, in the pivottable, if you drag some row and column labels around, you should be able to either copy, or insert into the following code cell (half-way example), the pandas code that perform a similar transformation given the original table.

I still don’t think this is supported?

Then you can use the pivottable for interactive exploration, or as a doodle pad, until the table then looks like how you want it, and the generate the code that will perform that operation on your original data to give you a data frame with a structure that looks like the view you can see in interactive pivot table view.

This would be more fun, more direct, more energy efficient and less likely to make stuff up than using overkill genAI models that are trying to parse your crappy verbal instructions or decode a crappy sketch that you have drawn. And it would only require a few rules and some basic javascript to work.

Today I note the latest release of Marc Wout’s iTables package (Make your Pandas or Polars DataFrames Interactive with ITables 2.0). This is similar to the pivottable package except that, rather than displaying your pandas datframe in a pivot table app, it renders it as an interactive table using the DataTables js package.

One thing that looks particulalry nice about this package is that it works in things like shiny and quarto as well as Jupyter notebooks, Jupyter Book, VS Code, &c.

Again the tool lets you do all sorts of interactive manipulations over an interactive HTML rendering of the contents of a provided pandas dataframe, again it’d be more useful for anything other than ephemeral exploratory use if you could use it to interactively discover and generate some reproducible pandas code that implements the overall transformation a user applied to the original pandas dataframe to get to the view they are currently looking at.

Again, this is more direct and useful thing to do than pratting around with, and prattling on about, crappy and pointless genAI interactions.

And again its something you can probably do with a few rules rather than a giga or billion of anything.