Classic Jupyter Notebook Branding Hack

Noting that another OU module has just started using classic Jupyter notebooks, with a deployment route based on students creating a miniconda environment, installing the notebook server and an additional R-kernel, into that environment, and then running the Jupyter notebook server from that environment, I tried a quick hack to package up a simple Pyhton extension to brand the notebook server.

The trick I use for branding our deployed environments is simply to use a custom logo and custom stylesheet. These need placing in a custom directory on the Jupyer server environment path. Getting the files in the correct place caused some issues, notably:

  • the jupyter_core package, in a recent version, has various routes to finding Jupyter config directories via jupyter_core.paths. But older versions don’t seem to support that, so updating jupyter_core simply to add a logo feels a bit risky to me;
  • getting the path to the environment is a faff, and can be sensitive to the ennvironment in which you run the jupyter --paths commands to discover paths. The hack I came up with was by default to install into the “generic” ~/.jupyter path, with an option to pass in a conda environment name and then try to find a path that relates to that (and so far, only tested on a Mac).

Anyway, the repo is currently at https://github.com/ouseful-testing/classic-nb-ou-branding and can be installed as

pip install --upgrade  git+https://github.com/ouseful-testing/classic-nb-ou-branding.git

ou_nb_branding install --conda ENVIRONMENT_NAME

The ENVIRONMENT_NAME is the name of the conda envt you want to install into if you installed and are running the jupyter server in the env. In M348 defaults , this means running the following command before you start the jupyter server:

ou_nb_branding install --conda r_env

Getting the path to install into is a hack, so it may break (plus, I only tested on Mac).

Without the --conda switch, the installation defaults to whatever home Jupyter environment path is found. So in a simple envt installing the custom branding to default path, just run: ou_nb_branding install

For branding JupyterLab (and I have no idea if this works for retrolab/new notebook etc.), see https://github.com/innovationOUtside/jupyterlab_ou_brand_extension

Creating Terminal Based Screencast Movies With VHS

Via a fun blogpost from Leigh Dodds — Recreating sci-fi terminals using VHS — and the fascinating repo behind it, I come across charmbracelet/vhs, a nifty package for creating animated gif based screencasts / movies of terminal based interactions.

Ar first, I thought this was just about creating fake replays from a canned script, which obviously has issues if you are looking to create a reproducible and true video of an actual terminal interaction…

But then, looking at Leigh’s repo, it suddenly became clear to me that this was just a hack based on using a real terminal, and that the script can be used to script, and record, actual terminal interactios. Doh!

So for example, Leigh’s .tape script to create the above video looks like this:

Output gifs/jurassic-park-nedry.gif
Output gifs/jurassic-park-nedry.mp4

Set FontSize 14

Set Width 800
Set Height 600

Set Theme { "name": "Neo", "black": "#0D0208", "red": "#ef6487", "green": "#5eca89", "yellow": "#fdd877", "blue": "#65aef7", "magenta": "#aa7ff0", "cyan": "#43c1be", "white": "#ffffff", "brightBlack": "#0D0208", "brightRed": "#ef6487", "brightGreen": "#5eca89", "brightYellow": "#fdd877", "brightBlue": "#65aef7", "brightMagenta": "#aa7ff0", "brightCyan": "#43c1be", "brightWhite": "#ffffff", "background": "#203085", "foreground": "#ffffff", "selection": "#000000", "cursor": "#5eca89" }

# Change the default prompt before we start the script
Hide
Type "source jurassic-park-nedry.sh" Enter
Type "export PS1=''" Enter
Ctrl+L
Type "Jurassic Park, System Security Interface" Enter
Show

Type@100ms "access security"
Sleep 1s
Enter
Sleep 1s
Type@100ms "access security grid"
Enter
Sleep 1s
Type@100ms "access main security grid"
Enter  

Sleep 5s
Ctrl+U

Those are actual commands, as you can see from the source jurassic-park-nedry.sh typed command. What that does is load in a script that creates definitions for “commands” that appear later in the tape script (i.e. commands that might not be recognised by the terminal, or that we might want to override:

Jurassic() {
 export PS1='> '
 echo "Version 4.0.5, Alpha E";
 echo "Ready...";
}

access() {
 sleep 0.5;
 case $1 in
   security)
     echo "access: PERMISSION DENIED.";
     ;;
   main)
     echo "access: PERMISSION DENIED....and....";
     for i in {1..100}
     do
     	echo "YOU DIDN'T SAY THE MAGIC WORD!";
        sleep 0.1;
     done
     ;;
esac
}

So for example, when the command Jurassic is entered into the real terminal, we want a particular echoed response. Similarly when the access commands are Typed.

If you have Docker installed, the simplest way to run vhs is probably via a container. For example, hacking the following simple .tape script together:

Output demo.gif

# Set up a 1200x600 terminal with 46px font.
Set FontSize 46
Set Width 1200
Set Height 600
Set TypingSpeed 100ms

# Type a command in the terminal.
Type "echo 'A simple VHS demo...'"

# Pause for dramatic effect...
Sleep 2s

# Run the command by pressing enter.
Enter

Type "mkdir -p demo"
Sleep 500ms
Enter
Sleep 1s

Type@100ms "ls demo"
Sleep 500ms
Enter
Sleep 1s

Type "echo 'Hello world' > demo/test.txt"
Enter

Sleep 500ms

Type@1s "ls demo"
Enter
Sleep 2s

Type "cat demo/test.txt"
Enter

Sleep 500ms

Type "echo '....and that is the end of the demo...'"

Sleep 1s

Enter
Sleep 3s

I can generate an animated gif as easily as typing docker run --rm -v $PWD:/vhs ghcr.io/charmbracelet/vhs MYTAPE.tape

I’ve also started wondering about using a Jupyter notebook to draft the interaction, then some sort of simple script to convert the notebook code cells to .tape instrcutions, perhaps with suitable delays…

It might also be possible to spoof the notebook interaction, by echoing the commands in the code cell as if they wer real commands and then echoing the code cell output? Then you could use a pre-run notebook as the basis for a tape that replayed the “mocked” taped console commands from the notebook and the actual outputs from the notebook code cells?

In Search of JupyterLab Workspaces Config Files

Way, way, way back when JupyterLab was first mooted, I raised concerns about the complexity of an IDE vs the simplicity of the notebook document views, and was told not to worry becuase there would be an easily customisable worskpace facility whereby you could come up with a custom layout, save a config file and then ship that view of the IDE.

This would have been brilliant for educational purposes, because it means you can customise and ship different desktop layouts for different purposes. For example, in one activity, you might ship an instructional rendered markdown document with just text based instructions in one panel, and an terminal in the panel next to it.

In another, you might pre-open a particlar notebook, and so on.

In a third, you might ship a pre-opened interactve map preview, and so on.

What this does is potentially reduce the time to get started on an activity: “load this and read what’s in front of you” rather than “just load this, then open that thing from the file menu, in that directory, and then open that other thing, then click on the top of that thing and drop it so that you get two things side by side, then…” etc etc.

Whilst workspaces were available early on, they never had more than a rare demo, v. little docs and no tutorials, so it was really tricky to find a way into experimenting with them, let alone using them.

Anyway, I’ve started trying again, and again fall at the first hurdle; there are no menu options I can see (no Save workspace, no Save Workspace As…, no Open Workspace, no Reset Workspace etc) but there are some URLs for cloning one workspace from another (but not cloning the current default workspace into another one that you can then name?).

It’s also not obvious where the worskpace config files are. I think their default home is /home/jovyan/.jupyter/lab/workspaces/ but they look fiddly to had create. Also, I’m not sure if/where/how you can share a custom default workspace file (eg how should it be named; what path does it need to be on?)

Creating a custom config file manually (which IIRC was also one of the early promises) is probably also not aan option; the best way seems to be to mess around with the panels in a workspace until you get them arranged just so, then try to poke around the /home/jovyan/.jupyter/lab/workspaces/ directory until you find one that looks like it encodes the space you want.

Having found myself defaulting to VS Code more and more, using VS Code Jupyer notebooks running against Jupyter kernels inside local docker containers, as well as spending time in RStudio again (accessing quarto workflows etc), the main thing I get from Jupyter UIs seems to be frustration:-(

PostgreSQL Running in the Browser via WASM

I’ve previoulsy written about WASM support for in-browser SQLite databases, as well as using the DuckDB query engine to run queries over various data sources from a browser based termianl (for example, SQL Databases in the Browser, via WASM: SQLite and DuckDB and Noting Python Apps, and Datasette, Running Purely in the Browser), but now it seems we can also get access to a fully blown PostgreSQL database in the browser via snaplet/postgres-wasm (announcement; demo; another demo). At the moment, I don’t think there’s an option to save and reload the database as a browser app, so you’ll need to initially load it into a tab from either a local or remote webserver (so it’s not completely serverless yet…).

Key points that jump out at me from the full demo:

  • you get a psql terminal in the bowser that lets you run psql commands as well as SQL queries;
  • you can save and load the database state into browser storage:

You can also save and load the database state to/from a desktop file.

  • a web proxy service is available that lets you query the database from a remote connection; that is, the db running in your browser can be exposed via a web proxy and you can connect to it over the network. For example, I connected to the proxy from a Python Jupyter kernel running in a Docker container on my local machine; the database was running in a browser on the same machine.

From an educational perspective, having access to a fully blown DBMS engine, rather than just a simple SQLite database, for example, is that you get access to both the psql command line line, but also database management tools such as roles and role based permissions. Which means you can teach a lot more purely within the browser.

Note that I think a webserver is still required to load the environment until such a time as a PWA/progressive web app version is available (I don’t think datasette-lite is available as a PWA yet either? [issue]).

In terms of managing a learning environment, one quick and easy way would b to run two open browser windows side by side: one containing the instructional material, the other containing the terminal.

Questions that immediately come to mind:

  • What’s the easiest way is to be able to run the proxy service on localhost?
  • Is it in-principle possible for the database server to run as a browser app and publish its service from there onto the localhost network?
  • Is there in-principle way for the database server to run in one browser tab and expose itself to a client running in another browser tab?
  • Can you have multiple connections onto the same browser storage persisted database from clients open in different/multiple tabs, or would you have to hack the common storgae by repeatedly saving and loading state from each tab?
  • At the moment, we can’t connect to postgres running wheresoever from a in-browser Python/Pyodide environment (issue), which means we can’t connect to it from a JupyterLite Python kernel. Would it be possible to create some sort of JupyterLite shim so that you could load a combined JupyterLite+postgres-wasm environment to give you access to in-browser postgres db storage via JupyterLite notebook scripts?
  • How easy would it be to fork the jupyterlite/xeus-sqlite-kernel to create a xeus-postgres-wasm kernel? How easy would it be to also bundle in pandas and some sql magic for a pandas/postgres hybrid, (even if you have access to no other python commands than pd. methods (and what would that feel like to use?!), along with support for pandas plots/charts?
  • How easy would be to wire in some custom visual chart generating Postgres functions?!
  • With a python-postgres-wasm, could you support the creation of Postgres/Python custom functions?

It could be so much fun to work on a computing course that tried to deliver everything via the browser using a progressive web app, or at most a web-server…

Fragment: On Failing to Automate the Testing of a Jupyter Notebook Environment Using playwright and Github Actions

One of those days (again) when I have no idea whether what I’m trying to do is possible or not, or whether I’m doing something wrong…

On my local machine, everything works fine:

  • a customised classic notebook Jupyter environment is running in a Docker container on my local machine, with a port mapped on the host’s localhost network;
  • the playwright browser automation framework is installed on the host machine, with a test script that runs tests against the Jupyter environment exposed on the localhost network;
  • the tests I’m running are to login to the notebook server, grab a screenshot, then launch a Jupyter server proxy mapped OpenRefine environment and grab another screenshot.

So far, so simples: I can run tests locally. But it’d be nice to be able to do that via Github Actions. So my understanding of Github Actions is that I should be able to run my notebook container as a service using a service container, and then use the same playwright script as for my local tests.

But it doesn’t seem to work:-( [UPDATE: seems like was image was borked somehow; updated that image (using same tag)… Connection to server works now..]

Here’s my playwright script:

import { test, expect } from '@playwright/test';

// Create test screenshots
// npx playwright test --update-snapshots


// Allow minor difference - eg in the new version
const config: PlaywrightTestConfig = {
  expect: {
    toHaveScreenshot: { maxDiffPixels: 100 },
  },
};
export default config;

test('test', async ({ page }) => {
  // Go to http://localhost:8888/login?next=%2Ftree%3F
  await page.goto('http://localhost:8888/login?next=%2Ftree%3F');
  await expect(page).toHaveScreenshot('notebook-login-page.png');
  // Click text=Password or token: Log in >> input[name="password"]
  await page.locator('text=Password or token: Log in >> input[name="password"]').click();
  // Fill text=Password or token: Log in >> input[name="password"]
  await page.locator('text=Password or token: Log in >> input[name="password"]').fill('letmein');
  // Press Enter
  await page.locator('text=Password or token: Log in >> input[name="password"]').press('Enter');
  await expect(page).toHaveURL('http://localhost:8888/tree?');
  await expect(page).toHaveScreenshot('notebook_homepage-test.png');
  // Click text=New Toggle Dropdown
  await page.locator('text=New Toggle Dropdown').click();
  await expect(page).toHaveScreenshot('notebook_new-test.png');
  // Click a[role="menuitem"]:has-text("OpenRefine")
  const [page1] = await Promise.all([
    page.waitForEvent('popup'),
    page.locator('a[role="menuitem"]:has-text("OpenRefine")').click()
  ]);
  await page1.waitForSelector('#right-panel');
  await expect(page1).toHaveScreenshot('openrefine-homepage.png');
});

And here’s my Github Action script:

name: test

on:
  workflow_dispatch:

# This job installs dependencies and runs the tests
jobs:
  notebook-tests:
    runs-on: ubuntu-latest
    services:
      jupyter-tm351:
        image: ouvocl/vce-tm351-monolith:22j-b2
        ports:
        - 8888:8888
    steps:
    - uses: actions/checkout@v3

    - uses: actions/setup-node@v3
      with:
        node-version: 16
    - name: Install playwright dependencies
      run: |
        npx playwright install-deps
        npx playwright install
        npm install -D @playwright/test
    # Test setup
    - name: Test screenshots
      run: |
        npx playwright test

It’s been so nice not doing any real coding over the last few weeks. I realise again I don’t really have a clue how any of this stuff works, and I don’t really find code tinkering to be enjoyable at all any more.

FWIW, test repo I was using is here. If you spot what I’m doing wrong, please let me know via the comments… FIXED NOW – SEEMS TO WORK

PS hmmm… maybe that container was broken in starting up… not sure how; I have a version running locally?? Maybe my local tag version is out of synch with pushed version somehow? FIXED NOW – SEEMS TO WORK

PPS I just added another action that will generate gold master screenshots and temporarily stash them as action artefacts. Much as the above except the npx playwright test --update-snapshots line to create gold master screenshots (which assumes things are working as they should….) and then a step to save the screenshot image artefacts.

name: generate-gold-master-images

on:
  workflow_dispatch:

# This job installs dependencies, builds the book, and pushes it to `gh-pages`
jobs:
  notebook-tests:
    runs-on: ubuntu-latest
    services:
      jupyter-tm351:
        image: ouvocl/vce-tm351-monolith:22j-b2
        ports:
        - 8888:8888
    steps:
    - uses: actions/checkout@v3

    - uses: actions/setup-node@v3
      with:
        node-version: 16
    - name: Install playwright dependencies
      run: |
        npx playwright install-deps
        npx playwright install
        npm install -D @playwright/test

    # Generate screenshots
    - name: Generate screenshots
      run: |
        npx playwright test --update-snapshots

    - uses: actions/upload-artifact@v3
      with:
        name: test-images
        path: tests/openrefine.spec.ts-snapshots/

Fragment: Is English, Rather than Maths, a Better Grounding for the Way Large Areas of Computing are Going?

Or maybe Latin…?

Having a quick play with the spacy natural language processing package just now, in part because I started wondering again about how to reliably extract “facts” from the MPs’ register of interests (I blame @fantasticlife for bringing that to mind again; data available via a @simonw datasette here: simonw/register-of-members-interests-datasette).

Skimming over the Linguistic Features docs, I realise that I probably need to brush up on my grammar. It also helped crystallise out a bit further some niggling concerns I’ve got about what from practical computing is, or might take in the near future, based on the sorts of computational objects we can reasily work with.

Typically when you start teaching computing, you familiarise learners with the primitive types of computational object that they can work with: integers, floating point numbers, strings (that is, text), then slightly more elaborate structures, such as Python lists or dictionaries (dicts). You might then move up to custom defined classes, and so on. What sort of thing a computation object largely determines what you can do with it, and what you can extract from it.

Traditionally, getting stuff out of natural language, free text strings has been a bit fiddly. It you’re working with a text sentence, represented typically as a string, one way of trying to extract “facts” from it is to pattern match on it. This means that strings (texts) with a regular structure are among the easier things to work with.

As an example, a few weeks ago I was trying to reliably parse out the name of a publisher, and the town/city of publication from free text citation data such as Joe Bloggs & Sons (London) or London: Joe Bloggs & Sons (see Semantic Feature / Named Entity Extraction Rather Than Regular Expression Based Pattern Matching and Parsing). A naive approach to this might be to try to write some templated regular expression pattern matching rules. A good way of of understanding how this can work is to consider their opposite: template generators.

Suppose I have a data file with a list of book references; the data has a column for PUBLISHER, and one for CITY. If I want to generate a text based citation, I might create different rules to generates citations of different forms. For example, the template {PUBLISHER} ({CITY}) might display the publisher and then the city in brackets (the curly brackets show we want to populate the string with a value pulled from a particular column in a row of data); or the template {CITY}: {PUBLISHER} the city, followed by a colon, and then the publsiher.

If we reverse this process, we then need to create a rule that can extract the publisher and the city from the text. This may or may not be easy to do. If our text phrase only contains the publisher and the city, things should be relatively straightforward. For example, if all the references were generated by a template {PUBLISHER} ({CITY}), I could just match on something like THIS IS THE PUBLISHER TEXT (THIS IS THE CITY) : anything between the brackets is the city, anything before the brackets is the publisher. (That said, if the publisher name included something in brackets, things might start to break…). However, if the information appears in a more general context, things are likely to get more complicated quite quickly. For example, suppose we have the text “the book was published by Joe Bloggs and Sons in London“. How then do we pull out the publisher name? What pattern are we trying to match?

When I first looked at this problem, I started writing rules at the character level in the string. But then it occurred to me that I could use a more structured, higher level representation of the text based on “entities”. Tools such as the spacy Python package provide a range of tools for representing natural language text at much higher levels of representation than simple text strings. For example, give spacy a text document, and it then lets you work with it at the sentence level, or the word level. It will also try to recognise “entities”: dates, numerical values, monetary values, peoples names, organisation names, geographical entities (or geopolitical entities, GPEs), and so on. With a sentence structured this way, I should be able to parse my publisher/place sentence and extract the recognised organisation and place:

import spacy

nlp = spacy.load("en_core_web_sm")

doc = nlp("The book was published by Joe Bloggs & Sons of London.")
print([(ent.text, ent.label_) for ent in doc.ents])

"""
[('Joe Bloggs & Sons', 'ORG'), ('London', 'GPE')]
"""

As well as using “AI”, which is to say, trained statistical models, to recognise entities, spacy will also let you define rules to match your entities. These definitions can be based on matching explicit strings, or regular expression patterns (see a quick example in Creating Rule Based Entity Pattern Matchers in spacy) as well as linguistic “parts of speech” features: nouns, verbs, prepositions etc etc.

I don’t really remember when we did basic grammar school, back in the day, but I do remember that I learned more about tenses in French and Latin then I ever did in English, and I learned more about grammar in general in Latin than I ever did in English.

So I wonder: how grammar is taught in school now? I wonder whether things like the spacy playground at https://demos.explosion.ai/matcher exist as interactive teaching / tutorial playground tools in English lessons as a way of allowing learners to explore grammatical structures in a text?

In addition, I wonder whether tools like that are used in IT and computing lessons, where pupils get to explore writing linguistic patterns matching rules in support of teaching “computational thinking” (whatever that is…), and then build their own information extracting tools using their own rules.

Certainly, to make best progress, a sound understanding of grammar would help when writing rules, so where should that grammar be taught? In computing, or in English? (This is not to argue that English teaching should become subservient to the computing curriculum. Whilst there is a risk that the joy of language is taught instrumentally with a view to how it can be processed by machines, or how we can (are forced to?) modify our language to get machines to “accept out input” or do what we want them to do, this sort of knowledge might also help provide us with defences against computational processing and obviously opinionated (in a design sense) textual interfaces. (Increasing amounts of natural language text is parsed and processed by machines, and used to generate responses: chat bots, email responders, SMS responders, search engines, etc etc. In one sense in it’s wrong that we need to understand how our inputs maybe interpreted by machines, particulary if we find ourselves changing how we interat to suit the machine. On the other, a little bit of knowledge may also give us power over those machines… [This is probably all sounding a little paranoid, isn’t it?! ;-)])

And of course, all the above is before we even start to get on to the subject of how to generate texts based on grammar based templating tools. If we do this for ourselves, it can be really powerful. If we realise this is being done to us, (and spot tells that suggest to us we are being presented with machine generated texts), it can give us a line of defence (paranoia again?! ;-).

Creating Rule Based Entity Pattern Matchers in spacy

Via a comment to Creating Training Data Sets for Custom Named Entity Recognition Taggers in spacy by Adam G, I learn that as well as training statistical models (as used in that post) spacy lets you write simple pattern matching rules that can be used to identify entities: Rule-based entity recognition.

import spacy

nlp = spacy.load("en_core_web_sm")
ruler = nlp.add_pipe("entity_ruler")
patterns = [{"label": "ROYALTY", "pattern": [{"LOWER": "king"}]},
            {"label": "ROYALTY", "pattern": [{"LOWER": "queen"}]},
            {"label": "ROYALTY", "pattern": [{"TEXT": {"REGEX": "[Pp]rinc[es]*"}}]},
            {"label": "MONEY", "pattern": [{"LOWER": {"REGEX": "(gold|silver|copper)"},
                                             "OP": "?"},
                                            {"TEXT": {"REGEX": "(coin|piece)s?"}}]}]
ruler.add_patterns(patterns)

doc = nlp("The King gave 100 gold coins to the Queen, 1 coin to the prince and  ten COPPER pieces the Princesses")
print([(ent.text, ent.label_) for ent in doc.ents])

"""
[('King', 'ROYALTY'), ('100', 'CARDINAL'), ('gold coins', 'MONEY'), ('Queen', 'ROYALTY'), ('1', 'CARDINAL'), ('coin', 'MONEY'), ('prince', 'ROYALTY'), ('ten', 'CARDINAL'), ('COPPER pieces', 'MONEY'), ('Princesses', 'ROYALTY')]
"""

There is a handy tool for trying out patterns at https://demos.explosion.ai/matcher [example]:

(I note that REGEX is not available in the playground though?)

The playground also generates the pattern match rule for you:

However, if I try that rule in my own pattern match, the cardinal element is not matched?

There appears to be a lot of scope as to what you can put in the patterns to be matched , including parts of speech. Which reminds me that I meant to look at Using Pandas DataFrames to analyze sentence structure which uses dependency parsing on spacy parsed sentences to pull out relationships, such as peoples’ names and the associated job titles, from company documents. This probably also means digging into spacy’s Linguistic features.

This also makes me wonder again about the extent to which it might be possible to extract certain Propp functions from sentences parsed using spacy and explict pattern matching rules on particular texts with bits of hand tuning (eg using hand crafted rules in the identification of actors)?

PS I guess if this part of the pipeline is crearing the entity types, they may not be available to the matcher, even if the ENT_TYPE is allowed as part of the rule conditions? In which case, can we fettle the pipeline somehow so we can get rules to match on previoulsy identified entity types?

Creating Training Data Sets for Custom Named Entity Recognition Taggers in spacy

I’ve just been having a quick look at getting custom NER (named entity recognition) working in spacy. The training data seems to be presented as a list of tuples with the form:

(TEXT_STRING,
  {'entities': [(START_IDX, END_IDX, ENTITY_TYPE), ...]})

TO simplify things, I wanted to create the training data structure from a simpler representation, a two-tuple of the form (TEXT, PHRASE), where PHRASE is the entity you want to match.

Let’s start by finding the index values of a match phrase in a text string:

import re

def phrase_index(txt, phrase):
    """Return start and end index phrase in string."""
    matches = [(idx.start(),
                idx.start()+len(phrase)) for idx in re.finditer(phrase, txt)]
    return matches

# Example
#phrase_index("this and this", "this")
#[(0, 4), (9, 13)]

I’m using training documents of the following form:

_train_strings = [("The King gave the pauper three gold coins and the pauper thanked the King.", [("three gold coins", "MONEY"), ("King", "ROYALTY")]) ]

We can then generate the formatted training data as follows:

def generate_training_data(_train_strings):
    """Generate training data from text and match phrases."""
    for (txt, items) in _train_strings:
        _ents_list = []
        for (phrase, typ) in items:
            matches = phrase_index(txt, phrase)
            for (start, end) in matches:
                _ents_list.append( (start, end, typ) )
        if _ents_list:
            training_data.append( (txt, {"entities": _ents_list}) )

    return training_data

# Call as:
#generate_training_data(_train_strings)

I was a bit surprised this sort of utility doesn’t already exist? Or did I miss it? (I haven’t really read the spacy docs, but then again, spacy seems to keep getting updated…)

Fragments: Structural Search

Several years ago, I picked up a fascinating book on the Morphology of the Folktale by Vladimir Propp. This identifies a set of primitives that could be used to describe the structure of Russian folktales, though they also serve to describe lots of other Western folktales too. (A summary of the primitives identified by Propp can be found here: Propp functions. (See also Levi-Strauss and structural analysis of folk-tales).)

Over the weekend, I started wondering whether there are any folk-tale corpuses out there annotated with Propp functions, that might be used as the basis for a “structural search engine”, or that could perhaps be used to build a model that could attempt to automatically analyse other folktales in structural terms.

Thus far, I’ve found one, as described in ProppLearner: Deeply annotating a corpus of Russian folktales to enable the machine learning of a Russian formalist theory, Mark A. Finlayson, Digital Scholarship in the Humanities, Volume 32, Issue 2, June 2017, Pages 284–300, https://doi.org/10.1093/llc/fqv067 . The papers gives a description of the method used to annotate the tale collection, and also links to data download containing annotation guides and the annotated collection as an XML file: Supplementary materials for “ProppLearner: Deeply Annotating a Corpus of Russian Folktales to Enable the Machine Learning of a Russian Formalist Theory” [description, zip file]. The bundle includes fifteen annotated tales marked up using the Story Workbench XML format; a guide to the format is also included.

(The Story Workbench is (was?) an Eclipse based tool for annotating texts. I wonder if anything has replaced it that also opens the Story Workbench files? In passing, I note Prodigy, a commercial text annotation tool that integrates tightly with spacy, as well as a couple of free local server powered browser based tools, brat and doccano. The metanno promises “a JupyterLab extension that allows you build your own annotator”, but from the README, I can’t really make sense of what it supports…)

The markup file looks a bit involved, and will take some time to make sense of. It includes POS tags, sentences, extracted events, timex3 / TimeML tags, Proppian functions and archetypes, among other things. To get a better sense of it, I need to build a parser and have a play with it…

The TimeML tagging was new to me, and looks like it could be handy. It provides an XML based tagging scheme for describing temporal events, expressions and relationships [Wikipedia]. The timexy Python package provides a spacy pipeline component that “extracts and normalizes temporal expressions” and represents them using TimeML tags.

In passing, I also note a couple of things spacy related, such as a phrase matcher that could be used to reconcile different phrases (eg Finn Mac Cool and Fionn MacCumHail etc): PhraseMatcher, and this note on Training Custom NER models in spacy to auto-detect named entitities.

Documenting AI Models

I don’t (currently) work on any AI related courses, but this strikes me as something that could be easily co-opted to support: a) best practice; b) assessment — the use of model cards. Here is where I first spotted a mention of them:

My first thought was whether this could be a “new thing” for including in bibliographic/reference data (eg when citing a module model, you’d ideally cite its model card).

Looking at the model card for the Whisper model, I then started to wonder whether this would also be a Good Thing to teach students to create to describe their model, particularly as the format also looks like the sort of thing you could easily assess: the Whisper model card, for example, includes the following headings:

  • Model Details
  • Release Date
  • Model Type
  • Paper / Samples
  • Model Use
  • Training Data
  • Performance and Limitations
  • Broader Implications

The broader implications is an interesting one..

It also struck me that the model card might also provide a useful cover sheet for a data investigation.

Via Jo Walsh/@ultrazool, I was also tipped off to the “documentation” describing the model card approach: https://huggingface.co/docs/hub/models-cards. The blurb sugges:

The model card should describe:

  • the model
  • its intended uses & potential limitations, including biases and ethical considerations
  • the training params and experimental info (you can embed or link to an experiment tracking platform for reference)
  • which datasets were used to train your model
  • your evaluation results
Hugging Face: model cards, https://huggingface.co/docs/hub/models-cards

The model card format is more completely described in Model Cards for Model Reporting, Margaret Mitchell et al., https://arxiv.org/abs/1810.03993 .

A largely similarly structure card might also be something that could usefully act as a cover sheet / “executive report metadata” for a data investigation?

PS also via Jo, City of Helsinki AI Register, ” a window into the artificial intelligence systems used by the City of Helsinki. Through the register, you can get acquainted with the quick overviews of the city’s artificial intelligence systems or examine their more detailed information”. For more info on that idea, see their Public AI Registers white paper (pdf). And if that sort of thing interests you, you should probably also read Dan McQuillan’s Resisting AI, whose comment on the Helsinki register was “transparency theatre”

PPS for an example of using Whisper to transcribe, and translate, an audio file, see @electricarchaeo’s Whisper, from OpenAI, for Transcribing Audio.