Fragment: Is English, Rather than Maths, a Better Grounding for the Way Large Areas of Computing are Going?

Or maybe Latin…?

Having a quick play with the spacy natural language processing package just now, in part because I started wondering again about how to reliably extract “facts” from the MPs’ register of interests (I blame @fantasticlife for bringing that to mind again; data available via a @simonw datasette here: simonw/register-of-members-interests-datasette).

Skimming over the Linguistic Features docs, I realise that I probably need to brush up on my grammar. It also helped crystallise out a bit further some niggling concerns I’ve got about what from practical computing is, or might take in the near future, based on the sorts of computational objects we can reasily work with.

Typically when you start teaching computing, you familiarise learners with the primitive types of computational object that they can work with: integers, floating point numbers, strings (that is, text), then slightly more elaborate structures, such as Python lists or dictionaries (dicts). You might then move up to custom defined classes, and so on. What sort of thing a computation object largely determines what you can do with it, and what you can extract from it.

Traditionally, getting stuff out of natural language, free text strings has been a bit fiddly. It you’re working with a text sentence, represented typically as a string, one way of trying to extract “facts” from it is to pattern match on it. This means that strings (texts) with a regular structure are among the easier things to work with.

As an example, a few weeks ago I was trying to reliably parse out the name of a publisher, and the town/city of publication from free text citation data such as Joe Bloggs & Sons (London) or London: Joe Bloggs & Sons (see Semantic Feature / Named Entity Extraction Rather Than Regular Expression Based Pattern Matching and Parsing). A naive approach to this might be to try to write some templated regular expression pattern matching rules. A good way of of understanding how this can work is to consider their opposite: template generators.

Suppose I have a data file with a list of book references; the data has a column for PUBLISHER, and one for CITY. If I want to generate a text based citation, I might create different rules to generates citations of different forms. For example, the template {PUBLISHER} ({CITY}) might display the publisher and then the city in brackets (the curly brackets show we want to populate the string with a value pulled from a particular column in a row of data); or the template {CITY}: {PUBLISHER} the city, followed by a colon, and then the publsiher.

If we reverse this process, we then need to create a rule that can extract the publisher and the city from the text. This may or may not be easy to do. If our text phrase only contains the publisher and the city, things should be relatively straightforward. For example, if all the references were generated by a template {PUBLISHER} ({CITY}), I could just match on something like THIS IS THE PUBLISHER TEXT (THIS IS THE CITY) : anything between the brackets is the city, anything before the brackets is the publisher. (That said, if the publisher name included something in brackets, things might start to break…). However, if the information appears in a more general context, things are likely to get more complicated quite quickly. For example, suppose we have the text “the book was published by Joe Bloggs and Sons in London“. How then do we pull out the publisher name? What pattern are we trying to match?

When I first looked at this problem, I started writing rules at the character level in the string. But then it occurred to me that I could use a more structured, higher level representation of the text based on “entities”. Tools such as the spacy Python package provide a range of tools for representing natural language text at much higher levels of representation than simple text strings. For example, give spacy a text document, and it then lets you work with it at the sentence level, or the word level. It will also try to recognise “entities”: dates, numerical values, monetary values, peoples names, organisation names, geographical entities (or geopolitical entities, GPEs), and so on. With a sentence structured this way, I should be able to parse my publisher/place sentence and extract the recognised organisation and place:

import spacy

nlp = spacy.load("en_core_web_sm")

doc = nlp("The book was published by Joe Bloggs & Sons of London.")
print([(ent.text, ent.label_) for ent in doc.ents])

[('Joe Bloggs & Sons', 'ORG'), ('London', 'GPE')]

As well as using “AI”, which is to say, trained statistical models, to recognise entities, spacy will also let you define rules to match your entities. These definitions can be based on matching explicit strings, or regular expression patterns (see a quick example in Creating Rule Based Entity Pattern Matchers in spacy) as well as linguistic “parts of speech” features: nouns, verbs, prepositions etc etc.

I don’t really remember when we did basic grammar school, back in the day, but I do remember that I learned more about tenses in French and Latin then I ever did in English, and I learned more about grammar in general in Latin than I ever did in English.

So I wonder: how grammar is taught in school now? I wonder whether things like the spacy playground at exist as interactive teaching / tutorial playground tools in English lessons as a way of allowing learners to explore grammatical structures in a text?

In addition, I wonder whether tools like that are used in IT and computing lessons, where pupils get to explore writing linguistic patterns matching rules in support of teaching “computational thinking” (whatever that is…), and then build their own information extracting tools using their own rules.

Certainly, to make best progress, a sound understanding of grammar would help when writing rules, so where should that grammar be taught? In computing, or in English? (This is not to argue that English teaching should become subservient to the computing curriculum. Whilst there is a risk that the joy of language is taught instrumentally with a view to how it can be processed by machines, or how we can (are forced to?) modify our language to get machines to “accept out input” or do what we want them to do, this sort of knowledge might also help provide us with defences against computational processing and obviously opinionated (in a design sense) textual interfaces. (Increasing amounts of natural language text is parsed and processed by machines, and used to generate responses: chat bots, email responders, SMS responders, search engines, etc etc. In one sense in it’s wrong that we need to understand how our inputs maybe interpreted by machines, particulary if we find ourselves changing how we interat to suit the machine. On the other, a little bit of knowledge may also give us power over those machines… [This is probably all sounding a little paranoid, isn’t it?! ;-)])

And of course, all the above is before we even start to get on to the subject of how to generate texts based on grammar based templating tools. If we do this for ourselves, it can be really powerful. If we realise this is being done to us, (and spot tells that suggest to us we are being presented with machine generated texts), it can give us a line of defence (paranoia again?! ;-).

Creating Rule Based Entity Pattern Matchers in spacy

Via a comment to Creating Training Data Sets for Custom Named Entity Recognition Taggers in spacy by Adam G, I learn that as well as training statistical models (as used in that post) spacy lets you write simple pattern matching rules that can be used to identify entities: Rule-based entity recognition.

import spacy

nlp = spacy.load("en_core_web_sm")
ruler = nlp.add_pipe("entity_ruler")
patterns = [{"label": "ROYALTY", "pattern": [{"LOWER": "king"}]},
            {"label": "ROYALTY", "pattern": [{"LOWER": "queen"}]},
            {"label": "ROYALTY", "pattern": [{"TEXT": {"REGEX": "[Pp]rinc[es]*"}}]},
            {"label": "MONEY", "pattern": [{"LOWER": {"REGEX": "(gold|silver|copper)"},
                                             "OP": "?"},
                                            {"TEXT": {"REGEX": "(coin|piece)s?"}}]}]

doc = nlp("The King gave 100 gold coins to the Queen, 1 coin to the prince and  ten COPPER pieces the Princesses")
print([(ent.text, ent.label_) for ent in doc.ents])

[('King', 'ROYALTY'), ('100', 'CARDINAL'), ('gold coins', 'MONEY'), ('Queen', 'ROYALTY'), ('1', 'CARDINAL'), ('coin', 'MONEY'), ('prince', 'ROYALTY'), ('ten', 'CARDINAL'), ('COPPER pieces', 'MONEY'), ('Princesses', 'ROYALTY')]

There is a handy tool for trying out patterns at [example]:

(I note that REGEX is not available in the playground though?)

The playground also generates the pattern match rule for you:

However, if I try that rule in my own pattern match, the cardinal element is not matched?

There appears to be a lot of scope as to what you can put in the patterns to be matched , including parts of speech. Which reminds me that I meant to look at Using Pandas DataFrames to analyze sentence structure which uses dependency parsing on spacy parsed sentences to pull out relationships, such as peoples’ names and the associated job titles, from company documents. This probably also means digging into spacy’s Linguistic features.

This also makes me wonder again about the extent to which it might be possible to extract certain Propp functions from sentences parsed using spacy and explict pattern matching rules on particular texts with bits of hand tuning (eg using hand crafted rules in the identification of actors)?

PS I guess if this part of the pipeline is crearing the entity types, they may not be available to the matcher, even if the ENT_TYPE is allowed as part of the rule conditions? In which case, can we fettle the pipeline somehow so we can get rules to match on previoulsy identified entity types?

Creating Training Data Sets for Custom Named Entity Recognition Taggers in spacy

I’ve just been having a quick look at getting custom NER (named entity recognition) working in spacy. The training data seems to be presented as a list of tuples with the form:

  {'entities': [(START_IDX, END_IDX, ENTITY_TYPE), ...]})

TO simplify things, I wanted to create the training data structure from a simpler representation, a two-tuple of the form (TEXT, PHRASE), where PHRASE is the entity you want to match.

Let’s start by finding the index values of a match phrase in a text string:

import re

def phrase_index(txt, phrase):
    """Return start and end index phrase in string."""
    matches = [(idx.start(),
                idx.start()+len(phrase)) for idx in re.finditer(phrase, txt)]
    return matches

# Example
#phrase_index("this and this", "this")
#[(0, 4), (9, 13)]

I’m using training documents of the following form:

_train_strings = [("The King gave the pauper three gold coins and the pauper thanked the King.", [("three gold coins", "MONEY"), ("King", "ROYALTY")]) ]

We can then generate the formatted training data as follows:

def generate_training_data(_train_strings):
    """Generate training data from text and match phrases."""
    for (txt, items) in _train_strings:
        _ents_list = []
        for (phrase, typ) in items:
            matches = phrase_index(txt, phrase)
            for (start, end) in matches:
                _ents_list.append( (start, end, typ) )
        if _ents_list:
            training_data.append( (txt, {"entities": _ents_list}) )

    return training_data

# Call as:

I was a bit surprised this sort of utility doesn’t already exist? Or did I miss it? (I haven’t really read the spacy docs, but then again, spacy seems to keep getting updated…)

Fragments: Structural Search

Several years ago, I picked up a fascinating book on the Morphology of the Folktale by Vladimir Propp. This identifies a set of primitives that could be used to describe the structure of Russian folktales, though they also serve to describe lots of other Western folktales too. (A summary of the primitives identified by Propp can be found here: Propp functions. (See also Levi-Strauss and structural analysis of folk-tales).)

Over the weekend, I started wondering whether there are any folk-tale corpuses out there annotated with Propp functions, that might be used as the basis for a “structural search engine”, or that could perhaps be used to build a model that could attempt to automatically analyse other folktales in structural terms.

Thus far, I’ve found one, as described in ProppLearner: Deeply annotating a corpus of Russian folktales to enable the machine learning of a Russian formalist theory, Mark A. Finlayson, Digital Scholarship in the Humanities, Volume 32, Issue 2, June 2017, Pages 284–300, . The papers gives a description of the method used to annotate the tale collection, and also links to data download containing annotation guides and the annotated collection as an XML file: Supplementary materials for “ProppLearner: Deeply Annotating a Corpus of Russian Folktales to Enable the Machine Learning of a Russian Formalist Theory” [description, zip file]. The bundle includes fifteen annotated tales marked up using the Story Workbench XML format; a guide to the format is also included.

(The Story Workbench is (was?) an Eclipse based tool for annotating texts. I wonder if anything has replaced it that also opens the Story Workbench files? In passing, I note Prodigy, a commercial text annotation tool that integrates tightly with spacy, as well as a couple of free local server powered browser based tools, brat and doccano. The metanno promises “a JupyterLab extension that allows you build your own annotator”, but from the README, I can’t really make sense of what it supports…)

The markup file looks a bit involved, and will take some time to make sense of. It includes POS tags, sentences, extracted events, timex3 / TimeML tags, Proppian functions and archetypes, among other things. To get a better sense of it, I need to build a parser and have a play with it…

The TimeML tagging was new to me, and looks like it could be handy. It provides an XML based tagging scheme for describing temporal events, expressions and relationships [Wikipedia]. The timexy Python package provides a spacy pipeline component that “extracts and normalizes temporal expressions” and represents them using TimeML tags.

In passing, I also note a couple of things spacy related, such as a phrase matcher that could be used to reconcile different phrases (eg Finn Mac Cool and Fionn MacCumHail etc): PhraseMatcher, and this note on Training Custom NER models in spacy to auto-detect named entitities.

Documenting AI Models

I don’t (currently) work on any AI related courses, but this strikes me as something that could be easily co-opted to support: a) best practice; b) assessment — the use of model cards. Here is where I first spotted a mention of them:

My first thought was whether this could be a “new thing” for including in bibliographic/reference data (eg when citing a module model, you’d ideally cite its model card).

Looking at the model card for the Whisper model, I then started to wonder whether this would also be a Good Thing to teach students to create to describe their model, particularly as the format also looks like the sort of thing you could easily assess: the Whisper model card, for example, includes the following headings:

  • Model Details
  • Release Date
  • Model Type
  • Paper / Samples
  • Model Use
  • Training Data
  • Performance and Limitations
  • Broader Implications

The broader implications is an interesting one..

It also struck me that the model card might also provide a useful cover sheet for a data investigation.

Via Jo Walsh/@ultrazool, I was also tipped off to the “documentation” describing the model card approach: The blurb sugges:

The model card should describe:

  • the model
  • its intended uses & potential limitations, including biases and ethical considerations
  • the training params and experimental info (you can embed or link to an experiment tracking platform for reference)
  • which datasets were used to train your model
  • your evaluation results
Hugging Face: model cards,

The model card format is more completely described in Model Cards for Model Reporting, Margaret Mitchell et al., .

A largely similarly structure card might also be something that could usefully act as a cover sheet / “executive report metadata” for a data investigation?

PS also via Jo, City of Helsinki AI Register, ” a window into the artificial intelligence systems used by the City of Helsinki. Through the register, you can get acquainted with the quick overviews of the city’s artificial intelligence systems or examine their more detailed information”. For more info on that idea, see their Public AI Registers white paper (pdf). And if that sort of thing interests you, you should probably also read Dan McQuillan’s Resisting AI, whose comment on the Helsinki register was “transparency theatre”

PPS for an example of using Whisper to transcribe, and translate, an audio file, see @electricarchaeo’s Whisper, from OpenAI, for Transcribing Audio.

Using dogsheep-beta to Create a Combined SQLite Free Text Metasearch Engine Over Several SQLite Databases

I had a quick play yesterday tinkering with my storynotes side project, creating a couple more scrapers over traditional tale collections, specifically World of Tales and Laura Gibb’s Tiny Tales books. The scrapers pull the stories into simple, free text searchable SQLite databases to support discovery of particular stories based on simple, free text search terms. (I’ve also been exploring simple doc2vec based semantic search strategies over the data.)

See the technical recipes: World of Tales scraper; Tiny Tales scraper.

The databases I’ve been using aren’t very consistent in the table structure I’m using to store the scraped data, and thus far I tended to search the databases separately. But when I came across Simon Willison’s dogsheep-beta demo, which creates a free text meta-search database over several distinct databases using SQLite’s ATTACH method (example) for connecting to multiple databases, I thought I’d have a quick play to see if I could re-use that method to create my own meta-search database.

And it turns out, I can. Really easily…

The dogsheep-beta demo essentially bundles two things:

  • a tool for constructing a single, full text searchable database from the content of one or more database tables in one or more databases;
  • a datasette extension for rendering a faceted full text search page over the combined database.

The index builder is based around a YAML config file that contains a series of entries, each describing a query onto a database that returns the searchable data in a standard form. The standardised columns are type (in the dogsheep world, this is the original data source, eg a tweet, or a github record); key (a unique key for the record); title; timestamp; category; is_public; search_1, search_2, search_3.

I forked the repo to tweak these columns slightly, changing timestamp to pubdate and adding a new book column to replace type with the original book from which a story came; the database records I’m interested in search over are individual stories or tales, but it can be useful to know the original source text. I probably also need to be able to support something like a link to the original text, but for now I’m interested in a minimum viable search tool.

Items in config file have the form:

        sql: |-
              MY_ID as key,
              MY_TEXT as search_1
            from MY_TABLE

The select MUST be in lower case (a hack in the code searches for the first select in the provided query, as part of a query rewrite. Also, the query MUST NOT end with a ;, as the aforementioned query rewrite appends LIMIT 0 to the original query en route to identifying the column headings.

Here’s the config file I used for creating my metasearch database:

        sql: |-
              "lang::" || replace(title, " ", "") as key,
              text as search_1
            from books

        sql: |-
              "jacobs::" || replace(story_title, " ", "") as key,
              story_title as title,
              book_title as book,
              story_text as search_1
            from stories

        sql: |-
              "wot::" || replace(title, " ", "") as key,
              text as search_1
            from tales

        sql: |-
              "dash::" || replace(title, " ", "") as key,
              title as title,
              metadata as book,
              text as search_1
            from ashliman_stories

        sql: |-
              "mtdf::" || replace(title, " ", "") as key,
              title as title,
              text as search_1
            from english_stories

Using the datasette UI to query the FTS table, with a slightly tweaked SQL query, we can get something like the following:

One of the issues with even a free text search strategy is that the search terms must appear in the searched text if they are to return a result (we can get round this slightly by using things like stemming to reduce a word to its stem). However, it’s not too hard to generate a simple semantic search over a corpus using doc2vec, as this old demo shows. However, the vtfunc trick that relies on seems to have rotted in Py10 [update: ensuring Cython is installed fixes the vtfunc install; for another Python wrapper for sqlite, see also apsw]; there may be an alternative way to access a TableFunction via the peewee package, but that seems tightly bound to a particular database object, and on the quickest of plays, I couldn’t get it to play nice with sqlite_utils or pandas.read_sql().

What I’m thinking is, it would be really handy to have a template repo associated with sqlite_utils / datasette that provides tools to:

  • create a metasearch database (dogsheep-beta pretty much does this, but could be generalised to be more flexible in defining/naming required columns, etc.);
  • provide an easily customisable index.html template for datasette that gives you a simple free text search UI (related issue);
  • provide a simple tool that will build a doc2vec table, register vector handlers (related issue) and a custom semantic search function; the index.html template should then give you the option of running a free text search or a semantic search;
  • (it might also be handy to support other custom fuzzy search functions (example);)
  • a simpler config where you select free text and or semantic search, and the index/doc2vec builders are applied appropriately, and the index.html template serves queries appropriately.

PS I posted this related item to sqlite_utils discord server:

SQLite supports the creation of virtual table functions that allow you to define custom SQL functions that can return a table.

The coleifer/sqlite-vtfunc package provided a fairly straightforward way of defining custom table functions (eg I used it to create a custom fuzzy search function here:

The recipe is to define a class class My_Custom_Function(TableFunction) and then register it on a connection (My_Custom_Function.register(db.conn)). This worked fine with sqlite_utils database connections and meant you could easily add custom functions to a running server.

However, that package is no longer maintained and seems to be breaking when installing in at least Py10?

An equivalent (ish) function is provided by the peewee package (functions imported from playhouse.sqlite_ext), but it seems to be rather more tightly bound to a SqliteDatabase object, rather than simply being registered to a database connection.

Is there any support for registering such table returning functions in sqlite_utils?

Generating “Print Pack” Material (PDF, epub) From Jupyter Notebooks Using pandoc

For one of the courses I work on, we produce “print pack” materials (PDF, epub), on request, for students who require print, or alternative format, copies of the Jupyter notebooks used in the module. This does and doesn’t make a lot of sense. On the one hand, having a print copy of notebooks may be useful for creating physical annotations. On the other, the notebooks are designed as interactive, and often generative, materials: interactive in the sense students are expect to run code, as well modify, create and execute their own code; generative in the sense that outputs are generated by code execution and the instructional content may explicitly refer to things that have been so generated.

In producing the print / alternative format material, we generally render the content from un-run notebooks, which is to say the the print material notebooks do not include code outputs. Part of the reason for this is that we want the students to do the work: if we hnded out completed worksheets, there’d be no need to work through and complete the worksheets, right?

Furthermore, whilst for some notebooks, it may be possible to meaningfully run all cells and then provide executed/run cell notebooks as print materials, in other modules this may not make sense. In our TM129 robotics block, an interactive simulator widget is generated and controlled from the run notebook, and it doesn’t make sense to naively share a notebook with all cells run. Instead, we would have to share screenshots of the simulator widget following each notebook activity, and here may be several such activities in each notebook. (It might be instructive to try to automate the creation of such screenshots, eg using the JupyterLab galata test framework.)

Anyway, I started looking at how to automate the generation of print packs. The following code [( is hard wired for a directory structure where the content is in ./content directory, with subdirectories for each week starting with a two digit week number (01., 02. etc) and notebooks in each week directory numbered according to week/notebook number (eg 01.1, 01.2, …, 02.1, 02.2, … etc.).

# # ``
# Script for generating print items (weekly PDF, weekly epub).
# # Install requirements
# – Python
# – pandoc
# – python packages:
# – ipython
# – nbconvert
# – nbformat
# – pymupdf
from pathlib import Path
#import nbconvert
import nbformat
from nbconvert import HTMLExporter
#import pypandoc
import os
import secrets
import shutil
import subprocess
import fitz #pip install pymupdf
html_exporter = HTMLExporter(template_name = 'classic')
pwd = Path.cwd()
print(f'Starting in: {pwd}')
# +
nb_wd = "content" # Path to weekly content folders
pdf_output_dir = "print_pack" # Path to output dir
# Create print pack output dir if required
Path(pdf_output_dir).mkdir(parents=True, exist_ok=True)
# –
# Iterate through weekly content dirs
# We assume the dir starts with a week number
for p in Path(nb_wd).glob("[0-9]*"):
print(f'- processing: {p}')
if not p.is_dir():
# Get the week number
weeknum =". ")[0]
# Settings for pandoc
pdoc_args = ['-s', '-V geometry:margin=1in',
#f'–resource-path="{p.resolve()}"', # Doesn't work?
'–metadata', f'title="TM129 Robotics — Week {weeknum}"']
#cd to week directory
# Create a tmp directory for html files
# Rather than use tempfile, create our own lest we want to persist it
_tmp_dir = Path(secrets.token_hex(5))
_tmp_dir.mkdir(parents=True, exist_ok=True)
# Find notebooks for the current week
for _nb in Path.cwd().glob("*.ipynb"):
nb =, as_version=4)
# Generate HTML version of document
(body, resources) = html_exporter.from_notebook_node(nb)
with open(_tmp_dir /".ipynb", ".html"), "w") as f:
# Now convert the HTML files to PDF
# We need to run pandoc in the correct directory so that
# relatively linked image files are correctly picked up.
# Specify output PDF path
pdf_out = str(pwd / pdf_output_dir / f"tm129_{weeknum}.pdf")
epub_out = str(pwd / pdf_output_dir / f"tm129_{weeknum}.epub")
# It seems pypandoc is not sorting the files in ToC etc?
# to='pdf',
# #format='html',
# extra_args=pdoc_args,
# outputfile= str(pwd / pdf_output_dir / f"tm129_{weeknum}.pdf"))
# Hacky – requires IPython
# #! pandoc -s -o {pdf_out} -V geometry:margin=1in –toc –metadata title="TM129 Robotics — Week {weeknum}" {_tmp_dir}/*html
# #! pandoc -s -o {epub_out} –metadata title="TM129 Robotics — Week {weeknum}" –metadata author="The Open University, 2022" {_tmp_dir}/*html'pandoc –quiet -s -o {pdf_out} -V geometry:margin=1in –toc –metadata title="TM129 Robotics — Week {weeknum}" {_tmp_dir}/*html', shell = True)'pandoc –quiet -s -o {epub_out} –metadata title="TM129 Robotics — Week {weeknum}" –metadata author="The Open University, 2022" {_tmp_dir}/*html', shell = True)
# Tidy up tmp dir
#Just in case we need to know relatively where we are…
# Go back to the home dir
# ## Add OU Logo to First Page of PDF
# Add an OU logo to the first page of the PDF documents
# +
logo_file = ".print_assets/OU-logo-83×65.png"
img = open(logo_file, "rb").read()
# define the position (upper-left corner)
logo_container = fitz.Rect(60,40,143,105)
for f in Path(pdf_output_dir).glob("*.pdf"):
print(f'- branding: {f}')
with as pdf:
pdf_first_page = pdf[0]
pdf_first_page.insert_image(logo_container, stream=img)
pdf_out =".pdf", "_logo.pdf")
txt_origin = fitz.Point(350, 770)
text = "Copyright © The Open University, 2022"
for page in pdf:
page.insert_text(txt_origin, text) / pdf_out)
#Remove the unbranded PDF
view raw hosted with ❤ by GitHub

The script uses pandoc to generate the PDF and epub documents, one per weekly directory. The PDF generator also includes a table of contents, automatically generated from headings by pandoc. A second pass using fitz/pymupdf then adds a logo and copyright notice to each PDF.

PDF with post-processed addition of a custom logo
PDF with post-process addition of a copyright footer

Prompt Engineering, aka Generative System Query Optmisation

Over the last few months, I’ve seen increasing numbers of references to prompt engineering in my feeds. Prompt engineering is to generative AI systems — thinks like GPT-3 (natural langage/text generation), CoPilot (code and comment (i.e. code explanation) generation), and DALL-E (text2image generation) — as query optimisation is to search engines: the ability to iterate a query/prompt, in light of the responses it returns, in toder to find, or generate, a response that meets your needs.

Back in the day when I used to bait librarians, developing web search skills was an ever present theme at the library-practitioner conferences I used to regularly attend: advanced search forms and advanced search queries (using advanced search/colon operators, etc.), as well a power search tricks based around URL hacking or the use of particular keywords that reflected your understanding of the search algorithm, etc.

There was also an ethical dimension: should you teach particular search techniques that could be abused in some way, if generally know, but that might be appropriate if you were working in an investigative context: strategies that could be used to profile — or stalk — someone, for example.

So now I’m wondering: are the digital skills folk in libraries starting to look at ways of developing prompt engineering skills yet?

Again, there are ethical questions: for example, should you develop skills or give prompt exampes of how to create deepfakes in unfiltered generative models such as Stable Diffusion (Deepfakes for all: Uncensored AI art model prompts ethics questions, h/t Tactical Tech/@info_activism)?

And there are also commercial and intellectual property right considerations. For example, what’s an effective prompt worth and how do you protect it if you are selling it on a prompt marketplace such as PromptBase? Will there be an underground market in “cheats” that can be deployed against certain commercial generative models (because pre-trained models are now also regularly bought and sold), and so on.

There’s also the skills question of how to effectively (and iteratively) work with a generative AI to create and tune meaningful solutions, which requires the important skill of evaulating, and perhaps editing or otherwise tweaking, the responses (cf. evaluating the quality of websites returned in the results of a web search engine) as well as improving or refining your prompts (for example, students who naively enter a complete question as a search engine query or a generative system prompt compared to those who use the question as a basis for their own, more refined, query, that attemtps to extract key “features” from the question…)

For education, there’s also the question of how to respond to students using generative systems in open assessments: will we try to “forbid” students to such systems at all, will we have the equivalent of “calculator” papers in maths exams, where you are expected, or evern required, to use the generative tools, or will we try to enter into an arms race with questions which, if naively sumbitted as a prompt, will return an incorrect answer (in which case, how soon will the generative systems be improved to correctly respond to such prompts…)

See also: Fragment: Generative Plagiarism and Technologically Enhanced Cheating?

PS For some examples of prompt engineering in practice, Simon Willison has blogged several examples of some of his playful experiments with generative AI:

PPS another generative approach that looks interesting is provided by, which allows you to submit several training images of a novel concept, and then generate differently rendered styles of that concept.

Automating the Production of Student Software Guides With Annotated Screenshots Using Playwright and Jupyter Notebooks

With a bit of luck, we’ll be updating software for our databases course starting in Sept/Oct to use JupyterLab. It’s a bit late for settling this, but I find the adrenaline of going live, and the interaction with pathfinder students in particular, to be really invigorating (a bit like the old days of residential schools) so here’s hoping…

[I started this post last night, closed the laptop, opened it this morning and did 75 mins of work. The sleep broke WordPress’ autobackup, so when I accidentally backswiped the page, I seem to have lost over an hour’s work. Which is to say, this post was originally better written, but now it’s rushed. F**k you, WordPress.]

If we are to do the update, one thing I need to do is update the software guide. The software guide looks a bit like this:

Which is to say:

  • process text describing what to do;
  • a (possibly annotated) screenshot showing what to do, or the outcome of doing it;
  • more process text, etc.

Maintaining this sort of documentation can be a faff:

  • if the styling of the target website/application changes, but the structure and process steps remain the same, the screenshots drift from actuality, even if they are still functionally correct;
  • if the application structure or process steps change, then the documentation properly breaks. This can be hard to check for unless you rerun everying and manually test things before each presentation at least.

So is there a better way?

There is, and it’s another example of why it’s useful for folk to learn to code, which is to say, have some sort of understanding about how the web works and how to construct simple computational scripts.

Regular readers will know that I’ve tinkered on and off with the Selenium browser automation toolkit in the past, using it for scraping as well as automating repetitive manual tasks (such as downloading from one exams system scores of student scripts one at a time and grabbing a 2FA code for each of them for use in submiting marks into a second system). But playwright, Microsoft’s browser automation tool (freely available), seems to be what all the cool kids are using now, so I thought I’d try that.

The playwright app itself is a node app, which also makes me twitchy because becuase node is a pain to install. But the pytest-python package, which is installable from PyPi and which wraps playwright, bundles it’s own version of node, which makes things much simple. (See Simon Wilison’s Bundling binary tools in Python wheels for a discussion of this; for any edtechies out there, this is a really useful pattern, because if students have Python installed, you can use it as a route to deploy other things…)

Just as a brief aside, playwright is also wrapped by @simonw’s shot-scraper command line tool which makes it dead easy to grab screenshots. For example, we can grab a screenshot of the OU home page as simply as typing shot-scraper

Note that because the session runs in a new, incognito, browser, we get the cookie notice.

We can also grab a screenshot of just a particular, located CSS element: shot-scraper -s '#ou-org-footer'. See the shot-scraper docs for many more examples.

In many cases, screenshots that appear in OU course materials that mismatch with reality are identified and reported by students. Tools like shot-scraper and pytest can both be used as a part of a documentation testing suite where we create gold master images and then, as required, test “current” screenshots to see if they match distributed originals.

But back to the creation or reflowing of documentation.

As well as command line control using shot-scraper, we can also drive playwright from Python code, executed synchronously, as in a pytest test, or asynchronously, as in the case of Python running inside a Jupyter notebook. This is what I spent yesterday exploring, in particular, whether we could create reproducible documentation in the sense of something that has the form text, screenshot, text, … and looks like this:

but is actually created by something that has the form text, code, (code output), text, … and looks like this:

And as you’ve probably guessed, we can.

For some reason, my local version of nbconvert seems to now default to using no-input display settings and I can’t see why (there are no nbconvert settings files I can see. Anyone got any ideas how/why this might be happening? The only way I can work around it atm is to explicitly enable the display of the inputs: jupyter nbconvert --to pdf --TemplateExporter.exclude_output_prompt=True --TemplateExporter.exclude_input=False --TemplateExporter.exclude_input_prompt=False notebook.ipynb.

It’s worth noting a couple of things here:

  • if we reflow the document to generate new output screenshots, they will be a faithful representation of what the screen looks like at the time the document is reflowed. So if the visual styling (and nothing else) has been updated, we can capture the latest version;
  • the code should ideally follow the text description of the process steps, so if the code stops working for some reason, that might suggest the process has changed and so the text might well be broken too.

Generating the automation code requires knowledge of a couple of things:

  • how to write the code itself: the documentation helps here (eg in respect of how to wait for things, how to grab screenshots, how to move to things so they are in focus when you take a screenshot), but a recipe / FAQ / crib sheet would also be really handy;
  • how to find locators.

In terms of finding locators, one way is to do it manually, by usier browser developer tools to inspect elements and grab locators for required elements. But another, simpler way, is to record a set of playwright steps using the playwright codegen URL command line tool: simple walk through (click through) the process you want to document in the automatically launched interactive browser, and record the corresponding playwright steps.

With the script recorded, you can then use that as the basis for you screenshot generating reproducible documentation.

For any interested OU internal readers, there is an example software guide generated using playwright in MS Teams > Jupyter notebook working group team. I’m happy to share any scripts etc I come up with, and am interested to see other examples of using browser automation to test and generate documentation etc.

Referring back to the original software guide, we note that some screenshots have annotations. Another nice feature of playwright is that we can inject javascript into the controlled browser and evaulate it in that context, we which means we can inject Javascript that tinlers with the DOM.

The shot-scraper tool has a nifty function that will accept a set of locators for a particular page and, using Javascript injection, create a div surround them that can then be used to grab a screenshot of the area covering those locators.

It’s trivial to amend that function to add a border round an element:

# Based on @simomw's shot-scraper
import json
import secrets
import textwrap

from IPython.display import Image

def _selector_javascript(selectors, selectors_all,
                         padding=0, border='none', margin='none'):
    selector_to_shoot = "shot-scraper-{}".format(secrets.token_hex(8))
    selector_javascript = textwrap.dedent(
    new Promise(takeShot => {{
        let padding = %s;
        let margin = %s;
        let minTop = 100000000;
        let minLeft = 100000000;
        let maxBottom = 0;
        let maxRight = 0;
        let els = => document.querySelector(s));
        // Add the --selector-all elements => els.push(...document.querySelectorAll(s)));
        els.forEach(el => {{
            let rect = el.getBoundingClientRect();
            if ( < minTop) {{
                minTop =;
            if (rect.left < minLeft) {{
                minLeft = rect.left;
            if (rect.bottom > maxBottom) {{
                maxBottom = rect.bottom;
            if (rect.right > maxRight) {{
                maxRight = rect.right;
        // Adjust them based on scroll position
        let top = minTop + window.scrollY;
        let bottom = maxBottom + window.scrollY;
        let left = minLeft + window.scrollX;
        let right = maxRight + window.scrollX;
        // Apply padding
        // TO DO - apply margin?
        // margin demo:
        top = top - padding;
        bottom = bottom + padding;
        left = left - padding;
        right = right + padding;
        let div = document.createElement('div'); = 'absolute'; = top + 'px'; = left + 'px'; = (right - left) + 'px'; = (bottom - top) + 'px'; = '{border}'; = margin + 'px';
        div.setAttribute('id', %s);
        setTimeout(() => {{
        }}, 300);
        % (
            padding, margin,
    return selector_javascript, "#" + selector_to_shoot

async def screenshot_bounded_selectors(page, selectors=None,
    if selectors or selectors_all:
        selector_javascript, selector_to_shoot = _selector_javascript(
                selectors, selectors_all, padding, border, margin

        # Evaluate javascript
        await page.evaluate(selector_javascript)
    if (selectors or selectors_all) and not full_page:
        # Grab screenshot
        await page.locator(selector_to_shoot).screenshot(path=path)
        if selector_to_shoot:
            await page.locator(selector_to_shoot).hover()
        await page.screenshot(path=path)

    # Should we include a step to
    # then remove the selector_to_shoot and leave the page as it was
    if show:
    return path

Then we can use a simple script to grab a screenshot with a highlighted area:

from time import sleep
from playwright.async_api import async_playwright

playwright = await async_playwright().start()

# If we run headless--False, plawright will launch a 
# visible browser we can track progress in
browser = await playwright.chromium.launch(headless = False)

# Create a reference to a browser page
page = await browser.new_page()

# The page we want to visit
PAGE = ""

# And a locator for the "Accept all cookies" button
cookie_id = "#cassie_accept_all_pre_banner"

# Load the page
await page.goto(PAGE)

# Accept the cookies
await page.locator(cookie_id).click()

# The selectors we want to screenshot
selectors = []
selectors_all = [".int-grid3"]

# Grab the screenshot
await screenshot_bounded_selectors(page, selectors, selectors_all,
                                   border='5px solid red',

Or we can just sreenshot and highlight the element of interest:

Simon has an open shot-scraper issue on adding things like arrows to the screenshot, so this is obviously something that might repay a bit more exploration.

I note that the Jupyter accessibility docs have a section on the DOM / locator structure of the JupyterLab UI that include highlighted screenshots annotated, I think, using drawio. It might be interesting to try to replicate / automatically generate those, using playwright?

Finally, it’s worth noting that there is another playwright based tool, galata, that provides a set of high level tools for controlling JupyterLab and scripting JupyterLab actions (this will be the subject of another post). However, galata is currently a developer only tool, in that it expected to run against a wide open Juypter environment (no authentication), and only a wide open JupyterLab environment. It does this by over-riding playwright‘s .goto() method to expect a particular Jupyterlab locator, which means that if you want to test a deployment that sits behind the Jupyter authenticator (which is the recommended and default way of running a Jupyer server), or you want to go through some browser steps involving arbitrary web pages, you can’t. (I have opened an issue regarding at least getting through Jupyter authentication here and a related Jupyter discourse discussion here. What would be useful in the general case would be a trick to use a generic playwright script automate steps up to a JupyterLab page and then hand over browser state to a galata script. But I don’t know how to “just” do that. Which is why I say this is a developer tool, and as such is hostile to non-developer users, who, for example, might only be able to run scripts against a hosted server accessed through mutliple sorts of institutional and Jupyter based authentcation. The current set up also means it’s not possible to use galata out of the can for testing a deployment. Devs only, n’est-ce pas?

Displaying Wide pandas Dataframes Over Several, Narrower Dataframes

One of the problems associated with using Jupyter notebooks, and Jupyter Book, for publishing content that incorporates pandas dataframes is that wide dataframes are overflowed in the displayed page. In an HTML output, the wide dataframe is rendered via a scroll box; in a PDF output, the content is just lost over the edge of page.

Controls are available to limit the width of the displayed dataframe:

# Pandas dataframe display options
pd.options.display.width = 80
pd.options.display.max_colwidth = 10
pd.options.display.max_columns = 10

but sometimes you want to see all the columns.

The easiest way of rendering all the columns is to split the wide table over several smaller tables, each with a limited number of columns that do fit in the page width, but I havenlt spotted a simple way to do that (which is really odd, becuase I suspect this is a widely (?!) desired feature..

Anyway, here’s a first attempt at a simple function to do that:

#| echo: false
import pandas as pd

def split_wide_table(df, split=True,
                     hide=None, padding=1, width=None):
    """Split a wide table over multiple smaller tables."""
    # Note: column widths may be much larger than heading widths
    if hide:
        hide = [hide] if isinstance(hide, str) else hide
        hide = []
    cols = [c for c in df.columns if c not in hide]
    width = width or pd.options.display.width
    _cols = []
    _w = 0
    continues = False
    for col in cols:
        if (len("".join(_cols)) + (len(_cols)+1)*padding) < width:
            _cols = []
            continues = False
    if _cols:

Given a wide dataframe such as the following:

wdf = pd.DataFrame({f"col{x}":range(3) for x in range(25)})

We can instead render the table over multiple, narrower tables:

I havenlt really written any decorators before, but it also seems to me that this sort of approach might make decorative display sense…

For example, given the decorator:

def split_wide_display(func):
    """Split a wide table of multiple tables."""
    def inner(*args, **kwargs):
        resp = func(*args[:3])
        if (not "display" in kwargs) or \
            # Optionally display the potentially wide
            # table as a set of narrower tables
            split_wide_table(resp, **kwargs)
        return resp
    return inner

we can decorate a function returns a wide dataframe to optionally display the dataframe over mulitple narrower tables:

# Use fastf1 Formula One data grabbing package
from fastf1 import ergast

def ergast_race_results(year, event, session):
    erd_race = ergast.fetch_results(year, event, session)
    df = pd.json_normalize(erd_race)
    return df

It seems really odd that I have to hack this solution together, and there isn’t a default way of doing this? Or maybe there is, and I just missed it? If so, please let me know via the comments etc…

PS the splitter is far from optimal, becuase it splits columns on the basis of the cumulative column header widths, rather than the larger of the column head or column content width. I’m not sure what the best way might be of finding the max displayed column content width?

PPS is there a Sphinx extension to magically handle splitting overflowing tables into narraoewer subtables somehow? And a LaTeX one?