Creating Training Data Sets for Custom Named Entity Recognition Taggers in spacy

I’ve just been having a quick look at getting custom NER (named entity recognition) working in spacy. The training data seems to be presented as a list of tuples with the form:

(TEXT_STRING,
  {'entities': [(START_IDX, END_IDX, ENTITY_TYPE), ...]})

TO simplify things, I wanted to create the training data structure from a simpler representation, a two-tuple of the form (TEXT, PHRASE), where PHRASE is the entity you want to match.

Let’s start by finding the index values of a match phrase in a text string:

import re

def phrase_index(txt, phrase):
    """Return start and end index phrase in string."""
    matches = [(idx.start(),
                idx.start()+len(phrase)) for idx in re.finditer(phrase, txt)]
    return matches

# Example
#phrase_index("this and this", "this")
#[(0, 4), (9, 13)]

I’m using training documents of the following form:

_train_strings = [("The King gave the pauper three gold coins and the pauper thanked the King.", [("three gold coins", "MONEY"), ("King", "ROYALTY")]) ]

We can then generate the formatted training data as follows:

def generate_training_data(_train_strings):
    """Generate training data from text and match phrases."""
    for (txt, items) in _train_strings:
        _ents_list = []
        for (phrase, typ) in items:
            matches = phrase_index(txt, phrase)
            for (start, end) in matches:
                _ents_list.append( (start, end, typ) )
        if _ents_list:
            training_data.append( (txt, {"entities": _ents_list}) )

    return training_data

# Call as:
#generate_training_data(_train_strings)

I was a bit surprised this sort of utility doesn’t already exist? Or did I miss it? (I haven’t really read the spacy docs, but then again, spacy seems to keep getting updated…)

Fragments: Structural Search

Several years ago, I picked up a fascinating book on the Morphology of the Folktale by Vladimir Propp. This identifies a set of primitives that could be used to describe the structure of Russian folktales, though they also serve to describe lots of other Western folktales too. (A summary of the primitives identified by Propp can be found here: Propp functions. (See also Levi-Strauss and structural analysis of folk-tales).)

Over the weekend, I started wondering whether there are any folk-tale corpuses out there annotated with Propp functions, that might be used as the basis for a “structural search engine”, or that could perhaps be used to build a model that could attempt to automatically analyse other folktales in structural terms.

Thus far, I’ve found one, as described in ProppLearner: Deeply annotating a corpus of Russian folktales to enable the machine learning of a Russian formalist theory, Mark A. Finlayson, Digital Scholarship in the Humanities, Volume 32, Issue 2, June 2017, Pages 284–300, https://doi.org/10.1093/llc/fqv067 . The papers gives a description of the method used to annotate the tale collection, and also links to data download containing annotation guides and the annotated collection as an XML file: Supplementary materials for “ProppLearner: Deeply Annotating a Corpus of Russian Folktales to Enable the Machine Learning of a Russian Formalist Theory” [description, zip file]. The bundle includes fifteen annotated tales marked up using the Story Workbench XML format; a guide to the format is also included.

(The Story Workbench is (was?) an Eclipse based tool for annotating texts. I wonder if anything has replaced it that also opens the Story Workbench files? In passing, I note Prodigy, a commercial text annotation tool that integrates tightly with spacy, as well as a couple of free local server powered browser based tools, brat and doccano. The metanno promises “a JupyterLab extension that allows you build your own annotator”, but from the README, I can’t really make sense of what it supports…)

The markup file looks a bit involved, and will take some time to make sense of. It includes POS tags, sentences, extracted events, timex3 / TimeML tags, Proppian functions and archetypes, among other things. To get a better sense of it, I need to build a parser and have a play with it…

The TimeML tagging was new to me, and looks like it could be handy. It provides an XML based tagging scheme for describing temporal events, expressions and relationships [Wikipedia]. The timexy Python package provides a spacy pipeline component that “extracts and normalizes temporal expressions” and represents them using TimeML tags.

In passing, I also note a couple of things spacy related, such as a phrase matcher that could be used to reconcile different phrases (eg Finn Mac Cool and Fionn MacCumHail etc): PhraseMatcher, and this note on Training Custom NER models in spacy to auto-detect named entitities.

Documenting AI Models

I don’t (currently) work on any AI related courses, but this strikes me as something that could be easily co-opted to support: a) best practice; b) assessment — the use of model cards. Here is where I first spotted a mention of them:

My first thought was whether this could be a “new thing” for including in bibliographic/reference data (eg when citing a module model, you’d ideally cite its model card).

Looking at the model card for the Whisper model, I then started to wonder whether this would also be a Good Thing to teach students to create to describe their model, particularly as the format also looks like the sort of thing you could easily assess: the Whisper model card, for example, includes the following headings:

  • Model Details
  • Release Date
  • Model Type
  • Paper / Samples
  • Model Use
  • Training Data
  • Performance and Limitations
  • Broader Implications

The broader implications is an interesting one..

It also struck me that the model card might also provide a useful cover sheet for a data investigation.

Via Jo Walsh/@ultrazool, I was also tipped off to the “documentation” describing the model card approach: https://huggingface.co/docs/hub/models-cards. The blurb sugges:

The model card should describe:

  • the model
  • its intended uses & potential limitations, including biases and ethical considerations
  • the training params and experimental info (you can embed or link to an experiment tracking platform for reference)
  • which datasets were used to train your model
  • your evaluation results
Hugging Face: model cards, https://huggingface.co/docs/hub/models-cards

The model card format is more completely described in Model Cards for Model Reporting, Margaret Mitchell et al., https://arxiv.org/abs/1810.03993 .

A largely similarly structure card might also be something that could usefully act as a cover sheet / “executive report metadata” for a data investigation?

PS also via Jo, City of Helsinki AI Register, ” a window into the artificial intelligence systems used by the City of Helsinki. Through the register, you can get acquainted with the quick overviews of the city’s artificial intelligence systems or examine their more detailed information”. For more info on that idea, see their Public AI Registers white paper (pdf). And if that sort of thing interests you, you should probably also read Dan McQuillan’s Resisting AI, whose comment on the Helsinki register was “transparency theatre”

PPS for an example of using Whisper to transcribe, and translate, an audio file, see @electricarchaeo’s Whisper, from OpenAI, for Transcribing Audio.

Using dogsheep-beta to Create a Combined SQLite Free Text Metasearch Engine Over Several SQLite Databases

I had a quick play yesterday tinkering with my storynotes side project, creating a couple more scrapers over traditional tale collections, specifically World of Tales and Laura Gibb’s Tiny Tales books. The scrapers pull the stories into simple, free text searchable SQLite databases to support discovery of particular stories based on simple, free text search terms. (I’ve also been exploring simple doc2vec based semantic search strategies over the data.)

See the technical recipes: World of Tales scraper; Tiny Tales scraper.

The databases I’ve been using aren’t very consistent in the table structure I’m using to store the scraped data, and thus far I tended to search the databases separately. But when I came across Simon Willison’s dogsheep-beta demo, which creates a free text meta-search database over several distinct databases using SQLite’s ATTACH method (example) for connecting to multiple databases, I thought I’d have a quick play to see if I could re-use that method to create my own meta-search database.

And it turns out, I can. Really easily…

The dogsheep-beta demo essentially bundles two things:

  • a tool for constructing a single, full text searchable database from the content of one or more database tables in one or more databases;
  • a datasette extension for rendering a faceted full text search page over the combined database.

The index builder is based around a YAML config file that contains a series of entries, each describing a query onto a database that returns the searchable data in a standard form. The standardised columns are type (in the dogsheep world, this is the original data source, eg a tweet, or a github record); key (a unique key for the record); title; timestamp; category; is_public; search_1, search_2, search_3.

I forked the repo to tweak these columns slightly, changing timestamp to pubdate and adding a new book column to replace type with the original book from which a story came; the database records I’m interested in search over are individual stories or tales, but it can be useful to know the original source text. I probably also need to be able to support something like a link to the original text, but for now I’m interested in a minimum viable search tool.

Items in config file have the form:

DATABASE_FILE.db:
    DB_LABEL:
        sql: |-
            select
              MY_ID as key,
              title,
              MY_TEXT as search_1
            from MY_TABLE

The select MUST be in lower case (a hack in the code searches for the first select in the provided query, as part of a query rewrite. Also, the query MUST NOT end with a ;, as the aforementioned query rewrite appends LIMIT 0 to the original query en route to identifying the column headings.

Here’s the config file I used for creating my metasearch database:

lang_fairy_tale.db:
    lang:
        sql: |-
            select
              "lang::" || replace(title, " ", "") as key,
              title,
              book,
              text as search_1
            from books

jacobs_fairy_tale.db:
    jacobs:
        sql: |-
            select
              "jacobs::" || replace(story_title, " ", "") as key,
              story_title as title,
              book_title as book,
              story_text as search_1
            from stories

word_of_tales.db:
    world_of_tales:
        sql: |-
            select
              "wot::" || replace(title, " ", "") as key,
              title,
              book,
              text as search_1
            from tales

ashliman_demo.db:
    ashliman:
        sql: |-
            select
              "dash::" || replace(title, " ", "") as key,
              title as title,
              metadata as book,
              text as search_1
            from ashliman_stories

mtdf_demo.db:
    mtdf:
        sql: |-
            select
              "mtdf::" || replace(title, " ", "") as key,
              title as title,
              text as search_1
            from english_stories

Using the datasette UI to query the FTS table, with a slightly tweaked SQL query, we can get something like the following:

One of the issues with even a free text search strategy is that the search terms must appear in the searched text if they are to return a result (we can get round this slightly by using things like stemming to reduce a word to its stem). However, it’s not too hard to generate a simple semantic search over a corpus using doc2vec, as this old demo shows. However, the vtfunc trick that relies on seems to have rotted in Py10 [update: ensuring Cython is installed fixes the vtfunc install; for another Python wrapper for sqlite, see also apsw]; there may be an alternative way to access a TableFunction via the peewee package, but that seems tightly bound to a particular database object, and on the quickest of plays, I couldn’t get it to play nice with sqlite_utils or pandas.read_sql().

What I’m thinking is, it would be really handy to have a template repo associated with sqlite_utils / datasette that provides tools to:

  • create a metasearch database (dogsheep-beta pretty much does this, but could be generalised to be more flexible in defining/naming required columns, etc.);
  • provide an easily customisable index.html template for datasette that gives you a simple free text search UI (related issue);
  • provide a simple tool that will build a doc2vec table, register vector handlers (related issue) and a custom semantic search function; the index.html template should then give you the option of running a free text search or a semantic search;
  • (it might also be handy to support other custom fuzzy search functions (example);)
  • a simpler config where you select free text and or semantic search, and the index/doc2vec builders are applied appropriately, and the index.html template serves queries appropriately.

PS I posted this related item to sqlite_utils discord server:

SQLite supports the creation of virtual table functions that allow you to define custom SQL functions that can return a table.

The coleifer/sqlite-vtfunc package provided a fairly straightforward way of defining custom table functions (eg I used it to create a custom fuzzy search function here:

The recipe is to define a class class My_Custom_Function(TableFunction) and then register it on a connection (My_Custom_Function.register(db.conn)). This worked fine with sqlite_utils database connections and meant you could easily add custom functions to a running server.

However, that package is no longer maintained and seems to be breaking when installing in at least Py10?

An equivalent (ish) function is provided by the peewee package (functions imported from playhouse.sqlite_ext), but it seems to be rather more tightly bound to a SqliteDatabase object, rather than simply being registered to a database connection.

Is there any support for registering such table returning functions in sqlite_utils?

Generating “Print Pack” Material (PDF, epub) From Jupyter Notebooks Using pandoc

For one of the courses I work on, we produce “print pack” materials (PDF, epub), on request, for students who require print, or alternative format, copies of the Jupyter notebooks used in the module. This does and doesn’t make a lot of sense. On the one hand, having a print copy of notebooks may be useful for creating physical annotations. On the other, the notebooks are designed as interactive, and often generative, materials: interactive in the sense students are expect to run code, as well modify, create and execute their own code; generative in the sense that outputs are generated by code execution and the instructional content may explicitly refer to things that have been so generated.

In producing the print / alternative format material, we generally render the content from un-run notebooks, which is to say the the print material notebooks do not include code outputs. Part of the reason for this is that we want the students to do the work: if we hnded out completed worksheets, there’d be no need to work through and complete the worksheets, right?

Furthermore, whilst for some notebooks, it may be possible to meaningfully run all cells and then provide executed/run cell notebooks as print materials, in other modules this may not make sense. In our TM129 robotics block, an interactive simulator widget is generated and controlled from the run notebook, and it doesn’t make sense to naively share a notebook with all cells run. Instead, we would have to share screenshots of the simulator widget following each notebook activity, and here may be several such activities in each notebook. (It might be instructive to try to automate the creation of such screenshots, eg using the JupyterLab galata test framework.)

Anyway, I started looking at how to automate the generation of print packs. The following code [(https://gist.github.com/psychemedia/6288db9ef97dc17b2fbd909a7516f12b)%5D is hard wired for a directory structure where the content is in ./content directory, with subdirectories for each week starting with a two digit week number (01., 02. etc) and notebooks in each week directory numbered according to week/notebook number (eg 01.1, 01.2, …, 02.1, 02.2, … etc.).

# # `print_publication.py`
#
# Script for generating print items (weekly PDF, weekly epub).
# # Install requirements
#
# – Python
# – pandoc
# – python packages:
# – ipython
# – nbconvert
# – nbformat
# – pymupdf
from pathlib import Path
#import nbconvert
import nbformat
from nbconvert import HTMLExporter
#import pypandoc
import os
import secrets
import shutil
import subprocess
import fitz #pip install pymupdf
html_exporter = HTMLExporter(template_name = 'classic')
pwd = Path.cwd()
print(f'Starting in: {pwd}')
# +
nb_wd = "content" # Path to weekly content folders
pdf_output_dir = "print_pack" # Path to output dir
# Create print pack output dir if required
Path(pdf_output_dir).mkdir(parents=True, exist_ok=True)
# –
# Iterate through weekly content dirs
# We assume the dir starts with a week number
for p in Path(nb_wd).glob("[0-9]*"):
print(f'- processing: {p}')
if not p.is_dir():
continue
# Get the week number
weeknum = p.name.split(". ")[0]
# Settings for pandoc
pdoc_args = ['-s', '-V geometry:margin=1in',
'–toc',
#f'–resource-path="{p.resolve()}"', # Doesn't work?
'–metadata', f'title="TM129 Robotics — Week {weeknum}"']
#cd to week directory
os.chdir(p)
# Create a tmp directory for html files
# Rather than use tempfile, create our own lest we want to persist it
_tmp_dir = Path(secrets.token_hex(5))
_tmp_dir.mkdir(parents=True, exist_ok=True)
# Find notebooks for the current week
for _nb in Path.cwd().glob("*.ipynb"):
nb = nbformat.read(_nb, as_version=4)
# Generate HTML version of document
(body, resources) = html_exporter.from_notebook_node(nb)
with open(_tmp_dir / _nb.name.replace(".ipynb", ".html"), "w") as f:
f.write(body)
# Now convert the HTML files to PDF
# We need to run pandoc in the correct directory so that
# relatively linked image files are correctly picked up.
# Specify output PDF path
pdf_out = str(pwd / pdf_output_dir / f"tm129_{weeknum}.pdf")
epub_out = str(pwd / pdf_output_dir / f"tm129_{weeknum}.epub")
# It seems pypandoc is not sorting the files in ToC etc?
#pypandoc.convert_file(f"{temp_dir}/*html",
# to='pdf',
# #format='html',
# extra_args=pdoc_args,
# outputfile= str(pwd / pdf_output_dir / f"tm129_{weeknum}.pdf"))
# Hacky – requires IPython
# #! pandoc -s -o {pdf_out} -V geometry:margin=1in –toc –metadata title="TM129 Robotics — Week {weeknum}" {_tmp_dir}/*html
# #! pandoc -s -o {epub_out} –metadata title="TM129 Robotics — Week {weeknum}" –metadata author="The Open University, 2022" {_tmp_dir}/*html
subprocess.call(f'pandoc –quiet -s -o {pdf_out} -V geometry:margin=1in –toc –metadata title="TM129 Robotics — Week {weeknum}" {_tmp_dir}/*html', shell = True)
subprocess.call(f'pandoc –quiet -s -o {epub_out} –metadata title="TM129 Robotics — Week {weeknum}" –metadata author="The Open University, 2022" {_tmp_dir}/*html', shell = True)
# Tidy up tmp dir
shutil.rmtree(_tmp_dir)
#Just in case we need to know relatively where we are…
#Path.cwd().relative_to(pwd)
# Go back to the home dir
os.chdir(pwd)
os.chdir(pwd)
# ## Add OU Logo to First Page of PDF
#
# Add an OU logo to the first page of the PDF documents
# +
logo_file = ".print_assets/OU-logo-83×65.png"
img = open(logo_file, "rb").read()
# define the position (upper-left corner)
logo_container = fitz.Rect(60,40,143,105)
for f in Path(pdf_output_dir).glob("*.pdf"):
print(f'- branding: {f}')
with fitz.open(f) as pdf:
pdf_first_page = pdf[0]
pdf_first_page.insert_image(logo_container, stream=img)
pdf_out = f.name.replace(".pdf", "_logo.pdf")
txt_origin = fitz.Point(350, 770)
text = "Copyright © The Open University, 2022"
for page in pdf:
page.insert_text(txt_origin, text)
pdf.save(Path(pdf_output_dir) / pdf_out)
#Remove the unbranded PDF
os.remove(f)
view raw print_pack.py hosted with ❤ by GitHub

The script uses pandoc to generate the PDF and epub documents, one per weekly directory. The PDF generator also includes a table of contents, automatically generated from headings by pandoc. A second pass using fitz/pymupdf then adds a logo and copyright notice to each PDF.

PDF with post-processed addition of a custom logo
PDF with post-process addition of a copyright footer

Prompt Engineering, aka Generative System Query Optmisation

Over the last few months, I’ve seen increasing numbers of references to prompt engineering in my feeds. Prompt engineering is to generative AI systems — thinks like GPT-3 (natural langage/text generation), CoPilot (code and comment (i.e. code explanation) generation), and DALL-E (text2image generation) — as query optimisation is to search engines: the ability to iterate a query/prompt, in light of the responses it returns, in toder to find, or generate, a response that meets your needs.

Back in the day when I used to bait librarians, developing web search skills was an ever present theme at the library-practitioner conferences I used to regularly attend: advanced search forms and advanced search queries (using advanced search/colon operators, etc.), as well a power search tricks based around URL hacking or the use of particular keywords that reflected your understanding of the search algorithm, etc.

There was also an ethical dimension: should you teach particular search techniques that could be abused in some way, if generally know, but that might be appropriate if you were working in an investigative context: strategies that could be used to profile — or stalk — someone, for example.

So now I’m wondering: are the digital skills folk in libraries starting to look at ways of developing prompt engineering skills yet?

Again, there are ethical questions: for example, should you develop skills or give prompt exampes of how to create deepfakes in unfiltered generative models such as Stable Diffusion (Deepfakes for all: Uncensored AI art model prompts ethics questions, h/t Tactical Tech/@info_activism)?

And there are also commercial and intellectual property right considerations. For example, what’s an effective prompt worth and how do you protect it if you are selling it on a prompt marketplace such as PromptBase? Will there be an underground market in “cheats” that can be deployed against certain commercial generative models (because pre-trained models are now also regularly bought and sold), and so on.

There’s also the skills question of how to effectively (and iteratively) work with a generative AI to create and tune meaningful solutions, which requires the important skill of evaulating, and perhaps editing or otherwise tweaking, the responses (cf. evaluating the quality of websites returned in the results of a web search engine) as well as improving or refining your prompts (for example, students who naively enter a complete question as a search engine query or a generative system prompt compared to those who use the question as a basis for their own, more refined, query, that attemtps to extract key “features” from the question…)

For education, there’s also the question of how to respond to students using generative systems in open assessments: will we try to “forbid” students to such systems at all, will we have the equivalent of “calculator” papers in maths exams, where you are expected, or evern required, to use the generative tools, or will we try to enter into an arms race with questions which, if naively sumbitted as a prompt, will return an incorrect answer (in which case, how soon will the generative systems be improved to correctly respond to such prompts…)

See also: Fragment: Generative Plagiarism and Technologically Enhanced Cheating?

PS For some examples of prompt engineering in practice, Simon Willison has blogged several examples of some of his playful experiments with generative AI:

PPS another generative approach that looks interesting is provided by https://github.com/rinongal/textual_inversion, which allows you to submit several training images of a novel concept, and then generate differently rendered styles of that concept.

Automating the Production of Student Software Guides With Annotated Screenshots Using Playwright and Jupyter Notebooks

With a bit of luck, we’ll be updating software for our databases course starting in Sept/Oct to use JupyterLab. It’s a bit late for settling this, but I find the adrenaline of going live, and the interaction with pathfinder students in particular, to be really invigorating (a bit like the old days of residential schools) so here’s hoping…

[I started this post last night, closed the laptop, opened it this morning and did 75 mins of work. The sleep broke WordPress’ autobackup, so when I accidentally backswiped the page, I seem to have lost over an hour’s work. Which is to say, this post was originally better written, but now it’s rushed. F**k you, WordPress.]

If we are to do the update, one thing I need to do is update the software guide. The software guide looks a bit like this:

Which is to say:

  • process text describing what to do;
  • a (possibly annotated) screenshot showing what to do, or the outcome of doing it;
  • more process text, etc.

Maintaining this sort of documentation can be a faff:

  • if the styling of the target website/application changes, but the structure and process steps remain the same, the screenshots drift from actuality, even if they are still functionally correct;
  • if the application structure or process steps change, then the documentation properly breaks. This can be hard to check for unless you rerun everying and manually test things before each presentation at least.

So is there a better way?

There is, and it’s another example of why it’s useful for folk to learn to code, which is to say, have some sort of understanding about how the web works and how to construct simple computational scripts.

Regular readers will know that I’ve tinkered on and off with the Selenium browser automation toolkit in the past, using it for scraping as well as automating repetitive manual tasks (such as downloading from one exams system scores of student scripts one at a time and grabbing a 2FA code for each of them for use in submiting marks into a second system). But playwright, Microsoft’s browser automation tool (freely available), seems to be what all the cool kids are using now, so I thought I’d try that.

The playwright app itself is a node app, which also makes me twitchy because becuase node is a pain to install. But the pytest-python package, which is installable from PyPi and which wraps playwright, bundles it’s own version of node, which makes things much simple. (See Simon Wilison’s Bundling binary tools in Python wheels for a discussion of this; for any edtechies out there, this is a really useful pattern, because if students have Python installed, you can use it as a route to deploy other things…)

Just as a brief aside, playwright is also wrapped by @simonw’s shot-scraper command line tool which makes it dead easy to grab screenshots. For example, we can grab a screenshot of the OU home page as simply as typing shot-scraper https://www.open.ac.uk/:

Note that because the session runs in a new, incognito, browser, we get the cookie notice.

We can also grab a screenshot of just a particular, located CSS element: shot-scraper https://www.open.ac.uk/ -s '#ou-org-footer'. See the shot-scraper docs for many more examples.

In many cases, screenshots that appear in OU course materials that mismatch with reality are identified and reported by students. Tools like shot-scraper and pytest can both be used as a part of a documentation testing suite where we create gold master images and then, as required, test “current” screenshots to see if they match distributed originals.

But back to the creation or reflowing of documentation.

As well as command line control using shot-scraper, we can also drive playwright from Python code, executed synchronously, as in a pytest test, or asynchronously, as in the case of Python running inside a Jupyter notebook. This is what I spent yesterday exploring, in particular, whether we could create reproducible documentation in the sense of something that has the form text, screenshot, text, … and looks like this:

but is actually created by something that has the form text, code, (code output), text, … and looks like this:

And as you’ve probably guessed, we can.

For some reason, my local version of nbconvert seems to now default to using no-input display settings and I can’t see why (there are no nbconvert settings files I can see. Anyone got any ideas how/why this might be happening? The only way I can work around it atm is to explicitly enable the display of the inputs: jupyter nbconvert --to pdf --TemplateExporter.exclude_output_prompt=True --TemplateExporter.exclude_input=False --TemplateExporter.exclude_input_prompt=False notebook.ipynb.

It’s worth noting a couple of things here:

  • if we reflow the document to generate new output screenshots, they will be a faithful representation of what the screen looks like at the time the document is reflowed. So if the visual styling (and nothing else) has been updated, we can capture the latest version;
  • the code should ideally follow the text description of the process steps, so if the code stops working for some reason, that might suggest the process has changed and so the text might well be broken too.

Generating the automation code requires knowledge of a couple of things:

  • how to write the code itself: the documentation helps here (eg in respect of how to wait for things, how to grab screenshots, how to move to things so they are in focus when you take a screenshot), but a recipe / FAQ / crib sheet would also be really handy;
  • how to find locators.

In terms of finding locators, one way is to do it manually, by usier browser developer tools to inspect elements and grab locators for required elements. But another, simpler way, is to record a set of playwright steps using the playwright codegen URL command line tool: simple walk through (click through) the process you want to document in the automatically launched interactive browser, and record the corresponding playwright steps.

With the script recorded, you can then use that as the basis for you screenshot generating reproducible documentation.

For any interested OU internal readers, there is an example software guide generated using playwright in MS Teams > Jupyter notebook working group team. I’m happy to share any scripts etc I come up with, and am interested to see other examples of using browser automation to test and generate documentation etc.

Referring back to the original software guide, we note that some screenshots have annotations. Another nice feature of playwright is that we can inject javascript into the controlled browser and evaulate it in that context, we which means we can inject Javascript that tinlers with the DOM.

The shot-scraper tool has a nifty function that will accept a set of locators for a particular page and, using Javascript injection, create a div surround them that can then be used to grab a screenshot of the area covering those locators.

It’s trivial to amend that function to add a border round an element:

# Based on @simomw's shot-scraper
# https://github.com/simonw/shot-scraper/blob/467a6c16ce5f4139c7ef92ca2efb45fc6de4e8bf/shot_scraper/cli.py#L784
import json
import secrets
import textwrap

from IPython.display import Image

def _selector_javascript(selectors, selectors_all,
                         padding=0, border='none', margin='none'):
    selector_to_shoot = "shot-scraper-{}".format(secrets.token_hex(8))
    selector_javascript = textwrap.dedent(
        f"""
    new Promise(takeShot => {{
        let padding = %s;
        let margin = %s;
        let minTop = 100000000;
        let minLeft = 100000000;
        let maxBottom = 0;
        let maxRight = 0;
        let els = %s.map(s => document.querySelector(s));
        // Add the --selector-all elements
        %s.map(s => els.push(...document.querySelectorAll(s)));
        els.forEach(el => {{
            let rect = el.getBoundingClientRect();
            if (rect.top < minTop) {{
                minTop = rect.top;
            }}
            if (rect.left < minLeft) {{
                minLeft = rect.left;
            }}
            if (rect.bottom > maxBottom) {{
                maxBottom = rect.bottom;
            }}
            if (rect.right > maxRight) {{
                maxRight = rect.right;
            }}
        }});
        // Adjust them based on scroll position
        let top = minTop + window.scrollY;
        let bottom = maxBottom + window.scrollY;
        let left = minLeft + window.scrollX;
        let right = maxRight + window.scrollX;
        // Apply padding
        // TO DO - apply margin?
        // margin demo: https://developer.mozilla.org/en-US/docs/Web/CSS/margin
        top = top - padding;
        bottom = bottom + padding;
        left = left - padding;
        right = right + padding;
        let div = document.createElement('div');
        div.style.position = 'absolute';
        div.style.top = top + 'px';
        div.style.left = left + 'px';
        div.style.width = (right - left) + 'px';
        div.style.height = (bottom - top) + 'px';
        div.style.border = '{border}';
        div.style.margin = margin + 'px';
        div.setAttribute('id', %s);
        document.body.appendChild(div);
        setTimeout(() => {{
            takeShot();
        }}, 300);
    }});
    """
        % (
            padding, margin,
            json.dumps(selectors),
            json.dumps(selectors_all),
            json.dumps(selector_to_shoot),
        )
    )
    return selector_javascript, "#" + selector_to_shoot

async def screenshot_bounded_selectors(page, selectors=None,
                                 selectors_all=None,
                                 padding=0,
                                 border='none',
                                 margin='none',
                                 path="test.png",
                                 full_page=False,
                                 show=True):
    if selectors or selectors_all:
        selector_javascript, selector_to_shoot = _selector_javascript(
                selectors, selectors_all, padding, border, margin
            )

        # Evaluate javascript
        await page.evaluate(selector_javascript)
        
    if (selectors or selectors_all) and not full_page:
        # Grab screenshot
        await page.locator(selector_to_shoot).screenshot(path=path)
    else:
        if selector_to_shoot:
            await page.locator(selector_to_shoot).hover()
        await page.screenshot(path=path)

    # Should we include a step to
    # then remove the selector_to_shoot and leave the page as it was
    
    if show:
        display(Image(path))
    return path

Then we can use a simple script to grab a screenshot with a highlighted area:

from time import sleep
from playwright.async_api import async_playwright

playwright = await async_playwright().start()

# If we run headless--False, plawright will launch a 
# visible browser we can track progress in
browser = await playwright.chromium.launch(headless = False)

# Create a reference to a browser page
page = await browser.new_page()

# The page we want to visit
PAGE = "https://www.open.ac.uk"

# And a locator for the "Accept all cookies" button
cookie_id = "#cassie_accept_all_pre_banner"

# Load the page
await page.goto(PAGE)

# Accept the cookies
await page.locator(cookie_id).click()

# The selectors we want to screenshot
selectors = []
selectors_all = [".int-grid3"]

# Grab the screenshot
await screenshot_bounded_selectors(page, selectors, selectors_all,
                                   border='5px solid red',
                                   padding=2,
                                   margin=0,
                                   full_page=True)

Or we can just sreenshot and highlight the element of interest:

Simon has an open shot-scraper issue on adding things like arrows to the screenshot, so this is obviously something that might repay a bit more exploration.

I note that the Jupyter accessibility docs have a section on the DOM / locator structure of the JupyterLab UI that include highlighted screenshots annotated, I think, using drawio. It might be interesting to try to replicate / automatically generate those, using playwright?

Finally, it’s worth noting that there is another playwright based tool, galata, that provides a set of high level tools for controlling JupyterLab and scripting JupyterLab actions (this will be the subject of another post). However, galata is currently a developer only tool, in that it expected to run against a wide open Juypter environment (no authentication), and only a wide open JupyterLab environment. It does this by over-riding playwright‘s .goto() method to expect a particular Jupyterlab locator, which means that if you want to test a deployment that sits behind the Jupyter authenticator (which is the recommended and default way of running a Jupyer server), or you want to go through some browser steps involving arbitrary web pages, you can’t. (I have opened an issue regarding at least getting through Jupyter authentication here and a related Jupyter discourse discussion here. What would be useful in the general case would be a trick to use a generic playwright script automate steps up to a JupyterLab page and then hand over browser state to a galata script. But I don’t know how to “just” do that. Which is why I say this is a developer tool, and as such is hostile to non-developer users, who, for example, might only be able to run scripts against a hosted server accessed through mutliple sorts of institutional and Jupyter based authentcation. The current set up also means it’s not possible to use galata out of the can for testing a deployment. Devs only, n’est-ce pas?

Displaying Wide pandas Dataframes Over Several, Narrower Dataframes

One of the problems associated with using Jupyter notebooks, and Jupyter Book, for publishing content that incorporates pandas dataframes is that wide dataframes are overflowed in the displayed page. In an HTML output, the wide dataframe is rendered via a scroll box; in a PDF output, the content is just lost over the edge of page.

Controls are available to limit the width of the displayed dataframe:

# Pandas dataframe display options
pd.options.display.show_dimensions=False
pd.options.display.width = 80
pd.options.display.max_colwidth = 10
pd.options.display.max_columns = 10

but sometimes you want to see all the columns.

The easiest way of rendering all the columns is to split the wide table over several smaller tables, each with a limited number of columns that do fit in the page width, but I havenlt spotted a simple way to do that (which is really odd, becuase I suspect this is a widely (?!) desired feature..

Anyway, here’s a first attempt at a simple function to do that:

#| echo: false
import pandas as pd

def split_wide_table(df, split=True,
                     hide=None, padding=1, width=None):
    """Split a wide table over multiple smaller tables."""
    # Note: column widths may be much larger than heading widths
    if hide:
        hide = [hide] if isinstance(hide, str) else hide
    else:
        hide = []
    cols = [c for c in df.columns if c not in hide]
    width = width or pd.options.display.width
    _cols = []
    _w = 0
    continues = False
    for col in cols:
        if (len("".join(_cols)) + (len(_cols)+1)*padding) < width:
            _cols.append(col)
        else:
            print("\n")
            display(df[_cols].style.hide(axis="index"))
            _cols = []
            continues = False
    if _cols:
        print("\n")
        display(df[_cols].style.hide(axis="index"))

Given a wide dataframe such as the following:

wdf = pd.DataFrame({f"col{x}":range(3) for x in range(25)})
wdf

We can instead render the table over multiple, narrower tables:

I havenlt really written any decorators before, but it also seems to me that this sort of approach might make decorative display sense…

For example, given the decorator:

def split_wide_display(func):
    """Split a wide table of multiple tables."""
    def inner(*args, **kwargs):
        resp = func(*args[:3])
        if (not "display" in kwargs) or \
            kwargs["display"]:
            # Optionally display the potentially wide
            # table as a set of narrower tables
            split_wide_table(resp, **kwargs)
        return resp
 
    return inner

we can decorate a function returns a wide dataframe to optionally display the dataframe over mulitple narrower tables:

# Use fastf1 Formula One data grabbing package
from fastf1 import ergast

@split_wide_display
def ergast_race_results(year, event, session):
    erd_race = ergast.fetch_results(year, event, session)
    df = pd.json_normalize(erd_race)
    return df

It seems really odd that I have to hack this solution together, and there isn’t a default way of doing this? Or maybe there is, and I just missed it? If so, please let me know via the comments etc…

PS the splitter is far from optimal, becuase it splits columns on the basis of the cumulative column header widths, rather than the larger of the column head or column content width. I’m not sure what the best way might be of finding the max displayed column content width?

PPS is there a Sphinx extension to magically handle splitting overflowing tables into narraoewer subtables somehow? And a LaTeX one?

Semantic Feature / Named Entity Extraction Rather Than Regular Expression Based Pattern Matching and Parsing

Playing with story stuff, trying to write a simple BibTeX record generator based on archive.org (see a forthcoming post…), I wanted to parse out publisher and location elements from strings such as Joe Bloggs & Sons (London) or London : Joe Bloggs & Sons.

If the string structures are consistent (PUBLISHER (LOCATION) or LOCATION : PUBLISHER, for example), we can start to create a set of regular expression / pattern matching expressions to pull out the fields.

But there is another way, albeit one that relies on a much heavier wieight computational approach than a simple regex, and that is to do some language modeling and see if we can extract entities of a particular type, such as an organisation and a geogrpahical location.

Something like this, maybe:

import spacy

nlp = spacy.load("en_core_web_sm")

refs = ["Joe Bloggs & Sons, London",
        "Joe Bloggs & Sons, (London)",
        "Joe Bloggs & Sons (London)",
        "London: Joe Bloggs & Sons",
         "London: Joe, Bloggs & Doe"]

for ref in refs:
    doc = nlp(ref)
    print("---")
    for ent in doc.ents:
        print(ref, "::", ent.text, ent.label_)

Which gives as a result:

---
Joe Bloggs & Sons, London :: Joe Bloggs & Sons ORG
Joe Bloggs & Sons, London :: London GPE
---
Joe Bloggs & Sons, (London) :: Joe Bloggs & Sons ORG
Joe Bloggs & Sons, (London) :: London GPE
---
Joe Bloggs & Sons (London) :: Joe Bloggs & Sons ORG
Joe Bloggs & Sons (London) :: London GPE
---
London: Joe Bloggs & Sons :: London GPE
London: Joe Bloggs & Sons :: Joe Bloggs & Sons ORG
---
London: Joe, Bloggs & Doe :: London GPE
London: Joe, Bloggs & Doe :: Joe, Bloggs & Doe ORG

My natural inclination would probably be to get frustrated by writing ever more, and ever more elaborate, regular epxressions to try to capture “always one more” outliers in how the publisher/location data might be presented in a single string. But I wonder: is that an outmoded way of doing compute now? Are the primitives we can readily work with now conveniently available as features at a higher representational level of abstraction?

See also things like storysniffer, a Python package that includes a pretrained model for sniffing a URL and estimating whether the related page is likely to contain a news story.

Some folk will say, of course, that these model based approaches aren’t exact or provable. But in the case of my regexer to GUESS at the name of a publisher and a location, there is still uncertainty as to whether the correct response will be provided for an arbitrary string: my regexes might be incorrect, or I might have missed a pattern in likely presented strings. The model based approach will also be uncertain in its ability to correctly identify the publisher and location , but the source of the uncertainty will be different. As a user, though, all I likely care about is that most of the time the approach does work, and does work reasonably correctly. I may well be willing to tolerate, or at least, put up with, errors, and don’t really care how or why they arise unless I can do something about it.

Generating and Visualising JSON Schemas

If you’re presented with a 2D tabular dataset (eg a spreadsheet or CSV file), it;s quite easy to get a sense of the data by checking the column names and looking at a few of the values in each column. You can also use a variety of tools that will “profile” or summarise the data in each column for you. For example, a column of numeric values might be summarised with by the mean and standard deviation of the values, or a histogram, etc. Geographic co-ordinates might be “summarised” by plotting them onto a map. And so on.

If you’re presented with the data in a JSON file, particularly a large data file, getting a sense of the structure of the dataset can be much harder, particularly if the JSON data is “irregular” (that is, if the records differ in some way; for example, some records might have more fields than other records).

A good way of summarising the structure of a JSON file is via its schema. This is an abstracted representation of the tree structure of the JSON object that extracts the unique keys in the object and the data type of any associated literal values.

One JSON dataset I have played with over the years is rally timing and results data from the WRC live timing service. The structure of some of the returned data objects can be quite complicated, so how can we get a handle on them?

One way is to automatically extract the schema and then visualise it, so here’s a recipe for doing that.

For example:

#%pip install genson
from genson import SchemaBuilder

# Get some data
import requests

season_url = "https://api.wrc.com/contel-page/83388/calendar/active-season/"
jdata = requests.get(season_url).json()

# Create a schema object
builder = SchemaBuilder()
builder.add_object(jdata )

# Generate the schema
schema = builder.to_schema()

# Write the schema to a file
with open("season_schema.json", "w") as f:
    json.dump(schema, f)

The schema itself can be quite long and hard to read…

{'$schema': 'http://json-schema.org/schema#',
 'type': 'object',
 'properties': {'seasonYear': {'type': 'integer'},
  'seasonImages': {'type': 'object',
   'properties': {'format16x9': {'type': 'string'},
    'format16x9special': {'type': 'string'},
    'timekeeperLogo': {'type': 'string'},
    'timekeeperLogoDark': {'type': 'string'}},
   'required': ['format16x9',
    'format16x9special',
    'timekeeperLogo',
    'timekeeperLogoDark']},
  'rallyEvents': {'type': 'object',
   'properties': {'total': {'type': 'integer'},
    'items': {'type': 'array',
     'items': {'type': 'object',
      'properties': {'id': {'type': 'integer'},
       'name': {'type': 'string'},

...

To get a sense of the schema, we can visualise it interactively using the json_schema_for_humans visualiser.

from json_schema_for_humans.generate import generate_from_filename

generate_from_filename("season_schema.json",
                      "season_schema.html")

We can open the generated HTML in a web browser (note, this doesn’t seem to render correctly in JupyterLab via IPython.display.HTML; however, we can open the HTML file from the JupyterLab file browser, and as long as we trust it, it will render correctly:

With the HTML trusted, we can then explore the schema;

Wondering: are there other JSON-schema visualisers out there, particularly that work either as a JupyterLab extension, IPython magic or via an appropriate __repr__ display method?