Generating “Print Pack” Material (PDF, epub) From Jupyter Notebooks Using pandoc

For one of the courses I work on, we produce “print pack” materials (PDF, epub), on request, for students who require print, or alternative format, copies of the Jupyter notebooks used in the module. This does and doesn’t make a lot of sense. On the one hand, having a print copy of notebooks may be useful for creating physical annotations. On the other, the notebooks are designed as interactive, and often generative, materials: interactive in the sense students are expect to run code, as well modify, create and execute their own code; generative in the sense that outputs are generated by code execution and the instructional content may explicitly refer to things that have been so generated.

In producing the print / alternative format material, we generally render the content from un-run notebooks, which is to say the the print material notebooks do not include code outputs. Part of the reason for this is that we want the students to do the work: if we hnded out completed worksheets, there’d be no need to work through and complete the worksheets, right?

Furthermore, whilst for some notebooks, it may be possible to meaningfully run all cells and then provide executed/run cell notebooks as print materials, in other modules this may not make sense. In our TM129 robotics block, an interactive simulator widget is generated and controlled from the run notebook, and it doesn’t make sense to naively share a notebook with all cells run. Instead, we would have to share screenshots of the simulator widget following each notebook activity, and here may be several such activities in each notebook. (It might be instructive to try to automate the creation of such screenshots, eg using the JupyterLab galata test framework.)

Anyway, I started looking at how to automate the generation of print packs. The following code [(https://gist.github.com/psychemedia/6288db9ef97dc17b2fbd909a7516f12b)%5D is hard wired for a directory structure where the content is in ./content directory, with subdirectories for each week starting with a two digit week number (01., 02. etc) and notebooks in each week directory numbered according to week/notebook number (eg 01.1, 01.2, …, 02.1, 02.2, … etc.).

# # `print_publication.py`
#
# Script for generating print items (weekly PDF, weekly epub).
# # Install requirements
#
# – Python
# – pandoc
# – python packages:
# – ipython
# – nbconvert
# – nbformat
# – pymupdf
from pathlib import Path
#import nbconvert
import nbformat
from nbconvert import HTMLExporter
#import pypandoc
import os
import secrets
import shutil
import subprocess
import fitz #pip install pymupdf
html_exporter = HTMLExporter(template_name = 'classic')
pwd = Path.cwd()
print(f'Starting in: {pwd}')
# +
nb_wd = "content" # Path to weekly content folders
pdf_output_dir = "print_pack" # Path to output dir
# Create print pack output dir if required
Path(pdf_output_dir).mkdir(parents=True, exist_ok=True)
# –
# Iterate through weekly content dirs
# We assume the dir starts with a week number
for p in Path(nb_wd).glob("[0-9]*"):
print(f'- processing: {p}')
if not p.is_dir():
continue
# Get the week number
weeknum = p.name.split(". ")[0]
# Settings for pandoc
pdoc_args = ['-s', '-V geometry:margin=1in',
'–toc',
#f'–resource-path="{p.resolve()}"', # Doesn't work?
'–metadata', f'title="TM129 Robotics — Week {weeknum}"']
#cd to week directory
os.chdir(p)
# Create a tmp directory for html files
# Rather than use tempfile, create our own lest we want to persist it
_tmp_dir = Path(secrets.token_hex(5))
_tmp_dir.mkdir(parents=True, exist_ok=True)
# Find notebooks for the current week
for _nb in Path.cwd().glob("*.ipynb"):
nb = nbformat.read(_nb, as_version=4)
# Generate HTML version of document
(body, resources) = html_exporter.from_notebook_node(nb)
with open(_tmp_dir / _nb.name.replace(".ipynb", ".html"), "w") as f:
f.write(body)
# Now convert the HTML files to PDF
# We need to run pandoc in the correct directory so that
# relatively linked image files are correctly picked up.
# Specify output PDF path
pdf_out = str(pwd / pdf_output_dir / f"tm129_{weeknum}.pdf")
epub_out = str(pwd / pdf_output_dir / f"tm129_{weeknum}.epub")
# It seems pypandoc is not sorting the files in ToC etc?
#pypandoc.convert_file(f"{temp_dir}/*html",
# to='pdf',
# #format='html',
# extra_args=pdoc_args,
# outputfile= str(pwd / pdf_output_dir / f"tm129_{weeknum}.pdf"))
# Hacky – requires IPython
# #! pandoc -s -o {pdf_out} -V geometry:margin=1in –toc –metadata title="TM129 Robotics — Week {weeknum}" {_tmp_dir}/*html
# #! pandoc -s -o {epub_out} –metadata title="TM129 Robotics — Week {weeknum}" –metadata author="The Open University, 2022" {_tmp_dir}/*html
subprocess.call(f'pandoc –quiet -s -o {pdf_out} -V geometry:margin=1in –toc –metadata title="TM129 Robotics — Week {weeknum}" {_tmp_dir}/*html', shell = True)
subprocess.call(f'pandoc –quiet -s -o {epub_out} –metadata title="TM129 Robotics — Week {weeknum}" –metadata author="The Open University, 2022" {_tmp_dir}/*html', shell = True)
# Tidy up tmp dir
shutil.rmtree(_tmp_dir)
#Just in case we need to know relatively where we are…
#Path.cwd().relative_to(pwd)
# Go back to the home dir
os.chdir(pwd)
os.chdir(pwd)
# ## Add OU Logo to First Page of PDF
#
# Add an OU logo to the first page of the PDF documents
# +
logo_file = ".print_assets/OU-logo-83×65.png"
img = open(logo_file, "rb").read()
# define the position (upper-left corner)
logo_container = fitz.Rect(60,40,143,105)
for f in Path(pdf_output_dir).glob("*.pdf"):
print(f'- branding: {f}')
with fitz.open(f) as pdf:
pdf_first_page = pdf[0]
pdf_first_page.insert_image(logo_container, stream=img)
pdf_out = f.name.replace(".pdf", "_logo.pdf")
txt_origin = fitz.Point(350, 770)
text = "Copyright © The Open University, 2022"
for page in pdf:
page.insert_text(txt_origin, text)
pdf.save(Path(pdf_output_dir) / pdf_out)
#Remove the unbranded PDF
os.remove(f)
view raw print_pack.py hosted with ❤ by GitHub

The script uses pandoc to generate the PDF and epub documents, one per weekly directory. The PDF generator also includes a table of contents, automatically generated from headings by pandoc. A second pass using fitz/pymupdf then adds a logo and copyright notice to each PDF.

PDF with post-processed addition of a custom logo
PDF with post-process addition of a copyright footer

Using dogsheep-beta to Create a Combined SQLite Free Text Metasearch Engine Over Several SQLite Databases

I had a quick play yesterday tinkering with my storynotes side project, creating a couple more scrapers over traditional tale collections, specifically World of Tales and Laura Gibb’s Tiny Tales books. The scrapers pull the stories into simple, free text searchable SQLite databases to support discovery of particular stories based on simple, free text search terms. (I’ve also been exploring simple doc2vec based semantic search strategies over the data.)

See the technical recipes: World of Tales scraper; Tiny Tales scraper.

The databases I’ve been using aren’t very consistent in the table structure I’m using to store the scraped data, and thus far I tended to search the databases separately. But when I came across Simon Willison’s dogsheep-beta demo, which creates a free text meta-search database over several distinct databases using SQLite’s ATTACH method (example) for connecting to multiple databases, I thought I’d have a quick play to see if I could re-use that method to create my own meta-search database.

And it turns out, I can. Really easily…

The dogsheep-beta demo essentially bundles two things:

  • a tool for constructing a single, full text searchable database from the content of one or more database tables in one or more databases;
  • a datasette extension for rendering a faceted full text search page over the combined database.

The index builder is based around a YAML config file that contains a series of entries, each describing a query onto a database that returns the searchable data in a standard form. The standardised columns are type (in the dogsheep world, this is the original data source, eg a tweet, or a github record); key (a unique key for the record); title; timestamp; category; is_public; search_1, search_2, search_3.

I forked the repo to tweak these columns slightly, changing timestamp to pubdate and adding a new book column to replace type with the original book from which a story came; the database records I’m interested in search over are individual stories or tales, but it can be useful to know the original source text. I probably also need to be able to support something like a link to the original text, but for now I’m interested in a minimum viable search tool.

Items in config file have the form:

DATABASE_FILE.db:
    DB_LABEL:
        sql: |-
            select
              MY_ID as key,
              title,
              MY_TEXT as search_1
            from MY_TABLE

The select MUST be in lower case (a hack in the code searches for the first select in the provided query, as part of a query rewrite. Also, the query MUST NOT end with a ;, as the aforementioned query rewrite appends LIMIT 0 to the original query en route to identifying the column headings.

Here’s the config file I used for creating my metasearch database:

lang_fairy_tale.db:
    lang:
        sql: |-
            select
              "lang::" || replace(title, " ", "") as key,
              title,
              book,
              text as search_1
            from books

jacobs_fairy_tale.db:
    jacobs:
        sql: |-
            select
              "jacobs::" || replace(story_title, " ", "") as key,
              story_title as title,
              book_title as book,
              story_text as search_1
            from stories

word_of_tales.db:
    world_of_tales:
        sql: |-
            select
              "wot::" || replace(title, " ", "") as key,
              title,
              book,
              text as search_1
            from tales

ashliman_demo.db:
    ashliman:
        sql: |-
            select
              "dash::" || replace(title, " ", "") as key,
              title as title,
              metadata as book,
              text as search_1
            from ashliman_stories

mtdf_demo.db:
    mtdf:
        sql: |-
            select
              "mtdf::" || replace(title, " ", "") as key,
              title as title,
              text as search_1
            from english_stories

Using the datasette UI to query the FTS table, with a slightly tweaked SQL query, we can get something like the following:

One of the issues with even a free text search strategy is that the search terms must appear in the searched text if they are to return a result (we can get round this slightly by using things like stemming to reduce a word to its stem). However, it’s not too hard to generate a simple semantic search over a corpus using doc2vec, as this old demo shows. However, the vtfunc trick that relies on seems to have rotted in Py10 [update: ensuring Cython is installed fixes the vtfunc install; for another Python wrapper for sqlite, see also apsw]; there may be an alternative way to access a TableFunction via the peewee package, but that seems tightly bound to a particular database object, and on the quickest of plays, I couldn’t get it to play nice with sqlite_utils or pandas.read_sql().

What I’m thinking is, it would be really handy to have a template repo associated with sqlite_utils / datasette that provides tools to:

  • create a metasearch database (dogsheep-beta pretty much does this, but could be generalised to be more flexible in defining/naming required columns, etc.);
  • provide an easily customisable index.html template for datasette that gives you a simple free text search UI (related issue);
  • provide a simple tool that will build a doc2vec table, register vector handlers (related issue) and a custom semantic search function; the index.html template should then give you the option of running a free text search or a semantic search;
  • (it might also be handy to support other custom fuzzy search functions (example);)
  • a simpler config where you select free text and or semantic search, and the index/doc2vec builders are applied appropriately, and the index.html template serves queries appropriately.

PS I posted this related item to sqlite_utils discord server:

SQLite supports the creation of virtual table functions that allow you to define custom SQL functions that can return a table.

The coleifer/sqlite-vtfunc package provided a fairly straightforward way of defining custom table functions (eg I used it to create a custom fuzzy search function here:

The recipe is to define a class class My_Custom_Function(TableFunction) and then register it on a connection (My_Custom_Function.register(db.conn)). This worked fine with sqlite_utils database connections and meant you could easily add custom functions to a running server.

However, that package is no longer maintained and seems to be breaking when installing in at least Py10?

An equivalent (ish) function is provided by the peewee package (functions imported from playhouse.sqlite_ext), but it seems to be rather more tightly bound to a SqliteDatabase object, rather than simply being registered to a database connection.

Is there any support for registering such table returning functions in sqlite_utils?

Documenting AI Models

I don’t (currently) work on any AI related courses, but this strikes me as something that could be easily co-opted to support: a) best practice; b) assessment — the use of model cards. Here is where I first spotted a mention of them:

My first thought was whether this could be a “new thing” for including in bibliographic/reference data (eg when citing a module model, you’d ideally cite its model card).

Looking at the model card for the Whisper model, I then started to wonder whether this would also be a Good Thing to teach students to create to describe their model, particularly as the format also looks like the sort of thing you could easily assess: the Whisper model card, for example, includes the following headings:

  • Model Details
  • Release Date
  • Model Type
  • Paper / Samples
  • Model Use
  • Training Data
  • Performance and Limitations
  • Broader Implications

The broader implications is an interesting one..

It also struck me that the model card might also provide a useful cover sheet for a data investigation.

Via Jo Walsh/@ultrazool, I was also tipped off to the “documentation” describing the model card approach: https://huggingface.co/docs/hub/models-cards. The blurb sugges:

The model card should describe:

  • the model
  • its intended uses & potential limitations, including biases and ethical considerations
  • the training params and experimental info (you can embed or link to an experiment tracking platform for reference)
  • which datasets were used to train your model
  • your evaluation results
Hugging Face: model cards, https://huggingface.co/docs/hub/models-cards

The model card format is more completely described in Model Cards for Model Reporting, Margaret Mitchell et al., https://arxiv.org/abs/1810.03993 .

A largely similarly structure card might also be something that could usefully act as a cover sheet / “executive report metadata” for a data investigation?

PS also via Jo, City of Helsinki AI Register, ” a window into the artificial intelligence systems used by the City of Helsinki. Through the register, you can get acquainted with the quick overviews of the city’s artificial intelligence systems or examine their more detailed information”. For more info on that idea, see their Public AI Registers white paper (pdf). And if that sort of thing interests you, you should probably also read Dan McQuillan’s Resisting AI, whose comment on the Helsinki register was “transparency theatre”

PPS for an example of using Whisper to transcribe, and translate, an audio file, see @electricarchaeo’s Whisper, from OpenAI, for Transcribing Audio.

Fragments: Structural Search

Several years ago, I picked up a fascinating book on the Morphology of the Folktale by Vladimir Propp. This identifies a set of primitives that could be used to describe the structure of Russian folktales, though they also serve to describe lots of other Western folktales too. (A summary of the primitives identified by Propp can be found here: Propp functions. (See also Levi-Strauss and structural analysis of folk-tales).)

Over the weekend, I started wondering whether there are any folk-tale corpuses out there annotated with Propp functions, that might be used as the basis for a “structural search engine”, or that could perhaps be used to build a model that could attempt to automatically analyse other folktales in structural terms.

Thus far, I’ve found one, as described in ProppLearner: Deeply annotating a corpus of Russian folktales to enable the machine learning of a Russian formalist theory, Mark A. Finlayson, Digital Scholarship in the Humanities, Volume 32, Issue 2, June 2017, Pages 284–300, https://doi.org/10.1093/llc/fqv067 . The papers gives a description of the method used to annotate the tale collection, and also links to data download containing annotation guides and the annotated collection as an XML file: Supplementary materials for “ProppLearner: Deeply Annotating a Corpus of Russian Folktales to Enable the Machine Learning of a Russian Formalist Theory” [description, zip file]. The bundle includes fifteen annotated tales marked up using the Story Workbench XML format; a guide to the format is also included.

(The Story Workbench is (was?) an Eclipse based tool for annotating texts. I wonder if anything has replaced it that also opens the Story Workbench files? In passing, I note Prodigy, a commercial text annotation tool that integrates tightly with spacy, as well as a couple of free local server powered browser based tools, brat and doccano. The metanno promises “a JupyterLab extension that allows you build your own annotator”, but from the README, I can’t really make sense of what it supports…)

The markup file looks a bit involved, and will take some time to make sense of. It includes POS tags, sentences, extracted events, timex3 / TimeML tags, Proppian functions and archetypes, among other things. To get a better sense of it, I need to build a parser and have a play with it…

The TimeML tagging was new to me, and looks like it could be handy. It provides an XML based tagging scheme for describing temporal events, expressions and relationships [Wikipedia]. The timexy Python package provides a spacy pipeline component that “extracts and normalizes temporal expressions” and represents them using TimeML tags.

In passing, I also note a couple of things spacy related, such as a phrase matcher that could be used to reconcile different phrases (eg Finn Mac Cool and Fionn MacCumHail etc): PhraseMatcher, and this note on Training Custom NER models in spacy to auto-detect named entitities.

Creating Training Data Sets for Custom Named Entity Recognition Taggers in spacy

I’ve just been having a quick look at getting custom NER (named entity recognition) working in spacy. The training data seems to be presented as a list of tuples with the form:

(TEXT_STRING,
  {'entities': [(START_IDX, END_IDX, ENTITY_TYPE), ...]})

TO simplify things, I wanted to create the training data structure from a simpler representation, a two-tuple of the form (TEXT, PHRASE), where PHRASE is the entity you want to match.

Let’s start by finding the index values of a match phrase in a text string:

import re

def phrase_index(txt, phrase):
    """Return start and end index phrase in string."""
    matches = [(idx.start(),
                idx.start()+len(phrase)) for idx in re.finditer(phrase, txt)]
    return matches

# Example
#phrase_index("this and this", "this")
#[(0, 4), (9, 13)]

I’m using training documents of the following form:

_train_strings = [("The King gave the pauper three gold coins and the pauper thanked the King.", [("three gold coins", "MONEY"), ("King", "ROYALTY")]) ]

We can then generate the formatted training data as follows:

def generate_training_data(_train_strings):
    """Generate training data from text and match phrases."""
    for (txt, items) in _train_strings:
        _ents_list = []
        for (phrase, typ) in items:
            matches = phrase_index(txt, phrase)
            for (start, end) in matches:
                _ents_list.append( (start, end, typ) )
        if _ents_list:
            training_data.append( (txt, {"entities": _ents_list}) )

    return training_data

# Call as:
#generate_training_data(_train_strings)

I was a bit surprised this sort of utility doesn’t already exist? Or did I miss it? (I haven’t really read the spacy docs, but then again, spacy seems to keep getting updated…)

Creating Rule Based Entity Pattern Matchers in spacy

Via a comment to Creating Training Data Sets for Custom Named Entity Recognition Taggers in spacy by Adam G, I learn that as well as training statistical models (as used in that post) spacy lets you write simple pattern matching rules that can be used to identify entities: Rule-based entity recognition.

import spacy

nlp = spacy.load("en_core_web_sm")
ruler = nlp.add_pipe("entity_ruler")
patterns = [{"label": "ROYALTY", "pattern": [{"LOWER": "king"}]},
            {"label": "ROYALTY", "pattern": [{"LOWER": "queen"}]},
            {"label": "ROYALTY", "pattern": [{"TEXT": {"REGEX": "[Pp]rinc[es]*"}}]},
            {"label": "MONEY", "pattern": [{"LOWER": {"REGEX": "(gold|silver|copper)"},
                                             "OP": "?"},
                                            {"TEXT": {"REGEX": "(coin|piece)s?"}}]}]
ruler.add_patterns(patterns)

doc = nlp("The King gave 100 gold coins to the Queen, 1 coin to the prince and  ten COPPER pieces the Princesses")
print([(ent.text, ent.label_) for ent in doc.ents])

"""
[('King', 'ROYALTY'), ('100', 'CARDINAL'), ('gold coins', 'MONEY'), ('Queen', 'ROYALTY'), ('1', 'CARDINAL'), ('coin', 'MONEY'), ('prince', 'ROYALTY'), ('ten', 'CARDINAL'), ('COPPER pieces', 'MONEY'), ('Princesses', 'ROYALTY')]
"""

There is a handy tool for trying out patterns at https://demos.explosion.ai/matcher [example]:

(I note that REGEX is not available in the playground though?)

The playground also generates the pattern match rule for you:

However, if I try that rule in my own pattern match, the cardinal element is not matched?

There appears to be a lot of scope as to what you can put in the patterns to be matched , including parts of speech. Which reminds me that I meant to look at Using Pandas DataFrames to analyze sentence structure which uses dependency parsing on spacy parsed sentences to pull out relationships, such as peoples’ names and the associated job titles, from company documents. This probably also means digging into spacy’s Linguistic features.

This also makes me wonder again about the extent to which it might be possible to extract certain Propp functions from sentences parsed using spacy and explict pattern matching rules on particular texts with bits of hand tuning (eg using hand crafted rules in the identification of actors)?

PS I guess if this part of the pipeline is crearing the entity types, they may not be available to the matcher, even if the ENT_TYPE is allowed as part of the rule conditions? In which case, can we fettle the pipeline somehow so we can get rules to match on previoulsy identified entity types?

Fragment: Is English, Rather than Maths, a Better Grounding for the Way Large Areas of Computing are Going?

Or maybe Latin…?

Having a quick play with the spacy natural language processing package just now, in part because I started wondering again about how to reliably extract “facts” from the MPs’ register of interests (I blame @fantasticlife for bringing that to mind again; data available via a @simonw datasette here: simonw/register-of-members-interests-datasette).

Skimming over the Linguistic Features docs, I realise that I probably need to brush up on my grammar. It also helped crystallise out a bit further some niggling concerns I’ve got about what from practical computing is, or might take in the near future, based on the sorts of computational objects we can reasily work with.

Typically when you start teaching computing, you familiarise learners with the primitive types of computational object that they can work with: integers, floating point numbers, strings (that is, text), then slightly more elaborate structures, such as Python lists or dictionaries (dicts). You might then move up to custom defined classes, and so on. What sort of thing a computation object largely determines what you can do with it, and what you can extract from it.

Traditionally, getting stuff out of natural language, free text strings has been a bit fiddly. It you’re working with a text sentence, represented typically as a string, one way of trying to extract “facts” from it is to pattern match on it. This means that strings (texts) with a regular structure are among the easier things to work with.

As an example, a few weeks ago I was trying to reliably parse out the name of a publisher, and the town/city of publication from free text citation data such as Joe Bloggs & Sons (London) or London: Joe Bloggs & Sons (see Semantic Feature / Named Entity Extraction Rather Than Regular Expression Based Pattern Matching and Parsing). A naive approach to this might be to try to write some templated regular expression pattern matching rules. A good way of of understanding how this can work is to consider their opposite: template generators.

Suppose I have a data file with a list of book references; the data has a column for PUBLISHER, and one for CITY. If I want to generate a text based citation, I might create different rules to generates citations of different forms. For example, the template {PUBLISHER} ({CITY}) might display the publisher and then the city in brackets (the curly brackets show we want to populate the string with a value pulled from a particular column in a row of data); or the template {CITY}: {PUBLISHER} the city, followed by a colon, and then the publsiher.

If we reverse this process, we then need to create a rule that can extract the publisher and the city from the text. This may or may not be easy to do. If our text phrase only contains the publisher and the city, things should be relatively straightforward. For example, if all the references were generated by a template {PUBLISHER} ({CITY}), I could just match on something like THIS IS THE PUBLISHER TEXT (THIS IS THE CITY) : anything between the brackets is the city, anything before the brackets is the publisher. (That said, if the publisher name included something in brackets, things might start to break…). However, if the information appears in a more general context, things are likely to get more complicated quite quickly. For example, suppose we have the text “the book was published by Joe Bloggs and Sons in London“. How then do we pull out the publisher name? What pattern are we trying to match?

When I first looked at this problem, I started writing rules at the character level in the string. But then it occurred to me that I could use a more structured, higher level representation of the text based on “entities”. Tools such as the spacy Python package provide a range of tools for representing natural language text at much higher levels of representation than simple text strings. For example, give spacy a text document, and it then lets you work with it at the sentence level, or the word level. It will also try to recognise “entities”: dates, numerical values, monetary values, peoples names, organisation names, geographical entities (or geopolitical entities, GPEs), and so on. With a sentence structured this way, I should be able to parse my publisher/place sentence and extract the recognised organisation and place:

import spacy

nlp = spacy.load("en_core_web_sm")

doc = nlp("The book was published by Joe Bloggs & Sons of London.")
print([(ent.text, ent.label_) for ent in doc.ents])

"""
[('Joe Bloggs & Sons', 'ORG'), ('London', 'GPE')]
"""

As well as using “AI”, which is to say, trained statistical models, to recognise entities, spacy will also let you define rules to match your entities. These definitions can be based on matching explicit strings, or regular expression patterns (see a quick example in Creating Rule Based Entity Pattern Matchers in spacy) as well as linguistic “parts of speech” features: nouns, verbs, prepositions etc etc.

I don’t really remember when we did basic grammar school, back in the day, but I do remember that I learned more about tenses in French and Latin then I ever did in English, and I learned more about grammar in general in Latin than I ever did in English.

So I wonder: how grammar is taught in school now? I wonder whether things like the spacy playground at https://demos.explosion.ai/matcher exist as interactive teaching / tutorial playground tools in English lessons as a way of allowing learners to explore grammatical structures in a text?

In addition, I wonder whether tools like that are used in IT and computing lessons, where pupils get to explore writing linguistic patterns matching rules in support of teaching “computational thinking” (whatever that is…), and then build their own information extracting tools using their own rules.

Certainly, to make best progress, a sound understanding of grammar would help when writing rules, so where should that grammar be taught? In computing, or in English? (This is not to argue that English teaching should become subservient to the computing curriculum. Whilst there is a risk that the joy of language is taught instrumentally with a view to how it can be processed by machines, or how we can (are forced to?) modify our language to get machines to “accept out input” or do what we want them to do, this sort of knowledge might also help provide us with defences against computational processing and obviously opinionated (in a design sense) textual interfaces. (Increasing amounts of natural language text is parsed and processed by machines, and used to generate responses: chat bots, email responders, SMS responders, search engines, etc etc. In one sense in it’s wrong that we need to understand how our inputs maybe interpreted by machines, particulary if we find ourselves changing how we interat to suit the machine. On the other, a little bit of knowledge may also give us power over those machines… [This is probably all sounding a little paranoid, isn’t it?! ;-)])

And of course, all the above is before we even start to get on to the subject of how to generate texts based on grammar based templating tools. If we do this for ourselves, it can be really powerful. If we realise this is being done to us, (and spot tells that suggest to us we are being presented with machine generated texts), it can give us a line of defence (paranoia again?! ;-).

Fragment: On Failing to Automate the Testing of a Jupyter Notebook Environment Using playwright and Github Actions

One of those days (again) when I have no idea whether what I’m trying to do is possible or not, or whether I’m doing something wrong…

On my local machine, everything works fine:

  • a customised classic notebook Jupyter environment is running in a Docker container on my local machine, with a port mapped on the host’s localhost network;
  • the playwright browser automation framework is installed on the host machine, with a test script that runs tests against the Jupyter environment exposed on the localhost network;
  • the tests I’m running are to login to the notebook server, grab a screenshot, then launch a Jupyter server proxy mapped OpenRefine environment and grab another screenshot.

So far, so simples: I can run tests locally. But it’d be nice to be able to do that via Github Actions. So my understanding of Github Actions is that I should be able to run my notebook container as a service using a service container, and then use the same playwright script as for my local tests.

But it doesn’t seem to work:-( [UPDATE: seems like was image was borked somehow; updated that image (using same tag)… Connection to server works now..]

Here’s my playwright script:

import { test, expect } from '@playwright/test';

// Create test screenshots
// npx playwright test --update-snapshots


// Allow minor difference - eg in the new version
const config: PlaywrightTestConfig = {
  expect: {
    toHaveScreenshot: { maxDiffPixels: 100 },
  },
};
export default config;

test('test', async ({ page }) => {
  // Go to http://localhost:8888/login?next=%2Ftree%3F
  await page.goto('http://localhost:8888/login?next=%2Ftree%3F');
  await expect(page).toHaveScreenshot('notebook-login-page.png');
  // Click text=Password or token: Log in >> input[name="password"]
  await page.locator('text=Password or token: Log in >> input[name="password"]').click();
  // Fill text=Password or token: Log in >> input[name="password"]
  await page.locator('text=Password or token: Log in >> input[name="password"]').fill('letmein');
  // Press Enter
  await page.locator('text=Password or token: Log in >> input[name="password"]').press('Enter');
  await expect(page).toHaveURL('http://localhost:8888/tree?');
  await expect(page).toHaveScreenshot('notebook_homepage-test.png');
  // Click text=New Toggle Dropdown
  await page.locator('text=New Toggle Dropdown').click();
  await expect(page).toHaveScreenshot('notebook_new-test.png');
  // Click a[role="menuitem"]:has-text("OpenRefine")
  const [page1] = await Promise.all([
    page.waitForEvent('popup'),
    page.locator('a[role="menuitem"]:has-text("OpenRefine")').click()
  ]);
  await page1.waitForSelector('#right-panel');
  await expect(page1).toHaveScreenshot('openrefine-homepage.png');
});

And here’s my Github Action script:

name: test

on:
  workflow_dispatch:

# This job installs dependencies and runs the tests
jobs:
  notebook-tests:
    runs-on: ubuntu-latest
    services:
      jupyter-tm351:
        image: ouvocl/vce-tm351-monolith:22j-b2
        ports:
        - 8888:8888
    steps:
    - uses: actions/checkout@v3

    - uses: actions/setup-node@v3
      with:
        node-version: 16
    - name: Install playwright dependencies
      run: |
        npx playwright install-deps
        npx playwright install
        npm install -D @playwright/test
    # Test setup
    - name: Test screenshots
      run: |
        npx playwright test

It’s been so nice not doing any real coding over the last few weeks. I realise again I don’t really have a clue how any of this stuff works, and I don’t really find code tinkering to be enjoyable at all any more.

FWIW, test repo I was using is here. If you spot what I’m doing wrong, please let me know via the comments… FIXED NOW – SEEMS TO WORK

PS hmmm… maybe that container was broken in starting up… not sure how; I have a version running locally?? Maybe my local tag version is out of synch with pushed version somehow? FIXED NOW – SEEMS TO WORK

PPS I just added another action that will generate gold master screenshots and temporarily stash them as action artefacts. Much as the above except the npx playwright test --update-snapshots line to create gold master screenshots (which assumes things are working as they should….) and then a step to save the screenshot image artefacts.

name: generate-gold-master-images

on:
  workflow_dispatch:

# This job installs dependencies, builds the book, and pushes it to `gh-pages`
jobs:
  notebook-tests:
    runs-on: ubuntu-latest
    services:
      jupyter-tm351:
        image: ouvocl/vce-tm351-monolith:22j-b2
        ports:
        - 8888:8888
    steps:
    - uses: actions/checkout@v3

    - uses: actions/setup-node@v3
      with:
        node-version: 16
    - name: Install playwright dependencies
      run: |
        npx playwright install-deps
        npx playwright install
        npm install -D @playwright/test

    # Generate screenshots
    - name: Generate screenshots
      run: |
        npx playwright test --update-snapshots

    - uses: actions/upload-artifact@v3
      with:
        name: test-images
        path: tests/openrefine.spec.ts-snapshots/

PostgreSQL Running in the Browser via WASM

I’ve previoulsy written about WASM support for in-browser SQLite databases, as well as using the DuckDB query engine to run queries over various data sources from a browser based termianl (for example, SQL Databases in the Browser, via WASM: SQLite and DuckDB and Noting Python Apps, and Datasette, Running Purely in the Browser), but now it seems we can also get access to a fully blown PostgreSQL database in the browser via snaplet/postgres-wasm (announcement; demo; another demo). At the moment, I don’t think there’s an option to save and reload the database as a browser app, so you’ll need to initially load it into a tab from either a local or remote webserver (so it’s not completely serverless yet…).

Key points that jump out at me from the full demo:

  • you get a psql terminal in the bowser that lets you run psql commands as well as SQL queries;
  • you can save and load the database state into browser storage:

You can also save and load the database state to/from a desktop file.

  • a web proxy service is available that lets you query the database from a remote connection; that is, the db running in your browser can be exposed via a web proxy and you can connect to it over the network. For example, I connected to the proxy from a Python Jupyter kernel running in a Docker container on my local machine; the database was running in a browser on the same machine.

From an educational perspective, having access to a fully blown DBMS engine, rather than just a simple SQLite database, for example, is that you get access to both the psql command line line, but also database management tools such as roles and role based permissions. Which means you can teach a lot more purely within the browser.

Note that I think a webserver is still required to load the environment until such a time as a PWA/progressive web app version is available (I don’t think datasette-lite is available as a PWA yet either? [issue]).

In terms of managing a learning environment, one quick and easy way would b to run two open browser windows side by side: one containing the instructional material, the other containing the terminal.

Questions that immediately come to mind:

  • What’s the easiest way is to be able to run the proxy service on localhost?
  • Is it in-principle possible for the database server to run as a browser app and publish its service from there onto the localhost network?
  • Is there in-principle way for the database server to run in one browser tab and expose itself to a client running in another browser tab?
  • Can you have multiple connections onto the same browser storage persisted database from clients open in different/multiple tabs, or would you have to hack the common storgae by repeatedly saving and loading state from each tab?
  • At the moment, we can’t connect to postgres running wheresoever from a in-browser Python/Pyodide environment (issue), which means we can’t connect to it from a JupyterLite Python kernel. Would it be possible to create some sort of JupyterLite shim so that you could load a combined JupyterLite+postgres-wasm environment to give you access to in-browser postgres db storage via JupyterLite notebook scripts?
  • How easy would it be to fork the jupyterlite/xeus-sqlite-kernel to create a xeus-postgres-wasm kernel? How easy would it be to also bundle in pandas and some sql magic for a pandas/postgres hybrid, (even if you have access to no other python commands than pd. methods (and what would that feel like to use?!), along with support for pandas plots/charts?
  • How easy would be to wire in some custom visual chart generating Postgres functions?!
  • With a python-postgres-wasm, could you support the creation of Postgres/Python custom functions?

It could be so much fun to work on a computing course that tried to deliver everything via the browser using a progressive web app, or at most a web-server…

In Search of JupyterLab Workspaces Config Files

Way, way, way back when JupyterLab was first mooted, I raised concerns about the complexity of an IDE vs the simplicity of the notebook document views, and was told not to worry becuase there would be an easily customisable worskpace facility whereby you could come up with a custom layout, save a config file and then ship that view of the IDE.

This would have been brilliant for educational purposes, because it means you can customise and ship different desktop layouts for different purposes. For example, in one activity, you might ship an instructional rendered markdown document with just text based instructions in one panel, and an terminal in the panel next to it.

In another, you might pre-open a particlar notebook, and so on.

In a third, you might ship a pre-opened interactve map preview, and so on.

What this does is potentially reduce the time to get started on an activity: “load this and read what’s in front of you” rather than “just load this, then open that thing from the file menu, in that directory, and then open that other thing, then click on the top of that thing and drop it so that you get two things side by side, then…” etc etc.

Whilst workspaces were available early on, they never had more than a rare demo, v. little docs and no tutorials, so it was really tricky to find a way into experimenting with them, let alone using them.

Anyway, I’ve started trying again, and again fall at the first hurdle; there are no menu options I can see (no Save workspace, no Save Workspace As…, no Open Workspace, no Reset Workspace etc) but there are some URLs for cloning one workspace from another (but not cloning the current default workspace into another one that you can then name?).

It’s also not obvious where the worskpace config files are. I think their default home is /home/jovyan/.jupyter/lab/workspaces/ but they look fiddly to had create. Also, I’m not sure if/where/how you can share a custom default workspace file (eg how should it be named; what path does it need to be on?)

Creating a custom config file manually (which IIRC was also one of the early promises) is probably also not aan option; the best way seems to be to mess around with the panels in a workspace until you get them arranged just so, then try to poke around the /home/jovyan/.jupyter/lab/workspaces/ directory until you find one that looks like it encodes the space you want.

Having found myself defaulting to VS Code more and more, using VS Code Jupyer notebooks running against Jupyter kernels inside local docker containers, as well as spending time in RStudio again (accessing quarto workflows etc), the main thing I get from Jupyter UIs seems to be frustration:-(