Using dogsheep-beta to Create a Combined SQLite Free Text Metasearch Engine Over Several SQLite Databases

I had a quick play yesterday tinkering with my storynotes side project, creating a couple more scrapers over traditional tale collections, specifically World of Tales and Laura Gibb’s Tiny Tales books. The scrapers pull the stories into simple, free text searchable SQLite databases to support discovery of particular stories based on simple, free text search terms. (I’ve also been exploring simple doc2vec based semantic search strategies over the data.)

See the technical recipes: World of Tales scraper; Tiny Tales scraper.

The databases I’ve been using aren’t very consistent in the table structure I’m using to store the scraped data, and thus far I tended to search the databases separately. But when I came across Simon Willison’s dogsheep-beta demo, which creates a free text meta-search database over several distinct databases using SQLite’s ATTACH method (example) for connecting to multiple databases, I thought I’d have a quick play to see if I could re-use that method to create my own meta-search database.

And it turns out, I can. Really easily…

The dogsheep-beta demo essentially bundles two things:

  • a tool for constructing a single, full text searchable database from the content of one or more database tables in one or more databases;
  • a datasette extension for rendering a faceted full text search page over the combined database.

The index builder is based around a YAML config file that contains a series of entries, each describing a query onto a database that returns the searchable data in a standard form. The standardised columns are type (in the dogsheep world, this is the original data source, eg a tweet, or a github record); key (a unique key for the record); title; timestamp; category; is_public; search_1, search_2, search_3.

I forked the repo to tweak these columns slightly, changing timestamp to pubdate and adding a new book column to replace type with the original book from which a story came; the database records I’m interested in search over are individual stories or tales, but it can be useful to know the original source text. I probably also need to be able to support something like a link to the original text, but for now I’m interested in a minimum viable search tool.

Items in config file have the form:

DATABASE_FILE.db:
    DB_LABEL:
        sql: |-
            select
              MY_ID as key,
              title,
              MY_TEXT as search_1
            from MY_TABLE

The select MUST be in lower case (a hack in the code searches for the first select in the provided query, as part of a query rewrite. Also, the query MUST NOT end with a ;, as the aforementioned query rewrite appends LIMIT 0 to the original query en route to identifying the column headings.

Here’s the config file I used for creating my metasearch database:

lang_fairy_tale.db:
    lang:
        sql: |-
            select
              "lang::" || replace(title, " ", "") as key,
              title,
              book,
              text as search_1
            from books

jacobs_fairy_tale.db:
    jacobs:
        sql: |-
            select
              "jacobs::" || replace(story_title, " ", "") as key,
              story_title as title,
              book_title as book,
              story_text as search_1
            from stories

word_of_tales.db:
    world_of_tales:
        sql: |-
            select
              "wot::" || replace(title, " ", "") as key,
              title,
              book,
              text as search_1
            from tales

ashliman_demo.db:
    ashliman:
        sql: |-
            select
              "dash::" || replace(title, " ", "") as key,
              title as title,
              metadata as book,
              text as search_1
            from ashliman_stories

mtdf_demo.db:
    mtdf:
        sql: |-
            select
              "mtdf::" || replace(title, " ", "") as key,
              title as title,
              text as search_1
            from english_stories

Using the datasette UI to query the FTS table, with a slightly tweaked SQL query, we can get something like the following:

One of the issues with even a free text search strategy is that the search terms must appear in the searched text if they are to return a result (we can get round this slightly by using things like stemming to reduce a word to its stem). However, it’s not too hard to generate a simple semantic search over a corpus using doc2vec, as this old demo shows. However, the vtfunc trick that relies on seems to have rotted in Py10 [update: ensuring Cython is installed fixes the vtfunc install; for another Python wrapper for sqlite, see also apsw]; there may be an alternative way to access a TableFunction via the peewee package, but that seems tightly bound to a particular database object, and on the quickest of plays, I couldn’t get it to play nice with sqlite_utils or pandas.read_sql().

What I’m thinking is, it would be really handy to have a template repo associated with sqlite_utils / datasette that provides tools to:

  • create a metasearch database (dogsheep-beta pretty much does this, but could be generalised to be more flexible in defining/naming required columns, etc.);
  • provide an easily customisable index.html template for datasette that gives you a simple free text search UI (related issue);
  • provide a simple tool that will build a doc2vec table, register vector handlers (related issue) and a custom semantic search function; the index.html template should then give you the option of running a free text search or a semantic search;
  • (it might also be handy to support other custom fuzzy search functions (example);)
  • a simpler config where you select free text and or semantic search, and the index/doc2vec builders are applied appropriately, and the index.html template serves queries appropriately.

PS I posted this related item to sqlite_utils discord server:

SQLite supports the creation of virtual table functions that allow you to define custom SQL functions that can return a table.

The coleifer/sqlite-vtfunc package provided a fairly straightforward way of defining custom table functions (eg I used it to create a custom fuzzy search function here:

The recipe is to define a class class My_Custom_Function(TableFunction) and then register it on a connection (My_Custom_Function.register(db.conn)). This worked fine with sqlite_utils database connections and meant you could easily add custom functions to a running server.

However, that package is no longer maintained and seems to be breaking when installing in at least Py10?

An equivalent (ish) function is provided by the peewee package (functions imported from playhouse.sqlite_ext), but it seems to be rather more tightly bound to a SqliteDatabase object, rather than simply being registered to a database connection.

Is there any support for registering such table returning functions in sqlite_utils?

Generating “Print Pack” Material (PDF, epub) From Jupyter Notebooks Using pandoc

For one of the courses I work on, we produce “print pack” materials (PDF, epub), on request, for students who require print, or alternative format, copies of the Jupyter notebooks used in the module. This does and doesn’t make a lot of sense. On the one hand, having a print copy of notebooks may be useful for creating physical annotations. On the other, the notebooks are designed as interactive, and often generative, materials: interactive in the sense students are expect to run code, as well modify, create and execute their own code; generative in the sense that outputs are generated by code execution and the instructional content may explicitly refer to things that have been so generated.

In producing the print / alternative format material, we generally render the content from un-run notebooks, which is to say the the print material notebooks do not include code outputs. Part of the reason for this is that we want the students to do the work: if we hnded out completed worksheets, there’d be no need to work through and complete the worksheets, right?

Furthermore, whilst for some notebooks, it may be possible to meaningfully run all cells and then provide executed/run cell notebooks as print materials, in other modules this may not make sense. In our TM129 robotics block, an interactive simulator widget is generated and controlled from the run notebook, and it doesn’t make sense to naively share a notebook with all cells run. Instead, we would have to share screenshots of the simulator widget following each notebook activity, and here may be several such activities in each notebook. (It might be instructive to try to automate the creation of such screenshots, eg using the JupyterLab galata test framework.)

Anyway, I started looking at how to automate the generation of print packs. The following code [(https://gist.github.com/psychemedia/6288db9ef97dc17b2fbd909a7516f12b)%5D is hard wired for a directory structure where the content is in ./content directory, with subdirectories for each week starting with a two digit week number (01., 02. etc) and notebooks in each week directory numbered according to week/notebook number (eg 01.1, 01.2, …, 02.1, 02.2, … etc.).

# # `print_publication.py`
#
# Script for generating print items (weekly PDF, weekly epub).
# # Install requirements
#
# – Python
# – pandoc
# – python packages:
# – ipython
# – nbconvert
# – nbformat
# – pymupdf
from pathlib import Path
#import nbconvert
import nbformat
from nbconvert import HTMLExporter
#import pypandoc
import os
import secrets
import shutil
import subprocess
import fitz #pip install pymupdf
html_exporter = HTMLExporter(template_name = 'classic')
pwd = Path.cwd()
print(f'Starting in: {pwd}')
# +
nb_wd = "content" # Path to weekly content folders
pdf_output_dir = "print_pack" # Path to output dir
# Create print pack output dir if required
Path(pdf_output_dir).mkdir(parents=True, exist_ok=True)
# –
# Iterate through weekly content dirs
# We assume the dir starts with a week number
for p in Path(nb_wd).glob("[0-9]*"):
print(f'- processing: {p}')
if not p.is_dir():
continue
# Get the week number
weeknum = p.name.split(". ")[0]
# Settings for pandoc
pdoc_args = ['-s', '-V geometry:margin=1in',
'–toc',
#f'–resource-path="{p.resolve()}"', # Doesn't work?
'–metadata', f'title="TM129 Robotics — Week {weeknum}"']
#cd to week directory
os.chdir(p)
# Create a tmp directory for html files
# Rather than use tempfile, create our own lest we want to persist it
_tmp_dir = Path(secrets.token_hex(5))
_tmp_dir.mkdir(parents=True, exist_ok=True)
# Find notebooks for the current week
for _nb in Path.cwd().glob("*.ipynb"):
nb = nbformat.read(_nb, as_version=4)
# Generate HTML version of document
(body, resources) = html_exporter.from_notebook_node(nb)
with open(_tmp_dir / _nb.name.replace(".ipynb", ".html"), "w") as f:
f.write(body)
# Now convert the HTML files to PDF
# We need to run pandoc in the correct directory so that
# relatively linked image files are correctly picked up.
# Specify output PDF path
pdf_out = str(pwd / pdf_output_dir / f"tm129_{weeknum}.pdf")
epub_out = str(pwd / pdf_output_dir / f"tm129_{weeknum}.epub")
# It seems pypandoc is not sorting the files in ToC etc?
#pypandoc.convert_file(f"{temp_dir}/*html",
# to='pdf',
# #format='html',
# extra_args=pdoc_args,
# outputfile= str(pwd / pdf_output_dir / f"tm129_{weeknum}.pdf"))
# Hacky – requires IPython
# #! pandoc -s -o {pdf_out} -V geometry:margin=1in –toc –metadata title="TM129 Robotics — Week {weeknum}" {_tmp_dir}/*html
# #! pandoc -s -o {epub_out} –metadata title="TM129 Robotics — Week {weeknum}" –metadata author="The Open University, 2022" {_tmp_dir}/*html
subprocess.call(f'pandoc –quiet -s -o {pdf_out} -V geometry:margin=1in –toc –metadata title="TM129 Robotics — Week {weeknum}" {_tmp_dir}/*html', shell = True)
subprocess.call(f'pandoc –quiet -s -o {epub_out} –metadata title="TM129 Robotics — Week {weeknum}" –metadata author="The Open University, 2022" {_tmp_dir}/*html', shell = True)
# Tidy up tmp dir
shutil.rmtree(_tmp_dir)
#Just in case we need to know relatively where we are…
#Path.cwd().relative_to(pwd)
# Go back to the home dir
os.chdir(pwd)
os.chdir(pwd)
# ## Add OU Logo to First Page of PDF
#
# Add an OU logo to the first page of the PDF documents
# +
logo_file = ".print_assets/OU-logo-83×65.png"
img = open(logo_file, "rb").read()
# define the position (upper-left corner)
logo_container = fitz.Rect(60,40,143,105)
for f in Path(pdf_output_dir).glob("*.pdf"):
print(f'- branding: {f}')
with fitz.open(f) as pdf:
pdf_first_page = pdf[0]
pdf_first_page.insert_image(logo_container, stream=img)
pdf_out = f.name.replace(".pdf", "_logo.pdf")
txt_origin = fitz.Point(350, 770)
text = "Copyright © The Open University, 2022"
for page in pdf:
page.insert_text(txt_origin, text)
pdf.save(Path(pdf_output_dir) / pdf_out)
#Remove the unbranded PDF
os.remove(f)
view raw print_pack.py hosted with ❤ by GitHub

The script uses pandoc to generate the PDF and epub documents, one per weekly directory. The PDF generator also includes a table of contents, automatically generated from headings by pandoc. A second pass using fitz/pymupdf then adds a logo and copyright notice to each PDF.

PDF with post-processed addition of a custom logo
PDF with post-process addition of a copyright footer

Prompt Engineering, aka Generative System Query Optmisation

Over the last few months, I’ve seen increasing numbers of references to prompt engineering in my feeds. Prompt engineering is to generative AI systems — thinks like GPT-3 (natural langage/text generation), CoPilot (code and comment (i.e. code explanation) generation), and DALL-E (text2image generation) — as query optimisation is to search engines: the ability to iterate a query/prompt, in light of the responses it returns, in toder to find, or generate, a response that meets your needs.

Back in the day when I used to bait librarians, developing web search skills was an ever present theme at the library-practitioner conferences I used to regularly attend: advanced search forms and advanced search queries (using advanced search/colon operators, etc.), as well a power search tricks based around URL hacking or the use of particular keywords that reflected your understanding of the search algorithm, etc.

There was also an ethical dimension: should you teach particular search techniques that could be abused in some way, if generally know, but that might be appropriate if you were working in an investigative context: strategies that could be used to profile — or stalk — someone, for example.

So now I’m wondering: are the digital skills folk in libraries starting to look at ways of developing prompt engineering skills yet?

Again, there are ethical questions: for example, should you develop skills or give prompt exampes of how to create deepfakes in unfiltered generative models such as Stable Diffusion (Deepfakes for all: Uncensored AI art model prompts ethics questions, h/t Tactical Tech/@info_activism)?

And there are also commercial and intellectual property right considerations. For example, what’s an effective prompt worth and how do you protect it if you are selling it on a prompt marketplace such as PromptBase? Will there be an underground market in “cheats” that can be deployed against certain commercial generative models (because pre-trained models are now also regularly bought and sold), and so on.

There’s also the skills question of how to effectively (and iteratively) work with a generative AI to create and tune meaningful solutions, which requires the important skill of evaulating, and perhaps editing or otherwise tweaking, the responses (cf. evaluating the quality of websites returned in the results of a web search engine) as well as improving or refining your prompts (for example, students who naively enter a complete question as a search engine query or a generative system prompt compared to those who use the question as a basis for their own, more refined, query, that attemtps to extract key “features” from the question…)

For education, there’s also the question of how to respond to students using generative systems in open assessments: will we try to “forbid” students to such systems at all, will we have the equivalent of “calculator” papers in maths exams, where you are expected, or evern required, to use the generative tools, or will we try to enter into an arms race with questions which, if naively sumbitted as a prompt, will return an incorrect answer (in which case, how soon will the generative systems be improved to correctly respond to such prompts…)

See also: Fragment: Generative Plagiarism and Technologically Enhanced Cheating?

PS For some examples of prompt engineering in practice, Simon Willison has blogged several examples of some of his playful experiments with generative AI:

PPS another generative approach that looks interesting is provided by https://github.com/rinongal/textual_inversion, which allows you to submit several training images of a novel concept, and then generate differently rendered styles of that concept.

Automating the Production of Student Software Guides With Annotated Screenshots Using Playwright and Jupyter Notebooks

With a bit of luck, we’ll be updating software for our databases course starting in Sept/Oct to use JupyterLab. It’s a bit late for settling this, but I find the adrenaline of going live, and the interaction with pathfinder students in particular, to be really invigorating (a bit like the old days of residential schools) so here’s hoping…

[I started this post last night, closed the laptop, opened it this morning and did 75 mins of work. The sleep broke WordPress’ autobackup, so when I accidentally backswiped the page, I seem to have lost over an hour’s work. Which is to say, this post was originally better written, but now it’s rushed. F**k you, WordPress.]

If we are to do the update, one thing I need to do is update the software guide. The software guide looks a bit like this:

Which is to say:

  • process text describing what to do;
  • a (possibly annotated) screenshot showing what to do, or the outcome of doing it;
  • more process text, etc.

Maintaining this sort of documentation can be a faff:

  • if the styling of the target website/application changes, but the structure and process steps remain the same, the screenshots drift from actuality, even if they are still functionally correct;
  • if the application structure or process steps change, then the documentation properly breaks. This can be hard to check for unless you rerun everying and manually test things before each presentation at least.

So is there a better way?

There is, and it’s another example of why it’s useful for folk to learn to code, which is to say, have some sort of understanding about how the web works and how to construct simple computational scripts.

Regular readers will know that I’ve tinkered on and off with the Selenium browser automation toolkit in the past, using it for scraping as well as automating repetitive manual tasks (such as downloading from one exams system scores of student scripts one at a time and grabbing a 2FA code for each of them for use in submiting marks into a second system). But playwright, Microsoft’s browser automation tool (freely available), seems to be what all the cool kids are using now, so I thought I’d try that.

The playwright app itself is a node app, which also makes me twitchy because becuase node is a pain to install. But the pytest-python package, which is installable from PyPi and which wraps playwright, bundles it’s own version of node, which makes things much simple. (See Simon Wilison’s Bundling binary tools in Python wheels for a discussion of this; for any edtechies out there, this is a really useful pattern, because if students have Python installed, you can use it as a route to deploy other things…)

Just as a brief aside, playwright is also wrapped by @simonw’s shot-scraper command line tool which makes it dead easy to grab screenshots. For example, we can grab a screenshot of the OU home page as simply as typing shot-scraper https://www.open.ac.uk/:

Note that because the session runs in a new, incognito, browser, we get the cookie notice.

We can also grab a screenshot of just a particular, located CSS element: shot-scraper https://www.open.ac.uk/ -s '#ou-org-footer'. See the shot-scraper docs for many more examples.

In many cases, screenshots that appear in OU course materials that mismatch with reality are identified and reported by students. Tools like shot-scraper and pytest can both be used as a part of a documentation testing suite where we create gold master images and then, as required, test “current” screenshots to see if they match distributed originals.

But back to the creation or reflowing of documentation.

As well as command line control using shot-scraper, we can also drive playwright from Python code, executed synchronously, as in a pytest test, or asynchronously, as in the case of Python running inside a Jupyter notebook. This is what I spent yesterday exploring, in particular, whether we could create reproducible documentation in the sense of something that has the form text, screenshot, text, … and looks like this:

but is actually created by something that has the form text, code, (code output), text, … and looks like this:

And as you’ve probably guessed, we can.

For some reason, my local version of nbconvert seems to now default to using no-input display settings and I can’t see why (there are no nbconvert settings files I can see. Anyone got any ideas how/why this might be happening? The only way I can work around it atm is to explicitly enable the display of the inputs: jupyter nbconvert --to pdf --TemplateExporter.exclude_output_prompt=True --TemplateExporter.exclude_input=False --TemplateExporter.exclude_input_prompt=False notebook.ipynb.

It’s worth noting a couple of things here:

  • if we reflow the document to generate new output screenshots, they will be a faithful representation of what the screen looks like at the time the document is reflowed. So if the visual styling (and nothing else) has been updated, we can capture the latest version;
  • the code should ideally follow the text description of the process steps, so if the code stops working for some reason, that might suggest the process has changed and so the text might well be broken too.

Generating the automation code requires knowledge of a couple of things:

  • how to write the code itself: the documentation helps here (eg in respect of how to wait for things, how to grab screenshots, how to move to things so they are in focus when you take a screenshot), but a recipe / FAQ / crib sheet would also be really handy;
  • how to find locators.

In terms of finding locators, one way is to do it manually, by usier browser developer tools to inspect elements and grab locators for required elements. But another, simpler way, is to record a set of playwright steps using the playwright codegen URL command line tool: simple walk through (click through) the process you want to document in the automatically launched interactive browser, and record the corresponding playwright steps.

With the script recorded, you can then use that as the basis for you screenshot generating reproducible documentation.

For any interested OU internal readers, there is an example software guide generated using playwright in MS Teams > Jupyter notebook working group team. I’m happy to share any scripts etc I come up with, and am interested to see other examples of using browser automation to test and generate documentation etc.

Referring back to the original software guide, we note that some screenshots have annotations. Another nice feature of playwright is that we can inject javascript into the controlled browser and evaulate it in that context, we which means we can inject Javascript that tinlers with the DOM.

The shot-scraper tool has a nifty function that will accept a set of locators for a particular page and, using Javascript injection, create a div surround them that can then be used to grab a screenshot of the area covering those locators.

It’s trivial to amend that function to add a border round an element:

# Based on @simomw's shot-scraper
# https://github.com/simonw/shot-scraper/blob/467a6c16ce5f4139c7ef92ca2efb45fc6de4e8bf/shot_scraper/cli.py#L784
import json
import secrets
import textwrap

from IPython.display import Image

def _selector_javascript(selectors, selectors_all,
                         padding=0, border='none', margin='none'):
    selector_to_shoot = "shot-scraper-{}".format(secrets.token_hex(8))
    selector_javascript = textwrap.dedent(
        f"""
    new Promise(takeShot => {{
        let padding = %s;
        let margin = %s;
        let minTop = 100000000;
        let minLeft = 100000000;
        let maxBottom = 0;
        let maxRight = 0;
        let els = %s.map(s => document.querySelector(s));
        // Add the --selector-all elements
        %s.map(s => els.push(...document.querySelectorAll(s)));
        els.forEach(el => {{
            let rect = el.getBoundingClientRect();
            if (rect.top < minTop) {{
                minTop = rect.top;
            }}
            if (rect.left < minLeft) {{
                minLeft = rect.left;
            }}
            if (rect.bottom > maxBottom) {{
                maxBottom = rect.bottom;
            }}
            if (rect.right > maxRight) {{
                maxRight = rect.right;
            }}
        }});
        // Adjust them based on scroll position
        let top = minTop + window.scrollY;
        let bottom = maxBottom + window.scrollY;
        let left = minLeft + window.scrollX;
        let right = maxRight + window.scrollX;
        // Apply padding
        // TO DO - apply margin?
        // margin demo: https://developer.mozilla.org/en-US/docs/Web/CSS/margin
        top = top - padding;
        bottom = bottom + padding;
        left = left - padding;
        right = right + padding;
        let div = document.createElement('div');
        div.style.position = 'absolute';
        div.style.top = top + 'px';
        div.style.left = left + 'px';
        div.style.width = (right - left) + 'px';
        div.style.height = (bottom - top) + 'px';
        div.style.border = '{border}';
        div.style.margin = margin + 'px';
        div.setAttribute('id', %s);
        document.body.appendChild(div);
        setTimeout(() => {{
            takeShot();
        }}, 300);
    }});
    """
        % (
            padding, margin,
            json.dumps(selectors),
            json.dumps(selectors_all),
            json.dumps(selector_to_shoot),
        )
    )
    return selector_javascript, "#" + selector_to_shoot

async def screenshot_bounded_selectors(page, selectors=None,
                                 selectors_all=None,
                                 padding=0,
                                 border='none',
                                 margin='none',
                                 path="test.png",
                                 full_page=False,
                                 show=True):
    if selectors or selectors_all:
        selector_javascript, selector_to_shoot = _selector_javascript(
                selectors, selectors_all, padding, border, margin
            )

        # Evaluate javascript
        await page.evaluate(selector_javascript)
        
    if (selectors or selectors_all) and not full_page:
        # Grab screenshot
        await page.locator(selector_to_shoot).screenshot(path=path)
    else:
        if selector_to_shoot:
            await page.locator(selector_to_shoot).hover()
        await page.screenshot(path=path)

    # Should we include a step to
    # then remove the selector_to_shoot and leave the page as it was
    
    if show:
        display(Image(path))
    return path

Then we can use a simple script to grab a screenshot with a highlighted area:

from time import sleep
from playwright.async_api import async_playwright

playwright = await async_playwright().start()

# If we run headless--False, plawright will launch a 
# visible browser we can track progress in
browser = await playwright.chromium.launch(headless = False)

# Create a reference to a browser page
page = await browser.new_page()

# The page we want to visit
PAGE = "https://www.open.ac.uk"

# And a locator for the "Accept all cookies" button
cookie_id = "#cassie_accept_all_pre_banner"

# Load the page
await page.goto(PAGE)

# Accept the cookies
await page.locator(cookie_id).click()

# The selectors we want to screenshot
selectors = []
selectors_all = [".int-grid3"]

# Grab the screenshot
await screenshot_bounded_selectors(page, selectors, selectors_all,
                                   border='5px solid red',
                                   padding=2,
                                   margin=0,
                                   full_page=True)

Or we can just sreenshot and highlight the element of interest:

Simon has an open shot-scraper issue on adding things like arrows to the screenshot, so this is obviously something that might repay a bit more exploration.

I note that the Jupyter accessibility docs have a section on the DOM / locator structure of the JupyterLab UI that include highlighted screenshots annotated, I think, using drawio. It might be interesting to try to replicate / automatically generate those, using playwright?

Finally, it’s worth noting that there is another playwright based tool, galata, that provides a set of high level tools for controlling JupyterLab and scripting JupyterLab actions (this will be the subject of another post). However, galata is currently a developer only tool, in that it expected to run against a wide open Juypter environment (no authentication), and only a wide open JupyterLab environment. It does this by over-riding playwright‘s .goto() method to expect a particular Jupyterlab locator, which means that if you want to test a deployment that sits behind the Jupyter authenticator (which is the recommended and default way of running a Jupyer server), or you want to go through some browser steps involving arbitrary web pages, you can’t. (I have opened an issue regarding at least getting through Jupyter authentication here and a related Jupyter discourse discussion here. What would be useful in the general case would be a trick to use a generic playwright script automate steps up to a JupyterLab page and then hand over browser state to a galata script. But I don’t know how to “just” do that. Which is why I say this is a developer tool, and as such is hostile to non-developer users, who, for example, might only be able to run scripts against a hosted server accessed through mutliple sorts of institutional and Jupyter based authentcation. The current set up also means it’s not possible to use galata out of the can for testing a deployment. Devs only, n’est-ce pas?

Displaying Wide pandas Dataframes Over Several, Narrower Dataframes

One of the problems associated with using Jupyter notebooks, and Jupyter Book, for publishing content that incorporates pandas dataframes is that wide dataframes are overflowed in the displayed page. In an HTML output, the wide dataframe is rendered via a scroll box; in a PDF output, the content is just lost over the edge of page.

Controls are available to limit the width of the displayed dataframe:

# Pandas dataframe display options
pd.options.display.show_dimensions=False
pd.options.display.width = 80
pd.options.display.max_colwidth = 10
pd.options.display.max_columns = 10

but sometimes you want to see all the columns.

The easiest way of rendering all the columns is to split the wide table over several smaller tables, each with a limited number of columns that do fit in the page width, but I havenlt spotted a simple way to do that (which is really odd, becuase I suspect this is a widely (?!) desired feature..

Anyway, here’s a first attempt at a simple function to do that:

#| echo: false
import pandas as pd

def split_wide_table(df, split=True,
                     hide=None, padding=1, width=None):
    """Split a wide table over multiple smaller tables."""
    # Note: column widths may be much larger than heading widths
    if hide:
        hide = [hide] if isinstance(hide, str) else hide
    else:
        hide = []
    cols = [c for c in df.columns if c not in hide]
    width = width or pd.options.display.width
    _cols = []
    _w = 0
    continues = False
    for col in cols:
        if (len("".join(_cols)) + (len(_cols)+1)*padding) < width:
            _cols.append(col)
        else:
            print("\n")
            display(df[_cols].style.hide(axis="index"))
            _cols = []
            continues = False
    if _cols:
        print("\n")
        display(df[_cols].style.hide(axis="index"))

Given a wide dataframe such as the following:

wdf = pd.DataFrame({f"col{x}":range(3) for x in range(25)})
wdf

We can instead render the table over multiple, narrower tables:

I havenlt really written any decorators before, but it also seems to me that this sort of approach might make decorative display sense…

For example, given the decorator:

def split_wide_display(func):
    """Split a wide table of multiple tables."""
    def inner(*args, **kwargs):
        resp = func(*args[:3])
        if (not "display" in kwargs) or \
            kwargs["display"]:
            # Optionally display the potentially wide
            # table as a set of narrower tables
            split_wide_table(resp, **kwargs)
        return resp
 
    return inner

we can decorate a function returns a wide dataframe to optionally display the dataframe over mulitple narrower tables:

# Use fastf1 Formula One data grabbing package
from fastf1 import ergast

@split_wide_display
def ergast_race_results(year, event, session):
    erd_race = ergast.fetch_results(year, event, session)
    df = pd.json_normalize(erd_race)
    return df

It seems really odd that I have to hack this solution together, and there isn’t a default way of doing this? Or maybe there is, and I just missed it? If so, please let me know via the comments etc…

PS the splitter is far from optimal, becuase it splits columns on the basis of the cumulative column header widths, rather than the larger of the column head or column content width. I’m not sure what the best way might be of finding the max displayed column content width?

PPS is there a Sphinx extension to magically handle splitting overflowing tables into narraoewer subtables somehow? And a LaTeX one?

Semantic Feature / Named Entity Extraction Rather Than Regular Expression Based Pattern Matching and Parsing

Playing with story stuff, trying to write a simple BibTeX record generator based on archive.org (see a forthcoming post…), I wanted to parse out publisher and location elements from strings such as Joe Bloggs & Sons (London) or London : Joe Bloggs & Sons.

If the string structures are consistent (PUBLISHER (LOCATION) or LOCATION : PUBLISHER, for example), we can start to create a set of regular expression / pattern matching expressions to pull out the fields.

But there is another way, albeit one that relies on a much heavier wieight computational approach than a simple regex, and that is to do some language modeling and see if we can extract entities of a particular type, such as an organisation and a geogrpahical location.

Something like this, maybe:

import spacy

nlp = spacy.load("en_core_web_sm")

refs = ["Joe Bloggs & Sons, London",
        "Joe Bloggs & Sons, (London)",
        "Joe Bloggs & Sons (London)",
        "London: Joe Bloggs & Sons",
         "London: Joe, Bloggs & Doe"]

for ref in refs:
    doc = nlp(ref)
    print("---")
    for ent in doc.ents:
        print(ref, "::", ent.text, ent.label_)

Which gives as a result:

---
Joe Bloggs & Sons, London :: Joe Bloggs & Sons ORG
Joe Bloggs & Sons, London :: London GPE
---
Joe Bloggs & Sons, (London) :: Joe Bloggs & Sons ORG
Joe Bloggs & Sons, (London) :: London GPE
---
Joe Bloggs & Sons (London) :: Joe Bloggs & Sons ORG
Joe Bloggs & Sons (London) :: London GPE
---
London: Joe Bloggs & Sons :: London GPE
London: Joe Bloggs & Sons :: Joe Bloggs & Sons ORG
---
London: Joe, Bloggs & Doe :: London GPE
London: Joe, Bloggs & Doe :: Joe, Bloggs & Doe ORG

My natural inclination would probably be to get frustrated by writing ever more, and ever more elaborate, regular epxressions to try to capture “always one more” outliers in how the publisher/location data might be presented in a single string. But I wonder: is that an outmoded way of doing compute now? Are the primitives we can readily work with now conveniently available as features at a higher representational level of abstraction?

See also things like storysniffer, a Python package that includes a pretrained model for sniffing a URL and estimating whether the related page is likely to contain a news story.

Some folk will say, of course, that these model based approaches aren’t exact or provable. But in the case of my regexer to GUESS at the name of a publisher and a location, there is still uncertainty as to whether the correct response will be provided for an arbitrary string: my regexes might be incorrect, or I might have missed a pattern in likely presented strings. The model based approach will also be uncertain in its ability to correctly identify the publisher and location , but the source of the uncertainty will be different. As a user, though, all I likely care about is that most of the time the approach does work, and does work reasonably correctly. I may well be willing to tolerate, or at least, put up with, errors, and don’t really care how or why they arise unless I can do something about it.

Generating and Visualising JSON Schemas

If you’re presented with a 2D tabular dataset (eg a spreadsheet or CSV file), it;s quite easy to get a sense of the data by checking the column names and looking at a few of the values in each column. You can also use a variety of tools that will “profile” or summarise the data in each column for you. For example, a column of numeric values might be summarised with by the mean and standard deviation of the values, or a histogram, etc. Geographic co-ordinates might be “summarised” by plotting them onto a map. And so on.

If you’re presented with the data in a JSON file, particularly a large data file, getting a sense of the structure of the dataset can be much harder, particularly if the JSON data is “irregular” (that is, if the records differ in some way; for example, some records might have more fields than other records).

A good way of summarising the structure of a JSON file is via its schema. This is an abstracted representation of the tree structure of the JSON object that extracts the unique keys in the object and the data type of any associated literal values.

One JSON dataset I have played with over the years is rally timing and results data from the WRC live timing service. The structure of some of the returned data objects can be quite complicated, so how can we get a handle on them?

One way is to automatically extract the schema and then visualise it, so here’s a recipe for doing that.

For example:

#%pip install genson
from genson import SchemaBuilder

# Get some data
import requests

season_url = "https://api.wrc.com/contel-page/83388/calendar/active-season/"
jdata = requests.get(season_url).json()

# Create a schema object
builder = SchemaBuilder()
builder.add_object(jdata )

# Generate the schema
schema = builder.to_schema()

# Write the schema to a file
with open("season_schema.json", "w") as f:
    json.dump(schema, f)

The schema itself can be quite long and hard to read…

{'$schema': 'http://json-schema.org/schema#',
 'type': 'object',
 'properties': {'seasonYear': {'type': 'integer'},
  'seasonImages': {'type': 'object',
   'properties': {'format16x9': {'type': 'string'},
    'format16x9special': {'type': 'string'},
    'timekeeperLogo': {'type': 'string'},
    'timekeeperLogoDark': {'type': 'string'}},
   'required': ['format16x9',
    'format16x9special',
    'timekeeperLogo',
    'timekeeperLogoDark']},
  'rallyEvents': {'type': 'object',
   'properties': {'total': {'type': 'integer'},
    'items': {'type': 'array',
     'items': {'type': 'object',
      'properties': {'id': {'type': 'integer'},
       'name': {'type': 'string'},

...

To get a sense of the schema, we can visualise it interactively using the json_schema_for_humans visualiser.

from json_schema_for_humans.generate import generate_from_filename

generate_from_filename("season_schema.json",
                      "season_schema.html")

We can open the generated HTML in a web browser (note, this doesn’t seem to render correctly in JupyterLab via IPython.display.HTML; however, we can open the HTML file from the JupyterLab file browser, and as long as we trust it, it will render correctly:

With the HTML trusted, we can then explore the schema;

Wondering: are there other JSON-schema visualisers out there, particularly that work either as a JupyterLab extension, IPython magic or via an appropriate __repr__ display method?

My JupyterLite Blockers Are Rapidly Being Addressed…

Three months or so ago, in My Personal Blockers to Adopting JupyterLite for Distance and Open Educational Use, described several issues that I saw as blocking when it came to using JupyterLite in an educational context. These included the inability to distribute pre-installed packages as part of the JupyterLite distribution, the inability read and write files programmatically, and the inability to work on files in the local filesystem.

Many of my personal blockers have now been addressed, to such an extent that I think I should probably get round to updating our simple Learn to Code for Data Analysis open educational course to use JupyterLite or RetroLite (I suspect I might actually have to invoke the open license and publish a modified version of my own: the way we institutionally publish content is via a not very flexible Moodle platform and trying to accommodate the JupyterLite environment could be a step too far!).

So: what can we now do with JupyterLite that we couldn’t three months ago that make it more acceptable as an environment we can use to support independent distance and open educational users?

Perhaps the biggest blocker then was the inability to read and write files into the JupyterLite filesystem. This meant that workarounds were required when running packages such as pandas to open and save files from and to JupyterLite file storage. This has now been addressed, so packages such as pandas are now able to read some data files shipped as part of the JupyterLite distribution and also save and read back files into the JupyterLite file system. The JupyterLite file system support also means you can access a list directory contents, for example, from code within a notebook. (Data file type read/writes that aren’t currently supported by pandas, including SQLite file read/writes, are being tracked via this issue.) Indeed, the pandas docs now include a JupyterLite console that allows you to try out pandas code, including file read/writes, directly in the browser.

Another issue was the disconnect between files in the browser and files and the desktop. If you are working with the files in the default JupyterLite file panel, modifications to these files are save into browser storage. You can add and create new files saved to this file area, as well as deleting those files from browser storage. If files are shipped as part of the JupyterLite distribution, deleting those files, for example, after modification, resets the file to the originally distributed version. (A recent issue raises the possibility of how to alert users to an updated version of the notebook on the host repository.)

In many educational settings, however, we may want students to work from copies of notebooks that are stored on their local machine, or perhaps on a USB drive plugged into it. A recent extension that works in Chrome, Edge and Opera browsers — jupyterlab-contrib/jupyterlab-filesystem-access — added the ability to open, edit and save files on the local desktop from the JupyterLite environment. (In use, it’s a little bit fiddly in the way you need to keep granting permission to the browser to save files; but better that than the browser grabbing all sorts of permissions over the local file system without the user’s knowledge.)

In passing, I’m not sure if there’s support or demos yet for mounting files from online drives such as OneDrive, Dropbox or Google Drive, which would provide another useful way of persisting files, albeit one that raises the question of how to handle credentials safely.

When it comes to producing a JupyterLite distribution, which is to say, publishing a JupyterLite environment containing a predefined set of notebooks and a set of preinstalled packages, this has been non-trivial when it comes to adding additional packages to the distribution. The practical consequence of this is that packages need to be explicitly installed from notebooks using micropip or piplite, which adds clutter to notebooks, as well as code that will not run a “traditional”, non-JupyterLite envronment unless you add the appropriate code guards. However (and I have yet to test this), it seems that the new jupyterlite/xeus-python-kernel can be built relatively easily under automation to include additional Python packages that are not part of the default pyodide environment (presumably this requires that platform independent PyPi wheels are available, or custom build scripts that can build an appropriate Emscripten target built wheel?): minimal docs. (I note that this kernel also has the benefit that from time import sleep works!) The ipycanvas docs apparently demo this, so the answer to the question of how to create a JupyterLite distribution with custom additional pre-installed packages is presumably available somwhere in the repo (I think these commits are related: setup, package requirements.) It would be really handy if a minimal self-documenting jupyterlite-custom-package-demo were available to complement jupyterlite/demo with a minimal yet well commented/ narrated example of how to add a custom package to a JupyterLite distribution.

I would have tried a demo of bundling a custom package as a demo for this post, but from reading the docs and skimming the ipycanvas repo, I wasn’t sure what all-and-only the necessary steps are, and I don’t have time to go rat/rabbit-holing atm!

Installing JupyterLite in Chrome as a Browser Web App for Offline Use

If you are working in the Chrome browser on a desktop machine, you can install JupyterLite as a web application. (Firefox discontinued support for progressive web apps in the desktop version of Firefox in 2021.) A benefit of doing this is that you can then used the application in an offline mode, without the need for a web connection.

With a web connection available, if you launch a JupyterLite instance from the JupyterLite homepage, or from the JupyterLite demo page, you will see an option to install the environment as a progressive web application:

In the Chrome browser, you can view your Chrome installed web applications from the chrome://apps URL:

The application will open into its own browser window and can be used with or without a web connection. Files are saved into local browser storage, just as they would be if you were running the application from the original website. This means you can work against the website or the local web app, and the files you have saved into local browser storage will be available in both contexts.

If you do want to work in an offline mode, you need to ensure that all the packages you might want to call on have been “progressively” downloaded and cached by the app. Typically, JupyterLite will only download a Python package when it is first imported. This means that if your notebooks import different packages, you may find in offline use that a required package is not available. To get around this, you should create a single notebook importing all required packages and run that when you do have a network connection to make sure that all the packages you are likely to need are available for offline use.

Python Package Use Across Course Notebooks

Some time ago, I started exploring various ways of analysing the structure of Jupyter notebooks as part of an informal “notebook quality” unproject (innovationOUtside/nb_quality_profile).

Over the last week or two, for want of anything else to do, I’ve been looking at that old repo again and made a start tinkering with some of the old issues, as well as creating some new ones.

One of the things I messed around with today was a simple plot showing how different packages are used across a set of course notebooks. (One of the quality reports lists the packages imported by each notebook, and can flag if any packages are missing from the Python environment in which the tool runs.)

The course lasts ~30 weeks, with a set of notebooks most weeks and the plot shows the notebooks, in order of study, along the x-axis, with the packages listed as they are first enountered on the y-axis.

This chart is essentially a macroscopic view of package usage throughout the course module (and as long term readers will know, I do like a good mascroscope:-).

In passing, I note that I could also add colour and/or shapes or size to identify whether a package is in the Python standard library or whether it is imported from a project installed from PyPi, or highlight whether the package is not available in the Pyhton environment the tool that generates the chart is run in.

A quick glance at it reveals several things:

  • pandas is used heavily throughout (you might guess this is a data related course!), as we can see from the long horizontal run throughout the course;
  • several other packages are used over consecutive notebooks (short, contiguous horizontal runs of dots), suggesting a package that has a particular relevance for the subject matter studied that week;
  • vertical runs show that several new packages are used for the first time in the same notebook, perhaps to acheive a particular task. If the same vertical run appears in other notebooks, perhaps a similar task is being performed in each of those notebooks;
  • there is a steady increase in the number of packages used over the course. If there is a large number of packages introduced for the first time in a single notebook (i.e. a vertical run of dots), this might suggest a difficult notebook for students to work through in terms of new packages and new functions to get their head round;
  • if a package is used only one notebook (which is a little hard to see — I need to explore gridlines in a way that doesn’t overly clutter the graphic), it might be worth visiting that notebook to see if we can simplify it and remove the singleton use of that package, or check the relevance of the topic it relates to to the course overall;
  • if a notebook imports no modules (or has no code cells), it might be worth checking to see whether it really should be a notebook;
  • probably more things…

I’m now wondering what sort of tabular or list style report listing might be useful to identify the notebooks each module appears in, at least for packages that only appear once or twice, or are widely separated in terms of when they are studied.

I also wonder if there are any tools out that I can use to identify package functions used in each notebook to see how they are distributed over the module. (This all rather makes me think of topic analysis!)

In Passing, Noting Python Apps, and Datasette, Running Purely in the Browser

With pyodide now pretty much established, and from what I can tell, possibly with better optimised/lighter builds on the roadmap, I started wondering again about running Python apps purely in the browser.

One way of doing this is to create ipywidget powered applications in a JuptyerLite context (although I don’t think you can “appify” these yet, Voila style?) but that ties you into the overhead of running JupyterLite.

The React in Python with Pyodide post on the official pyodide blog looked like it might provide a way in to this, but over the weekend I noticed an annoucement from Anaconda regarding Python in the Browser using a new framework called PyScript (examples). This framework provides a (growing?) set of custom HTML components that appear to simplify the process of building pyodide Python powered web apps that run purely in the browser.

I also noticed over the weekend that sqlite_utils and datasette now run in a pyodide context, the latter providing the sql api run against an in-memory database (datasette test example).

The client will also return the datasette HTML, so now I wonder: what would be required to be able to run a datasette app in the JuptyerLite/JupyterLab context? The datasette server must be intercepting the local URL calls somehow, but I imagine that the Jupyter server is ignorant of them. So how could datasette “proxy” its URL calls via JupyterLite so that the local links in the datasette app can be resolved? (We surely wouldn’t want to have to make all the links handled button elements?)

UPDATE 5/5/22: It didn’t take Simon long… Datasette now runs as a full web app in the browser under pyodide. Announcement post here: Datasette Lite: a server-side Python web application running in a browser.

So now I’m wondering again… is there a way to “proxy” a Python app so that it can power a web app, running purely in the browser, via Pyodide?