Pencil Sketch – Building a Simple Search Engine Around a Particular Internet Archive Document Collection

Over the last few weeks, I’ve been making a fair bit of use of the Internet Archive, not least to look up items in copies of Notes & Queries from the 19th century. In part, this has rekindled my interest in things like indexes (for human readers) and custom search engines. So here are a few notes before I try and hack anything together about various bits and bobs that come to mind with how I could start to play a bit with the Internet Archive content.

Inspection of various sample records shows a few things we can work with:

Particularly:

  • the pub_notes-and-queries tag appears to be used to assign related documents to an associated document collection;
  • the OCR search text is available as a text file;
  • an OCR page index document contains a list of four tuples, one per scanned paged; the first two elements in the tuple appear to give the start and end character index count of characters in the OCR search text file for each page. (I have no idea what the thrid and fourth elements in the tuple relate to.)
  • a page numbers text file that gives a list of records, one per page, with a page number estimate and confidence score:

To retrieve the data, we can use the Python internetarchive package. This provides various command line tools, including bulk downloaers, as well as a Python API.

For example, we can use the API to search for items within a particular collection:

#%pip install internetarchive

from internetarchive import search_items

item_ids = []
# Limit search to collection items
#collection:"pub_notes-and-queries"
# Limit search to a particular publication and  volume
#sim_pubid:1250 AND volume:6
# Limit search to a particular publication and year
#sim_pubid:1250 AND year:1867

for i in search_items('collection:"pub_notes-and-queries"').iter_as_items():
    #iterate through retrieved item records
   pass

The item records include information that is likely to be useful for helping us retrieve items and construct a search engine over the documents (though what form that search engine might take, I am still not sure).

{'identifier': 'sim_notes-and-queries_1867_12_index',
 'adaptive_ocr': 'true',
 'auditor': 'associate-jerald-capanay@archive.org',
 'betterpdf': 'true',
 'boxid': 'IA1641612',
 'canister': 'IA1641612-06',
 'collection': ['pub_notes-and-queries', 'sim_microfilm', 'periodicals'],
 'contrast_max': '250',
 'contrast_min': '147',
 'contributor': 'Internet Archive',
 'copies': '4',
 'date': '1867',
 'derive_version': '0.0.19',
 'description': 'Notes and Queries 1867: <a href="https://archive.org/search.php?query=sim_pubid%3A1250%20AND%20volume%3A12" rel="nofollow">Volume 12</a>, Issue Index.<br />Digitized from <a href="https://archive.org/details/sim_raw_scan_IA1641612-06/page/n1668" rel="nofollow">IA1641612-06</a>.<br />Previous issue: <a href="https://archive.org/details/sim_notes-and-queries_1867-06-29_11_287" rel="nofollow">sim_notes-and-queries_1867-06-29_11_287</a>.<br />Next issue: <a href="https://archive.org/details/sim_notes-and-queries_1867-07-06_12_288" rel="nofollow">sim_notes-and-queries_1867-07-06_12_288</a>.',
 'issn': '0029-3970',
 'issue': 'Index',
 'language': 'English',
 'mediatype': 'texts',
 'metadata_operator': 'associate-berolyn-gilbero@archive.org',
 'next_item': 'sim_notes-and-queries_1867-07-06_12_288',
 'ppi': '400',
 'previous_item': 'sim_notes-and-queries_1867-06-29_11_287',
 'pub_type': 'Scholarly Journals',
 'publisher': 'Oxford Publishing Limited(England)',
 'scanner': 'microfilm03.cebu.archive.org',
 'scanningcenter': 'cebu',
 'sim_pubid': '1250',
 'software_version': 'nextStar 4.5.0.20626',
 'source': ['IA1641612-06', 'microfilm'],
 'sponsor': 'Kahle/Austin Foundation',
 'subject': ['Classical Studies',
  'Library And Information Sciences',
  'Literary And Political Reviews',
  'Literature',
  'Publishing And Book Trade',
  'Scholarly Journals',
  'microfilm'],
 'title': 'Notes and Queries  1867: Vol 12 Index',
 'volume': '12',
 'uploader': 'arthur+microfilm02@archive.org',
 'publicdate': '2021-10-19 11:27:07',
 'addeddate': '2021-10-19 11:27:07',
 'identifier-access': 'http://archive.org/details/sim_notes-and-queries_1867_12_index',
 'identifier-ark': 'ark:/13960/t3gz6zr86',
 'imagecount': '29',
 'ocr': 'tesseract 5.0.0-beta-20210815',
 'ocr_parameters': '-l eng',
 'ocr_module_version': '0.0.13',
 'ocr_detected_script': 'Latin',
 'ocr_detected_script_conf': '0.9685',
 'ocr_detected_lang': 'en',
 'ocr_detected_lang_conf': '1.0000',
 'page_number_confidence': '100.00',
 'pdf_module_version': '0.0.15',
 'foldoutcount': '0'}

So for example, we might pull out the volume, issue, date, and title; we can check whether page numbers were identified; we have reference to the next and previous items.

From inspection of other records, we also note that if an item is restricted (for example, in its preview, then a 'access-restricted-item': 'true' attribute is also set.

We can retrieve documents from the API by id and document type:

from internetarchive import download

# downloads to a dir with name same as id
# _page_numbers.json
# _hocr_searchtext.txt.gz

doc_id = 'sim_notes-and-queries_1849-11-03_1_1'

download(doc_id, destdir='ia-downloads',
         formats=[ "OCR Page Index", "OCR Search Text", "Page Numbers JSON"])

To create a search index from the downloads, what are we to do?

By inspection of several OCR Search Text files, it appears as if the content is arranged as one paragraph per line of file, where paragraphs are perhaps more correctly blocks of text that appear visually separated from other blocks of text.

For a full text search engine, for example, over a SQLite FTS4 extension virtual table, we could just add each document as a separate record.

However, it might also be interesting to have full text search over page level records. We could do this by splitting content in the full text document according to the OCR page index file, and also numbering pages with the “scan page” index and also any extracted “actual” page number from the page numbers text file. (The length of the lists in the OCR page index file and the page numbers text file should be the same.)

A search over the pages tables would then be able to return page numbers.

In some cases it may be that the searcher will want to view the actual original document scan, for example, in the case of dodgy OCR, or to check the emphasis or layout of the original text. So it probably makes sense to also grab the scanned documents, either as a PDF, or as a collection of JPG images, loading the file as binary data and saving into the database as a SQLite BLOB.

We can preview the files downloadable for a particular item by getting a list of the associated files and then reporting their file format:

from internetarchive import get_item

#Retrieve an item record by id
jx = get_item('sim_notes-and-queries_1867_12_index')

# The item .get_files() method returns an iterator of
# available file types
for jj in jx.get_files():
    #Report the file format
    print(jj.format)

"""
Item Tile
JPEG 2000
JPEG 2000
Text PDF
Archive BitTorrent
chOCR
DjVuTXT
Djvu XML
Metadata
JSON
hOCR
OCR Page Index
OCR Search Text
Item Image
Single Page Processed JP2 ZIP
Metadata
Metadata
Page Numbers JSON
JSON
Scandata
"""

The Single Page Processed JP2 ZIP contains separate JPEG images, one per scan page, so that is a good candidate. We could also grab the Text PDF, which is a searchable PDF document. If page level access were required were could then grab the whole PDF and just extract and display the required page.

However, downloading all the JPG / PDF files feels a bit excessive. So it might be more interesting to try to build something that will only download and store PDFs / images on demand, as and when a users wants to preview an actual scanned page.

PS in passing, I note this recent package – https://github.com/cbrennig/pdf_sqlite_fts – for OCRing, indexing and full-text searching PDF docs.

A Personal Take on Customising Jupyter Via End-User Innovation

Following a Github discussion on The future of the classic notebook interface and the Jupyter Notebook version 7 JEP (Jupyter Enhancement proposal), a Pre-release plan for Notebook v7 is now in play that will see RetroLab, as was, the notebook style JupyerLab powered single document UI form the basis of future notebooks UIs.

I’ve been asked for comments on my take on now the original notebook supported originally supported end-user development which I’ll try to describe here. But first I should add some caveats:

  • I am not a developer;
  • I am not interested in environments that developers use for doing things developers do;
  • I am not a data scientist;
  • I am not interested in environments that data scientists use for doing things that data scientists do;
  • I am a note taker; I am a tinkerer and explorer of the potential for using newly available technologies in combination with other technologies; I am doodler, creating code exploiting sketches to perform particular tasks, often a single line of code at a time; I am an author of texts that exploit interactivity in a wide variety of subject areas using third party, off-the-shelf packages that exploit IPython (and hence, Jupyter notebook) display machinery.
A line of code can be used, largely in ignorance of how it works and simply by copying, pasting, and changing a single recognisable value, to embed a rich interactive into a document. In this case, a 3D molecular model is embdedded based on a standardised compound code passed in via a variable defined elsewhere in the document.
  • I am interested in environments that help me take notes, that help me tinker and explore the potential for using newly available technologies in combination with other technologies, that help me author the sort of texts that I want to explore.
  • I am interested in media that can be used to support the open and distance education, both teaching (in sense of making materials available to learners) and learning (which might be done either in a teaching context, or independently). My preference for teaching materials is that they support learning.
  • I am interested in end-user innovation where an end-user with enthusiasm and only modicum skills can extend an/or co-opt environment or the features or serviecs it offers, for their own personal use, without having to ask for permission or modify the environment’s core offering or code base (i.e. end-user innovation that allows a user to lock themselves in through extras they have added; in certain situations, this may be characterised as developing on top of undocumented features (it certainly shares many similar features));
  • In my organisation, the lead times are ridiculous. The following is only a slight caricature: a module takes 2+ years to create then is expected to last for 5-10 years largely unchanged. A technology freeze might essentially be put in place a year before student first use date. Technology selection is often based on adopting a mature technology at the start of the produciton process (two years prior to first use date).
  • When we adopted Jupyter notebooks for the first time for a module due for first release in 2016, it was a huge punt. The notebooks (IPython notebooks) were immature and unstable at the time; we also adopted pandas which was still in early days. There were a lot of reasons why I felt comfortable recommending both those solutions based on observation of the developer communities and practical experience of using the technologies. One practical observation was that I could get started very quickly, over a coffee, without docs, and do something useful. That meant other people would be able to too. Which would mean low barrier to first use. Which meant easy adoption. Which meant few blockers to trying to scale use. (Note that getting the environment you wanted set up as you wanted could be a faff, but we could mitigate that by using virtual machines to deliver software. It was also likely that installation would get easier.)
  • One of the attractive things about the classic Jupyter notebook UI was that I could easily hack the branding to loosely match organisational branding (a simple CSS tweak, based on inspection by someone who didn’t really know CSS (i.e. me), to point to our logo). As a distance learning organisation, I felt it was important to frame the environment in a particular way, so that students should feel as if they working in what felt like an institutional context. When you’re working in that (branded) space, you are expected to behave, and work, in a particular way:
  • there were also third party extensions, written in simple JS and HTML. These could be created and installed by an end-user, taking inspiration and code from pre-existing extensions. As an end-user, I was interested in customising the appearance of content in the notebook. For teaching / publishing purposes, I was interested in being able to replicate the look of materials in our VLE (virtual learning environment). The materials use colour theming to cue different sorts of behaviour. For example, activities and SAQs (self-assessment questions) use colour highlighted blocks to identify particular sorts of content:
  • the open source nature of the Jupyter code base meant that we could run things as local users or as an on-prem service, or as a rented hosted service from a third party provider; in my opinion all three are important. I think students need to be able to run code locally so that they can: work offline, as well as share or use the provided environment in another context, eg a work context; I think being able to offer an institutionally hosted service provides equitable access to students who may be limited in terms of personal access to computers; I think the external provider route demonstrates a more widespread adoption of a particular approach, which means longer term viability and support as well as a demonstration that we are using “authentic” tools that are used elsewhere.

One of our original extensions sought to colour theme activities in Jupyter notebooks in a similar way. (This could be improved, probably, but the following was a quick solution based on inspection of the HTML to try to find attributes that could be easily styled.)

The cells are highlighted by selecting a cell and then clicking a button. How to add the button to a toolbar was cribbed from a simple, pre-existing extension.

If anything, I’m a “View source” powered tinkerer, copying fragrments of code from things I’ve found that do more or less the sort of thing I want to do. This strategy is most effective when the code to achieve a particular effect appears in a single place. It’s also helpful if its obvious how to load in an required packages, and what those packages might be.

At the time I created the colour theming extension I ran into an issue identifying how to address cells in Javascript and queried it somewhere (I though in a Github issue, but I can’t find it); IIRC, @minrk promptly responded with a very quick and simple idea for how to address my issue. Not only could I hack most of what I wanted, I could articulute enough of a quetion to be abele to ask for help, and help could also be quickly and relatvely easily given: if it’s easy to answer a query, or offer a fix, eg Stack Overflow style, you might; if it’s complex and hard, and takes a lot of effort to answer a query, you are less likely to; and as a result, less likely to help other people then continue past a blocker and continue to help themselves.

The ability to add toolbar buttons to the user interface meant that it was easy enough to customise the tools offered to the end-user via the toolbar. How to add buttons was cribbed from inspection of the Javascript used by the simplest pre-existing extension I could find that added buttons to the toolbar.

Another thing the classic notebook extensions offered was a simple extensions configurator. This is based on a simple YAML file. The extensions configurator means the end-user developer can trivially specify an extensions panel. Here’s an example of what our current cell colour highlighter extension configurator supports, specificlly, which buttons are displayed and what colours to use:

And the corresponding fragment of the YAML file that creates it:

How to create the YAML file was cribbed from the content of a YAML config script from the simplest extensions I could find that offered the configurator controls I required.

How to access state set by the extension configurator from the extension Javascript was based on inspection of very simple Javascript of the simplest pre-existing extension I could find that made use of a similar configurator setting.

The state of the extension settings can be persisted, and is easily shared as part of a distribution via a config file. (This means we can easily configure an environment with pre-enabled extensions with particular settings to give the end user a pre-configured, customised environment that they can then, if they so choose, personalise / customise further).

This is important: we can customise the environment we give to students, and those users can then personalise it.

What this means is that there is a way for folk to easily lock themselves in to their own customised environment.

In the wider world, there are a great many JupyterHub powered environments out there serving Jupyter end-user interfaces as the default UI (JupyterLab, classic notebook, perhaps RetroLab). In order to support differentiation, these different environments may brand themselves, may offer particular back-end resources (compute/GPU, storage, access to particular datasets etc.), may offer particular single-sign on and data access authentication controls, may offer particular computational environments (certain packages preinstalled etc), may offer particular pre-instaled extensions, including in-house developed extensions which may or may not be open sourced / generally available, may wire those extensions together or or “co-/cross-configure” them in such a way as to offer a “whole-is-more-than-sum-of-parts” user experience, and so on.

For the personal user,running locally, or running on a third party server that enables extension installation and persistence, they can configure their own environment from available third party extensions.

And for the have-a-go tinkerer, they can develop and share there own extensions, and perhaps make them available to others.

In each case, the service operator or designer can lock themselves in to a partcular set-up. In our case, we have locked ourselves into a classic Jupyter notebook environment through extensions and configurations we have developed locally. And we are not in a position to migrate, in part because we have accreted workflows and presentation styles through our own not-fork, in part because of the technical barriers to entry to creating extensions in the JupyterLab environment. Because as I see it, that requires: a knowledge of particular frameworks and “modern” ways of using Javascript.

The current version of my own extensions has started to use, by cribbing others,rather than created from a position of knowledge or understanding, things like promises; but I’ve only got to that by iterating on simpler approaches and by cribbing diffs from other, pre-existing extensions that have migrated from the original way of working to more contemporary methods (all hail developers for helpful commit messages!); a knowledge of the JupyterLab frameworks (in the classic notebook, I could, over a half-hour coffee break, crib some simple HTML and CSS from the classic UI, crib some simple JS from an pre-exsiting extension that had a feature I wanted to use, or appeared to use a method for acheiving something similar to the effect I wanted to achieve).

There has been work in the JupyterLab extensions repo to try to provide simple examples, and I have to admit, I don’t check there very often to see if they have added the sorts of examples that I tend to crib on because from experience they tend to be targeted at developers doing developery things.

I. Am. Not. A. Developer. And the development I want to do is often end user interface customisations. (I also note from the Jupyter notebook futures discussions comments along the lines of “the core devs aren’t fron end developers, so keeping the old’n’hacky noteb’ook UI going is unsustainable” which I both accept and appreciate (appreciate in sense of understand). Bt it raises the question: who is there looking for ways to offer “open” and “casual” (i.e. informal, end-user) UI developments.

It is also worth noting that the original notebook UI was developed by folk who were perhaps not web developers and so got by on simple HTML/CSS/JS techniques, because that was their skill level in that domain at the time. And they were also new to Jupyter frameworks in the sense that those frameworks were still new and still relatively small in feature and scope. But the core devs are now seasoned pros in working in those frameworks. Whereas have-a-go end-user developers wanting to scrath that one itch, are always brand new to it. And they may have zero requirement to ever do another bit of development again. On anything. Ever.

The “professionalisation” of examples and extensions in the JupyterLab ecosystem is also hostile to not-developers. For example, here’s a repo for a collapsible headings extension I happened to have in an open tab:

I have no idea what many (most) of those files or, or how neccessary they are to build a working extension. I’m not sure I could figure out how to actually build the extension either (becuase I think they do need building before they can be installed?) I. Am. Not. A . Developer. Just as I don’t think users should have to be sys admins to be able to install and run a notebook (which is one reason we increasingly offer hosted solutions), I don’t think end user developers who want to hack a tiny bit of code should have to be developers with the knowledge, skills and toolchains available to be able to build package before it can be used. (I think there are tools in the JupyerLab UI context that are starting to explore making things a bit more “build free”.)

To help set the scene, imagine the user is a music teacher who wants to use music21 to in a class. They get by using what is essentially a DSL (domain specific language) in the form of music21 functions in a notebook environment. Their use case for Jupyter is write what are essentially interactive handouts relating to music teaching. They also note that they can publish the materials on the web using Jupyer Book They see thay Jupyter Book has a particular UI feature, such as a highlighted noe, and the think “how hard could it be to add that tho the notebook UI”.

One approach I have taken previously with regard to such UI tweaks is to make use of cell tag attributes to identify cells that should be styled in a particular way. (Early on, I’d hoped tags would be exposed appropriately in cell HTML as class attibutes, e.g. expose cell tags as HTML classes. This opens up end user development in a really simple way (hacking CSS, essentially, or iterating HTML based on class attributes; though ideally you’d work with the notebook JSON data structure and index cells based on metadata tag values).

As an example of hacky workflow from a “not a developer” perspective to acheive a styling effect similar to the Jupyter Book style effect above, I use a “tags2style” extension to style things in the notebook, and a tag processor churn notebook .ipynb content into appropriately marked up markdown for Jupyter Book. (Contributing to Jupyter Book extensions is also a little beyond me. I can proof-of-concept, but all the “proper” developer stuff of lint’n’tests and sensible commit messages, as well as how to use git properly, etc., are beyond me…! Not a team player, I guess… Just trying to get stuff done for my own purposes.)

So… in terms of things I’d find useful for end user development, and allowing using to help themselves, a lot of it boils down to not a lot (?!;-):

  • I want to be able to access a notebook datastructure and iterate over cells (either all cells, or just cells of a particular type);
  • I want to be able to read and write tag state;
  • I want to be able to render style based on cell tag; failing that, I want to be able to control cell UI class attributes so I can modify them based on cell tags.
  • I want to be able to add buttons to the toolbar UI;
  • I want to be able to trigger operations from tool bar button clicks that apply to a the current in focus cell, a set of selected cells, all cells / all cells of a particular type;
  • I want to be able to configure extension state in a simply defined configuration panel;
  • I want to be able to easily access extension configuration state ans use it within my extension;
  • I want to be able to easily persist and distrbute extension configuration state;
  • It would be nice to be able to group cells in a nested tree; eg identify a set of cells as an exercise block and within that as exercise-question and exercise-answer cells, and style the (sub-)grouped cells together and potentially the first, rest, and last cells in each (sub-)group differently to the rest.

In passing, several more areas of interest.

First, JupyterLab workspaces. When these were originally announced, really early on, they seemed really appealing to me. A workspace can be used to preconfigure / persist / share a particular arrangement of panels in the JupyterLab UI. This means you can define a “workbench” with a particular arrangement of panels, much as you might set up a physical lab with a particular arrangement of equipment. (Imagine a school chemistry lab; the lab assistant sets up each bench with the apparatus needed for that day’s experiment.) In certain senses, the resulting workspace might be also be thought of as an “app” or a “lab”.

I would have aggressively explored workspaces, but I was locked into using the custom styling extensions of classic notebook, and this blocked me from exploring JupyterLab further.

I firmly believe value can be added to an environment by providing preconfigured workspaces, where the computational environment is set up as required (necessary packages installed, maybe some configuration of particular package settings, appropriate panels opened and arranged on screen), particularly in an educational setting. But I haven’t really seen people talking about using workspaces in such ways, or even many eamples of workspaces being used and customised at all.

Example of Dask Examples repo launched on MyBinder – a custom workspace opens predefined panels on launch.

A lot of work has been placed around dashboards in a Jupyter context, which is perhaps Jupyter used in a business reporting context, but not JupyterLab workspaces, which are really powerful for education.

I note that various discussion relating to classic notebook versus JupyterLab relate to the perceived complexity of the JupyterLab UI. My own take on the JupyterLab UI is that is can be very cluttered and have a lot of menu options or elements available that are irrelevant to a particular user in a particular use case. For different classes of user, we might want to add lightness to the UI, to simplify it to just what is required for a particular activity, and strip out the unnecessary. Workspaces offer that possibility. Dashboard and app style views, if used creatively, can also be used that way BUT they don’t provide access to the JupyterLab controls.

On the question of what to put into workspaces, jupyter-sidecar could be extremely useful in that respect. It was a long time coming, but side cast now lets to display a widget directly into a panel, rather than first having to display it as cell output.

This means I could demo for myself using my nbev3devsim simulator in a JupyterLab context.

Note that the ONLY reason I can access that JS app as a Jupyter widget is via the jp_proxy_widget, which to my mind should be mainitained in core along with things like Jupyter server proxy, JupyterLite, and maybe jhsingle-native-proxy. All these things allow not-developers and not-sysadmins to customise a distributed environment without core developer skills.

A final area of concern for me relates to environments that are appropriate for authoring new forms of document, particularly those that:

  • embed standalone interactive elements generated from a single single line of magic code (for example, embed an interactive map centred on a location give a location);
Using parameterised magic to embed a customised interactive.
  • generate and embed non-textual media elements from a text description;

Note that to innovate in the production of tools to create such outputs does not require “Jupyter developer skills”. The magics can often be simple Pyhton cribbed, largely based on code cribbed from other, pre-existing magics, applied to new off-the-shelf packages that support rich IPython object output displays.

In terms of document rendering, Jupyter Book currently offers one of the richest and most powerful HTML output pathways, but I think support for PDF generation may still lag behind R/bookdown workflows. I’m not sure about e-book generation. For support end-user innovation around the publication path, there are several considerations: the document format (eg MyST) and how to get content into that format (eg generating it from magics); the parsed representation (how easy it is to manipulate a document model and render content from it); the templates, that provide examples for how to go from the parsed object representation to output format content (HTML, LaTeX, etc); and the stylesheets, that allow you to customised content rendered by a particular template.

In terms of document authoring, I think there are several issues: first, rich editors, that allow you to preview or edit directly in a styled display view. For example, WYSIWYG editors. Jupyter notebook has had a WYSWYG markdown cell extension for a long time and it sucks: as soon as you use it you’re original markdown is converted to crappy HTML. The WYSIWYG editor needs to preserve markdown, insofar as that is possible, which means it needs to work with and generate a rich enough flavour of markdown (such as MyST) to provide the author with the freedom to author the content they want to author. It would be nice of such an editor could be extended to allow you to embed to embed high level object genrators, for example, IPython line or block magics.

Ideally, I’d be able to have a rich display in a full notebook editor that resembles the look in terms of broad styling features the look of Jupyter Book HTML output, perhaps provided via a Jupyter Book default theme styling extension for JupyterLab / RetroLab.

I’m not sure what the RStudio / Quarto gameplan is for the Quarto editor, which currently seems to take the form of a rich editor inside RStudio. The docs appear to suggest it is likely to be spun out as its own editor at some point. How various flavours of output document are generated and published will be a good indicator of how much traction Quarto will gain. RStudio allow “Shiny app publishing” from RStudio, so integration with e.g. Github Pages for “static” or WASM powered live code docs, or server backed environments for live code documents executing against a hosted code server would demonstrate a similar keen-ness to support online published outputs.

Personally, I’d like to see a Leanpub style option somewhere for Jupyter Book style ebooks which would open up a commercialisation route that could help drive a certain amount of adoption; but currently, the Leanpub flavour of markdown is not easily generated from eg MyST via tools such as pandoc or Jupytext, which means there is not easy / direct conversion workflow. In terms of supporting end user innovation, having support in Jupytext for “easy” converters, eg where you can specify rules for how Jupytext/MyST object model elements map onto output text, both in terms of linear documents (eg Markdown) or nested (eg HTML, XML).

Internally, we use a custom XML format based on DocBook (I think?). I proposed an internal project to develop pandoc converters for it to allow conversion to/from that format into other pandoc supported formats, which would have enabled notebook mediated authoring and display of such content. After hours of meetings quibbling over what validation process for the output should be (I’d have been happy with – is it good enough to get started converting docs that already exist) I gave up. In the time I spent on the proposal, its revisions, and in meetings, I could probably have learned enough Haskell to hack something together myself. But that wasn’t the point!

At the moment, Curvenote looks to be offering a rich, WYSIWYG authoring, along with various other value adds. For me to feel confident in exploring this further, I would like to be able to run the editor locally and load and save files to both disk and browser storage. Purely as a MyST / markdown editor, a minimum viable Github Pages demo would demonstrate to me that that is possible. In terms of code execution, my initial preference would be to be able to execute code using something like Thebelab connected invisibly to a JupyterLite powered kernel. More generally, the ability to connect to a local or remote Binder Jupyter notebook server and launch kernels against it, and then the ability to connect to a local or remote Jupyter notebook server, both requiring and not requiring authentication.

In terms of what else WASM powered code execution might support, and noting that you can call on things like the pillow image editing package directly in the browser, I wonder about whether it is possible to do the following purely wihtin the browser (and if not, why not / what are the blockers?):

It is also interesting to consider the possibility of extensions to the Jupyter Book UI that allow it to be used as an editor bith in terms of low hanging fruit and also in terms of more ridiculous What if? wonderings. Currently, Thebelab enabled books allow readers to edit code cells as well as executing code against an externally launched kernel. However, there is no ability to save edited code to browser storage (or local or remote storage), or load modified pages from browser storage. (In a shared computer setting, there is also the question of how borwser local storage is managed. In Chrome, or when using Chromebooks, for example, can a user sign in to the browser and they have their local storage synched to other browsers, and have it cleared from the actual browser they were using a session when they sign out?) There is also no option to edit the markdown cell source, bit this would presumably not be markdown anyway, but rather rendered HTML. (Unless the browser was rendering the HTML from the fly from source markdown?!) This perhaps limits their use in education, where we might want to treat the notebooks as interactive worksheets users can modify notebooks and retain their edits. But the ability to edit, save and reload code cell content at least, and may even add and delete code cells, would be a start. One approach might be a simple recipe for running Jupyter Book via Jupyer server proxy (e.g. simple hacky POC), or for Jupyter Book serving JupyterLite. In the first case, if there was a watcher on a file directory, a Jupyter Book user could perhaps open the file in the local /server Jupyter notebook environment, save the file, and then have the directory watcher trigger jupyer book build to update the book. In the second case, could JupyterLite let the user to edit the source HTML of locally stored Jupyter Book HTML content and then re-serve it? Or could we run Jupyter Book build process in the browser and make changes to the source notebook or markdown document?! Which leads to the following more general set of questions about WASM powered code execution in the browser. For example, can we / could we:

  • run Jupyter-proxied flask apps, API publishing services such as datasette, or browser-based code executing Voila dashboards?
  • run Jupyter server extensions, eg jupytext?
  • run Sphinx / Jupyter Book build processes?
  • run pandoc to support pandoc powered document conversion and export?
  • connect to remote storage and/or mount the local file system into the JupyterLite environment (also: what would the security implications of the that be?)?

Are any of these currently or near-future possible? Actually impossible? Are there demos? If they are not possible, why not? What are the blockers?

One of the major issues I had, and continue to have, with Jupyter notebook server handling from the notebook UI is in connecting to kernels. Ideally, a user would be trivially be able to connect to kernels running on either a local server or listed by one or more remote servers, all from the same notebook UI. This would mean a student could work locally much of the time, but connect to a remote server (from the same UI) if they need to access a server with a particular resource availability, such as a GPU or proximity to a large dataset. VS Code makes it relatively easy to connect, from the VS Code client, to new Jupyter servers, but at the cost of breaking other connections. Using a Jupyter notebook server, remote kernel approaches typically appear to require the use of ssh tunneling to establish a connection to launch and connect to a remote server.

One way round the problem of server connections for code execution is to have in-browser local code execution. Extending Thebelab to support in-browser JupyterLite / WASM powered kernel connections will enable users of tools such as Jupyter Book to publish books capable of executing code from just a simple webserver, eg using a site such as Github Pages. Trivially, JupyterLite kernels incorporate a useful range of packages, although to support the easy creation of end-user distributions, a very simple mechanism for adding additional packages that are “pre-installed” in the WASM environment is not available. (The result is the end-end-user needs to install packages before they can be used.) JupyterLite also currently lacks an easy / natural way of loading files from browser storage into code.

Here ends the core dump!

Fragment – Searchable SQLite Database of Andrew Lang Fairy Stories

A week or two ago, I bought a couple of ready-made print-on-demand public domain volumes that colleced together all of Andrew Lang’s coloured Fairy Books. There are no contents lists and no index, but the volumes didn’t cost that much more than printing-them on demand myself, and they saved me the immediate hassle of compiling my own PDFs.

But… there’s too much to skim through if you’re trying to find a particular story. So I started to wonder about creating a simple full-text search tool to search through the stories. A first attempt, that scrapes the story texts from sacred-texts.com, can be accessed here but it’s in pretty raw form – a SQL query interface essentially published via GitHub Pages and running against a db in the repo. (The query interface is powered via SQLite compiled to WASM and running in the browser, a trick I discovered several years ago… I’m still waiting for datasette in the browser! ;-))

Anyway… code for the scraper and the db constructor is in the repo, with an earlier version available as a gist. And of course, the query UI is available here. The scraper and sample db queries took maybe a couple of hours to pull together in all. And then another half hour today to set the repo up with the SQL GUI and write thisblog post…

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

Note to self – the db is intended to run as a full-text searchable db via a neater user inferface, ideally with some sensible facet-based search options (with facet elements identified using entity extraction etc. I think I need to start working on my own “fairy story entities” model too…)

Also on the to do list:

  • annotate stories with Aarne-Thompson or Aarne-Thompson-Uther (ATU) story classification codes (is there a dataset I can pull in to do this keyed on story title? There is one for Grimm here. There’s a motif index here.)
  • pull out first lines as a separate item;
  • explore generating book indexes based on a hacky pagination estimate;
  • put together a fairy story entity model.

It’d also be really interesting to come up with a way of tagging each record with a story structure / morphology (e..g. Propp morphology of each story), eg so I could easily search for stories with different structure types.

PS (suggested via @TarkabarkaHolgy / @OnlineCrsLady) also add link to Wikipedia page for each story (thinks: that should be easy enough to at least partially automate…)

Fragment: AutoBackup to git

Noting, as ever, that sometimes students either don’t save, or accidentally delete, all their work (I take it as read that most folk don’t back-up: I certainly don’t (“one copy, no back-ups”)) I started pondering whether we could create local git repos in student working directories in provided Docker containers with a background process running to commit changed files every 5 minutes or so.

Note that I haven’t had a chance to test that any of this works yet!

The inotify-tools package wraps the Linux inotify command with a handy CLI wrapper. (inotify itself provides handy tools for monitoring the state of the file system (for example, Monitor file system activity with inotify.) The gitwatch package uses inotify-tools to as part of “a bash script to watch a file or folder and commit changes to a git repo” (you can also run it as a service).

So I wonder: can we use gitwatch to back-up (ish) files in a student’s working directory in a provided container into a local persistent git repo mounted into the container?

# With git installed, we need to create a nominal user
git config user.email junk@example.com
git init

# Example of adding a file...
git add test.md
# And committing it...
git commit -m first commit
# But students won't remember to do that very often...

# So we need to shcedule something
# inotify-tools gives us access to a directory monitor...
apt-get update -y && apt-get install -y inotify-tools

# The gitwatch package uses inotify-tools to trigger
# automatic git add/commits for changed files
# https://github.com/gitwatch/gitwatch#bpkg
# It's most conveniently installed using bpkg package manager
curl -Lo- "https://raw.githubusercontent.com/bpkg/bpkg/master/setup.sh" | bash
bpkg install  gitwatch/gitwatch

# Note that it requires that USER is set
# The following is probably not recommended...
#export USER=root

# Run as a service
# https://github.com/gitwatch/gitwatch/wiki/gitwatch-as-a-service-on-Debian-with-supervisord

# Example of viewing state of github repo at at particular time # https://stackoverflow.com/a/56187629/454773

It strikes me that if students are working on notebooks, we really want to commit the notebooks cleared of cell outputs. One way of doing this would be tweak gitwatch to spot changed files and if they are .ipynb notebook files, actually backup a copy of them (perhaps as .ipynb files via a hidden directory, perhaps via hidden Jupytext paired markdown files) rather than the notebook; pre-commit actions might also be useful here…

Having got files backed up, -ish, into a git repo, the next issue is how students can revisit a point back in time. If the repo is mirrored in GitHub, it looks like you can revisit the state of a repo if you don’t want to go too far back in time (“simply add the date in the following format – HEAD@{2019-04-29} – in the place of the URL where the commit would usually belong”).

On the command line, it seems we can get a commit immediately prior to a particular date: git rev-list master --max-count=1 --before=2014-10-01 (can we also fdo that relative to datetime?). Then it seems that git branch NEW_BRANCH_NAME COMMIT_HASH will create a branch based around the state of the files at the time of that commit, or git checkout -b NEW_BRANCH_NAME COMMIT_HASH will create the branch and check it out (just make sure you have all your current files checked into the repo so you can revert back to them…)

Ah… this looks handy:

# To keep your current changes

You can keep your work stashed away, without commiting it, with `git stash`. You would than use `git stash pop` to get it back. Or you can `git commit` it to a separate branch.

# Checkout by date using `rev-parse`

You can checkout a commit by a specific date using rev-parse like this:

`git checkout 'master@{1979-02-26 18:30:00}'`

More details on the available options can be found in the `git-rev-parse`.

As noted in the comments this method uses the reflog to find the commit in your history. By default these entries expire after 90 days. Although the syntax for using the reflog is less verbose you can only go back 90 days.

# Checkout out by date using `rev-list`

The other option, which doesn't use the reflog, is to use `rev-list` to get the commit at a particular point in time with:

```
git checkout `git rev-list -n 1 --first-parent --before="2009-07-27 13:37" master`
```

Note the `--first-parent` if you want only your history and not versions brought in by a merge. That's what you usually want.

PS seems like @choldgraf got somewhere close to this previously… choldgraf/gitautopush: watch a local git repository for any changes and automatically push them to GitHub.

Fragment – Embedding srcdoc IFrames in Jupyter Notebooks

Whilst trying to create IFrame generating magics to embded content in Jupyter Book output, I noticed that the IPython.display.IFrame element only appears to let you refer to external src linked HTML content and not inline/embedded srcdata content. This has the downside that you need to find a way ofcopying any generated src-linked HTML page into the Jupyter Book / sphinx generated distribution directory (Sphinx/Jupyter Book doesn’t seem to copy linked local pages over (I think bookdown publishing workflow does?).

Noting that folium maps render okay in Jupyter Book without manual copyting of the map containing HTML file, I had a peek at the source code and noticed it was using embedded srcdata content.

Cribbing the mechanics, the following approach can be used to create an object with a __rep_html__ method that returns an IFrame with embedded srcdoc content that will render the content in Jupyter Book output without the need for an externally linked src file. The HTML is generated from a template page (template) populated using named template attributes passed via a Python dict ( data). Once the object is created, when used as the last item in a notebook code cell it will return the inlined-IFrame as the display object.

from html import escape
from IPython.display import IFrame

class myDisplayObject:
    def __init__(self, data, template, width="100%", height=None, ratio=1):
        self.width = width
        self.height = height
        self.ratio = ratio
        self.html = self.js_html(data, template)

    def js_html(self, data, template):
        """Generate the HTML for the js diagram."""
        return template.format(**data)

    # cribbed from branca Py package
    def _repr_html_(self, **kwargs):
        """Displays the Diagram in a Jupyter notebook."""
        html = escape(self.html)
        if self.height is None:
            iframe = (
                '<div style="width:{width};">'
                '<div style="position:relative;width:100%;height:0;padding-bottom:{ratio};">'  # noqa
                '<span style="color:#565656">Make this Notebook Trusted to load map: File -> Trust Notebook</span>'  # noqa
                '<iframe srcdoc="{html}" style="position:absolute;width:100%;height:100%;left:0;top:0;'  # noqa
                'border:none !important;" '
                'allowfullscreen webkitallowfullscreen mozallowfullscreen>'
                '</iframe>'
                '</div></div>'
            ).format(html=html, width=self.width, ratio=self.ratio)
        else:
            iframe = (
                '<iframe srcdoc="{html}" width="{width}" height="{height}"'
                'style="border:none !important;" '
                '"allowfullscreen" "webkitallowfullscreen" "mozallowfullscreen">'
                '</iframe>'
            ).format(html=html, width=self.width, height=self.height)
        return iframe

For an example of how this is used, see innovationOUtside/nb_js_diagrammers (functionality added via this commit).

More Scripted Diagram Extensions For Jupyter Notebook, Sphinx and Jupyter Book

Following on from Previewing Sphinx and Jupyter Book Rendered Mermaid and Wavedrom Diagrams in VS Code, I note several more sphinx extensions for rendering diagrams from source script in appropriately tagged code fenced blocks:

  • blockdiag/sphinxcontrib-blockdiag: a rather dated, but still working, extension, that generates png images from source scripts. (The resolution of the text in the image is very poor. It would perhaps be useful to be able to specify outputting SVG?) See also this Jupyter notebook renderer extension: innovationOUtside/ipython_magic_blockdiag. I haven’t spotted a VS Code preview extension for blockdiag yet. Maybe this is something I should try to build for myself? Maybe a strike day activity for me when the strikes return…
  • sphinx-contrib/plantuml: I have’t really looked at PlantUML before, but it looks like it can generate a whole host of diagram types, including sequence diagrams, activity diagrams, state diagrams, deployment diagrams, timing diagrams, network diagrams, wireframes and more.
PlantUML Activity Diagram
PlantUML Deployment Diagram
PlantUML Timing Diagram
PlantUML Wireframe (1)
PlantUML Wireframe (2)

The jbn/IPlantUML IPython extension and the markdown-preview-enhanced VS Code extension will also preview PlantUML diagrams in Jupyter notebooks and VS Code respectively. For example, in a Jupyter notebook we can render a PlantUML sequence diagram via a block magicked code cell.

Simple Jupytext Github Action to Update Jupyter .ipynb Notebooks From Markdown

In passing, a simple Github Action that will look for updates to markdown files in a GitHub push or pull request and if it finds any, will run jupytext --sync over them to update any paired files found in markdown metadata (and/or via jupytext config settings?)

Such files might have been modified, for example, by an editor proof reading the markdown materials in a text editor.

If I read the docs right, the --use-source-timestamp will set the notebook timestamp to the same as the modified markdown file(s)?

The modified markdown files themselves are identified using the dorny/paths-filter action. Any updated .ipynb files are then auto-committed to the repo using the stefanzweifel/git-auto-commit-action action.

name: jupytext-changes

on:
  push

jobs:
  sync-jupytext:
    runs-on: ubuntu-latest
    steps:

    # Checkout
    - uses: actions/checkout@v2

    # Test for markdown
    - uses: dorny/paths-filter@v2
      id: filter
      with:
        # Enable listing of files matching each filter.
        # Paths to files will be available in `${FILTER_NAME}_files` output variable.
        # Paths will be escaped and space-delimited.
        # Output is usable as command-line argument list in Linux shell
        list-files: shell

        # In this example changed markdown will be spellchecked using aspell
        # If we specify we are only interested in added or modified files, deleted files are ignored
        filters: |
            notebooks:
                - added|modified: '**.md'
        # Should we also identify deleted md files
        # and then try to identify (and delete) .ipynb docs otherwise paired to them?
        # For example, remove .ipynb file on same path ($FILEPATH is a file with .md suffix)
        # rm ${FILEPATH%.md}.ipynb

    - name: Install Packages if changed files
      if: ${{ steps.filter.outputs.notebooks == 'true' }}
      run: |
        pip install jupytext

    - name: Synch changed files
      if: ${{ steps.filter.outputs.notebooks == 'true' }}
      run: |
        # If a command accepts a list of files,
        # we can pass them directly
        # This will only synch files if the md doc include jupytext metadata
        # and has one or more paired docs defined
        # The timestamp on the synched ipynb file will be set to the
        # same time as the changed markdown file
        jupytext --use-source-timestamp  --sync ${{ steps.filter.outputs.notebooks_files }}

    # Auto commit any updated notebook files
    - uses: stefanzweifel/git-auto-commit-action@v4
      with:
        # This would be more useful if the git hash were referenced?
        commit_message: Jupytext synch - modified, paired .md files

Note that the action does not execute the notebook code cells (adding --execute to the jupytext command would fix that, although steps would also need to be taken to ensure that an appropriate code execution environment is available): for the use case I’m looking at, the assumption is that edits to the markdown do not include making changes to code.

Supporting Playful Exploration of Data Clustering and Classification Using datadraw

One of the most powerful learning techniques I know that works for me is play, the freedom to explore an idea or concept or principle in an open-ended, personally directed way, trying things out, test them, making up “what if?” scenarios, and so on.

Playing takes time of course, and the way we construst courses means that we donlt give students time to play, preferring to overload them with lots of stuff read, presumably on the basis that stuff = value.

If I were to produce a 5 hour chunk of learning material that was little more three or four pages of text, defining various bits of playful activity, I suspect that questions would be asked on the basis that 5 hours of teaching should include lots more words… I also suspect that the majority of students would not know how to play consructively within the prescribed bounds for that length of time.

Whatever.

In passing, I note this rather neat Python package, drawdata, that plays nice with Jupyter notebooks:

Example of use drawdata.draw_scatter() mode

Select a group (a, b, or c), draw a buffered line, and it will be filled (ish) with random dots. Click the copy csv button to grab the data into the clipboard, and then you can retireve it from there into a pandas dataframe:

Retrieve data from clipboard into pandas dataframe

At the risk of complicating the UI, I wonder about adding a couple more controls, one to tune with width of buffered line (and also ensure that points are only generated inside the line envelope), another to set the density of the points.

Another tool allows you to generate randonly sampled points along a line:

I note this could be a limiting case of a zero-width line in a draw-data() widget with a controllable buffer size.

Could using such a widget in a learning activity provide an example of technology enhanced learning, I wonder?! (I still don’t know what that phrase is supposed to mean…)

For example, I can easily imagine creating a simple activity where students get to draw different distributions and then run their own simple classifiers over them. The playfulness aspect would come in when students starting wondering about how different datas groups might interact, or how linear classifiers might struggle with particular multigroup distributions.

As a related example of supporting such palyfulness, the tensorflow playground provides several different test distributions with different interesing properties:

Distributions in tensorflow playground

To run your own local version of tensflow playground via a jupyter-server-proxy, see innovationOUtside/nb_tensorflow_playground_serverproxy.

With datadraw, students could quite easily create their own test cases to test their own understanding of how a particular classifier works. To my mind, developing such an understanding is supported if we can also visualise the evolution of a classifier over time. For example, the following animation (taken from some material I developed for a first year module that never made it past the “optional content” stage) shows the result of training a simple classifier over a small dataset with four groups of points.

Evolution of a classifier

See also: How to Generate Test Datasets in Python with scikit-learn, a post on the Machine Learning Mastery blog, and Generating Fake Data – Quick Roundup, which summarises various other takes on generating synthetic data.

PS This also reminds me a little bit of Google Correlate (for example,
Google Correlate: What Search Terms Does Your Time Series Data Correlate With?), where you could draw a simple timeseries and then try to find search terms on Google Trends with the same timeseries behaviour. On a quick look, none of the original URLs I had for that seem to work anymore. I’m not sure if it’s still available via Google Trends, for example?

PPS Here’s another nice animation from Doug Blank demonstrating a PCA based classification: https://nbviewer.org/github/Calysto/conx-notebooks/blob/60106453bdb66a83da7c2741d7644b7f8ee94517/PCA.ipynb

30 Second Bookmarklet for Saving a Web Location to the Wayback Machine

In passing, I just referenced a web page in another post, the content of which I really don’t want to lose access to if the page disappears. A quick fix is to submit the page to the Internet Archive Wayback Machine, so that at least I know a copy of the page will be available there.

From the Internet Archive homepage, you can paste in a URL and the Archive will check to see if it has a copy of the page. In many cases, the page will have been grabbed multiple times over the years, which also means you can track a page’s evolution over time.

Also on the homeopage is a link that allows you to submit a URL to request that that page is also saved to the archive:

Here’s the actual save page:

When you save the page, a snapshot is grabbed:

Saving a page to the Wayback Machine

Checking the URL for that page, it looks like we can grab a snapshot by passing the URL https://web.archive.org/save/ followed by the URL of the page we want to save…

Hmmm… 30s bookmarklet time. Many years ago, I spent some of the happiest and most productive months (for me) doing an Arcadia Fellowship with the University Library in Cambridge, tinkering with toys and exploring that incredible place.

Diring my time there, I posted to the Arcadia Mashups Blog, which still exists as a web fossil. One of the posts there, The ‘Get Current URL’ Bookmarklet Pattern, is a blog post and single page web app in one, that lets you generate simple redirection bookmarklets:

Window location bookmarklet form, Arcadia Mashups Blog

If you haven’t come across bookmarklets before, you could think of them as automation web links that run a bit of Javascript to do something useful for you, either my modifying the current web page, or doing something with its web location / URL.

When you save a bookmarklet, you should really check that the bookmarklet javascript code isnlt doing anyhting naughty, or make sure you inly install bookmarklets from trusted locations.

In the above Archive-it example, the code grabs the current page location and passes it to https://web.archive.org/save/ . If you drag the bookmarklet to your browser toolbar, open a web page, and click the bookmarklet, the page is archived:

Oh, happy days…

So, a 30s hack and I have built myself a tool to quickly archive a web URL. (Writing this blog post took much longer than remembering that post existed and generating the bookmarklet.)

There are of course other tools for doing similar things, not least robustlinks.mementoweb.org, but it was as quick to create my own as to try to re-find that…

See also: Fragment – Virtues of a Programmer, With a Note On Web References and Broken URLs and Name (Date) Title, Available at: URL (Accessed: DATE): So What?

Previewing Sphinx and Jupyter Book Rendered Mermaid and Wavedrom Diagrams in VS Code

In VS Code as an Integrated, Extensible Authoring Environment for Rich Media Asset Creation, I linked to a rather magical VS Code extension (shd101wyy.markdown-preview-enhanced) that lets you preview diagrams rendered from various diagram scripts, such as documents defined using Mermaid markdown script or wavedrom.

The diagram script is incorporated in code fenced block qualified by the scripting language type, such as ```mermaid or ```wavedrom.

Pondering whether I this was also a route to live previews of documents rendered from the original markdown using Sphinx (the publishing engine used in Jupyter Book workflows, for example), I had a poke around for related extensions and found a couple of likely candidates, such as:

After installing the packages from PyPi, these extensions are enabled in a Jupyter Book workflow by adding the following to the _config.yml file:

sphinx:
  extra_extensions:
    - sphinxcontrib.mermaid
    - sphinxcontrib.wavedrom

Building a Sphinx generated book from a set of markdown files using Jupyter Book (e.g. by running jupyter-book build .) does not render the diagrams… Boo…

However, changing the code fence label to a MyST style label (as suggested here), does render the diagrams in the Sphinx generated Jupyter Book output, albeit at the cost of not now being able to preview the diagram directly in the VS Code editor.

It’s not so much of an overhead to flip between the two, and an automation step could probably be set up quite straightforwardly to convert between the forms as part of a publishing workflow, but I’ve raised an issue anyway suggesting it might be nice if the shd101wyy.markdown-preview-enhanced extension also supported the MyST flavoured syntax…

See also: A Simple Pattern for Embedding Third Party Javascript Generated Graphics in Jupyter Notebooks which shows a simple recipe for addiing js diagram generation support to classic Jupyter notebooks, at least, using simple magics. A simple trasnformation script should be able to map between the magic cells and an appropriately fenced code block that can render the diagram in a Sphinx/Jupyter Book workflow.