Custom Branded Logos for JupyterLab and RetroLab (Jupyter notebook v.7)

Of the three main blockers in terms of look feel that I’ve used as an excuse to not to start thinking about moving course materials over to JupyterLab/RetroLab, I’ve now got hacky extensions for styling notebooks, empinken style, and enable cell run status indicators. The next one is purely cosmetic – adding custom logos.

The proper way to do this (?!) is probably to use a custom theme. See https://github.com/g2nb/jupyterlab-theme/ for an example of adding custom logos to to a custom theme.

Whilst it doesn’t appear that there is straightforward “parameter configurable” way of doing this, and there is zero help on the various forums and issues trackers for anyone who does want to achieve this (because believe it or not, the setting does matter sometimes, and learners particularly, often benefit from thinking they’re in a “safe space” that is suggested by branded environments), I finally had a poke around for some hacky ways of doing this.

It does, of course, require all the pain of building an extension, but we don’t need much more than some simple CSS and some simple JS, so we can get away with using the the JupyterLab Javascript cookiecutter extension.

To open the cookiecutter files into a pre-existing directory, such as a one created by cloing a remote Gihub repo onto your desktop, run the command:

cookiecutter https://github.com/jupyterlab/extension-cookiecutter-js -fs

You can then set up the environment:

cd my_extension/`

python -m pip install .

And do a test build:

jlpm run build && python3 -m build

You can then install the resulting package:

pip3 install ./dist/my_extension-0.1.0-py3-none-any.whl

To customise the logos, in the ./style/base.css file, we can hide the default JupyterLab logo and add our own:

#jp-MainLogo {
    background-image: url(./images/OU-logo-36x28.png);
    background-repeat: no-repeat;
}

#jp-MainLogo > svg {
    visibility: hidden;
}

#jp-RetroLogo {
    background-image: url(./images/OU-logo-53x42.png);
    background-repeat: no-repeat;
}

#jp-RetroLogo > svg {
    visibility: hidden;
}

The images should be place in a new ./style/images/ folder.

To have a go at hacking the favicon (which works on a “full server, ish, but not in JupyterLite?), we need some simple Javascript in ./style/index.js:

import './base.css';

// Via: https://discourse.jupyter.org/t/changing-favicon-with-notebook-extension/2721

let head = document.head || document.getElementsByTagName('head')[0];

let link = document.createElement('link')
let oldLink = document.getElementsByClassName('favicon');
link.rel = 'icon';
link.type = 'image/x-icon';
link.href = 'https://www.open.ac.uk/oudigital/v4/external/img/favicons/favicon.png';
if (oldLink) {
    link.classList = oldLink[0].classList;
    head.removeChild(oldLink[0]);
}
head.appendChild(link);

I’m not sure how to reference a local, statically packed favicon shipped with the extension, so for now I pull in a remote one.

Rebuild the extension, and reinstall it:

jlpm run build && python3 -m build && pip3 install --upgrade./dist/my_extension-0.1.0-py3-none-any.whl

To make it easier to distribute, I remove the dist/ element from the .gitignore file and push everything to a repo.

The following Github action acan be manually triggered to build a JupyterLite enviornment pushed to Github Pages (in your Github repo, you need to go to Settings > Pages and select the gh-pages branch as the target for the site.

I include the custom extension in the JupyerLite build via a requirements-jupyterlite.txt file which includes the following:

./dist/my_extension-0.1.0-py3-none-any.whl

name: JupyterLite Build and Deploy

on:
  release:
    types: [published]
  workflow_dispatch:

jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout
        uses: actions/checkout@v2
      - name: Setup Python
        uses: actions/setup-python@v2
        with:
          python-version: 3.8
      - name: Install the dependencies
        run: |
          python -m pip install -r requirements-jupyterlite.txt
      - name: Build the JupyterLite site
        run: |
          cp README.md content
          jupyter lite build --contents content
      - name: Upload (dist)
        uses: actions/upload-artifact@v2
        with:
          name: jupyterlite-demo-dist-${{ github.run_number }}
          path: ./_output

  deploy:
    if: github.ref == 'refs/heads/main'
    needs: [build]
    runs-on: ubuntu-latest
    steps:
      - name: Checkout
        uses: actions/checkout@v2.3.1
      - uses: actions/download-artifact@v2
        with:
          name: jupyterlite-demo-dist-${{ github.run_number }}
          path: ./dist
      - name: Deploy
        uses: JamesIves/github-pages-deploy-action@4.1.3
        with:
          branch: gh-pages
          folder: dist

A demo of the extension can then be tried purely withing the browser via JupyterLite.

Repo is here: https://github.com/innovationOUtside/jupyterlab_ou_brand_extension

Demo is here: http://innovationoutside.github.io/jupyterlab_ou_brand_extension/

Files changed compared to cookiecutter generated pages here: https://github.com/innovationOUtside/jupyterlab_ou_brand_extension/commit/688863cb79557920b1950a9a9b0331ccedcdac39

Extracting geoJSON Data From Leaflet Maps with shot-scraper

The shot-scraper package is a crazy piece of command-line magic from Simon Willison that, among other things, lets you grab a web page, and all its attendant Javascript state, into a headless browser, inject a bit of scraper JavaScript into it, and return the result.

For some time, I’ve been wondering how to grab rally route data from the rather wonderful Rally Maps website (the last time I looked, the route info seemed to be baked into the page rather than being pulled in as data from its own easy to grab URI). One approach I looked at was a related technique described in Grabbing Javascript Objects Out of Web Pages And Into Python but IIRC, I’d got a little stuck in getting a clean set of route features out.

Anyway, when reading about Web Scraping via Javascript Runtime Heap Snapshots (again via @simonw), it struck me again that the route info must be in the leaflet map somewhere, so could we get it out? Thinking to search this time for how to export route leaflet I found a simple trick in a Stack Overflow question here that gives the following recipe (I think) for grabbing the route info from a leafelt map (assuming the map object is in the variable map):

shot-scraper javascript https://www.rally-maps.com/Rallye-Festival-Hoznayo-2022 "var collection = {'type':'FeatureCollection','features':[]}; map.eachLayer(function (layer) {if (typeof(layer.toGeoJSON) === 'function') collection.features.push(layer.toGeoJSON())}); collection" > scraped-routes.geojson

Having a quick peek in geojson viewer, and it seems to work (I just need to scrape some of the other data too, such as marker labels etc.)

Presumably, I could automate looking for variables of the leaflet map type in order to make this recipe even easier to use?

In Passing, Noting Python Apps, and Datasette, Running Purely in the Browser

With pyodide now pretty much established, and from what I can tell, possibly with better optimised/lighter builds on the roadmap, I started wondering again about running Python apps purely in the browser.

One way of doing this is to create ipywidget powered applications in a JuptyerLite context (although I don’t think you can “appify” these yet, Voila style?) but that ties you into the overhead of running JupyterLite.

The React in Python with Pyodide post on the official pyodide blog looked like it might provide a way in to this, but over the weekend I noticed an annoucement from Anaconda regarding Python in the Browser using a new framework called PyScript (examples). This framework provides a (growing?) set of custom HTML components that appear to simplify the process of building pyodide Python powered web apps that run purely in the browser.

I also noticed over the weekend that sqlite_utils and datasette now run in a pyodide context, the latter providing the sql api run against an in-memory database (datasette test example).

The client will also return the datasette HTML, so now I wonder: what would be required to be able to run a datasette app in the JuptyerLite/JupyterLab context? The datasette server must be intercepting the local URL calls somehow, but I imagine that the Jupyter server is ignorant of them. So how could datasette “proxy” its URL calls via JupyterLite so that the local links in the datasette app can be resolved? (We surely wouldn’t want to have to make all the links handled button elements?)

UPDATE 5/5/22: It didn’t take Simon long… Datasette now runs as a full web app in the browser under pyodide. Announcement post here: Datasette Lite: a server-side Python web application running in a browser.

So now I’m wondering again… is there a way to “proxy” a Python app so that it can power a web app, running purely in the browser, via Pyodide?

Python Package Use Across Course Notebooks

Some time ago, I started exploring various ways of analysing the structure of Jupyter notebooks as part of an informal “notebook quality” unproject (innovationOUtside/nb_quality_profile).

Over the last week or two, for want of anything else to do, I’ve been looking at that old repo again and made a start tinkering with some of the old issues, as well as creating some new ones.

One of the things I messed around with today was a simple plot showing how different packages are used across a set of course notebooks. (One of the quality reports lists the packages imported by each notebook, and can flag if any packages are missing from the Python environment in which the tool runs.)

The course lasts ~30 weeks, with a set of notebooks most weeks and the plot shows the notebooks, in order of study, along the x-axis, with the packages listed as they are first enountered on the y-axis.

This chart is essentially a macroscopic view of package usage throughout the course module (and as long term readers will know, I do like a good mascroscope:-).

In passing, I note that I could also add colour and/or shapes or size to identify whether a package is in the Python standard library or whether it is imported from a project installed from PyPi, or highlight whether the package is not available in the Pyhton environment the tool that generates the chart is run in.

A quick glance at it reveals several things:

  • pandas is used heavily throughout (you might guess this is a data related course!), as we can see from the long horizontal run throughout the course;
  • several other packages are used over consecutive notebooks (short, contiguous horizontal runs of dots), suggesting a package that has a particular relevance for the subject matter studied that week;
  • vertical runs show that several new packages are used for the first time in the same notebook, perhaps to acheive a particular task. If the same vertical run appears in other notebooks, perhaps a similar task is being performed in each of those notebooks;
  • there is a steady increase in the number of packages used over the course. If there is a large number of packages introduced for the first time in a single notebook (i.e. a vertical run of dots), this might suggest a difficult notebook for students to work through in terms of new packages and new functions to get their head round;
  • if a package is used only one notebook (which is a little hard to see — I need to explore gridlines in a way that doesn’t overly clutter the graphic), it might be worth visiting that notebook to see if we can simplify it and remove the singleton use of that package, or check the relevance of the topic it relates to to the course overall;
  • if a notebook imports no modules (or has no code cells), it might be worth checking to see whether it really should be a notebook;
  • probably more things…

I’m now wondering what sort of tabular or list style report listing might be useful to identify the notebooks each module appears in, at least for packages that only appear once or twice, or are widely separated in terms of when they are studied.

I also wonder if there are any tools out that I can use to identify package functions used in each notebook to see how they are distributed over the module. (This all rather makes me think of topic analysis!)

My JupyterLite Blockers Are Rapidly Being Addressed…

Three months or so ago, in My Personal Blockers to Adopting JupyterLite for Distance and Open Educational Use, described several issues that I saw as blocking when it came to using JupyterLite in an educational context. These included the inability to distribute pre-installed packages as part of the JupyterLite distribution, the inability read and write files programmatically, and the inability to work on files in the local filesystem.

Many of my personal blockers have now been addressed, to such an extent that I think I should probably get round to updating our simple Learn to Code for Data Analysis open educational course to use JupyterLite or RetroLite (I suspect I might actually have to invoke the open license and publish a modified version of my own: the way we institutionally publish content is via a not very flexible Moodle platform and trying to accommodate the JupyterLite environment could be a step too far!).

So: what can we now do with JupyterLite that we couldn’t three months ago that make it more acceptable as an environment we can use to support independent distance and open educational users?

Perhaps the biggest blocker then was the inability to read and write files into the JupyterLite filesystem. This meant that workarounds were required when running packages such as pandas to open and save files from and to JupyterLite file storage. This has now been addressed, so packages such as pandas are now able to read some data files shipped as part of the JupyterLite distribution and also save and read back files into the JupyterLite file system. The JupyterLite file system support also means you can access a list directory contents, for example, from code within a notebook. (Data file type read/writes that aren’t currently supported by pandas, including SQLite file read/writes, are being tracked via this issue.) Indeed, the pandas docs now include a JupyterLite console that allows you to try out pandas code, including file read/writes, directly in the browser.

Another issue was the disconnect between files in the browser and files and the desktop. If you are working with the files in the default JupyterLite file panel, modifications to these files are save into browser storage. You can add and create new files saved to this file area, as well as deleting those files from browser storage. If files are shipped as part of the JupyterLite distribution, deleting those files, for example, after modification, resets the file to the originally distributed version. (A recent issue raises the possibility of how to alert users to an updated version of the notebook on the host repository.)

In many educational settings, however, we may want students to work from copies of notebooks that are stored on their local machine, or perhaps on a USB drive plugged into it. A recent extension that works in Chrome, Edge and Opera browsers — jupyterlab-contrib/jupyterlab-filesystem-access — added the ability to open, edit and save files on the local desktop from the JupyterLite environment. (In use, it’s a little bit fiddly in the way you need to keep granting permission to the browser to save files; but better that than the browser grabbing all sorts of permissions over the local file system without the user’s knowledge.)

In passing, I’m not sure if there’s support or demos yet for mounting files from online drives such as OneDrive, Dropbox or Google Drive, which would provide another useful way of persisting files, albeit one that raises the question of how to handle credentials safely.

When it comes to producing a JupyterLite distribution, which is to say, publishing a JupyterLite environment containing a predefined set of notebooks and a set of preinstalled packages, this has been non-trivial when it comes to adding additional packages to the distribution. The practical consequence of this is that packages need to be explicitly installed from notebooks using micropip or piplite, which adds clutter to notebooks, as well as code that will not run a “traditional”, non-JupyterLite envronment unless you add the appropriate code guards. However (and I have yet to test this), it seems that the new jupyterlite/xeus-python-kernel can be built relatively easily under automation to include additional Python packages that are not part of the default pyodide environment (presumably this requires that platform independent PyPi wheels are available, or custom build scripts that can build an appropriate Emscripten target built wheel?): minimal docs. (I note that this kernel also has the benefit that from time import sleep works!) The ipycanvas docs apparently demo this, so the answer to the question of how to create a JupyterLite distribution with custom additional pre-installed packages is presumably available somwhere in the repo (I think these commits are related: setup, package requirements.) It would be really handy if a minimal self-documenting jupyterlite-custom-package-demo were available to complement jupyterlite/demo with a minimal yet well commented/ narrated example of how to add a custom package to a JupyterLite distribution.

I would have tried a demo of bundling a custom package as a demo for this post, but from reading the docs and skimming the ipycanvas repo, I wasn’t sure what all-and-only the necessary steps are, and I don’t have time to go rat/rabbit-holing atm!

Installing JupyterLite in Chrome as a Browser Web App for Offline Use

If you are working in the Chrome browser on a desktop machine, you can install JupyterLite as a web application. (Firefox discontinued support for progressive web apps in the desktop version of Firefox in 2021.) A benefit of doing this is that you can then used the application in an offline mode, without the need for a web connection.

With a web connection available, if you launch a JupyterLite instance from the JupyterLite homepage, or from the JupyterLite demo page, you will see an option to install the environment as a progressive web application:

In the Chrome browser, you can view your Chrome installed web applications from the chrome://apps URL:

The application will open into its own browser window and can be used with or without a web connection. Files are saved into local browser storage, just as they would be if you were running the application from the original website. This means you can work against the website or the local web app, and the files you have saved into local browser storage will be available in both contexts.

If you do want to work in an offline mode, you need to ensure that all the packages you might want to call on have been “progressively” downloaded and cached by the app. Typically, JupyterLite will only download a Python package when it is first imported. This means that if your notebooks import different packages, you may find in offline use that a required package is not available. To get around this, you should create a single notebook importing all required packages and run that when you do have a network connection to make sure that all the packages you are likely to need are available for offline use.

Generating and Visualising JSON Schemas

If you’re presented with a 2D tabular dataset (eg a spreadsheet or CSV file), it;s quite easy to get a sense of the data by checking the column names and looking at a few of the values in each column. You can also use a variety of tools that will “profile” or summarise the data in each column for you. For example, a column of numeric values might be summarised with by the mean and standard deviation of the values, or a histogram, etc. Geographic co-ordinates might be “summarised” by plotting them onto a map. And so on.

If you’re presented with the data in a JSON file, particularly a large data file, getting a sense of the structure of the dataset can be much harder, particularly if the JSON data is “irregular” (that is, if the records differ in some way; for example, some records might have more fields than other records).

A good way of summarising the structure of a JSON file is via its schema. This is an abstracted representation of the tree structure of the JSON object that extracts the unique keys in the object and the data type of any associated literal values.

One JSON dataset I have played with over the years is rally timing and results data from the WRC live timing service. The structure of some of the returned data objects can be quite complicated, so how can we get a handle on them?

One way is to automatically extract the schema and then visualise it, so here’s a recipe for doing that.

For example:

#%pip install genson
from genson import SchemaBuilder

# Get some data
import requests

season_url = "https://api.wrc.com/contel-page/83388/calendar/active-season/"
jdata = requests.get(season_url).json()

# Create a schema object
builder = SchemaBuilder()
builder.add_object(jdata )

# Generate the schema
schema = builder.to_schema()

# Write the schema to a file
with open("season_schema.json", "w") as f:
    json.dump(schema, f)

The schema itself can be quite long and hard to read…

{'$schema': 'http://json-schema.org/schema#',
 'type': 'object',
 'properties': {'seasonYear': {'type': 'integer'},
  'seasonImages': {'type': 'object',
   'properties': {'format16x9': {'type': 'string'},
    'format16x9special': {'type': 'string'},
    'timekeeperLogo': {'type': 'string'},
    'timekeeperLogoDark': {'type': 'string'}},
   'required': ['format16x9',
    'format16x9special',
    'timekeeperLogo',
    'timekeeperLogoDark']},
  'rallyEvents': {'type': 'object',
   'properties': {'total': {'type': 'integer'},
    'items': {'type': 'array',
     'items': {'type': 'object',
      'properties': {'id': {'type': 'integer'},
       'name': {'type': 'string'},

...

To get a sense of the schema, we can visualise it interactively using the json_schema_for_humans visualiser.

from json_schema_for_humans.generate import generate_from_filename

generate_from_filename("season_schema.json",
                      "season_schema.html")

We can open the generated HTML in a web browser (note, this doesn’t seem to render correctly in JupyterLab via IPython.display.HTML; however, we can open the HTML file from the JupyterLab file browser, and as long as we trust it, it will render correctly:

With the HTML trusted, we can then explore the schema;

Wondering: are there other JSON-schema visualisers out there, particularly that work either as a JupyterLab extension, IPython magic or via an appropriate __repr__ display method?

Semantic Feature / Named Entity Extraction Rather Than Regular Expression Based Pattern Matching and Parsing

Playing with story stuff, trying to write a simple BibTeX record generator based on archive.org (see a forthcoming post…), I wanted to parse out publisher and location elements from strings such as Joe Bloggs & Sons (London) or London : Joe Bloggs & Sons.

If the string structures are consistent (PUBLISHER (LOCATION) or LOCATION : PUBLISHER, for example), we can start to create a set of regular expression / pattern matching expressions to pull out the fields.

But there is another way, albeit one that relies on a much heavier wieight computational approach than a simple regex, and that is to do some language modeling and see if we can extract entities of a particular type, such as an organisation and a geogrpahical location.

Something like this, maybe:

import spacy

nlp = spacy.load("en_core_web_sm")

refs = ["Joe Bloggs & Sons, London",
        "Joe Bloggs & Sons, (London)",
        "Joe Bloggs & Sons (London)",
        "London: Joe Bloggs & Sons",
         "London: Joe, Bloggs & Doe"]

for ref in refs:
    doc = nlp(ref)
    print("---")
    for ent in doc.ents:
        print(ref, "::", ent.text, ent.label_)

Which gives as a result:

---
Joe Bloggs & Sons, London :: Joe Bloggs & Sons ORG
Joe Bloggs & Sons, London :: London GPE
---
Joe Bloggs & Sons, (London) :: Joe Bloggs & Sons ORG
Joe Bloggs & Sons, (London) :: London GPE
---
Joe Bloggs & Sons (London) :: Joe Bloggs & Sons ORG
Joe Bloggs & Sons (London) :: London GPE
---
London: Joe Bloggs & Sons :: London GPE
London: Joe Bloggs & Sons :: Joe Bloggs & Sons ORG
---
London: Joe, Bloggs & Doe :: London GPE
London: Joe, Bloggs & Doe :: Joe, Bloggs & Doe ORG

My natural inclination would probably be to get frustrated by writing ever more, and ever more elaborate, regular epxressions to try to capture “always one more” outliers in how the publisher/location data might be presented in a single string. But I wonder: is that an outmoded way of doing compute now? Are the primitives we can readily work with now conveniently available as features at a higher representational level of abstraction?

See also things like [strorysniffer](https://palewi.re/docs/storysniffer/), a Python package that includes a pretrained model for sniffing a URL and estimating whether the related page is likely to contain a news story.

Some folk will say, of course, that these model based approaches aren’t exact or provable. But in the case of my regexer to GUESS at the name of a publisher and a location, there is still uncertainty as to whether the correct response will be provided for an arbitrary string: my regexes might be incorrect, or I might have missed a pattern in likely presented strings. The model based approach will also be uncertain in its ability to correctly identify the publisher and location , but the source of the uncertainty will be different. As a user, though, all I likely care about is that most of the time the approach does work, and does work reasonably correctly. I may well be willing to tolerate, or at least, put up with, errors, and don’t really care how or why they arise unless I can do something about it.