Sketching a datasette powered Jupyter Notebook Search Engine: nbsearch

Every so often, I’ve pondered the question of "notebook search": how can we easily support searches over Jupyter notebooks. I don’t really understand why this area seems so underserved, especially given the explosion in the number of notebooks and the way in which notebooks are increasingly used as a document for writing technical documentation, tutorial and instructional material.

One approach I have seen as a workaround is to produce an HTML site from a set of notebooks using something like nbsphinx or Jupyter Book simply to generate access to an inbuilt search engine. But that somehow feels redundant to me. The HTML Jupyter book form is not a collection of notebooks, nor does it provide a satisfying search environment. To access runnable notebooks you need to click through to open the notebook in another environment (for example, a MyBinder environment built from a repository of notebooks that created the HTML pages), or return the the HTML environment and run code cells inline using something like Thebelab.

So I finally got round to considering this whole question again in the form of a quick sketch to see what an integrated Jupyter notebook server search engine might feel like. It’s still early days — the nbsearch tool is provided as a Jupyter server proxy application, rather than integrated as a Jupyter server extension available via a integrated tab, but that does mean it also works in a standalone mode.

The search engine is built on top of a SQLite database, served using datasette. The base UI was stolen wholesale from Simon Willison’s Fast Autocomplete Search for Your Website demo.

The repo is currently here.

The search index is currently based on a full text search index of notebook code and markdown cells. (At the moment, you have to manually generate the index from a command line command. On the to do list for another sketch is an indexer that monitors the file system.) Cells are returned in a cell-type sensitive way:

Screenshot of initial nbsearch UI.

Code cells are syntax highlighted using Prism.js, and feature a Copy button for copying the (unstyled) code (clipboard.js). Markdown cells are styled using a simple Javascript markdown parser (marked.js).

The code cells should also have line numbers but this seems a little erratic at the moment; I can’t get local static js and css files to load properly under the Jupyter server proxy at the moment, so I’m using a CDN. The prism.js line number extension is a separate CDN delivered script to the main Prism script, and it seems that the line number extension doesnlt necessarily load correctly? A race condition maybe?

Each result item displays a link to the original notebook (although this doesn’t necessarily resolve correctly at the moment), and a description of which cell in the notebook the result corresponds to. An inline graphic depicts the structure of the notebook (markdown cells are blue, and code cells pink). Clicking the graphic toggles the display (show / hide) of that results cell group.

The contents of a cell are limited in terms of number of characters displayed. Clicking the the Show all cell button displays the full range of content. Two other buttons — Show previous cell and Show next cell — allow you to repeatedly grab additional cells that surround the originally retrieved results cell.

I’ve also started experimenting with a Thebelab code execution support. At the moment this is hardwired to use a MyBinder backend, but the intention is that if a local Jupyer server is available (eg as in the case when running nbsearch as a Jupyter server proxy application), it will use the local Jupyter server. (Ideally, it would also ensure the correct kernel is selected for any given notebook result.)

nbsearch UI with ThebeLab code execution example.

At the moment, things don’t work completely properly with Thebelab. If you run a query, and "activate" Thebelab in the normal way, things work fine. But when I dynamically add new cells, they arenlt activated.

If I try to manually activate them via a cell-centric button:

then the run/restart buttons appear, but trying to run the cell just hangs on the "Waiting for kernel…" message.

At the moment, the code cell is non-editable, but making it editable should just be a case of tweaking the code cell attributes.

There are lots of other issues to consider regarding cell execution, such as when a cell requires other cells to have run previously. This could be managed by running another query to grab all the previous code cells associated with a particular code code, and running those cells on a restarted kernel using Thebelab before running the current cell.

Providing an option to grab and display (and even copy) all the previous code in a notebook, or perhaps explore the gather package for finding precursor cells, might be a useful facility anyway, even without the ability to execute the code directly.

At the moment, results are limited to the first ten. This needs tweaking, perhaps with a slider ranged to the total number of results for a particular query and then letting you slide to select how many of them you want to display.

A switch to limit results to just code or just markdown cells might also be useful, as would an indicator somewhere that shows the grouped number of hits per notebook, perhaps with selection of this group acting as a facet: selecting a particular notebook would then limit cell results to just that notebook, perhaps grouping and ordering cells within a notebook by cell otde.

The ranking algorithm is something else that may be worth exploring more generally. One simple ranking tweak that may be useful in an educational setting could be to order results by notebook and cell order (for example, if notebooks are named according to some numbering convention: 01.1 Introduction to X, O1.2 X in more detail, 02.1 etc). Again, Simon Willison has led the way in some of the practicalities associated with exploring custom ranking schemes in his post Exploring search relevance algorithms with SQLite.

Way back when, when I originally started blogging, search was one of my favourite topics. I’ve neglected it over the years, but still think it has a lot to offer as a teaching and learning tool (eg things like Search Engine Powered Courses… and search hubs / discovered custom search engines). Many educators disagree with this approach because they like to think they are in control of the narrative, whereas I think that search, with a properly tuned ranking algorithm, can help support a student demand led, query result constructed, personalised structured narrative. Maybe it’s time for me to start playing with these ideas again…

Author: Tony Hirst

I'm a Senior Lecturer at The Open University, with an interest in #opendata policy and practice, as well as general web tinkering...