Sketching a datasette powered Jupyter Notebook Search Engine: nbsearch

Every so often, I’ve pondered the question of "notebook search": how can we easily support searches over Jupyter notebooks. I don’t really understand why this area seems so underserved, especially given the explosion in the number of notebooks and the way in which notebooks are increasingly used as a document for writing technical documentation, tutorial and instructional material.

One approach I have seen as a workaround is to produce an HTML site from a set of notebooks using something like nbsphinx or Jupyter Book simply to generate access to an inbuilt search engine. But that somehow feels redundant to me. The HTML Jupyter book form is not a collection of notebooks, nor does it provide a satisfying search environment. To access runnable notebooks you need to click through to open the notebook in another environment (for example, a MyBinder environment built from a repository of notebooks that created the HTML pages), or return the the HTML environment and run code cells inline using something like Thebelab.

So I finally got round to considering this whole question again in the form of a quick sketch to see what an integrated Jupyter notebook server search engine might feel like. It’s still early days — the nbsearch tool is provided as a Jupyter server proxy application, rather than integrated as a Jupyter server extension available via a integrated tab, but that does mean it also works in a standalone mode.

The search engine is built on top of a SQLite database, served using datasette. The base UI was stolen wholesale from Simon Willison’s Fast Autocomplete Search for Your Website demo.

The repo is currently here.

The search index is currently based on a full text search index of notebook code and markdown cells. (At the moment, you have to manually generate the index from a command line command. On the to do list for another sketch is an indexer that monitors the file system.) Cells are returned in a cell-type sensitive way:

Screenshot of initial nbsearch UI.

Code cells are syntax highlighted using Prism.js, and feature a Copy button for copying the (unstyled) code (clipboard.js). Markdown cells are styled using a simple Javascript markdown parser (marked.js).

The code cells should also have line numbers but this seems a little erratic at the moment; I can’t get local static js and css files to load properly under the Jupyter server proxy at the moment, so I’m using a CDN. The prism.js line number extension is a separate CDN delivered script to the main Prism script, and it seems that the line number extension doesnlt necessarily load correctly? A race condition maybe?

Each result item displays a link to the original notebook (although this doesn’t necessarily resolve correctly at the moment), and a description of which cell in the notebook the result corresponds to. An inline graphic depicts the structure of the notebook (markdown cells are blue, and code cells pink). Clicking the graphic toggles the display (show / hide) of that results cell group.

The contents of a cell are limited in terms of number of characters displayed. Clicking the the Show all cell button displays the full range of content. Two other buttons — Show previous cell and Show next cell — allow you to repeatedly grab additional cells that surround the originally retrieved results cell.

I’ve also started experimenting with a Thebelab code execution support. At the moment this is hardwired to use a MyBinder backend, but the intention is that if a local Jupyer server is available (eg as in the case when running nbsearch as a Jupyter server proxy application), it will use the local Jupyter server. (Ideally, it would also ensure the correct kernel is selected for any given notebook result.)

nbsearch UI with ThebeLab code execution example.

At the moment, things don’t work completely properly with Thebelab. If you run a query, and "activate" Thebelab in the normal way, things work fine. But when I dynamically add new cells, they arenlt activated.

If I try to manually activate them via a cell-centric button:

then the run/restart buttons appear, but trying to run the cell just hangs on the "Waiting for kernel…" message.

At the moment, the code cell is non-editable, but making it editable should just be a case of tweaking the code cell attributes.

There are lots of other issues to consider regarding cell execution, such as when a cell requires other cells to have run previously. This could be managed by running another query to grab all the previous code cells associated with a particular code code, and running those cells on a restarted kernel using Thebelab before running the current cell.

Providing an option to grab and display (and even copy) all the previous code in a notebook, or perhaps explore the gather package for finding precursor cells, might be a useful facility anyway, even without the ability to execute the code directly.

At the moment, results are limited to the first ten. This needs tweaking, perhaps with a slider ranged to the total number of results for a particular query and then letting you slide to select how many of them you want to display.

A switch to limit results to just code or just markdown cells might also be useful, as would an indicator somewhere that shows the grouped number of hits per notebook, perhaps with selection of this group acting as a facet: selecting a particular notebook would then limit cell results to just that notebook, perhaps grouping and ordering cells within a notebook by cell otde.

The ranking algorithm is something else that may be worth exploring more generally. One simple ranking tweak that may be useful in an educational setting could be to order results by notebook and cell order (for example, if notebooks are named according to some numbering convention: 01.1 Introduction to X, O1.2 X in more detail, 02.1 etc). Again, Simon Willison has led the way in some of the practicalities associated with exploring custom ranking schemes in his post Exploring search relevance algorithms with SQLite.

Way back when, when I originally started blogging, search was one of my favourite topics. I’ve neglected it over the years, but still think it has a lot to offer as a teaching and learning tool (eg things like Search Engine Powered Courses… and search hubs / discovered custom search engines). Many educators disagree with this approach because they like to think they are in control of the narrative, whereas I think that search, with a properly tuned ranking algorithm, can help support a student demand led, query result constructed, personalised structured narrative. Maybe it’s time for me to start playing with these ideas again…

FInding the Path to a Jupyter Notebook Server Start Directory… Or maybe not…

For the notebook search engine I’ve been tinkering with, I want to be able to index notebooks rooted on the same directory path as a notebook server the search engine can be added to as a Jupyter server proxy extension. There doesn’t seem to be a reliably set or accessible environment variable containing this path, so how can we create one?

Here’s a recipe that I think may help: it uses the nbclient package to run a minimal notebook that just executes a simple, single %pwd command against the available Jupyter server.

import nbformat
from nbclient import NotebookClient

_nb =  '''{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%pwd"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}'''

nb = nbformat.reads(_nb, as_version=nbformat.NO_CONVERT)

client = NotebookClient(nb, timeout=600)
# Available parameters include:
# kernel_name='python3'
# resources={'metadata': {'path': 'notebooks/'}})

client.execute()

path = nb['cells'][0]['outputs'][0]['data']['text/plain'].strip("'").strip('"')

Or maybe it doesn’t? Maybe it actually just runs in the directory you run the script from, in which case it’s just a labyrinthine pwd… Hmmm…

Search Assist With ChatGPT

Via my feeds, a tweet from @john_lam:

The tools for prototyping ideas are SO GOOD right now. This afternoon, I made a “citations needed” bot for automatically adding citations to the stuff that ChatGPT makes up

https://twitter.com/john_lam/status/1614778632794443776

A corresponding gist is here.

Having spent a few minutes prior to that doing a “traditional” search using good old fashioned search terms and the Google scholar search engine to try to find out how defendants in English trials of the early 19th century could challenge jurors (Brown, R. Blake. “Challenges for Cause, Stand-Asides, and Peremptory Challenges in the Nineteenth Century.” Osgoode Hall Law Journal 38.3 (2000) : 453-494, http://digitalcommons.osgoode.yorku.ca/ohlj/vol38/iss3/3 looks relevant), I wondered whether ChatGPT, and a John Lam’s search assist, might have been able to support the process:

Firstly, can ChatGPT help answer the question directly?

Secondly, can ChatGPT provide some search queries to help track down references?

The original rationale for the JSON based response was so that this could be used as part of an automated citation generator.

So this gives us a pattern of: write a prompt, get a response, request search queries relating to key points in response.

Suppose, however, that you have a set of documents on a topic and that you would like to be able to ask questions around them using something like ChatGPT. I note that Simon Willison has just posted a recipe on this topic — How to implement Q&A against your documentation with GPT3, embeddings and Datasette — that independently takes a similar approach to a recipe described in OpenAI’s cookbook: Question Answering using Embeddings.

The recipe begins with a semantic search of a set of papers. This is done by generating an embdding for the documents you want to search over using the OpenAI embeddings API, though we could roll our own that runs locally, albeit with a smaller model. (For example, here’s a recipe for a simple doc2vec powered semantic search.) To perform a semantic search, you find the embedding of the search query and then find near embeddings generated from your source documents to provide the results. To speed up this part of the process in datasette, Simon created the datasette-faiss plugin to use FAISS .

The content of the discovered documents are then used to seed a ChatGPT prompt with some “context”, and the question is applied to that context. So the recipe is something like: use a query to find some relavant documents, grab the content of those documents as context, then create a ChatGPT prompt of the form “given {context}, and this question: {question}”.

It shouldn’t be too difficult to hack together a think that runs this pattern against OU-XML materials. In other words:

  • generate simple text docs from OU-XML (I have scrappy recipes for this already);
  • build a semantic search engine around those docs (useful anyway, and I can reuse my doc2vec thing);
  • build a chatgpt query around a contextualised query, where the context is pulled from the semantic search results. (I wonder, has anyone built a chatgpt like thing around an opensource gpt2 model?)

PS another source of data / facts are data tables. There are various packages out there that claim to provide natural language query support for interrogating tabular data eg abhijithneilabraham/tableQA, and this review article, or the Higging Face table-question-answering transformer, but I forget which I’ve played with. Maybe I should write a new RallyDataJunkie unbook that demonstrates those sort of tool around tabulated rally results data?