Initial Sketch – Searching Jupyter Notebooks Using lunr

Coming round as it is to that time of year for updating, testing and freezing/”gold mastering” the TM351 VM that we distribute to students for the October presentation of our Data Analysis and Management course,  I’ve been thinking about how we can make the VM more useful for students, and whether the things we’re looking at might also be useful in an Institute of Coding context (I’m on a workpackage looking at infrastructure to support coding education: please get in touch if you’re up for a conversation around such matters:-)

One of the things I’ve been pondering is how to search across notebooks – a lot of the TM351 teaching material is in notebooks and there’s no obvious way of searching over them. (There’s also no obvious way of printing them all out in one go, or saving them to a merged document – I’ll post more about that in separate post…)

In my sketches for the new VM, I’ve added a simple python webserver that exposes a homepage that links to the various services running inside the VM. (Ideally, there’d also be indicator lights showing whether the associated Linux service is running or no: anyone know of a simple package to help with that?)

This made me think that it might be useful to provide simple search tool over the notebooks in the (shared) directory that the VM shares with the host.

One way of doing this might be to put the notebook content into a simple sqlite database and serve it using datasette, or query it via a Scripted Form style UI. SQLite has a full text search extension (FTS3-5) and some support for fuzzy matching (eg spellfix1), although I’m note sure how well it would fare as a code search engine.

But I also came across a lightweight Javascript search engine called lunr“[a] bit like Solr, but much smaller and not as bright” – and an example of How [Matthew Daly] Added Search to [His] Site With Lunr.js so I thought I’d give that a go…

At the moment, I’m only testing against a couple of notebooks. The search results are at the markdown cell level, so if a cell contains a lot of text, the whole cell will be displayed, which may not be optimal. I’m rendering the cell markdown as HTML in the browser using the Showdown Javascript package although this could be disabled to show just the raw markdown. My guess is that any relatively linked images embedded in the markdown will show as broken.

The search terms are supposed to be highlighted using mark.js, but while I had it working in a preliminary sketch, it seems to be borked now and I’m not sure where I’m setting it up incorrectly or using it wrong.

It strikes me that if a markdown cell in the results contains a lot of text, it might be worth trying to identify where in the text the query terms appear and then prune the result text around them.

I’m making no attempt to search code cells, though I did think about trying to extract lines of comment text using a crib along the lines of if LINE.strip().startswith('#').

I’m generating the lunr index using lunr.py and saving it along with a store of the cell content in a JSON file that’s loaded into the search page. Whilst I’m testing the search paged served from a simple Python httpserver, it struck me that it could also be served along a /view path in the Jupyter notebook context. When I first tried this, using JSON data loaded in to the search page using JQuery as a JSON object, I got a CORS error. Rather than waste too much time trying to solve that (I wasted a little!) I worked around it instead and loaded my lunr.json search index and store in to the page as JSONP instead.

One thing I need to do is provide an easy to use tool to generate the search index and lookup store from a set of notebooks. (In the TM351 VM context, this would be in the context of the mounted /shared notebooks folder that the notebook server runs at the top of.)

There still needs to be some clear thinking about what to link to – my initial thought is to link to the notebook running in the VM. If anchors are in the original markdown cell text it should be be possible to deeplink to those. It might also be possible to link to an HTML render of the notebook. This could be done via nbconvert (although I am not currently running this as a service in the VM) or perhaps as an in-browser rendering of the .ipynb JSON using something like Notebook.js / nbpreview. (FWIW, I also note react-jupyter).

But if nothing else, this is a thing that can be used and poked around to find out where it’s most painful in use and how it can be improved. A couple of things that immediately come to mind in terms of Jupyter integration, for example:

  • Jupyter notebook classic UI could come with a ‘Search notebooks’ tab and maybe a search indexer running in the background as and when notebooks in scope are saved);
  • JupterLab could be extended with a lun based notebook search plugin.

Code for my initial pencil sketch of a lunr Jupyter notebook markdown cell search tool can be found in this gist.

PS via Grant Nestor on the Jupyter Google group:

grep –include=’*.ipynb’ –exclude-dir=’.ipynb_checkpoints’ -rliw . -e ‘search query’

This will search your Jupyter server root recursively for files that contain the whole word (case-insensitive) “search query” and only return the file names of matches.

More info: https://stackoverflow.com/questions/16956810/how-do-i-find-all-files-containing-specific-text-on-linux

One comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s