Following on from initial sketch of Searching Jupyter Notebooks Using lunr, here’s a quick first pass [gist] at pouring Jupyter notebook cell contents (code and markdown) into a SQLite database, running a query over it and then inspecting the results using a modified NLTK text concordancer to show the search phrase in the context of where it’s located in a document.
The concordancer means we can offer a results listing more in accordance with a traditional search engine, showing just the text in the immediate vicinity of a search term. (Hmm, I’d need to check what happens if the search term appears multiple times in the search result text.) This means we can offer a tidier display the dumping the contents of a complete cell into the results listing.
The table the notebook data is added to is created so that it supports full text search. However, I imagine that any stemming that we could apply is not best suited to indexing code.
Similarly, the NLTK tokeniser doesn’t handle code very well. For example, splits occur around # and % symbols, which means things like magics, such as
%load_ext, aren’t recognised; instead, they’re split into separate tokens:
A bigger issue for the db approach is that I need to find a way to update / clean the database as and when notebooks are saved, updated, deleted etc.