Following on from initial sketch of Searching Jupyter Notebooks Using lunr, here’s a quick first pass [gist] at pouring Jupyter notebook cell contents (code and markdown) into a SQLite database, running a query over it and then inspecting the results using a modified NLTK text concordancer to show the search phrase in the context of where it’s located in a document.
The concordancer means we can offer a results listing more in accordance with a traditional search engine, showing just the text in the immediate vicinity of a search term. (Hmm, I’d need to check what happens if the search term appears multiple times in the search result text.) This means we can offer a tidier display the dumping the contents of a complete cell into the results listing.
The table the notebook data is added to is created so that it supports full text search. However, I imagine that any stemming that we could apply is not best suited to indexing code.
Similarly, the NLTK tokeniser doesn’t handle code very well. For example, splits occur around # and % symbols, which means things like magics, such as
%load_ext, aren’t recognised; instead, they’re split into separate tokens:
A bigger issue for the db approach is that I need to find a way to update / clean the database as and when notebooks are saved, updated, deleted etc.
PS sqlbiter provides a way of ingesting – and unpacking – JUpyter notebooks into a sqlite database.
PPS Handy Python command line tool for searching notebooks: https://github.com/conery/nbscan
Install it into TM351 VM from a Jupyter notebook code cell by running the following command when connected to the internet:
!sudo pip install git+https://github.com/conery/nbscan.git
Search for things in notebooks using commands like:
- search in code cells in notebooks in current directory (.) and all child directories for a phrase:
!nbscan.py --dir . --grep 'import pandas' --code
- search in all cells for the word ‘pandas’:
!nbscan.py --dir . --grep pandas
- search in markdown cells for the pattern
'data repr\w*'(that is, the phrase starting data repr…):
!nbscan.py --dir . --grep 'data repr\w*' --markdown
Would be handy to make a simple magic for this?