More Thoughts On Jupyter Notebook Search

Following on from initial sketch of Searching Jupyter Notebooks Using lunr, here’s a quick first pass [gist] at pouring Jupyter notebook cell contents (code and markdown) into a SQLite database, running a query over it and then inspecting the results using a modified NLTK text concordancer to show the search phrase in the context of where it’s located in a document.

import os
def nbpathwalk(path):
''' Walk down a directory path looking for ipynb notebook files… '''
for path, _, files in os.walk(path):
if '.ipynb_checkpoints' in path: continue
for f in [i for i in files if i.endswith('.ipynb')]:
yield os.path.join(path, f)
import nbformat
def get_cell_contents(nb_fn, c_md=None, cell_typ=None):
''' Extract the content of Jupyter notebook cells. '''
if cell_typ is None: cell_typ=['markdown']
if c_md is None: c_md = []
nb=nbformat.read(nb_fn,nbformat.NO_CONVERT)
_c_md=[i for i in nb.cells if i['cell_type'] in cell_typ]
ix=len(c_md)
for c in _c_md:
c.update( {"ix":str(ix)})
c.update( {"title":nb_fn})
ix = ix+1
c_md = c_md + _c_md
return c_md
import sqlite3
def index_notebooks_sqlite(nbpath='.', outfile='notebooks.sqlite', jsonp=None):
''' Get content from each notebook down a path and index it. '''
conn = sqlite3.connect(outfile)
# Create table
c = conn.cursor()
c.execute('''DROP TABLE IF EXISTS nbindex''')
#Enable full text search
c.execute('''CREATE VIRTUAL TABLE IF NOT EXISTS nbindex USING fts4(title text, source text, ix text PRIMARY KEY, cell_type text)''')
c_md=[]
for fn in nbpathwalk(nbpath):
cells = get_cell_contents(fn,c_md, cell_typ=['markdown','code'])
for cell in cells:
# Insert a row of data
c.execute("INSERT INTO nbindex VALUES (?,?,?,?)",(cell['title'],cell['source'],
cell['ix'], cell['cell_type']))
# Save (commit) the changes and close the db connection
conn.commit()
conn.close()
#https://blog.ouseful.info/2015/12/13/n-gram-phrase-based-concordances-in-nltk/
import nltk
def n_concordance_tokenised(text,phrase,left_margin=5,right_margin=5):
''' Token concordance for multiple contiguous tokens. '''
#concordance replication via https://simplypython.wordpress.com/2014/03/14/saving-output-of-nltk-text-concordance/
phraseList=phrase.split(' ')
c = nltk.ConcordanceIndex(text.tokens, key = lambda s: s.lower())
#Find the offset for each token in the phrase
offsets=[c.offsets(x) for x in phraseList]
offsets_norm=[]
#For each token in the phraselist, find the offsets and rebase them to the start of the phrase
for i in range(len(phraseList)):
offsets_norm.append([xi for x in offsets[i]])
#We have found the offset of a phrase if the rebased values intersect
#–
# http://stackoverflow.com/a/3852792/454773
#the intersection method takes an arbitrary amount of arguments
#result = set(d[0]).intersection(*d[1:])
#–
intersects=set(offsets_norm[0]).intersection(*offsets_norm[1:])
concordance_txt = ([text.tokens[list(map(lambda x: xleft_margin if (xleft_margin)>0 else 0,[offset]))[0]:offset+len(phraseList)+right_margin]
for offset in intersects])
outputs=[''.join([x+' ' for x in con_sub]) for con_sub in concordance_txt]
return outputs
def n_concordance(txt,phrase,left_margin=5,right_margin=5):
''' Find text concordance for a phrase. '''
tokens = nltk.word_tokenize(txt)
text = nltk.Text(tokens)
return n_concordance_tokenised(text,phrase,left_margin=left_margin,right_margin=right_margin)

view raw
nb_sqlite_db.py
hosted with ❤ by GitHub

#Generate sqlite db of notebook(s) cell contents
index_notebooks_sqlite('.')
import pandas as pd
# Run query and pull results into a pandas dataframe
with sqlite3.connect('notebooks.sqlite') as conn:
df = pd.read_sql_query("SELECT * from nbindex WHERE source MATCH 'this notebook' LIMIT 10", conn)
#Apply concordance to source column in each row in dataframe
df['source'].apply(n_concordance,args=('this notebook',1,1))

view raw
nb_sqlite_db_demo.py
hosted with ❤ by GitHub

The concordancer means we can offer a results listing more in accordance with a traditional search engine, showing just the text in the immediate vicinity of a search term. (Hmm, I’d need to check what happens if the search term appears multiple times in the search result text.) This means we can offer a tidier display the dumping the contents of a complete cell into the results listing.

The table the notebook data is added to is created so that it supports full text search. However, I imagine that any stemming that we could apply is not best suited to indexing code.

Similarly, the NLTK tokeniser doesn’t handle code very well. For example, splits occur around # and % symbols, which means things like magics, such as %load_ext, aren’t recognised; instead, they’re split into separate tokens: % and load_ext.

A bigger issue for the db approach is that I need to find a way to update / clean the database as and when notebooks are saved, updated, deleted etc.

PS sqlbiter provides a way of ingesting – and unpacking – JUpyter notebooks into a sqlite database.

PPS Handy Python command line tool for searching notebooks: https://github.com/conery/nbscan

Install it into TM351 VM from a Jupyter notebook code cell by running the following command when connected to the internet:

!sudo pip install git+https://github.com/conery/nbscan.git

Search for things in notebooks using commands like:

  • search in code cells in notebooks in current directory (.) and all child directories for a phrase: !nbscan.py --dir . --grep 'import pandas' --code
  • search in all cells for the word ‘pandas’: !nbscan.py --dir . --grep pandas
  • search in markdown cells for the pattern 'data repr\w*' (that is, the phrase starting data repr…):!nbscan.py --dir . --grep 'data repr\w*' --markdown

Would be handy to make a simple magic for this?

It might also be useful to take nbscan as a quick real time search tool then run results through the concordancer when displaying them?

Author: Tony Hirst

I'm a Senior Lecturer at The Open University, with an interest in #opendata policy and practice, as well as general web tinkering...

One thought on “More Thoughts On Jupyter Notebook Search”

Comments are closed.