Coming round as it is to that time of year for updating, testing and freezing/”gold mastering” the TM351 VM that we distribute to students for the October presentation of our Data Analysis and Management course, I’ve been thinking about how we can make the VM more useful for students, and whether the things we’re looking at might also be useful in an Institute of Coding context (I’m on a workpackage looking at infrastructure to support coding education: please get in touch if you’re up for a conversation around such matters:-)
One of the things I’ve been pondering is how to search across notebooks – a lot of the TM351 teaching material is in notebooks and there’s no obvious way of searching over them. (There’s also no obvious way of printing them all out in one go, or saving them to a merged document – I’ll post more about that in separate post…)
In my sketches for the new VM, I’ve added a simple python webserver that exposes a homepage that links to the various services running inside the VM. (Ideally, there’d also be indicator lights showing whether the associated Linux service is running or no: anyone know of a simple package to help with that?)
This made me think that it might be useful to provide simple search tool over the notebooks in the (shared) directory that the VM shares with the host.
One way of doing this might be to put the notebook content into a simple sqlite database and serve it using datasette, or query it via a Scripted Form style UI. SQLite has a full text search extension (FTS3-5) and some support for fuzzy matching (eg spellfix1), although I’m note sure how well it would fare as a code search engine.
But I also came across a lightweight Javascript search engine called lunr
– “[a] bit like Solr, but much smaller and not as bright” – and an example of How [Matthew Daly] Added Search to [His] Site With Lunr.js so I thought I’d give that a go…
At the moment, I’m only testing against a couple of notebooks. The search results are at the markdown cell level, so if a cell contains a lot of text, the whole cell will be displayed, which may not be optimal. I’m rendering the cell markdown as HTML in the browser using the Showdown
Javascript package although this could be disabled to show just the raw markdown. My guess is that any relatively linked images embedded in the markdown will show as broken.
The search terms are supposed to be highlighted using mark.js
, but while I had it working in a preliminary sketch, it seems to be borked now and I’m not sure where I’m setting it up incorrectly or using it wrong.
It strikes me that if a markdown cell in the results contains a lot of text, it might be worth trying to identify where in the text the query terms appear and then prune the result text around them.
I’m making no attempt to search code cells, though I did think about trying to extract lines of comment text using a crib along the lines of if LINE.strip().startswith('#')
.
I’m generating the lunr
index using lunr.py
and saving it along with a store of the cell content in a JSON file that’s loaded into the search page. Whilst I’m testing the search paged served from a simple Python httpserver, it struck me that it could also be served along a /view
path in the Jupyter notebook context. When I first tried this, using JSON data loaded in to the search page using JQuery as a JSON object, I got a CORS error. Rather than waste too much time trying to solve that (I wasted a little!) I worked around it instead and loaded my lunr.json search index and store in to the page as JSONP instead.
One thing I need to do is provide an easy to use tool to generate the search index and lookup store from a set of notebooks. (In the TM351 VM context, this would be in the context of the mounted /shared notebooks folder that the notebook server runs at the top of.)
There still needs to be some clear thinking about what to link to – my initial thought is to link to the notebook running in the VM. If anchors are in the original markdown cell text it should be be possible to deeplink to those. It might also be possible to link to an HTML render of the notebook. This could be done via nbconvert
(although I am not currently running this as a service in the VM) or perhaps as an in-browser rendering of the .ipynb
JSON using something like Notebook.js
/ nbpreview
. (FWIW, I also note react-jupyter
).
But if nothing else, this is a thing that can be used and poked around to find out where it’s most painful in use and how it can be improved. A couple of things that immediately come to mind in terms of Jupyter integration, for example:
- Jupyter notebook classic UI could come with a ‘Search notebooks’ tab and maybe a search indexer running in the background as and when notebooks in scope are saved);
- JupterLab could be extended with a lun based notebook search plugin.
Code for my initial pencil sketch of a lunr
Jupyter notebook markdown cell search tool can be found in this gist.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
<html><head> | |
<script type="application/javascript" src="assets/js/jquery-3.3.1.js"></script> | |
<script type="application/javascript" src="assets/js/showdown.min.js"></script> | |
<!– | |
https://markjs.io/ | |
<script type="application/javascript" src="assets/js/jquery.mark.min.js"></script> | |
–> | |
<script type="application/javascript" src="assets/js/lunr.js"></script> | |
<script src="lunr.jsonp"></script> | |
<link rel="stylesheet" href="assets/css/bootstrap.min.css" /> | |
<style> | |
ul {margin-bottom:50px;} | |
ul li{margin-bottom:50px; background-color: #f8f8f8;} | |
</style> | |
</head> | |
<body> | |
<div class='container' > | |
<div><img src='assets/images/OU_logo_unofficial.png' alt='OU logo' /></div> | |
<h1>TM351 Notebook Search</h1><div><input id='search' /></div> | |
<hr/> | |
<div><ul id='searchresults' style='list-style-type: none'></ul></div> | |
<hr/> | |
<div><em>To refresh the index, …</em></div></div></body><script type="text/javascript"> | |
//https://matthewdaly.co.uk/blog/2015/04/18/how-i-added-search-to-my-site-with-lunr-dot-js/ | |
$(document).ready(function () { | |
'use strict'; | |
// Set up search | |
var index, store; | |
//I'm importing the lunr.json as JSONP to get around CORS issues | |
//$.getJSON('./lunr.json', function (response) { | |
// Create index | |
index = lunr.Index.load(response.index); | |
// Create store | |
store = response.store; | |
// Handle search | |
$('input#search').on('keyup', function () { | |
// Get query | |
var query = $(this).val(); | |
// Search for it | |
var result = index.search(query); | |
// Output it | |
var resultdiv = $('ul#searchresults'); | |
// Keep track of search terms in result | |
var terms = new Set(); | |
if (result.length === 0) { | |
// Hide results | |
resultdiv.hide(); | |
} else { | |
// Show results | |
resultdiv.empty(); | |
for (var item in result) { | |
var ref = result[item].ref; | |
var converter = new showdown.Converter(); | |
var html = converter.makeHtml(store[ref].cell); | |
var searchitem = '<li>'+html+'<br/>Link: <a href="' + store[ref].title+ '">' + store[ref].title + '</a></li>'; | |
//alert(JSON.stringify(result),null,4) | |
// Keep track of search terms in result | |
//result.forEach(function (item) { | |
// Object.keys(item.matchData.metadata).forEach(function (term) { | |
// terms.add(term) | |
// }) | |
//}) | |
resultdiv.append(searchitem); | |
} | |
//Highlight search terms – was working, now broken? | |
//resultdiv.mark(query); | |
resultdiv.show(); | |
} | |
}); | |
//}); | |
}); | |
</script> | |
</html> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import os | |
import nbformat | |
from lunr import lunr | |
import json | |
def nbpathwalk(path): | |
''' Walk down a directory path looking for ipynb notebook files… ''' | |
for path, _, files in os.walk(path): | |
if '.ipynb_checkpoints' in path: continue | |
for f in [i for i in files if i.endswith('.ipynb')]: | |
yield os.path.join(path, f) | |
def get_md(nb_fn, c_md=None): | |
''' Extract the content of markdown ''' | |
if c_md is None: c_md = [] | |
nb=nbformat.read(nb_fn,nbformat.NO_CONVERT) | |
_c_md=[i for i in nb.cells if i['cell_type']=='markdown'] | |
ix=len(c_md) | |
for c in _c_md: | |
c.update( {"ix":str(ix)}) | |
c.update( {"title":nb_fn}) | |
ix = ix+1 | |
c_md = c_md + _c_md | |
return c_md | |
def index_notebooks(nbpath='.', outfile='lunr.json', jsonp=None): | |
''' Get content from each notebook down a path and index it ''' | |
c_md=[] | |
for fn in nbpathwalk(nbpath): | |
c_md = get_md(fn,c_md) | |
idx = lunr(ref='ix', fields=('title','source'), documents=c_md) | |
#Create a lookup for each md cell | |
store = {} | |
for c in c_md: | |
store[c['ix']]={'title':c['title'],'cell':c['source']} | |
out ={'index':idx.serialize(),'store':store} | |
with open(outfile, 'w') as f: | |
#Provide ability to write JSON or JSONP output file | |
if jsonp is None and not outfile.endswith('.jsonp'): | |
json.dump(out, f) | |
else: | |
if jsonp is None: | |
jsonp="var response = " | |
else: | |
jsonp="var {} = ".format(jsonp) | |
f.write('{}{}'.format(jsonp,json.dumps(out))) |
PS via Grant Nestor on the Jupyter Google group:
grep –include=’*.ipynb’ –exclude-dir=’.ipynb_checkpoints’ -rliw . -e ‘search query’
This will search your Jupyter server root recursively for files that contain the whole word (case-insensitive) “search query” and only return the file names of matches.
re: creating a server extension to provide a search page as a tab in jupyter notebooks, this may be a handy place to start: http://jupyter-notebook.readthedocs.io/en/stable/extending/handlers.html#writing-a-notebook-server-extension The nbgrader extension also implements various tabs.