Fragment: Indexing Local Jupyter Notebooks for Search

It’s been some time since I last explored this (eg here and here, and as far as I know know other solutions have appeared since, but a question still remains as to how to effectively search over a set of notebooks.

Partial alternative solutions maybe worth noting include:

  • nbscan for searching over notebooks from the command-line;
  • nbgallery bakes in Solr/sunspot; it’d be really nice if the nbgallery search tools could be easily decoupled so the search could be added to an arbitrary Jupyter notebook, or JupyterHub, server as an extension…);
  • this simple search engine with automcomplete by Simon Willison.

There is also the lunr based search of Jupyter Book (related issue).

One of the things I often wondered about in respect of building a notebook search engine index would be how to crawl / index freshly updated notebooks.

One way would presumably be to regularly crawl the directory path in which notebooks live looking for notebook files that have a changed timestamp compared to the last time they were indexed; another might be to set up some sort of watcher on the operating system that calls the indexer whenever it spots a file being updated (maybe something like fswatch?).

Another way might be to use something like the pgcontents contents manager to save (or process) notebooks into a search engine index database. (For other examples of Jupyter notebook content managers, see this Tracking Jupyter round-up. I wonder, is there a sqlite content manager that can save notebooks directly into SQLite? Would the pgcontents extension handle that with little or no modification, other thn to the supplied database connection string?) If notebooks were saved as notebooks to disk, and into a database for indexing as part of the search engine, how would the indexed notebook also be linked back to the notebook on disk so it could be linked to via search results?

Thinks: how is nbgallery architected? Where are notebooks saved to? How is the Solr search engine index managed?

More generally, I wonder: are there any Python based, simple full-text search engines with local fielsystem crawlers/monitors/indexers out there?

Rescuing Python Module Code From Cluttered Jupyter Notebooks

One of the ways I use Jupyter notebooks is to write stream-of-consciousness code.

Whilst the code doesn’t include formal tests (I never got into the swing of test-driven development, partly because I’m forever changing my mind about what I need a particular function to do!) the notebooks do contain a lot of implicit testing as I tried to build up a function (see Programming in Jupyter Notebooks, via the Heavy Metal Umlaut for an example of some of the ways I use notebooks to iterate the development of code either across several cells or within a single cell).

The resulting notebooks tend to be messy in a way that makes it hard to reuse code contained in them easily. In particular, there are lots of parameter setting cells and code fragment cells where I test specific things out, and then there are cells containing functions that pull together separate pieces to perform a particular task.

So for example, in the fragment below, there is a cell where I’m trying something out, a cell where I put that thing into a function, and a cell where I test the function:

My notebooks also tend to include lots of markdown cells where I try to explain what I want to achieve, or things I still need to do. Some of these usefully document completed functions, others are more note form that relate to the development of an idea or act as notes-to-self.

As the notebooks get more cluttered, it gets harder to use them to perform a particular task. I can’t load the notebook into another notebook as a module because as well as loading the functions in, all the directly run code cells will be loaded in and then executed.

Jupytext comes partly to the rescue here. As described in Exploring Jupytext – Creating Simple Python Modules Via a Notebook UI, we can add active-ipynb tags to a cell that instruct Jupytext where code cells should be executable:

In the case of the active-ipynb tag, if we generate a Python file from a notebook using Jupytext, the active-ipynb tagged code cells will be commented out. But that can still make for a quite a messy Python file.

Via Marc Wouts comes this alternative solution for using an nbconvert template to strip out code cells that are tagged active-ipynb; I’ve also tweaked the template to omit cell count numbers and only include markdown cells that are tagged docs.

echo """{%- extends 'python.tpl' -%}

{% block in_prompt %}
{% endblock in_prompt %}

{% block markdowncell scoped %}
{%- if \"docs\" in cell.metadata.tags -%}
{{ super() }}
{%- else -%}
{%- endif -%}
{% endblock markdowncell %}

{% block input_group -%}
{%- if \"active-ipynb\" in cell.metadata.tags  -%}
{%- else -%}
{{ super() }}
{%- endif -%}
{% endblock input_group %}""" > clean_py_file.tpl

Running nbconvert using this template over a notebook:

jupyter nbconvert "My Messy Notebook.ipynb" --to script --template clean_py_file.tpl

generates a My Messy file that includes code from cells not tagged as active-ipynb, along with commented out markdown from docs tagged markdown cells, that provides a much cleaner python module file.

With this workflow, I can revisit my old messy notebooks, tag the cells appropriately, and recover useful module code from them.

If I only ever generate (and never edit by hand) the module/Python files, then I can maintain the code from the messy notebook as long as I remember to generate the Python file from the notebook via the clean_py_file.tpl template. Ideally, this would be done via a Jupyter content manager hook so that whenever the notebook was saved, as per Jupytext paired files, the clean Python / module file would be automatically generated from it.

Just by the by, we can load in Python files that contain spaces in the filename as modules into another Python file or notebook using the formulation:

tsl = __import__('TSL Timing Screen Screenshot and Data Grabber')

and then call functions via tsl.myFunction() in the normal way. If running in a Jupyter notebook setting (which may be a notebook UI loaded from a .py file under Jupytext) where the notebook includes the magics:

%load_ext autoreload
%autoreload 2

then whenever a function from a loaded module file is called, the module (and any changes to it since it was last loaded) are reloaded.

PS thinks… it’d be quite handy to have a simple script that would autotag all notebook cells as active-ipynb; or perhaps just have another template that ignores all code cells not tagged with something like active-py or module-py. That would probably make tag gardening in old notebooks a bit easier…

SQL Murder Mystery, Notebook Style

In passing, I noticed that Simon Willison had posted a datasette mediated version of the Knight Lab SQL Murder Mystery.

The mystery ships as a SQLite database and a clue…

To my mind, a Jupyter notebook provides an ideal medium for playing this sort of game. In between writing queries onto the database, and displaying the responses inline within the notebook, detectives can also use markdown cells to write notes, pull out salient points, formulate hypotheses and write natural language questions that can then be cast into SQLese.

So as an aside for TM351 students that they really don’t need, I put together a notebook that sets the scene for the murder mystery, along the way showing how to create PostgreSQL databases and users, set database permissions on a per user basis, and import the original SQLite database into Postgres.

You can find the notebook here along with a link for how to download and install the TM351 VM yourself if you fancy giving it a spin…

Splitting Strings in pandas Dataframe Columns

A quick note on splitting strings in columns of pandas dataframes.

If we have a column that contains strings that we want to split and from which we want to extract particuluar split elements, we can use the .str. accessor to call the split function on the string, and then the .str. accessor again to obtain a particular element in the split list.

df_str = pd.DataFrame( {'col':['']*3} )
df_str['path'] = df_str['col'].str.split('/').str[-1]
df_str['stub'] = df_str['path'].str.split('.').str[0]

Fragment – Quantum Coding in Python

Noting that we now may be in an age of quantum supremacy (original docs possibly available via here, here’s yet more stuff for my “to learn about” list, quantum programming simulators in Python from the big guns:

There’s also:

  • QuTiP — Quantum Toolkit in Python.

Nudging Student Coders into Conforming with the PEP8 Python Style Guide Using Jupyter Notebooks, flake8 and pycodestyle_magic

My code is often a mishmash of styles, although I do try to be internally consistent in style in any given notebook or module. And whilst we had the intention that all the code in our TM351 notebooks would be strongly PEP8 compliant, much of it probably isn’t.

So as we start another presentation of TM351, I think that this year I am going to run the risk of adding even more stuff to the student workload in the form of optional, yet regularly posted, notebook productivity tips.

Whilst these will not directly address any of the module learning outcomes that I can recall offhand, they may help students develop their own code in a more efficient way than they might otherwise, and also present it rather more tidily in assessment material. (The latter can often have the effect of improving a marker’s mood, which in turn may influence the mark awarded…)

So what sorts of thing do I intend to cover?

  • simple debugging strategies for one thing: we don’t really teach, or debug, any formal approaches to debugging, although we do encourage “an interactive line at a time” approach to trying out, and developing, data cleaning, shaping, analysis and visualisation code sequences in the notebooks; however, the Python interactive debugger is available in the notebooks too and I think that providing some simple, and relevant, examples of how to use it once student have developed some familiarity with both the notebooks and the style of coding we are using in the course, may be helpful to some of them;

  • simple notebook extensions for monitoring cell execution state on the one hand and profiling code execution on the other, is another area that doesn’t directly address the topic matter directly (coding for data management and analysis), but will provide students with tools that allow them to explore and interrogate their own code in a rather more structured way than they might otherwise;

  • code styling and linting is the first thing I’m going to focus on, however; the intention here is to introduce students to some simple tools and strategies for writing code that conforms to PEP8 style guidelines.

The approach I’m probably going to take is to publish “nudging” posts into the forums once every week or two. Here’s an example of the sort of thing I posted today to introduce the notion of linting and code styling:

Writing Nicely Styled Python Code

In professional code development projects, code it typically written according to a style guide that describes a convention for how to present and layout the code.

In Python projects, the PEP8 style guide defines one such widely followed convention (the code we have provided in the notebooks tends towards PEP8 compliance… Each time we revisit a notebook, we try to tighten it up a bit further!).

Several tools are available for use with Jupyter notebooks that support the creation of PEP8 conformant code. The attached notebook provides instruction on how to install and enable one such tool, pycodestyle_magic, which can provide warnings about when your code style diverges from PEP8 conventions.

The notebook describes how to configure your VM to automatically load pycodestyle_magic and, if required, automatically enable it, in each of your notebooks.

The output of the magic takes the form of a report at the bottom of each code cell identifying any stylistic errors in a particular code cell each time that code cell is run:


alt-text: Example of pink warning message area listing PEP8 style guide contraventions generated via pycodestyle_magic

Each line of the report takes the form:


You can toggle line numbers on and off in a code cell by clicking in the code cell and using the keyboard shortcut: ESC-l

You are not required to install the extension, or even write PEP8 compliant code. However, you may find that it does help make your code readable, and that with practice you soon start to write code that does not raise many PEP8 errors.

(Note that some error reports could do with disabling, such as D100; the extension treats each code cell as a Python module, which is conventionally started with a triple double quoted (sic) comment string (eg """My Module."""). The magic does not currently support ignoring specific errors.)

The notebook itself can be found here: Notebook Code Linting.ipynb.

Fragment: Keeping an Eye on What’s Trackable, Where, and When — Tools for Data Protection Officers as well as the Rest of Us?

Way back when, in the early days of FOI and then “open data”, I naively believed that open data and FOI contact points in organisations would act on as advocates for us outside the organisation getting access to information from inside the organisation. The reality seems to be that as appointees and employees of the organisation, those individuals instead become gatekeepers and often seem to act to find ways of defending the organisation against such requests rather than trying to open the organisation up to them.

When it comes to those appointed to oversee data protection and data privacy issues, I would like to think that whoever is appointed such a role sees it as the role of an advocate for those who work for or come into contact with the organisation, as well as providing an opportunity to aggressively defend the rights of those outside the organisation against the unnecessary and disproportionate collection, processing and sharing of data about them by the organisation. That said, I suspect in many cases the role is more about trying to make sure the company doesn’t get sued under GDPR.

Whilst it would also be nice to think that the data protection person is a geek w/ skillz who can hack their way around an organisation’s systems and websites, poking around to find things that shouldn’t be there and demonstrating how other things can be potentially misused, I suspect they aren’t.

So do we need tools for such officers to keep tabs on their organisation, or perhaps tools to help privacy advocates provide oversight of them?

Poking around traffic generated as I visited the OU VLE a week or two ago, I saw a couple of requests I thought were unnecessary and raised an internal query about them. But it also got me thinking…

The requests appear to be made from tags loaded into the web page using the Google Tag Manager. The Google Tag Manager code appears to be delivered via a gtm.js script with the structure:

  "resource": {
    "version": "XXX",
    "macros": [ {} ],
    "tags": [ {} ],
    "predicates": [{}],
    "rules": [ {} ]
  "runtime": [ [], [] ]

followed byb a chunk of Javascript code.

The gtm.js file includes rules of the form [["if",1,31],["unless",34,35],["add",51]] that appear to index into the predicates list in the conditional part (logically or’d tests?) and then add a particular tag, which may reference a macro, when the condition is met.

Predicates take the form:


Tags can take a variety of forms, including:

      "vtp_html":"\n\u003Cscript type=\"text\/gtmscript\"\u003E!function(b,e,f,g,a,c,d){b.fbq||(a=b.fbq=function(){a.callMethod?a.callMethod.apply(a,arguments):a.queue.push(arguments)},b._fbq||(b._fbq=a),a.push=a,a.loaded=!0,a.version=\"2.0\",a.queue=[],c=e.createElement(f),c.async=!0,c.src=g,d=e.getElementsByTagName(f)[0],d.parentNode.insertBefore(c,d))}(window,document,\"script\",\"https:\/\/\/en_US\/fbevents.js\");fbq(\"init\",\"870490019710405\");fbq(\"track\",\"PageView\");\u003C\/script\u003E\n\u003Cnoscript\u003E\n\u003Cimg height=\"1\" width=\"1\" src=\"https:\/\/\/tr?id=870490019710405\u0026amp;ev=PageView\n\u0026amp;noscript=1\"\u003E\n\u003C\/noscript\u003E\n\n\n",

And macros take the form:


So what I’m wondering is: is there an offline, static analyser for gtm.js scripts that would allow someone to point to a website form a which a gtm.js script be downloaded and then lets them generate human readable reports that:

  • identify in general which trackers are loaded by which rules on which events with what arguments; and
  • identify which trackers are loaded by which rules on which events with what arguments for a specific URL.

This would then allow a university data protection officer, for example, or a student, to provide a URL, such as a URL into the VLE, and get a simple, statically generated report back that shows what trackers are loaded when visiting that environment.

Which is simpler than running Ghostery or opening developer tools in a wide open by default browser like Chrome, rather than the rather more privacy defending Firefox, for example, and searching the network logs for incriminating evidence.

Google Tag Manager has been around for some time, and I’m assuming that organisational web folk have read each line of code in the gtm.js they load into user’s browsers to make sure that it’s not doing anything untoward. (That everyone else uses it is no excuse, unless perhaps it meets some sort of international software quality standard that folk can just embed it without looking at it.)

So I’m wondering:

  • is there a line by line annotated version of the code at the bottom of the gtm.js script anywhere?
  • are there line by line examples out there of a simple gtm.js script and how to read it / analyse it (so eg walking through: this rule says this, which adds that tag, which is then parsed this way?)
  • are there static gtm.js analysers out there that generate the static reports suggested above and that allow folk to analyse arbitrary gtm.js scripts that are loaded into their browser in many of the sites they visit?

So for example, here’s a blog post that describes, line by line, how the Google tag manager container snippet that webmasters embed in their webpages runs so as to load in the gtm.js script: The Container Snippet: GTM Secrets Revealed. What I want is something similar for the gtm.js script…

PS it seems that GTM Spy [h/t/ Simo Ahava] helps browse the tags loaded in from a Google Tag Manager container. For example, here’s a look at code associated with a tag loaded via GTM by a particular org:

and here’s the trigger condition associated with it (I notice that a single naive, pass of my image blurrer, I can still read the contains text…):

This is a good start, but the navigation falls short of being usable by a ‘never-reads-the-manual’ such as myself. For example, looking at one of the triggers on another tab in the GTM Spy view:

I see the tag ID identified but no obvious way of finding out what tag that relates to, or what parameters / data might be returned via that tag?