Fragment: On Reproducible Open Educational Resources

Via O’Reilly’s daily Four Short Links feed, I notice the Open Logic Project, “a collection of teaching materials on mathematical logic aimed at a non-mathematical audience, intended for use in advanced logic courses as taught in many philosophy departments”. In particular, “it is open-source: you can download the LaTeX code [from Github]”.

However:

the TeX source does mean you need a (La)TeX environment to run it (and the project does bundle some of the custom .sty style files you need in the repo, which is handy).

Compare this with the Simple Maths Equations and Notation notebook I’ve started sketching as part of a self-started, informal “reproducible OERs with Jupyter notebooks” project I’m dabbling with:

Here, a Jupyter notebook contains LaTeX code can then be rendered (in part?) through the notebook previewer – at least in so far as expressions are written in Mathjax parseable code – and also within a live / running Jupyter notebook. Not only do I share the reproducible source code (as a notebook), I also share a link to at least one environment capable of running it, and that allows it to be reused with modification. (Okay, in this case, not openly so because you have to have an Azure Notebooks account. But the notebook could equally run on Binderhub or a local install, perhaps with one or two additional requirements if you don’t already run a scientific Python environment.)

In short, for a reproducible OER that supports reuse with modification, sharing the means of production also means sharing the machinery of production.

To simplify the user experience, the notebook environment can be preinstalled with packages needed to render a wider range of TeX code, such as drawings rendered using TikZ. Alternatively, code cells can be populated with package installation commands to custom a more vanilla environment, as I do in several demo notebooks:

What the Open Logic Project highlights is that reproducible OERs not only provide ready access to the “source code” of a resource so that it can be easily reused with modification, but that access to an open environment capable of processing that source code and rendering the output document also needs to be provided. (Open / reproducible science researchers have known this for some time…)

Getting a Tex/LateX environment up and running can be a faff – and can also take up a lot of disk space – so the runtime environment requirements are not negligible.

In the case of Jupyter notebooks, LateX support is available, and container images capable of running on Binderhub, for example, relatively easily defined (see for example the Binder LateX example). (I’m not sure how rich Stencila support for LaTeX is too, and/or whether it requires an external LaTeX environment when running the Stencila desktop app?)

It also strikes me that another thing we should be doing is export a copy of the finished work, eg as a PDF or complete, self-standing HTML archive, in case the machinery does break. This is also important where third party services are called. It may actually make sense to use something like requests for all third party URL requests, and save a cached version of all requests (using requests-cache) to provide a local copy of whatever it was that was called when originally flowing the document.

See also: OER Methods – Generative Designs for Reuse-With-Modification

An Easier Approach to Electrical Circuit Diagram Generation – lcapy

Whilst I might succumb in my crazier evangelical moments to the idea that academic authors (other than those who speak LateX natively) and media developers might engage in the raw circuitikz authoring described Reproducible Diagram Generators – First Thoughts on Electrical Circuit Diagrams, the reality is that it’s probably just way too clunky, and a little bit too far removed from the everyday diagrams educators are likely to want to create, to get much, if any, take up.

However, understanding something of the capabilities of quite low level drawing packages, and reflecting (as in the last post) on some of the strategies we might adopt for creating reusable, maintainable, revisable with modification and extensible diagram scripts puts us in good stead for looking out for more usable approaches.

One such example is the Python lcapy package, a linear circuit analysis package that supports:

  • the description of simple electrical circuits at a sloghtly hoger level than the raw circuitikz circuit creation model;
  • the rendering of the circuits, with a few layout cues, using circuitikz;
  • numerical analysis of the circuits in terms of response in time and frequency domains, and the charting of the results of the analysis; and
  • various forms of symbolic analysis of circuit descriptions in various domains.

Here are some quick examples to give a taste of what’s possible.

You can run the notebook (albeit subject to significant changes) that contains the original working for examples used in this post on Binderhub: Binder

Here’s a simple circuit:

And here’s how we can create it in lcapy from a netlist, annotated with cues for the underlying circuitikz generator about how to lay out the diagram.

from lcapy import Circuit

cct = Circuit()
cct.add("""
Vi 1 0_1 step 20; down
C 1 2; right, size=1.5
R 2 0; down
W 0_1 0; right
W 0 0_2; right, size=0.5
P1 2_2 0_2; down
W 2 2_2;right, size=0.5""")

cct.draw(style='american')

The things to the right of the semicolon on each line are the optional layout elements – they’re not required when defining the actual circuit itself.

The display of nodes and numbered nodes are all controllable, and the symbol styles are selectable between american, british and european stylings.

The lcapy/schematic.py package describes the various stylings as composites of circuitikz regionalisations, and could be easily extended to support a named house style, or perhaps accommodate a regionalisation passed in as an explicit argument value.

if style == 'american':
    style_args = 'american currents, american voltages'
elif style == 'british':
    style_args = 'american currents, european voltages'
elif style == 'european':
    style_args = ('european currents, european voltages, european inductors, european resistors')

As well as constructing circuits from netlist descriptions, we can also create them from network style descriptions:

from lcapy import R, C, L

cct2= (R(1e6) + L(2e-3)) | C(3e-6)
#For some reason, the style= argument is not respected
cct2.draw()

The diagrams generated from networks are open linear circuits rather than loops, which may not be quite what we want. But these circuits are quicker to write, so we can use them to draft netlists for us that we may then want to tidy up a bit further.

print(cct2.netlist())

'''
W 1 2; right=0.5
W 2 4; up=0.4
W 3 5; up=0.4
R 4 6 1000000.0; right
W 6 7; right=0.5
L 7 5 0.002; right
W 2 8; down=0.4
W 3 9; down=0.4
C 8 9 3e-06; right
W 3 0; right=0.5
'''

Circuit descriptions can also be loaded in from a named text file, which is handy for course material maintenance as well as reuse of circuits across materials: it’s easy enough to imagine a library of circuit descriptions.

#Create a file containing a circuit netlist
sch='''
Vi 1 0_1 {sin(t)}; down
R1 1 2 22e3; right, size=1.5
R2 2 0 1e3; down
P1 2_2 0_2; down, v=V_{o}
W 2 2_2; right, size=1.5
W 0_1 0; right
W 0 0_2; right
'''

fn="voltageDivider.sch"
with open(fn, "w") as text_file:
    text_file.write(sch)

#Create a circuit from a netlist file
cct = Circuit(fn)

The ability to create – and share – circuit diagrams in a Python context that plays nicely with Jupyter notebooks is handy, but the lcapy approach becomes really useful if we want to produce other assets around the circuit we’ve just created.

For example, in the case of the above circuit, how do the various voltage levels across the resistors respond when we switch on the sinusoidal source?

import numpy as np
t = np.linspace(0, 5, 1000)
vr = cct.R2.v.evaluate(t)
from matplotlib.pyplot import figure, savefig
fig = figure()
ax = fig.add_subplot(111, title='Resistor R2 voltage')
ax.plot(t, vr, linewidth=2)
ax.plot(t, cct.Vi.v.evaluate(t), linewidth=2, color='red')
ax.plot(t, cct.R1.v.evaluate(t), linewidth=2, color='green')
ax.set_xlabel('Time (s)')
ax.set_ylabel('Resistor voltage (V)');

Not the best example, admittedly, but you get the idea!

Here’s another example, where I’ve created a simple interactive to let me see the effect of changing one of the component values on the response of a circuit to a step input:

(The nice plotting of the diagram gets messed up unfortunately, at least in the way I’ve set things up for this example…)

As the code below shows, the @interact decorator from ipywidgets makes it trivial to create a set of interactive controls based around the arguments passed into a function:

import numpy as np
from matplotlib.pyplot import figure, savefig

@interact(R=(1,10,1))
def response(R=1):
    cct = Circuit()

    cct.add('V 0_1 0 step 10;down')
    cct.add('L 0_1 0_2 1e-3;right')
    cct.add('C 0_2 1 1e-4;right')
    cct.add('R 1 0_4 {R};down'.format(R=R))
    cct.add('W 0_4 0; left')

    t = np.linspace(0, 0.01, 1000)
    vr = cct.R.v.evaluate(t)

    fig = figure()
    #Note that we can add Greek symbols from LaTex into the figure text
    ax = fig.add_subplot(111, title='Resistor voltage (R={}$\Omega$)'.format(R))
    ax.plot(t, vr, linewidth=2)
    ax.set_xlabel('Time (s)')
    ax.set_ylabel('Resistor voltage (V)')
    ax.grid(True)
    
    cct.draw()

Using the network description of a circuit, it only takes a couple of lines to define a circuit and then get the transient response to step function for it:

Again, it doesn’t take much more effort to create an interactive that lets us select component values and explore the effect they have on the damping:

As well as the numerical analysis, lcapy also supports a range of symbolic analysis functions. For example, given a parallel resistor circuit, defined using a network description, we can find the overall resistance in simplest terms:

Or for parallel capacitors:

Some other elementary transformations we can apply – providing expressions for the an input voltage in the time or Laplace/s domain:

We can also create pole-zero plots quite straightforwardly, directly from an expression in the s-domain:

This is just a quick skim through of some of what’s possible with lcapy. So how and why might it be useful as part of a reproducible educational resource production process?

One reason is that several of the functions can reduce the “production distance” between different likely components of a set of educational materials.

For example, given a particular circuit description as a netlist, we can annotate it with direction cribs in order to generate a visual circuit diagram, and we can use a circuit created from it directly (or from the direction annotated script) to generate time or frequency response charts. (We can also obtain symbolic transfer functions.)

When trying to plot things like pole zero charts, where it is important that the chart matches a particular s-domain expression, we can guarantee that the chart is correct by deriving it directly from the s-domain expression, and then rendering that expression in pretty LaTeX equation form in the materials.

The ability to simplify expressions  – as in the example of the simplified expressions for overall capacitance or resistance in the parallel circuit examples above – directly from a circuit description whilst at the same time using that circuit description to render the circuit diagram, also reduces the amount of separation between those two outputs to zero – they are both generated from the self-same source item.

You can run the notebook (albeit subject to significant changes) that contains the original working for examples used in this post on Binderhub: Binder

Exposing Multiple Services Via a Single http Port Using Jupyter nbserverproxy

Over the last couple of weeks I’ve been circling, but failing to make much actual progress on using, OpenStack as a platform for making self-serve OU hosted VMs available to students. (I’m increasingly starting to think this is not sensible, but I’m struggling to find someone I can chat to about it… OpenStack is too enterprise, like a heavy “Java” thing where I need a just works “Python” thing…).

Anyway.

One of the issues with the OU Faculty OpenStack setup is the way the security model locks everything down. Not only is no API access available, there is also a limit on IP address allocation and open ports are limited to port 80 (and maybe port 22? Or maybe not.)

For the TM351 VM – which is what we’re looking to put onto OU OpenStack – we have been exposing services on at least two http ports, one for the Jupyter notebooks and one for OpenRefine. (The latest build also has a simple VM webserver, and I’m experimenting with a notebook search engine. Optionally, we have also allowed students to open up ports to the PostgreSQL and MongoDB services.)

If I do find a sensible way to get the VM running on OpenStack, finding a way to shove all the http services through port 80 looks like a necessary requirement. Previously, I’d noticed that @betatim’s openrefineder demo made use of a proxy to expose the OpenRefine service via the Jupyter notebook port, and looking at it again today I noticed that the nbopenrefineproxy package it was using is available as a Jupyterhub project package: jupyterhub/nbserverproxy.

In the current TM351 VM set-up, we have the following:

  • Jupyter notebook on guest port 8888, host port 35180
  • OpenRefine on guest port 3334, host port 35181

However, if I install and enable nbserverproxy, and restart the Jupyter notebook server, I can now find OpenRefine proxied as http://localhost:35180/proxy/3334/ as well as on http://localhost:35181.

One gotcha to note is that the OpenRefine page doesn’t render properly from that URL without the trailing slash because the OpenRefine HTML includes relative links to assets:

...
<link type="text/css" rel="stylesheet" href="externals/select2/select2.css" />
 <link type="text/css" rel="stylesheet" href="externals/tablesorter/theme.blue.css" />
...

which resolve as e.g. http://localhost:35180/proxy/externals/select2/select2.css (404).

However, with the trailing slash, the links do resolve correctly (e.g. as http://localhost:35180/proxy/3334/externals/select2/select2.css) when the trailing slash is added.

Handy… and the way to go if we do get this running on OpenStack.

PS if you know of a baby steps tutorial that shows how I can build a custom VM image on a Mac that I can upload to OpenStack, please let me know via the comments. Or otherwise get in touch if you can talk me through the various approaches.

First Class R Support in Binder / Binderhub – Shiny Apps As Well as R-Kernels and RStudio

I notice from the binder-examples/r repo that Binderhub now appears to offer all sorts of R goodness out of the can, if you specify a particular R build.

From the same repo root, you can get:

And from previously, here’s a workaround for displaying R/HTMLwidgets in a Jupyter notebook.

OpenRefine is also available from a simple URL – https://mybinder.org/v2/gh/betatim/openrefineder/master?urlpath=openrefine – courtesy of betatim/openrefineder:

Perhaps it’s time for me to try to get my head round what the Jupyter notebook proxy handlers are doing…

PS see also Scripted Forms for a simple markdown script way of building interactive Jupyter widget powered UIs.

More Thoughts On Jupyter Notebook Search

Following on from initial sketch of Searching Jupyter Notebooks Using lunr, here’s a quick first pass [gist] at pouring Jupyter notebook cell contents (code and markdown) into a SQLite database, running a query over it and then inspecting the results using a modified NLTK text concordancer to show the search phrase in the context of where it’s located in a document.


import os
def nbpathwalk(path):
''' Walk down a directory path looking for ipynb notebook files… '''
for path, _, files in os.walk(path):
if '.ipynb_checkpoints' in path: continue
for f in [i for i in files if i.endswith('.ipynb')]:
yield os.path.join(path, f)
import nbformat
def get_cell_contents(nb_fn, c_md=None, cell_typ=None):
''' Extract the content of Jupyter notebook cells. '''
if cell_typ is None: cell_typ=['markdown']
if c_md is None: c_md = []
nb=nbformat.read(nb_fn,nbformat.NO_CONVERT)
_c_md=[i for i in nb.cells if i['cell_type'] in cell_typ]
ix=len(c_md)
for c in _c_md:
c.update( {"ix":str(ix)})
c.update( {"title":nb_fn})
ix = ix+1
c_md = c_md + _c_md
return c_md
import sqlite3
def index_notebooks_sqlite(nbpath='.', outfile='notebooks.sqlite', jsonp=None):
''' Get content from each notebook down a path and index it. '''
conn = sqlite3.connect(outfile)
# Create table
c = conn.cursor()
c.execute('''DROP TABLE IF EXISTS nbindex''')
#Enable full text search
c.execute('''CREATE VIRTUAL TABLE IF NOT EXISTS nbindex USING fts4(title text, source text, ix text PRIMARY KEY, cell_type text)''')
c_md=[]
for fn in nbpathwalk(nbpath):
cells = get_cell_contents(fn,c_md, cell_typ=['markdown','code'])
for cell in cells:
# Insert a row of data
c.execute("INSERT INTO nbindex VALUES (?,?,?,?)",(cell['title'],cell['source'],
cell['ix'], cell['cell_type']))
# Save (commit) the changes and close the db connection
conn.commit()
conn.close()
#https://blog.ouseful.info/2015/12/13/n-gram-phrase-based-concordances-in-nltk/
import nltk
def n_concordance_tokenised(text,phrase,left_margin=5,right_margin=5):
''' Token concordance for multiple contiguous tokens. '''
#concordance replication via https://simplypython.wordpress.com/2014/03/14/saving-output-of-nltk-text-concordance/
phraseList=phrase.split(' ')
c = nltk.ConcordanceIndex(text.tokens, key = lambda s: s.lower())
#Find the offset for each token in the phrase
offsets=[c.offsets(x) for x in phraseList]
offsets_norm=[]
#For each token in the phraselist, find the offsets and rebase them to the start of the phrase
for i in range(len(phraseList)):
offsets_norm.append([xi for x in offsets[i]])
#We have found the offset of a phrase if the rebased values intersect
#–
# http://stackoverflow.com/a/3852792/454773
#the intersection method takes an arbitrary amount of arguments
#result = set(d[0]).intersection(*d[1:])
#–
intersects=set(offsets_norm[0]).intersection(*offsets_norm[1:])
concordance_txt = ([text.tokens[list(map(lambda x: xleft_margin if (xleft_margin)>0 else 0,[offset]))[0]:offset+len(phraseList)+right_margin]
for offset in intersects])
outputs=[''.join([x+' ' for x in con_sub]) for con_sub in concordance_txt]
return outputs
def n_concordance(txt,phrase,left_margin=5,right_margin=5):
''' Find text concordance for a phrase. '''
tokens = nltk.word_tokenize(txt)
text = nltk.Text(tokens)
return n_concordance_tokenised(text,phrase,left_margin=left_margin,right_margin=right_margin)

view raw

nb_sqlite_db.py

hosted with ❤ by GitHub


#Generate sqlite db of notebook(s) cell contents
index_notebooks_sqlite('.')
import pandas as pd
# Run query and pull results into a pandas dataframe
with sqlite3.connect('notebooks.sqlite') as conn:
df = pd.read_sql_query("SELECT * from nbindex WHERE source MATCH 'this notebook' LIMIT 10", conn)
#Apply concordance to source column in each row in dataframe
df['source'].apply(n_concordance,args=('this notebook',1,1))

The concordancer means we can offer a results listing more in accordance with a traditional search engine, showing just the text in the immediate vicinity of a search term. (Hmm, I’d need to check what happens if the search term appears multiple times in the search result text.) This means we can offer a tidier display the dumping the contents of a complete cell into the results listing.

The table the notebook data is added to is created so that it supports full text search. However, I imagine that any stemming that we could apply is not best suited to indexing code.

Similarly, the NLTK tokeniser doesn’t handle code very well. For example, splits occur around # and % symbols, which means things like magics, such as %load_ext, aren’t recognised; instead, they’re split into separate tokens: % and load_ext.

A bigger issue for the db approach is that I need to find a way to update / clean the database as and when notebooks are saved, updated, deleted etc.

PS sqlbiter provides a way of ingesting – and unpacking – JUpyter notebooks into a sqlite database.

PPS Handy Python command line tool for searching notebooks: https://github.com/conery/nbscan

Install it into TM351 VM from a Jupyter notebook code cell by running the following command when connected to the internet:

!sudo pip install git+https://github.com/conery/nbscan.git

Search for things in notebooks using commands like:

  • search in code cells in notebooks in current directory (.) and all child directories for a phrase: !nbscan.py --dir . --grep 'import pandas' --code
  • search in all cells for the word ‘pandas’: !nbscan.py --dir . --grep pandas
  • search in markdown cells for the pattern 'data repr\w*' (that is, the phrase starting data repr…):!nbscan.py --dir . --grep 'data repr\w*' --markdown

Would be handy to make a simple magic for this?

It might also be useful to take nbscan as a quick real time search tool then run results through the concordancer when displaying them?

Jigsaw Pieces – Linux Service Indicators, Jupyter Kernel Monitoring and Environment Management

Something I’ve been pondering for some time is how to set up some simple Linux service monitoring so that I can display an in indicator light in a web page to show whether a Linux service is running or not.

For example, in the TM351 VM, it could be handy to display some indicator lights in a Jupyter notebook status bar showing whether the database services we connect to from the notebooks are running correctly,

So here are some pieces that may contribute to that:

My thinking is:

  • use monit to monitor a process; if the process is down, write to a service status file in my www server directory, eg service_servicename_status.txt. If a service is running the contents of this file are 1, otherwise 0;
  • use the JQuery fragment to poll the status file every few seconds;
  • if the status file returns 0, display a red indicator, otherwise green.

Here are some other monitoring / environment managing fragments I’m pondering:

  • something like ps_mem, a Python utility *to accurately report the in core memory usage for a program*. I’m wondering if I could use that to track how much memory each Jupyter notebook python kernel is taking up (or maybe monit can do that?) There’s an old extnesion that looks like ti shows reports: nbtop. Or perhaps use psutil (via this issue, which seems to offer a solution?);
  • a minimal example of setting up notebook homepage tab for a hello world webpage; Writing a notebook server extension looks like it has the ingredients, and nb_conda provides a fuller working example. Actually, that extension looks useful for *Jupyter-as-a-learning-environment* because it lets you select different conda environments, which could be handy for running different activities.

Any other examples out there of Jupyter monitoring / environment management?

Interactive Authoring Environments for Reproducible Media: Stencila

One of the problems associated with keeping up with tech is that a lot of things that “make sense” are not the result of the introduction or availability of a new tool or application in and of itself, but in the way that it might make a new combination of tools possible that support a complete end to end workflow or that can be used to reengineer (a large part of) an existing workflow.

In the OU, it’s probably fair to say that the document workflow associated with creating course materials has its issues. I’m still keen to explore how a Jupyter notebook or Rmd workflow would work, particularly if the authored documents included recipes for embedded media objects such as diagrams, items retrieved from a third party API, or rendered from a source representation or recipe.

One “obvious” problem is that the Jupyter notebook or RStudio Rmd editor is “too hard” to work with (that is, it’s not Word).

A few days ago I saw a tweet mentioning the use of Stencila with Binderhub. Stencila? Apparently, *”[a]n open source office suite for reproducible research”. From the blurb:

[T]oday’s tools for reproducible research can be intimidating – especially if you’re not a coder. Stencila make reproducible research more accessible with the intuitive word processor and spreadsheet interfaces that you and your colleagues are already used to.

That sounds appropriate… It’s available as a desktop app, but courtesy of minrk/jupyter-dar (I think?), it runs on binderhub and can be accessed via a browser too:

 

You can try it here.

As with Jupyter notebooks, you can edit and run code cells, as well as authoring text. But the UI is smoother than in Jupyter notebooks.

(This is one of the things I don’t understand about colleagues’ attitude towards emerging tech projects: they look at today’s UX and think that’s it, because that’s how it is inside an organisation – you take what you’re given and it stays the same for decades. In a living project, stuff tends to get better if it’s being used and there are issues with it…)

The Jupyter-Dar strapline pitches “Jupyter + DAR compatibility exploration for running Stencila on binder”. Hmm. DAR? That’s also new to me:

Dar stands for (Reproducible) Document Archive and specifies a virtual file format that holds multiple digital documents, complete with images and other assets. A Dar consists of a manifest file (manifest.xml) that describes the contents.

Dar is being designed for storing reproducible research publications, but the underlying concepts are suitable for any kind of digital publications that can be bundled together with their assets.

Repo: [substance/dar](https://github.com/substance/dar)

Sounds interesting. And which reminds me: how’s OpenCreate coming along, I wonder? (My permissions appear to have been revoked again; or the URL has changed.)

PS seems like there’s more activity in the “pure web” notebook application world. Hot on the heels of Mike Bostock’s Observable notebooks (rationale) comes iodide, “[a] frictionless portable notebook-style interface for literate scientific computing in the browser” (examples).

I don’t know if these things just require you to use Javascript, or whether they can also embed things like Brython.

I’m not sure I fully get the js/browser notebooks yet? I like the richer extensibility of things like Jupyter in terms of arbitrary language/kernel availability, though I suppose the web notebooks might be able to hook into other kernels using similar mechanics to those used by things like Thebelab?

I guess one advantage is that you can do stuff on a Chromebook, and without a network connection if you cache all the required JS packages locally? Although with new ChromeOS offering support for Linux – and hence, Docker containers – natively, Chromebooks could get a whole lot more exciting over the next few months. From what I can tell, corsvm looks like a ChromeOS native equivalent to something like Virtualbox (with an equivalent of Guest Additions?). It’ll be interesting how well things like audio works? Reports suggest that graphical UIs will work, presumably using some sort of native X11 support rather than noVNC, so now could be a good time to start looking out for souped up Pixelbook…

PS March 2019 – Stencila desktop appears to have stalled for some time. As it’s built on the Texture wordprocessor / editor, it may end up as a plugin for that…

PPS June 2021 – have things rebooted again for Stencila? https://elifesciences.org/labs/a04d2b80/announcing-the-next-phase-of-executable-research-articles

Keeping Up With OpenRefine – Database Connections

It’s been a few months since I last checked out updates to OpenRefine, but reading a (completed) phase 1 project plan associated with some funding the OpenRefine Foundation received from Google News Labs it looks like database support is on the cards.

Database Table import/export – COMPLETED

Historically, OpenRefine has been limited compared to other data tools in that it does not have a way to connect to a database table. This is especially useful at export time, when there is a need to save a cleaned CSV for example into a database table. Importing from a database is useful also. It can help to join clean data in a database table against messy data in OpenRefine, in order to clean and prepare it for use. Database Drivers exist for many databases such as Oracle, MySQL, Postgres, and even many schema-less databases such as MongoDB. Most database drivers use JDBC which makes it easier for us to develop against, and others typically use a custom Java driver that sometimes is non-trivial to integrate with. Since OpenRefine is built with Java this should be relatively straightforward to utilize existing JDBC drivers for our import/export operations and for support of MongoDB there is a Java driver available.

Looking through the repo, it looks like there are a couples of related PRs:

I’m not sure about the export to a db?

The tests suggest drivers are in place for PostgreSQL, MySQL and MariaDB:

public class DatabaseTestConfig extends DBExtensionTests {

private DatabaseConfiguration mysqlDbConfig;
private DatabaseConfiguration pgsqlDbConfig;
private DatabaseConfiguration mariadbDbConfig;

It also looks like an upgrade to the internal data representation may be being considered: Research Apache Arrow to improve in-memory data model. FWIW, I think Apache Arrow really is one to watch.

Via the OpenRefine Google Group, I also noticed a couple of references to future planned activity / roadmap items:

Phase 2

Front / Backend separation

Scope: completely separating the backend so that an full API can be exposed for all OpenRefine operations and commands. Once the decoupling done, we can move to a modern front end framework and
Deliverable: Functional and documented API covering all the commands available in OpenRefine 3 front end.

Phase 3
R Lang support
Work with community to bring support for R lang via an extension.
https://github.com/OpenRefine/OpenRefine/issues/1226
There is significant use of statistics within News Organizations where the goal of minimizing the back and forth between R tooling and OpenRefine would be explored and assessed by the community.

rrefine is around and needs investigation – https://github.com/vpnagraj/rrefine

Hmmm… rrefine?

rrefine enables users to programmatically trigger data transfer between R and OpenRefine. Using the functions available in this package, you can import, export or delete a project in OpenRefine directly from R. There are several client libraries for automating OpenRefine tasks via Python, nodeJS and Ruby. rrefine extends this functionality to R users.

Okay – that makes me think of the OpenRefine Python Client Library?

But how about that Edit cells &gt; Transform &gt; Language support for R #1226` issue? “This is a feature-request to add R support in Edit cells > Transform > Language.”

That fits in with an earlier thought I had along the lines of “what if OpenRefine was a Jupyter client?” In an imagining frame of mind, this seems to me to offer a couple of potential benefits:

  • if the Transform &gt; Language utility supports hooks into a Jupyter kernel and exposes an executable code cell onto that (state persisting) kernel, and the data can be transferred efficiently using serialisations like feather or deeper hooks into Apache Arrow representations that might be supported in R or Python pandas, then any language with a Jupyter kernel could be used for transformations?
  • if OpenRefine was exposed as a panel in Jupyterlab, which it presumably could be simply by embedding the HTML UI in an IFrame, then it have a role as part of the look and feel of a single working environment, even if it was only loading and saving CSV files into the environment workspace.

But then let’s imagine something a bit more extreme (I’m not sure if / how this might fit into the Jupyterlab architecture, indeed whether it’s possible or just imagine magic, I’m just riffing…): if the data being manipulated within OpenRefine could be synched with a representation of the data being manipulated elsewhere in the Jupyterlab environment, then we could be viewing a dataset in one panel (Jupyterlab has crazy efficient support for viewing large datafiles), manipulating it in an OpenRefine panel, and running analysis scripts over it in a third. The reticulate package suddenly comes to mind here as an example of accessing data objects from one environment in another.

It also strikes me that use cases of the data represented in OpenRefine reflecting updates to the data from the analysis environment are less likely. The analysis should be operating on data after it has been cleaned, rather than passing it to OpenRefine?

PS by the by, if you want to run OpenRefine using the Jupyter ecosystem Binderhub machinery, here’s a proof of concept from @betatim: openrefineder.

Generating Printable MS Word Versions of Merged Jupyter Notebooks

One of the issues we know students have with the Jupyter notebooks that we provide as part of the course is that there is no straightforward way of printing them them all out for offscreen reading / annotation. (As well as code, there is a certain amount of practical and code related explanatory material in the notebooks.)

One of the things I started to doodle with last year was a simple script to merge several notebooks than then render the result as a Microsoft Word doc. This has a dependency on pandoc, though not LaTeX and requires that the conversion takes place via HTML: ipynb is converted to HTML using nbconvert , then from HTML to docx. If there are image files transcluded into the notebook, this also means that the pandoc conversion process needs to be executed in the same directory as the notebook so that the image paths are correctly recognised. (When running nbconvert with the html_embed output, pandoc fell over.)

Having to run pandoc in a local, image path respecting directory is a pain because it means I can’t run it over a merged notebook file composed of notebooks from multiple directories. Which means that I have to generate a separate docx file for the notebooks in each separate directory. Whilst I could more this into the same directory to make accessing them all a bit easier, it still means students have to print out multiple documents. I did try using a python package to merge the Word docs, but it borked on the images.

There are Python packages that can merge PDF documents in a more reliable way, but I am having issues with getting a sensible PDF workflow together. In the first case, for pandoc to render documents to  PDF seems to require the texlive-xetex package, which adds considerable weight to the VM (and I don’t know the dependency voodoo required to get a minimum viable LaTeX distribution in place). In the second, my test notebooks included a pymarkdown inline element that embedded a pandas dataframe in a markdown cell and this seemed to break the pandoc PDF conversion at that point.

One thing I haven’t done yet is look at customising the output templates so that we can brand the exported documents. For this, I need to look at custom templates.

My initial sketch code for the ‘export merged notebooks in a directory as docx’ routine is available via this gist. One thing I need to do is wrap it in a simple CLI command. Comments / suggestions for improvement, or links to better alternatives, more than welcome!


#https://stackoverflow.com/a/3207973/454773
from nbformat.v4 import new_notebook, new_markdown_cell
import nbformat
import io
import os
import subprocess
import random
import string
#from PyPDF2 import PdfFileMerger, PdfFileReader
def merged_notebooks_in_dir(dirpath,filenames):
''' Merge all notebooks in a directory into a single notebook '''
fns = ['{}/{}'.format(dirpath, fn) for fn in filenames if '.ipynb_checkpoints' not in dirpath and fn.endswith('.ipynb')]
if fns:
merged = new_notebook()
#Identify directory containing merged notebooks
cell = '\n\n\n\n# {}\n\n\n\n'.format(dirpath)
merged.cells.append(new_markdown_cell(cell))
else: return
for fn in fns:
#print(fn)
notebook_name = fn.split('/')[1]
with io.open(fn, 'r', encoding='utf-8') as f:
nb = nbformat.read(f, as_version=4)
#Identify filename of notebook
cell = '\n\n\n\n# {}\n\n\n\n'.format(fn)
merged.cells.append(new_markdown_cell(cell))
merged.cells.extend(nb.cells)
if not hasattr(merged.metadata, 'name'):
merged.metadata.name = ''
merged.metadata.name += "_merged"
return nbformat.writes(merged)
def merged_notebooks_down_path(path, typ='docx', execute=False):
''' Walk a path, creating an output file in each directory that merges all notebooks in the directory '''
for (dirpath, dirnames, filenames) in os.walk(path):
if '.ipynb_checkpoints' in dirpath: continue
#Should we run the execute processor here on each notebook separately,
# ensuring that images are embedded, and then merge the executed notebook files?
merged_nb = merged_notebooks_in_dir(dirpath,filenames)
if not merged_nb: continue
fn=''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(10))
with open('{}/{}.ipynbx'.format(dirpath,fn), 'w') as f:
f.write(merged_nb)
# Execute the merged notebook in its directory so that images are correctly handled
# Using html_embed seems to cause pandoc to fall over?
# The pdf conversion requires installation of texlive-xetex and inkscape
# This adds significant weight to the VM: maybe we need an MT/prouction VM and a student build?
# Inline code execution generated using python-markdown extension seems to break PDF generation
# at the first instance of inline code? Need to add a preprocessor?
# We could maybe process the notebook inline rather than via the commandline
# In such a case, the following may be a useful reference:
#https://github.com/ipython-contrib/jupyter_contrib_nbextensions/blob/master/docs/source/exporting.rst
execute = ' –ExecutePreprocessor.timeout=600 –ExecutePreprocessor.allow_errors=True –execute' if execute else ''
if typ=='pdf':
cmd='jupyter nbconvert –to pdf {exe} "{fn}".ipynbx'.format(exe=execute, fn=fn)
subprocess.check_call(cmd, shell=True, cwd=dirpath)
elif typ in ['docx']:
cmd='jupyter nbconvert –to html {exe} "{fn}".ipynbx'.format(exe=execute, fn=fn)
subprocess.check_call(cmd, shell=True, cwd=dirpath)
cmd='pandoc -s "{fn_out}".html -o _merged_notebooks.{typ}'.format(fn_out=fn, typ=typ)
subprocess.check_call(cmd, shell=True, cwd=dirpath)
os.remove("{}/{}.html".format(dirpath,fn))
os.remove("{}/{}.ipynbx".format(dirpath,fn))

Initial Sketch – Searching Jupyter Notebooks Using lunr

Coming round as it is to that time of year for updating, testing and freezing/”gold mastering” the TM351 VM that we distribute to students for the October presentation of our Data Analysis and Management course,  I’ve been thinking about how we can make the VM more useful for students, and whether the things we’re looking at might also be useful in an Institute of Coding context (I’m on a workpackage looking at infrastructure to support coding education: please get in touch if you’re up for a conversation around such matters:-)

One of the things I’ve been pondering is how to search across notebooks – a lot of the TM351 teaching material is in notebooks and there’s no obvious way of searching over them. (There’s also no obvious way of printing them all out in one go, or saving them to a merged document – I’ll post more about that in separate post…)

In my sketches for the new VM, I’ve added a simple python webserver that exposes a homepage that links to the various services running inside the VM. (Ideally, there’d also be indicator lights showing whether the associated Linux service is running or no: anyone know of a simple package to help with that?)

This made me think that it might be useful to provide simple search tool over the notebooks in the (shared) directory that the VM shares with the host.

One way of doing this might be to put the notebook content into a simple sqlite database and serve it using datasette, or query it via a Scripted Form style UI. SQLite has a full text search extension (FTS3-5) and some support for fuzzy matching (eg spellfix1), although I’m note sure how well it would fare as a code search engine.

But I also came across a lightweight Javascript search engine called lunr“[a] bit like Solr, but much smaller and not as bright” – and an example of How [Matthew Daly] Added Search to [His] Site With Lunr.js so I thought I’d give that a go…

At the moment, I’m only testing against a couple of notebooks. The search results are at the markdown cell level, so if a cell contains a lot of text, the whole cell will be displayed, which may not be optimal. I’m rendering the cell markdown as HTML in the browser using the Showdown Javascript package although this could be disabled to show just the raw markdown. My guess is that any relatively linked images embedded in the markdown will show as broken.

The search terms are supposed to be highlighted using mark.js, but while I had it working in a preliminary sketch, it seems to be borked now and I’m not sure where I’m setting it up incorrectly or using it wrong.

It strikes me that if a markdown cell in the results contains a lot of text, it might be worth trying to identify where in the text the query terms appear and then prune the result text around them.

I’m making no attempt to search code cells, though I did think about trying to extract lines of comment text using a crib along the lines of if LINE.strip().startswith('#').

I’m generating the lunr index using lunr.py and saving it along with a store of the cell content in a JSON file that’s loaded into the search page. Whilst I’m testing the search paged served from a simple Python httpserver, it struck me that it could also be served along a /view path in the Jupyter notebook context. When I first tried this, using JSON data loaded in to the search page using JQuery as a JSON object, I got a CORS error. Rather than waste too much time trying to solve that (I wasted a little!) I worked around it instead and loaded my lunr.json search index and store in to the page as JSONP instead.

One thing I need to do is provide an easy to use tool to generate the search index and lookup store from a set of notebooks. (In the TM351 VM context, this would be in the context of the mounted /shared notebooks folder that the notebook server runs at the top of.)

There still needs to be some clear thinking about what to link to – my initial thought is to link to the notebook running in the VM. If anchors are in the original markdown cell text it should be be possible to deeplink to those. It might also be possible to link to an HTML render of the notebook. This could be done via nbconvert (although I am not currently running this as a service in the VM) or perhaps as an in-browser rendering of the .ipynb JSON using something like Notebook.js / nbpreview. (FWIW, I also note react-jupyter).

But if nothing else, this is a thing that can be used and poked around to find out where it’s most painful in use and how it can be improved. A couple of things that immediately come to mind in terms of Jupyter integration, for example:

  • Jupyter notebook classic UI could come with a ‘Search notebooks’ tab and maybe a search indexer running in the background as and when notebooks in scope are saved);
  • JupterLab could be extended with a lun based notebook search plugin.

Code for my initial pencil sketch of a lunr Jupyter notebook markdown cell search tool can be found in this gist.


<html><head>
<script type="application/javascript" src="assets/js/jquery-3.3.1.js"></script>
<script type="application/javascript" src="assets/js/showdown.min.js"></script>
<!–
https://markjs.io/
<script type="application/javascript" src="assets/js/jquery.mark.min.js"></script>
–>
<script type="application/javascript" src="assets/js/lunr.js"></script>
<script src="lunr.jsonp"></script>
<link rel="stylesheet" href="assets/css/bootstrap.min.css" />
<style>
ul {margin-bottom:50px;}
ul li{margin-bottom:50px; background-color: #f8f8f8;}
</style>
</head>
<body>
<div class='container' >
<div><img src='assets/images/OU_logo_unofficial.png' alt='OU logo' /></div>
<h1>TM351 Notebook Search</h1><div><input id='search' /></div>
<hr/>
<div><ul id='searchresults' style='list-style-type: none'></ul></div>
<hr/>
<div><em>To refresh the index, …</em></div></div></body><script type="text/javascript">
//https://matthewdaly.co.uk/blog/2015/04/18/how-i-added-search-to-my-site-with-lunr-dot-js/
$(document).ready(function () {
'use strict';
// Set up search
var index, store;
//I'm importing the lunr.json as JSONP to get around CORS issues
//$.getJSON('./lunr.json', function (response) {
// Create index
index = lunr.Index.load(response.index);
// Create store
store = response.store;
// Handle search
$('input#search').on('keyup', function () {
// Get query
var query = $(this).val();
// Search for it
var result = index.search(query);
// Output it
var resultdiv = $('ul#searchresults');
// Keep track of search terms in result
var terms = new Set();
if (result.length === 0) {
// Hide results
resultdiv.hide();
} else {
// Show results
resultdiv.empty();
for (var item in result) {
var ref = result[item].ref;
var converter = new showdown.Converter();
var html = converter.makeHtml(store[ref].cell);
var searchitem = '<li>'+html+'<br/>Link: <a href="' + store[ref].title+ '">' + store[ref].title + '</a></li>';
//alert(JSON.stringify(result),null,4)
// Keep track of search terms in result
//result.forEach(function (item) {
// Object.keys(item.matchData.metadata).forEach(function (term) {
// terms.add(term)
// })
//})
resultdiv.append(searchitem);
}
//Highlight search terms – was working, now broken?
//resultdiv.mark(query);
resultdiv.show();
}
});
//});
});
</script>
</html>

view raw

nblunr.html

hosted with ❤ by GitHub


import os
import nbformat
from lunr import lunr
import json
def nbpathwalk(path):
''' Walk down a directory path looking for ipynb notebook files… '''
for path, _, files in os.walk(path):
if '.ipynb_checkpoints' in path: continue
for f in [i for i in files if i.endswith('.ipynb')]:
yield os.path.join(path, f)
def get_md(nb_fn, c_md=None):
''' Extract the content of markdown '''
if c_md is None: c_md = []
nb=nbformat.read(nb_fn,nbformat.NO_CONVERT)
_c_md=[i for i in nb.cells if i['cell_type']=='markdown']
ix=len(c_md)
for c in _c_md:
c.update( {"ix":str(ix)})
c.update( {"title":nb_fn})
ix = ix+1
c_md = c_md + _c_md
return c_md
def index_notebooks(nbpath='.', outfile='lunr.json', jsonp=None):
''' Get content from each notebook down a path and index it '''
c_md=[]
for fn in nbpathwalk(nbpath):
c_md = get_md(fn,c_md)
idx = lunr(ref='ix', fields=('title','source'), documents=c_md)
#Create a lookup for each md cell
store = {}
for c in c_md:
store[c['ix']]={'title':c['title'],'cell':c['source']}
out ={'index':idx.serialize(),'store':store}
with open(outfile, 'w') as f:
#Provide ability to write JSON or JSONP output file
if jsonp is None and not outfile.endswith('.jsonp'):
json.dump(out, f)
else:
if jsonp is None:
jsonp="var response = "
else:
jsonp="var {} = ".format(jsonp)
f.write('{}{}'.format(jsonp,json.dumps(out)))

view raw

nblunr.py

hosted with ❤ by GitHub

PS via Grant Nestor on the Jupyter Google group:

grep –include=’*.ipynb’ –exclude-dir=’.ipynb_checkpoints’ -rliw . -e ‘search query’

This will search your Jupyter server root recursively for files that contain the whole word (case-insensitive) “search query” and only return the file names of matches.

More info: https://stackoverflow.com/questions/16956810/how-do-i-find-all-files-containing-specific-text-on-linux