Jupyter Notebooks as Part of a Publishing System – “Executable” Inline Maths and Music Notations

One of the books I’m reading at the moment is Michael Hiltzik’s Dealers of Lightning: Xerox PARC and the Dawn of the Computer Age (my copy is second hand, ex-library stock…), birthplace to ethernet and the laser printer, as well as many of the computer user interactions we take for granted today. One thing I hadn’t fully appreciated was Xerox’s interests in publishing systems, which is in part what put it in mind for this post. The chapter I just finished reading tells of their invention of a modeless, WYSIWYG word processor, something that would be less hostile than the mode based editors of the time (I like the joke about accidentally entering command mode and typing edit – e: select entire document, d: delete selection, i:insert, t: the letter inserted. Oops – you just replaced your document with the letter t).

It must have been a tremendously exciting time there, having to invent the tools you wanted to use because they didn’t exist yet (some may say that’s still the case, but in a different way now, I think: we have many more building blocks at our disposal). But it’s still an exciting time, because while a lot of stuff has been invented, whether or not there is more to come, there are still ways of figuring out how to make it work easier, still ways of figuring out how to work the technology into our workflows in more sensible way, still many, many ways of trying to figure out how to use different bits of tech in combination with each other in order to get what feels like much more than we might reasonably expect from considering them as a set of separate parts, piled together.

One of the places this exploration could – should – take place is in education. Whilst at HE we often talk down tools in place of concepts, introducing new tools to students provides one way of exporting ideas embodied as tools into wider society. Tools like Jupyter notebooks, for example.

The  more I use Jupyter notebooks, the more I see their potential as a powerful general purpose tool not just for reproducible research, but also as general purpose computational workbench and as a powerful authoring medium.

Enlightened publishers such as O’Reilly seem to have got on board with using interactive notebooks in a publishing context (for example, Embracing Jupyter Notebooks at O’Reilly) and colleges such as Bryn Mawr in the US keep coming up with all manner of interesting ways of using notebooks in a course context – if you know of other great (or even not so great) use case examples in publishing or education, please let me know via the comments to this post – but I still get the feeling that many other people don’t get it.

“Initially the reaction to the concept [of the Gypsy, GUI powered wordprocessor that was to become part of the Ginn publishing system] was ‘You’re going to have to drag me kicking and screaming,'” Mott recalled. “But everyone who sat in front of that system and used it, to a person, was a convert within an hour.”
Michael Hiltzik, Dealers of Lightning: Xerox PARC and the Dawn of the Computer Age, p210

For example, in writing computing related documents, the ability to show a line of code and the output of that code, automatically generated by executing the code, and then automatically inserted into the document, means that when writing code examples, “helpful corrections” by an over-zealous editor go out of the window. The human hand should go nowhere near the output text.

week_3_exercise_notebook

Similarly when creating charts from data, or plotting equations: the charts should be created from the data or the equation by running a script over a source dataset, or plotting an equation directly.

week_3_exercise_notebook2

Again, the editor, or artist, should have no hand in “tweaking” the output to make it look better.

If the chart needs restyling, the artist needs to learn how to use a theme (like this?!) or theme generator rather then messing around with a graphics package (wrong sort of graphic). To add annotations, again, use code because it makes the graphic more maintainable.

supreme_annotations_-_moar_splainin_here__http___rud_is_b_2016_03_16_supreme-annotations__-_note__this_requires_the_github_version_of_ggplot2

We can also use various off-the-shelf libraries to generate HTML/Javascript fragments for creating inline interactives that can be embedded within the notebook, or saved and then reused elsewhere.

simpleMapDemo.png

There are also several toolkits around for creating other sorts of diagram from code, as I’ve written about previously, such as the tools provided on blockdiag.com:

sample_diagrams__packetdiag_-_blockdiag_1_0_documentation

Aside from making diagrams more easily maintainable, rendering them inline within a Jupyter notebook that also contains the programmatic “source code” for the diagram, written diagrams also provide a way in to the automatic generation of figure londesc text.

Electrical circuit schematics can also be written and embedded in a Jupyter notebook, as this Schemdraw example shows:

cdelker_bitbucket_org_schemdraw_html

So far, I haven’t found an example of a schematic plotting library that also allows you to simulate the behaviour of the circuit from the same definition though (eg I can’t simulate(d, …) in the above example, though I could presumably parameterise a circuit definition for a simulation package and use the same parameter values to label a corresponding Schemdraw circuit).

There are some notations that are “executable”, though. For example, the sympy (symbolic Python) package lets you write texts using python variables that can be rendered either as a symbol using mathematical notation, or by their value.

sympydemo1

(There’s a rendering bug in the generated Mathjax in the notebook I was using – I think this has been corrected in more recent versions.)

We can also use interactive widgets to help us identify and set parameter values to generate the sort of example we want:

sympydemo2

Sympy also provides support for a wide range of calculations. For example, we can “write” a formula, render it using mathematical notation, and then evaluate it. A Jupyter notebook plugin (not shown) allows python statements to be included and executed inline, which means that expressions and calculations can be included – and evaluated – inline. Changing the parameters in an example is then easy to achieve, with the added benefit that the guaranteed correct result of automatically evaluating the modified expression can also be inlined.

sympdemo3

(For interactive examples, see the notebooks in the sympy folder here; the notebooks are also runnable by launching a mybinder container – click on the launch:binder button to fire one up.) 

It looks like there are also tools out there for converting from LateX math expressions to sympy equivalents.

As well as writing mathematical expressions than can be both expressed using mathematical notation, and evaluated as a mathematical expression, we can also write music, expressing a score in notational form or creating an admittedly beepy audio file corresponding to it.

midimusic8

(For an interactive example, run the midiMusic.ipynb notebook by clicking through on the launch:binder button from here.)

We can also generate audio files from formulae (I haven’t tried this in a sympy context yet, though) and then visualise them as data.

audio6

Packages such as librosa also seem to provide all sorts of tools for analysing an visualising audio files.

When we put together the Learn to Code MOOC for FutureLearn, which uses Jupyter notebooks as an interactive exercise environment for learners, we started writing the materials in (web pages for the FutureLearn teaching text, notebooks for the interactive exercises) in Jupyter notebooks. The notebooks can export as markdown, the FutureLearn publishing systems is based around content entered as a markdown, so we should have been able to publish direct from the notebooks to FutureLearn, right? Wrong. The workflow doesn’t support it: editor takes content in Microsoft Word, passes it back to authors for correction, then someone does something to turn it into markdown for FutureLearn. Or at least, that’s the OU’s publishing route (which has plenty of other quirks too…).

Or perhaps will be was the OU’s publishing route, because there’s a project on internally (the workshops around which I haven’t been able to make, unfortunately) to look at new authoring environments for producing OU content, though I’m not sure if this is intended to feed into the backend of the current route – Microsoft Word, Oxygen XML editor, OU-XML, HTML/PDF etc output – or envisages a different pathway to final output. I started to explore using Google docs as an OU XML exporter, but that raised little interest – it’ll be interesting to see what sort of authoring environment(s) the current project delivers.

(By the by, I remember being really excited about the OU-XML a publishing system route when it was being developed, not least because I could imagine its potential for feeding other use cases, some of which I started to explore a few years later; I was less enthused by its actual execution and the lack of imagination around putting it to work though… I also thought we might be able to use FutureLearn as a route to exploring how we might not just experiment with workflows and publishing systems, but also the tech – and business models around the same – for supporting stateful and stateless interactive, online student activities. Like hosting a mybinder style service, for example, or embedded interactions like the O’Reily Thebe demo, or even delivering a course as a set of linked Jupyter notebooks. You can probably guess how successful that’s been…)

So could Jupyter notebooks have a role to play in producing semi-automated content (automated, for example in the production of graphical objects and the embedding of automatically evaluated expressions)? Markdown support is already supported and it shouldn’t take someone too long (should it?!) to put together an nbformat exporter that could generate OU-XML (if that is still the route we’re going?)? It’d be interesting to hear how O’Reilly are getting on…

Whatever, again…

Datadive Reproducibility – Time for a DataBox?

Whilst at the Global Witness “Beneficial Ownership” datadive a couple of weeks ago, one of the things I was pondering  – how to make the weekend’s discoveries reproducible on the one hand, useful as a set of still working legacy tooling on the other – blended into another: how to provide an on-ramp for folk attending the event who were not familiar with the data or the way in which t was provided.

Event facilitators DataKind worked in advance with Global Witness to produce an orientation exercise based around a sample dataset. Several other prepped datasets were also made available via USB memory sticks distributed as required to the three different working groups.

The orientation exercise was framed as a series of questions applied to a core dataset, a denormalised flat 250MB or so CSV file containing just over a million or so rows, with headers. (I think Excel could cope with this – not sure if that was by design or happy accident.)

For data wranglers expert at working with raw datafiles and their own computers, this doesn’t present much of a problem. My gut reaction was to open the datafile into a pandas dataframe in a Jupyter notebook and twiddle with it there; but as pandas holds dataframes in memory, this may not be the best approach, particularly if you have multiple large dataframes open at the same time. As previously mentioned, I think the data also fit into Excel okay.

Another approach after previewing the data, even if just by looking at it on the command line with a head command, was to load the data into a database and look at it from there.

This immediately begs several questions of course  – if I have a database set up on my machine and import the database without thinking about it, how can someone else recreate that? If I don’t have a database on my machine (so I need to install one and get it running) and/or I don’t then know how to get data into the database, I’m no better off. (It may well be that there are great analysts who know how to work with data stored in databases but don’t know how to do the data engineering stuff of getting the database up and running and populated with data in the first place.)

My preferred solution for this at the moment is to see whether Docker containers can help. And in this case, I think they can. I’d already had a couple of quick plays looking at getting the Companies House significant ownership data into various databases (Mongo, neo4j) and used a recipe that linked a database container with a Jupyter notebook server that I could write my analysis scripts in (linking RStudio rather than Jupyter notebooks is just as straightforward).

Using those patterns, it was easy enough to create a similar recipe to link a Postgres database container to a Jupyter notebook server. The next step – loading the data in. Now it just so happens that in the days before the datadive, I’d been putting together some revised notebooks for an OU course on data management and analysis that dealt with quick ways of loading data into a Postgres data, so I wondered whether those notes provided enough scaffolding to help me load the sample core data into a database: a) even if I was new to working with databases, and b) in a reproducible way. The short answer was “yes”. Putting the two steps together, the results can be found here: Getting started – Database Loader Notebook.

With the data in a reproducibly shareable and “live” queryable form, I put together a notebook that worked through the orientation exercises. Along the way, I found a new-to-me HTML5/d3js package for displaying small  interactive network diagrams, visjs2jupyter. My attempt at the orientation exercises can be found here: Orientation Activities.

Whilst I am all in favour of experts datawranglers using their own recipes, tools and methods for working with the data – that’s part of the point of these expert datadives – I think there may also be mileage in providing a base install where the data is in some sort of immediately queryable form, such as in a minimal, even if not properly normalised, database. This means that datasets too large to be manipulated in memory or loaded into Excel can be worked with immediately. It also means that orientation materials can be produced that pose interesting questions that can be used to get a quick overview of the data, or tutorial materials produced that show how to work with off-the-shelf powertool combinations (Jupyter notebooks / Python/pandas / PostgreSQL, for example, or RStudio /R /PostgreSQL ).

Providing a base set up to start from also acts as an invitation to extend that environment in a reproducible way over the course of the datadive. (When working on your own computer with your own tooling, it can be way too easy to forget what packages (apt-get, pip and so on) you have pre-installed that will cause breaking changes to any outcome code you show with others who do not have the same environment. Creating a fresh environment for the datadive, and documenting what you add to it, can help with that, but testing in a linked container, but otherwise isolated, context really helps you keep track of what you needed to add to make things work!

If you also keep track of what you needed to do handle undeclared file encodings, weird separator characters, or password protected zip files from the provided files, it means that others should be able to work with the files in a reliable way…

(Just a note on that point for datadive organisers – metadata about file encodings, unusual zip formats, weird separator encodings etc is a useful thing to share, rather than have to painfully discover….)

Using tools like Docker is one way of improving the shareability of immediately queryable data, but is there an even quick way? One thing I want to explore on my to do list is the idea of a “databox”, a Raspberry Pi image that when booted runs a database server and Jupyter notebook (or RStudio) environment. The database can be pre-seeded with data for the datadive, so all that should be required is for an individual to plug the Raspberry Pi into their computer with an ethernet cable, and run from there. (This won’t work for really large datasets – the Raspberry Pi lacks grunt – but it’s enough to get you started.)

Note that these approaches scale out to other domains, such as data journalism projects (each project on its own Raspberry PI SD card or docker-compose setup…)

An Alternative Way of Motivating the Use of Functions?

At the end of the first of the Curriculum Development Hackathon on Reproducible Research using Jupyter Notebooks held at BIDS in Berkeley, yesterday, discussion turned on whether we should include a short how-to on the use of interactive IPython widgets to support exploratory data analysis. This would provide workshop participants with an example of how to rapidly prototype a simple exploratory data analysis application such as an interactive chart, enabling them to explore a range of parameter values associated with the data being plotted in a convenient way.

In summarising how the ipywidgets interact() function works, Fernando Perez made a comment that made wonder whether we could use the idea of creating simple interactive chart explorers as a way of motivating the use of functions.

More specifically, interact() takes a function name and the set of parameters passed into that function and creates a set of appropriate widgets for setting the parameters associated with the function. Changing the widget setting runs the function with the currently selected values of the parameters. If the function returns a chart object, then the function essentially defines an interactive chart explorer application.

So one reason for creating a function is that you may be able to automatically convert into an interactive application using interact().

Here’s a quick first sketch notebook that tries to set up a motivating example: An Alternative Way of Motivating Functions?


Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

PS to embed an image of a rendered widget in the notebook, select the Save notebook with snapshots option from the Widgets menu:

simplewidgetdemo

See also: Simple Interactive View Controls for pandas DataFrames Using IPython Widgets in Jupyter Notebooks

A Recipe for Automatically Going From Data to Text to Reveal.js Slides

Over the last few years, I’ve experimented on and off with various recipes for creating text reports from tabular data sets, (spreadsheet plugins are also starting to appear with a similar aim in mind). There are several issues associated with this, including:

  • identifying what data or insight you want to report from your dataset;
  • (automatically deriving the insights);
  • constructing appropriate sentences from the data;
  • organising the sentences into some sort of narrative structure;
  • making the sentences read well together.

Another approach to humanising the reporting of tabular data is to generate templated webpages that review and report on the contents of a dataset; this has certain similarities to dashboard style reporting, mixing tables and charts, although some simple templated text may also be generated to populate the page.

In a business context, reporting often happens via Powerpoint presentations. Slides within the presentation deck may include content pulled from a templated spreadsheet, which itself may automatically generate tables and charts for such reuse from a new dataset. In this case, the recipe may look something like:

exceldata2slide

#render via: http://blockdiag.com/en/blockdiag/demo.html
{
  X1[label='macro']
  X2[label='macro']

  Y1[label='Powerpoint slide']
  Y2[label='Powerpoint slide']

   data -> Excel -> Chart -> X1 -> Y1;
   Excel -> Table -> X2 -> Y2 ;
}

In the previous couple of posts, the observant amongst you may have noticed I’ve been exploring a couple of components for a recipe that can be used to generate reveal.js browser based presentations from the 20% that account for the 80%.

The dataset I’ve been tinkering with is a set of monthly transparency spending data from the Isle of Wight Council. Recent releases have the form:

iw_transparency_spending_data

So as hinted at previously, it’s possible to use the following sort of process to automatically generate reveal.js slideshows from a Jupyter notebook with appropriately configured slide cells (actually, normal cells with an appropriate metadata element set) used as an intermediate representation.

jupyterslidetextgen

{
  X1[label="text"]
  X2[label="Jupyter notebook\n(slide mode)"]
  X3[label="reveal.js\npresentation"]

  Y1[label="text"]
  Y2[label="text"]
  Y3[label="text"]

  data -> "pandas dataframe" -> X1  -> X2 ->X3
  "pandas dataframe" -> Y1,Y2,Y3  -> X2 ->X3

  Y2 [shape = "dots"];
}

There’s an example slideshow based on October 2016 data here. Note that some slides have “subslides”, that is, slides underneath them, so watch the arrow indicators bottom left to keep track of when they’re available. Note also that the scrolling is a bit hit and miss – ideally, a new slide would always be scrolled to the top, and for fragments inserted into a slide one at a time the slide should scroll down to follow them).

The structure of the presentation is broadly as follows:

demo_-_interactive_shell_for_blockdiag_-_blockdiag_1_0_documentation

For example, here’s a summary slide of the spends by directorate – note that we can embed charts easily enough. (The charts are styled using seaborn, so a range of alternative themes are trivially available). The separate directorate items are brought in one at a time as fragments.

testfullslidenotebook2_slides1

The next slide reviews the capital versus expenditure revenue spend for a particular directorate, broken down by expenses type (corresponding slides are generated for all other directorates). (I also did a breakdown for each directorate by service area.)

The items listed are ordered by value, and taken together account for at least 80% of the spend in the corresponding area. Any further items contributing more than 5%(?) of the corresponding spend are also listed.

testfullslidenotebook2_slides2

Notice that subslides are available going down from this slide, rather than across the mains slides in the deck. This 1.5D structure means we can put an element of flexible narrative design into the presentation, giving the reader an opportunity to explore the data, but in a constrained way.

In this case, I generated subslides for each major contributing expenses type to the capital and revenue pots, and then added a breakdown of the major suppliers for that spending area.

testfullslidenotebook2_slides3

This just represents a first pass at generating a 1.5D slide deck from a tabular dataset. A Pareto (80/20) heurstic is used to try to prioritise to the information displayed in order to account for 80% of spend in different areas, or other significant contributions.

Applying this principle repeatedly allows us to identify major spending areas, and then major suppliers within those spending areas.

The next step is to look at other ways of segmenting and structuring the data in order to produce reports that might actually be useful…

If you have any ideas, please let me know via the comments, or get in touch directly…

PS FWIW, it should be easy enough to run any other dataset that looks broadly like the example at the top through the same code with only a couple of minor tweaks…

Displaying Differences in Jupyter Notebooks – nbdime / nbdiff

One of the challenges of working with Jupyter notebooks to date has been the question of diffing, spotting the differences between two versions of the same notebook. This made collaborative authoring and reviewing of notebooks a bit tricky. It also acted as a brake on using notebooks for student assessment. It’s easy enough to to set an exercise using a templated notebook and then get students to work through it, but marking the completed notebook in return can be a bit fiddly. (The nbgrader system addresses this in part but at the expense of the overhead in terms of having to use additional nbgrader formatting and markup.)

However, there’s ongoing effort now around nbdime (docs). Given past success in getting Jupyter notebook previews displayed in Github, it wouldn’t be unreasonable to think that the diff view too might make it into that environment at some point too…

See also: Sensible Diff-ing of Jupyter Notebook ipynb Documents Using VS Code

At the moment, nbdime works from the command line. It can produce a text diff in the console, or launch a notebook viewer in the browser that shows differences between two notebooks.

The differ works on a cell by cell basis and highlights changes and addtions. (Extra emphasis on the changed text in a markdown cell doesn’t seem to work at the moment?)

nbdime_-_diff_and_merge_your_jupyter_notebooks

If you change the contents of a code cell, or the outputs of a code cell have changed, those differences are identified too. (Note the extra emphasis in the code cell on the changed text, but not in the output.)

nbdime_-_diff_and_merge_your_jupyter_notebooks2

To improve readability, you can collapse the display of changed code cell output.

nbdime_-_diff_and_merge_your_jupyter_notebooks3

Where cell outputs include graphical objects, differences to these are highlighted too.

nbdime_-_diff_and_merge_your_jupyter_notebooks4

(Whilst I note that Github has various tools for exploring the differences between two versions of the same image, I suspect that sort of comparison will be difficult to achieve inline in the notebook differencer.)

I suspect one common way of using nbdime will be to compare the current state of a notebook with a checkpointed version. (Jupyter notebooks autosave the current state of the notebook quite regulalry. If you force a save, the current state is saved but a “checkpoint” version of the notebook is also saved to a hidden folder. If things go really wrong with your current notebook, you can restore it to the checkpointed version.)

If you’ve saved a checkpoint of a notebook, and want to compare the current (autosaved) version with it, you need to point to the checkpointed file in the checkpoint folder: nbdiff-web .ipynb_checkpoints/MY_FILE-checkpoint.ipynb MY_FILE.ipynb. It’d be nice if a switch could handle this automatically, eg nbdime_web --compare-checkpoint MY_FILE.ipynb (It would also be nice if the nbdiff command could force the notebook to autosave before a diff is run, but I’m not sure how that could be achieved?)

It also strikes me that when restoring from a checkpoint, it might be possible to combine the restoration action with the differencer view so that you can decide which bits of the current notebook you might want to keep (i.e. essentially treat the differences between the current and checkpointed version as conflicts that need to be resolved?)

This is probably pushing things a bit far, but I also wonder if lightweight, inline, cell level differencing would be possible, given that each cell in at running notebook has an undo feature that goes back multiple streps?

Finally, a note about using the differencer to support marking. The differencer view is an HTML file, so whilst you can compare a student’s notebook with the orignal  you can’t edit their notebook directly in the differencer to add marks or feedback. (I really do need to have another play with nbgrader, I think…)

PS It’s also worth noting that SageMathCloud has a history slider that lets you run over different autosaved versions of a notebook, although differences are not highlighted.

PPS Thinks: what I’d like is a differencer that generates a new notebook with addition/deletion cells highlighted and colour styled so that I could retain – or delete – the cell and add cells of my own… Something akin to track changes, for example. That way I could run different cells, add annotations, etc etc (related issue).

Python Code Stepper / Debugger / Tutor for Jupyter Notebooks – nbtutor

Whilst reviewing / scoping* possible programming editor environments for the new level 1 courses, one of the things I was encouraged to look at was Philip Guo’s interactive Python Tutor.

According the the original writeup (Philip J. Guo. Online Python Tutor: Embeddable Web-Based Program Visualization for CS Education. In Proceedings of the ACM Technical Symposium on Computer Science Education (SIGCSE), March 2013), the application has an HTML front end that calls on on a backend debugger: “the Online Python Tutor backend takes the source code of a Python program as input and produces an execution trace as output. The backend executes the input program under supervision of the standard Python debugger module (bdb), which stops execution after every executed line and records the program’s run-time state.”

The current version of the online tutor supports a wider range of languages – Python, Java, JavaScript, TypeScript, Ruby, C, and C++ – which presumably have their own backend interpreter and use a common trace response format?

The tutor itself allows you to step through code snippets a line at a time, displaying a trace of the current variable values.

Another nice feature of the Online Python Tutor, though it was a bit ropey when I first tried it out a few months ago, was the shared session support, that a learner and a tutor see, via a shared link, the same session, with an additional chat box allowing them to chat over the shared experience in realtime.

Whilst the Online Python Tutor allows URLs to saved programs (“tutorials”) to be generated and shared: link to the demo shown in the movie above. The code is actually passed via the URL.

One of the problems with the Online Python Tutor is that requires a network connection so that the code can be passed to the interpreter back end, executed to generate the code trace, and then passed back to the browser. It didn’t take long for folk to start embedding the tutor in an iframe to give a pseudo-traceability experience in the notebook context, but now the Online Python Tutor inspired nbtutor extension makes cell based tracing against the local python kernel possible**.

The nbtutor extension provides cell by cell tracing (when running a cell, all the code in the cell is executed, the trace returned, and then available for visualising. Note that all variables in scope are displayed in the trace, even if they have been set in other cells outside of the nbtutor magic. (I’m not sure if there’s a setting that allows you just to display the variables that are referenced within the cell?)  It is also possible to clear all variables in the global scope via a magic parameter, with a prompt to confirm that you really do want to clear out all those variable values.

I’m not sure that the best way would be to go about framing nbtutor exercises in a Jupyter notebook context, but I note that the notebooks used to support the MPR213 (Programming and Information Technology) course from the Department of Mechanical and Aeronautical Engineering in the Faculty of Engineering, Built Environment and Information Technology at the University of Pretoria now include nbtutor examples.

Footnotes:

* A cynic might say scoping in the sense not seriously considering anything other than the environments that had already been decided on before the course production process had really started… ;-) I also preferred BlockPy over Scratch, for example. My feeling was that if the OU was going to put developer effort in (the original claim was we wouldn’t have to put effort into Scratch, though of course we are because Scratch wasn’t quite right…) we could add more value to the OU and the community by getting involved with BlockPy, rather than a programming environment developed for primary school kids. Looking again at the “friendly” error messages that the BlockPy environment offers, I’m starting to wondering if elements of that could be reused for some IPython notebook magic…

** Again, I’m of the mind that were it 20 years ago, porting the Online Python Tutor to the Jupyter notebook context might have been something we’d have considered doing in the OU…

Reporting in a Repeatable, Parameterised, Transparent Way

Earlier this week, I spent a day chatting to folk from the House of Commons Library as a part of a temporary day-a-week-or-so bit of work I’m doing with the Parliamentary Digital Service.

During one of the conversations on matters loosely geodata-related with Carl Baker, Carl mentioned an NHS Digital data set describing the number of people on a GP Practice list who live within a particular LSOA (Lower Super Output Area). There are possible GP practice closures on the Island at the moment, so I thought this might be an interesting dataset to play with in that respect.

Another thing Carl is involved with is producing a regularly updated briefing on Accident and Emergency Statistics. Excel and QGIS templates do much of the work in producing the updated documents, so much of the data wrangling side of the report generation is automated using those tools. Supporting regular updating of briefings, as well as answering specific, ad hoc questions from MPs, producing debate briefings and other current topic briefings, seems to be an important Library activity.

As I’ve been looking for opportunities to compare different automation routes using things like Jupyter notebooks and RMarkdown, I thought I’d have a play with the GP list/LSOA data, showing how we might be able to use each of those two routes to generate maps showing the geographical distribution, across LSOAs at least, for GP practices on the Isle of Wight. This demonstrates several things, including: data ingest; filtering according to practice codes accessed from another dataset; importing a geoJSON shapefile; generating a choropleth map using the shapefile matched to the GP list LSOA codes.

The first thing I tried was using a python/pandas Jupyter notebook to create a choropleth map for a particular practice using the folium library. This didn’t take long to do at all – I’ve previously built an NHS admin database that lets me find practice codes associated with a particular CCG, such as the Isle of Wight CCG, as well as a notebook that generates a choropleth over LSOA boundaries, so it was simply a case of copying and pasting old bits of code and adding in the new dataset.You can see a rendered example of the notebook here (download).

One thing you might notice from the rendered notebook is that I actually “widgetised” it, allowing users of the live notebook to select a particular practice and render the associated map.

Whilst I find the Jupyter notebooks to provide a really friendly and accommodating environment for pulling together a recipe such as this, the report generation workflows are arguably still somewhat behind the workflows supported by RStudio and in particular the knitr tools.

So what does an RStudio workflow have to offer? Using Rmarkdown (Rmd) we can combine text, code and code outputs in much the same way as we can in a Jupyter notebook, but with slightly more control over the presentation of the output.

__dropbox_parlidata_rdemos_-_rstudio

For example, from a single Rmd file we can knit an output HTML file that incorporates an interactive leaflet map, or a static PDF document.

It’s also possible to use a parameterised report generation workflow to generate separate reports for each practice. For example, applying this parameterised report generation script to a generic base template report will generate a set of PDF reports on a per practice basis for each practice on the Isle of Wight.

The bookdown package, which I haven’t played with yet, also looks promising for its ability to generate a single output document from a set of source documents. (I have a question in about the extent to which bookdown supports partially parameterised compound document creation).

Having started thinking about comparisons between Excel, Jupyter and RStudio workflows, possible next steps are:

  • to look for sensible ways of comparing the workflow associated with each,
  • the ramp-up skills required, and blockers (including cultural blockers (also administrative / organisational blockers, h/t @dasbarrett)) associated with getting started with new tools such as Jupyter or RStudio, and
  • the various ways in which each tool/workflow supports: transparency; maintainability; extendibility; correctness; reuse; integration with other tools; ease and speed of use.

It would also be interesting to explore how much time and effort would actually be involved in trying to port a legacy Excel report generating template to Rmd or ipynb, and what sorts of issue would be likely to arise, and what benefits Excel offers compared to Jupyter and RStudio workflows.

Experimenting With Sankey Diagrams in R and Python

A couple of days ago, I spotted a post by Oli Hawkins on Visualising migration between the countries of the UK which linked to a Sankey diagram demo of Internal migration flows in the UK.

One of the things that interests me about the Jupyter and RStudio centred reproducible research ecosystems is their support for libraries that generate interactive HTML/javascript outputs (charts, maps, etc) from a computational data analysis context such as R, or python/pandas, so it was only natural (?!) that I though I should see how easy it would be to generate something similar from a code context.

In an R context, there are several libraries available that support the generation of Sankey diagrams, including googleVis (which wraps Google Chart tools), and a couple of packages that wrap d3.js – an original rCharts Sankey diagram demo by @timelyporfolio, and a more recent HTMLWidgets demo (sankeyD3).

Here’s an example of the evolution of my Sankey diagram in R using googleVis – the Rmd code is here and a version of the knitred HTML output is here.

The original data comprised a matrix relating population flows between English regions, Wales, Scotland and Northern Ireland. The simplest rendering of the data using the googleViz Sankey diagram generator produces an output that uses default colours to label the nodes.

Using the country code indicator at the start of each region/country identifier, we can generate a mapping from country to a country colour that can then be used to identify the country associated with each node.

One of the settings for the diagram allows the source (or target) node colour to determine the edge colour. We can also play with the values we use as node labels:

If we exclude edges relating to flow between regions of the same country, we get a diagram that is more reminiscent of Oli’s orignal (country level) demo. Note also that the charts that are generated are interactive – in this case, we see a popup that describes the flow along one particular edge.

If we associate a country with each region, we can group the data and sum the flow values to produce country level flows. Charting this produces a chart similar to the original inspiration.

As well as providing the code for generating each of the above Sankey diagrams, the Rmd file linked above also includes demonstrations for generating basic Sankey diagrams for the original dataset using the rCharts and htmlwidgets R libraries.

In order to provide a point of comparison, I also generated a python/pandas workflow using Jupyter notebooks and the ipysankey widget. (In fact, I generated the full workflow through the different chart versions first in pandas – I find it an easier language to think in than R! – and then used that workflow as a crib for the R version…)

The original notebook is here and an example of the HTML version of it here. Note that I tried to save a rasterisation of the widgets but they don’t seem to have turned out that well…

The original (default) diagram looks like this:

and the final version, after a bit of data wrangling, looks like this:

Once again, all the code is provided in the notebook.

One of the nice things about all these packages is that they produce outputs than can be reused/embedded elsewhere, or that can be used as a first automatically produced draft of code that can be tweaked by hand. I’ll have more to say about that in a future post…

Creating a Jupyter Bundler Extension to Download Zipped Notebook and HTML Files

In the first version of the TM351 VM, we had a simple toolbar extension that would download a zipped ipynb file, along with an HTML version of the notebook, so it could be uploaded and previewed in the OU Open Design Studio. (Yes, I know, it would have been much better to have an nbviewer handler as an ODS plugin, but the we don’t do that sort of tech innovation, apparently…)

Looking at updating the extension today for the latest version of Jupyter notebooks, I noticed the availability of custom bundler extensions that allow you to add additional tools to support notebook downloads and deployment (I’m not sure what deployment relates to?). Adding a new download option allows it to be added to the notebook Edit -> Download menu:

The extension is created as a python package:

# odszip/setup.py
from setuptools import setup

setup(name='odszip',
      version='0.0.1',
      description='Save Jupyter notebook and HTML in zip file with .nbk suffix',
      author='',
      author_email='',
      license='MIT',
      packages=['odszip'],
      zip_safe=False)
#odszip/odszip/download.py

# Copyright (c) The Open University, 2017
# Copyright (c) Jupyter Development Team.

# Distributed under the terms of the Modified BSD License.
# Based on: https://github.com/jupyter-incubator/dashboards_bundlers/

import os
import shutil
import tempfile

#THIS IS A REQUIRED FUNCTION
def _jupyter_bundlerextension_paths():
    '''API for notebook bundler installation on notebook 5.0+'''
    return [{
                'name': 'odszip_download',
                'label': 'ODSzip (.nbk)',
                'module_name': 'odszip.download',
                'group': 'download'
            }]


def make_download_bundle(abs_nb_path, staging_dir, tools):
	'''
	Assembles the notebook and resources it needs, returning the path to a
	zip file bundling the notebook and its requirements if there are any,
	the notebook's path otherwise.
	:param abs_nb_path: The path to the notebook
	:param staging_dir: Temporary work directory, created and removed by the caller
	'''
    
	# Clean up bundle dir if it exists
	shutil.rmtree(staging_dir, True)
	os.makedirs(staging_dir)
	
	# Get name of notebook from filename
	notebook_basename = os.path.basename(abs_nb_path)
	notebook_name = os.path.splitext(notebook_basename)[0]
	
	# Add the notebook
	shutil.copy2(abs_nb_path, os.path.join(staging_dir, notebook_basename))
	
	# Include HTML version of file
	cmd='jupyter nbconvert --to html "{abs_nb_path}" --output-dir "{staging_dir}"'.format(abs_nb_path=abs_nb_path,staging_dir=staging_dir)
	os.system(cmd)

	zip_file = shutil.make_archive(staging_dir, format='zip', root_dir=staging_dir, base_dir='.')
	return zip_file

#THIS IS A REQUIRED FUNCTION       
def bundle(handler, model):
	'''
	Downloads a notebook as an HTML file and zips it with the notebook
	'''
	
	# Based on https://github.com/jupyter-incubator/dashboards_bundlers
	
	abs_nb_path = os.path.join(
		handler.settings['contents_manager'].root_dir,
		model['path']
	)
		
	notebook_basename = os.path.basename(abs_nb_path)
	notebook_name = os.path.splitext(notebook_basename)[0]
	
	tmp_dir = tempfile.mkdtemp()

	output_dir = os.path.join(tmp_dir, notebook_name)
	bundle_path = make_download_bundle(abs_nb_path, output_dir, handler.tools)
		
	handler.set_header('Content-Disposition', 'attachment; filename="%s"' % (notebook_name + '.nbk'))
	
	handler.set_header('Content-Type', 'application/zip')
	
	with open(bundle_path, 'rb') as bundle_file:
		handler.write(bundle_file.read())

	handler.finish()


	# We read and send synchronously, so we can clean up safely after finish
	shutil.rmtree(tmp_dir, True)

We can then create the python package and install the extension, remmebering to restart the Jupyter server for the extension to take effect.

#Install the ODSzip extension package
pip3 install --upgrade --force-reinstall ./odszip

#Enable the ODSzip extension
jupyter bundlerextension enable --py odszip.download  --sys-prefix

Pondering a Jupyter Notebook “Diff”er Extension and Its Use as a Marking Tool

One of the problems with trivially using simple text based differencing tools on Jupyter notebooks is that the differencing is based on the content of the JSON file used to represent the notebook, rather than the content of the notebook cells themselves.

The nbdime tools (which I’ve commented on before) address his problem by reporting differences at the cell content level. (Hmm, I wonder: would a before/after image comparison view for charts also be useful (example)?

I still don’t really understand how Jupyter notebook checkpointing works (e.g. I think I’d like it to work so that I can force a save to a checkpoint, whilst the autosave updates the current notebook file) but I started wondering yesterday about a simple Jupyter extension that would compare the current (saved) notebook with the checkpointed version.

My first attempt can be found in this gist:

view raw

README.md

hosted with ❤ by GitHub


// call "nbdime" and compare current version with new version in new tab
define([
'base/js/namespace',
'jquery',
'base/js/events',
'base/js/utils'
], function(IPython, $, events, utils) {
"use strict";
/**
* Call nbdiff-web with current notebook against checkpointed version
*
*/
var nbCheckpointDiffView = function () {
var kernel = IPython.notebook.kernel;
var name = IPython.notebook.notebook_name;
var path= IPython.notebook.notebook_path
var base_url=window.location.protocol + '//' +window.location.hostname
var abspath='/vagrant/notebooks'
var hostport='35101'
var guestport='8899'
var url = base_url+':'+hostport+'?remote='+abspath+'/.ipynb_checkpoints/'+utils.splitext(name)[0]+'-checkpoint.ipynb&base='+abspath+'/'+path;
var command = 'import subprocess; subprocess.run(\'nbdiff-web –port='+guestport+' –ip=* \', shell=True)';
kernel.execute(command);
window.open(url);
$('#doCheckpointDiffView').blur();
};
var load_ipython_extension = function() {
IPython.toolbar.add_buttons_group([
{
id: 'doCheckpointDiffView',
label: 'Display nbdiff to checkpointed version',
icon: 'fa-list-alt',
callback: nbCheckpointDiffView
}
]);
};
return {
load_ipython_extension : load_ipython_extension
};
});

view raw

main.js

hosted with ❤ by GitHub


Type: IPython Notebook Extension
Name: Checkpointdiffer
Description: Calls nbdiff with current notebook and checkpointed version
Link: readme.md
Compatibility: 4.x

view raw

nbdiffex.yaml

hosted with ❤ by GitHub

Clicking the appropriate button (I need to find a better icon?) launches the nbdiff-web service on a specified port and then pops open a tab comparing the current saved notebook file with the checkpointed version. Note there are several settings specific to my setup where the notebooks are running inside the TM351 VM:

  • guestport is the port inside the guest VM that the nbdiff-web service runs on
  • hostport is the port on the host machine that the nbdiff-web service port is mapped to
  • abspath is the path to the notebook root directory; ideally, this should be picked up by the script.

In testing, it seems my setup seems to have checkpoints being saved regularly and automatically, which means the diffing is not that useful… I maybe to have a “backup-save” or “version-save” option somewhere to force saves at particular times and then compare to those?

The code used for the extension was inspired by the printview extension code. Looking at that, I note how the defining YAML file makes it easy to set up the extension configuration options:

Which got me wondering… one of the issues we have had with marking student work specified in and completed as Jupyter notebooks is helping markers quickly navigate to the cells that students have modified so they can be marked. (Finding an efficient way of allowing markers to annotate and return scripts is another issue. Note that we still haven’t really experimented with nbgrader either.) So I wonder if the differencer can help?

For example, we could bundle a set of ‘provided’ notebooks in which assessments are defined within a marking extension, and then allow a marker to compare the student returned version of the notebook with the provided copy, using the option to hide unchanged cells and just highlight the ones that have been changed to include the student’s answers. Ideally, we’d also want the marker to be able to annotate those cells with marks and feedback, and return the annotated scrip to the student – who could then diff back to their submitted transcript to see what the marker had to say?

PS via @biztechpm, How to grade programming assignments on GitHub