Notebook Practice – Data Sci Ed, 101 With a Nod To TM351

One of the ways our Data management and analysis (TM351) course differs from the module it replaced, a traditional databases module, was the way in which we designed it to cover a full data pipeline, from data acquisition, through cleaning, management (including legal issues), analysis, visualisation and reporting. The database content takes up about a third of the course and covers key elements of relational and noSQL databases. Half the course content is reading, half is practical. To help frame the way we created the course, we imagined a particular persona: a postdoc researcher who had to manage everything data related in a research project.

Jupyter notebooks, delivered with PostgreSQL and MongoDB databases, along with OpenRefine, inside a VirtualBox virtual machine, provide the practical environment. Through the assessment model, two tutor marked assessments as continuous assessment, and a final project style assessment, we try to develop the idea that notebooks can be used to document small data investigations in a reproducible way.

The course has been running for several years now — the notebooks were stilled called IPython notebooks when we started — but literature we can use to post hoc rationalise the approach we’ve taken is now starting to appear…

For example, this preprint recently appeared on arXiv — Three principles of data science: predictability, computability, and stability (PCS),  Bin Yu, Karl Kumbier, arxiv:1901.08152 — and provides a description of how to use notebooks in a way that resembles the model we use in TM351:

We propose the following steps in a notebook

1. Domain problem formulation (narrative). Clearly state the real-world question one would like to answer and describe prior work related to this question.

2. Data collection and relevance to problem (narrative). Describe how the data were generated, including experimental design principles, and reasons why data is relevant to answer the domain question.

3. Data storage (narrative). Describe where data is stored and how it can be accessed by others.

4. Data cleaning and preprocessing (narrative, code, visualization). Describe steps taken to convert raw data into data used for analysis, and Why these preprocessing steps are justified. Ask whether more than one preprocessing methods should be used and examine their impacts on the final data results.

5. PCS inference (narrative, code, visualization). Carry out PCS inference in the context of the domain question. Specify appropriate model and data perturbations. If necessary, specify null hypothesis and associated perturbations (if applicable). Report and post-hoc analysis of data results.

6. Draw conclusions and/or make recommendations (narrative and visualization) in the context of domain problem.

As early work for a new course on datascience kicks off, a module I’m hoping will be notebook mediated, I’m thinking now would be a good time for us to revisit how we use the notebooks for teaching, and how we expect the students the use them for practice and assessment, and pull together some comprehensive notes on emerging notebook pedagogy from a distance HE perspective.

So How Do I Export Styled Pandas Tables From a Jupyter Notebook as a PNG or PDF File?

In order to render tweetable PNGs of my WRC rally stage chartables, I’ve been using selenium to render the table in its own web page and then grab a screenshot (Converting Pandas Generated HTML Data Tables to PNG Images), but that’s really clunky. So I started to wonder: are there any HTML2PNG converters out there?

This post has some handy pointers, including reference to the following packages:

  • HTML2Canvas, which allows you to take “[s]creenshots with JavaScript” and export them as PNG files;
  • TableExport, which seems to work with jspdf (“the leading HTML5 client solution for generating PDFs”) and jsPDF-AutoTable (a “jsPDF plugin for generating PDF tables with javascript”) to allow you to export an HTML table as a PDF.

The html2canvas route is also demonstrated in this Stack Overflow answer offered in answer to a query wondering how to Download table as PNG using JQuery.

So now I’m wondering a couple of things about styled pandas tables in Jupyter notebooks.

Firstly, would some magic __repr__ extensibilty that provides a button for exporting a styled pandas table as a PNG or a PDF be handy?

Secondly, how could the above recipes be woven into a simple end user developed ipywidgets powered app (such as the one I used to explore WRC stage charts) to provide a button to download a rendered, styled HTML table as a PNG or PDF?

Anyone know of such magic / extensions already out there? Or can knock up a demo, ideally a Binderised one,  showing me how to do it?

Coding for Graphics and a BBC House Style

Since I discovered the ggplot2 R graphics package, and read Leland Wilkinson’s The Grammar of Graphics book that underpins its design philosophy, it’s hugely influenced the way I think about the creation of custom graphics.

The separation of visual appearance from the underlying graphical model is very powerful. The web works in a similar way: the same HTML coded website can presented in a myriad different ways by the application of different CSS styles, without making any changes to the underlying HTML at all. Check out the CSS Zen Garden for over 200 examples of the same web content, presented differently using CSS styling.

A recent blog post by the BBC Visual and Data Journalism team — How the BBC Visual and Data Journalism team works with graphics in R — suggests they had a similar epiphany:

In March last year, we published our first chart made from start to finish using ggplot2.

Since then, change has been quick.

ggplot2 gives you far more control and creativity than a chart tool and allows you to go beyond a limited number of graphics. Working with scripts saves a huge amount of time and effort, in particular when working with data that needs updating regularly, with reproducibility a key requirement of our workflow.

In short, it was a game changer, so we quickly turned our attention to how best manage this newly-discovered power.

The approach they took was to create a cookbook in which to collect and curate useful recipes for creating particular types of graphic: you can find the BBC R Cookbook here.

The other other thing they did was create a BBC News ggplot2 house style: bbplot.

At the OU, where graphics production is still a cottage industry involving academics producing badly drawn sketches that get given to artists who return proper illustrations in a house style that need a tweak here or a tweak there communicated by the author making red pen annotations to printed out versions of the graphic etc etc, I keep on wondering why we don’t use powerful code based graphics packages for writing diagrams, and why we don’t have a house style designed for use with them.

PS by the by, via Downes, a post on The Secret Weapon to Learning CSS, that references another post on Teaching a Correct CSS Mental Model by asking what are “the mental patterns: ways to frame the problem in our heads, so we can break problems into their constituent parts and notice recurring patterns” that helps a CSS designer ply their craft. Which is to say: what are the key elements of The Grammar of Webstyle?

From Charts to Interactive Apps With Jupyter Notebooks and IPywidgets…

One of the nice things about Jupyter notebooks is that once you’ve got some script in place to generate a particular sort of graphic, you can very easily turn it into a parameterised, widgetised app that lets you generate chart views at will.

For example, here’s an interactive take on a  WRC chartable(?!) I’ve been playing with today. Given a function that generates a table for a given stage, rebased against a specified driver, it takes only a few lines of code and some very straightforward widget definitions to create an interactive custom chart generating application around it:

In this case, I am dynamically populating the drivers list based on which class is selected. (Unfortunately, it only seems to work for RC1 and RC4 at the moment. But I can select drivers and stages within that class…)

It also struck me that we can add further controls to select which columns are displayed in the output chart:

What this means is that we can easily create simple applications capable of producing a wide variety of customised chart outputs. Such a tool might be useful for a sports journalist wanting to use different sorts of table to illustrate different sorts of sport report.

Tinkering with Stage Charts for WRC Rally Sweden

Picking up on some doodles I did around the Dakar 2019 rally, a quick review of a couple of chart types I’ve been tweeting today…

First up is a chart showing the evolution of the rally over the course of the first day.

overall_lap_ss8

This chart mixes metaphors a little…

The graphics relate directly to the driver identified by the row. The numbers are rebased relative to a particular driver (Lappi, in this case).

The first column, the stepped line chart, tracks overall position over the stages, the vertical bar chart next to it identifying the gap to the overall leader at the end of each stage (that is, how far behind the overall leader each driver is at the end of each stage). The green dot highlights that the driver was in overall lead of the rally at the end of that stage.

The SS_N_overall numbers are represent rebased overall times. So we see that at the end of SS2, MIK was 9s ahead of LAP overall, and LAP was 13.1 seconds ahead of LOE. The stagePosition stepped line shows how the driver specified by each row fared on each stage. The second vertical bar chart shows the time that driver lost compared to the stage winner; again, a green dot highlights a nominal first position, in this case stage wins. The SS_N numbers are once again rebased times, this time showing how much time the rebased driver gained (green) or lost (red) compared to the driver named on that row.

I still need to add a few channels into the stage report. The ones I have for WRC are still the basic ones without any inline charts, but the tables are a bit more normalised and I’d need to sit down and think through what I need to pull from where to best generate the appropriate rows and from them the charts…

Here’s a reminder of what a rebased single stage chart looks like: The first column is road position, the second the overall gap at the end of the previous stage. The first numeric columns are how far the rebased driver was ahead (green) or behind (red) each other driver at each split. The Overall* column is the gap at the end of the stage (I should rename this and drop the Overall* or maybe replace as Final; then overall position and overall rally time delta (i.e. the column that take on the role of Previous column in the next stage). The DN columns are the time gained/lost going between split points. This  often highlights any particularly good or bad parts of the stage.  For example, in the above example, rebased on Lappi, the first split was dreadful but then he was fastest going between splits 1 and 2, and fared well 2-3 and 3-4.

Not Quite Serverless Reproducible Environments With Digital Ocean

I’ve been on something of a code rollercoaster over the last few days, fumbling with Jupyter server proxy settings in MyBinder, and fighting with OpenRefine, but I think I’ve stumbled on a new pattern that could be quite handy. In fact, I think it’s frickin’ ace, even though it’s just another riff, another iteration, on what we’ve been able to do before (e.g. along the lines of deploy to tutum, Heroku or zeit.now).

(It’s probably nothing that wasn’t out there already, but I have seen the light that issues forth from it.)

One of the great things about MyBinder is that it helps make you work reproducible. There’s a good, practical, review of why in this presentation: Reproducible, Reliable, Reusable Analyses with BinderHub on Cloud. One of the presentation claims is that it costs about $5000 a month to run MyBinder, covered at the moment by the project, but following the reference link I couldn’t see that number anywhere. What I did see, though, was something worth bearing in mind: “[MyBinder] users are guaranteed at least 1GB of RAM, with a maximum of 2GB”.

For OpenRefine running in MyBinder, along with other services, this shows why it may struggle at times…

So, how can we help?

And how can we get around the fact of not knowing what other stuff repo2docker, the build agent for MyBinder, might be putting into the server, or not being able to use Docker compose to link across several services in the Binderised environment, or having to run MyBinder containers in public (although it looks as though evil Binder auth in general may now be available?)

One way would be for institutions to chip in the readies to help keep the public MyBinder service free. Another way could be a sprint on a federated Binderhub in which institutions could chip in server resource. Another would be for institutions to host their own Binderhub instances, either publicly available or just available to registered users. (In the latter case, it would be good if the institutions also contributed developer effort, code, docs or community development back to the Jupyter project as a whole.)

Alternatively, we can roll our own server. But setting up a Binderhub instance is not necessarily the easiest of tasks (I’ve yet to try it…) and isn’t really the sort of thing your average postgrad or data journalist who wants to run a Binderised environment should be expected to have to do.

So what’s the alternative?

To my mind, Binderhub offers a serverless sort of experience, though that’s not to say no servers are involved. The model is that I can pick a repo, click a button, a server is started, my image built and a container launched, and I get to use the resulting environment. Find repo. Click button. Use environment. The servery stuff in the middle — the provisioning of a physical web server and the building of the environment — that’s nothing I have to worry about. It’s seamless and serverless.

Another thing to note is that MyBinder use cases are temporary / ephemeral. Launch Binderised app. Use it. Kill it.

This stands in contrast to setting services running for extended periods of time, and managing them over that period, which is probably what you’d want to do if you ran your own Binderhub instance. I’m not really interested in that. I want to: launch my environment; use it; kill it. (I keep trying to find a way of explaining this “personal application server” position clearly, but not with much success so far…)

So here’s where I’m at now: a nearly, or not quite, serverless solution; a bring your own server approach, in fact, using Digital Ocean, which is the easiest cloud hosting solution I’ve found for the sorts of things I want to do.

Its based around the User data text box in a Digital Ocean droplet creation page:

If you pop a shell script in there, it will run the code that appears in that box once the server is started.

But what code?

That’s the pattern I’ve started exploring.

Something like this:

#!/bin/bash

#Optionally:
export JUPYTER_TOKEN=myOwnPA5%w0rD

#Optionally:
#export REFINEVERSION=2.8

GIST=d67e7de29a2d012183681778662ef4b6
git clone https://gist.github.com/$GIST.git
cd $GIST
docker-compose up -d

which will grab a script (saved as a set of files in a public gist) to download and install an OpenRefine server inside a token protected Jupyter notebook server (the OpenRefine server runs via a Jupyter server proxy (see also OpenRefine Running in MyBinder, Several Ways… for various ways of running OpenRefine behind a Jupyter server proxy in MyBinder).

Or this (original gist):

#!/bin/bash

#Optionally:
#export JUPYTER_TOKEN=myOwnPA5%w0rD

GIST=8fa117e34c62b7f80b6c595b8ba4f488

git clone https://gist.github.com/$GIST.git
cd $GIST

docker-compose up -d

that will download and install a docker-compose set of elements:

  • a Jupyter notebook server, seeded with a demo notebook and various SQL magics;
  • a Postgres server (empty; I really need to add a fragment showing how to seed it with data; or you should be able to figure it out from here: Running a PostgreSQL Server in a MyBinder Container);
  • an AgensGraph server; AgensGraph is a graph database built on Postgres. The demo notebook currently uses the first part of the AgensGraph tutorial to show how to load data into it.

(The latter example includes a zip file that you can’t upload via the Gist web interface; so here’s a recipe for adding binary files (zip files, image files) to a Github gist.)

So what do you need to do to get the above environments up and running?

  • go to Digital Ocean](https://www.digitalocean.com/) (this link will get you $100 credit if you need to create an account);
  • create a droplet;
  • select ‘one-click’ type, and the docker flavour;
  • select a server size (get the cheap 3GB server and the demos will be fine);
  • select a region (or don’t); I generally go for London cos I figure it’s locallest;
  • check the User data check box and paste in one of the above recipes (make sure you start it with no spare lines at the top with the hashbang (#!/bin/bash);
  • optionally name the image (for your convenience and lack of admin panel eyesoreness);
  • click create;
  • copy the IP address of the server that’s created;
  • after 3 or 4 minutes (it may take some time to download the app containers into the server), paste the IP address into a browser location bar;
  • when presented with the Jupyter notebook login page, enter the default token (letmein; or the token you added in the User data script, if you did), or use it to set a different password at the bottom of the login page;
  • use your apps…
  • destroy the droplet (so you don’t continue to pay for it).

If that sounds too hard / too many steps, there are some pictures to show you what to do in the Creating a Digital Ocean Docker Droplet section of this post.

It’s really not that hard…

Though it could be easier. For example, if we had a “deploy to Digital Ocean” button that took the something like the form: http://deploygist.digitalocean.com/GIST and that looked for user_data and maybe other metadata files (region, server size, etc) to set a server running on your account and then redirect you to the appropriate webpage.

We don’t need to rely on just web clients either. For example, here’s a recipe for Connecting to a Remote Jupyter Notebook Server Running on Digital Ocean from Microsoft VS Code.

The next thing I need to think about is preserving state. This looks like it may be promising in that regard? Filestash [docs].

For anyone still still reading and still interested, here are some more reasons why I think this post is a useful one…

The linked gists are bother based around Docker deployments (it makes sense to use Docker, I think, because a lot of hard to install software is already packaged in Docker containers), although they demonstrate different techniques:

  • the first (OpenRefine) demo extends a Jupyter container so that it includes the OpenRefine server; OpenRefine then hides behind the Jupyter notebook auth and is proxied using Jupyter server proxy;
  • the second (AgensGraph) demo uses Docker compose. The database services are headless and not exposed.

What I have tried to do in the docker-compose.yml and Dockerfile files is show a variety of techniques for getting stuff done. I’ll comment them more liberally, or write a post about them, when I get chance. One thing I still need to do is a demo using nginx as a reverse proxy, with and without simple http auth. One thing I’m not sure how to do, if indeed it’s doable, is proxy services from a separate container using Jupyter server proxy; nginx would provide a way around that.

Adding Zip Files to Github Gists

Over the years, I’ve regularly used Github gists as a convenient place to post code fragments. Using the web UI, it’s easy enough to create new text files. But how do you add images, or zip files to a gist..?

…because there’s no way I can see of doing it using the web UI?

But a gist is just a git repo, so we should be able to commit binary files to it.

Ish via this gist — How to add an image to a gist — this simple recipe. It requires that you have git installed on your computer…

#YOURGISTID is in the gist URL: https://gist.github.com/USERNAME/GISTID
GIST=GISTID

git clone https://gist.github.com/$GIST.git

#MYBINARYFILENAME is something like mydata.zip or myimage.zip
cp MYBINARYPATH/MYBINARYFILENAME $GIST/

cd $GIST
git add MYBINARYFILENAME

git commit -m MYCOMMITMESSAGE

git push origin master
#If prompted, provide your Github credentials associated with the gist

Handy…