Tagged: jupyter

Rolling Your Own Jupyter and RStudio Data Analysis Environment Around Apache Drill Using docker-compose

I had a bit of a play last night trying to hook a Jupyter notebook container up to an Apache Drill container using docker-compose. The idea was to have a shared data volume between the two of them, but I couldn’t for the life of me get that to work using the the docker-compose version 2 or 3 (services/volumes) syntax – for some reason, any of the Apache Drill containers I tried wouldn’t fire up properly.

So I eventually (3am…:-( went for a simpler approach, synching data through a local directory on host.

The result is something that looks like this:

The Apache Drill container, and an Apache Zookeeper container to keep it in check, I found via Dockerhub. I also reused an official RStudio container. The Jupyter container is one I rolled for TM351.

The Jupyter and RStudio containers can both talk to the Apache Drill container, and both analysis apps have access to their own data folder mounted in an application folder in the current directory on host.The data folders mount into separate directories in the Apache Drill container. Both applications can query into data files contained in either data directory as viewable from Apache Drill.

This is far from ideal, but it works. (The structure is as suggested so that RStudio and Jupyter scripts can both be used to download data into a data directory viewable from the Apache Drill container. Another approach would be to mount a separate ./data directory and provide some means for populating it with data files. Alternatively, if the files already exist on host,  mounting the host data directory onto a /data volume in the Apache Drill container would work too.

Here’s the docker-compose.yaml file I’ve ended up with:

  image: dialonce/drill
    - 8047:8047
    - zookeeper
    -  ./notebooks/data:/nbdata
    -  ./R/data:/rdata

  image: jplock/zookeeper

  container_name: notebook-apache-drill-test
  image: psychemedia/ou-tm351-jupyter-custom-pystack-test
    - 35200:8888
    - ./notebooks:/notebooks/
    - drill:drill

  container_name: rstudio-apache-drill-test
  image: rocker/tidyverse
    - PASSWORD=letmein
  #default user is: rstudio
    - ./R:/home/rstudio
    - 8787:8787
    - drill:drill

If you have docker installed and running, running docker-compose up -d in the folder containing the docker-compose.yaml file will launch three linked containers: Jupyter notebook on localhost port 35200, RStudio on port 8787, and Apache Drill on port 8047. If the ./notebooks, ./notebooks/data, ./R and ./R/data subfolders don’t exist they will be created.

We can use the clients to variously download data files and run Apache Drill queries against them. In Jupyter notebooks, I used the pydrill package to connect. Note the hostname used is the linked container name (in this case, drill).

If we download data to the ./notebooks/data folder which is mounted inside the Apache Drill container as /nbdata, we can query against it.

(Note – it probably would make sense to used a modified Apache Drill container configured to use CSV headers, as per Querying Large CSV Files With Apache Drill.)

We can also query against that same data file from the RStudio container. In this case I used the DrillR package (I had hoped to use the sergeant package (“drill sergeant”, I assume?! Sigh..;-) but it uses the RJDBC package which expects to find java installed, rather than DBI, and java isn’t installed in the rocker/tidyverse container I used.) UPDATE: sergeant now works without Java dependency... Thanks, Bob:-)

I’m not sure if DrillR is being actively developed, but it would be handy if it could return the data from the query as a dataframe.

So , getting up and running with Apache Drill and a data analysis environment is not that hard at all, if you have docker installed:-)

Tinkering With Apache Drill – JOINed Queries Across JSON and CSV files

Coming down from a festival high, I spent the day yesterday jamming with code and trying to get a feel for Apache Drill. As I’ve posted before, Apache Drill is really handy for querying large CSV files.

The test data I’ve been using is Evan Odell’s 3GB Hansard dataset, downloaded as a CSV file but recast in the parquet format to speed up queries (see the previous post for details). I had another look at the dataset yesterday, and popped together some simple SQL warm up exercises in notebook (here).

Something I found myself doing was flitting between running SQL queries over the data using Apache Drill to return a pandas dataframe, and wrangling the pandas dataframes directly, following the path of least resistance to doing the thing I wanted to do at each step. (Which is to say, if I couldn’t figure out the SQL, I’d try moving into pandas; and if the pandas route was too fiddly, rethinking the SQL query! That said, I also noticed I had got a bit rusty with SQL…) Another pattern of behaviour I found myself falling into was using Apache Drill to run summarising queries over the large original dataset, and then working with these smaller, summary datasets as in-memory pandas dataframes. This could be a handy strategy, I think.

As well as being able to query large flat CSV files, Apache Drill also allows you to run queries over JSON files, as well directories full of similarly structured JSON or CSV files. Lots of APIs export data as JSON, so being able to save the response of multiple calls on a similar topic as uniquely named (and the name doesn’t matter..) flat JSON files in the same folder, and then run a SQL query over all of them simply by pointing to the host directory, is really appealing.

I explored these features in a notebook exploring UK Parliament Written Answers data. In particular, the notebook shows:

  • requesting multiple pages on the same from the UK Parliament data API (not an Apache Drill feature!)
  • querying deep into a single large JSON file;
  • running JOIN queries over data contained in a JSON file and a CSV file;
  • running JOIN queries over data contained in a JSON file and data contained in multiple, similarly structured, data files in the same directory.

All I need to do now is come up with a simple “up and running” recipe for working Apache Drill. I’m thinking: Docker compose and some linked containers: Jupyter notebook, RStudio, Apache Drill, and maybe a shared data volume between them?

Pondering a Jupyter Notebook “Diff”er Extension and Its Use as a Marking Tool

One of the problems with trivially using simple text based differencing tools on Jupyter notebooks is that the differencing is based on the content of the JSON file used to represent the notebook, rather than the content of the notebook cells themselves.

The nbdime tools (which I’ve commented on before) address his problem by reporting differences at the cell content level. (Hmm, I wonder: would a before/after image comparison view for charts also be useful (example)?

I still don’t really understand how Jupyter notebook checkpointing works (e.g. I think I’d like it to work so that I can force a save to a checkpoint, whilst the autosave updates the current notebook file) but I started wondering yesterday about a simple Jupyter extension that would compare the current (saved) notebook with the checkpointed version.

My first attempt can be found in this gist:

Clicking the appropriate button (I need to find a better icon?) launches the nbdiff-web service on a specified port and then pops open a tab comparing the current saved notebook file with the checkpointed version. Note there are several settings specific to my setup where the notebooks are running inside the TM351 VM:

  • guestport is the port inside the guest VM that the nbdiff-web service runs on
  • hostport is the port on the host machine that the nbdiff-web service port is mapped to
  • abspath is the path to the notebook root directory; ideally, this should be picked up by the script.

In testing, it seems my setup seems to have checkpoints being saved regularly and automatically, which means the diffing is not that useful… I maybe to have a “backup-save” or “version-save” option somewhere to force saves at particular times and then compare to those?

The code used for the extension was inspired by the printview extension code. Looking at that, I note how the defining YAML file makes it easy to set up the extension configuration options:

Which got me wondering… one of the issues we have had with marking student work specified in and completed as Jupyter notebooks is helping markers quickly navigate to the cells that students have modified so they can be marked. (Finding an efficient way of allowing markers to annotate and return scripts is another issue. Note that we still haven’t really experimented with nbgrader either.) So I wonder if the differencer can help?

For example, we could bundle a set of ‘provided’ notebooks in which assessments are defined within a marking extension, and then allow a marker to compare the student returned version of the notebook with the provided copy, using the option to hide unchanged cells and just highlight the ones that have been changed to include the student’s answers. Ideally, we’d also want the marker to be able to annotate those cells with marks and feedback, and return the annotated scrip to the student – who could then diff back to their submitted transcript to see what the marker had to say?

PS via @biztechpm, How to grade programming assignments on GitHub

Creating a Jupyter Bundler Extension to Download Zipped Notebook and HTML Files

In the first version of the TM351 VM, we had a simple toolbar extension that would download a zipped ipynb file, along with an HTML version of the notebook, so it could be uploaded and previewed in the OU Open Design Studio. (Yes, I know, it would have been much better to have an nbviewer handler as an ODS plugin, but the we don’t do that sort of tech innovation, apparently…)

Looking at updating the extension today for the latest version of Jupyter notebooks, I noticed the availability of custom bundler extensions that allow you to add additional tools to support notebook downloads and deployment (I’m not sure what deployment relates to?). Adding a new download option allows it to be added to the notebook Edit -> Download menu:

The extension is created as a python package:

# odszip/setup.py
from setuptools import setup

      description='Save Jupyter notebook and HTML in zip file with .nbk suffix',

# Copyright (c) The Open University, 2017
# Copyright (c) Jupyter Development Team.

# Distributed under the terms of the Modified BSD License.
# Based on: https://github.com/jupyter-incubator/dashboards_bundlers/

import os
import shutil
import tempfile

def _jupyter_bundlerextension_paths():
    '''API for notebook bundler installation on notebook 5.0+'''
    return [{
                'name': 'odszip_download',
                'label': 'ODSzip (.nbk)',
                'module_name': 'odszip.download',
                'group': 'download'

def make_download_bundle(abs_nb_path, staging_dir, tools):
	Assembles the notebook and resources it needs, returning the path to a
	zip file bundling the notebook and its requirements if there are any,
	the notebook's path otherwise.
	:param abs_nb_path: The path to the notebook
	:param staging_dir: Temporary work directory, created and removed by the caller
	# Clean up bundle dir if it exists
	shutil.rmtree(staging_dir, True)
	# Get name of notebook from filename
	notebook_basename = os.path.basename(abs_nb_path)
	notebook_name = os.path.splitext(notebook_basename)[0]
	# Add the notebook
	shutil.copy2(abs_nb_path, os.path.join(staging_dir, notebook_basename))
	# Include HTML version of file
	cmd='jupyter nbconvert --to html "{abs_nb_path}" --output-dir "{staging_dir}"'.format(abs_nb_path=abs_nb_path,staging_dir=staging_dir)

	zip_file = shutil.make_archive(staging_dir, format='zip', root_dir=staging_dir, base_dir='.')
	return zip_file

def bundle(handler, model):
	Downloads a notebook as an HTML file and zips it with the notebook
	# Based on https://github.com/jupyter-incubator/dashboards_bundlers
	abs_nb_path = os.path.join(
	notebook_basename = os.path.basename(abs_nb_path)
	notebook_name = os.path.splitext(notebook_basename)[0]
	tmp_dir = tempfile.mkdtemp()

	output_dir = os.path.join(tmp_dir, notebook_name)
	bundle_path = make_download_bundle(abs_nb_path, output_dir, handler.tools)
	handler.set_header('Content-Disposition', 'attachment; filename="%s"' % (notebook_name + '.nbk'))
	handler.set_header('Content-Type', 'application/zip')
	with open(bundle_path, 'rb') as bundle_file:


	# We read and send synchronously, so we can clean up safely after finish
	shutil.rmtree(tmp_dir, True)

We can then create the python package and install the extension, remmebering to restart the Jupyter server for the extension to take effect.

#Install the ODSzip extension package
pip3 install --upgrade --force-reinstall ./odszip

#Enable the ODSzip extension
jupyter bundlerextension enable --py odszip.download  --sys-prefix

Experimenting With Sankey Diagrams in R and Python

A couple of days ago, I spotted a post by Oli Hawkins on Visualising migration between the countries of the UK which linked to a Sankey diagram demo of Internal migration flows in the UK.

One of the things that interests me about the Jupyter and RStudio centred reproducible research ecosystems is their support for libraries that generate interactive HTML/javascript outputs (charts, maps, etc) from a computational data analysis context such as R, or python/pandas, so it was only natural (?!) that I though I should see how easy it would be to generate something similar from a code context.

In an R context, there are several libraries available that support the generation of Sankey diagrams, including googleVis (which wraps Google Chart tools), and a couple of packages that wrap d3.js – an original rCharts Sankey diagram demo by @timelyporfolio, and a more recent HTMLWidgets demo (sankeyD3).

Here’s an example of the evolution of my Sankey diagram in R using googleVis – the Rmd code is here and a version of the knitred HTML output is here.

The original data comprised a matrix relating population flows between English regions, Wales, Scotland and Northern Ireland. The simplest rendering of the data using the googleViz Sankey diagram generator produces an output that uses default colours to label the nodes.

Using the country code indicator at the start of each region/country identifier, we can generate a mapping from country to a country colour that can then be used to identify the country associated with each node.

One of the settings for the diagram allows the source (or target) node colour to determine the edge colour. We can also play with the values we use as node labels:

If we exclude edges relating to flow between regions of the same country, we get a diagram that is more reminiscent of Oli’s orignal (country level) demo. Note also that the charts that are generated are interactive – in this case, we see a popup that describes the flow along one particular edge.

If we associate a country with each region, we can group the data and sum the flow values to produce country level flows. Charting this produces a chart similar to the original inspiration.

As well as providing the code for generating each of the above Sankey diagrams, the Rmd file linked above also includes demonstrations for generating basic Sankey diagrams for the original dataset using the rCharts and htmlwidgets R libraries.

In order to provide a point of comparison, I also generated a python/pandas workflow using Jupyter notebooks and the ipysankey widget. (In fact, I generated the full workflow through the different chart versions first in pandas – I find it an easier language to think in than R! – and then used that workflow as a crib for the R version…)

The original notebook is here and an example of the HTML version of it here. Note that I tried to save a rasterisation of the widgets but they don’t seem to have turned out that well…

The original (default) diagram looks like this:

and the final version, after a bit of data wrangling, looks like this:

Once again, all the code is provided in the notebook.

One of the nice things about all these packages is that they produce outputs than can be reused/embedded elsewhere, or that can be used as a first automatically produced draft of code that can be tweaked by hand. I’ll have more to say about that in a future post…

Reporting in a Repeatable, Parameterised, Transparent Way

Earlier this week, I spent a day chatting to folk from the House of Commons Library as a part of a temporary day-a-week-or-so bit of work I’m doing with the Parliamentary Digital Service.

During one of the conversations on matters loosely geodata-related with Carl Baker, Carl mentioned an NHS Digital data set describing the number of people on a GP Practice list who live within a particular LSOA (Lower Super Output Area). There are possible GP practice closures on the Island at the moment, so I thought this might be an interesting dataset to play with in that respect.

Another thing Carl is involved with is producing a regularly updated briefing on Accident and Emergency Statistics. Excel and QGIS templates do much of the work in producing the updated documents, so much of the data wrangling side of the report generation is automated using those tools. Supporting regular updating of briefings, as well as answering specific, ad hoc questions from MPs, producing debate briefings and other current topic briefings, seems to be an important Library activity.

As I’ve been looking for opportunities to compare different automation routes using things like Jupyter notebooks and RMarkdown, I thought I’d have a play with the GP list/LSOA data, showing how we might be able to use each of those two routes to generate maps showing the geographical distribution, across LSOAs at least, for GP practices on the Isle of Wight. This demonstrates several things, including: data ingest; filtering according to practice codes accessed from another dataset; importing a geoJSON shapefile; generating a choropleth map using the shapefile matched to the GP list LSOA codes.

The first thing I tried was using a python/pandas Jupyter notebook to create a choropleth map for a particular practice using the folium library. This didn’t take long to do at all – I’ve previously built an NHS admin database that lets me find practice codes associated with a particular CCG, such as the Isle of Wight CCG, as well as a notebook that generates a choropleth over LSOA boundaries, so it was simply a case of copying and pasting old bits of code and adding in the new dataset.You can see a rendered example of the notebook here (download).

One thing you might notice from the rendered notebook is that I actually “widgetised” it, allowing users of the live notebook to select a particular practice and render the associated map.

Whilst I find the Jupyter notebooks to provide a really friendly and accommodating environment for pulling together a recipe such as this, the report generation workflows are arguably still somewhat behind the workflows supported by RStudio and in particular the knitr tools.

So what does an RStudio workflow have to offer? Using Rmarkdown (Rmd) we can combine text, code and code outputs in much the same way as we can in a Jupyter notebook, but with slightly more control over the presentation of the output.


For example, from a single Rmd file we can knit an output HTML file that incorporates an interactive leaflet map, or a static PDF document.

It’s also possible to use a parameterised report generation workflow to generate separate reports for each practice. For example, applying this parameterised report generation script to a generic base template report will generate a set of PDF reports on a per practice basis for each practice on the Isle of Wight.

The bookdown package, which I haven’t played with yet, also looks promising for its ability to generate a single output document from a set of source documents. (I have a question in about the extent to which bookdown supports partially parameterised compound document creation).

Having started thinking about comparisons between Excel, Jupyter and RStudio workflows, possible next steps are:

  • to look for sensible ways of comparing the workflow associated with each,
  • the ramp-up skills required, and blockers (including cultural blockers (also administrative / organisational blockers, h/t @dasbarrett)) associated with getting started with new tools such as Jupyter or RStudio, and
  • the various ways in which each tool/workflow supports: transparency; maintainability; extendibility; correctness; reuse; integration with other tools; ease and speed of use.

It would also be interesting to explore how much time and effort would actually be involved in trying to port a legacy Excel report generating template to Rmd or ipynb, and what sorts of issue would be likely to arise, and what benefits Excel offers compared to Jupyter and RStudio workflows.