When Less is More: Data Tables That Make a Difference

In the previous post, From Visual Impressions to Visual Opinions, I gave various examples of charts that express opinions. In this post, I’ll share a few examples of how we can take a simple data table and derive multiple views from it that each provide a different take on the same story (or does that mean, tells different stories from the same set of "facts"?)

Here’s the original, base table, showing the recorded split times from a single rally stage. The time is the accumulated stage time to each split point (i.e. the elapsed stage time you see for a driver as they reach each split point):

From this, we immediately note the ordering (more on this in another post) which seems not useful. It is, in fact, the road order (i.e. the order in which each driver started the stage).

We also note that the final split is not the actual final stage time: the final split in this case was a kilometer or so before the stage end. So from the table, we can’t actually determine who won the stage.

Making a Difference

The times presented are the actual split times. But one thing we may be more interested in is the differences to see how far ahead or behind one driver another driver was at a particular point. We can subtract one driver’s time from anothers to find this difference. For example, how did the times at each split compare to first on road Ogier’s (OGI)?

Note that we can “rebase” the table relative to any driver by subtracting the required driver’s row from every other row in the original table.

From this “rebased” table, which has fewer digits (less ink) in it than the original, we can perhaps more easily see who was in the lead at each split, specifically, the person with the minimum relative time. The minimum value is trivially the most negative value in a column (i.e. at each split), or, if there are no negative values, the minimum zero value.

As well a subtracting one row from every other row to find the differences realative to a specified driver, we can also subtract the first column from the second, the second from the third etc to find the time it took to get from one split point to the next (we subtract 0 from the first split point time since the elapsed time into stage at the start of the stage is 0 seconds).

The above table shows the time taken to traverse the distance from one split point to the next; the extra split_N column is based on the final stage time. Once again, we could subtract one row from all the other rows to rebase these times relative to a particular driver to see the difference in time it took each driver to traverse a split section, relative to a specified driver.

As well as rebasing relative to an actual driver, we can also rebase relative to variously defined “ultimate” drivers. For example, if we find the minimum of each of the “split traverse” table columns, we create a dummy driver whose split section times represent the ultimate quickest times taken to get from one split to the next. We can then subtract this dumny row from every row of the split section times table:

In this case, the 0 in the first split tells us who got to the first split first, but then we lose information (withiut further calculation) about anything other than relative performance on each split section traverse. Zeroes in the other columns tell us who completed that particular split section traverse in the quickest time.

Another class of ultimate time dummy driver is the accumulated ultimate section time driver. That is, take the ultimate split sections then find the cumulative sum of them. These times then represent the dummy elapsed stage times of an ultimate driver who completed each split in the fastest split section time. If we rebase against that dummy driver:

In this case, there may be only a single 0, specifically at the first split.

A third possible ultimate dummy driver is the one who “as if” recorded the minimum actual elapsed time at each split. Again, we can rebase according to that driver:

In this case, will be at least one zero in each column (for the driver who recorded that particular elapsed time at each split).

Visualising the Difference

Viewing the above tables as purely numerical tables is fine as far as it goes, but we can also add visual cues to help us spot patterns, and different stories, more readily.

For example, looking at times rebased to the ultimate split section dummy driver, we get the following:

We see that SOL was flying from the second split onwards, getting from one split to another in pretty much the fastest time after a relatively poor start.

The variation in columns may also have something interesting to say. SOL somehow made time against pretty much every between split 4 and 5, but in the other sections (apart from the short last section to finish), there is quite a lot of variability. Checking this view against a split sectioned route map might help us understand whether there were particular features of the route that might explain these differences.

How about if we visualise the accumulated ultimate split section time dummy driver?

Here, we see that TAN was recording the best time compared the ultimate time as calculated against the sum of best split section times, but was still off the ultimate pace: it was his first split that made the difference.

How about if we rebase against the dummy driver that represents the driver with the fastest actual recorded accumulated time at each split:

Here, we see that TAN led the stage at each split point based on actual accumulated time.

Remember, all these stories were available in the original data table, but sometimes it takes a bit of differencing to see them clearly…

From Visual Impressions to Visual Opinions

In The Analytics Trap I scribbled some notes on how I like using data not as a source of "truth", but as a lens, or a perspective, from a particular viewpoint.

One idea I’ve increasingly noticed being talked about explcitly across various software projects I follow is the idea of opionated software and opionated design.

According to the Basecamp bible, Getting Real, [th]e best software takes sides. … [Apps should] have an attitude. This seems to lie at the heart of opinionated design.

A blog post from 2015, The Rise of Opinionated Software presents a widely shared definition: Opinionated Software is a software product that believes a certain way of approaching a business process is inherently better and provides software crafted around that approach. Other widely shared views relate to software design: opinonated software should have "a view" on how things are done and should enforce that view.

So this idea of opinion is perhaps one we can riff on.

I’ve been playing with data for years, and one of things I’ve believed, throughout, in my opinionated way, is that its an unreliable and opinionated witness.

In the liminal space between wake and sleep this morning, I started wondering about how visualisations in particular could range from providing visual impressions to visual opinions.

For example, here’s a view of a rally stage, overlaid onto a map:

This sort of thing is widely recongnisable to anyboy had use an online map, and anyone who has seen a printed map and drawn a route on it.

Example interactive map view

Here’s a visual impression of just the route:

View of route

Even this view is opinionated because the co-ordinates are projected to a particular co-ordinate system, albeit the one we are most familiar with when viewing online maps; but other projections are available.

Now here’s a more opinionated view of the route, with it cut into approximuately 1km segments:

Or the chart can express an opinion about where it things significant left and right hand corners are:

The following view has strong opinions about how to display each kilometer section: not only does it make claims about where it things significant right and left corners are, it also rotates each segment to so the start and end point of the section lay on the same horixontal line:

Another viewpoint brings in another dimension: elevation. It also transforms the flat 2D co-ordinates of each point along the route to a 1-D distance-along-route measure allowing us to plot the elevation against a 1-D representation of the route in a 2D (1D!) line chart.

Again, the chart expresses an opinion about where the significant right and left corners are. The chart also chooses not to be more helpful than it could be: if vertical grid lines corresponded to the start and end distace-into-stage values for the segmented plots, it would be easier to see how this chart relates to the 1km segmented sections.

At this point, you may say that the points are "facts" from the data, but again, they really aren’t. There are various ways of trying to define the intensity of a turn, and there may be various ways of calculating any particular measure that give slightly differnent results. Many definitions rely on particular parameter settings (for example, if you measure radius of curvature from three points on a route, how far should those points be apart? 1m? 10m? 20m? 50m?

The "result" is only a "fact" insofar as it represents the output of a particular calculation of a particular measure using a particular set of parameters, things that are typically not disclosed in chart labels, often aren’t mentioned in chart captions, and may or may not be disclosed in the surrounding text.

On the surface, the chart is simply expressing an opion about how tight any of the particular corners are. If we take it a face value, and trust its opinion is based on reasonable foundations, then we can accept (or not accept) the chart’s opinion aabout where the significant turns are.

If we were really motivated to understand the chart’s opinion further, if we had access to the code that generated it we could start to probe its definition of "significnant curvature" to see if we agree with the principles on which the chart has based its opinion. But in most cases, we don’t do that. We take the chart for what it is, typically accept it for what it appears to say, and ascribe some sort of truth to it.

But at the end of the day, it’s just an opinion.

The charts were generated using R based on ideas inspired by Visualising WRC Rally Stages With rayshader and R [repo].

Thinks Another: Using Spectrograms to Identify Stage Wiggliness?

Last night I started wondering about ways in which I might be able to use signal processing (Fourier analysis) or symbol dynamics (eg Thinks: Symbolic Dynamics for Categorising Rally Stage Wiggliness?) to help categorise the nature of rally stage twistiness.

Over a morning coffee break, I reminded myself of spectrograms, graphical devices that chunk a time series into a sequence of steps, and than display a frequency plot of each part. Which got me wondering: could I use a spectrogram to segment a stage route and analyse the spectrum of some signal taken along the route to identify wiggliness at that part of the stage?

If I’m reading it right [I wasn’t… the distances were wrong for a start: note to self – check the default parameter settings!], I think the following spectrogram does show some possible differences in wiggliness for different segments along the stage?

Image

The question then becomes: what signal (as a function of distance along line) to use? The above spectrogram is based on the perpendicular distance of the route from the straight line connecting the start and end points of the route.

# trj is a trajr route
straight = st_linestring(data.matrix(rbind(head(trj[,c('x','y')], 1),
                                           tail(trj[,c('x','y')], 1))))

straight_sf = st_sfc(straight,
                     crs=st_crs(utm_routes))

trj_d = TrajRediscretize(trj, 10)
utm_discretised = trj_d %>% 
                    sf::st_as_sf(coords = c("x","y")) %>% 
                    sf::st_set_crs(st_crs(utm_routes[route_index,]))

# Get the rectified distance from the midline
# Can we also get whether it's to left or right?
perp_distances = data.frame(d_ = st_distance(utm_discretised,
                                             straight_sf))
# Returned distance is given as units
perp_distances$d = as.integer(perp_distances$d_)

perp_distances$i = 10 * (1:nrow(perp_distances))
#perp_distances$i = units::set_units(10 * (1:nrow(perp_distances)), 'm')

We can then do something like a low pass filter:

library(signal)

# High pass filter
bf <- butter(2, 0.9, type="high")
perp_distances$d_hi <- filter(bf, perp_distances$d)

and generate the spectrogram show above:

# We could just plot this direct
spec = specgram(perp_distances$d_hi)

# Or make pretty
# Via:https://hansenjohnson.org/post/spectrograms-in-r/
library(oce)
# discard phase information
P = abs(spec$S)

# normalize
P = P/max(P)

# convert to dB
P = 10*log10(P)

# config time axis
t = spec$t

# plot spectrogram
imagep(x = t,
       y = spec$f,
       z = t(P),
       col = oce.colorsViridis,
       ylab = 'Frequency [Hz]',
       xlab = 'Time [s]',
       drawPalette = T,
       decimate = F
)

However, it would possibly make more sense to use something line the angle of turn, convexity index, or radius of curvature at each 10m step as the signal…

Hmmm…


Related: Rapid ipywidgets Prototyping Using Third Party Javascript Packages in Jupyter Notebooks With jp_proxy_widget (example of a waversurfer.js spectrogram js app widgetised for use in Jupyter notebooks).

If you listen to that track it’s really interesting seeing how the imagery maps onto the sound. Eg in the above image you can see a lag in an edge between right and left channels towards the end of the trace, which translates to hearing an effect in the left channel echoed a moment later in the right.

Which makes me think: could I use telemetry from two drivers as left and right stereo tracks and try to sonify the telemetry differences between them using distance along stage as the x axis value and some mapping of different telemetry channels onto frequency…? For example, brake on the bass, throttle at the top, and lateral acceleration in the mid-range?

Automatically Detecting Corners on Rally Stage Routes Using R

One of the things I’ve started pondering with my rally stage route metrics is the extent to which we might be able to generate stage descriptions of the sort you might find on the It Gets Faster Now blog. The idea wouldn’t necessarily be to create finished stage descriptions, more a set of notes that a journalist or fan could use as a prompt to create a more relevant description. (See these old Notes on Robot Churnalism, Part I – Robot Writers for a related discussion.)

So here’s some sketching related to that: identifying corners.

We can use the rLFT (Linear Feature Tools) R package to calculate a convexity measure at fixed sample points along a route (for a fascinating discussion of the curvature/convexity metric, see *Albeke, S.E. et al. Measuring boundary convexity at multiple spatial scales using a linear ‘moving window’ analysis: an application to coastal river otter habitat selection Landscape Ecology 25 (2010): 1575-1587).

By filtering on high absolute convexity sample points, we can do a little bit of reasoning around the curvature at each point to make an attempt at identifying the start of a corner:

library(rLFT)

stepdist = 10
window = 20
routeConvTable <- bct(utm_routes[1,],
                      # distance between measurements 
                      step = stepdist,
                      window = window, ridName = "Name")

head(routeConvTable)

We can then use the convexity index to highlight the sample points with a high convexity index:

corner_conv = 0.1

tight_corners = routeConvTable[abs(routeConvTable$ConvexityIndex)>corner_conv,]
tight_corners_zoom1 = tight_corners$Midpoint_Y>4964000 & tight_corners$Midpoint_Y<4965000

ggplot(data=trj[zoom1, ],
       aes(x=x, y=y)) + geom_path(color='grey') + coord_sf() +
  geom_text(data=tight_corners[tight_corners_zoom1,],
                           aes(label = ConvexityIndex,
                               x=Midpoint_X, y=Midpoint_Y),
                           size=2) +
  geom_point(data=tight_corners[tight_corners_zoom1,],
             aes(x=Midpoint_X, y=Midpoint_Y,
                 color= (ConvexityIndex>0) ), size=1) +
  theme_classic()+
  theme(axis.text.x = element_text(angle = 45))
High convexity points along a route

We can now do a bit of reasoning to find the start of a corner (see Automatically Generating Stage Descriptions for more discussion about the rationale behind this):

cornerer = function (df, slight_conv=0.01, closeby=25){
  df %>%
    mutate(dirChange = sign(ConvexityIndex) != sign(lag(ConvexityIndex))) %>%
    mutate(straightish =  (abs(ConvexityIndex) < slight_conv)) %>%
    mutate(dist =  (lead(MidMeas)-MidMeas)) %>%
    mutate(nearby =  dist < closeby) %>%
    mutate(firstish = !straightish & 
                        ((nearby & !lag(straightish) & lag(dirChange)) |
                        # We don't want the previous node nearby
                        (!lag(nearby)) )  & !lag(nearby) )
}

tight_corners = cornerer(tight_corners)

Let’s see how it looks, labeling the points as we do so with the distance to the next sample point:

ggplot(data=trj[zoom1,],
       aes(x=x, y=y)) + geom_path(color='grey') + coord_sf() +
  ggrepel::geom_text_repel(data=tight_corners[tight_corners_zoom1,],
                           aes(label = dist,
                               x=Midpoint_X, y=Midpoint_Y),
                           size=3) +
  geom_point(data=tight_corners[tight_corners_zoom1,],
             aes(x=Midpoint_X, y=Midpoint_Y,
                 color= (firstish) ), size=1) +
  theme_classic()+
  theme(axis.text.x = element_text(angle = 45))
Corner entry

In passing, we note we can identify the larg gap distances as "straights" (and then perhaps look for lower convexity index corners along the way we could label as "flowing" corners, perhaps).

Something else we might do is number the corners:

Numbered corners

There’s all sorts of fun to be had here, I think!

Personally Learning

Notes and reflections on a curiosity driven personal learning journey into geo and rasters and animal movement trajectory categorisation and all sorts of things that weren’t the point when I started…

Somewhen over the last month or so, I must have noticed a 3D map produced using the rayshader R package somewhere because I idly started wondering about whether I could use it to render a 3D rally stage map.

Just under three weeks ago, I started what was intended to be a half hour hack to give it a go, and it didn’t take too long to get something up and running…

Rally stage map rendered using rayshader

I then started tinkering a bit more and thinking about what else we might be able to do with linear geographies, such as generating elevation along route maps, for example, and also started keeping notes on various useful bits and bobs along the way: some notes on how geographic projections work, for example (which has been something of a blocker to me in the past) or how rasters work and how to process them.

I also had to try to get my head around R again (it’s been several years since I last used it) and started pondering about a useful way to structure my notes and then publish them somewhere: bookdown was the obvious candidate as I was working in RStudio (I seem to have developed a sick-in-the-stomach allergic reaction to Jupyter noteobooks, Python, VS Code and Javascript — they really are physically revolting / nausea inducing to me — after a work burn out over the last 9 months of last year).

I use code just a matter of course for all sorts of things, pretty much every day, and also use it recreationally, so R has provided a handy escape route for my code related urges (maybe I should pick up the opportunity to learn something new? The issue is, I tend to be task focussed when it comes to my personal learning, so I’d need to use a language that somehow made sense for a practical thing I want to achieve…)

Anyway, the rally route thing quickly turned into a curiosity driven learning journey: how could I render a raster in a ggplot, could I overlay tiles on a 3D rendered map:

Could I generate a ridge plot?

Ridge plot

Could I buffer a route and use it to crop a 3D model?

Could we convert an image to an elevation raster?

And so on..

When poking around looking for ideas about how to characterise how twisty or turny a route was, I stumbled across sinuosity as a metric, and from that idea quickly discovered a range of R packages that implements tools to characterise animal movement trajectories which we can easily apply to rally stage routes.

Enriching maps with data pulled in from OpenStreetMap also suggests how we might be able to use generate maps that might be useful in event planning (access roads, car parks, viewpoints, etc); and casting routes onto road networks (graph representations of road networks; think osmnx in Python, for example) made me wonder if I’d be able to generate road books and tulip maps (answer: not yet, at least…).

I’ve written my learning journey from the last 20 days or so up at RallyDataJunkie: Visualising Rally Stages; the original repo is here. A summary of topics is in the previous blog post: Visualising Rally Route Stages (with help from rayshader and some ecologists…).

Reflecting on what I’ve ended up with, the structure is partly reflective of the journey I followed, but it’s also a bit revisionist. The original motivation was the chapter on the rendering 3D stage maps; to do this I needed to get a sense of what I could do with 2D rayshader maps first (the 3D plot is just a change in the plot command from plot_map() to plot_3d()), and to do that properly I had to get better at working with route files and elevation matrices. Within the earlier chapters, I do try to follow the route I took learning about each topic, rather then presenting things in the order an academic treatment or traditional teaching route my follow: the point of the resource is not to “teach” linear geo hacking in a formal way, it’s a report of how I learned it, with some backdropped “really useful to know this” pointers added back to earlier stages as I realised I needed them for later things.

Something else you may note about the individual chapters is that there are chunks of repetition of code from earlier on: this is deliberate. The book is a personal reference manual for me, so when I refer back to it for how to do something in the future, there’s enough to get going (hopefully!) without having to keep referring explicitly to too many early chapters.

Another observation: I see this sort of personal learning as availing myself of (powerful) ideas or techniques that are out there that other people have already figured out, ideas or tools or techniques that can help me do a particular thing that I want to do, or make sense of a particular thing that I can’t get my head round (i.e. that help me (help myself) understand the how or the why of a particular thing). I don’t want to be taught. I want enough that I can use and learn from. In my personal learning journey, I’ll come to see why some things that were really handy or useful to help me get started may not be the best way of doing something as I get more skilled, but the more advanced idea would have hindered my learning journey if it had been forced on me. (When I see a new function with dozens of parameters, I stirp it down to what I think is all I need to get it to work, then start to perhaps add parameters back in…)

As teachers, we are often sort of experts, and follow a teaching line based on our expert knowledge, and what we know is good for folk to know foundationally, or that follows a canonical textbook line. But as a curiosity driven personal learner, I chase all manner of desire lines, sometimes having to go around an obstacle I can yet figure out, sometimes having to retrace my steps, sometimes having to go back to the beginning to collect something I didn’t realise I’d actually need.

I don’t care about the canon or the curriculum. I want to know how, then I want to know why, and at some point I may come to understand “oh yeah, it would have been really handy to to have known that at the start”. But whilst teaching is often about making sure everyone is prepared at each step for the step that comes next, learning for me is about heading out into the unknown and picking up stuff that’s useful as I find I need it. And that includes picking up on the theory.

For example, Finding the Racing Line collates a set of very mathematical references around finding optimal racing lines that I’ll perhaps pick into for nudges and examples and blind copying without understanding at times if it helps once I start to try to get my head round the lines rally drivers take round corners. Then I’ll go back to the pictures and equations and try to make sense of it once I’ve got to a position where things maybe work (eg visualised possible routes round a corner) but can I now figure out why and how, and can I make them work better. It may take years to understand the papers, if ever (I’ve been reading Racecar Engineering magazine for 15 years and most of it still doesn’t make much sense to me…), but I’ll pick the bits that look useful, and use the bits I can, and maybe go away to learn a bit more about something else that helps me then use a bit more of the papers, and so on. But doing a maths course, or a physics course wouldn’t help, becuase the teaching line would probably not be my curiosity driven learning line.

For me, playful curiosity is the driver that allows you stick at a problem till you figure it out — but why doesn’t it work? — or at least get into a frame of mind where you can just ignore it (for now) or park it until you figure something else out, or whatever… I’m not sure how the play relates to curiosity, or curiosity to play, but together they allow you to be creative and give you the persistence you need to figure stuff out enough to get stuff done…

Visualising Rally Route Stages (with help from rayshader and some ecologists…)

Inspired by some 3D map views generated using the rayshader and rgl R packages, I wondered how easy it would be to render some 3D maps of rally stages.

It didn’t take too long to get a quick example up and running but then I started wondering what else I could do with route and elevation data. And it turns out, quite a lot.

The result of my tinkerings to date is at rallydatajunkie.com/visualising-rally-stages. It concentrates soley on a "static analysis" of rally routes: no results, no telemetry, just the route.

Along the way, it covers the following topics:

  • using R spatial (sp) and simple features (sf) packages to represent routes;
  • using the leaflet, mapview and ggplot2 packages to render routes;
  • annotating and splitting routes / linestrings;
  • downloading elevation rasters using elevatr;
  • using the raster package to work with elevation rasters;
  • a review of map projections;
  • exploring various ways of rendering rasters and annotating them with derived terrain features;
  • rendering elevation rasters in 2D using rayshader;
  • an aside on converting images to elevation rasters;
  • rendering and cropping elevation rasters in 3D using rayshader;
  • rendering shadows for particular datetimes at specific locations (suncalc);
  • stage route analysis: using animal movement trajectory analysis tools (trajr, amt, rLFT) to characterise stage routes;
  • stage elevation visualisation and analysis (including elevation analysis using slopes);
  • adding OpenStreetMap data inclduing highways and buildings to maps (osmdata);
  • steps towards creating a road book / tulip map using by mapping stage routes onto OSM road networks (sfnetworks, dodgr).

Along the way, I had to learn various knitr tricks, eg for rendering images, HTML and movies in the output document.

The book itself was written uisng Rmd and then published via bookdown and Github Pages. The source repo is on Github at RallyDataJunkie/visualising-rally-stages.

Running R Projects in MyBinder – Dockerfile Creation With Holepunch

For those who don’t know it, MyBinder is a reproducible research automation tool that will take the contents of a Github repository, build a Docker container based on requirements files found inside the repo, and then present the user with a temporary, running container that can serve a Jupyter notebook, JupyterLab or RStudio environment to the user. All at the click of a button.

Although the primary, default, UI is the original Jupyter notebook interface, it is also possible to open a MyBinder environment into JupyterLab or, if the R packaging is install, RStudio.

For example, using the demo https://github.com/binder-examples/r repository, which contains a simple base R environment, with RStudio installed, we can use my Binder to launch RStudio running over the contents of that repository:

When we launch the binderised repo, we get — RStudio in the browser:

Part of the Binder magic is to install a set of required packages into the container, along with “content” documents (Jupyter notebooks, for example, or Rmd files), based on requirements identified in the repo. The build process is managed using a tool called repo2docker, and the way requirements / config files need to be defined can be found here.

To make building requirements files easier for R projects, the rather wonderful holepunch package will automatically parse the contents of an R project looking for package dependencies, and will then create a DESCRIPTION metadata file itemising the found R package dependencies. (holepunch can also be used to create install.R files.) Alongside it, a Dockerfile is created that references the DESCRIPTION file and allows Binderhub to build the container based on the project’s requirements.

For an example of how holepunch can be used in support of academic publishing, see this repo — rgayler/scorecal_CSCC_2019 — which contains the source documents for a recent presentation by Ross Gayler to the Credit Scoring & Credit Control XVI Conference. This repo contains the Rmd document required to generate the presentation PDF (via knitr) and Binder build files created by holepunch.

Clicking the repo’s MyBinder  button takes you, after a moment or two, to a running instance of RStudio, within which you can open, and edit, the presentation .Rmd file and knitr it to produce a presentation PDF.

In this particular case, the repository is also associated with a Zenodo DOI.

As well as launching Binderised repositories from the Github (or other repository) URL, MyBinder can also launch a container from a Zenodo DOI reference.

The screenshot actually uses the incorrect DOI…

For example, https://mybinder.org/v2/zenodo/10.5281/zenodo.3402938/?urlpath=rstudio.

Looking Up R / CRAN Package Maintainers With an ac.uk Affiliation

Trying to find an examiner for a particular PhD thesis relating to a rather interesting datastructure for wrangling messy datatables, I wondered whether we might find a likely suspect amongst the R package maintainer community.

We can get a list of R package maintainers here and a list of package name / short descriptions here.

FWIW, here’s the code fragment:

import pandas as pd

maintainers = pd.read_html('https://cran.r-project.org/web/checks/check_summary_by_maintainer.html')[0]
maintainers_email = maintainers.dropna(subset=[0])
maintainers_email[maintainers_email[0].str.contains('.ac.uk')][[0,1]]

packages = pd.read_html('https://cran.r-project.org/web/packages/available_packages_by_name.html')[0]
packages

maintainers_email_acuk = maintainers_email[maintainers_email[0].str.contains('.ac.uk')][[0,1]]
maintainers_email_acuk.merge(packages,left_on=1,right_on=0)

See also: What Do you Mean You Write Code EVERY DAY?, examples of which I’ve just turned into a new blog category: WDYMYWCED.

Fragment – Running Multiple Services, such as Jupyter Notebooks and a Postgres Database, in a Single Docker Container

Over the last couple of days, I’ve been fettling the build scripts for the TM351 VM, which typically uses vagrant to build a VirtualBox VM from a set of shell scripts, so they can be used to build a single Docker container that runs all the TM351 services, specifically Jupyter notebooks, OpenRefine, PostgreSQL and MongoDB.

Docker containers are typically constructed to a run a single service, with compositions of containers wired together using Docker Compose to create applications that deliver, or rely on, more than one running service. For example, in a previous post (Setting up a Containerised Desktop API server (MySQL + Apache / PHP 5) for the ergast Motor Racing Data API) I showed how to set up a couple of containers to work together, one running a MySQL database server, the other an http service that provided an API to the database.

So how to run multiple services in the same container? Docs on the Docker website suggest using supervisord to run multiple services in a single container, so here’s a fragment on how I’ve done that from my TM351 build.

To begin with, I’ve built the container up as a tiered set of containers, in a similar way to the way the stack of opinionated Jupyter notebook Docker containers are constructed:

#Define a stub to identify the images in this image stack
IMAGESTUB=psychemedia/tm361testm

# minimal
## Define a minimal container, eg a basic Linux container
## using whatever flavour of Linux we prefer
docker build --rm -t ${IMAGESTUB}-minimal-test ./minimal

# base
## The base container installs core packages
## The intention is to define a common build environment
## populated with packages likely to be common to many courses
docker build --rm --build-arg BASE=${IMAGESTUB}-minimal-test -t ${IMAGESTUB}-base-test ./base

#...

One of the things I’ve done to try to generalise the build steps is allow the name a base container to be used to bootstrap a new one by passing the name of the base image in via an optional variable (in the above case, --build-arg BASE=${IMAGESTUB}-minimal-test). Each Dockerfile in a build step directory uses the following construction to work out which image to use as the FROM basis:

#Set ARG values using --build-arg =
#Each ARG value can also have a default value
ARG BASE=psychemedia/ou-tm351-base-test
FROM ${BASE}

Using the same approach, I have used separate build tiers for the following components:

  • jupyter base: minimal Jupyter notebook install;
  • jupyter custom: add some customisation onto a pre-existing Jupyter notebook install;
  • openrefine: add the OpenRefine application; (note, we could just use BASE=ubuntu to create this a simple, standalone OpenRefine container);
  • postgres: create a seeded PostgreSQL database; note, this could be split into two: a base postgres tier and then a customisation that adds users, creates and seed databases etc;
  • mongodb: add in a seeded mongo database; again, the seeding could be added as an extra tier on a minimal database tier;
  • topup: a tier to add in anything I’ve missed without having to go back to rebuild from an earlier step…

The intention behind splitting out these tiers is that we might want to have a battle hardened OU postgres tier, for example, that could be shared between different courses. Alternatively, we might want to have tiers offering customisations for specific presentations of a course, whilst reusing several other fixed tiers intended to last out the life of the course.

By the by, it can be quite handy to poke inside an image once you’ve created it to check that everything is in the right place:

#Explore inside animage by entering it with a shell command
docker run -it --entrypoint=/bin/bash psychemedia/ou-tm351-jupyter-base-test -i

Once the services are in place, I add a final layer to the container that ensures supervisord is available and set up with an appropriate supervisord.conf configuration file:

##Dockerfile
#Final tier Dockerfile
ARG BASE=psychemedia/testpieces
FROM ${BASE}

USER root
RUN apt-get update && apt-get install -y supervisor

RUN mkdir -p /openrefine_projects  && chown oustudent:100 /openrefine_projects
VOLUME /openrefine_projects

RUN mkdir -p /notebooks  && chown oustudent:100 /notebooks
VOLUME /notebooks

RUN mkdir -p /var/log/supervisor
COPY monolithic_container_supervisord.conf /etc/supervisor/conf.d/supervisord.conf

EXPOSE 3334
EXPOSE 8888

CMD ["/usr/bin/supervisord"]

The supervisord.conf file is defined as follows:

##supervisord.conf
##We can check running processes under supervisord with: supervisorctl

[supervisord]
nodaemon=true
logfile=/dev/stdout
loglevel=trace
logfile_maxbytes=0
#The HOME envt needs setting to the correct USER
#otherwise jupyter throws: [Errno 13] Permission denied: '/root/.local'
#https://github.com/jupyter/notebook/issues/1719
environment=HOME=/home/oustudent

[program:jupyternotebook]
#Note the auth is a bit ropey on this atm!
command=/usr/local/bin/jupyter notebook --port=8888 --ip=0.0.0.0 --y --log-level=WARN --no-browser --allow-root --NotebookApp.password= --NotebookApp.token=
#The directory we want to start in
#(replaces jupyter notebook parameter: --notebook-dir=/notebooks)
directory=/notebooks
autostart=true
autorestart=true
startsecs=5
user=oustudent
stdout_logfile=NONE
stderr_logfile=NONE

[program:postgresql]
command=/usr/lib/postgresql/9.5/bin/postgres -D /var/lib/postgresql/9.5/main -c config_file=/etc/postgresql/9.5/main/postgresql.conf
user=postgres
autostart=true
autorestart=true
startsecs=5

[program:mongodb]
command=/usr/bin/mongod --dbpath=/var/lib/mongodb --port=27351
user=mongodb
autostart=true
autorestart=true
startsecs=5

[program:openrefine]
command=/opt/openrefine-3.0-beta/refine -p 3334 -i 0.0.0.0 -d /vagrant/openrefine_projects
user=oustudent
autostart=true
autorestart=true
startsecs=5
stdout_logfile=NONE
stderr_logfile=NONE

One thing I need to do better is to find a way to stage the construction of the supervisord.conf file, bearing in mind that multiple tiers may relate to the same servicel for example, I have a jupyter-base tier to create a minimal Jupyter notebook server and then a jupyter-base-custom tier that adds in specific customisations, such as branding and course related notebook extensions.

When the final container is built, the supervisord command is run and the multiple services started.

One other thing to note: we’re hoping to run TM351 environments on an internal OpenStack cluster. The current cluster only allows students to expose a single port, and port 80 at that, from the VM (IP addresses are in scant supply, and network security lockdowns are in place all over the place). The current VM exposes at least two http services: Jupyter notebooks and OpenRefine, so we need a proxy in place if we are to expose them both via a single port. Helpfully, the nbserverproxy Jupyter extension (as described in Exposing Multiple Services Via a Single http Port Using Jupyter nbserverproxy), allows us to do just that. One thing to note, though – I had to enable it via the same user that launches the notebook server in the suoervisord.conf settings:

##Dockerfile fragment

RUN $PIP install nbserverproxy

USER oustudent
RUN jupyter serverextension enable --py nbserverproxy
USER root

To run the VM, I can call something like:

docker run -p 8899:8888 -d psychemedia/tm351dockermonotest

and then to access the additional services, I can browse to e.g. localhost:8899/proxy/3334/ to see the OpenRefine application.

PS in case you’re wondering why I syndicated this through RBloggers too, the same recipe will work if you’re using Jupyter notebooks with an R kernel, rather than the default IPython one.

Embedded Audio Players in Jupyter Notebooks Running IRKernel

For ref, when running IRkernel Jupyter R notebooks, media objects can be embedded by making use of the shiny::tags function, that can return HTML elements with appropriate MIME types, and are renderable using _repr_html machinery (h/t @flying-sheep):

For example:

PS By the by, I notice the existence of another R kernel for Jupyter notebooks, JuniperKernel. An advantage of this over IRkernel is that it supports xwidgets, a C++ widget implementation for Jupyter notebooks akin to ipywidgets. (As far as I know, there isn’t a widget implementation for IRkernel, although maybe something can be finessed using shiny::tags and other bits of shiny machinery? ) Although I could install JuniperKernel on my Mac, I had some issues getting it to work that I haven’t had a chance to explore yet, so I don’t yet have a widgets demo…

PS having to save an audio file to then load back into the player may be a faff; that said, it looks like there may be a route to using a data URI? Not tried this yet; if you have a short reproducible demo that works, please share via the comments:-)