Category: Rstats

Generative Assessment Creation

It’s coming round to that time of year where we have to create the assessment material for courses with an October start date. In many cases, we reuse question forms from previous presentations but change the specific details. If a question is suitably defined, then large parts of this process could be automated.

In the OU, automated question / answer option randomisation is used to provide iCMAs (interactive computer marked assessments) via the student VLE using OpenMark. As well as purely text based questions, questions can include tables or images as part of the question.

One way of supporting such question types is to manually create a set of answer options, perhaps with linked media assets, and then allow randomisation of them.

Another way is to define the question in a generative way so that the correct and incorrect answers are automatically generated.(This seems to be one of those use cases for why ‘everyone should learn to code’;-)

Pinching screenshots from an (old?) OpenMark tutorial, we can see how a dynamically generated question might be defined. For example, create a set of variables:

and then generate a templated question, and student feedback generator, around them:

Packages also exist for creating generative questions/answers more generally. For example, the R exams package allows you to define question/answer templates in Rmd and then generate questions and solutions in a variety of output document formats.

You can also write templates that include the creation of graphical assets such as charts:


Via my feeds over the weekend, I noticed that this package now also supports the creation of more general diagrams created from a TikZ diagram template. For example, logic diagrams:

Or automata diagrams:

(You can see more exam templates here:

As I’m still on a “we can do everything in Jupyter” kick, one of the things I’ve explored is various IPython/notebook magics that support diagram creation. At the moment, these are just generic magics that allow you to write TikZ diagrams, for example, that make use of various TikZ packages:

One the to do list is to create some example magics that template different question types.

I’m not sure if OpenCreate is following a similar model? (I seem to have lost access permissions again…)

FWIW, I’ve also started looking at my show’n’tell notebooks again, trying to get them working in Azure notebooks. (OU staff should be able to log in to using credentials.) For the moment, I’m depositing them at, although some tidying may happen at some point. There are also several more basic demo notebooks I need to put together (e.g. on creating charts and using interactive widgets, digital humanities demos, R demos and (if they work!) polyglot R and python notebook demos, etc.). To use the notebooks interactively, log in and clone the library into your own user space.

R HTMLWidgets in Jupyter Notebooks

A quick note for displaying R htmlwidgets in Jupyter notebooks without requiring pandoc – there may be a more native way but this acts as a workaround in the meantime if not:


m = leaflet() %>% addTiles()
saveWidget(m, 'demo.html', selfcontained = FALSE)
display_html('<iframe src="demo.html"></iframe>')

PS and from the other side, using reticulate for Python powered Shiny apps.

Generating Text From An R DataFrame using PyTracery, Pandas and Reticulate

In a couple of recent posts (Textualisation With Tracery and Database Reporting 2.0 and More Tinkering With PyTracery) I’ve started exploring various ways of using the pytracery port of the tracery story generation tool to generate variety of texts from Python pandas data frames.

For my F1DataJunkie tinkerings I’ve been using R + SQL as the base languages, with some hardcoded Rdata2text constructions for rendering text from R dataframes (example).

Whilst there is a basic port of tracery to R, I want to make use of the various extensions I’ve been doodling with to pytracery, so it seemed like a good opportunity to start exploring the R reticulate package.

It was a bit of a faff trying to get things to work the first time, so here on some notes on what I had to consider to get a trivial demo working in my RStudio/Rmd/knitr environment.

Python Environment

My first attempt was to use python blocks in an Rmd document:

import sys

but R insisted on using the base Python path on my Mac that was not the path I wanted to use… The fix turned out to be setting the engine…

```{python, engine.path ='/Users/f1dj/anaconda3/bin/python' }
import sys

This could also be done via a setting: opts_chunk$set(engine.path = '/Users/f1dj/anaconda3/bin/python')

One of the problems with this approach is that a Python environment is created for each chunk – so you can’t easily carry state over from one Python chunk to another.

So I had a look at a workaround using reticulate instead.

Calling pytracery from R using reticulate

The solution I think I’m going for is to put Python code into a file, call that into R, then pass an R dataframe as an argument to a called Python function and gett a response back into R as an R dataframe.

For example, here’s a simple python test file:

import tracery
from tracery.modifiers import base_english
import pandas as pd

def pandas_row_mapper(row, rules, root,  modifiers=base_english):
    ''' Function to parse single row of dataframe '''

    for k in row:
        rules[k] = str(row[k])
        grammar = tracery.Grammar(rules)
        if modifiers is not None:
            if isinstance(modifiers,list):
                for modifier in modifiers:

    return grammar.flatten(root)

def pandas_tracery(df, rules, root, modifiers=base_english):
  return df.apply(lambda row: pandas_row_mapper(row, rules, root, modifiers), axis=1)

def pdt_inspect(df):

def pdt_test1(df):
  return type(df)

def pdt_demo(df):
  return pandas_tracery(df, _demo_rules, "#origin#",  modifiers=base_english)

#Create example rule to apply to each row of dataframe
_demo_rules = {'origin': "#code# was placed #position#!",
         'position': "#pos.uppercase#"}

We can access a python environment using reticulate:


#Show conda environments

#Use a particular, name conda environment
use_condaenv(condaenv='anaconda3', required=T)

#Check the availability of a particular module in the environment

Now we can load in the python file – and the functions it defines – and then call one of the  loaded Python functions.

Note that I seemed to have to force the casting of the R dataframe to a python/pandas dataframe using r_to_py(), although I’d expected the type mapping to be handled automatically? (Perhaps there is a setting somewhere?)

df1=data.frame(code=c('Jo','Sam'), pos=c('first','Second'))
df1$result = pdt_demo(r_to_py(df1, convert=T))

Jo	first	Jo was placed FIRST!
Sam	Second	Sam was placed SECOND!

(Note: I also spotted a gotcha – things don’t work so well if you define an R column name called name… )

So now I can start looking at converting sports reporting tropes like these:

into tracery story models I can call using my pandas/pytracery hacks:-)

PS here’s a quick demo of inlining Python code:


#Go into python shell - this persists
#Access R variables with r.

#Return to R shell

#Access Python variable with py$

Sketch – Data Trivia

A bit more tinkering with F1 data from the ergast db, this time trying to generating trivia / facts around races.

The facts are identified using SQL queries:

#starts for a team
q=paste0('SELECT d.code, COUNT(code) AS startsforteam, AS name FROM drivers d JOIN races r JOIN results rs JOIN constructors c WHERE c.constructorId=rs.constructorId AND d.driverId=rs.driverId AND r.raceId=rs.raceId AND d.code IN (',driversThisYear_str,') ',upto,' GROUP BY d.code, HAVING (startsforteam+1) % 50 = 0')
startsTeammod50=dbGetQuery(ergastdb, q)

#looking for poles to date modulo 5 
q=paste0('SELECT d.code, COUNT(code) AS poles FROM drivers d JOIN qualifying q JOIN races r WHERE r.raceId=q.raceId AND d.code IN (',driversThisYear_str,') AND d.driverId=q.driverId AND q.position=1',upto,' GROUP BY code HAVING poles>1 AND (poles+1) % 5 = 0')
lookingpolesmod5=dbGetQuery(ergastdb, q)

Some of the queries also embed query fragments, which I intend to develop further…

upto=paste0(' AND (year<',year,' OR (year=',year,' AND round<',round,')) ')

I'm using knitr to generate Github flavoured markdown (gfm) from my Rmd docs – here’s part of the header:

    variant: gfm

The following recipe then takes results from the trivia queries and spiels the output:

if (nrow(startsTeammod50)>0) {
  for (row in 1:nrow(startsTeammod50)) {
    text = '- `r startsTeammod50[row, "code"]` is looking for their `r toOrdinal(startsTeammod50[row, "startsforteam"]+1)` start for `r startsTeammod50[row, "name"]`'

if (nrow(lookingpolesmod5)>0) {
  for (row in 1:nrow(lookingpolesmod5)) {
    text = '- `r lookingpolesmod5[row, "code"]` is looking for their `r toOrdinal(lookingpolesmod5[row, "poles"]+1)` ever pole position'

We then get outputs of the form:

  • BOT is looking for their 100th race start
  • HAM is looking for their 100th start for Mercedes

See more example outputs here: Bahrain F1 2018 – Race Trivia.

This is another recipe I need to work up a little further and add to Wrangling F1 Data With R.

Tinkering with Competitive Supertimes

I’m back on the R thang with F1 data from, and started having a look at how drivers and teams compare at a circuit.

One metric I came across for comparing teams over a season is the supertime, typically calculated for each manufacturer as the average of their fastest single lap recorded by the team at each race weekend expressed as a percentage of the fastest single lap overall.

It struck me that we can also derive a reduced competitive supertime by basing the calculation on best laptime recorded across the qualifying and race sessions, omitting laptimes recorded in the practice sessions.

We can draw on the notion of supertimes to derive two simple measures for comparing team performances based on laptime:

  • evolution of manufacturer competitive supertime for a circuit over the years;
  • evolution of manufacturer competitive supertime for each circuit over the course of a season.

We can also produce driver performance metrics based on the competitive supertime of each driver.

So here are some notes on doing just that… I’m pulling the data from a MySQL database built from a datadump published via

q=paste0('SELECT circuitRef, AS circuit, as race, location, country FROM races r JOIN circuits c WHERE r.circuitId=c.circuitId AND year=',year,' AND round=',round)
cct = as.list(dbGetQuery(ergastdb, q))

[1] "bahrain"

[1] "Bahrain International Circuit"

[1] "Bahrain Grand Prix"

[1] "Sakhir"

[1] "Bahrain"

Calculate some competitive supertimes for manufacturers:


q=paste0('SELECT q1, q2, q3, fastestLapTime,, d.code, year FROM races r JOIN circuits c JOIN results rs JOIN constructors cn JOIN qualifying q JOIN drivers d WHERE r.circuitId=c.circuitId AND d.driverId=rs.driverId AND r.raceId=rs.raceId AND cn.constructorId=rs.constructorId AND q.raceId=r.raceId AND q.driverId=rs.driverId AND IN (',teamsThisYear_str,') AND circuitref="',cct$circuitRef,'" ORDER BY year')
st = dbGetQuery(ergastdb, q)

st = melt(st,
          id.vars = c("name","code", "year"),
          measure.vars = c("q1", "q2", "q3", "fastestLapTime"))

st['time'] = as.numeric(apply(st['value'], 1, timeInS))

# Normalise the time
st = ddply(st, .(year), transform, ntime = time/min(time, na.rm=TRUE))

#Find the best time for each manufacturer per race weekend (quali+race)
stt = ddply(st,
            stime = min(ntime, na.rm=TRUE))

stt$label = as.factor(stt$name)

Here’s a chart theme I’ve used elsewhere:


  g = g + theme_minimal(base_family="Arial Narrow")
  #g = g + theme(panel.grid.major.y=element_blank())
  g = g + theme(panel.grid.minor=element_blank())
  g = g + theme(axis.line.y=element_line(color="#2b2b2b", size=0.15))
  g = g + theme(axis.text.y=element_text(margin=margin(r=0, l=0)))
  g = g + theme(plot.margin=unit(rep(30, 4), "pt"))
  g = g + theme(plot.title=element_text(face="bold"))
  g = g + theme(plot.subtitle=element_text(margin=margin(b=10)))
  g = g + theme(plot.caption=element_text(size=8, margin=margin(t=10)))

Create a supertime chart plotting function:

compSupertime = function(df, cct, labelSize=0.7, smooth=FALSE){
  if (smooth){g=g+stat_smooth(aes(colour=label),method = "lm", formula= y ~ x + I(x^2), se=FALSE, size=0.7)}
  else { g=g+geom_line(aes(colour=label))}
  g=g+ guides(colour=FALSE)+xlim(min(df$year),max(df$year)+1)
  #cex is label size
  g=g+geom_dl(aes(label = label, colour=label),method = list(cex = labelSize, dl.trans(x = x + .3), "last.bumpup"))
  g=g+labs(x=NULL,y='Competitive Supertime (% of best)',
           title=paste0('F1 ', cct$race, ' - ','Competitive Supertimes',', ', min(stt$year), ' to ', max(df$year)),
           subtitle = paste0(cct$circuit,', ',cct$location,', ',cct$country),
           caption="Data from Ergast database,")

Let’s have a look…





The best fit / model line obviously leaves something to be desired, eg in the case of Renault. But it’s a start.

I’ve also started working on a workflow to autopublish stuff to Gitbooks via github. For example, here’s the site in progress around the Bahrain F1 2018 Grand Prix. Here’s an earlier example (subject to updates/reflowing!) around the Australia 2018 F1 Grand Prix.

PS As I get back in to this, I’ll probably start updating the Wrangling F1 Data With R with recipes again… Also on the to do list is a second edition based on the tidyverse way of doing things…

Note On My Emerging Workflow for Working With Binderhub

Yesterday saw the public reboot of Binder / MyBinder (which I first wrote about a couple of years ago here), as reported in The Jupyter project blog post Binder 2.0, a Tech Guide and this practical guide: Introducing Binder 2.0 — share your interactive research environment.

For anyone not familiar with Binder / MyBinder, it’s a service that will launch a fully running Jupyter notebook server and computing environment based the contents of a Github repository (config files as well as notebooks).  What this means is that if you put your Jupyter notebooks into a Github repository, along with one or two simple files that least any Linux or Python packages you need to install in order to run the code in the notebooks (or R packages and perhaps Rmd files if you also install an R kernel/RStudio), you can get a browser access to that running environment at just the click of a link. And the generosity of whoever is paying for the servers the notebook server runs on.

The system has been rebuilt to use Jupyterhub, with a renaming as far as the codebase goes to Binderhub. There are also several utility tools associated with the project, including the really handy repo2docker that builds a Docker image from the contents of a local folder or Github repository.

One of the things that particularly interested me in the announcement blog posts was the following aspirational remark:

We would love to see others deploy their own BinderHub servers, either for their own communities, or as part of a federated public service of BinderHubs.

I’d love to see the OU get behind this, either directly or under the banner of OpenLearn, as part of an effort to help make Jupyter powered interactive open educational materials available without the need to install any software.

(I tried to pitch it to FutureLearn to help support the OU/FutureLearn Learn to Code for Data Analysis MOOC when we were writing that course, but they weren’t interested…)

One disadvantage is Binderhub is a stateless service, which means you need to download any notebooks you’re working on and them upload them again yourself if you stop an interactive session: the environment you were working in is personal to you, but it’s also destroyed whenever you close the session (or after a particular amount of time? So other solutions are required for persisting state (i.e. having a personal file storage area). Jupyterhub is one way to do that (and one of the things we’re starting to explore in the OU at the moment).

Through playing with Binderhub over the last couple of weeks as part of an attempt to put together some demos for how to use Jupyter notebooks to support the creation of educational content that contains rich content (images, interactives) from specifications contained within the notebook document (think: writing diagrams) I’ve come to the following workflow:

  • create a Github repository to host different builds (example). In my case, these are for different topic areas; but they could be different research projects, courses, data journalism investigations, etc.
  • put each build in a branch (example);
  • work up the build instructions for the environment either using Github/Binder or locally; I was having to use Github/Binder because I was working on a slow network connection that made building my evolving image difficult. But it meant that every time I made a change to the build, it used up Binder resources to do so.
  • if the build is a big one, it can take time to complete. I think that Binder will rebuild the Docker image each time you update the repo, so even if you only update notebook files, then *I think* that that package installation steps are also run even if those files *haven’t* changed? To simplify this process, we can instead create a Docker image from out build files and push that to Dockerhub (example).
  • We can then then create a new build process for our working repository that pulls the pre-built image (containing all the required packages) and adds in the working notebooks (example).
  • We can also share a minimum viable repository that can be forked to allow other people to use the same environment (example).

One advantage of this route is that it separates “sys admin” concerns – building and installing the required packages – from “working” concerns relating to developing the contents of the notebooks. (I think the working repository that uses the Dockerfile build can also draw on the postbuild file to add in any additional or missing packages, which can then be added to the container build as part of a maintenance step.)

PS picking up on a recent related Downes presentation – Applications, Algorithms and Data: Open Educational Resources and the Next Generation of Virtual Learning – and a response from @jimgroom that I really need to comment back on – Containing the Future of OER – this phrase comes to mind: “syndicated runtime” eg if you syndicate the HTML version of a notebook via an RSS feed with a link back to the Binder runnable version of it…

Quick Round-Up – Visualising Flows Using Network and Sankey Diagrams in Python and R

Got some data, relating to how students move from one module to another. Rows are student ID, module code, presentation date. The flow is not well-behaved. Students may take multiple modules in one presentation, and modules may be taken in any order (which means there are loops…).

My first take on the data was just to treat it as a graph and chart flows without paying attention to time other than to direct the edges (module A taken directky after module B; if multiple modules are taken by the same student in the same presentation, they both have the same precursor(s) and follow on module(s), if any) – the following (dummy) data shows the sort of thing we can get out using networkx and the pygraphviz output:

The data can also be filtered to show just the modules taken leading up to a particular module, or taken following a particular module.

The R diagram package has a couple of functions that can generate similar sorts of network diagram using its plotweb() function. For example, a simple flow graph:

Or something that looks a bit more like a finite state machine diagram:

(In passing, I also note the R diagram package can be used to draw electrical circuit diagrams/schematics.)

Another way of visualising this blocked data might be to use a simple two dimensional flow diagram, such as a transition plot from the R Gmisc package.

For example, the following could describe total flow from one module to another over a given period of time:

If there are a limited number of presentations (or modules) of interest, we could further break down each category to show the count of students taking a module in a particular presentation (or going directly on to / having directly come from a particular module; in this case, we may want an “other” group to act as a catch all for modules outside a set of modules we are interested in; getting the proportions right might also be a fudge).

Another way we might be able to look at the data “out of time” to show flow between modules is to use a Sankey diagram that allows for the possibility of feedback loops.

The Python sankeyview package (described in Hybrid Sankey diagrams: Visual analysis of multidimensional data for understanding resource use looks like it could be useful here, if I can work out how to do the set-up correctly!

Again, it may be appropriate to introduce a catch-all category to group modules into a generic Other bin where there is only a small flow to/from that module to reduce clutter in the diagram.

The sankeyview package is actually part of a family of packages that includes the d3-sankey-diagram and  the ipysankeywidget.

We can use the ipysankeywidget to render a simple graph data structure of the sort that can be generated by networkx.

One big problems with the view I took of the data is that it doesn’t respect time, or the numbers of students taking a particular presentation of a course. This detail could help tell a story about the evolving curriculum as new modules come on stream, for example, and perhaps change student behaviour about the module they take next from a particular module. So how could we capture it?

If we can linearise the flow by using module_presentation keyed nodes, rather than just module identified nodes, and limit the data to just show students progressing from one presentation to the next, we should be able to use something line a categorical parallel co-ordinates plot, such as an alluvial  diagram from the R alluvial package.

With time indexed modules, we can also explore richer Sankey style diagrams that require a one way flow (no loops).

So for example, here are a few more packages that might be able to help with that, as well as the aforementioned Python sankeyview and ipysankeywidget  packages.

First up, the R networkD3 package includes support for Sankey diagrams. Data can be sourced from igraph and then exported into the JSON format expected by the package:

If you prefer Google charts, the googleVis R package has a gvisSankey function (that I’ve used elsewhere).

The R riverplot package also supports Sankey diagrams – and the gallery includes a demo of how to recreate Minard’s visualisation of Napoleon’s 1812 march.

The R sankey package generates a simple Sankey diagram from a simple data table:

Back in the Python world, the pySankey package can generate a simple Sankey diagram from a pandas dataframe.

matplotlib also support for sankey diagrams as matplotlib.sankey() (see also this tutorial):

What I really need to do now is set up a Binder demo of each of them… but that will have to wait till another day…

If you know of any other R or Python packages / demos that might be useful for visualising flows, please let me know via the comments and I’ll add them to the post.

Via the comments…

H/T @Richard: process maps using the R bupar package:


PS for creating Sankey diagrams in a browser, see SankeyMATIC.