Generating Text From An R DataFrame using PyTracery, Pandas and Reticulate

In a couple of recent posts (Textualisation With Tracery and Database Reporting 2.0 and More Tinkering With PyTracery) I’ve started exploring various ways of using the pytracery port of the tracery story generation tool to generate variety of texts from Python pandas data frames.

For my F1DataJunkie tinkerings I’ve been using R + SQL as the base languages, with some hardcoded Rdata2text constructions for rendering text from R dataframes (example).

Whilst there is a basic port of tracery to R, I want to make use of the various extensions I’ve been doodling with to pytracery, so it seemed like a good opportunity to start exploring the R reticulate package.

It was a bit of a faff trying to get things to work the first time, so here on some notes on what I had to consider to get a trivial demo working in my RStudio/Rmd/knitr environment.

Python Environment

My first attempt was to use python blocks in an Rmd document:

```{python}
import sys
print(sys.executable)
````

but R insisted on using the base Python path on my Mac that was not the path I wanted to use… The fix turned out to be setting the engine…

```{python, engine.path ='/Users/f1dj/anaconda3/bin/python' }
import sys
print(sys.executable)
````

This could also be done via a setting: opts_chunk$set(engine.path = '/Users/f1dj/anaconda3/bin/python')

One of the problems with this approach is that a Python environment is created for each chunk – so you can’t easily carry state over from one Python chunk to another.

So I had a look at a workaround using reticulate instead.

Calling pytracery from R using reticulate

The solution I think I’m going for is to put Python code into a file, call that into R, then pass an R dataframe as an argument to a called Python function and gett a response back into R as an R dataframe.

For example, here’s a simple python test file:

import tracery
from tracery.modifiers import base_english
import pandas as pd

def pandas_row_mapper(row, rules, root,  modifiers=base_english):
    ''' Function to parse single row of dataframe '''
    row=row.to_dict()
    rules=rules.copy()

    for k in row:
        rules[k] = str(row[k])
        grammar = tracery.Grammar(rules)
        if modifiers is not None:
            if isinstance(modifiers,list):
                for modifier in modifiers:
                    grammar.add_modifiers(modifier)
            else:
                grammar.add_modifiers(modifiers)

    return grammar.flatten(root)

def pandas_tracery(df, rules, root, modifiers=base_english):
  return df.apply(lambda row: pandas_row_mapper(row, rules, root, modifiers), axis=1)

def pdt_inspect(df):
  return(df)

def pdt_test1(df):
  return type(df)

def pdt_demo(df):
  return pandas_tracery(df, _demo_rules, "#origin#",  modifiers=base_english)

#Create example rule to apply to each row of dataframe
_demo_rules = {'origin': "#code# was placed #position#!",
         'position': "#pos.uppercase#"}

We can access a python environment using reticulate:

library(reticulate)

#Show conda environments
conda_list("auto")

#Use a particular, name conda environment
use_condaenv(condaenv='anaconda3', required=T)

#Check the availability of a particular module in the environment
py_module_available("tracery")

Now we can load in the python file – and the functions it defines – and then call one of the  loaded Python functions.

Note that I seemed to have to force the casting of the R dataframe to a python/pandas dataframe using r_to_py(), although I’d expected the type mapping to be handled automatically? (Perhaps there is a setting somewhere?)

```{r}
source_python("pd_tracery_demo.py")
df1=data.frame(code=c('Jo','Sam'), pos=c('first','Second'))
df1$result = pdt_demo(r_to_py(df1, convert=T))
df1
```

#Displays:
Jo	first	Jo was placed FIRST!
Sam	Second	Sam was placed SECOND!

(Note: I also spotted a gotcha – things don’t work so well if you define an R column name called name… )

So now I can start looking at converting sports reporting tropes like these:

into tracery story models I can call using my pandas/pytracery hacks:-)

PS here’s a quick demo of inlining Python code:

library(reticulate)
rvar=1

#Go into python shell - this persists
repl_python()
#Access R variables with r.
pyvar=r.rvar+1

#Return to R shell
exit

#Access Python variable with py$
py$pyvar

Sketch – Data Trivia

A bit more tinkering with F1 data from the ergast db, this time trying to generating trivia / facts around races.

The facts are identified using SQL queries:

#starts for a team
q=paste0('SELECT d.code, COUNT(code) AS startsforteam, c.name AS name FROM drivers d JOIN races r JOIN results rs JOIN constructors c WHERE c.constructorId=rs.constructorId AND d.driverId=rs.driverId AND r.raceId=rs.raceId AND d.code IN (',driversThisYear_str,') ',upto,' GROUP BY d.code, c.name HAVING (startsforteam+1) % 50 = 0')
startsTeammod50=dbGetQuery(ergastdb, q)

#looking for poles to date modulo 5 
q=paste0('SELECT d.code, COUNT(code) AS poles FROM drivers d JOIN qualifying q JOIN races r WHERE r.raceId=q.raceId AND d.code IN (',driversThisYear_str,') AND d.driverId=q.driverId AND q.position=1',upto,' GROUP BY code HAVING poles>1 AND (poles+1) % 5 = 0')
lookingpolesmod5=dbGetQuery(ergastdb, q)

Some of the queries also embed query fragments, which I intend to develop further…

upto=paste0(' AND (year<',year,' OR (year=',year,' AND round<',round,')) ')

I'm using knitr to generate Github flavoured markdown (gfm) from my Rmd docs – here’s part of the header:

---
output:
  md_document:
    variant: gfm
---

The following recipe then takes results from the trivia queries and spiels the output:

if (nrow(startsTeammod50)>0) {
  for (row in 1:nrow(startsTeammod50)) {
    text = '- `r startsTeammod50[row, "code"]` is looking for their `r toOrdinal(startsTeammod50[row, "startsforteam"]+1)` start for `r startsTeammod50[row, "name"]`'
    cat(paste0(knit_child(text=text,quiet=TRUE),'\n'))
  }
}

if (nrow(lookingpolesmod5)>0) {
  for (row in 1:nrow(lookingpolesmod5)) {
    text = '- `r lookingpolesmod5[row, "code"]` is looking for their `r toOrdinal(lookingpolesmod5[row, "poles"]+1)` ever pole position'
    cat(paste0(knit_child(text=text,quiet=TRUE),'\n'))
  }
}

We then get outputs of the form:

  • BOT is looking for their 100th race start
  • HAM is looking for their 100th start for Mercedes

See more example outputs here: Bahrain F1 2018 – Race Trivia.

This is another recipe I need to work up a little further and add to Wrangling F1 Data With R.

Tinkering with Competitive Supertimes

I’m back on the R thang with F1 data from ergast.com, and started having a look at how drivers and teams compare at a circuit.

One metric I came across for comparing teams over a season is the supertime, typically calculated for each manufacturer as the average of their fastest single lap recorded by the team at each race weekend expressed as a percentage of the fastest single lap overall.

It struck me that we can also derive a reduced competitive supertime by basing the calculation on best laptime recorded across the qualifying and race sessions, omitting laptimes recorded in the practice sessions.

We can draw on the notion of supertimes to derive two simple measures for comparing team performances based on laptime:

  • evolution of manufacturer competitive supertime for a circuit over the years;
  • evolution of manufacturer competitive supertime for each circuit over the course of a season.

We can also produce driver performance metrics based on the competitive supertime of each driver.

So here are some notes on doing just that… I’m pulling the data from a MySQL database built from a datadump published via ergast.com.

q=paste0('SELECT circuitRef, c.name AS circuit, r.name as race, location, country FROM races r JOIN circuits c WHERE r.circuitId=c.circuitId AND year=',year,' AND round=',round)
cct = as.list(dbGetQuery(ergastdb, q))

cct
$circuitRef
[1] "bahrain"

$circuit
[1] "Bahrain International Circuit"

$race
[1] "Bahrain Grand Prix"

$location
[1] "Sakhir"

$country
[1] "Bahrain"

Calculate some competitive supertimes for manufacturers:

library(reshape2)
library(plyr)

q=paste0('SELECT q1, q2, q3, fastestLapTime, cn.name, d.code, year FROM races r JOIN circuits c JOIN results rs JOIN constructors cn JOIN qualifying q JOIN drivers d WHERE r.circuitId=c.circuitId AND d.driverId=rs.driverId AND r.raceId=rs.raceId AND cn.constructorId=rs.constructorId AND q.raceId=r.raceId AND q.driverId=rs.driverId AND cn.name IN (',teamsThisYear_str,') AND circuitref="',cct$circuitRef,'" ORDER BY year')
st = dbGetQuery(ergastdb, q)


st = melt(st,
          id.vars = c("name","code", "year"),
          measure.vars = c("q1", "q2", "q3", "fastestLapTime"))

st['time'] = as.numeric(apply(st['value'], 1, timeInS))

# Normalise the time
st = ddply(st, .(year), transform, ntime = time/min(time, na.rm=TRUE))

#Find the best time for each manufacturer per race weekend (quali+race)
stt = ddply(st,
            .(name,year),
            summarise, 
            stime = min(ntime, na.rm=TRUE))

stt$label = as.factor(stt$name)
stt

Here’s a chart theme I’ve used elsewhere:

library(ggplot2)
library(directlabels)
library(ggrepel)
library(ggthemes)

champChartTheme=function(g){
  g = g + theme_minimal(base_family="Arial Narrow")
  #g = g + theme(panel.grid.major.y=element_blank())
  g = g + theme(panel.grid.minor=element_blank())
  g = g + theme(axis.line.y=element_line(color="#2b2b2b", size=0.15))
  g = g + theme(axis.text.y=element_text(margin=margin(r=0, l=0)))
  g = g + theme(plot.margin=unit(rep(30, 4), "pt"))
  g = g + theme(plot.title=element_text(face="bold"))
  g = g + theme(plot.subtitle=element_text(margin=margin(b=10)))
  g = g + theme(plot.caption=element_text(size=8, margin=margin(t=10)))
  g
}

Create a supertime chart plotting function:

compSupertime = function(df, cct, labelSize=0.7, smooth=FALSE){
  g=ggplot(data=df,aes(x=year,y=stime))
  if (smooth){g=g+stat_smooth(aes(colour=label),method = "lm", formula= y ~ x + I(x^2), se=FALSE, size=0.7)}
  else { g=g+geom_line(aes(colour=label))}
  g=g+ guides(colour=FALSE)+xlim(min(df$year),max(df$year)+1)
  #cex is label size
  g=g+geom_dl(aes(label = label, colour=label),method = list(cex = labelSize, dl.trans(x = x + .3), "last.bumpup"))
  g=g+labs(x=NULL,y='Competitive Supertime (% of best)',
           title=paste0('F1 ', cct$race, ' - ','Competitive Supertimes',', ', min(stt$year), ' to ', max(df$year)),
           subtitle = paste0(cct$circuit,', ',cct$location,', ',cct$country),
           caption="Data from Ergast database, ergast.com/mrd")
  champChartTheme(g)
}

Let’s have a look…

compSupertime(stt,cct,smooth=F)

 


Alternatively:

compSupertime(stt,cct,smooth=T)

The best fit / model line obviously leaves something to be desired, eg in the case of Renault. But it’s a start.

I’ve also started working on a workflow to autopublish stuff to Gitbooks via github. For example, here’s the site in progress around the Bahrain F1 2018 Grand Prix. Here’s an earlier example (subject to updates/reflowing!) around the Australia 2018 F1 Grand Prix.

PS As I get back in to this, I’ll probably start updating the Wrangling F1 Data With R with recipes again… Also on the to do list is a second edition based on the tidyverse way of doing things…

Note On My Emerging Workflow for Working With Binderhub

Yesterday saw the public reboot of Binder / MyBinder (which I first wrote about a couple of years ago here), as reported in The Jupyter project blog post Binder 2.0, a Tech Guide and this practical guide: Introducing Binder 2.0 — share your interactive research environment.

For anyone not familiar with Binder / MyBinder, it’s a service that will launch a fully running Jupyter notebook server and computing environment based the contents of a Github repository (config files as well as notebooks).  What this means is that if you put your Jupyter notebooks into a Github repository, along with one or two simple files that least any Linux or Python packages you need to install in order to run the code in the notebooks (or R packages and perhaps Rmd files if you also install an R kernel/RStudio), you can get a browser access to that running environment at just the click of a link. And the generosity of whoever is paying for the servers the notebook server runs on.

The system has been rebuilt to use Jupyterhub, with a renaming as far as the codebase goes to Binderhub. There are also several utility tools associated with the project, including the really handy repo2docker that builds a Docker image from the contents of a local folder or Github repository.

One of the things that particularly interested me in the announcement blog posts was the following aspirational remark:

We would love to see others deploy their own BinderHub servers, either for their own communities, or as part of a federated public service of BinderHubs.

I’d love to see the OU get behind this, either directly or under the banner of OpenLearn, as part of an effort to help make Jupyter powered interactive open educational materials available without the need to install any software.

(I tried to pitch it to FutureLearn to help support the OU/FutureLearn Learn to Code for Data Analysis MOOC when we were writing that course, but they weren’t interested…)

One disadvantage is Binderhub is a stateless service, which means you need to download any notebooks you’re working on and them upload them again yourself if you stop an interactive session: the environment you were working in is personal to you, but it’s also destroyed whenever you close the session (or after a particular amount of time? So other solutions are required for persisting state (i.e. having a personal file storage area). Jupyterhub is one way to do that (and one of the things we’re starting to explore in the OU at the moment).

Through playing with Binderhub over the last couple of weeks as part of an attempt to put together some demos for how to use Jupyter notebooks to support the creation of educational content that contains rich content (images, interactives) from specifications contained within the notebook document (think: writing diagrams) I’ve come to the following workflow:

  • create a Github repository to host different builds (example). In my case, these are for different topic areas; but they could be different research projects, courses, data journalism investigations, etc.
  • put each build in a branch (example);
  • work up the build instructions for the environment either using Github/Binder or locally; I was having to use Github/Binder because I was working on a slow network connection that made building my evolving image difficult. But it meant that every time I made a change to the build, it used up Binder resources to do so.
  • if the build is a big one, it can take time to complete. I think that Binder will rebuild the Docker image each time you update the repo, so even if you only update notebook files, then *I think* that that package installation steps are also run even if those files *haven’t* changed? To simplify this process, we can instead create a Docker image from out build files and push that to Dockerhub (example).
  • We can then then create a new build process for our working repository that pulls the pre-built image (containing all the required packages) and adds in the working notebooks (example).
  • We can also share a minimum viable repository that can be forked to allow other people to use the same environment (example).

One advantage of this route is that it separates “sys admin” concerns – building and installing the required packages – from “working” concerns relating to developing the contents of the notebooks. (I think the working repository that uses the Dockerfile build can also draw on the postbuild file to add in any additional or missing packages, which can then be added to the container build as part of a maintenance step.)

PS picking up on a recent related Downes presentation – Applications, Algorithms and Data: Open Educational Resources and the Next Generation of Virtual Learning – and a response from @jimgroom that I really need to comment back on – Containing the Future of OER – this phrase comes to mind: “syndicated runtime” eg if you syndicate the HTML version of a notebook via an RSS feed with a link back to the Binder runnable version of it…

Quick Round-Up – Visualising Flows Using Network and Sankey Diagrams in Python and R

Got some data, relating to how students move from one module to another. Rows are student ID, module code, presentation date. The flow is not well-behaved. Students may take multiple modules in one presentation, and modules may be taken in any order (which means there are loops…).

My first take on the data was just to treat it as a graph and chart flows without paying attention to time other than to direct the edges (module A taken directky after module B; if multiple modules are taken by the same student in the same presentation, they both have the same precursor(s) and follow on module(s), if any) – the following (dummy) data shows the sort of thing we can get out using networkx and the pygraphviz output:

The data can also be filtered to show just the modules taken leading up to a particular module, or taken following a particular module.

The R diagram package has a couple of functions that can generate similar sorts of network diagram using its plotweb() function. For example, a simple flow graph:

Or something that looks a bit more like a finite state machine diagram:

(In passing, I also note the R diagram package can be used to draw electrical circuit diagrams/schematics.)

Another way of visualising this blocked data might be to use a simple two dimensional flow diagram, such as a transition plot from the R Gmisc package.

For example, the following could describe total flow from one module to another over a given period of time:

If there are a limited number of presentations (or modules) of interest, we could further break down each category to show the count of students taking a module in a particular presentation (or going directly on to / having directly come from a particular module; in this case, we may want an “other” group to act as a catch all for modules outside a set of modules we are interested in; getting the proportions right might also be a fudge).

Another way we might be able to look at the data “out of time” to show flow between modules is to use a Sankey diagram that allows for the possibility of feedback loops.

The Python sankeyview package (described in Hybrid Sankey diagrams: Visual analysis of multidimensional data for understanding resource use looks like it could be useful here, if I can work out how to do the set-up correctly!

Again, it may be appropriate to introduce a catch-all category to group modules into a generic Other bin where there is only a small flow to/from that module to reduce clutter in the diagram.

The sankeyview package is actually part of a family of packages that includes the d3-sankey-diagram and  the ipysankeywidget.

We can use the ipysankeywidget to render a simple graph data structure of the sort that can be generated by networkx.

One big problems with the view I took of the data is that it doesn’t respect time, or the numbers of students taking a particular presentation of a course. This detail could help tell a story about the evolving curriculum as new modules come on stream, for example, and perhaps change student behaviour about the module they take next from a particular module. So how could we capture it?

If we can linearise the flow by using module_presentation keyed nodes, rather than just module identified nodes, and limit the data to just show students progressing from one presentation to the next, we should be able to use something line a categorical parallel co-ordinates plot, such as an alluvial  diagram from the R alluvial package.

With time indexed modules, we can also explore richer Sankey style diagrams that require a one way flow (no loops).

So for example, here are a few more packages that might be able to help with that, as well as the aforementioned Python sankeyview and ipysankeywidget  packages.

First up, the R networkD3 package includes support for Sankey diagrams. Data can be sourced from igraph and then exported into the JSON format expected by the package:

If you prefer Google charts, the googleVis R package has a gvisSankey function (that I’ve used elsewhere).

The R riverplot package also supports Sankey diagrams – and the gallery includes a demo of how to recreate Minard’s visualisation of Napoleon’s 1812 march.

The R sankey package generates a simple Sankey diagram from a simple data table:

Back in the Python world, the pySankey package can generate a simple Sankey diagram from a pandas dataframe.

matplotlib also support for sankey diagrams as matplotlib.sankey() (see also this tutorial):

What I really need to do now is set up a Binder demo of each of them… but that will have to wait till another day…

If you know of any other R or Python packages / demos that might be useful for visualising flows, please let me know via the comments and I’ll add them to the post.

Via the comments…

H/T @Richard: process maps using the R bupar package:

Process_Maps

PS for creating Sankey diagrams in a browser, see SankeyMATIC.

Programming, meh… Let’s Teach How to Write Computational Essays Instead

From Stephen Wolfram, a nice phrase to describe the sorts of thing you can create using tools like Jupyter notebooks, Rmd and Mathematica notebooks: computational essays that complements the “computational narrative” phrase that is also used to describe such documents.

Wolfram’s recent blog post What Is a Computational Essay?, part essay, part computational essay,  is primarily a pitch for using Mathematica notebooks and the Wolfram Language. (The Wolfram Language provides computational support plus access to a “fact engine” database that ca be used to pull factual information into the coding environment.)

But it also describes nicely some of the generic features of other “generative document” media (Jupyter notebooks, Rmd/knitr) and how to start using them.

There are basically three kinds of things [in a computational essay]. First, ordinary text (here in English). Second, computer input. And third, computer output. And the crucial point is that these three kinds of these all work together to express what’s being communicated.

In Mathematica, the view is something like this:


In Jupyter notebooks:

In its raw form, an RStudio Rmd document source looks something like this:

A computational essay is in effect an intellectual story told through a collaboration between a human author and a computer. …

The ordinary text gives context and motivation. The computer input gives a precise specification of what’s being talked about. And then the computer output delivers facts and results, often in graphical form. It’s a powerful form of exposition that combines computational thinking on the part of the human author with computational knowledge and computational processing from the computer.

When we originally drafted the OU/FutureLearn course Learn to Code for Data Analysis (also available on OpenLearn), we wrote the explanatory text – delivered as HTML but including static code fragments and code outputs – as a notebook, and then ‘ran” the notebook to generate static HTML (or markdown) that provided the static course content. These notebooks were complemented by actual notebooks that students could work with interactively themselves.

(Actually, we prototyped authoring both the static text, and the elements to be used in the student notebooks, in a single document, from which the static HTML and “live” notebook documents could be generated: Authoring Multiple Docs from a Single IPython Notebook. )

Whilst the notion of the computational essay as a form is really powerful, I think the added distinction between between generative and generated documents is also useful. For example, a raw Rmd document of Jupyter notebook is a generative document that can be used to create a document containing text, code, and the output generated from executing the code. A generated document is an HTML, Word, or PDF export from an executed generative document.

Note that the generating code can be omitted from the generated output document, leaving just the text and code generated outputs. Code cells can also be collapsed so the code itself is hidden from view but still available for inspection at any time:

Notebooks also allow “reverse closing” of cells—allowing an output cell to be immediately visible, even though the input cell that generated it is initially closed. This kind of hiding of code should generally be avoided in the body of a computational essay, but it’s sometimes useful at the beginning or end of an essay, either to give an indication of what’s coming, or to include something more advanced where you don’t want to go through in detail how it’s made.

Even if notebooks are not used interactively, they can be used to create correct static texts where outputs that are supposed to relate to some fragment of code in the main text actually do so because they are created by the code, rather than being cut and pasted from some other environment.

However, making the generative – as well as generated – documents available means readers can learn by doing, as well as reading:

One feature of the Wolfram Language is that—like with human languages—it’s typically easier to read than to write. And that means that a good way for people to learn what they need to be able to write computational essays is for them first to read a bunch of essays. Perhaps then they can start to modify those essays. Or they can start creating “notes essays”, based on code generated in livecoding or other classroom sessions.

In terms of our own learnings to date about how to use notebooks most effectively as part of a teaching communication (i.e. as learning materials), Wolfram seems to have come to many similar conclusions. For example, try to limit the amount of code in any particular code cell:

In a typical computational essay, each piece of input will usually be quite short (often not more than a line or two). But the point is that such input can communicate a high-level computational thought, in a form that can readily be understood both by the computer and by a human reading the essay.

...

So what can go wrong? Well, like English prose, can be unnecessarily complicated, and hard to understand. In a good computational essay, both the ordinary text, and the code, should be as simple and clean as possible. I try to enforce this for myself by saying that each piece of input should be at most one or perhaps two lines long—and that the caption for the input should always be just one line long. If I’m trying to do something where the core of it (perhaps excluding things like display options) takes more than a line of code, then I break it up, explaining each line separately.

It can also be useful to "preview" the output of a particular operation that populates a variable for use in the following expression to help the reader understand what sort of thing that expression is evaluating:

Another important principle as far as I’m concerned is: be explicit. Don’t have some variable that, say, implicitly stores a list of words. Actually show at least part of the list, so people can explicitly see what it’s like.

In many respects, the computational narrative format forces you to construct an argument in a particular way: if a piece of code operates on a particular thing, you need to access, or create, the thing before you can operate on it.

[A]nother thing that helps is that the nature of a computational essay is that it must have a “computational narrative”—a sequence of pieces of code that the computer can execute to do what’s being discussed in the essay. And while one might be able to write an ordinary essay that doesn’t make much sense but still sounds good, one can’t ultimately do something like that in a computational essay. Because in the end the code is the code, and actually has to run and do things.

One of the arguments I've been trying to develop in an attempt to persuade some of my colleagues to consider the use of notebooks to support teaching is the notebook nature of them. Several years ago, one of the en vogue ideas being pushed in our learning design discussions was to try to find ways of supporting and encouraging the use of "learning diaries", where students could reflect on their learning, recording not only things they'd learned but also ways they'd come to learn them. Slightly later, portfolio style assessment became "a thing" to consider.

Wolfram notes something similar from way back when...

The idea of students producing computational essays is something new for modern times, made possible by a whole stack of current technology. But there’s a curious resonance with something from the distant past. You see, if you’d learned a subject like math in the US a couple of hundred years ago, a big thing you’d have done is to create a so-called ciphering book—in which over the course of several years you carefully wrote out the solutions to a range of problems, mixing explanations with calculations. And the idea then was that you kept your ciphering book for the rest of your life, referring to it whenever you needed to solve problems like the ones it included.

Well, now, with computational essays you can do very much the same thing. The problems you can address are vastly more sophisticated and wide-ranging than you could reach with hand calculation. But like with ciphering books, you can write computational essays so they’ll be useful to you in the future—though now you won’t have to imitate calculations by hand; instead you’ll just edit your computational essay notebook and immediately rerun the Wolfram Language inputs in it.

One of the advantages that notebooks have over some other environments in which students learn to code is that structure of the notebook can encourage you to develop a solution to a problem whilst retaining your earlier working.

The earlier working is where you can engage in the minutiae of trying to figure out how to apply particular programming concepts, creating small, playful, test examples of the sort of the thing you need to use in the task you have actually been set. (I think of this as a "trial driven" software approach rather than a "test driven* one; in a trial,  you play with a bit of code in the margins to check that it does the sort of thing you want, or expect, it to do before using it in the main flow of a coding task.)

One of the advantages for students using notebooks is that they can doodle with code fragments to try things out, and keep a record of the history of their own learning, as well as producing working bits of code that might be used for formative or summative assessment, for example.

Another advantage is that by creating notebooks, which may include recorded fragments of dead ends when trying to solve a particular problem, is that you can refer back to them. And reuse what you learned, or discovered how to do, in them.

And this is one of the great general features of computational essays. When students write them, they’re in effect creating a custom library of computational tools for themselves—that they’ll be in a position to immediately use at any time in the future. It’s far too common for students to write notes in a class, then never refer to them again. Yes, they might run across some situation where the notes would be helpful. But it’s often hard to motivate going back and reading the notes—not least because that’s only the beginning; there’s still the matter of implementing whatever’s in the notes.

Looking at many of the notebooks students have created from scratch to support assessment activities in TM351, it's evident that many of them are not using them other than as an interactive code editor with history. The documents contain code cells and outputs, with little if any commentary (what comments there are are often just simple inline code comments in a code cell). They are barely computational narratives, let alone computational essays; they're more of a computational scratchpad containing small code fragments, without context.

This possibly reflects the prior history in terms of code education that students have received, working "out of context" in an interactive Python command line editor, or a traditional IDE, where the idea is to produce standalone files containing complete programmes or applications. Not pieces of code, written a line at a time, in a narrative form, with example output to show the development of a computational argument.

(One argument I've heard made against notebooks is that they aren't appropriate as an environment for writing "real programmes" or "applications". But that's not strictly true: Jupyter notebooks can be used to define and run microservices/APIs as well as GUI driven applications.)

However, if you start to see computational narratives as a form of narrative documentation that can be used to support a form of literate programming, then once again the notebook format can come in to its own, and draw on styling more common in a text document editor than a programming environment.

(By default, Jupyter notebooks expect you to write text content in markdown or markdown+HTML, but WYSIWYG editors can be added as an extension.)

Use the structured nature of notebooks. Break up computational essays with section headings, again helping to make them easy to skim. I follow the style of having a “caption line” before each input. Don’t worry if this somewhat repeats what a paragraph of text has said; consider the caption something that someone who’s just “looking at the pictures” might read to understand what a picture is of, before they actually dive into the full textual narrative.

As well as allowing you to create documents in which the content is generated interactively - code cells can be changed and re-run, for example - it is also possible to embed interactive components in both generative and generated documents.

On the one hand, it's quite possible to generate and embed an interactive map or interactive chart that supports popups or zooming in a generated HTML output document.

On the other, Mathematica and Jupyter both support the dynamic creation of interactive widget controls in generative documents that give you control over code elements in the document, such as sliders to change numerical parameters or list boxes to select categorical text items. (In the R world, there is support for embedded shiny apps in Rmd documents.)

These can be useful when creating narratives that encourage exploration (for example, in the sense of  explorable explantations, though I seem to recall Michael Blastland expressing concern several years ago about how ineffective interactives could be in data journalism stories.

The technology of Wolfram Notebooks makes it straightforward to put in interactive elements, like Manipulate, [interact/interactive in Jupyter notebooks] into computational essays. And sometimes this is very helpful, and perhaps even essential. But interactive elements shouldn’t be overused. Because whenever there’s an element that requires interaction, this reduces the ability to skim the essay."

I've also thought previously that interactive functions are a useful way of motivating the use of functions in general when teaching introductory programming. For example, An Alternative Way of Motivating the Use of Functions?.

One of the issues in trying to set up student notebooks is how to handle boilerplate code that is required before the student can create, or run, the code you actually want them to explore. In TM351, we preload notebooks with various packages and bits of magic; in my own tinkerings, I'm starting to try to package stuff up so that it can be imported into a notebook in a single line.

Sometimes there’s a fair amount of data—or code—that’s needed to set up a particular computational essay. The cloud is very useful for handling this. Just deploy the data (or code) to the Wolfram Cloud, and set appropriate permissions so it can automatically be read whenever the code in your essay is executed.

As far as opportunities for making increasing use of notebooks as a kind of technology goes, I came to a similar conclusion some time ago to Stephen Wolfram when he writes:

[I]t’s only very recently that I’ve realized just how central computational essays can be to both the way people learn, and the way they communicate facts and ideas. Professionals of the future will routinely deliver results and reports as computational essays. Educators will routinely explain concepts using computational essays. Students will routinely produce computational essays as homework for their classes.

Regarding his final conclusion, I'm a little bit more circumspect:

The modern world of the web has brought us a few new formats for communication—like blogs, and social media, and things like Wikipedia. But all of these still follow the basic concept of text + pictures that’s existed since the beginning of the age of literacy. With computational essays we finally have something new.

In many respects, HTML+Javascript pages have been capable of delivering, and actually delivering, computationally generated documents for some time. Whether computational notebooks offer some sort of step-change away from that, or actually represent a return to the original read/write imaginings of the web with portable and computed facts accessed using Linked Data?

Fragment – DIT4C – Docker Base Containers for Edu Remote Computing Labs

What’s an effective way of helping a student run a desktop application when their own computer won’t run the application, for whatever reason, locally? Virtualised software, running remotely, provides one solution. So here’s an example of a project that looks at doing just that:  DIT4C (“Data Intensive Tools for the Cloud”)a platform for hosting data analysis tools “in the cloud” using containers [repo].

Prepackaged, standalone containers are defined for a range of applications, including RStudio, Jupyter notebooks, Jupyter+R and OpenRefine

Standalone Containers With Branded Landing Page

The application containers are built on top of a base container that includes an nginx webser/proxy, a GoTTY shell and a file uploader. The individual containers then have a “homepage” that links to the particular application:

So what do we have at this point?

  • a branded landing page;
  • browser accessed shell:
  • a browser accessed file uploader:

These services are all running within a single container. I don’t know if there’s a way of linking multiple containers using docker-compose? This would require finding some way of announcing the services provided by each container, to a central nginx server which could then link to each from a single homepage. But this would mean separate terminals and file loaders into each one (though maybe the shared files could be handled as a single mounted volume shared across all the linked containers?

Once again, I’m coming round to the idea that using a single container to run multiple services, rather than several linked containers each running a single service, is simpler, even if it does go against the (ideal?) model of using containers as part of a small pieces, loosely joined architecture? I think I need to post a simple recipe (or recipes) somewhere that show different ways of running multiple services within a single container. The docker docs  – Run multiple services in a container – provide a crib in to this at the moment.

X11 Applications

Skimming the docs, I notice reference to a base X11 desktop container. Interesting… I have a PhD student looking for an easy way to host a Qt widget running application in the cloud for evaluation purposes. To this end I’ve just started looking around for X11/noVNC web client containers that would allow us to package the app in a simple container then access it from something like Digital Ocean (given there’s no internal OU docker container hosting service that I’m allowed to access (or am aware of… Maybe on the Faculty cluster?)).

So things like this show the way – a container that offers a link to a containerised “desktop” application, in this case QGIS (dit4c/dockerfile-dit4c-container-qgis); (does the background colour mean anything, I wonder? How could we make use of background colour in OU containers?):


Following the X11 Session link, we get to a desktop:

There’s an icon in the toolbar to the application we want – QGIS:

What I’m thinking now is this could be handy for running the V-REP robot simulator, and maybe Gephi…

It also makes me think that things could be simplified a little further by offering a link to QGIS, rather than X11 Application, and opening the application in full screen mode (on the virtualised desktop) on start-up. (See Distributing Virtual Machines That Include a Virtual Desktop To Students – V-REP + Jupyter Notebooks for some thoughts on how to use VMs to distribute a single pre-launched on startup desktop application to try to simplify the student experience.)

It also makes me even more concerned about the apparent lack of interest in the OU, and even awareness of, the possibilities of virtualised software offerings. For example, at a recent SIG group on (interactive) maps/mapping, brief mention was made of using QGIS, and problems arising therefrom (though I forget the context of the problems). Here we have a solution – out there for all to see and anyone to find – that demonstrates the use of QGIS in a prebuilt container. But who, internally, would think to mention that? I don’t think any of the Tech Enhanced Learning folk I’ve spoken to would even consider it, if they are even aware of it as an option?

(Of course, in testing, it might be rubbish… how much bandwidth is required for a responsive experience when creating detailed maps? See also one of my earlier related experiments: Accessing GUI Apps Via a Browser from a Container Using Guacamole, which remotely accessing the Audacity audio editor using a cloud hosted container.)

The Platform Offering

Skimming through the repos, I (mistakenly, as it happens) thought I saw a reference to resbaz (ResBaz Cloud – Containerised Research Apps as a Service). I was mistaken in thinking I had seen a reference in the code I skimmed though, but not, it seems, in the fact that there is a relationship:

And so it seems that perhaps more interestingly than the standalone containers is that DIT4C is a platform offering (architecture docs), providing authenticated access to users, file persistence (presumably?) and the ability to launch prebuilt docker images as required.

That said, looking at the Github repository commits for the project, there appears to have been little activity since March 2017 and the gitter channel appears to have gone silent at the end of 2016. In addition, the docs for getting an instance of the platform up and running are a little bit too sparse for me to follow easily… [UPDATE: it seems as if the funding did run out/get pulled:-(]

So maybe as a project, DIT4C is perhaps now “of historical interest” only, rather than being a live project we might have been able to jump on the back of to get an OU hosted remote computing lab up and running? :-( That said, the ResBaz (Research Bazaar) initiative, “worldwide festival promoting the digital literacy emerging at the center of modern research”, still seems to be around…