OpenRefine Running in MyBinder, Several Ways…

Python packages such as the Jupyter Server  Proxy allow you to use a Jupyter notebook server as a proxy for other services running in the same environment, such as a MyBinder container.

The jupyter-server-proxy package represents a generalisation of earlier demonstrations that showed how to proxy RStudio and OpenRefine in a MyBinder container.

In this post, I’ll serve up several ways of getting OpenRefine running in a MyBinder container. You can find all the examples in branches off psychemedia/jupyterserverproxy-openrefine.

Early original work on getting OpenRefine running in MyBinder was done by @betatim (betatim/openrefineder) using an earlier package, nbserverproxy; @yuvipanda helped me get my head round various bits of jupyterhub/jupyter-server-proxy/ which is key to proxying web services via Jupyter (jupyter-server-proxy docs). @manics provided the jupyter-server-proxy PR for handling predefined, rather than allocated, port mappings, which also made life much easier…

Common Installation Requirements

The following steps are pretty much common to all the recipes, and are responsibly for installing OpenRefine and its dependencies.

First, in a binder/apt.txt file, the Java dependency:


A binder/postBuild step to install a specific version of OpenRefine:

set -e

wget -q -O openrefine-$VERSION.tar.gz$VERSION/openrefine-linux-$VERSION.tar.gz
mkdir -p $HOME/.openrefine
tar xzf openrefine-$VERSION.tar.gz -C $HOME/.openrefine
rm openrefine-$VERSION.tar.gz

mkdir -p $HOME/openrefine

A binder/requirements.txt file to install the Python OpenRefine API client:


Note that this a fork of the original client that supports Python 3. It works with OpenRefine 2.8 but I’m not sure if it works properly with OpenRefine 3. There are multiple forks of the client and from what I can tell they are differently broken. It would be great of OpenRefine repo took on one fork as the official client that everyone could contribute to.

start definition – Autostarting headless OpenRefine Server


This start branch (repo) demonstrates:

  • using a binder/start file to auto start OpenRefine;
  • a notebook/client demo; this essentially runs in a headless mode.

The binder/start file extend the MyBinder start CMD to run the commands included in the file in addition to the default command to start the Jupyter notebook server. (The binder/start file can also be used to run things like the setting of environment variables. I’m not sure how to make available an environment variable defined in binder/postBuild inside binder/start?)


#Start OpenRefine
nohup $HOME/.openrefine/openrefine-2.8/refine -p 3333 -d OPENREFINE_DIR > /dev/null 2>&1 &

exec "$@"

In this demo, you won’t be able to see the OpenRefine GUI using this demo. Instead, you can access it via its API using an OpenRefine python client. An included notebook gives a worked example (note that at the moment you can’t run the first few parts of the demo because they assume the presence of a pre-existing OpenRefine project. Instructions appear further down the notebook for creating a project and working with it using the API client; I’ll do a separate post on the OpenRefine Python client at some point…)

simpleproxy definition

The simpleproxy branch (repo) extends the start branch with a proxy that can be used to render the OpenRefine GUI.

The binder/requirements.txt needs an additional package — the jupyter-server-proxy package. (I’m using the repo version because at the time of writing the PyPi released version doesn’t include all the features we need…)


If you launch the Binder, it uses the serverproxy to proxy the OpenRefine port to proxy/3333/; note that the trailing slash is important. Without it, the static files (CSS etc) required to render the page are not resolved correctly.

traitlet-nolab definition

The traitlet-nolab branch (repo) uses the traitlet method (docs) to add a menu option to the Jupyter notebook homepage that allows OpenRefine to be started and launched from the notebook home New menu.

OpenRefine will also be started automatically if you start the MyBinder container with ?urlpath=openrefine or navigate directly to http://MYBINDERURL/openrefine.

Start on Jupyter notebook homepage: Binder

Start in OpenRefine client: Binder

In this case, the binder/start invocation is not required.

Once started, OpenRefine will appear on a named proxy path, openrefine (the slash may be omitted in this case).

The traitlet is defined in a file:

# Traitlet configuration file for jupyter-notebook.

c.ServerProxy.servers = {
    'openrefine': {
        'command': ['/home/jovyan/.openrefine/openrefine-2.8/refine', '-p', '{port}','-d','/home/jovyan/openrefine'],
        'port': 3333,
        'timeout': 120,
        'launcher_entry': {
            'title': 'OpenRefine'

This is copied into the correct location by an additional binder/postBuild step:

mkdir -p $HOME/.jupyter/

#Although located in binder/,
# this bash file runs in $HOME rather than $HOME/binder
mv $HOME/.jupyter/

The traitlet definition file is loaded in as a notebook server configuration file prior to starting the notebook server.

Note that the definition file uses the port: 3333 attribute to explicitly set the port that the server will be served against. If this is omitted, then a port will be dynamically allocated by the proxy server. In the case of OpenRefine, I am defining a port explicitly so that the Python API client can connect to it directly on the assumed default port 3333.

Note that if we try to use the Python client without starting the OpenRefine server by launching it, the connection will fail because there will be no running OpenRefine server for the client to connect to.

Python package setup definition

The setup branch (repo) demonstrates:

  • using serverproxy (setup definition (docs)) to add an OpenRefine menu option to the notebook start menu. The configuration uses a fixed port assignment once again so that we can work with the client package using default port settings.

Start in Jupyter notebook homepage: Binder

Start in OpenRefine client: Binder

For this build, we go back to the base setup (no binder/start, not traitlet definition files) and add a file:

import setuptools

  # py_modules rather than packages, since we only have 1 file
      'jupyter_serverproxy_servers': [
          # name = packagename:function_name
          'openrefine = openrefine:setup_openrefine',

This calls on an file to define the configuration:

import os

def setup_openrefine():
  path = os.path.join(os.environ['HOME'], 'openrefine')
  return {
    'command': ['$HOME/.openrefine/openrefine-2.8/refine', '-p', '{port}','-d',path],
    'port': 3333,
    'launcher_entry': {
        'title': 'OpenRefine',

As before, an OpenRefine option is added to the start menu and can be used to start the OpenRefine server and launch the UI client on the path openrefine. (As we started the server on a known port, can also find it explictly at proxy/3333.)

Calling the aliased URL directly will also start the server. This means we can tell MyBinder to open on the openrefine path (or add ?urlpath=openrefine to the Binder URL) and the container will open into the OpenRefine application.

Once again, we need to launch the OpenRefine app before we can connect to it from the Python client.

master branch – traitlet definition, Notebook and JupyterLab Support

The master branch (repo) builds on the traitlet definition branch and demonstrates:

  • using serverproxy (traitlet definition) to add an OpenRefine menu option to the notebook start menu. The configuration uses a fixed port assigment so that we can work with the client package.
  • a button is also enabled and added to the JupyterLab launcher.

OpenRefine can now be started and launched from the notebook homepage New menu or from the JupyterLab launcher, via a ?urlpath=openrefine MyBinder luanch invocation, or by navigating directly to the proxied path openrefine.

Open to Notebook homepage: Binder

Open to OpenRefine: Binder

Open to Jupyterlab: Binder

In this case, we need to enable the JupyterLab extension with the following addition to the binder/postBuild file:

#Enable the OpenRefine icon in JuptyerLab desktop launcher
jupyter labextension install jupyterlab-server-proxy

This will enable a default start button in the JupyterLab launcher.

We can also provide an icon for the start button. Further modify the binder/postBuild file to copy the logo to a desired location:

#Although located in binder/,
mv open-refine-logo.svg $HOME/.jupyter/

and modify the by with the addition of a path to the start logo, also ensuring that the launcher entry is enabled:

# Traitlet configuration file for jupyter-notebook.

c.ServerProxy.servers = {
    'openrefine': {
        'command': ['/home/jovyan/.openrefine/openrefine-2.8/refine', '-p', '{port}','-d','/home/jovyan/openrefine'],
        'port': 3333,
        'timeout': 120,
        'launcher_entry': {
            'enabled': True,
            'icon_path': '/home/jovyan/.jupyter/open-refine-logo.svg',
            'title': 'OpenRefine',

We should now see a start button for OpenRefine in the JupyterLab launcher.

Clicking on the button will autostart the server an open a browser tab onto the OpenRefine application GUI.


Running OpenRefine in MyBinder using JupyterServerProxy allows us to use OpenRefine as part of a shareable, on demand, serverless, Jupyter mediated workbench, defined via a public Github repository.

As well as being access as a GUI application, and via the Python API client, OpenRefine can also connect to a PostgreSQL server running inside the MyBinder container. For running PostgreSQL inside MyBinder, see Running a PostgreSQL Server in a MyBinder Container; for connecting OpenRefine to a Postgres server running in the same MyBinder container, see OpenRefine Database Connections in MyBinder.

OpenRefine Database Connections in MyBinder

With the version 3.0 release of OpenRefine last year, database integration was introduced that allows data to be imported into OpenRefine from a connected database, or exported to a downloadable SQL datadump. (It doesn’t look like you can save/export data to a new database table in the connected database, or upsert the contents of a cleaned table). This was the release of the  OpenRefine Database Import Extension and the SqlDump export mentioned in this earlier post.

If you want to try it out, I’ve created a MyBinder / repo2docker configuration repo that will launch a MyBinder repo containing both a running OpenRefine server and a running PostgreSQL server, although the test table is very small…

For how to run Postgres in a MyBinder container, see Running a PostgreSQL Server in a MyBinder Container.

Start in OpenRefine client: Binder

Details are:

  • host: localhost
  • Port: 5432
  • User: testuser
  • Password: testpass
  • Database: testdb

There’s also a tiny seeded table in the database called quickdemo from which we can import data into OpenRefine:

I said it was a small table!

The rest of the db integration — SQL export — is described in the aforementioned post on OpenRefine’s SQL integration.

I have to admit I’m not sure what the workflow is? You’d typically want to put clean data into a database, rather than pull data from a database into OpenRefine for cleaning.

If you are using OpenRefine as a data cleaning tool, it would be useful to be able to export the data directly back into the connected database, either as an upserted table (as well as perhaps some row deletions) or as a new ..._clean table (“Upsert to database…”).

If you’re using OpenRefine as a data enrichment tool, being able to create a new, enriched table back in the connected database (“Export to database…”) would also make sense.

One of the things I’ll add to the to-do list is an example of how to export data from OpenRefine and then import it into the database using a simple Jupyter notebook script (a Jupyter notebook server is also running in the  MyBinder container (just delete the openrefine/ path from the MyBinder URL).

One of the new (to me) things I’ve spotted in OpenRefine 3 is the ability to export a Project Data Package. I mistakenly thought this might be something like a Frictionless Data data package format, but it looks to just be an export format for the OpenRefine project data? There are fields for import settings as well as descriptive metadata, but I don’t see any dialogues in the UI where you’d enter things like creator, contributors or description?

  "name": "clipboard",
  "tags": [],
  "created": "2019-02-09T22:13:42Z",
  "modified": "2019-02-09T22:14:31Z",
  "creator": "",
  "contributors": "",
  "subject": "",
  "description": "",
  "rowCount": 2,
  "title": "",
  "homepage": "",
  "image": "",
  "license": "",
  "version": "",
  "customMetadata": {},
  "importOptionMetadata": [
      "guessCellValueTypes": false,
      "projectTags": [
      "ignoreLines": -1,
      "processQuotes": true,
      "fileSource": "(clipboard)",
      "encoding": "",
      "separator": ",",
      "storeBlankCellsAsNulls": true,
      "storeBlankRows": true,
      "skipDataLines": 0,
      "includeFileSources": false,
      "headerLines": 1,
      "limit": -1,
      "quoteCharacter": "\"",
      "projectName": "clipboard"

One of the column operations you can perform inOpenRefine is to cast columns to text, dates or numerics, but I don’t think that is saved as metadata anywhere? You can also define column types in the SQL exporter, but again, I’m not sure that then becomes project metadata. It’d be good to see these things unified a bit, and framing such a process in terms of supporting a tabular data package (with things like column typing specified) could be useful.

Another foil for this might be supporting a SQLite export format?

I have to admit I’m a bit confused as to how OpenRefine sits where in different workflows, particularly with data that is managed, and as such is most likely to be stored in some sort of database? (Lots of the OpenRefine tooling still harkens to a Linked Data future, so maybe it fits better in Linked Data workflows?). I also get the feeling that it shares a possible overlap with query engine tools such as Apache Drill, and maybe even document data extraction tools such as Apache Tika or Tabula. Again, seeing demonstrated toolchains and workflows in this area could be interesting.

Note to self: there are several other PDF table extractor tools out there alongside Tabula (Java) that I haven’t played with; eg R/pdftools, Python/Camelot and Python/pdfplumber.

Simon Willison is doing all sorts of useful stuff framing datasette as a datasette / SQLite ecosystem play. It could be useful to think a bit more about OpenRefine in terms of how it integrates with other data tools. For example, the X-to-sqlite tools help you start to structure variously formatted data sources in terms of a common SQLite representation, which can naturally incorporate things like column typing, but also the notion of database primary and foreign key columns. In a sense, OpenRefine provides a similar “import from anything-export to one format (CSV)” with a data cleaning step in the middle, but CSV is really informally structured in terms of its self-descriptive representation.

One of the insights I had when revising our TM351 relational database notebooks was that database table constraints can play a really useful role when helping clean a dataset by automatically identifying things that are wrong with it… I’ll maybe try to demonstrate an OpenRefine / Jupyter notebook hybrid workflow around that too…

By the by, I noticed this post the other day Exploring the dystopian future of a Javascript Gephi. Gephi, like OpenRefine, is a Java app, and like OpenRefine is one I’ve never been tempted to doodle with code wise for a couple of reasons: a) Java doesn’t appeal to me as a language; b) I don’t have a Java environment to hand, and the thought of trying to set up an environment, and all the build tools, as a novice, for a complex legacy project just leaves me cold. As the Gephi developers see it, “[w]e have to face it: the multiplatform is moving from Java to web technologies. Oracle wants a Java that powers backends, not a user interface framework.”

I’ve dabbled with OpenRefine off and on for years now, and whiles its browser accessibility is really handy, the docs could do with some attention (I guess that’s something I could make a positive contribution to). Also, if it was a Python, rather than Java, application, I’d be more comfortable with it and would possibly start to poke around inside it a bit….

I guess one of the things can do (though I’ve never really had to push it) is scale with larger datasets, although the memory overhead may then become an issue? I the the R/Pandas crossover folk have been doing a lot of work on efficient datatable representations and scaleable tabular data interchange formats, and I’m not sure if OpenRefine is/will draw on any of that work?

It’s also been some time since I looked at Workbench, (indeed, I haven’t really looked at it since I posted an early review), but a quick peek at the repo shows a fair amount of activity. Maybe I should look at it again…?

Note to OUseful.Info Blog Email Subscribers…

I’m not sure how the email subscription works, but just so you know: the content of blog posts can go through multiple edits, not just for typos, but also rejoinders and clarifications, in the 10-30 minutes after the post is first published. So it’s worth viewing an email post you’re actually sent as a rough first draft, and click through to see the actual latest version. My blog, my rules!;-)

Research into Practice?

Picking up on So I was Wrong… Someone Does Look at the Webstats…, and the second part of the title of Martin Weller’s blogpost that prompted it — (Learning design – the long haul of institutional change), the question: how do we shorten the feedback loop so data can be used by course teams?

IET are a research wing who do their own thing for academic research credit and also contribute to internal innovation and change. (UPDATE: or not.. see @R3beccaF’s comment…) The research I referred to in the previous post drew on institutionally sourced data that looked like it required some sort of project in place in order to have it collected and is not something (I think) I have direct access to.

I get the need for research and folk to do stats and etc etc, but I also believe that folk with an interest can use often quite scruffy data to provide anecdotal evidence about what’s working and what isn’t (the “first draft” of more formal research perhaps?).

So for example, I’ve been interested (casually) in this for years but never done more than play around the edges, but not as much as I’d have liked. I’ve only ever managed to get access to reasonable granularity page level tracking data several years ago when I managed to persuade someone to pop a Google Tracking code I had access to the dashboard for onto a set of course pages for a course I was sole author on. More recently, I’ve struggled to find many VLE stats on course pages I am still involved with (maybe I should check again; it’s been a while…).

On the other hand, I have a modicum of data skills, data storytelling / exploratory analysis skills, and end user app developer skills. And I’m interested in rapidly prototyping tools that may help make the data useful.

So I was Wrong… Someone Does Look at the Webstats…

Via a Martin Weller blogpost (Learning design – the long haul of institutional change), the phrase:

we now have a uniform design process across the university, and are one of the world leaders in this approach. It has allowed us to then match analytics against designs, and to develop a common language and representation.

I asked him for examples of “match[ing] analytics against designs” and got a pointer back to work by Bart Rienties et al. which I guess I should have been following over the years (I have reached out to various folk in IET over the years but never really got anywhere…).

Here’s an example, quickly found; in Linking students’ timing of engagement to learning design and academic performance, Nguyen, Quan; Huptych, Michal and Rienties, Bart (2018), Proceedings of the 8th International Conference on Learning Analytics and Knowledge, ACM, New York, pp. 141–150:

2.3 VLE engagement
The second dataset consisted of clickstream data of individual learners from the VLE and was retrieved using SAS Enterprise 9.4. The data were captured from four weeks before the start of the module until four weeks after the end of the module. Learning activities were planned over 30 weeks. Data were gathered in two semesters (Fall 2015 and Fall 2016) in order to validate the findings from two independent implementations. First, we would like to mention that the student behaviour record includes all students’ VLE activity. In other words, “the spent time” is determined as the time between any two clicks of a student, regardless a course and a type of the VLE activity. Further, not each click can be associated with studying time; for instance, there are clicks related to downloading of some material. We have this information about an action type which is connected with the click. Thus, we can determinate that a click with the connected action “download” was not included in the spent time of student in the analysis. Nonetheless, we can assume that the time of a click with the connected action “view” is associated with the time of learning of a study material for which the click is logged.

To compare the LD with the actual student behaviour, time spent on task was calculated as the duration between clicks. As pointed out by previous research [17], this metric could be problematic due to (1) the inability to differentiate between active time and non-active time (students leave the respective web page open and go for a coffee), and (2) the last click of the day is followed by a click next day), which makes the duration excessively long. Any attempt to set an arbitrary cut-off value would pose a threat in underestimating or overestimating of the actual engagement time.

Taking into account the context and LD of a module could produce a more informed cut-off value. Ideally, this cutoff value should be tailored to the design and context of each individual activity. For example, the cut-off value should be different between a 20 minutes activity and a 1-hour activity. While this study does not fully address the aforementioned problems, it leveraged the design of learning activities (discussion between researchers and designers) to set a cut-off value at 1 hour for all activity (e.g. any activity goes beyond 1 hour will be set as 1 hour).

So, there is work going on, and it looks related to some of the approaches I’d like to be able to draw on to review the first year of presentation (at least) of a new course and I should apologise for that. I probably should make more effort to attend internal research events (I used to…) and should track their research outputs more rather than digging my own cess pit of vitriol and bile. (I guess I don’t help make the “friendly” course teams that Martin mentioned as being part of this effort…)

I guess I need to try to find better ways of reach out to folk in the OU in more constructive ways.

PS By the by, the title of another paper, “A multi-modal study into students’ timing and learning regulation: time is ticking” reminds me of a thing I’d built but wasn’t allowed to deploy in my first engagement with HTML delivered learning materials, the T396 eSG (electronic study guide) (Keeping a Distance-Education Course Current Through eLearning and Contextual Assessment) where I used client side Javascript to pop up a widget if you’d spent too long on an eSG page and ask how you were doing. (I think I also tried to experiment with tracking time over a study session too (the eSG was frame based)). We dropped it on the grounds that it would probably be unreliable, and would almost certainly be irritating, as per Clippy. (I still think it’d have been interesting to try to iterate on a couple of times though…) We also had an experimental WAP site containing micro-info about the course, TMA submissions dates and so on. Anyone else remember WAP?!

Fragmentary Thoughts on Data (and “Analytics”) in Online Distance Education

A recent episode of TWiT Triangulation features Shoshana Zuboff, author of the newly released The Age of Surveillance Capitalism (which I’ve still to get, let alone read).

Watching the first ten minutes reminds me of Google’s early reluctance to engage in advertising. For example, in their 1998 paper The Anatomy of a Large-Scale Hypertextual Web Search Engine, Brin and Page (the founders of Google) write, in Appendix A of that paper, the following:

Currently, the predominant business model for commercial search engines is advertising. The goals of the advertising business model do not always correspond to providing quality search to users. For example, in our prototype search engine one of the top results for cellular phone is “The Effect of Cellular Phone Use Upon Driver Attention”, a study which explains in great detail the distractions and risk associated with conversing on a cell phone while driving. This search result came up first because of its high importance as judged by the PageRank algorithm, an approximation of citation importance on the web [Page, 98]. It is clear that a search engine which was taking money for showing cellular phone ads would have difficulty justifying the page that our system returned to its paying advertisers. For this type of reason and historical experience with other media [Bagdikian 83], we expect that advertising funded search engines will be inherently biased towards the advertisers and away from the needs of the consumers.

Since it is very difficult even for experts to evaluate search engines, search engine bias is particularly insidious. A good example was OpenText, which was reported to be selling companies the right to be listed at the top of the search results for particular queries [Marchiori 97]. This type of bias is much more insidious than advertising, because it is not clear who “deserves” to be there, and who is willing to pay money to be listed. This business model resulted in an uproar, and OpenText has ceased to be a viable search engine. But less blatant bias are likely to be tolerated by the market. For example, a search engine could add a small factor to search results from “friendly” companies, and subtract a factor from results from competitors. This type of bias is very difficult to detect but could still have a significant effect on the market. Furthermore, advertising income often provides an incentive to provide poor quality search results. For example, we noticed a major search engine would not return a large airline’s homepage when the airline’s name was given as a query. It so happened that the airline had placed an expensive ad, linked to the query that was its name. A better search engine would not have required this ad, and possibly resulted in the loss of the revenue from the airline to the search engine. In general, it could be argued from the consumer point of view that the better the search engine is, the fewer advertisements will be needed for the consumer to find what they want. This of course erodes the advertising supported business model of the existing search engines. However, there will always be money from advertisers who want a customer to switch products, or have something that is genuinely new. But we believe the issue of advertising causes enough mixed incentives that it is crucial to have a competitive search engine that is transparent and in the academic realm.

How times change.

Back to the start of the Triangulation episode, and Leo Laporte reminisces on how in the early days of Google the focus was on using data to optimise the performance of the search engine — that is, to optimise the way in which search results were presented on a page in response to a user query. Indeed, the first design goal listed in the Anatomy of a Search Engine paper is to “improve the quality of web search engines”.

In contrast, today’s webcos seek to maximise revenues by modeling, predicting, and even influencing, user behaviours in order to encourage users to enter into financial transactions. Google takes an early cut from others’ potential revenues arising from potential transactions in the form of advertising revenue.

At which point, let’s introduce learning analytics. I think the above maps well on to how I see the role of analytics in education. I am still firmly in the camp of Appendix A. I think we should use data to improve the performance of the things we control and use data to inform changes to the things we control. I see learning analytics as a bastard child of a Surveillance Capitalism worldview.

Looking back to the early archives, here and in my original (partially complete) blog archive, I’ve posted several times over the years about how we might make use of “analytics” data to maintain and improve the things we control.

Treating our VLE course pages as a website

In the OU, a significant portion of the course content of an increasing number of courses is delivered as VLE website content. Look at an OpenLearn course to get a feel for what this content looks like. In the OU, the VLE is not used as a place to dump lecture notes: it is the lecture.

The VLE content is under out control. We should use website performance data to improve the quality of our web pages (which is to say, our module content). During module production (in some modules at least) at lot of design effort is put into limiting and chunking content so as not to overload students (word limits in the content we produce; guides about how much time to spend on a particular activity.

So do we make use of simple (basic) web analytics to track this? To track how long students spend on a particular web page, to track whether they ever click on links to external resources, to track sorts of study patterns students appear to have so we can better chunk our content (eg form the web stats, do they study in one hour blocks, two hour blocks, four hour block) or better advise online forum moderators as to when students are online so we can maybe even provide a bit of realtime interaction/support?

If students appear to spend far longer on a page than the design budgeted for it, is that ever flagged up to us?

From my perspective, I don’t get to see that data or the opportunity to make changes based on it.

(There’s “too much data” to try to collect it all apparently. (By the by, was that a terabyte SD card I saw has recently gone on sale?) At one point crude stats for daily(?) page usage was available to us in the VLE, but I haven’t checked recently to see what stats I can download from there easily (pointers would be much welcomed…). Even crude data might be useful to module teams (eg see the heatmap in this post on Teaching Material Analytics).)

I’ve posted similar rants before. See also rants on things like not doing A/B testing. I also did a series of posts on Library web analytics and have a scraggy script for analysing FutureLearn data as available a couple of years ago here.

Note that there is one area where I know we do use stats to improve materials, or modify internal behaviour, and that’s in assessment. Looking at data from online quiz questions can identify if questions are too easier, or two hard, or we maybe need to teach something better if one of the distractors is getting selected as the right answer too often.

In tutor marked and end of course assessment, we also use stats to shape question level stats or modify individual tutor marks (the numbers are such that excessively harsh or generous markers can often be identified, and their awarded marks statistically tweaked to bring them into line with other markers as a whole).

In both those cases, we do use data to modify OUr behaviour and things we control.

Search Data

This is something we don’t get to see from course material or conveniently at new module/curriculum planning time.

For example, what are (new) students searching for on the OU website in subject related terms. (I used to get quite het up about the way we wrote course descriptions in course listings on the OU website, arguing that it’s all very well putting in words describing the course that students will understand once they’ve finished the course, but it doesn’t help folk find that page when they don’t have the vocabulary and won’t be using those search terms…) Or what subjects are folk searching for on OpenLearn or FutureLearn (the OU owns FutureLearn, though I’m not sure what benefits accrue from it back to the OU?).

In terms of within-course related searching, what terms are students searching for and how might we use that information to improve navigation, glossary items, within-module “SEO”. Again, how might we use data that is available, or that can be collected, to improve the thing we control (the course content).

UPDATE — Okay, So Maybe We Do Run the Numbers

Via a blog post in my feeds, a tweet chaser from me to the author, and a near immediate response, maybe I was wrong: maybe we are closing the loop (at least, in a small part of the OU): see here: So I was Wrong… Someone Does Look at the Webstats….

I know I live on the Isle of Wight, but for years it’s felt like I’ve been sent to Coventry.

Learning Analytics

The previous two sections correspond to my Appendix 8 world view, and original design goal of “improving the quality of module content web pages”, a view that never got traction because… I don’t know. I really don’t know. Too mundane, maybe?

That approach also stands in marked contrast to the learning analytics view, which is more akin to the current dystopia being developed by Google et al. In this world, data is collected not to improve the thing we control (the course content, structure and navigation) but to control the user so they better meet our metrics. Data is collected not so that we can make interventions in the thing we control (the course content, structure and navigation) but “the product” — the student. Interventions are there so we can tell the students where they are going wrong, where they are not performing.

The fact that we spend £loads on electronic resources that (perhaps) no-one ever uses (I don’t know – they may do? I don’t see the click stats) is irrelevant.

The fact that students do or don’t watch videos, or bail out of watching videos after 3 minutes (so maybe we shouldn’t make four minute videos?), is not something that gets back to the course team. I can imagine that more likely would be an email to a student as an intervention saying “we notice you don’t seem to be watching the videos…”

But in such a case, IT’S NOT A STUDENT PROBLEM, IT’S A CONTENT DESIGN PROBLEM. Which is to say, it’s OUr problem, and something we can do something about.


It would be so refreshing to have a chance to explore a data driven course maintenance model on a short course presented a couple of times a year for a couple of years. We could use this as a testbed to explore setting up feedback loops to monitor intended design goals (time on activity, for example, or designed pacing of materials compared to actual use pacing) and maybe even engage in a bit of A/B testing.

How to Create a Simple Dockerfile for Building an OpenRefine Docker Image

Over the last few weeks, I’ve been exploring serving OpenRefine in a various ways, such as on a vanilla Digital Ocean Linux server or using Docker, as well as using MyBinder (blog post to come…).

So picking up on the last post (OpenRefine on Digital Ocean using Docker), here’s a quick walkthrough of how we can go about creating a Dockerfile, the script used to create a Docker container, for OpenRefine.

First up, an annotated recipe for building OpenRefine from scratch from the current repo from Thad Guidry (via):

#Bring in a base container
#Alpine is quite lite, and we can get a build with JDK-8 already installed
FROM maven:3.6.0-jdk-8-alpine

#We need to install git so we can clone the OpenRefine repo
RUN apk add --no-cache git

#Clone the current repo
RUN git clone 

#Build the OpenRefine application
RUN OpenRefine/refine build

#Create a directory we can save OpenRefine user project files into
RUN mkdir /mnt/refine

#Mount a Docker volume against that directory.
#This means we can save data to another volume and persist it
#if we get rid of the current container.
VOLUME /mnt/refine

#Expose the OpenRefine server port outside the container

#Command to start the OpenRefine server when the container starts
CMD ["OpenRefine/refine", "-i", "", "-d", "/mnt/refine"]

You can build the container from that Dockerfile by cding into the same directory as the Dockerfile and running something like:

docker build -t psychemedia/openrefine .

The -t flag tags the image (that is, names it); the . says look to the current directory for the dockerfile.

You could then run the container using something like:

docker build --rm -d --name openrefine -p 3334:3333 psychemedia/openrefine

One of the disadvantages of the above build process is that it produces a container that still contains the build files, and tooling required to build it, as well as the application files. This means that the container is larger than it need be. it’s also not quite a release?

I think we can also add RUN OpenRefine/refine dist RELEASEVERSION to then create a release, but there is a downside that this step will fail if a test fails.

We’d then have to tidy up a bit, which we could do with a multistage build. Simon Willison has written a really neat sketch around this on building smaller Python Docker images that provides a handy crib. In our case, we could FROM the same base container (or maybe a JRE, rather than JDK, populated version, if OpenRefine can run just with a JRE?) and copy across the distribution file create from the distribution build step; from that, we could then install the application.

So let’s go to that other extreme and look at a Dockerfile for building a container from a specific release/distribution.

The OpenRefine releases page lists all the OpenRefine releases. Looking at the download links for the the Linux distribution, the URLs take the form:$RELEASE/openrefine-linux-$RELEASE.tar.gz.

So how do we install an OpenRefine server from a distribution file?

#We can use the smaller JRE rather than the JDK
FROM openjdk:8-jre-alpine as builder


#Download a couple of required packages
RUN apk update && apk add --no-cache wget bash

#We can pass variables into the build process via --build-arg variables
#We name them inside the Dockerfile using ARG, optionally setting a default value

#ENV vars are environment variables that get baked into the image
#We can pass an ARG value into a final image by assigning it to an ENV variable

#There's a handy discussion of ARG versus ENV here:

#Download a distribution archive file
RUN wget --no-check-certificate$RELEASE/openrefine-linux-$RELEASE.tar.gz

#Unpack the archive file and clear away the original download file
RUN tar -xzf openrefine-linux-$RELEASE.tar.gz  && rm openrefine-linux-$RELEASE.tar.gz

#Create an OpenRefine project directory
RUN mkdir /mnt/refine

#Mount a Docker volume against the project directory
VOLUME /mnt/refine

#Expose the server port

#Create the state command.
#Note that the application is in a directory named after the release
#We use the environment variable to set the path correctly
CMD openrefine-$RELEASE/refine -i -d /mnt/refine

We can now build an image of the default version as baked into the Dockerfile:

docker build -t psychemedia/openrefinedemo .

Or we can build against a specific version:

docker build -t psychemedia/openrefinedemo --build-arg RELEASE=3.1-beta .

To peek inside the container, we run it and jump into a bash shell inside it:

docker run --rm -i -t psychemedia/openrefinedemo /bin/bash

We run the container as before:

docker run --rm -d -p 3333:3333 --name openrefine psychemedia/openrefinedemo


PS Note that when running an OpenRefine container on something like Digital Ocean using the default OpenRefine memory settings, you may have trouble starting OpenRefine on machines smaller that 3GB. (I’ve had some trouble getting it started on a 2GB server.)