Fragmentary Thoughts on Data (and “Analytics”) in Online Distance Education

A recent episode of TWiT Triangulation features Shoshana Zuboff, author of the newly released The Age of Surveillance Capitalism (which I’ve still to get, let alone read).

Watching the first ten minutes reminds me of Google’s early reluctance to engage in advertising. For example, in their 1998 paper The Anatomy of a Large-Scale Hypertextual Web Search Engine, Brin and Page (the founders of Google) write, in Appendix A of that paper, the following:

Currently, the predominant business model for commercial search engines is advertising. The goals of the advertising business model do not always correspond to providing quality search to users. For example, in our prototype search engine one of the top results for cellular phone is “The Effect of Cellular Phone Use Upon Driver Attention”, a study which explains in great detail the distractions and risk associated with conversing on a cell phone while driving. This search result came up first because of its high importance as judged by the PageRank algorithm, an approximation of citation importance on the web [Page, 98]. It is clear that a search engine which was taking money for showing cellular phone ads would have difficulty justifying the page that our system returned to its paying advertisers. For this type of reason and historical experience with other media [Bagdikian 83], we expect that advertising funded search engines will be inherently biased towards the advertisers and away from the needs of the consumers.

Since it is very difficult even for experts to evaluate search engines, search engine bias is particularly insidious. A good example was OpenText, which was reported to be selling companies the right to be listed at the top of the search results for particular queries [Marchiori 97]. This type of bias is much more insidious than advertising, because it is not clear who “deserves” to be there, and who is willing to pay money to be listed. This business model resulted in an uproar, and OpenText has ceased to be a viable search engine. But less blatant bias are likely to be tolerated by the market. For example, a search engine could add a small factor to search results from “friendly” companies, and subtract a factor from results from competitors. This type of bias is very difficult to detect but could still have a significant effect on the market. Furthermore, advertising income often provides an incentive to provide poor quality search results. For example, we noticed a major search engine would not return a large airline’s homepage when the airline’s name was given as a query. It so happened that the airline had placed an expensive ad, linked to the query that was its name. A better search engine would not have required this ad, and possibly resulted in the loss of the revenue from the airline to the search engine. In general, it could be argued from the consumer point of view that the better the search engine is, the fewer advertisements will be needed for the consumer to find what they want. This of course erodes the advertising supported business model of the existing search engines. However, there will always be money from advertisers who want a customer to switch products, or have something that is genuinely new. But we believe the issue of advertising causes enough mixed incentives that it is crucial to have a competitive search engine that is transparent and in the academic realm.

How times change.

Back to the start of the Triangulation episode, and Leo Laporte reminisces on how in the early days of Google the focus was on using data to optimise the performance of the search engine — that is, to optimise the way in which search results were presented on a page in response to a user query. Indeed, the first design goal listed in the Anatomy of a Search Engine paper is to “improve the quality of web search engines”.

In contrast, today’s webcos seek to maximise revenues by modeling, predicting, and even influencing, user behaviours in order to encourage users to enter into financial transactions. Google takes an early cut from others’ potential revenues arising from potential transactions in the form of advertising revenue.

At which point, let’s introduce learning analytics. I think the above maps well on to how I see the role of analytics in education. I am still firmly in the camp of Appendix A. I think we should use data to improve the performance of the things we control and use data to inform changes to the things we control. I see learning analytics as a bastard child of a Surveillance Capitalism worldview.

Looking back to the early OUseful.info archives, here and in my original (partially complete) blog archive, I’ve posted several times over the years about how we might make use of “analytics” data to maintain and improve the things we control.

Treating our VLE course pages as a website

In the OU, a significant portion of the course content of an increasing number of courses is delivered as VLE website content. Look at an OpenLearn course to get a feel for what this content looks like. In the OU, the VLE is not used as a place to dump lecture notes: it is the lecture.

The VLE content is under out control. We should use website performance data to improve the quality of our web pages (which is to say, our module content). During module production (in some modules at least) at lot of design effort is put into limiting and chunking content so as not to overload students (word limits in the content we produce; guides about how much time to spend on a particular activity.

So do we make use of simple (basic) web analytics to track this? To track how long students spend on a particular web page, to track whether they ever click on links to external resources, to track sorts of study patterns students appear to have so we can better chunk our content (eg form the web stats, do they study in one hour blocks, two hour blocks, four hour block) or better advise online forum moderators as to when students are online so we can maybe even provide a bit of realtime interaction/support?

If students appear to spend far longer on a page than the design budgeted for it, is that ever flagged up to us?

From my perspective, I don’t get to see that data or the opportunity to make changes based on it.

(There’s “too much data” to try to collect it all apparently. (By the by, was that a terabyte SD card I saw has recently gone on sale?) At one point crude stats for daily(?) page usage was available to us in the VLE, but I haven’t checked recently to see what stats I can download from there easily (pointers would be much welcomed…). Even crude data might be useful to module teams (eg see the heatmap in this post on Teaching Material Analytics).)

I’ve posted similar rants before. See also rants on things like not doing A/B testing. I also did a series of posts on Library web analytics and have a scraggy script for analysing FutureLearn data as available a couple of years ago here.

Note that there is one area where I know we do use stats to improve materials, or modify internal behaviour, and that’s in assessment. Looking at data from online quiz questions can identify if questions are too easier, or two hard, or we maybe need to teach something better if one of the distractors is getting selected as the right answer too often.

In tutor marked and end of course assessment, we also use stats to shape question level stats or modify individual tutor marks (the numbers are such that excessively harsh or generous markers can often be identified, and their awarded marks statistically tweaked to bring them into line with other markers as a whole).

In both those cases, we do use data to modify OUr behaviour and things we control.

Search Data

This is something we don’t get to see from course material or conveniently at new module/curriculum planning time.

For example, what are (new) students searching for on the OU website in subject related terms. (I used to get quite het up about the way we wrote course descriptions in course listings on the OU website, arguing that it’s all very well putting in words describing the course that students will understand once they’ve finished the course, but it doesn’t help folk find that page when they don’t have the vocabulary and won’t be using those search terms…) Or what subjects are folk searching for on OpenLearn or FutureLearn (the OU owns FutureLearn, though I’m not sure what benefits accrue from it back to the OU?).

In terms of within-course related searching, what terms are students searching for and how might we use that information to improve navigation, glossary items, within-module “SEO”. Again, how might we use data that is available, or that can be collected, to improve the thing we control (the course content).

UPDATE — Okay, So Maybe We Do Run the Numbers

Via a blog post in my feeds, a tweet chaser from me to the author, and a near immediate response, maybe I was wrong: maybe we are closing the loop (at least, in a small part of the OU): see here: So I was Wrong… Someone Does Look at the Webstats….

I know I live on the Isle of Wight, but for years it’s felt like I’ve been sent to Coventry.

Learning Analytics

The previous two sections correspond to my Appendix 8 world view, and original design goal of “improving the quality of module content web pages”, a view that never got traction because… I don’t know. I really don’t know. Too mundane, maybe?

That approach also stands in marked contrast to the learning analytics view, which is more akin to the current dystopia being developed by Google et al. In this world, data is collected not to improve the thing we control (the course content, structure and navigation) but to control the user so they better meet our metrics. Data is collected not so that we can make interventions in the thing we control (the course content, structure and navigation) but “the product” — the student. Interventions are there so we can tell the students where they are going wrong, where they are not performing.

The fact that we spend £loads on electronic resources that (perhaps) no-one ever uses (I don’t know – they may do? I don’t see the click stats) is irrelevant.

The fact that students do or don’t watch videos, or bail out of watching videos after 3 minutes (so maybe we shouldn’t make four minute videos?), is not something that gets back to the course team. I can imagine that more likely would be an email to a student as an intervention saying “we notice you don’t seem to be watching the videos…”

But in such a case, IT’S NOT A STUDENT PROBLEM, IT’S A CONTENT DESIGN PROBLEM. Which is to say, it’s OUr problem, and something we can do something about.

Conclusion

It would be so refreshing to have a chance to explore a data driven course maintenance model on a short course presented a couple of times a year for a couple of years. We could use this as a testbed to explore setting up feedback loops to monitor intended design goals (time on activity, for example, or designed pacing of materials compared to actual use pacing) and maybe even engage in a bit of A/B testing.

How to Create a Simple Dockerfile for Building an OpenRefine Docker Image

Over the last few weeks, I’ve been exploring serving OpenRefine in a various ways, such as on a vanilla Digital Ocean Linux server or using Docker, as well as using MyBinder (blog post to come…).

So picking up on the last post (OpenRefine on Digital Ocean using Docker), here’s a quick walkthrough of how we can go about creating a Dockerfile, the script used to create a Docker container, for OpenRefine.

First up, an annotated recipe for building OpenRefine from scratch from the current repo from Thad Guidry (via):

#Bring in a base container
#Alpine is quite lite, and we can get a build with JDK-8 already installed
FROM maven:3.6.0-jdk-8-alpine
MAINTAINER thadguidry@gmail.com

#We need to install git so we can clone the OpenRefine repo
RUN apk add --no-cache git

#Clone the current repo
RUN git clone https://github.com/OpenRefine/OpenRefine.git 

#Build the OpenRefine application
RUN OpenRefine/refine build

#Create a directory we can save OpenRefine user project files into
RUN mkdir /mnt/refine

#Mount a Docker volume against that directory.
#This means we can save data to another volume and persist it
#if we get rid of the current container.
VOLUME /mnt/refine

#Expose the OpenRefine server port outside the container
EXPOSE 3333

#Command to start the OpenRefine server when the container starts
CMD ["OpenRefine/refine", "-i", "0.0.0.0", "-d", "/mnt/refine"]

You can build the container from that Dockerfile by cding into the same directory as the Dockerfile and running something like:

docker build -t psychemedia/openrefine .

The -t flag tags the image (that is, names it); the . says look to the current directory for the dockerfile.

You could then run the container using something like:

docker build --rm -d --name openrefine -p 3334:3333 psychemedia/openrefine

One of the disadvantages of the above build process is that it produces a container that still contains the build files, and tooling required to build it, as well as the application files. This means that the container is larger than it need be. it’s also not quite a release?

I think we can also add RUN OpenRefine/refine dist RELEASEVERSION to then create a release, but there is a downside that this step will fail if a test fails.

We’d then have to tidy up a bit, which we could do with a multistage build. Simon Willison has written a really neat sketch around this on building smaller Python Docker images that provides a handy crib. In our case, we could FROM the same base container (or maybe a JRE, rather than JDK, populated version, if OpenRefine can run just with a JRE?) and copy across the distribution file create from the distribution build step; from that, we could then install the application.

So let’s go to that other extreme and look at a Dockerfile for building a container from a specific release/distribution.

The OpenRefine releases page lists all the OpenRefine releases. Looking at the download links for the the Linux distribution, the URLs take the form:

https://github.com/OpenRefine/OpenRefine/releases/download/$RELEASE/openrefine-linux-$RELEASE.tar.gz.

So how do we install an OpenRefine server from a distribution file?

#We can use the smaller JRE rather than the JDK
FROM openjdk:8-jre-alpine as builder

MAINTAINER tony.hirst@gmail.com

#Download a couple of required packages
RUN apk update && apk add --no-cache wget bash

#We can pass variables into the build process via --build-arg variables
#We name them inside the Dockerfile using ARG, optionally setting a default value
ARG RELEASE=3.1

#ENV vars are environment variables that get baked into the image
#We can pass an ARG value into a final image by assigning it to an ENV variable
ENV RELEASE=$RELEASE

#There's a handy discussion of ARG versus ENV here:
#https://vsupalov.com/docker-arg-vs-env/

#Download a distribution archive file
RUN wget --no-check-certificate https://github.com/OpenRefine/OpenRefine/releases/download/$RELEASE/openrefine-linux-$RELEASE.tar.gz

#Unpack the archive file and clear away the original download file
RUN tar -xzf openrefine-linux-$RELEASE.tar.gz  && rm openrefine-linux-$RELEASE.tar.gz

#Create an OpenRefine project directory
RUN mkdir /mnt/refine

#Mount a Docker volume against the project directory
VOLUME /mnt/refine

#Expose the server port
EXPOSE 3333

#Create the state command.
#Note that the application is in a directory named after the release
#We use the environment variable to set the path correctly
CMD openrefine-$RELEASE/refine -i 0.0.0.0 -d /mnt/refine

We can now build an image of the default version as baked into the Dockerfile:

docker build -t psychemedia/openrefinedemo .

Or we can build against a specific version:

docker build -t psychemedia/openrefinedemo --build-arg RELEASE=3.1-beta .

To peek inside the container, we run it and jump into a bash shell inside it:

docker run --rm -i -t psychemedia/openrefinedemo /bin/bash

We run the container as before:

docker run --rm -d -p 3333:3333 --name openrefine psychemedia/openrefinedemo

Useful?

PS Note that when running an OpenRefine container on something like Digital Ocean using the default OpenRefine memory settings, you may have trouble starting OpenRefine on machines smaller that 3GB. (I’ve had some trouble getting it started on a 2GB server.)

Is Your Phone Listening to You? Fragmentary Notes on Trusting Corporates…

Many folk will have seen stories or posts floating around the internet claiming that someone was talking to someone else about X one day and they suddenly started received adverts about it on their phone, the assumption being that the phone listened in on the conversation, picked out the keywords and sent an ad on that basis.

One likely alternative explanation is that the person experiencing this had just primed, or sensitised, themselves to that ad. We see and blank out thousands of ads every day, at least consciously, but that doesn’t mean we don’t see them. And by talking about a thing we are then primed (self-primed?) to consciously notice it if it does cross our attention path soon after (my cog psych knowledge is not that good; there are probably some really good experiments and mechanisms around to explain this… eg stuff).

Another possible contributory factor is that the models are getting better at prediction. You do a sporty thing at a particular location (your phone knows where you are) and talk about different deodorant products afterwards. You then spot a deodorant ad. Your phone has been listening to you. Or maybe your phone (or the services or networks it is connected to) spotted you were at a sporty location, there was no phone activity for an hour and a half, (not even jiggling around, as detected by the gyros, so you left your phone somewhere; or maybe the signal died when you put it in a locker) so maybe you were doing something sporty, so maybe: worth a shot at advertising a deodorant?

Now the phone may or may not be being used to listen to you in the audible sense  of hearing your spoken conversations (it’s certainly being used to “listen” to your actions in web tracking ways, for example), and the webcos et al. tend to protest that they don’t. But they don’t make life easy for themselves with the sorts of things they do announce they can do.

For example, in a recent blog post on the Google AI blog, Real-time Continuous Transcription with Live Transcribe, there’s this:

Today, we’re announcing Live Transcribe, a free Android service that makes real-world conversations more accessible by bringing the power of automatic captioning into everyday, conversational use. Powered by Google Cloud, Live Transcribe captions conversations in real-time…

Okay, so if you have a network connection, your phone could transcribe any audio it heard in real time. Google are celebrating that fact. There is no technology blocker if access to the microphone and internet connection are available, and the microphone is in range of the conversation.

Potential future improvements in mobile-based automatic speech transcription include on-device recognition, …

So they also want to be able to do it on the phone…

The world is full of such apparent contradictions. On the one hand, conspiracy theories about what the tech giants (increasingly, rather than “the state”, as in the case of China) are doing; on the other, announcements by the same companies about what they don’t (as a matter or policy), can (technically), and do want to (technically) do.

Which comes down to a question of trust that the policies they operate under are: a) sound; b) followed; c) not not followed.

Here are some more possible contradictions…

We trust the Amazon store as a place to shop, right? Like a supermarket or department selling branded goods. But hold on a minute… For a start, it’s increasingly a market, and just like a free market or a car boot sale, buyer beware. At scale. Why? Well, Amazon Warned Apple of Counterfeit Products in 2016 and is now Warning Investors that Counterfeit Products are a Problem; there are plenty of other stories about counterfeit products on Amazon out there.

Something else to note about Amazon is that they are like a supermarket in a certain respect: they sell products from a wide range of own brand items, although you may not realise it. One way of trying to track down what brands they own is to look at the WIPO Trademark database.

(I built a trademarks by company explorer once, using OpenCorporates data and OpenRefine, but I suspect it’s rotted by now. Maybe worth revisiting, along with something that mines companies in a corporate grouping and grabs trademarks associated with all of them?)

So how about another Google story — Advancing research on fake audio detection:

When you listen to Google Maps driving directions in your car, get answers from your Google Home, or hear a spoken translation in Google Translate, you’re using Google’s speech synthesis, or text-to-speech (TTS) technology. …

Over the last few years, there’s been an explosion of new research using neural networks to simulate a human voice. These models, including many developed at Google, can generate increasingly realistic, human-like speech.

While the progress is exciting, we’re keenly aware of the risks this technology can pose if used with the intent to cause harm. Malicious actors may synthesize speech to try to fool voice authentication systems, or they may create forged audio recordings to defame public figures.

We’re taking action. When we launched the Google News Initiative last March, we committed to releasing datasets that would help advance state-of-the-art research on fake audio detection.  Today, we’re delivering on that promise…

On the one hand, the Goog is trying to create ever authentic voices. On the other, so are the bad guys. (Google, by implication, is not a bad guy).

By releasing some of their data, they hope to encourage third parties to create systems to distinguish real voices from machine generated ones.

The way research works, of course, is that the folk (at Google…) who create machine generated voices will presumably try to improve their creations to avoid detection by the new improved machine generated voice detectors on the grounds of “improving customer experience”…

In their defense:

As we published in our AI Principles last year, we take seriously our responsibility both to engage with the external research community, and to apply strong safety practices to avoid unintended results that create risks of harm.

So I wonder, will Google add something inaudible to its machine generated voices that flag out a voice as machine generated, even one sent over a low pass filtered phone connection, in a spirit of responsibility? This would make it trivially easy to detect a Google generated voice and prevent lazy bad guys from using it to fool other machines.

Finally, and this may seem like more Google bashing, but hey, this is just a sample pulled from today’s feeds (I’m on catch up), how do these big cos develop the trust that supports a belief that they can be trusted to formulate and adhere to sound policies that are not just for corporate benefit but also, at the very least, do no harm to the rest of society, if not actually benefitting it? By being good (corporate) citizens in wider society? Google pays more in EU fines than it does in taxes. Erm…?

JupyterTips: Launching Jupyter Notebooks Into a Particular Browser

There are just so many Jupyter related settings and configs that I’m going to start making short posts about them, tagged JupyterTips (feed), to try to help me remember what they are and how to invoke them…

TIL (Today I Learned)…

…you can define which browser a newly launched Jupyter notebook server will open into. By default, this is the default browser. But you can override it with the --NotebookApp.browser argument. For example:

jupyter notebook --NotebookApp.browser=firefox

See more commandline settings at: Jupyter Notebook — Config file and command line options

What is Coding?

I have no idea…

Here’s a first attempt:

the act of creating machine readable representations using formal syntax.

Which is to say:

  • act: something practical, possibly purposive, (so should that be intentional act?), which also makes it to be a skill and a craft?
  • creating: so it’s about doing something new, that also admits of having to solve problems along the way, perhaps be inventive, and playful.
  • machine readable: so coding produces something that a computer is capable of processing; does this implicitly unpacks further though, to take in notions of the machine actually processing the code  in order to bring about some sort of state transformation? So maybe replace with machine readable with machine interpretable and executable? But you don’t have to execute code? Eg if I encode a mathematical formula in LaTex, the machine will interpret that code to render the typographically laid out equation, but it hasn’t executed the code. So maybe machine interpretable and/or executable?
  • representations: this is not so much about what the code looks like to us, but the way we use it to create models that represent something “meaningful” to us in a way that the machine can process it in a way that is also meaningful to us. Again, this admits of problem solving and the need to be creative, but also starts to bring in unstated ideas that the representation somehow needs to be coherent and stand in some sort of sensible relationship to the sort of thing the code things they are representing?
  • using: so coding is about doing something with something…
  • formal: …that something being formally defined and bound/constrained…
  • syntax: …by a set of rules that determine how the representations are declared and the form in which those representations should be stated. Does adding and grammar help? Do programming languages add grammatical elements over and above syntactic rules? Is dot notation, for example,  a morphological feature or a syntactic one?

Note that there is nothing in there that distinguishes between text based languages and graphical languages (for example). Nor is the word language mentioned explicitly.

Self-help Edjucashun

Never having learned to read music or play a musical instrument as a kid, I’m finding learning to play the harp quite incredible. The feedback loops between seeing marks on paper, speaking out the name of each note played (as recommended by several of the guides/tutorials I’ve seen), developing muscle memory and hearing audio feedback is just an amazing learning experience.

Progress is slow, and I’m struggling with metre and note length. I really should get a lesson or two with a teacher, not least so I can hear what my elementary practice tunes are supposed to sound like. (I have no idea what sort of models Google is building around all the Youtube videos of young children I seem to be watching (kids doing their practice pieces… You can probably imagine the level I’m at given I aspire to be that good!))

So… self-help… there’s loads of music related web apps out there, so I figured it might be useful to try to transcribe some of my practice tunes into a form that I can get some idea of what they should sound like.

The language I’ve opted for is abcjs (repo) which I discovered via the music21 package (see some music21 demos here) ; but it doesn’t need any of the Python machinery to run — it works directly in the browser.

Here’s an example of what it looks like:

X: 1
T: Blue Bells of Scotland
M: 4/4
L: 1/8
K: C
V:R
G2|c4B2A2|G4A2Bc|z8|z4z2G2|
V:L
z2|z8|z8|E2E2E2D2|C6z2|
V:R
|c4B2A2|G4A2Bc|z8|z8|
V:L
|z8|z8|E2E2F2D2|C6G2|
V:R
z8|c4G2Bc|B2G2A2B2|G4A2B2|
V:L
E2C2E2G2|z8|z8|z8|
V:R
c4B2A2|G4A2Bc|z8|z6|]
V:L
z8|z8|E2E2F2D2|C6|]

The M field gives the meter, the L the unit note length for the piece, and K is the key. V:R and V:L record right and left hand staves. Each separate line in the abcjs script corresponds to a separate line of music.

There are some handy notes (doh!) here — How to understand abc (the basics)
— and a some more complete docs here: The abc music standard 2.1 .

I’ve found that transcribing from sheet music to abcjs notation is also helping my music reading. The editor I use — https://abcjs.net/abcjs-editor.html — provides live rendering of the notes, so it’s easy to get visual feedback as I write in the notation about whether I’ve read to myself, and written, the correct one.

(The red highlight in the score follows the cursor position in the text editor.)

As well as live rendering of the score as you transcribe, you can also play back the tune using the embedded the music player. (I’m not sure if its possible to change the instrument type? It defaults to a sort-of piano…) The tempo is set by the Q parameter in beats per minute, so it’s easy enough to speed up and slow down the playback.

FWIW, I’ll start popping related tinkerings and doodlin’s here: psychemedia/harperin-onabcjs will also support adding things like fingerings for each note, but I don’t want to break copyright too much when I do post transcribed scores, so I’ll be omitting that…

As far as learning goes, learning to write abcjs will also help me learn to read music better, I think, as well as reading it a bit more deeply.

It’s ages since I learned a new sort of thing (though I have also been trying to learn Polish pronunciation so I can sound out names appropriately in a history of Poland I’m reading at the moment). It’s fun, isn’t it?! And soooo time disappearing…

Running OpenRefine in the Clear on Digital Ocean

In a couple of earlier posts, I’ve described how to get OpenRefine up and running remotely over the web by installing the OpenRefine server onto a Digital Ocean Linux server and running it there behind a simple authenticating proxy(Running OpenRefine On Digital Ocean Using Simple Auth and the more automated Authenticated OpenRefine Server on Digital Ocean, Redux).

In this post I’ll show how to set up a simple OpenRefine server, without authentication, using Docker (I’ll show how to add in the authenicating nginx proxy in a follow on post).

Docker is a virtualisation technology that heavily draws on the idea of “containers”, isolated computational environments that provide just enough operating system to run a particular application within them.

As well as hosting raw Linux servers, Digital Ocean also provides Linux-servers-with-docker as a one-click application.

Here’s how to start a docker machine on Digital Ocean.

Creating a Digital Ocean Docker Droplet

First up, create a new droplet as a one-click app, selecting docker as the one-click application type:

To give ourselves some space to work with, I’m going to choose the 3GB server (it may work with default settings in a 2GB server, or it may ruin your day…). It’s metered by the hour, but it’ll still only cost a few pennies for a quick demo. (You can also get $100 free credit as a new user if you sign up here.)

DigitalOcean_-_Create_Droplets_

Select a data center region (I typically go for a local one):

If you want to, add your SSH key (recipe here, but it’s not really necessary: the ssh key just makes it easier for you to login to the server from your own computer if you need to. If you haven’t heard of ssh keys before, ignore this step!)

Hit the big green button to create your droplet (if you want to, give the sever a nicer hostname first…).

Accessing the Digital Ocean Droplet Server Terminal

Your one-click docker server will now start up. Once its there (it should take less than a minute) click through to its admin page. Assuming you haven’t added ssh keys, you’ll need to log in through the console. The login details for your server should have been emailed to the email address associated with your Digital Ocean account. Use them to login.

On first login, you’ll be prompted to change the password (it was emailed to you in plain text after all!)

If you choose a really simple replacement password, you may need to choose another one. Also note that the (current) UNIX password was the one you were emailed, so you’ll essentially be providing this password twice in quick succession (once for the first login, then again to authorise the enforced password change). Copy and pasting the password into the console from your email should work…

Once you’ve changed your password, you’ll be logged out and you’ll have to log back in again with your new password. (Isn’t security a faff?! That’s why ssh keys…!;-)

Now you get to install and launch OpenRefine. I’ve got an example image here, and the recipe for creating it here, but you don’t need really need to look at that if you trust me…

What you do need to do is run:

docker run -d -p 3333:3333 --name openrefine psychemedia/openrefinedemo

What this command does is download and run the container psychemedia/openrefinedemo, naming it (purely for our convenience) as openrefine.

You can learn how to create an OpenRefine docker image here: How to Create a Simple Dockerfile for Building an OpenRefine Docker Image.

The -d flags runs the container in “detached”, standalone mode (in the background, essentially). The -p 3333:3333 is read as -p PUBLICPORT:INTERNALPORT. The OpenRefine server is started on INTERNALPORT=3333 and we’re also going to view it on a URL port 3333.

The container will take a few seconds to download if this is the first time you’ve called for it:

and then it’ll a print out a long id number once it’s launched and running the background.

(You can check it’s running by running the command docker ps.)

In both the terminal and the droplet admin pages, as well as the droplet status line in the current droplet listing pages, you should see the public IP address associated with the droplet. Copy that address into your browser and add the port mapping (:3333). You should now be able to see a running version of OpenRefine. (And so should anyone else who wanders by that URL:PORT combination…)

Let’s now move the application to another port. We could do this by launching another container, with a new unique name (container names, when we assign them, need to be unique) and assigned to another port. (The OpenRefine internal service port will remain the same). For example:

docker run -d -p 3334:3333 --name openrefine2 psychemedia/openrefinedemo

This creates a new container running a fresh instance of the OpenRefine server. You should see it on IPADDRESS:3334.

(Alternatively we can omit the name and a random one will be assigned, for example, docker run -d -p 3335:3333)

Note that the docker image does not need to be downloaded again. We simply reuse the one we downloaded previously, and spawn a new instance of it as a new container.

Each container does take up memory though, so kill the original container:

docker kill openrefine

and remove it:

docker rm openrefine.

For a last quick demo, let’s create a new instance of the contain, once again called openrefine (assuming we’ve removed the one previously called that) and run it on port 80, the default http port, which means we should be able to see it directly by going to just the IPADDRESS (with no port specified) in our browser:

docker run -d -p 80:3333 --name openrefine psychemedia/openrefinedemo

When you’re done, you can halt the droplet (in which case, you’ll keep on paying rent for it) or destroy it (which means you won’t be billed for any additional hours, or parts thereof, on top of the time you’ve already been running the droplet):

You don’t need to tidy up around the docker containers, they’ll die with the droplet.

So, not all that hard, is it? Probably a darn sight easier than trying to get anything out of your local IT unit?!

In the next post, I’ll show how to combine the container with another one containing nginx to provide some simple authentication. (There are lots of prebuilt containers out there we can just take “off-the-shelf”, and nginx is one of them.) I’ll maybe also have a look at how you might persist projects in hibernating container / droplet, perhaps look at how we might be able to upload files that OpenRefine can work on, and maybe even try to figure out a way to simply synch your project files from Digital Ocean to your own file storage location somewhere. Maybe…

PS third party nginx proxy example: https://github.com/beevelop/docker-nginx-basic-auth

Viewing Dockerised Desktops via an X11 Bridge, novnc and RDP, Sort of…

So… the story so far…

As regular readers of this blog will know, I happen to be of the opinion that we should package more of OUr software using Docker containers, for a couple of reasons (at least):

  • we control the software environment, including all required dependencies, and avoiding any conflicts with preinstalled software;
  • the same image can be used to launch containers locally or remotely.

I also happen to believe that we should deliver all UIs through a browser. Taken together with the containerised services, this means that students just need a browser to run course related software. Which could be on their phone, for all I care.

I keep umming and ahhing about electron apps. If the apps that are electron wrapped are also packaged so that they can also be run as a service and accessed via a browser too, that’s fine, I guess…

There are some cases in which this won’t work. For example, not all applications we may want to distribute come with an HTML UI, but instead may be native applications (which is an issue becuase we are supposed to be platform independent), or cross platform applications that use native widgets (for example, Java apps, or electron apps).

One way round this is to run a desktop application in container and then expose its UI using X11, (aka the X Window System), although this looks like it may be on the way out in favour of other windowing alternatives, such as Wayland… See also Chrome OS Is Working To Remove The Last Of Its X11 Dependencies. (I am so out of my depth here!)

Although X11 does provide a way of rendering windows created on a remote (or containerised guest) system using native windows on your own desktop, a downside is that it requires X11 support on your own machine; and I haven’t found a cross-platform one that looks to be a popular de facto standard.

Another approach is to use VNC, in which the remote (or guest) system sends a compressed rendered version of the desktop back to your machine, which then renders it. (See X11 on Raspberry Pi – remote login from your laptop for a discussion of some of the similarities and differences between X11 and VNC.)

Note to self – one of the issues I’ve had with VNC is the low screen resolution of the rendered desktop… but is that just because I used a default low resolution in the remote VNC server? Another issue I’ve had in the past with novnc, a VNC client that renders desktops using HTML via a browser window, relates to video and audio support… Video is okay, but VNC doesn’t do audio?

Earlier today, I came across x11docker, that claims to run GUI applications and desktops in docker (though on Windows and Linux desktops only. The idea is that you “just type x11docker IMAGENAME [COMMAND]” to launch a container and an X11 connection is made that allows the application to be rendered in a native X11 window. You can find a recipe for doing something similar on a Mac here: Running GUI’s with Docker on Mac OS X.

But that all seems a little fiddly, not least because of a dependency on an X11 client which might need to be separately installed. However, it seems that we can use another Docker container — JAremko/docker-x11-bridge — running xpra (“an open-source multi-platform persistent remote display server and client for forwarding applications and desktop screens”) as bridge that can connect to an X11 serving docker container and render the desktop in a browser.

For example, Jess Frazelle’s collection of Dockerfiles containerise all manner of desktop applications (though I couldn’t get them all to work over X11; maybe I wasn’t starting the containers correctly?). I can get them running, in my browser, by starting the bridge:

docker run -d \
 --name x11-bridge \
 -e MODE="tcp" \
 -e XPRA_HTML="yes" \
 -e DISPLAY=:14 \
 -p 10000:10000 \
 jare/x11-bridge

and then firing up a couple of applications:

docker run -d --rm  \
  --name firefox \
  --volumes-from x11-bridge \
  -e DISPLAY=:14 \
  jess/firefox

docker run -d --rm  \
  --name gimp \
  --volumes-from x11-bridge \
  -e DISPLAY=:14 \
  jess/gimp

#Housekeeping
#docker kill gimp firefox
#docker rm gimp firefox
#docker rmi jess/gimp jess/firefox

Another approach is to use VNC within a container, an approach I’ve used with this DIT4C Inspired RobotLab Container/ (The DIT4C container is quite old now; perhaps there’s something more recent I should use? In particular, audio support was lacking.)

It’s been a while since I had a look around for good examples of novnc containers, but this Collection of Docker images with headless VNC environments could be a useful start:

Desktop environment Xfce4 or IceWM
VNC-Server (default VNC port 5901)
noVNC – HTML5 VNC client (default http port 6901)

The containers also allow screen resolution and colour depth to be set via environment variables. The demo seems to work (without audio) using novnc in a browser, and I can connect using TigerVNC to the VNC port, though again, without audio support.

Audio is a pain. On a Linux machine, you can mount an audio device when you start a novnc container (eg fcwu/docker-ubuntu-vnc-desktop) but I’m not sure if that works on a Mac? Or how it’d work on Windows?) That said, a few years ago I did find a recipe for getting audio out of a remote container that did seem to work — More Docker Doodlings – Accessing GUI Apps Via a Browser from a Container Using Guacamole — although it seems to be broken now (did the container format change in that period I wonder?). Is there a more recent (and robust) variant of this out there somewhere, I wonder?

Hmm… here’s another approach: using a remote desktop client. Microsoft produce RDP (Remote Desktop Protocol) clients for different platforms so that might provide a useful starting point.

This repo — danielguerra69/firefox-rdp — builds on danielguerra69/dockergui (fork of this) and shows how to create a container running Firefox that can be accessed via RDP. If I run it:

docker run --rm -d --shm-size 1g -p 3389:3389 --name firefox danielguerra/firefox-rdp

I can create a connection using the Microsoft remote desktop client at the address localhost:3389, login with my Mac credentials, and then use the application. Testing on Youtube shows that video and audio work too. So that’s promising…

(Docker housekeeping: docker kill firefox; docker rm firefox; docker rmi danielguerra/firefox-rdp.)

Hmmm… so maybe now we’re getting somewhere more recent. Eg danielguerra69/ubuntu-xrdp although this doesn’t render the desktop properly for me, and danielguerra69/alpine-xfce4-xrdp doesn’t play out the audio? Ah, well… I’ve wasted enough time on this for today…

On Not Faffing Around With Jupyter Notebook Docker Container Auth Tokens

Mark this post as deprecated… There already exists an easy way of setting the token when starting one of the Jupyter notebook Docker containers: -e JUPYTER_TOKEN="easy; it's already there". In fact, things are even easier if you export JUPYTER_TOKEN='easy' in the local environment, and then start the container with docker run --rm -d --name democontainer -p 9999:8888 -e JUPYTER_TOKEN jupyter/base-notebook (which is equivalent to -e JUPYTER_TOKEN=$JUPYTER_TOKEN). You can then autolaunch into the notebook with open "http://localhost:9999?token=${JUPYTER_TOKEN}". H/t @minrk for that…

[UPDATE: an exercise in reinventing the wheel… This is why I should really do something else with my life…]

I know they’re there for good reason, but starting the official Jupyter containers requires that you enter a token created when you launch the container, which means you need to check the docker logs…

In terms of usability, this is a bit of a faff. For example, the example URL is not necessarily the correct one (it specifies the port the notebook is running on inside the container rather than the exposed port you have mapped it to.

If you start the container with a -d flag, you don’t see the token (something that looks like the token is printed out but it’s not the token, it’s docker created…). However, you can see the log stream containing the token using Kitematic.

If you go directly to the notebook page without the token argument, you’ll need to login with it, or with a default password (which is not set in the official Jupyter Docker images).

To provide continued authenticated access, you also have the opportunity at the bottom of that screen to swap the token for a new password (this is via the c.NotebookApp.allow_password_change setting which by default is set to True):

I think the difference between default token and password is that in the config file, if you specify a token via the c.NotebookApp.token argument, you do so in plain text, whereas the c.NotebookApp.password  setting takes an MD5 hashed value. If you set c.NotebookApp.token='', you can get in without a token. For a full set of config settings, see the Jupyter notebook config file and command line options.

So, can we balance the need for a small amount security without going to the extreme of disabling auth altogether?

Here’s a Dockerfile I’ve just popped together that allows you to build a variant of the official containers with support for tokenless or predefined token access:

#Dockerfile
FROM jupyter/minimal-notebook

#Configure container to support easier access
ARG TOKEN=-1
RUN mkdir -p $HOME/.jupyter/
RUN if [ $TOKEN!=-1 ]; then echo "c.NotebookApp.token='$TOKEN'" >> $HOME/.jupyter/jupyter_notebook_config.py; fi

We can then build variations on a theme as follows by running the following build commands in the same directory as the Dockerfile:

# Automatically generated token (default behaviour)
docker build -t psychemedia/quicknotebook .

# Tokenless access (no auth)
docker build -t psychemedia/quicknotebook --build-arg TOKEN='' .

# Specified one time token (set your own plain text one time token)
docker build -t psychemedia/quicknotebook --build-arg TOKEN='letmein' .

And some more handy administrative commands, just for the record:

#Run the container
docker run --rm -d -p 8899:8888 --name quicknotebook psychemedia/quicknotebook
##Or:
docker run --rm -d --expose 8888 --name quicknotebook psychemedia/quicknotebook

#Stop the container
docker kill quicknotebook

#Tidy up after running if you didn't --rm
docker rm quicknotebook

#Push container to Docker hub (must be logged in)
docker push psychemedia/quicknotebook

I’m also starting to wonder whether there’s an easy way of using Docker ENV vars (passed in the docker run command via a -e MYVAR='myval' pattern) to allow containers to be started up with a particular token, not just created with specified tokens at build time? That would take some messing around with the container start command though…

There’s a handy guide to Dockerfile ARG and ENV vars here: Docker ARG vs ENV.

Hmm… looking at the start.sh script that runs as part of the base notebook start CMD, it looks like there’s a /usr/local/bin/start-notebook.d/ directory that can contain files that are executed prior to the notebook server starting…

So we can presumably just hack that to take an environment variable?

So let’s extend the Dockerfile:

ENV TOKEN=$TOKEN
USER root
RUN mkdir -p /usr/local/bin/start-notebook.d/
RUN echo  "if [ \$TOKEN!=-1 ]; then echo \"c.NotebookApp.token='\$TOKEN'\" >> $HOME/.jupyter/jupyter_notebook_config.py; fi" >> /usr/local/bin/start-notebook.d/tokeneffort.sh
RUN chmod +x /usr/local/bin/start-notebook.d/tokeneffort.sh
USER $NB_USER

Now we should also be able to set a one time token when we run the container:

docker run -d -p 8899:8888 --name quicknotebook -e TOKEN='letmeout' psychemedia/quicknotebook

Useful? [Not really, completely pointless; passing the token as an environment variable is already supported (which raises the question; how come I’ve kept missing this trick?!) At best, it was a refresher in the use of Dockerfile ARG and ENV vars.]

Running a PostgreSQL Server in a MyBinder Container

The original MyBinder service used to run an optional PostgreSQL DBMS alongside the Jupyter notebook service inside a Binder container (my original review).

But if you want to run a Postgres database in the same MyBinder environment nowadays, you need to add it in yourself.

Here are some recipes with different pros and cons. As @manics comments here, “[m]ost distributions package postgres to be run as a system service, so the user permissions are locked down.”, which means that you can’t run Postgres as an arbitrary user. The best approach is probably the last one, which uses an Anaconda packaged version of Postgres that has a more liberal attitude…

Recipe the First – Hacking Permissions

I picked up this approach from dchud/datamanagement-notebook/ based around Docker. It gets around the problem that the Postgres Linux package requires a particular user (postgres) or an alternative user with root permissions to start and stop the server.

Use a Dockerfile to install postgres and create a simple database test user, as well as escalating default user notebook jovyan to sudoers (along with the password redspot). The jovyan user can then start / stop the Postgres server via an appropriate entrypoint script.

USER root

RUN chown -R postgres:postgres /var/run/postgresql
RUN echo "jovyan ALL=(ALL)   ALL" >> /etc/sudoers
RUN echo "jovyan:redspot" | chpasswd

COPY ./entrypoint.sh /
RUN chmod +x /entrypoint.sh

USER $NB_USER
ENTRYPOINT ["/entrypoint.sh"]

The entrypoint.sh script will start the Postgres server and then continue with any other start-up actions required to start the Jupyter notebook server install by repo2docker/MyBinder by default:

#!/bin/bash
set -e

echo redspot | sudo -S service postgresql start

exec "$@"

Try it on MyBinder from here.

A major issue with this approach is that you may not want jovyan, or another user, to have root privileges.

Recipe The Second – Hacking Fewer Permissions

The second example comes from @manics/@crucifixkiss and is based on manics/omero-server-jupyter.

In this approach, which also uses a Dockerfile, we again escalate the privileges of the jovyan user, although this time in a more controlled way:

USER root

#The trick in this Dockerfile is to change the ownership of /run/postgresql
RUN  apt-get update && \
    apt-get install -qq -y \
        postgresql postgresql-client && apt-get clean && \
    chown jovyan /run/postgresql/

COPY ./entrypoint.sh  /
RUN chmod +x /entrypoint.sh

In this case, the entrypoint.sh script doesn’t require any tampering with sudo:

#!/bin/bash
set -e

PGDATA=${PGDATA:-/home/jovyan/srv/pgsql}

if [ ! -d "$PGDATA" ]; then
  /usr/lib/postgresql/10/bin/initdb -D "$PGDATA" --auth-host=md5 --encoding=UTF8
fi
/usr/lib/postgresql/10/bin/pg_ctl -D "$PGDATA" status || /usr/lib/postgresql/10/bin/pg_ctl -D "$PGDATA" -l "$PGDATA/pg.log" start

psql postgres -c "CREATE USER testuser PASSWORD 'testpass'"
createdb -O testuser testdb

exec "$@"

You can try it on MyBinder from here.

Recipe the Third – An Alternative Distribution

The third approach is again via @manics and uses an Anaconda packaged version of Postgres, installing the postgresql package via an environment.yml file.

A postbuild step initialises everything and pulls in a script to set up a dummy user and database.

#!/bin/bash
set -eux

#Make sure that everything is initialised properly
PGDATA=${PGDATA:-/home/jovyan/srv/pgsql}
if [ ! -d "$PGDATA" ]; then
  initdb -D "$PGDATA" --auth-host=md5 --encoding=UTF8
fi

#Start the database during the build process
# so that we can seed it with users, a dummy seeded db, etc
pg_ctl -D "$PGDATA" -l "$PGDATA/pg.log" start

#Call a script to create a dummy user and seeded dummy db
#Make sure that the script is executable...
chmod +x $HOME/init_db.sh
$HOME/init_db.sh

For example, here’s a simple init_db.sh script:

#!/bin/bash
set -eux

THISDIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"

#Demo PostgreSQL Database initialisation
psql postgres -c "CREATE USER testuser PASSWORD 'testpass'"

#The -O flag below sets the user: createdb -O DBUSER DBNAME
createdb -O testuser testdb

psql -d testdb -U testuser -f $THISDIR/seed_db.sql

which in turn pulls in a simple .sql file to seed the dummy database:

-- Demo PostgreSQL Database initialisation

DROP TABLE IF EXISTS quickdemo CASCADE;
CREATE TABLE quickdemo(id INT, name VARCHAR(20), value INT);
INSERT INTO quickdemo VALUES(1,'This',12);
INSERT INTO quickdemo VALUES(2,'That',345);

Picking up on the recipe described in an earlier post (AutoStarting A Headless OpenRefine Server in MyBinder Using Repo2Docker and a start Config File), the database is autostarted using a start file:

#!/bin/bash
set -eux
PGDATA=${PGDATA:-/home/jovyan/srv/pgsql}
pg_ctl -D "$PGDATA" -l "$PGDATA/pg.log" start

exec "$@"

In a Jupyter notebook, we can connect to the database in several ways.

For example, we can connect directly using the the psycopg2 package:

import psycopg2

conn = psycopg2.connect("dbname='postgres'")
cur = conn.cursor()
cur.execute("SELECT datname from pg_database")

cur.fetchall()

Alternatively we can connect using something like ipython-sql magic, using a connection string that attaches us using a passwordless connection string as the default (jovyan) user and default connection details (we use default ports etc.): postgresql:///postgres

Or we can go to the other extreme, and use a connection string that connects us using the test user credentials, explicit host/port details, and a specified database: postgresql://testuser:testpass@localhost:5432/testdb

You can try it on MyBinder from here.