Paying for Free

A recent post on the BBC News / Technology blog — Why Big Tech pays poor Kenyans to teach self-driving cars — describes how Kenyan knowledge workers spend 8 hour shifts creating machine learning training data for the likes of Google, Microsoft and VW as employees of Samasource, providers of “humans-in-the-loop to help you build quality ground truth training data for your natural language or computer vision algorithms”. (Seems like I missed Samasource when I blogged about these sorts of companies previously: Robot Workers?)

Now matter how little you pay people, they’re still expensive, so it’s better if you can get free labour. That’s what Captchas do. One of the tasks the Samasource people do is trace around meaningful objects that appear in an image and associate them with labels that describe the thing. “Car”, “bus”, “bicycle” and so on, but if you extract time and attention from folk browsing the web to do it for you, even better.

For example, I got captcha’d the other day hacking URLs on the Bloomberg website (sites often take umbrage and challenge you to prove you aren’t a robot if you do anything other than click links on their site, such as hacking URLs or using advanced search queries).

But by selecting things that appear in a grid, that’s not as good as tracing them surely? Well, it is if you run the test thousands of times, move the grid around a pixel at a time, and do some sums.

It’s much the same with lots of the sites and services you use “for free”. They’re not free to run of course, they may cost tens or even hundred of millions of dollars to put together and deliver, so someone has to pay. Ads cover some of it (the money there is advertising dollars in exchange for targeted audiences (Ad-Tech – A Great Way in To OSINT), and they are constructed by mining user data to find all the people who work in universities and look at pr0n on the bus for a bit of excitement, for example). Surveys have also been used as a ‘partial payment” mechanism (From Paywalls and Attention Walls to Data Disclosure Walls and Survey Walls).

Another partial payment mechanism is your time. For example, when your GPS app sends you on a weird route, it’s quite possibly using you as a guinea pig to see how effective that part of the route is at that time of day. It needs to learn somehow, right? Gives new meaning the rat run, doesn’t it? (Didn’t you think of yourself as a lab rat running a maze before?)

Which leads to a handful of things on my to read list…

First up, Exploring or Exploiting? Social and Ethical Implications of Autonomous Experimentation in AI, the abstract for which reads as follows:

In the field of computer science, large-scale experimentation on users is not new. However, driven by advances in artificial intelligence, novel autonomous systems for experimentation are emerging that raise complex, unanswered questions for the field. Some of these questions are computational, while others relate to the social and ethical implications of these systems. We see these normative questions as urgent because they pertain to critical infrastructure upon which large populations depend, such as transportation and healthcare. Although experimentation on widely used online platforms like Facebook has stoked controversy in recent years, the unique risks posed by autonomous experimentation have not received sufficient attention, even though such techniques are being trialled on a massive scale. In this paper, we identify several questions about the social and ethical implications of autonomous experimentation systems. These questions concern the design of such systems, their effects on users, and their resistance to some common mitigations.

Here’s how they set the scene:

Consider, for example, navigation services that are responsible for providing millions of users with real-time directions. Given the current traffic conditions, these services attempt to suggest optimal routes for drivers. Experimentation is likely a core part of suggesting optimal routes. This is because service providers often lack information about traffic conditions on those routes to which they have purposefully not directed drivers. To determine whether a previously slow route is still slow, these services will deliberately send some users along it.

As I said, on my to-read pile. I’ll try to pull out my own TL:DR nuggets in another post when I have the spare cycles to take it in properly.

Second up, Two Cheers for Corporate Experimentation: The A/B Illusion and the Virtues of Data-Driven Innovation, a much longer, footnoted piece which again I’m not in a mood to read right now… Maybe later…

Finally, a brief review article — What’s Behind Your Navigation App — which perhaps leads to more things to read…

I have such a backlog of half started posts, of which this is one… Normally, I’d have tried to complete it, but I’m losing stuff in the queue, so posting it as is means this bit is done and I may be more likely to get round to reading those papers and doing a part 2…

Getting the TM351 VM Running on OU OpenStack

One of the original motivations for delivering the TM351 software and services via a virtual machine, with user interfaces provided via a browser, was that we should be able to use the same VM  as a locally run machine on a student’s own computer, or as a hosted machine (accessible via the web) running on an OU server.

A complementary third year equivalent course, TM352 Web, Mobile and Cloud Technologies, uses a Faculty managed OpenStack instance as a dogfooding teaching environment on that course (students learn about cloud stuff in the course, get to deploy some canned machines and develop their own services using OpenStack, and the department develops skills in in deploying and managing such environments with hundreds of users).

I think part of the pitch for the OpenStack cluster was that it would be available to other courses, but a certain level of twitchiness in keeping it stable for the original course use case has meant that getting access to the machine has not been as easy as it might have been.

(There is no dev server that I can access, at least not from a connection outside the OU network. So the only server I can play on is the live server, as used by students. If you’re confident managing OpenStack, this is probably fine (it should be able to cope with lots of tenants with different requirements, right?), but if you’re not, making a dev server, open to all who want to try it out, and available sooner rather than later, probably makes morse sense: more people solving problems, more use cases being explored and ruled out, more issues being debugged; more learning going on generally…)


I’ve finally got an account, and a copy of the TM351 VM image, originally built for VirtualBox, uploaded to it.

You’d think that part at least would have been be easy, but it took the best part of four months or so at least… First, getting an account on the OpenStack server. Second, getting a copy of the TM351 VM image that could be loaded onto it. I got stuck going nowhere trying to convert the original Virtualbox image until it was pointed out to me that there was a VirtualBoxManage tool for doing it (Converting between VM Formats). Faculty advice suggests the clonehd command:

vboxmanage clonehd box-disk001.vmdk /Users/USER/Desktop/tm351.img --format raw

but that looks deprecated in recent versions of VirtualBox to me… The following seems more contemporary:

VBoxManage clonemedium ~/VirtualBox\ VMs/tm351_18J-student/box-disk001.vmdk tm351_18J-student.raw --format RAW

Third, loading the image onto OpenStack. A raw box format image I thought I had managed to create myself came in at 64GB (the original box was ~8GB), but it seems this is because that’s the size of the virtual disk. Presumably vagrant is setting this in my original build (or VirtualBox is defaulting to it?), so one thing I need to figure out is how to reduce it without compromising anything. Looking at Resizing Vagrant box disk space  I wonder if we could move along steps from vmdk to vid to resize and then raw?

Uploading a 64 GB from home to OpenStack using an http file uploader on  the OpenStack user admin page is just asking for trouble, but even copying the image from OU networked machines is not just-do-it-able: it requires copying  the file from one machine to another and then onto the OpenStack server by someone-not-me with the appropriate logins and scp permissions.

(Building the machine on OpenStack myself using an OpenStack vagrant provisioner is not an option on the live server at least: API access addresses seem to only be provided for a private network that I don’t have access to. If we manage to get a development server that I am allowed to access using VPN, or even better, without VPN, and I can get permissions to use the API, and we can connect to things like the apt-get and Pypi/pip repos, using a build provisioner makes sense to me.)

So I there is now an image visible on the OpenStack server.

You’ll note we haven’t tried to brand the OpenStack user’s admin panel at all  (I would have…;-).

What next? Trying to spin up an instance from the image kept giving me errors (I started trying with a small machine instance, then tried creating an instance with ever larger machine flavours — the issue was indeed the 64GB default disk size associated with the image. Faculty IT changed a setting that meant the larger disk sizes would spin up and reported that it worked for them with the VM on a large flavour machine.

But it didn’t for me…  I kept getting the message [Error: Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance XXXXXX. I think the issue must have been a permissions thing manifested as a network thing. Faculty IT restarted the image as private to me, (and with my own private network?) and I tried again… (For this reason, I’m not convinced that anyone else just given an account will be able to get their own version of the TM351 VM up and running? I need to understand better what requirements, if any, are placed on the creation of the OpenStack user account for it to work. And I need a second test user account (at least) to test it..)

Anyway – success for me – a running instance of the TM351 VM. And now I could use the OpenStack web console to log in to the machine using the default vagrant credentials. Which I need to change… (and find a sensible method for students to use to change the defaults).

So now I can poke around inside the VM. But I can’t see any of the services it’s running for a couple of reasons: firstly, the VM has no public IP address; secondly, the only port I think I’m allowed to expose publicly is port 80, and there are no services running on port 80. And unlike vagrant and docker, which make it easy to map and expose an arbitrary port inside the VM onto a specified port outside the VM, such as port 80, I haven’t found a way to do that in OpenStack. (The documentation sucks. Really badly. And there is no internal FAQ to give me even the slightest crib as to what to do next.)

The TM352 course materials come to my rescue here, sort of. As OU central academic staff, I can log in to course VLEs and see the published teaching material, although not the student forums. Looking in the current presentation, the materials that show TM352 students how to make their VM visible to the world haven’t been released yet so I can’t see them.. Bah… But I can look at the materials provided to students on the previous presentation… Which are out of date compared to the current version of OpenStack. But never mind, because the materials are enough of a crib to figure out what to do where-ish: Block 2 Part 2: Designing a cloud, 8 Getting started with OpenStack. The essential steps boil down to the following (apols for the vagueness; I don’t want to actually restep through everything to check it works in case I break my current instance; next time I run through from scratch, I’ll tidy up the instructions. Ideally, I’d do a fresh run through in a new, virgin test user account):

  1. Create a new private network for the VM to run on: I seemed to have a network already created, but here’s a howto: under Network, select the Networks option, and then Create Network with the  Admin State as UP (i.e. running and usable) and the Create Subnet box ticked. Use IP/v4 and set an IP address range in CIDR format (e.g.;
  2. Create a router that interconnects the public network and the private network: from the Network menu select Routers . Set Admin State to UP and External Network to public then Create Router. In the Network Topology view, select the router and then Add Interface, using Subnet set to the private created network and the IP Address left blank.
  3. Configure the network security rules: from Network select Security Groups ; if there’s no default group create one; once there is, select Manage Rules. We need to add three rules:
Direction Ingress
Remote CIDR
Direction Ingress
Remote CIDR
Remote CIDR
  1. Create a VM instance from the TM351 image: bearing in mind the previous set-up, choose appropriately!
  2. Attach a public IP address to the VM: in `Network` select Floating IPs and then Allocate IP to Project. With the new floating IP address, select Associate and choose appropriately.

Hopefully now there should be a public IP address associated with the VM and ports 80 and 22 (ssh) exposed. Using the public IP address, from a terminal on my own local machine:

ssh vagrant@VM.IP.ADDR.ESS

followed by the password, and I should be in…

(I can’t help thinking that typing vagrant up is a much easier way to launch a VM. And then vagrant ssh to SSH in…)

Next step – try to see the public services running inside the VM, bearing in mind that we can only access services through port 80.

To test things, we can just try a simple http server on port 80:

python3 -m http.server 80

That works, so port 80 is live on my VM and I can see it from the public internet. So kill the test http server…

Running services inside VM against port 80 requires them to run as root (ports <1024 are privileged), but in the last rebuild of the VM we tried to move away from running everything as root and instead run them under a user account. Which means that the Jupyter server is defined to run under a user account on a non-privileged port.

I went round in circles on this one for getting on for an hour, trying to run Jupyter notebooks on port 80, but running into permissions errors accessing port 80 unless I ran the service as root.  (Things like tail /var/log/syslog helped in the debugging…)

I also had to manually fix the missing notebook directory that the notebook service is supposed to start in. (I think this is another permissions snafu – the service runs as a user but the mkdir guard run via ExecStartPre needs permissions tweaking to run as root using PermissionsStartOnly=true (issue.)

The simplest thing to do is run a proxy like nginx. Which isn’t installed in the VM. No problem, the vagrant user I ssh into the VM with can run via sudo so I should be able to just do a sudo apt-get update && sudo apt-get install -y nginx. Only I can’t because the security rules upstream of the OpenStack server won’t let me. F**k. It’s a Saturday afternoon, and there are zero, no, zilch, none, Faculty IT help files or FAQs that have been shared with me, or that I’m even aware of the existence of, with possible workarounds. But there is Twitter, and various other Saturday working friends, which gives me a result: set up an ssh tunnel and do it via my home machine ( ):

sudo ssh -R vagrant@IP.ADDR

With that tunnel set up, inside the VM I can run sudo nano /etc/apt/apt.conf and edit in the following lines:

Acquire::http::Proxy "http://localhost:8899";
Acquire::https::Proxy "https://localhost:8899";

Then I can apt-get update, apt-get install etc inside the VM

sudo apt-get update
sudo apt-get install -y nginx

To try and pre-empt any other issues, it’s worth checking that the required folders (again) are in place (/vagrant/notebooks and /vagrant/openrefine-projects) and with the appropriate owner and group (oustudent:users) permissions:

sudo chown -R oustudent:users /vagrant

As mentioned, the current ExecPreStart in the Jupyter notebook and OpenRefine service definition files were supposed to check folders exist but I think they need changing to incorporate things like following:

ExecStartPre=/bin/mkdir -p /vagrant/notebooks
ExecStartPre=/bin/chown oustudent:users /vagrant/notebooks

Right, so permissions should be sorted, and the Jupyter notebook server should be runnable against port 80 via the nginx proxy; but I need an nginx config file… If we were running notebooks as a service in the OU this is the sort of thing I’d hope would be in an an examples FAQ, battle tested in an OU context; but we don’t so it isn’t so I rely on other people having solved the problem and being willing to share their answer in public:

Unfortunately, it didn’t work for me out of the can… the post supposedly describes how to proxy the server down a path, but (jumping ahead) the login page URL didn’t rewrite down the path for me; tweaking the proxy definition so that the Jupyter notebook server runs at the top level (/) on port 80 did work though – so here’s the nginx definition file I ended up using:

sudo nano /etc/nginx/sites-available/default

and then:

location / {
  error_page 403 = @proxy_groot;

  allow all;

  # set a webroot, if there is one
  #root /web_root;
  try_files $uri @proxy_groot;

location @proxy_groot {
  #rewrite /notebooks(.*) $1 break;
  proxy_read_timeout 300s;
  proxy_pass http://upstream_groot;

  # pass some extra stuff to the backend
  proxy_set_header Host $host;
  proxy_set_header X-Real-Ip $remote_addr;
  proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;

location ~ /api/kernels/ {
  proxy_pass http://upstream_groot;
  proxy_set_header Host $host;

  # websocket support
  proxy_http_version 1.1;
  proxy_set_header Upgrade "websocket";
  proxy_set_header Connection "Upgrade";
  proxy_read_timeout 86400;

location ~ /terminals/ {
  proxy_pass http://upstream_groot;
  proxy_set_header Host $host;

  # websocket support
  proxy_http_version 1.1;
  proxy_set_header Upgrade "websocket";
  proxy_set_header Connection "Upgrade";
  proxy_read_timeout 86400;

followed by:

sudo  nginx -s reload

To try to make the notebook server slightly more secure than wide open — it will be running on a public IP address after all — I need to add a password (the original TM351 VM runs everything wide open).

First, create a password hash:

echo -n "my cool password" | sha1sum

then edit the system service file:

sudo nano /lib/systemd/system/jupyter.service

We need to tweak the startup along the lines of:

ExecStart=/usr/local/bin/jupyter notebook --port=8888 --ip= --y --log-level=WARN --no-browser --notebook-dir=/vagrant/notebooks --allow-root --NotebookApp.token='' --NotebookApp.password='sha1:WHATEVER' --allow_origin='*'

We can probably drop the --allow-root ? (Although the default notebook user can sudo some commands…)

Reload the daemon to acknowledge the service definition changes and restart the service:

sudo systemctl daemon-reload
sudo systemctl restart jupyter.service

So this seems to work: I can see Jupyter notebook and login via port 80 on the floating public IP address I assigned to the TM351VM instance. I can open a notebook, run cells, call the PostgreSQL and basic Mongo databases at least, open a terminal. What I can’t do is curl or wget or run Python requests to load data files from the internet using a notebook because of the upstream IT network security rules. This is a bit of a blocker for the course. We may be able to finesse a way round with an ssh tunnel in testing, but I don’t think we should be expecting that of our students. (Thinks: how do IT security rules / policies apply when we define activities for students that we expect them to run on their own computers?! File as: whatever… We’ll just have to do something really crappy instead for students. Or set up a best-not-tell-IT proxy on the OU network somewhere…)

The next step is – can I expose the other core teaching application in the VM: OpenRefine?

A possible blocker is that we only have one port exposed on the public internet (port 80) so we need to find a way to expose OpenRefine. Fortunately, the nbserverproxy package allows the Jupyter server to proxy services running on localhost in the VM. So I should be able to run that. But first things first:  pip installs are borked even with an ssh tunnel (open questions on Stack Overflow confirm that this is not just me…).

Okay… pip packages can be downloaded and installed from a local file, so I can download the nbserverproxy pip package on my own machine and then scp it into the running OpenStack hosted VM at /vagrant/notebooks . Then from a notebook inside the VM I can run !pip install --user ./ (just to show the notebook is working properly! ;-)  and enable it: ! jupyter serverextension enable --py nbserverproxy.

Restart the notebook server from VM command line and I should be able to see OpenRefine at http://MY.FLOATING.IP.ADDR/proxy/3334/ (the trailing slash is required of the styling fails as the path to the style files is incorrectly resolved). I think that this should also be down the password protected path? i.e. if I hadn’t logged in to the notebook server, I don’t think I should be able to get this far? (NEED TO CHECK.)

One of the VM Easter Eggs, nbdime, is also visible on http://MY.FLOATING.IP.ADDR/proxy/8899/. Go team me… :-)

Grab a snapshot of the working VM in the idle hope that maybe if someone else tries to launch from that image, it will just work. Although things like the network and security rules will presumably need setting up?

For student use, I’d need a simple way / recipe to set up different/personalised ssh credentials into the VM, otherwise anyone with the public IP address could ssh in. This must be a common issue, so it’d be good to see a Faculty OpenStack FAQ suggesting what the possible options are. I guess a simple one is on starting the instance? Can we force keys into the VM when it launches? Another issue is (re)setting the password for the Jupyer notebook server so each student is assigned, or can easily set (and recover….) their own password.

Other next steps: is there something in OpenStack where I can define network settings, security rules, etc, and provide students with an easier way of deploying an TM351 instance on the Faculty OpenStack and making its public services available on the public internet? Can I do this with an OpenStack stack? If so, that would be a handy thing to have an OU OpenStack tutorial for…

This is obvs the sort of support that should be available in Faculty IT tutorials, FAQs, and God Forbid, in person if we’re running the OpenStack server as a Faculty service and trying to encourage people to use it, so that’s what I’ll probably spend my next day of miserable OpenStack hacking doing when I can motivate myself to do it: trying to figure out if and how to make things closer to one click simpler for students to launch their own TM351 VM. (In the first instance for TM351, we want students to be able to run course VMs on an OU server because they’re struggling with getting things running on their own computer; this is often highly correlated with them having poor computer skills, poor problem solving skills, and poor instruction following skills, so we’re on a hiding to nothing if we expect them to launch instances, choose flavours, create routers, create and assign floating IP addresses and set up security rules. On their own. Because I’m not going to do that tech support for them. (I am ranty typing; my keyboard is suddenly VERY LOUD. [REDACTED])

Converting between VM Formats

Trying to get our current TM351 VirtualBox virtual machine into a raw format that we can run on OpenStack… how many times did I go round the houses failing to discover that VBoxManage has a conversion tool (VBoxManage clonemedium) taking something like the form:

VBoxManage clonemedium ~/VirtualBox\ VMs/tm351_18J-student/box-disk001.vmdk tm351_18J-student.raw --format RAW

Export  / conversion / clone formats include: VDI, VMDK, VHD, RAW. There’s also “other”, but I’m not sure what that entails.

This might also be handy: Convert VDI (VirtualBox) to raw, qcow2, qed, vmdk, vhd in Windows.

The Library as the Natural Home for Emerging Technology…

I was back in the Library today after waaaay too long away to give a staff development session on things related to virtual machines, docker, “digital application shelves”, Jupyter notebooks and reproducible educational materials. (We also tried a bit of consensus humming… :-)

In conversation afterwards, we briefly chatted about the Library as being a possible home for providing such services, then over coffee with Richard Nurse riffed fleetingly on the idea of a Digital Skills Lab, which is a phrase that has been sitting with me all day since…

In my second public outing of the day, a conversation with Stpehen Downes for his e-learning 3.0 MOOC, I riffed on Docker and notebooks again, and whilst chatting after that event riffed casually on the notion of using Docker as a means of delivering personal productivity apps / information tools, and why the Library, rather than IT, might be better suited to supporting such an offering…

Between the two, I tried to hijack a server we’ve acquired to explore some infrastructure experiments to support delivery of Institute of Coding activities that the OU has a work package to deliver and give it to the Library… Here are some quick thoughts relating to a possible case for the defence I may have to make tomorrow…

  • to explore useful infrastructure offerings for supporting coding related education, we need to consider: 1) the environments that are user (learner and/or researcher) facing; 2) the architecture that lets us scale that offering;
  • the original server was supposed to satisfy both need, but the lack of resource to develop the scaling infrastructure part was blocking the end-user development work;
  • grabbing a server, situating it in the Library, and calling it a Digital Skills Lab development server makes a statement about the sorts of things we might want to use it for. Specifically:
  • utility of running experimental Jupyter notebook servers so people can start to explore their own use of notebooks, notebook environment customisations using extensions, etc;
  • utility of  running a local lab docker hub “digital application shelf” and docker machine to let folk check out and run pre-built “digital applications” (i.e. prebuilt Docker containers) taken off the shelf;
  • utility of running a local Binderhub to let folk explore building their own pre-configured computational environment + and distribute it as a live environment with notebook content that exploits that environment;
  • developing a lab mentality as a space / server where folk can try stuff out, and bring queries and requests as well as volunteering in their own ideas;
  • situating it in the Library means it’s not a STEM computing thing: it’s accessible all faculties;
  • more specifically, taking it out of the Computing Department and STEM Faculty makes a statement that we’re trying to offer computation stuff to people in general, not provide a computing environment for computing people per se; that is, we can explore, and maybe even help develop, a different set of expectations and use models for “code” – not necessarily writing big programs, but perhaps just finding the single-line-of-code-at-a-time that helps you complete a particular task.

Anyway – that was today… we’ll have to see what tomorrow’s email returns bring to see how much trouble I’m in!

PS By the by, waiting for a boat home, a most enjoyable piece by Tim Harford appeared in my streams: Why big companies squander brilliant ideas. Heh, heh… ;-)

“Tracking Jupyter” Newsletter

The pace of change associated with the Jupyter ecosystem, the variety of notebook examples published daily across a wide range of disciplines and domains, the increasing use of notebooks in industry, the creativity of extension writers, the range of hosted solutions and hosting providers, let alone the technical and engineering issues associated with designing and deploying Jupyter environments means it can be hard to keep up…

…so with the Tracking Jupyter newsletter, I’ll try to produce an ongoing round up of Jupyter related news and announcements that I’ve managed to spot over the previous week or two…


Topics are likely to include:

    • official Jupyter announcements and releases;
    • Jupyter in education;
    • Jupyter in research;
    • Jupyter in industry;
    • new kernels and widget walkthroughs;
    • interesting notebooks and use cases (for example, notebooks behind news stories, notebooks demonstrating work in a particular topic area from computational sciences to digital humanities);
    • hosted solutions;
    • hosting and infrastructure (technical / engineering) solutions;
    • jobs.

Contributions / suggestions for news items are welcome (email:

To get a feeling for what the newsletter might include, the first issue is available here: Tracking Jupyter: Newsletter, the First…

Subscribe here: Tracking Jupyter signup.

(In)Distinguishable from Magic…

A classic physics experiment showing a magical physical world effect – the inverted water cup…

With a little bit of science/physics knowledge, nothing is hidden and the effect is explainable (how it works). No tricks, in other words. The trick is not only self-working, it’s also transparent. Scientific knowledge is the key to the secret.

But are the safety glasses really necessarily? Really?

Here’s the same trick, as magic:


The same physics are at work but there’s a hidden element.

There’s also a risk here that people think there is a physics explanation for the trick )(surface tension of water, for example) and the magic leaves them with a misplaced confidence or understanding of the physics…

(Penn and Teller riff on this by showing how a trick is done, breaking the secret, then rerunning the trick – with the same overall effect – but in a way that doesn’t use the secret, thus reinstilling the magic for people who think they know the secret.)

When Arthur C. Clarke wrote “Any sufficiently advanced technology is indistinguishable from magic”, which sort of magic was he referring to? The application of gimmicks, the application of trickery? Or the application of mechanisms that are transparent.