## Jupyter is Not Just Notebooks

Last week, I filled an hour in a department seminar showing ways in which we could use to Jupyter notebooks to support the creation and use of interactive educational materials.

I’ve no idea if it converted anyone to the cause.

I could have done any number of other talks — about the architecture of the Jupyter ecosysytem more widely (at least, insofar as I understand), or the way in which Jupyter makes sense for reproducible research and how it fits into a containerised / virtualised way of working.

Because Jupyter is not just about notebooks.

It’s also about string and glue.

Here’s something I suddenly grokked the other day whilst chatting to somebody about different ways of accessing applications that have a graphical UI… (on a desktop, on a desktop in a VM, via X11 (“what’s that?” they asked… sigh…), via a browser if is has an HTML UI, via novnc in a browser window if it doesn’t (albeit w/ borked audio support); note to self – try out this  novnc Jupyter extension.): if you wrap an application that has a command line interface using metakernel, you can access it in a notebook, or JupyterLab.

Obvious, right? But that means I can also access it via a web page using something like ThebeLab (or Juniper, or nbinteract), run via a container launched using Binderhub.

This is all tied up with a couple of the Big Ideas that underlies Jupyter: firstly, that it supports the read/write web. Secondly that it supports remote code execution (and as such enables the read/write/execute web).

So for example, one of the many metakernel based kernels is the gnuplot_kernel that lets you run Gnuplot commands from a notebook code cell and display the generated figure in a notebook. Here’s a forked version with the repo tweaked so it runs on MyBinder.

Using a gnuplot_kernel enabled Binder repo, we can now run Gnuplot commands via a web-browser using the ThebeLab Javascript package, for example, and display the result in the same web page. The container on the back is fired up in response to the first command issued from the page, which make take up to a minute or two, and will be used for future commands issued from the page in the same session.

Here’s what it looks like:

(The Gnuplot code is ripped from an example in the Gnuplot docs / gallery.)

The code seems to be repeated in the output, but I guess a tweak to the ThebeLab settings, or code, may fix that. Or maybe the kernel needs a tweak. But the proof of concept is there…

Here’s the code for the web page (image file, sorry… WordPress-com editor’n’sourcecode support sucks and I get fed up faffing around with tag brackets each time I re-edit the page):

That source code image does make a second point, though… Look closely, and compare the URLs in the two images above: I can edit an HTML file via the Jupyter notebook text file editor, and also render the page as a served HTML file.

So that’s a couple more things for my colleagues to say “ah, but it won’t work for my course because…”

Bring it on…

PS the code as a gist:

PPS Interested in keeping up to date with Jupyter news? Sign up to the Tracking Jupyter weekly newsletter.

## Launching Azure VMs, et al, From Code

Via RBloggers, a post showing how to launch Azure VMs from R as well as retrieving/restarting a previously launched VM.

By the by, you can also launch Azure VMs from Python code.

Packages like this makes it relatively straightforward to write your own provisioners, at least in the sense of starting and stopping VMs.

There are also code wrappers for a wide range of other cloud services and virtualisation engines. For example, docker-py provides similar tooling for starting and stopping Docker containers; for my preferred cloud host, Digital Ocean, it looks like there’s python-digitalocean.  I’ve also used Linode in the path, who provide their own official package: linode_api4-python.

I know all this, of course, though I don’t use this approach much. But it’s a useful set of things to remember you have available in the string’n’glue drawer.

Via an Inkdroid post on The Ferguson Principles, this handy suite of tools for archiving and normalising Twitter streams:

• twarc – a command line tool for collecting tweets from Twitter’s search and streaming APIs, and can collect threaded conversations and user profile information. It also comes with a kitchen sink of utilities contributed by members of the community.
• Catalog – a clearinghouse of Twitter identifier datasets that live in institutional repositories around the web. These have been collected by folks like the University of North Texas, George Washington University, UC Riverside, University of Maryland, York University, Society of Catalan Archivists, University of Virginia, tUniversity of Puerto Rico, North Carolina State University, University of Alberta, Library and Archives Canada, and more.
• Hydrator – A desktop utility for turning tweet identifier datasets (from the Catalog) back into structured JSON and CSV for analysis. It was designed to be able to run for weeks on your laptop, to slowly reassemble a tweet dataset, while respecting Twitter’s Terms of Service, and users right to be forgotten.
• unshrtn – A microservice that makes it possible to bulk normalize and extract metadata from a large number of URLs.
• DiffEngine – a utility that tracks changes on a website using its RSS feed, and publishes these changes to Twitter and Mastodon. As an example see whitehouse_diffwhich announces changes to the Executive orders made on the White House blog.
• DocNow – An application (still under development) that allows archivists to observe Twitter activity, do data collection, analyze referenced web content, and optionally send it off to the Internet Archive to be archivd.

The post further remarks:

These tools emerged as part of doing work with social media archives. Rather than building one tool that attempts to solve some of the many problems of archiving social media, we wanted to create small tools that fit particular problems, and could be composed into other people’s projects and workflows.

Handy…

And of the principles mentioned in the original post title?

1. Archivists must engage and work with the communities they wish to document on the web. Archives are often powerful institutions. Attention to the positionality of the archive vis-à-vis content creators, particularly in the case of protest, is a primary consideration that can guide efforts at preservation and access.
2. Documentation efforts must go beyond what can be collected without permission from the web and social media. Social media collected with the consent of content creators can form a part of richer documentation efforts that include the collection of oral histories, photographs, correspondence, and more. Simply telling the story of what happens in social media is not enough, but it can be a useful start.
3. Archivists should follow social media platforms’ terms of service only where they are congruent with the values of the communities they are attempting to document. What is legal is not always ethical, and what is ethical is not always legal. Context, agency and (again) positionality matter.
4. When possible, archivists should apply traditional archival practices such as appraisal, collection development, and donor relations to social media and web materials. It is hard work adapting these concepts to the collection of social media content, but they matter now, more than ever.

These arise from trying to address several challenges associated with [p]reserving web and social media content in ethical ways that protect already marginalized people (Documenting the Now Ethics White Paper):

1. User awareness (or informed consent) of how social media platforms use their data or how it can be collected and accessed by third parties.
2. Potential for fraudulent use and manipulation of social media content.
3. Heightened potential of harm for members of marginalized communities when those individuals participate in activities such as protests and other forms of civil disobedience that are traditionally heavily monitored by law enforcement.
4. Difficulty of applying traditional archival practices to social media content given the sheer volume of data and complicated logistics of interacting with content creators.

## In the Wrong Job…?

Stuff…

Rewriting… longtime readers of this blog will know that for a long time I’ve thought we could do more in the way of creating written diagrams to help make courses more maintainable. Part of this comes from the ability to make minor edits and reflow a diagram, part of it comes from developing, over time, a set of reusable patterns and building on top of (or iterating around) what you’ve already done. Working out which bits of a diagram to parameterise in order to come up with parameterisable or programmable diagram generators (think things like the Blockdiag diagrams) is something that is likely to develop over time and provide accelerating returns if you need to generate diagrams of a similar type in future.

Updates to TM351 include a couple of diagram types. Entity relation diagrams of various kinds and some database transaction diagrams.

I thought there would be easy ways to do this, but not where I looked, so here are some fragments using Tikz (LaTeX)… Which is to say, old tech

Generate arrows of the form:

Not all of these make sense, but that might be useful when creating nonsensical examples.

Here’s an example diagram (not necessarily a meaningful one):

LaTeX code in this gist.

(I’m finding latex4technics  a handy online editor for previewing TikZ diagrams.)

With a few more hours, we could pinch from things like https://tex.stackexchange.com/a/133849/151162 or nicer https://tex.stackexchange.com/a/195694/151162 cf. https://tex.stackexchange.com/a/367337/151162, and generate a thing for doing nice (maintainable) ERDs from text. (The next thing in automation would then be to find a way to automatically layout tables. Graphviz does provide a way round things like this – and there are ERD customisations for working with it (eg BurntSushi/erd or laowantong/mocodo, but the orthogonal projection seems to have an issue when it comes to handling directed edge directions?)

Several transaction diagrams were also provided as sketches and here’s a first attempt at trying to recreate one of them:

\documentclass{standalone}
\usepackage{tikz}
\usetikzlibrary{arrows}

%https://tex.stackexchange.com/a/126310/151162
%This makes sure the arrowhead on the bendy line points the right way...
\makeatletter
\def\pgf@plot@curveto@handler@finish{%
\ifpgf@plot@started%
\pgfpathcurvebetweentimecontinue{0}{0.995}{\pgf@plot@curveto@first}{\pgf@plot@curveto@first@support}{\pgf@plot@curveto@second}{\pgf@plot@curveto@second}%
\fi%
}
\makeatother

\begin{document}

\begin{tikzpicture}

\def\Wa{71.6}
\def\Wb{75}

%Create some vars to allow alignment
\def\dashboxx{1.5}
\def\dashboxwidth{5.5}
\def\dashboxheight{2}

\node (origin){};

\draw (0,0) node[anchor=east] {Tamblin} -- ++(10,0) ;
\draw (0,2) node[anchor=east, align=center] {Paxton /\\ Thornton}  -- (10,2);
\draw (0,4) node[anchor=east] {Gibson} -- (10,4);

\draw[dashed](\dashboxx,-1) rectangle ++(\dashboxwidth,\dashboxheight);
\draw[dashed](\dashboxx,3) rectangle ++(\dashboxwidth,\dashboxheight);

\draw[arrows={-angle 60}] (\dashboxx-0.5,2) node[anchor=south east] {\Wa} -- (\dashboxx+0.5,4)  node[anchor=south] {\Wa} ;

\draw[arrows={-angle 60}] (1.2,2) -- (2.2,0)  node[anchor=north] {\Wa} ;
%alternative arrows: eg [-latex,thin]
\draw [arrows={-angle 60}] plot [smooth] coordinates { (1.3,2) (2,1.5) (5,1.5) (6.5,0)} node[anchor=north] {\Wb};

\draw[dashed](2.1,1.9) rectangle ++(3,0.6);
\draw[arrows={-angle 60}] (3,2.3) node[anchor=east] {\Wa} -- (4.5,2.3) node[anchor=west] {\Wb};

\draw[arrows={-angle 60}] (5.5,2) -- (6.5,4)  node[anchor=south] {\Wb} ;

\end{tikzpicture}
\end{document}


Here’s how it renders (no, me neither…):

So… the above were generated using code. And everyone should learn to code, apparently. But from what I can tell, everyone (academic-wise) tends to either handover a sketch for an artist to draw (the above diagrams can be rendered as PDF, eps, or SVG (which can in turn be converted to PNG etc), although I’m not sure how nice the SVG is, e.g. when it comes to importing it into a drawing package?), or try to draw the diagram in Word or Excel. (Also based on experience, people who teach code often don’t seem to tend to think in terms of creating or using code, albeit even the scruffy code that many of us write, in the everyday, to actually help get stuff done…)

And when it comes to recruitment, we recruit yet more academics to academic posts, with complete disdain and disinterest for practical skills (not right for an ‘academic’ job role). But then, “university” not “polytechnic”, I guess.

We are so missing out on making contributions around how to teach innovatively using emerging current tech and develop new teaching strategies with it. Which requires the confidence and ability to use it, and explore ways in which it can be used.

Instead, the best we can hope for is finding a way of co-opting something bought off-the-shelf and making do with it as best we can at the user level, forgetting that much of the off-the-shelf stuff may have been recently developed by small start-ups with little or no academic “technology enhanced learning” expertise. (And that there is a long of stuff to be learned at the deploying new tech in ways and new ocmbinations level.) But maybe we add value by showing how to take of-the-shelf productised tech and demonstrate how to “use it properly”.

(If we’re competing with other institutions to by the same tech, what additional do we bring? How to “use it properly” in a distance education setting? Maybe I’m being churlish. Maybe there is real value  in us doing that.

When you work at the UI layer, you working at the same level as every other muppet.

It’s like a weird inverse of the not-invented-here syndrome: we’re safer buying something in because we have no expertise or capacity to develop it internally. And we’re not interested in trying to develop capacity. (Compare that to when I joined the OU: it was a leading developer of educational software; but capacity and in-house expertise has cut and lost year-on-year for years now…)

The thing about code is that it lets you build your own tools. The thing about code is that it builds up abstraction levels that lets you combine things are each level. The thing about code is that unlike Lego, where you start with big chunky Duplo blocks, move to standard Lego then Lego Technic, code starts small and fiddly, and then builds layers on top: C, Python, pandas, pytorch|Tensorflow, Docker container running Jupyter notebooks|pythorch|tensorflow, Docker container running Jupyter notebooks|pythorch|tensorflow docker composed with a database, or notebook extended to run jobs on a remote cluster.

But then, academics are academics first, not technologists first, or engineers first. The ability to do magic in the real world is of no interest.

I am so fed up.

## Using Github For Editing Course Notebooks

One of the great things about working on the TM351 module is that the module team are generally up for trying stuff out. Over the last year or two, we’ve been fumbling towards a way of working with Github for managing module notebooks.

The latest spurt of activity has been around updating Jupyter notebooks relating to the the relational databases part of the course, which has involved re-writing notebook activities from scratch. (We’ve also added a couple of tools to try to help students see what changes they’re making to a database which I’ll post about later.)

Part of the strength of the OU course production route is that it involves peer review, critical reading and editing of materials before they go live to students (this also partly explains the length of time it takes to get a new course out…) The review process provides opportunities for exploring and developing pedagogy (that word has to appear in every discussion of OU course material production!), as well as learning from each other about different ways of approaching this online teaching stuff.

Posting some final edits to a reworking by a colleague of a notebook I originally drafted (such is the process), I was struck again by how a Github workflow can help us capture our thinking about certain course design decisions, as well as argue through different approaches.

In the current thing we are trying out, notebooks are being revised in an 18j-notebooks-updates branch (the OU uses BBC year-month numbering: J is October, and 18J refers to a module presentation starting in October 2018. For any OU readers, yes, this does mean we are editing materials for a module that has already started the presentation for which they are intended. Agile, innit…;-).

I’m making my edits in sub-branches derived from that branch, one sub-branch per notebook. Associated with each notebook is a separate issue in the Github issue tracker where we can discuss issues relating to the notebook.

I’ve started trying to group edits made in the sub-branch so that each commit relates to a particular sort of change:

Clicking on the commit identifier allows for inspection, as well as general comment and review, and point by point comment and review (click the + on a particular change), of the changes made as part of that commit.

The commits I’m making are being made via the Mac Github desktop client. I find I’m making multiple passes of the notebook, but also collecting multiple changes to it at the same time that fit best in separate commits.

The following isn’t the best example, but it makes the point:

I can select (highlight dark blue) various changes in the notebook file and just add those to a particular commit. Clicking on the bar within a set of colour highlighted changed rows lets you select all contiguous changed lines that make up that change.

(The nbdime tool allows you to see differences in rendered notebook, but I’ve found that if you’re working with a notebook with cleared output cells, and commit changes reasonably regularly, it’s easy enough to keep identify and add the changes you want in a particular commit.)

When I’ve committed all the changes I want to suggest for a particular notebook, I can make a pull request (PR) onto the edit branch I forked from, with a Fix: ISSUE comment that associates the PR with the issue related to the notebook. If the PR is accepted, the issue is automatically closed and the sub-branch can be automatically deleted. (Our repo is in a bit of a mess with its branches at the moment and needs some serious gardening!)

You might also notice that the branch has been set up so that a review is required before a PR can be merged. This is one of the checks and balances we’ve added to try to make sure the (team) workflow doesn’t get upset by someone mistakenly deleting everything, merging something into the wrong branch, or merging files that don’t really belong in the repo (or a particular branch at least…)

None of us are particularly expert at using git or Github, but I think we we’re slowly working towards a workflow that allows us to discuss — and effect — changes to materials, keep track of them, and also keep tracking of our reflection around them. As problems are discovered, and then resolved, we can capture that learning so we don’t make the same mistakes again, or if we do, we can look up how to resolve them (or at least, how we resolved them previously). This is something that often gets lost in the editing process.

The low level change controls that we can manage through commit reviews is, and the comments and  commit and acceptance messages we can associate with them, is far richer, and at better levels of granularity both up and down the scale, than track changes and comments in a Word document, which is the traditional way of making edits to documents in the OU.

One thing the Github process does force on us, though, is the requirement to work at the text level. That said, the OU’s Word document workflow does end up in a text format – OU-XML – which could be managed (and edited) at the text/XML level (but oh, so the claim goes, how academics would complain…)

Personally, I think we should allow Markdown and LaTeX document creation, or authoring direct in Jupyter notebooks (with an exporter to OU-XML; unfortunately, no-one in the OU admits to experience in Jinja templating or is willing to learn enough to create an nbconvert template to render OU-XML from notebook JSON/ipynb. (It’s been on my to do list for what feels like forever!)) If we did that, we’d be able to manage all our TM351 module materials and edits, not just the notebooks, using our emerging Github workflow.

A recent post on the BBC News / Technology blog — Why Big Tech pays poor Kenyans to teach self-driving cars — describes how Kenyan knowledge workers spend 8 hour shifts creating machine learning training data for the likes of Google, Microsoft and VW as employees of Samasource, providers of “humans-in-the-loop to help you build quality ground truth training data for your natural language or computer vision algorithms”. (Seems like I missed Samasource when I blogged about these sorts of companies previously: Robot Workers?)

Now matter how little you pay people, they’re still expensive, so it’s better if you can get free labour. That’s what Captchas do. One of the tasks the Samasource people do is trace around meaningful objects that appear in an image and associate them with labels that describe the thing. “Car”, “bus”, “bicycle” and so on, but if you extract time and attention from folk browsing the web to do it for you, even better.

For example, I got captcha’d the other day hacking URLs on the Bloomberg website (sites often take umbrage and challenge you to prove you aren’t a robot if you do anything other than click links on their site, such as hacking URLs or using advanced search queries).

But by selecting things that appear in a grid, that’s not as good as tracing them surely? Well, it is if you run the test thousands of times, move the grid around a pixel at a time, and do some sums.

It’s much the same with lots of the sites and services you use “for free”. They’re not free to run of course, they may cost tens or even hundred of millions of dollars to put together and deliver, so someone has to pay. Ads cover some of it (the money there is advertising dollars in exchange for targeted audiences (Ad-Tech – A Great Way in To OSINT), and they are constructed by mining user data to find all the people who work in universities and look at pr0n on the bus for a bit of excitement, for example). Surveys have also been used as a ‘partial payment” mechanism (From Paywalls and Attention Walls to Data Disclosure Walls and Survey Walls).

Another partial payment mechanism is your time. For example, when your GPS app sends you on a weird route, it’s quite possibly using you as a guinea pig to see how effective that part of the route is at that time of day. It needs to learn somehow, right? Gives new meaning the rat run, doesn’t it? (Didn’t you think of yourself as a lab rat running a maze before?)

Which leads to a handful of things on my to read list…

First up, Exploring or Exploiting? Social and Ethical Implications of Autonomous Experimentation in AI, the abstract for which reads as follows:

In the field of computer science, large-scale experimentation on users is not new. However, driven by advances in artificial intelligence, novel autonomous systems for experimentation are emerging that raise complex, unanswered questions for the field. Some of these questions are computational, while others relate to the social and ethical implications of these systems. We see these normative questions as urgent because they pertain to critical infrastructure upon which large populations depend, such as transportation and healthcare. Although experimentation on widely used online platforms like Facebook has stoked controversy in recent years, the unique risks posed by autonomous experimentation have not received sufficient attention, even though such techniques are being trialled on a massive scale. In this paper, we identify several questions about the social and ethical implications of autonomous experimentation systems. These questions concern the design of such systems, their effects on users, and their resistance to some common mitigations.

Here’s how they set the scene:

Consider, for example, navigation services that are responsible for providing millions of users with real-time directions. Given the current traffic conditions, these services attempt to suggest optimal routes for drivers. Experimentation is likely a core part of suggesting optimal routes. This is because service providers often lack information about traffic conditions on those routes to which they have purposefully not directed drivers. To determine whether a previously slow route is still slow, these services will deliberately send some users along it.

As I said, on my to-read pile. I’ll try to pull out my own TL:DR nuggets in another post when I have the spare cycles to take it in properly.

Second up, Two Cheers for Corporate Experimentation: The A/B Illusion and the Virtues of Data-Driven Innovation, a much longer, footnoted piece which again I’m not in a mood to read right now… Maybe later…

I have such a backlog of half started posts, of which this is one… Normally, I’d have tried to complete it, but I’m losing stuff in the queue, so posting it as is means this bit is done and I may be more likely to get round to reading those papers and doing a part 2…

## Getting the TM351 VM Running on OU OpenStack

One of the original motivations for delivering the TM351 software and services via a virtual machine, with user interfaces provided via a browser, was that we should be able to use the same VM  as a locally run machine on a student’s own computer, or as a hosted machine (accessible via the web) running on an OU server.

A complementary third year equivalent course, TM352 Web, Mobile and Cloud Technologies, uses a Faculty managed OpenStack instance as a dogfooding teaching environment on that course (students learn about cloud stuff in the course, get to deploy some canned machines and develop their own services using OpenStack, and the department develops skills in in deploying and managing such environments with hundreds of users).

I think part of the pitch for the OpenStack cluster was that it would be available to other courses, but a certain level of twitchiness in keeping it stable for the original course use case has meant that getting access to the machine has not been as easy as it might have been.

(There is no dev server that I can access, at least not from a connection outside the OU network. So the only server I can play on is the live server, as used by students. If you’re confident managing OpenStack, this is probably fine (it should be able to cope with lots of tenants with different requirements, right?), but if you’re not, making a dev server, open to all who want to try it out, and available sooner rather than later, probably makes morse sense: more people solving problems, more use cases being explored and ruled out, more issues being debugged; more learning going on generally…)

Whatever.

I’ve finally got an account, and a copy of the TM351 VM image, originally built for VirtualBox, uploaded to it.

You’d think that part at least would have been be easy, but it took the best part of four months or so at least… First, getting an account on the OpenStack server. Second, getting a copy of the TM351 VM image that could be loaded onto it. I got stuck going nowhere trying to convert the original Virtualbox image until it was pointed out to me that there was a VirtualBoxManage tool for doing it (Converting between VM Formats). Faculty advice suggests the clonehd command:

vboxmanage clonehd box-disk001.vmdk /Users/USER/Desktop/tm351.img --format raw

but that looks deprecated in recent versions of VirtualBox to me… The following seems more contemporary:

VBoxManage clonemedium ~/VirtualBox\ VMs/tm351_18J-student/box-disk001.vmdk tm351_18J-student.raw --format RAW

Third, loading the image onto OpenStack. A raw box format image I thought I had managed to create myself came in at 64GB (the original box was ~8GB), but it seems this is because that’s the size of the virtual disk. Presumably vagrant is setting this in my original build (or VirtualBox is defaulting to it?), so one thing I need to figure out is how to reduce it without compromising anything. Looking at Resizing Vagrant box disk space  I wonder if we could move along steps from vmdk to vid to resize and then raw?

Uploading a 64 GB from home to OpenStack using an http file uploader on  the OpenStack user admin page is just asking for trouble, but even copying the image from OU networked machines is not just-do-it-able: it requires copying  the file from one machine to another and then onto the OpenStack server by someone-not-me with the appropriate logins and scp permissions.

(Building the machine on OpenStack myself using an OpenStack vagrant provisioner is not an option on the live server at least: API access addresses seem to only be provided for a private network that I don’t have access to. If we manage to get a development server that I am allowed to access using VPN, or even better, without VPN, and I can get permissions to use the API, and we can connect to things like the apt-get and Pypi/pip repos, using a build provisioner makes sense to me.)

So I there is now an image visible on the OpenStack server.

You’ll note we haven’t tried to brand the OpenStack user’s admin panel at all  (I would have…;-).

What next? Trying to spin up an instance from the image kept giving me errors (I started trying with a small machine instance, then tried creating an instance with ever larger machine flavours — the issue was indeed the 64GB default disk size associated with the image. Faculty IT changed a setting that meant the larger disk sizes would spin up and reported that it worked for them with the VM on a large flavour machine.

But it didn’t for me…  I kept getting the message [Error: Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance XXXXXX. I think the issue must have been a permissions thing manifested as a network thing. Faculty IT restarted the image as private to me, (and with my own private network?) and I tried again… (For this reason, I’m not convinced that anyone else just given an account will be able to get their own version of the TM351 VM up and running? I need to understand better what requirements, if any, are placed on the creation of the OpenStack user account for it to work. And I need a second test user account (at least) to test it..)

Anyway – success for me – a running instance of the TM351 VM. And now I could use the OpenStack web console to log in to the machine using the default vagrant credentials. Which I need to change… (and find a sensible method for students to use to change the defaults).

So now I can poke around inside the VM. But I can’t see any of the services it’s running for a couple of reasons: firstly, the VM has no public IP address; secondly, the only port I think I’m allowed to expose publicly is port 80, and there are no services running on port 80. And unlike vagrant and docker, which make it easy to map and expose an arbitrary port inside the VM onto a specified port outside the VM, such as port 80, I haven’t found a way to do that in OpenStack. (The documentation sucks. Really badly. And there is no internal FAQ to give me even the slightest crib as to what to do next.)

The TM352 course materials come to my rescue here, sort of. As OU central academic staff, I can log in to course VLEs and see the published teaching material, although not the student forums. Looking in the current presentation, the materials that show TM352 students how to make their VM visible to the world haven’t been released yet so I can’t see them.. Bah… But I can look at the materials provided to students on the previous presentation… Which are out of date compared to the current version of OpenStack. But never mind, because the materials are enough of a crib to figure out what to do where-ish: Block 2 Part 2: Designing a cloud, 8 Getting started with OpenStack. The essential steps boil down to the following (apols for the vagueness; I don’t want to actually restep through everything to check it works in case I break my current instance; next time I run through from scratch, I’ll tidy up the instructions. Ideally, I’d do a fresh run through in a new, virgin test user account):

1. Create a new private network for the VM to run on: I seemed to have a network already created, but here’s a howto: under Network, select the Networks option, and then Create Network with the  Admin State as UP (i.e. running and usable) and the Create Subnet box ticked. Use IP/v4 and set an IP address range in CIDR format (e.g. 192.168.0.0/24);
2. Create a router that interconnects the public network and the private network: from the Network menu select Routers . Set Admin State to UP and External Network to public then Create Router. In the Network Topology view, select the router and then Add Interface, using Subnet set to the private created network and the IP Address left blank.
3. Configure the network security rules: from Network select Security Groups ; if there’s no default group create one; once there is, select Manage Rules. We need to add three rules:
 Rule ALL TCP Direction Ingress Remote CIDR CIDR 0.0.0.0/0
 Rule ALL ICMP Direction Ingress Remote CIDR CIDR 0.0.0.0/0
 Rule HTTP Remote CIDR CIDR 0.0.0.0/0
1. Create a VM instance from the TM351 image: bearing in mind the previous set-up, choose appropriately!
2. Attach a public IP address to the VM: in Network select Floating IPs and then Allocate IP to Project. With the new floating IP address, select Associate and choose appropriately.

Hopefully now there should be a public IP address associated with the VM and ports 80 and 22 (ssh) exposed. Using the public IP address, from a terminal on my own local machine:

ssh vagrant@VM.IP.ADDR.ESS

followed by the password, and I should be in…

(I can’t help thinking that typing vagrant up is a much easier way to launch a VM. And then vagrant ssh to SSH in…)

Next step – try to see the public services running inside the VM, bearing in mind that we can only access services through port 80.

To test things, we can just try a simple http server on port 80:

python3 -m http.server 80

That works, so port 80 is live on my VM and I can see it from the public internet. So kill the test http server…

## Running Everything Through Port 80

Running services inside VM against port 80 requires them to run as root (ports <1024 are privileged), but in the last rebuild of the VM we tried to move away from running everything as root and instead run them under a user account. Which means that the Jupyter server is defined to run under a user account on a non-privileged port.

I went round in circles on this one for getting on for an hour, trying to run Jupyter notebooks on port 80, but running into permissions errors accessing port 80 unless I ran the service as root.  (Things like tail /var/log/syslog helped in the debugging…)

I also had to manually fix the missing notebook directory that the notebook service is supposed to start in. (I think this is another permissions snafu – the service runs as a user but the mkdir guard run via ExecStartPre needs permissions tweaking to run as root using PermissionsStartOnly=true (issue.)

The simplest thing to do is run a proxy like nginx. Which isn’t installed in the VM. No problem, the vagrant user I ssh into the VM with can run via sudo so I should be able to just do a sudo apt-get update && sudo apt-get install -y nginx. Only I can’t because the security rules upstream of the OpenStack server won’t let me. F**k. It’s a Saturday afternoon, and there are zero, no, zilch, none, Faculty IT help files or FAQs that have been shared with me, or that I’m even aware of the existence of, with possible workarounds. But there is Twitter, and various other Saturday working friends, which gives me a result: set up an ssh tunnel and do it via my home machine ( https://stackoverflow.com/questions/36353955/apt-get-install-via-tunnel-proxy-but-ssh-only-from-client-side ):

sudo ssh -R 8899:us.archive.ubuntu.com:80 vagrant@IP.ADDR

With that tunnel set up, inside the VM I can run sudo nano /etc/apt/apt.conf and edit in the following lines:

Acquire::http::Proxy "http://localhost:8899"; Acquire::https::Proxy "https://localhost:8899";

Then I can apt-get update, apt-get install etc inside the VM

sudo apt-get update
sudo apt-get install -y nginx

To try and pre-empt any other issues, it’s worth checking that the required folders (again) are in place (/vagrant/notebooks and /vagrant/openrefine-projects) and with the appropriate owner and group (oustudent:users) permissions:

sudo chown -R oustudent:users /vagrant

As mentioned, the current ExecPreStart in the Jupyter notebook and OpenRefine service definition files were supposed to check folders exist but I think they need changing to incorporate things like following:

PermissionsStartOnly=true ExecStartPre=/bin/mkdir -p /vagrant/notebooks ExecStartPre=/bin/chown oustudent:users /vagrant/notebooks

Right, so permissions should be sorted, and the Jupyter notebook server should be runnable against port 80 via the nginx proxy; but I need an nginx config file… If we were running notebooks as a service in the OU this is the sort of thing I’d hope would be in an an examples FAQ, battle tested in an OU context; but we don’t so it isn’t so I rely on other people having solved the problem and being willing to share their answer in public: https://nathan.vertile.com/blog/2017/12/07/run-jupyter-notebook-behind-a-nginx-reverse-proxy-subpath/

Unfortunately, it didn’t work for me out of the can… the post supposedly describes how to proxy the server down a path, but (jumping ahead) the login page URL didn’t rewrite down the path for me; tweaking the proxy definition so that the Jupyter notebook server runs at the top level (/) on port 80 did work though – so here’s the nginx definition file I ended up using:

sudo nano /etc/nginx/sites-available/default

and then:

location / {
error_page 403 = @proxy_groot;

deny 127.0.0.1;
allow all;

# set a webroot, if there is one
#root /web_root;
try_files $uri @proxy_groot; } location @proxy_groot { #rewrite /notebooks(.*)$1 break;
proxy_pass http://upstream_groot;

# pass some extra stuff to the backend
proxy_set_header Host $host; proxy_set_header X-Real-Ip$remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; } location ~ /api/kernels/ { proxy_pass http://upstream_groot; proxy_set_header Host$host;

# websocket support
proxy_http_version 1.1;
}

location ~ /terminals/ {
proxy_pass http://upstream_groot;

# websocket support
proxy_http_version 1.1;
}



followed by:

sudo  nginx -s reload

To try to make the notebook server slightly more secure than wide open — it will be running on a public IP address after all — I need to add a password (the original TM351 VM runs everything wide open).

echo -n "my cool password" | sha1sum

then edit the system service file:

sudo nano /lib/systemd/system/jupyter.service

We need to tweak the startup along the lines of:

ExecStart=/usr/local/bin/jupyter notebook --port=8888 --ip=0.0.0.0 --y --log-level=WARN --no-browser --notebook-dir=/vagrant/notebooks --allow-root --NotebookApp.token='' --NotebookApp.password='sha1:WHATEVER' --allow_origin='*'

We can probably drop the --allow-root ? (Although the default notebook user can sudo some commands…)

Reload the daemon to acknowledge the service definition changes and restart the service:

sudo systemctl daemon-reload sudo systemctl restart jupyter.service

So this seems to work: I can see Jupyter notebook and login via port 80 on the floating public IP address I assigned to the TM351VM instance. I can open a notebook, run cells, call the PostgreSQL and basic Mongo databases at least, open a terminal. What I can’t do is curl or wget or run Python requests to load data files from the internet using a notebook because of the upstream IT network security rules. This is a bit of a blocker for the course. We may be able to finesse a way round with an ssh tunnel in testing, but I don’t think we should be expecting that of our students. (Thinks: how do IT security rules / policies apply when we define activities for students that we expect them to run on their own computers?! File as: whatever… We’ll just have to do something really crappy instead for students. Or set up a best-not-tell-IT proxy on the OU network somewhere…)

The next step is – can I expose the other core teaching application in the VM: OpenRefine?

A possible blocker is that we only have one port exposed on the public internet (port 80) so we need to find a way to expose OpenRefine. Fortunately, the nbserverproxy package allows the Jupyter server to proxy services running on localhost in the VM. So I should be able to run that. But first things first:  pip installs are borked even with an ssh tunnel (open questions on Stack Overflow confirm that this is not just me…).

Okay… pip packages can be downloaded and installed from a local file, so I can download the nbserverproxy pip package on my own machine and then scp it into the running OpenStack hosted VM at /vagrant/notebooks . Then from a notebook inside the VM I can run !pip install --user ./nbserverproxy-master.zip (just to show the notebook is working properly! ;-)  and enable it: ! jupyter serverextension enable --py nbserverproxy.

Restart the notebook server from VM command line and I should be able to see OpenRefine at http://MY.FLOATING.IP.ADDR/proxy/3334/ (the trailing slash is required of the styling fails as the path to the style files is incorrectly resolved). I think that this should also be down the password protected path? i.e. if I hadn’t logged in to the notebook server, I don’t think I should be able to get this far? (NEED TO CHECK.)

One of the VM Easter Eggs, nbdime, is also visible on http://MY.FLOATING.IP.ADDR/proxy/8899/. Go team me… :-)

Grab a snapshot of the working VM in the idle hope that maybe if someone else tries to launch from that image, it will just work. Although things like the network and security rules will presumably need setting up?

For student use, I’d need a simple way / recipe to set up different/personalised ssh credentials into the VM, otherwise anyone with the public IP address could ssh in. This must be a common issue, so it’d be good to see a Faculty OpenStack FAQ suggesting what the possible options are. I guess a simple one is on starting the instance? Can we force keys into the VM when it launches? Another issue is (re)setting the password for the Jupyer notebook server so each student is assigned, or can easily set (and recover….) their own password.

Other next steps: is there something in OpenStack where I can define network settings, security rules, etc, and provide students with an easier way of deploying an TM351 instance on the Faculty OpenStack and making its public services available on the public internet? Can I do this with an OpenStack stack? If so, that would be a handy thing to have an OU OpenStack tutorial for…

This is obvs the sort of support that should be available in Faculty IT tutorials, FAQs, and God Forbid, in person if we’re running the OpenStack server as a Faculty service and trying to encourage people to use it, so that’s what I’ll probably spend my next day of miserable OpenStack hacking doing when I can motivate myself to do it: trying to figure out if and how to make things closer to one click simpler for students to launch their own TM351 VM. (In the first instance for TM351, we want students to be able to run course VMs on an OU server because they’re struggling with getting things running on their own computer; this is often highly correlated with them having poor computer skills, poor problem solving skills, and poor instruction following skills, so we’re on a hiding to nothing if we expect them to launch instances, choose flavours, create routers, create and assign floating IP addresses and set up security rules. On their own. Because I’m not going to do that tech support for them. (I am ranty typing; my keyboard is suddenly VERY LOUD. [REDACTED])