Category: Anything you want

A DevOps Approach to Common Environment Educational Software Provisioning and Deployment

In Distributing Software to Students in a BYOD Environment, I briefly reviewed a paper that described a paper that reported on the use of Debian metapackages to support the configuration of Linux VMs for particular courses (each course has its own Debian metapackage that could install all the packages required for that course).

This idea of automating the build of machines comes under the wider banner of DevOps (development and operations). In a university setting, we might view this in several ways:

  • the development of course related software environments during course production, the operational distribution and deployment of software to students, updating and support of the software in use, and maintenance and updating of software between presentations of a course;
  • the development of software environments for use in research, the operation of those environments during the lifetime of a resarch project, and the archiving of those environments;
  • the development and operation of institutional IT services.

In an Educause review from 2014 (Cloud Strategy for Higher Education: Building a Common Solution, EDUCAUSE Center for Analysis and Research (ECAR) Research Bulletin, November, 2014 [link]), a pitch for universities making greater use of cloud services, the authors make the observation that:

In order to make effective use of IaaS [Infrastructure as a Service], an organization has to adopt an automate-first mindset. Instead of approaching servers as individual works of art with artisan application configurations, we must think in terms of service-level automation. From operating system through application deployment, organizations need to realize the ability to automatically instantiate services entirely from source code control repositories.

This is the approach I took from the start when thinking about the TM351 virtual machine, focussing more on trying to identify production, delivery, support and maintenance models that might make sense in a factory production model that should work in a scaleable way not only across presentations of the same course, as well as across different courses, but also across different platforms (students own devices, OU managed cloud hosts, student launched commercial hosts) rather than just building a bespoke, boutique VM for a single course. (I suspect the module team would have preferred my focussing on the latter – getting something that works reliably, has been rigorously tested, and can be delivered to students rather than pfaffing around with increasingly exotic and still-not-really-invented-yet tooling that I don’t really understand to automate production of machines from scratch that still might be a bit flaky!;-)

Anyway, it seems that the folk at Berkeley have been putting together a “Common Scientific Compute Environment for Research and Education” [Clark, D., Culich, A., Hamlin, B., & Lovett, R. (2014). BCE: Berkeley’s Common Scientific Compute Environment for Research and Education, Proceedings of the 13th Python in Science Conference (SciPy 2014).]


The BCE – Berkeley Common Environment – is “a standard reference end-user environment” consisting of a simply skinned Linux desktop running in virtual machine delivered as a Virtualbox appliance that “allows for targeted instructions that can assume all features of BCE are present. BCE also aims to be stable, reliable, and reduce complexity more than it increases it”. The development team adopted a DevOps style approach customised for the purposes of supporting end-user scientific computing, arising from the recognition that they “can’t control the base environment that users will have on their laptop or workstation, nor do we wish to! A useful environment should provide consistency and not depend on or interfere with users’ existing setup”, further “restrict[ing] ourselves to focusing on those tools that we’ve found useful to automate the steps that come before you start doing science. Three main frames of reference were identified:

  • instructional: students could come from all backgrounds and often unlikely to have sys admin skills over and above the ability to use a simple GUI approach to software installation: “The most accessible instructions will only require skills possessed by the broadest number of people. In particular, many potential students are not yet fluent with notions of package management, scripting, or even the basic idea of commandline interfaces. … [W]e wish to set up an isolated, uniform environment in its totality where instructions can provide essentially pixel-identical guides to what the student will see on their own screen.”
  • scientific collaboration: that is, the research context: “It is unreasonable to expect any researcher to develop code along with instructions on how to run that code on any potential environment.” In addition, “[i]t is common to encounter a researcher with three or more Python distributions installed on their machine, and this user will have no idea how to manage their command-line path, or which packages are installed where. … These nascent scientific coders will have at various points had a working system for a particular task, and often arrive at a state in which nothing seems to work.”
  • Central support: “The more broadly a standard environment is adopted across campus, the more familiar it will be to all students”, with obvious benefits when providing training or support based on the common environment.

Whilst it was recognised the personal laptop computers are perhaps the most widely used platform, the team argued that the “environment should not be restricted to personal computers”. Some scientific computing operations are likely to stretch the resources of a personal laptop, so the environment should also be capable of running on other platforms such as hefty workstations or on a scientific computing cloud.

The first consideration was to standardise on an O/S: Linux. Since the majority of users don’t run Linux machines, this required the use of a virtual machine (VM) to host the Linux system, whilst still recognising that “one should assume that any VM solution will not work for some individuals and provide a fallback solution (particularly for instructional environments) on a remote server”.

Another issue that can arise is dealing with mappings between host and guest OS, which vary from system to system – arguing for the utility of an abstraction layer for VM configuration like Vagrant or Packer … . This includes things like portmapping, shared files, enabling control of the display for a GUI vs. enabling network routing for remote operation. These settings may also interact with the way the guest OS is configured.

Reflecting on the “traditional” way of building a computing environment, the authors argued for a more automated approach:

Creating an image or environment is often called provisioning. The way this was done in traditional systems operation was interactively, perhaps using a hybrid of GUI, networked, and command-line tools. The DevOps philosophy encourages that we accomplish as much as possible with scripts (ideally checked into version control!).

The tools explored included Ansible, packer, vagrant and docker:

  • Ansible: to declare what gets put into the machine (alternatives include shell scripts, puppet etc. (For the TM351 monolithic VM, I used puppet.) End-users don’t need to know anything about Ansible, unless they want to develop a new, reproducible, custom environment.
  • packer: used to run the provisioners and construct and package up a base box. Again, end-users don’t need to know anything about this. (For the TM351 monolithic VM, I used vagrant to build a basebox in Virtualbox, and then package it; the power of Packer is that is lets you generate builds from a common source for a variety of platforms (AWS, Virtualbox, etc etc).)
  • vagrant: their description is quite a handy one: “a wrapper around virtualization software that automates the process of configuring and starting a VM from a special Vagrant box image … . It is an alternative to configuring the virtualization software using the GUI interface or the system-specific command line tools provided by systems like VirtualBox or Amazon. Instead, Vagrant looks for a Vagrantfile which defines the configuration, and also establishes the directory under which the vagrant command will connect to the relevant VM. This directory is, by default, synced to the guest VM, allowing the developer to edit the files with tools on their host OS. From the command-line (under this directory), the user can start, stop, or ssh into the Vagrant-managed VM. It should be noted that (again, like Packer) Vagrant does no work directly, but rather calls out to those other platform-specific command-line tools.” However, “while Vagrant is conceptually very elegant (and cool), we are not currently using it for BCE. In our evaluation, it introduced another piece of software, requiring command-line usage before students were comfortable with it”. This is one issue we are facing with the TM351 VM – current the requirement to use vagrant to manage the VM from the commandline (albeit this only really requires a couple of commands – we can probably get away with just: vagrant up && vagrant provision and vagrant suspend – but also has a couple of benefits, like being able to trivially vagrant ssh in to the VM if absolutely necessary…).
  • docker: was perceived as adding complexity, both computationally and conceptually: “Docker requires a Linux environment to host the Docker server. As such, it clearly adds additional complexity on top of the requirement to support a virtual machine. … the default method of deploying Docker (at the time of evaluation) on personal computers was with Vagrant. This approach would then also add the complexity of using Vagrant. However, recent advances with boot2docker provide something akin to a VirtualBox-only, Docker-specific replacement for Vagrant that eliminates some of this complexity, though one still needs to grapple with the cognitive load of nested virtual environments and tooling.” The recent development of Kitematic addresses some of the use-case complexity, and also provides GUI based tools for managing some of the issues described above associate with portmapping, file sharing etc. Support for linked container compositions (using Docker Compose) is still currently lacking though…

At the end of the day, Packer seems to rule them all – coping as it does with simple installation scripts and being able to then target the build for any required platform. The project homepage is here: Berkeley Common Environment and the github repo here: Berkeley Common Environment (Github).

The paper also reviewed another common environment – OSGeo. Once again built on top of a common Linux base, well documented shell scripts are used to define package installations: “[n]otably, the project uses none of the recent DevOps tools. OSGeo-Live is instead configured using simple and modular combinations of Python, Perl and shell scripts, along with clear install conventions and examples. Documentation is given high priority. … Scripts may call package managers, and generally have few constraints (apart from conventions like keeping recipes contained to a particular directory)”. In addition, “[s]ome concerns, like port number usage, have to be explicitly managed at a global level”. This approach contrasts with the approach reviewed in Distributing Software to Students in a BYOD Environment where Debian metapackages were used to create a common environment installation route.


The idea of a common environment is a good one, and that would work particularly well in a curriculum such as Computing, I think? One main difference between the BCE approach and the TM351 approach is that BCE is self-contained and runs a desktop environment within the VM, whereas the TM351 environment uses a headless VM and follows more of a microservice approach that publishes HTML based service UIs via http ports that can be viewed in a browser. One disadvantage of the latter approach is that you need to keep a more careful eye on port assignments (in the event of collisions) when running the VM locally.

What Happens When “Computers” Are Replaced by Tablets and Phones?

With personal email services managed online since what feels like forever (and probably is “forever”, for many users), personally accessed productivity apps delivered via online services (perhaps with some minimal support for in-browser, offline use) – things like Microsoft Office Online or Google Docs – video and music services provided via online streaming services, rather than large file downloads, image galleries stored in the cloud and social networking provided exclusively online, and in the absence of data about connecting devices (which is probably available from both OU and OU-owned FutureLearn server logs), I wonder if the OU strategists and curriculum planners are considering a future where a significant percentage of OUr distance education students do not have access to a “personal (general purpose) computer” onto which arbitrary software applications can be installed rather than from which they can simply be accessed, but do have access to a network connection via a tablet device, and perhaps a wireless keyboard?

And if the learners do have access to a desktop or laptop computer, what happens if that is likely to be a works machine, or perhaps a public access desktop computer (though I’m not sure how much longer they will remain around), probably with administrative access limits on it (if the OU IT department’s obsession with minimising general purpose and end-user defined computing is anything to go by…)

If we are to require students to make use of “installed software” rather than software that can be accessed via browser based clients/user interfaces, then we will need to ask the access question: is it fair to require students to buy a desktop computer onto which software can be installed purely for the purposes of their studies, given they presumably have access otherwise to all the (online) digital services they need?

I seem to recall that the OU’s student computing requirements are now supposed to be agnostic as to operating system (the same is not true internally, unfortunately, where legacy systems still require Windows and may even require obsolete versions of IE!;-) although the general guidance on the matter is somewhat vague and perhaps not a little out of date…?!

I wish I’d kept copies of OU computing (and network) requirements over the years. Today, network access is likely to come in the form of either wired, fibre, or wireless broadband access (the latter particularly in rural areas, (example) or (for the cord-cutters), a mobile/3G-4G connection; personal computing devices that connect to the network are likely to be smartphones, tablets, laptop computers, Chromebooks and their ilk, and gaming desktop machines. Time was when a household was lucky to have a single personal desktop computer, a requirement that became expected of OU students. I suspect that is still largely true… (the yoof’s gaming machine; the 10 year old “office” machine).

If we require students to run “desktop” applications, should we then require the students to have access to computers capable of installing those applications on their own computer, or should we be making those applications available in a way that allows them to be installed and run anywhere – either on local machines (for offline use), or on remote machines (either third party managed or managed by the OU) where a network connection is more or less always guaranteed?

One of the reasons I’m so taken by the idea of containerised computing is that it provides us with a mechanism for deploying applications to students that can be run in a variety of ways. Individuals can run the applications on their own computers, in the cloud, via service providers accessed and paid for directly by the students on a metered basis, or by the OU.

Container contents can be very strictly version controlled and archived, and are easily restored if something should go wrong (there are various ‘switch-it-off-and-switch-it-on-again’ possibilities with several degrees of severity!) Container image files can be distributed using physical media (USB memory sticks, memory cards) for local use, and for OU cloud servers, at least, those images could be pre-installed on student accessed container servers (meaning the containers can start up relatively quickly…)

If updates are required, these are likely to be lightweight – only those bits of the application that need updating will be updated.

At the moment, I’m not sure how easy it is to arbitrarily share a data container containing a student’s work with application containers that are arbitrarily launched on various local and remote hosts? (Linking containers to Dropbox containers is one possibility, but they would perhaps be slow to synch? Flocker is perhaps another route, with its increased emphasis on linked data container management?)

If any other educational institutions, particularly those involved in distance education, are looking at using containers, I’d be interested to hear what your take is…

And if any folk in the OU are looking at containers in any context (teaching, research, project work), please get in touch – I need folk to bounce ideas around with, sanity check with, and ask for technical help!;-)

Festival Segregation

Isle of Wight Festival time again, and some immediate reflections from the first day…

I seem to remember a time-was-when festival were social levellers – unless you were crew or had a guest pass that got you backstage. Then the backstage areas started to wend their way up the main stage margins so the backstage guests could see the stage from front-of-stage. Then you started to get the front-of-stage VIP areas with their own bars, and a special access area in front of the stage to give you a better view and keep you away from the plebs.

There has also been a growth in other third party retailed add-ons – boutique camping, for example:

Isle_of_Wight_Festival_2015_-_11th-14th_June

and custom toilets:

Isle_of_Wight_Festival_2015

One of the things I noticed about the boutique camping areas (which are further distinguished from the VIP camping areas…) was that they are starting to include their own bars, better toilets, and so on. Gated communities, for those who can afford a hefty premium on top of the base ticket price. Or a corporate hospitality/hostility perk.

I guess festivals always were a “platform” creating two sided markets that could sell tickets to punters, location to third party providers (who were then free to sell goods and services to the audience), sponsorship of every possible surface. But the festivals were, to an extent, open; level-playing fields. Now they’re increasingly enclosed. So far, the music entertainment has remained free. But how long before you have to start paying to access “exclusive” events in some of the music tents?

PS I wonder: when it comes to eg toilet capacity planning, are the boutique poo-stations over-and-above capacity compared to the capacity provided by the festival promoter to meet sanitation needs, or are they factored in as part of that core capacity? Which is to say, if no-one paid the premium, would the minimum capacity requirements still be met?

PPS I also note that the IW Festival had a heliport this year (again…?)

PPPS On the toilet front, the public toilets all seemed pretty clean this year… and what really amused me was seeing a looooonnngggg queue for the purchased-access toilets…

Capital, Labour and Value… I Really Don’t Understand These Terms at All…

For most of my life, I’ve managed to avoid reading much, if anything, about political theory. I have to admit I struggle reading anything from a Marxist perspective because I don’t understand what any of the words mean (I’m not convinced I even know how to pronounce some of them…), or how the logic works that tries to play them off against each other.

The closest I do get to reading political books tend to be more related to organisational theories – things like Parkinson’s Law, for example…;-)

So at a second attempt, I’ve started reading David Graeber’s “The Utopia of Rules”. Such is my level of political naivety, I can’t tell whether it’s a rant, a critique, a satire, or a nonsense.

But if nothing else it does start to introduce words in a way that gives me a jumping off point to try to make my own sense out of them. So for example, on page 37, we have a quote claimed to be from Abraham Lincoln (whether the Abraham Lincoln, or another, possibly made up one, I have no idea – I didn’t follow the footnote to check!):

Labor is prior to, and independent of, capital. Capital is only the fruit of labor, and could never have existed if labor had not first existed. labor is the superior of capital, and deserves much the higher consideration.

This followed on from the observation that “[m]ost Americans, for instance, used to subscribe to a rough-and-ready version of the labor theory of value.” Here’s my rough and ready understanding of that, in part generated as a riffed response to the Lincoln quote, as a picture:

myLabourTheoryOfValue

The abstract thing created by labour is value. The abstract thing that capital is exchanged for is value. That capital (a fiction) can create more capital through loans of capital in exchange for capital+interest repayments suggests that the value capital creates – value that corresponds to interest on capital loaned – is a fiction created from a fiction. It only becomes made real when the actor needing repay the additional fiction must acquire it somehow through their own labour, though in some situations it will also be satiated through the creation of capital-interest, that is, through the creations of other fictions.

Such is the state of my political education!

PS here some other lines I’ve particularly liked so far: from p32: “The bureaucratisation of daily life means the imposition of impersonal rules and regulations; impersonal rules and regulations, in turn, can only operate if they are backed up by the threat of force.” Which follows from p.31: “Whenever someone starts talking about the ‘free market’, it’s a good idea to look around for the man with the gun. He’s never far away.”

And on international trade (p30): “(Much of what was being called ‘international trade’ in fact consisted merely of the transfer of materials back and forth between different branches of the same corporation.)”

Yahoo Pipes Retires…

And so it seem that Yahoo Pipes, a tool I first noted here (February 08, 2007), something I created lots of recipes for (see also on the original, archived OUseful site), ran many a workshop around (and even started exploring a simple recipe book around) is to be retired (end of life annoucement)…

Pipes__Rewire_the_web

It’s not completely unexpected – I stopped using Pipes much at all several years ago, as sites that started making content available via RSS and Atom feeds then started locking it down behind simple authentication, and then OAuth…

I guess I also started to realise that the world I once imagine, as for example in my feed manifesto, We Ignore RSS at OUr Peril, wasn’t going to play out like that…

However, if you still believe in pipe dreams, all is not lost… Several years ago, Greg Gaughan took up the challenge of producing a Python library that could take a Yahoo Pipe JSON definition file and execute the pipe. Looking at the pipe2py project on github just now, it seems the project is still being maintained, so if you’re wondering what to do with your pipes, that may be worth a look…

By the by, the last time I thought Pipes might not be long for this world, I posted a couple of posts that explored how it might be possible to bulk export a set of pipe definitions as well as compiling and running your exported Yahoo Pipes.

Hmmm… thinks… it shouldn’t be too hard to get pipe2py running in a docker container, should it…?

PS I don’t think pipe2py has a graphical front end, but javascript toolkits like jsPlumb look like they may do much of the job. (It would be nice if the Yahoo Pipes team could release the Pipes UI code, of course…;-)

PPS if you you need a simple one step feed re-router, there’s always IFTT. If realtime feed/stream processing apps are more your thing, here are a couple of alternatives that I keep meaning to explore, but never seem to get round to… Node-RED, a node.js thing (from IBM?) for doing internet-of-things based inspired stream (I did intend to play with it once, but I couldn’t even figure out how to stream the data I had in…); and Streamtools (about), from The New York Times R&D Lab, that I think does something similar?

Standards or Interoperability?

An interesting piece, as ever, from Tim Davies (Slow down with the standards talk: it’s interoperability & information quality we should focus on) reflecting on the question of whether we need more standards, or better interoperability, in the world of (open) data publishing. Tim also links out to Friedrich Lindenberg’s warnings about 8 things you probably believe about your data standard, which usefully mock some of the claims often casually made about standards adoption.

My own take on many standards in the area is that conventions are the best we can hope for, and that even then they will be interpreted in variety of ways, which means you have to be forgiving when trying to read them. All manner of monstrosities have been published in the guise of being HTML or RSS, so the parsers had to do the best they could getting the mess into a consistent internal representation at the consumer side of the transaction. Publishers can help by testing that whatever they publish does appear to parse correctly with the current “industry standard” importers, ideally open code libraries. It’s then up to the application developers to decide which parser to use, or whether to write their own.

It’s all very well standardising your data interchange format, but the application developer will then want to work on that data using some other representation in a particular programming language. Even if you have a formal standard interchange format, and publishers stick to religiously and unambiguously, you will still get different parsers generating internal representations that the application code will work on that are potentially very different, and may even have different semantics. [I probably need to find some examples of that to back up that claim, don’t I?!;-)])

I also look at standards from the point of view of trying to get things done with tools that are out there. I don’t really care if a geojson feed is strictly conformant with any geojson standard that’s out there, I just need to know that something claimed to be published as as geojson works with whatever geojson parser the Leaflet Javascript library uses. I may get frustrated by the various horrors that are published using a CSV suffix, but if I can open it using pandas (a Python programming library), RStudio (an R programming environment) or OpenRefine (a data cleaning application), I can work with it.

At the data level, if councils published their spending data using the same columns and same number, character and data formats for those columns, it would make life aggregating those datasets mush easier. But even then, different councils use the same thing differently. Spending area codes, or directorate names are not necessarily standardised across councils, so just having a spending area code or directorate name column (similarly identified) in each release doesn’t necessarily help.

What is important is that data publishers are consistent in what they publish so that you can start to take into account their own local customs and work around those. Of course, internal consistency is also hard to achieve. Look down any local council spending data transaction log and you’ll find the same company described in several ways (J. Smith, J. Smith Ltd, JOHN SMITH LIMITED, and so on), some of which may match the way the same company is recorded by another council, some of which won’t…

Stories are told from the Enigma codebreaking days of how the wireless listeners could identify Morse code operators by the cadence and rhythm of their transmissions, as unique to them as any other personal signature (you know that the way you walk identifies you, right?). In open data land, I think I can spot a couple of different people entering transactions into local council spending transaction logs, where the systems aren’t using controlled vocabularies and selection box or dropdown list entry methods, but instead support free text entry… Which is say – even within a standard data format (a spending transaction schema) published using a conventional (though variously interpreted) document format (CSV) that nay be variously encoded (UTF-8, ASCII, Latin-1), the stuff in the data file may be all over the place…

An approach I have been working towards for my own use over the last year or so is to adopt a working environment for data wrangling and analysis based around the Python pandas programming library. It’s then up to me how to represent things internally within that environment, and how to get the data into that representation within that environment. The first challenge is getting the data in, the second getting it into a state where I can start to work with it, the third getting it into a state where I can start to normalise it and then aggregate it and/or combine it with other data sets.

So for example, I started doodling a wrapper for nomis and looking at importers for various development data sets. I have things call on the Food Standards Agency datasets (and when I get round to it, their API) and scrape reports from the CQC website, I download and dump Companies House data into a database, and have various scripts for calling out to various Linked Data endpoints.

Where different publishers use the same identifier schemes, I can trivially merge, join or link the various data elements. For approxi-matching, I run ad hoc reconciliation services.

All this is to say that at the end of the day, the world is messy and standardised things often aren’t. At the end of the day, integration occurs in your application, which is why it can be handy to be able to code a little, so you can whittle and fettle the data you’re presented with into a representation and form that you can work with. Wherever possible, I use libraries that claim to be able to parse particular standards and put the data into representations I can cope with, and then where data is published in various formats or standards, go for the option that I know has library support.

PS I realise this post stumbles all over the stack, from document formats (eg CSV) to data formats (or schema). But it’s also worth bearing in mind that just because two publishers use the same schema, you won’t necessarily be able to sensibly aggregate the datasets across all the columns (eg in spending data again, some council transaction codes may be parseable and include dates, accession based order numbers, department codes, others may be just be jumbles of numbers). And just because two things have the same name and the same semantics, doesn’t mean the format will be the same (2015-01-15, 15/1/15, 15 Jan 2015, etc etc)

Problems of Data Quality

One of the advantages of working with sports data, you might have thought, is that official sports results are typically good quality data. With a recent redesign of the Formula One website, the official online (web) source of results is now the FIA website.

As well as publishing timing and classification (results) data in a PDF format intended for consumption by the press, presumably, the FIA also publish “official” results via a web page.

But as I discovered today, using data from a scraper that scrapes results from the “official” web page rather than the official PDF documents is no guarantee that the “official” web page results bear any resemblance at all to the actual result.

formula_one_spanish_grand_prix_2015_q_off_class_pdf__page_2_of_2__and_Session_Classifications___Federation_Internationale_de_l_Automobile

Yet another sign that the whole F1 circus is exactly that – an enterprise promoted by clowns…