Literate DevOps? Could We Use IPython Notebooks To Build Custom Virtual Machines?

A week or so ago I came across a couple of IPython notebooks produced by Catherine Devlin covering the maintenance and tuning of a PostgreSQL server: DB Introspection Notebook (example 1: introspection, example 2: tuning, example 3: performance checklist). One of the things we have been discussing in the TM351 course team meetings is the extent to which we “show our working” to students in terms how the virtual machine and the various databases used in the course were put together, even if we don’t actually teach that stuff.

Notebooks make an ideal way of documenting the steps taken to set up a particular system, blending commentary with command line as well as code executable cells.

The various approaches I’ve explored to build the VM have arguably been over-complex – vagrant, puppet, docker and docker-compose – but I’ve always seen the OU as a place where we explore the technologies we’re teaching – or might teach – in the context of both course production and course delivery (that is, we can often use a reflexive approach whereby the content of the teaching also informs the development and delivery of the teaching).

In contrast, in A DevOps Approach to Common Environment Educational Software Provisioning and Deployment I referred to a couple of examples of a far simpler approach, in which common research, or research and teaching, VM environments were put together using simple scripts. This approach is perhaps more straightforwardly composable, in that if someone has additional requirements of the machine, they can just add a simple configuration script to bring in those elements.

In our current course example, where the multiple authors have a range of skill and interest levels when it comes to installing software and exploring different approaches to machine building, I’m starting to wonder whether I should have started with a simple base machine running just an IPython notebook server and no additional libraries or packages, and then created series of notebooks, one for each part of the course (which broadly breaks down to one part per author), containing instructions for installing all the bits and pieces required for just that part of the course. If there’s duplication across parts, trying to install the same thing for each part, that’s fine – the various package managers should be able to cope with that. (The only issue would arise if different authors needed different versions of the same package, for some reason, and I’m not sure what we’d do in that case?)

The notebooks would then include explanatory documentation and code cells to install Linux packages and python packages. Authors could take over the control of setup notebooks, or just make basic requests. At some level, we might identify a core offering (for example, in our course, this might be the inclusion of the pandas package) that might be pushed up into a core configuration installation notebook executed prior to the installation notebook for each part.

Configuring the machine would then be a case of running the separate configuration notebooks for each part (perhaps preceded by a core configuration notebook), perhaps by automated means. For example, ipython nbconvert --to=html --ExecutePreprocessor.enabled=True configNotebook_1.ipynb will [via StackOverflow]. This generates an output HTML report from running the code cells in the notebook (which can include command line commands) in a headless IPython process (I think!).

The following switch may also be useful (it clears the output cells): ipython nbconvert --to=pdf --ExecutePreprocessor.enabled=True --ClearOutputPreprocessor.enabled=True RunMe.ipynb (note in this case we generate a PDF report).

To build the customised VM box, the following route should work:

– set up a simple Vagrant file to import a base box
– install IPython into the box
– copy the configuration notebooks into the box
– run the configuration notebooks
– export the customised box

This approach has the benefits of using simple, literate configuration scripts described within a notebook. This makes them perhaps a little less “hostile” than shell scripts, and perhaps makes it easier to build in tests inline, and report on them nicely. (If a running a cell results in an error, I think the execution of the notebook will stop at that point?) The downside is that to run the notebooks, we also need to have IPython installed first.

A DevOps Approach to Common Environment Educational Software Provisioning and Deployment

In Distributing Software to Students in a BYOD Environment, I briefly reviewed a paper that described a paper that reported on the use of Debian metapackages to support the configuration of Linux VMs for particular courses (each course has its own Debian metapackage that could install all the packages required for that course).

This idea of automating the build of machines comes under the wider banner of DevOps (development and operations). In a university setting, we might view this in several ways:

  • the development of course related software environments during course production, the operational distribution and deployment of software to students, updating and support of the software in use, and maintenance and updating of software between presentations of a course;
  • the development of software environments for use in research, the operation of those environments during the lifetime of a resarch project, and the archiving of those environments;
  • the development and operation of institutional IT services.

In an Educause review from 2014 (Cloud Strategy for Higher Education: Building a Common Solution, EDUCAUSE Center for Analysis and Research (ECAR) Research Bulletin, November, 2014 [link]), a pitch for universities making greater use of cloud services, the authors make the observation that:

In order to make effective use of IaaS [Infrastructure as a Service], an organization has to adopt an automate-first mindset. Instead of approaching servers as individual works of art with artisan application configurations, we must think in terms of service-level automation. From operating system through application deployment, organizations need to realize the ability to automatically instantiate services entirely from source code control repositories.

This is the approach I took from the start when thinking about the TM351 virtual machine, focussing more on trying to identify production, delivery, support and maintenance models that might make sense in a factory production model that should work in a scaleable way not only across presentations of the same course, as well as across different courses, but also across different platforms (students own devices, OU managed cloud hosts, student launched commercial hosts) rather than just building a bespoke, boutique VM for a single course. (I suspect the module team would have preferred my focussing on the latter – getting something that works reliably, has been rigorously tested, and can be delivered to students rather than pfaffing around with increasingly exotic and still-not-really-invented-yet tooling that I don’t really understand to automate production of machines from scratch that still might be a bit flaky!;-)

Anyway, it seems that the folk at Berkeley have been putting together a “Common Scientific Compute Environment for Research and Education” [Clark, D., Culich, A., Hamlin, B., & Lovett, R. (2014). BCE: Berkeley’s Common Scientific Compute Environment for Research and Education, Proceedings of the 13th Python in Science Conference (SciPy 2014).]


The BCE – Berkeley Common Environment – is “a standard reference end-user environment” consisting of a simply skinned Linux desktop running in virtual machine delivered as a Virtualbox appliance that “allows for targeted instructions that can assume all features of BCE are present. BCE also aims to be stable, reliable, and reduce complexity more than it increases it”. The development team adopted a DevOps style approach customised for the purposes of supporting end-user scientific computing, arising from the recognition that they “can’t control the base environment that users will have on their laptop or workstation, nor do we wish to! A useful environment should provide consistency and not depend on or interfere with users’ existing setup”, further “restrict[ing] ourselves to focusing on those tools that we’ve found useful to automate the steps that come before you start doing science. Three main frames of reference were identified:

  • instructional: students could come from all backgrounds and often unlikely to have sys admin skills over and above the ability to use a simple GUI approach to software installation: “The most accessible instructions will only require skills possessed by the broadest number of people. In particular, many potential students are not yet fluent with notions of package management, scripting, or even the basic idea of commandline interfaces. … [W]e wish to set up an isolated, uniform environment in its totality where instructions can provide essentially pixel-identical guides to what the student will see on their own screen.”
  • scientific collaboration: that is, the research context: “It is unreasonable to expect any researcher to develop code along with instructions on how to run that code on any potential environment.” In addition, “[i]t is common to encounter a researcher with three or more Python distributions installed on their machine, and this user will have no idea how to manage their command-line path, or which packages are installed where. … These nascent scientific coders will have at various points had a working system for a particular task, and often arrive at a state in which nothing seems to work.”
  • Central support: “The more broadly a standard environment is adopted across campus, the more familiar it will be to all students”, with obvious benefits when providing training or support based on the common environment.

Whilst it was recognised the personal laptop computers are perhaps the most widely used platform, the team argued that the “environment should not be restricted to personal computers”. Some scientific computing operations are likely to stretch the resources of a personal laptop, so the environment should also be capable of running on other platforms such as hefty workstations or on a scientific computing cloud.

The first consideration was to standardise on an O/S: Linux. Since the majority of users don’t run Linux machines, this required the use of a virtual machine (VM) to host the Linux system, whilst still recognising that “one should assume that any VM solution will not work for some individuals and provide a fallback solution (particularly for instructional environments) on a remote server”.

Another issue that can arise is dealing with mappings between host and guest OS, which vary from system to system – arguing for the utility of an abstraction layer for VM configuration like Vagrant or Packer … . This includes things like portmapping, shared files, enabling control of the display for a GUI vs. enabling network routing for remote operation. These settings may also interact with the way the guest OS is configured.

Reflecting on the “traditional” way of building a computing environment, the authors argued for a more automated approach:

Creating an image or environment is often called provisioning. The way this was done in traditional systems operation was interactively, perhaps using a hybrid of GUI, networked, and command-line tools. The DevOps philosophy encourages that we accomplish as much as possible with scripts (ideally checked into version control!).

The tools explored included Ansible, packer, vagrant and docker:

  • Ansible: to declare what gets put into the machine (alternatives include shell scripts, puppet etc. (For the TM351 monolithic VM, I used puppet.) End-users don’t need to know anything about Ansible, unless they want to develop a new, reproducible, custom environment.
  • packer: used to run the provisioners and construct and package up a base box. Again, end-users don’t need to know anything about this. (For the TM351 monolithic VM, I used vagrant to build a basebox in Virtualbox, and then package it; the power of Packer is that is lets you generate builds from a common source for a variety of platforms (AWS, Virtualbox, etc etc).)
  • vagrant: their description is quite a handy one: “a wrapper around virtualization software that automates the process of configuring and starting a VM from a special Vagrant box image … . It is an alternative to configuring the virtualization software using the GUI interface or the system-specific command line tools provided by systems like VirtualBox or Amazon. Instead, Vagrant looks for a Vagrantfile which defines the configuration, and also establishes the directory under which the vagrant command will connect to the relevant VM. This directory is, by default, synced to the guest VM, allowing the developer to edit the files with tools on their host OS. From the command-line (under this directory), the user can start, stop, or ssh into the Vagrant-managed VM. It should be noted that (again, like Packer) Vagrant does no work directly, but rather calls out to those other platform-specific command-line tools.” However, “while Vagrant is conceptually very elegant (and cool), we are not currently using it for BCE. In our evaluation, it introduced another piece of software, requiring command-line usage before students were comfortable with it”. This is one issue we are facing with the TM351 VM – current the requirement to use vagrant to manage the VM from the commandline (albeit this only really requires a couple of commands – we can probably get away with just: vagrant up && vagrant provision and vagrant suspend – but also has a couple of benefits, like being able to trivially vagrant ssh in to the VM if absolutely necessary…).
  • docker: was perceived as adding complexity, both computationally and conceptually: “Docker requires a Linux environment to host the Docker server. As such, it clearly adds additional complexity on top of the requirement to support a virtual machine. … the default method of deploying Docker (at the time of evaluation) on personal computers was with Vagrant. This approach would then also add the complexity of using Vagrant. However, recent advances with boot2docker provide something akin to a VirtualBox-only, Docker-specific replacement for Vagrant that eliminates some of this complexity, though one still needs to grapple with the cognitive load of nested virtual environments and tooling.” The recent development of Kitematic addresses some of the use-case complexity, and also provides GUI based tools for managing some of the issues described above associate with portmapping, file sharing etc. Support for linked container compositions (using Docker Compose) is still currently lacking though…

At the end of the day, Packer seems to rule them all – coping as it does with simple installation scripts and being able to then target the build for any required platform. The project homepage is here: Berkeley Common Environment and the github repo here: Berkeley Common Environment (Github).

The paper also reviewed another common environment – OSGeo. Once again built on top of a common Linux base, well documented shell scripts are used to define package installations: “[n]otably, the project uses none of the recent DevOps tools. OSGeo-Live is instead configured using simple and modular combinations of Python, Perl and shell scripts, along with clear install conventions and examples. Documentation is given high priority. … Scripts may call package managers, and generally have few constraints (apart from conventions like keeping recipes contained to a particular directory)”. In addition, “[s]ome concerns, like port number usage, have to be explicitly managed at a global level”. This approach contrasts with the approach reviewed in Distributing Software to Students in a BYOD Environment where Debian metapackages were used to create a common environment installation route.


The idea of a common environment is a good one, and that would work particularly well in a curriculum such as Computing, I think? One main difference between the BCE approach and the TM351 approach is that BCE is self-contained and runs a desktop environment within the VM, whereas the TM351 environment uses a headless VM and follows more of a microservice approach that publishes HTML based service UIs via http ports that can be viewed in a browser. One disadvantage of the latter approach is that you need to keep a more careful eye on port assignments (in the event of collisions) when running the VM locally.

A Couple of Interesting Interactive Data Storytelling Devices

A couple of interesting devices for trying to engage folk in a data mediated story. First up, a chart that is reminiscent in feel to Hans Rosling’s ignorance test, in which (if you aren’t familiar with it) audiences are asked a range of data-style questions and then demonstrate their ignorance based on their preconceived ideas about what they think the answer to the question is – and which is invariably the wrong answer (with the result that audiences perform worse than random – or as Hans Rosling often puts is, worse than a chimpanzee; by the by, Rosling recently gave a talk at the World Bank, which included a rendition of the ignorance test. Rosling’s dressing down of the audience – who make stats based policy and help spend billions in the areas covered by Rosling’s questions, yet still demonstrated their complete lack of grasp about what the numbers say – is worth watching alone…).

Anyway – the chart comes from the New York Times, in a post entitled You Draw It: How Family Income Predicts Children’s College Chances. A question is posed and the reader is invited to draw the shape of the curve they think describes the relationship between family income and college chances:

nty-youDrawIt

Once you’ve drawn your line and submitted it, you’re told how close your answer is the the actual result:
You_Draw_It__How_Family_Income_Predicts_Children’s_College_Chances_-_The_New_York_Times

Another display demonstrates the general understanding calculated from across all the submissions.

You_Draw_It__How_Family_Income_Predicts_Children’s_College_Chances_-_The_New_York_Times2

Textual explanations also describe the actual relationship, putting it into context and trying to explain the basis of the relationship. As ever, a lovely piece of work, and once again with Amanda Cox in the credits…

The second example comes from Bloomberg, and riffs on the idea of immersive stories to produce a chart that gets updated as you scroll through (I saw this described as “scrollytelling” by @arnicas):

What_s_Really_Warming_the_World__Climate_deniers_blame_natural_factors__NASA_data_proves_otherwise

The piece is about global warming and shows the effect of various causal factors on temperature change, at first separately, and then in additive composite form to show how they explain the observed increase. It’s a nice device for advocacy, methinks…

It also reminds me that I never got round to trying to the Knight Lab Storymap.js with a zoomified/tiled chart image as the basis for a storymap (or should that be, storychart? For other storymappers, see Seven Ways to Create a Storymap). I just paid out the $19 or so for a copy of zoomify to convert large images to tilesets to work with that app, though I guess I really should have tried to hack a solution out with something like Imagemagick (I think that can export tiles?) or Inkscape (which would let me convert a vector image to tiles, I think?). Anyway, I just need a big image to try out now, which I guess I could generate from some F1 data using ggplot?

Running RStudio on Digital Ocean, AWS etc Using Tutum and Docker Containers

Via RBloggers I noticed a tutorial today on Setting Rstudio server using Amazon Web Services (AWS).

In the post Getting Started With Personal App Containers in the Cloud I described how I linked my tutum account to a Digital Ocean hosting account and then launched a Digital Ocean server. (How to link tutum to Amazon AWS is described here: tutum support: Link your Amazon Web Services account.)

Having launched a server (also described in Getting Started With Personal App Containers in the Cloud), we can now create a new service that will fire up an RStudio container.

First up, we need to locate a likely container – the official one is the rocker/rstudio image:

New_Service_Wizard___Tutum

Having selected the image, we need to do a little bit of essential configuration (we could do more, like giving the service a new name):

New_Service_Wizard___Tutum2

Specifically, we need to publish the port so that it’s publicly viewable – then we can Create and Deploy the service:

New_Service_Wizard___Tutum3

After a minute or two, the service should be up and running:

rstudio-8a316887___Tutum4

We can now find the endpoint, and click through to it (note: we need to change the URL from a tcp:// address to an http:// one. (Am I doing something wrong in the set up to stop the http URL being minted as the service endpoint?)

rstudio-8a316887___Tutum5

URL tweaked, you should now be able to see an RStudio login screen. The default user is rstudio and the default password rstudio too:

RStudio_Sign_In

And there we have it:-)

RStudio

So we don’t continue paying for the server, I generally stop the container and then terminate it to destroy it…

rstudio-8a316887___Tutum

And then terminate the node…

Node_dashboard___Tutum5

So, assuming the Amazon sign-up process is painless, I’m assuming it shouldn’t be much harder than that?

By the by, it’s possible to link containers to other containers; here’s an example (on the desktop, using boot2docker, that links an RStudio container to a MySQL database: Connecting RStudio and MySQL Docker Containers – an example using the ergast db. When I get a chance, I’ll have a go at doing that via tutum too…

Notes on Robot Churnalism, Part II – Robots in the Journalism Workplace

In the first part of this series (Notes on Robot Churnalism, Part I – Robot Writers), I reviewed some of the ways in which robot writers are able to contribute to the authoring of news content.

In this part, I will consider some of the impacts that might arise from robots entering the workplace.

Robot Journalism in the Workplace

“Robot journalists” have some competitive advantages which are hard for human journalists to compete with. The strengths of automated content generation are the low marginal costs, the speed with which articles can be written and the broad spectrum of sport events which can be covered.
Arjen van Dalen, The Algorithms Behind the Headlines, Journalism Practice, 6:5-6, 648-658, 2012, p652

One thing machines do better is create value from large amounts of data at high speed. Automation of process and content is the most under-explored territory for reducing costs of journalism and improving editorial output. Within five to 10 years, we will see cheaply produced information monitored on networks of wireless devices.
Post Industrial Journalism: Adapting to the Present, Chris Anderson, Emily Bell, Clay Shirky, Tow Center for Digital Journalism Report, December 3, 2014

Year on year, it seems, the headlines report how the robots are coming to take over a wide range of professional jobs and automate away the need to employ people to fill a wide range of currently recognised roles (see, for example, this book: The Second Machine Age [review], this Observer article: Robots are leaving the factory floor and heading for your desk – and your job, this report: The Future of Employment: How susceptible are jobs to computerisation? [PDF], this other report: AI, Robotics, and the Future of Jobs [review], and this business case: Rethink Robotics: Finding a Market).

Stories also abound fearful of a possible robotic takeover of the newsroom: ‘Robot Journalist’ writes a better story than human sports reporter (2011), The robot journalist: an apocalypse for the news industry? (2012), Can an Algorithm Write a Better News Story Than a Human Reporter? (2012), Robot Writers and the Digital Age (2013), The New Statesman could eventually be written by a computer – would you care? (2013), The journalists who never sleep (2014), Rise of the Robot Journalist (2014), Journalists, here’s how robots are going to steal your job (2014), Robot Journalist Finds New Work on Wall Street (2015).

It has to be said, though, that many of these latter “inside baseball” stories add nothing new, perhaps reflecting the contributions of another sort of robot to the journalistic process: web search engines like Google…

Looking to the academic literature, in his 2015 case study around Narrative Science, Matt Carlson describes how “public statements made by its management reported in news about the company reveal two commonly expressed beliefs about how its technology will improve journalism: automation will augment— rather than displace — human journalists, and it will greatly expand journalistic output” p420 (Matt Carlson (2015), The Robotic Reporter, Digital Journalism, 3:3, 416-431).

As with the impact of many other technological innovations within the workplace, “[a]utomated journalism’s ability to generate news accounts without intervention from humans raises questions about the future of journalistic labor” (Carlson, 2015, p422). In contrast to the pessimistic view that “jobs will lost”, there are at least two possible positive outcomes for jobs that may result from the introduction of a new technology: firstly, that the technology helps transform the original job and in so doing help make it more rewarding, or allows the original worker to “do more”; secondly, that the introduction of the new technology creates new roles and new job opportunities.

On the pessimistic side, Carlson describes how:

many journalists … question Narrative Science’s prediction that its service would free up or augment journalists, including Mathew Ingram (GigaOm, April 25, 2012): “That’s a powerful argument, but it presumes that the journalists who are ‘freed up’ because of Narrative Science … can actually find somewhere else that will pay them to do the really valuable work that machines can’t do. If they can’t, then they will simply be unemployed journalists.” This view challenges the virtuous circle suggested above to instead argue that some degree of displacement is inevitable.(Carlson, 2015, p423)

On the other hand:

[a]ccording to the more positive scenario, machine-written news could be complementary to human journalists. The automation of routine tasks offers a variety of possibilities to improve journalistic quality. Stories which cannot be covered now due to lack of funding could be automated. Human journalists could be liberated from routine tasks, giving them more time to spend on quality, in-depth reporting, investigative reporting. (van Dalen, p653)

This view thus represents the idea of algorithms working alongside the human journalists, freeing them up from the mundane tasks and allow them to add more value to a story… If a journalist has 20 minutes to spend on a story, if that time is spent searching a database and pulling out a set of numbers that may not even be very newsworthy, how much more journalistically productive could that journalist be if a machine gave them the data and a canned summary of it for free, then allowing the journalist to use the few minutes allocated to that story to take the next step – adding in some context, perhaps, or contacting a second source for comment?

A good example of the time-saving potential of automated copy production can be seen in the publication of earnings reports by AP, as reported by trade blog journalism.co.uk, who quoted vice president and managing editor Lou Ferrara’s announcement of a tenfold increase in stories from 300 per quarter produced by human journalists, to 3,700 with machine support (AP uses automation to increase story output tenfold, June, 2015).

The process AP went through during testing appears to be one that I’m currently exploring with my hyperlocal, OnTheWight, for producing monthly JobSeekers Allowance reports (here’s an example of the human produced version, which in this case was corrected after a mistake was spotted when checking that an in-testing machine generated version of the report was working correctly..! As journalism.co.uk reported about AP, “journalists were doing all their own manual calculations to produce the reports, which Ferrara said had ‘potential for error’.” Exactly the same could have been said of the OnTheWight process…)

In the AP case, “during testing, the earnings reports were produced via automation and journalists compared them to the relevant press release and figured out bugs before publishing them. A team of five reporters worked on the project, and Ferrara said they still had to check for everything a journalist would normally check for, from spelling mistakes to whether the calculations were correct.” (I wonder if they check the commas, too?!) The process I hope to explore with OnTheWight builds in the human checking route, taking the view that the machine should generate press-release style copy that does the grunt work in getting the journalist started on the story, rather than producing the complete story for them. At AP, it seems that automation “freed up staff time by one fifth”. The process I’m hoping to persuade OnTheWight to adopt is that to begin with, the same amount of time should be spent on the story each month, but month on month we automate a bit more and the journalistic time is then spent working up what the next paragraph might be, and then in turn automate the production of that…

Extending the Promise?

In addition to time-saving, there is the hope that the wider introduction of robot journalists will create new journalistic roles:

Beyond questions of augmentation or elimination, Narrative Science’s vision of automated journalism requires the transformation of journalistic labor to include such new positions as “meta-writer” or “metajournalist” to facilitate automated stories. For example, Narrative Science’s technology can only automate sports stories after journalists preprogram it with possible frames for sports stories (e.g., comeback, blowout, nail-biter, etc.) as well as appropriate descriptive language. After this initial programming, automated journalism requires ongoing data management. Beyond the newsroom, automated journalism also redefines roles for non-journalists who participate in generating data. (Carlson, 2015, p423)

In the first post of these series, I characterised the process used by Narrative Science which included the application of rules for detecting signals and angles, and the linkage of detected “facts” to story points within an a particular angle that could then be used to generate a narrative told through automatically generated natural language. Constructing angles, identifying logical processes that can identify signals and map them on to story elements, and generating turns of phrase that can help explicate narratives in a natural way are all creative acts that are likely to require human input for the near future at least, albeit tasking the human creative with the role of supporting the machine. This is not necessarily that far removed from the some of the skills already employed by journalists, however. As Carlson suggests, “Scholars have long documented the formulaic nature underlying compositional forms of news exposed by the arrival of automated news. … much journalistic writing is standardized to exclude individual voice. This characteristic makes at least a portion of journalistic output susceptible to automation” (p425). What’s changing, perhaps, is that now the journalists mush learn to capture those standardised forms and map them onto structures that act as programme fodder for their robot helpers.

Audience Development

Narrative Science also see potential in increasing the size of the total potential audience by accommodating the very specific needs of a large number of niche audiences.

“While Narrative Science flaunts the transformative potential of automated journalism to alter both the landscape of available news and the work practices of journalists, its goal when it comes to compositional form is conformity with existing modes of human writing. The relationship here is telling: the more the non-human origin of its stories is undetectable, the more it promises to disrupt news production. But even in emulating human writing, the application of Narrative Science’s automation technology to news prompts reconsiderations of the core qualities underpinning news composition. The attention to the quality and character of Narrative Science’s automated news stories reflects deep concern both with existing news narratives and with how automated journalistic writing commoditizes news stories.” Carlson, 2015, p424

In the midst of this mass of stories, it’s possible that there will be some “outliers” that are of more general interest which can, with some additional contextualisation and human reporting, be made relevant to a wider audience.

There is also the possible of searching for “meta-stories” that tell not the specifics of particular cases, but identify trends across the mass of stories as whole. (Indeed, it is by looking for such trends and patterns that outliers may be detected). In addition, patterns that only become relevant when looking across all the individual stories might in turn lead to additional stories. (For example, a failing school operated by a particular provider is perhaps of only local interest, but if it turns out that the majority of schools operated by a particular provider we turned round from excellent to failing by that provider, questions might, perhaps, be worth asking…?!)

When it comes to the case for expanding the range of content that is available, Narrative Science’s hope appears to be that:

[t]he narrativization of data through sophisticated artificial intelligence programs vastly expands the terrain of news. Automated journalism becomes a normalized component of the news experience. Moreover, Narrative Science has tailored its promotional discourse to reflect the economic uncertainty of online journalism business models by suggesting that its technology will create a virtuous circle in which increased news revenue supports more journalists (Carlson, 2015, p 421).

The alternative, fearful view, of course, is that revenues will be protected by reducing the human wage bill, using robot content creators operating at a near zero marginal cost on particular story types to replace human content creation.

Whether news organisations will use automation to extend the range of producers in the newsroom, or contribute to the reduction of human creative input to the journalistic process, is perhaps still to be seen. As Anderson, Bell & Shirky noted, “the reality is that most journalists at most newspapers do not spend most of their time conducting anything like empirically robust forms of evidence gathering.” Perhaps now is the time for them to stop churning the press releases and statistics announcements – after all, the machines can do that faster and better – and concentrate more on contextualising and explaining the machine generated stories, as well as spending more time out hunting for stories and pursuing their own investigative leads?

Distributing Software to Students in a BYOD Environment

Reading around a variety of articles on the various ways of deploying software in education, it struck me that in traditional institutions a switch is may be taking place between students making use of centrally provided computing services – including physical access to desktop computers – to students bringing their own devices on which they may want to run the course software themselves. In addition, traditional universities are also starting to engage increasingly with their own distance education students; and the rise of the MOOCs are based around the idea of online course provision – that is, distance education.

The switch from centrally provided computers to a BYOD regime contrasts with the traditional approach in distance education in which students traditionally provided their own devices and onto which they installed software packaged and provided by their educational institution. That is, distance education students have traditionally been BYOD users.

However, in much the same way that the library in a distance education institution like the OU could not originally provide physical information (book lending) services to students, instead brokering access agreements with other HE libraries, but now can provide a traditional a traditional library service through access to digital collections, academic computing services are perhaps now more in a position where they can provide central computing services, at scale, to their students. (Contributory factors include: readily available network access for students, cheaper provider infrastructure costs (servers, storage, bandwidth, etc).)

With this in mind, it is perhaps instructive for those of us working in distance education to look at how the traditional providers are coping with an an influx of BYOD users, and how they are managing access to, and the distribution of, software to this newly emerging class of user (for them) whilst at the same time continuing to provide access to managed facilities such as computing labs and student accessed machines.


Notes from: Supporting CS Education via Virtualization and Packages – Tools for Successfully Accommodating “Bring-Your-Own-Device” at Scale, Andy Sayler, Dirk Grunwald, John Black, Elizabeth White, and Matthew Monaco SIGCSE’14, March 5–8, 2014, Atlanta, GA, USA [PDF]

The authors describe “a standardized development environment for all core CS courses across a range of both school-owned and student-owned computing devices”, leveraging “existing off-the-shelf virtualization and software management systems to create a common virtual machine that is used across all of our core computer science courses”. The goal was to “provide students with an easy to install and use development environment that they could use across all their CS courses. The development environment should be available both on department lab machines, and as a VM for use on student-owned machines (e.g. as a ‘lab in a box’).”

From the student perspective, our solution had to: a) Run on a range of host systems; b) Be easy to install; c) Be easy to use and maintain; d) Minimize side-effects on the host system; e) Provide a stable experience throughout the semester.

From the instructor perspective, our solution had to: a) Keep the students happy; b) Minimize instructor IT overhead; c) Provide consistent results across student, grader, and instructor machines; d) Provide all necessary software for the course; e) Provide the ability to update software as the course progresses.

Virtualbox was adopted on the grounds that it runs cross-platform, is free, open source software, and has good support for running Linux guest machines. The VM was based on Ubuntu 12.04 (presumably the long term edition available at the time) and distributed as an .ova image.

To support the distribution of software packages for a particular course, Debian metapackages (that simply list dependencies; in passing, I note that the Anaconda python distribution supports the notion of python (conda) metapackages, but pip does not, specifically?) were created on a per course basis that could be used via apt-get to install all the necessary packages required for a particular course (example package files).

In terms of student support, the team published “a central web-page that provides information about the VM, download links, installation instructions, common troubleshooting steps, and related self-help information” along with “YouTube videos describing the installation and usage of the VM”. Initial distribution is provided using BitTorrent. Where face-to-face help sessions are required, VM images are provided on USB memory sticks to avoid download time delays. Backups are handled by bundling Dropbox into the VM and encouraging students to place their files there. (Github is also used.)

The following observation is useful in respect of student experience of VM performance:

“Modern CPUs provide extensions that enable a fast, smooth and enjoyable VM experience (i.e. VT-x). Unfortunately, many non-Apple PC manufacturers ship their machines with these extension disabled in the BIOS. Getting students to enable these extensions can be a challenge, but makes a big difference in their overall impression of VM usability. One way to force students to enable these extensions is to use a 64-bit and/or multi-core VM, which VirtualBox will not start without virtualization extensions enabled.”

The open issues identified by the team are the issue of virtualisation support; corrupted downloads of the VM (mitigation includes publishing a checksum for the VM and verifying against this); and the lack of a computer capable of running the VM (ARM devices, low specification Intel Atom computers). [On this latter point, it may be worth highlighting the distinction between hardware that cannot cope with running computationally intensive applications, hardware that has storage limitations, and hardware that cannot run particular virtualisation services (for example, that cannot run x86 virtualisation). See also: What Happens When “Computers” Are Replaced by Tablets and Phones?]


The idea of using package management is attractive, and contrasts with the approach I took when hacking together the TM351 VM using vagrant and puppet scripts. It might make sense to further abstract the machine components into a Debian metapackage and a simple python/pip “meta” package (i.e. one that simply lists dependencies). The result would be an installation reduced to a couple of lines of the form:

apt-get install ou_tm351=15J.0
pip install ou_tm351==15J.0

where packages are versioned to a particular presentation of an OU course, with a minor version number to accommodate any updates/patches. One downside to this approach is that it splits co-dependency relationships between python and Debian packages relative to a particular application. In the current puppet build files for the monolithic VM build, each application has its own puppet file that installs the additional libraries over base libraries required for a particular application. (In addition, particular applications can specify dependencies on base libraries.) For the dockerised VM build, each container image has it’s own Dockerfile that identifies the dependencies for that image.

Tracing its history (and reflecting the accumulated clutter of my personal VM learning journey!) the draft TM351 VM is currently launched and provisioned using vagrant, partly because I can’t seem to start the IPython Notebook reliably from a startup script:-( Distributing the machine as a start/stoppable appliance (i.e. as an Open Virtualization Format/.ova package) might be more convenient, if we could guarantee that file sharing with host works as required (sharing against a specific folder on host) and any port collisions experienced by the provided services can be managed and worked around?

Port collisions are less of an issue for Sayler et al. because their model is that students will be working within the VM context – a “desktop as a local service” (or “platform as a local service” model); the TM351 VM model provides services that run within the VM, some of which are exposed via http to the host – more of a “software as a local service” model. In the cloud, software-as-a-service and desktop-as-a-service models are end-user delivery models, where users access services through a browser or lightweight desktop client, compared with “platform-as-a-service” offerings where applications can be developed and delivered within a managed development environment offering high level support services, or “infrastructure as a service” offerings, which provide access to base computing components (computational processing, storage, networking, etc.)

Note that what interests me particularly are delivery models that support all three of the following models: BYOD, campus lab, and cloud/remotely hosted offerings (as a crude shorthand, I use ‘cloud’ to mean environments that are responsive in terms of firing up servers to meet demand). The notions of personal computing environments, course computing environments and personal course computing environments might also be useful, (for example, a course computing environment might be a generic container populated with course software, a personal course computing container might then be a container linked to a student’s identity, with persisted state and linked storage, or a course container running on a students own device) alongside research computing environments and personal research computing environments.

What Happens When “Computers” Are Replaced by Tablets and Phones?

With personal email services managed online since what feels like forever (and probably is “forever”, for many users), personally accessed productivity apps delivered via online services (perhaps with some minimal support for in-browser, offline use) – things like Microsoft Office Online or Google Docs – video and music services provided via online streaming services, rather than large file downloads, image galleries stored in the cloud and social networking provided exclusively online, and in the absence of data about connecting devices (which is probably available from both OU and OU-owned FutureLearn server logs), I wonder if the OU strategists and curriculum planners are considering a future where a significant percentage of OUr distance education students do not have access to a “personal (general purpose) computer” onto which arbitrary software applications can be installed rather than from which they can simply be accessed, but do have access to a network connection via a tablet device, and perhaps a wireless keyboard?

And if the learners do have access to a desktop or laptop computer, what happens if that is likely to be a works machine, or perhaps a public access desktop computer (though I’m not sure how much longer they will remain around), probably with administrative access limits on it (if the OU IT department’s obsession with minimising general purpose and end-user defined computing is anything to go by…)

If we are to require students to make use of “installed software” rather than software that can be accessed via browser based clients/user interfaces, then we will need to ask the access question: is it fair to require students to buy a desktop computer onto which software can be installed purely for the purposes of their studies, given they presumably have access otherwise to all the (online) digital services they need?

I seem to recall that the OU’s student computing requirements are now supposed to be agnostic as to operating system (the same is not true internally, unfortunately, where legacy systems still require Windows and may even require obsolete versions of IE!;-) although the general guidance on the matter is somewhat vague and perhaps not a little out of date…?!

I wish I’d kept copies of OU computing (and network) requirements over the years. Today, network access is likely to come in the form of either wired, fibre, or wireless broadband access (the latter particularly in rural areas, (example) or (for the cord-cutters), a mobile/3G-4G connection; personal computing devices that connect to the network are likely to be smartphones, tablets, laptop computers, Chromebooks and their ilk, and gaming desktop machines. Time was when a household was lucky to have a single personal desktop computer, a requirement that became expected of OU students. I suspect that is still largely true… (the yoof’s gaming machine; the 10 year old “office” machine).

If we require students to run “desktop” applications, should we then require the students to have access to computers capable of installing those applications on their own computer, or should we be making those applications available in a way that allows them to be installed and run anywhere – either on local machines (for offline use), or on remote machines (either third party managed or managed by the OU) where a network connection is more or less always guaranteed?

One of the reasons I’m so taken by the idea of containerised computing is that it provides us with a mechanism for deploying applications to students that can be run in a variety of ways. Individuals can run the applications on their own computers, in the cloud, via service providers accessed and paid for directly by the students on a metered basis, or by the OU.

Container contents can be very strictly version controlled and archived, and are easily restored if something should go wrong (there are various ‘switch-it-off-and-switch-it-on-again’ possibilities with several degrees of severity!) Container image files can be distributed using physical media (USB memory sticks, memory cards) for local use, and for OU cloud servers, at least, those images could be pre-installed on student accessed container servers (meaning the containers can start up relatively quickly…)

If updates are required, these are likely to be lightweight – only those bits of the application that need updating will be updated.

At the moment, I’m not sure how easy it is to arbitrarily share a data container containing a student’s work with application containers that are arbitrarily launched on various local and remote hosts? (Linking containers to Dropbox containers is one possibility, but they would perhaps be slow to synch? Flocker is perhaps another route, with its increased emphasis on linked data container management?)

If any other educational institutions, particularly those involved in distance education, are looking at using containers, I’d be interested to hear what your take is…

And if any folk in the OU are looking at containers in any context (teaching, research, project work), please get in touch – I need folk to bounce ideas around with, sanity check with, and ask for technical help!;-)