Running a Shell Script Once Only in vagrant

Via somewhere (I’ve lost track of the link), here’s a handy recipe for running a shell script once and once only from Vagrantfile.

In the shell script (runonce.sh):

#!/bin/bash

if [ ! -f ~/runonce ]
then

  #ONCE RUN CODE HERE

  touch ~/runonce
fi

In the Vagrantfile:

  config.vm.provision :shell, :inline => <<-SH
    chmod ugo+x /vagrant/runonce.sh
    /vagrant/runonce.sh
  SH

Robot Journalism in Germany

By chance, I came across a short post by uber-ddj developer Lorenz Matzat (@lorz) on robot journalism over the weekend: Robot journalism: Revving the writing engines. Along with a mention of Narrative Science, it namechecked another company that was new to me: [b]ased in Berlin, Retresco offers a “text engine” that is now used by the German football portal “FussiFreunde”.

A quick scout around brought up this Retresco post on Publishing Automation: An opportunity for profitable online journalism [translated] and their robot journalism pitch, which includes “weekly automatic Game Previews to all amateur and professional football leagues and with the start of the new season for every Game and detailed follow-up reports with analyses and evaluations” [translated], as well as finance and weather reporting.

I asked Lorenz if he was dabbling with such things and he pointed me to AX Semantics (an Aexea GmbH project). It seems their robot football reporting product has been around for getting on for a year (Robot Journalism: Application areas and potential[translated]) or so, which makes me wonder how siloed my reading has been in this area.

Anyway, it seems as if AX Semantics have big dreams. Like heralding Media 4.0: The Future of News Produced by Man and Machine:

The starting point for Media 4.0 is a whole host of data sources. They share structured information such as weather data, sports results, stock prices and trading figures. AX Semantics then sorts this data and filters it. The automated systems inside the software then spot patterns in the information using detection techniques that revolve around rule-based semantic conclusion. By pooling pertinent information, the system automatically pulls together an article. Editors tell the system which layout and text design to use so that the length and structure of the final output matches the required media format – with the right headers, subheaders, the right number and length of paragraphs, etc. Re-enter homo sapiens: journalists carefully craft the information into linguistically appropriate wording and liven things up with their own sugar and spice. Using these methods, the AX Semantics system is currently able to produce texts in 11 languages. The finishing touches are added by the final editor, if necessary livening up the text with extra content, images and diagrams. Finally, the text is proofread and prepared for publication.

A key technology bit is the analysis part: “the software then spot patterns in the information using detection techniques that revolve around rule-based semantic conclusion”. Spotting patterns and events in datasets is an area where automated journalism can help navigate the data beat and highlight things of interest to the journalist (see for example Notes on Robot Churnalism, Part I – Robot Writers for other takes on the robot journalism process). If notable features take the form of possible story points, narrative content can then be generated from them.

To support the process, it seems as if AX Semantics have been working on a markup language: ATML3 (I’m not sure what it stands for? I’d hazard a guess at something like “Automated Text ML” but could be very wrong…) A private beta seems to be in operation around it, but some hints at tooling are starting to appear in the form of ATML3 plugins for the Atom editor.

One to watch, I think…

Exporting and Distributing Docker Images and Data Container Contents

Although it was a beautiful day today, and I should really have spent it in the garden, or tinkering with F1 data, I lost the day to the screen and keyboard pondering various ways in which we might be able to use Kitematic to support course activities.

One thing I’ve had on pause for some time is the possibility of distributing docker images to students via a USB stick, and then loading them into Kitematic. To do this we need to get tarballs of the appropriate images so we could then distribute them.

docker save psychemedia/openrefine_ou:tm351d2test | gzip -c > test_openrefine_ou.tgz
docker save psychemedia/tm351_scipystacknserver:tm351d3test | gzip -c > test_ipynb.tgz
docker save psychemedia/dockerui_patch:tm351d2test | gzip -c > test_dockerui.tgz
docker save busybox:latest | gzip -c > test_busybox.tgz
docker save mongo:latest | gzip -c > test_mongo.tgz
docker save postgres:latest | gzip -c > test_postgres.tgz

On the to do list is getting to these to with the portable Kitematic branch (I’m not sure if that branch will continue, or whether the interest is too niche?!), but in the meantime, I could load it into the Kitematic VM from the Kitematice CLI using:

docker load < test_mongo.tgz

assuming the test_mongo.tgz file is in the current working directory.

Another I need to explore is how to get the set up the data volume containers on the students’ machine.

The current virtual machine build scripts aim to seed the databases from raw data, but to set up the student machines it would seem more sensible to either rebuild a database from a backup, or just load in a copy of the seeded data volume container. (All the while we have to be mindful of providing a route for the students to recreate the original, as distributed, setup, just in case things go wrong. At the same time, we also need to start thing about backup strategies for the students so they can checkpoint their own work…)

The traditional backup and restore route for PostgreSQL seems to be something like the following:

#Use docker exec to run a postgres export
docker exec -t vagrant_devpostgres_1 pg_dumpall -Upostgres -c &gt; dump_`date +%d-%m-%Y"_"%H_%M_%S`.sql
#If it's a large file, maybe worth zipping: pg_dump dbname | gzip > filename.gz

#The restore route would presumably be something like:
cat postgres_dump.sql | docker exec -i vagrant_devpostgres_1 psql -Upostgres
#For the compressed backup: cat postgres_dump.gz | gunzip | psql -Upostgres

For mongo, things seem to be a little bit more complicated. Something like:

docker exec -t vagrant_mongo_1 mongodump

#Complementary restore command is: mongorestore

would generate a dump in the container, but then we’d have to tar it and get it out? Something like these mongodump containers may be easier? (mongo seems to have issues with mounting data containers on host, on a Mac at least?

By the by, if you need to get into a container within a Vagrant launched VM (I use vagrant with vagrant-docker-compose), the following shows how:

#If you need to get into a container:
vagrant ssh
#Then in the VM:
  docker exec -it CONTAINERNAME bash

Another way of getting to the data is to export the contents of the seeded data volume containers from the build machine. For example:

#  Export data from a data volume container that is linked to a database server

#postgres
docker run --volumes-from vagrant_devpostgres_1 -v $(pwd):/backup busybox tar cvf /backup/postgresbackup.tar /var/lib/postgresql/data 

#I wonder if these should be run with --rm to dispose of the temporary container once run?

#mongo - BUT SEE CAVEAT BELOW
docker run --volumes-from vagrant_mongo_1 -v $(pwd):/backup busybox tar cvf /backup/mongobackup.tar /data/db

We can then take the tar file, distribute it to students, and use it to seed a data volume container.

Again, from the Kitematic command line, I can run something like the following to create a couple of data volume containers:

#Create a data volume container
docker create -v /var/lib/postgresql/data --name devpostgresdata busybox true
#Restore the contents
docker run --volumes-from devpostgresdata -v $(pwd):/backup ubuntu sh -c "tar xvf /backup/postgresbackup.tar"
#Note - the docker helpfiles don't show how to use sh -c - which appears to be required...
#Again, I wonder whether this should be run with --rm somewhere to minimise clutter?

Unfortunately, things don’t seem to run so smoothly with mongo?

#Unfortunately, when trying to run a mongo server against a data volume container
#the presence of a mongod.lock seems to break things
#We probably shouldn't do this, but if the database has settled down and completed
#  all its writes, it should be okay?!
docker run --volumes-from vagrant_mongo_1 -v $(pwd):/backup busybox tar cvf /backup/mongobackup.tar /data/db --exclude=*mongod.lock
#This generates a copy of the distributable file without the lock...

#Here's an example of the reconstitution from the distributable file for mongo
docker create -v /data/db --name devmongodata busybox true
docker run --volumes-from devmongodata -v $(pwd):/backup ubuntu sh -c "tar xvf /backup/mongobackup.tar"

(If I’m doing something wrong wrt the getting the mongo data out of the container, please let me know… I wonder as well with the cavalier way I treat the lock file whether the mongo container should be started up in repair mode?!)

If have a docker-compose.yml file in the working directory like the following:

mongo:
  image: mongo
  ports:
    - "27017:27017"
  volumes_from:
    - devmongodata

##We DO NOT need to declare the data volume here
#We have already created it
#Also, if we leave it in, a "docker-compose rm" command
#will destroy the data volume container...
#...which means we wouldn't persist the data in it
#devmongodata:
#    command: echo created
#    image: busybox
#    volumes: 
#        - /data/db

We can the run docker-compose up and it should fire up a mongo container and link it to the seeded data volume container, making the data contains in that data volume container available to us.

I’ve popped some test files here. Download and unzip, from the Kitematic CLI cd into the unzipped dir, create and populate the data containers as above, then run: docker-compose up

You should be presented with some application containers including OpenRefine and an OU customised IPython notebook server. You’ll need to mount the IPython notebooks folder onto the unzipped folder. The example notebook (if everything works!) should show demonstrate calls to prepopulated mongo and postgres databases.

Hopefully!

Fighting With docker – and Pondering Innovation in an Institutional Context

I spent my not-OU day today battling with trying to bundle up a dockerised VM, going round in circles trying simplify things a bit, and getting confused by docker-compose not working quite so well following an upgrade.

I think there’s still some weirdness going on (eg in docker-ui showing messed container names?) but I’m now way too confused to care or try to unpick it…

I also spent a chunk of time considering the 32 bit problem, but got nowhere with it…. Docker is predominantly a 64 bit thing, but the course has decided in it’s wisdom that we have to support 32 bit machines, which means I need to find a way of getting a 32 bit version of docker into a base box (apt-get install docker.io I think?), finding way of getting the vagrant docker provisioner to use it (would an alias help?), and checking that vagrant-docker-compose works in a 32 bit VM, then tracking down 32 docker images for PostgreSQL, MongoDB, dockerUI and OpenRefine (or finding build files for them so I can build my own 32 bit images).

We then need to be able to test the VM in a variety of regimes: 32 bit O/S on a 32 bit machine, 32 bit O/S on a 64 bit machine, 64 bit O/S on a 64 bit machine, with a variety of hardware virtualisation settings we might expect on students’ machines. I’m on a highly specced Macbook Pro, though, so my testing is skewed…

And I’m not sure I have it in me to try to put together 32 bit installs…:-( Perhaps that’s what LTS are for…?!;-)

(I keep wondering if we could get access to stats about the sorts of machines students are using to log in to the OU VLE from the user-agent strings of their browsers that can be captured in webstats? And take that two ways: 1) look to see how it’s evolving over time; 2) look to see what the profile of machines is for students in computing programmes, particular those coming up to level 3 option study? That’s the sort of pratical, useful data that could help inform course technology choices but that doesn’t have learning analytics buzzword kudos or budget attached to it though, so I suspect it’s not often championed…)

When LTS was an educational software house, I think there was also more opportunity, support and willingness to try to explore what the technology might be able to do for us and OUr students? Despite the continual round of job ads to support corporate IT, I fear that exploring the educational uses of software has not had much developer support in recent years…

As an example of the sort of thing I think we could explore – if only we could find a forum to do so – is the following docker image that contains an OU customised IPython notebook: psychemedia/ouflpy_scipystacknserver

The context is a forthcoming FutureLearn course on introductory programming. We’re currently planning on getting students to use Anaconda to run the IPython Notebooks that provide the programming environment for the course, but I idly wondered what a Kitematic route might be like. (The image is essentially the scipystack and notebook server with a few notebook extensions and OU customisations installed.)

There are some sample (testing) notebooks here that illustrate some of the features.

Here’s the installation recipe:

– download and unzip the notebooks (double click the downloaded file) and keep a note of where you unzipped the notebook directory to.

– download and install Kitematic. Ths makes use of docker and Virtualbox – but I think it should install them both for you if you don’t already have them installed.

– start Kitematic, search for psychemedia/ouflpy_scipystacknserver and create an application container.

Kitematic_fl_1

It should download and start up automatically.

When it’s running, click on the Notebooks panel and Enable Volumes. This allows the container to see a folder on your computer (“volumes” are a bit like folders that can be aliased or mapped on to other folders across devices).

Kitematic_fl_2

Click the cog (settings) symbol in the Notebooks panel to get to the Volumes settings. Select the directory that you created when you unzipped the downloaded notebooks bundle.

Kitematic_fl_3

Click on the Ports tab. If you click on the link that’s displayed, it should open an IPython notebook server homepage in your browser.

Kitematic_fl_4

Here’s what you should see…

Kitematic_fl_5

Click on a notebook link to open the notebook.

Kitematic_fl_6

The two demo notebooks are just simple demonstrations of some custom extensions and styling features I’ve been experimenting with. You should be able to create you own notebooks, open other people’s notebooks, etc.

You can also run the container in the cloud. Tweak the following recipe to try it out on Digital Ocean: Getting Started With Personal App Containers in the Cloud or Running RStudio on Digital Ocean, AWS etc Using Tutum and Docker Containers. (That latter example you could equally well run in Kitematic – just search for and install rocker/rstudio.)

The potential of using containers still excites me, even after 6 months or so of messing around the fringes of what’s possible. In the case of writing a new level computing course with a major practical element, limiting ourselves to a 32 bit build seems a backward step to me? I fully appreciate the need to to make our courses as widely accessible as possible, and in an affordable a way as possible (ahem…) but here’s why I think supporting 32 bit machines in for a new level 3 computing course is a backward step.

In the first case, I think we’re making life harder for OUrselves. (Trying to offer backwards compatibility is prone to this.) Docker is built for 64 bit and most of the (reusable) images are 64 bit. If we had the resource to contribute to a 32 bit docker ecosystem, that might be good for making this sort of technology accessible more widely internationally, as well as domestically, but I don’t think there’s the resource to do that? Secondly, we arguably worsen the experience for students with newer, more powerful machines (though perhaps this could be seen as levelling the playing field a bit?) I always liked the idea of making use of progressive enhancement as a way of trying to offer the best possible experience using the technology they have, though we’d always have to ensure we weren’t then favouring some students over others. (That said, the OU celebrates diversity across a whole range of dimensions in every course cohort…)

Admittedly, students on a computing programme may well have bought a computer to see them through their studies – if the new course is the last one they do, that might mean the machine they bought for their degree is now 6 years old. But on the other hand, students buying a new computer recently may well have opted for an affordable netbook, or even a tablet computer, neither of which can support the installation of “traditional” software applications.

The solution I’d like to explore is a hybrid offering, where we deliver software that makes use of browser based UIs and software services that communicate using standard web protocols (http, essentially). Students who can install software on their computers can run the services locally and access them through their browser. Students who can’t install the software (because they have an older spec machine, or a newer netbook/tablet spec machine, or who do their studies on a public access machine in a library, or using an IT crippled machine in their workplace (cough, optimised desktop, cOUgh..) can access the same applications running in the cloud, or perhaps even from one or more dedicated hardware app runners (docker’s been running on a Raspberry Pi for some time I think?). Whichever you opt for, exactly the same software would be running inside the container and exposing it in the same way though a browser… (Of course, this does mean you need a network connection. But if you bought a netbook, that’s the point, isn’t it?!)

There’s a cost associated with running things in the cloud, of course – someone has to pay for the hosting, storage and bandwidth. But in a context like FutureLearn, that’s an opportunity to get folk paying and then top slice them with a (profit generating…) overhead, management or configuration fee. And in the context of the OU – didn’t we just get a shed load of capital investment cash to spend on remote experimentation labs and yet another cluster?

There are also practical consequences – running apps on you own machine makes it easier to keep copies of files locally. When running in the cloud, the files have to live somewhere (unless we start exploring fast routes to filesharing – Dropbox can be a bit slow at synching large files, I think…)

Anyway – docker… 32 bit… ffs…

If you give the container a go, please let me know how you get on… I did half imagine we might be able to try this for a FutureLearn course, though I fear the timescales are way too short in OU-land to realistically explore this possibility.

Doodling With 3d Animated Charts in R

Doodling with some Gapminder data on child mortality and GDP per capita in PPP$, I wondered whether a 3d plot of the data over the time would show different trajectories over time for different countries, perhaps showing different development pathways over time.

Here are a couple of quick sketches, generated using R (this is the first time I’ve tried to play with 3d plots…)

library(xlsx)
#data downloaded from Gapminder
#dir()
#wb=loadWorkbook("indicator gapminder gdp_per_capita_ppp.xlsx")
#names(getSheets(wb))

#Set up dataframes
gdp=read.xlsx("indicator gapminder gdp_per_capita_ppp.xlsx", sheetName = "Data")
mort=read.xlsx("indicator gapminder under5mortality.xlsx", sheetName = "Data")

#Tidy up the data a bit
library(reshape2)

gdpm=melt(gdp,id.vars = 'GDP.per.capita',variable.name='year')
gdpm$year = as.integer(gsub('X', '', gdpm$year))
gdpm=rename(gdpm, c("GDP.per.capita"="country", "value"="GDP.per.capita"))

mortm=melt(mort,id.vars = 'Under.five.mortality',variable.name='year')
mortm$year = as.integer(gsub('X', '', mortm$year))
mortm=rename(mortm, c("Under.five.mortality"="country", "value"="Under.five.mortality"))

#The following gives us a long dataset by country and year with cols for GDP and mortality
gdpmort=merge(gdpm,mortm,by=c('country','year'))

#Filter out some datasets by country
x.us=gdpmort[gdpmort['country']=='United States',]
x.bg=gdpmort[gdpmort['country']=='Bangladesh',]
x.cn=gdpmort[gdpmort['country']=='China',]

Now let’s have a go at some charts. First, let’s try a static 3d line plot using the scatterplot3d package:

library(scatterplot3d)

s3d = scatterplot3d(x.cn$year,x.cn$Under.five.mortality,x.cn$GDP.per.capita, 
                     color = "red", angle = -50, type='l', zlab = "GDP.per.capita",
                     ylab = "Under.five.mortality", xlab = "year")
s3d$points3d(x.bg$year,x.bg$Under.five.mortality, x.bg$GDP.per.capita, 
             col = "purple", type = "l")
s3d$points3d(x.us$year,x.us$Under.five.mortality, x.us$GDP.per.capita, 
             col = "blue", type = "l")

Here’s what it looks like… (it’s worth fiddling with the angle setting to get different views):

3dline

A 3d bar chart provides a slightly different view:

s3d = scatterplot3d(x.cn$year,x.cn$Under.five.mortality,x.cn$GDP.per.capita, 
                     color = "red", angle = -50, type='h', zlab = "GDP.per.capita",
                     ylab = "Under.five.mortality", xlab = "year",pch = " ")
s3d$points3d(x.bg$year,x.bg$Under.five.mortality, x.bg$GDP.per.capita, 
             col = "purple", type = "h",pch = " ")
s3d$points3d(x.us$year,x.us$Under.five.mortality, x.us$GDP.per.capita, 
             col = "blue", type = "h",pch = " ")

3dbar

As well as static 3d plots, we can generate interactive ones using the rgl library.

Here’s the code to generate an interactive 3d plot that you can twist and turn with a mouse:

#Get the data from required countries - data cols are GDP and child mortality
x.several = gdpmort[gdpmort$country %in% c('United States','China','Bangladesh'),]

library(rgl)
plot3d(x.several$year, x.several$Under.five.mortality,  log10(x.several$GDP.per.capita),
       col=as.integer(x.several$country), size=3)

We can also set the 3d chart spinning….

play3d(spin3d(axis = c(0, 0, 1)))

We can also grab frames from the spinning animation and save them as individual png files. If you have Imagemagick installed, there’s a function that will generate the image files and weave them into an animated gif automatically.

It’s easy enough to install on a Mac if you have the Homebrew package manager installed. On the command line:

brew install imagemagick

Then we can generate a movie:

movie3d(spin3d(axis = c(0, 0, 1)), duration = 10,
        dir = getwd())

Here’s what it looks like:

movie

Handy…:-)

WordPress Quickstart With Docker

I need a WordPress install to do some automated publishing tests, so had a little look around to see how easy it’d be using docker and Kitematic. Remarkably easy, it turns out, once the gotchas are sorted. So here’s the route in four steps:

1) Create a file called docker-compose.yml in a working directory of your choice, containing the following:

somemysql:
  image: mysql
  environment:
    MYSQL_ROOT_PASSWORD: example
    
somewordpress:
  image: wordpress
  links:
    - somemysql:mysql
  ports:
    - 8082:80

The port mapping sets the WordPress port 80 to be visible on host at port 8082.

2) Using Kitematic, launch the Kitematic command-line interface (CLI), cd to your working directory and enter:

docker-compose up -d

(The -d flag runs the containers in detached mode – whatever that means?!;-)

3) Find the IP address that Kitematic is running the VM on – on the command line, run:

docker-machine env dev

You’ll see something like export DOCKER_HOST="tcp://192.168.99.100:2376" – the address you want is the “dotted quad” in the middle; here, it’s 192.168.99.100

4) In your browser, go to eg 192.168.99.100:8082 (or whatever values your setup us using) – you should see the WordPress setup screen:

WordPress_›_Installation

Easy:-)

Here’s another way (via this docker tutorial: wordpress):

i) On the command line, get a copy of the MySQL image:

docker pull mysql:latest

ii) Start a MySQL container running:

docker run --name some-mysql -e MYSQL_ROOT_PASSWORD=example -d mysql

iii) Get a WordPress image:

docker pull wordpress:latest

iv) And then get a WordPress container running, linked to the database container:

docker run --name wordpress-instance --link some-mysql:mysql -p 8083:80 -d wordpress

v) As before, lookup the IP address of the docker VM, and then go to port 8083 on that address.

FOI and Communications Data

Last week, the UK Gov announced an Independent Commission on Freedom of Information (written statement) to consider:

  • whether there is an appropriate public interest balance between transparency, accountability and the need for sensitive information to have robust protection
  • whether the operation of the Act adequately recognises the need for a ‘safe space’ for policy development and implementation and frank advice
  • the balance between the need to maintain public access to information, the burden of the Act on public authorities and whether change is needed to moderate that while maintaining public access to information

To the, erm, cynical amongst us, this could be interpreted as the first step of a government trying to make it harder access public information about decision making processes (that is, a step to reduce transparency), though it would be interesting if the commission reported that making more information available proactively available as published public documents, open data and so was an effective route to reducing the burden of FOIA on local authorities.

One thing I’ve been meaning to do for a long time is have a proper look at WhatDoTheyKnow, the MySociety site that mediates FOI requests in a public forum, as well as published FOI disclosure logs, to see what the most popular requests are by sector to see whether FOI requests can be used to identify datasets and other sources of information that are commonly requested and, by extension, should perhaps be made available proactively (for early fumblings, see FOI Signals on Useful Open Data? or The FOI Route to Real (Fake) Open Data via WhatDoTheyKnow, for example).

(Related, I spotted this the other day on the Sunlight Foundation blog: Pilot program will publicize all FOIA responses at select federal agencies: Currently, federal agencies are only required to publicly share released records that are requested three or more times. The new policy, known as “release to one, release to all,” removes this threshold for some agencies and instead requires that any records released to even one requester also be posted publicly online. I’d go further – if the same requests are made repeatedly (eg information about business rates seems to be one such example) the information should be published proactively.)

In a commentary on the FOI Commission, David Higgerson writes (Freedom of Information Faces Its Biggest THreat Yet – Here’s Why):

The government argues it wants to be the most transparent in the world. Noble aim, but the commission proves it’s probably just words. If it really wished to be the most transparent in the world, it would tell civil servants and politicians that their advice, memos, reports, minutes or whatever will most likely be published if someone asks for them – but that names and any references to who they are will be redacted. Then the public could see what those working within Government were thinking, and how decisions were made.

That is, the content should be made available, but the metadata should be redacted. This immediately put me in mind of the Communications Data Bill, which is perhaps likely to resurface anytime soon, that wants to collect communications metadata (who spoke to whom) but doesn’t directly let folk get to peak at the content. (See also: From Communications Data to #midata – with a Mobile Phone Data Example. In passing, I also note that the cavalier attitude of previous governments to passing hasty, ill-thought out legislation in the communications data area at least is hitting home. It seems that the Data Retention and Investigatory Powers Act (DRIPA) 2014 is “inconsistent with European Union law”. Oops…)

Higgerson also writes:

Politicians aren’t stupid. Neither are civil servants. They do, however, often tend to be out of touch. The argument that ‘open data’ makes a government transparent is utter bobbins. Open data helps people to see the data being used to inform decisions, and see the impact of previous decisions. It does not give context, reasons or motives. Open data, in this context, is being used as a way of watering down the public’s right to know.

+1 on that… Transparency comes more from the ability to keep tabs on the decision making process, not just the data. (Some related discussion on that point here: A Quick Look Around the Open Data Landscape.)