Kiteflying Around Containers – A Better Alternative to Course VMs?

Eighteen months or so ago, I started looking at ways in which we might use a virtual machine to bundle up a variety of interoperating software applications for a distance education course on databases and data management. (This VM would run IPython notebooks as the programming surface, PostgreSQL and MongoDB as the databases. I was also keen that OpenRefine should be made available, and as everything in the VM was being accessed via a browser, I added a browser based terminal app (tty.js) to the mix as well). The approach I started to follow was to use vagrant as a provisioner and VM manager, and puppet scripts to build the various applications. One reason for this approach is that the OU is an industrial scale educator, and (to my mind) it made sense to explore a model that would support the factory line production model we have in a way that would scale vertically as a way of maintaining VMs for a course that runs over several ways as well as horizontally across other courses with other software application requirements. You can see how my thinking evolved across the following posts: posts tagged “VM” on OUseful.info.

Since then, a lot has changed. IPython notebooks have forked into the Jupyter notebook server and IPython, and Jupyter has added a browser based terminal app to the base offerings of the notebook server. (It’s not as good a flexible as tty.js, which allowed for multiple terminals in the same browser window, but I guess there’s nothing to stop you loading multiple terminals into separate browser tabs.) docker has also become a thing…

To recap on some of thinking about how we might provide software to students, I was pre-occupied at various times with the following (not necessarily exhaustive) list of considerations:

  • how could we manage the installation and configuration of different software applications on students’ self-managed, remote computers, running arbitrary versions of arbitrary operating systems on arbitrarily specced machines over networks with unknown and perhaps low bandwidth internet connections;
  • how could we make sure those applications interoperated correctly on the students’ own machines;
  • how could we make sure the students retained access to local copies of all the files they had created as part of their studies, and that those local copies would be the ones they actually worked on in the provided software applications; (so for example, IPython notebook files, and perhaps even database data directories);
  • how could we manage the build of each application in the OU production context, with OU course teams requiring access to a possibly evolving version of the machine 18 months in advance of student first use date and an anticipated ‘gold master’ freeze date on elements of the software build ~9 months prior to students’ first use;
  • how could we manage the maintenance of VMs within a single presentation of a 9 month long course and across several presentations of the course spanning 1 presentation a year over a 5 year period;
  • how could the process support the build and configuration of the same software application for several courses (for example, an OU-standard PostgreSQL build);
  • how could the same process/workflow support the development, packaging, release to students, maintenance workflow for other software applications for other courses;
  • could the same process be used to manage the deployment of application sets to students on a cloud served basis, either through a managed OU cloud, or on a self-served basis, perhaps using an arbitrary cloud service provider.

All this bearing in mind that I know nothing about managing software packaging, maintenance and deployment in any sort of environment, let alone a production one…;-) And all this bearing in mind that I don’t think anybody else really cares about any of the above…;-)

Having spent a few weeks away from the VM, I’m now thinking that we would be better served by using a more piecemeal approach based around docker containers. These still require the use of something like Virtualbox, but rather than using vagrant to provision the necessary environment, we could use more of an appstore approach to starting and stopping services. So for example, today I had a quick play with Kitematic, a recent docker acquisition, and an app that doesn’t run on Windows yet but for which Windows supported is slated for June, 2015 in the Kitematic roadmap on github

So what’s involved? Install Kitematic (if Virtualbox isn’t already installed, I think it’ll grab it down for you?) and fire it up…

Kitematic_1

It starts up a dockerised virtual machine into which you can install various containers. Next up, you’re presented with an “app dashboard”, as well as the ability to search dockerhub for additional “apps”:

Kitematic_2

Find a container you want, and select it – this will download the required components and fire up the container.

Kitematic_3

The port tells you where you can find any services exposed by the container. In this case, for scipyserver, it’s an IPython notebook (HTML app) running on top of a scipy stack.

Kitematic_4

By default the service runs over https with a default password; we can go into the Settings for the container, reset the Jupyter server password, force it to use http rather than https, and save to force the container to use the new settings:

Kitematic_5

So for example…

kitematic_ipynb

In the Kitematic container homepage, if I click on the notebooks folder icon in the Edit Files panel, I can share the notebook folder across to my host machine:

scipyserver_share

I can also choose what directory on host to use as the shared folder:

Kitematic_7

I can also discover and fire up some other containers – a PostgreSQL database, for example, as well as a MongoDB database server:

Kitematic_6

From within my notebook, I can install additional packages and libraries and then connect to the databases. So for example, I can connect to the PostgreSQL database:

kitematic_ipynb_postgres

or to mongo:

kitematic_ipynb_mongodb

Looking at the container Edit Files settings, it looks like I may also be able to share across the database datafiles – though I’m not sure how this would work if I had a default database configuration to being with? (Working out how to pre-configure and then share database contents from containerised DBMS’ is something that’s puzzled me for a bit and something I haven’t got my head round yet).

So – how does this fit into the OU model (that doesn’t really exist yet?) for using VMs to make interoperating software collections available to students on their own machines?

First up, no Windows support at the moment, though that looks like it’s coming; secondly, the ability to mount shares with host seems to work, though I haven’t tested what happens if you shutdown and start up containers, or delete a scipyserver container and then fire up a clean replacement for example. Nor do I know (yet?!) how to manage shares and pre-seeding for the database containers. One original argument for the VM was that interoperability between the various software applications could be hardwired and tested. Kitematic doesn’t support fig/Docker compose (yet?) but it’s not too hard to lookup up the addresses paste them into a notebook. I think it does mean we can’t provide hard coded notebooks with ‘guaranteed to work’ configurations (i.e. ones prewritten with service addresses and port numbers) baked in, but it’s not too hard to do this manually. In the docker container Dockerfiles, I’m not sure if we could fix the port number mappings to initial default values?

One thing we’d originally envisioned for the VM was shipping it on a USB stick. It would be handy to be able to point Kitematic to a local dockerhub, for example, a set of prebuilt containers on a USB stick with the necessary JSON metadata file to announce what containers were available there, so that containers could be installed from the USB stick. (Kitematic currently grabs the container elements down from dockerhub and pops the layers into the VM (I assume?), so it could do the same to grab them from the USB stick?) In the longer term, I could imagine an OU branded version of Kitematic that allows containers to be installed from a USB stick or pulled down from an OU hosted dockerhub.

But then again, I also imagined an OU USB study stick and an OU desktop software updater 9 years or so ago and they never went anywhere either..;-)

Whither the Library?

As I scanned my feeds this morning, a table in a blog post (Thoughts on KOS (Part 3): Trends in knowledge organization) summarising the results from a survey reported in a paywalled academic journal article – Saumure, Kristie, and Ali Shiri. “Knowledge organization trends in library and information studies: a preliminary comparison of the pre-and post-web eras.” Journal of information science 34.5 (2008): 651-666 [pay content] – really wound me up:

lib_trends

My immediate reaction to this was: so why isn’t cataloguing about metadata? (Or indexing, for that matter?)

In passing, I note that the actual paper presented the results in a couple of totally rubbish (technical term;-) pie charts:

wtf_infoProfessionalsmyRS

More recently (that was a report from 2008 on a lit review going back before then), JISC have just announced a job ad for a role as Head of scholarly and library futures to “provide leadership on medium and long-term trends in the digital scholarly communication process, and the digital library.“. (They didn’t call… You going for it, Owen?!;-)

The brief includes “[k]eep[ing] a close watch on developments in the library and research support communities, and practices in digital scholarship, and also in digital technology, data, on-line resources and behavioural analytics” and providing:

Oversight and responsibility for practical projects and experimentation in that context in areas such as, but not limited to:

  • Digital scholarly communication and publishing
  • Digital preservation
  • Management of research data
  • Resource discovery infrastructure
  • Citation indices and other measures of impact
  • Digital library systems and services
  • Standards, protocols and techniques that allow on-line services to interface securely

So the provision of library services at a technical level, then (which presumably also covers things like intellectual property rights and tendering – making sure the libraries don’t give their data and organisation’s copyrights to the commercial publishers – but perhaps not providing a home for policy and information ethical issue considerations such as algorithmic accountability?), rather than identifying and meeting the information skills needs of upcoming generations (sensemaking, data management and all the other day to day chores that benefit from being a skilled worker with information).

It would be interesting to know what a new appointee to the role would make of the recently announced Hague Declaration on Knowledge Discovery in the Digital Age (possibly in terms of a wider “publishing data” complement to “management of research data”), which provides a call for opening up digitally represented content to the content miners.

I’d need to read it more carefully, but at the very briefest of first glances, it appears to call for some sort of de facto open licensing when it comes to making content available to machines for processing by machines:

Generally, licences and contract terms that regulate and restrict how individuals may analyse and use facts, data and ideas are unacceptable and inhibit innovation and the creation of new knowledge and, therefore, should not be adopted. Similarly, it is unacceptable that technical measures in digital rights management systems should inhibit the lawful right to perform content mining.

The declaration also seems to be quite dismissive of database rights. A well-put together database makes it easier – or harder – to ask particular sorts of question and to a certain respect reflects the amount of creative effort involved in determining a database schema, leaving aside the physical effort involved in compiling, cleaning and normalising the data that secures the database right.

Also, if I was Google, I think I’d be loving this… As ever, the promise of open is one thing, the reality may be different, as those who are geared up to work at scale, and concentrate power further, inevitably do so…

By the by, the declaration also got me thinking: who do I go to in the library to help me get content out of APIs so that I can start analysing it? That is, who do I go to get help with with “resource discovery infrastructure” and perhaps more importantly in this context, “resource retrieval infrastructure”? The library developer (i.e. someone with programming skills who works with librarians;-)?

And that aside from the question I keep asking myself: who do I go to to ask for help in storing data, managing data, cleaning data, visualising data, making sense of data, putting data into a start where I can even start to make sense of it, etc etc… (Given those pie charts, I probably wouldn’t trust the library!;-) Though I keep thinking: that should be the place I’d go.)

The JISC Library Futures role appears silent on this (but then, JISC exists to make money from selling services and consultancy to institutions, right, not necessarily helping or representing the end academic or student user?)

But that’s a shame; because as things like the Stanford Center for Interdisciplinary Digital Research (CIDR) show, libraries can act as a hub and go to place for sharing – and developing – digital skills, which increasingly includes digital skills that extend out of the scientific and engineering disciplines, out of the social sciences, and into the (digital) humanities.

When I started going into academic libraries, the librarian was the guardian of “the databases” and the CD-ROMs. Slowly access to these information resources opened up to the end user – though librarian support was still available. Now I’m as likely to need help with textmining and making calendar maps: so which bit of the library do I go to?

Problems of Data Quality

One of the advantages of working with sports data, you might have thought, is that official sports results are typically good quality data. With a recent redesign of the Formula One website, the official online (web) source of results is now the FIA website.

As well as publishing timing and classification (results) data in a PDF format intended for consumption by the press, presumably, the FIA also publish “official” results via a web page.

But as I discovered today, using data from a scraper that scrapes results from the “official” web page rather than the official PDF documents is no guarantee that the “official” web page results bear any resemblance at all to the actual result.

formula_one_spanish_grand_prix_2015_q_off_class_pdf__page_2_of_2__and_Session_Classifications___Federation_Internationale_de_l_Automobile

Yet another sign that the whole F1 circus is exactly that – an enterprise promoted by clowns…

Routine Sources, Court Reporting, the Data Beat and Metadata Journalism

In The Re-Birth of the “Beat”: A hyperlocal online newsgathering model (Journalism Practice 6.5-6 (2012): 754-765), Murray Dick cites various others to suggest that routine sources are responsible for generating a significant percentage of local news reports:

Schlesinger [Schlesinger, Philip (1987) Putting ‘Reality’ Together: BBC News. Taylor & Francis: London] found that BBC news was dependent on routine sources for up to 80 per cent of its output, while later [Franklin, Bob and Murphy, David (1991) Making the Local News: Local Journalism in Context. Routledge: London] established that local press relied upon local government, courts, police, business and voluntary organisations for 67 per cent of their stories (in [Keeble, Richard (2009) Ethics for Journalists, 2nd Edition. Routledge: London], p114-15)”].

As well as human sources, news gatherers may also look to data sources at either a local level, such as local council transparency (that is, spending data), or national data sources with a local scope as part of a regular beat. For example, the NHS publish accident and emergency statistics as the provider organisation level on a weekly basis, and nomis, the official labour market statistics publisher, publish unemployment figures at a local council level on a monthly basis. Ratings agencies such as the Care Quality Commission (CQC) and the Food Standards Agency (FSA) publish inspections data for local establishments as it becomes available, and other national agencies publish data annually that can be broken down to a local level: if you want to track car MOT failures at the postcode region level, the DVLA have the data that will help you do it.

To a certain extent, adding data sources to a regular beat, or making a beat purely from data sources enables the automatic generation of data driven press releases that can be used to shorten the production process of news reports about a particular class of routine stories that are essentially reports about “the latest figures” (see, for example, my nomis Labour Market Statistics textualisation sketch).

Data sources can also be used to support the newsgathering process by processing the data in order to raise alerts or bring attention to particular facts that might otherwise go unnoticed. Where the data has a numerical basis, this might relate to sorting a national dataset on the basis of some indicator value or other and highlighting to a particular local news outlet that their local X is in the top M or bottom N of similar establishments in the rest of the country, and that there may be a story there. Where the data has a text basis, looking for keywords might pull out paragraphs or records that are of particular interest, or running a text through an entity recognition engine such as Thomson Reuters’ OpenCalais might automatically help identify individuals or organisations of interest.

In this context of this post, I will be considering the role that metadata about court cases that is contained within court lists and court registers might have to play in helping news media identify possibly newsworthy stories arising from court proceedings. I will also explore the extent to which the metadata may be processed, both in order to help identify court proceedings that may be worth reporting on, as well to produce statistical summaries that may in themselves be newsworthy and provide a more balanced view over the activity of the courts than the impression one might get about their behaviour simply from the balance of coverage provided by the media.

Continue reading

A Google Spreadsheets View Over DWP Tabulation Tool 1-Click Time Series Data

Whilst preparing for an open data training session for New Economy in Manchester earlier this week, I was introduced to the DWP tabulation tool that provides a quick way of analysing various benefits and allowances related datasets, including bereavement benefits, incapacity benefit and employment and support allowance.

The tool supports the construction of various data views as well as providing 1-click link views over “canned” datasets for each category of data.

DWP_Tabulation_Tool

The data is made available in the form on an HTML data table via a static URL (example):

Bereavement_Benefits__Bereavement_Benefit_and_Widows_Benefit_combined__--_On_Flows__thousands____Time_Series_by_Gender_of_claimant

To simplify working with data, we can import the data table directly into Google spreadsheets using the importHTML() formula, which allows you to specify a URL, and then import a specified HTML data table from that page. In the following example, the first table from a results page – that contains the description of the table – is imported into cell A1, and the actual datatable (table 2) is imported via an importhtml() formula specified in cell A2.

test1_-_Google_Sheets

Note that the first data row does not appear to import cleanly – inspection of the original HTML table shows why – the presence of what is presumably a split cell that declares the name of the timeseries index column along with the first time index value.

To simplify the import of these data tables into a Google Spreadsheet, we can make use of a small script to add an additional custom menu into Google spreadsheets that will import a particular dataset.

test1_-_Google_Sheets_addscript

The following script shows one way of starting to construct such a set of menus:

function onOpen() {
  var ui = SpreadsheetApp.getUi();
  // Or DocumentApp or FormApp.
  ui.createMenu('DWP Tabs')
      .addSubMenu(ui.createMenu('Bereavement Benefits')
          .addItem('1-click BW/BB timeseries', 'mi_bb_b')
          .addItem('1-click Region timeseries', 'mi_bb_r')
          .addItem('1-click Gender timeseries', 'mi_bb_g')
          .addItem('1-click Age timeseries', 'mi_bb_a')
      )
      .addSubMenu(ui.createMenu('Incapacity Benefit/Disablement')
          .addItem('1-click Region timeseries', 'mi_ic_r')
      )
      .addToUi();
}

function menuActionImportTable(url){
  var ss = SpreadsheetApp.getActiveSpreadsheet();
  var sheet = ss.getSheets()[0];

  var cell = sheet.getRange("A1");
  cell.setFormula('=importhtml("'+url+'","table",1)');
  cell = sheet.getRange("A2");
  cell.setFormula('=importhtml("'+url+'","table",2)');
}

//--Incapacity Benefit/Disablement
function mi_ic_r() {
  var url='http://tabulation-tool.dwp.gov.uk/flows/flows_on/ibsda/cdquarter/ccgor/a_carate_r_cdquarter_c_ccgor.html';
  menuActionImportTable(url)
}


//-- Bereavement Benefits
function mi_bb_r() {
  var url='http://tabulation-tool.dwp.gov.uk/flows/flows_on/bb/cdquarter/ccgor/a_carate_r_cdquarter_c_ccgor.html';
  menuActionImportTable(url)
}

function mi_bb_g() {
  var url='http://tabulation-tool.dwp.gov.uk/flows/flows_on/bb/cdquarter/ccsex/a_carate_r_cdquarter_c_ccsex.html';
  menuActionImportTable(url)
}

function mi_bb_a() {
  var url='http://tabulation-tool.dwp.gov.uk/flows/flows_on/bb/cdquarter/cnage/a_carate_r_cdquarter_c_cnage.html';
  menuActionImportTable(url)
}

function mi_bb_b() {
  var url='http://tabulation-tool.dwp.gov.uk/flows/flows_on/bb/cdquarter/ccbbtype/a_carate_r_cdquarter_c_ccbbtype.html';
  menuActionImportTable(url)
}

Copying the above script into the script editor associated with a spreadsheet, and then reloading the spreadsheet (permissions may need to be granted to the script the first time it is run), provides a custom menu that allows the direct import of a particular dataset:

test1_-_Google_Sheets_menu

Duplicating the spreadsheet carries the script along with it (I think) and can presumably also be shared… (It’s been some time since I played with apps script – I’m not sure how permissioning works or how easy it is to convert scripts to add-ons, though I note from the documentation that top-level app menus aren’t supported by add-ons.

Authoring Dynamic Documents in IPython / Jupyter Notebooks?

One of the reasons I started writing the Wrangling F1 Data With R book was to see how it felt writing combined text, code and code output materials in the RStudio/RMarkdown context. For those of you that haven’t tried it, RMarkdown lets you insert executable code elements inside a markdown document, either as code blocks or inline. The knitr library can then execture the code and display the code output (which includes tables and charts) and pandoc transforms the output to a desired output document format (such as HTML, or PDF, for example). And all this at the click of a single button.

In IPython (now Jupyter) notebooks, I think we can start to achieve a similar effect using a combination of extensions. For example:

  • python-markdown allows you to embed (and execute) python code inline within a markdown cell by enclosing it in double braces (For example, I could say “{{ print(‘hello world’}}”);
  • hide_input_all is an extension that will hide code cells in a document and just display their executed output; it would be easy enough to tweak this extension to allow a user to select which cells to show and hide, capturing that cell information as cell metadata;
  • Readonly allows you to “lock” a cell so that it cannot be edited; using a notebook server that implements this extension means you can start to protect against accidental changes being made to a cell by mistake within a particular workflow; in a journalistic context, assigning a quote to a python variable, locking that code cell, and then referencing that quote/variable in a python-markdown might be one of working, for example.
  • Printview-button will call nbconvert to generate an HTML version of the current notebook – however, I suspect this does not respect the extension based customisations that operate on cell metadata. To do that, I guess we need to generate our outptut via an nbconvert custom template? (The Download As... notebook option doesn’t seem to save the current HTML view of a notebook either?)

So – my reading is: tools are there to support the editing side (inline code, marking cells to be hidden etc) of dynamic document generation, but not necessarily the rendering to hard copy side, which need to be done via nbconvert extensions?

Related: Seven Ways of Running IPython Notebooks

Keeping Track of an Evolving “Top N” Cutoff Threshold Value

In a previous post (Charts are for Reading), I noted how it was difficult to keep track of which times in an F1 qualifying session had made the cutoff time as a qualifying session evolved. The problem can be stated as follows: in the first session, with 20 drivers competing, the 15 drivers with the best ranked laptime will make it into the next session. Each driver can complete zero or more timed laps, with drivers completing laps in any order.

Finding the 15 drivers who will make the cutoff is therefore not simply a matter of ranking the best 15 laptimes at any point, because the same 5 drivers, say, may each record 3 fast laptimes, thus taking up the 15 slots that record the 15 fastest laptimes.

If we define a discrete time series with steps corresponding to each recorded laptime (from any driver), then at each time step we can find the best 15 drivers by finding each driver’s best laptime to date and ranking by those times. Conceptually, we need something like a lap chart which uses a ‘timed lap count’ rather than race lap index to keep track of the top 15 cars at any point.

example_fia_lapchart

At each index step, we can then find the laptime of the 15th ranked car to find the current session laptime.

In a dataframe that records laptimes in a session by driver code for each driver, along with a column that contains the current purple laptime, we can arrange the laptimes by cumulative session laptime (so the order of rows follows the order in which laptimes are recorded) and then iterate through those rows one at a time. At each step, we can summarise the best laptime recorded so far in the session for each driver.

df=arrange(df,cuml)
dfc=data.frame()
for (r in 1:nrow(df)) {
  #summarise best laptime recorded so far in the session for each driver
  dfcc=ddply(df[1:r,],.(qsession,code),summarise,dbest=min(stime))
  #Keep track of which session we are in
  session=df[r,]$qsession
  #Rank the best laptimes for each driver to date in the current session
  #(Really should filter by session at the start of this loop?)
  dfcc=arrange(dfcc[dfcc['qsession']==session,],dbest)
  #The different sessions have different cutoffs: Q1, top 15; Q2, top 10
  n=cutoffvals[df[r,]$qsession]
  #if we have at least as many driver best times recorded as the cutoff number
  if (nrow(dfcc) >=n){
    #Grab a record of the current cut-off time
    #along with info about each recorded laptime
    dfc=rbind(dfc,data.frame(df[r,]['qsession'],df[r,]['code'],df[r,]['cuml'],dfcc[n,]['dbest']) )
  }
}

We can then plot the evolution of the cut-off time as the sessions proceed. The chart in it’s current form is still a bit hard to parse, but it’s a start…

qualicutoff

In the above sketch, the lines connect the current purple time and the current cut-off time in each session (aside from the horizontal line which represents the cut-off time at the end of the session). This gives a false impression of the evolution of the cutoff time – really, the line should be a stepped line that traces the current cut-off time horizontally until it is improved, at which point it should step vertically down. (In actual fact, the line does track horizontally as laptimes are recorded that do not change the cuttoff time, as indicated by the horizontal tracks in the Q1 panel as the grey times (laptime slower than driver’s best time in session so far) are completed.

The driver labels are coloured according to: purple – current best in session time; green – driver best in session time to date (that wasn’t also purple); red – driver’s best time in session that was outside the final cut-off time. This colouring conflates two approaches to representing information – the purple/green colours represent online algorithmic processing (if we constructed the chart in real time from laptime data as laps we completed, that’s how we’d colour the points), whereas the red colouring represents the results of offline algorithmic processing (the colour can only be calculated at the end of the session when we know the final session cutoff time). I think these mixed semantics contribute to making the chart difficult to read…

In terms of what sort of stories we might be able to pull from the chart, we see that in Q2, Hulkenberg and Sainz were only fractions of a second apart, and Perez looked like he had squeezed in to the final top 10 slot until Sainz pushed him out. To make it easier to see which times contribute to the top 10 times, we could use font weight (eg bold font) to highlight each drivers session best laptimes.

To make the chart easier to read, I think each time improvement to the cutoff time should be represented by a faint horizontal line, with a slightly darker line tracing the evolution of the cutoff time as a stepped line. This would all us to see which times were within the cutoff time at any point.

I also wonder whether it might be interesting to generate a table a bit like the race lap chart, using session timed lap count rather than race lap count, perhaps with additional colour fields to show which car recorded the time that increased the lap count index, and perhaps also where in the order that time fell if it didn’t change the order in the session so far. We could also generate online and offline differences between each laptime in the session and the current cutoff time (online algorithm) as well as the final overall session cutoff time (offline algorithm).

[As and when I get this chart sorted, it will appear in an update to the Wrangling F1 Data With R lean book.]