Getting Your Own Space on the Web…

To a certain extent, we can all be web publishers now: social media let’s us share words, pictures and videos, online office suites allow us to publish documents and spreadsheets, code repositories allow us to share code, sites like Shinyapps.io allow you to publish specific sorts of applications, and so on.

So where do initiatives like a domain of one’s own come in, which provide members of a university (originally), staff and students alike, with a web domain and web hosting of their own?

One answer is that they provide a place on the web for you to call your own. With a domain name registered (and nothing else – not server requirements, no applications to install) you can set up an email address that you and you alone own and use it to forward mail sent to that address to any other address. You can also use your domain as a forwarding address or alias for other locations on the web. My ouseful.info domain forwards traffic to a blog hosted on wordpress.com (I pay WordPress for the privilege of linking to my site there with my domain address); another domain I registered – f1datajunkie.com – acts as an alias to a site hosted on blogger.com.

The problem with using my domains like this mean that I can only forward traffic to sites that other people operate – and what I can do on those sites is limited by those providers. WordPress is a powerful web publishing platform, but WordPress.com only offers a locked down experience with no allowances for customising the site using your own plugins. If I paid for my own hosting, and ran my own WordPress server, the site could be a lot richer. But then in turn I would have to administer the site for myself, running updates, being responsible – ultimately – for the security and resource provisioning of the site myself.

Taking the step towards hosting your own site is a big one, for many people (I’m too lazy to host my own sites, for example…) But initiatives like Reclaim Hosting, and more recently OU Create (from Oklahoma University, not the UK OU), originally inspired by a desire to provide a personal playspace in which students could explore their own digital creativity and give them a home on the web in which they could run their own applications, eased the pain for many: the host could also be trusted, was helpful, and was affordable.

The Oklahoma create space allows students to register a subdomain (e.g. myname.oucreate.com) or custom domain (e.g. ouseful.info) and associate it with what is presumably a university hosted serverspace into which users can presumably install their own applications.

So it seems to me we can tease apart two things:

  • firstly, the ability to own a bit of the web’s “namespace” by registering your own domain (ouseful.info, for example);
  • secondly, the ability to own a bit of the web’s “functionality space”: running your own applications that other people can connect to and make use of; this might be running your own possibly blogging platform, possibly customised using your own, or third party, extensions, or it might be running one or more custom applications you have developed on your own.

But what if you don’t want the responsibility of running, and maintaining, your own applications day in, day out? What if you only want to share an application to the web for a short period of time? What if you want to be able to “show and tell” and application for a particular class, and then put it back on the shelf, available to use again but not always running? Or what if you want to access an application that might be difficult to install, or isn’t available for your computer? Or you’re running a netbook or tablet, and the application you want isn’t available as an app, just as a piece of “traditionally installed software”?

I’ve started to think that docker style containers may offer a way of doing this. I’ve previously posted a couple of examples of how to run RStudio or OpenRefine via docker containers using a cloud host. How much nicer it would be if I could run such containers on a (sub)domain of my own running via a university host…

Which is to say – I don’t necessarily want a full hosting solution on a domain of my own, at least, not to start with, but I do want to be able to add my own bits of functionality to the web, for short periods of time at the least. That is, what I’d quite like is a convenient place to “publish” (in the sense of “run”) my own containerised apps; and then rip them down. And then, perhaps at a later date, take them away and run them on my own fully hosted domain.

Authoring Multiple Docs from a Single IPython Notebook

It’s my not-OU today, and whilst I should really be sacrificing it to work on some content for a FutureLearn course, I thought instead I’d tinker with a workflow tool related to the production process we’re using.

The course will be presented as a set of HTML docs on FutureLearn, supported by a set of IPython notebooks that learners will download and execute themselves.

The handover resources will be something like:

– a set of IPython notebooks;
– a Word document for each week containing the content to appear online. (This document will be used as the basis for multiple pages on the course website. The content is entered into the FutureLearn system by someone else as markdown (though I’m not sure what flavour?)
– for each video asset, a Word document containing the script;
– ?separate image files (the images will also be in the Word doc).

Separate webpages provide teaching that leads into a linked to IPython notebook. (Learners will be running IPython via Anaconda on their own desktops – which means tablet/netbook users won’t be able to do the interactive activities as currently delivered; we looked at using Wakari, but didn’t go with it; offering our own hosted solution or tmpnb server was considered out of scope.)

The way I have authored my week is to create a single IPython document that proceeds in a linear fashion, with “FutureLearn webpage” content authored using as markdown, as well as incorporating executed code cells, followed by “IPython notebook” activity content relating to the previous “webpage”. The “IPython notebook” sections are preceded by a markdown cell containing a NOTEBOOK START statement, and closed with markdown cell containing a NOTEBOOK END statement.

I then run a simple script that:

  • generates one IPython notebook per “IPython notebook” section;
  • creates a monolithic notebook containing all, but just, the “FutureLearn webpage” content;
  • generates a markdown version of that monolithic notebook;
  • uses pandoc to convert the monolithic markdown doc to a Microsoft Word/docx file.

fl_ipynb_workflow

Note that it would be easy enough to render each “FutureLearn webpage” doc as markdown directly from the original notebook source, into its own file that could presumably be added directly to FutureLearn, but that was seen as being overly complex compared to the original “copy rendered markdown from notebook into Word and then somehow generate markdown to put into FutureLearn editor” route.

import io, sys
import IPython.nbformat as nb
import IPython.nbformat.v4.nbbase as nb4

#Are we in a notebook segment?
innb=False

#Quick and dirty count of notebooks
c=1

#The monolithic notebook is the content ex of the separate notebook content
monolith=nb4.new_notebook()

#Load the original doc in
mynb=nb.read('ORIGINAL.ipynb',nb.NO_CONVERT)

#For each cell in the original doc:
for i in mynb['cells']:
    if (i['cell_type']=='markdown'):
        #See if we can stop a standalone notebook code delimiter
        if ('START NOTEBOOK' in i['source']):
            #At the start of a block, create a new notebook
            innb=True
            test=nb4.new_notebook()
        elif ('END NOTEBOOK' in i['source']):
            #At the end of the block, save the code to a new standalone notebook file
            innb=False
            nb.write(test,'test{}.ipynb'.format(c))
            c=c+1
        elif (innb):
            test.cells.append(nb4.new_markdown_cell(i['source']))
        else:
            monolith.cells.append(nb4.new_markdown_cell(i['source']))
    elif (i['cell_type']=='code'):
        #For the code cells, preserve any output text
        cc=nb4.new_code_cell(i['source'])
        for o in i['outputs']:
            cc['outputs'].append(o)
        #Route the code cell as required...
        if (innb):
            test.cells.append(cc)
        else:
            monolith.cells.append(cc)

#Save the monolithic notebook
nb.write(monolith,'monolith.ipynb')

#Convert it to markdown
!ipython nbconvert --to markdown monolith.ipynb

##On a Mac, I got pandoc via:
#brew install pandoc

#Generate a Microsoft .docx file from the markdown
!pandoc -o monolith.docx -f markdown -t docx monolith.md

What this means is that I can author a multiple chapter, multiple notebook minicourse within a single IPython notebook, then segment it into a variety of different standalone files using a variety of document types.

Of course, what I really should have been doing was working on the course material… but then again, it was supposed to be my not-OU today…;-)

Data Driven Press Releases From HSCIC Data – Diabetes Prescribing

By chance, I saw a tweet from the HSCIC yesterday announcing Prescribing for Diabetes, England – 2005/06 to 2014/15′ http://bit.ly/1J3h0g8 #hscicstats.

The data comes via a couple of spreadsheets, broken down at the CCG level.

As an experiment, I thought I’d see how quickly I could come up with a story form and template for generating a “data driven press release” that localises the data, and presents it in a textual form, for a particular CCG.

It took a couple of hours, and at the moment my recipe is hard coded to the Isle of Wight, but it should be easily generalisable to other CCGs (the blocker at the moment is identifying regional codes from CCG codes (the spreadsheets in the release don’t provide that linkage – another source for that data is required).

Anyway, here’s what I came up with:

Sketching_a_handcrafted_data2text_report_for_diabetes_prescribing_

Figures recently published by the HSCIC show that for the reporting period Financial 2014/2015, the total Net Ingredient Costs (NIC) for prescribed diabetes drugs was £2,450,628.59, representing 9.90% of overall Net Ingredient Costs. The ISLE OF WIGHT CCG prescribed 136,169 diabetes drugs items, representing 4.17% of all prescribed items. The average net ingredient cost (NIC) was £18.00 per item. This compares to 4.02% of items (9.85% of total NIC) in the Wessex (Q70) region and 4.45% of items (9.98% of total NIC) in England.

Of the total diabetes drugs prescribed, Insulins accounted for 21,170 items at a total NIC of £1,013,676.82 (£47.88 per item (on average), 0.65% of overall prescriptions, 4.10% of total NIC) and Antidiabetic Drugs accounted for 93,660 items at a total NIC of £825,682.54 (£8.82 per item (on average), 2.87% of overall prescriptions, 3.34% of total NIC).

For the NHS ISLE OF WIGHT CCG, the NIC in 2014/15 per patient on the QOF diabetes register in 2013/14 was £321.53. The QOF prevalence of diabetes, aged 17+, for the NHS ISLE OF WIGHT CCG in 2013/14 was 6.43%. This compares to a prevalence rate of 6.20% in Wessex and 5.70% across England.

All the text generator requires me to do is pass in the name of the CCG and the area code, and it does the rest. You can find the notebook that contains the code here: diabetes prescribing textualiser.

Fragments – Scraping Tabular Data from PDFs

Over the weekend, we went to Snetterton to watch the BTCC touring cars and go-for-it Ginetta Juniors. Timing sheets from the event are available on the TSL website, so I thought I’d have a play with the data…

Each series has it’s own results booklet, a multi-page PDF document containing a range of timing sheets. Here’s an example of part of one of them:

ginettaJnrSnetterton2015_pdf__page_28_of_44_

It’s easy enough to use tools like Tabula (at version 1.0 as of August, 2015) to extract the data from regular (ish) tables, but for more complex tables we’d need to do some additional cleaning.

For example, on a page like:

ginettaJnrSnetterton2015_pdf__page_35_of_44_

we get the data out simply by selecting the bits of the PDF we are interested in:

Select_Tables___Tabula_and_Downloads

and preview (or export it):

Export_Data___Tabula

Note that this would still require a bit of work to regularise it further, perhaps using something like OpenRefine.

When I scrape PDFs, I tend to use pdf2html (from the poppler package, I think?) and then parse in the resulting XML:

import os
fn='ginettaJnrSnetterton2015'
cmd = 'pdftohtml -xml -nodrm -zoom 1.5 -enc UTF-8 -noframes %s "%s" "%s"' % ( '',fn+'.pdf', os.path.splitext(fn+'.xml')[0])
# Can't turn off output? Throw it away...
cmd + " >/dev/null 2>&1"
os.system(cmd)

import lxml.etree

xmldata = open(fn+'.xml','r').read()
root = lxml.etree.fromstring(xmldata)
pages = list(root)

We can then quickly preview the “raw” data we’re getting from the PDF:

def flatten(el):
    result = [ (el.text or "") ]
    for sel in el:
        result.append(flatten(sel))
        result.append(sel.tail or "")
    return "".join(result)

def pageview(pages,page):
    for el in pages[page]:
        print( el.attrib['left'], el.attrib['top'],flatten(el))

TSL_-_BTCC_Scraper

The scraped data includes top and left co-ordinates for each text element. We can count how many data elements are found at each x (left) co-ordinate and use that to help build our scraper.

By eye, we can spot natural breaks in the counts…:

TSL_-_BTCC_Scraper2

but can we also detect them automatically? The Jenks Natural Breaks algorithm [code] looks like it tries to do that…

TSL_-_BTCC_Scraper4

The centres identified by the Jenks natural breaks algorithm could then be used as part of a default hierarchy to assign a particular data element to a particular column. Crudely, we might use something like the following:

TSL_-_BTCC_Scraper5

Whilst it’s quite possible to hand-build scrapers that inspect each element scraped from the PDF document in turn, I notice that the Tabula extraction engine now has a command line interface, so it may be worth spending some time looking at that instead. (It would also be nice if the Tabula GUI could be used to export configuration info, so you could highlight areas of a PDF using the graphical tools and then generate the command line parameter values for reuse from from the command line?)

PS another handy PDF table extractor is published by Scraperwiki: pdftables.com. Which is probably the way to go if you have the funds to pay for it…

PPS A handy summary from the Scraperwiki blog about the different sorts of table containing documents you often come across as PDFS: The four kinds of data PDF (“large tables”, “pivotted tables”, “transactions”, “reports”).

PPPS This also looks relevant – an MSc thesis by Anssi Nurminen, from Tampere University of Technology, on Algorithmic Extraction of Data in Tables in PDF; also this report by Burcu Yildiz, Katharina Kaiser, and Silvia Miksch on pdf2table: A Method to Extract Table Information from PDF Files and an associated Masters thesis by Burcu Yildiz, Information Extraction – Utilizing Table Patterns.

Seven Graphical Interfaces to Docker

From playing with docker over the last few weeks, I think it’s worth pursuing as a technology for deploying educational software to online and distance education students, not least because it offers the possibility of using containers as app runners than can run an app on your own desktop, or via the cloud.

The command line is probably something of a blocker to users who expect GUI tools, such as a one-click graphical installer, or double click to start an app, so I had a quick scout round for graphical user interfaces in and around the docker ecosystem.

I chose the following apps because they are directed more at the end user – launching prebuilt apps, an putting together simple app compositions. There are some GUI tools aimed at devops folk to help with monitoring clusters and running containers, but that’s out of scope for me at the moment…

1. Kitematic

Kitematic is a desktop app (Mac and Windows) that makes it one-click easy to download images from docker hub and run associated containers within a local docker VM (currently running via boot2docker?).

I’ve blogged about Kitematic several times, but to briefly summarise: Kitematic allows you to launch and configure individual containers as well as providing easy access to a boo2docker command line (which can be used to run docker-compose scripts, for example). Simply locate an image on the public docker hub, download it and fire up an associated container.

Kitematic_and_Pinboard__bookmarks_for_psychemedia_tagged__docker_

Where a mount point is defined to allow sharing between the container and the host, you can simply select the desktop folder you want to mount into the container.

At the moment, Kitematic doesn’t seem to support docker-compose in a graphical way, or allow users to deploy containers to a remote host.

2. Panamax

panamax.io is a browser rendered graphical environment for pulling together image compositions, although it currently needs to be started from the command line. Once the application is up and running, you can search for images or templates:

Panamax___Search

Templates seem to correspond to fig/docker compose like assemblages, with panamax providing an environment for running pre-existing ones or putting together new ones. I think the panamax folk ran a competition some time ago to try to encourage folk to submit public templates, but that doesn’t seem to have gained much traction.

Panamax___Search2

Panamax supports deployment locally or to a remote web host.

Panamax___Remote_Deployment_Targets

When I first came across docker, I found panamax really exciting becuase of the way it provided support for linking containers. Now I just wish Kitematic would offer some graphical support for docker compose that would let me drag different images into a canvas, create a container placeholder each time I do, and then wire the containers together. Underneath, it’d just build a docker compose file.

The public project files is useful – it’d be great to see more sharing of general useful docker-compose scripts and asscociated quick-start tutorials (eg WordPress Quickstart With Docker).

3. Lorry.io

lorry.io is a graphical tool for building docker compose files, but doesn’t have the drag, drop and wire together features I’d like to see.

Lorry.io is published by CenturyLink, who also publish panamax, (lorry.io is the newer development, I think?)

Lorry_io_-_Docker_Compose_YAML_Editor

Lorry_io_-_Docker_Compose_YAML_Editor2

Lorry.io lets you search specify your own images or build files, find images on dockerhub, and configure well-formed docker compose YAML scripts from auto-populated drop down menu selections which are sensitive to the current state of the configuration.

4. docker ui

docker.ui is a simple container app that provides an interface, via the browser, into a currently running docker VM. As such, it allows you to browse the installed images and the state of any containers.

Docker_Hub

Kitematic offers a similar sort of functionality in a slightly friendlier way. See additional screenshots here.

5. tutum Cloud

I’ve blogged about tutum.co a couple of times before – it was the first service that I could actually use to get containers running in the cloud: all I had to do was create a Digial Ocean account, pop some credit onto it, then I could link directly to it from tutum and launch containers on Digital Ocean directly from the tutum online UI.

New_Service_Wizard___Tutum

I’d love to see some of the cloud deployment aspects of tutum make it into Kitematic…

See also things like clusteriup.io

6. docker Compose UI

The docker compose UI looks as if it provides a browser based interface to manage deployed container compositions, akin to some of the dashboards provided by online hosts.

francescou_docker-compose-ui

I couldn’t get it to work… I get the feeling it’s like the docker ui but with better support for managing all the containers associated with a particular docker-compose file.

7. ImageLayers

Okay – I said I was going to avoid devops tools, but this is another example of a sort of thing that may be handy when trying to put a composition of several containers together because it might help identify layers that can be shared across different images.

imagelayers.io looks like it pokes through the Dockerfile of one or more containers and shows you the layers that get built.

ImageLayers___A_Docker_Image_Visualizer

I’m not sure if you can point it at a docker-compose file and let it automatically pull out the layers from identified sources (images, or build sources)?

Running a Shell Script Once Only in vagrant

Via somewhere (I’ve lost track of the link), here’s a handy recipe for running a shell script once and once only from Vagrantfile.

In the shell script (runonce.sh):

#!/bin/bash

if [ ! -f ~/runonce ]
then

  #ONCE RUN CODE HERE

  touch ~/runonce
fi

In the Vagrantfile:

  config.vm.provision :shell, :inline => <<-SH
    chmod ugo+x /vagrant/runonce.sh
    /vagrant/runonce.sh
  SH

Robot Journalism in Germany

By chance, I came across a short post by uber-ddj developer Lorenz Matzat (@lorz) on robot journalism over the weekend: Robot journalism: Revving the writing engines. Along with a mention of Narrative Science, it namechecked another company that was new to me: [b]ased in Berlin, Retresco offers a “text engine” that is now used by the German football portal “FussiFreunde”.

A quick scout around brought up this Retresco post on Publishing Automation: An opportunity for profitable online journalism [translated] and their robot journalism pitch, which includes “weekly automatic Game Previews to all amateur and professional football leagues and with the start of the new season for every Game and detailed follow-up reports with analyses and evaluations” [translated], as well as finance and weather reporting.

I asked Lorenz if he was dabbling with such things and he pointed me to AX Semantics (an Aexea GmbH project). It seems their robot football reporting product has been around for getting on for a year (Robot Journalism: Application areas and potential[translated]) or so, which makes me wonder how siloed my reading has been in this area.

Anyway, it seems as if AX Semantics have big dreams. Like heralding Media 4.0: The Future of News Produced by Man and Machine:

The starting point for Media 4.0 is a whole host of data sources. They share structured information such as weather data, sports results, stock prices and trading figures. AX Semantics then sorts this data and filters it. The automated systems inside the software then spot patterns in the information using detection techniques that revolve around rule-based semantic conclusion. By pooling pertinent information, the system automatically pulls together an article. Editors tell the system which layout and text design to use so that the length and structure of the final output matches the required media format – with the right headers, subheaders, the right number and length of paragraphs, etc. Re-enter homo sapiens: journalists carefully craft the information into linguistically appropriate wording and liven things up with their own sugar and spice. Using these methods, the AX Semantics system is currently able to produce texts in 11 languages. The finishing touches are added by the final editor, if necessary livening up the text with extra content, images and diagrams. Finally, the text is proofread and prepared for publication.

A key technology bit is the analysis part: “the software then spot patterns in the information using detection techniques that revolve around rule-based semantic conclusion”. Spotting patterns and events in datasets is an area where automated journalism can help navigate the data beat and highlight things of interest to the journalist (see for example Notes on Robot Churnalism, Part I – Robot Writers for other takes on the robot journalism process). If notable features take the form of possible story points, narrative content can then be generated from them.

To support the process, it seems as if AX Semantics have been working on a markup language: ATML3 (I’m not sure what it stands for? I’d hazard a guess at something like “Automated Text ML” but could be very wrong…) A private beta seems to be in operation around it, but some hints at tooling are starting to appear in the form of ATML3 plugins for the Atom editor.

One to watch, I think…