Going Round in Circles… or Iterating?

Listening to F1 technical pundit Gary Anderson on a 2014 panel (via Joe Saward) about lessons from F1 for business, I was struck by his comment that “motor racing is about going round in circles..racing drivers go round in circles all day long”, trying to improve lap on lap:

Each time round is another chance to improve, not just for the driver but for the teams, particularly during practice sessions, where real time telemetry allows the team to offer suggested changes as the car is on track, and pit stop allow physical (and computational?) changes to be made to the car.

Each lap is another iteration. Each stint is another iteration. Each session is another iteration. (If you only get 20 laps in a session, that could still give you fifty useful iterations, fifty chances to change something to see if it makes a useful difference.) Each race weekend is another iteration. Each season is another iteration.

Each iteration gives you a chance to try something new and compare it with what you’ve done before.

Who else iterates? Google does. Google (apparently) runs experiments all the time. Potentially, every page impression is another iteration to test the efficacy of their search engine results in terms of convert searchers to revenue generating clickers.

But the thing about iteration is that changes might have negative effects too, which is one reason why you need to iterate fast and often.

But business processes often appear to act as a brake on such opportunities.

Which is why I’ve learned to be very careful writing anything down… because organisations that have had time to build up an administration and a bureaucracy seem tempted to treat things that are written down as somehow fixed (even if those things are written down in socially editable documents (woe betide anyone who changes what you added to the document…)); things that are written down become STOPs in the iteration process. Things that are written down become cast in stone… become things that force you to go round in circles, rather than iterating…

What’s On the Horizon in the Jupyter Ecosystem?

Having been given a “visioning” role for a new level 1 course in production, I’ve started trying to make sense of what an online, or at least, virtual, computing and IT lab might look like for use in an OU context.

One of the ways I’ve tried to carve up the problem is in terms of support tools (the sorts of things a classroom management system might offer – chat rooms, screen sharing, collaborative working, etc) and end-user, task related applications. Another is to try to get a feel for how ecosystems might develop around particular technologies or communities.

It probably won’t surprise regular readers that one of the communities I’ve been looking at is the one growing up around Jupyter notebooks. So here’s a quick summary of some of the Jupyter related projects currently under development that have caught my eye.

Dashboards and Alternative Browser Based UIs

Although I’ve still to start playing with Jupyter widgets, the Jupyter incubator dashboards project seems to be offering support for a structured way o using them in the form of grid-based dashboards generated directly from notebooks. (I guess this is a variant of creating interactive slide decks, eg using nbconvert –to slides, from notebooks?)

It seems as if the dashboard project came out of the IBM Cloud Emerging Technology group (Dynamic Dashboards from Jupyter Notebooks) which suggests that as a tool Jupyter notebooks might have some appeal for business, as well as education and research…

Another company that seems to have bought into the Jupyter ecosystem is technical book publisher O’Reilly. Their thebe code library claims to provide “an easy way to let users on a web page run code examples on a server”, such as a simple HTML UI for a Jupyter served process, as this thebe demo illustrates.

(The Jupyter project itself seems to be developing several Javascript libraries that could be handy, such as Jupyter JS Services, a “Javascript client for the Jupyter services REST APIs”, and maybe jupyter-js-utils, “JavaScript utilities for the Jupyter frontend”; but as with many Jupyter projects in active development, if you’re not on the developer channel it can be tricky working out what the code is for or how to use it!)

One thing I’ve been wondering about for a rewrite of out level 1 residential school robotics activity is whether we might be able to produce a browser or electron app based desktop or tablet based editor, inspired by the look and feel of the RobotLab drag’n’drop text based editor we’ve used in the course previously, to connect to a Jupyer server running on a Lego EV3 brick; and the thebe demo suggests to me that we might…

Collaboration

Collaboration around Jupyter notebooks comes in two forms: realtime collaborative editing within the same notebook (where two users have a copy of the same notebook open in separate editors and see each others updates in realtime), and collaboration around documents in a shared/social repository.

SageMathCloud already offers realtime collaboration within Jupyter notebooks, but official Jupyter support for this sort of feature is still on the official Jupyter project roadmap (using Google Drive as the backbone).

Realtime collaboration within notebooks is also available in the form of Livebook [code], which lives outside the main Jupyter project; the live demo site allows you to create – and collaborate around – temporary notebooks (pandas included): try opening a couple of copies of the same notebook (same URL) in a couple of browsers to get a feel for how it works…

In terms of asynchronous collaboration, this independent Commit-and-Push to GitHub from Jupyter Notebooks notebook extension looks interesting in terms of its ability to save the current notebook as a git commit (related issue here). The original nbdiff project [code] appears to have stalled, but there again, the SageMathCloud environment provides a history slider that lets you play through a whole series of (regular) saves of a notebook to show how it evolved and get access to “interim” versions of it.

There seems to be an independent NotebookDiff extension for comparing the state of notebook checkins, though I haven’t used it. I’m guessing the GitCheckpoints extension from the same developers (which I also haven’t tried) saves checkpoints as a git commit?

Jupyter on the Desktop

One of the “problems” of current Jupyter notebook usage is that the application does not run as a standalone app; instead, a server is started and then notebooks are accessed via a browser.

The Jupyter/Atom-notebook project takes a cross platform Atom editor (an electron framework app) and embeds Jupyter notebooks inside it. Double-clicking a notebook file will open it into the editor.

The nteract/composition app is a desktop based electron app, currently under development (I couldn’t get it to build with my node.js installation).

See also: this earlier, independently produced, proof of concept IPython Desktop project that offers a cleaner experience; the independent, proof-of-concept Jupyter sidecar electron app, that displays rich Jupyter kernel output from commands issued in a command line shell in an HTML presenting side display; and the Atom Hydrogen extension, which allows code to be executed against Jupyter kernels, Light Table style.

Summary

A quick scout around Jupyter related projects in progress shows much promise in the development of end-user tools that will make Jupyter notebooks easier to use, as well as tools that support collaborative working around a particular notebooks.

The Jupyter project has an active community around it and recently advertised for a full time project manager.

Jupyter notebooks feature in the IBM Data Scientist Workbench (as well as things like Wakari and Domino Data Lab) and IBM also seemed to bootstrap the dashboard components. Technical book publisher O’Reilly use Jupyter notebooks as a first-class authoring environment for the O’Reilly publishing program and Github recognises the .ipynb file format a first class document type, rendering HTML previews of .ipynb files uploaded to Github or as Github gists.

In a university context, Jupyter notebooks offer much potential for both teaching and research. It will be interesting to see how university IT departments react to this style of computing, and whether they try to find ways of supporting their community in the use of such systems, or whether their users will simply decide to go elsewhere.

See also: Seven Ways of Running IPython / Jupyter Notebooks.

PS I think this is probably going to become a living post…

  • nbpresent: next generation slideshows from notebooks, apparently…
  • nbbrowserpdf: “LaTeX-free PDF generation for Jupyter Notebooks”

Both of those come from the Anaconda developers, so it seems like Continuum are buying into the Jupyter ecosystem…

And some more from IBM: Jupyter Notebooks as RESTful Microservices that “turn notebooks into RESTful web APIs”. Hmm, literate API definitions than can be consumed by literate API consumer notebooks?

 

Running Docker Container Compositions in the Cloud on Digital Ocean

With TM351 about to start, I thought I’d revisit the docker container approach to delivering the services required by the course to see if I could get something akin to the course software running in the cloud.

Copies of the dockerfiles used to create the images can be found on Github, with prebuilt images available on dockerhub (https://hub.docker.com/r/psychemedia/ou-tm351-*).

A couple of handy ones out of the can are:

That said, at the current time, the images are not intended for use as part of the course

The following docker-compose.yml file will create a set of linked containers that resemble (ish!) the monolithic VM we distributed to students as a Virtualbox box.

dockerui:
    container_name: tm351-dockerui
    image: dockerui/dockerui
    ports:
        - "35183:9000"
    volumes:
        - /var/run/docker.sock:/var/run/docker.sock
    privileged: true

devmongodata:
    container_name: tm351-devmongodata
    command: echo mongodb_created
    #Share same layers as the container we want to link to?
    image: mongo:3.0.7
    volumes: 
        - /data/db

mongodb:
    container_name: tm351-mongodb
    image: mongo:3.0.7
    ports:
        - "27017:27017"
    volumes_from:
        - devmongodata
    command: --smallfiles

mongodb-seed:
    container_name: tm351-mongodb-seed
    image: psychemedia/ou-tm351-mongodb-simple-seed
    links:
        - mongodb

devpostgresdata:
    container_name: tm351-devpostgresdata
    command: echo created
    image: busybox
    volumes: 
        - /var/lib/postgresql/data
 
postgres:
    container_name: tm351-postgres
    environment:
        - POSTGRES_PASSWORD=PGPass
    image: psychemedia/ou-tm351-postgres
    ports:
        - "5432:5432"

openrefine:
    container_name: tm351-openrefine
    image: psychemedia/tm351-openrefine
    ports:
        - "35181:3333"
    privileged: true
    
notebook:
    container_name: tm351-notebook
    #build: ./tm351_scipystacknserver
    image: psychemedia/ou-tm351-pystack
    ports:
        - "35180:8888"
    links:
        - postgres:postgres
        - mongodb:mongodb
        - openrefine:openrefine
    privileged: true

Place a copy of the docker-compose.yml YAML file somewhere, from Kitematic, open the command line, cd into the directory containing the YAML file, and enter the command docker-compose up -d – the images are on dockerhub and should be downloaded automatically.

Refer back to Kitematic and you should see running containers – the settings panel for the notebooks container shows the address you can find the notebook server at.

kitematic

The notebooks and OpenRefine containers should also be linked to shared folders in the directory you ran the Docker Compose script from.

Running the Containers in the Cloud – Docker-Machine and Digital Ocean

As well as running the linked containers on my own machine, my real intention was to see how easy it would be to get them running in the cloud and using just the browser on my own computer to access them.

And it turns out to be really easy. The following example uses cloud host Digital Ocean.

To start with, you’ll need a Digital Ocean account with some credit in it and a Digital Ocean API token:

DigitalOcean_Control_Panel

(You may be able to get some Digital Ocean credit for free as part of the Github Education Student Developer Pack.)

Then it’s just a case of a few command line instructions to get things running using Docker Machine:

docker-machine ls
#kitematic usess: default

#Create a droplet on Digital Ocean
docker-machine create -d digitalocean --digitalocean-access-token YOUR_ACCESS_TOKEN --digitalocean-region lon1 --digitalocean-size 4gb ou-tm351-test 

#Check the IP address of the machine
docker-machine ip ou-tm351-test

#Display information about the machine
docker-machine env ou-tm351-test
#This returns necessary config details
#For example:
##export DOCKER_TLS_VERIFY="1"
##export DOCKER_HOST="tcp://IP_ADDRESS:2376"
##export DOCKER_CERT_PATH="/Users/YOUR-USER/.docker/machine/machines/ou-tm351-test"
##export DOCKER_MACHINE_NAME="ou-tm351-test"
# Run this command to configure your shell: 
# eval $(docker-machine env ou-tm351-test)

#Set the environment variables as recommended
export DOCKER_TLS_VERIFY="1"
export DOCKER_HOST="tcp://IP_ADDRESS:2376"
export DOCKER_CERT_PATH="/Users/YOUR-USER/.docker/machine/machines/ou-tm351-test"

#Run command to set current docker-machine
eval "$(docker-machine env ou-tm351-test)"

#If the docker-compose.yml file is in .
docker-compose up -d
#This will launch the linked containers on Digital Ocean

#The notebooks should now be viewable at:
#http://IP_ADDRESS:35180

#OpenRefine should now be viewable at:
#http://IP_ADDRESS:35181

#To stop the machine
docker-machine stop ou-tm351-test
#To remove the Digital Ocean droplet (so you stop paying for it...
docker-machine rm ou-tm351-test

#Reset the current docker machine to the Kitematic machine
eval "$(docker-machine env default)"

So that’s a start. Issues still arise in terms of persisting state, such as the database contents, notebook files* and OpenRefine projects: if you leave the containers running on Digital Ocean to persist the state, the meter will keep running.

(* I should probably also build a container that demos how to bake a few example notebooks into a container running the notebook server and TM351 python distribution.)

Programmer Employability – Would HE Prepare You for a Guardian Developer Job Interview?

A post on the Guardian developer blog – The Guardian’s new pairing exercises – describes part of the recruitment process used by the Guardian when appointing new developers: candidates are paired with a Guardian developer and set a an exercise that assesses “how they approach solving a problem and structure code”.

Originally, all candidates were set the same exercise, but to try to reduce fatigue (in the sense of a loss of interest, enthusiasm or engagement) with the same task by the Guardian developers engaged in the pairing activity, a wider range of exercises were developed that the the paired developer can choose from when it’s their turn to work with a candidate.

The exercises used in the process are publicly available – github: guardian/pairing-tests.

Candidates can prepare for the test if they wish but as there are so many tests it is possible to be given an exercise not seen before. It also gives an idea of the skills the Guardian is looking for, the problems do not test knowledge on method names of a language of choice but instead focus on solving a problem in manageable parts over an hour.

One example is to implement a set of rules as per Conway’s Game of Life:

Requirement Cards:

  • A cell can be made “alive”
  • A cell can be “killed”
  • A cell with fewer than two live neighbours dies of under-population
  • A cell with 2 or 3 live neighbours lives on to the next generation
  • A cell with more than 3 live neighbours dies of overcrowding
  • An empty cell with exactly 3 live neighbours “comes to life”
  • The board should wrap

Other’s include the parsing and preliminary analysis of an election results dataset, or a set of controls for a simple robot.

So this got me wondering about a few things…:

  • how does this sort of activity design compare with the sort of assessment activity we give to students as part of a course?
  • how does the assessment of the exercise – in terms of what the recruiter learns about the candidate’s problem solving, programming/coding and interpersonal skills – compare with the assessment design and marking guides we use in HE, and by extension, the sort of assessment we train students up for?
  • how comfortable would a recent graduate be taking part in a paired exercise?

It also made me think again how unemployable I am!;-)

In passing, I also note that if you take a peek behind the Guardian homepage, they’re still running their developer ad there:

view-source_www_theguardian_com_uk

But how many graduates would think to look? (I seem to remember that sort of ad made me laugh out loud the first time I saw one whilst rooting through someone’s web page trying to figure out how they’d done something or other…)

(Hmmm… thinks: could that make the basis of an exercise – generate an ASCII-art text banner for an arbitrary phrase? Or emoji?! How would *I* go about doing that (other than justing installing a python package that already does it?! And if I did come up with an exercise and put in a pull request, or made a contribution to one of their other projects on github, would it land me an interview?!;-)

The Rise of Transparent Data Journalism – The BuzzFeed Tennis Match Fixing Data Analysis Notebook

The news today was lead in part by a story broken by the BBC and BuzzFeed News – The Tennis Racket – about match fixing in Grand Slam tennis tournaments. (The BBC contribution seems to have been done under the ever listenable File on Four: Tennis: Game, Set and Fix?)

One interesting feature of this story was that “BuzzFeed News began its investigation after devising an algorithm to analyse gambling on professional tennis matches over the past seven years”, backing up evidence from leaked documents with “an original analysis of the betting activity on 26,000 matches”. (See also: How BuzzFeed News Used Betting Data To Investigate Match-Fixing In Tennis, and an open access academic paper that inspired it: Rodenberg, R. & Feustel, E.D. (2014), Forensic Sports Analytics: Detecting and Predicting Match-Fixing in Tennis, The Journal of Prediction Markets, 8(1).)

Feature detecting algorithms such as this (where the feature is an unusual betting pattern) are likely to play an increasing role in the discovery of stories from data, step 2 in the model described in this recent Tow Center for Digital Journalism Guide to Automated Journalism:]

Guide_to_Automated_Journalism__

See also: OUseful.info: Notes on Robot Churnalism, Part I – Robot Writers

Another interesting aspect of the story behind the story was the way in which BuzzFeed News opened up the analysis they had applied to the data. You can find it described on Github – Methodology and Code: Detecting Match-Fixing Patterns In Tennis – along with the data and a Jupyter notebook that includes the code used to perform the analysis: Data and Analysis: Detecting Match-Fixing Patterns In Tennis.

2016-01-tennis-betting-analysis_tennis-analysis_ipynb_at_master_·_BuzzFeedNews_2016-01-tennis-betting-analysis

You can even run the notebook to replicate the analysis yourself, either by downloading it and running it using your own Jupyter notebook server, or by using the online mybinder service: run the tennis analysis yourself on mybinder.org.

(I’m not sure if the BuzzFeed or BBC folk tried to do any deeper analysis, for example poking into point summary data as captured by the Tennis Match Charting Project? See also this Teniis Visuals project that makes use of the MCP data. Tennis etting data is also collected here: tennis-data.co.uk. If you’re into the idea of analysing tennis stats, this book is one way in: Analyzing Wimbledon: The Power Of Statistics.)

So what are these notebooks anyway? They’re magic, that’s what!:-)

The Jupyter project is an evolution of an earlier IPython (interactive Python) project that included a browser based notebook style interface for allowing users to write and execute code, as well as seeing the result of executing the code, a line at a time, all in the context of a “narrative” text document. The Jupyter project funding proposal describes it thus:

[T]he core problem we are trying to solve is the collaborative creation of reproducible computational narratives that can be used across a wide range of audiences and contexts.

[C]omputation in science is ultimately in service of a result that needs to be woven into the bigger narrative of the questions under study: that result will be part of a paper, will support or contest a theory, will advance our understanding of a domain. And those insights are communicated in papers, books and lectures: narratives of various formats.

The problem the Jupyter project tackles is precisely this intersection: creating tools to support in the best possible ways the computational workflow of scientific inquiry, and providing the environment to create the proper narrative around that central act of computation. We refer to this as Literate Computing, in contrast to Knuth’s concept of Literate Programming, where the emphasis is on narrating algorithms and programs. In a Literate Computing environment, the author weaves human language with live code and the results of the code, and it is the combination of all that produces a computational narrative.

At the heart of the entire Jupyter architecture lies the idea of interactive computing: humans executing small pieces of code in various programming languages, and immediately seeing the results of their computation. Interactive computing is central to data science because scientific problems benefit from an exploratory process where the results of each computation inform the next step and guide the formation of insights about the problem at hand. In this Interactive Computing focus area, we will create new tools and abstractions that improve the reproducibility of interactive computations and widen their usage in different contexts and audiences.

The Jupyter notebooks include two types of interactive cell – editable text cells into which you can write simple markdown and HTML text that will be rendered as text; and code cells into which you can write executable code. Once executed, the results of that execution are displayed as cell output. Note that the output from a cell may be text, a datatable, a chart, or even an interactive map.

One of the nice things about the Jupyter notebook project is that the executable cells are connected via the Jupyter server to a programming kernel that executes the code. An increasing number of kernels are supported (e.g. for R, Javascript and Java as well as Python) so once you hook in to the Jupyter ecosystem you can use the same interface for a wide variety of computing tasks.

There are multiple ways of running Jupyter notebooks, including the mybinder approach described above, – I describe several of them in the post Seven Ways of Running IPython Notebooks.

As well as having an important role to play in reproducible data journalism and reproducible (scientific) research, notebooks are also a powerful, and expressive, medium for teaching and learning. For example, we’re just about to star using Jupyter notebooks, delivered via a virtual machine, for the new OU course Data management and analysis.

We also used them in the FutureLearn course Learn to Code for Data Analysis, showing how code could be used a line at a time to analyse a variety of opendata sets from sources such as the World Bank Indicators database and the UN Comtrade (import /export data) database.

PS for sports data fans, here’s a list of data sources I started to compile a year or so ago: Sports Data and R – Scope for a Thematic (Rather than Task) View? (Living Post).

Another Notebook-UI Cloud Environment – WolframCloud

If Wolfram didn’t make it so difficult to gain free and open access to their tools, I’d probably use them – and blog about them – a lot more. But as the impression I always seem to come away with is that it’ll probably cost me, or others, a not insignificant amount to use the tools regularly, or I’ll find myself using a proprietary (not open) system, I’m loath to invest much time in their products and even less likely to write about it and do their advertising for them.

But that’s not to say that they aren’t doing some interesting and probably useful stuff, or that my impressions are correct. A few weeks ago, for example, I came across the Wolfram Cloud that provides online access to Wolfram Mathematica Notebooks:

test1_nb_-_Wolfram_Programming_Lab

Access is free, once you’ve given them an email address and a password, and allows you to create up to 5 notebooks and create 30 day “cloud deployments” (Mathematica objects that can be accessed via a URL – very nifty tool for quick interactive web publishing, but it just all feels too closed for me…:-(

And too commercially driven…

Pricing_-_Wolfram

So what’s actually involved with the collaboration capabilities, I wonder? “As an instructor, share editable Explorations and notebooks with your students, creating an interactive learning environment.” Does that mean collaborative editing, so a tutor and a student can edit the same notebook at the same time and follow each other’s work? No idea – I can’t find a way to try it and if they want to make it that hard for me to evaluate their offering I won’t bother…

By the by, it’s also worth noting that having bought in to a plan, you then need to start watching how you spend credit within that plan…

Wolfram_Cloud_Credits__On-Demand_Computation_for_Cloud_Applications

I guess I want my computing to be in principle free and in principle open… and it seems to me that there is nothing in principle free (or open?) about anything to do with Mathematica… (I always get the feeling I should be grateful they’re letting me try something out…)

An approach – and all round offering – I personally find far more compelling is SageMathCloud (review).

See also: IBM DataScientistWorkBench = OpenRefine + RStudio + Jupyter Notebooks in the Cloud, Via Your Browser and Course Management and Collaborative Jupyter Notebooks via SageMathCloud.

Data as Justification?

Way back when, I spent a couple of years doing research on intelligent software agents, which included a chunk of time looking at formal agent logics.

One of the mantras of a particular flavour of epistemic logic we relied on comes in the form of the following definition: “knowledge is justified true belief”.

This unpacks as follows: you know something if you believe it AND your belief is true AND your belief is justified. So for example, I toss a coin and you believe it falls heads up. It is heads-up but I haven’t shown you that, so you don’t know it, even though you believe it and that belief is a true belief (the thing you believe – that the coin is heads up – is true). However, if I show you the coin is heads-up, you now know it because your belief is justified as well as being true. (I seem to recall it gets a bit more fiddly as you introduce time into this explicitly too…)

When we start to look at what data can do for us, one of those things is to provide justification for our beliefs. Hans Rosling’s ever amusing ignorance tests demonstrate why we sometimes need our beliefs challenging and his data rich presentations (such as the OU co-produced Don’t Panic shows on BBC2) use data to either confirm our beliefs – reinforcing our knowledge – or show them to be false beliefs (that is, beliefs we have, but that don’t correspond to the state of the world, i.e. beliefs that are untrue).

As well as acting as justification for a belief, data can also create beliefs. But even if the data is true, we still need to take care that any beliefs we generate from the data are justified.

For example, you may or may not find this sentence confusing – more than half of UK wage earners earner less than the average salary. If you think of an average as a mean value, then a quick example easily demonstrates this: four office workers are sat in a bar with “average”-ish incomes, and in walks Mark Zuckerberg. Add the respective incomes together and divide by five. How many people in the bar now have an income higher than that average (mean) value?

However, if you regard an average in terms of the median – mid-point – value, then one person will have the median income and, assuming the original four had slightly different incomes, two will have an income below it, and two will have an income above it.

So when your data point is an average, even if it is correctly calculated (i.e. the data is true), you need to take care what sort of belief you take away from it… Because even if you correctly identify which average is being talked about, you may still come away with a false belief about how the values are distributed. (Not all distributions are, erm, normal…)

And it goes without saying that you also need to be critical of the data itself. Because it may or may not be true…