Tagged: docker

Querying Panama Papers Neo4j Database Container From a Linked Jupyter Notebook Container

A few weeks ago I posted some quick doodles showing, on the one hand, how to get the Panama Papers data into a simple SQLite database and in another how to link a neo4j graph database to a Jupyter notebook server using Docker Compose.

As the original Panama Papers investigation used neo4j as its backend database, I thought putting the data into a neo4j container could give me the excuse I needed to start looking at neo4j.

Anyway, it seems as if someone has already pushed a neo4j Docker container image preseeded with the Panama Papers data, so here’s my quickstart.

To use it, you need to have Docker installed, download the docker-compose.yaml file and then run:

docker-compose up

If you do this from a command line launched from Kitematic, Kitematic should provide you with a link to the neo4j database, running on the Docker IP address and port 7474. Log in with the default credentials ( neo4j / neo4j ) and change the password to panamapapers (all lower case).

Download the quickstart notebook into the newly created notebooks directory, and you should be able to see it from the notebooks homepage on Docker IP address port 8890 (or again, just follow the link from Kitematic).

I’m still trying to find my way around both the py2neo Python wrapper and the neo4j Cypher query language, so the demo thus far is not that inspiring!

And I’m not sure when I’ll get a chance to look at it again…:-(

Application Shelves for the Digital Library – Fragments

Earlier today, I came across BioShaDock: a community driven bioinformatics shared Docker-based tools registry (BioShadock registry). This collects together a wide range of containerised applications and tools relevant to the bioinformatics community. Users can take one or more applications “off-the-shelf” and run them, without having to go through any complicated software installation process themselves, even if the installation process is a nightmare confusion of dependency hell: the tools are preinstalled and ready to run…

The container images essentially represent reference images that can be freely used by the community. The application containers come pre-installed and ready to run, exact replicas of their parent reference image. The images can be versioned with different versions or builds of the application, so you can reference the use of a particular version of an application and provide a way of sharing exactly that version with other people.

So could we imagine this as a specialist reference shelf in a Digital Library? A specialist reference application shelf, with off-the-shelf, ready-to-run run tools, anywhere, any time?

Another of the nice things about containers is that you can wire them together using things like Docker Compose or Panamax templates to provide a suite of integrated applications that can work with each other. Linked containers can pass information between each other in isolation from the rest of the world. One click can provision and launch all the applications, wired together. And everything can be versioned and archived. Containerised operations can also be sequenced too (eg using DRAY docker pipelines or OpenAPI).

Sometimes, you might want to bundle a set of applications together in a single, shareable package as a virtual machine. These can be versioned, and shared, so everyone has access to the same tools installed in the same way within a single virtual machine. Things like the DH Box, “a push-button Digital Humanities laboratory” (DHBox on github); or the Data Science Toolbox. These could go on to another part of the digital library applications shelf – a more “general purpose toolbox” area, perhaps?

As a complement to the “computer area” in the physical library that provides access to software on particular desktops, the digital library could have “execution rooms” that will actually let you run the applications taken off the shelf, and access them through a web browser UI, for example. So runtime environments like mybinder or tmpnb. Go the the digital library execution room (which is just a web page, though you may have to authenticate to gain access to the server that will actually run the code for you..), say which container, container collection, or reference VM you want to run, and click “start”. Or take the images home you with (that is, run them on your own computer, or on a third party host).

Some fragments relating to the above to try to help me situate this idea of runnable, packaged application shelves with the context of the library in general…

  • libraries have been, and still are, one of the places you go access IT equipment and learn IT skills;
  • libraries used to be, and still are, a place you could go to get advice on, and training in, advanced information skills, particularly discovery, critical reading and presentation;
  • libraries used to be, and still, a locus for collections of things that are often valuable to community or communities associated with a particular library;
  • libraries provide access to reference works or reference materials that provide a common “axiomatic” basis for particular activities;
  • libraries are places that provide access to commercial databases;
  • libraries provide archival and preservation services;
  • libraries may be organisational units that support data and information management needs of their host organisation.

Some more fragments:

  • the creation of a particular piece of work may involve many different steps;
  • one or more specific tools may be involved in the execution of each step;
  • general purpose tools may support the actions required perform a range of tasks to a “good enough” level of quality;
  • specialist tools may provide a more powerful environment for performing a particular task to a higher level of quality

Some questions:

  • what tools are available for performing a particular information related task or set of tasks?
  • what are the best tools for performing a particular information related task or set of tasks?
  • where can I get access to the tools required for a particular task without having to install them myself?
  • how can I effectively organise a workflow that requires the use of several different tools?
  • how can I preserve, document or reference the workflow so that I can reuse it or share it with others?

Some observations:

  • Docker containers provide a way of packaging an application or tool so that it can be “run anywhere”;
  • Docker containers may be linked together in particular compositions so that they can interoperate with each other;
  • docker container images may be grouped together in collections within a subject specific registry: for example, BioShaDock.

OpenRobertaLab – Simple Robot Programming Simulator and UI for Lego EV3 Bricks

Rather regretting not having done a deep dive into programming environments for the Lego EV3 somewhat earlier, I came across the block.ly inspired OpenRobertaLab (code, docs) only a couple of days ago.

Open_Roberta_Lab

(Way back when , in the first incarnation of the OU Robotics Outreach Group, we were part of the original Roberta project which was developing a European educational robotics pack, so it’s nice to see it’s continued.)

OpenRobertaLab is a browser accessible environment that allows users to use block.ly blocks to program a simulated robot.

Open_Roberta_Lab2

I’m not sure how easy it is to change the test track used in the simulator? That said, the default does have some nice features – a line to follow, colour bars to detect, a square to drive round.

The OU Robotlab simulator supported a pen down option that meant you could trace the path taken by the robot – I’m not sure if RobertaLab has a similar feature?

robotlab

It also looks as if user accounts are available, presumably so you can save your programmes and return to them at a later date:

Open_Roberta_Lab5

Account creation looks to be self-service:

Open_Roberta_Lab6

OpenRobertaLab also allows you to program a connected EV3 robot running leJOS, the community developed Java programming environment for the EV3s. It seems that it’s also possible to connect to a brick running ev3dev to OpenRobertaLab using the robertalab-ev3dev connector. This package is preinstalled in ev3dev, although it needs enabling (and the brick rebooting) to run. ssh into the brick and then from the brick commandline, run:

sudo systemctl unmask openrobertalab.service
sudo systemctl start openrobertalab.service

Following a reboot, the Open Robertalab client should now automatically run and be available from the OpenRobertaLab menu on the brick. To stop the service / cancel it from running automatically, run:

sudo systemctl stop openrobertalab.service
sudo systemctl mask openrobertalab.service

If the brick has access to the internet, you should now be able to simply connect to the OpenRobertalab server (lab.open-roberta.org).

Requesting a connection from the brick gives you an access code you need to enter on the OpenRobertaLab server. From the robots menu, select connect...:

Open_Roberta_Lab3

and enter the provided connection code (use the connection code displayed on your EV3):

Open_Roberta_Lab4

On connecting, you should hear a celebratory beep!

Note that this was as far as I got – Open Robertalab told me a more recent version of the brick firmware was available and suggested I installed it. Whilst claiming I may still be possible to run commands using old firmware, that didn’t seem to be the case?

As we well as accessing the public Open Robertalab environment on the web, you can also run your own server. There are a few dependencies required for this, so I put together a Docker container psychemedia/robertalab (Dockerfile) containing the server, which means you should be able to run it using Kitematic:

kitematic_robertalab

(For persisting things like user accounts, and and saved programmes, there should probably be a shared data container to persist that info?)

A random port will be assigned, though you can change this to the original default (1999):

kitematic_robertalab

The simulator should run fine using the IP address assigned to the docker machine, but in order to connect a robot on the same local WiFi network to the Open RobertaLab server, or connect to the programming environment from another computer on the local network, you will need to set up proter forwarding from the Docker VM:

virtualboxroboertacontainer

See Exposing Services Running in a Docker Container Running in Virtualbox to Other Computers on a Local Network for more information on exposing the containerised Open Robertalab server to a local network.

On the EV3, you will need to connect to a custom Open Robertalab server. The settings will be the IP address of the computer on which the server is running, which you can find on a Mac from the Mac Network settings, along with the port number the server is running on:

So for example, if Kitematic has assigned the port number 32567, and you didn’t otherwise change it, and you host computer IP address is 192.168.1.86, you should connect to: 192.168.1.86:32567 from the Open Robertalab connection settings on the brick. On connecting, you will be presented with a pass code as above, which you should connect to from your local OpenRobertaLab webpage.

Note that when trying to run programmes on a connected brick, I suffered the firmware mismatch problem again.

Exposing Services Running in a Docker Container Running in Virtualbox to Other Computers on a Local Network

Most of my experiments with Docker on my desktop machine to date have been focused on reducing installation pain and side-effects by running applications and services that I can access from a browser on the same desktop.

The services are exposed against the IP address of the virtual machine running docker, rather than localhost of the host machine, which also means that the containerised services can’t be accessed by other machines connected to the same local network.

So how do we get the docker container ports exposed on the host’s localhost network IP address?

If docker is running the containers via Virtualbox in the virtual machine named default, it seems all we need to do is tweak a couple of port forwarding rules in Virtualbox. So if I’m trying to get port 32769 on the docker IP address relayed to the same port on the host localhost, I can issue the following terminal command if the Docker Virtualbox is currently running:

VBoxManage controlvm "default" natpf1 "tcp-port32769,tcp,,32769,,32769"

which has syntax:

natpf<1-N> [<rulename>],tcp|udp,[<hostip>], <hostport>,[<guestip>],<guestport>

Alternatively, the rule can be created from the Network – Port Forwarding Virtualbox  settings for the default box:

default_-_Network

To clear the rule, use:

VBoxManage controlvm "default" natpf1 delete "tcp-port32769"

or delete from the Virtualbox box settings Network – Port Forwarding rule dialogue.

If the box is not currently running, use:

VBoxManage modifyvm "default" --natpf1 "tcp-port32769,tcp,,32769,,32769"
VBoxManage modifyvm "default" --natpf1 delete "tcp-port32769"

The port should now be visible and localhost:32769 and by extension may be exposed to machines on the same network as the host machine by calling the IP address of the host machine with the value of the forwarded port on host.

On a Mac, you can find the local IP address of the machine from the Mac’s Network settings:

Network

Simples:-)

From Linked Data to Linked Applications?

Pondering how to put together some Docker IPython magic for running arbitrary command line functions in arbitrary docker containers (this is as far as I’ve got so far), I think the commands must include a couple of things:

  1. the name of the container (perhaps rooted in a particular repository): psychemedia/contentmine or dockerhub::psychemedia/contentmine, for example;
  2. the actual command to be called: for example, one of the contentine commands: getpapers -q {QUERY} -o {OUTPUTDIR} -x

We might also optionally specify mount directories with the calling and called containers, using a conventional default otherwise.

This got me thinking that the called functions might be viewed as operating in a namespace (psychemedia/contentmine or dockerhub::psychemedia/contentmine, for example). And this in turn got me thinking about “big-L, big-A” Linked Applications.

According to Tim Berners Lee’s four rules of Linked Data, the web of data should:

  1. Use URIs as names for things
  2. Use HTTP URIs so that people can look up those names.
  3. When someone looks up a URI, provide useful information, using the standards (RDF*, SPARQL)
  4. Include links to other URIs. so that they can discover more things.

So how about a web of containerised applications, that would:

  1. Use URIs as names for container images
  2. Use HTTP URIs so that people can look up those names.
  3. When someone looks up a URI, provide useful information (in the minimal case, this corresponds to a Dockerhub page for example; in a user-centric world, this could just return a help file identifying the commands available in the container, along with help for individual commands; )
  4. Include a Dockerfile. so that they can discover what the application is built from (also may link to other Dockerfiles).

Compared with Linked Data, where the idea is about relating data items one to another, the identifying HTTP URI actually represents the ability to make a call into a functional, execution space. Linkage into the world of linked web resources might be provided through Linked Data relations that specify that a particular resource was generated from an instance of a Linked Application or that the resource can be manipulated by an instance of a particular application.

So for example, files linked to on the web might have a relation that identifies the filetype, and the filetype is linked by another relation that says it can be opened in a particular linked application. Another file might link to a description of the workflow that created it, and the individual steps in the workflow might link to function/command identifiers that are linked to linked applications through relations that associate particular functions with a particular linked application.

Workflows may be defined generically, and then instantiated within a particular experiment. So for example: load file with particular properties, run FFT on particular columns, save output file becomes instantiated within a particular run of an experiment as load file with this URI, run the FFT command from this linked application on particular columns, save output file with this URI.

Hmm… thinks.. there is a huge amount of work already done in the area of automated workflows and workflow execution frameworks/environments for scientific computing. So this is presumably already largely solved? For example, Integrating Containers into Workflows: A Case Study Using Makeflow, Work Queue, and Docker, C. Zheng & D. Thain, 2015 [PDF]?

A handful of other quick points:

  • the model I’m exploring in the Docker magic context is essentially stateless/serverless computing approach, where a commandline container is created on demand and treated in a disposable way to just run a particular function before being destroyed; (see also the OpenAPI approach).
  • The Linked Application notion extends to other containerised applications, such as ones that expose an HTML user interface over http that can be accessed via a browser. In such cases, things like WSDL (or WADL; remember WADL?) provided a machine readable formalised way of describing functional resource availability.
  • In the sense that commandline containerised Linked Applications are actually services, we can also think about web services publishing an http API in a similar way?
  • services such as Sandstorm, which have the notion of self-running containerised documents, have the potentially to actually bind a specific document within an interactive execution environment for that document.

Hmmm… so how much nonsense is all of the above, then?

Steps Towards Some Docker IPython Magic – Draft Magic to Call a Contentmine Container from a Jupyter Notebook Container

I haven’t written any magics for IPython before (and it probably shows!) but I started sketching out some magic for the Contentmine command-line container I described in Using Docker as a Personal Productivity Tool – Running Command Line Apps Bundled in Docker Containers,

What I’d like to explore is a more general way of calling command line functions accessed from arbitrary containers via a piece of generic magic, but I need to learn a few things along the way, such as handling arguments for a start!

The current approach provides crude magic for calling the contentmine functions included in a public contentmine container from a Jupyter notebook running inside a container. The commandline contentmine container is started from within the notebook contained and uses a volume-from the notebook container to pass files between the containers. The path to the directory mounted from the notebook is identified by a bit of jiggery pokery , as is the method for spotting what container the notebook is actually running in (I’m all ears if you know of a better way of doing either of these things?:-)

The magic has the form:

%getpapers /notebooks rhinocerous

to run the getpapers query (with fixed switch settings for now) and the search term rhinocerous; files are shared back from the contentmine container into the .notebooks folder of the Jupyter container.

Other functions include:

%norma /notebooks rhinocerous
%cmine /notebooks rhinocerous

These functions are applied to files in the same folder as was created by the search term (rhinocerous).

The magic needs updating so that it will also work in a Jupyter notebook that is not running within a container – this should simply be just of case of switching in a different directory path. The magics also need tweaking so we can pass parameters in. I’m not sure if more flexibility should also be allowed on specifying the path (we need to make sure that the paths for the mounted directories are the correct ones!)

What I’d like to work towards is some sort of line magic along the lines of:

%docker psychemedia/contentmine -mountdir /CALLING_CONTAINER_PATH -v ${MOUNTDIR}:/PATH COMMAND -ARGS etc

or cell magic:

%%docker psychemedia/contentmine -mountdir /CALLING_CONTAINER_PATH -v ${MOUNTDIR}:/PATH
COMMAND -ARGS etc
...
COMMAND -ARGS etc

Note that these go against the docker command line syntax – should they be closer to it?

The code, and a walked through demo, are included in the notebook available via this gist, which should also be embedded below.


Docker as a Personal Application Runner

Consider, for a moment, the following scenarios associated with installing and running a desktop based application on your own computer:

  • a learner installing software for a distance education course: course materials are produced in advance of the course and may be written with a particular version of the software in mind, distributed as part of the course materials. Learners may have arbitrary O/S (various versions of Windows and OS/X), may be be working on work computers with aggressive IT enforced security policies, or may be working on shared/public computers. Some courses may require links between different applications (for example, a data analysis packages and a database system); in addition, some students may not be able to install any software on their own computer – how can we support them?
  • academic research environment: much academic software is difficult to install and may require an element of sysadmin skills, as well as a particular o/s and particular version so supporting libraries. Why should a digital humanities researcher who want to work with text analysis tools provided in a particular text analysis package also have to learn sys admin skills to install the software before they can use the functions that actually matter to them? Or consider a research group environment, where it’s important that research group members have access to the same software configuration but on their own machines.
  • data journalism environment: another twist on the research environment, data journalists may want to compartmentalise and preserve a particular analysis of a particular dataset, along with the tools associated with running those analyses, as “evidence”, in case the story they write on it is challenged in court. Or maybe they need to fire up a particular suite of interlinked tools for producing a particular story in quick time (from accessing the raw data for the first time to publishing the story within a few hours), making sure they work from a clean set up each time.

What we have here is a packaging problem. We also have a situation where the responsibility for installing a single copy of the application or linked applications is an individual user or small team working on an arbitrary platform with few, if any, sys admin skills.

So can Docker help?

A couple of recent posts on the Docker blog set out to explore what Docker is not.

The first – Containers are not VMs – argues that Docker “is not a virtualization technology, it’s an application delivery technology”. The post goes on:

In a VM-centered world, the unit of abstraction is a monolithic VM that stores not only application code, but often its stateful data. A VM takes everything that used to sit on a physical server and just packs it into a single binary so it can be moved around. But it is still the same thing. With containers the abstraction is the application; or more accurately a service that helps to make up the application.

With containers, typically many services (each represented as a single container) comprise an application. Applications are now able to be deconstructed into much smaller components which fundamentally changes the way they are managed in production.

So, how do you backup your container, you don’t. Your data doesn’t live in the container, it lives in a named volume that is shared between 1-N containers that you define. You backup the data volume, and forget about the container. Optimally your containers are completely stateless and immutable.

The key idea here is that with Docker we have a “something” (in the form of a self-contained container) that implements an application’s logic and publishes the application as a service, but isn’t really all that interested in preserving the state of, or any data associated with, the application. If you want to preserve data or state, you need to store it in a separate persistent data container, or alternative data storage service, that is linked to application containers that want to call on it.

The second post – There’s Application Virtualization and There’s Docker – suggests that “Docker is not application virtualization” in the sense of “put[ting] the application inside of a sandbox that includes the app and all its necessary DLLs. Or, … hosting the application on a server, and serving it up remotely…”, but I think I take issue with this in the way it can be misinterpreted as a generality.

The post explicitly considers such application virtualisation in the context of applications that are “monolithic in that they contain their own GUI (vs. a web app that is accessed via a browser)”, things like Microsoft Office or other “traditional” desktop based applications, for example.

But many of the applications I am interest in are ones that publish their user interface as a service, of sorts, over HTTP in the form of a browser based HTML API, or that are accessed via a the commandline. For these sorts of applications, I believe that Docker represents a powerful environment for personal, disposable, application virtualisation. For example, dedicated readers of this blog may already be aware of my demonstrations of how to:

Via Paul Murrell, I also note this approach for defining a pipeline approach for running docker containers: An OpenAPI Pipeline for NZ Crime Data. Pipeline steps are defined in separate XML modules, and the whole pipeline defined in another XML file. For example, the module step OpenAPI/NZcrime/region-Glendowie.xml runs a specified R command in a Docker container fired up to execute just that command. The pipeline definition file identifies the component modules as nodes in some sort of execution graph, along with the edges connecting them as steps in the pipeline. The pipeline manager handles the execution of the steps in order and passes state between the step in one of several ways (for example, via a shared file or a passed variable). (Further work on the OpenAPI pipeline approach is described in An Improved Pipeline for CPI Data.)

What these examples show is that as well as providing devops provisioning support for scaleable applications on the one hand, and an environment for effective testing and rapid development of applications on the other, Docker containers may also have a role to play in “user-pulled” applications.

This is not so much thinking of Docker from an enterprise perspective in which it acts as an environment that supports development and auto-scaled deployment of containerised applications and services, nor is it a view that a web hosting service might take of Docker images providing an appropriate packaging format for the self-service deployment of long-lived services, such as blogs or wiki applications, (a Docker hub and deployment system to rival cPanel, for example).

Instead, it’s a user-centric, rather than devops-centric view, seeing containers from a single user, desktop perspective, seeing Docker and its ilk as providing an environment that can support off-the-shelf, ready to run tools and applications, that can be run locally or in the cloud, individually or in concert with each other.

Next up in this series: a reflection on the possibilities of a “Digital Library Application Shelf”.