How to Create a Simple Dockerfile for Building an OpenRefine Docker Image

Over the last few weeks, I’ve been exploring serving OpenRefine in a various ways, such as on a vanilla Digital Ocean Linux server or using Docker, as well as using MyBinder (blog post to come…).

So picking up on the last post (OpenRefine on Digital Ocean using Docker), here’s a quick walkthrough of how we can go about creating a Dockerfile, the script used to create a Docker container, for OpenRefine.

First up, an annotated recipe for building OpenRefine from scratch from the current repo from Thad Guidry (via):

#Bring in a base container
#Alpine is quite lite, and we can get a build with JDK-8 already installed
FROM maven:3.6.0-jdk-8-alpine
MAINTAINER thadguidry@gmail.com

#We need to install git so we can clone the OpenRefine repo
RUN apk add --no-cache git

#Clone the current repo
RUN git clone https://github.com/OpenRefine/OpenRefine.git 

#Build the OpenRefine application
RUN OpenRefine/refine build

#Create a directory we can save OpenRefine user project files into
RUN mkdir /mnt/refine

#Mount a Docker volume against that directory.
#This means we can save data to another volume and persist it
#if we get rid of the current container.
VOLUME /mnt/refine

#Expose the OpenRefine server port outside the container
EXPOSE 3333

#Command to start the OpenRefine server when the container starts
CMD ["OpenRefine/refine", "-i", "0.0.0.0", "-d", "/mnt/refine"]

You can build the container from that Dockerfile by cding into the same directory as the Dockerfile and running something like:

docker build -t psychemedia/openrefine .

The -t flag tags the image (that is, names it); the . says look to the current directory for the dockerfile.

You could then run the container using something like:

docker build --rm -d --name openrefine -p 3334:3333 psychemedia/openrefine

One of the disadvantages of the above build process is that it produces a container that still contains the build files, and tooling required to build it, as well as the application files. This means that the container is larger than it need be. it’s also not quite a release?

I think we can also add RUN OpenRefine/refine dist RELEASEVERSION to then create a release, but there is a downside that this step will fail if a test fails.

We’d then have to tidy up a bit, which we could do with a multistage build. Simon Willison has written a really neat sketch around this on building smaller Python Docker images that provides a handy crib. In our case, we could FROM the same base container (or maybe a JRE, rather than JDK, populated version, if OpenRefine can run just with a JRE?) and copy across the distribution file create from the distribution build step; from that, we could then install the application.

So let’s go to that other extreme and look at a Dockerfile for building a container from a specific release/distribution.

The OpenRefine releases page lists all the OpenRefine releases. Looking at the download links for the the Linux distribution, the URLs take the form:

https://github.com/OpenRefine/OpenRefine/releases/download/$RELEASE/openrefine-linux-$RELEASE.tar.gz.

So how do we install an OpenRefine server from a distribution file?

#We can use the smaller JRE rather than the JDK
FROM openjdk:8-jre-alpine as builder

MAINTAINER tony.hirst@gmail.com

#Download a couple of required packages
RUN apk update && apk add --no-cache wget bash

#We can pass variables into the build process via --build-arg variables
#We name them inside the Dockerfile using ARG, optionally setting a default value
ARG RELEASE=3.1

#ENV vars are environment variables that get baked into the image
#We can pass an ARG value into a final image by assigning it to an ENV variable
ENV RELEASE=$RELEASE

#There's a handy discussion of ARG versus ENV here:
#https://vsupalov.com/docker-arg-vs-env/

#Download a distribution archive file
RUN wget --no-check-certificate https://github.com/OpenRefine/OpenRefine/releases/download/$RELEASE/openrefine-linux-$RELEASE.tar.gz

#Unpack the archive file and clear away the original download file
RUN tar -xzf openrefine-linux-$RELEASE.tar.gz  && rm openrefine-linux-$RELEASE.tar.gz

#Create an OpenRefine project directory
RUN mkdir /mnt/refine

#Mount a Docker volume against the project directory
VOLUME /mnt/refine

#Expose the server port
EXPOSE 3333

#Create the state command.
#Note that the application is in a directory named after the release
#We use the environment variable to set the path correctly
CMD openrefine-$RELEASE/refine -i 0.0.0.0 -d /mnt/refine

We can now build an image of the default version as baked into the Dockerfile:

docker build -t psychemedia/openrefinedemo .

Or we can build against a specific version:

docker build -t psychemedia/openrefinedemo --build-arg RELEASE=3.1-beta .

To peek inside the container, we run it and jump into a bash shell inside it:

docker run --rm -i -t psychemedia/openrefinedemo /bin/bash

We run the container as before:

docker run --rm -d -p 3333:3333 --name openrefine psychemedia/openrefinedemo

Useful?

PS Note that when running an OpenRefine container on something like Digital Ocean using the default OpenRefine memory settings, you may have trouble starting OpenRefine on machines smaller that 3GB. (I’ve had some trouble getting it started on a 2GB server.)

Author: Tony Hirst

I'm a Senior Lecturer at The Open University, with an interest in #opendata policy and practice, as well as general web tinkering...