Over the last few weeks, I’ve been exploring serving OpenRefine in a various ways, such as on a vanilla Digital Ocean Linux server or using Docker, as well as using MyBinder (blog post to come…).
So picking up on the last post (OpenRefine on Digital Ocean using Docker), here’s a quick walkthrough of how we can go about creating a Dockerfile, the script used to create a Docker container, for OpenRefine.
First up, an annotated recipe for building OpenRefine from scratch from the current repo from Thad Guidry (via):
#Bring in a base container #Alpine is quite lite, and we can get a build with JDK-8 already installed FROM maven:3.6.0-jdk-8-alpine MAINTAINER thadguidry@gmail.com #We need to install git so we can clone the OpenRefine repo RUN apk add --no-cache git #Clone the current repo RUN git clone https://github.com/OpenRefine/OpenRefine.git #Build the OpenRefine application RUN OpenRefine/refine build #Create a directory we can save OpenRefine user project files into RUN mkdir /mnt/refine #Mount a Docker volume against that directory. #This means we can save data to another volume and persist it #if we get rid of the current container. VOLUME /mnt/refine #Expose the OpenRefine server port outside the container EXPOSE 3333 #Command to start the OpenRefine server when the container starts CMD ["OpenRefine/refine", "-i", "0.0.0.0", "-d", "/mnt/refine"]
You can build the container from that Dockerfile by cd
ing into the same directory as the Dockerfile and running something like:
docker build -t psychemedia/openrefine .
The -t
flag tags the image (that is, names it); the .
says look to the current directory for the dockerfile.
You could then run the container using something like:
docker build --rm -d --name openrefine -p 3334:3333 psychemedia/openrefine
One of the disadvantages of the above build process is that it produces a container that still contains the build files, and tooling required to build it, as well as the application files. This means that the container is larger than it need be. it’s also not quite a release?
I think we can also add RUN OpenRefine/refine dist RELEASEVERSION
to then create a release, but there is a downside that this step will fail if a test fails.
We’d then have to tidy up a bit, which we could do with a multistage build. Simon Willison has written a really neat sketch around this on building smaller Python Docker images that provides a handy crib. In our case, we could FROM
the same base container (or maybe a JRE, rather than JDK, populated version, if OpenRefine can run just with a JRE?) and copy across the distribution file create from the distribution build step; from that, we could then install the application.
So let’s go to that other extreme and look at a Dockerfile for building a container from a specific release/distribution.
The OpenRefine releases page lists all the OpenRefine releases. Looking at the download links for the the Linux distribution, the URLs take the form:
https://github.com/OpenRefine/OpenRefine/releases/download/$RELEASE/openrefine-linux-$RELEASE.tar.gz
.
So how do we install an OpenRefine server from a distribution file?
#We can use the smaller JRE rather than the JDK FROM openjdk:8-jre-alpine as builder MAINTAINER tony.hirst@gmail.com #Download a couple of required packages RUN apk update && apk add --no-cache wget bash #We can pass variables into the build process via --build-arg variables #We name them inside the Dockerfile using ARG, optionally setting a default value ARG RELEASE=3.1 #ENV vars are environment variables that get baked into the image #We can pass an ARG value into a final image by assigning it to an ENV variable ENV RELEASE=$RELEASE #There's a handy discussion of ARG versus ENV here: #https://vsupalov.com/docker-arg-vs-env/ #Download a distribution archive file RUN wget --no-check-certificate https://github.com/OpenRefine/OpenRefine/releases/download/$RELEASE/openrefine-linux-$RELEASE.tar.gz #Unpack the archive file and clear away the original download file RUN tar -xzf openrefine-linux-$RELEASE.tar.gz && rm openrefine-linux-$RELEASE.tar.gz #Create an OpenRefine project directory RUN mkdir /mnt/refine #Mount a Docker volume against the project directory VOLUME /mnt/refine #Expose the server port EXPOSE 3333 #Create the state command. #Note that the application is in a directory named after the release #We use the environment variable to set the path correctly CMD openrefine-$RELEASE/refine -i 0.0.0.0 -d /mnt/refine
We can now build an image of the default version as baked into the Dockerfile:
docker build -t psychemedia/openrefinedemo .
Or we can build against a specific version:
docker build -t psychemedia/openrefinedemo --build-arg RELEASE=3.1-beta .
To peek inside the container, we run it and jump into a bash shell inside it:
docker run --rm -i -t psychemedia/openrefinedemo /bin/bash
We run the container as before:
docker run --rm -d -p 3333:3333 --name openrefine psychemedia/openrefinedemo
Useful?
PS Note that when running an OpenRefine container on something like Digital Ocean using the default OpenRefine memory settings, you may have trouble starting OpenRefine on machines smaller that 3GB. (I’ve had some trouble getting it started on a 2GB server.)