Running OpenRefine On Digital Ocean Using Simple Auth

As a cloud host, Digital Ocean provides a really easy way in to getting services up and running on the web.

Here’s a quick recipe for getting Open Refine up and running behind a simple authentication scheme.

Creating a Server With Digital Ocean

First up, create a Digital Ocean account if you don’t already have one (this link will get you started with $100 free credit).

Creating and launching a server is easy… Select Create then Droplet, choose the server type you want — let’s use a simple Ubuntu box — and choose the server size. For lots of quick tests, I use the smallest box, but from experience I think OpenRefine prefers MORE THAN 2GB or more

Really… If you pick a 2GB server, you may find that OpenRefine hangs on start and ruins your day when you end up trying to debug what you think other other problems…. Be warned, kids… stay unfrustrated out there…

The servers are charged at a metered rate, and you can stop them any time, so for a quick test, it’ll cost you pennies… (A $100 credit can go a long way…!)

Next, choose a data center region; I generally pick a local one…

You also have the option of adding an ssh key. This makes life much easier when trying to log in to the server from your own machine using ssh (you can just run ssh root@IP_ADDRESS and it’ll log you straight in; there’s a recipe for setting up an SSH key here).

If you don’t want to set up an SSH, a root password will be emailed to you. You can use this password to log in to your server via a web terminal, which means you can do everything via a web UI if you need to…

Create your server by clicking the big green button…

It should only take a few seconds to start up… And when it has, you’ll be presented with it’s public IP address.

If you need a web terminal, click through the on the server name, and you should see a link to launch a web console.

Installing  OpenRefine

From the console, we can install all we need to run OpenRefine. This is a minimum viable example — we should probably find a better place to install OpenRefine, and may want to run it as a particular user with limited permissions. Working as root with everything wide open makes life easier, though not necessarily safer…!

OpenRefine requires a Java environment, so we need to install that:

apt-get update && apt-get install -y openjdk-8-jre

We can download the OpenRefine application via the command-line using a command of the form wget -q -O DOWNLOADED_FILE_NAME URL; the download link for each release can be found on the OpenRefine releases page:

wget -q -O openrefine-2.8.tar.gz https://github.com/OpenRefine/OpenRefine/releases/download/2.8/openrefine-linux-2.8.tar.gz

The downloaded file is provided as a tar archive file, which we need to unpack:

tar xzf openrefine-2.8.tar.gz

This will unbundle the files into the directory ./openrefine-2.8.

Let’s create an alias for a working directory in which to place the OpenRefine project files:

OPENREFINE_DIR="$HOME/openrefine"

And create that directory:

mkdir -p $OPENREFINE_DIR

You should now be able to run OpenRefine in the background using the command:

nohup openrefine-2.8/refine -p 3333 -d OPENREFINE_DIR > /dev/null 2>&1 &

This will run OpenRefine on port 3333. If you copy the IP address of your Digital Ocean server, and go to http://IP.ADDRESS:3333, you should see OpenRefine running there. (Note that if you’re in Chrome, Google may well tell you that the address is dangerous…)

Adding Simple Authentication

The OpenRefine server is running as a public service on the public web. If you want to add a simple layer of authentication, you can add a web server proxy to the server that will prompt for a password when a new visit is made to the server.

One of the easiest proxies to get up and running is nginx. Let’s install it, along with a simple Apache toolkit that will help us create a simple password:

apt-get install -y nginx apache2-utils

Now create a simple user/password combination. Ever secure, I’ll go with user test and password letmein:

htpasswd -b -c /etc/nginx/.htpasswd test letmein

Now we need to define the proxy. A Digital Ocean tutorial (How To Install Nginx on Ubuntu 18.04) describes how to set up a firewall – I’m selecting the 'Nginx Full' option because I’m working via SSH, but if you’re working in the web terminal, the more restrictive 'Nginx HTTP' may be more appropriate:

sudo ufw allow 'Nginx Full'

If you try loading the OpenRefine server on port 3333, you should now find that it’s blocked: the firewall is only letting web traffic through on port 80.

We now need to open access back up to the OpenRefine server, albeit via a password challenge. The following will create a default nginx configuration file that will expose the OpenRefine service running on port 3333 via default http port 80, mediated by a simple authorisation challenge:

config='''
server {
  listen 80;
  auth_basic Protected...;
  auth_basic_user_file /etc/nginx/.htpasswd;
  location / {
    proxy_pass http://127.0.0.1:3333;
  }
}
'''

echo "$config" > /etc/nginx/sites-available/default

Restart the nginx proxy to put the new settings into effect:

nginx -s reload

If you now go to http://IPADDRESS you should be presented with a challenge. Enter the credentials you defined, and you should see your OpenRefine server:

Finishing Up

When you’ve finished your session, you can destroy the droplet. This will tear the server down and you won’t be billed for it anymore.

Alternatively, you can switch the droplet off, but keep it in a shutdown state that you can restart in the future.

However, as the above prompt suggests, you will continue to be billed, even if the service is not running, because it is still consuming Digital Ocean resources.

Author: Tony Hirst

I'm a Senior Lecturer at The Open University, with an interest in #opendata policy and practice, as well as general web tinkering...

%d bloggers like this: