Keeping Up With OpenRefine – Database Connections

It’s been a few months since I last checked out updates to OpenRefine, but reading a (completed) phase 1 project plan associated with some funding the OpenRefine Foundation received from Google News Labs it looks like database support is on the cards.

Database Table import/export – COMPLETED

Historically, OpenRefine has been limited compared to other data tools in that it does not have a way to connect to a database table. This is especially useful at export time, when there is a need to save a cleaned CSV for example into a database table. Importing from a database is useful also. It can help to join clean data in a database table against messy data in OpenRefine, in order to clean and prepare it for use. Database Drivers exist for many databases such as Oracle, MySQL, Postgres, and even many schema-less databases such as MongoDB. Most database drivers use JDBC which makes it easier for us to develop against, and others typically use a custom Java driver that sometimes is non-trivial to integrate with. Since OpenRefine is built with Java this should be relatively straightforward to utilize existing JDBC drivers for our import/export operations and for support of MongoDB there is a Java driver available.

Looking through the repo, it looks like there are a couples of related PRs:

I’m not sure about the export to a db?

The tests suggest drivers are in place for PostgreSQL, MySQL and MariaDB:

public class DatabaseTestConfig extends DBExtensionTests {

private DatabaseConfiguration mysqlDbConfig;
private DatabaseConfiguration pgsqlDbConfig;
private DatabaseConfiguration mariadbDbConfig;

It also looks like an upgrade to the internal data representation may be being considered: Research Apache Arrow to improve in-memory data model. FWIW, I think Apache Arrow really is one to watch.

Via the OpenRefine Google Group, I also noticed a couple of references to future planned activity / roadmap items:

Phase 2

Front / Backend separation

Scope: completely separating the backend so that an full API can be exposed for all OpenRefine operations and commands. Once the decoupling done, we can move to a modern front end framework and
Deliverable: Functional and documented API covering all the commands available in OpenRefine 3 front end.

Phase 3
R Lang support
Work with community to bring support for R lang via an extension.
https://github.com/OpenRefine/OpenRefine/issues/1226
There is significant use of statistics within News Organizations where the goal of minimizing the back and forth between R tooling and OpenRefine would be explored and assessed by the community.

rrefine is around and needs investigation – https://github.com/vpnagraj/rrefine

Hmmm… rrefine?

rrefine enables users to programmatically trigger data transfer between R and OpenRefine. Using the functions available in this package, you can import, export or delete a project in OpenRefine directly from R. There are several client libraries for automating OpenRefine tasks via Python, nodeJS and Ruby. rrefine extends this functionality to R users.

Okay – that makes me think of the OpenRefine Python Client Library?

But how about that Edit cells > Transform > Language support for R #1226` issue? “This is a feature-request to add R support in Edit cells > Transform > Language.”

That fits in with an earlier thought I had along the lines of “what if OpenRefine was a Jupyter client?” In an imagining frame of mind, this seems to me to offer a couple of potential benefits:

  • if the Transform > Language utility supports hooks into a Jupyter kernel and exposes an executable code cell onto that (state persisting) kernel, and the data can be transferred efficiently using serialisations like feather or deeper hooks into Apache Arrow representations that might be supported in R or Python pandas, then any language with a Jupyter kernel could be used for transformations?
  • if OpenRefine was exposed as a panel in Jupyterlab, which it presumably could be simply by embedding the HTML UI in an IFrame, then it have a role as part of the look and feel of a single working environment, even if it was only loading and saving CSV files into the environment workspace.

But then let’s imagine something a bit more extreme (I’m not sure if / how this might fit into the Jupyterlab architecture, indeed whether it’s possible or just imagine magic, I’m just riffing…): if the data being manipulated within OpenRefine could be synched with a representation of the data being manipulated elsewhere in the Jupyterlab environment, then we could be viewing a dataset in one panel (Jupyterlab has crazy efficient support for viewing large datafiles), manipulating it in an OpenRefine panel, and running analysis scripts over it in a third. The reticulate package suddenly comes to mind here as an example of accessing data objects from one environment in another.

It also strikes me that use cases of the data represented in OpenRefine reflecting updates to the data from the analysis environment are less likely. The analysis should be operating on data after it has been cleaned, rather than passing it to OpenRefine?

PS by the by, if you want to run OpenRefine using the Jupyter ecosystem Binderhub machinery, here’s a proof of concept from @betatim: openrefineder.

One comment