OpenRefine Database Connections in MyBinder

With the version 3.0 release of OpenRefine last year, database integration was introduced that allows data to be imported into OpenRefine from a connected database, or exported to a downloadable SQL datadump. (It doesn’t look like you can save/export data to a new database table in the connected database, or upsert the contents of a cleaned table). This was the release of the  OpenRefine Database Import Extension and the SqlDump export mentioned in this earlier post.

If you want to try it out, I’ve created a MyBinder / repo2docker configuration repo that will launch a MyBinder repo containing both a running OpenRefine server and a running PostgreSQL server, although the test table is very small…

For how to run Postgres in a MyBinder container, see Running a PostgreSQL Server in a MyBinder Container.

Start in OpenRefine client: Binder

Details are:

  • host: localhost
  • Port: 5432
  • User: testuser
  • Password: testpass
  • Database: testdb

There’s also a tiny seeded table in the database called quickdemo from which we can import data into OpenRefine:

I said it was a small table!

The rest of the db integration — SQL export — is described in the aforementioned post on OpenRefine’s SQL integration.

I have to admit I’m not sure what the workflow is? You’d typically want to put clean data into a database, rather than pull data from a database into OpenRefine for cleaning.

If you are using OpenRefine as a data cleaning tool, it would be useful to be able to export the data directly back into the connected database, either as an upserted table (as well as perhaps some row deletions) or as a new ..._clean table (“Upsert to database…”).

If you’re using OpenRefine as a data enrichment tool, being able to create a new, enriched table back in the connected database (“Export to database…”) would also make sense.

One of the things I’ll add to the to-do list is an example of how to export data from OpenRefine and then import it into the database using a simple Jupyter notebook script (a Jupyter notebook server is also running in the  MyBinder container (just delete the openrefine/ path from the MyBinder URL).

One of the new (to me) things I’ve spotted in OpenRefine 3 is the ability to export a Project Data Package. I mistakenly thought this might be something like a Frictionless Data data package format, but it looks to just be an export format for the OpenRefine project data? There are fields for import settings as well as descriptive metadata, but I don’t see any dialogues in the UI where you’d enter things like creator, contributors or description?

{
  "name": "clipboard",
  "tags": [],
  "created": "2019-02-09T22:13:42Z",
  "modified": "2019-02-09T22:14:31Z",
  "creator": "",
  "contributors": "",
  "subject": "",
  "description": "",
  "rowCount": 2,
  "title": "",
  "homepage": "",
  "image": "",
  "license": "",
  "version": "",
  "customMetadata": {},
  "importOptionMetadata": [
    {
      "guessCellValueTypes": false,
      "projectTags": [
        ""
      ],
      "ignoreLines": -1,
      "processQuotes": true,
      "fileSource": "(clipboard)",
      "encoding": "",
      "separator": ",",
      "storeBlankCellsAsNulls": true,
      "storeBlankRows": true,
      "skipDataLines": 0,
      "includeFileSources": false,
      "headerLines": 1,
      "limit": -1,
      "quoteCharacter": "\"",
      "projectName": "clipboard"
    }
  ],
  }
}

One of the column operations you can perform inOpenRefine is to cast columns to text, dates or numerics, but I don’t think that is saved as metadata anywhere? You can also define column types in the SQL exporter, but again, I’m not sure that then becomes project metadata. It’d be good to see these things unified a bit, and framing such a process in terms of supporting a tabular data package (with things like column typing specified) could be useful.

Another foil for this might be supporting a SQLite export format?

I have to admit I’m a bit confused as to how OpenRefine sits where in different workflows, particularly with data that is managed, and as such is most likely to be stored in some sort of database? (Lots of the OpenRefine tooling still harkens to a Linked Data future, so maybe it fits better in Linked Data workflows?). I also get the feeling that it shares a possible overlap with query engine tools such as Apache Drill, and maybe even document data extraction tools such as Apache Tika or Tabula. Again, seeing demonstrated toolchains and workflows in this area could be interesting.

Note to self: there are several other PDF table extractor tools out there alongside Tabula (Java) that I haven’t played with; eg R/pdftools, Python/Camelot and Python/pdfplumber.

Simon Willison is doing all sorts of useful stuff framing datasette as a datasette / SQLite ecosystem play. It could be useful to think a bit more about OpenRefine in terms of how it integrates with other data tools. For example, the X-to-sqlite tools help you start to structure variously formatted data sources in terms of a common SQLite representation, which can naturally incorporate things like column typing, but also the notion of database primary and foreign key columns. In a sense, OpenRefine provides a similar “import from anything-export to one format (CSV)” with a data cleaning step in the middle, but CSV is really informally structured in terms of its self-descriptive representation.

One of the insights I had when revising our TM351 relational database notebooks was that database table constraints can play a really useful role when helping clean a dataset by automatically identifying things that are wrong with it… I’ll maybe try to demonstrate an OpenRefine / Jupyter notebook hybrid workflow around that too…

By the by, I noticed this post the other day Exploring the dystopian future of a Javascript Gephi. Gephi, like OpenRefine, is a Java app, and like OpenRefine is one I’ve never been tempted to doodle with code wise for a couple of reasons: a) Java doesn’t appeal to me as a language; b) I don’t have a Java environment to hand, and the thought of trying to set up an environment, and all the build tools, as a novice, for a complex legacy project just leaves me cold. As the Gephi developers see it, “[w]e have to face it: the multiplatform is moving from Java to web technologies. Oracle wants a Java that powers backends, not a user interface framework.”

I’ve dabbled with OpenRefine off and on for years now, and whiles its browser accessibility is really handy, the docs could do with some attention (I guess that’s something I could make a positive contribution to). Also, if it was a Python, rather than Java, application, I’d be more comfortable with it and would possibly start to poke around inside it a bit….

I guess one of the things can do (though I’ve never really had to push it) is scale with larger datasets, although the memory overhead may then become an issue? I the the R/Pandas crossover folk have been doing a lot of work on efficient datatable representations and scaleable tabular data interchange formats, and I’m not sure if OpenRefine is/will draw on any of that work?

It’s also been some time since I looked at Workbench, (indeed, I haven’t really looked at it since I posted an early review), but a quick peek at the repo shows a fair amount of activity. Maybe I should look at it again…?

Author: Tony Hirst

I'm a Senior Lecturer at The Open University, with an interest in #opendata policy and practice, as well as general web tinkering...