Quick Notes – OpenRefine – Working With Databases

It seems as if the latest version of OpenRefine – v. 3.0 – is out in beta ([repo]) and it looks at if some integration with databases is provided.

On the way in, you can connect to a database and import data from it into OpenRefine directly:

Looks interesting, though part of me wonders “why?”: if the data is in the database then presumably you’ve started to clean it anyway. One of the things I’ve realised working on some revisions for our TM351 data management and analysis course is that you can use table constraints as part of a data cleaning workflow to flag up errors in your data. It could be interesting to imagine how constraint violation errors might be used to support data cleaning activities in an application such as OpenRefine. (In a sense, you might think of applying facets to a column as the application of rules or constraints that can identify rows containing data that violates them.)

On the way out is the option to export a project as a SQL file. This includes defining column types:

and also configuring other aspects of the SQL export file, such as exporting table definitions, either with or without the data, and optionally including DROP and IF EXISTS statements:

Here’s an example of the sort of thing it can generate:

DROP TABLE IF EXISTS clipboard;
CREATE TABLE clipboard (
a INT NULL,
b VARCHAR(12) NULL,
c INT NULL
);
INSERT INTO clipboard (a,b,c) VALUES
( 1,'e',1 ),
( 2,'w',3 ),
( 2,'tg',56 )

Going back to the notion of constraints, if an exportable SQL table is defined with particular column types, it would be handy to be able to use the type definitions to check the data being exported. For example, having set the types, apply those types as exclusive facet rules on the relevant columns and then raise an alert and/or preview the violating rows if any rows are returned that break the type rules.

If such checking was possible, you could start to imagine round tripping on the data, loading in a table definition and using it to check some aspects of a dataset loaded alongside it (such as data types, uniqueness etc.) Which reminds me, the SQL exporter doesn’t allow seem to support UNIQUE or PRIMARY KEY definitions?

Author: Tony Hirst

I'm a lecturer at The Open University, with an interest in #opendata policy and practice, as well as general web tinkering...