Linear Cell Based Data Cleaning and Analysis Workflows With Workbench

[Things have moved on since this post was published… For a review of Workbench as of 2021, see Jon Udell’s more recent post A beautiful power tool to scrape, clean, and combine data]

One of the nice features of data cleaning tool OpenRefine is that whilst it is a spreadsheet like, GUI driven tool, you can view the (linear) history of operations you applied to a particular dataset, as well as exporting them:

The exported operations history can then be imported into another project with another dataset oft he same form, which means you can use OpenRefine to clean multiple documents in the same way. (The history is exported as a JSON file, which you can edit, for example by removing reordering operations. I’m not sure if operations can be reordered in OpenRefine, although they can be undone in reverse application order.)

The exported operations sidebar – and export file – thus provide a way of documenting, in part, a set of data cleaning operations, although you can’t annotate or comment the actions.

Via Owen Stephens, a long-time hard-core OpenRefine user, I was tipped off to another data cleaning tool, Workbench. An open source project from Columbia Journalism School (repo), this is being developed as a tool for data journalists, the history of operations is displayed up front and centre (well, to the side, but always in view) as a project workflow:

Clicking on any step in the workflow reflows the displayed data to the output of that step – an arrow indicates from which step in the workflow the central display is derived.

The workflow can also be documented by adding (optional) comments immediately before a particular step in the workflow:

The range of possible operations, or modules as they are referred to in Workbench, is impressive [module docs]. To start with, a range of data import modules are provided:

The file uploader can upload CSV or Excel files, though I’m not sure how really crappy spreadsheets or multi-sheet Excel files are supposed to be handled (it didn’t seem to work for me)? (The Join Workflow option also looks intriguing…)

As well as some simple scraping support:

There is a good selection of data cleaning operations:

As well as cleaning steps, a range of analysis modules can also be added to the workflow:

The Select SQL temporarily pops the data into a SQLite database (I’m guessing….) and lets you then run SQL queries on it. One thing perhaps lacking from the import options is an ‘import from SQL’ option (OpenRefine has recently added support for database connections).

Finally, there is a set of visualisation modules than can be used to generate simple charts:

In OpenRefine, the templated export facility provides a way of converting rows to text based summary reports, and offhand I don’t spot a correlate of that in the provided Workbench modules.

However, there is a more general cell that may provide a way of doing this:

although that didn’t seem to work for me?

It also looks like you can create and import your own modules from Github:

The modules are defined using a JSON configuration file (module metadata and parameters), and a Python script (which does the actual data processing) [see the docs for more details]. This makes me think Workbench might provide a useful playground for creating simple textualisers as custom modules designed to work with particular sorts of dataset (note to self: revisit pytracery… ).

By default, projects are private, although they can be shared to a public URL. Public workflows can also be cloned (Duplicate), which means you can grab someone else’s workflow and make it your own.

One other nice feature of the environment is that it supports “tutorials” as well as “workflows”, which allow you to add a further scaffolding sidebar that talks someone through the creation of their own worfklow:

Poking around the repo (CJWorkbench/cjworkbench/server/lessons), it looks as if separate lessons are each defined using a single HTML file.

As an open source project, instructions are also provided to set up and run your own Workbench development environment, deployed as a set of Docker containers.

This got me thing it would be nice to have a simple “single user” Docker setup that would provide me with a temporary workbench that I could use as a disposable (stateless) application, preserving the workflow by exporting it rather than saving it to a personal Workbench account. (Hmm… thinks: I couldn’t see a way to export workflows?)

Under the hood, Workbench seems to use pandas and provides a useful, and powerful, user interface abstraction over it.

On a quick look, I didn’t spot any Jupyter machinery. However, the workflow panel is very much notebook style (text and code cells, with text cells tied to successor code cells, and a single selected cell output view) and gets around the hidden state problem in notebooks by forcing a linear execution model, reactive output display for a particular selected cell.

All in all, a really neat application, although like Owen, who I think defaults to using OpenRefine as his universal hammer, I’ll probably still default to using Jupyter Notebooks as mine.

PS In passing, looking at Workbench got me idly wondering about the graphical flow based Orange graphical data analysis environment, and how learner experiences of Jupyter Notebooks vs Workbench vs Orange might compare?

Author: Tony Hirst

I'm a Senior Lecturer at The Open University, with an interest in #opendata policy and practice, as well as general web tinkering...

%d bloggers like this: