OUseful.Info, the blog…

Trying to find useful things to do with emerging technologies in open education

Diff or Chop? Github, CSV data files and OpenRefine

A recent post on the OKFNLabs blog – Diffing and patching tabular data – proposes a visualisation scheme (and some associated tooling) for comparing the differences between two tabular data/CSV files:

csv diff

With Github recently announcing that tabular CSV and TSV files are now previewable as such via a searchable* rendering of the data, I wonder if such view may soon feature on that site? An example of how it might work is described in James Smith’s ODI blogpost Adapting Git for simple data, which also has a recipe for diffing CSV data files in Github as it currently stands.

* Though not column sortable? I guess that would detract from Github’s view of showing files as is…? For more discussion on on the rationale for a “Github for data”, see for example Rufus Pollock’s posts Git and Github for Data and We Need Distributed Revision/Version Control for Data.

So far, so esoteric, perhaps. Because you may be wondering why exactly anyone would want to look at the differences between two data files? One reason may be to compare “original” data sets with data tables that are ostensibly copies of them, such as republications of open datasets held as local copies to support data journalism or watchdog activities. Another reason may be as a tool to support data cleaning activities.

One of my preferred tools for cleaning tabular datasets is OpenRefine. One of the nice features of OpenRefine is that it keeps a history of the changes you have made to a file:

openrefine history

Selecting any one of these steps allows you to view the datafile as it stands at that step. Another way of looking at the data file in the case might be the diff view – that is, a view that highlights the differences between the version of the data file as it is at the current step compared to the original datafile. We might be able to flip between these two views (data file as it is at the current step, versus diff’ed data file at the current step compared to the original datafile) using a simple toggle selector.

A more elaborate approach may allow use to view diffs between the data file at the current step and the previous step, or the current data file and an arbitrary previous step.

Another nice feature of OpenRefine is that it allows you to export a JSON description of the change operations (“chops”?;-) applied to the file:

open refine extract

This is a different way of thinking about changes. Rather than identifying differences between two data files by comparing their contents, all we need is a single data file and the change operation history. Then we can create the diff-ed file from the original by applying the specified changes to the original datafile. We may be some way away from an ecosystem that allows us to post datafiles and change operation histories to a repository and then use those as a basis for comparing versions of a datafile, but maybe there are a few steps we can take towards making better use of OpenRefine in a Github context?

For example, OpenRefine already integrates with Google Docs to allow users to import and export file from that service.

OPen Refine export to google

So how about if OpenRefine were able to check out a CSV file from Github (or use gists) and then check it back in, with differences, along with a chops file (that is, the JSON representation of the change operations applied to the original data file?). Note that we might also have to extend the JSON representation, or add another file fragment to the checking, that associates a particular chops file with a particular checkout version of the data file it was applied to. (How is an OpenRefine project file structured, I wonder? Might this provide some clues about ways of managing versions of data files their associated chops files?)

For OpenRefine to identify which file or files are the actual data files to be pulled from a particular Github repository may at first sight appear problematic, but again the ecosytem approach may be able to help us. If data files that are available in a particular Github repository are identified via a data package description file, an application such as OpenRefine could access this metadata file and then allow users to decide which file it is they want to pull into OpenRefine. Pushing a changed file should also check in the associated chops history file. If the changed file is pushed back with the same filename, all well and good. If the changed file is pushed back with a different name then OpenRefine could also push back a modified data package file. (I guess even if filenames don’t change, the datapackage file might be annotated with a reference to the appropriate chops file?)

And as far as ecosystems go, there are already other pieces of the jigsaw already in place, such as James Smith’s Git Data Viewer (about), which allows you to view data files described via a datapackage descriptor file.

Written by Tony Hirst

August 27, 2013 at 9:49 am

Posted in OpenRefine

Tagged with ,

3 Responses

Subscribe to comments with RSS.

  1. Hmm… thinks… Yahoo Pipes publish JSON descriptions of themselves, which might also be thought of as chops files. Libraries like pipe2py – http://blog.ouseful.info/2010/09/30/yahoo-pipes-code-generator/ – “compile” these descriptions and allow the same change operations to be run on feeds outside of Yahoo Pipes. (That is, Yahoo Pipes can be used as the authoring environment.)

    I keep thinking that a Python script for executing OpenRefine chops files would be useful. Maybe there is an opportunity to define an extensible chops definition language?

    Tony Hirst

    August 27, 2013 at 10:21 am

  2. SutoCom

    August 27, 2013 at 10:28 am

  3. I know this might not be the appropriate place to post this question but i could not tell where else it could.

    so i am having this could be so tiny issue with collecting news about syria. i am working in a news research project and i can not tell exactly what is the best wat to track the unfolding of a news story . my main concern is to make easy to read charts by comparing “who published first” a news story. this is mainly for verification reasons . as there are alot of news there about syria and most of that could be a repetition and copy and paste nothing more. The problem is that the original source many times tend to be not a credible one.

    so if i can hear any advices, suggestions with this regard i would be thankful. I used some comparisons using google news filters and so on along with topsy to track on twitter posts but that doesn’t seem to be working fair enough for me.

    any notes on this ?

    thanks

    Bilal Zaiter

    September 5, 2013 at 12:41 pm


Comments are closed.

Follow

Get every new post delivered to your Inbox.

Join 766 other followers

%d bloggers like this: