Diff or Chop? Github, CSV data files and OpenRefine
A recent post on the OKFNLabs blog – Diffing and patching tabular data – proposes a visualisation scheme (and some associated tooling) for comparing the differences between two tabular data/CSV files:
With Github recently announcing that tabular CSV and TSV files are now previewable as such via a searchable* rendering of the data, I wonder if such view may soon feature on that site? An example of how it might work is described in James Smith’s ODI blogpost Adapting Git for simple data, which also has a recipe for diffing CSV data files in Github as it currently stands.
* Though not column sortable? I guess that would detract from Github’s view of showing files as is…? For more discussion on on the rationale for a “Github for data”, see for example Rufus Pollock’s posts Git and Github for Data and We Need Distributed Revision/Version Control for Data.
So far, so esoteric, perhaps. Because you may be wondering why exactly anyone would want to look at the differences between two data files? One reason may be to compare “original” data sets with data tables that are ostensibly copies of them, such as republications of open datasets held as local copies to support data journalism or watchdog activities. Another reason may be as a tool to support data cleaning activities.
One of my preferred tools for cleaning tabular datasets is OpenRefine. One of the nice features of OpenRefine is that it keeps a history of the changes you have made to a file:
Selecting any one of these steps allows you to view the datafile as it stands at that step. Another way of looking at the data file in the case might be the diff view – that is, a view that highlights the differences between the version of the data file as it is at the current step compared to the original datafile. We might be able to flip between these two views (data file as it is at the current step, versus diff’ed data file at the current step compared to the original datafile) using a simple toggle selector.
A more elaborate approach may allow use to view diffs between the data file at the current step and the previous step, or the current data file and an arbitrary previous step.
Another nice feature of OpenRefine is that it allows you to export a JSON description of the change operations (“chops”?;-) applied to the file:
This is a different way of thinking about changes. Rather than identifying differences between two data files by comparing their contents, all we need is a single data file and the change operation history. Then we can create the diff-ed file from the original by applying the specified changes to the original datafile. We may be some way away from an ecosystem that allows us to post datafiles and change operation histories to a repository and then use those as a basis for comparing versions of a datafile, but maybe there are a few steps we can take towards making better use of OpenRefine in a Github context?
For example, OpenRefine already integrates with Google Docs to allow users to import and export file from that service.
So how about if OpenRefine were able to check out a CSV file from Github (or use gists) and then check it back in, with differences, along with a chops file (that is, the JSON representation of the change operations applied to the original data file?). Note that we might also have to extend the JSON representation, or add another file fragment to the checking, that associates a particular chops file with a particular checkout version of the data file it was applied to. (How is an OpenRefine project file structured, I wonder? Might this provide some clues about ways of managing versions of data files their associated chops files?)
For OpenRefine to identify which file or files are the actual data files to be pulled from a particular Github repository may at first sight appear problematic, but again the ecosytem approach may be able to help us. If data files that are available in a particular Github repository are identified via a data package description file, an application such as OpenRefine could access this metadata file and then allow users to decide which file it is they want to pull into OpenRefine. Pushing a changed file should also check in the associated chops history file. If the changed file is pushed back with the same filename, all well and good. If the changed file is pushed back with a different name then OpenRefine could also push back a modified data package file. (I guess even if filenames don’t change, the datapackage file might be annotated with a reference to the appropriate chops file?)
And as far as ecosystems go, there are already other pieces of the jigsaw already in place, such as James Smith’s Git Data Viewer (about), which allows you to view data files described via a datapackage descriptor file.