Over the years, I’ve been fortunate enough to have been gifted some very evocative, and powerful, ideas that immediately appealed to me when I first heard them and that I’ve been able to draw on, reuse and repurpose over and over again. One such example is “glue logic”, introduced to me by my original OU PhD supervisor George Kiss. The idea of glue logic is to provide a means by which two digital electronic circuits (two “logic” circuits) that don’t share a common interface can be “glued” together.
To generalise things a little, I have labelled the circuits as applications in the figure. But you can think of them as circuits if you prefer.
A piece of custom digital circuitry than can talk to both original circuits, and translate the outputs of each into a form that can be used as input to the other, is placed between them to take on this interfacing role: glue logic.
Sometimes, we might not need to transform all the data that comes out of the first circuit or application:
This idea is powerful enough in its own right, but there was a second bit to it that made it really remarkable: the circuitry typically used to create the glue logic was a device known as a Field Programmable Gate Array, or FPGA. This is a type of digital circuit whose logical function can be configured, or programmed. That is, I can take my “shapeless” FPGA, and programme it so that it physically implements a particular digital circuit. Just think about that for a moment… You probably have a vague idea that the same computer can be reprogrammed to do particular things, using some vaguely mysterious and magical thing called software, instructions that computer processors follow in order to do incredible things. With an FPGA, the software actually changes the hardware: there is no processor that “runs a programme”; when you programme an FPGA, you change its hardware. FPGAs are, literally, programmable chips. (If you imagine digital circuits to be like bits of plastic, an FPGA is like polymorph.)
The notion of glue logic has stuck with me for two reasons, I think: firstly, because of what it made possible, the idea of flexibly creating an interface between two otherwise incompatible components; secondly, because of the way in which it could be achieved – using a flexible, repurposable, reprogrammable device – one that you could easily reprogramme if the mapping from one device to another wasn’t quite working properly.
So what has this got to do with anything? In a post yesterday, I described a recipe for Grabbing Twitter Search Results into Google Refine And Exporting Conversations into Gephi. The recipe does what it says on the tin… but it actually isn’t really about that at all. It’s about using Google Refine as glue logic, for taking data out of one system in one format (JSON data from the Twitter search API, via a hand assembled URL) and getting it in to another (Gephi, using a simple CSV file).
The Twitter example was a contrived example… the idea of using Google Refine as demonstrated in the example is much more powerful. Because it provides you way of thinking about how we might be able to decompose data chain problems into simple steps using particular tools that do certain conversions for you ‘for free’.
Simply by knowing that Google Refine can import JSON (and if you look at the import screen, you’ll also notice it can read in Excel files, CSV files, XML files, etc, not just JSON, tidy it a little, and then export some or all of it as a simple CSV file means you now have a tool that might just do the job if you ever need to glue together an application that publishes or exports JSON data (or XML, or etc etc) with one that expects to read CSV. (Google Refine can also be persuaded to generate other output formats too – not just CSV…)
You can also imagine chaining separate tools together. Yahoo Pipes, for example, is a great environment for aggregating and filtering RSS feeds. As well as publishing RSS/XML via a URL, it also publishes more comprehensive JSON feeds via a URL. So what? So now you know that if you have data coming through a Yahoo Pipe, you can pull it out of the pipe and into Google Refine, and then produce a custom CSV file from it.
PS Here’s another really powerful idea: moving in and out of different representation spaces. In Automagic Map of #opened12 Attendees, Scott Leslie described a recipe for plotting out the locations of #opened12 attendees on a map. This involved a process of geocoding the addresses of the home institutions of the participants to obtain the latitude and longitude of those locations, so that they could be appropriately placed. So Scott had to hand: 1) institution names and or messy addresses for those institutions; 2) latitude and longitude data for those addresses. (I use the word “messy” to get over the idea that the addresses may be specified in all manner of weird and wonderful ways… Geocoders are built to cope with all sorts of variation in the way addresses are presented to them, so we can pass the problem of handling these messy addresses over to the geocoder.)
In a comment to his original post, Scott then writes: [O]nce I had the geocodes for each registrant, I was also interested in doing some graphs of attendees by country and province. I realized I could use a reverse geocode lookup to do this. That is, by virtue of now having the lat/long data, Scott could present these co-ordinates to a reverse geo-coding service that takes in a latitude/longitude pair and returns an address for it in a standardised form, including, for example, an explicit country or country code element. This clean data can then be used as the basis for grouping the data by country, for example. The process is something like this:
Messy address data -> [geocode] -> latitude/longitude data -> [reverse geocode] -> structured address data.
Beautiful. A really neat, elegant solution that dances between ways of representing the data, getting it into different shapes or states that we can work in different particular ways. :-)
PPS Tom Morris tweeted this post likening Google Refine to an FPGA. For this to work more strongly as a metaphor, I think we might have to take the JSON transcript of a set of operations in Google Refine as the programming, and then bake them into executable code, as for example we can do with the Stanford Data Wrangler, or using Pipe2Py, with Yahoo Pipes?
You might be interested in Flow-Based Programming. See also the Google Group, and a Node.js implementation that’s fun to play with.