[Elements of this post has been largely deprecated since I drafted it a couple of weeks ago, but I’m posting it anyway because this is my open notebook, and as such it has a role in logging the things that are maybe dead ends, as well as hopefully more useful stuff…]
At its heart, Gephi is an environment for visualising graphs (or networks) in which “nodes” are connected to each other by “edges”. Nodes are represented using circles whose size and colour represent particular characteristics of the the node. So for example, if you were to visualise your Facebook friends, a node might represent a particular friend, the size of the node might be proportional to the number of friends they have, and the colour to how many photos they have uploaded. Lines between nodes would then show who is a friend of whom. But must we always add the lines between the nodes? If we leave them out, can we effectively use Gephi as a tool for generating charts like the Many Eyes bubble charts?
One of the data import formats supported by Gephi is the gdf format (gdf documentation), which expects a list of node definitions, followed by a list of edge connections. If we ignore the edges, then we can just import a set of node definitions, and create a bubble chart.
As an example of this, let’s see what we can do with some the Transparency in procurement and contracting information released by the Cabinet Office. As part of the data release, they publish a CSV file containing a summary of all the tender documents held:
Looking at the data, we see that each tender is represented by one or more documents. Each row in the CSV file gives us information about the tender (its project ID, originating department, expected value, expected duration) as well as the particular document. So if we view each tender as a “bubble” or node in Gephi, we might want to represent it as follows:
nodedef> name VARCHAR,label VARCHAR, procid VARCHAR, estVal DOUBLE,estDur DOUBLE,date VARCHAR, dept VARCHAR, desc VARCHAR, nature VARCHAR
402846,"Spring Electoral Events Contact Centre",402846,125000,48,"17/09/2010","Central Office of Information","Invitation to Tender","Competition as part of an existing framework agreement"
"2010CMTLSE00001","Supply of body armour to HMCS","2010CMTLSE00001",250000,48,"24/09/2010","Ministry of Justice","RFQ instructions","Competition as part of an existing framework agreement"
Note that the GDF file requires a particular sort of header, followed by CSV rows of data. It’s easy enough in this case to simply edit the original CSV file by deleting the first line, tweaking the column headers to the name VARCHAR, label VARCHAR… format required by the GDF file, and prefixing the new first row (the header row) with nodedef>.
However, I’ve recently started exploring the use of the browser based desktop application Google Refine (formerly Freebase Gridworks) as a step in my workflow for tidying up CSV data and then getting it into the GDF format.
Here’s what it looks like once the data has been imported:
(For a great overview of what Gridworks allows you to do with data, see @jenit’s Using Freebase Gridworks to Create Linked Data.)
The data I want to visualise in Gephi relates to the current tenders, rather than anything to do with the actual documents, so we can use Gridworks to simplify the data set by deleting the document type, document name and contact email columns. We can also check that columns using a restricted vocabulary (e.g. the type of tender being offered), do so consistently. For example, if we look at the Nature of the Tender Process column, and select Cluster and Edit…:
we can see that there may be the odd typo that we can correct automatically:
The Description column also has various categories we can tidy up:
Here are the data tidying steps I’ve applied:
At the time of writing, a new version of Google Refine/Gridworks is about to be released. In the version I’m using, I don’t think it’s possible to remove duplicate rows, which we have aplenty in my tidied up dataset (where several documents were listed for a tender, there are now several identical rows in my dataset). [Google Refine 2.0 is out now – and I don’t think it can de-dupe?] However, I happen to know that Gephi will ignore duplicates of nodes loaded into Gephi, so we can do the de-dupe/de-duplication step there…
To generate the GDF file, we need to create a header line, and then define the output pattern for each row. We can do this using Gridworks’ Templating support:
Here’s how I define the output document:
Note that the linebreaks will need removing in order to generate the correct output format. Also, in the version of Gridworks I’m using, it’s worth noting that whenever you run the template, you’re returned to the main data window and your template definitions is lost… (so before running the template code, grab a copy of it into a text editor just to be safe;-)
When you export the data, it’s exported to your browser downloads directory, as a text file. Change the filetype form txt to .gdf and import it into Gephi:
You’ll see that Gephi has detected the duplicate rows based on common name elements (that is, common project IDs), and ignored the copies/duplicates.
Now we can view the procurement data using proportional symbol visualisations – here I size the nodes by estimated value (and display the label size in proportion to node size), and colour the nodes according to estimated duration:
[Since drafting this post, I’ve found a far better way of getting just the node data into Gephi – load it into the data table as a node table. I’ll post more on how to do this in a follow on post…]
(The Many Eyes take on Bubble Charts ignores x/y co-ordinates as useful data, although other definitions of Bubble Charts include x/y location as important factors. In the current example, I allow Gephi to layout the nodes/bubbles. However, we can define x and y co-ordinates in the gdf file if we want to specifically locate the bubbles on the canvas.)
We can also use Gephi to cluster the data according to calling departments, or type of procurement exercise:
I *think* the size of the resulting bubble is proportional to the sum of the values used to inform the node size of the original components, so we should be able to group by procurement exercise type and have the bubble size be proportional to the sum of the estimated values of the procurement projects in that procurement class.
We can also expand a clustered node to see what activity is related to it – in this case, here are tenders from the British Library:
Going back to the full list, here we size by estimated value and colour by type of procurement:
We can also generate views over the data using filters – so for example, COI sponsored procurement:
One thing that Gephi doesn’t currently support is a treemap style visualisation. However, now we have deduped the data by importing it into Gephi, we can export it as a simple CSV file from the datatable view, and then upload the data to Many Eyes and make use of its treemap:
We use TSV because that is the preferred format for Many Eyes… (data file on Many Eyes)
Here’s one configuration of the treemap:
With the data in Many Eyes, we can of course easily generate other views over it, such as a histogram:
(NB in the original data, the Estimated value column – which should contain numbers – also contained a few unknown elements:
Because code that expects numbers sometimes chokes on text, I should maybe have set the unknow vlaues to a default value as shown above?)
Okay – so what have we covered in this post?
– how to start cleaning/preparing data in Freebase Gridworks/Google Refine;
– how to use reebase Gridworks/Google Refine to generate an output file according to a template;
– how to use Gephi to deduplicate data based on a common field (in this case, the project id);
– how to use Gephi as a proportional symbol/bubble chart visualisation tool;
– how to export data from Gephi and upload it into Many Eyes;
– how to use Many Eyes to generate a treemap.
As ever, this blog post took longer to write than it took me to work through the exercise originally.