A couple of weeks ago, I came across Gephi, a desktop application for visualising networks.
And quite by chance, a day or two after I was asked about any tools I knew of that could visualise and help analyse social network activity around an OU course… which I take as a reasonable justification for exploring exactly what Gephi can do :-)
So, after a few false starts, here’s what I’ve learned so far…
First up, we need to get some graph data – netvizz – facebook to gephi suggests that the netvizz facebook app can be used to grab a copy of your Facebook network in a format that Gephi understands, so I installed the app, downloaded my network file, and then uninstalled the app… (can’t be too careful ;-)
Once Gephi is launched (and updated, if it’s a new download – you’ll see an updates prompt in the status bar along the bottom of the Gephi window, right hand side) Open… the network file you downloaded.
NB I think the graph should probably be loaded as an undirected graph… That is, if A connects to B, B connects to A. But I’m committed to the directed version in this case, so we’ll stick with it… (The directed version would make sense for a Twitter network (which has an asymmetric friending model), where A may follow B, but B might choose not to follow A. In Facebook, friending is symmetric – A can only friend B if B friends A.
(Btw, I’ve come across a few gotchas using Gephi so far, including losing the window layout shown above. Playing with the Reset Windows from the Windows menu sometimes helps… There may be an easier way, but I haven’t found it yet…)
The graph window gives a preview of the network – in this case, the nodes are people and the edges show that one person is following another. (Remember, I should have loaded this as an undirected graph. The directed edges are just an artefact of the way the edge list that states who is connected to whom was generated by netvizz.)
Using the scroll wheel on a mouse (or two finger push on my Mac mousepad), you can zoom in and out of the network in the graph view. You can also move nodes around, view the labels, switch the edges on and off off, and recenter the view.
Not shown – but possible – is deleting nodes from the graph, as well as editing their properties.
You can also generate views of the graph that show information about the network. In the Ranking panel, if you select the Nodes tab, set the option to Degree (the number of edges/connections attached to a node) and then choose the node size button (the jewel), you can set the size of the node to be proportional to the number of connections. Tune the min and max sizes as required, then hit apply:
You can also colour the nodes according to properties:
So for example, we might get something like this:
Label size and colour can also be proportional to node attributes:
To view the labels, make sure you click on the Text labels option at the bottom of the graph panel. You may also need to tweak the label size slider that’s also on the bottom of the panel.
If you want to generate a pretty version of the graph, you need to do a couple of things. Firstly, in the layout panel, select a layout algorithm. Force Atlas is the one that the original tutorial recommends. The repulsion strength determines how dispersed the final graph will be (i.e. it sets the “repulsive force” between nodes); I set a value of 2000, but feel free to play:
When you hit Run, the button label will change to Stop and the graph should start to move and reorganise itself. Hit Stop when the graph looks a little better laid out. Remember, you can also move nodes around in the graph as show in the video above.
Having run the Layout routine, we can now generate a pretty view of the graph. In the Preview Settings panel on the left-hand side of the Gephi environment, select “Show Labels” and then hit “Refresh”:
In the Preview panel, (next tab along from Preview Settings), you should see a the prettified, 3D layout view:
Note that in this case I haven’t made much attempt at generating a nice layout, for example by moving nodes around in the graph window to better position them, but you can do… (just remember to Refresh the Preview view in the Preview Settings… (There must be a shortcut way of doing that, but I haven’t found it…!:-(
If you want to look at who any particular individual is connected to, you can go to the
Data Table panel (again in the set of panels on the right hand side, just along from the Preview tab panel) and search for people by name. Here, I’m searching the edges to see who of my Facebook friends a certain Martin W is also connected to on Facebook;
It’s easy enough to highlight/select and copy these cells and then post them into a spreadsheet if required.
So that’s step 1 of getting started with Gephi… a way of using it to explore a graph in very general terms; but that’s not where the real fun lies. That starts when you start processing the graph by running statistics and filters over it. But for that, you’ll have to wait for the next post in this series… which is here: Getting Started With Gephi Network Visualisation App – My Facebook Network, Part II: Basic Filters
In Getting Started With Gephi Network Visualisation App – My Facebook Network, Part I I described how to get up and running with the Gephi network visualisation tool using social graph data pulled out of my Facebook account. In this post, I’ll explore some of the tools that Gephi provides for exploring a network in a more structured way.
If you aren’t familiar with Gephi, and if you haven’t read Part I of this series, I suggest you do so now…
Okay, so where do we begin? As before, I’m going to start with a fresh worksheet, and load my Facebook network data, downloaded via the netvizz app, into Gephi, but as an undirected graph this time! So far, so exactly the same as last time. Just to give me some pointers over the graph, I’m going to set the node size to be proportional to the degree of each node (that is, the number of people each person is connected to).
I can activate text labels for the nodes that are proportional to the node sizes from the toolbar along the bottom of the Graph panel:
…remembering to turn on the text labels, of course!
So – how can we explore the data visually using Gephi? One way is to use filters. The notion of filtering is incredibly powerful one, and one that I think is both often assumed and underestimated, so let’s just have a quick recap on what filtering is all about.
[“green beans” by House Of Sims]
Filters – such as sieves, or colanders, but also like EQ settings and graphic, bass or treble equalisers on music players, colour filters on cameras and so on – are things that can be used to separate one thing from another based on their different properties. So for example, a colander can be used to separate green beans from the water it was boiled in, and a bass filter can be used to filter out the low frequency pounding of the bass on an audio music track. In Gephi, we can use filters to separate out parts of a network that have particular properties from other parts of the network.
The graph of Facebook friends that we’re looking at shows people I know as nodes; a line connecting two nodes (generally known as an edge) shows that that the two people represented by the corresponding nodes are also friends with each other. The size of the node depicts its degree, that is, the number of edges that are connected to it. We might interpret this as the popularity (or at least, the connectedness) of a particular person in my Facebook network, as determined by the number of my friends that they are also a friend of.
(In an undirected network like Facebook, where if A is a friend of B, B is also a friend of A, the edges are simple lines. In a directed network, such as the social graph provided by Twitter, the edges have a direction, and are typically represented by arrows. The arrow shows the direction of the relationship defined by the edge, so in Twitter an arrow going from A to B might represent that A is a follower of B; but if there is no second arrow going from B to A, then B is not following A.)
We’ve already used degree property of the nodes to scale the size of the nodes as depicted in the network graph window. But we can also use this property to filter the graph, and see just who the most (or least) connected members of my Facebook friends are. That is, we can see which people are friends of lots of the people am I friends of.
So for example – of my Facebook friends, which of them are friends of at least 35 people I am friends with? In the Filter panel, click on the Degree Range element in the Topology folder in the Filter panel Library and drag and drop it on to the Drag Filter Here
Adjust the Degree Range settings slider and hit the Filter button. The changes to allow us to see different views over the network corresponding to number of connections. So for example, in the view shown above, we can see members of my Facebook network who are friends with at least 30 other friends in my network. In my case, the best connected are work colleagues.
Going the other way, we can see who is not well connected:
One of the nice things we can do with Gephi is use the filters to create new graphs to work with, using the notion of workspaces.
If I export the graph of people in my network with more than 35 connections, it is place into a nw workspace, where I can work on it separately from the complete graph.
Navigating between workspaces is achieved via a controller in the status bar at the bottom right of the Gephi environment:
The new workspace contains just the nodes that had 35 or more connections in the original graph. (I’m not sure if we can rename, or add description information, to the workspace? If you know how to do this, please add a comment to the post saying how:-)
If we go back to the original graph, we can now delete the filter (right click, delete) and see the whole network again.
One very powerful filter rule that it’s worth getting to grips with is the Union filter. This allows you to view nodes (and the connections between them) of different filtered views of the graph that might otherwise be disjoint. So for example, if I want to look at members of my network with ten or less connections, but also see how they connect to each other to Martin Weller, who has over 60 connections, the Union filter is the way to do it:
That is, the Union filter will display all nodes, and the connections between them, that either have 10 or less connections, or 60 or more connections.
As before, I can save just the members of this subnetwork to a new workspace, and save the whole project from the File menu in the normal way.
Okay, that’s enough for now… have a play with some of the other filter options, and paste a comment back here about any that look like they might be interesting. For example, can you find a way of displaying just the people who are connected to Martin Weller?
Getting Started With Gephi Network Visualisation App – My Facebook Network, Part III: Ego Filters and Simple Network Stats
In a couple of previous posts on exploring my Facebook network with Gephi, I’ve shown how to plot visualise the network, and how to start constructing various filtered views over it (Getting Started With The Gephi Network Visualisation App – My Facebook Network, Part I and Getting Started With Gephi Network Visualisation App – My Facebook Network, Part II: Basic Filters). In this post, I’ll explore a new feature, ego filters, as well as looking at some simple social network analysis tools that can help us better understand the structure of a social network.
To start with, I’m going to load my Facebook network data (grabbed via the Netvizz app, as before) into Gephi as an undirected graph. As mentioned above, the ego network filter is a new addition to Gephi, which will show that part of a graph that is connected to a particular person. So for example, I can apply the ego filter (from the Topology folder in the list of filters) to “George Siemens” to see which of my Facebook friends George knows.
If I save this as a workspace, I can then tunnel into it a little more, for example by applying a new ego filter to the subgraph of my friends who George Siemens knows. In this case, lets add Grainne to the mix – and see who of my friends know both George Siemens and Grainne:
Note that I could have achieved a similar effect with the full graph by using the intersection filter (as introduced in the previous post in this series):
The depth of the ego filter also allows you to see who of of my friends the named individual knows either directly, or through one of my other friends. Using an ego filtered network to depth two (frined of a friend) around George Siemens, I can run some network statistics over just that group of people. So for example, if I run the Degree statistics over the network, and then set the node size according to node degree within that network this is what I get:
(I also turned node labels on and set their size proportional to node size.)
Running Network Diameter stats generates the following sorts of report:
- betweenness centrality;
- closeness centrality;
These all sound pretty technical, so what do they refer to?
Betweenness centrality is a measure based on the number of shortest paths between any two nodes that pass through a particular node. Nodes around the edge of the network would typically have a low betweenness centrality. A high betweenness centrality might suggest that the individual is connecting various different parts of the network together.
Closeness centrality is a measure that indicates how close a node is to all the other nodes in a network, whether or not the node lays on a shortest path between other nodes. A high closeness centrality means that there is a large average distance to other nodes in the network. (So a small closeness centrality means there is a short average distance to all other nodes in the network. Geddit? (I think sometimes the reciprocal of this measure is given as closeness centrality:-).
The eccentricity measure captures the distance between a node and the node that is furthest from it; so a high eccentricity means that the furthest away node in the network is a long way away, and a low eccentricity means that the furthest away node is actually quite close.
So let’s have a look at the structure of my Facebook network, as filtered according to George’s ego filter, depth 2:
Plotting size proportional to betweenness centrality, we see Martin Weller, Grainne and Stephen Downes are influential in keeping different parts of my network connected:
As far as outliers go, we can look at the closeness centrality and eccentricity (to protect the innocent, I shall suppress the names!)
Here, the colour field defines the closeness centrality and the size of the node the eccentricity. It’s quite easy to identify the people in this network who are not well connected and who are unlikely to be able to reach each other easily through those of my friends they know.
From nods with similar sizes and different colours, we also see how it’s quite possible for two nodes to have a similar eccentricity (similar distances to the furthest away nodes) and very different closeness centrality (that is, the node may have a small or large average distance to every other node in the graph). For example, if a node is connected to a very well connected node, it will lower the closeness centrality.
So for example, if we look at the ego network with the above netwrok based around the very well connected Martin Weller, what do we see?
The colder, blue shaded circles (high closeness centrality) have disappeared. Being a Martin Weller friend (in my Facebook network at least) has the effect of lowering your closeness centrality, i.e. bringing you closer to all the other people in the network.
Okay, that’s definitely more than enough for now. Why not have a play looking at your Facebook network, and seeing if you can identify who the best connected folk are?
PS when plotting charts, I think Gephi uses data from the last statistics run it did, even if that was in another workspace, so it’s always worth running the statistics over the current graph if you intend to chart something based on those stats…
In the first two posts in this series, I described how to use Gephi to visualise various different views over a personal social network in Facebook using data pulled from a Facebook account using the Netvizz application. This was followed by a post describing how to and run some simple social network analysis statistics over the network. In this post, we’ll look at another powerful analytic tool provided by Gephi: clustering. [Note that after publishing this post, a far better way of visualising the clustered groups was suggested to me – find out more in the next post in this series.]
Clustering is a mathematical process in which different elements in a particular set are grouped together based on certain similarities between the different elements. That is, like is grouped with like. In a social network, clustering algorithms typically group together individuals who form “subnetworks” – for example, subsets of the the whole population who all know one another.
Being able to identify clusters within a social network allows you to identify “subnetworks” within that larger network. In Gephi, a graph can be clustered by running the Modularity in the Statistics panel – so here’s what I get when I run this measure over my Facebook network:
The Modularity clustering tool identifies several different clusters, (that is, groups or classes) within the network as a whole, associating each node with one of the groups.
One you have run the tool, you can view the cluster each node has been associated with by using the Modularity Class Ranking parameter; I find that the colour mapping is the most effective:
You can inspect the size of the various classes in a crude fashion via the Filter panel: select the Modularity Class option from the Partition folder in the filter Library and drag the filter to the query window. If you click on the Partition column element, you will be presented with a window showing each pf the partition classes:
If you now filter on one of the partition classes, and switch on the node names, you can see which nodes have been clustered together. Looking separately at two of the largest clusters in my Facebook network, I can see two OU clusters:
and a North America/Canada dominated ed-tech cluster (which also includes some BBC folk via Bill Thompson…):
(Note that it is possible to highlight/filter on more than one cluster within a single filter (on a Mac, fn-click allows you to select multiple individual clusters).
The Modularity statistic thus provides us with a powerful tool for identifying subgroupings within a social network. So why not try it using your own data – are the clusters that are identified meaningful to you?
PS a far better way of visualising the clustered groups was suggested to me – find out more in the next post in this series.
A comment from one of the Gephi developers to Getting Started With The Gephi Network Visualisation App – My Facebook Network, Part IV, in which I described how to use the Modularity statistic to partition a network in terms of several different similar subnetwork groupings, suggested that a far better way of visualising the groups was to use the Partion parameter… and how right they were…
Running the Modularity statistic over my Facebook netwrok, as captured using Netvizz, and then refreshing the view in the Partition panel allows us to colour the netwrok using different partitions – such as the Modularity classes that the Modularity statistic generates and assigns nodes to:
Here’s what happens when we applying the colouring:
Selecting the Group view collects all the nodes in a partition together as a group:
These grouped nodes can be individually ungrouped by right-clicking on a group node and ungrouping it, or they can be expanded which maintains the group identity whilst still letting us look at the local structure:
Here’s what the expanded view of one of the classes looks like, with text labels turned on:
We see that the members of the group are visible, allowing us to explore the make-up of the subnetwork. As you might expect, we can then colour or resize nodes within the expanded group in the normal way:
To create a workspace containing just the members of a particular partition, ungroup all the nodes via the Partition module and filter on the required partition using a Modularity Class filter:
The Partition module is incredibly powerful, as you can hopefully see; but it isn’t limited to dealing with just partitions created using Gephi statistics – it can also deal with partitions defined over the graph as loaded into Gephi (see the GUESS format for more details on how to structure the input file).
So for example, the most recent version of Netvizz will return additional data alongside just the identities of your friends, such as their gender (if revealed to you by their profile privacy settings) and the number of their wall posts. Loading this richer network specification into Gephi, and refreshing the Partion module settings reveals the following:
Which in turn means we can colour the graph as follows:
The wall count parameter is made available through the Ranking panel:
So as we can see, if you have partition data available for network members, Gephi can provide a great way of visualising it :-)
Although on a day to day basis I’m a Mac user, every so often I need to dip into the Windows virtual machine on my laptop. This generally fills me with fear and trepidation, because as an infrequent Windows user, whenever I do go over to the dark side I know my internet connection will grind to a halt and I will get regular requests to restart the machine as Windows goes into update mode. In a similar vein, on a day to day basis it’s Twitter that meets my social web needs. But on the rare occasions I go into Facebook, I’m also filled with dread. Why? Because there is frequently a new privacy minefield to negotiate (e.g. Keeping Your Facebook Updates Private).
Over the last few days, there’s been a Facebook developers conference, so I thought it worth checking in to see what new horrors have been released; and here’s what I saw today:
So Facebook makes it easy for website owners to help you “tweet” a link to your Facebook stream… (I wonder, does this also work as a social bookmarking service? Can I browse through the links I’ve Liked?
Ah – according to Deceiving Users with the Facebook Like Button, it appears that “Removing the feed item from your newsfeed does not remove your like — it stays in your profile. You have to click the button again to remove the ‘Like’ relationship.” So it could be used as a social bookmarking service, of a sort. Or at least a Facebook equivalent to “favorited” websites in your browser.
As you might have guessed from the previously linked to post, all may not (yet) be well with the Liked implementation though – because it seems that it’s possible to add a “Like” link on one page that actually Likes a page on another website. Which reminds me a little of phishing…
So, what other goodness (?!) does Facebook have in store for us?
Instant personalisation, hmm…? So if I go to Pandora, say, it can trawl my Facebook profile, decide from my Likes and updates that I’m a goth hippy groover, and generate a personalised radio station for me jus’ like that? The oo’s have it… (ooh…, cool… or spooky…?;-)
And guess what, Facebook have thoughtfully opted me in to that service, without me having to do anything, and without even forcing me to notice (I didn’t have to follow the link on the home page to read the new service announcement; and for mobile users, I wonder if any of the Facebook apps tell the users that they’ve been opted in to this new way of giving their personal data to third parties…?)
I think I’ll click here:
I think I’ll untick…
Am I sure…? Err, yes… Confirm.
But what does this mean…?
Please keep in mind that if you opt out, your friends may still share public Facebook information about you to personalize their experience on these partner sites unless you block the application.
Hmmm, I think I’ll Learn More… (do you ever get the feeling you’ve ended up in one of those Create Your Own Adventure style games, only for real… Is this Brazil, or a Trial?
I guess this is the one:
What data is shared with instantly personalized partner sites?
When you and your friends visit an instantly personalized site, the partner can use your public Facebook information, which includes your name, profile picture, gender, and connections. To access any non-public information, the website is required to ask for you or your friend’s explicit permission.
Or is that “When you or your friends visit…”? That is, if my friend visits Pandora and goes for instant personalisation, can Pandora use my friend as a vector to grab my public information? A question that now follows is – can Pandora identify my Facebook identity through some mechanism or other (e.g. Facebook set cookies?) and reconcile that with what it has learned about me from my friends who have opted in to personlisation features. And if so, could it then offer me personalisation services anyway, even though I opted out on Facebook…?
I’m still unticking… because as Facebook adds partners, I probably won’t pick up on it…
So, do I dare walk up the Facebook Privacy tree…? Let’s go up to the Privacy Setting page:
So here’s the Profile Settings control panel:
Hmmm… there’s a link there to Application Settings, which I don’t think appears on the Privacy settings page. Where does it go?
I’m not sure I understand everything in that drop down menu…?
How about the Contact settings?
Sheesh.. So here are the tabs that I have to work through:
Many of the pages only require setting a simple drop down box (though thinking through the implications, and what relates to what may be comples); but there are also quite a few that offer “Edit Settings…” links, and I suspect that some of those open up into rather more involved dialogues…
I reckon you could easily spend at least 1 week/10 hours of a 10 point short course just looking at Facebook privacy settings, and trying to think through what the implications are…
Which brings to mind the Facebook network visulisation I started working on with Gephi… Could we use visualisation tools to highlight who in your Facebook network can see what given your current privacy settings? Methinks there’s an app in that…
PS popping back in to Facebook just now to delete most of the apps I’m signed up to, I noticed on the “click here” page linked to above the option:
What your friends can share about you
Control what your friends can share about you when using applications and websites
Clicking through to Edit Settings, here’s what I see:
[Since grabbing that screenshot, I’ve unchecked all those boxes…]
I’ll spell out the text for you:
What your friends can share about you through applications and websites
When your friend visits a Facebook-enhanced application or website, they may want to share certain information to make the experience more social. For example, a greeting card application may use your birthday information to prompt your friend to send a card
If your friend uses an application that you do not use, you can control what types of information the application can access. Please note that applications will always be able to access your publicly available information (Name, Profile Picture, Gender, Current City, Networks, Friend List, and Pages) and information that is visible to Everyone
So… if i don’t take steps to protect my information, then my friends can give access to my presence, videos, links, photos, videos and photos and tagged in, my birthday, hometown etc etc to third party applications? Does that mean if I have various privacy settings set to share with friends only, they can still share the information on to third parties I did not anticipate seeing the data?
In the following set up, who can see photos and videos of me?
Answers in the comments please… If anyone’s done the experiments to see just how the various previous setting inter-relate, I’d love to see a write-up. I’m also thinking: maybe Facebook should be required to publish a logical model of what’s going on? (Are there logics of privacy? You could probably get somewhere close using epistemic logic?)
(It’s all a bit like writing legislation that says that as yet unspecified powers will be given to a Minister, who may then devolve those powers to others…;-)
PPS a page I didn’t link to/show a screengrab of but should have included is the Applications page (this is not under the privacy settings. You can find it here: http://www.facebook.com/#!/editapps.php?v=allowed
If you don’t use an app, particularly an external one, I suggest you delete it…
Whenever Facebook rolls out a major change, there’s a backlash… Here’s why I posted recently about how to opt out of Facebook’s new services…
Firstly, I’m quite happy to admit that it might be that you will be benefit from opting in to the Facebook personaliation and behavioural targeting services. If you take the line that better targeted ads are content, and behavioural advertising is one way to achieve that, all well and good. Just bear in mind that your purchasing decisions will be even more directedly influenced ;-)
What does concern me is that part of the attraction of Facebook for many people are its privacy controls. But when they’re too confusing to understand, and potentially misleading, it’s a Bad Thing… (I suppose you could argue that Facebook is innovating in terms of privacy, openness, and data sharing on behalf of its users, but is that a Good Thing?)
If folk think they have set their privacy setting one way, but they operate in another through the myriad interactions of the different settings, users may find that the images and updates they think they are posting into a closed garden, are in fact being made public in other ways, whether by the actions of their friends, applications they have installed, pages they have connected to, or websites they visit.
The Facebook privacy settings also seem to me to suggest various asymmetries. For example, if think I am only sharing videos with friends, then if those friends can also share on content because of the way I have set/not changed the default on another setting, I may be publishing content in a way that was not intended. It seems to me that Facebook is set up to devolve trust to the edge of my network – I publish to the edge of the my network, for example, but the people or pages on the edge of my network can then push the content out further.
So for example, in the case of connecting to pages, Facebook says: “Keep in mind that Facebook Pages you connect to are public. You can control which friends are able to see connections listed on your profile, but you may still show up on Pages you’re connected to. If you don’t want to show up on those Pages, simply disconnect from them by clicking the “Unlike” link in the bottom left column of the Page.”
The privacy settings around how friends can share on content I have shared with them is also confusing – do their privacy settings override mine on content I have published to them?
I’m starting to think (and maybe I’m wrong on this) that the best way of thinking about Facebook is to assume that everything you publish to your Facebook network can be republished by the members of your network under the terms of their privacy conditions. So if I publish a photo that you can see, then I have to assume that you can also publish it under your privacy settings. And so on…
This contrasts with a view of each object having a privacy setting, and that by publishing an object, the publisher controls that setting. So for example, I could publish an object and say it could only be seen by friends of me, and that setting would stick with the object. If you treid to republish it, it could only be repulshed to your friends who are also my friends. My privacy settings would set the scope, or maximum reach, of your republication of it.
Regular readers will know I’ve started looking at ways of visualising Facebook networks using Gephi. What I’m starting to think is that Facebook should offer a visualisation of the furthest reach of a person’s data, videos, images, updates, etc, given their current privacy settings (or preview changes to that reach if they want to test out new privacy settings.
PS re the visualisation thing – something like this, generated from your current settings, would do the job nicely:
More at The Evolution of Privacy on Facebook, including a view of just how open things are now…
[Elements of this post has been largely deprecated since I drafted it a couple of weeks ago, but I'm posting it anyway because this is my open notebook, and as such it has a role in logging the things that are maybe dead ends, as well as hopefully more useful stuff...]
At its heart, Gephi is an environment for visualising graphs (or networks) in which “nodes” are connected to each other by “edges”. Nodes are represented using circles whose size and colour represent particular characteristics of the the node. So for example, if you were to visualise your Facebook friends, a node might represent a particular friend, the size of the node might be proportional to the number of friends they have, and the colour to how many photos they have uploaded. Lines between nodes would then show who is a friend of whom. But must we always add the lines between the nodes? If we leave them out, can we effectively use Gephi as a tool for generating charts like the Many Eyes bubble charts?
One of the data import formats supported by Gephi is the gdf format (gdf documentation), which expects a list of node definitions, followed by a list of edge connections. If we ignore the edges, then we can just import a set of node definitions, and create a bubble chart.
As an example of this, let’s see what we can do with some the Transparency in procurement and contracting information released by the Cabinet Office. As part of the data release, they publish a CSV file containing a summary of all the tender documents held:
Looking at the data, we see that each tender is represented by one or more documents. Each row in the CSV file gives us information about the tender (its project ID, originating department, expected value, expected duration) as well as the particular document. So if we view each tender as a “bubble” or node in Gephi, we might want to represent it as follows:
nodedef> name VARCHAR,label VARCHAR, procid VARCHAR, estVal DOUBLE,estDur DOUBLE,date VARCHAR, dept VARCHAR, desc VARCHAR, nature VARCHAR
402846,"Spring Electoral Events Contact Centre",402846,125000,48,"17/09/2010","Central Office of Information","Invitation to Tender","Competition as part of an existing framework agreement"
"2010CMTLSE00001","Supply of body armour to HMCS","2010CMTLSE00001",250000,48,"24/09/2010","Ministry of Justice","RFQ instructions","Competition as part of an existing framework agreement"
Note that the GDF file requires a particular sort of header, followed by CSV rows of data. It’s easy enough in this case to simply edit the original CSV file by deleting the first line, tweaking the column headers to the name VARCHAR, label VARCHAR… format required by the GDF file, and prefixing the new first row (the header row) with nodedef>.
However, I’ve recently started exploring the use of the browser based desktop application Google Refine (formerly Freebase Gridworks) as a step in my workflow for tidying up CSV data and then getting it into the GDF format.
Here’s what it looks like once the data has been imported:
(For a great overview of what Gridworks allows you to do with data, see @jenit’s Using Freebase Gridworks to Create Linked Data.)
The data I want to visualise in Gephi relates to the current tenders, rather than anything to do with the actual documents, so we can use Gridworks to simplify the data set by deleting the document type, document name and contact email columns. We can also check that columns using a restricted vocabulary (e.g. the type of tender being offered), do so consistently. For example, if we look at the Nature of the Tender Process column, and select Cluster and Edit…:
we can see that there may be the odd typo that we can correct automatically:
The Description column also has various categories we can tidy up:
Here are the data tidying steps I’ve applied:
At the time of writing, a new version of Google Refine/Gridworks is about to be released. In the version I’m using, I don’t think it’s possible to remove duplicate rows, which we have aplenty in my tidied up dataset (where several documents were listed for a tender, there are now several identical rows in my dataset). [Google Refine 2.0 is out now - and I don't think it can de-dupe?] However, I happen to know that Gephi will ignore duplicates of nodes loaded into Gephi, so we can do the de-dupe/de-duplication step there…
To generate the GDF file, we need to create a header line, and then define the output pattern for each row. We can do this using Gridworks’ Templating support:
Here’s how I define the output document:
Note that the linebreaks will need removing in order to generate the correct output format. Also, in the version of Gridworks I’m using, it’s worth noting that whenever you run the template, you’re returned to the main data window and your template definitions is lost… (so before running the template code, grab a copy of it into a text editor just to be safe;-)
When you export the data, it’s exported to your browser downloads directory, as a text file. Change the filetype form txt to .gdf and import it into Gephi:
You’ll see that Gephi has detected the duplicate rows based on common name elements (that is, common project IDs), and ignored the copies/duplicates.
Now we can view the procurement data using proportional symbol visualisations – here I size the nodes by estimated value (and display the label size in proportion to node size), and colour the nodes according to estimated duration:
[Since drafting this post, I've found a far better way of getting just the node data into Gephi - load it into the data table as a node table. I'll post more on how to do this in a follow on post...]
(The Many Eyes take on Bubble Charts ignores x/y co-ordinates as useful data, although other definitions of Bubble Charts include x/y location as important factors. In the current example, I allow Gephi to layout the nodes/bubbles. However, we can define x and y co-ordinates in the gdf file if we want to specifically locate the bubbles on the canvas.)
We can also use Gephi to cluster the data according to calling departments, or type of procurement exercise:
I *think* the size of the resulting bubble is proportional to the sum of the values used to inform the node size of the original components, so we should be able to group by procurement exercise type and have the bubble size be proportional to the sum of the estimated values of the procurement projects in that procurement class.
We can also expand a clustered node to see what activity is related to it – in this case, here are tenders from the British Library:
Going back to the full list, here we size by estimated value and colour by type of procurement:
We can also generate views over the data using filters – so for example, COI sponsored procurement:
One thing that Gephi doesn’t currently support is a treemap style visualisation. However, now we have deduped the data by importing it into Gephi, we can export it as a simple CSV file from the datatable view, and then upload the data to Many Eyes and make use of its treemap:
We use TSV because that is the preferred format for Many Eyes… (data file on Many Eyes)
Here’s one configuration of the treemap:
With the data in Many Eyes, we can of course easily generate other views over it, such as a histogram:
(NB in the original data, the Estimated value column – which should contain numbers – also contained a few unknown elements:
Because code that expects numbers sometimes chokes on text, I should maybe have set the unknow vlaues to a default value as shown above?)
Okay – so what have we covered in this post?
- how to start cleaning/preparing data in Freebase Gridworks/Google Refine;
- how to use reebase Gridworks/Google Refine to generate an output file according to a template;
- how to use Gephi to deduplicate data based on a common field (in this case, the project id);
- how to use Gephi as a proportional symbol/bubble chart visualisation tool;
- how to export data from Gephi and upload it into Many Eyes;
- how to use Many Eyes to generate a treemap.
As ever, this blog post took longer to write than it took me to work through the exercise originally.
To corrupt a well known saying, “cook a man a meal and he’ll eat it; teach a man a recipe, and maybe he’ll cook for you…”, I thought it was probably about time I posted the recipe I’ve been using for laying out Twitter friends networks using Gephi, not least because I’ve been generating quite a few network files for folk lately, giving them copies, and then not having a tutorial to point them to. So here’s that tutorial…
The starting point is actually quite a long way down the “how did you that?” chain, but I have to start somewhere, and the middle’s easier than the beginning, so that’s where we’ll step in (I’ll give some clues as to how the beginning works at the end…;-)
Here’s what we’ll be working towards: a diagram that shows how the people on Twitter that @wiredUK follows follow each other:
The tool we’re going to use to layout this graph from a data file is a free, extensible, open source, cross platform Java based tool called Gephi. If you want to play along, download the datafile. (Or try with a network of your own, such as your Facebook network or social data grabbed from Google+.)
From the Gephi file menu, Open the appropriate graph file:
Import the file as a Directed Graph:
The Graph window displays the graph in a raw form:
Sometimes a graph may contain nodes that are not connected to any other nodes. (For example, protected Twitter accounts do not publish – and are not published in – friends or followers lists publicly via the Twitter API.) Some layout algorithms may push unconnected nodes far away from the rest of the graph, which can affect generation of presentation views of the network, so we need to filter out these unconnected nodes. The easiest way of doing this is to filter the graph using the Giant Component filter.
To colour the graph, I often make us of the modularity statistic. This algorithm attempts to find clusters in the graph by identifying components that are highly interconnected.
This algorithm is a random one, so it’s often worth running it several times to see how many communities typically get identified.
A brief report is displayed after running the statistic:
While we have the Statistics panel open, we can take the opportunity to run another measure: the HITS algorithm. This generates the well known Authority and Hub values which we can use to size nodes in the graph.
The next step is to actually colour the graph. In the Partition panel, refresh the partition options list and then select Modularity Class.
Choose appropriate colours (right click on each colour panel to select an appropriate colour for each class – I often select pastel colours) and apply them to the graph.
The next thing we want to do is lay out the graph. The Layout panel contains several different layout algorithms that can be used to support the visual analysis of the structures inherent in the network; (try some of them – each works in a slightly different way; some are also better than others for coping with large networks). For a network this size and this densely connected,I’d typically start out with one of the force directed layouts, that positions nodes according to how tightly linked they are to each other.
When you select the layout type, you will notice there are several parameters you can play with. The default set is often a good place to start…
Run the layout tool and you should see the network start to lay itself out. Some algorithms require you to actually Stop the layout algorithm; others terminate themselves according to a stopping criterion, or because they are a “one-shot” application (such as the Expansion algorithm, which just scales the x and y values by a given factor).
We can zoom in and out on the layout of the graph using a mouse wheel (on my MacBook trackpad, I use a two finger slide up and down), or use the zoom slider from the “More options” tab:
To see which Twitter ID each node corresponds to, we can turn on the labels:
This view is very cluttered – the nodes are too close to each other to see what’s going on. The labels and the nodes are also all the same size, giving the same visual weight to each node and each label. One thing I like to do is resize the nodes relative to some property, and then scale the label size to be proportional to the node size.
Here’s how we can scale the node size and then set the text label size to be proportional to node size. In the Ranking panel, select the node size property, and the attribute you want to make the size proportional to. I’m going to use Authority, which is a network property that we calculated when we ran the HITS algorithm. Essentially, it’s a measure of how well linked to a node is.
The min size/max size slider lets us define the minimum and maximum node sizes. By default, a linear mapping from attribute value to size is used, but the spline option lets us use a non-linear mappings.
I’m going with the default linear mapping…
We can now scale the labels according to node size:
Note that you can continue to use the text size slider to scale the size of all the displayed labels together.
This diagram is now looking quite cluttered – to make it easier to read, it would be good if we could spread it out a bit. The Expansion layout algorithm can help us do this:
A couple of other layout algorithms that are often useful: the Transformation layout algorithm lets us scale the x and y axes independently (compared to the Expansion algorithm, which scales both axes by the same amount); and the Clockwise Rotate and Counter-Clockwise Rotate algorithm lets us rotate the whole layout (this can be useful if you want to rotate the graph so that it fits neatly into a landscape view.
The expanded layout is far easier to read, but some of the labels still overlap. The Label Adjust layout tool can jiggle the nodes so that they don’t overlap.
(Note that you can also move individual nodes by clicking on them and dragging them.)
So – nearly there… The final push is to generate a good quality output. We can do this from the preview window:
The preview window is where we can generate good quality SVG renderings of the graph. The node size, colour and scaled label sizes are determined in the original Overview area (the one we were working in), although additional customisations are possible in the Preview area.
To render our graph, I just want to make a couple of tweaks to the original Default preview settings: Show Labels and set the base font size.
Click on the Refresh button to render the graph:
Oops – I overdid the font size… let’s try again:
Okay – so that’s a good start. Now I find I often enter into a dance between the Preview ad Overview panels, tweaking the layout until I get something I’m satisfied with, or at least, that’s half-way readable.
How to read the graph is another matter of course, though by using colour, sizing and placement, we can hopefully draw out in a visual way some interesting properties of the network. The recipe described above, for example, results in a view of the network that shows:
- groups of people who are tightly connected to each other, as identified by the modularity statistic and consequently group colour; this often defines different sorts of interest groups. (My follower network shows distinct groups of people from the Open University, and JISC, the HE library and educational technology sectors, UK opendata and data journalist types, for example.)
- people who are well connected in the graph, as displayed by node and label size.
Here’s my final version of the @wiredUK “inner friends” network:
You can probably do better though…;-)
To recap, here’s the recipe again:
- filter on connected component (private accounts don’t disclose friend/follower detail to the api key i use) to give a connected graph;
- run the modularity statistic to identify clusters; sometimes I try several attempts
- colour by modularity class identified in previous step, often tweaking colours to use pastel tones
- I often use a force directed layout, then Expansion to spread to network out a bit if necessary; the Clockwise Rotate or Counter-Clockwise rotate will rotate the network view; I often try to get a landscape format; the Transformation layout lets you expand or contract the graph along a single axis, or both axes by different amounts.
- run HITS statistic and size nodes by authority
- size labels proportional to node size
- use label adjust and expand to to tweak the layout
- use preview with proportional labels to generate a nice output graph
- iterate previous two steps to a get a layout that is hopefully not completely unreadable…
Finally, to the return beginning. The recipe I use to generate the data is as follows:
- grab a list of twitter IDs (call it L); there are several ways of doing this, for example: obtain a list of tweets on a particular topic by searching for a particular hashtag, then grab the set of unique IDs of people using the hashtag; grab the IDs of the members of one or more Twitter lists; grab the IDs of people following or followed by a particular person; grab the IDs of people sending geo-located tweets in a particular area;
- for each person P in L, add them as a node to a graph;
- for each person P in L, get a list of people followed by the corresponding person, e.g. Fr(P)
- for each X in e.g. Fr(P): if X in Fr(P) and X in L, create an edge [P,X] and add it to the graph
- save the graph in a format that can be visualised in Gephi.
To make this recipe, I use Tweepy and a Python script to call the Twitter API and get the friends lists from there, but you could use the Google Social API to get the same data. There’s an example of calling that API using Javscript in my “live” Twitter friends visualisation script (Using Protovis to Visualise Connections Between People Tweeting a Particular Term) as well as in the A Bit of NewsJam MoJo – SocialGeo Twitter Map.
Social Interest Positioning – Visualising Facebook Friends’ Likes With Data Grabbed Using Google Refine
What do my Facebook friends have in common in terms of the things they have Liked, or in terms of their music or movie preferences? (And does this say anything about me?!) Here’s a recipe for visualising that data…
After discovering via Martin Hawksey that the recent (December, 2011) 2.5 release of Google Refine allows you to import JSON and XML feeds to bootstrap a new project, I wondered whether it would be able to pull in data from the Facebook API if I was logged in to Facebook (Google Refine does run in the browser after all…)
Looking through the Facebook API documentation whilst logged in to Facebook, it’s easy enough to find exemplar links to things like your friends list (https://graph.facebook.com/me/friends?access_token=A_LONG_JUMBLE_OF_LETTERS) or the list of likes someone has made (https://graph.facebook.com/me/likes?access_token=A_LONG_JUMBLE_OF_LETTERS); replacing me with the Facebook ID of one of your friends should pull down a list of their friends, or likes, etc.
(Note that validity of the access token is time limited, so you can’t grab a copy of the access token and hope to use the same one day after day.)
Grabbing the link to your friends on Facebook is simply a case of opening a new project, choosing to get the data from a Web Address, and then pasting in the friends list URL:
Click on next, and Google Refine will download the data, which you can then parse as a JSON file, and from which you can identify individual record types:
If you click the highlighted selection, you should see the data that will be used to create your project:
You can now click on Create Project to start working on the data – the first thing I do is tidy up the column names:
We can now work some magic – such as pulling in the Likes our friends have made. To do this, we need to create the URL for each friend’s Likes using their Facebook ID, and then pull the data down. We can use Google Refine to harvest this data for us by creating a new column containing the data pulled in from a URL built around the value of each cell in another column:
The Likes URL has the form https://graph.facebook.com/me/likes?access_token=A_LONG_JUMBLE_OF_LETTERS which we’ll tinker with as follows:
The throttle control tells Refine how often to make each call. I set this to 500ms (that is, half a second), so it takes a few minutes to pull in my couple of hundred or so friends (I don’t use Facebook a lot;-). I’m not sure what limit the Facebook API is happy with (if you hit it too fast (i.e. set the throttle time too low), you may find the Facebook API stops returning data to you for a cooling down period…)?
Having imported the data, you should find a new column:
At this point, it is possible to generate a new column from each of the records/Likes in the imported data… in theory (or maybe not..). I found this caused Refine to hang though, so instead I exprted the data using the default Templating… export format, which produces some sort of JSON output…
I then used this Python script to generate a two column data file where each row contained a (new) unique identifier for each friend and the name of one of their likes:
import simplejson,csv writer=csv.writer(open('fbliketest.csv','wb+'),quoting=csv.QUOTE_ALL) fn='my-fb-friends-likes.txt' data = simplejson.load(open(fn,'r')) id=0 for d in data['rows']: id=id+1 #'interests' is the column name containing the Likes data interests=simplejson.loads(d['interests']) for i in interests['data']: print str(id),i['name'],i['category'] writer.writerow([str(id),i['name'].encode('ascii','ignore')])
[I think this R script, in answer to a related @mhawksey Stack Overflow question, also does the trick: R: Building a list from matching values in a data.frame]
I could then import this data into Gephi and use it to generate a network diagram of what they commonly liked:
Rather than returning Likes, I could equally have pulled back lists of the movies, music or books they like, their own friends lists (permissions settings allowing), etc etc, and then generated friends’ interest maps on that basis.
PS dropping out of Google Refine and into a Python script is a bit clunky, I have to admit. What would be nice would be to be able to do something like a “create new rows with new column from column” pattern that would let you set up an iterator through the contents of each of the cells in the column you want to generate the new column from, and for each pass of the iterator: 1) duplicate the original data row to create a new row; 2) add a new column; 3) populate the cell with the contents of the current iteration state. Or something like that…
PPS Related to the PS request, there is a sort of related feature in the 2.5 release of Google Refine that lets you merge data from across rows with a common key into a newly shaped data set: Key/value Columnize. Seeing this, it got me wondering what a fusion of Google Refine and RStudio might be like (or even just R support within Google Refine?)
PPPS this could be interesting – looks like you can test to see if a friendship exists given two Facebook user IDs.
PPPPS This paper in PNAS – Private traits and attributes are predictable from digital records of human behavior – by Kosinski et. al suggests it’s possible to profile people based on their Likes. It would be interesting to compare how robust that profiling is, compared to profiles based on the common Likes of a person’s followers, or the common likes of folk in the same Facebook groups as an individual?