Plotting Comment Networks in Gephi, Part II – Merging Datasets Using Google Fusion Tables

Following on from the previous post on comment graphs, here’s something a little more interesting.

The data I’m working with is a CSV file that contains a list of data pairs; for each comment on a photo, it gives the user ID of the person making the comment and the PhotoID the comment was made on. In a second file, I have pairs of photo ID and the user ID of the person who took the photo. I’m thinking it would be interesting to look directly at the edges between user IDs, using a directed graph in which the edges go from the person who made a comment to the person who uploaded the photo being commented on.

So how to generate this merged data file? One non-programmatic way that occurred to me was to use a Google Fusion Table. Simply upload the two data files, and then create a new one that merges the data around the photo ID (that is, photo ID is the common term that joins together the ID of someone commenting, and the person who uploaded the photo being commented on):

Google Fusion Table

That gives a merged data table that looks something like this:

Merged data

This data can be exported, and the photo ID column removed to give a two column CSV file that contains the user ID of someone who took a photo, paired (optionally) with the ID of a person who commented on that photo.

Where multiple people commented on the same photo, multiple rows will result.

Loading the data into Gephi, colouring nodes by out-degree (from photographer ID to commenter ID) and sizing them by in-degree (number of incoming comments) gives something like this:

comment/phographer network

If we auto-select neighbours and colour the edges according to direction, we get:

Neighbours in a directed graph (gephi)

We can also see what happens if we try to cluster the network using the Modularity filter – several partitions are identified, and expanding one lets us see which users were grouped together, presumbaly becuase there was a high degree of commenting between them:

Gephi - clustering using modularity statistic

If we now run the Network Diameter statistic, we can look at the Betweeness Centrality across the commenting network, colouring for InDegree and sizing for Betweenness:

Gephi - betweenness centrality

The resulting chart shows us which individuals are most active in terms of commenting on, and being commented upon, by other members of the network.

We can generate a similar saw of representation based on favouriting behaviour too. In this case, I started off withteh Favouriter table, and then merged in the Photo user id table – which meant that every favourite had a user ID of the photographer associated with it compared to the above case where I started with the photo table and then merged in the commenter IDs – which meant there were some rows that only had photographer ID (ie some photos had no comments… which means, if we filter the nodes in the comment graph based on in-degree zero, we can view the individuals who have received no comments? (Which may be because they didn’t upload any photos? Eg they might be moderators?)

In the following image, we see individuals whose photos have been heavily favourited. In the top left we see an individual who has favourited lots of of photos (lots of links going out) but not been favourited in return:

Favourting behaviour in gephi

If we return the Network stats, and look at Betweenness, we can see which individuals are favouriting widely across the whole userbase:

Betweenness

So there we have it. Using Google Fusion tables, we can generate a user-to-user graph that relates the IDs of users commenting on (or favouriting) each others photos, based on two separate data sets: one that relates user ID of a commenter to a photo; and a second that relates a photo to the ID of the user who uploaded the photo. The resulting graph userID to userID data allows us to use Gephi to plot diagrams that use directed edges to show person A favourited person B’s photo, or person A had a photo commented on by person B.

Gephi – Comment Graphs

In several recent posts, I’ve been exploring in a rather clumsy and over complicated way how we can use Gephi to look at commenting behaviour around shared photos. In this post, I take a step back to look at a much simpler approach… Just plotting the comment graph…

So for example, take a CSV file along the lines of:

c-jd342,p189
c-cd546,p226
c-gh232,p226
where we use a simple CSV file to denote commenter ID and photo ID, and look at the result as a directed graph.

Here’s the graph with nodes sized by in-degree (large nodes are well commented photos):

Gephi - popular photos

And then for nodes sized by out degree (large nodes are people who have commented lots of photos):

Gephi - photo comment graph

Here we colour nodes by the out-degree and size it by in-degree, using colour range boundaries to help us colour the graph:

Colouring nodes in gephi

Using the Autoselect neighbour feature we can look at who commented on a photo:

Gephi autoselect

and then what other photos they commented on:

Gephi autoselect again

The ego filter trivially shows us, at depth 1, the photos a person has commented on:

Comment graph - a single person's comments

(If the photo nodes were sized according to the number of incoming comments from the original graph, we’d be able to see which popularly commented on photos an individual had commented on)

At ego filter depth 2, we can see who else has commented on the same photos:

WHo else has commented on the same photos

Running statistics over the depth 2 ego filter graph and plotting the results would allow us to see which individuals have commented on a large number of the same photos as the ego filtered individual.

Gephi Bits 1: Comments on Social Objects in a Closed Community

This is the first in a series of bitty posts (if it makes less sense than usual, tough) just cobbling together a couple of observations about some of the things it looks like you can get Gephi to do with with variously formatted network data…

The setting is data from an OU course (U101 Design thinking: creativity for the 21st century), in which students (with unique identifiers), post images to a course public space, and then comment on and favourite each other’s images.

A research project (that I’m not officially part of…;-) is looking at how the commenting and favouriting behaviour develops, whether it influences the work students do and I guess whether it there is any correlation with grades. After a brainstroming chat with Jennefer Hart yesterday, I had a little tinker last night and again this morning with some of the data, and here’s where I’ve got… (This is open netbook science the inform and scruffy way, right?!;-)

The data comes in various spreadsheets:
– a sheet containing photo id’s ( a number), user IDs (alphanumeric), date of upload, etc;
– a sheet containing photo ids, comment ids (a number), the comment, date of submission, and if it’s a reply to another comment, the id of that other comment ( a number);
– a sheet containing photo ids, favourite ids (a number), and user id of the person who favorited the image;
– a sheet containing a list of student group ids; students are assigned to different groups for different epochs within the course. Every so often new groups (with new ids) form and students are assigned to these new groups.

So – what can we do with this data? The first thing I did was to try to error trap confusion between numerical photo IDs, comment IDs and favourite IDs, so I rewrote these in the form pNNNN, cNNNN and fNNNN respectively. Gephi will use the ID to identify each separate node, so we need to make sure that a node representing photo id 234 is not treated as the same node as comment id 234.

I actually augmented the data using a text editor, e.g. taking three column data presented in CSV style as [commentID, photoID, username] and running the following search and replace expression over it:
(\d*),(\d*),(.*)\r -> c\1,p\2,\3\r

The next thing was to decide on the file format to use to get the data into Gephi. Gephi can accept CSV data, where each row describes the connections from one node to the next (so if you have a list of edges ” a connects to b”, “a connects to c” etc, a two column CSV file could easily describe this).

So for example, taking a CSV dump of “photo id, comment id” pairs, we can generate something along the lines of this graph, where node size is the degree of the node which is to say the number of edges impinging on the node;-) That is, the number of comments a photo has, for example…)

Photos by number of comments

(The layout was achieved by running the Yifan Hu layout algorithm for a few seconds with an optimal distance of 1000.)

One handy feature of Gephi (I think?) is that it appears to let us add data to the network already open in Gephi from another file. So for example, I think I can augment the photo’n’comment data with photo’n’faves data:

Merging graphs in gephi?

This is the effect I get when I load in the second data set…

Importing a 2nd data set that should share node IDs..

Is Gephi seeing photos with the same ID as the same node, whether they’re linked to comments or favorites? How can I tell? Maybe I should refresh the statistics and then replot the the graph? The random layout is as good as any to start with:

Gephi random layout

Seems to look ok…. err..?;-)

So what can we learn? First of all, let’s find a photo that has a large number of inlinks (presumably – hopefully – the sum of favorites and comments…?) – we can use a filter to do this:

Finding the popular photos

Maybe one way to see what connects to popular nodes is to look at the Ego network? [See a much better way in the PS below…] Remove the previous filter to regain the whole graph, and we can have a play… Because I’ve loaded the data in as a directed graph (from comment to photo, or from favourite to photo, I don’t think a depth one ego search will work (because there are no links of depth 1 going away from the photo node.) But if we explore a little further, it seems that for some reason a depth 2 search works, which is handy… [UPDATE – I think I’d messed my settings up – seems to work fine with depth 1…]

Gephi - looking at comments and faves round a photo

We can also use the data table to look at the list of comment and favourite IDs.

Okay, that’s enough for now… what have we done?

– loaded simple edge connection data (simple pairs – comment to photo, for example) into gephi using csv; I used a directed edge to distinguish between photos and annotations.
– added one graph to another: we started with comment data then added the favourite data in on top; in order to view the new data, it’s probably best to run the in/out degree statistic over the combined data set just to be sure you’re not looking at just comment or favourite inlink stats;
– spotted which photos are popular based on combined favourite and comment views, and then used (abused?) the Ego filter to see which comments and favourites were associated with an image. If we’d used undirected edges, the Ego filter might have worked at depth 1?

And what comes to mind next? Firstly, it would be useful to render 2 dimensions of data, for example, colour to show the number of favourites and node size to show the number of comments. (I’m not sure how to do this? Could we maybe label/colour the edges and get a count based on that? OR maybe fudge it, having inlinks for comments and outlinks for faves?) Secondly, we need to start bringing in personal data – who uploaded which photo, who made which comment, and start to explore how active individuals are. But that’s all for another day…

PS following a comment by Alan Cann, I realised that because the graph is largely disjoint – there are separate clusters for each photo, that is only linked to by favourites and comments, with each favourite and comment only linking to one photo – if we run the modularity statistic we get a modularity close to 1 and with clusters around each image:

Modularity classes/partiions

If we expand one of the classes, we can see the photo at the centre and the favourites and comments that (I think) apply to it:

Expanding a class

This seems plausible – that the modularity stat identifies the disjoint bits of graph? I wonder if there is a tool that will definitely and only split the graph into disjoint partitions?