Following on from the previous post on comment graphs, here’s something a little more interesting.
The data I’m working with is a CSV file that contains a list of data pairs; for each comment on a photo, it gives the user ID of the person making the comment and the PhotoID the comment was made on. In a second file, I have pairs of photo ID and the user ID of the person who took the photo. I’m thinking it would be interesting to look directly at the edges between user IDs, using a directed graph in which the edges go from the person who made a comment to the person who uploaded the photo being commented on.
So how to generate this merged data file? One non-programmatic way that occurred to me was to use a Google Fusion Table. Simply upload the two data files, and then create a new one that merges the data around the photo ID (that is, photo ID is the common term that joins together the ID of someone commenting, and the person who uploaded the photo being commented on):
That gives a merged data table that looks something like this:
This data can be exported, and the photo ID column removed to give a two column CSV file that contains the user ID of someone who took a photo, paired (optionally) with the ID of a person who commented on that photo.
Where multiple people commented on the same photo, multiple rows will result.
Loading the data into Gephi, colouring nodes by out-degree (from photographer ID to commenter ID) and sizing them by in-degree (number of incoming comments) gives something like this:
If we auto-select neighbours and colour the edges according to direction, we get:
We can also see what happens if we try to cluster the network using the Modularity filter – several partitions are identified, and expanding one lets us see which users were grouped together, presumbaly becuase there was a high degree of commenting between them:
If we now run the Network Diameter statistic, we can look at the Betweeness Centrality across the commenting network, colouring for InDegree and sizing for Betweenness:
The resulting chart shows us which individuals are most active in terms of commenting on, and being commented upon, by other members of the network.
We can generate a similar saw of representation based on favouriting behaviour too. In this case, I started off withteh Favouriter table, and then merged in the Photo user id table – which meant that every favourite had a user ID of the photographer associated with it compared to the above case where I started with the photo table and then merged in the commenter IDs – which meant there were some rows that only had photographer ID (ie some photos had no comments… which means, if we filter the nodes in the comment graph based on in-degree zero, we can view the individuals who have received no comments? (Which may be because they didn’t upload any photos? Eg they might be moderators?)
In the following image, we see individuals whose photos have been heavily favourited. In the top left we see an individual who has favourited lots of of photos (lots of links going out) but not been favourited in return:
If we return the Network stats, and look at Betweenness, we can see which individuals are favouriting widely across the whole userbase:
So there we have it. Using Google Fusion tables, we can generate a user-to-user graph that relates the IDs of users commenting on (or favouriting) each others photos, based on two separate data sets: one that relates user ID of a commenter to a photo; and a second that relates a photo to the ID of the user who uploaded the photo. The resulting graph userID to userID data allows us to use Gephi to plot diagrams that use directed edges to show person A favourited person B’s photo, or person A had a photo commented on by person B.