This is the first in a series of bitty posts (if it makes less sense than usual, tough) just cobbling together a couple of observations about some of the things it looks like you can get Gephi to do with with variously formatted network data…
The setting is data from an OU course (U101 Design thinking: creativity for the 21st century), in which students (with unique identifiers), post images to a course public space, and then comment on and favourite each other’s images.
A research project (that I’m not officially part of…;-) is looking at how the commenting and favouriting behaviour develops, whether it influences the work students do and I guess whether it there is any correlation with grades. After a brainstroming chat with Jennefer Hart yesterday, I had a little tinker last night and again this morning with some of the data, and here’s where I’ve got… (This is open netbook science the inform and scruffy way, right?!;-)
The data comes in various spreadsheets:
– a sheet containing photo id’s ( a number), user IDs (alphanumeric), date of upload, etc;
– a sheet containing photo ids, comment ids (a number), the comment, date of submission, and if it’s a reply to another comment, the id of that other comment ( a number);
– a sheet containing photo ids, favourite ids (a number), and user id of the person who favorited the image;
– a sheet containing a list of student group ids; students are assigned to different groups for different epochs within the course. Every so often new groups (with new ids) form and students are assigned to these new groups.
So – what can we do with this data? The first thing I did was to try to error trap confusion between numerical photo IDs, comment IDs and favourite IDs, so I rewrote these in the form pNNNN, cNNNN and fNNNN respectively. Gephi will use the ID to identify each separate node, so we need to make sure that a node representing photo id 234 is not treated as the same node as comment id 234.
I actually augmented the data using a text editor, e.g. taking three column data presented in CSV style as [commentID, photoID, username] and running the following search and replace expression over it:
(\d*),(\d*),(.*)\r -> c\1,p\2,\3\r
The next thing was to decide on the file format to use to get the data into Gephi. Gephi can accept CSV data, where each row describes the connections from one node to the next (so if you have a list of edges ” a connects to b”, “a connects to c” etc, a two column CSV file could easily describe this).
So for example, taking a CSV dump of “photo id, comment id” pairs, we can generate something along the lines of this graph, where node size is the degree of the node which is to say the number of edges impinging on the node;-) That is, the number of comments a photo has, for example…)
(The layout was achieved by running the Yifan Hu layout algorithm for a few seconds with an optimal distance of 1000.)
One handy feature of Gephi (I think?) is that it appears to let us add data to the network already open in Gephi from another file. So for example, I think I can augment the photo’n’comment data with photo’n’faves data:
This is the effect I get when I load in the second data set…
Is Gephi seeing photos with the same ID as the same node, whether they’re linked to comments or favorites? How can I tell? Maybe I should refresh the statistics and then replot the the graph? The random layout is as good as any to start with:
Seems to look ok…. err..?;-)
So what can we learn? First of all, let’s find a photo that has a large number of inlinks (presumably – hopefully – the sum of favorites and comments…?) – we can use a filter to do this:
Maybe one way to see what connects to popular nodes is to look at the Ego network? [See a much better way in the PS below…] Remove the previous filter to regain the whole graph, and we can have a play…
Because I’ve loaded the data in as a directed graph (from comment to photo, or from favourite to photo, I don’t think a depth one ego search will work (because there are no links of depth 1 going away from the photo node.) But if we explore a little further, it seems that for some reason a depth 2 search works, which is handy… [UPDATE – I think I’d messed my settings up – seems to work fine with depth 1…]
We can also use the data table to look at the list of comment and favourite IDs.
Okay, that’s enough for now… what have we done?
– loaded simple edge connection data (simple pairs – comment to photo, for example) into gephi using csv; I used a directed edge to distinguish between photos and annotations.
– added one graph to another: we started with comment data then added the favourite data in on top; in order to view the new data, it’s probably best to run the in/out degree statistic over the combined data set just to be sure you’re not looking at just comment or favourite inlink stats;
– spotted which photos are popular based on combined favourite and comment views, and then used (abused?) the Ego filter to see which comments and favourites were associated with an image.
If we’d used undirected edges, the Ego filter might have worked at depth 1?
And what comes to mind next? Firstly, it would be useful to render 2 dimensions of data, for example, colour to show the number of favourites and node size to show the number of comments. (I’m not sure how to do this? Could we maybe label/colour the edges and get a count based on that? OR maybe fudge it, having inlinks for comments and outlinks for faves?) Secondly, we need to start bringing in personal data – who uploaded which photo, who made which comment, and start to explore how active individuals are. But that’s all for another day…
PS following a comment by Alan Cann, I realised that because the graph is largely disjoint – there are separate clusters for each photo, that is only linked to by favourites and comments, with each favourite and comment only linking to one photo – if we run the modularity statistic we get a modularity close to 1 and with clusters around each image:
If we expand one of the classes, we can see the photo at the centre and the favourites and comments that (I think) apply to it:
This seems plausible – that the modularity stat identifies the disjoint bits of graph? I wonder if there is a tool that will definitely and only split the graph into disjoint partitions?