For fear of being left way behind as Martin Hawksey starts to get to grips with R, (see for example how he’s using R to automate the annotation of Google Spreadsheets with calculations that don’t come readily or efficiently to hand in Google Spreadsheets itself), I thought I better try to get to grips with R’s igraph library…
So here’s a script that gives me some hints as to how to start migrating chunks of my clunky Python script into R, as well as some ideas about how to start reporting on the structure of hashtag communities in a graphical as well as stats analytical way.
require(igraph) #load in a graph from a graphml file; the graph contains nodes representing Twitter users connected by directed edges that represent friend or follower relations, depending on the actual experimental condition I ran g=read.graph('/Users/ajh59/code/twapps/newt/reports/scmvESP/scmvESP_2012-01-26-22-53-45/friends_outerfriendsdegree_X_25_25_X_esp.graphml',format='graphml') summary(g) #Vertices are obtained via V(g). The summary() tells us what attributes are available. #So for example, inspect the label attribute V(g)$label #in and out degree counts for each (labelled) node/vertex df=data.frame(name=V(g)$label,indegree=degree(g,mode='in'),outdegree=degree(g,mode='out')) #inspect the top 10 nodes sorted by indegree #the plyr arrange function makes sorting dataframes a doddle... require(plyr) df2=head(arrange(df,desc(indegree)),10) df2 #get ready to do some plots require(ggplot2) #It might be interesting to look at the in-degree and out-degree distributions #out-degree, because we see how promiscuous folk are in their following behaviour #h/t to @mhawksey for pointing out the mode argument to me.. doh! ddout=degree.distribution(g,mode='out') #degree.distribution() "a numeric vector of the same length as the maximum degree plus one. The ﬁrst element is the relative frequency zero degree vertices, the second vertices with degree one,etc." #We can use the vector vals as the y-value, but x is unspecified/implied by the row number #So we need to generate the x vals explicitly...? ggplot()+geom_point(aes(c(1:length(ddout)),ddout)) #If we want to ignore the outdegree==0 value, we can skip the first item in the list ggplot()+geom_point(aes(c(2:length(ddout)),ddout[-1])) #in-degree ddin=degree.distribution(g,mode='in') ggplot()+geom_point(aes(c(1:length(ddin)),ddin)) ggplot()+geom_point(aes(c(2:length(ddin)),ddin[-1])) #We can also plot indegree and outdegree together #Use colour to distinguish the points, and make the upper layer smaller in case we overplot ggplot() + geom_point(aes(c(2:length(ddin)),ddin[-1]),colour='red') + geom_point(aes(c(2:length(ddout)),ddout[-1]),colour='blue',size=1)
Note that I really should have labelled the axes – x-axis is “in (or out) degree”, y-axis is “proportion of nodes with corresponding in (or out) degree”.
Out-degree (except out-degree==0):
In-degree (except in-degree==0):
One thing I notice about the in-degree is that there is a very high number of very low in-degree nodes, which tail off quickly, and then another head at in-degree 25 which then tails off. This is an artefact of the way the graph file was pre-processed – I generated a friends network of hashtag users, then filter the network to only include nodes that had indegree of at least 25 and/or outdegree of at least 25. The nodes with in-degree between 1 and 25 are nodes corresponding to hashtaggers that are friended by other hashtaggers.
In- (blue) and out- (red) degree:
Reflecting on the in-degree graph, we have a way of identifying those folk who used the hashtag and are connected to other hashtaggers:
arrange(subset(df,subset=(outdegree>0 & indegree>0)),desc(indegree))
The dataset I’m using refers is based on folk using the #bbcqt hashtag. Here are the hashtaggers most linked to by other hashtaggers:
> head(arrange(subset(df,subset=(outdegree>0)),desc(indegree))) name indegree outdegree 1 bbcquestiontime 190 102 2 DIMBLEBOT 76 61 3 markinreading 34 121 4 politicalhackuk 27 236 5 10anta 25 73 6 Parlez_me_nTory 24 63
So now I’m wondering… does this hint at a way of measuring some sort of engagement with the Twitter account set up to promote the programme and, presumably, the hashtag???
If we consider @bbcquestiontime, the high indegree tells us that the @bbcquestiontime account is being followed by a significant number of the hashtag users (we could find out what proportion by dividing through by the number of folk with out-degree>1 minus 1 (minus 1 because @bbcquestiontime is one of those hashtaggers). That @bbcquestiontime has outdegree > 0 tells us it was sampled as a user of the hashtag (the graph was originally generated with directed edges from folk who used the tag to their friends.) The high (ish?!) out-degree tells us that this account is linking to a reasonable number of folk popularly followed by users of the #bbcqt hashtag or who used the hashtag; so #bbcquestiontime is listening to folk that the #bbcqt taggers listen to, which is probably a good thing. (I guess what we could do here is compare the outdegree of the @bbcquestiontime account with its total friend count (ie, with the total number of accounts it follows. Because if the account was following 1000 people or so, and only 10% of them were being followed by #bbcqt hashtaggers, we might wonder whether they’re interested in different things?) Once again, we could also normalise the out-degree number with respect to one less number of accounts with indegree >0 (again, we subtract one to account for the self reference) to get the proportion of folk being followed by hashtaggers that are being followed by @bbcquestiontime. This gives us some idea of the extent to which @bbcquestiontime is listening to folk that the #bbcqt hashtaggers are listening to.
Let’s try that latter normalisation to get a feel for what the proportions are…
#Count the number of rows where folk have indegree, or outdegree, as required, > 0 df$inReach=df$indegree/(nrow(subset(df,df$outdegree>0))-1) df$outReach=df$outdegree/(nrow(subset(df,df$indegree>0))-1) #First let's see who reaches furthest out into the interest community head(arrange(subset(df,inReach>0),desc(outReach))) name indegree outdegree outReach inReach 1 Damientg 5 341 0.4782609 0.013054830 2 danmknight 9 265 0.3716690 0.023498695 3 martysm 1 261 0.3660589 0.002610966 4 MrJacHart 18 257 0.3604488 0.046997389 5 VMcAV 5 237 0.3323983 0.013054830 6 politicalhackuk 27 236 0.3309958 0.070496084 #now let's see who is touched by most of the community head(arrange(subset(df,outReach>0),desc(inReach))) name indegree outdegree outReach inReach 1 bbcquestiontime 190 102 0.14305750 0.49608355 2 DIMBLEBOT 76 61 0.08555400 0.19843342 3 markinreading 34 121 0.16970547 0.08877285 4 politicalhackuk 27 236 0.33099579 0.07049608 5 10anta 25 73 0.10238429 0.06527415 6 Parlez_me_nTory 24 63 0.08835905 0.06266319
So, from that, we see that @Damientg is following a large number of the folk popularly followed by users of the #bbcqt hashtag or who used the hashtag. I don’t think this is interesting. However, the fact that @bbcquestiontime is followed by about half the folk who used the #bbcqt tag (in the sample I grabbed) is maybe useful as a measure of how engaged the hashtaggers may be with the programme Twitter account?
The latter report also brings to mind another question – how many of the hashtaggers does any particular account follow – that is, how connected is any particular account to folk who used the hashtag (which is the set of folk with outdegree>0)? This is important I think – distinguishing between hashtaggers who link to each other as part of a conversation, and other accounts they follow en masse but who aren’t engaging in conversation via the hashtag?
Hmmm…something to ponder over the weekend I think;-)