Fishing for OU Twitter Folk…

Just a quick observation inspired by the online “focus group” on Twitter yesterday around the #twitterou hashtag (a discussion for OU folk about Twitter usage): a few minutes in to the discussion, I grabbed a list of the folk who had used the tag so far (about 10 or people at the time), pulled down a list of the people they followed to construct a graph of hashtaggers->friends, and then filtered the resulting graph to show folk with node degree of 5 or more.

twitterOU - folk followed by 5 or more folk using twitterou before 2.10 or so today

Because a large number of OU Twitter folk follow each other, the graph is quite dense, which means that if we take a sample of known OU users and look for people that a majority of that sample follow, we stand a reasonable chance of identifying other OU folk…

Doing a bit of List Intelligence (looking up the lists that a significant number of hashtag users were on, I identified several OU folk Twitter lists, most notably @liamgh/planetou and @guyweb/openuniversity.

Just for completeness, it’s also worth pointing out that simple community analysis of followers of a known OU person might also turn up OU clusters, e.g. as described in Digging Deeper into the Structure of My Twitter Friends Network: Librarian Spotting. I suspect if we did clique analysis on the followers, this might also identify ‘core’ members of organisational communities that could be used to seed a snowball discovery mechanism for more members of that organisation.

PS hmmm… maybe I need to do a post or two on how we might go about discovering enterprise/organisation networks/communities on Twitter…?

Charting the Social Landscape – Who’s Notable Amongst Followers of UK HE Twitter Accounts?

Over the last week or two, I’ve been playing around with a few ideas relating to where Twitter accounts are located in the social landscape. There are several components to this: who does a particular Twitter account follow, and who follows it; do the friends, or followers cluster in any ways that we can easily and automatically identify (for example, by term analysis applied to the biographies of folk in an individual cluster); who’s notable amongst the friends or followers of an individual that aren’t also a friend or follower of the individual, and so on…

Just to place a stepping stone in my thinking so far, here’s a handful of examples, showing who’s notable amongst the followers of a couple of official HE Twitter accounts but who doesn’t follow the corresponding followed_by account.

Firstly, here’s a snapshot of who followers of @OU_Community follow in significant numbers:

Positioing @ou_community

Hmmm – seems the audience are into their satire… Should the OU be making some humorous videos to tap into that interest?

Here’s how a random sample (I think!) of 250 of @UCLnews’ followers seem to follow at the 4% or more level (that is, at least 0.04 * 250 = 10 of @UCLnews followers follow them…)

positioning of @uclnews co-followed accounts

Seems to be quite a clustering of other university accounts being followed in there, but also “notable” figures and some evidence of a passing interest in serious affairs/commentators? That other UCL accounts are also being followed might suggest evidence that the @UCLnews account is being followed by current students?

How about the followers of @boltonuni? (Again, using a sample of 250 followers, though from a much smaller total follower population when compared to @UCLnews):

@boltonuni cofollowed

The dominance of other university accounts is noticeable here. A couple of possible reasons for this suggesting are that the sampled accounts skew towards other “professional” accounts from within the sector (or that otherwise follow it), or that the student and potential students have a less coherent (in the nicest possible sense of the word!) world view… Or that maybe there are lots of potential students out there following several university twitter accounts trying to get a feel for what the universities are offering.

If we actually look at friend connections between the @boltonuni 250 follower sample, 40% or so are not connected to other followers (either because they are private accounts or because they don’t follow any of the other followers – as we might expect from potential students, for example?)

The connected followers split into two camps:

Tunnelling in on boltonuni follower sample

A gut reaction reading of these communities that they represent sector and locale camps.

Finally, let’s take a look at 250 random followers of @buckssu (Buckinghamshire University student union); this time we get about 75% of followers in the giant connected component:

@buckssu follower sample

Again, we get a locale and ‘sector’ cluster. If we look at folk followed by 4% or more of the follower sample, we get this:

Flk followed by a sample of followers of buckssu

My reading of this is that the student union accounts are pretty tightly connected (I’m guessing we’d find some quite sizeable SU account cliques), there’s a cluster of “other student society” type accounts top left, and then a bunch of celebs…

So what does this tell us? Who knows…?! I’m working on that…;-)

Circles vs Community Detection

One take on the story so far:

– Facebook supports symmetrical follows and allows you to see connections between your Facebook friends;
– Twitter supports asymmetric follows and allows you to see pretty much everyones’ friend and follower connections;
– Google+ supports asymmetric follows

Facebook and Twitter both support lists but hardly anyone uses them. Google+ encourages you to put people into addressable circles (i.e. lists).

If you can grab a copy of connections between folk in your social network, you can run social network statistics that will partition out different social groupings:

My annotated twitter follower network

If you’re familiar with the interests of people in a particular cluster, you can label them (there are also ways you might try to do this automagically).

Now a Facebook app, Super Friends, will help you identify – and label – clusters in your Facebook network (via ReadWriteWeb):

Super friends facebook app

This is a great feature, and something I could imagine being supported to some extent in Gephi, for example, by allowing the user to create a node attribute where the values represent label mappings from different modularity clusters (or more simply by allowing a user to add a label to each modularity class?).

The SuperFriends app also stands in contrast to the Google+ approach. I’d class SuperFriends as gardening, whereas he Google+ approach is more one of planning. The Google+ approach encourages you to think you’re in control of different parts of your network and makes your life really complicated (which circle do I put this person in; do I need a new circle for this?); the SuperFriends approach helps you realise how complicated (or not) your social circle is. In terms of filters, the Google+ approach encourages you to add your own, whereas the SuperFriends approach helps you identify setting that emerges out of network properties.

Given that in many respects Google is an AI/machine learning company, it’s odd that they’re getting the user to define circle/set membership; maybe it’d be too creepy if they automatically suggested groups? Maybe there’s too much scope for error if you don’t deliberately place people into a group yourself (and instead trust an algorithm to do it?)

Superfriends helps uncover structure, Google+ forces you to make all sorts of choices and decisions every time you “follow” another person. Google+ makes you define tags and categories to label people up front; SuperFriends identifies clusters that might be covered by an obvious tag.

Looking at my delicous bookmarks, I have almost as many tags a bookmarks… But if I ran some sort of grouping analysis, (not sure what?!) maybe natural clusters – and natural tags – would emerge as a result?

Maybe I need to read Everything is Miscellaneous again…?

PS if you want to run a more hands on analysis of your Facebook network, try this: Getting Started With The Gephi Network Visualisation App – My Facebook Network, Part I

PPS here’s another Facebook app that identifies clusters: http://www.fellows-exp.com/ h/t @jacomyal

PPPS @danmcquillan also tweeted that LinkedIn InMaps do a similar clustering job on LinkedIn connections. They do indeed; and they use Gephi. I wonder if they’ve released the code that handles things from the point at which a social network graph data is prpvided to the rendering of the map?

Dominant Tags in My Delicious Network

Following on from Social Networks on Delicious, here’s a view over my delicious network (that is, the folk I “follow” on delicious) and the dominant tags they use:

The image is created from a source file generated by:

1) grabbing the list of folk in my delicious network;
2) grabbing the tags each of them uses;
3) generating a bipartite network specification graph containing user and edge nodes, with weighted links corresponding to the number of times a user has used a particular tag (i.e. the number of bookmarks they have bookmarked using that tag).

Because the original graph is a large, sparse one (many users define lots of tags but only use them rarely), I filtered the output view to show only those tags that have been used more than 150 times each by any particular user, based on the weight of each edge (remember, the edge weight describes the number of times a used has used a particular tag). (So if every user had used the same tag up to but not more 149 times each, it wouldn’t be displayed). The tag nodes are sized according to the number of users who have used the tag 150 or more times.

I also had a go at colouring the nodes to identify tags used heavily by a single user, compared to tags heavily used by several members of my network.

Here’s the Python code:

import urllib, simplejson

def getDeliciousUserNetwork(user,network):
  url='http://feeds.delicious.com/v2/json/networkmembers/'+user
  data = simplejson.load(urllib.urlopen(url))
  for u in data:
    network.append(u['user'])
    #time also available: u['dt']
  #print network
  return network

def getDeliciousTagsByUser(user):
  tags={}
  url='http://feeds.delicious.com/v2/json/tags/'+user
  data = simplejson.load(urllib.urlopen(url))
  for tag in data:
    tags[tag]=data[tag]
  return tags

def printDeliciousTagsByNetwork(user,minVal=2):
  f=openTimestampedFile('delicious-socialNetwork','network-tags-' + user+'.gdf')
  f.write(gephiCoreGDFNodeHeader(typ='delicious')+'\n')
 
  network=[]
  network=getDeliciousUserNetwork(user,network)

  for user in network:
    f.write(user+','+user+',user\n')
  f.write('edgedef> user1 VARCHAR,user2 VARCHAR,weight DOUBLE\n')
  for user in network:
    tags={}
    tags=getDeliciousTagsByUser(user)
    for tag in tags:
      if tags[tag]>=minVal:
         f.write(user+',"'+tag.encode('ascii','ignore') + '",'+str(tags[tag])+'\n')
  f.close()

Looking at the network, it’s possible to see which members of my network are heavy users of a particular tag, and furthermore, which tags are heavily used by more than one member of my network. The question now is: to what extent might this information help me identify whether or not I am following people who are likely to turn up resources that are in my interest area, by virtue of the tags used by the members of my network.

Picking up on the previous post on Social Networks on Delicious, might it be worth looking at the tags used heavily by my followers to see what subject areas they are interested in, and potentially the topic area(s) in which they see me as acting as a resource investigator?

Social Networks on Delicious

One of the many things that the delicious social networking site appears to have got wrong is how to gain traction from its social network. As well as the incidental social network that arises from two or more different users using the same tag or bookmarking the same resource (for example, Visualising Delicious Tag Communities Using Gephi), there is also an explicit social network constructed using an asymmetric model similar to that used by Twitter: specifically, you can follow me (become a “fan” of me) without my permission, and I can add you to my network (become a fan of you, again without your permission).

Realising that you are part of a social network on delicious is not really that obvious though, nor is the extent to which it is a network. So I thought I’d have a look at the structure of the social network that I can crystallise out around my delicious account, by:

1) grabbing the list of my “fans” on delicious;
2) grabbing the list of the fans of my fans on delicious and then plotting:
2a) connections between my fans and and their fans who are also my fans;
2b) all the fans of my fans.

(Writing “fans” feels a lot more ego-bollox than writing “followers”; is that maybe one of the nails in the delicious social SNAFU coffin?!)

Here’s the way my “fans” on delicious follow each other (maybe? I’m not sure if the fans call always grabs all the fans, or whether it pages the results?):

The network is plotted using Gephi, of course; nodes are coloured according to modularity clusters, the layout is derived from a Force Atlas layout).

Here’s the wider network – that is, showing fans of my fans:

In this case, nodes are sized according to betweenness centrality and coloured according to in-degree (that is, the number of my fans who have this people as fans). [This works in so far as we’re trying to identify reputation networks. If we’re looking for reach in terms of using folk as a resource discovery network, it would probably make more sense to look at the members of my network, and the networks of those folk…)

If you want to try to generate your own, here’s the code:

import simplejson

def getDeliciousUserFans(user,fans):
  url='http://feeds.delicious.com/v2/json/networkfans/'+user
  #needs paging? or does this grab all the fans?
  data = simplejson.load(urllib.urlopen(url))
  for u in data:
    fans.append(u['user'])
    #time also available: u['dt']
  #print fans
  return fans

def getDeliciousFanNetwork(user):
  f=openTimestampedFile("fans-delicious","all-"+user+".gdf")
  f2=openTimestampedFile("fans-delicious","inner-"+user+".gdf")
  f.write(gephiCoreGDFNodeHeader(typ="min")+"\n")
  f.write("edgedef> user1 VARCHAR,user2 VARCHAR\n")
  f2.write(gephiCoreGDFNodeHeader(typ="min")+"\n")
  f2.write("edgedef> user1 VARCHAR,user2 VARCHAR\n")
  fans=[]
  fans=getDeliciousUserFans(user,fans)
  for fan in fans:
    time.sleep(1)
    fans2=[]
    print "Fetching data for fan "+fan
    fans2=getDeliciousUserFans(fan,fans2)
    for fan2 in fans2:
      f.write(fan+","+fan2+"\n")
      if fan2 in fans:
        f2.write(fan+","+fan2+"\n")
  f.close()
  f2.close()

So what”s the next step…?!

Discovering Co-location Communities – Twitter Maps of Tweets Near Wherever…

As privacy erodes further and further, and more and more people start to reveal where they using location services, how easy is it to identify communities based on location, say, or postcode, rather than hashtag? That is, how easy is it to find people who are colocated in space, rather than topic, as in the hashtag communities? Very easy, it turns out…

One of the things I’ve been playing with lately is “community detection”, particularly in the context of people who are using a particular hashtag on Twitter. The recipe in that case runs something along the lines of: find a list of twitter user names for people using a particular hashtag, then grab their Twitter friends lists and look to see what community structures result (e.g. look for clusters within the different twitterers). The first part of that recipe is key, and generalisable: find a list of twitter user names

So, can we create a list of names based on co-location? Yep – easy: Twitter search offers a “near:” search limit that lets you search in the vicinity of a location.

Here’s a Yahoo Pipe to demonstrate the concept – Twitter hyperlocal search with map output:

Pipework for twitter hyperlocal search with map output

[UPDATE: since grabbing that screenshot, I’ve tweaked the pipe to make it a little more robust…]

And here’s the result:

Twitter local trend

It’s easy enough to generate a widget of the result – just click on the Get as Badge link to get the embeddable widget code, or add the widget direct to a dashboard such as iGoogle:

Yahoo pipes map badge

(Note that this pipe also sets the scene for a possible demo of a “live pipe”, e.g. one that subscribes to searches via pubsubhubbub, so that whenever a new tweet appears it’s pushed to the pipe, and that makes the output live, for example by using a webhook.)

You can also grab the KML output of the pipe using a URL of the form:
http://pipes.yahoo.com/pipes/pipe.run?_id=f21fb52dc7deb31f5fffc400c780c38d&_render=kml&distance=1&location=YOUR+LOCATION+STRING
and post it into a Google maps search box… like this:

Yahoo pipe in google map

(If you try to refresh the Google map, it may suffer from result cacheing.. in which case you have to cache bust, e.g. by changing the distance value in the pipe URL to 1.0, 1.00, etc…;-)

Something else that could be useful for community detection is to search through the localised/co-located tweets for popular hashtags. Whilst we could probably do this in a separate pipe (left as an exercise for the reader), maybe by using a regular expression to extract hashtags and then the unique block filtering on hashtags to count the reoccurrences, here’s a Python recipe:

import simplejson, urllib

def getYahooAppID():
  appid='YOUR_YAHOO_APP_ID_HERE'
  return appid

def placemakerGeocodeLatLon(address):
  encaddress=urllib.quote_plus(address)
  appid=getYahooAppID()
  url='http://where.yahooapis.com/geocode?location='+encaddress+'&flags=J&appid='+appid
  data = simplejson.load(urllib.urlopen(url))
  if data['ResultSet']['Found']>0:
    for details in data['ResultSet']['Results']:
      return details['latitude'],details['longitude']
  else:
    return False,False

def twSearchNear(tweeters,tags,num,place='mk7 6aa,uk',term='',dist=1):
  t=int(num/100)
  page=1
  lat,lon=placemakerGeocodeLatLon(place)
  while page<=t:
    url='http://search.twitter.com/search.json?geocode='+str(lat)+'%2C'+str(lon)+'%2C'+str(1.0*dist)+'km&rpp=100&page='+str(page)+'&q=+within%3A'+str(dist)+'km'
    if term!='':
      url+='+'+urllib.quote_plus(term)

    page+=1
    data = simplejson.load(urllib.urlopen(url))
    for i in data['results']:
     if not i['text'].startswith('RT @'):
      u=i['from_user'].strip()
      if u in tweeters:
        tweeters[u]['count']+=1
      else:
        tweeters[u]={}
        tweeters[u]['count']=1
      ttags=re.findall("#([a-z0-9]+)", i['text'], re.I)
      for tag in ttags:
        if tag not in tags:
    	  tags[tag]=1
    	else:
    	  tags[tag]+=1
    	    
  return tweeters,tags

''' Usage:
tweeters={}
tags={}
num=100 #number of search results, best as a multiple of 100 up to max 1500
location='PLACE YOU WANT TO SEARCH AROUND'
term='OPTIONAL SEARCH TERM TO NARROW DOWN SEARCH RESULTS'
tweeters,tags=twSearchNear(tweeters,tags,num,location,searchTerm)
'''

What this code does is:
– use Yahoo placemaker to geocode the address provided;
– search in the vicinity of that area (note to self: allow additional distance parameter to be set; currently 1.0 km)
– identify the unique twitterers, as well as counting the number of times they tweeted in the search results;
– identify the unique tags, as well as counting the number of times they appeared in the search results.

Here’s an example output for a search around “Bath University, UK”:

Having got the list of Twitterers (as discovered by a location based search), we can then look at their social connections as in the hashtag community visualisations:

Community detected around Bath U.. Hmm,,, people there who shouldnlt be?!

And wondering why the likes @pstainthorp and @martin_hamilton appear to be in Bath? Is the location search broken, picking up stale data, or some other error….? Or is there maybe a UKOLN event on today I wonder..?

PS Looking at a search near “University of Bath” in the web based Twitter search, it seems that: a) there arenlt many recent hits; b) the search results pull up tweets going back in time…

Which suggests to me:
1) the code really should have a time window to filter the tweets by time, e.g. excluding tweets that are more than a day or even an hour old; (it would be so nice if Twitter search API offered a since_time: limit, although I guess it does offer since_id, and the web search does offer since: and until: limits that work on date, and that could be included in the pipe…)
2) where there aren’t a lot of current tweets at a location, we can get a profile of that location based on people who passed through it over a period of time?

UPDATE: Problem solved…

The location search is picking up tweets like this:

Twitter locations...

but when you click on the actual tweet link, it’s something different – a retweet:

Twitter reweets pass through the original location

So “official” Twitter retweets appear to pass through the location data of the original tweet, rather than the person retweeting… so I guess my script needs to identify official twitter retweets and dump them…

PS if you want to see how folk tweeting around a location are socially connected (i.e. whether they follow each other), check out A Bit of NewsJam MoJo – SocialGeo Twitter Map).

Gephi Bits 2: A Further Look at Comments on Social Objects in a Closed Community

In the previous post in this set (Gephi Bits 1: Comments on Social Objects in a Closed Community), I started having a play with comment and favourites data from a series of peer review activities in the OU course Design thinking: creativity for the 21st century.

In particular, I loaded simple pairwise CSV data directly into Gephi, relating comment id and favourite ids to photo ids. The resulting images provided a view over the photos that showed which photos were heavily commented and/or favourited. Towards the end of the post, I suggested it might be interesting to be able to distinguish between the comment and favorite nodes by colouring them somehow. So let’s start by seeing how we might achieve that…

The easiest way I can think of is to preload Gephi with a definition of each node and the assignment of a type label to each node – photo, comment or favourite. We can then partition – and colour – each node based on the type label.

To define the nodes and type labels, we can use a file defined using the GUESS .gdf format. In particular, we define the nodes as follows:

nodedef> name VARCHAR, ltype VARCHAR
p189, photo
p191, photo

c1428, comment
c1429, comment

f1005, fave
f1006, fave

Load this file into Gephi, and then append the contents of the comment-photo and favourite-photo CSV files to the graph. We can then colour the nodes (sized according to, err, something!) according to partition:

Coloured partitions in Gephi

If we filter the network for a particular photo using an ego filter, we can get a colour coded view of the comment and favourite IDs associated with that image:

Coloured nodes and labels in Gephi

What we’ve achieved so far is a way of exploring how heavily commented or favourited a photo is, as well as picking up a tip or two about labeling and colouring nodes. But what about if we wanted a person level analysis, where we could visually identify the individuals who had posted the most images, or whose images were most heavily commented upon and favourited?

To start with, let’s capture some information about each of the nodes. In the following example, we have an identifer (for a photo, favourite or comment), followed by a user id (the person who made the comment or favourite, or who uploaded the photo), and a label (photo, comment or fave). (The ltype field also captures a sense of this.)

nodedef> name VARCHAR, username VARCHAR, ltype VARCHAR
p189,jd342,photo
p191,jd342,photo
p192,pn43,photo
..
c1189,pd73,comment
c1190,srs22,comment
..
f46,ww66,fave
f47,ee79,fave

Rather than describe edges based on connecting comment or favourite ID to photo ID, we can easily generate links of the form userID, photoID, where userID is the ID of the user making a comment or favouriting an image. However, it is possible to annotate the edges to describe whether or not the link relates to a comment or favouriting action. So for example:

edgedef> otherUser VARCHAR, photo VARCHAR, etype VARCHAR
pd73,p189,comment
srs22,p226,comment

ww66,p176,fave

Alternatively, we might just use the simpler format:
edgedef> otherUser VARCHAR, photo VARCHAR
pd73,p189
srs22,p226

ww66,p176

In this simpler case, we can just load in the node definition gdf file, and follow it by adding the actual graph edge data from CSV files, which is what I’ve done for what follows.

Firstly, here’s the partition colour palette:

Gephi - partition colours

The null entities relate to nodes that didn’t get an explicit node specification (i.e. the person nodes).

To provide a bit of flexibility over the graph, I loaded the the favourites and comment edges in as directed edges from “Other user” to photo ID, where “Other user” is the user ID of the person making the comment or favourite.

If we size the graph by out-degree, we can look at which users are actively engaged in commenting on photos:

Gephi - who's commenting/favouriting

The size of the arrow depicts whether or not they are multiple edges going from one person to a photo, so we can see, for example, where someone has made multiple comments on the same photo.

If we size by in-degree, we can see which photos are popular:

Gephi - what photos are popular

If we run an ego filter over over a photo id, we can see who commented on it.

However, what we would really like to be able to do is look at the connections between people via a photo (for example, to see who has favourited who’s photos). If we add in another edge data file that links from a photo ID to a person ID (the person who uploaded the photo), we can start to explore these relationships.

NB the colour palette changes in what follows…

Having captured user to photo relationships based on commenting, favouriting or uploading behaviour, we can now do things like the following. Here for example is a use of a simple filter to see which of a user’s photo’s are popular:

Gephi - simple filter

If we run a simple ego filter, we can see the photos that a user has uploaded or commented on/favourited:

Gephi - ego filter

If we increase the depth to 2, we can see who else a particular user is connected to by virtue of a shared interest in the same photographs (I’m not sure what edge size relates to here…?):

Ego depth 2 in gephi - who connects to whom

Here, user ba49 is outsize because they uploaded a lot of the images that are shown. (The above graph shows linkage between ba49 and other users who either commented on/favourited one of ba49’s images, or who commented/favourited photo that ba49 also commented on/favourited.)

Doh – it appears I’ve crashed Gephi, so as it’s late, I’m going to stop for now! In the next post, I’ll show how we can further elaborate the nodes using extended user identifiers that designate the role a person is acting in (eg as a commenter, favouriter or photo uploader) to see what sorts of view this lets us take over the network.

Gephi Bits 1: Comments on Social Objects in a Closed Community

This is the first in a series of bitty posts (if it makes less sense than usual, tough) just cobbling together a couple of observations about some of the things it looks like you can get Gephi to do with with variously formatted network data…

The setting is data from an OU course (U101 Design thinking: creativity for the 21st century), in which students (with unique identifiers), post images to a course public space, and then comment on and favourite each other’s images.

A research project (that I’m not officially part of…;-) is looking at how the commenting and favouriting behaviour develops, whether it influences the work students do and I guess whether it there is any correlation with grades. After a brainstroming chat with Jennefer Hart yesterday, I had a little tinker last night and again this morning with some of the data, and here’s where I’ve got… (This is open netbook science the inform and scruffy way, right?!;-)

The data comes in various spreadsheets:
– a sheet containing photo id’s ( a number), user IDs (alphanumeric), date of upload, etc;
– a sheet containing photo ids, comment ids (a number), the comment, date of submission, and if it’s a reply to another comment, the id of that other comment ( a number);
– a sheet containing photo ids, favourite ids (a number), and user id of the person who favorited the image;
– a sheet containing a list of student group ids; students are assigned to different groups for different epochs within the course. Every so often new groups (with new ids) form and students are assigned to these new groups.

So – what can we do with this data? The first thing I did was to try to error trap confusion between numerical photo IDs, comment IDs and favourite IDs, so I rewrote these in the form pNNNN, cNNNN and fNNNN respectively. Gephi will use the ID to identify each separate node, so we need to make sure that a node representing photo id 234 is not treated as the same node as comment id 234.

I actually augmented the data using a text editor, e.g. taking three column data presented in CSV style as [commentID, photoID, username] and running the following search and replace expression over it:
(\d*),(\d*),(.*)\r -> c\1,p\2,\3\r

The next thing was to decide on the file format to use to get the data into Gephi. Gephi can accept CSV data, where each row describes the connections from one node to the next (so if you have a list of edges ” a connects to b”, “a connects to c” etc, a two column CSV file could easily describe this).

So for example, taking a CSV dump of “photo id, comment id” pairs, we can generate something along the lines of this graph, where node size is the degree of the node which is to say the number of edges impinging on the node;-) That is, the number of comments a photo has, for example…)

Photos by number of comments

(The layout was achieved by running the Yifan Hu layout algorithm for a few seconds with an optimal distance of 1000.)

One handy feature of Gephi (I think?) is that it appears to let us add data to the network already open in Gephi from another file. So for example, I think I can augment the photo’n’comment data with photo’n’faves data:

Merging graphs in gephi?

This is the effect I get when I load in the second data set…

Importing a 2nd data set that should share node IDs..

Is Gephi seeing photos with the same ID as the same node, whether they’re linked to comments or favorites? How can I tell? Maybe I should refresh the statistics and then replot the the graph? The random layout is as good as any to start with:

Gephi random layout

Seems to look ok…. err..?;-)

So what can we learn? First of all, let’s find a photo that has a large number of inlinks (presumably – hopefully – the sum of favorites and comments…?) – we can use a filter to do this:

Finding the popular photos

Maybe one way to see what connects to popular nodes is to look at the Ego network? [See a much better way in the PS below…] Remove the previous filter to regain the whole graph, and we can have a play… Because I’ve loaded the data in as a directed graph (from comment to photo, or from favourite to photo, I don’t think a depth one ego search will work (because there are no links of depth 1 going away from the photo node.) But if we explore a little further, it seems that for some reason a depth 2 search works, which is handy… [UPDATE – I think I’d messed my settings up – seems to work fine with depth 1…]

Gephi - looking at comments and faves round a photo

We can also use the data table to look at the list of comment and favourite IDs.

Okay, that’s enough for now… what have we done?

– loaded simple edge connection data (simple pairs – comment to photo, for example) into gephi using csv; I used a directed edge to distinguish between photos and annotations.
– added one graph to another: we started with comment data then added the favourite data in on top; in order to view the new data, it’s probably best to run the in/out degree statistic over the combined data set just to be sure you’re not looking at just comment or favourite inlink stats;
– spotted which photos are popular based on combined favourite and comment views, and then used (abused?) the Ego filter to see which comments and favourites were associated with an image. If we’d used undirected edges, the Ego filter might have worked at depth 1?

And what comes to mind next? Firstly, it would be useful to render 2 dimensions of data, for example, colour to show the number of favourites and node size to show the number of comments. (I’m not sure how to do this? Could we maybe label/colour the edges and get a count based on that? OR maybe fudge it, having inlinks for comments and outlinks for faves?) Secondly, we need to start bringing in personal data – who uploaded which photo, who made which comment, and start to explore how active individuals are. But that’s all for another day…

PS following a comment by Alan Cann, I realised that because the graph is largely disjoint – there are separate clusters for each photo, that is only linked to by favourites and comments, with each favourite and comment only linking to one photo – if we run the modularity statistic we get a modularity close to 1 and with clusters around each image:

Modularity classes/partiions

If we expand one of the classes, we can see the photo at the centre and the favourites and comments that (I think) apply to it:

Expanding a class

This seems plausible – that the modularity stat identifies the disjoint bits of graph? I wonder if there is a tool that will definitely and only split the graph into disjoint partitions?

Asymmetric Disclosure in Social Networks

A thought in process…

In a social network, under what conditions should relationships between individuals be publicly discoverable?

So for example, if I am a member of a social network that supports private groups and I put You in one of My private groups, and You put Me in one of
Your public groups, should Our relationship be publicly disclosd on Your profile?

A ‘real world’ version of this (?maybe): suppose You have a problem. You ask Me for a chat about it over coffee in a public coffee shop. Under what circumstances should I be able to disclose in public that You and I had that coffee together?

I haven’t got a proper definition of what I think I mean by asymmetric disclosure yet, but what I (think I) want is to find a way of representing (or at least, talking about) public and private relationships between individuals that allow us to reason about whether friend of a friend connections that are private might end up being disclosed in public just because it’s too complicated to work out whether something is, should, or might ‘reasonably’ expected to be, public or private…

So here’s where I’m at: an asymmetry can be thought of arising if one party in a relationship can reveal information about the other that the other believed they had disclosed to the one in a “private” way, or at least, not in a public way.

This all becomes relevant when we start thinking about ‘friend of a friend’ based friend recommendations or social search and potentially unwelcome disclosures that might result. It might also provide a way of helping us reason about situations where information flow can route around “privacy blocks” via network connections we might not be aware of?

PS here’s another example of possible asymmetric disclosure, this time taken from Twitter. Suppose @A, who has 50 or so followers, tweets “It’s my birthday”. If B, who is one of A’s followers, responds with “@A Happy Birthday”, that response will only appear in the feed of people who follow both A and B, although it can also be seen on B’s public page. If C, who has 1,000 ‘unmoderated’ followers (that is, C never blocks anyone) tweets “Hey, Happy Birthday @A”, all of C’s followers (which let’s assume are mainly spambots and social phishbots(?)) see the message. C has amplified A’s birthdate details. (Admittedly, A had already made that information public, but their intention may only have been to declare that fact to their 30 or so followers. So what we have here is potentially a case of unintended amplification…?)

See also: Brand Association and Your Twitter Followers

Finding New People to Follow in a Hashtag Community

Last night I spent an hour or two tinkering with the dev version of my prototype hashtag community explorer (Personal Twitter Networks in Hashtag Communities), in part prompted by a tweet from @sleslie, thinking about what sorts of features might help you decide whether or not to follow someone new from that community.

NOTE – at times this post reads like a mechanical, and very contrived, prescription for deciding whether to follow someone on Twitter according to how ‘useful’ they may to you. I know friending/following is a lot more fluid/ad hoc than this, but that’s not the point, okay…? (though I’m not sure what the actual point is, yet…?!)

Part of the rationale for this is so that I can start reading about formal social network analysis with some sort of prior knowledge about what sorts of measures I think might be useful, and why, along with how easy they are to calculate in practice. And along with that, I was also looking for easy to do calculations that might be useful in the context of friend recommendation algorithm. (It also occurs to me that this sort of thinking might be tangentially useful to the development of ‘trust’ or ‘reputation’ metrics that Martin is so keen on… e.g. Some more thoughts on metrics ;-)

So here’s where I got to, comparing myself and @jamesclay in the context of a sample of altc2009 hashtagger:


The first metric is easy enough to calculate – @jamesclay’s friends/followers ratio. When rating how valuable a node might be in a network, I think the ratio of input (“friends”) edges to output (“followers”) edges is a useful one. If the number if close to zero, the node is acting in a largely broadcast mode. My friends/followers ratio is about 0.2-0.25 – approx 4 followers per friend, which works for me. Looking at the magnitude of the number of followers also gives you a clue as to how well connected a node is as a potential amplification channel.

The next pair of numbers I calculated related to the number of mutual friends and the number of mutual followers between myself and @jamesclay, normalised against my total number of friends and my total number of followers respectively.

The first measure – my “normalised mutual friends” tells me what proportion of my friends are also jamesclay’s friends. That is, what proportion of my friends are mutually ‘trusted’ by the person I’m considering following (where friending someone on twitter is taken as a vote of trust; we might also take the number of friends to be the number of people who can influence us on Twitter?). As this number tends to 1, it tells me the extent to which all the people I follow are also followed by @jamesclay. If this number equals one, @jamesclay has friended all the people I have. Although note that in that case, this may only be a small proportion of @jamesclay’s total friends list. (So maybe I need a measure to accommodate that? Eg the number of mutual friends normalised against @jamesclay’s total number of friends?) If the number tends to zero, then very few of my influenced

My “normalised mutual followers” score tells me what proportion of my followers are also jamesclay’s followers. That is, what proportion of my followers mutually ‘trusted’ both myself and jamesclay. If this number tends to one, all my followers are also following jamesclay; which would mean that a tweet from jamesclay would reach all my followers and maybe more. If the number tends to zero, we potentially influence completely different sets of people.

(I guess there’s a number we can grab here which is our shared audience size, that is, the number of our combined unique followers: my_followers+your_followers-mutual_followers. Dividing this by my_followers then gives an amplification factor if ‘you’ retweet me?)

The next two measures are based on the number of my friends who follow jamesclay. That is, the people I trust (as demonstrated by my friending them) who in turn trust (have friended/are following) jamesclay.

The first number is the number of my friends who follow jamesclay, divided by the total number of his followers. That is, what proportion of jamesclay’s followers are my friends? Or to put it another way, what proportion of jamesclay’s total following do I trust?

The last number is the number of my friends who follow jamesclay divided by the total number of my friends. That is, what proportion of my friends trust jamesclay.

Okay, so I have no idea where any of this is going, but I just needed to write it down so that I don’t have to remember it, but know that I can call on it if i do need it…;-) I fully expect that things relating to all the above have been properly worked out in the context of ‘proper’ social network analysis, but I’m still trying to generate my own context to make reading that stuff relevant.