Visualising Delicious Tag Communities Using Gephi

Years ago, I used the Javascript Infovis Toolkit to put together a handful of data visualisations around the idea of the “social life of a URL” by looking up bookmarked URLs on delicious and then seeing who had bookmarked them and using what tags (delicious URL History – Hyperbolic Tree Visualisation, More Hyperbolic Tree Visualisations – delicious URL History: Users by Tag). Whilst playing with some Twitter hashtag network visualisations today, I wondered whether I could do something similar based around delicious bookmark tags, so here’s a first pass attempt…

As a matter of course, delicious publishes RSS and JSON feeds from tag pages, optionally containing up to 100 bookmarked entries. Each item in the response is a bookmarked URL, along with details of the single individual person who saved that particular bookmark and the tags they used.

That is, for a particular tag on delicious we can trivially get hold of the 100 most recent bookmarks saved with that tag and data on:

– who bookmarked it;
– what tags they used.

Here’s a little script in Python to grab the user and tag data for each lak11 bookmark and generate a Gephi gdf file to represent the bipartite graph that associates users with the tags they have used:

import simplejson
import urllib

def getDeliciousTagURL(tag,typ='json', num=100):
  #need to add a pager to get data when more than 1 page
  return "http://feeds.delicious.com/v2/json/tag/"+tag+"?count=100"

def getDeliciousTaggedURLDetailsFull(tag):
  durl=getDeliciousTagURL(tag)
  data = simplejson.load(urllib.urlopen(durl))
  userTags={}
  uniqTags=[]
  for i in data:
    url= i['u']
    user=i['a']
    tags=i['t']
    title=i['d']
    if user in userTags:
      for t in tags:
        if t not in uniqTags:
          uniqTags.append(t)
        if t not in userTags[user]:
          userTags[user].append(t)
    else:
      userTags[user]=[]
      for t in tags:
        userTags[user].append(t)
        if t not in uniqTags:
          uniqTags.append(t)
  
  f=open('bookmarks-delicious_'+tag+'.gdf')
  f.write('nodedef> name VARCHAR,label VARCHAR, type VARCHAR\n')
  for user in userTags:
    f.write(user+','+user+',user\n')
  for t in uniqTags:
    f.write(t+','+t+',tag\n')

  f.write('edgedef> user VARCHAR,tag VARCHAR\n')
  for user in userTags:
    for t in userTags[user]:
      f.write(user+','+t+'\n')
  f.close()

tag='lak11'
getDeliciousTaggedURLDetailsFull(tag)

[Note to self: this script needs updating to grab additional results pages?]

Here’s an example of the output, in this case using the tag for Jim Groom’s Digital Storytelling course: ds106. The nodes are coloured according to whether they represent a user or a tag, and sized according to degree, and the layout is based on a force atlas layout with a few tweaks to allow us to see labels clearly.

Note that the actual URLs that are bookmarked are not represented in any way in this visualisation. The netwroks shows the connections between users and the tags they have used irrespective of whether the tags were applies to the same or different URLs. Even if two users share common tags, they may not share any common bookmarks…

Here’s another example, this time using the lak11 tag:

Looking at these networks, a couple of things struck me:

– the commonly used tags might express a category or conceptual tag that describes the original tag used to source the data;

– folk sharing similar tags may share similar interests.

Here’s a view over part of the LAK11 network with the LAK11 tag filtered out, and the Gephi ego filter applied with depth two to a particular user, in this case delicious user rosemary20:

The filtered view shows us:

– the tags a particular user (in this case, rosemary20) has used;

– the people who have used the same tags as rosemary20; note that this does not necessarily mean that they bookmarked any of the same URLs, nor that they are using the tags to mean the same thing*…

(* delicious does allow users to provide a description of a tag, though I’m not sure if this information is generally available via a public API?)

By sizing the nodes according to degree in this subnetwork, we can readily identify the tags commonly used alongside the tag used to source the data, and also the users who have used the largest number of identical tags.

PS it struck me that a single page web app should be quite easy to put together to achieve something similar to the above visualisations. The JSON feed from delicious is easy enough to pull in to any page, and the Protovis code library has a force directed layout package that works on a simple graph representation not totally dissimilar to the Gephi/GDF format.

If I get an hour I’ll try to have a play to put a demo together. If you beat me to it, please post a link to your demo (or even fully blown app!) in the comments:-)

Digging Deeper into the Structure of My Twitter Friends Network: Librarian Spotting

A couple of days ago, I grabbed the Twitter friends lists of all my Twitter friends (that is, lists of all the people that the people I follow on Twitter follow…) and plotted the connections between them filtered through the people I follow (Small World? A Snapshot of How My Twitter “Friends” Follow Each Other…). That is, for all of the people I follow on Twitter, I plotted the extent to which they follow each other… got that?

Running the resulting network through Gephi’s modularity statistic (some sort of clustering algorithm; I really need to find out which), several distinct clusters of people turned up: OU folk, data journalism folk, ed techies, JISC/Museums/library folk, and open gov data folk.

(Gephi allows you to export the graph file for the current project, including annotations, if appropriate, (such as modularity class) that are added by running Gepi’s statistics. Extracting the list of nodes (i.e. Twitter users), and filtering them by modularity class means we can create separate lists of individuals based on which cluster they appear in; which in turn means that we could generate a Twitter list from those individuals.)

From my “curated” list of Twitter friends, we can identify a set of “OU twitterers” through a cluster analysis of the mass action of their own friending behaviour, and I could use this to automatically generate a Twitter list of (potential) OU Twitterers that other people can follow.

Here’s the total set of my followers, coloured by modularity class and sized by in-degree (that is, the number of my friend who follow that person).

My Twitter friends, coloured by modularity class

If we filter on modularity class, we can just look at the folk in what I have labelled “OU Twitterers”. There are one or two folk in there who donlt quite fit this label (e.g. University of Leicester folk, and a handful of otherwise “disconnected” folk…), but it’s not bad.

OU Twitterers

Note that if I grab the complete friends and followers lists of these individuals, and look for users who are commonly followed, who also tend to follow back, and who donlt have huge numbers of followers (ie they aren’t celebrities who automatically follow back…) I may discover other OU Twitterers that I don’t follow…

If we run the modularity stat over this group of people, the “OU Twitterers” (most easily done by generating a new workspace from the filtered group), we see three more partitions fall out. Broadly, this first one corresponds to OU Library folk (ish…):

OU LIbrary folk...

Twitterers from my faculty (several whom rarely, if ever, tweet):

Twitterers I follow in my faculty

And the rest (the vast majority, in fact):

OU folk

(Note that a coule of folk are completely disconnected, and have nothing to do with the OU…)

Running the modulraity class over this larger group turns up nothing of interest.

So… so what? So this. Firstly, I can mine the friends lists of the friends of arbitrary people on Twitter and pull out clusters from that may tell me something about the interests of those people. (For example, we might grab their twitter biography statements and run them through a word cloud as a first approximation; or grab their recent tweets and do some text mining on that to see if there is any common interest. Hashtag analysis might also be revealing…) Secondly, we could use the members of cluster to act as a first approximation for a list of connected members of a community interested in a particular topic area; for these community members we could then pull down lists of all their friends and followers and look to see if we can grow the list through other commonly connected to individuals.

PS after tweeting the original post, a couple of people asked if I could grab the data from their friends lists. For example, @neilkod’s turned up clusters relating to “Utah tweeps, my cycling ones, and of course data/#rstats.” So the approach appears to work in general…:-)