Where Next With The Hashtagging Twitterers List?

This post is a holding position, so it’s probably gonna be even more cryptic than usual…

In Who’s Tweeting Our Hashtag?, I described a recipe for generating a list of people who had been tweeting, twittering or whatever, using a particular hashtag.

So what’s next on my to do list with this info?

Well, first of all I thought it’d be interesting to try to plot a graph of connections between the followers of everyone on the list, to see how large the hashtag audience might be.

Using a list of about 60 or so twitterers, captured yesterday, I called the Twitter API http://twitter.com/followers/ids/USERNAME.xml function for each one to pull down an XML list of all each of their followers by ID number, and topped it up with the user info (http://twitter.com/users/show/USERNAME.xml) for each person on the original list; this info meant I could in turn spot the ID for each of the hashtagging twitterers amongst the followers lists.

It’s easy enough to map transform these lists into the dot format that can be plotted by GraphViz, but the 10,000 edges or so that the list generated from the followers lists was too much for my version of GraphViz to cope with.

So instead, I thought I’d just try to plot a subgraph, such as the graph of people who were following a minimum specified number of people in the original hashtag twittering list. So for example, the graph of people who were following at least five of the the people who’d used the particular hashtag.

I hacked a piece of code to do this, but it’s far from ideal and I’m not totally convinced it works properly… Ideally what I want is simple (efficient) utility that will accept a .dot file and prune it, removing nodes that are less than a specified degree. (If you know of such a tool, please post a link to it in the comments:-)

Here’s the first graph I managed to plot:

If my code is working, an edge points to a person if at that person is following at least, err, lots of the other people [that is: lots of other people who used the hashtag]. So under the assumption that the code is working, this graph shows one person at the centre of the graph who is following lots of people who have tweeted the hashtag. Any guesses who that person might be? People who have edges directed towards them in this sort of plot are people who are heavily following the people using a particular hashtag. If you’re a conference organiser, I’m guessing that you’d probably want to appear in this sort of graph?

(If the code isn’t working, I’m not sure what the hell it is doing, or what the graph shows?!;-)

One other thing I thought I’d look at was the people who are following lots of people on the hashtagging list who haven’t themselves used the hashtag. These are the people to whom the event is being heavily amplified.

So for example, here we have a chart that is constructed as follows. The hashtag twitterers list is constructed from a sample of the most recent 500 opened09 hashtagged tweets around about the time stamp of this post and contains people who are in that list at least 3 times.

The edges on the chart are directed towards people who are not on the hashtag list but who are following more than 13 of the people who are on the list.

Hmmmm… anyway, that’s more than enough confusion for now… I’m going to try not to tinker with this any more for a bit, becuase a holiday beckons and this could turn into a mindf**k project… However, when I do return to it, I think I’m going to have a go at attacking it with a graph/network toolkit, such as NetworkX, and see if I can do a proper bit of network analysis on the resulting graphs.

More Thinkses Around Twitter Hashtag Networks: #JISCRI

A brief next step on from Preliminary Thoughts on Visualising the OpenEd09 Twitter Network and A Quick Peek at the IWMW2009 Twitter Network with a couple of graphs that look at the hashtag network around the JISCRI event that’s going on this week.

The sample was a taken from a search of recent #jiscri hashtagged tweets captured last night using the Hashtag Twitterers pipe.

The first chart was to look at people who the hashtag twitterers were following in large numbers who weren’t using the hashtag (I think…my experimental protocol was a bit ropey last night… oops).

The graphs were plotted using Graphviz – firstly a radial plot:

jiscrinetExtGurus

And then a circular one:

jiscrinetExtGurus2

The circular one is quite fun, I think? :-) At a glance, it shows who the “external gurus” are, as well as the differences in their influence.

The second thing I looked at was the network graph of the JISCRI hashtaggers, showing who friended whom:

jiscriTwitterNet

Here’s the circular view:

jiscriTwitterNetCircular

For a large event, I think this sort of graph could be quite fun to generate at both the start of the event and at the end of the event, to show how connections can be formed during an event.

For conferences that publish lists of attendees, popping up a poster of the delegates’ twitter network might provide an interesting discussion thing for people to chat around.

PS See also Meet @HelloApp, Making Conferences More Fun.

Handling Yahoo Pipes Serialised PHP Output

One of the output formats supported by Yahoo Pipes is a PHP style array. In this post, which describes a way of seeing how well connected a particular Twitter user is to other Twitterers who have recently used a particular hashtag, I’ll show you how it can b used.

The following snippet, (cribbed from Coding Forums) shows how to handle this array:

//Declare the required pipe, specifying the php output
$req = "http://pipes.yahoo.com/ouseful/hashtagtwitterers?_render=php&min=3&n=100&q=%23jiscri";

// Make the request
$phpserialized = file_get_contents($req);

// Parse the serialized response
$phparray = unserialize($phpserialized);

//Here's the raw contents of the array
print_r($phparray);

//Here's how to parse it
foreach ($phparray['value']['items'] AS $key => $val)
	printf("<div><p><a href=\"%s\">%s</a></p><p>%s</p>\n", $val['link'], $val['title'], $val['description']);

The pipe used in the above snippet (http://pipes.yahoo.com/ouseful/hashtagtwitterers) displays a list of people who have recently used a particular hashtag on Twitter a minimum specified number of times.

It’s easy enough to parse out the Twitter ID of each individual, and then for a particular named individual see which of those hashtagging Twitterers they are either following, or are following them. (Why’s this interesting? Well, for any given hashtag community, it can show you how well connected you are with that community).

So let’s see how to do it. First, parse out the Twitter ID:

foreach ($phparray['value']['items'] AS $key => $val) {
	$id=preg_replace("/@([^\s]*)\s.*/", "$1", $val['title']);
	$idList[] = $id; 
}

We have the Twitter screennames, but now we want the actual Twitter user IDs. There are several PHP libraries for accessing the Twitter API. The following relies on an old, rejigged version of the library available from http://github.com/jdp/twitterlibphp/tree/master/twitter.lib.php (the code may need tweaking to work with the current version…), and is really kludged together… (Note to self – tidy this up on day!)

The algorithm is basically as follows, and generates a GraphViz .dot file that will plot the connections a particular user has with the members of a particular hashtagging community:

  • get the list of hashtagger Twitter usernames (as above);
  • for each username, call the Twitter API to get the corresponding Twitter ID, and print out a label that maps each ID to a username;
  • for the user we want to investigate, pull down the list of people who follow them from the Twitter API; for each follower, if the follower is in the hashtaggers set, print out that relationship;
  • for the user we want to investigate, pull down the list of people who they follow (i.e. their ‘friends’) from the Twitter API; for each friend, if the friend is in the hashtaggers set, print out that relationship;
$Twitter = new Twitter($myTwitterID, $myTwitterPwd);

//Get the Twitter ID for each user identified by the hashtagger pipe
foreach ($idList as $user) {
	$user_det=$Twitter->showUser($user, 'xml');
 	$p = xml_parser_create();
	xml_parse_into_struct($p,$user_det,$results,$index);
	xml_parser_free($p);
	$id=$results[$index['ID'][0]][value];
	$userID[$user]=$id;
	//print out labels in the Graphviz .dot format
	echo $id."[label=\"".$user."\"];\r";
}

//$userfocus is the Twitter screenname of the person we want to examine
$currUser=$userID[$userfocus];
 
//So who in the hashtagger list is following them?
$follower_det=$Twitter->getFollowers($userfocus, 'xml');
$p = xml_parser_create();
xml_parse_into_struct($p,$follower_det,$results,$index);
xml_parser_free($p);
foreach ($index['ID'] as $item){
	$follower=$results[$item][value];
	//print out edges in the Graphviz .dot format
	if (in_array($follower,$userID)) echo $follower."->".$currUser.";\r";
}

//And who in the hashtagger list are they following?
$friends_det=$Twitter->getFriends($userfocus, 'xml');
$p = xml_parser_create();
xml_parse_into_struct($p,$friends_det,$results,$index);
xml_parser_free($p);
foreach ($index['ID'] as $item){
	$followed=$results[$item][value];
	//print out edges in the Graphviz .dot format
	if (in_array($followed,$userID)) echo $currUser."->".$followed.";\r";
}

For completeness, here are the Twitter object methods and their associated Twitter API calls that were used in the above code:

function showUser($id,$format){
	$api_call=sprintf("http://twitter.com/users/show/%s.%s",$id,$format);
  	return $this->APICall($api_call, false);
}

function getFollowers($id,$format){
  	$api_call=sprintf("http://twitter.com/followers/ids/%s.%s",$id,$format);
 	return $this->APICall($api_call, false);
}
  
function getFriends($id,$format){
  	$api_call=sprintf("http://twitter.com/friends/ids/%s.%s",$id,$format);
 	return $this->APICall($api_call, false);
}

Running the code uses N+2 Twitter API calls, where N is the number of different users identified by the hashtagger pipe.

The output of the script is almost a minimal Graphviz .dot file. All that’s missing is the wrapper, e.g. something like: digraph twitterNet { … }. Here’s what a valid file looks like:

(The labels can appear either before or after the edges – it makes no difference as far as GraphViz is concernd.)

Plotting the graph will show you who the individual of interest is connected to, and how, in the particular hashtag community.

So for example, in the recent #ukoer community, here’s how LornaMCampbell is connected. First a ‘circular’ view:

ukoerInternalNetLMC2

The arrow direction goes FROM one person TO a person they are following. In the circular diagram, it can be quite hard to see whether a connection is reciprocated or one way.

The Graphviz network diagram uses a separate edge for each connection and makes it easier to spot reciprocated links:

ukoerInternalNetLMC

So, there we have it. Another way of looking at Twitter hashtag networks to go along with Preliminary Thoughts on Visualising the OpenEd09 Twitter Network, A Quick Peek at the IWMW2009 Twitter Network and More Thinkses Around Twitter Hashtag Networks: #JISCRI

Personal Twitter Networks in Hashtag Communities

Another conference I’m not at, this time ALT-C, so time for another blatant attempt to raise my profile at the event even if I’m not there with another Twitter related hack…;-) This time, a little tool to help you explore the extent of your Twitter network within a community of people using a particular hashtag.

Here’s a tease of the sort of report it gives:

My place in a Twitter hashtag community http:ouseful.open.ac.uk/twitterMyhashtagNet.php?q=psychemedia&h=altc2009

Some numbers (I’ll let you know what in a minute…) A list of people in the hashtag network who are followed by a particular individual (their “friends”). A list of people in the hashtag network who follow a particular individual, but are not followed (friended) back (their “serfs”). A list of people in the hashtag network who are followed (friended) a particular individual, but do not follow them (their “slebs”). A list of people in the hashtag network who neither follow nor are followed (friended) by a particular individual (“the void”).

Before I go on, I should probably also define what I mean by a hashtag community, not last because there are some, err, pragmatic constraints on defining this;-)

For the purposes of this post, a hashtag community is a collection of people who have used a particular hashtag more than a certain minimum specified number of times in a set of Twitter posts that use the hashtag. In my default ad hoc set up, I tend to look for people who have used the hashtag more than 3 times in the most recent 500 or so tweets. For the proof of concept demo, I also limit the size of the hashtag network to 100, otherwise the pipework that underpins it starts to fall over…

UPDATE:
Here’s a bit more explanation about why the app doesn’t always show people in the community you ‘know’ to be there…

You may notice that not everyone you know has used the hashtag appears in the friends and followers lists. This is because the size of the hashtag community is limited in three ways:

  • hashtag use sample size: for this proof of concept, the hashtag community analysis is based on a Twitter search that grabs the 500 most recent uses of the declard hashtag. If this were a production tool, it would pull the complete archive of hashtag use from one of the twitter archiving services. if you want that feature, build it yourself…;-)
  • minimum number of tweets: an optional paramenter in the URI identifies the minimum number of hashtagged tweets that a user must have sent in the sample to be considered a member of the community. By setting this numbr large, it allows you to just see the heaviest hashtagger in the community, or filter out people who maybe just use the hashtag once in a retweet. (I think there’s a bug in the code – if you set this mintweets paramter to 2, the user must have hashtagged at least 3 times. i.e. one more. 10 is 11.
  • Max community size: an ‘issue’ in the Twitter search API means I need to call the Twitter API once for every person in the community. This overhead can break the pipework, so the community size can be limited arbitrarily.

The inspiration for the report is a typical ego thing – to what extent is my personal Twitter network dominated by the membership of a particular hashtag community. (Note I’ve explored related ideas in a variety of other ad hoc ways: Who’s Tweeting Our Hashtag?, Where Next With The Hashtagging Twitterers List?, Preliminary Thoughts on Visualising the OpenEd09 Twitter Network, A Quick Peek at the IWMW2009 Twitter Network, More Thinkses Around Twitter Hashtag Networks: #JISCRI and Handling Yahoo Pipes Serialised PHP Output).

Anyway, in the current example, the numbers I’ve started to look at are defined as follows. All numbers are either integers, or real numbers in the range 0..1.

So what do the numbers mean?

  • Number of hashtaggers: the number of people in the hashtag network, Ngalaxy;
  • Hashtaggers as followers (‘hashtag followers’): the number of people in the hashtag community who are following the named individual, Gfollowers
  • Hashtaggers as friends (‘hashtag friends’): the number of people in th hashtag community that the named individual has friended, Griends
  • Hashtagger followers not friended (‘serfs’): the number of people in the hashtag community that follow the named individual but that are not followed back (i.e. who are not friends of the named individual), Gserfs
  • Hashtagger friends not following (‘slebs’): the number of people in the hashtag community that are followed by the named individual (i.e. friends) but that do not follow them back (i.e. who are not also followers of the named individual), Gslebs
  • Hashtaggers not friends or followers (‘the hashtag void’): the number of people in the hashtag community who neither follow, nor are friended by, the named individual Gvoid
  • Reach into hashtag community: the proportion of the the hashtag community that follow the named individual; a measure of the extent to which an individual can reach the hashtag community without actually using the hashtag; Greach=Gfollowers/Ngalaxy.
  • Reception of hashtag community the proportion of the the hashtag community that are followed by (i.e. are friends of) the named individual; a measure of the extent to which an individual sees messages from the hashtag community without directly tracking the hashtag; Greception=Gfriends/Ngalaxy
  • Hashtag void (normalised): the size of the void normalised relative to the size of the hashtag community; the proportion of the hashtag community that are unlikely to be directly encountered outside of the hashtag community; Normvoid=Gvoid/Ngalaxy
  • Total personal followers the total number of followers of the named individual, Nfollowers
  • Total personal friends: the total number of friends of the named individual Nfriends
  • Hashtag community dominance of personal reach: the extent to which the hashtag community dominates the set of people who follow the named individual, Domreach=Gfollowers/Nfollowers. If all the named individual’s followers are in the hashtag community, Domreach=1. If none of them are, Domreach=0.
  • Hashtag community dominance of personal reception: the extent to which the set of the named individual’s friends is dominated by members of the hashtag community, Domreception=Gfriends/Nfriends. If all the named individual’s friends are in the hashtag community, Domreception=1. If none of them are, Domreception=0.

If you want to try the tool out, the interface is provided by the URI:
http://ouseful.open.ac.uk/twitterMyhashtagNet.php?q=ostephens&h=altc2009&mintweets=2&maxusers=99

I have no idea whether any of these measures are used in more formal analyses (I’ve yet to start my formal reading of the proper social network analysis stuff…) but it’s a way in for me to start thinking about what measures that might be in some sense meaningful and both easy to explain and calculate;-)

Finding New People to Follow in a Hashtag Community

Last night I spent an hour or two tinkering with the dev version of my prototype hashtag community explorer (Personal Twitter Networks in Hashtag Communities), in part prompted by a tweet from @sleslie, thinking about what sorts of features might help you decide whether or not to follow someone new from that community.

NOTE – at times this post reads like a mechanical, and very contrived, prescription for deciding whether to follow someone on Twitter according to how ‘useful’ they may to you. I know friending/following is a lot more fluid/ad hoc than this, but that’s not the point, okay…? (though I’m not sure what the actual point is, yet…?!)

Part of the rationale for this is so that I can start reading about formal social network analysis with some sort of prior knowledge about what sorts of measures I think might be useful, and why, along with how easy they are to calculate in practice. And along with that, I was also looking for easy to do calculations that might be useful in the context of friend recommendation algorithm. (It also occurs to me that this sort of thinking might be tangentially useful to the development of ‘trust’ or ‘reputation’ metrics that Martin is so keen on… e.g. Some more thoughts on metrics ;-)

So here’s where I got to, comparing myself and @jamesclay in the context of a sample of altc2009 hashtagger:


The first metric is easy enough to calculate – @jamesclay’s friends/followers ratio. When rating how valuable a node might be in a network, I think the ratio of input (“friends”) edges to output (“followers”) edges is a useful one. If the number if close to zero, the node is acting in a largely broadcast mode. My friends/followers ratio is about 0.2-0.25 – approx 4 followers per friend, which works for me. Looking at the magnitude of the number of followers also gives you a clue as to how well connected a node is as a potential amplification channel.

The next pair of numbers I calculated related to the number of mutual friends and the number of mutual followers between myself and @jamesclay, normalised against my total number of friends and my total number of followers respectively.

The first measure – my “normalised mutual friends” tells me what proportion of my friends are also jamesclay’s friends. That is, what proportion of my friends are mutually ‘trusted’ by the person I’m considering following (where friending someone on twitter is taken as a vote of trust; we might also take the number of friends to be the number of people who can influence us on Twitter?). As this number tends to 1, it tells me the extent to which all the people I follow are also followed by @jamesclay. If this number equals one, @jamesclay has friended all the people I have. Although note that in that case, this may only be a small proportion of @jamesclay’s total friends list. (So maybe I need a measure to accommodate that? Eg the number of mutual friends normalised against @jamesclay’s total number of friends?) If the number tends to zero, then very few of my influenced

My “normalised mutual followers” score tells me what proportion of my followers are also jamesclay’s followers. That is, what proportion of my followers mutually ‘trusted’ both myself and jamesclay. If this number tends to one, all my followers are also following jamesclay; which would mean that a tweet from jamesclay would reach all my followers and maybe more. If the number tends to zero, we potentially influence completely different sets of people.

(I guess there’s a number we can grab here which is our shared audience size, that is, the number of our combined unique followers: my_followers+your_followers-mutual_followers. Dividing this by my_followers then gives an amplification factor if ‘you’ retweet me?)

The next two measures are based on the number of my friends who follow jamesclay. That is, the people I trust (as demonstrated by my friending them) who in turn trust (have friended/are following) jamesclay.

The first number is the number of my friends who follow jamesclay, divided by the total number of his followers. That is, what proportion of jamesclay’s followers are my friends? Or to put it another way, what proportion of jamesclay’s total following do I trust?

The last number is the number of my friends who follow jamesclay divided by the total number of my friends. That is, what proportion of my friends trust jamesclay.

Okay, so I have no idea where any of this is going, but I just needed to write it down so that I don’t have to remember it, but know that I can call on it if i do need it…;-) I fully expect that things relating to all the above have been properly worked out in the context of ‘proper’ social network analysis, but I’m still trying to generate my own context to make reading that stuff relevant.

Asymmetric Disclosure in Social Networks

A thought in process…

In a social network, under what conditions should relationships between individuals be publicly discoverable?

So for example, if I am a member of a social network that supports private groups and I put You in one of My private groups, and You put Me in one of
Your public groups, should Our relationship be publicly disclosd on Your profile?

A ‘real world’ version of this (?maybe): suppose You have a problem. You ask Me for a chat about it over coffee in a public coffee shop. Under what circumstances should I be able to disclose in public that You and I had that coffee together?

I haven’t got a proper definition of what I think I mean by asymmetric disclosure yet, but what I (think I) want is to find a way of representing (or at least, talking about) public and private relationships between individuals that allow us to reason about whether friend of a friend connections that are private might end up being disclosed in public just because it’s too complicated to work out whether something is, should, or might ‘reasonably’ expected to be, public or private…

So here’s where I’m at: an asymmetry can be thought of arising if one party in a relationship can reveal information about the other that the other believed they had disclosed to the one in a “private” way, or at least, not in a public way.

This all becomes relevant when we start thinking about ‘friend of a friend’ based friend recommendations or social search and potentially unwelcome disclosures that might result. It might also provide a way of helping us reason about situations where information flow can route around “privacy blocks” via network connections we might not be aware of?

PS here’s another example of possible asymmetric disclosure, this time taken from Twitter. Suppose @A, who has 50 or so followers, tweets “It’s my birthday”. If B, who is one of A’s followers, responds with “@A Happy Birthday”, that response will only appear in the feed of people who follow both A and B, although it can also be seen on B’s public page. If C, who has 1,000 ‘unmoderated’ followers (that is, C never blocks anyone) tweets “Hey, Happy Birthday @A”, all of C’s followers (which let’s assume are mainly spambots and social phishbots(?)) see the message. C has amplified A’s birthdate details. (Admittedly, A had already made that information public, but their intention may only have been to declare that fact to their 30 or so followers. So what we have here is potentially a case of unintended amplification…?)

See also: Brand Association and Your Twitter Followers

Gephi Bits 1: Comments on Social Objects in a Closed Community

This is the first in a series of bitty posts (if it makes less sense than usual, tough) just cobbling together a couple of observations about some of the things it looks like you can get Gephi to do with with variously formatted network data…

The setting is data from an OU course (U101 Design thinking: creativity for the 21st century), in which students (with unique identifiers), post images to a course public space, and then comment on and favourite each other’s images.

A research project (that I’m not officially part of…;-) is looking at how the commenting and favouriting behaviour develops, whether it influences the work students do and I guess whether it there is any correlation with grades. After a brainstroming chat with Jennefer Hart yesterday, I had a little tinker last night and again this morning with some of the data, and here’s where I’ve got… (This is open netbook science the inform and scruffy way, right?!;-)

The data comes in various spreadsheets:
– a sheet containing photo id’s ( a number), user IDs (alphanumeric), date of upload, etc;
– a sheet containing photo ids, comment ids (a number), the comment, date of submission, and if it’s a reply to another comment, the id of that other comment ( a number);
– a sheet containing photo ids, favourite ids (a number), and user id of the person who favorited the image;
– a sheet containing a list of student group ids; students are assigned to different groups for different epochs within the course. Every so often new groups (with new ids) form and students are assigned to these new groups.

So – what can we do with this data? The first thing I did was to try to error trap confusion between numerical photo IDs, comment IDs and favourite IDs, so I rewrote these in the form pNNNN, cNNNN and fNNNN respectively. Gephi will use the ID to identify each separate node, so we need to make sure that a node representing photo id 234 is not treated as the same node as comment id 234.

I actually augmented the data using a text editor, e.g. taking three column data presented in CSV style as [commentID, photoID, username] and running the following search and replace expression over it:
(\d*),(\d*),(.*)\r -> c\1,p\2,\3\r

The next thing was to decide on the file format to use to get the data into Gephi. Gephi can accept CSV data, where each row describes the connections from one node to the next (so if you have a list of edges ” a connects to b”, “a connects to c” etc, a two column CSV file could easily describe this).

So for example, taking a CSV dump of “photo id, comment id” pairs, we can generate something along the lines of this graph, where node size is the degree of the node which is to say the number of edges impinging on the node;-) That is, the number of comments a photo has, for example…)

Photos by number of comments

(The layout was achieved by running the Yifan Hu layout algorithm for a few seconds with an optimal distance of 1000.)

One handy feature of Gephi (I think?) is that it appears to let us add data to the network already open in Gephi from another file. So for example, I think I can augment the photo’n’comment data with photo’n’faves data:

Merging graphs in gephi?

This is the effect I get when I load in the second data set…

Importing a 2nd data set that should share node IDs..

Is Gephi seeing photos with the same ID as the same node, whether they’re linked to comments or favorites? How can I tell? Maybe I should refresh the statistics and then replot the the graph? The random layout is as good as any to start with:

Gephi random layout

Seems to look ok…. err..?;-)

So what can we learn? First of all, let’s find a photo that has a large number of inlinks (presumably – hopefully – the sum of favorites and comments…?) – we can use a filter to do this:

Finding the popular photos

Maybe one way to see what connects to popular nodes is to look at the Ego network? [See a much better way in the PS below…] Remove the previous filter to regain the whole graph, and we can have a play… Because I’ve loaded the data in as a directed graph (from comment to photo, or from favourite to photo, I don’t think a depth one ego search will work (because there are no links of depth 1 going away from the photo node.) But if we explore a little further, it seems that for some reason a depth 2 search works, which is handy… [UPDATE – I think I’d messed my settings up – seems to work fine with depth 1…]

Gephi - looking at comments and faves round a photo

We can also use the data table to look at the list of comment and favourite IDs.

Okay, that’s enough for now… what have we done?

– loaded simple edge connection data (simple pairs – comment to photo, for example) into gephi using csv; I used a directed edge to distinguish between photos and annotations.
– added one graph to another: we started with comment data then added the favourite data in on top; in order to view the new data, it’s probably best to run the in/out degree statistic over the combined data set just to be sure you’re not looking at just comment or favourite inlink stats;
– spotted which photos are popular based on combined favourite and comment views, and then used (abused?) the Ego filter to see which comments and favourites were associated with an image. If we’d used undirected edges, the Ego filter might have worked at depth 1?

And what comes to mind next? Firstly, it would be useful to render 2 dimensions of data, for example, colour to show the number of favourites and node size to show the number of comments. (I’m not sure how to do this? Could we maybe label/colour the edges and get a count based on that? OR maybe fudge it, having inlinks for comments and outlinks for faves?) Secondly, we need to start bringing in personal data – who uploaded which photo, who made which comment, and start to explore how active individuals are. But that’s all for another day…

PS following a comment by Alan Cann, I realised that because the graph is largely disjoint – there are separate clusters for each photo, that is only linked to by favourites and comments, with each favourite and comment only linking to one photo – if we run the modularity statistic we get a modularity close to 1 and with clusters around each image:

Modularity classes/partiions

If we expand one of the classes, we can see the photo at the centre and the favourites and comments that (I think) apply to it:

Expanding a class

This seems plausible – that the modularity stat identifies the disjoint bits of graph? I wonder if there is a tool that will definitely and only split the graph into disjoint partitions?

Gephi Bits 2: A Further Look at Comments on Social Objects in a Closed Community

In the previous post in this set (Gephi Bits 1: Comments on Social Objects in a Closed Community), I started having a play with comment and favourites data from a series of peer review activities in the OU course Design thinking: creativity for the 21st century.

In particular, I loaded simple pairwise CSV data directly into Gephi, relating comment id and favourite ids to photo ids. The resulting images provided a view over the photos that showed which photos were heavily commented and/or favourited. Towards the end of the post, I suggested it might be interesting to be able to distinguish between the comment and favorite nodes by colouring them somehow. So let’s start by seeing how we might achieve that…

The easiest way I can think of is to preload Gephi with a definition of each node and the assignment of a type label to each node – photo, comment or favourite. We can then partition – and colour – each node based on the type label.

To define the nodes and type labels, we can use a file defined using the GUESS .gdf format. In particular, we define the nodes as follows:

nodedef> name VARCHAR, ltype VARCHAR
p189, photo
p191, photo

c1428, comment
c1429, comment

f1005, fave
f1006, fave

Load this file into Gephi, and then append the contents of the comment-photo and favourite-photo CSV files to the graph. We can then colour the nodes (sized according to, err, something!) according to partition:

Coloured partitions in Gephi

If we filter the network for a particular photo using an ego filter, we can get a colour coded view of the comment and favourite IDs associated with that image:

Coloured nodes and labels in Gephi

What we’ve achieved so far is a way of exploring how heavily commented or favourited a photo is, as well as picking up a tip or two about labeling and colouring nodes. But what about if we wanted a person level analysis, where we could visually identify the individuals who had posted the most images, or whose images were most heavily commented upon and favourited?

To start with, let’s capture some information about each of the nodes. In the following example, we have an identifer (for a photo, favourite or comment), followed by a user id (the person who made the comment or favourite, or who uploaded the photo), and a label (photo, comment or fave). (The ltype field also captures a sense of this.)

nodedef> name VARCHAR, username VARCHAR, ltype VARCHAR
p189,jd342,photo
p191,jd342,photo
p192,pn43,photo
..
c1189,pd73,comment
c1190,srs22,comment
..
f46,ww66,fave
f47,ee79,fave

Rather than describe edges based on connecting comment or favourite ID to photo ID, we can easily generate links of the form userID, photoID, where userID is the ID of the user making a comment or favouriting an image. However, it is possible to annotate the edges to describe whether or not the link relates to a comment or favouriting action. So for example:

edgedef> otherUser VARCHAR, photo VARCHAR, etype VARCHAR
pd73,p189,comment
srs22,p226,comment

ww66,p176,fave

Alternatively, we might just use the simpler format:
edgedef> otherUser VARCHAR, photo VARCHAR
pd73,p189
srs22,p226

ww66,p176

In this simpler case, we can just load in the node definition gdf file, and follow it by adding the actual graph edge data from CSV files, which is what I’ve done for what follows.

Firstly, here’s the partition colour palette:

Gephi - partition colours

The null entities relate to nodes that didn’t get an explicit node specification (i.e. the person nodes).

To provide a bit of flexibility over the graph, I loaded the the favourites and comment edges in as directed edges from “Other user” to photo ID, where “Other user” is the user ID of the person making the comment or favourite.

If we size the graph by out-degree, we can look at which users are actively engaged in commenting on photos:

Gephi - who's commenting/favouriting

The size of the arrow depicts whether or not they are multiple edges going from one person to a photo, so we can see, for example, where someone has made multiple comments on the same photo.

If we size by in-degree, we can see which photos are popular:

Gephi - what photos are popular

If we run an ego filter over over a photo id, we can see who commented on it.

However, what we would really like to be able to do is look at the connections between people via a photo (for example, to see who has favourited who’s photos). If we add in another edge data file that links from a photo ID to a person ID (the person who uploaded the photo), we can start to explore these relationships.

NB the colour palette changes in what follows…

Having captured user to photo relationships based on commenting, favouriting or uploading behaviour, we can now do things like the following. Here for example is a use of a simple filter to see which of a user’s photo’s are popular:

Gephi - simple filter

If we run a simple ego filter, we can see the photos that a user has uploaded or commented on/favourited:

Gephi - ego filter

If we increase the depth to 2, we can see who else a particular user is connected to by virtue of a shared interest in the same photographs (I’m not sure what edge size relates to here…?):

Ego depth 2 in gephi - who connects to whom

Here, user ba49 is outsize because they uploaded a lot of the images that are shown. (The above graph shows linkage between ba49 and other users who either commented on/favourited one of ba49’s images, or who commented/favourited photo that ba49 also commented on/favourited.)

Doh – it appears I’ve crashed Gephi, so as it’s late, I’m going to stop for now! In the next post, I’ll show how we can further elaborate the nodes using extended user identifiers that designate the role a person is acting in (eg as a commenter, favouriter or photo uploader) to see what sorts of view this lets us take over the network.

Discovering Co-location Communities – Twitter Maps of Tweets Near Wherever…

As privacy erodes further and further, and more and more people start to reveal where they using location services, how easy is it to identify communities based on location, say, or postcode, rather than hashtag? That is, how easy is it to find people who are colocated in space, rather than topic, as in the hashtag communities? Very easy, it turns out…

One of the things I’ve been playing with lately is “community detection”, particularly in the context of people who are using a particular hashtag on Twitter. The recipe in that case runs something along the lines of: find a list of twitter user names for people using a particular hashtag, then grab their Twitter friends lists and look to see what community structures result (e.g. look for clusters within the different twitterers). The first part of that recipe is key, and generalisable: find a list of twitter user names

So, can we create a list of names based on co-location? Yep – easy: Twitter search offers a “near:” search limit that lets you search in the vicinity of a location.

Here’s a Yahoo Pipe to demonstrate the concept – Twitter hyperlocal search with map output:

Pipework for twitter hyperlocal search with map output

[UPDATE: since grabbing that screenshot, I’ve tweaked the pipe to make it a little more robust…]

And here’s the result:

Twitter local trend

It’s easy enough to generate a widget of the result – just click on the Get as Badge link to get the embeddable widget code, or add the widget direct to a dashboard such as iGoogle:

Yahoo pipes map badge

(Note that this pipe also sets the scene for a possible demo of a “live pipe”, e.g. one that subscribes to searches via pubsubhubbub, so that whenever a new tweet appears it’s pushed to the pipe, and that makes the output live, for example by using a webhook.)

You can also grab the KML output of the pipe using a URL of the form:
http://pipes.yahoo.com/pipes/pipe.run?_id=f21fb52dc7deb31f5fffc400c780c38d&_render=kml&distance=1&location=YOUR+LOCATION+STRING
and post it into a Google maps search box… like this:

Yahoo pipe in google map

(If you try to refresh the Google map, it may suffer from result cacheing.. in which case you have to cache bust, e.g. by changing the distance value in the pipe URL to 1.0, 1.00, etc…;-)

Something else that could be useful for community detection is to search through the localised/co-located tweets for popular hashtags. Whilst we could probably do this in a separate pipe (left as an exercise for the reader), maybe by using a regular expression to extract hashtags and then the unique block filtering on hashtags to count the reoccurrences, here’s a Python recipe:

import simplejson, urllib

def getYahooAppID():
  appid='YOUR_YAHOO_APP_ID_HERE'
  return appid

def placemakerGeocodeLatLon(address):
  encaddress=urllib.quote_plus(address)
  appid=getYahooAppID()
  url='http://where.yahooapis.com/geocode?location='+encaddress+'&flags=J&appid='+appid
  data = simplejson.load(urllib.urlopen(url))
  if data['ResultSet']['Found']>0:
    for details in data['ResultSet']['Results']:
      return details['latitude'],details['longitude']
  else:
    return False,False

def twSearchNear(tweeters,tags,num,place='mk7 6aa,uk',term='',dist=1):
  t=int(num/100)
  page=1
  lat,lon=placemakerGeocodeLatLon(place)
  while page<=t:
    url='http://search.twitter.com/search.json?geocode='+str(lat)+'%2C'+str(lon)+'%2C'+str(1.0*dist)+'km&rpp=100&page='+str(page)+'&q=+within%3A'+str(dist)+'km'
    if term!='':
      url+='+'+urllib.quote_plus(term)

    page+=1
    data = simplejson.load(urllib.urlopen(url))
    for i in data['results']:
     if not i['text'].startswith('RT @'):
      u=i['from_user'].strip()
      if u in tweeters:
        tweeters[u]['count']+=1
      else:
        tweeters[u]={}
        tweeters[u]['count']=1
      ttags=re.findall("#([a-z0-9]+)", i['text'], re.I)
      for tag in ttags:
        if tag not in tags:
    	  tags[tag]=1
    	else:
    	  tags[tag]+=1
    	    
  return tweeters,tags

''' Usage:
tweeters={}
tags={}
num=100 #number of search results, best as a multiple of 100 up to max 1500
location='PLACE YOU WANT TO SEARCH AROUND'
term='OPTIONAL SEARCH TERM TO NARROW DOWN SEARCH RESULTS'
tweeters,tags=twSearchNear(tweeters,tags,num,location,searchTerm)
'''

What this code does is:
– use Yahoo placemaker to geocode the address provided;
– search in the vicinity of that area (note to self: allow additional distance parameter to be set; currently 1.0 km)
– identify the unique twitterers, as well as counting the number of times they tweeted in the search results;
– identify the unique tags, as well as counting the number of times they appeared in the search results.

Here’s an example output for a search around “Bath University, UK”:

Having got the list of Twitterers (as discovered by a location based search), we can then look at their social connections as in the hashtag community visualisations:

Community detected around Bath U.. Hmm,,, people there who shouldnlt be?!

And wondering why the likes @pstainthorp and @martin_hamilton appear to be in Bath? Is the location search broken, picking up stale data, or some other error….? Or is there maybe a UKOLN event on today I wonder..?

PS Looking at a search near “University of Bath” in the web based Twitter search, it seems that: a) there arenlt many recent hits; b) the search results pull up tweets going back in time…

Which suggests to me:
1) the code really should have a time window to filter the tweets by time, e.g. excluding tweets that are more than a day or even an hour old; (it would be so nice if Twitter search API offered a since_time: limit, although I guess it does offer since_id, and the web search does offer since: and until: limits that work on date, and that could be included in the pipe…)
2) where there aren’t a lot of current tweets at a location, we can get a profile of that location based on people who passed through it over a period of time?

UPDATE: Problem solved…

The location search is picking up tweets like this:

Twitter locations...

but when you click on the actual tweet link, it’s something different – a retweet:

Twitter reweets pass through the original location

So “official” Twitter retweets appear to pass through the location data of the original tweet, rather than the person retweeting… so I guess my script needs to identify official twitter retweets and dump them…

PS if you want to see how folk tweeting around a location are socially connected (i.e. whether they follow each other), check out A Bit of NewsJam MoJo – SocialGeo Twitter Map).

Social Networks on Delicious

One of the many things that the delicious social networking site appears to have got wrong is how to gain traction from its social network. As well as the incidental social network that arises from two or more different users using the same tag or bookmarking the same resource (for example, Visualising Delicious Tag Communities Using Gephi), there is also an explicit social network constructed using an asymmetric model similar to that used by Twitter: specifically, you can follow me (become a “fan” of me) without my permission, and I can add you to my network (become a fan of you, again without your permission).

Realising that you are part of a social network on delicious is not really that obvious though, nor is the extent to which it is a network. So I thought I’d have a look at the structure of the social network that I can crystallise out around my delicious account, by:

1) grabbing the list of my “fans” on delicious;
2) grabbing the list of the fans of my fans on delicious and then plotting:
2a) connections between my fans and and their fans who are also my fans;
2b) all the fans of my fans.

(Writing “fans” feels a lot more ego-bollox than writing “followers”; is that maybe one of the nails in the delicious social SNAFU coffin?!)

Here’s the way my “fans” on delicious follow each other (maybe? I’m not sure if the fans call always grabs all the fans, or whether it pages the results?):

The network is plotted using Gephi, of course; nodes are coloured according to modularity clusters, the layout is derived from a Force Atlas layout).

Here’s the wider network – that is, showing fans of my fans:

In this case, nodes are sized according to betweenness centrality and coloured according to in-degree (that is, the number of my fans who have this people as fans). [This works in so far as we’re trying to identify reputation networks. If we’re looking for reach in terms of using folk as a resource discovery network, it would probably make more sense to look at the members of my network, and the networks of those folk…)

If you want to try to generate your own, here’s the code:

import simplejson

def getDeliciousUserFans(user,fans):
  url='http://feeds.delicious.com/v2/json/networkfans/'+user
  #needs paging? or does this grab all the fans?
  data = simplejson.load(urllib.urlopen(url))
  for u in data:
    fans.append(u['user'])
    #time also available: u['dt']
  #print fans
  return fans

def getDeliciousFanNetwork(user):
  f=openTimestampedFile("fans-delicious","all-"+user+".gdf")
  f2=openTimestampedFile("fans-delicious","inner-"+user+".gdf")
  f.write(gephiCoreGDFNodeHeader(typ="min")+"\n")
  f.write("edgedef> user1 VARCHAR,user2 VARCHAR\n")
  f2.write(gephiCoreGDFNodeHeader(typ="min")+"\n")
  f2.write("edgedef> user1 VARCHAR,user2 VARCHAR\n")
  fans=[]
  fans=getDeliciousUserFans(user,fans)
  for fan in fans:
    time.sleep(1)
    fans2=[]
    print "Fetching data for fan "+fan
    fans2=getDeliciousUserFans(fan,fans2)
    for fan2 in fans2:
      f.write(fan+","+fan2+"\n")
      if fan2 in fans:
        f2.write(fan+","+fan2+"\n")
  f.close()
  f2.close()

So what”s the next step…?!