Doodlings Around the Data Driven Journalism Round Table Event Hashtag Community
…got that?;-) Or in other words, this is a post looking at some visualisations of the #ddj hashtag community…
A couple of days ago, I was fortunate enough to attend a Data Driven Journalism round table (sort of!) event organised by the European Journalism Centre. I’ll try to post some notes on it, err, sometime; but for now, here’s a quick collection of some of the various things I’ve tinkered with around hashtag communities, using #ddj as an example, as a “note to self” that I really should pull these together somehow, or at least automate some of the bits of workflow; I also need to move away from Twitter’s Basic Auth (which gets switched off this week, I think?) to oAuth*…
*At least Twitter is offering a single access token which “is ideal for applications migrating to OAuth with single-user use cases”. Rather than having to request key and secret values in an oAuth handshake, you can just grab them from the Twitter Application Control Panel. Which means I should be able to just plug them into a handful of Tweepy commands:
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
api = tweepy.API(auth)
So, first up is the hashtag community view showing how an individual is connected to the set of people using a particular hashtag (and which at the moment only works for as long as Twitter search turns up users around the hashtag…)
Having got a Twapperkeeper API key, I guess I really should patch this to allow a user to explore their hashtag community based on a Twapperkeeper archive (like the #ddj archive)…
One thing the hashtag community page doesn’t do is explore any of the structure within the network… For that, I currently have a two step process:
1) get a list of people using a hashtag (recipe: Who’s Tweeting our hashtag?) using the this Yahoo Pipe and output the data as CSV using a URL with the pattern (replace ddj with your desired hashtag):
2) take the column of twitter homepage URLs of people tweeting the hashtag and replace http://twitter.com/ with nothing to give the list of Twitter usernames using the hashtag.
3) run a script that finds the twitter numerical IDs of users given their Twitter usernames listed in a simple text file:
import tweepy auth = tweepy.BasicAuthHandler("????", "????") api = tweepy.API(auth) f =open('hashtaggers.txt') f2=open('hashtaggersIDs.txt','w') f3=open('hashtaggersIDnName.txt','w') for uid in f: user=api.get_user(uid) print uid f2.write(str(user.id)+'\n') f3.write(uid+','+str(user.id)+'\n') f.close() f2.close() f3.close()
Note: this a) is Python; b) uses tweepy; c) needs changing to use oAuth.
4) run another script (gist here – note this code is far from optimal or even well behaved; it needs refactoring, and also tweaking so it plays nice with the Twitter API) that gets lists of the friends and the followers of each hashtagger, from their Twitter id and writes these to CSV files in a variety of ways. In particular, for each list (friends and followers, generate three files where the edges represent: i) link between an individual and other hashtaggers (“inner” edges within the community); ii) link between hashtagger and not hashtaggers (“outside” edges from the community); iii) links between hashtagger and hashtaggers as well as not hashtaggers);
5) an edit of the friends/followers CSV files to put them into an appropriate format for viewing in tools such as Gephi or Graphviz. For Gephi, edges can be defined using comma separated pairs of nodes (e.g. userID, followerID) with a bit of syntactic sugar; we can also use the list of Twitter usernames/user IDs to provide nice labels for the community’s “inner” nodes:
nodedef> name VARCHAR,label VARCHAR
edgedef> user VARCHAR,friend VARCHAR
Having got a gdf formatted file, we can load it straight in to Gephi:
In 3d view we can get something like this:
Node size is proportional to number of hashtag users following a hashtagger. Colour (heat) is proportional to number of hashtaggers follwed by a hashtagger. Arrows go from a hashtagger to the people they are following. So a large node size means lots of other hashtaggers follow that person. A hot/red colour shows that person is following lots of the other hashtaggers. Gephi allows you to run various statistics over the graph to allow you to analyse the network properties of the community, but I’m not going to cover that just now! (Search this blog using “gephi” for some examples in other contexts.)
Use of different layouts, colour/text size etc etc can be used to modify this view in Gephi, which can also generate high quality PDF renderings of 2D versions of the graph:
(If you want to produce your own visualisation in Gephi, I popped the gdf file for the network here.)
If we export the gexf representation of the graph, we can use Alexis Jacomy’s Flex gexfWalker component to provide an interactive explorer for the graph:
Clicking on a node allows you to explore who a particular hashtagger is following:
Remember, arrows go from a hashtagger to the person they are following. Note that the above visualisation allows you to see reciprocal links. The colourings are specified via the gexf file, which itself had its properties set in the Preview Settings dialogue in Gephi:
As well as looking at the internal structure of the hashtag community, we camn look at all the friends and/or followers of the hashtaggers. THe graph for this is rather large (70k nodes and 90k edges), so after a lazyweb reuest to @gephi I found I had to increase the memory allocation for the Gephi app (the app stalled on loading the graph when it had run out of memory…).
If we load in the graph of “outer friends” (that is the people the hashtaggers follow who are not hashtaggers) and filter the graph to only show nodes who have more than 5 or so incoming edges we can see which Twitter users are followed by large numbers of the hashtaggers, but who have not been using the hashtag themselves. Becuase the friends/followers lists return Twitter numercal IDs, we have to do a look up on Twitter to find out the actual Twitter usernames. This is something I need to automate, maybe using the Twitter lookup API call that lets authenticated users look up the details of up to 100 Twitter users at a time given their numerical IDs. (If anyone wants this data from my snapshot of 23/8/10, drop me a line….)
Okay, that/s more than enough for now… As I’ve shared the gdf and gexf files for the #ddj internal hashtaggers network, if any more graphically talented than I individuals would like to see what sort of views they can come up with, either using Gephi or any other tool that accepts those data formats, I’d love to see them:-)
PS It also strikes me that having got the list of hashtaggers, I need to package up this with a set of tools that would let you:
- create a Twitter list around a set of hashtaggers (and maybe then use that to generate a custom search engine over the hashtaggers’ linked to homepages);
- find other hashtags being used by the hashtaggers (that is, hashtags they may be using in arbitrary tweets).