What is the Potential Audience Size for a Hashtag Community?

What’s the potential audience size around, or ‘reach’ associated with, a Twitter hashtag?

Way back when, in the early days of webs stats, reported figures tended to centre around the notion of hits, the number of calls made to a server via website activity. I forget the details, but the metric was presumably generated from server logs. This measure was always totally unreliable, because in the course of serving a web page, a server might be hit multiple times, once for each separately delivered asset, such as images, javascript files, css files and so on. Hits soon gave way to the notion of Page Views, which more accurately measured the number of pages (rather than assets) served via a website. This was complemented with the notion of Visits and Unique Visits: Visits, as tracked by a cookies, represent a set of pages viewed around about the same time by the same person. Unique Visits (or “Uniques”), represent the number of different people who appear to have visited the site in any given period.

What we see here, then, is a steady evolution in the complexity of website metrics that reflects on the one hand dissatisfaction with one way of measuring or reporting activity, and on the other practical considerations with respect to instrumentation and the ability to capture certain metrics once they are conceived of.

Widespread social media monitoring/tracking is largely still in the realm of “hits” measurement. Personal dashboards for services such as Twitter typically display direct measures provided by the Twitter API, or measures trivially/directly identified from Twitter API or archived data – number of followers, numbers of friends, distribution of updates over time, number of mentions, and so on.

Something both myself and Martin Hawksey have been thinking about on and off for some time are ways of reporting activity around Twitter hashtags. A commonly(?!) asked question in this respect relates to how much engagement (whatever that means) there has been with a particular tag. So here’s a quick mark in the sand about some of my current thinking about this. (Note that these ideas may well have been more formally developed in the academic literature – I’m a bit behind in my reading! If you know something that covers this in more detail, or that I should cite, please feel free to add a link in the comments… #lazyAcademic.)

One of the first metrics that comes to my mind is the number of people who have used a particular hashtag, and the number of their followers. Easily stated, it doesn’t take a lot of thought to realise even these “simple” measures are fraught with difficulty:

  • what counts as a use of the hashtag? If I retweet a measure of yours that contains a hashtag, have I used it in any meaningful sense? Does a “use” mean the creation of a new tweet containing the tag? What about if I reply to a tweet from you than contains the tag and I include the tag in my reply to you, even if I’m not sure what that tag relates to?
  • the potential audience size for the tag (potential uniques?), based on the number of followers of the tag users. At first glance, we might think this can be easily calculated by adding together the follower counts of the tag users, but this is more strictly an approximation of the potential audience: the set of followers of A may include some of the followers of B, or C; do we count the tag users themselves amongst the audience? If so, the upper bound also needs to take into account the fact that none of the users may be followers of any of the other tag users.
    Note there is also a lower bound – the largest follower count amongst the tag users (whatever that means…) of the hashtag. Furthermore, if we want to count the number of folk not using the tag but who may have seen the tag, this lower bound can be revised downwards by subtracting the number of tag users minus one (for the tag user with the largest follower count). The value is still only an approximation, though, becuase it assumes that all the tag users are actually included as followers of at least one, each, of the tag users. (If you think these points are “just academic”, they are and they aren’t – observations like these can often be used to help formulate gaming strategies around metrics based on these measures.)
  • the potential number of views of a tag, for example based on the product of the number of times a user tweets and their follower count?
  • the reach of (or active engagement with?) the tag, as measured by the number of people who actually see the tag, or the number of people who take and action around it (such as replying to a tagged tweet, RTing it, or clicking on a link a tagged tweet contains); note that we may be able ot construct probabilistic models (albeit quite involved ones) of the potential reach based on factors like the number of people someone follows, when they are online, the rate at which the people they follow tweet, and so on..

To try to make this a little more concrete, here are a couple of scripts for exploring the potential audience size of a tag based on the followers of the tag users (where a user is someone who publishes or retweets a tweet containing the tag over a specified period). The first, Python script runs a Twitter search and generates a list of unique users of the tag, along with the timestamp of their first use of the tag within the sample period. This script also grabs all the followers of the tag users, along with their counts, and generates running cumulative (upper bound approximation) count of the tag user follower numbers as well as calculating the rolling set of unique followers to date as each new tag user is observed. The second, R script plots the values.

The first thing we can do is look at the incidence of new users of the hashtag over time:

(For a little more discussion of this sort of chart, see Visualising Activity Around a Twitter Hashtag or Search Term Using R and its inspiration, @mediaczar’s How should Page Admins deal with Flame Wars?.)

More relevant to this post, however, is a plot showing some counts relating to followers of users of the hashtag:

In this case, the top, green line represents the summed total number of followers for tag users as they enter the conversation. If every user had completely different followers, this might be meaningful, but where conversation takes place around a tag between folk who know each other, it’s highly likely that they have followers in common.

The middle, red line shows a count of the number of unique followers to date, based on the the followers of users of the tag to date.

The lower, blue line shows the difference between the red and green lines. This represents the error between the summed follower counts and the actual number of unique followers.

Here’s a view over the number of new unique potential audience members at each time step (I think the use of the line chart here may be a mistake… I think bars/lineranges would probably be more appropriate…):

In the following chart, I overplot oneline with another. The lower layer (a red line) is the total follower account for each new tag user. The blue is the increase in the potential audience count (that is, the number of the new users’ followers that haven’t potentially seen the tag so far). The range of the visible part of the red line thus shows the number of a new tag user’s followers who have potentially already seen the tag. Err… maybe (that is, if my code is correct and all the scripts are doing what I think they’re doing! If they aren’t, then just treat this post as an exploration of the sorts of charts we might be able to produce to explore audience reach;-)

Here are the scripts (such as they are!)

import newt,csv,tweepy
import networkx as nx

#the term we're going to search for
#how many tweets to search for (max 1500)

##Something along lines of:
auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
auth.set_access_token(SKEY, SSECRET)
api = tweepy.API(auth, cache=tweepy.FileCache('cache',cachetime), retry_errors=[500], retry_delay=5, retry_count=2)

#You need to do some work here to search the Twitter API
tweeters, tweets=yourSearchTwitterFunction(api,tag,num)
#tweeters is a list of folk who tweeted the term of interest
#tweets is a list of the Twitter tweet objects returned from the search
#My code for this is tightly bound up in a large and rambling library atm...

#Put tweets into chronological order

#I was being lazy and wasn't sure what vars I needed or what I was trying to do when I started this!
#The whole thing really needs rewriting...
#runtot is crude and doesn't measure overlap

#Construct a digraph from folk using the tag to their followers

for tweet in tweets:
	if user not in tweepFo:
		print "Getting follower data for", str(user), str(len(tweepFo)), 'of', str(len(tweeters))
		userID=tweet['from_user_id'] #check
		for m in mi:
			#construct graph
		#seen to date is all people who have seen so far, plus new ones, so it's the union
		#I'm weighting the edges so we can count how many times folk see the hashtag
		if len(DG.edges(userID))>0:
			for fromN,toN in DG.edges(userID):

for ts,l,u,ct,ufc,ols in uniqSourceFo:
	print ts,l


print "Writing graph.."
for n in DG:
	if DG.degree(n)>1: filter.append(n)
nx.write_graphml(H, 'reports/tmp/'+tag+'_ncount_2up.graphml')
print "Writing other graph.."
nx.write_graphml(DG, 'reports/tmp/'+tag+'_ncount.graphml')

Here’s the R script…

ddj_ncount <- read.csv("~/code/twapps/newt/reports/tmp/ddj_ncount.csv")
#Convert the datetime string to a time object
ddj_ncount$ttime=as.POSIXct(strptime(ddj_ncount$datetime, "%a, %d %b %Y %H:%M:%S"),tz='UTC')

#Order the newuser factor levels into the order in which they first use the tag
ddj_ncount$newuser=factor(ddj_ncount$newuser, levels = dda$newuser)

#Plot when each user first used the tag against time
ggplot(ddj_ncount) + geom_point(aes(x=ttime,y=newuser)) + opts(axis.text.x=theme_text(size=6),axis.text.y=theme_text(size=4))

#Plot the cumulative and union flavours of increasing possible audience size, as well as the difference between them
ggplot(ddj_ncount) + geom_line(aes(x=ttime,y=count,col='Unique followers')) + geom_line(aes(x=ttime,y=crudetot,col='Cumulative followers')) + geom_line(aes(x=ttime,y=crudetot-count,col='Repeated followers')) + labs(colour='Type') + xlab(NULL)

#Number of new unique followers introduced at each time step
ggplot(ddj_ncount)+geom_line(aes(x=ttime,y=count-previousCount,col='Actual delta'))

#Try to get some idea of how many of the followers of a new user are actually new potential audience members
ggplot(ddj_ncount) + opts(axis.text.x=theme_text(angle=-90,size=4)) + geom_linerange(aes(x=newuser,ymin=0,ymax=userFoCount,col='Follower count')) + geom_linerange(aes(x=newuser,ymin=0,ymax=(count-previousCount),col='Actual new audience'))

#This is still a bit experimental
#I'm playing around trying to see what proportion or number of a users followers are new to, or subsumed by, the potential audience of the tag to date...
ggplot(ddj_ncount) + geom_linerange(aes(x=newuser,ymin=0,ymax=1-(count-previousCount)/userFoCount)) + opts(axis.text.x=theme_text(angle=-90,size=6)) + xlab(NULL)

In the next couple of posts in this series, I’ll start to describe how we can chart the potential increase in audience count as a delta for each new tagger, along with a couple of ways of trying to get some initial sort of sense out of the graph file, such as the distribution of the potential number of “views” of a tag across the unique potential audience members…

PS See also the follow on post More Thoughts on Potential Audience Metrics for Hashtag Communities

Do Retweeters Lack Commitment to a Hashtag?

I seem to be going down more ratholes than usual at the moment, in this case relating to activity round Twitter hashtags. Here’s a quick bit of reflection around a chart from Visualising Activity Around a Twitter Hashtag or Search Term Using R that shows activity around a hashtag that was minted for an event that took place before the sample period.

The y-axis is organised according to the time of first use (within the sample period) of the tag by a particular user. The x axis is time. The dots represent tweets containing the hashtag, coloured blue by default, red if they are an old-style RT (i.e. they begin RT @username:).

So what sorts of thing might we look for in this chart, and what are the problems with it? Several things jump out at me:

  • For many of the users, their first tweet (in this sample period at least) is an RT; that is, they are brought into the hashtag community through issuing an RT;
  • Many of the users whose first use is via an RT don’t use the hashtag again within the sample period. Is this typical? Does this signal represent amplification of the tag without any real sense of engagement with it?
  • A noticeable proportion of folk whose first use is not an RT go on to post further non-RT tweets. Does this represent an ongoing commitment to the tag? Note that this chart does not show whether tweets are replies, or “open” tweets. Replies (that is, tweets beginning @username are likely to represent conversational threads within a tag context rather than “general” tag usage, so it would be worth using an additional colour to identify reply based conversational tweets as such.
  • “New style” retweets are diaplayed as retweets by colouring… I need to check whether or nor newstyle RT information is available that I could use to colour such tweets appropriately. (or alternatively, I’d have to do some sort of string matching to see whether or not a tweet was the same as a previously seen tweet, which is a bit of a pain:-(

(Note that when I started mapping hashtag communities, I used to generate tag user names based on a filtered list of tweets that excluded RTs. this meant that folk who only used the tag as part of an RT and did not originate tweets that contained the tag, either in general or as part of a conversation, would not be counted as a member of the hashtag community. More recently, I have added filters that include RTs but exclude users who used the tag only once, for example, thus retaining serial RTers, but not single use users.)

So what else might this chart tell us? Looking at vertical slices, it seems that news entrants to the tag community appear to come in waves, maybe as part of rapid fire RT bursts. This chart doesn’t tell us for sure that this is happening, but it does highlight areas of the timelime that might be worth investigating more closely if we are interested in what happened at those times when there does appear to be a spike in activity. (Are there any modifications we could make to this chart to make them more informative in this respect? The time resolution is very poor, for example, so being able to zoom in on a particular time might be handy. Or are there other charts that might provide a different lens that can help us see what was happening at those times?)

And as a final point – this stuff may be all very interesting, but is it useful?, And if so, how? I also wonder how generalisable it is to other sorts of communication analysis. For example, I think we could use similar graphical techniques to explore engagement with an active comment thread on a blog, or Google+, or additions to an online forum thread. (For forums with mutliple threads, we maybe need to rethink how this sort of chart would work, or how it might be coloured/what symbols we might use, to distinguish between starting a new thread, or adding to a pre-existing one, for example. I’m sure the literature is filled with dozens of examples for how we might visualise forum activity, so if you know of any good references/links…?! ;-) #lazyacademic)

Generating Twitter Wordclouds in R (Prompted by an Open Learning Blogpost)

A couple of weeks ago I saw a great example of an open learning blogpost from @katy_bird: Generating a word cloud (or not) from a Twitter hashtag. It described the trials and tribulations associated with trying to satisfy a request for the generation of a wordcloud based on tweets associated with a specific Twitter hashtag. A seemingly simple task, you might think, but things are never that easy… If you read the post, you’ll see Katy identified several problems, or stumbling blocks, along the way, as well as how she addressed them. There’s also a bit of reflection on the process as a whole.

Reading the post the first time (and again, just now), completely set me up for the day. It had a little bit of everyhting: a goal statement, the identification of a set of problems associated with trying to complete the task, some commentary on how the problems were tackled, and some reflection on the process as a whole. The post thus serves the purpose of capturing a problem discovery process, as well as the steps taken to try and solve each problem (although full documentation is lacking… This is something I have learned over the years: to use something like a gist on github to actually keep a copy of any code I generated to solve the problem, linked to for reuse by myself and others from the associated blog post). The post captures a glimpse back at a moment in time – when Katy didn’t know how to generate a wordcloud – from the joyful moment at which she has just learned how to generate said wordcloud. More importantly, the post describes the learning problems that became evident whilst trying to achieve the goal in such a way that they can act as hooks on which others can hang alternative or additional ways of solving the problem, or act as mentor.

By identifying the learning journey and problems discovered along the way, Katy’s record of her learning strategy also provides an authentic, learner centric perspective on what’s involved in trying to create a wordcloud around a twitter hashtag.

Reading the post again has also prompted me to blog this recipe, largely copied from the RDataMining post Using Text Mining to Find Out What @RDataMining Tweets are About, for generating a word cloud around a twitter hashtag using R (I use RStudio; the recipe requires at least the twitteR and tm libraries):

#Grab the tweets
rdmTweets <- searchTwitter(searchTerm, n=500)
#Use a handy helper function to put the tweets into a dataframe

##Note: there are some handy, basic Twitter related functions here:
#For example:
RemoveAtPeople <- function(tweet) {
  gsub("@\\w+", "", tweet)
#Then for example, remove @'d names
tweets <- as.vector(sapply(tw.df$text, RemoveAtPeople))

##Wordcloud - scripts available from various sources; I used:

#Install the textmining library
#Call with eg: tw.c=generateCorpus(tw.df$text)
generateCorpus= function(df,my.stopwords=c()){
  #The following is cribbed and seems to do what it says on the can
  tw.corpus= Corpus(VectorSource(df))
  # remove punctuation
  tw.corpus = tm_map(tw.corpus, removePunctuation)
  #normalise case
  tw.corpus = tm_map(tw.corpus, tolower)
  # remove stopwords
  tw.corpus = tm_map(tw.corpus, removeWords, stopwords('english'))
  tw.corpus = tm_map(tw.corpus, removeWords, my.stopwords)


  doc.m = TermDocumentMatrix(corpus, control = list(minWordLength = 1))
  dm = as.matrix(doc.m)
  # calculate the frequency of words
  v = sort(rowSums(dm), decreasing=TRUE)
  d = data.frame(word=names(v), freq=v)
  #Generate the wordcloud
  wc=wordcloud(d$word, d$freq, min.freq=min.freq)


##Generate an image file of the wordcloud
png('test.png', width=600,height=600)

#We could make it even easier if we hide away the tweet grabbing code. eg:
  rdmTweets = searchTwitter(searchTerm, n=num)
  as.vector(sapply(tw.df$text, RemoveAtPeople))
#Then we could do something like:

Here’s the result:

PS for an earlier, was broken, now patched, route to sketching a wordcloud from a twitter search using Wordle, see How To Create Wordcloud from a Twitter Hashtag Search Feed in a Few Easy Steps.

Doodling With a Conversation, or Retweet, Data Sketch Around LAK12

How can we represent conversations between a small sample of users, such as the email or SMS converstations between James Murdoch’s political lobbiest and a Government minister’s special adviser (Leveson inquiry evidence), or the pattern of retweet activity around a couple of heavily retweeted individuals using a particular hashtag?

I spent a bit of time on-and-off today mulling over ways of representing this sort of interaction, in search of something like a UML call sequence diagram but not, and here’s what I came up with in the context of the retweet activity:

The chart looks a bit complicated at first, but there’s a lot of information in there. The small grey dots on their own are tweets using a particular hashtag that aren’t identified as RTs in a body of tweets obtained via a Twitter search around a particular hashtag (that is, they don’t start with a pattern something like RT @[^:]*:). The x-axis represents the time a tweet was sent and the y-axis who sent it. Paired dots connected by a vertical line segment show two people, one of whom (light grey point) retweeted the other (dark grey point). RTs of two notable individuals are highlighted using different colours. The small black dots highlight original tweets sent by the individuals who we highlight in terms of how they are retweeted. Whilst we can’t tell which tweet was retweeted, we may get an idea of how the pattern of RT behaviour related to the individuals of interest plays out relative to when they actually tweeted.

Here’s the R-code used to build up the chart. Note that the order in which the layers are constructed is important (for example, we need the small black dots to be in the top layer).

##RT chart, constructed in R using ggplot2
#the base data set - exclude tweets that aren't RTs
g = ggplot(subset(tw.df.rt,subset=(!is.na(rtof))))
#Add in vertical grey lines connecting who RT'd whom
g = g + geom_linerange(aes(x=created,ymin=screenName,ymax=rtof),colour='lightgrey')
#Use a more complete dataset to mark *all* tweets with a lightgrey point
g = g + geom_point(data=(tw.df),aes(x=created,y=screenName),colour='lightgrey')
#Use points at either end of the RT line segment to distinguish who RTd whom
g = g + geom_point(aes(x=created,y=screenName),colour='lightgrey') + geom_point(aes(x=created,y=rtof),colour='grey') + opts(axis.text.y=theme_text(size=5))
#We're going to highlight RTs of two particular individuals
#Define a couple of functions to subset the data
subdata.rtof=function(u) return(subset(tw.df.rt,subset=(!is.na(rtof) & rtof==u)))
subdata.user=function(u) return(subset(tw.df.rt,subset=(!is.na(rtof) & screenName==u)))
#Grab user 1
#Highlight the RT lines associated with RTs of this user
g = g + geom_linerange(data=ss1,aes(x=created,ymin=screenName,ymax=rtof),colour=sc1)
#Grab user 2
#Highlight the RT lines associated with RTs of this user
g = g + geom_linerange(data=ss2,aes(x=created,ymin=screenName,ymax=rtof),colour=sc2)
#Now we add another layer to colour the nodes associated with RTs of the two selected users
g = g + geom_point(data=ss1,aes(x=created,y=rtof),colour=sc1) + geom_point(data=ss1,aes(x=created,y=screenName),colour=sc1)
g = g + geom_point(data=ss2,aes(x=created,y=rtof),colour=sc2) + geom_point(data=ss2,aes(x=created,y=screenName),colour=sc2)
#Finally, add a highlight to mark when the RTd folk we are highlighting actually tweet
g = g + geom_point(data=(ss1x),aes(x=created,y=screenName),colour='black',size=1)
g = g + geom_point(data=(ss2x),aes(x=created,y=screenName),colour='black',size=1)
#Print the chart

One thing I’m not sure about is the order of names on the y-axis. That said, one advantage of using the conversational, exploratory visualisation data approach that I favour is that if you let you eyes try to seek out patterns, you may be able to pick up clues for some sort of model around the data that really emphasises those patterns. So for example, looking at the chart, I wonder if there would be any merit in organising the y-axis so that folk who RTd orange but not aquamarine were in the bottom third of the chart, folk who RTd aqua but not orange were in the top third of the chart, folk who RTd orange and aqua were between the two users of interest, and folk who RTd neither orange nor aqua were arranged closer to the edges, with folk who RTd each other close to each other (invoking an ink minimisation principle)?

Something else that it would be nice to do would be to use the time an original tweet was sent as the x-axis value for the tweet marker for the original sender of a tweet that is RTd. We would then get a visual indication of how quickly a tweet was RTd.

PS I also created a script that generated a wealth of other charts around the lak12 hashtag [PDF]. The code used to generate the report can be found as the file exampleSearchReport.Rnw in this code repository.

Twitter Volume Controls

With a steady stream of tweets coming out today containing local election results, @GuardianData (as @datastore was recently renamed) asked whether or not regular, stream swamping updates were in order:

A similar problem can occur when folk are livetweeting an event – for a short period, one or two users can dominate a stream with a steady outpouring of live tweets.

Whilst I’m happy to see the stream, I did wonder about how we could easily wrangle a volume control, so here are a handful of possible approaches:

  • Tweets starting @USER ... are only seen in the stream of people following both the sender of the tweet and @USER. So if @GuardianData set up another, non-tweeting, account, @GuardianDataBlitz, and sent election results to that account (“@GuardianDataBlitz Mayor referendum results summary: Bradford NO (55.13% on ), Manchester NO (53.24%), Coventry NO (63.58%), Nottingham NO (57.49%) #vote2012” for example), only @GuardianData followers following @GuardianDataBlitz would see the result. There are a couple of problems with this approach, of course: for one, @GuardianDataBlitz takes up too many characters (although that can be easily addressed), but more significantly it means that most followers of @GuardianData will miss out on the data stream. (They can be expected to necessarily know about the full fat feed switch.)
  • For Twitter users using a Twitter client that supports global filtering of tweets across all streams within a client, we may be able to set up a filter to exclude tweets of the form (@GuardianData AND #vote2012). This is a high maintenance approach, though, and will lead to the global filter getting cluttered over time, or at least requiring maintenance.
  • The third approach – again targeted at folk who can set up global filters – is for @GuardianData to include a volume control in their tweets, eg Mayor referendum results summary: Bradford NO (55.13% on ), Manchester NO (53.24%), Coventry NO (63.58%), Nottingham NO (57.49%) #vote2012 #blitz. Now users can set a volume control by filtering out terms tagged #gblitz. To remind people that they have a volume filter in place, @GuardianData could occasionally post blitz items with #inblitz to show folk who have the filter turned on what they’re missing? Downsides to this approach are that it pollutes the tweets with more confusing metadata maybe confuses folk about what hashtag is being used.
  • A more generic approach might be to use a loudness indicator or channel that can be filtered against, so for example channel 11: ^11 or ^loud (reusing the ^ convention that is used to identify individuals tweeting on a team account)? Reminders to folk who may have a volume filter set could take the form ^on11 or ^onloud on some of the tweets? Semantic channels might also be defined: ^ER (Election Results), ^LT (Live Tweets) etc, again with occasional reminders to folk who’ve set filters (^onLT, etc, or “We’re tweeting local election results on the LT ^channel today”)). Again, this is a bit of a hack that’s only likely to appeal to “advanced” users and does require them to take some action; I guess it depends whether the extra clutter is worth it?

So – any other volume control approaches I’ve missed?

PS by the by, here’s a search query (just for @daveyp;-) that I’ve been using to try to track results as folk tweet them:

-RT (#atthecount OR #vote2012 OR #le2012) AND (gain OR held OR los OR hold) AND (con OR lib OR lab OR ukip)

I did wonder about trying to parse out ward names to try an automate the detection of possible results as they appeared in the stream, but opted to go to bed instead! It’s something I could imagine trying to work up on Datasift, though…

Interest Differencing: Folk Commonly Followed by Tweeting MPs of Different Parties

Earlier this year I doodled a recipe for comparing the folk commonly followed by users of a couple of BBC programme hashtags (Social Media Interest Maps of Newsnight and BBCQT Twitterers). Prompted in part by a tweet from Michael Smethurst/@fantasticlife about generating an ESP map for UK politicians (something I’ve also doodled before – Sketching the Structure of the UK Political Media Twittersphere) I drew on the @tweetminster Twitter lists of MPs by party to generate lists of folk commonly followed by the MPs of each party.

Using the R wordcloud library commonality and comparison clouds, we can get a visual impression of folk commonly followed in significant numbers by all the MPs of the three main parties, as well as the folk the MPs of each party follow significantly and differentially to the other parties:

There’s still a fair bit to do making the methodology robust (for example, being able to cope with comparing folk commonly followed by different sets of users where the size of the set differs to a significant extent (for example, there is a large difference between the number of tweeting Conservative and LibDem MPs). I’ve also noticed that repeatedly running the comparison.cloud code turns up different clouds, so there’s some element of randomness in there. I guess this just adds to the “sketchy” nature of the visualisation; or maybe hints at a technique akin to the way a photogrpaher will take multiple shots of a subject before picking one or two to illustrate something in particular. Which is to say: the “truthiness” of the image reflects the message that you are trying to communicate. The visualisation in this case exposes a partial truth (which is to say, no absolute truth), or particular perspective about the way different groups differentially follow folk on Twitter. A couple of other quirks I’ve noticed about the comparison.cloud as currently defined: firstly, very highly represented friends are sized too large to appear in the cloud (which is why very commonly followed folk across all sets – the people that appear in the commonality cloud – tend not to appear) – there must be a better way of handling this? Secondly, if one person is represented so highly in one group that they don’t appear in the cloud for that group, they may appear elsewhere in the cloud. (So for example, I tried plotting clouds for folk commonly followed by a sample of the followers of @davegorman, as well as the people commonly followed by the friends of @davegorman – and @davegorman appeared as a small label in the friends part of the comparison.cloud (notwithstanding the fact that all the followers of @davegorman follow @davegorman, but not all his friends do… What might make more sense would be to suppress the display of a label in the colour of a particular group if that label has a higher representation in any of the other groups (and isn’t displayed because it would be too large)).

That said, as a quick sketch, I think there’s some information being revealed there (the coloured comparison.cloud seems to pull out some names that make sense as commonly followed folk peculiar to each party…). I guess way forward is to start picking apart the comparison.cloud code, another is to explore a few more comparison sets? Suggestions welcome as to what they might be…:-)

PS by the by, I notice via the Guardian datablog (Church vs beer: using Twitter to map regional differences in US culture) another Twitter based comparison project – Church or Beer? Americans on Twitter – which looked at geo-coded Tweets over a particular time period on a US state-wide basis and counted the relative occurrence of Tweets mentioning “church” or “beer”…

Sketches Around Twitter Followers

I’ve been doodling… Following a query about the possible purchase of Twitter followers for various public figure accounts (I need to get my head round what the problem is with that exactly?!), I thought I’d have a quick look at some stats around follower groupings…

I started off with a data grab, pulling down the IDs of accounts on a particular Twitter list and then looking up the user details for each follower. This gives summary data such as the number of friends, followers and status updates; a timestamp for when the account was created; whether the account is private or not; the “location”, as well as a possibly more informative timezone field (you may tell fibs about the location setting but I suspect the casual user is more likely to set a timezone appropriate to their locale).

So what can we do with that data? Simple scatter plots, for one thing – here’s how friends vs. followers distribute for MPs on the Tweetminster UKMPs list:


We can also see how follower numbers are distributed across those MPs, for example, which looks reasonable and helps us get our eye in…:


We can also calculate ratios and then plot them – followers per day (the number of followers divided by the number of days since the account was registered, for example) vs the followers per status update (to try to get a feeling of how the number of followers relates to the number of tweets):


This particular view shows a few outliers, and allows us to spot a couple of accounts that have recently had a ‘change of use’.

As well as looking at the stats across the set of MPs, we can pull down the list of followers of a particular account (or sample thereof – I grabbed the lesser of all followers or 10,000 randomly sampled followers from a target account) and then look at the summary stats (number of followers, friends, date they joined Twitter, etc) over those followers.

So for example, things like this – a scatterplot of friends/follower counts similar to the one above:


…sort of. There’s something obviously odd about that graph, isn’t there? The “step up” at a friends count of 2000. This is because Twitter imposes, in most cases, a limit of 2000 friends on an account.

How about the followers per day for an account versus the number of days that account has been on Twitter, with outliers highlighted?


Alternatively, we can do counts by number of days the followers have been on Twitter:


The bump around 1500 days ago corresponds to Twitter getting suddenly popular around then, as this chart from Google Trends shows:


Sometimes, you get a distribution that is very, very wrong… If we do a histogram that has bins along the x-axis specifying that a follower had 0-100 followers of their own, or 500-600 followers etc, and then for all the followers of a particular account, pop them into a corresponding bin given the number of their followers, counting the number of people in each bin once we have allocated them all, we might normally expect to see something like this:

normally log followers

However, if an account is followed by lots of followers that have zero or very few followers of their own, we get a skewed distribution like this:

a dodgy follower distribution

There’s obviously something not quite, erm, normal(?!) about this account (at least, at the time I grabbed the data, there was something not quite normal etc etc…).

When we get stats from the followers of a set of folk, such as the members of a list, we can generate summary statistics over the sets of followers of each person on the list – for example, the median number of followers, or different ratios (eg mean of the friend/follower ratios for each follower). Lots of possible stats – but which ones does it make sense to look at?

Here’s one… a plot of the median followers per status ratio versus the median friend/follower ratio:


Spot the outlier ;-)

So that’s a quick review of some of the views we can get from data grabs of the user details from the followers of a particular account. A useful complement to the social positioning maps I’ve also been doing for some time:


It’s just a shame that my whitelisted Twitter API key is probably going to die in few weeks:-(

[In the next post in this series I’ll describe a plot that estimates when folk started following a particular account, and demonstrate how it can be used to identify notable “events” surrounding the person being followed…]