OUseful.Info, the blog…

Trying to find useful things to do with emerging technologies in open education

Posts Tagged ‘Twitter

Estimated Follower Accession Charts for Twitter

Just over a year or so ago, Mat Morrison/@mediaczar introduced me to a visualisation he’d been working on (How should Page Admins deal with Flame Wars?) that I started to refer to as an accession chart (Visualising Activity Around a Twitter Hashtag or Search Term Using R). The idea is that we provide each entrant into a conversation or group with an accession number: the first person has accession number 1, the second person accession number 2 and so on. The accession number is plotted in rank order on the vertical y-axis, with ranked/time ordered “events” along the horizontal x-axis: utterances in a conversation for example, or posts to a forum.

A couple of months ago, I wondered whether this approach might also be used to estimate when folk started following an individual on Twitter. My reasoning went something like this:

One of the things I think is true of the Twitter API call for the followers of an account is that it returns lists of followers in reverse accession order. So the person who followed an account most recently will be at the top of the list (the first to be returned) and the person who followed first will be at the end of the list. Unfortunately, we don’t know when followers joined, so it’s hard to spot bursty growth in the number of followers of an account. However, it struck me that we may be able to get a bound on this by looking at the dates at which followers joined Twitter, along with their ‘accession order’ as followers of an account. If we get the list of followers and reverse it, and assume that this gives an ordered list of followers (with the follower that started following the longest time ago first), we can then work through this list and keep track of the oldest ‘created_at’ date seen so far. This gives us an upper bound (most recent date) for when followers that far through the list started following. (You can’t start following until you join twitter…)

So for example, if followers A, B, C, D in that accession order (ie started following target in that order) have user account creation dates 31/12/09, 1/1/09, 15/6/12, 5/5/10 then:
- A started following no earlier than 31/12/09 (because that’s when they joined Twitter and it’s the most recent creation date we’ve seen so far)
- B started following no earlier than 31/12/09 (because they started following after B)
- C started following no earlier than 15/6/12 (because that’s when they joined Twitter and it’s the most recent creation date we’ve seen so far)
- D started following no earlier than 15/6/12 (because they started following after C, which gave use the most recent creation date seen so far)

That’s probably confused you enough, so here’s a chart – accession number is along the bottom (i.e. the x-axis), joining date (in days ago) is on the y-axis:

recencyVacc

NOTE: this diverges from the accession graph described above, where accession number goes on the y-axis and rank ordered event along the x-axis.

What the chart shows is an estimate (the red line) of how many days ago a follower with a particular accession number started to follow a particular Twitter account.

As described in Sketches Around Twitter Followers, we see a clear break at 1500 days ago when Twitter started to get popular. This approach also suggests a technique for creating “follower probes” that we can use to date a follower record: if you know which day a particular user followed a target account, you can use that follower to put a datestamp into the follower record (assuming the Twitter API returned followers in reverse accession order).

Here’s an example of the code I used based on Twitter follower data grabbed for @ChrisPincher (whose follower profile appeared to be out of sorts from the analysis sketched in Visualising Activity Around a Twitter Hashtag or Search Term Using R). I’ve corrected the x/y axis ordering so follower accession number is now the vertical, y-component.

require(ggplot2)

processUserData = function(data) {
    data$tz = as.POSIXct(data$created_at)
    data$days = as.integer(difftime(Sys.time(), data$tz, units = "days"))
    data = data[rev(rownames(data)), ]
    data$acc = 1:length(data$days)
    data$recency = cummin(data$days)

    data
}

mp_cp <- read.csv("~/code/MPs/ChrisPincher_fo_0__2013-02-16-01-29-28.csv", row.names = NULL)

ggplot(processUserData(mp_cp)) +  geom_point(aes(x = -days, y = acc), size = 0.4) + geom_point(aes(x = -recency, y = acc), col = "red", size = 1)+xlim(-2000,0)

Here's @ChrisPincher's chart:

cp_demo

The black dots reveal how many days ago a particular follower joined Twitter. The red line is the estimate of when a particular follower started following the account, estimated based on the most recently created account seen to date amongst the previously acceded followers.

We see steady growth in follower numbers to start with, and then the account appears to have been spam followed? (Can you spot when?!;-) The clumping of creation dates of accounts during the attack also suggests they were created programmatically.

[In the "next" in this series of posts [What Happened Then? Using Approximated Twitter Follower Accession to Identify Political Events], I'll show how spikes in follower acquisition on a particular day can often be used to "detect" historical news events.]

PS after coming up with this recipe, I did a little bit of "scholarly research" and I learned that a similar approach for estimating Twitter follower acquisition times had already been described at least once, at the opening of this paper: We Know Who You Followed Last Summer: Inferring Social Link Creation Times In Twitter – “We estimate the edge creation time for any follower of a celebrity by positing that it is equal to the greatest lower bound that can be deduced from the edge orderings and follower creation times for that celebrity”.

Written by Tony Hirst

April 5, 2013 at 10:31 am

Posted in Rstats

Tagged with

Sketches Around Twitter Followers

I’ve been doodling… Following a query about the possible purchase of Twitter followers for various public figure accounts (I need to get my head round what the problem is with that exactly?!), I thought I’d have a quick look at some stats around follower groupings…

I started off with a data grab, pulling down the IDs of accounts on a particular Twitter list and then looking up the user details for each follower. This gives summary data such as the number of friends, followers and status updates; a timestamp for when the account was created; whether the account is private or not; the “location”, as well as a possibly more informative timezone field (you may tell fibs about the location setting but I suspect the casual user is more likely to set a timezone appropriate to their locale).

So what can we do with that data? Simple scatter plots, for one thing – here’s how friends vs. followers distribute for MPs on the Tweetminster UKMPs list:

ukMPS_frfo_scatter

We can also see how follower numbers are distributed across those MPs, for example, which looks reasonable and helps us get our eye in…:

ukMPS_fo_dist

We can also calculate ratios and then plot them – followers per day (the number of followers divided by the number of days since the account was registered, for example) vs the followers per status update (to try to get a feeling of how the number of followers relates to the number of tweets):

ukMPs_foday_fost

This particular view shows a few outliers, and allows us to spot a couple of accounts that have recently had a ‘change of use’.

As well as looking at the stats across the set of MPs, we can pull down the list of followers of a particular account (or sample thereof – I grabbed the lesser of all followers or 10,000 randomly sampled followers from a target account) and then look at the summary stats (number of followers, friends, date they joined Twitter, etc) over those followers.

So for example, things like this – a scatterplot of friends/follower counts similar to the one above:

friendsfollowers

…sort of. There’s something obviously odd about that graph, isn’t there? The “step up” at a friends count of 2000. This is because Twitter imposes, in most cases, a limit of 2000 friends on an account.

How about the followers per day for an account versus the number of days that account has been on Twitter, with outliers highlighted?

foperday_days

Alternatively, we can do counts by number of days the followers have been on Twitter:

Rplot

The bump around 1500 days ago corresponds to Twitter getting suddenly popular around then, as this chart from Google Trends shows:

gtrends

Sometimes, you get a distribution that is very, very wrong… If we do a histogram that has bins along the x-axis specifying that a follower had 0-100 followers of their own, or 500-600 followers etc, and then for all the followers of a particular account, pop them into a corresponding bin given the number of their followers, counting the number of people in each bin once we have allocated them all, we might normally expect to see something like this:

normally log followers

However, if an account is followed by lots of followers that have zero or very few followers of their own, we get a skewed distribution like this:

a dodgy follower distribution

There’s obviously something not quite, erm, normal(?!) about this account (at least, at the time I grabbed the data, there was something not quite normal etc etc…).

When we get stats from the followers of a set of folk, such as the members of a list, we can generate summary statistics over the sets of followers of each person on the list – for example, the median number of followers, or different ratios (eg mean of the friend/follower ratios for each follower). Lots of possible stats – but which ones does it make sense to look at?

Here’s one… a plot of the median followers per status ratio versus the median friend/follower ratio:

fostvfrfo

Spot the outlier ;-)

So that’s a quick review of some of the views we can get from data grabs of the user details from the followers of a particular account. A useful complement to the social positioning maps I’ve also been doing for some time:

davidevennett

It’s just a shame that my whitelisted Twitter API key is probably going to die in few weeks:-(

[In the next post in this series I'll describe a plot that estimates when folk started following a particular account, and demonstrate how it can be used to identify notable "events" surrounding the person being followed...]

Written by Tony Hirst

February 19, 2013 at 2:09 pm

Posted in Infoskills, Rstats

Tagged with

Interest Differencing: Folk Commonly Followed by Tweeting MPs of Different Parties

Earlier this year I doodled a recipe for comparing the folk commonly followed by users of a couple of BBC programme hashtags (Social Media Interest Maps of Newsnight and BBCQT Twitterers). Prompted in part by a tweet from Michael Smethurst/@fantasticlife about generating an ESP map for UK politicians (something I’ve also doodled before – Sketching the Structure of the UK Political Media Twittersphere) I drew on the @tweetminster Twitter lists of MPs by party to generate lists of folk commonly followed by the MPs of each party.

Using the R wordcloud library commonality and comparison clouds, we can get a visual impression of folk commonly followed in significant numbers by all the MPs of the three main parties, as well as the folk the MPs of each party follow significantly and differentially to the other parties:

There’s still a fair bit to do making the methodology robust (for example, being able to cope with comparing folk commonly followed by different sets of users where the size of the set differs to a significant extent (for example, there is a large difference between the number of tweeting Conservative and LibDem MPs). I’ve also noticed that repeatedly running the comparison.cloud code turns up different clouds, so there's some element of randomness in there. I guess this just adds to the "sketchy" nature of the visualisation; or maybe hints at a technique akin to the way a photogrpaher will take multiple shots of a subject before picking one or two to illustrate something in particular. Which is to say: the "truthiness" of the image reflects the message that you are trying to communicate. The visualisation in this case exposes a partial truth (which is to say, no absolute truth), or particular perspective about the way different groups differentially follow folk on Twitter. A couple of other quirks I've noticed about the comparison.cloud as currently defined: firstly, very highly represented friends are sized too large to appear in the cloud (which is why very commonly followed folk across all sets - the people that appear in the commonality cloud - tend not to appear) - there must be a better way of handling this? Secondly, if one person is represented so highly in one group that they don't appear in the cloud for that group, they may appear elsewhere in the cloud. (So for example, I tried plotting clouds for folk commonly followed by a sample of the followers of @davegorman, as well as the people commonly followed by the friends of @davegorman - and @davegorman appeared as a small label in the friends part of the comparison.cloud (notwithstanding the fact that all the followers of @davegorman follow @davegorman, but not all his friends do... What might make more sense would be to suppress the display of a label in the colour of a particular group if that label has a higher representation in any of the other groups (and isn't displayed because it would be too large)).

That said, as a quick sketch, I think there's some information being revealed there (the coloured comparison.cloud seems to pull out some names that make sense as commonly followed folk peculiar to each party...). I guess way forward is to start picking apart the comparison.cloud code, another is to explore a few more comparison sets? Suggestions welcome as to what they might be...:-)

PS by the by, I notice via the Guardian datablog (Church vs beer: using Twitter to map regional differences in US culture) another Twitter based comparison project - Church or Beer? Americans on Twitter - which looked at geo-coded Tweets over a particular time period on a US state-wide basis and counted the relative occurrence of Tweets mentioning "church" or "beer"...

Written by Tony Hirst

July 6, 2012 at 1:37 pm

Twitter Volume Controls

With a steady stream of tweets coming out today containing local election results, @GuardianData (as @datastore was recently renamed) asked whether or not regular, stream swamping updates were in order:

A similar problem can occur when folk are livetweeting an event – for a short period, one or two users can dominate a stream with a steady outpouring of live tweets.

Whilst I’m happy to see the stream, I did wonder about how we could easily wrangle a volume control, so here are a handful of possible approaches:

  • Tweets starting @USER ... are only seen in the stream of people following both the sender of the tweet and @USER. So if @GuardianData set up another, non-tweeting, account, @GuardianDataBlitz, and sent election results to that account ("@GuardianDataBlitz Mayor referendum results summary: Bradford NO (55.13% on ), Manchester NO (53.24%), Coventry NO (63.58%), Nottingham NO (57.49%) #vote2012" for example), only @GuardianData followers following @GuardianDataBlitz would see the result. There are a couple of problems with this approach, of course: for one, @GuardianDataBlitz takes up too many characters (although that can be easily addressed), but more significantly it means that most followers of @GuardianData will miss out on the data stream. (They can be expected to necessarily know about the full fat feed switch.)
  • For Twitter users using a Twitter client that supports global filtering of tweets across all streams within a client, we may be able to set up a filter to exclude tweets of the form (@GuardianData AND #vote2012). This is a high maintenance approach, though, and will lead to the global filter getting cluttered over time, or at least requiring maintenance.
  • The third approach - again targeted at folk who can set up global filters - is for @GuardianData to include a volume control in their tweets, eg Mayor referendum results summary: Bradford NO (55.13% on ), Manchester NO (53.24%), Coventry NO (63.58%), Nottingham NO (57.49%) #vote2012 #blitz. Now users can set a volume control by filtering out terms tagged #gblitz. To remind people that they have a volume filter in place, @GuardianData could occasionally post blitz items with #inblitz to show folk who have the filter turned on what they're missing? Downsides to this approach are that it pollutes the tweets with more confusing metadata maybe confuses folk about what hashtag is being used.
  • A more generic approach might be to use a loudness indicator or channel that can be filtered against, so for example channel 11: ^11 or ^loud (reusing the ^ convention that is used to identify individuals tweeting on a team account)? Reminders to folk who may have a volume filter set could take the form ^on11 or ^onloud on some of the tweets? Semantic channels might also be defined: ^ER (Election Results), ^LT (Live Tweets) etc, again with occasional reminders to folk who've set filters (^onLT, etc, or "We're tweeting local election results on the LT ^channel today")). Again, this is a bit of a hack that's only likely to appeal to "advanced" users and does require them to take some action; I guess it depends whether the extra clutter is worth it?

So - any other volume control approaches I've missed?

PS by the by, here's a search query (just for @daveyp;-) that I've been using to try to track results as folk tweet them:

-RT (#atthecount OR #vote2012 OR #le2012) AND (gain OR held OR los OR hold) AND (con OR lib OR lab OR ukip)

I did wonder about trying to parse out ward names to try an automate the detection of possible results as they appeared in the stream, but opted to go to bed instead! It's something I could imagine trying to work up on Datasift, though...

Written by Tony Hirst

May 4, 2012 at 9:39 am

Posted in Anything you want

Tagged with

Doodling With a Conversation, or Retweet, Data Sketch Around LAK12

How can we represent conversations between a small sample of users, such as the email or SMS converstations between James Murdoch’s political lobbiest and a Government minister’s special adviser (Leveson inquiry evidence), or the pattern of retweet activity around a couple of heavily retweeted individuals using a particular hashtag?

I spent a bit of time on-and-off today mulling over ways of representing this sort of interaction, in search of something like a UML call sequence diagram but not, and here’s what I came up with in the context of the retweet activity:

The chart looks a bit complicated at first, but there’s a lot of information in there. The small grey dots on their own are tweets using a particular hashtag that aren’t identified as RTs in a body of tweets obtained via a Twitter search around a particular hashtag (that is, they don’t start with a pattern something like RT @[^:]*:). The x-axis represents the time a tweet was sent and the y-axis who sent it. Paired dots connected by a vertical line segment show two people, one of whom (light grey point) retweeted the other (dark grey point). RTs of two notable individuals are highlighted using different colours. The small black dots highlight original tweets sent by the individuals who we highlight in terms of how they are retweeted. Whilst we can’t tell which tweet was retweeted, we may get an idea of how the pattern of RT behaviour related to the individuals of interest plays out relative to when they actually tweeted.

Here’s the R-code used to build up the chart. Note that the order in which the layers are constructed is important (for example, we need the small black dots to be in the top layer).

##RT chart, constructed in R using ggplot2
require(ggplot2)
#the base data set - exclude tweets that aren't RTs
g = ggplot(subset(tw.df.rt,subset=(!is.na(rtof))))
#Add in vertical grey lines connecting who RT'd whom
g = g + geom_linerange(aes(x=created,ymin=screenName,ymax=rtof),colour='lightgrey')
#Use a more complete dataset to mark *all* tweets with a lightgrey point
g = g + geom_point(data=(tw.df),aes(x=created,y=screenName),colour='lightgrey')
#Use points at either end of the RT line segment to distinguish who RTd whom
g = g + geom_point(aes(x=created,y=screenName),colour='lightgrey') + geom_point(aes(x=created,y=rtof),colour='grey') + opts(axis.text.y=theme_text(size=5))
#We're going to highlight RTs of two particular individuals
#Define a couple of functions to subset the data
subdata.rtof=function(u) return(subset(tw.df.rt,subset=(!is.na(rtof) & rtof==u)))
subdata.user=function(u) return(subset(tw.df.rt,subset=(!is.na(rtof) & screenName==u)))
#Grab user 1
s1='gsiemens'
ss1=subdata.rtof(s1)
ss1x=subdata.user(s1)
sc1='aquamarine3'
#Highlight the RT lines associated with RTs of this user
g = g + geom_linerange(data=ss1,aes(x=created,ymin=screenName,ymax=rtof),colour=sc1)
#Grab user 2
s2='busynessgirl'
ss2=subdata.rtof(s2)
ss2x=subdata.user(s2)
sc2='orange'
#Highlight the RT lines associated with RTs of this user
g = g + geom_linerange(data=ss2,aes(x=created,ymin=screenName,ymax=rtof),colour=sc2)
#Now we add another layer to colour the nodes associated with RTs of the two selected users
g = g + geom_point(data=ss1,aes(x=created,y=rtof),colour=sc1) + geom_point(data=ss1,aes(x=created,y=screenName),colour=sc1)
g = g + geom_point(data=ss2,aes(x=created,y=rtof),colour=sc2) + geom_point(data=ss2,aes(x=created,y=screenName),colour=sc2)
#Finally, add a highlight to mark when the RTd folk we are highlighting actually tweet
g = g + geom_point(data=(ss1x),aes(x=created,y=screenName),colour='black',size=1)
g = g + geom_point(data=(ss2x),aes(x=created,y=screenName),colour='black',size=1)
#Print the chart
print(g)

One thing I'm not sure about is the order of names on the y-axis. That said, one advantage of using the conversational, exploratory visualisation data approach that I favour is that if you let you eyes try to seek out patterns, you may be able to pick up clues for some sort of model around the data that really emphasises those patterns. So for example, looking at the chart, I wonder if there would be any merit in organising the y-axis so that folk who RTd orange but not aquamarine were in the bottom third of the chart, folk who RTd aqua but not orange were in the top third of the chart, folk who RTd orange and aqua were between the two users of interest, and folk who RTd neither orange nor aqua were arranged closer to the edges, with folk who RTd each other close to each other (invoking an ink minimisation principle)?

Something else that it would be nice to do would be to use the time an original tweet was sent as the x-axis value for the tweet marker for the original sender of a tweet that is RTd. We would then get a visual indication of how quickly a tweet was RTd.

PS I also created a script that generated a wealth of other charts around the lak12 hashtag [PDF]. The code used to generate the report can be found as the file exampleSearchReport.Rnw in this code repository.

Written by Tony Hirst

May 2, 2012 at 11:04 pm

Posted in Rstats, Tinkering

Tagged with ,

Generating Twitter Wordclouds in R (Prompted by an Open Learning Blogpost)

A couple of weeks ago I saw a great example of an open learning blogpost from @katy_bird: Generating a word cloud (or not) from a Twitter hashtag. It described the trials and tribulations associated with trying to satisfy a request for the generation of a wordcloud based on tweets associated with a specific Twitter hashtag. A seemingly simple task, you might think, but things are never that easy… If you read the post, you’ll see Katy identified several problems, or stumbling blocks, along the way, as well as how she addressed them. There’s also a bit of reflection on the process as a whole.

Reading the post the first time (and again, just now), completely set me up for the day. It had a little bit of everyhting: a goal statement, the identification of a set of problems associated with trying to complete the task, some commentary on how the problems were tackled, and some reflection on the process as a whole. The post thus serves the purpose of capturing a problem discovery process, as well as the steps taken to try and solve each problem (although full documentation is lacking… This is something I have learned over the years: to use something like a gist on github to actually keep a copy of any code I generated to solve the problem, linked to for reuse by myself and others from the associated blog post). The post captures a glimpse back at a moment in time – when Katy didn’t know how to generate a wordcloud – from the joyful moment at which she has just learned how to generate said wordcloud. More importantly, the post describes the learning problems that became evident whilst trying to achieve the goal in such a way that they can act as hooks on which others can hang alternative or additional ways of solving the problem, or act as mentor.

By identifying the learning journey and problems discovered along the way, Katy’s record of her learning strategy also provides an authentic, learner centric perspective on what’s involved in trying to create a wordcloud around a twitter hashtag.

Reading the post again has also prompted me to blog this recipe, largely copied from the RDataMining post Using Text Mining to Find Out What @RDataMining Tweets are About, for generating a word cloud around a twitter hashtag using R (I use RStudio; the recipe requires at least the twitteR and tm libraries):

require(twitteR)
searchTerm='#dev8d'
#Grab the tweets
rdmTweets <- searchTwitter(searchTerm, n=500)
#Use a handy helper function to put the tweets into a dataframe
tw.df=twListToDF(rdmTweets)

##Note: there are some handy, basic Twitter related functions here:
##https://github.com/matteoredaelli/twitter-r-utils
#For example:
RemoveAtPeople <- function(tweet) {
  gsub("@\\w+", "", tweet)
}
#Then for example, remove @'d names
tweets <- as.vector(sapply(tw.df$text, RemoveAtPeople))

##Wordcloud - scripts available from various sources; I used:
#http://rdatamining.wordpress.com/2011/11/09/using-text-mining-to-find-out-what-rdatamining-tweets-are-about/

#Install the textmining library
require(tm)
#Call with eg: tw.c=generateCorpus(tw.df$text)
generateCorpus= function(df,my.stopwords=c()){
  #The following is cribbed and seems to do what it says on the can
  tw.corpus= Corpus(VectorSource(df))
  # remove punctuation
  tw.corpus = tm_map(tw.corpus, removePunctuation)
  #normalise case
  tw.corpus = tm_map(tw.corpus, tolower)
  # remove stopwords
  tw.corpus = tm_map(tw.corpus, removeWords, stopwords('english'))
  tw.corpus = tm_map(tw.corpus, removeWords, my.stopwords)

  tw.corpus
}

wordcloud.generate=function(corpus,min.freq=3){
  require(wordcloud)
  doc.m = TermDocumentMatrix(corpus, control = list(minWordLength = 1))
  dm = as.matrix(doc.m)
  # calculate the frequency of words
  v = sort(rowSums(dm), decreasing=TRUE)
  d = data.frame(word=names(v), freq=v)
  #Generate the wordcloud
  wc=wordcloud(d$word, d$freq, min.freq=min.freq)
  wc
}

print(wordcloud.generate(generateCorpus(tweets,'dev8d'),7))

##Generate an image file of the wordcloud
png('test.png', width=600,height=600)
wordcloud.generate(generateCorpus(tweets,'dev8d'),7)
dev.off()

#We could make it even easier if we hide away the tweet grabbing code. eg:
tweets.grabber=function(searchTerm,num=500){
  require(twitteR)
  rdmTweets = searchTwitter(searchTerm, n=num)
  tw.df=twListToDF(rdmTweets)
  as.vector(sapply(tw.df$text, RemoveAtPeople))
}
#Then we could do something like:
tweets=tweets.grabber('ukgc12')
wordcloud.generate(generateCorpus(tweets),3)

Here's the result:

PS for an earlier, was broken, now patched, route to sketching a wordcloud from a twitter search using Wordle, see How To Create Wordcloud from a Twitter Hashtag Search Feed in a Few Easy Steps.

Written by Tony Hirst

February 15, 2012 at 9:40 pm

Posted in Rstats

Tagged with ,

Do Retweeters Lack Commitment to a Hashtag?

I seem to be going down more ratholes than usual at the moment, in this case relating to activity round Twitter hashtags. Here’s a quick bit of reflection around a chart from Visualising Activity Around a Twitter Hashtag or Search Term Using R that shows activity around a hashtag that was minted for an event that took place before the sample period.

The y-axis is organised according to the time of first use (within the sample period) of the tag by a particular user. The x axis is time. The dots represent tweets containing the hashtag, coloured blue by default, red if they are an old-style RT (i.e. they begin RT @username:).

So what sorts of thing might we look for in this chart, and what are the problems with it? Several things jump out at me:

  • For many of the users, their first tweet (in this sample period at least) is an RT; that is, they are brought into the hashtag community through issuing an RT;
  • Many of the users whose first use is via an RT don't use the hashtag again within the sample period. Is this typical? Does this signal represent amplification of the tag without any real sense of engagement with it?
  • A noticeable proportion of folk whose first use is not an RT go on to post further non-RT tweets. Does this represent an ongoing commitment to the tag? Note that this chart does not show whether tweets are replies, or "open" tweets. Replies (that is, tweets beginning @username are likely to represent conversational threads within a tag context rather than "general" tag usage, so it would be worth using an additional colour to identify reply based conversational tweets as such.
  • "New style" retweets are diaplayed as retweets by colouring... I need to check whether or nor newstyle RT information is available that I could use to colour such tweets appropriately. (or alternatively, I'd have to do some sort of string matching to see whether or not a tweet was the same as a previously seen tweet, which is a bit of a pain:-(

(Note that when I started mapping hashtag communities, I used to generate tag user names based on a filtered list of tweets that excluded RTs. this meant that folk who only used the tag as part of an RT and did not originate tweets that contained the tag, either in general or as part of a conversation, would not be counted as a member of the hashtag community. More recently, I have added filters that include RTs but exclude users who used the tag only once, for example, thus retaining serial RTers, but not single use users.)

So what else might this chart tell us? Looking at vertical slices, it seems that news entrants to the tag community appear to come in waves, maybe as part of rapid fire RT bursts. This chart doesn't tell us for sure that this is happening, but it does highlight areas of the timelime that might be worth investigating more closely if we are interested in what happened at those times when there does appear to be a spike in activity. (Are there any modifications we could make to this chart to make them more informative in this respect? The time resolution is very poor, for example, so being able to zoom in on a particular time might be handy. Or are there other charts that might provide a different lens that can help us see what was happening at those times?)

And as a final point - this stuff may be all very interesting, but is it useful?, And if so, how? I also wonder how generalisable it is to other sorts of communication analysis. For example, I think we could use similar graphical techniques to explore engagement with an active comment thread on a blog, or Google+, or additions to an online forum thread. (For forums with mutliple threads, we maybe need to rethink how this sort of chart would work, or how it might be coloured/what symbols we might use, to distinguish between starting a new thread, or adding to a pre-existing one, for example. I'm sure the literature is filled with dozens of examples for how we might visualise forum activity, so if you know of any good references/links...?! ;-) #lazyacademic)

Written by Tony Hirst

February 9, 2012 at 6:30 pm

What is the Potential Audience Size for a Hashtag Community?

What’s the potential audience size around, or ‘reach’ associated with, a Twitter hashtag?

Way back when, in the early days of webs stats, reported figures tended to centre around the notion of hits, the number of calls made to a server via website activity. I forget the details, but the metric was presumably generated from server logs. This measure was always totally unreliable, because in the course of serving a web page, a server might be hit multiple times, once for each separately delivered asset, such as images, javascript files, css files and so on. Hits soon gave way to the notion of Page Views, which more accurately measured the number of pages (rather than assets) served via a website. This was complemented with the notion of Visits and Unique Visits: Visits, as tracked by a cookies, represent a set of pages viewed around about the same time by the same person. Unique Visits (or “Uniques”), represent the number of different people who appear to have visited the site in any given period.

What we see here, then, is a steady evolution in the complexity of website metrics that reflects on the one hand dissatisfaction with one way of measuring or reporting activity, and on the other practical considerations with respect to instrumentation and the ability to capture certain metrics once they are conceived of.

Widespread social media monitoring/tracking is largely still in the realm of “hits” measurement. Personal dashboards for services such as Twitter typically display direct measures provided by the Twitter API, or measures trivially/directly identified from Twitter API or archived data – number of followers, numbers of friends, distribution of updates over time, number of mentions, and so on.

Something both myself and Martin Hawksey have been thinking about on and off for some time are ways of reporting activity around Twitter hashtags. A commonly(?!) asked question in this respect relates to how much engagement (whatever that means) there has been with a particular tag. So here’s a quick mark in the sand about some of my current thinking about this. (Note that these ideas may well have been more formally developed in the academic literature – I’m a bit behind in my reading! If you know something that covers this in more detail, or that I should cite, please feel free to add a link in the comments… #lazyAcademic.)

One of the first metrics that comes to my mind is the number of people who have used a particular hashtag, and the number of their followers. Easily stated, it doesn’t take a lot of thought to realise even these “simple” measures are fraught with difficulty:

  • what counts as a use of the hashtag? If I retweet a measure of yours that contains a hashtag, have I used it in any meaningful sense? Does a “use” mean the creation of a new tweet containing the tag? What about if I reply to a tweet from you than contains the tag and I include the tag in my reply to you, even if I’m not sure what that tag relates to?
  • the potential audience size for the tag (potential uniques?), based on the number of followers of the tag users. At first glance, we might think this can be easily calculated by adding together the follower counts of the tag users, but this is more strictly an approximation of the potential audience: the set of followers of A may include some of the followers of B, or C; do we count the tag users themselves amongst the audience? If so, the upper bound also needs to take into account the fact that none of the users may be followers of any of the other tag users.
    Note there is also a lower bound – the largest follower count amongst the tag users (whatever that means…) of the hashtag. Furthermore, if we want to count the number of folk not using the tag but who may have seen the tag, this lower bound can be revised downwards by subtracting the number of tag users minus one (for the tag user with the largest follower count). The value is still only an approximation, though, becuase it assumes that all the tag users are actually included as followers of at least one, each, of the tag users. (If you think these points are “just academic”, they are and they aren’t – observations like these can often be used to help formulate gaming strategies around metrics based on these measures.)
  • the potential number of views of a tag, for example based on the product of the number of times a user tweets and their follower count?
  • the reach of (or active engagement with?) the tag, as measured by the number of people who actually see the tag, or the number of people who take and action around it (such as replying to a tagged tweet, RTing it, or clicking on a link a tagged tweet contains); note that we may be able ot construct probabilistic models (albeit quite involved ones) of the potential reach based on factors like the number of people someone follows, when they are online, the rate at which the people they follow tweet, and so on..

To try to make this a little more concrete, here are a couple of scripts for exploring the potential audience size of a tag based on the followers of the tag users (where a user is someone who publishes or retweets a tweet containing the tag over a specified period). The first, Python script runs a Twitter search and generates a list of unique users of the tag, along with the timestamp of their first use of the tag within the sample period. This script also grabs all the followers of the tag users, along with their counts, and generates running cumulative (upper bound approximation) count of the tag user follower numbers as well as calculating the rolling set of unique followers to date as each new tag user is observed. The second, R script plots the values.

The first thing we can do is look at the incidence of new users of the hashtag over time:

(For a little more discussion of this sort of chart, see Visualising Activity Around a Twitter Hashtag or Search Term Using R and its inspiration, @mediaczar’s How should Page Admins deal with Flame Wars?.)

More relevant to this post, however, is a plot showing some counts relating to followers of users of the hashtag:

In this case, the top, green line represents the summed total number of followers for tag users as they enter the conversation. If every user had completely different followers, this might be meaningful, but where conversation takes place around a tag between folk who know each other, it’s highly likely that they have followers in common.

The middle, red line shows a count of the number of unique followers to date, based on the the followers of users of the tag to date.

The lower, blue line shows the difference between the red and green lines. This represents the error between the summed follower counts and the actual number of unique followers.

Here’s a view over the number of new unique potential audience members at each time step (I think the use of the line chart here may be a mistake… I think bars/lineranges would probably be more appropriate…):

In the following chart, I overplot oneline with another. The lower layer (a red line) is the total follower account for each new tag user. The blue is the increase in the potential audience count (that is, the number of the new users’ followers that haven’t potentially seen the tag so far). The range of the visible part of the red line thus shows the number of a new tag user’s followers who have potentially already seen the tag. Err… maybe (that is, if my code is correct and all the scripts are doing what I think they’re doing! If they aren’t, then just treat this post as an exploration of the sorts of charts we might be able to produce to explore audience reach;-)

Here are the scripts (such as they are!)

import newt,csv,tweepy
import networkx as nx

#the term we're going to search for
tag='ddj'
#how many tweets to search for (max 1500)
num=500

##Something along lines of:
auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
auth.set_access_token(SKEY, SSECRET)
api = tweepy.API(auth, cache=tweepy.FileCache('cache',cachetime), retry_errors=[500], retry_delay=5, retry_count=2)

#You need to do some work here to search the Twitter API
tweeters, tweets=yourSearchTwitterFunction(api,tag,num)
#tweeters is a list of folk who tweeted the term of interest
#tweets is a list of the Twitter tweet objects returned from the search
#My code for this is tightly bound up in a large and rambling library atm...

#Put tweets into chronological order
tweets.reverse()

#I was being lazy and wasn't sure what vars I needed or what I was trying to do when I started this!
#The whole thing really needs rewriting...
tweepFo={}
seenToDate=set([])
uniqSourceFo=[]
#runtot is crude and doesn't measure overlap
runtot=0
oldseentodate=0

#Construct a digraph from folk using the tag to their followers
DG=nx.DiGraph()

for tweet in tweets:
	user=tweet['from_user']
	if user not in tweepFo:
		tweepFo[user]=[]
		print "Getting follower data for", str(user), str(len(tweepFo)), 'of', str(len(tweeters))
		mi=tweepy.Cursor(api.followers_ids,id=user).items()
		userID=tweet['from_user_id'] #check
		DG.add_node(userID,label=user)
		for m in mi:
			tweepFo[user].append(m)
			#construct graph
			DG.add_edge(userID,m,weight=1)
			DG.node[m]['label']=''
		ufc=len(tweepFo[user])
		runtot=runtot+ufc
		#seen to date is all people who have seen so far, plus new ones, so it's the union
		oldseentodate=len(seenToDate)
		seenToDate=seenToDate.union(set(tweepFo[user]))
		uniqSourceFo.append((tweet['created_at'],len(seenToDate),user,runtot,ufc,oldseentodate))
	else:
		#I'm weighting the edges so we can count how many times folk see the hashtag
		if len(DG.edges(userID))>0:
			tmp1,tmp2=DG.edges(userID)[0]
			weight=DG[userID][tmp2]['weight']+1
			for fromN,toN in DG.edges(userID):
				DG[fromN][toN]['weight']=weight


fo='reports/tmp/'+tag+'_ncount.csv'
f=open(fo,'wb+')
writer=csv.writer(f)
writer.writerow(['datetime','count','newuser','crudetot','userFoCount','previousCount'])
for ts,l,u,ct,ufc,ols in uniqSourceFo:
	print ts,l
	writer.writerow([ts,l,u,ct,ufc,ols])

f.close()

print "Writing graph.."
filter=[]
for n in DG:
	if DG.degree(n)>1: filter.append(n)
filter=set(filter)
H=DG.subgraph(filter)
nx.write_graphml(H, 'reports/tmp/'+tag+'_ncount_2up.graphml')
print "Writing other graph.."
nx.write_graphml(DG, 'reports/tmp/'+tag+'_ncount.graphml')

Here's the R script...

ddj_ncount <- read.csv("~/code/twapps/newt/reports/tmp/ddj_ncount.csv")
#Convert the datetime string to a time object
ddj_ncount$ttime=as.POSIXct(strptime(ddj_ncount$datetime, "%a, %d %b %Y %H:%M:%S"),tz='UTC')

#Order the newuser factor levels into the order in which they first use the tag
dda=subset(ddj_ncount,select=c('ttime','newuser'))
dda=arrange(dda,-desc(ttime))
ddj_ncount$newuser=factor(ddj_ncount$newuser, levels = dda$newuser)

#Plot when each user first used the tag against time
ggplot(ddj_ncount) + geom_point(aes(x=ttime,y=newuser)) + opts(axis.text.x=theme_text(size=6),axis.text.y=theme_text(size=4))

#Plot the cumulative and union flavours of increasing possible audience size, as well as the difference between them
ggplot(ddj_ncount) + geom_line(aes(x=ttime,y=count,col='Unique followers')) + geom_line(aes(x=ttime,y=crudetot,col='Cumulative followers')) + geom_line(aes(x=ttime,y=crudetot-count,col='Repeated followers')) + labs(colour='Type') + xlab(NULL)

#Number of new unique followers introduced at each time step
ggplot(ddj_ncount)+geom_line(aes(x=ttime,y=count-previousCount,col='Actual delta'))

#Try to get some idea of how many of the followers of a new user are actually new potential audience members
ggplot(ddj_ncount) + opts(axis.text.x=theme_text(angle=-90,size=4)) + geom_linerange(aes(x=newuser,ymin=0,ymax=userFoCount,col='Follower count')) + geom_linerange(aes(x=newuser,ymin=0,ymax=(count-previousCount),col='Actual new audience'))

#This is still a bit experimental
#I'm playing around trying to see what proportion or number of a users followers are new to, or subsumed by, the potential audience of the tag to date...
ggplot(ddj_ncount) + geom_linerange(aes(x=newuser,ymin=0,ymax=1-(count-previousCount)/userFoCount)) + opts(axis.text.x=theme_text(angle=-90,size=6)) + xlab(NULL)

In the next couple of posts in this series, I'll start to describe how we can chart the potential increase in audience count as a delta for each new tagger, along with a couple of ways of trying to get some initial sort of sense out of the graph file, such as the distribution of the potential number of "views" of a tag across the unique potential audience members...

PS See also the follow on post More Thoughts on Potential Audience Metrics for Hashtag Communities

Written by Tony Hirst

February 9, 2012 at 12:30 am

Dangers of a Walled Garden…

Reading a recent Economist article (The value of friendship) about the announcement last week that Facebook is to float as a public company, and being amazed as ever about how these valuations, err, work, I recalled a couple of observations from a @currybet post about the Guardian Facebook app (“The Guardian’s Facebook app” – Martin Belam at news:rewired). The first related to using Facebook apps to (only partially successfully) capture attention of folk on Facebook and get them to refocus it on the Guardian website:

We knew that 77% of visits to the Guardian from facebook.com only lasted for one page. A good hypothesis for this was that leaving the confines of Facebook to visit another site was an interruption to a Facebook session, rather than a decision to go off and browse another site. We began to wonder what it would be like if you could visit the Guardian whilst still within Facebook, signed in, chatting and sharing with your friends. Within that environment could we show users a selection of other content that would appeal to them, and tempt them to stay with our content a little bit longer, even if they weren’t on our domain.

The second thing that came to mind related to the economic/business models around the app Facebook app itself:

The Guardian Facebook app is a canvas app. That means the bulk of the page is served by us within an iFrame on the Facebook domain. All the revenue from advertising served in that area of the page is ours, and for launch we engaged a sponsor to take the full inventory across the app. Facebook earn the revenue from advertising placed around the edges of the page.

I’m not sure if Facebook runs CPM (cost per thousand) display based ads, where advertisers pay per impression, or follow the Google AdWords model, where advertisers pay per click (PPC), but it got me wondering… A large number of folk on Facebook (and Twitter) share links to third party websites external to Facebook. As Martin Belam points out, the user return rate back to Facebook for folk visiting third party sites from Facebook seems very high – folk seem to follow a link from Facebook, consume that item, return to Facebook. Facebook makes an increasing chunk of its revenue from ads it sells on Facebook.com (though with the amount of furniture and Facebook open graph code it’s getting folk to include on their own websites, it presumably wouldn’t be so hard for them to roll out their own ad network to place ads on third party sites?) so keeping eyeballs on Facebook is presumably in their commercial interest.

In Twitter land, where the VC folk are presumably starting to wonder when the money tap will start to flow, I notice “sponsored tweets” are starting to appear in search results:

ANother twitter search irrelevance

Relevance still appears to be quite low, possibly because they haven’t yet got enough ads to cover a wide range of keywords or prompts:

Dodgy twitter promoted tweet

(Personally, if the relevance score was low, I wouldn’t place the ad, or I’d serve an ad tuned to the user, rather than the content, per se…)

Again, with Twitter, a lot of sharing results in users being taken to external sites, from which they quickly return to the Twitter context. Keeping folk in the Twitter context for images and videos through pop-up viewers or embedded content in the client is also a strategy pursued in may Twitter clients.

So here’s the thought, though it’s probably a commercially suicidal one: at the moment, Facebook and Twitter and Google+ all automatically “linkify” URLs (though Google+ also takes the strategy of previewing the first few lines of a single linked to page within a Google+ post). That is, given a URL in a post, they turn it into a link. But what if they turned that linkifier off for a domain, unless a fee was paid to turn it back on. Or what if the linkifier was turned off if the number of clickthrus on links to a particular domain, or page within a domain, exceeded a particular threshold, and could only be turned on again at a metered, CPM rate. (Memories here of different models for getting folk to pay for bandwidth, because what we have here is access to bandwidth out of the immediate Facebook, Twitter or Google+ context).

As a revenue model, the losses associated with irritating users would probably outweigh any revenue benefits, but as a thought experiment, it maybe suggests that we need to start paying more attention to how these large attention-consuming services are increasingly trying to cocoon us in their context (anyone remember AOL, or to a lesser extent Yahoo, or Microsoft?), rather than playing nicely with the rest of the web.

PS Hmmm…”app”. One default interpretation of this is “app on phone”, but “Facebook app” means an app that runs on the Facebook platform… So for any give app, that it is an “app” implies that that particular variant means “software application that runs on a proprietary platform”, which might actually be a combination of hardware and software platforms (e.g. Facebook API and Android phone)???

Written by Tony Hirst

February 8, 2012 at 11:46 am

Posted in Anything you want

Tagged with , , ,

Social Media Interest Maps of Newsnight and BBCQT Twitterers

I grabbed independent samples of 1500 recent users of the #newsnight and #bbcqt hashtags within a minute or two of each other about half an hour ago. Here’s who’s followed by 25 or more of the recent hashtaggers in each case. Can you distinguish the programmes each audience interest projection map relates to?

Here’s the first one – are these folk followed by 25 or more of the folk who recently used the #bbcqt or the #newsnight hashtag?

#bbcqt 1500 forward friends 25 25

Here’s the second one – are these folk followed by 25 or more of the folk who recently used the #bbcqt or the #newsnight hashtag?

#newsnight 1500   forward friends  projection 25 25

The answer is a only a click away…

PS I’ve got a couple of scripts in the pipeline that should be able to generate data that I can use to generate this sort of differencing word cloud, the idea being I should be able to identify at a glance accounts that different hashtag communities both follow, and accounts that they differently follow…

UPDATE: so here’s a quick first pass at comparing the audiences. I’m not sure how reliable the method is, but it’s as follows:

- for each hashtag, grab 1500 recent tweets. Grab the list of folk the hashtagging users follow and retain a list (the ‘interest list’) of folk followed by at least 25 of the hashtaggers. Filter the hashtagger list so that it only contains hashtaggers who follow at least 25 people (this cuts out brand new users and newly created spam accounts). Count the number of filtered hashtaggers that follow each person in the interest list, and normalise by dividing through by the total number of filtered hashtaggers. To recap, for each tag, we now have a list of folk who were popularly followed by users of that tag, along with a number for each one between 0 and 1 describing proportionally how much of the hashtagging sample follow them.

(Note that there may be all sorts of sampling errors… I guess I need to qualify reports with the number of unique folk tweeting in the twitter sample captured. I maybe also need to improve sampling so rather than searching for 1500 tweets, I generate a sample of 1000 unique users of the tag?)

I then load these files into R and run through the following process:

#Multiply this nromalised follower proportion by 1000 and round down to get an integer between 0 and 1000 representing a score relative to the proportion of filtered hashtagger who follow each person in the interest list.
counts_newsnight$normIn=as.integer(counts_newsnight$inNorm*1000)
counts_bbcqt$normIn=as.integer(counts_bbcqt$inNorm*1000)

#ANother filtering step: we're going to plot similarities and differences between folk followed by at least 25% of the corresponding filtered hashtaggers
newsnight=subset(counts_newsnight,select=c(username,normIn),subset=(inNorm>=0.25))
bbcqt=subset(counts_bbcqt,select=c(username,normIn),subset=(inNorm>=0.25))

#Now generate a dataframe
qtvnn=merge(bbcqt,newsnight,by="username",all=T)
colnames(qtvnn)=c('username','bbcqt','newsnight')

#replace the NA cell values (where for example someone in the bbcqt list is not in the newsnight list
qtvnn[is.na(qtvnn)] <- 0

That generates a dataframe that looks something like this:

      username bbcqt newsnight
1    Aiannucci   414       408
2  BBCBreaking   455       464
3 BBCNewsnight   317       509
4  BBCPolitics     0       256
5   BBCr4today     0       356
6  BarackObama   296       334

Thanks to Josh O'Brien on Stack Overflow, I can recast this data frame into a term.matrix that plays nicely with the latest version of the R wordcloud package.

mat <- as.matrix(qtvnn[-1])
dimnames(mat)[1] <- qtvnn[1]
comparison.cloud(term.matrix = mat)
commonality.cloud(term.matrix = mat)

Here's the result - commonly followed folk:

And differentially followed folk (at above the 25% level, remember...)

So from this what can we say? Both audiences have a general news interest, into pop politics and perhaps satirical comedy, maybe leaning to the left? The Question Time audience is a more casual audience, more minded to following celebrities, whereas the Newsnight audience is a bit more into following notable media folk (journalists, editors) and also political news. (I'd be keen to hear any other readings of these maps - please feel free to leave a comment containing your interpretations/observations/reading:-)

UPDATE2: to try to get a handle on what the word clouds might be telling us from an alternative visual perspective on the data, rather than inspecting the actual code for example, here's a scatterplot showing how the follower proportions compare directly:

COmparison of who #newsnight and #bbcqt hashtaggers follow

ggplot(na.omit(subset(qtvnn,bbcqt>0 & newsnight>0))) + geom_text(aes(x=bbcqt,y=newsnight,label=username,angl=45),size=4) + xlim(200,600) + ylim(200,600) + geom_abline(intercept=0, slope=1,colour='grey')

Here's another view - this time plotting followed folk for each tag who are not followed by the friends of the other tag [at at least the 25% level]:

hashtag comparison - folk not follwed by other tag

I couldn't remember/didn't have Google to hand to find the best way of reshaping the data for this, so I ended up with a horrible horrible hack...

nn=data.frame(typ='newsnight',subset(qtvnn,select=c(username,newsnight),subset=(newsnight>0 & bbcqt==0)))
qt=data.frame(typ='bbcqt',subset(qtvnn,select=c(username,bbcqt),subset=(newsnight==0 & bbcqt>0)))
colnames(nn)=c('typ','name',val'')
colnames(qt)=c('typ','name',val'')
qtnn=rbind(nn,qt)
ggplot()+geom_text(data=qtnn,aes(x=typ,y=val,label=name),size=3)

I think this is all starting to get to the point where I need to team up with a proper developer and get *all* the code properly written and documented before any errors that are currently there get baked in too deeply...

Written by Tony Hirst

January 26, 2012 at 11:23 pm

Posted in Anything you want, Rstats

Tagged with

Follow

Get every new post delivered to your Inbox.

Join 757 other followers