Visualising Activity Around a Twitter Hashtag or Search Term Using R

I think one of valid criticisms around a lot of the visualisations I post here and on my various #f1datajunkie blogs is that I often don’t post any explanatory context around the visualisations. This is partly a result of the way I use my blog posts in a selfish way to document the evolution of my own practice, but not necessarily the “so what” elements that represent any meaning or sense I take from the visualisations. In many cases, this is because the understanding I come to of a dataset is typically the result of an (inter)active exploration of the data set; what I blog are the pieces of the puzzle that show how I personally set about developing a conversation with a dataset, pieces that you can try out if you want to…;-)

An approach that might get me more readers would be to post commentary around what I’ve learned about a dataset from having a conversation with it. A good example of this can be seen in @mediaczar’s post on How should Page Admins deal with Flame Wars?, where this visualisation of activity around a Facebook post is analysed in terms of effective (or not!) strategies for moderating a flame war.

@mediaczar visualisation of engagement around facebook flamewars

The chart shows a sequential ordering of posts in the order they were made along the x-axis, and the unique individual responsible for each post, ordered by accession to the debate along the y-axis. For interpretation and commentary, see the original post: How should Page Admins deal with Flame Wars? ;-)

One take away of the chart for me is that it provides a great snapshot of new people entering into a conversation (vertical lines) as well as engagement by an individual (horizontal lines). If we use a time proportional axis on x, we can also see engagement over time.

In a Twitter context, it’s likely that a rapid increase in numbers of folk engaging with a hashtag, for example, might be the result of an RT related burst of activity. For folk who have already engaged in hashtag usage, for example as part of a live event backhannel, a large number of near co-occurring tweets that are not RTs might signal some notable happenstance within the event.

To explore this idea, here’s a quick bit of R tooling inspired by Mat’s post… It uses the twitteR library and sources tweets via a Twitter search.

#Pull in a search around a hashtag.
rdmTweets <- searchTwitter(searchTerm, n=500)
# Note that the Twitter search API only goes back 1500 tweets

#Plot of tweet behaviour by user over time
#Based on @mediaczar's
#Make use of a handy dataframe creating twitteR helper function
#@mediaczar's plot uses a list of users ordered by accession to user list
## 1) find earliest tweet in searchlist for each user [ ]
tw.dfx=ddply(tw.df, .var = "screenName", .fun = function(x) {return(subset(x, created %in% min(created),select=c(screenName,created)))})
## 2) arrange the users in accession order
## 3) Use the username accession order to order the screenName factors in the searchlist
tw.df$screenName=factor(tw.df$screenName, levels = tw.dfxa$screenName)
#ggplot seems to be able to cope with time typed values...

We can get a feeling for which occurrences were old-style RTs by identifying tweets that start with a classic RT, and then colouring each tweet appropriately (note there may be some overplotting/masking of points…I’m not sure how big the x-axis time bins are…)

#Identify and colour the RTs...
#A helper function to remove @ symbols from user names...
trim <- function (x) sub('@','',x)
#Identify classic style RTs
tw.df$rt=sapply(tw.df$text,function(tweet) trim(str_match(tweet,"^RT (@[[:alnum:]_]*)")[2]))
tw.df$rtt=sapply(tw.df$rt,function(rt) if ( 'T' else 'RT')

So now we can see when folk entered into the hashtag community via a classic RT.

We can also start to explore who was classically retweeted when:

#Generate a plot showing how a person is RTd
tw.df$rtof=sapply(tw.df$text,function(tweet) trim(str_match(tweet,"^RT (@[[:alnum:]_]*)")[2]))
#Note that this doesn't show how many RTs each person got in a given time period if they got more than one...

Another view might show who was classically RTd by whom (activity along a row indicating someone was retweeted a lot through one or more tweets, activity within a column identifying an individual who RTs a lot…):

#We can start to get a feel for who RTs whom...
#We don't want to display screenNames of folk who tweeted but didn't RT
#Order the screennames of folk who did RT by accession order (ie order in which they RTd)
tw.df.rta=arrange(ddply(tw.df.rt, .var = "screenName", .fun = function(x) {return(subset(x, created %in% min(created),select=c(screenName,created)))}),-desc(created))
tw.df.rt$screenName=factor(tw.df.rt$screenName, levels = tw.df.rta$screenName)
# Plot who RTd whom
ggplot(subset(tw.df.rt,subset=(!,y=rtof))+opts(axis.text.x=theme_text(angle=-90,size=6)) + xlab(NULL)

What sense you might make of all this, or where to take it next, is down to you of course… Err, erm…?! ;-)

PS see also:


  1. Pingback: What is the Potential Audience Size for a Hashtag Community? « OUseful.Info, the blog…
  2. Pingback: Do Retweeters Lack Commitment to a Hashtag? « OUseful.Info, the blog…
  3. Pingback: Visualización de la propagación de #estemambGrecia #estamosconGrecia -- La maldekstra kolono
  4. Pingback: Visualising Twitter User Timeline Activity in R « OUseful.Info, the blog…
  5. Pingback: Estimated Follower Accession Charts for Twitter | OUseful.Info, the blog...
  6. Pingback: Wrangling Data With “Free” Tools – LASI13 Workshop Round-Up | OUseful.Info, the blog...