Generating Twitter Wordclouds in R (Prompted by an Open Learning Blogpost)

A couple of weeks ago I saw a great example of an open learning blogpost from @katy_bird: Generating a word cloud (or not) from a Twitter hashtag. It described the trials and tribulations associated with trying to satisfy a request for the generation of a wordcloud based on tweets associated with a specific Twitter hashtag. A seemingly simple task, you might think, but things are never that easy… If you read the post, you’ll see Katy identified several problems, or stumbling blocks, along the way, as well as how she addressed them. There’s also a bit of reflection on the process as a whole.

Reading the post the first time (and again, just now), completely set me up for the day. It had a little bit of everyhting: a goal statement, the identification of a set of problems associated with trying to complete the task, some commentary on how the problems were tackled, and some reflection on the process as a whole. The post thus serves the purpose of capturing a problem discovery process, as well as the steps taken to try and solve each problem (although full documentation is lacking… This is something I have learned over the years: to use something like a gist on github to actually keep a copy of any code I generated to solve the problem, linked to for reuse by myself and others from the associated blog post). The post captures a glimpse back at a moment in time – when Katy didn’t know how to generate a wordcloud – from the joyful moment at which she has just learned how to generate said wordcloud. More importantly, the post describes the learning problems that became evident whilst trying to achieve the goal in such a way that they can act as hooks on which others can hang alternative or additional ways of solving the problem, or act as mentor.

By identifying the learning journey and problems discovered along the way, Katy’s record of her learning strategy also provides an authentic, learner centric perspective on what’s involved in trying to create a wordcloud around a twitter hashtag.

Reading the post again has also prompted me to blog this recipe, largely copied from the RDataMining post Using Text Mining to Find Out What @RDataMining Tweets are About, for generating a word cloud around a twitter hashtag using R (I use RStudio; the recipe requires at least the twitteR and tm libraries):

#Grab the tweets
rdmTweets <- searchTwitter(searchTerm, n=500)
#Use a handy helper function to put the tweets into a dataframe

##Note: there are some handy, basic Twitter related functions here:
#For example:
RemoveAtPeople <- function(tweet) {
  gsub("@\\w+", "", tweet)
#Then for example, remove @'d names
tweets <- as.vector(sapply(tw.df$text, RemoveAtPeople))

##Wordcloud - scripts available from various sources; I used:

#Install the textmining library
#Call with eg: tw.c=generateCorpus(tw.df$text)
generateCorpus= function(df,my.stopwords=c()){
  #The following is cribbed and seems to do what it says on the can
  tw.corpus= Corpus(VectorSource(df))
  # remove punctuation
  tw.corpus = tm_map(tw.corpus, removePunctuation)
  #normalise case
  tw.corpus = tm_map(tw.corpus, tolower)
  # remove stopwords
  tw.corpus = tm_map(tw.corpus, removeWords, stopwords('english'))
  tw.corpus = tm_map(tw.corpus, removeWords, my.stopwords)


  doc.m = TermDocumentMatrix(corpus, control = list(minWordLength = 1))
  dm = as.matrix(doc.m)
  # calculate the frequency of words
  v = sort(rowSums(dm), decreasing=TRUE)
  d = data.frame(word=names(v), freq=v)
  #Generate the wordcloud
  wc=wordcloud(d$word, d$freq, min.freq=min.freq)


##Generate an image file of the wordcloud
png('test.png', width=600,height=600)

#We could make it even easier if we hide away the tweet grabbing code. eg:
  rdmTweets = searchTwitter(searchTerm, n=num)
  as.vector(sapply(tw.df$text, RemoveAtPeople))
#Then we could do something like:

Here’s the result:

PS for an earlier, was broken, now patched, route to sketching a wordcloud from a twitter search using Wordle, see How To Create Wordcloud from a Twitter Hashtag Search Feed in a Few Easy Steps.

Do Retweeters Lack Commitment to a Hashtag?

I seem to be going down more ratholes than usual at the moment, in this case relating to activity round Twitter hashtags. Here’s a quick bit of reflection around a chart from Visualising Activity Around a Twitter Hashtag or Search Term Using R that shows activity around a hashtag that was minted for an event that took place before the sample period.

The y-axis is organised according to the time of first use (within the sample period) of the tag by a particular user. The x axis is time. The dots represent tweets containing the hashtag, coloured blue by default, red if they are an old-style RT (i.e. they begin RT @username:).

So what sorts of thing might we look for in this chart, and what are the problems with it? Several things jump out at me:

  • For many of the users, their first tweet (in this sample period at least) is an RT; that is, they are brought into the hashtag community through issuing an RT;
  • Many of the users whose first use is via an RT don’t use the hashtag again within the sample period. Is this typical? Does this signal represent amplification of the tag without any real sense of engagement with it?
  • A noticeable proportion of folk whose first use is not an RT go on to post further non-RT tweets. Does this represent an ongoing commitment to the tag? Note that this chart does not show whether tweets are replies, or “open” tweets. Replies (that is, tweets beginning @username are likely to represent conversational threads within a tag context rather than “general” tag usage, so it would be worth using an additional colour to identify reply based conversational tweets as such.
  • “New style” retweets are diaplayed as retweets by colouring… I need to check whether or nor newstyle RT information is available that I could use to colour such tweets appropriately. (or alternatively, I’d have to do some sort of string matching to see whether or not a tweet was the same as a previously seen tweet, which is a bit of a pain:-(

(Note that when I started mapping hashtag communities, I used to generate tag user names based on a filtered list of tweets that excluded RTs. this meant that folk who only used the tag as part of an RT and did not originate tweets that contained the tag, either in general or as part of a conversation, would not be counted as a member of the hashtag community. More recently, I have added filters that include RTs but exclude users who used the tag only once, for example, thus retaining serial RTers, but not single use users.)

So what else might this chart tell us? Looking at vertical slices, it seems that news entrants to the tag community appear to come in waves, maybe as part of rapid fire RT bursts. This chart doesn’t tell us for sure that this is happening, but it does highlight areas of the timelime that might be worth investigating more closely if we are interested in what happened at those times when there does appear to be a spike in activity. (Are there any modifications we could make to this chart to make them more informative in this respect? The time resolution is very poor, for example, so being able to zoom in on a particular time might be handy. Or are there other charts that might provide a different lens that can help us see what was happening at those times?)

And as a final point – this stuff may be all very interesting, but is it useful?, And if so, how? I also wonder how generalisable it is to other sorts of communication analysis. For example, I think we could use similar graphical techniques to explore engagement with an active comment thread on a blog, or Google+, or additions to an online forum thread. (For forums with mutliple threads, we maybe need to rethink how this sort of chart would work, or how it might be coloured/what symbols we might use, to distinguish between starting a new thread, or adding to a pre-existing one, for example. I’m sure the literature is filled with dozens of examples for how we might visualise forum activity, so if you know of any good references/links…?! ;-) #lazyacademic)

What is the Potential Audience Size for a Hashtag Community?

What’s the potential audience size around, or ‘reach’ associated with, a Twitter hashtag?

Way back when, in the early days of webs stats, reported figures tended to centre around the notion of hits, the number of calls made to a server via website activity. I forget the details, but the metric was presumably generated from server logs. This measure was always totally unreliable, because in the course of serving a web page, a server might be hit multiple times, once for each separately delivered asset, such as images, javascript files, css files and so on. Hits soon gave way to the notion of Page Views, which more accurately measured the number of pages (rather than assets) served via a website. This was complemented with the notion of Visits and Unique Visits: Visits, as tracked by a cookies, represent a set of pages viewed around about the same time by the same person. Unique Visits (or “Uniques”), represent the number of different people who appear to have visited the site in any given period.

What we see here, then, is a steady evolution in the complexity of website metrics that reflects on the one hand dissatisfaction with one way of measuring or reporting activity, and on the other practical considerations with respect to instrumentation and the ability to capture certain metrics once they are conceived of.

Widespread social media monitoring/tracking is largely still in the realm of “hits” measurement. Personal dashboards for services such as Twitter typically display direct measures provided by the Twitter API, or measures trivially/directly identified from Twitter API or archived data – number of followers, numbers of friends, distribution of updates over time, number of mentions, and so on.

Something both myself and Martin Hawksey have been thinking about on and off for some time are ways of reporting activity around Twitter hashtags. A commonly(?!) asked question in this respect relates to how much engagement (whatever that means) there has been with a particular tag. So here’s a quick mark in the sand about some of my current thinking about this. (Note that these ideas may well have been more formally developed in the academic literature – I’m a bit behind in my reading! If you know something that covers this in more detail, or that I should cite, please feel free to add a link in the comments… #lazyAcademic.)

One of the first metrics that comes to my mind is the number of people who have used a particular hashtag, and the number of their followers. Easily stated, it doesn’t take a lot of thought to realise even these “simple” measures are fraught with difficulty:

  • what counts as a use of the hashtag? If I retweet a measure of yours that contains a hashtag, have I used it in any meaningful sense? Does a “use” mean the creation of a new tweet containing the tag? What about if I reply to a tweet from you than contains the tag and I include the tag in my reply to you, even if I’m not sure what that tag relates to?
  • the potential audience size for the tag (potential uniques?), based on the number of followers of the tag users. At first glance, we might think this can be easily calculated by adding together the follower counts of the tag users, but this is more strictly an approximation of the potential audience: the set of followers of A may include some of the followers of B, or C; do we count the tag users themselves amongst the audience? If so, the upper bound also needs to take into account the fact that none of the users may be followers of any of the other tag users.
    Note there is also a lower bound – the largest follower count amongst the tag users (whatever that means…) of the hashtag. Furthermore, if we want to count the number of folk not using the tag but who may have seen the tag, this lower bound can be revised downwards by subtracting the number of tag users minus one (for the tag user with the largest follower count). The value is still only an approximation, though, becuase it assumes that all the tag users are actually included as followers of at least one, each, of the tag users. (If you think these points are “just academic”, they are and they aren’t – observations like these can often be used to help formulate gaming strategies around metrics based on these measures.)
  • the potential number of views of a tag, for example based on the product of the number of times a user tweets and their follower count?
  • the reach of (or active engagement with?) the tag, as measured by the number of people who actually see the tag, or the number of people who take and action around it (such as replying to a tagged tweet, RTing it, or clicking on a link a tagged tweet contains); note that we may be able ot construct probabilistic models (albeit quite involved ones) of the potential reach based on factors like the number of people someone follows, when they are online, the rate at which the people they follow tweet, and so on..

To try to make this a little more concrete, here are a couple of scripts for exploring the potential audience size of a tag based on the followers of the tag users (where a user is someone who publishes or retweets a tweet containing the tag over a specified period). The first, Python script runs a Twitter search and generates a list of unique users of the tag, along with the timestamp of their first use of the tag within the sample period. This script also grabs all the followers of the tag users, along with their counts, and generates running cumulative (upper bound approximation) count of the tag user follower numbers as well as calculating the rolling set of unique followers to date as each new tag user is observed. The second, R script plots the values.

The first thing we can do is look at the incidence of new users of the hashtag over time:

(For a little more discussion of this sort of chart, see Visualising Activity Around a Twitter Hashtag or Search Term Using R and its inspiration, @mediaczar’s How should Page Admins deal with Flame Wars?.)

More relevant to this post, however, is a plot showing some counts relating to followers of users of the hashtag:

In this case, the top, green line represents the summed total number of followers for tag users as they enter the conversation. If every user had completely different followers, this might be meaningful, but where conversation takes place around a tag between folk who know each other, it’s highly likely that they have followers in common.

The middle, red line shows a count of the number of unique followers to date, based on the the followers of users of the tag to date.

The lower, blue line shows the difference between the red and green lines. This represents the error between the summed follower counts and the actual number of unique followers.

Here’s a view over the number of new unique potential audience members at each time step (I think the use of the line chart here may be a mistake… I think bars/lineranges would probably be more appropriate…):

In the following chart, I overplot oneline with another. The lower layer (a red line) is the total follower account for each new tag user. The blue is the increase in the potential audience count (that is, the number of the new users’ followers that haven’t potentially seen the tag so far). The range of the visible part of the red line thus shows the number of a new tag user’s followers who have potentially already seen the tag. Err… maybe (that is, if my code is correct and all the scripts are doing what I think they’re doing! If they aren’t, then just treat this post as an exploration of the sorts of charts we might be able to produce to explore audience reach;-)

Here are the scripts (such as they are!)

import newt,csv,tweepy
import networkx as nx

#the term we're going to search for
#how many tweets to search for (max 1500)

##Something along lines of:
auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
auth.set_access_token(SKEY, SSECRET)
api = tweepy.API(auth, cache=tweepy.FileCache('cache',cachetime), retry_errors=[500], retry_delay=5, retry_count=2)

#You need to do some work here to search the Twitter API
tweeters, tweets=yourSearchTwitterFunction(api,tag,num)
#tweeters is a list of folk who tweeted the term of interest
#tweets is a list of the Twitter tweet objects returned from the search
#My code for this is tightly bound up in a large and rambling library atm...

#Put tweets into chronological order

#I was being lazy and wasn't sure what vars I needed or what I was trying to do when I started this!
#The whole thing really needs rewriting...
#runtot is crude and doesn't measure overlap

#Construct a digraph from folk using the tag to their followers

for tweet in tweets:
	if user not in tweepFo:
		print "Getting follower data for", str(user), str(len(tweepFo)), 'of', str(len(tweeters))
		userID=tweet['from_user_id'] #check
		for m in mi:
			#construct graph
		#seen to date is all people who have seen so far, plus new ones, so it's the union
		#I'm weighting the edges so we can count how many times folk see the hashtag
		if len(DG.edges(userID))>0:
			for fromN,toN in DG.edges(userID):

for ts,l,u,ct,ufc,ols in uniqSourceFo:
	print ts,l


print "Writing graph.."
for n in DG:
	if>1: filter.append(n)
nx.write_graphml(H, 'reports/tmp/'+tag+'_ncount_2up.graphml')
print "Writing other graph.."
nx.write_graphml(DG, 'reports/tmp/'+tag+'_ncount.graphml')

Here’s the R script…

ddj_ncount <- read.csv("~/code/twapps/newt/reports/tmp/ddj_ncount.csv")
#Convert the datetime string to a time object
ddj_ncount$ttime=as.POSIXct(strptime(ddj_ncount$datetime, "%a, %d %b %Y %H:%M:%S"),tz='UTC')

#Order the newuser factor levels into the order in which they first use the tag
ddj_ncount$newuser=factor(ddj_ncount$newuser, levels = dda$newuser)

#Plot when each user first used the tag against time
ggplot(ddj_ncount) + geom_point(aes(x=ttime,y=newuser)) + opts(axis.text.x=theme_text(size=6),axis.text.y=theme_text(size=4))

#Plot the cumulative and union flavours of increasing possible audience size, as well as the difference between them
ggplot(ddj_ncount) + geom_line(aes(x=ttime,y=count,col='Unique followers')) + geom_line(aes(x=ttime,y=crudetot,col='Cumulative followers')) + geom_line(aes(x=ttime,y=crudetot-count,col='Repeated followers')) + labs(colour='Type') + xlab(NULL)

#Number of new unique followers introduced at each time step
ggplot(ddj_ncount)+geom_line(aes(x=ttime,y=count-previousCount,col='Actual delta'))

#Try to get some idea of how many of the followers of a new user are actually new potential audience members
ggplot(ddj_ncount) + opts(axis.text.x=theme_text(angle=-90,size=4)) + geom_linerange(aes(x=newuser,ymin=0,ymax=userFoCount,col='Follower count')) + geom_linerange(aes(x=newuser,ymin=0,ymax=(count-previousCount),col='Actual new audience'))

#This is still a bit experimental
#I'm playing around trying to see what proportion or number of a users followers are new to, or subsumed by, the potential audience of the tag to date...
ggplot(ddj_ncount) + geom_linerange(aes(x=newuser,ymin=0,ymax=1-(count-previousCount)/userFoCount)) + opts(axis.text.x=theme_text(angle=-90,size=6)) + xlab(NULL)

In the next couple of posts in this series, I’ll start to describe how we can chart the potential increase in audience count as a delta for each new tagger, along with a couple of ways of trying to get some initial sort of sense out of the graph file, such as the distribution of the potential number of “views” of a tag across the unique potential audience members…

PS See also the follow on post More Thoughts on Potential Audience Metrics for Hashtag Communities

Dangers of a Walled Garden…

Reading a recent Economist article (The value of friendship) about the announcement last week that Facebook is to float as a public company, and being amazed as ever about how these valuations, err, work, I recalled a couple of observations from a @currybet post about the Guardian Facebook app (“The Guardian’s Facebook app” – Martin Belam at news:rewired). The first related to using Facebook apps to (only partially successfully) capture attention of folk on Facebook and get them to refocus it on the Guardian website:

We knew that 77% of visits to the Guardian from only lasted for one page. A good hypothesis for this was that leaving the confines of Facebook to visit another site was an interruption to a Facebook session, rather than a decision to go off and browse another site. We began to wonder what it would be like if you could visit the Guardian whilst still within Facebook, signed in, chatting and sharing with your friends. Within that environment could we show users a selection of other content that would appeal to them, and tempt them to stay with our content a little bit longer, even if they weren’t on our domain.

The second thing that came to mind related to the economic/business models around the app Facebook app itself:

The Guardian Facebook app is a canvas app. That means the bulk of the page is served by us within an iFrame on the Facebook domain. All the revenue from advertising served in that area of the page is ours, and for launch we engaged a sponsor to take the full inventory across the app. Facebook earn the revenue from advertising placed around the edges of the page.

I’m not sure if Facebook runs CPM (cost per thousand) display based ads, where advertisers pay per impression, or follow the Google AdWords model, where advertisers pay per click (PPC), but it got me wondering… A large number of folk on Facebook (and Twitter) share links to third party websites external to Facebook. As Martin Belam points out, the user return rate back to Facebook for folk visiting third party sites from Facebook seems very high – folk seem to follow a link from Facebook, consume that item, return to Facebook. Facebook makes an increasing chunk of its revenue from ads it sells on (though with the amount of furniture and Facebook open graph code it’s getting folk to include on their own websites, it presumably wouldn’t be so hard for them to roll out their own ad network to place ads on third party sites?) so keeping eyeballs on Facebook is presumably in their commercial interest.

In Twitter land, where the VC folk are presumably starting to wonder when the money tap will start to flow, I notice “sponsored tweets” are starting to appear in search results:

ANother twitter search irrelevance

Relevance still appears to be quite low, possibly because they haven’t yet got enough ads to cover a wide range of keywords or prompts:

Dodgy twitter promoted tweet

(Personally, if the relevance score was low, I wouldn’t place the ad, or I’d serve an ad tuned to the user, rather than the content, per se…)

Again, with Twitter, a lot of sharing results in users being taken to external sites, from which they quickly return to the Twitter context. Keeping folk in the Twitter context for images and videos through pop-up viewers or embedded content in the client is also a strategy pursued in may Twitter clients.

So here’s the thought, though it’s probably a commercially suicidal one: at the moment, Facebook and Twitter and Google+ all automatically “linkify” URLs (though Google+ also takes the strategy of previewing the first few lines of a single linked to page within a Google+ post). That is, given a URL in a post, they turn it into a link. But what if they turned that linkifier off for a domain, unless a fee was paid to turn it back on. Or what if the linkifier was turned off if the number of clickthrus on links to a particular domain, or page within a domain, exceeded a particular threshold, and could only be turned on again at a metered, CPM rate. (Memories here of different models for getting folk to pay for bandwidth, because what we have here is access to bandwidth out of the immediate Facebook, Twitter or Google+ context).

As a revenue model, the losses associated with irritating users would probably outweigh any revenue benefits, but as a thought experiment, it maybe suggests that we need to start paying more attention to how these large attention-consuming services are increasingly trying to cocoon us in their context (anyone remember AOL, or to a lesser extent Yahoo, or Microsoft?), rather than playing nicely with the rest of the web.

PS Hmmm…”app”. One default interpretation of this is “app on phone”, but “Facebook app” means an app that runs on the Facebook platform… So for any give app, that it is an “app” implies that that particular variant means “software application that runs on a proprietary platform”, which might actually be a combination of hardware and software platforms (e.g. Facebook API and Android phone)???

Social Media Interest Maps of Newsnight and BBCQT Twitterers

I grabbed independent samples of 1500 recent users of the #newsnight and #bbcqt hashtags within a minute or two of each other about half an hour ago. Here’s who’s followed by 25 or more of the recent hashtaggers in each case. Can you distinguish the programmes each audience interest projection map relates to?

Here’s the first one – are these folk followed by 25 or more of the folk who recently used the #bbcqt or the #newsnight hashtag?

#bbcqt 1500 forward friends 25 25

Here’s the second one – are these folk followed by 25 or more of the folk who recently used the #bbcqt or the #newsnight hashtag?

#newsnight 1500   forward friends  projection 25 25

The answer is a only a click away…

PS I’ve got a couple of scripts in the pipeline that should be able to generate data that I can use to generate this sort of differencing word cloud, the idea being I should be able to identify at a glance accounts that different hashtag communities both follow, and accounts that they differently follow…

UPDATE: so here’s a quick first pass at comparing the audiences. I’m not sure how reliable the method is, but it’s as follows:

– for each hashtag, grab 1500 recent tweets. Grab the list of folk the hashtagging users follow and retain a list (the ‘interest list’) of folk followed by at least 25 of the hashtaggers. Filter the hashtagger list so that it only contains hashtaggers who follow at least 25 people (this cuts out brand new users and newly created spam accounts). Count the number of filtered hashtaggers that follow each person in the interest list, and normalise by dividing through by the total number of filtered hashtaggers. To recap, for each tag, we now have a list of folk who were popularly followed by users of that tag, along with a number for each one between 0 and 1 describing proportionally how much of the hashtagging sample follow them.

(Note that there may be all sorts of sampling errors… I guess I need to qualify reports with the number of unique folk tweeting in the twitter sample captured. I maybe also need to improve sampling so rather than searching for 1500 tweets, I generate a sample of 1000 unique users of the tag?)

I then load these files into R and run through the following process:

#Multiply this nromalised follower proportion by 1000 and round down to get an integer between 0 and 1000 representing a score relative to the proportion of filtered hashtagger who follow each person in the interest list.

#ANother filtering step: we're going to plot similarities and differences between folk followed by at least 25% of the corresponding filtered hashtaggers

#Now generate a dataframe

#replace the NA cell values (where for example someone in the bbcqt list is not in the newsnight list
qtvnn[] <- 0

That generates a dataframe that looks something like this:

      username bbcqt newsnight
1    Aiannucci   414       408
2  BBCBreaking   455       464
3 BBCNewsnight   317       509
4  BBCPolitics     0       256
5   BBCr4today     0       356
6  BarackObama   296       334

Thanks to Josh O’Brien on Stack Overflow, I can recast this data frame into a term.matrix that plays nicely with the latest version of the R wordcloud package.

mat <- as.matrix(qtvnn[-1])
dimnames(mat)[1] <- qtvnn[1] = mat) = mat)

Here’s the result – commonly followed folk:

And differentially followed folk (at above the 25% level, remember…)

So from this what can we say? Both audiences have a general news interest, into pop politics and perhaps satirical comedy, maybe leaning to the left? The Question Time audience is a more casual audience, more minded to following celebrities, whereas the Newsnight audience is a bit more into following notable media folk (journalists, editors) and also political news. (I’d be keen to hear any other readings of these maps – please feel free to leave a comment containing your interpretations/observations/reading:-)

UPDATE2: to try to get a handle on what the word clouds might be telling us from an alternative visual perspective on the data, rather than inspecting the actual code for example, here’s a scatterplot showing how the follower proportions compare directly:

COmparison of who #newsnight and #bbcqt hashtaggers follow

ggplot(na.omit(subset(qtvnn,bbcqt>0 & newsnight>0))) + geom_text(aes(x=bbcqt,y=newsnight,label=username,angl=45),size=4) + xlim(200,600) + ylim(200,600) + geom_abline(intercept=0, slope=1,colour='grey')

Here’s another view – this time plotting followed folk for each tag who are not followed by the friends of the other tag [at at least the 25% level]:

hashtag comparison - folk not follwed by other tag

I couldn’t remember/didn’t have Google to hand to find the best way of reshaping the data for this, so I ended up with a horrible horrible hack…

nn=data.frame(typ='newsnight',subset(qtvnn,select=c(username,newsnight),subset=(newsnight>0 & bbcqt==0)))
qt=data.frame(typ='bbcqt',subset(qtvnn,select=c(username,bbcqt),subset=(newsnight==0 & bbcqt>0)))

I think this is all starting to get to the point where I need to team up with a proper developer and get *all* the code properly written and documented before any errors that are currently there get baked in too deeply…

Getting Started With Twitter Analysis in R

Earlier today, I saw a post vis the aggregating R-Bloggers service a post on Using Text Mining to Find Out What @RDataMining Tweets are About. The post provides a walktrhough of how to grab tweets into an R session using the twitteR library, and then do some text mining on it.

I’ve been meaning to have a look at pulling Twitter bits into R for some time, so I couldn’t but have a quick play…

Starting from @RDataMiner’s lead, here’s what I did… (Notes: I use R in an R-Studio context. If you follow through the example and a library appears to be missing, from the Packages tab search for the missing library and import it, then try to reload the library in the script. The # denotes a commented out line.)

#The original example used the twitteR library to pull in a user stream
#rdmTweets <- userTimeline("psychemedia", n=100)
#Instead, I'm going to pull in a search around a hashtag.
rdmTweets <- searchTwitter('#mozfest', n=500)
# Note that the Twitter search API only goes back 1500 tweets (I think?)

#Create a dataframe based around the results
df <-"rbind", lapply(rdmTweets,
#Here are the columns
#And some example content

So what can we do out of the can? One thing is look to see who was tweeting most in the sample we collected:


# Let's do something hacky:
# Limit the data set to show only folk who tweeted twice or more in the sample
barplot(cc,las=2,cex.names =0.3)

Now let’s have a go at parsing some tweets, pulling out the names of folk who have been retweeted or who have had a tweet sent to them:

#Whilst tinkering, I came across some errors that seemed
# to be caused by unusual character sets
#Here's a hacky defence that seemed to work...
df$text=sapply(df$text,function(row) iconv(row,to='UTF-8'))

#A helper function to remove @ symbols from user names...
trim <- function (x) sub('@','',x)

#A couple of tweet parsing functions that add columns to the dataframe
#We'll be needing this, I think?
#Pull out who a message is to
df$to=sapply(df$text,function(tweet) str_extract(tweet,"^(@[[:alnum:]_]*)"))
df$to=sapply(df$to,function(name) trim(name))

#And here's a way of grabbing who's been RT'd
df$rt=sapply(df$text,function(tweet) trim(str_match(tweet,"^RT (@[[:alnum:]_]*)")[2]))

So for example, now we can plot a chart showing how often a particular person was RT’d in our sample. Let’s use ggplot2 this time…


Okay – enough for now… if you’re tempted to have a play yourself, please post any other avenues you explored with in a comment, or in your own post with a link in my comments;-)

Fishing for OU Twitter Folk…

Just a quick observation inspired by the online “focus group” on Twitter yesterday around the #twitterou hashtag (a discussion for OU folk about Twitter usage): a few minutes in to the discussion, I grabbed a list of the folk who had used the tag so far (about 10 or people at the time), pulled down a list of the people they followed to construct a graph of hashtaggers->friends, and then filtered the resulting graph to show folk with node degree of 5 or more.

twitterOU - folk followed by 5 or more folk using twitterou before 2.10 or so today

Because a large number of OU Twitter folk follow each other, the graph is quite dense, which means that if we take a sample of known OU users and look for people that a majority of that sample follow, we stand a reasonable chance of identifying other OU folk…

Doing a bit of List Intelligence (looking up the lists that a significant number of hashtag users were on, I identified several OU folk Twitter lists, most notably @liamgh/planetou and @guyweb/openuniversity.

Just for completeness, it’s also worth pointing out that simple community analysis of followers of a known OU person might also turn up OU clusters, e.g. as described in Digging Deeper into the Structure of My Twitter Friends Network: Librarian Spotting. I suspect if we did clique analysis on the followers, this might also identify ‘core’ members of organisational communities that could be used to seed a snowball discovery mechanism for more members of that organisation.

PS hmmm… maybe I need to do a post or two on how we might go about discovering enterprise/organisation networks/communities on Twitter…?