Getting Started With Twitter Analysis in R
Earlier today, I saw a post vis the aggregating R-Bloggers service a post on Using Text Mining to Find Out What @RDataMining Tweets are About. The post provides a walktrhough of how to grab tweets into an R session using the twitteR library, and then do some text mining on it.
I’ve been meaning to have a look at pulling Twitter bits into R for some time, so I couldn’t but have a quick play…
Starting from @RDataMiner’s lead, here’s what I did… (Notes: I use R in an R-Studio context. If you follow through the example and a library appears to be missing, from the Packages tab search for the missing library and import it, then try to reload the library in the script. The # denotes a commented out line.)
require(twitteR)
#The original example used the twitteR library to pull in a user stream
#rdmTweets <- userTimeline("psychemedia", n=100)
#Instead, I'm going to pull in a search around a hashtag.
rdmTweets <- searchTwitter('#mozfest', n=500)
# Note that the Twitter search API only goes back 1500 tweets (I think?)
#Create a dataframe based around the results
df <- do.call("rbind", lapply(rdmTweets, as.data.frame))
#Here are the columns
names(df)
#And some example content
head(df,3)
So what can we do out of the can? One thing is look to see who was tweeting most in the sample we collected:
counts=table(df$screenName) barplot(counts) # Let's do something hacky: # Limit the data set to show only folk who tweeted twice or more in the sample cc=subset(counts,counts>1) barplot(cc,las=2,cex.names =0.3)
Now let’s have a go at parsing some tweets, pulling out the names of folk who have been retweeted or who have had a tweet sent to them:
#Whilst tinkering, I came across some errors that seemed
# to be caused by unusual character sets
#Here's a hacky defence that seemed to work...
df$text=sapply(df$text,function(row) iconv(row,to='UTF-8'))
#A helper function to remove @ symbols from user names...
trim <- function (x) sub('@','',x)
#A couple of tweet parsing functions that add columns to the dataframe
#We'll be needing this, I think?
library(stringr)
#Pull out who a message is to
df$to=sapply(df$text,function(tweet) str_extract(tweet,"^(@[[:alnum:]_]*)"))
df$to=sapply(df$to,function(name) trim(name))
#And here's a way of grabbing who's been RT'd
df$rt=sapply(df$text,function(tweet) trim(str_match(tweet,"^RT (@[[:alnum:]_]*)")[2]))
So for example, now we can plot a chart showing how often a particular person was RT’d in our sample. Let’s use ggplot2 this time…
require(ggplot2) ggplot()+geom_bar(aes(x=na.omit(df$rt)))+opts(axis.text.x=theme_text(angle=-90,size=6))+xlab(NULL)
Okay – enough for now… if you’re tempted to have a play yourself, please post any other avenues you explored with in a comment, or in your own post with a link in my comments;-)



[...] Getting Started With Twitter Analysis in R Earlier today, I saw a post vis the aggregating R-Bloggers service a post on Using Text Mining to Find Out What @RDataMining Tweets are About. The post provides a walktrhough of how to grab tweets … Source: blog.ouseful.info [...]
Getting Started With Twitter Analysis in R | Statistics with R | Scoop.it
November 10, 2011 at 10:05 am
Hi Tony,
do you know the CRAN task view for NLP: http://cran.r-project.org/web/views/NaturalLanguageProcessing.html ?
I am sure you will find an inspiring collection of R packages to conduct more content analysis. My tip: look at the tm architecture and do something like topic maps or lsa on top of it to cluster keywords or utterances or persons — very interesting.
Best,
Fridolin
Fridolin Wild
November 10, 2011 at 12:01 pm
Hi Fridolin – thanks for the tip; duly added to my to do list;-)
Tony Hirst
November 10, 2011 at 1:04 pm
[...] #Data_Science – Quick Start in Tweet analysis using R [...]
State of Data #75 « Dr Data's Blog
November 17, 2011 at 11:17 pm
[...] on the R code from that example, along with a couple of routines from my own previous R’n'Twitter experiments, here’s some R code that will grab items from a Twapperkeeper archive and parse them into a [...]
Rescuing Twapperkeeper Archives Before They Vanish « OUseful.Info, the blog…
December 10, 2011 at 9:41 pm
[...] the by, we can also generate stats’n graphs of the contents of the archive. For example, via Getting Started With Twitter Analysis in R, we can generate a bar plot to show who was retweeted [...]
A Tool Chain for Plotting Twitter Archive Retweet Graphs – Py, R, Gephi « OUseful.Info, the blog…
December 21, 2011 at 4:55 pm
When I was trying the last command, I just got ” rror: could not find function “ggplot” “,
I’ve installed the package “ggplot2″, so I cannot get what I want here.
Anyone has the same problem with me ?
oppih Xue
December 25, 2011 at 1:44 pm
O, I fogot to require(ggplot2) after I installed it :(
oppih Xue
December 25, 2011 at 1:48 pm
Excellent resource. I’ve posted some code on how to get started on Twitter and Ruby under http://twitterresearcher.wordpress.com/
plotti2k1
April 30, 2012 at 2:33 pm
Thanks for that link – and the examples:-)
I’ve put an example output of a script in (slow) progress for analysing hashtag activity at http://psychemedia.posterous.com/quick-and-dirty-report-on-recent-usage-of-the There’s also a link to the R script that generated that doc from the post itself.
Tony Hirst
April 30, 2012 at 4:38 pm
[...] then I discover that that whole paragraph (and around 5 hours work) could be achieved using R and about 15 lines of code. I think that’s a job for Monday evening though as I first need to learn how to import excel [...]
(Preparing to) Analyse Twitter data with excel
April 30, 2012 at 5:33 pm
Interesting example. It is great that R can now access tweets. Right now, I am trying to implement a digital marketing project that requires me to get 2-month’s worth of data from Twitter, during December and January. Basically I have to get all the tweets tagged with a location identifier (checkins from Foursquare or Gowalla), in a radius of 50 kilometers around the center of Brussels, Belgium, but only the locations that are supermarkets of certain brands (e.g. delhaize, colruyt, carrefour) AND the fast-food restaurants (mcdonalds, quick, pizza hut).
If I could somehow build a database containing these particular tweets with their corresponding timestamp and location stamp it would be absolutely great.
My questions are:
Can this be done?
If yes how would the data gathering be done. Every week for the previous week, until the 1st of February?
Where would I store all this data. In one file? If yes, how do I append it so that the tweets are numbered correctly?
Many thanks for an answer
Ciprian Begu (@ciprianbegu)
November 23, 2012 at 9:40 am
Hi Ciprian
I briefly considered grabbing samples of folk around supermarkets using Twitter/Foursquare a few months ago, eg searching for things like: “I’m at Tesco” near:”milton keynes” within:150mi
One thing you could do is use one of Martin Hawksey’s Twitter archiving spreadsheets – http://mashe.hawksey.info/2012/01/twitter-archive-tagsv3/
Tony Hirst
November 23, 2012 at 10:27 am
[...] The second which I will be posting examples of my work with is called “Getting Started with Twitter Analysis in R” by AJ Hirst and can be found here: < http://blog.ouseful.info/2011/11/09/getting-started-with-twitter-analysis-in-r/> [...]
twitteR Visualizations
April 29, 2013 at 5:05 am
[…] its package twitteR makes it even easier… After adapting the code from a really useful post (see here), I obtained data relating to twitter users and the number of times they used certain hashtags (see […]
…a scientific crowd | FreshBiostats
May 17, 2013 at 7:19 pm