Generating Twitter Wordclouds in R (Prompted by an Open Learning Blogpost)

A couple of weeks ago I saw a great example of an open learning blogpost from @katy_bird: Generating a word cloud (or not) from a Twitter hashtag. It described the trials and tribulations associated with trying to satisfy a request for the generation of a wordcloud based on tweets associated with a specific Twitter hashtag. A seemingly simple task, you might think, but things are never that easy… If you read the post, you’ll see Katy identified several problems, or stumbling blocks, along the way, as well as how she addressed them. There’s also a bit of reflection on the process as a whole.

Reading the post the first time (and again, just now), completely set me up for the day. It had a little bit of everyhting: a goal statement, the identification of a set of problems associated with trying to complete the task, some commentary on how the problems were tackled, and some reflection on the process as a whole. The post thus serves the purpose of capturing a problem discovery process, as well as the steps taken to try and solve each problem (although full documentation is lacking… This is something I have learned over the years: to use something like a gist on github to actually keep a copy of any code I generated to solve the problem, linked to for reuse by myself and others from the associated blog post). The post captures a glimpse back at a moment in time – when Katy didn’t know how to generate a wordcloud – from the joyful moment at which she has just learned how to generate said wordcloud. More importantly, the post describes the learning problems that became evident whilst trying to achieve the goal in such a way that they can act as hooks on which others can hang alternative or additional ways of solving the problem, or act as mentor.

By identifying the learning journey and problems discovered along the way, Katy’s record of her learning strategy also provides an authentic, learner centric perspective on what’s involved in trying to create a wordcloud around a twitter hashtag.

Reading the post again has also prompted me to blog this recipe, largely copied from the RDataMining post Using Text Mining to Find Out What @RDataMining Tweets are About, for generating a word cloud around a twitter hashtag using R (I use RStudio; the recipe requires at least the twitteR and tm libraries):

require(twitteR)
searchTerm='#dev8d'
#Grab the tweets
rdmTweets <- searchTwitter(searchTerm, n=500)
#Use a handy helper function to put the tweets into a dataframe
tw.df=twListToDF(rdmTweets)

##Note: there are some handy, basic Twitter related functions here:
##https://github.com/matteoredaelli/twitter-r-utils
#For example:
RemoveAtPeople <- function(tweet) {
  gsub("@\\w+", "", tweet)
}
#Then for example, remove @'d names
tweets <- as.vector(sapply(tw.df$text, RemoveAtPeople))

##Wordcloud - scripts available from various sources; I used:
#http://rdatamining.wordpress.com/2011/11/09/using-text-mining-to-find-out-what-rdatamining-tweets-are-about/

#Install the textmining library
require(tm)
#Call with eg: tw.c=generateCorpus(tw.df$text)
generateCorpus= function(df,my.stopwords=c()){
  #The following is cribbed and seems to do what it says on the can
  tw.corpus= Corpus(VectorSource(df))
  # remove punctuation
  tw.corpus = tm_map(tw.corpus, removePunctuation)
  #normalise case
  tw.corpus = tm_map(tw.corpus, tolower)
  # remove stopwords
  tw.corpus = tm_map(tw.corpus, removeWords, stopwords('english'))
  tw.corpus = tm_map(tw.corpus, removeWords, my.stopwords)

  tw.corpus
}

wordcloud.generate=function(corpus,min.freq=3){
  require(wordcloud)
  doc.m = TermDocumentMatrix(corpus, control = list(minWordLength = 1))
  dm = as.matrix(doc.m)
  # calculate the frequency of words
  v = sort(rowSums(dm), decreasing=TRUE)
  d = data.frame(word=names(v), freq=v)
  #Generate the wordcloud
  wc=wordcloud(d$word, d$freq, min.freq=min.freq)
  wc
}

print(wordcloud.generate(generateCorpus(tweets,'dev8d'),7))

##Generate an image file of the wordcloud
png('test.png', width=600,height=600)
wordcloud.generate(generateCorpus(tweets,'dev8d'),7)
dev.off()

#We could make it even easier if we hide away the tweet grabbing code. eg:
tweets.grabber=function(searchTerm,num=500){
  require(twitteR)
  rdmTweets = searchTwitter(searchTerm, n=num)
  tw.df=twListToDF(rdmTweets)
  as.vector(sapply(tw.df$text, RemoveAtPeople))
}
#Then we could do something like:
tweets=tweets.grabber('ukgc12')
wordcloud.generate(generateCorpus(tweets),3)

Here’s the result:

PS for an earlier, was broken, now patched, route to sketching a wordcloud from a twitter search using Wordle, see How To Create Wordcloud from a Twitter Hashtag Search Feed in a Few Easy Steps.

Author: Tony Hirst

I'm a Senior Lecturer at The Open University, with an interest in #opendata policy and practice, as well as general web tinkering...

6 thoughts on “Generating Twitter Wordclouds in R (Prompted by an Open Learning Blogpost)”

  1. Pingback: Dr Data's Blog
  2. Hi, thanks for sharing this script!

    The following line caused an error for me:

    tw.df=twListToDF(rdmTweets)

    The error was: Error: could not find function “twListToDF”

    I changed the command to the following which resolved the issue:

    tw.df=do.call(‘rbind’,lapply(rdmTweets, as.data.frame))

    I’m only new to R and lost as to why I encountered this error. The twitteR package had been installed and loaded.

    Any advise would be greatly appreciated.

    Looking forward to learning more and putting this script to use.

    Thank you.

    Johann

    1. @johann Hmm.. I thought that was a twitteR function? Have you got the latest version f the package installed? (Thanks for posting the workaround, btw :-)

  3. Hi,
    In order to run this script, I need “Rcpp” package. However, I couldn’t install package ‘Rcpp’ . Please help.

    1. @raj you should be able to install this package from CRAN I think? Or maybe it requires a more recent version of R than the one you are running before it will install?

Comments are closed.