I’ve largely avoided “time” in R to date, but following a chat with @mhawksey at #dev8d yesterday, I went down a rathole last night exploring a few ways of visualising a Twitter user timeline. As a result, I also had a quick initial play with some time handling features of R, such as timeseries objects, as well as a go at generating daily, weekly and monthly summary counts of data values.

To start, let’s grab a user timeline. As Martin started it (?!), we’ll use his…;-)

require(twitteR) username='TWITTERUSERNAME' #the most tweets we can bring back from a user timeline is the most recent 3600... mht=userTimeline(username,n=3200) tw.df=twListToDF(mht) #As I've done in previous scripts, pull out the names of folk who have been "old-fashioned RTd"... require(stringr) trim <- function (x) sub('@','',x) tw.df$rt=sapply(tw.df$text,function(tweet) trim(str_match(tweet,"^RT (@[[:alnum:]_]*)")[2])) tw.df$rtt=sapply(tw.df$rt,function(rt) if (is.na(rt)) 'T' else 'RT')

The returned data includes a *created* attribute (of the form “2012-02-17 11:40:25”) and a *replyToSN* attribute that includes the username of a user Martin was replying to via a particular tweet.

The simplest way I can think of displaying the data is to just display the *screenName* atrribute of the sender (which in this case is always *mhawskey*) against time:

require(ggplot2) ggplot(tw.df)+geom_point(aes(x=created,y=screenName))

As ever, things are never that simple… some tweets with old dates appear to have crept in somehow… A couple of things I tried realting to time based filtering caused R to have all sorts of malloc errors, so here’s a fudge I found to just display tweets that were created within the last 8,000 hours…

tw.dfs=subset(tw.df,subset=((Sys.time()-created)<8000)) ggplot(tw.dfs)+geom_point(aes(x=created,y=screenName))

Okay, so not very interesting… It shows that Martin tweets…

Picking up on views of the style doodled in Visualising Activity Around a Twitter Hashtag or Search Term Using R, where we look at when new users appear in a hashtag stream, we can plot when Martin replies to another twitter user, arranging the user names in the order in which they were first publicly replied to:

require(plyr) #Order the replyToSN factor levels in the order in which they were first created tw.dfx=ddply(tw.dfs, .var = "replyToSN", .fun = function(x) {return(subset(x, created %in% min(created),select=c(replyToSN,created)))}) tw.dfxa=arrange(tw.dfx,-desc(created)) tw.dfs$replyToSN=factor(tw.dfs$replyToSN, levels = tw.dfxa$replyToSN) #and plot the result ggplot(tw.dfs)+geom_point(aes(x=created,y=replyToSN))

The line at the top are tweets where the replyToSN value was NA (not available).

We can then go a little further and plot when folk are replied to or retweeted, as well as tweets that are neither a reply nor an old-style retweet:

ggplot()+geom_point(data=subset(tw.dfs,subset=(!is.na(replyToSN))),aes(x=created,y=replyToSN),col='red') + geom_point(data=subset(tw.dfs,subset=(!is.na(rt))),aes(x=created,y=rt),col='blue') + geom_point(data=subset(tw.dfs,subset=(is.na(replyToSN) & is.na(rt))),aes(x=created,y=screenName),col='green')

Here, the blue dots are old-style retweets, the red dots are replies, and the green dots are tweets that are neither replies nor old-style retweets. If a blue dot appears on a row before a red dot, it shows Martin RT’d them before ever replying to them. If blue dots are on a row that contains no red dot, then it shows Martin has RT’d but not replied to that person. A heavily populated row shows Martin has repeated interactions with that user.

We can generate an ordered bar chart showing who is most heavily replied to:

#First we need to count how many replies a user gets... #http://stackoverflow.com/a/3255448/454773 r_table <- table(tw.dfs$replyToSN) #..rank them... r_levels <- names(r_table)[order(-r_table)] #..and use this ordering to order the factor levels... tw.dfs$replyToSN <- factor(tw.dfs$replyToSN, levels = r_levels) #Then we can plot the chart... ggplot(subset(tw.dfs,subset=(!is.na(replyToSN))),aes(x=replyToSN)) + geom_bar(aes(y = (..count..)))+opts(axis.text.x=theme_text(angle=-90,size=6))

(Hmmm… how would I filter this to only show folk replied to more than 50 times, for example? [UPDATE: here’s a partial, related recipe: ****BUT IT SEEMS TO BE BROKEN AND I CAN”T SEE HOW TO FIX IT ATM…

`require(gdata)
tt=as.data.frame(table(tw.dfs$replyToSN))
#Filter to retain users with Freq above some threshold, then drop spare levels
tts=drop.levels(subset(tt,subset=(Freq>5)))
ggplot(tts)+geom_bar(stat='identity',aes(x=Var1,y=Freq))`

Note that the order of the factors needs rearranging. Something like this maybe?

orderLevels=function(dfc.name,dfc.val){ factor(dfc.name, levels = reorder(dfc.name,dfc.val)) }

then:

`tts$Var1=orderLevels(tts$Var1,tts$Freq)
ggplot(tts)+geom_bar(stat='identity',aes(x=Var1,y=Freq))`

Taking the simplification (?!) further:

orderedSubset=function(dfc,min=5){ require(gdata) tmp1=as.data.frame(table(dfc)) tmp2=drop.levels(subset(tmp1,subset=(Freq>=min))) tmp2$Var1=factor(tmp2$Var1, levels = reorder(tmp2$Var1,tmp2$Freq)) tmp2 } ggplot(orderedSubset(tw.dfs$replyToSN))+geom_bar(stat='identity',aes(x=Var1,y=Freq)) ggplot(orderedSubset(tw.dfs$replyToSN,50))+geom_bar(stat='identity',aes(x=Var1,y=Freq))

..or even further…

plotOrderedSubset=function(dfc,min=5){ ggplot(orderedSubset(dfc,min))+geom_bar(stat='identity',aes(x=Var1,y=Freq)) } plotOrderedSubset(tw.dfs$replyToSN) plotOrderedSubset(tw.dfs$replyToSN,20)

])

Sometimes, a text view is easier…

head(table(tw.dfs$replyToSN)) #eg returns: #psychemedia wilm ambrouk sheilmcn dajbelshaw manmalik 394 66 59 53 48 43 #Hmm..can we generalise this? topTastic=function(dfc,num=5){ r_table <- table(dfc) r_levels <- names(r_table)[order(-r_table)] head(table(factor(dfc, levels = r_levels)),num) } #so now, for example, I should be able to display the most old-style retweeted folk? topTastic(tw.dfs$rt) #or the 10 most replied to... topTastic(tw.dfs$replyToSN,10)

Let’s try some time stuff now… From the R Cookbook, I find I can do this:

#label a tweet with the month number tw.dfs$month=sapply(tw.dfs$created, function(x) {p=as.POSIXlt(x);p$mon}) #label a tweet with the hour tw.dfs$hour=sapply(tw.dfs$created, function(x) {p=as.POSIXlt(x);p$hour}) #label a tweet with a number corresponding to the day of the week tw.dfs$wday=sapply(tw.dfs$created, function(x) {p=as.POSIXlt(x);p$wday})

What this means is we can now chart a count of the number of tweets by day, week, or hour… For example, here’s hour vs. day of the week:

ggplot(tw.dfs)+geom_jitter(aes(x=wday,y=hour))

Note that this jittered scattergraph, where each dot is a tweet, only approximates the time each tweet occurred – the jitter applied is a random quantity designed to separate out tweets posted within the same hour-and-day-of-the-week bin.

What about Martin’s tweeting behaviour over time?

#We can also generate barplots showing the distribution of tweet count over time: ggplot(tw.dfs,aes(x=created))+geom_bar(aes(y = (..count..))) #Hmm... I'm not sure how to manually set binwidth= sensibly, though?!

Here’s a plot of the number of counts per… *I’m not sure*: the bin width was calculated automatically…

How about using the number of tweets in a particular day or hour bin to see what times of day or days of week Martin is tweeting?

#We can also plot the number of tweets within particular hour or time bins... ggplot(tw.dfs,aes(x=wday))+geom_bar(aes(y = (..count..)),binwidth=1) ggplot(tw.dfs,aes(x=hour))+geom_bar(aes(y = (..count..)),binwidth=1)

This chart shows activity (in terms of count…) per hour of day.

As well as doing the count of tweets per hour, for example, via a ggplot statistical graphical function, we can also get day, week, month, quarter and year counts from a set of functions associated with a particular sort of timeseries object…

Each element in a time series typically has two elements – a timestamp, and a numeric value. We can generate a time series of a sort around a twitter usertimeline by creating a dummy quantity – such as the unit value, 1 – and associate it with each timestamp:

require(xts) #The xts function creates a timeline from a vector of values and a vector of timestamps. #If we know how many tweets we have, we can just create a simple list or vector containing that number of 1s ts=xts(rep(1,times=nrow(tw.dfs)),tw.dfs$created) #We can now do some handy number crunching on the timeseries, such as applying a formula to values contained with day, week, month, quarter or year time bins. #So for example, if we sum the unit values in daily bin, we can get a count of the number of tweets per day ts.sum=apply.daily(ts,sum) #also apply. weekly, monthly, quarterly, yearly #If for any resason we need to turn the timeseries into a dataframe, we can: #http://stackoverflow.com/a/3387259/454773 ts.sum.df=data.frame(date=index(ts.sum), coredata(ts.sum)) colnames(ts.sum.df)=c('date','sum') #We can then use ggplot to plot the timeseries... ggplot(ts.sum.df)+geom_line(aes(x=date,y=sum))

#Having got the data in a timeseries form, we can do timeseries based things to it... such as checking the autocorrelation: acf(ts.sum)

Hmmm.. so, one day is much the same as another, but there also appears to be a weekly (7 day periodicity) pattern…

Finally, here’s a handy script I found on the Revolution Analytics site for Charting time series as calendar heat maps in R:

############################################################################## # Calendar Heatmap # # by # # Paul Bleicher # # an R version of a graphic from: # # http://stat-computing.org/dataexpo/2009/posters/wicklin-allison.pdf # # requires lattice, chron, grid packages # ############################################################################## ## calendarHeat: An R function to display time-series data as a calendar heatmap ## Copyright 2009 Humedica. All rights reserved. ## This program is free software; you can redistribute it and/or modify ## it under the terms of the GNU General Public License as published by ## the Free Software Foundation; either version 2 of the License, or ## (at your option) any later version. ## This program is distributed in the hope that it will be useful, ## but WITHOUT ANY WARRANTY; without even the implied warranty of ## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the ## GNU General Public License for more details. ## You can find a copy of the GNU General Public License, Version 2 at: ## http://www.gnu.org/licenses/gpl-2.0.html calendarHeat <- function(dates, values, ncolors=99, color="r2g", varname="Values", date.form = "%Y-%m-%d", ...) { require(lattice) require(grid) require(chron) if (class(dates) == "character" | class(dates) == "factor" ) { dates <- strptime(dates, date.form) } caldat <- data.frame(value = values, dates = dates) min.date <- as.Date(paste(format(min(dates), "%Y"), "-1-1",sep = "")) max.date <- as.Date(paste(format(max(dates), "%Y"), "-12-31", sep = "")) dates.f <- data.frame(date.seq = seq(min.date, max.date, by="days")) # Merge moves data by one day, avoid caldat <- data.frame(date.seq = seq(min.date, max.date, by="days"), value = NA) dates <- as.Date(dates) caldat$value[match(dates, caldat$date.seq)] <- values caldat$dotw <- as.numeric(format(caldat$date.seq, "%w")) caldat$woty <- as.numeric(format(caldat$date.seq, "%U")) + 1 caldat$yr <- as.factor(format(caldat$date.seq, "%Y")) caldat$month <- as.numeric(format(caldat$date.seq, "%m")) yrs <- as.character(unique(caldat$yr)) d.loc <- as.numeric() for (m in min(yrs):max(yrs)) { d.subset <- which(caldat$yr == m) sub.seq <- seq(1,length(d.subset)) d.loc <- c(d.loc, sub.seq) } caldat <- cbind(caldat, seq=d.loc) #color styles r2b <- c("#0571B0", "#92C5DE", "#F7F7F7", "#F4A582", "#CA0020") #red to blue r2g <- c("#D61818", "#FFAE63", "#FFFFBD", "#B5E384") #red to green w2b <- c("#045A8D", "#2B8CBE", "#74A9CF", "#BDC9E1", "#F1EEF6") #white to blue assign("col.sty", get(color)) calendar.pal <- colorRampPalette((col.sty), space = "Lab") def.theme <- lattice.getOption("default.theme") cal.theme <- function() { theme <- list( strip.background = list(col = "transparent"), strip.border = list(col = "transparent"), axis.line = list(col="transparent"), par.strip.text=list(cex=0.8)) } lattice.options(default.theme = cal.theme) yrs <- (unique(caldat$yr)) nyr <- length(yrs) print(cal.plot <- levelplot(value~woty*dotw | yr, data=caldat, as.table=TRUE, aspect=.12, layout = c(1, nyr%%7), between = list(x=0, y=c(1,1)), strip=TRUE, main = paste("Calendar Heat Map of ", varname, sep = ""), scales = list( x = list( at= c(seq(2.9, 52, by=4.42)), labels = month.abb, alternating = c(1, rep(0, (nyr-1))), tck=0, cex = 0.7), y=list( at = c(0, 1, 2, 3, 4, 5, 6), labels = c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"), alternating = 1, cex = 0.6, tck=0)), xlim =c(0.4, 54.6), ylim=c(6.6,-0.6), cuts= ncolors - 1, col.regions = (calendar.pal(ncolors)), xlab="" , ylab="", colorkey= list(col = calendar.pal(ncolors), width = 0.6, height = 0.5), subscripts=TRUE ) ) panel.locs <- trellis.currentLayout() for (row in 1:nrow(panel.locs)) { for (column in 1:ncol(panel.locs)) { if (panel.locs[row, column] > 0) { trellis.focus("panel", row = row, column = column, highlight = FALSE) xyetc <- trellis.panelArgs() subs <- caldat[xyetc$subscripts,] dates.fsubs <- caldat[caldat$yr == unique(subs$yr),] y.start <- dates.fsubs$dotw[1] y.end <- dates.fsubs$dotw[nrow(dates.fsubs)] dates.len <- nrow(dates.fsubs) adj.start <- dates.fsubs$woty[1] for (k in 0:6) { if (k < y.start) { x.start <- adj.start + 0.5 } else { x.start <- adj.start - 0.5 } if (k > y.end) { x.finis <- dates.fsubs$woty[nrow(dates.fsubs)] - 0.5 } else { x.finis <- dates.fsubs$woty[nrow(dates.fsubs)] + 0.5 } grid.lines(x = c(x.start, x.finis), y = c(k -0.5, k - 0.5), default.units = "native", gp=gpar(col = "grey", lwd = 1)) } if (adj.start < 2) { grid.lines(x = c( 0.5, 0.5), y = c(6.5, y.start-0.5), default.units = "native", gp=gpar(col = "grey", lwd = 1)) grid.lines(x = c(1.5, 1.5), y = c(6.5, -0.5), default.units = "native", gp=gpar(col = "grey", lwd = 1)) grid.lines(x = c(x.finis, x.finis), y = c(dates.fsubs$dotw[dates.len] -0.5, -0.5), default.units = "native", gp=gpar(col = "grey", lwd = 1)) if (dates.fsubs$dotw[dates.len] != 6) { grid.lines(x = c(x.finis + 1, x.finis + 1), y = c(dates.fsubs$dotw[dates.len] -0.5, -0.5), default.units = "native", gp=gpar(col = "grey", lwd = 1)) } grid.lines(x = c(x.finis, x.finis), y = c(dates.fsubs$dotw[dates.len] -0.5, -0.5), default.units = "native", gp=gpar(col = "grey", lwd = 1)) } for (n in 1:51) { grid.lines(x = c(n + 1.5, n + 1.5), y = c(-0.5, 6.5), default.units = "native", gp=gpar(col = "grey", lwd = 1)) } x.start <- adj.start - 0.5 if (y.start > 0) { grid.lines(x = c(x.start, x.start + 1), y = c(y.start - 0.5, y.start - 0.5), default.units = "native", gp=gpar(col = "black", lwd = 1.75)) grid.lines(x = c(x.start + 1, x.start + 1), y = c(y.start - 0.5 , -0.5), default.units = "native", gp=gpar(col = "black", lwd = 1.75)) grid.lines(x = c(x.start, x.start), y = c(y.start - 0.5, 6.5), default.units = "native", gp=gpar(col = "black", lwd = 1.75)) if (y.end < 6 ) { grid.lines(x = c(x.start + 1, x.finis + 1), y = c(-0.5, -0.5), default.units = "native", gp=gpar(col = "black", lwd = 1.75)) grid.lines(x = c(x.start, x.finis), y = c(6.5, 6.5), default.units = "native", gp=gpar(col = "black", lwd = 1.75)) } else { grid.lines(x = c(x.start + 1, x.finis), y = c(-0.5, -0.5), default.units = "native", gp=gpar(col = "black", lwd = 1.75)) grid.lines(x = c(x.start, x.finis), y = c(6.5, 6.5), default.units = "native", gp=gpar(col = "black", lwd = 1.75)) } } else { grid.lines(x = c(x.start, x.start), y = c( - 0.5, 6.5), default.units = "native", gp=gpar(col = "black", lwd = 1.75)) } if (y.start == 0 ) { if (y.end < 6 ) { grid.lines(x = c(x.start, x.finis + 1), y = c(-0.5, -0.5), default.units = "native", gp=gpar(col = "black", lwd = 1.75)) grid.lines(x = c(x.start, x.finis), y = c(6.5, 6.5), default.units = "native", gp=gpar(col = "black", lwd = 1.75)) } else { grid.lines(x = c(x.start + 1, x.finis), y = c(-0.5, -0.5), default.units = "native", gp=gpar(col = "black", lwd = 1.75)) grid.lines(x = c(x.start, x.finis), y = c(6.5, 6.5), default.units = "native", gp=gpar(col = "black", lwd = 1.75)) } } for (j in 1:12) { last.month <- max(dates.fsubs$seq[dates.fsubs$month == j]) x.last.m <- dates.fsubs$woty[last.month] + 0.5 y.last.m <- dates.fsubs$dotw[last.month] + 0.5 grid.lines(x = c(x.last.m, x.last.m), y = c(-0.5, y.last.m), default.units = "native", gp=gpar(col = "black", lwd = 1.75)) if ((y.last.m) < 6) { grid.lines(x = c(x.last.m, x.last.m - 1), y = c(y.last.m, y.last.m), default.units = "native", gp=gpar(col = "black", lwd = 1.75)) grid.lines(x = c(x.last.m - 1, x.last.m - 1), y = c(y.last.m, 6.5), default.units = "native", gp=gpar(col = "black", lwd = 1.75)) } else { grid.lines(x = c(x.last.m, x.last.m), y = c(- 0.5, 6.5), default.units = "native", gp=gpar(col = "black", lwd = 1.75)) } } } } trellis.unfocus() } lattice.options(default.theme = def.theme) }

If we pass the dataframed time series data counting the sum (count) of tweets per day, we can get a calendar heatmap view of Martin’s twitter activity:

calendarHeat(ts.sum.df$date, ts.sum.df$sum, varname="@mhawksey Twitter activity")

I’m not sure if this is even interesting, let alone useful, but I do think now I’ve found out a little bit about working with time in R, *that* could be handy…

Still to do: extract hashtags and visualise them; extend the twitteR library so it exposes things like retweet counts. But that’s for another day…

Ignoring the gross invasion of privacy ;) my favourite summary is the calendar heatmap [who’d thought it would take so many lines of R code to produce this #theyMustBeDoingSomethingWrong].

Immediate questions are what was I saying on Tuesday 13th December and what would it look like if you excluded my more gabberish moments?

Martin

Wow!! :) It’d be great to have the same scripts to analize a whole community (Friends timeline)

@Victor I think the friends timeline only applies to an authenticated user. At the moment, I am exploring what I can do without any authentication.

Hi This is wonderful, I am trying to replicate this with 5 years of data. only last year is retained with a warning message

In trellis.panelArgs() :

Plot spans multiple pages, only last page can be updated

being a novice could’nt figure it out. Could you help me this. rgds, Krishna

I’m a complete novice with this too – the calendarheatmap is code i found on the web and I haven’t really looked to see how it works (or doesn’t…). If there’s a problem with it, it may be worth following the link to the original author and asking them…

I know this is probably a stupid question, but I am very new to R and do not have much experience with programming languages. How do you actually import someone’s twitter into R?

@jack If you instal the twitteR package from CRAN, all you need to do then is:

require(twitteR)

username=’TWITTERUSERNAME’

usertimeline=userTimeline(username,n=3200)

usertimeline.df=twListToDF(usertimeline)

Hey! Nice graphs!

Suppose you want R to retrive more than 3600 tweets, in order to analyze a given period of time (a month or more). Do you know how can I achieve this?

Thank you in advance.

@sergio The Twitter search API is quite limited in what you can do with it. If you know in advance what Twitter accounts/search terms you want to probe, you can always set up an archiver (eg ThinkUpApp, or Martin Hawksey’s google spreadsheets Twitter archiver http://mashe.hawksey.info/2012/01/twitter-archive-tagsv3/ ).

Looks great.

Really nice. I only have a problem: in The colored plot I can’t see blue dots (I only have green and red, even using your same Twitter timeline). What can I do?