Archive for the ‘Rstats’ Category
Visualising Twitter User Timeline Activity in R
I’ve largely avoided “time” in R to date, but following a chat with @mhawksey at #dev8d yesterday, I went down a rathole last night exploring a few ways of visualising a Twitter user timeline. As a result, I also had a quick initial play with some time handling features of R, such as timeseries objects, as well as a go at generating daily, weekly and monthly summary counts of data values.
To start, let’s grab a user timeline. As Martin started it (?!), we’ll use his…;-)
require(twitteR)
username='TWITTERUSERNAME'
#the most tweets we can bring back from a user timeline is the most recent 3600...
mht=userTimeline(username,n=3200)
tw.df=twListToDF(mht)
#As I've done in previous scripts, pull out the names of folk who have been "old-fashioned RTd"...
require(stringr)
trim <- function (x) sub('@','',x)
tw.df$rt=sapply(tw.df$text,function(tweet) trim(str_match(tweet,"^RT (@[[:alnum:]_]*)")[2]))
tw.df$rtt=sapply(tw.df$rt,function(rt) if (is.na(rt)) 'T' else 'RT')
The returned data includes a created attribute (of the form “2012-02-17 11:40:25″) and a replyToSN attribute that includes the username of a user Martin was replying to via a particular tweet.
The simplest way I can think of displaying the data is to just display the screenName atrribute of the sender (which in this case is always mhawskey) against time:
require(ggplot2) ggplot(tw.df)+geom_point(aes(x=created,y=screenName))
As ever, things are never that simple… some tweets with old dates appear to have crept in somehow… A couple of things I tried realting to time based filtering caused R to have all sorts of malloc errors, so here’s a fudge I found to just display tweets that were created within the last 8,000 hours…
tw.dfs=subset(tw.df,subset=((Sys.time()-created)<8000)) ggplot(tw.dfs)+geom_point(aes(x=created,y=screenName))
Okay, so not very interesting… It shows that Martin tweets…
Picking up on views of the style doodled in Visualising Activity Around a Twitter Hashtag or Search Term Using R, where we look at when new users appear in a hashtag stream, we can plot when Martin replies to another twitter user, arranging the user names in the order in which they were first publicly replied to:
require(plyr)
#Order the replyToSN factor levels in the order in which they were first created
tw.dfx=ddply(tw.dfs, .var = "replyToSN", .fun = function(x) {return(subset(x, created %in% min(created),select=c(replyToSN,created)))})
tw.dfxa=arrange(tw.dfx,-desc(created))
tw.dfs$replyToSN=factor(tw.dfs$replyToSN, levels = tw.dfxa$replyToSN)
#and plot the result
ggplot(tw.dfs)+geom_point(aes(x=created,y=replyToSN))
The line at the top are tweets where the replyToSN value was NA (not available).
We can then go a little further and plot when folk are replied to or retweeted, as well as tweets that are neither a reply nor an old-style retweet:
ggplot()+geom_point(data=subset(tw.dfs,subset=(!is.na(replyToSN))),aes(x=created,y=replyToSN),col='red') + geom_point(data=subset(tw.dfs,subset=(!is.na(rt))),aes(x=created,y=rt),col='blue') + geom_point(data=subset(tw.dfs,subset=(is.na(replyToSN) & is.na(rt))),aes(x=created,y=screenName),col='green')
Here, the blue dots are old-style retweets, the red dots are replies, and the green dots are tweets that are neither replies nor old-style retweets. If a blue dot appears on a row before a red dot, it shows Martin RT’d them before ever replying to them. If blue dots are on a row that contains no red dot, then it shows Martin has RT’d but not replied to that person. A heavily populated row shows Martin has repeated interactions with that user.
We can generate an ordered bar chart showing who is most heavily replied to:
#First we need to count how many replies a user gets... #http://stackoverflow.com/a/3255448/454773 r_table <- table(tw.dfs$replyToSN) #..rank them... r_levels <- names(r_table)[order(-r_table)] #..and use this ordering to order the factor levels... tw.dfs$replyToSN <- factor(tw.dfs$replyToSN, levels = r_levels) #Then we can plot the chart... ggplot(subset(tw.dfs,subset=(!is.na(replyToSN))),aes(x=replyToSN)) + geom_bar(aes(y = (..count..)))+opts(axis.text.x=theme_text(angle=-90,size=6))
(Hmmm… how would I filter this to only show folk replied to more than 50 times, for example? [UPDATE: here's a partial, related recipe: ****BUT IT SEEMS TO BE BROKEN AND I CAN"T SEE HOW TO FIX IT ATM...
require(gdata)
tt=as.data.frame(table(tw.dfs$replyToSN))
#Filter to retain users with Freq above some threshold, then drop spare levels
tts=drop.levels(subset(tt,subset=(Freq>5)))
ggplot(tts)+geom_bar(stat='identity',aes(x=Var1,y=Freq))
Note that the order of the factors needs rearranging. Something like this maybe?
orderLevels=function(dfc.name,dfc.val){
factor(dfc.name, levels = reorder(dfc.name,dfc.val))
}
then:
tts$Var1=orderLevels(tts$Var1,tts$Freq)
ggplot(tts)+geom_bar(stat='identity',aes(x=Var1,y=Freq))
Taking the simplification (?!) further:
orderedSubset=function(dfc,min=5){
require(gdata)
tmp1=as.data.frame(table(dfc))
tmp2=drop.levels(subset(tmp1,subset=(Freq>=min)))
tmp2$Var1=factor(tmp2$Var1, levels = reorder(tmp2$Var1,tmp2$Freq))
tmp2
}
ggplot(orderedSubset(tw.dfs$replyToSN))+geom_bar(stat='identity',aes(x=Var1,y=Freq))
ggplot(orderedSubset(tw.dfs$replyToSN,50))+geom_bar(stat='identity',aes(x=Var1,y=Freq))
..or even further...
plotOrderedSubset=function(dfc,min=5){
ggplot(orderedSubset(dfc,min))+geom_bar(stat='identity',aes(x=Var1,y=Freq))
}
plotOrderedSubset(tw.dfs$replyToSN)
plotOrderedSubset(tw.dfs$replyToSN,20)
])
Sometimes, a text view is easier…
head(table(tw.dfs$replyToSN))
#eg returns:
#psychemedia wilm ambrouk sheilmcn dajbelshaw manmalik
394 66 59 53 48 43
#Hmm..can we generalise this?
topTastic=function(dfc,num=5){
r_table <- table(dfc)
r_levels <- names(r_table)[order(-r_table)]
head(table(factor(dfc, levels = r_levels)),num)
}
#so now, for example, I should be able to display the most old-style retweeted folk?
topTastic(tw.dfs$rt)
#or the 10 most replied to...
topTastic(tw.dfs$replyToSN,10)
Let’s try some time stuff now… From the R Cookbook, I find I can do this:
#label a tweet with the month number
tw.dfs$month=sapply(tw.dfs$created, function(x) {p=as.POSIXlt(x);p$mon})
#label a tweet with the hour
tw.dfs$hour=sapply(tw.dfs$created, function(x) {p=as.POSIXlt(x);p$hour})
#label a tweet with a number corresponding to the day of the week
tw.dfs$wday=sapply(tw.dfs$created, function(x) {p=as.POSIXlt(x);p$wday})
What this means is we can now chart a count of the number of tweets by day, week, or hour… For example, here’s hour vs. day of the week:
ggplot(tw.dfs)+geom_jitter(aes(x=wday,y=hour))
Note that this jittered scattergraph, where each dot is a tweet, only approximates the time each tweet occurred – the jitter applied is a random quantity designed to separate out tweets posted within the same hour-and-day-of-the-week bin.
What about Martin’s tweeting behaviour over time?
#We can also generate barplots showing the distribution of tweet count over time: ggplot(tw.dfs,aes(x=created))+geom_bar(aes(y = (..count..))) #Hmm... I'm not sure how to manually set binwidth= sensibly, though?!
Here’s a plot of the number of counts per… I’m not sure: the bin width was calculated automatically…
How about using the number of tweets in a particular day or hour bin to see what times of day or days of week Martin is tweeting?
#We can also plot the number of tweets within particular hour or time bins... ggplot(tw.dfs,aes(x=wday))+geom_bar(aes(y = (..count..)),binwidth=1) ggplot(tw.dfs,aes(x=hour))+geom_bar(aes(y = (..count..)),binwidth=1)
This chart shows activity (in terms of count…) per hour of day.
As well as doing the count of tweets per hour, for example, via a ggplot statistical graphical function, we can also get day, week, month, quarter and year counts from a set of functions associated with a particular sort of timeseries object…
Each element in a time series typically has two elements – a timestamp, and a numeric value. We can generate a time series of a sort around a twitter usertimeline by creating a dummy quantity – such as the unit value, 1 – and associate it with each timestamp:
require(xts)
#The xts function creates a timeline from a vector of values and a vector of timestamps.
#If we know how many tweets we have, we can just create a simple list or vector containing that number of 1s
ts=xts(rep(1,times=nrow(tw.dfs)),tw.dfs$created)
#We can now do some handy number crunching on the timeseries, such as applying a formula to values contained with day, week, month, quarter or year time bins.
#So for example, if we sum the unit values in daily bin, we can get a count of the number of tweets per day
ts.sum=apply.daily(ts,sum)
#also apply. weekly, monthly, quarterly, yearly
#If for any resason we need to turn the timeseries into a dataframe, we can:
#http://stackoverflow.com/a/3387259/454773
ts.sum.df=data.frame(date=index(ts.sum), coredata(ts.sum))
colnames(ts.sum.df)=c('date','sum')
#We can then use ggplot to plot the timeseries...
ggplot(ts.sum.df)+geom_line(aes(x=date,y=sum))
#Having got the data in a timeseries form, we can do timeseries based things to it... such as checking the autocorrelation: acf(ts.sum)
Hmmm.. so, one day is much the same as another, but there also appears to be a weekly (7 day periodicity) pattern…
Finally, here’s a handy script I found on the Revolution Analytics site for Charting time series as calendar heat maps in R:
##############################################################################
# Calendar Heatmap #
# by #
# Paul Bleicher #
# an R version of a graphic from: #
# http://stat-computing.org/dataexpo/2009/posters/wicklin-allison.pdf #
# requires lattice, chron, grid packages #
##############################################################################
## calendarHeat: An R function to display time-series data as a calendar heatmap
## Copyright 2009 Humedica. All rights reserved.
## This program is free software; you can redistribute it and/or modify
## it under the terms of the GNU General Public License as published by
## the Free Software Foundation; either version 2 of the License, or
## (at your option) any later version.
## This program is distributed in the hope that it will be useful,
## but WITHOUT ANY WARRANTY; without even the implied warranty of
## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
## GNU General Public License for more details.
## You can find a copy of the GNU General Public License, Version 2 at:
## http://www.gnu.org/licenses/gpl-2.0.html
calendarHeat <- function(dates,
values,
ncolors=99,
color="r2g",
varname="Values",
date.form = "%Y-%m-%d", ...) {
require(lattice)
require(grid)
require(chron)
if (class(dates) == "character" | class(dates) == "factor" ) {
dates <- strptime(dates, date.form)
}
caldat <- data.frame(value = values, dates = dates)
min.date <- as.Date(paste(format(min(dates), "%Y"),
"-1-1",sep = ""))
max.date <- as.Date(paste(format(max(dates), "%Y"),
"-12-31", sep = ""))
dates.f <- data.frame(date.seq = seq(min.date, max.date, by="days"))
# Merge moves data by one day, avoid
caldat <- data.frame(date.seq = seq(min.date, max.date, by="days"), value = NA)
dates <- as.Date(dates)
caldat$value[match(dates, caldat$date.seq)] <- values
caldat$dotw <- as.numeric(format(caldat$date.seq, "%w"))
caldat$woty <- as.numeric(format(caldat$date.seq, "%U")) + 1
caldat$yr <- as.factor(format(caldat$date.seq, "%Y"))
caldat$month <- as.numeric(format(caldat$date.seq, "%m"))
yrs <- as.character(unique(caldat$yr))
d.loc <- as.numeric()
for (m in min(yrs):max(yrs)) {
d.subset <- which(caldat$yr == m)
sub.seq <- seq(1,length(d.subset))
d.loc <- c(d.loc, sub.seq)
}
caldat <- cbind(caldat, seq=d.loc)
#color styles
r2b <- c("#0571B0", "#92C5DE", "#F7F7F7", "#F4A582", "#CA0020") #red to blue
r2g <- c("#D61818", "#FFAE63", "#FFFFBD", "#B5E384") #red to green
w2b <- c("#045A8D", "#2B8CBE", "#74A9CF", "#BDC9E1", "#F1EEF6") #white to blue
assign("col.sty", get(color))
calendar.pal <- colorRampPalette((col.sty), space = "Lab")
def.theme <- lattice.getOption("default.theme")
cal.theme <-
function() {
theme <-
list(
strip.background = list(col = "transparent"),
strip.border = list(col = "transparent"),
axis.line = list(col="transparent"),
par.strip.text=list(cex=0.8))
}
lattice.options(default.theme = cal.theme)
yrs <- (unique(caldat$yr))
nyr <- length(yrs)
print(cal.plot <- levelplot(value~woty*dotw | yr, data=caldat,
as.table=TRUE,
aspect=.12,
layout = c(1, nyr%%7),
between = list(x=0, y=c(1,1)),
strip=TRUE,
main = paste("Calendar Heat Map of ", varname, sep = ""),
scales = list(
x = list(
at= c(seq(2.9, 52, by=4.42)),
labels = month.abb,
alternating = c(1, rep(0, (nyr-1))),
tck=0,
cex = 0.7),
y=list(
at = c(0, 1, 2, 3, 4, 5, 6),
labels = c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday",
"Friday", "Saturday"),
alternating = 1,
cex = 0.6,
tck=0)),
xlim =c(0.4, 54.6),
ylim=c(6.6,-0.6),
cuts= ncolors - 1,
col.regions = (calendar.pal(ncolors)),
xlab="" ,
ylab="",
colorkey= list(col = calendar.pal(ncolors), width = 0.6, height = 0.5),
subscripts=TRUE
) )
panel.locs <- trellis.currentLayout()
for (row in 1:nrow(panel.locs)) {
for (column in 1:ncol(panel.locs)) {
if (panel.locs[row, column] > 0)
{
trellis.focus("panel", row = row, column = column,
highlight = FALSE)
xyetc <- trellis.panelArgs()
subs <- caldat[xyetc$subscripts,]
dates.fsubs <- caldat[caldat$yr == unique(subs$yr),]
y.start <- dates.fsubs$dotw[1]
y.end <- dates.fsubs$dotw[nrow(dates.fsubs)]
dates.len <- nrow(dates.fsubs)
adj.start <- dates.fsubs$woty[1]
for (k in 0:6) {
if (k < y.start) {
x.start <- adj.start + 0.5
} else {
x.start <- adj.start - 0.5
}
if (k > y.end) {
x.finis <- dates.fsubs$woty[nrow(dates.fsubs)] - 0.5
} else {
x.finis <- dates.fsubs$woty[nrow(dates.fsubs)] + 0.5
}
grid.lines(x = c(x.start, x.finis), y = c(k -0.5, k - 0.5),
default.units = "native", gp=gpar(col = "grey", lwd = 1))
}
if (adj.start < 2) {
grid.lines(x = c( 0.5, 0.5), y = c(6.5, y.start-0.5),
default.units = "native", gp=gpar(col = "grey", lwd = 1))
grid.lines(x = c(1.5, 1.5), y = c(6.5, -0.5), default.units = "native",
gp=gpar(col = "grey", lwd = 1))
grid.lines(x = c(x.finis, x.finis),
y = c(dates.fsubs$dotw[dates.len] -0.5, -0.5), default.units = "native",
gp=gpar(col = "grey", lwd = 1))
if (dates.fsubs$dotw[dates.len] != 6) {
grid.lines(x = c(x.finis + 1, x.finis + 1),
y = c(dates.fsubs$dotw[dates.len] -0.5, -0.5), default.units = "native",
gp=gpar(col = "grey", lwd = 1))
}
grid.lines(x = c(x.finis, x.finis),
y = c(dates.fsubs$dotw[dates.len] -0.5, -0.5), default.units = "native",
gp=gpar(col = "grey", lwd = 1))
}
for (n in 1:51) {
grid.lines(x = c(n + 1.5, n + 1.5),
y = c(-0.5, 6.5), default.units = "native", gp=gpar(col = "grey", lwd = 1))
}
x.start <- adj.start - 0.5
if (y.start > 0) {
grid.lines(x = c(x.start, x.start + 1),
y = c(y.start - 0.5, y.start - 0.5), default.units = "native",
gp=gpar(col = "black", lwd = 1.75))
grid.lines(x = c(x.start + 1, x.start + 1),
y = c(y.start - 0.5 , -0.5), default.units = "native",
gp=gpar(col = "black", lwd = 1.75))
grid.lines(x = c(x.start, x.start),
y = c(y.start - 0.5, 6.5), default.units = "native",
gp=gpar(col = "black", lwd = 1.75))
if (y.end < 6 ) {
grid.lines(x = c(x.start + 1, x.finis + 1),
y = c(-0.5, -0.5), default.units = "native",
gp=gpar(col = "black", lwd = 1.75))
grid.lines(x = c(x.start, x.finis),
y = c(6.5, 6.5), default.units = "native",
gp=gpar(col = "black", lwd = 1.75))
} else {
grid.lines(x = c(x.start + 1, x.finis),
y = c(-0.5, -0.5), default.units = "native",
gp=gpar(col = "black", lwd = 1.75))
grid.lines(x = c(x.start, x.finis),
y = c(6.5, 6.5), default.units = "native",
gp=gpar(col = "black", lwd = 1.75))
}
} else {
grid.lines(x = c(x.start, x.start),
y = c( - 0.5, 6.5), default.units = "native",
gp=gpar(col = "black", lwd = 1.75))
}
if (y.start == 0 ) {
if (y.end < 6 ) {
grid.lines(x = c(x.start, x.finis + 1),
y = c(-0.5, -0.5), default.units = "native",
gp=gpar(col = "black", lwd = 1.75))
grid.lines(x = c(x.start, x.finis),
y = c(6.5, 6.5), default.units = "native",
gp=gpar(col = "black", lwd = 1.75))
} else {
grid.lines(x = c(x.start + 1, x.finis),
y = c(-0.5, -0.5), default.units = "native",
gp=gpar(col = "black", lwd = 1.75))
grid.lines(x = c(x.start, x.finis),
y = c(6.5, 6.5), default.units = "native",
gp=gpar(col = "black", lwd = 1.75))
}
}
for (j in 1:12) {
last.month <- max(dates.fsubs$seq[dates.fsubs$month == j])
x.last.m <- dates.fsubs$woty[last.month] + 0.5
y.last.m <- dates.fsubs$dotw[last.month] + 0.5
grid.lines(x = c(x.last.m, x.last.m), y = c(-0.5, y.last.m),
default.units = "native", gp=gpar(col = "black", lwd = 1.75))
if ((y.last.m) < 6) {
grid.lines(x = c(x.last.m, x.last.m - 1), y = c(y.last.m, y.last.m),
default.units = "native", gp=gpar(col = "black", lwd = 1.75))
grid.lines(x = c(x.last.m - 1, x.last.m - 1), y = c(y.last.m, 6.5),
default.units = "native", gp=gpar(col = "black", lwd = 1.75))
} else {
grid.lines(x = c(x.last.m, x.last.m), y = c(- 0.5, 6.5),
default.units = "native", gp=gpar(col = "black", lwd = 1.75))
}
}
}
}
trellis.unfocus()
}
lattice.options(default.theme = def.theme)
}
If we pass the dataframed time series data counting the sum (count) of tweets per day, we can get a calendar heatmap view of Martin’s twitter activity:
calendarHeat(ts.sum.df$date, ts.sum.df$sum, varname="@mhawksey Twitter activity")
I’m not sure if this is even interesting, let alone useful, but I do think now I’ve found out a little bit about working with time in R, that could be handy…
Still to do: extract hashtags and visualise them; extend the twitteR library so it exposes things like retweet counts. But that’s for another day…
Generating Twitter Wordclouds in R (Prompted by an Open Learning Blogpost)
A couple of weeks ago I saw a great example of an open learning blogpost from @katy_bird: Generating a word cloud (or not) from a Twitter hashtag. It described the trials and tribulations associated with trying to satisfy a request for the generation of a wordcloud based on tweets associated with a specific Twitter hashtag. A seemingly simple task, you might think, but things are never that easy… If you read the post, you’ll see Katy identified several problems, or stumbling blocks, along the way, as well as how she addressed them. There’s also a bit of reflection on the process as a whole.
Reading the post the first time (and again, just now), completely set me up for the day. It had a little bit of everyhting: a goal statement, the identification of a set of problems associated with trying to complete the task, some commentary on how the problems were tackled, and some reflection on the process as a whole. The post thus serves the purpose of capturing a problem discovery process, as well as the steps taken to try and solve each problem (although full documentation is lacking… This is something I have learned over the years: to use something like a gist on github to actually keep a copy of any code I generated to solve the problem, linked to for reuse by myself and others from the associated blog post). The post captures a glimpse back at a moment in time – when Katy didn’t know how to generate a wordcloud – from the joyful moment at which she has just learned how to generate said wordcloud. More importantly, the post describes the learning problems that became evident whilst trying to achieve the goal in such a way that they can act as hooks on which others can hang alternative or additional ways of solving the problem, or act as mentor.
By identifying the learning journey and problems discovered along the way, Katy’s record of her learning strategy also provides an authentic, learner centric perspective on what’s involved in trying to create a wordcloud around a twitter hashtag.
Reading the post again has also prompted me to blog this recipe, largely copied from the RDataMining post Using Text Mining to Find Out What @RDataMining Tweets are About, for generating a word cloud around a twitter hashtag using R (I use RStudio; the recipe requires at least the twitteR and tm libraries):
require(twitteR)
searchTerm='#dev8d'
#Grab the tweets
rdmTweets <- searchTwitter(searchTerm, n=500)
#Use a handy helper function to put the tweets into a dataframe
tw.df=twListToDF(rdmTweets)
##Note: there are some handy, basic Twitter related functions here:
##https://github.com/matteoredaelli/twitter-r-utils
#For example:
RemoveAtPeople <- function(tweet) {
gsub("@\\w+", "", tweet)
}
#Then for example, remove @'d names
tweets <- as.vector(sapply(tw.df$text, RemoveAtPeople))
##Wordcloud - scripts available from various sources; I used:
#http://rdatamining.wordpress.com/2011/11/09/using-text-mining-to-find-out-what-rdatamining-tweets-are-about/
#Install the textmining library
require(tm)
#Call with eg: tw.c=generateCorpus(tw.df$text)
generateCorpus= function(df,my.stopwords=c()){
#The following is cribbed and seems to do what it says on the can
tw.corpus= Corpus(VectorSource(df))
# remove punctuation
tw.corpus = tm_map(tw.corpus, removePunctuation)
#normalise case
tw.corpus = tm_map(tw.corpus, tolower)
# remove stopwords
tw.corpus = tm_map(tw.corpus, removeWords, stopwords('english'))
tw.corpus = tm_map(tw.corpus, removeWords, my.stopwords)
tw.corpus
}
wordcloud.generate=function(corpus,min.freq=3){
require(wordcloud)
doc.m = TermDocumentMatrix(corpus, control = list(minWordLength = 1))
dm = as.matrix(doc.m)
# calculate the frequency of words
v = sort(rowSums(dm), decreasing=TRUE)
d = data.frame(word=names(v), freq=v)
#Generate the wordcloud
wc=wordcloud(d$word, d$freq, min.freq=min.freq)
wc
}
print(wordcloud.generate(generateCorpus(tweets,'dev8d'),7))
##Generate an image file of the wordcloud
png('test.png', width=600,height=600)
wordcloud.generate(generateCorpus(tweets,'dev8d'),7)
dev.off()
#We could make it even easier if we hide away the tweet grabbing code. eg:
tweets.grabber=function(searchTerm,num=500){
require(twitteR)
rdmTweets = searchTwitter(searchTerm, n=num)
tw.df=twListToDF(rdmTweets)
as.vector(sapply(tw.df$text, RemoveAtPeople))
}
#Then we could do something like:
tweets=tweets.grabber('ukgc12')
wordcloud.generate(generateCorpus(tweets),3)
Here’s the result:
PS for an earlier, was broken, now patched, route to sketching a wordcloud from a twitter search using Wordle, see How To Create Wordcloud from a Twitter Hashtag Search Feed in a Few Easy Steps.
More Thoughts on Potential Audience Metrics for Hashtag Communities
Following on from the sketched ideas relating to estimating the Potential Audience Size for a Hashtag Community?, here are a few quick doodles around the graph representation of the tag users – followers graph that explore the extent to which we can use quite simple counts and analyses to get a feel for how the followers of a set of hashtag users are distributed and the number of times they are likely to see a hashtagged tweets (I’m mulling over calling this potential view count “receipts”…)
require(igraph)
#Read in the graph: the graphs contain nodes representing Twitter users connected by directed weighted edges that represent 'is followed by' relations. The weights correspond to the number of hashtagged messages published by the from-node over the sample period
g2=read.graph('/Users/ajh59/code/twapps/newt/reports/tmp/ddj_ncount.graphml',format='graphml')
summary(g2)
#The summary provides an overview of the graph, The number of nodes corresponds to the number of folk in the union of the set of hashtaggers and their followers, for example.
#We can count how many nodes have a particular in-degree count (where' in-degree represents the number of hashtaggers the node follows)
g.nodes=as.data.frame(table(degree(g2,mode='in')))
g.nodes$Var1=as.numeric(levels(g.nodes$Var1)[as.integer(g.nodes$Var1)])
#Check: if we sum the node occurrence frequencies, we should get the total number of nodes as a result
sum(g.nodes$Freq)
#We can then chart the result to look at the distribution of how many hashtaggers are followed by how many people
require(ggplot2)
ggplot(g.nodes)+geom_linerange(aes(x=Var1,ymin=0,ymax=Freq)) + scale_y_log10() + xlab('In-degree of followers')
To start with, we can get a view of how the indegree values of the follower nodes are distributed – this gives us an idea of how many of the hashtag users members of the follower set actually follow.
For a tight knit, coherent community, where tag users know each other, we might expect that folk who are likely to be interested in the tag are following several over the tag users.
Note the use of a log10 scale for the count… Most followers are following one tag user (most likely a single user of the tag with a large follower count). Folk following none of the tag users are likely to be tag users who don’t follow any of the other tag users captured during the sample period (erm, maybe? They could also be tag users with private settings, so their friend/follower lists aren’t public…)
Here’s the code for a second sketch…
#The incoming edges to follower nodes are weighted according to the number of tagged tweets the corresponding hashtagger published in the sample period.
#What this means is that we can count the total number of tagged tweets seen by each follower by summing the weights of edges incident on each node
g.weights=as.data.frame(table(graph.strength(g2,mode='in')))
g.weights$Var1=as.numeric(levels(g.weights$Var1)[as.integer(g.weights$Var1)])
#If we sum the product of message counts and frequencies, we see how many potential "receipts" of a tagged tweet there were.
sum(g.weights$Var1*g.weights$Freq)
#We can also plot the distribution of the number of tagged tweets potentially received by each follower
ggplot(g.weights)+geom_linerange(aes(x=Var1,ymin=0,ymax=Freq)) + scale_y_log10() + xlab('Incoming tagged message count')
This time, we get to see the distribution of the number of receipts of a tagged message across the follower set, where a receipt represents a publication of a tagged tweet in the sample period from any one of the tag users followed by an individual. Because the graph uses edges weighted according to the number of tagged tweets published by a user, we can easily calculate the number of tagged tweets potentially seen by a user by summing the weights of their incoming edges from tag users.
This chart makes it clear that most folk in the potential hashtag audience only had one potential receipt of a tagged tweet… Which makes me start thinking about ways of considering “conversion” rates based in part on the likelihood of follower to join in a hashtag community given the number of tag users to date they follow and the number of followers each of those tag users has…
Note that range of the incoming message count is greater than the range of the number of tag users followed because some tag users tweet using the tag more than once during the sample period.
Finally, we chart as a histogram the distribution of the number of followers of each tag user, simply because we can easily do so…
#It's also easy enough to chart the distribution of the follower counts for each hashtagger:
tagger.nodes=subset(as.data.frame(table(degree(g2,mode='out'))),subset=(Var1!='0'))
tagger.nodes$Var1=as.numeric(levels(tagger.nodes$Var1)[as.integer(tagger.nodes$Var1)])
#Quick check on the number of taggers
sum(tagger.nodes$Freq)
#And the distribution of how many followers they have
ggplot(tagger.nodes)+geom_histogram(aes(x=Var1,ymin=0,ymax=Freq),binwidth=250) + xlab('Follower count')
Note the outliers…
For additional charts that can be generated from the graph representation, see: Experimenting With iGraph – and a Hint Towards Ways of Measuring Engagement?
PS Hmmm…pondering this.. focus of a tag user is the number of their followers who originate a tagged tweet in the sample period (RTs don’t count, and maybe neither do replies…) divided by the total number of their follwers…? And maybe salience as the number of tagged tweets published by an individual during the sample period divided by the total number of tweets they published over the same period…?
What is the Potential Audience Size for a Hashtag Community?
What’s the potential audience size around a Twitter hashtag?
Way back when, in the early days of webs stats, reported figures tended to centre around the notion of hits, the number of calls made to a server via website activity. I forget the details, but the metric was presumably generated from server logs. This measure was always totally unreliable, because in the course of serving a web page, a server might be hit multiple times, once for each separately delivered asset, such as images, javascript files, css files and so on. Hits soon gave way to the notion of Page Views, which more accurately measured the number of pages (rather than assets) served via a website. This was complemented with the notion of Visits and Unique Visits: Visits, as tracked by a cookies, represent a set of pages viewed around about the same time by the same person. Unique Visits (or “Uniques”), represent the number of different people who appear to have visited the site in any given period.
What we see here, then, is a steady evolution in the complexity of website metrics that reflects on the one hand dissatisfaction with one way of measuring or reporting activity, and on the other practical considerations with respect to instrumentation and the ability to capture certain metrics once they are conceived of.
Widespread social media monitoring/tracking is largely still in the realm of “hits” measurement. Personal dashboards for services such as Twitter typically display direct measures provided by the Twitter API, or measures trivially/directly identified from Twitter API or archived data – number of followers, numbers of friends, distribution of updates over time, number of mentions, and so on.
Something both myself and Martin Hawksey have been thinking about on and off for some time are ways of reporting activity around Twitter hashtags. A commonly(?!) asked question in this respect relates to how much engagement (whatever that means) there has been with a particular tag. So here’s a quick mark in the sand about some of my current thinking about this. (Note that these ideas may well have been more formally developed in the academic literature – I’m a bit behind in my reading! If you know something that covers this in more detail, or that I should cite, please feel free to add a link in the comments… #lazyAcademic.)
One of the first metrics that comes to my mind is the number of people who have used a particular hashtag, and the number of their followers. Easily stated, it doesn’t take a lot of thought to realise even these “simple” measures are fraught with difficulty:
- what counts as a use of the hashtag? If I retweet a measure of yours that contains a hashtag, have I used it in any meaningful sense? Does a “use” mean the creation of a new tweet containing the tag? What about if I reply to a tweet from you than contains the tag and I include the tag in my reply to you, even if I’m not sure what that tag relates to?
- the potential audience size for the tag (potential uniques?), based on the number of followers of the tag users. At first glance, we might think this can be easily calculated by adding together the follower counts of the tag users, but this is more strictly an approximation of the potential audience: the set of followers of A may include some of the followers of B, or C; do we count the tag users themselves amongst the audience? If so, the upper bound also needs to take into account the fact that none of the users may be followers of any of the other tag users.
Note there is also a lower bound – the largest follower count amongst the tag users (whatever that means…) of the hashtag. Furthermore, if we want to count the number of folk not using the tag but who may have seen the tag, this lower bound can be revised downwards by subtracting the number of tag users minus one (for the tag user with the largest follower count). The value is still only an approximation, though, becuase it assumes that all the tag users are actually included as followers of at least one, each, of the tag users. (If you think these points are “just academic”, they are and they aren’t – observations like these can often be used to help formulate gaming strategies around metrics based on these measures.) - the potential number of views of a tag, for example based on the product of the number of times a user tweets and their follower count?
- the reach of (or active engagement with?) the tag, as measured by the number of people who actually see the tag, or the number of people who take and action around it (such as replying to a tagged tweet, RTing it, or clicking on a link a tagged tweet contains); note that we may be able ot construct probabilistic models (albeit quite involved ones) of the potential reach based on factors like the number of people someone follows, when they are online, the rate at which the people they follow tweet, and so on..
To try to make this a little more concrete, here are a couple of scripts for exploring the potential audience size of a tag based on the followers of the tag users (where a user is someone who publishes or retweets a tweet containing the tag over a specified period). The first, Python script runs a Twitter search and generates a list of unique users of the tag, along with the timestamp of their first use of the tag within the sample period. This script also grabs all the followers of the tag users, along with their counts, and generates running cumulative (upper bound approximation) count of the tag user follower numbers as well as calculating the rolling set of unique followers to date as each new tag user is observed. The second, R script plots the values.
The first thing we can do is look at the incidence of new users of the hashtag over time:
(For a little more discussion of this sort of chart, see Visualising Activity Around a Twitter Hashtag or Search Term Using R and its inspiration, @mediaczar’s How should Page Admins deal with Flame Wars?.)
More relevant to this post, however, is a plot showing some counts relating to followers of users of the hashtag:
In this case, the top, green line represents the summed total number of followers for tag users as they enter the conversation. If every user had completely different followers, this might be meaningful, but where conversation takes place around a tag between folk who know each other, it’s highly likely that they have followers in common.
The middle, red line shows a count of the number of unique followers to date, based on the the followers of users of the tag to date.
The lower, blue line shows the difference between the red and green lines. This represents the error between the summed follower counts and the actual number of unique followers.
Here’s a view over the number of new unique potential audience members at each time step (I think the use of the line chart here may be a mistake… I think bars/lineranges would probably be more appropriate…):
In the following chart, I overplot oneline with another. The lower layer (a red line) is the total follower account for each new tag user. The blue is the increase in the potential audience count (that is, the number of the new users’ followers that haven’t potentially seen the tag so far). The range of the visible part of the red line thus shows the number of a new tag user’s followers who have potentially already seen the tag. Err… maybe (that is, if my code is correct and all the scripts are doing what I think they’re doing! If they aren’t, then just treat this post as an exploration of the sorts of charts we might be able to produce to explore audience reach;-)
Here are the scripts (such as they are!)
import newt,csv,tweepy
import networkx as nx
#the term we're going to search for
tag='ddj'
#how many tweets to search for (max 1500)
num=500
##Something along lines of:
auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
auth.set_access_token(SKEY, SSECRET)
api = tweepy.API(auth, cache=tweepy.FileCache('cache',cachetime), retry_errors=[500], retry_delay=5, retry_count=2)
#You need to do some work here to search the Twitter API
tweeters, tweets=yourSearchTwitterFunction(api,tag,num)
#tweeters is a list of folk who tweeted the term of interest
#tweets is a list of the Twitter tweet objects returned from the search
#My code for this is tightly bound up in a large and rambling library atm...
#Put tweets into chronological order
tweets.reverse()
#I was being lazy and wasn't sure what vars I needed or what I was trying to do when I started this!
#The whole thing really needs rewriting...
tweepFo={}
seenToDate=set([])
uniqSourceFo=[]
#runtot is crude and doesn't measure overlap
runtot=0
oldseentodate=0
#Construct a digraph from folk using the tag to their followers
DG=nx.DiGraph()
for tweet in tweets:
user=tweet['from_user']
if user not in tweepFo:
tweepFo[user]=[]
print "Getting follower data for", str(user), str(len(tweepFo)), 'of', str(len(tweeters))
mi=tweepy.Cursor(api.followers_ids,id=user).items()
userID=tweet['from_user_id'] #check
DG.add_node(userID,label=user)
for m in mi:
tweepFo[user].append(m)
#construct graph
DG.add_edge(userID,m,weight=1)
DG.node[m]['label']=''
ufc=len(tweepFo[user])
runtot=runtot+ufc
#seen to date is all people who have seen so far, plus new ones, so it's the union
oldseentodate=len(seenToDate)
seenToDate=seenToDate.union(set(tweepFo[user]))
uniqSourceFo.append((tweet['created_at'],len(seenToDate),user,runtot,ufc,oldseentodate))
else:
#I'm weighting the edges so we can count how many times folk see the hashtag
if len(DG.edges(userID))>0:
tmp1,tmp2=DG.edges(userID)[0]
weight=DG[userID][tmp2]['weight']+1
for fromN,toN in DG.edges(userID):
DG[fromN][toN]['weight']=weight
fo='reports/tmp/'+tag+'_ncount.csv'
f=open(fo,'wb+')
writer=csv.writer(f)
writer.writerow(['datetime','count','newuser','crudetot','userFoCount','previousCount'])
for ts,l,u,ct,ufc,ols in uniqSourceFo:
print ts,l
writer.writerow([ts,l,u,ct,ufc,ols])
f.close()
print "Writing graph.."
filter=[]
for n in DG:
if DG.degree(n)>1: filter.append(n)
filter=set(filter)
H=DG.subgraph(filter)
nx.write_graphml(H, 'reports/tmp/'+tag+'_ncount_2up.graphml')
print "Writing other graph.."
nx.write_graphml(DG, 'reports/tmp/'+tag+'_ncount.graphml')
Here’s the R script…
ddj_ncount <- read.csv("~/code/twapps/newt/reports/tmp/ddj_ncount.csv")
#Convert the datetime string to a time object
ddj_ncount$ttime=as.POSIXct(strptime(ddj_ncount$datetime, "%a, %d %b %Y %H:%M:%S"),tz='UTC')
#Order the newuser factor levels into the order in which they first use the tag
dda=subset(ddj_ncount,select=c('ttime','newuser'))
dda=arrange(dda,-desc(ttime))
ddj_ncount$newuser=factor(ddj_ncount$newuser, levels = dda$newuser)
#Plot when each user first used the tag against time
ggplot(ddj_ncount) + geom_point(aes(x=ttime,y=newuser)) + opts(axis.text.x=theme_text(size=6),axis.text.y=theme_text(size=4))
#Plot the cumulative and union flavours of increasing possible audience size, as well as the difference between them
ggplot(ddj_ncount) + geom_line(aes(x=ttime,y=count,col='Unique followers')) + geom_line(aes(x=ttime,y=crudetot,col='Cumulative followers')) + geom_line(aes(x=ttime,y=crudetot-count,col='Repeated followers')) + labs(colour='Type') + xlab(NULL)
#Number of new unique followers introduced at each time step
ggplot(ddj_ncount)+geom_line(aes(x=ttime,y=count-previousCount,col='Actual delta'))
#Try to get some idea of how many of the followers of a new user are actually new potential audience members
ggplot(ddj_ncount) + opts(axis.text.x=theme_text(angle=-90,size=4)) + geom_linerange(aes(x=newuser,ymin=0,ymax=userFoCount,col='Follower count')) + geom_linerange(aes(x=newuser,ymin=0,ymax=(count-previousCount),col='Actual new audience'))
#This is still a bit experimental
#I'm playing around trying to see what proportion or number of a users followers are new to, or subsumed by, the potential audience of the tag to date...
ggplot(ddj_ncount) + geom_linerange(aes(x=newuser,ymin=0,ymax=1-(count-previousCount)/userFoCount)) + opts(axis.text.x=theme_text(angle=-90,size=6)) + xlab(NULL)
In the next couple of posts in this series, I’ll start to describe how we can chart the potential increase in audience count as a delta for each new tagger, along with a couple of ways of trying to get some initial sort of sense out of the graph file, such as the distribution of the potential number of “views” of a tag across the unique potential audience members…
PS See also the follow on post More Thoughts on Potential Audience Metrics for Hashtag Communities
Visualising Activity Around a Twitter Hashtag or Search Term Using R
I think one of valid criticisms around a lot of the visualisations I post here and on my various #f1datajunkie blogs is that I often don’t post any explanatory context around the visualisations. This is partly a result of the way I use my blog posts in a selfish way to document the evolution of my own practice, but not necessarily the “so what” elements that represent any meaning or sense I take from the visualisations. In many cases, this is because the understanding I come to of a dataset is typically the result of an (inter)active exploration of the data set; what I blog are the pieces of the puzzle that show how I personally set about developing a conversation with a dataset, pieces that you can try out if you want to…;-)
An approach that might get me more readers would be to post commentary around what I’ve learned about a dataset from having a conversation with it. A good example of this can be seen in @mediaczar’s post on How should Page Admins deal with Flame Wars?, where this visualisation of activity around a Facebook post is analysed in terms of effective (or not!) strategies for moderating a flame war.

The chart shows a sequential ordering of posts in the order they were made along the x-axis, and the unique individual responsible for each post, ordered by accession to the debate along the y-axis. For interpretation and commentary, see the original post: How should Page Admins deal with Flame Wars? ;-)
One take away of the chart for me is that it provides a great snapshot of new people entering into a conversation (vertical lines) as well as engagement by an individual (horizontal lines). If we use a time proportional axis on x, we can also see engagement over time.
In a Twitter context, it’s likely that a rapid increase in numbers of folk engaging with a hashtag, for example, might be the result of an RT related burst of activity. For folk who have already engaged in hashtag usage, for example as part of a live event backhannel, a large number of near co-occurring tweets that are not RTs might signal some notable happenstance within the event.
To explore this idea, here’s a quick bit of R tooling inspired by Mat’s post… It uses the twitteR library and sources tweets via a Twitter search.
require(twitteR)
#Pull in a search around a hashtag.
searchTerm='#ukgc12'
rdmTweets <- searchTwitter(searchTerm, n=500)
# Note that the Twitter search API only goes back 1500 tweets
#Plot of tweet behaviour by user over time
#Based on @mediaczar's http://blog.magicbeanlab.com/networkanalysis/how-should-page-admins-deal-with-flame-wars/
#Make use of a handy dataframe creating twitteR helper function
tw.df=twListToDF(rdmTweets)
#@mediaczar's plot uses a list of users ordered by accession to user list
## 1) find earliest tweet in searchlist for each user [ http://stackoverflow.com/a/4189904/454773 ]
require(plyr)
tw.dfx=ddply(tw.df, .var = "screenName", .fun = function(x) {return(subset(x, created %in% min(created),select=c(screenName,created)))})
## 2) arrange the users in accession order
tw.dfxa=arrange(tw.dfx,-desc(created))
## 3) Use the username accession order to order the screenName factors in the searchlist
tw.df$screenName=factor(tw.df$screenName, levels = tw.dfxa$screenName)
#ggplot seems to be able to cope with time typed values...
require(ggplot2)
ggplot(tw.df)+geom_point(aes(x=created,y=screenName))
We can get a feeling for which occurrences were old-style RTs by identifying tweets that start with a classic RT, and then colouring each tweet appropriately (note there may be some overplotting/masking of points…I’m not sure how big the x-axis time bins are…)
#Identify and colour the RTs...
library(stringr)
#A helper function to remove @ symbols from user names...
trim <- function (x) sub('@','',x)
#Identify classic style RTs
tw.df$rt=sapply(tw.df$text,function(tweet) trim(str_match(tweet,"^RT (@[[:alnum:]_]*)")[2]))
tw.df$rtt=sapply(tw.df$rt,function(rt) if (is.na(rt)) 'T' else 'RT')
ggplot(tw.df)+geom_point(aes(x=created,y=screenName,col=rtt))
So now we can see when folk entered into the hashtag community via a classic RT.
We can also start to explore who was classically retweeted when:
#Generate a plot showing how a person is RTd tw.df$rtof=sapply(tw.df$text,function(tweet) trim(str_match(tweet,"^RT (@[[:alnum:]_]*)")[2])) #Note that this doesn't show how many RTs each person got in a given time period if they got more than one... ggplot(subset(tw.df,subset=(!is.na(rtof))))+geom_point(aes(x=created,y=rtof))
Another view might show who was classically RTd by whom (activity along a row indicating someone was retweeted a lot through one or more tweets, activity within a column identifying an individual who RTs a lot…):
#We can start to get a feel for who RTs whom...
require(gdata)
#We don't want to display screenNames of folk who tweeted but didn't RT
tw.df.rt=drop.levels(subset(tw.df,subset=(!is.na(rtof))))
#Order the screennames of folk who did RT by accession order (ie order in which they RTd)
tw.df.rta=arrange(ddply(tw.df.rt, .var = "screenName", .fun = function(x) {return(subset(x, created %in% min(created),select=c(screenName,created)))}),-desc(created))
tw.df.rt$screenName=factor(tw.df.rt$screenName, levels = tw.df.rta$screenName)
# Plot who RTd whom
ggplot(subset(tw.df.rt,subset=(!is.na(rtof))))+geom_point(aes(x=screenName,y=rtof))+opts(axis.text.x=theme_text(angle=-90,size=6)) + xlab(NULL)
What sense you might make of all this, or where to take it next, is down to you of course… Err, erm…?! ;-)
PS see also: http://blog.ouseful.info/2012/01/21/a-quick-view-over-a-mashe-google-spreadsheet-twitter-archive-of-ukgc2012-tweets/
Experimenting With iGraph – and a Hint Towards Ways of Measuring Engagement?
For fear of being left way behind as Martin Hawksey starts to get to grips with R, (see for example how he’s using R to automate the annotation of Google Spreadsheets with calculations that don’t come readily or efficiently to hand in Google Spreadsheets itself), I thought I better try to get to grips with R’s igraph library…
So here’s a script that gives me some hints as to how to start migrating chunks of my clunky Python script into R, as well as some ideas about how to start reporting on the structure of hashtag communities in a graphical as well as stats analytical way.
require(igraph)
#load in a graph from a graphml file; the graph contains nodes representing Twitter users connected by directed edges that represent friend or follower relations, depending on the actual experimental condition I ran
g=read.graph('/Users/ajh59/code/twapps/newt/reports/scmvESP/scmvESP_2012-01-26-22-53-45/friends_outerfriendsdegree_X_25_25_X_esp.graphml',format='graphml')
summary(g)
#Vertices are obtained via V(g). The summary() tells us what attributes are available.
#So for example, inspect the label attribute
V(g)$label
#in and out degree counts for each (labelled) node/vertex
df=data.frame(name=V(g)$label,indegree=degree(g,mode='in'),outdegree=degree(g,mode='out'))
#inspect the top 10 nodes sorted by indegree
#the plyr arrange function makes sorting dataframes a doddle...
require(plyr)
df2=head(arrange(df,desc(indegree)),10)
df2
#get ready to do some plots
require(ggplot2)
#It might be interesting to look at the in-degree and out-degree distributions
#out-degree, because we see how promiscuous folk are in their following behaviour
#h/t to @mhawksey for pointing out the mode argument to me.. doh!
ddout=degree.distribution(g,mode='out')
#degree.distribution() "a numeric vector of the same length as the maximum degree plus one. The first element is the relative frequency zero degree vertices, the second vertices with degree one,etc."
#We can use the vector vals as the y-value, but x is unspecified/implied by the row number
#So we need to generate the x vals explicitly...?
ggplot()+geom_point(aes(c(1:length(ddout)),ddout))
#If we want to ignore the outdegree==0 value, we can skip the first item in the list
ggplot()+geom_point(aes(c(2:length(ddout)),ddout[-1]))
#in-degree
ddin=degree.distribution(g,mode='in')
ggplot()+geom_point(aes(c(1:length(ddin)),ddin))
ggplot()+geom_point(aes(c(2:length(ddin)),ddin[-1]))
#We can also plot indegree and outdegree together
#Use colour to distinguish the points, and make the upper layer smaller in case we overplot
ggplot() + geom_point(aes(c(2:length(ddin)),ddin[-1]),colour='red') + geom_point(aes(c(2:length(ddout)),ddout[-1]),colour='blue',size=1)
Note that I really should have labelled the axes – x-axis is “in (or out) degree”, y-axis is “proportion of nodes with corresponding in (or out) degree”.
Out-degree:
Out-degree (except out-degree==0):
In-degree:
In-degree (except in-degree==0):
One thing I notice about the in-degree is that there is a very high number of very low in-degree nodes, which tail off quickly, and then another head at in-degree 25 which then tails off. This is an artefact of the way the graph file was pre-processed – I generated a friends network of hashtag users, then filter the network to only include nodes that had indegree of at least 25 and/or outdegree of at least 25. The nodes with in-degree between 1 and 25 are nodes corresponding to hashtaggers that are friended by other hashtaggers.
In- (blue) and out- (red) degree:
Reflecting on the in-degree graph, we have a way of identifying those folk who used the hashtag and are connected to other hashtaggers:
arrange(subset(df,subset=(outdegree>0 & indegree>0)),desc(indegree))
The dataset I’m using refers is based on folk using the #bbcqt hashtag. Here are the hashtaggers most linked to by other hashtaggers:
> head(arrange(subset(df,subset=(outdegree>0)),desc(indegree)))
name indegree outdegree
1 bbcquestiontime 190 102
2 DIMBLEBOT 76 61
3 markinreading 34 121
4 politicalhackuk 27 236
5 10anta 25 73
6 Parlez_me_nTory 24 63
So now I’m wondering… does this hint at a way of measuring some sort of engagement with the Twitter account set up to promote the programme and, presumably, the hashtag???
If we consider @bbcquestiontime, the high indegree tells us that the @bbcquestiontime account is being followed by a significant number of the hashtag users (we could find out what proportion by dividing through by the number of folk with out-degree>1 minus 1 (minus 1 because @bbcquestiontime is one of those hashtaggers). That @bbcquestiontime has outdegree > 0 tells us it was sampled as a user of the hashtag (the graph was originally generated with directed edges from folk who used the tag to their friends.) The high (ish?!) out-degree tells us that this account is linking to a reasonable number of folk popularly followed by users of the #bbcqt hashtag or who used the hashtag; so #bbcquestiontime is listening to folk that the #bbcqt taggers listen to, which is probably a good thing. (I guess what we could do here is compare the outdegree of the @bbcquestiontime account with its total friend count (ie, with the total number of accounts it follows. Because if the account was following 1000 people or so, and only 10% of them were being followed by #bbcqt hashtaggers, we might wonder whether they’re interested in different things?) Once again, we could also normalise the out-degree number with respect to one less number of accounts with indegree >0 (again, we subtract one to account for the self reference) to get the proportion of folk being followed by hashtaggers that are being followed by @bbcquestiontime. This gives us some idea of the extent to which @bbcquestiontime is listening to folk that the #bbcqt hashtaggers are listening to.
Let’s try that latter normalisation to get a feel for what the proportions are…
#Count the number of rows where folk have indegree, or outdegree, as required, > 0
df$inReach=df$indegree/(nrow(subset(df,df$outdegree>0))-1)
df$outReach=df$outdegree/(nrow(subset(df,df$indegree>0))-1)
#First let's see who reaches furthest out into the interest community
head(arrange(subset(df,inReach>0),desc(outReach)))
name indegree outdegree outReach inReach
1 Damientg 5 341 0.4782609 0.013054830
2 danmknight 9 265 0.3716690 0.023498695
3 martysm 1 261 0.3660589 0.002610966
4 MrJacHart 18 257 0.3604488 0.046997389
5 VMcAV 5 237 0.3323983 0.013054830
6 politicalhackuk 27 236 0.3309958 0.070496084
#now let's see who is touched by most of the community
head(arrange(subset(df,outReach>0),desc(inReach)))
name indegree outdegree outReach inReach
1 bbcquestiontime 190 102 0.14305750 0.49608355
2 DIMBLEBOT 76 61 0.08555400 0.19843342
3 markinreading 34 121 0.16970547 0.08877285
4 politicalhackuk 27 236 0.33099579 0.07049608
5 10anta 25 73 0.10238429 0.06527415
6 Parlez_me_nTory 24 63 0.08835905 0.06266319
So, from that, we see that @Damientg is following a large number of the folk popularly followed by users of the #bbcqt hashtag or who used the hashtag. I don’t think this is interesting. However, the fact that @bbcquestiontime is followed by about half the folk who used the #bbcqt tag (in the sample I grabbed) is maybe useful as a measure of how engaged the hashtaggers may be with the programme Twitter account?
The latter report also brings to mind another question – how many of the hashtaggers does any particular account follow – that is, how connected is any particular account to folk who used the hashtag (which is the set of folk with outdegree>0)? This is important I think – distinguishing between hashtaggers who link to each other as part of a conversation, and other accounts they follow en masse but who aren’t engaging in conversation via the hashtag?
Hmmm…something to ponder over the weekend I think;-)
Social Media Interest Maps of Newsnight and BBCQT Twitterers
I grabbed independent samples of 1500 recent users of the #newsnight and #bbcqt hashtags within a minute or two of each other about half an hour ago. Here’s who’s followed by 25 or more of the recent hashtaggers in each case. Can you distinguish the programmes each audience interest projection map relates to?
Here’s the first one – are these folk followed by 25 or more of the folk who recently used the #bbcqt or the #newsnight hashtag?
Here’s the second one – are these folk followed by 25 or more of the folk who recently used the #bbcqt or the #newsnight hashtag?
The answer is a only a click away…
PS I’ve got a couple of scripts in the pipeline that should be able to generate data that I can use to generate this sort of differencing word cloud, the idea being I should be able to identify at a glance accounts that different hashtag communities both follow, and accounts that they differently follow…
UPDATE: so here’s a quick first pass at comparing the audiences. I’m not sure how reliable the method is, but it’s as follows:
- for each hashtag, grab 1500 recent tweets. Grab the list of folk the hashtagging users follow and retain a list (the ‘interest list’) of folk followed by at least 25 of the hashtaggers. Filter the hashtagger list so that it only contains hashtaggers who follow at least 25 people (this cuts out brand new users and newly created spam accounts). Count the number of filtered hashtaggers that follow each person in the interest list, and normalise by dividing through by the total number of filtered hashtaggers. To recap, for each tag, we now have a list of folk who were popularly followed by users of that tag, along with a number for each one between 0 and 1 describing proportionally how much of the hashtagging sample follow them.
(Note that there may be all sorts of sampling errors… I guess I need to qualify reports with the number of unique folk tweeting in the twitter sample captured. I maybe also need to improve sampling so rather than searching for 1500 tweets, I generate a sample of 1000 unique users of the tag?)
I then load these files into R and run through the following process:
#Multiply this nromalised follower proportion by 1000 and round down to get an integer between 0 and 1000 representing a score relative to the proportion of filtered hashtagger who follow each person in the interest list.
counts_newsnight$normIn=as.integer(counts_newsnight$inNorm*1000)
counts_bbcqt$normIn=as.integer(counts_bbcqt$inNorm*1000)
#ANother filtering step: we're going to plot similarities and differences between folk followed by at least 25% of the corresponding filtered hashtaggers
newsnight=subset(counts_newsnight,select=c(username,normIn),subset=(inNorm>=0.25))
bbcqt=subset(counts_bbcqt,select=c(username,normIn),subset=(inNorm>=0.25))
#Now generate a dataframe
qtvnn=merge(bbcqt,newsnight,by="username",all=T)
colnames(qtvnn)=c('username','bbcqt','newsnight')
#replace the NA cell values (where for example someone in the bbcqt list is not in the newsnight list
qtvnn[is.na(qtvnn)] <- 0
That generates a dataframe that looks something like this:
username bbcqt newsnight 1 Aiannucci 414 408 2 BBCBreaking 455 464 3 BBCNewsnight 317 509 4 BBCPolitics 0 256 5 BBCr4today 0 356 6 BarackObama 296 334
Thanks to Josh O’Brien on Stack Overflow, I can recast this data frame into a term.matrix that plays nicely with the latest version of the R wordcloud package.
dimnames(mat)[1] <- qtvnn[1] mat <- as.matrix(qtvnn[-1]) dimnames(mat)[1] <- qtvnn[1] comparison.cloud(term.matrix = mat) commonality.cloud(term.matrix = mat)
Here’s the result – commonly followed folk:
And differentially followed folk (at above the 25% level, remember…)
So from this what can we say? Both audiences have a general news interest, into pop politics and perhaps satirical comedy, maybe leaning to the left? The Question Time audience is a more casual audience, more minded to following celebrities, whereas the Newsnight audience is a bit more into following notable media folk (journalists, editors) and also political news. (I’d be keen to hear any other readings of these maps – please feel free to leave a comment containing your interpretations/observations/reading:-)
UPDATE2: to try to get a handle on what the word clouds might be telling us from an alternative visual perspective on the data, rather than inspecting the actual code for example, here’s a scatterplot showing how the follower proportions compare directly:
ggplot(na.omit(subset(qtvnn,bbcqt>0 & newsnight>0))) + geom_text(aes(x=bbcqt,y=newsnight,label=username,angl=45),size=4) + xlim(200,600) + ylim(200,600) + geom_abline(intercept=0, slope=1,colour='grey')
Here’s another view – this time plotting followed folk for each tag who are not followed by the friends of the other tag [at at least the 25% level]:
I couldn’t remember/didn’t have Google to hand to find the best way of reshaping the data for this, so I ended up with a horrible horrible hack…
nn=data.frame(typ='newsnight',subset(qtvnn,select=c(username,newsnight),subset=(newsnight>0 & bbcqt==0)))
qt=data.frame(typ='bbcqt',subset(qtvnn,select=c(username,bbcqt),subset=(newsnight==0 & bbcqt>0)))
colnames(nn)=c('typ','name',val'')
colnames(qt)=c('typ','name',val'')
qtnn=rbind(nn,qt)
ggplot()+geom_text(data=qtnn,aes(x=typ,y=val,label=name),size=3)
I think this is all starting to get to the point where I need to team up with a proper developer and get *all* the code properly written and documented before any errors that are currently there get baked in too deeply…
A Quick View Over a MASHe Google Spreadsheet Twitter Archive of UKGC12 Tweets
Following on from A Tool Chain for Plotting Twitter Archive Retweet Graphs – Py, R, Gephi, here’s a quick view summary view over #UKGC12 tweets saved in Google Spreadsheet archive as developed by Martin Hawksey, generated from an R script (R code available here; #ukgc12 tweet archive here)…
(I did mean to tidy these up, add in titles etc etc but it’s late and I’m realllly tiered:-(
So for example, an ordered bar chart showing who was @’d most by hashtagged tweets:
And a scatterplot showing the number of tagged tweets to and from particular individuals, sized by how many times RT’s of a person’s tweets there were:
(Hmmm..strikes me I could use a fourth dimension (colour) to capture the number of RTs issued by each person too…? I wonder if I can also tie the angle of each label to a parameter value?!)
I also had a quick peek at looking at folk who were using the tag and/or were heavily followed by tag users (nodes sized according to betweenness centrality):
You can view a dynamic version of the conversation graph around the tag using Martin’s TAGSExplorer (about).
PS See the first comment below from Ben Marwick for a link to a text analysis script in R that can be easily tweaked to use archived tweets. When I get a chance, I’ll try to wrap this into a Sweave script (cf. How Might Data Journalists Show Their Working? Sweave for the automated generation of PDF and HTML reports.).
Amateur Mapmaking: Getting Started With Shapefiles
One of the great things about (software) code is that people build on it and out from it… Which means that as well as producing ever more complex bits of software, tools also get produced over time that make it easier to do things that were once hard to do, or required expensive commercial software tools.
Producing maps is a fine example of this. Not so very long ago, producing your own annotated maps was a hard thing to do. Then in June, 2005, or so, the Google Maps API came along and suddenly you could create your own maps (or at least, put markers on to a map if you had latitude and longitude co-ordinates available). Since then, things have just got easier. If you want to put markers on a map just given their addresses, it’s easy (see for example Mapping the New Year Honours List – Where Did the Honours Go?). You can make use of Ordnance Survey maps if you want to, or recolour and style maps so they look just the way you want.
Sometimes, though, when using maps to visualise numerical data sets, just putting markers onto a map, even when they are symbols sized proportionally in accordance with your data, doesn’t quite achieve the effect you want. Sometimes you just have to have a thematic, choropleth map:

The example above is taken from an Ordnance Survey OpenSpace tutorial, which walks you through the creation of thematic maps using the OS API.
But what do you do if the boundaries/shapes you want to plot aren’t supported by the OS API?
One of the common ways of publishing boundary data is in the form of shapefiles (suffix .shp, though they are often combined with several other files in a .zip package). So here’s a quick first attempt at plotting shapefiles and colouring them according to an appropriately defined data set.
The example is based on a couple of data sets – shapefiles of the English Government Office Regions (GORs), and a dataset from the Ministry of Justice relating to insolvencies that, amongst other things, describes numbers of insolvencies per time period by GOR.
The language I’m using is R, within the RStudio environment. Here’s the code:
#Download English Government Office Network Regions (GOR) from:
#http://www.sharegeo.ac.uk/handle/10672/50
##tmpdir/share geo loader courtesy of http://stackoverflow.com/users/1033808/paul-hiemstra
tmp_dir = tempdir()
url_data = "http://www.sharegeo.ac.uk/download/10672/50/English%20Government%20Office%20Network%20Regions%20(GOR).zip"
zip_file = sprintf("%s/shpfile.zip", tmp_dir)
download.file(url_data, zip_file)
unzip(zip_file, exdir = tmp_dir)
library(maptools)
#Load in the data file (could this be done from the downloaded zip file directly?
gor=readShapeSpatial(sprintf('%s/Regions.shp', tmp_dir))
#I can plot the shapefile okay...
plot(gor)
Here’s what it looks like:
#I can use these commands to get a feel for the data contained in the shapefile...
summary(gor)
attributes(gor@data)
gor@data$NAME
#[1] North East North West
#[3] Greater London Authority West Midlands
#[5] Yorkshire and The Humber South West
#[7] East Midlands South East
#[9] East of England
#9 Levels: East Midlands East of England ... Yorkshire and The Humber
#download data from http://www.justice.gov.uk/downloads/publications/statistics-and-data/courts-and-sentencing/csq-q3-2011-insolvency-tables.csv
insolvency<- read.csv("http://www.justice.gov.uk/downloads/publications/statistics-and-data/courts-and-sentencing/csq-q3-2011-insolvency-tables.csv")
#Grab a subset of the data, specifically to Q3 2011 and numbers that are aggregated by GOR
insolvencygor.2011Q3=subset(insolvency,Time.Period=='2011 Q3' & Geography.Type=='Government office region')
#tidy the data - you may need to download and install the gdata package first
#The subsetting step doesn't remove extraneous original factor levels, so I will.
require(gdata)
insolvencygor.2011Q3=drop.levels(insolvencygor.2011Q3)
names(insolvencygor.2011Q3)
#[1] "Time.Period" "Geography"
#[3] "Geography.Type" "Company.Winding.up.Petition"
#[5] "Creditors.Petition" "Debtors.Petition"
levels(insolvencygor.2011Q3$Geography)
#[1] "East" "East Midlands"
#[3] "London" "North East"
#[5] "North West" "South East"
#[7] "South West" "Wales"
#[9] "West Midlands" "Yorkshire and the Humber"
#Note that these names for the GORs don't quite match the ones used in the shapefile, though how they relate one to another is obvious to us...
#So what next? [That was the original question...!]
#Here's the answer I came up with...
#Convert factors to numeric [ http://stackoverflow.com/questions/4798343/convert-factor-to-integer ]
#There's probably a much better formulaic way of doing this/automating this?
insolvencygor.2011Q3$Creditors.Petition=as.numeric(levels(insolvencygor.2011Q3$Creditors.Petition))[insolvencygor.2011Q3$Creditors.Petition]
insolvencygor.2011Q3$Company.Winding.up.Petition=as.numeric(levels(insolvencygor.2011Q3$Company.Winding.up.Petition))[insolvencygor.2011Q3$Company.Winding.up.Petition]
insolvencygor.2011Q3$Debtors.Petition=as.numeric(levels(insolvencygor.2011Q3$Debtors.Petition))[insolvencygor.2011Q3$Debtors.Petition]
#Tweak the levels so they match exactly (really should do this via a lookup table of some sort?)
i2=insolvencygor.2011Q3
i2c=c('East of England','East Midlands','Greater London Authority','North East','North West','South East','South West','Wales','West Midlands','Yorkshire and The Humber')
i2$Geography=factor(i2$Geography,labels=i2c)
#Merge the data with the shapefile
gor@data=merge(gor@data,i2,by.x='NAME',by.y='Geography')
#Plot the data using a greyscale
plot(gor,col=gray(gor@data$Creditors.Petition/max(gor@data$Creditors.Petition)))
And here’s the result:
Okay – so it’s maybe not the most informative of maps, it needs a scale, the London data is skewed, etc etc… But it shows that the recipe seems to work..
(Here’s a glimpse of how I worked my way to this example using a question to Stack Overflow: Plotting Thematic Maps in R Using Shapefiles and Data Files from DIfferent Sources (note: better solutions may have since been posted to that question, and which may improve on the recipe provided in this post…)
PS If the R thing is just too scary, here’s a recipe for plotting data using shapefiles in Google Fusion Tables [PDF] (alternative example) that makes use of the ShpEscape service for importing shapefiles into Fusion Tables (note that shpescape can be a bit slow converting an uploaded file and may appear to be doing nothing much at all for 10-20 minutes…). See also: Quantum GIS
Over on F1DataJunkie, 2011 Season Review Doodles…
Things have been a little quiet, post wise here, of late, in part because of the holiday season… but I have been posting notes on a couple of charts in progress over on the F1DataJunkie blog. Here are links to the posts in chronological order – they capture the evolution of the chart design(s) to date:
- F1 2011 Progress Throughout the Year
- F1 2011 Review – Another Look at Fastest Laptime Evolution
- F1 2011 Review – Qualifying Progress
- F1 2011 Review – Grid/Final Classification Deltas
- F1 2011 Review – Grid vs FInal Classification, Redux
- F1 2011 Review – Driver and Race Position Charts
You can find a copy of the data I used to create the charts here: F1 2011 Year in Review spreadsheet.
I used R to generate the charts (scripts are provided and/or linked to from the posts, or included in the comments – I’ll tidy them and pop them into a proper Github repository if/when I get a chance), loading the data in to RStudio using this sort of call:
require(RCurl)
gsqAPI = function(key,query,gid=0){ return( read.csv( paste( sep="",'http://spreadsheets.google.com/tq?', 'tqx=out:csv','&tq=', curlEscape(query), '&key=', key, '&gid=', curlEscape(gid) ), na.strings = "null" ) ) }
key='0AmbQbL4Lrd61dEd0S1FqN2tDbTlnX0o4STFkNkc0NGc'
sheet=4
qualiResults2011=gsqAPI(key,'select *',sheet)
If any other folk out there are interested in using R to wrangle with F1 data, either from 2011 or looking forward to 2012, let me know and maybe we could get a script collection going on Github:-)







































