I grabbed independent samples of 1500 recent users of the #newsnight and #bbcqt hashtags within a minute or two of each other about half an hour ago. Here’s who’s followed by 25 or more of the recent hashtaggers in each case. Can you distinguish the programmes each audience interest projection map relates to?
Here’s the first one – are these folk followed by 25 or more of the folk who recently used the #bbcqt or the #newsnight hashtag?
Here’s the second one – are these folk followed by 25 or more of the folk who recently used the #bbcqt or the #newsnight hashtag?
The answer is a only a click away…
PS I’ve got a couple of scripts in the pipeline that should be able to generate data that I can use to generate this sort of differencing word cloud, the idea being I should be able to identify at a glance accounts that different hashtag communities both follow, and accounts that they differently follow…
UPDATE: so here’s a quick first pass at comparing the audiences. I’m not sure how reliable the method is, but it’s as follows:
– for each hashtag, grab 1500 recent tweets. Grab the list of folk the hashtagging users follow and retain a list (the ‘interest list’) of folk followed by at least 25 of the hashtaggers. Filter the hashtagger list so that it only contains hashtaggers who follow at least 25 people (this cuts out brand new users and newly created spam accounts). Count the number of filtered hashtaggers that follow each person in the interest list, and normalise by dividing through by the total number of filtered hashtaggers. To recap, for each tag, we now have a list of folk who were popularly followed by users of that tag, along with a number for each one between 0 and 1 describing proportionally how much of the hashtagging sample follow them.
(Note that there may be all sorts of sampling errors… I guess I need to qualify reports with the number of unique folk tweeting in the twitter sample captured. I maybe also need to improve sampling so rather than searching for 1500 tweets, I generate a sample of 1000 unique users of the tag?)
I then load these files into R and run through the following process:
#Multiply this nromalised follower proportion by 1000 and round down to get an integer between 0 and 1000 representing a score relative to the proportion of filtered hashtagger who follow each person in the interest list. counts_newsnight$normIn=as.integer(counts_newsnight$inNorm*1000) counts_bbcqt$normIn=as.integer(counts_bbcqt$inNorm*1000) #ANother filtering step: we're going to plot similarities and differences between folk followed by at least 25% of the corresponding filtered hashtaggers newsnight=subset(counts_newsnight,select=c(username,normIn),subset=(inNorm>=0.25)) bbcqt=subset(counts_bbcqt,select=c(username,normIn),subset=(inNorm>=0.25)) #Now generate a dataframe qtvnn=merge(bbcqt,newsnight,by="username",all=T) colnames(qtvnn)=c('username','bbcqt','newsnight') #replace the NA cell values (where for example someone in the bbcqt list is not in the newsnight list qtvnn[is.na(qtvnn)] <- 0
That generates a dataframe that looks something like this:
username bbcqt newsnight 1 Aiannucci 414 408 2 BBCBreaking 455 464 3 BBCNewsnight 317 509 4 BBCPolitics 0 256 5 BBCr4today 0 356 6 BarackObama 296 334
Thanks to Josh O’Brien on Stack Overflow, I can recast this data frame into a term.matrix that plays nicely with the latest version of the R wordcloud package.
mat <- as.matrix(qtvnn[-1]) dimnames(mat) <- qtvnn comparison.cloud(term.matrix = mat) commonality.cloud(term.matrix = mat)
Here’s the result – commonly followed folk:
And differentially followed folk (at above the 25% level, remember…)
So from this what can we say? Both audiences have a general news interest, into pop politics and perhaps satirical comedy, maybe leaning to the left? The Question Time audience is a more casual audience, more minded to following celebrities, whereas the Newsnight audience is a bit more into following notable media folk (journalists, editors) and also political news. (I’d be keen to hear any other readings of these maps – please feel free to leave a comment containing your interpretations/observations/reading:-)
UPDATE2: to try to get a handle on what the word clouds might be telling us from an alternative visual perspective on the data, rather than inspecting the actual code for example, here’s a scatterplot showing how the follower proportions compare directly:
ggplot(na.omit(subset(qtvnn,bbcqt>0 & newsnight>0))) + geom_text(aes(x=bbcqt,y=newsnight,label=username,angl=45),size=4) + xlim(200,600) + ylim(200,600) + geom_abline(intercept=0, slope=1,colour='grey')
Here’s another view – this time plotting followed folk for each tag who are not followed by the friends of the other tag [at at least the 25% level]:
I couldn’t remember/didn’t have Google to hand to find the best way of reshaping the data for this, so I ended up with a horrible horrible hack…
nn=data.frame(typ='newsnight',subset(qtvnn,select=c(username,newsnight),subset=(newsnight>0 & bbcqt==0)))
qt=data.frame(typ='bbcqt',subset(qtvnn,select=c(username,bbcqt),subset=(newsnight==0 & bbcqt>0)))
I think this is all starting to get to the point where I need to team up with a proper developer and get *all* the code properly written and documented before any errors that are currently there get baked in too deeply…
10 thoughts on “Social Media Interest Maps of Newsnight and BBCQT Twitterers”
But how did you get the 1,500 tweets? I can never get more than 100 using searchTwitter in twitteR.
@Hywel there are a couple of arguments in the search API that let you get more results: rpp (results per page), which can be set up to 100; and page, for paged results (1 to 15). Using these together you can get 15 pages of 100 results each…
Thanks but I reckon you’re referring to the Twitter search api rather than twitteR’s searchTwitter(). I can’t see that it takes a rpp parameter. So, giving up on twitteR, I’m trying to understand the json output to a Twitter api search including ‘…rpp=100&page=15’. I’m trying to get it in Python as follows:
url = ‘http://search.twitter.com/search.json?q=newsnight&rpp=15&page=10’
html_resp = urllib2.urlopen(url)
soup = json.load(html_resp)
But soup seems to give the output of the 10th page, rather than pages 1-10. (I freely admit I have no idea what a ‘page’ is in this context!). Should I be using something other than urllib2.urlopen()? I’d be grateful for any advice – code even better.
The following seems almost to answer my own question: https://derrickpetzold.com/p/twitter-python-search/
though it doesn’t seem to include a page parameter and so far I’ve still only managed to get it to produce 200 tweets, even though i set max-results=1500.
Ah – yes, my mistake – I was talking about Twitter search API not the R library (I’m not that familar with twitteR – will have to explore it again…). Many APIs used page results, as do many search engines. If you run a search on Google, the first page of results is, err, the first page, the second page of results is, err, the second page, and so on. If you ask for page 15, you get the results for page 15, not the results for pages 1-15. (This makes sense, right? If you have 20 results per page, you don’t want the 500th page of results to have to return 10000 results and then require the client to have to work out what the last 20 results are.) If you want to get 1000 results at 100 results per page, you need to grab the first 10 results pages with 100 results on each.
Comments are closed.