Rescuing Twapperkeeper Archives Before They Vanish

A couple of years or so ago, various JISC folk picked up on the idea that there might be value in them thar tweets and started encouraging the use of Twapperkeeper for archiving hashtagged tweets around events, supporting the development of that service in exchange for an open source version of the code. Since then, Twapperkeeper has been sold on, with news out this week that the current Twapperkeeper archives will be deleted early in the New Year.

Over on the MASHE blog (one of the few blogs in my feeds that just keeps on delivering…), Martin Hawksey has popped up a Twapperkeeper archive importer for Google Spreadsheets that will grab up to 15,000 tweets from a Twapperkeeper archive and store them in a Google Spreadsheet (from where I’m sure you’ll be able to tap in to some of Martin’s extraordinary Twitter analysis and visualisation tools, as well as exporting Tweets to NodeXL (I think?!)).

The great thing about Martin’s “Twapperkeeper Rescue Tool” (?!;-) is that archived Tweets are hosted online, which makes them accessible for web based hacks if you have the Spreadsheet key. But which archives need rescuing? One approach might be to look to see what archives have been viewed using @andypowe11’s Summarizr? I don’t know if Andy has kept logs of which tag archives folk have analysed using Summarizr, but these may be worth grabbing? I don’t know if JISC has a list of hashtags from events they want to continue to archive? (Presumably a @briankelly risk assessment goes into this somewhere? By the by, I wonder in light of the Twapperkeeper timeline whether Brian would now feel the need to change any risk analysis he might have posted previously advocating the use of a service like Twapperkeeper?)

A more selfish approach might be to grab one or more Twapperkeeper archives onto your own computer. Grab your TwapperKeeper Archive before Shutdown! describes how to use R to do this, and is claimed to work for rescues of up to 50,000 tweets from any one archive.

Building on the R code from that example, along with a couple of routines from my own previous R’n’Twitter experiments (Getting Started With Twitter Analysis in R), here’s some R code that will grab items from a Twapperkeeper archive and parse them into a dataframe that also includes Twitter IDs, sender, to and RT information:

**NOTE – IN TESTING, THIS CODE HAS CHOKED FOR MORE ON CERTIN ARCHIVES (CHARACTER ENCODINGS IN SOME ARCHIVES BREAKING THINGS ON IMPORT) – FOR UPDATED CODE, SEE THE REDUX POST

require(XML)
require(stringr)

#A helper function to remove @ symbols from user names...
trim <- function (x) sub('@','',x)

twapperkeeperRescue=function(hashtag){
    #startCrib: http://libreas.wordpress.com/2011/12/09/twapperkeeper/
    url <- paste("http://twapperkeeper.com/rss.php?type=hashtag&name=",hashtag,"&l=50000", sep="")
    doc <- xmlTreeParse(url,useInternal=T)
    tweet <- xpathSApply(doc, "//item//title", xmlValue)  
    pubDate <- xpathSApply(doc, "//item//pubDate", xmlValue)
    #endCrib
    df=data.frame(cbind(tweet,pubDate))
    df$from=sapply(df$tweet,function(tweet) str_extract(tweet,"^([[:alnum:]_]*)"))
    df$id=sapply(df$tweet,function(tweet) str_extract(tweet,"[[:digit:]/s]*$"))
    df$txt=sapply(df$tweet,function(tweet) str_trim(str_replace(str_sub(str_replace(tweet,'- tweet id [[:digit:]/s]*$',''),end=-35),"^([[:alnum:]_]*:)",'')))
    df$to=sapply(df$txt,function(tweet) trim(str_extract(tweet,"^(@[[:alnum:]_]*)")))
    df$rt=sapply(df$txt,function(tweet) trim(str_match(tweet,"^RT (@[[:alnum:]_]*)")[2]))
    return(df)
}
#usage: 
#tag='ukdiscovery'
#twarchive.df=twapperkeeperRescue(tag)

#if you want to save the parsed archive:
twapperkeeperSave=function(hashtag,path='./'){
    tweet.df=twapperkeeperRescue(hashtag)
    fn <- paste(path,"twArchive_",hashtag,".csv")
    write.csv(tweet.df,fn)
}
#usage:
#twapperkeeperSave(tag)

If I get a chance, I’ll try to post some visualisation/analysis functions too…

PS I also doodled a Python script to download (even large) Twapperkeeper archives, by user

Rescuing Twapperkeeper Archives Before They Vanish, Redux

In Rescuing Twapperkeeper Archives Before They Vanish, I described a routine for grabbing Twapperkeeper archives, parsing them, and saving them to a local desktop file using the R programming language (downloading RStudio is the easiest way I know of getting R…).

Following a post fron @briankelly (Responding to the Forthcoming Demise of TwapperKeeper), where Brian described how to lookup all the archives saved by a person on Twapperkeeper and using that as a basis for an archive rescue strategy, I thought I’d tweak my code to grab all the hashtag archives for a particular user (other archives are also available, search as search term archives; I don’t grab the list of these… IF you fancy generalising the code, please post a link to it in the comments;-)

What should have been a trivial task didn’t work, of course: the R XML parser seemed to choke on some of the archive files claiming they weren’t the claimed UTF-8 encoding. Character encodings are still something that I don’t understand at all (and more than a few times have caused me to give up on a hack), but on the offchance, I tried using a more resilient file loader (curl, if that means anything to you…;-) rather than the XML package loader, and it seems to do the trick (warnings are still raised, but that’s an improvement on errors, that tend cause everything to stop).

Anyway, here’s the revised code, along with an additional routine for grabbing all the hashtag archives saved on Twapperkeeper by a named individual. If I get a chance (i.e. when I learn how to do it!), I’ll add in a line to two that will grab all the archives from a list of named individuals…

require(XML)
require(stringr)
require(RCurl)

#A helper function to remove @ symbols from user names...
trim <- function (x) sub('@','',x)
tagtrim <- function (x) sub('#','',x)

twapperkeeperRescue=function(hashtag,num=10000){
    #startCrib: http://libreas.wordpress.com/2011/12/09/twapperkeeper/
    #tweak - reduce to a grab of 10000 archived tweets
    url <- paste("http://twapperkeeper.com/rss.php?type=hashtag&name=",hashtag,"&l=",num, sep="")
    print(url)
    #This is a hackfix I tried on spec - use the RCurl library to load in the file...
    lurl=getURL(url)
    #...then parse it, rather than loading it in directly using the XML parser...
    doc <- xmlTreeParse(lurl,useInternal=T,encoding = "UTF-8")
    tweet <- xpathSApply(doc, "//item//title", xmlValue)  
    pubDate <- xpathSApply(doc, "//item//pubDate", xmlValue)
    #endCrib
    df=data.frame(cbind(tweet,pubDate))
    print('...extracting from...')
    df$from=sapply(df$tweet,function(tweet) str_extract(tweet,"^([[:alnum:]_]*)"))
    print('...extracting id...')
    df$id=sapply(df$tweet,function(tweet) str_extract(tweet,"[[:digit:]/s]*$"))
    print('...extracting txt...')
    df$txt=sapply(df$tweet,function(tweet) str_trim(str_replace(str_sub(str_replace(tweet,'- tweet id [[:digit:]/s]*$',''),end=-35),"^([[:alnum:]_]*:)",'')))
    print('...extracting to...')
    df$to=sapply(df$txt,function(tweet) trim(str_extract(tweet,"^(@[[:alnum:]_]*)")))
    print('...extracting rt...')
    df$rt=sapply(df$txt,function(tweet) trim(str_match(tweet,"^RT (@[[:alnum:]_]*)")[2]))
    return(df)
}

#usage: 
#tag='ukdiscovery'
#twarchive.df=twapperkeeperRescue(tag)

#if you want to save the parsed archive:
twapperkeeperSave=function(hashtag,num=10000,path='./'){
    tweet.df=twapperkeeperRescue(hashtag,num)
    fn <- paste(path,"twArchive_",hashtag,".csv")
    write.csv(tweet.df,fn)
}
#usage:
#twapperkeeperSave(tag)


#The following function grabs a list of hashtag archives saved by a given user
# and then rescues each archive in turn...
twapperkeeperUserRescue=function(uname='psychemedia',num=10000){
	#This routine only grabs hashtag archives;
	#Search archives and other archives can also be identified an downloaded if you feel like generalising this bit of code...;-)
	url=paste('http://twapperkeeper.com/allnotebooks.php?type=hashtag&name=&description=&tag=&created_by=',uname,sep='')
	archives=readHTMLTable(url,which=2,header=T)
	archives$Name=sapply(archives$Name,function(tag) tagtrim(tag))
	mapply(twapperkeeperSave,archives$Name,num)
}
#usage:
#user='psychemedia'
#twapperkeeperUserRescue(user)
#twapperkeeperUserRescue(user,1000)
#The numerical argument is the number of archived tweets you want to save (max 50000)
#Note to self: need to trap this maxval...

Now… do I build some archive analytics and visualisation on top of this, or do I have a play with building an archive rescuer in Scraperwiki?!

PS I also doodled a Python script to download (even large) Twapperkeeper archives, by user

Python Script for Exporting (Large) Twapperkeeper Archives By User

FWIW, I started putting together a script that will grab individual hashtag archives, or all the hashtag archives created by a single user, from Twapperkeeper (which is shutting off its archives in early January).

The script should be capable of grabbing tweets from even large archives (hundreds of thousands/millions of tweets), though it’s probably not very efficient in the way it does it…

You can find the script here: Twapperkeeper archive rescue

If you have any problems with the script, or make any improvements to it, please let me know via the comments…