OUseful.Info, the blog…

Trying to find useful things to do with emerging technologies in open education

Rescuing Twapperkeeper Archives Before They Vanish

A couple of years or so ago, various JISC folk picked up on the idea that there might be value in them thar tweets and started encouraging the use of Twapperkeeper for archiving hashtagged tweets around events, supporting the development of that service in exchange for an open source version of the code. Since then, Twapperkeeper has been sold on, with news out this week that the current Twapperkeeper archives will be deleted early in the New Year.

Over on the MASHE blog (one of the few blogs in my feeds that just keeps on delivering…), Martin Hawksey has popped up a Twapperkeeper archive importer for Google Spreadsheets that will grab up to 15,000 tweets from a Twapperkeeper archive and store them in a Google Spreadsheet (from where I’m sure you’ll be able to tap in to some of Martin’s extraordinary Twitter analysis and visualisation tools, as well as exporting Tweets to NodeXL (I think?!)).

The great thing about Martin’s “Twapperkeeper Rescue Tool” (?!;-) is that archived Tweets are hosted online, which makes them accessible for web based hacks if you have the Spreadsheet key. But which archives need rescuing? One approach might be to look to see what archives have been viewed using @andypowe11’s Summarizr? I don’t know if Andy has kept logs of which tag archives folk have analysed using Summarizr, but these may be worth grabbing? I don’t know if JISC has a list of hashtags from events they want to continue to archive? (Presumably a @briankelly risk assessment goes into this somewhere? By the by, I wonder in light of the Twapperkeeper timeline whether Brian would now feel the need to change any risk analysis he might have posted previously advocating the use of a service like Twapperkeeper?)

A more selfish approach might be to grab one or more Twapperkeeper archives onto your own computer. Grab your TwapperKeeper Archive before Shutdown! describes how to use R to do this, and is claimed to work for rescues of up to 50,000 tweets from any one archive.

Building on the R code from that example, along with a couple of routines from my own previous R’n’Twitter experiments (Getting Started With Twitter Analysis in R), here’s some R code that will grab items from a Twapperkeeper archive and parse them into a dataframe that also includes Twitter IDs, sender, to and RT information:

**NOTE – IN TESTING, THIS CODE HAS CHOKED FOR MORE ON CERTIN ARCHIVES (CHARACTER ENCODINGS IN SOME ARCHIVES BREAKING THINGS ON IMPORT) – FOR UPDATED CODE, SEE THE REDUX POST

require(XML)
require(stringr)

#A helper function to remove @ symbols from user names...
trim <- function (x) sub('@','',x)

twapperkeeperRescue=function(hashtag){
    #startCrib: http://libreas.wordpress.com/2011/12/09/twapperkeeper/
    url <- paste("http://twapperkeeper.com/rss.php?type=hashtag&name=",hashtag,"&l=50000", sep="")
    doc <- xmlTreeParse(url,useInternal=T)
    tweet <- xpathSApply(doc, "//item//title", xmlValue)  
    pubDate <- xpathSApply(doc, "//item//pubDate", xmlValue)
    #endCrib
    df=data.frame(cbind(tweet,pubDate))
    df$from=sapply(df$tweet,function(tweet) str_extract(tweet,"^([[:alnum:]_]*)"))
    df$id=sapply(df$tweet,function(tweet) str_extract(tweet,"[[:digit:]/s]*$"))
    df$txt=sapply(df$tweet,function(tweet) str_trim(str_replace(str_sub(str_replace(tweet,'- tweet id [[:digit:]/s]*$',''),end=-35),"^([[:alnum:]_]*:)",'')))
    df$to=sapply(df$txt,function(tweet) trim(str_extract(tweet,"^(@[[:alnum:]_]*)")))
    df$rt=sapply(df$txt,function(tweet) trim(str_match(tweet,"^RT (@[[:alnum:]_]*)")[2]))
    return(df)
}
#usage: 
#tag='ukdiscovery'
#twarchive.df=twapperkeeperRescue(tag)

#if you want to save the parsed archive:
twapperkeeperSave=function(hashtag,path='./'){
    tweet.df=twapperkeeperRescue(hashtag)
    fn <- paste(path,"twArchive_",hashtag,".csv")
    write.csv(tweet.df,fn)
}
#usage:
#twapperkeeperSave(tag)

If I get a chance, I’ll try to post some visualisation/analysis functions too…

PS I also doodled a Python script to download (even large) Twapperkeeper archives, by user

Written by Tony Hirst

December 10, 2011 at 9:09 pm

Posted in Anything you want, Rstats

Tagged with

5 Responses

Subscribe to comments with RSS.

  1. http://thinkupapp.com/ is also handy opensource tool for Twitter archival

    Dan Brickley

    December 10, 2011 at 9:54 pm

  2. [...] Martin Hawksey has published a post on his MASHe blog which describes how you can Free the tweets! Export TwapperKeeper archives using Google Spreadsheet.  Martin’s post also links to a post entitled LIBREAS.Library Grab your TwapperKeeper Archive before Shutdown! which describes a technique which can be used by those familiar with R code. Tont Hirst on the OUseful.info blog has also listed a technical solution based on R code in his post on Rescuing Twapperkeeper Archives Before They Vanish. [...]

  3. [...] a great post on solutions for downloading tweets including Tony Hirst’s post on Rescuing Twitter Archives before they Vanish and using Martin Hawksey exporter tool that is build on a google spreadsheet. I’ve already [...]

  4. [...] Rescuing Twapperkeeper Archives Before They Vanish, I described a routine for grabbing Twapperkeeper archives, parsing them, and saving them to a [...]

  5. [...] Rescuing Twapperkeeper Archives Before They Vanish [...]


Comments are closed.

Follow

Get every new post delivered to your Inbox.

Join 811 other followers

%d bloggers like this: