Harvesting Searched for Tweets Using Python

Via Tanya Elias/eliast05, a query regarding tools for harvesting historical tweets. I haven’t been keeping track of Twitter related tools over the last few years, so my first thought is often “could Martin Hawksey’s TAGSexplorer do it?“!

But I’ve also had a the twecoll Python/command line package on my ‘to play with’ list for a while, so I though I’d give it a spin. Note that the code requires python to be installed (which it will be, by default, on a Mac).

On the command line, something like the following should be enough to get you up and running if you’re on a Mac (run the commands in a Terminal, available from the Utilities folder in the Applications folder). If wget is not available, download the twecoll file to the twitterstuff directory, and save it as twecoll (no suffix).

#Change directory to your home directory
$ cd

#Create a new directory - twitterstuff - in you home directory
$ mkdir twitterstuff

#Change directory into that directory
$ cd twitterstuff

#Fetch the twecoll code
$ wget https://raw.githubusercontent.com/jdevoo/twecoll/master/twecoll
--2016-05-02 14:51:23--  https://raw.githubusercontent.com/jdevoo/twecoll/master/twecoll
Resolving raw.githubusercontent.com... 23.235.43.133
Connecting to raw.githubusercontent.com|23.235.43.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 31445 (31K) [text/plain]
Saving to: 'twecoll'
 
twecoll                    100%[=========================================>]  30.71K  --.-KB/s   in 0.05s  
 
2016-05-02 14:51:24 (564 KB/s) - 'twecoll' saved [31445/31445]

#If you don't have wget installed, download the file from:
#https://raw.githubusercontent.com/jdevoo/twecoll/master/twecoll
#and save it in the twitterstuff directory as twecoll (no suffix)

#Show the directory listing to make sure the file is there
$ ls
twecoll

#Change the permissions on the file to 'user - executable'
$ chmod u+x twecoll

#Run the command file - the ./ reads as: 'in the current directory'
$ ./twecoll tweets -q "#lak16"

Running the code the first time prompts you for some Twitter API credentials (follow the guidance on the twecoll homepage), but this only needs doing once.

Testing the app, it works – tweets are saved as a text file in the current directory with an appropriate filename and suffix .twt – BUT the search doesn’t go back very far in time. (Is the Twitter search API crippled then…?)

Looking around for an alternative, the GetOldTweets python script, which again can be run from the command line; download the zip file from Github, move it into the twitterstuff directory, and unzip it. On the command line (if you’re still in the twitterstuff directory, run:

ls

to check the name of the folder (something like GetOldTweets-python-master) and then cd into it:

cd GetOldTweets-python-master/

to move into the unzipped folder.

Note that I found I had to install pyquery to get the script to run; on the command line, run: easy_install pyquery.

This script does not require credentials – instead it scrapes the Twitter web search. Data limits for the search can be set explicitly.

python Exporter.py --querysearch '#lak15' --since 2015-03-10 --until 2015-09-12 --maxtweets 500

Tweets are saved into the file output_got.csv and are semicolon delimited.

A couple of things I noticed with this script: it’s slow (because it “scrolls” through pages and pages of Twitter search results, which only have a small number of results on each) and on occasion seems to hang (I’m not sure if it gets stuck in an infinite loop; on a couple of occasions I used ctrl-z to break out). In such a case, it doesn’t currently save results as you go along, so you have nothing; reduce the --maxtweets value, and try again. On occasion, when running the script under the default Mac python 2.7, I noticed that there may be encoding issues in tweets which break the output, so again the file can’t get written,

Both packages run from the command line, or can be scripted from a Python programme (though I didn’t try that). If the GetOldTweets-python package can be tightened up a bit (eg in respect of UTF-8/tweet encoding issues, which are often a bugbear in Python 2.7), it looks like it could be a handy little tool. And for collecting stuff via the API (which requires authentication), rather than by scraping web results from advanced search queries, twecoll looks as if it could be quite handy too.