Harvesting Searched for Tweets Using Python

Via Tanya Elias/eliast05, a query regarding tools for harvesting historical tweets. I haven’t been keeping track of Twitter related tools over the last few years, so my first thought is often “could Martin Hawksey’s TAGSexplorer do it?“!

But I’ve also had a the twecoll Python/command line package on my ‘to play with’ list for a while, so I though I’d give it a spin. Note that the code requires python to be installed (which it will be, by default, on a Mac).

On the command line, something like the following should be enough to get you up and running if you’re on a Mac (run the commands in a Terminal, available from the Utilities folder in the Applications folder). If wget is not available, download the twecoll file to the twitterstuff directory, and save it as twecoll (no suffix).

#Change directory to your home directory
$ cd

#Create a new directory - twitterstuff - in you home directory
$ mkdir twitterstuff

#Change directory into that directory
$ cd twitterstuff

#Fetch the twecoll code
$ wget https://raw.githubusercontent.com/jdevoo/twecoll/master/twecoll
--2016-05-02 14:51:23--  https://raw.githubusercontent.com/jdevoo/twecoll/master/twecoll
Resolving raw.githubusercontent.com... 23.235.43.133
Connecting to raw.githubusercontent.com|23.235.43.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 31445 (31K) [text/plain]
Saving to: 'twecoll'
 
twecoll                    100%[=========================================>]  30.71K  --.-KB/s   in 0.05s  
 
2016-05-02 14:51:24 (564 KB/s) - 'twecoll' saved [31445/31445]

#If you don't have wget installed, download the file from:
#https://raw.githubusercontent.com/jdevoo/twecoll/master/twecoll
#and save it in the twitterstuff directory as twecoll (no suffix)

#Show the directory listing to make sure the file is there
$ ls
twecoll

#Change the permissions on the file to 'user - executable'
$ chmod u+x twecoll

#Run the command file - the ./ reads as: 'in the current directory'
$ ./twecoll tweets -q "#lak16"

Running the code the first time prompts you for some Twitter API credentials (follow the guidance on the twecoll homepage), but this only needs doing once.

Testing the app, it works – tweets are saved as a text file in the current directory with an appropriate filename and suffix .twt – BUT the search doesn’t go back very far in time. (Is the Twitter search API crippled then…?)

Looking around for an alternative, the GetOldTweets python script, which again can be run from the command line; download the zip file from Github, move it into the twitterstuff directory, and unzip it. On the command line (if you’re still in the twitterstuff directory, run:

ls

to check the name of the folder (something like GetOldTweets-python-master) and then cd into it:

cd GetOldTweets-python-master/

to move into the unzipped folder.

Note that I found I had to install pyquery to get the script to run; on the command line, run: easy_install pyquery.

This script does not require credentials – instead it scrapes the Twitter web search. Data limits for the search can be set explicitly.

python Exporter.py --querysearch '#lak15' --since 2015-03-10 --until 2015-09-12 --maxtweets 500

Tweets are saved into the file output_got.csv and are semicolon delimited.

A couple of things I noticed with this script: it’s slow (because it “scrolls” through pages and pages of Twitter search results, which only have a small number of results on each) and on occasion seems to hang (I’m not sure if it gets stuck in an infinite loop; on a couple of occasions I used ctrl-z to break out). In such a case, it doesn’t currently save results as you go along, so you have nothing; reduce the --maxtweets value, and try again. On occasion, when running the script under the default Mac python 2.7, I noticed that there may be encoding issues in tweets which break the output, so again the file can’t get written,

Both packages run from the command line, or can be scripted from a Python programme (though I didn’t try that). If the GetOldTweets-python package can be tightened up a bit (eg in respect of UTF-8/tweet encoding issues, which are often a bugbear in Python 2.7), it looks like it could be a handy little tool. And for collecting stuff via the API (which requires authentication), rather than by scraping web results from advanced search queries, twecoll looks as if it could be quite handy too.

Revisiting My Twitter Harvesting Code

Despite having suffered a catastrophic/unrecoverable hard-disk failure on the (unbacked up) machine I had my Twitter harvesting notebooks (and cached data database) on, I did manage to find a reasonably current version of the code (via Github gists and Dropbox) and spent a few evening hours tinkering with over the last ten days or so.

So as a quick to note-to-self, here’s a list of the functions I currently have to hand:

  • search for users using a recent search terms: get a list of users recently using a particular term or phrase;
  • search for users using a recent hashtag: get a list of users recently using a particular hashtag;
  • generate maps of folk commonly followed by users of the searchterm/tag: from the term or tag userlist, find the folk commonly followed by those users and generate a network edge list;
  • get members of a list: get a list of the members of a particular list;
  • get lists a person is a member of: get a list of the lists a user is a member of; optionally limit to lists with more than a certain number of followers;
  • triangulate lists: find lists that several specified users are a member of, thresholded (so e.g. lists where at least 3 of 5 people mentioned are on the list); also limit by minimum number of subscribers to list (so we can ignore lists with no subscribers etc). List triangulation can be applied to lists of users e.g. folk using a particular hashtag; so we have a route to finding lists that may be topically related to a particular tag;
  • download members of lists a specified user is a member of: for the lists a particular user is a member of, grab details of all the members of those lists’
  • get all friends/followers of a user: this can be limited to a maximum number of friends/followers (eg 5000);
  • get common friends of (sampled) followers of a user: for a particular user, get their followers, sample N of them, then find folk commonly followed by that sample; output as a graph edge list;
  • find common followers of a set of specified users: for a list of users (e.g. recent users of a particular hashtag), find folk who follow a minimum number of them, or who are followed by a minimum number of them;
  • tag user biographies using Thomson Reuters OpenCalais and IBM Alchemy APIs: this tagging can be easily applied to all users in a list, tagging their biographies one at a time

I’ve also started looking again at generating topic models around Twitter data, starting with user biographies (which so far is not very interesting!)

With these various functions, it’s easy enough to generate various combinations of emergent social positioning map. I’ve started exploring various Python libraries for clustering and laying out maps automatically, but tend to fall back to handcrafting the displays using Gephi. On the to do list is to try to automate the Gephi side, at least for a first pass, using the Gephi toolkit, though at the moment that looks like requiring that I get my head round a bit of Java. Ideally, I’d like to be able to see a Gephi endpoint (perhaps from a Gephi headless server running in a Docker container…?:-), give it a graph file and a config file, and get a PDF, SVG or PNG layout back…

I also need to do a couple of proof-of-concept one-off printed outputs for myself, like getting an ESP map printed as an A0 poster or folded map.

Fishing for OU Twitter Folk…

Just a quick observation inspired by the online “focus group” on Twitter yesterday around the #twitterou hashtag (a discussion for OU folk about Twitter usage): a few minutes in to the discussion, I grabbed a list of the folk who had used the tag so far (about 10 or people at the time), pulled down a list of the people they followed to construct a graph of hashtaggers->friends, and then filtered the resulting graph to show folk with node degree of 5 or more.

twitterOU - folk followed by 5 or more folk using twitterou before 2.10 or so today

Because a large number of OU Twitter folk follow each other, the graph is quite dense, which means that if we take a sample of known OU users and look for people that a majority of that sample follow, we stand a reasonable chance of identifying other OU folk…

Doing a bit of List Intelligence (looking up the lists that a significant number of hashtag users were on, I identified several OU folk Twitter lists, most notably @liamgh/planetou and @guyweb/openuniversity.

Just for completeness, it’s also worth pointing out that simple community analysis of followers of a known OU person might also turn up OU clusters, e.g. as described in Digging Deeper into the Structure of My Twitter Friends Network: Librarian Spotting. I suspect if we did clique analysis on the followers, this might also identify ‘core’ members of organisational communities that could be used to seed a snowball discovery mechanism for more members of that organisation.

PS hmmm… maybe I need to do a post or two on how we might go about discovering enterprise/organisation networks/communities on Twitter…?

Segmented Communications on Twitter via @-partner Messaging

As this blog rarely attracts comments, it can be quite hard for me to know who, if anyone, regularly reads it (likely known suspects and the Googlebot aside). The anonymous nature of feed reader subscriptions also means it tricky to know who (if anyone) is reading the blog at all…

Twitter is slightly different in this regard, because for the majority of accounts, the friends and followers lists are public; which means it’s possible to “position” a particular account in terms of the interests of the folk it follows and who follow it.

Whilst I was putting together A Couple More Social Media Positioning Maps for UK HE Twitter Accounts, I considered including a brief comment on how the audience of a popular account will probably segment into different interest groups, and whether or not there was any mileage in trying to customise messages to particular segments without alienating the other parts of the audience.

Seeing @eingang’s use yesterday of a new (to me) Twitter convention of sending hashtagged messages to @hidetag, so that folk following the hashtag would see the tweet, but Michelle’s followers wouldn’t necessarily see the tagged tweets (no-one should follow @hidetag, NO_ONE ;-), it struck me that we might be able to use a related technique to send messages that are only visible to a particular segment of the followers of a Twitter account…

How so?

Firstly, you need to know that public Twitter messages sent to a particular person by starting the message with an @name are only generally visible in the stream of folk who follow both the sender and @name; (identifying this population was one of the reasons I put together the Who Can See Whose Conversations In-stream on Twitter? tool).

Secondly, you need to do a bit of social network analysis. (In what follows, I assume a directed graph where a node from A to B means that A follows B, or equivalently, B is a friend of A.) A quick and dirty approach might be to use in-degree and out-degree, or maybe the HITS algorithm/authority and hub values, as follows: identify the audience segment you want to target by looking for clusters in how your followers follow each other, then do a bit of network analysis on that segment to look for Authority nodes or nodes that are followed by a large number of people in that segment who also follow you. If you now send a message to that Authority/high in-degree node, it will be seen in-stream by that user, as well as those of your followers who also follow that Authority account.

This approach can be seen as a version of co-branding/brand partnership: conversational co-branding/conversational brand partnerships. Here’s how it may work: brand X has an audience that segments into groups A, B and C. Suppose that company Y is an authority in segment B. If X and Y form a conversational brand partnership, X can send messages ostensibly to Y that also reach segment B. For a mutually beneficial relationship, X would also have to be an authority in one of Y’s audience segments (for example, segment P out of segments P, Q, and R.) Ideally, P and B would largely overlap, meaning they can have a “sensible” conversation and it will hit both their targeted audiences…

For monitoring discussions within a particular segment, it strikes me that if we monitor the messages seen by an individual with a large Hub value/out-degree (that is, folk who follow large numbers of (influential) folk within the segment). By tapping into the Hub’s stream, we get some sort of sampling of the conversations taking place within the segment.

These ideas are completely untested (by me) of course… But they’re something I may well start to tinker with if an appropriate opportunity arises…

Using Twitter Lists to Define Custom Search Engines

A long time ago, I used to play with search engines all the time, particularly in the context of bounded search, (that is, search over a particular set of web pages of web domains, e.g. Search Hubs and Custom Search at ILI2007). Although I’m not at IWMW this year, I can’t not have an IWMW related tinker, so here’s a quick play around IWMW related twittering folk…

To start with, let’s have a look at the IWMW Twitter account:

IWMW lists

We see there are several twitter lists associated with the account, including one for participants…

Looking around the IWMW10 website, I also spy a community area, with a Google Custom search engine that searches over institutional web management blogs that @briankelly, I presume, knows about:

Institutional Web Managemet blogs search engine

It seems a bit of a pain to manage though… “Please contact Brian Kelly if you would like your blog to be included in this list of blogs which are indexed”

Ever one to take the lazy approach, I wondered whether we could create a useful search engine around the URLs disclosed on the public Twitter profile page of folk listed on the various IWMW Twitter lists. The answer is “not necessarily”, because the URLs folk have posted on their Twitter profiles seem to point all over the place, but it’s easy enough to demonstrate the raw principle.

So here’s the recipe:

– find a Twitter list with interesting folk on it;
– use the Twitter API to grab the list of members on a list;
– the results include profile information of everyone on the list – including the URL they specified as a home page in their profile;
– grab the URLs and generate an annotations file that can be used to import the URLs into a Google Custom Search Engine;
– note that the annotations file should include a label identifier that specifies which CSE should draw on the annotations:

Google CSE config

Once the file is uploaded, you should have a custom search engine built around the URLs folk followed in the twitter list have revealed in their twitter profiles (here’s my IWMW Participants CSE (list date: 12:00 12/7/10)

Note that to create sensibly searchable URLs, I used the heuristics:

– if page URL is example.com or example.com/, search on example.com/*
– by default, if page is example.com/page.foo, just search on that page.

I used Python (badly!;-) and the tweepy library to generate my test CSE annotations feed:

import tweepy

#these are the keys you would normally use with oAuth
consumer_key=''
consumer_secret=''

#these are the special keys for single user apps from http://dev.twitter.com/apps
#as described in http://dev.twitter.com/pages/oauth_single_token
#select your app, then My Access Token from the sidebar
key=''
secret=''

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(key, secret)
api = tweepy.API(auth)

#this identifier is the identifier of the Google CSE you want to populate
cseLabelFromGoogle=''

listowner='iwmw'
tag='iwmw10participant'

auth = tweepy.BasicAuthHandler(accountName, password)
api = tweepy.API(auth)

f=open(tag+'listhomepages.xml','w')

cse=cseLabelFromGoogle

f.write("<GoogleCustomizations>\n\t<Annotations>\n")

#use the Cursor object so we can iterate through the whole list
for un in tweepy.Cursor(api.list_members,owner=listowner,slug=tag).items():
    if  type(un) is tweepy.models.User:
      l=un.url
      if l:
        l=l.replace("http://","")
        if not l.endswith('/'):
          l=l+"/*"
        else:
          if l[-1]=="/":
            l=l+"*"
        f.write("\t\t<Annotation about=\""+l+"\" score=\"1\">\n")
        f.write("\t\t\t<Label name=\""+cse+"\"/>\n")
        f.write("\t\t</Annotation>\n")

f.write("\t</Annotations>\n</GoogleCustomizations>")

f.close()

(Here’s the code as a gist, with tweaks so it runs with oAUth.)

Running this code generates a file (listhomepages.xm) that contains Google custom search annotations for a particular Google CSE, based around the URLs declared in the public twitter profiles of people listed in a particular list. This file can then be uploaded to the Google CSE environment and used to help configure a bounded search engine.

So what does this mean? It means that if you have a identified a set of people sharing a particular set of interests using a Twitter list, it’s easy enough to generate a custom search engine around the webpages or domains they have declared in their Twitter profile.