Revisiting My Twitter Harvesting Code

Despite having suffered a catastrophic/unrecoverable hard-disk failure on the (unbacked up) machine I had my Twitter harvesting notebooks (and cached data database) on, I did manage to find a reasonably current version of the code (via Github gists and Dropbox) and spent a few evening hours tinkering with over the last ten days or so.

So as a quick to note-to-self, here’s a list of the functions I currently have to hand:

  • search for users using a recent search terms: get a list of users recently using a particular term or phrase;
  • search for users using a recent hashtag: get a list of users recently using a particular hashtag;
  • generate maps of folk commonly followed by users of the searchterm/tag: from the term or tag userlist, find the folk commonly followed by those users and generate a network edge list;
  • get members of a list: get a list of the members of a particular list;
  • get lists a person is a member of: get a list of the lists a user is a member of; optionally limit to lists with more than a certain number of followers;
  • triangulate lists: find lists that several specified users are a member of, thresholded (so e.g. lists where at least 3 of 5 people mentioned are on the list); also limit by minimum number of subscribers to list (so we can ignore lists with no subscribers etc). List triangulation can be applied to lists of users e.g. folk using a particular hashtag; so we have a route to finding lists that may be topically related to a particular tag;
  • download members of lists a specified user is a member of: for the lists a particular user is a member of, grab details of all the members of those lists’
  • get all friends/followers of a user: this can be limited to a maximum number of friends/followers (eg 5000);
  • get common friends of (sampled) followers of a user: for a particular user, get their followers, sample N of them, then find folk commonly followed by that sample; output as a graph edge list;
  • find common followers of a set of specified users: for a list of users (e.g. recent users of a particular hashtag), find folk who follow a minimum number of them, or who are followed by a minimum number of them;
  • tag user biographies using Thomson Reuters OpenCalais and IBM Alchemy APIs: this tagging can be easily applied to all users in a list, tagging their biographies one at a time

I’ve also started looking again at generating topic models around Twitter data, starting with user biographies (which so far is not very interesting!)

With these various functions, it’s easy enough to generate various combinations of emergent social positioning map. I’ve started exploring various Python libraries for clustering and laying out maps automatically, but tend to fall back to handcrafting the displays using Gephi. On the to do list is to try to automate the Gephi side, at least for a first pass, using the Gephi toolkit, though at the moment that looks like requiring that I get my head round a bit of Java. Ideally, I’d like to be able to see a Gephi endpoint (perhaps from a Gephi headless server running in a Docker container…?:-), give it a graph file and a config file, and get a PDF, SVG or PNG layout back…

I also need to do a couple of proof-of-concept one-off printed outputs for myself, like getting an ESP map printed as an A0 poster or folded map.