I haven’t played with my ESP / twitter mapping code for a bit, but I dug it out again last night for a quick play, and to see how much of it I could reuse if I moved from Mongo to a neo4j / graph database backend (an excuse, in part, to learn a bit more about neo4j, but also because I think it would be easier to write interesting queries over something properly represented as a graph).
One of my favourite maps shows the folk most commonly followed by followers of a person, or set of people, on Twitter. But there are other ways of doing this two step projection, and I think they describe different things:
- common friends of your followers: this is “people like me” from the perspective of someone’s audience; if lots of folk follow you on Twitter because you interest them, you represent a shared interest of those people. If lots of those folk follow other individuals in common, that’s maybe because the interest they share with respect to you also applies to other folk they follow in common; other folk somehow like you. Alternatively, it may be that there are “affiliated” interests: lots of folk follow a particular golfer because they share an interest in golf, but maybe lots of them also follow a particular brand of whisky because of an interest in the thirteenth hole; so maybe the golfer should try to tie up with with whisky brand. These common friends of your followers are also your competitors in the sense that they too are trying to gain the attention of your followers.
- common followers of your followers: birds of a feather flock together (homophily); if folk share an interest in you, and they are all followed by someone who doesn’t follow you, perhaps someone who shares their interests, then maybe those common followers (who don’t follow you) of your followers are a place to grow your audience? You also have a route to those people (via your followers). And there are easy to identify metrics for any campaign, such as the rate at which you convert folk who follow your followers but not you into folk who do follow you.
- common friends of your friends: you can’t choose your followers (although you can block folk to exclude them from your follower list) but you do choose your friends (that is, people you follow). You friends influence you by virtue of the fact you see what they say. If you’re choosing friends as folk that you want to influence in turn, then by mapping who their common friends are (that is, who they commonly follow), you can see who influences them. If they don’t follow folk like you, but you want to gain their attention, you need to gain the attentions of the folk they follow.
- common followers of your friends: you follow folk because of your particular interests; if other folk follow the same people as you, perhaps they share the same interests; which means they may be your competitors, or they may be potential collaborators. You might also be able to use them to find other folk to follow (that is, look for the folk your friends followers follow that you don’t currently follow). You might also be able to use this group to find new possible followers – from the folk who follow them but don’t follow you.
I keeping meaning to formalise this stuff… hmmm…
Despite having suffered a catastrophic/unrecoverable hard-disk failure on the (unbacked up) machine I had my Twitter harvesting notebooks (and cached data database) on, I did manage to find a reasonably current version of the code (via Github gists and Dropbox) and spent a few evening hours tinkering with over the last ten days or so.
So as a quick to note-to-self, here’s a list of the functions I currently have to hand:
- search for users using a recent search terms: get a list of users recently using a particular term or phrase;
- search for users using a recent hashtag: get a list of users recently using a particular hashtag;
- generate maps of folk commonly followed by users of the searchterm/tag: from the term or tag userlist, find the folk commonly followed by those users and generate a network edge list;
- get members of a list: get a list of the members of a particular list;
- get lists a person is a member of: get a list of the lists a user is a member of; optionally limit to lists with more than a certain number of followers;
- triangulate lists: find lists that several specified users are a member of, thresholded (so e.g. lists where at least 3 of 5 people mentioned are on the list); also limit by minimum number of subscribers to list (so we can ignore lists with no subscribers etc). List triangulation can be applied to lists of users e.g. folk using a particular hashtag; so we have a route to finding lists that may be topically related to a particular tag;
- download members of lists a specified user is a member of: for the lists a particular user is a member of, grab details of all the members of those lists’
- get all friends/followers of a user: this can be limited to a maximum number of friends/followers (eg 5000);
- get common friends of (sampled) followers of a user: for a particular user, get their followers, sample N of them, then find folk commonly followed by that sample; output as a graph edge list;
- find common followers of a set of specified users: for a list of users (e.g. recent users of a particular hashtag), find folk who follow a minimum number of them, or who are followed by a minimum number of them;
- tag user biographies using Thomson Reuters OpenCalais and IBM Alchemy APIs: this tagging can be easily applied to all users in a list, tagging their biographies one at a time
I’ve also started looking again at generating topic models around Twitter data, starting with user biographies (which so far is not very interesting!)
With these various functions, it’s easy enough to generate various combinations of emergent social positioning map. I’ve started exploring various Python libraries for clustering and laying out maps automatically, but tend to fall back to handcrafting the displays using Gephi. On the to do list is to try to automate the Gephi side, at least for a first pass, using the Gephi toolkit, though at the moment that looks like requiring that I get my head round a bit of Java. Ideally, I’d like to be able to see a Gephi endpoint (perhaps from a Gephi headless server running in a Docker container…?:-), give it a graph file and a config file, and get a PDF, SVG or PNG layout back…
I also need to do a couple of proof-of-concept one-off printed outputs for myself, like getting an ESP map printed as an A0 poster or folded map.
Having dusted off and reversioned my Twitter emergent social positioning (ESP) code, and in advance of starting to think about what sorts of analyses I might properly start running, here’s a look back at what I was doing before in terms of charting where particular Twitter accounts sat amongst the other accounts commonly followed by the target account’s followers.
No longer having a whitelisted Twitter API key means the sample sizes I’m running are smaller than they used to be, to maybe that’s a good thing becuase it means I’ll have to start working properly on the methodology…
Anyway, here’s a quick snapshot of where I think hyperlocal news bloggers @onthewight might be situated on Twitter…
The view aims to map out accounts that are followed by 10 or more people from a sample of about 200 or so followers of @onthewight. The network is layed out according to a force directed layout algorithm with a dash of aesthetic tweaking; nodes are coloured based on community grouping as identified using the Gephi modularity statistic. Which has it’s issues, but it’s a start. The nodes are sized in the first case according to PageRank.
The quick take home from this original sketchmap is that there are a bunch of key information providers in the middle group, local accounts on the left, and slebs on the right.
If we look more closely at the key information providers, they seem to make sense…
These folk are likely to be either competitors of @onthewight, or prospects who might be worth approaching for advertising on the basis that @onthewight’s followers also follow the target account. (Of course, you could argue that because they share followers, there’s no point also using @onthewight as a channel. Except that @onthewight also has a popular blog presence, which would be where any ads were placed. (The @onthewight Twitter feed is generally news announcements and live reporting.) A better case could probably be made by looking at the follower profiles of the prospects, along with the ESP maps for the prospects, to see how well the audiences match, what additional reach could be offered, etc etc.
A broad brush view over the island community is a bit more cluttered:
If we tweak the layout a little, rerun PageRank to resize the nodes (note this will no longer take into account contributions from the other communities) and tweak the layout, again using a force directed algorithm, we get a bit less of a mess, though the map is still hard to read. Arts to the top, perhaps, Cowes to the right?
Again, with a bit more data, or perhaps a bit more of a think about what sort of map would be useful (and hence, what sort of data to collect), this sort of map might become useful for B2B marketing marketing purposes on the Island. (I’m not really interested in, erm, the plebs such as myself… i.e. people rather than bizs or slebs; though a pleb interest/demographic/reach analysis would probably be the one that would be most useful to take to prospects?).
If we look at the celebrity common follows, again resized and re-layed out, we see what I guess is a typical spread (it’s some time since I looked at these – not sure what the base line is, though @stephenfry still seems to feature high up in the background radiation count).
For bigger companies with their own marketing $, I guess this sort of map is the sort of place to look for potential celebrity endorsements to reinforce a message (folk following these accounts are already aware of @onthewight because they follow @onthewight) as well as potentially widen reach. But I guess the endorsement as reinforcement is more valuable as a legitimising thing?
Just got to work out what to do next, now, and how to start tightening this up and making it useful rather than just of passing interest…
PS A related chart that could be plotted using Facebook data would be to grab down all the likes of the friends of a person of company on Facebook, though I’m not not sure how that would work if their account is a page as a opposed to a “person”? I’m not so hot on Facebook API/permissions etc, or what sort of information page owners can get about their audience? Also, I’m not sure about the extent to which I can get likes from folk who aren’t my friends or who haven’t granted me app permissions? I used to be able to grab lists of people from groups and trawl through their likes, but I’m not sure default Facebook permissions make that as easy pickings now compared to a year or two ago? (The advantage of Twitter is that the friend/follow data is open on most accounts…)
Prompted by an email request, I’ve revisited the code I used to generate emergent social positioning maps in Twitter as an iPython notebook that reuses chunks of code from, as well as the virtual machine used to support, Matthew A. Russell’s Mining the Social Web (2nd edition) [code]).
As a reminder, the social positioning maps show folk commonly followed by the followers of a particular twitter user.
With the possibility that my effectively unlimited Twitter API key will die at some point in the Spring with the Twitter API upgrade, I’m starting to look around for alternative sources of interest signal (aka getting ready to say “bye, bye, Twitter interest mapping”). And Facebook groups look like they may offer once possibility…
Some time ago, I did a demo of how to map the the common Facebook Likes of my Facebook friends (Social Interest Positioning – Visualising Facebook Friends’ Likes With Data Grabbed Using Google Refine). In part inspired by a conversation today about profiling the interests of members of particular Facebook groups, I thought I’d have a quick peek at the Facebook API to see if it’s possible to grab the membership list of arbitrary, open Facebook groups, and then pull down the list of Likes made by the members of the group.
As with my other social positioning/social interest mapping experiments, the idea behind this approach is broadly this: users express interest through some sort of public action, such as following a particular Twitter account that can be associated with a particular interest. In this case, the signal I’m associating with an expression of interest is a Facebook Like. To locate something in interest space, we need to be able to detect a set of users associated with that thing, identify each of their interests, and then find interests they have in common. These shared interests (ideally over and above a “background level of shared interest”, aka the Stephen Fry effect (from Twitter, where a large number of people in any set of people appear to follow Stephen Fry oblivious of other more pertinent shared interests that are peculiar to that set of people) are then assumed to be representative of the interests associated with the thing. In this case, the thing is a Facebook group, the users associated with the thing are the group members, and the interests associated with the thing are the things commonly liked by members of the group.
So for example, here is the social interest positioning of the Red Bull Racing group on Facebook, based on a sample of 3000 members of the group. Note that a significant number of these members returned no likes, either because they haven’t liked anything, or because their personal privacy settings are such that they do not publicly share their likes.
As we might expect, the members of this group also appear to have an interest in other Formula One related topics, from F1 in general, to various F1 teams and drivers, and to motorsport and motoring in general (top half of the map). We also find music preferences (the cluster to the left of the map) and TV programmes (centre bottom of the map) that are of common interest, though I have no idea yet whether these are background radiation interests (that is, the Facebook equivalent of the Stephen Fry effect on Twitter) or are peculiar to this group. I’m not sure whether the cluster of beverage related preferences at the bottom right corner of the map is notable either?
This information is visualised using Gephi, using data grabbed via the following Python script (revised version of this code as a gist):
#This is a really simple script: ##Grab the list of members of a Facebook group (no paging as yet...) ###For each member, try to grab their Likes import urllib,simplejson,csv,argparse #Grab a copy of a current token from an example Facebook API call, eg from clicking a keyed link on: #https://developers.facebook.com/docs/reference/api/examples/ #Something a bit like this: #AAAAAAITEghMBAOMYrWLBTYpf9ciZBLXaw56uOt2huS7C4cCiOiegEZBeiZB1N4ZCqHgQZDZD parser = argparse.ArgumentParser(description='Generate social positioning map around a Facebook group') parser.add_argument('-gid',default='2311573955',help='Facebook group ID') #gid='2311573955' parser.add_argument('-FBTOKEN',help='Facebook API token') args=parser.parse_args() if args.gid!=None: gid=args.gid if args.FBTOKEN!=None: FBTOKEN=args.FBTOKEN #Quick test - output file is simple 2 column CSV that we can render in Gephi fn='fbgroupliketest_'+str(gid)+'.csv' writer=csv.writer(open(fn,'wb+'),quoting=csv.QUOTE_ALL) uids= def getGroupMembers(gid): gurl='https://graph.facebook.com/'+str(gid)+'/members?limit=5000&access_token='+FBTOKEN data=simplejson.load(urllib.urlopen(gurl)) if "error" in data: print "Something seems to be going wrong - check OAUTH key?" print data['error']['message'],data['error']['code'],data['error']['type'] exit(-1) else: return data #Grab the likes for a particular Facebook user by Facebook User ID def getLikes(uid,gid): #Should probably implement at least a simple cache here lurl="https://graph.facebook.com/"+str(uid)+"/likes?access_token="+FBTOKEN ldata=simplejson.load(urllib.urlopen(lurl)) print ldata if len(ldata['data'])>0: for i in ldata['data']: if 'name' in i: writer.writerow([str(uid),i['name'].encode('ascii','ignore')]) #We could colour nodes based on category, etc, though would require richer output format. #In the past, I have used the networkx library to construct "native" graph based representations of interest networks. if 'category' in i: print str(uid),i['name'],i['category'] #For each user in the group membership list, get their likes def parseGroupMembers(groupData,gid): for user in groupData['data']: uid=user['id'] writer.writerow([str(uid),str(gid)]) #x is just a fudge used in progress reporting x=0 #Prevent duplicate fetches if uid not in uids: getLikes(user['id'],gid) uids.append(uid) #Really crude progress reporting print x x=x+1 #need to handle paging? #parse next page URL and recall this function groupdata=getGroupMembers(gid) parseGroupMembers(groupdata,gid)
Note that I have no idea whether or not this is in breach of Facebook API terms and conditions, nor have I reflected on the ethical implications of running this sort of analysis, over and the above remarking that it’s the same general approach I apply to mapping social interests on Twitter.
As to where next with this? It brings into focus again the question of identifying common interests pertinent to this particular group, compared to background popular interest that might be expressed by any random set of people. But having got a new set of data to play with, it will perhaps make it easier to test the generalisability of any model or technique I do come up with for filtering out, or normalising against, background interest.
Other directions this could go? Using a single group to bootstrap a walk around the interest space? For example, in the above case, trying to identify groups associated with Sebastian Vettel, or F1, and then repeating the process? It might also make sense to look at the categories of the notable shared interests; (from a quick browse, these include, for example, things like Movie, Product/service, Public figure, Games/toys, Sports Company, Athlete, Interest, Sport; is there a full vocabulary available, I wonder? How might we use this information?)
Earlier this year I doodled a recipe for comparing the folk commonly followed by users of a couple of BBC programme hashtags (Social Media Interest Maps of Newsnight and BBCQT Twitterers). Prompted in part by a tweet from Michael Smethurst/@fantasticlife about generating an ESP map for UK politicians (something I’ve also doodled before – Sketching the Structure of the UK Political Media Twittersphere) I drew on the @tweetminster Twitter lists of MPs by party to generate lists of folk commonly followed by the MPs of each party.
Using the R wordcloud library commonality and comparison clouds, we can get a visual impression of folk commonly followed in significant numbers by all the MPs of the three main parties, as well as the folk the MPs of each party follow significantly and differentially to the other parties:
There’s still a fair bit to do making the methodology robust (for example, being able to cope with comparing folk commonly followed by different sets of users where the size of the set differs to a significant extent (for example, there is a large difference between the number of tweeting Conservative and LibDem MPs). I’ve also noticed that repeatedly running the comparison.cloud code turns up different clouds, so there’s some element of randomness in there. I guess this just adds to the “sketchy” nature of the visualisation; or maybe hints at a technique akin to the way a photogrpaher will take multiple shots of a subject before picking one or two to illustrate something in particular. Which is to say: the “truthiness” of the image reflects the message that you are trying to communicate. The visualisation in this case exposes a partial truth (which is to say, no absolute truth), or particular perspective about the way different groups differentially follow folk on Twitter. A couple of other quirks I’ve noticed about the comparison.cloud as currently defined: firstly, very highly represented friends are sized too large to appear in the cloud (which is why very commonly followed folk across all sets – the people that appear in the commonality cloud – tend not to appear) – there must be a better way of handling this? Secondly, if one person is represented so highly in one group that they don’t appear in the cloud for that group, they may appear elsewhere in the cloud. (So for example, I tried plotting clouds for folk commonly followed by a sample of the followers of @davegorman, as well as the people commonly followed by the friends of @davegorman – and @davegorman appeared as a small label in the friends part of the comparison.cloud (notwithstanding the fact that all the followers of @davegorman follow @davegorman, but not all his friends do… What might make more sense would be to suppress the display of a label in the colour of a particular group if that label has a higher representation in any of the other groups (and isn’t displayed because it would be too large)).
That said, as a quick sketch, I think there’s some information being revealed there (the coloured comparison.cloud seems to pull out some names that make sense as commonly followed folk peculiar to each party…). I guess way forward is to start picking apart the comparison.cloud code, another is to explore a few more comparison sets? Suggestions welcome as to what they might be…:-)
PS by the by, I notice via the Guardian datablog (Church vs beer: using Twitter to map regional differences in US culture) another Twitter based comparison project – Church or Beer? Americans on Twitter – which looked at geo-coded Tweets over a particular time period on a US state-wide basis and counted the relative occurrence of Tweets mentioning “church” or “beer”…
This is a placeholder as much as anything, something I want to try out but don’t have time to do right now… The context is the social media mapping approach I’ve been doodling with a few weeks for now, where I try to position social media users in terms of who their followers follow (for example, A Couple More Social Media Positioning Maps for UK HE Twitter Accounts).
One of the problems with the approach is that you often get some of the same-old, same-old accounts appearing again and again (@stephenfry for example). So I’ve been wondering whether it might be worth generating funnel plots that plot the rate at which followers of a target account follow the other accounts identified in the positioning maps generated around the target account? On the x we’d plot the total number of followers of each account, and on the y, the rate at which they are followed by the followers of the target account (i.e. their in-degree in the map divided by the target account follower sample size used to generate the map). We might then get useful signal from the presence of accounts that appear to be over-represented within the target account followers sample, signal that can be used to identify those accounts that are more highly associated with the target account than we might expect by chance?
Another factor that I maybe need to take into account is the total number of accounts followed by the target account followers?
PS by the by, I notice that my map of folk “in the vicinity of the #gdslaunch hashtag” appears to have been posterised…:-)
(If anyone wants SVG or graphml based representations of any of the Gephi generated images I post either here or on my flickr account, it can probably be arranged;-)