Search results for: gephi facebook

Emergent Social Interest Mapping – Red Bull Racing Facebook Group

With the possibility that my effectively unlimited Twitter API key will die at some point in the Spring with the Twitter API upgrade, I’m starting to look around for alternative sources of interest signal (aka getting ready to say “bye, bye, Twitter interest mapping”). And Facebook groups look like they may offer once possibility…

Some time ago, I did a demo of how to map the the common Facebook Likes of my Facebook friends (Social Interest Positioning – Visualising Facebook Friends’ Likes With Data Grabbed Using Google Refine). In part inspired by a conversation today about profiling the interests of members of particular Facebook groups, I thought I’d have a quick peek at the Facebook API to see if it’s possible to grab the membership list of arbitrary, open Facebook groups, and then pull down the list of Likes made by the members of the group.

As with my other social positioning/social interest mapping experiments, the idea behind this approach is broadly this: users express interest through some sort of public action, such as following a particular Twitter account that can be associated with a particular interest. In this case, the signal I’m associating with an expression of interest is a Facebook Like. To locate something in interest space, we need to be able to detect a set of users associated with that thing, identify each of their interests, and then find interests they have in common. These shared interests (ideally over and above a “background level of shared interest”, aka the Stephen Fry effect (from Twitter, where a large number of people in any set of people appear to follow Stephen Fry oblivious of other more pertinent shared interests that are peculiar to that set of people) are then assumed to be representative of the interests associated with the thing. In this case, the thing is a Facebook group, the users associated with the thing are the group members, and the interests associated with the thing are the things commonly liked by members of the group.

Simples.

So for example, here is the social interest positioning of the Red Bull Racing group on Facebook, based on a sample of 3000 members of the group. Note that a significant number of these members returned no likes, either because they haven’t liked anything, or because their personal privacy settings are such that they do not publicly share their likes.

rbr_fbGroup_commonLikes

As we might expect, the members of this group also appear to have an interest in other Formula One related topics, from F1 in general, to various F1 teams and drivers, and to motorsport and motoring in general (top half of the map). We also find music preferences (the cluster to the left of the map) and TV programmes (centre bottom of the map) that are of common interest, though I have no idea yet whether these are background radiation interests (that is, the Facebook equivalent of the Stephen Fry effect on Twitter) or are peculiar to this group. I’m not sure whether the cluster of beverage related preferences at the bottom right corner of the map is notable either?

This information is visualised using Gephi, using data grabbed via the following Python script (revised version of this code as a gist):

#This is a really simple script:
##Grab the list of members of a Facebook group (no paging as yet...)
###For each member, try to grab their Likes

import urllib,simplejson,csv,argparse

#Grab a copy of a current token from an example Facebook API call, eg from clicking a keyed link on:
#https://developers.facebook.com/docs/reference/api/examples/
#Something a bit like this:
#AAAAAAITEghMBAOMYrWLBTYpf9ciZBLXaw56uOt2huS7C4cCiOiegEZBeiZB1N4ZCqHgQZDZD

parser = argparse.ArgumentParser(description='Generate social positioning map around a Facebook group')

parser.add_argument('-gid',default='2311573955',help='Facebook group ID')
#gid='2311573955'

parser.add_argument('-FBTOKEN',help='Facebook API token')

args=parser.parse_args()
if args.gid!=None: gid=args.gid
if args.FBTOKEN!=None: FBTOKEN=args.FBTOKEN

#Quick test - output file is simple 2 column CSV that we can render in Gephi
fn='fbgroupliketest_'+str(gid)+'.csv'
writer=csv.writer(open(fn,'wb+'),quoting=csv.QUOTE_ALL)

uids=[]

def getGroupMembers(gid):
	gurl='https://graph.facebook.com/'+str(gid)+'/members?limit=5000&access_token='+FBTOKEN
	data=simplejson.load(urllib.urlopen(gurl))
	if "error" in data:
		print "Something seems to be going wrong - check OAUTH key?"
		print data['error']['message'],data['error']['code'],data['error']['type']
		exit(-1)
	else:
		return data

#Grab the likes for a particular Facebook user by Facebook User ID
def getLikes(uid,gid):
	#Should probably implement at least a simple cache here
	lurl="https://graph.facebook.com/"+str(uid)+"/likes?access_token="+FBTOKEN
	ldata=simplejson.load(urllib.urlopen(lurl))
	print ldata
	
	if len(ldata['data'])>0:	
		for i in ldata['data']:
			if 'name' in i:
				writer.writerow([str(uid),i['name'].encode('ascii','ignore')])
				#We could colour nodes based on category, etc, though would require richer output format.
				#In the past, I have used the networkx library to construct "native" graph based representations of interest networks.
				if 'category' in i: 
					print str(uid),i['name'],i['category']

#For each user in the group membership list, get their likes				
def parseGroupMembers(groupData,gid):
	for user in groupData['data']:
		uid=user['id']
		writer.writerow([str(uid),str(gid)])
		#x is just a fudge used in progress reporting
		x=0
		#Prevent duplicate fetches
		if uid not in uids:
			getLikes(user['id'],gid)
			uids.append(uid)
			#Really crude progress reporting
			print x
			x=x+1
	#need to handle paging?
	#parse next page URL and recall this function


groupdata=getGroupMembers(gid)
parseGroupMembers(groupdata,gid)

Note that I have no idea whether or not this is in breach of Facebook API terms and conditions, nor have I reflected on the ethical implications of running this sort of analysis, over and the above remarking that it’s the same general approach I apply to mapping social interests on Twitter.

As to where next with this? It brings into focus again the question of identifying common interests pertinent to this particular group, compared to background popular interest that might be expressed by any random set of people. But having got a new set of data to play with, it will perhaps make it easier to test the generalisability of any model or technique I do come up with for filtering out, or normalising against, background interest.

Other directions this could go? Using a single group to bootstrap a walk around the interest space? For example, in the above case, trying to identify groups associated with Sebastian Vettel, or F1, and then repeating the process? It might also make sense to look at the categories of the notable shared interests; (from a quick browse, these include, for example, things like Movie, Product/service, Public figure, Games/toys, Sports Company, Athlete, Interest, Sport; is there a full vocabulary available, I wonder? How might we use this information?)

Scribbled Ideas for “Research” Around the OU Online Conference…

So it seems I missed a meeting earlier this week planning a research strategy around the OU’s online conference, which takes place in a couple of weeks or so… (sigh: another presentation to prepare…;-)…

For what it’s worth, here are a few random thoughts about things I’ve done informally around confs before, or have considered doing… I’ve got the lurgy, though, those this is pretty much a raw core dump and is likely to have more typos and quirky grammatical constructions than usual (can’t concentrate at all:-(

Twitter hashtag communities: I keep thinking I should grab a bit more data (e.g. friends and followers details) around folk using a particular hashtag, and then do some social network analysis on the resulting network so I can write a “proper research paper” about it, but that would be selfish; because I suspect what would be more useful would be to spend that time making it easier for folk to decide whether or not they want to follow other hashtaggers, provide easy ways to create lists of hashtaggers, and so on. (That said, it would be really handy to get hold of the script that Dave Challis cobbled together around Dev8D (here and here) and then used to plot the growth of a twitter community over the course of that event. What’s required? Find the list of folk using the hashtag and then several times a day just grab a list of all their friends and followers (So we require two scripts: one to grab hashtaggers every hour or so and produce a list of “new” hashtaggers; one to grab the friends and followers of every hashtagger once an hour or so (or every half day; or whatever… if this is a research project, I guess it’d make sense to set quite a high sample rate and then learn from that what an effective sample rate would be?). Then at our leisure we can plot the network, and I guess run some SNA stats on it. (We could also use a hshtagger list to create a twitter map view of where folk might be participating from?) One interesting network view would be to show the growth of new connections between two time periods. I’m not sure if the temporal graphs Gephi supports would be handy here, but it’d be a good reason to learn how to use Gephi’s temporal views:-) If the conf is mainly hashtagged by OU users, then it won’t be interesting, because I suspect the OU hashtag community is already pretty densely interconnected. As the conference is being held (I think) in Ellumniate, it might be that a lot of the backchannel chatter occurs in that closed environment…? Is it possible to set up elluminate with a panel showing part of someone’s desktop that is streaming the conference hashtag, I wonder – ie showing backchannel chat within the elluminate environment using a backchannel that exists outside elluminate? (Thinks: would it be worth having a conference twitter user that autofollows anyone using the conf hashtag?) Other twitter stuff we can do is described in Topic and Event based Twittering – Who’s in Your Community?. Eg from the list of hashtaggers, we could see what other hashtags they were using/have recently used, helping identify situation of OU conf in other contexts according to the interests of people talking about the OU conf.

Facebook communities might be another thing to look at. The Netvizz app will grab an individuals network, and the connections between members of that network (unless recent privacy changes have broken things?). This data is trivially visualised in Gephi, which can also determine various SNA stats. Again it would make sense to grab regular dumps of data in maybe two cases: 1) create a faux Facebook user and get folk to friend it, then grab a dump of it’s network every hour or so (is it possible to autofriend people back? Or maybe that’s a job for a research monkey…?! Alternatively, get folk to join a conference group and grab a dump of the members of the group every hour or so (or every whenever or so). The only problem with that is if the group has more than 200 members, you only get a dump of a randomly selected 200 members.

– link communities – by which I mean look at activity around links that are being shared via eg twitter (extract the links from the (archived) hashtag feed) , or bookmarked on delicious. I’ve doodled social life of URL ideas before that might help provide macroscopic views over what links folk are sharing, and who else might be interested in those links (e.g. delicious URL History: Users by Tag or edupunk chatter). From socially bookmarked links, we can also generate tag clouds.

– chatter clouds: term extraction word clouds based on stuff that’s being tweeted with a particular hashtag.

– blog post communities: just cobble together a list of blogs that contain posts written around conf sessions.

– googalytics, bit.lytics: not sure what Google analytics you’d collect from where, but an obvious thing to do with them would be to look at the incoming IP adddresses/domains to see whether the audience was largely coming in from educational institutions. (Is there a list of IP ranges for UK HEIs, I wonder?) If any links are shared in the conference context, eg by backchannel, it would might sense shortening all those links on bit.ly with a conf API key, so you could track all click throughs on bit.ly shortened versions of that target link. The point would be to just be able to produce a chart of something like “most clicked through links for this conf”.

Bleurghhhhh….

Using Graphviz to Explore the Internal Link Structure of a WordPress Blog

In The Structure of OUseful.Info, I showed how it was possible to extract an autopingback graph from a WordPress blog (that is, the graph that shows which of the posts in a particular WordPress blog link to other posts in that blog), illustrating the post with a visualisation of linkage within OUseful.info generated using Gephi.

What I didn’t do was post any examples of the views that we can generate in Graphviz – so here are a couple generated without additional flourishes from a simple statement of links between posts.

Firstly, we see a series of posts relating to WriteToReply, and commentable documents:

Bog linkage examples using Graphviz

In the following example, we see a series of self-contained posts on Library Analytics:

Linkage structure in OUSeful info

Note that from the original library-analytics-part-1 post, we can see how two strands developed sound this topic (remember, arrows typically go from a more recent post to an earlier one; that is, the links typically go from a new post to one that already exists…)

Here’s another set of posts on a topic – this time privacy and Facebook:

OUseful posts on Facebook and privacy

(The bidirectional linkage arose from me editing the body of a pre-existing post with a link to a later one.)

One thing I haven’t explored yet is the groupings that arise from an analysis of the tags and categories I used to annotate each post. But what the above shows is that even in the absence of tags and categories, link structure may also be used to aggregate posts on a particular topic, and allow clusters of blog posts, or partitions containing link related posts, to be easily identified – and extracted – from the blog…

…and my supposition is that this sort of structure might be used to facilitate value adding navigation structures…

Small World? A Snapshot of How My Twitter “Friends” Follow Each Other…

I’m now following about 500 or so people on Twitter, but to what extent are they following each other? Are there any noticeable subgroups in the folk I follow, by virtue of them being highly linked to each other in the friends and following stakes?

How my twitter friends are interconnected - size is # of my friends following my friends, colour is # of my friends they follow

Each of the nodes represents one of my Twitter friends (that is, each node represents a separate person I follow on Twitter).

Node size is proportional to the number of my friends who are following other of my friends.

Node colour is proportional the the number of my friends that person is following (blue is cold – low number; red is hot – high number).

The graph is an indication of the extent to which the people I follow (that is, my friends…) is an echo chamber…

Running the Gephi “connected cpmponents” statistic, it seems that the group is pretty tightly connected… There is one noticeable separate component that contains more than a singleton, from a few accounts I followed last year…:

Partition over my twitter friends

If I look at the labels for the other separate components (not shown), they mainly correspond to people with private accounts, although there are a couple of people who are completely independent of the rest of my Twitter social circle.

The Gephi modularity class statistic, however, suggests there is a little more structure hiding in there…

My twitter network - modularity class

(This is a random algorithm, so it may give slightly different answers each time it is run…)

Let’s peek inside them…

My twitter friends - one cluster

Looks a bit educationalist to me…;-)

How about this one:

another of my twitter clusters

Hmm. Government and open data, maybe? What next…?

ANother of my twitter friend clusters...

BBC and journ hack types, with a bit of datajourn thrown in maybe?

Hmmm – the next one looks like an OU cluster:

AN OU cluster in my twitter friends

And that leaves….

Final twitter cluster

JISC, museums and libraries…

Seems about right to me:-)

PS Images produced using Gephi… Note to self: start spending a ittle more time about tidying up the presentation of some of these images…;-)

PPS for a similar exercise applied to my Facebook friends, see Getting Started With The Gephi Network Visualisation App – My Facebook Network, Part IV

UK Journalists on Twitter

A post on the Guardian Datablog earlier today took a dataset collected by the Tweetminster folk and graphed the sorts of thing that journalists tweet about ( Journalists on Twitter: how do Britain’s news organisations tweet?).

Tweetminster maintains separate lists of tweeting journalists for several different media groups, so it was easy to grab the names on each list, use the Twitter API to pull down the names of people followed by each person on the list, and then graph the friend connections between folk on the lists. The result shows that the hacks are follow each other quite closely:

UK Media Twitter echochamber (via tweetminster lists)

Nodes are coloured by media group/Tweetminster list, and sized by PageRank, as calculated over the network using the Gephi PageRank statistic.

The force directed layout shows how folk within individual media groups tend to follow each other more intensely than they do people from other groups, but that said, inter-group following is still high. The major players across the media tweeps as a whole seem to be @arusbridger, @r4today, @skynews, @paulwaugh and @BBCLauraK.

I can generate an SVG version of the chart, and post a copy of the raw Gephi GDF data file, if anyone’s interested…

PS if you’re interested in trying out Gephi for yourself, you can download it from gephi.org. One of the easiest ways in is to explore your Facebook network

PPS for details on how the above was put together, here’s a related approach:
Trying to find useful things to do with emerging technologies in open education
Doodlings Around the Data Driven Journalism Round Table Event Hashtag Community
.

For a slightly different view over the UK political Twittersphere, see Sketching the Structure of the UK Political Media Twittersphere. And for the House and Senate in the US: Sketching Connections Between US House and Senate Tweeps

Circles vs Community Detection

One take on the story so far:

– Facebook supports symmetrical follows and allows you to see connections between your Facebook friends;
– Twitter supports asymmetric follows and allows you to see pretty much everyones’ friend and follower connections;
– Google+ supports asymmetric follows

Facebook and Twitter both support lists but hardly anyone uses them. Google+ encourages you to put people into addressable circles (i.e. lists).

If you can grab a copy of connections between folk in your social network, you can run social network statistics that will partition out different social groupings:

My annotated twitter follower network

If you’re familiar with the interests of people in a particular cluster, you can label them (there are also ways you might try to do this automagically).

Now a Facebook app, Super Friends, will help you identify – and label – clusters in your Facebook network (via ReadWriteWeb):

Super friends facebook app

This is a great feature, and something I could imagine being supported to some extent in Gephi, for example, by allowing the user to create a node attribute where the values represent label mappings from different modularity clusters (or more simply by allowing a user to add a label to each modularity class?).

The SuperFriends app also stands in contrast to the Google+ approach. I’d class SuperFriends as gardening, whereas he Google+ approach is more one of planning. The Google+ approach encourages you to think you’re in control of different parts of your network and makes your life really complicated (which circle do I put this person in; do I need a new circle for this?); the SuperFriends approach helps you realise how complicated (or not) your social circle is. In terms of filters, the Google+ approach encourages you to add your own, whereas the SuperFriends approach helps you identify setting that emerges out of network properties.

Given that in many respects Google is an AI/machine learning company, it’s odd that they’re getting the user to define circle/set membership; maybe it’d be too creepy if they automatically suggested groups? Maybe there’s too much scope for error if you don’t deliberately place people into a group yourself (and instead trust an algorithm to do it?)

Superfriends helps uncover structure, Google+ forces you to make all sorts of choices and decisions every time you “follow” another person. Google+ makes you define tags and categories to label people up front; SuperFriends identifies clusters that might be covered by an obvious tag.

Looking at my delicous bookmarks, I have almost as many tags a bookmarks… But if I ran some sort of grouping analysis, (not sure what?!) maybe natural clusters – and natural tags – would emerge as a result?

Maybe I need to read Everything is Miscellaneous again…?

PS if you want to run a more hands on analysis of your Facebook network, try this: Getting Started With The Gephi Network Visualisation App – My Facebook Network, Part I

PPS here’s another Facebook app that identifies clusters: http://www.fellows-exp.com/ h/t @jacomyal

PPPS @danmcquillan also tweeted that LinkedIn InMaps do a similar clustering job on LinkedIn connections. They do indeed; and they use Gephi. I wonder if they’ve released the code that handles things from the point at which a social network graph data is prpvided to the rendering of the map?

A Tinkerer’s Toolbox…

A couple of days ago, I ran a sort of repeated, 3 hour, Digital Sandbox workshop session to students on the Goldsmiths’ MA/MSc in Creating Social Media (thanks to @danmcquillan for the invite and the #castlondon students for being so tolerant and engaged ;-)

I guess the main theme was how messy tinkering can be, and how simple ideas often don’t work as you expect them to, often requiring hacks, workarounds and alternative approaches to get things working at all, even if not reliably (which is to say: some of the demos borked;-)

Anyway… the topics covered were broadly:

1) getting data into a form where we can make it flow, as demonstrated by “my hit”, which shows how to screenscrape tabular data from a Wikipedia page using Google spreadsheets, republish it as CSV (eventually!), pull it into a Yahoo pipe and geocode it, then publish it as a KML feed that can be rendered in a Google map and embedded in an arbitrary web page.

2) getting started with Gephi as a tool for visualising and interactively having a conversation with a network represented data set.

To support post hoc activities, I had a play with a Delicious stack as a way of aggregating a set of tutorial like blog posts I had laying around that were related to each of the activities:

Delicious stack

I’d been quite dismissive of Delicious stacks when they first launched (see, for example, Rediscovering playlists), but I’m starting to see how they might actually be quite handy as a way of bootstrapping my way into a set of uncourses and/or ebooks around particular apps and technologies. There’s nothing particularly new about being able to build ordered sets of resources, of course, but the interesting thing for me is that even if I don’t get as far as editing a set of posts into a coherent mini-guide, a well ordered stack may itself provide a useful guide to a particular application, tool, set of techniques or topic.

As to why a literal repackaging of blog posts around a particular tool or technology as an ebook may not be such a good idea in and of itself, see Martin Belam’s post describing his experiences editing a couple of Guardian Shorts*: “Who’s Who: The Resurrection of the Doctor”: Doctor Who ebook confidential and Editing the Guardian’s Facebook ebook

* One of the things I’ve been tracking lately is engagement by the news media in alternative ways of trying to sell their content. A good example of this is the Guardian, who have been repackaging edited collections of (medium and long form) articles on a particular theme as “Guardian Shorts“. So for example, there are e-book article collection wrappers around the breaking of the phone hacking story, or investigating last year’s UK riots. If you want a quick guide to jazz or an overview of the Guardian datastore approach to data journalism, they have those too. (Did I get enough affiliate links in there, do you think?!;-)

This rethinking of how to aggregate, reorder and repackage content into saleable items is something that may benefit content producing universities. This is particularly true in the case of the OU, of course, where we have been producing content for years, and recently making it publicly available through a variety of channels, such as OpenLearn, or, err, the other OpenLearn, via iTunesU, or YouTube, OU/BBC co-productions and so on. It’s also interesting to note how the OU is also providing content (under some sort of commercial agreement…?) to other publishers/publications, such as the New Scientist:

OU youtube ads being in New Scientist context

There are other opportunities too, of course, such as Martin Weller’s suggestion that it’s time for the rebirth of the university press, or, from another of Martin’s posts, the creation of “special issue open access journal collections” (Launching Meta EdTech Journal), as well as things like The University Expert Press Room which provides a channel for thematic content around a news area and which complements very well, in legacy terms, the sort of model being pursued via Guardian Shorts?

Journalist Filters on Twitter – The Reuters View

It seems that Reuters has a new product out – Reuters Social Pulse. As well as highlighting “the stories being talked about by the newsmakers we follow”, there is an area highlighting “the Reuters & Klout 50 where we rank America’s most social CEOs.” Of note here is that this list is ordered by Klout score. Reuters don’t own Klout (yet?!) do they?!

The offering also includes a view of the world through the tweets of Reuters own staff. Apparently, “Reuters has over 3,000 journalists around the world, many of whom are doing amazing work on Twitter. That is too many to keep up with on a Twitter list, so we created a directory Reuters Twitter Directory] that shows you our best tweeters by topic. It let’s you find our reporters, bloggers and editors by category and location so you can drill down to business journalists in India, if you so choose, or tech writers in the UK.”

If you view the source of Reuters Twitter directory page, you can find a Javascript object that lists all(?) the folk in the Reuters Twitter directory and the tags they are associated with… Hmm, I thought… Hmmm…

If we grab that object, and pop it into Python, it’s easy enough to create a bipartite network that links journalists to the categories they are associated with:

import simplejson
import networkx as nx
#http://mlg.ucd.ie/files/summer/tutorial.pdf
from networkx.algorithms import bipartite

g = nx.Graph()

#need to bring in reutersJournalistList
users=simplejson.loads(reutersJournalistList)

#I had some 'issues' with the parsing for some reason? Required this hack in the end...
for user in users:
	for x in user:
		if x=='users':
			u=user[x][0]['twitter_screen_name']
			print 'user:',user[x][0]['twitter_screen_name']
			for topic in user[x][0]['topics']:
				print '- topic:',topic
				#Add edges from journalist name to each tag they are associated with
				g.add_edge(u,topic)
#print bipartite.is_bipartite(g)
#print bipartite.sets(g)

#Save a graph file we can visualise in Gephi corresponding to bipartite graph
nx.write_graphml(g, "usertags.graphml")

#We can find the sets of names/tags associated with the disjoint sets in the graph
users,tags=bipartite.sets(g)

#Collapse the bipartite graph to a graph of journalists connected via a common tag
ugraph= bipartite.projected_graph(g, users)
nx.write_graphml(ugraph, "users.graphml")

#Collapse the bipartite graph to a set of tags connected via a common journalist
tgraph= bipartite.projected_graph(g, tags)
nx.write_graphml(tgraph, "tags.graphml")

#Dump a list of the journalists Twitter IDs
f=open("users.txt","w+")
for uo in users: f.write(uo+'\n')
f.close()

Having generated graph files, we can then look to see how the tags cluster as a result of how they were applied to journalists associated with several tags:

Reuters journalists twitter directory cotags

Alternatively, we can look to see which journalists are connected by virtue of being associated with similar tags (hmm, I wonder if edge weight carries information about how many tags each connected pair may be associated through? [UPDATE: there is a projection that will calculate this – bipartite.projection.weighted_projected_graph]). In this case, I size the nodes by betweenness centrality to try to highlight journalists that bridge topic areas:

Reuters twitter journalists list via cotags, sized by betweenness centrality

Association through shared tags (as applied by Reuters) is one thing, but there is also structure arising from friendship networks…So to what extent do the Reuters Twitter List journalists follow each other (again, sizing by betweenness centrality):

Reuters twitter journalists friend connections sized by betweenness centrality

Finally, here’s a quick look at folk followed by 15 or more of the folk in the Reuters Twitter journalists list: this is the common source area on Twitter for the journalists on the list. This time, I size nodes by eigenvector centrality.

FOlk followed by 15 or more of folk on reuters twitter journliasts list, size by eigenvector centrality

So why bother with this? Because journalists provide a filter onto the way the world is reported to us through the media, and as a result the perspective we have of the world as portrayed through the media. If we see journalists as providing independent fairwitness services, then having some sort of idea about the extent to which they are sourcing their information severally, or from a common pool, can be handy. In the above diagram, for example, I try to highlight common sources (folk followed by at least 15 of the journalists on the Twitter list). But I could equally have got a feeling for the range of sources by producing a much larger and sparser graph, such as all the folk followed by journalists on the list, or folk followed by only 1 person on the list (40,000 people or so in all – see below), or by 2 to 5 people on the list…

The twitterverse as directly and publicly followed by folk on the Reuters Journalists twitter list

Friends lists are one sort of filter every Twitter user has onto the content been shared on Twitter, and something that’s easy to map. There are other views of course – the list of people mentioning a user is readily available to every Twitter user, and it’s easy enough to set up views around particular hashtags or search terms. Grabbing the journalists associated with one or more particular tags, and then mapping their friends (or, indeed, followers) is also possible, as is grabbing the follower lists for one or more journalists and then looking to see who the friends of the followers are, thus positioning the the journalist in the social media environment as perceived by their followers.

I’m not sure that value Reuters sees in the stream of tweets from the folk on its Twitter journalists lists, or the Twitter networks they have built up, but the friend lenses at least we can try to map out. And via the bipartite user/tag graph, it also becomes trivial for us to find journalists with interests in Facebook and advertising, for example…

PS for associated techniques related to the emergent social positioning of hashtags and shared links on Twitter, see Socially Positioning #Sherlock and Dr John Watson’s Blog… and Social Media Interest Maps of Newsnight and BBCQT Twitterers. For a view over @skynews Twitter friends, and how they connect, see Visualising How @skynews’ Twitter Friends Connect.

Is Twitter Starting to Make a Grab for the Interest Graph?

Over the last couple of years, I’ve dabbled with mapping parts of the interest graph defined by friends and follower relationships on Twitter. But with a couple of recent Twitter announcements, I’m starting to wonder if my ability to continue producing such maps will, to all and intents and purposes, soon be curtailed?

Here’s the story so far: the technique I’ve been using so far to generate interest maps relies on finding how the people followed by a particular individual follow each other, or who the common friends of the followers of a particular user are, or who the common friends of a set of hashtag or search term users are. This allows us to generate maps that show common interests of the friends of a given user, the followers of a user, or the users of a hashtag. Note that we interpret ‘interests’ loosely as named Twitter accounts that we can associate with a particular sort of interest. So for example, following @OrdnanceSurvey may be indicative of an interest in maps or mapping, or following @BBCr4today indicative of an interest in politics and current affairs, and so on.

As an example, here’s a map from a year or so ago of some of the accounts commonly followed by a sample of followers of the RedBullRacing account on Google+:

If you look closely, you can spot clusters (that is, groups of names close to each other) relating to Formula One teams, F1 drivers, performance cars, and so on…

The pattern of friends and followers in a social graph may thus be interpreted as some sort of interest graph. I personally find these maps to be, in and of themselves, interesting, in much the same way some folk take interest in, and pleasure from looking at, cartographic maps.

Knowledge of how this interest space is structured is also of interest to ad-sales marketers, whose aim is to sell highly targeted audiences (that is, folk interested in a particular topic, or who focus interest onto a particular area of the interest space) to advertisers. (For example, see From Paywalls and Attention Walls to Data Disclosure Walls and Survey Walls or What’s the Point of Social Media Metrics?.) Twitter also appears to have woken up to the interest graph as a potentially rich source of revenue: Interest targeting: Broaden your reach, reach the right audience.

Today we’re taking an important next step by allowing you to target your Promoted Tweets and Promoted Accounts campaigns to a set of interests that you explicitly choose. By targeting people’s topical interests, you will be able to connect with a greater number of users and deliver tailored messages to people who are more likely to engage with your Tweets.

There are two flavors of interest targeting. For broader reach, you can target more than 350 interest categories, ranging from Education to Home and Garden to Investing to Soccer.

If you want to target more precise sets of users, you can create custom segments by specifying certain @​usernames that are relevant to the product, event or initiative you are looking to promote. Custom segments let you reach users with similar interests to that @​username’s followers; they do not let you specifically target the followers of that @​username. If you’re promoting your indie band’s next tour, you can create a custom audience by adding @​usernames of related bands, thus targeting users with the same taste in music. This new feature will help you reach beyond your followers and users with similar interests, and target the most relevant audience for your campaign.

Let’s try to unpick that: [c]ustom segments let you reach users with similar interests to that @​username’s followers. In my naive definition, my first attempt to implement this would go something along the lines of: custom segments let you reach users who tend to follow similar people to the people followed en masse by that @​username’s followers. That is, I would position each @username in an interest space defined by people commonly followed by the followers of @username, and then target people who tend to follow the people disproportionately represented in that interest space compared to some sort of “baseline” representation of intereests of a more general population. I’ve no idea how Twitter do the targeting, but that would be my first step.

If targeted advertising is Twitter’s money play, then it’s obviously in their interest to keep hold of the data juice that lets them define audiences by interest. Which is to say, they need to keep the structure of the graph as closed as possible. [UPDATE – it seems as if Twitter is starting/continuing to block other social networks’ access to it’s social graph data…]

Unlike Facebook – which limits users to seeing friendships between their friends (or, painfully in terms of API calls, test whether friendship connections appear between two specified individuals), or LinkedIn, which doesn’t let you get hold of any data about how your friends connect other than in graphical form using InMaps (Visualize your LinkedIn network with InMaps), Twitter makes friend and follower data publicly available (unless you have a protected account). (If you want to visualise your own Facebook network, here’s a recipe for doing so: Getting Started With The Gephi Network Visualisation App – My Facebook Network.) Google+ also makes an indivudal’s connection data public in a roundabout way, although not via an easily accessed API (to access the graph data as data, you need to scrape it, as described here: So Where Am I Socially Situated on Google+?).

However, some changes have recently been announced relating to the Twitter API that look likely to limit the extent to which we can sample anything other than small, fragmentary snapshots of the interest graph: Changes coming in Version 1.1 of the Twitter API. In particular, the new 1.1 version of the Twitter API will apply differential rate limiting to different Twitter API endpoints, compared to the current limit of 350 API calls per hour summed across all API endpoints. For the last few years I’ve been lucky enough to benefit from a whitelisted Twitter API key (granted for research purposes) that has allowed me 20,000 calls per hour. (I’ve only ever used a fraction of the total possible number of API calls I could have made over that period, but on occasion have made several thousand calls over a short period to grab the friends or followers lists of hundreds or low thousands of users that I then use to generate the social interest maps.)

What my whitelisted API key allowed, and what the original 350 calls per hour limit for users not grandfathered in to the whitelisted key limit allowed, was the ability to grab the friends (or followers) lists from at least of the order of hundreds of users per hour. My own a hoc experiments suggest that sampling the friends of 500 or so followers of a particular account that may have tens or hundreds of thousands of followers, and then looking for accounts followed by 10-20% of that follower sample, gives an idea of the social interest positioning of the target account. Which means with a 350 API call limit per hour, you could generate at least one guesstimate interest positioning map every couple of hours. (With my 20k limit, I could generate several, much larger sample-size maps, per hour.) However, if the whitelisted key limit does not continue to be offered under version 1.1 of the Twitter API, and if rate limiting of 60 calls per hour to the friends/follower list endpoints is enforced (as looks likely?), this means that we’d only be able to generate one or two small-ish sample interest maps per day, and to run larger sample size maps, we’f have to max out hourly friends/follower list collection API calls 24 hours a day for several days in order to collect the data. Which is a pain. And a de facto block on the harvesting of graph data for the purposes of generating interest maps. (Which is why I’m hoping against hope my whitelisted API access continues! Though I am starting to feel as if I have squandered it, and either should have built a proper business around it and milked it for all it was worth, or sold the key on;-)

PS This looks like it could lead to the second major loss of “interesting” functionality relating to “derived” services around Twitter that I can think of, the first being the disappearance of services like the Twist search tool, that used to show volume trends for keywords as used on Twitter over the previous week:

twitter trends

(There are still a couple of hour/day trend analysis tools based on archived legacy datasets (I think?). For example, Tweet-o-life or timeu.se, as I’ve previously described in A Couple More Webrhythm Identifying Tools.)

We can still pick up webrhythms from tools such as Google Insights for Search, cultural rhythms from things like Google Books NGrams, and possible correlates using tools like Google Correlate, but the Twitter rhythms gave a much more visceral account of daily life than the trends that things like Google Trends can detect.

One of the things I find really disheartening about the web, even in its short lifetime to date, is the way that access to these large scale behavioural patterns is promissed as services start to scale up, but then disappear again as companies start to lock down their services or reign in access to the data, or the ad hoc third party services fail to scale (perhaps because there isn’t widespread enough interest to keep them ticking over?). Such is life, I suppose…

ILI2012 Workshop Prep – Appropriating IT: innovative uses of emerging technologies

Given that workshops at ILI2012 last a day (10 till 5), I thought I’d better start prepping the workshop I’m delivering with Martin Hawksey at this year’s Internat Librarian International early… W2 – Appropriating IT: innovative uses of emerging technologies:

Are you concerned that you are not maximising the potential of the many tools available to you? Do you know your mash-ups from your APIs? How are your data visualisation skills? Could you be using emerging technologies more imaginatively? What new technologies could you use to inspire, inform and educate your users? Learn about some of the most interesting emerging technologies and explore their potential for information professionals.

The workshop will combine a range of presentations and discussions about emerging information skills and techniques with some practical ‘makes’ to explore how a variety of free tools and applications can be appropriated and plugged together to create powerful information handling tools with few, if any, programming skills required.

Topics include:

– Visualisation tools
– Maps and timelines
– Data wrangling
– Social media hacks
– Screenscraping and data liberation
– Data visualisation

(If you would like to join in with the ‘makes’, please bring a laptop)

I have some ideas about how to fill the day – and I’m sure Martin does too – but I thought it might be worth asking what any readers of this blog might be interested in learning about in a little more detail and using slightly easier, starting from nowhere baby steps than I usually post.

My initial plan is to come up with five or six self contained elements that can also be loosely joined, structuring the day something like this:

  • opening, and an example of the sort of thing you’ll be able to do by the end of the day – no prior experience required, handheld walkthroughs all the way; intros from the floor along with what folk expect to get out of the day/want to be able to do at the day (h/t @briankelly in the comments; of course, if folks’ expectations differ from what we had planned….;-). As well as demo-ing how to use tools, we’ll also discuss why you might want to do these things and some of the strategies involved in trying to work out how to do them, knowing what you already know, or how to find out/work out how to do them if you don’t..
  • The philosophy of “appropriation”, “small pieces, lightly joined”, “minimum viability” and ‘why Twitter, blogs and Stack Overflow are Good Things”;
  • Visualising Data – because it’s fun to start playing straight away…
    • Google Fusion Tables – visualisations and queries
    • Google visualisation API/chart components

    Payoff: generate some charts and dashboards using pre-provided data (any ideas what data sets we might use…? At least one should have geo-data for a simple mapping demo…)

  • — Morning coffee break? —
  • Data scraping:
    • Google spreadsheets – import CSV, import HTML table;
    • Google Refine – import XLS, import JSON, import XML
    • (Briefly) – note the existence of other scraper tools, incl. Scraperwiki, and how they can be used

    Payoff: scrape some data and generate some charts/views… Any ideas what data to use? For the JSON, I thought about finishing with a grab of Twitter data, to set up after lunch…

  • — Lunch? —
  • (Social) Network Analysis with Gephi
    • Visually analyse Twitter data and/or Facebook data grabbed using Google Refine and/or TAGSExplorer
    • Wikipedia graphing using DBPedia
    • Other examples of how to think in graphs…
  • The scary session…
    • Working with large data files – examples of some simple text processing command line tools
    • Data cleansing and shaping – Google Refine, for the most part, including the use of reconciliation; additional examples based on regular expressions in a text editor, Google spreadsheets as a database, Stanford Data Wrangler, and R…
  • — Afternoon coffee break? —
  • Writing Diagrams – examples referring back to Gephi, mentioning Graphviz, then looking at R/ggplot2, finishing with R’s googleVis library as a way of generating Google Visualisation API Charts…
  • Wrap up – review of the philosophy, showing how it was applied throughout the exercises; maybe a multi-step mashup as a final demo?

Requirements: we’d need good wifi/network connections; also, it would help if participants pre-installed – and checked the set up of: a) a Google account; b) a modern browser (standardising on Google Chrome might be easiest?) c) Google Refine; d) Gephi (which may also require the installation of a Java runtime, eg on a new-ish Mac); e) R; f) RStudio and a raft of R libraries (ggplot2, plyr, reshape, RCurl, stringr, googleViz); g) a good text editor (?I use TextWrangler on a Mac); h) commandline tools (Windows machines);

Throughout each session, participants will be encouraged to identify datasets or IT workflow issues they encounter at work and discuss how the ideas presented in the workshop may be appropriated for use in those contexts…

Of course, this is all subject to change (I haven’t asked Martin how he sees the day panning out yet;-), but it gives a flavour of my current thinking… So: what sorts of things would you like to see? And would you like to book any of the sessions for a workshop at your place…?!;-)