Archive for the ‘Visualisation’ Category
Do Retweeters Lack Commitment to a Hashtag?
I seem to be going down more ratholes than usual at the moment, in this case relating to activity round Twitter hashtags. Here’s a quick bit of reflection around a chart from Visualising Activity Around a Twitter Hashtag or Search Term Using R that shows activity around a hashtag that was minted for an event that took place before the sample period.
The y-axis is organised according to the time of first use (within the sample period) of the tag by a particular user. The x axis is time. The dots represent tweets containing the hashtag, coloured blue by default, red if they are an old-style RT (i.e. they begin RT @username:).
So what sorts of thing might we look for in this chart, and what are the problems with it? Several things jump out at me:
- For many of the users, their first tweet (in this sample period at least) is an RT; that is, they are brought into the hashtag community through issuing an RT;
- Many of the users whose first use is via an RT don’t use the hashtag again within the sample period. Is this typical? Does this signal represent amplification of the tag without any real sense of engagement with it?
- A noticeable proportion of folk whose first use is not an RT go on to post further non-RT tweets. Does this represent an ongoing commitment to the tag? Note that this chart does not show whether tweets are replies, or “open” tweets. Replies (that is, tweets beginning @username are likely to represent conversational threads within a tag context rather than “general” tag usage, so it would be worth using an additional colour to identify reply based conversational tweets as such.
- “New style” retweets are diaplayed as retweets by colouring… I need to check whether or nor newstyle RT information is available that I could use to colour such tweets appropriately. (or alternatively, I’d have to do some sort of string matching to see whether or not a tweet was the same as a previously seen tweet, which is a bit of a pain:-(
(Note that when I started mapping hashtag communities, I used to generate tag user names based on a filtered list of tweets that excluded RTs. this meant that folk who only used the tag as part of an RT and did not originate tweets that contained the tag, either in general or as part of a conversation, would not be counted as a member of the hashtag community. More recently, I have added filters that include RTs but exclude users who used the tag only once, for example, thus retaining serial RTers, but not single use users.)
So what else might this chart tell us? Looking at vertical slices, it seems that news entrants to the tag community appear to come in waves, maybe as part of rapid fire RT bursts. This chart doesn’t tell us for sure that this is happening, but it does highlight areas of the timelime that might be worth investigating more closely if we are interested in what happened at those times when there does appear to be a spike in activity. (Are there any modifications we could make to this chart to make them more informative in this respect? The time resolution is very poor, for example, so being able to zoom in on a particular time might be handy. Or are there other charts that might provide a different lens that can help us see what was happening at those times?)
And as a final point – this stuff may be all very interesting, but is it useful?, And if so, how? I also wonder how generalisable it is to other sorts of communication analysis. For example, I think we could use similar graphical techniques to explore engagement with an active comment thread on a blog, or Google+, or additions to an online forum thread. (For forums with mutliple threads, we maybe need to rethink how this sort of chart would work, or how it might be coloured/what symbols we might use, to distinguish between starting a new thread, or adding to a pre-existing one, for example. I’m sure the literature is filled with dozens of examples for how we might visualise forum activity, so if you know of any good references/links…?! ;-) #lazyacademic)
What is the Potential Audience Size for a Hashtag Community?
What’s the potential audience size around a Twitter hashtag?
Way back when, in the early days of webs stats, reported figures tended to centre around the notion of hits, the number of calls made to a server via website activity. I forget the details, but the metric was presumably generated from server logs. This measure was always totally unreliable, because in the course of serving a web page, a server might be hit multiple times, once for each separately delivered asset, such as images, javascript files, css files and so on. Hits soon gave way to the notion of Page Views, which more accurately measured the number of pages (rather than assets) served via a website. This was complemented with the notion of Visits and Unique Visits: Visits, as tracked by a cookies, represent a set of pages viewed around about the same time by the same person. Unique Visits (or “Uniques”), represent the number of different people who appear to have visited the site in any given period.
What we see here, then, is a steady evolution in the complexity of website metrics that reflects on the one hand dissatisfaction with one way of measuring or reporting activity, and on the other practical considerations with respect to instrumentation and the ability to capture certain metrics once they are conceived of.
Widespread social media monitoring/tracking is largely still in the realm of “hits” measurement. Personal dashboards for services such as Twitter typically display direct measures provided by the Twitter API, or measures trivially/directly identified from Twitter API or archived data – number of followers, numbers of friends, distribution of updates over time, number of mentions, and so on.
Something both myself and Martin Hawksey have been thinking about on and off for some time are ways of reporting activity around Twitter hashtags. A commonly(?!) asked question in this respect relates to how much engagement (whatever that means) there has been with a particular tag. So here’s a quick mark in the sand about some of my current thinking about this. (Note that these ideas may well have been more formally developed in the academic literature – I’m a bit behind in my reading! If you know something that covers this in more detail, or that I should cite, please feel free to add a link in the comments… #lazyAcademic.)
One of the first metrics that comes to my mind is the number of people who have used a particular hashtag, and the number of their followers. Easily stated, it doesn’t take a lot of thought to realise even these “simple” measures are fraught with difficulty:
- what counts as a use of the hashtag? If I retweet a measure of yours that contains a hashtag, have I used it in any meaningful sense? Does a “use” mean the creation of a new tweet containing the tag? What about if I reply to a tweet from you than contains the tag and I include the tag in my reply to you, even if I’m not sure what that tag relates to?
- the potential audience size for the tag (potential uniques?), based on the number of followers of the tag users. At first glance, we might think this can be easily calculated by adding together the follower counts of the tag users, but this is more strictly an approximation of the potential audience: the set of followers of A may include some of the followers of B, or C; do we count the tag users themselves amongst the audience? If so, the upper bound also needs to take into account the fact that none of the users may be followers of any of the other tag users.
Note there is also a lower bound – the largest follower count amongst the tag users (whatever that means…) of the hashtag. Furthermore, if we want to count the number of folk not using the tag but who may have seen the tag, this lower bound can be revised downwards by subtracting the number of tag users minus one (for the tag user with the largest follower count). The value is still only an approximation, though, becuase it assumes that all the tag users are actually included as followers of at least one, each, of the tag users. (If you think these points are “just academic”, they are and they aren’t – observations like these can often be used to help formulate gaming strategies around metrics based on these measures.) - the potential number of views of a tag, for example based on the product of the number of times a user tweets and their follower count?
- the reach of (or active engagement with?) the tag, as measured by the number of people who actually see the tag, or the number of people who take and action around it (such as replying to a tagged tweet, RTing it, or clicking on a link a tagged tweet contains); note that we may be able ot construct probabilistic models (albeit quite involved ones) of the potential reach based on factors like the number of people someone follows, when they are online, the rate at which the people they follow tweet, and so on..
To try to make this a little more concrete, here are a couple of scripts for exploring the potential audience size of a tag based on the followers of the tag users (where a user is someone who publishes or retweets a tweet containing the tag over a specified period). The first, Python script runs a Twitter search and generates a list of unique users of the tag, along with the timestamp of their first use of the tag within the sample period. This script also grabs all the followers of the tag users, along with their counts, and generates running cumulative (upper bound approximation) count of the tag user follower numbers as well as calculating the rolling set of unique followers to date as each new tag user is observed. The second, R script plots the values.
The first thing we can do is look at the incidence of new users of the hashtag over time:
(For a little more discussion of this sort of chart, see Visualising Activity Around a Twitter Hashtag or Search Term Using R and its inspiration, @mediaczar’s How should Page Admins deal with Flame Wars?.)
More relevant to this post, however, is a plot showing some counts relating to followers of users of the hashtag:
In this case, the top, green line represents the summed total number of followers for tag users as they enter the conversation. If every user had completely different followers, this might be meaningful, but where conversation takes place around a tag between folk who know each other, it’s highly likely that they have followers in common.
The middle, red line shows a count of the number of unique followers to date, based on the the followers of users of the tag to date.
The lower, blue line shows the difference between the red and green lines. This represents the error between the summed follower counts and the actual number of unique followers.
Here’s a view over the number of new unique potential audience members at each time step (I think the use of the line chart here may be a mistake… I think bars/lineranges would probably be more appropriate…):
In the following chart, I overplot oneline with another. The lower layer (a red line) is the total follower account for each new tag user. The blue is the increase in the potential audience count (that is, the number of the new users’ followers that haven’t potentially seen the tag so far). The range of the visible part of the red line thus shows the number of a new tag user’s followers who have potentially already seen the tag. Err… maybe (that is, if my code is correct and all the scripts are doing what I think they’re doing! If they aren’t, then just treat this post as an exploration of the sorts of charts we might be able to produce to explore audience reach;-)
Here are the scripts (such as they are!)
import newt,csv,tweepy
import networkx as nx
#the term we're going to search for
tag='ddj'
#how many tweets to search for (max 1500)
num=500
##Something along lines of:
auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
auth.set_access_token(SKEY, SSECRET)
api = tweepy.API(auth, cache=tweepy.FileCache('cache',cachetime), retry_errors=[500], retry_delay=5, retry_count=2)
#You need to do some work here to search the Twitter API
tweeters, tweets=yourSearchTwitterFunction(api,tag,num)
#tweeters is a list of folk who tweeted the term of interest
#tweets is a list of the Twitter tweet objects returned from the search
#My code for this is tightly bound up in a large and rambling library atm...
#Put tweets into chronological order
tweets.reverse()
#I was being lazy and wasn't sure what vars I needed or what I was trying to do when I started this!
#The whole thing really needs rewriting...
tweepFo={}
seenToDate=set([])
uniqSourceFo=[]
#runtot is crude and doesn't measure overlap
runtot=0
oldseentodate=0
#Construct a digraph from folk using the tag to their followers
DG=nx.DiGraph()
for tweet in tweets:
user=tweet['from_user']
if user not in tweepFo:
tweepFo[user]=[]
print "Getting follower data for", str(user), str(len(tweepFo)), 'of', str(len(tweeters))
mi=tweepy.Cursor(api.followers_ids,id=user).items()
userID=tweet['from_user_id'] #check
DG.add_node(userID,label=user)
for m in mi:
tweepFo[user].append(m)
#construct graph
DG.add_edge(userID,m,weight=1)
DG.node[m]['label']=''
ufc=len(tweepFo[user])
runtot=runtot+ufc
#seen to date is all people who have seen so far, plus new ones, so it's the union
oldseentodate=len(seenToDate)
seenToDate=seenToDate.union(set(tweepFo[user]))
uniqSourceFo.append((tweet['created_at'],len(seenToDate),user,runtot,ufc,oldseentodate))
else:
#I'm weighting the edges so we can count how many times folk see the hashtag
if len(DG.edges(userID))>0:
tmp1,tmp2=DG.edges(userID)[0]
weight=DG[userID][tmp2]['weight']+1
for fromN,toN in DG.edges(userID):
DG[fromN][toN]['weight']=weight
fo='reports/tmp/'+tag+'_ncount.csv'
f=open(fo,'wb+')
writer=csv.writer(f)
writer.writerow(['datetime','count','newuser','crudetot','userFoCount','previousCount'])
for ts,l,u,ct,ufc,ols in uniqSourceFo:
print ts,l
writer.writerow([ts,l,u,ct,ufc,ols])
f.close()
print "Writing graph.."
filter=[]
for n in DG:
if DG.degree(n)>1: filter.append(n)
filter=set(filter)
H=DG.subgraph(filter)
nx.write_graphml(H, 'reports/tmp/'+tag+'_ncount_2up.graphml')
print "Writing other graph.."
nx.write_graphml(DG, 'reports/tmp/'+tag+'_ncount.graphml')
Here’s the R script…
ddj_ncount <- read.csv("~/code/twapps/newt/reports/tmp/ddj_ncount.csv")
#Convert the datetime string to a time object
ddj_ncount$ttime=as.POSIXct(strptime(ddj_ncount$datetime, "%a, %d %b %Y %H:%M:%S"),tz='UTC')
#Order the newuser factor levels into the order in which they first use the tag
dda=subset(ddj_ncount,select=c('ttime','newuser'))
dda=arrange(dda,-desc(ttime))
ddj_ncount$newuser=factor(ddj_ncount$newuser, levels = dda$newuser)
#Plot when each user first used the tag against time
ggplot(ddj_ncount) + geom_point(aes(x=ttime,y=newuser)) + opts(axis.text.x=theme_text(size=6),axis.text.y=theme_text(size=4))
#Plot the cumulative and union flavours of increasing possible audience size, as well as the difference between them
ggplot(ddj_ncount) + geom_line(aes(x=ttime,y=count,col='Unique followers')) + geom_line(aes(x=ttime,y=crudetot,col='Cumulative followers')) + geom_line(aes(x=ttime,y=crudetot-count,col='Repeated followers')) + labs(colour='Type') + xlab(NULL)
#Number of new unique followers introduced at each time step
ggplot(ddj_ncount)+geom_line(aes(x=ttime,y=count-previousCount,col='Actual delta'))
#Try to get some idea of how many of the followers of a new user are actually new potential audience members
ggplot(ddj_ncount) + opts(axis.text.x=theme_text(angle=-90,size=4)) + geom_linerange(aes(x=newuser,ymin=0,ymax=userFoCount,col='Follower count')) + geom_linerange(aes(x=newuser,ymin=0,ymax=(count-previousCount),col='Actual new audience'))
#This is still a bit experimental
#I'm playing around trying to see what proportion or number of a users followers are new to, or subsumed by, the potential audience of the tag to date...
ggplot(ddj_ncount) + geom_linerange(aes(x=newuser,ymin=0,ymax=1-(count-previousCount)/userFoCount)) + opts(axis.text.x=theme_text(angle=-90,size=6)) + xlab(NULL)
In the next couple of posts in this series, I’ll start to describe how we can chart the potential increase in audience count as a delta for each new tagger, along with a couple of ways of trying to get some initial sort of sense out of the graph file, such as the distribution of the potential number of “views” of a tag across the unique potential audience members…
PS See also the follow on post More Thoughts on Potential Audience Metrics for Hashtag Communities
Visualising Activity Around a Twitter Hashtag or Search Term Using R
I think one of valid criticisms around a lot of the visualisations I post here and on my various #f1datajunkie blogs is that I often don’t post any explanatory context around the visualisations. This is partly a result of the way I use my blog posts in a selfish way to document the evolution of my own practice, but not necessarily the “so what” elements that represent any meaning or sense I take from the visualisations. In many cases, this is because the understanding I come to of a dataset is typically the result of an (inter)active exploration of the data set; what I blog are the pieces of the puzzle that show how I personally set about developing a conversation with a dataset, pieces that you can try out if you want to…;-)
An approach that might get me more readers would be to post commentary around what I’ve learned about a dataset from having a conversation with it. A good example of this can be seen in @mediaczar’s post on How should Page Admins deal with Flame Wars?, where this visualisation of activity around a Facebook post is analysed in terms of effective (or not!) strategies for moderating a flame war.

The chart shows a sequential ordering of posts in the order they were made along the x-axis, and the unique individual responsible for each post, ordered by accession to the debate along the y-axis. For interpretation and commentary, see the original post: How should Page Admins deal with Flame Wars? ;-)
One take away of the chart for me is that it provides a great snapshot of new people entering into a conversation (vertical lines) as well as engagement by an individual (horizontal lines). If we use a time proportional axis on x, we can also see engagement over time.
In a Twitter context, it’s likely that a rapid increase in numbers of folk engaging with a hashtag, for example, might be the result of an RT related burst of activity. For folk who have already engaged in hashtag usage, for example as part of a live event backhannel, a large number of near co-occurring tweets that are not RTs might signal some notable happenstance within the event.
To explore this idea, here’s a quick bit of R tooling inspired by Mat’s post… It uses the twitteR library and sources tweets via a Twitter search.
require(twitteR)
#Pull in a search around a hashtag.
searchTerm='#ukgc12'
rdmTweets <- searchTwitter(searchTerm, n=500)
# Note that the Twitter search API only goes back 1500 tweets
#Plot of tweet behaviour by user over time
#Based on @mediaczar's http://blog.magicbeanlab.com/networkanalysis/how-should-page-admins-deal-with-flame-wars/
#Make use of a handy dataframe creating twitteR helper function
tw.df=twListToDF(rdmTweets)
#@mediaczar's plot uses a list of users ordered by accession to user list
## 1) find earliest tweet in searchlist for each user [ http://stackoverflow.com/a/4189904/454773 ]
require(plyr)
tw.dfx=ddply(tw.df, .var = "screenName", .fun = function(x) {return(subset(x, created %in% min(created),select=c(screenName,created)))})
## 2) arrange the users in accession order
tw.dfxa=arrange(tw.dfx,-desc(created))
## 3) Use the username accession order to order the screenName factors in the searchlist
tw.df$screenName=factor(tw.df$screenName, levels = tw.dfxa$screenName)
#ggplot seems to be able to cope with time typed values...
require(ggplot2)
ggplot(tw.df)+geom_point(aes(x=created,y=screenName))
We can get a feeling for which occurrences were old-style RTs by identifying tweets that start with a classic RT, and then colouring each tweet appropriately (note there may be some overplotting/masking of points…I’m not sure how big the x-axis time bins are…)
#Identify and colour the RTs...
library(stringr)
#A helper function to remove @ symbols from user names...
trim <- function (x) sub('@','',x)
#Identify classic style RTs
tw.df$rt=sapply(tw.df$text,function(tweet) trim(str_match(tweet,"^RT (@[[:alnum:]_]*)")[2]))
tw.df$rtt=sapply(tw.df$rt,function(rt) if (is.na(rt)) 'T' else 'RT')
ggplot(tw.df)+geom_point(aes(x=created,y=screenName,col=rtt))
So now we can see when folk entered into the hashtag community via a classic RT.
We can also start to explore who was classically retweeted when:
#Generate a plot showing how a person is RTd tw.df$rtof=sapply(tw.df$text,function(tweet) trim(str_match(tweet,"^RT (@[[:alnum:]_]*)")[2])) #Note that this doesn't show how many RTs each person got in a given time period if they got more than one... ggplot(subset(tw.df,subset=(!is.na(rtof))))+geom_point(aes(x=created,y=rtof))
Another view might show who was classically RTd by whom (activity along a row indicating someone was retweeted a lot through one or more tweets, activity within a column identifying an individual who RTs a lot…):
#We can start to get a feel for who RTs whom...
require(gdata)
#We don't want to display screenNames of folk who tweeted but didn't RT
tw.df.rt=drop.levels(subset(tw.df,subset=(!is.na(rtof))))
#Order the screennames of folk who did RT by accession order (ie order in which they RTd)
tw.df.rta=arrange(ddply(tw.df.rt, .var = "screenName", .fun = function(x) {return(subset(x, created %in% min(created),select=c(screenName,created)))}),-desc(created))
tw.df.rt$screenName=factor(tw.df.rt$screenName, levels = tw.df.rta$screenName)
# Plot who RTd whom
ggplot(subset(tw.df.rt,subset=(!is.na(rtof))))+geom_point(aes(x=screenName,y=rtof))+opts(axis.text.x=theme_text(angle=-90,size=6)) + xlab(NULL)
What sense you might make of all this, or where to take it next, is down to you of course… Err, erm…?! ;-)
PS see also: http://blog.ouseful.info/2012/01/21/a-quick-view-over-a-mashe-google-spreadsheet-twitter-archive-of-ukgc2012-tweets/
TV Audience Social Interest Mapping – Shameless vs. Newsnight vs Masterchef
How easy is it to differentiate between audiences of different types of TV programme based on their socially signalled interests?
This evening, I ran a couple of Twitter searches against the #shameless and #newsnight hashtags. In each case, I grabbed 1500 of the most recent tweets and generated lists of folk who had tweeted the corresponding hashtag at least twice in the sample set. I then grabbed the lists of all the friends of the folk in each list to generate a projection map of the friends of recent hashtaggers. The final preprocessing step was to filter each network to contain only nodes that had at least an indegree or outdegree of 25 (that is, I filtered the network to only include folk who had at least 25 friends, or were linked to by at least 25 of the folk in the corresponding hashtaggers list).
Here’s the resulting map generated around the #shameless tag – it gives an impression of folk who tend to be followed by folk using the #shameless tag:
Music, celebrities, footballers and comedians, err, I think?!
By way of comparison, here’s a sketch of who the folk using the #newsnight tag follow:
MPs, political hacks, and higher brow level of comedian maybe?! ;-)
PS for what it’s worth, here’s another map for tweets grabbed now around #masterchef, which aired a few hours ago:
So that’ll be a cooking show with some high profile talent (in the narrower scheme of things) then?!
Opening Up University Energy Data
Knowing how much energy a building uses is often the first step towards reducing it’s energy footprint, so here’s a quick round up of the university open data initiatives I know of that are based around energy data. (If you know of more, please let me know via the comments.)
First up, via @lncd, here’s a heatmap showing change in energy building usage compared aross the last two days:
This visulisation (built on top of Lincoln U’s open data feeds (http://data.lincoln.ac.uk/)) charts energy usage over a calendar month:
Read more about Lincoln’s energy data hacks here: University of Lincoln Energy Data …. an update!.
Over in Oxford, there’s a tool called OpenMeters that displays charts of energy usage by building:
The Oxford data is available as Linked Data from http://data.ox.ac.uk/datasets/, err, I think… As ever, it’d probably take me an hour or two to find out how to make my first query that returns anything meaningful to me in a form I could actually do anything with!;-) I also wonder whether the Oxford project feeds into another Oxford initiative: iMeasure?
(By the by, I misread the title of this place from the other place to Oxford as “A Linked Data Model Of Building Energy Consumption”: A Limited-Data Model Of Building Energy Consumption, which explores a model “targeted at practical, wide-scale deployment which produces an ongoing breakdown of building energy consumption”.)
Whilst I don’t think Warwick University’s energy monitoring page is based on exposed open data, it does have some dials/gauges on it:
(By the by – that page doesn’t have a Warwick favicon associated with it in my browser; should it?)
So – a quick round up (and huuuuuuge displacement activity from what I should have been doing this afternoon), but I think it’s worth tracking and reporting on these early demonstrations…
See also: JISC Green ICT projects (which I note don’t seem to return popular Google results for queries relating to university energy data, even when searches are limited to the .ac.uk domain…) and the Greening ICT Programme Community Site; Govspark, which compares energy usage across government departments; Innovations in Campus Mapping for a review of how open data is being used to support enhanced interactive campus maps; and Open Data Powered Location Based Services in UK Higher Education.
PS given Lincoln are publishing all manner of open data, I wonder whether there is enough there to do an ad hoc version of the Heat and light by timetable project using data they’re producing just anyway?!
Visualising New York Times Article API Tag Graphs Using d3.js
Picking up on Tinkering with the Guardian Platform API – Tag Signals, here’s a variant of the same thing using the New York Times Article Search API.
As with the Guardian OpenPlatform demo, the idea is to look up recent articles containing a particular search term, find the tag(s) used to describe the articles, and then graph them. The idea behind this approach is to get a quick snapshot of how the search target is represented, or positioned, in this case, by the New York Times.
Here is the code. The main differences compared to the Guardian API gist are as follows:
- a hacked recipe for getting several paged results back; I really need to sort this out properly, just as I need to generalise the code so it will work with either the Guardian or the NYT API, but that’s for another day now…
- the use of NetworkX as a way of representing the undirected tag-tag graph;
- the use of the NetworkX D3 helper library (networkx-d3) to generate a JSON output file that works with the d3.js force directed layout library.
Note that the D3 Python library generates a vanilla force directed layout diagram. In the end, I just grabbed the tiny bit of code that loads the JSON data into a D3 network, and then used it along with the code behind the rather more beautiful network layout used for this Mobile Patent Suits visualisation.
Here’s a snapshot of the result of a search for recent articles on the phrase Gates Foundation:
At this point, it’s probably worth pointing out that the Python script generates the graph file, and then the d3.js library generates the graph visualisation within a web browser. There is no human hand (other than force layout parameter setting) involved in the layout. I guess with the tweaking of a few parameters, maybe juggling the force layout parameters a little more, I could get an even clearer layout. It might also be worth trying to find a way of sizing, or at least colouring, the nodes according to degree (or even better, weighted degree?) I also need to find a way, of possible, of representing the weight of edges if the D3 Python library actually exports this (or if it exports multiple edges between the same two nodes).
Anyway, for an hour or so’s tinkering, it’s quite a neat little recipe to be able to add to my toolkit. Here’s how it works again: Python script calls NYT Article Search API and grabs articles based on a search term. Grab the tags used to describe each article and build up a graph using NetworkX that connects any and all tags that are mentioned on the same article. Dump the graph from its NetworkX representation as a JSON file using the D3 library, then use the D3 Patent Suits layout recipe to render it in the browser :-)
Now all I have to do is find out how I can grab an SVG dump of the network from a browser into a shareable file…
Active Lobbying Through Meetings with UK Government Ministers
In a move that seemed to upset collectors of UK ministerial meeting data, @whoslobbying, on grounds of wasted effort, the Guardian datastore published a spreadsheet last night containing data relating to ministerial meetings between May 2010 and March 2011.
(The first release of the spreadsheet actually omitted the column containing who the meeting was with, but that seems to be fixed now… There are, however, still plenty of character encoding issues (apostrophes, accented characters, some sort of em-dash, etc) that might cripple some plug and play tools.)
Looking over the data, we can use it as the basis for a network diagram with actors (Ministers and lobbiests) with edges representing meetings between Minsiters and lobbiests. There is one slight complication in that where there is a meeting between a Minister and several lobbiests, we ideally need to separate out the separate lobbiests into their own nodes.
This probably provides an ideal opportunity to have a play with the Stanford Data Wrangler and try forcing these separate lobbiests onto separate rows, but I didn’t allow myself much time for the tinkering (and the requisite learning!), so I resorted to Python script to read in the data file and split out the different lobbiests. (I also did an iterative step, cleaning the downloaded CSV file in a text editor by replacing nasty characters that caused the script to choke.) You can find the script here (note that it makes use of the networkx network analysis library, which you’ll need to install if you want to run the script.)
The script generates a directed graph with links from Ministers to lobbiests and dumps it to a GraphML file (available here) that can be loaded directly into Gephi. Here’s a view – using Gephi – of the hearth of the network. If we filter the graph to show nodes that met with at least five different Ministers…
we can get a view into the heart of the UK lobbying netwrok:
I sized the lobbiest nodes according to eigenvector centrality, which gives an indication of well connected they are in the network.
One of the nice things about Gephi is that it allows for interactive exploration of a graph, For example, I can hover over a lobbiest node – Barclays in this case – to see which Ministers were met:
Alternatively, we can see who of the well connected met with the Minister for Welfare Reform:
Looking over the data, we also see how some Ministers are inconsistently referenced within the original dataset:
Note that the layout algorithm is such that the different representations of the same name are likely to meet similar lobbiests, which will end up placing the node in a similar location under the force directed layout I used. Which is to say – we may be able to use visual tools to help us identify fractured representations of the same individual. (Note that multiple meetings between the same parties can be visualised using the thickness of the edges, which are weighted according to the number of times the edge is described in the GraphML file…)
Unifying the different representations of the same indivudal is something that Google Refine could help us tidy up with its various clustering tools, although it would be nice if the Datastore folk addressed this at source (or at least, as part of an ongoing data quality enhancement process…;-)
I guess we could also trying reconciling company names against universal company identifiers, for example by using Google Refine’s reconciliation service and the Open Corporates database? Hmmm, which makes me wonder: do MySociety, or Public Whip, offer an MP or Ministerial position reconciliation service that works with Google Refine?
A couple of things I haven’t done: represented the department (which could be done via a node attribute, maybe, at least for the Ministers); represented actual meetings, and what I guess we might term co-lobbying behaviour, where several organisations are in the same meeting.
Getting My Eye In Around F1 Quali Data – Parallel Coordinate Plots, Sort of…
Looking over the sector times from the qualifying session for tomorrow’s Hungarian Grand Prix, I noticed that Vettel was only fastest in one of the sectors.
Whilst looking for an easy way of shaping an R data frame so that I could plot categorical values sector1, sector2, sector3 on the x-axis, and then a line for each driver showing their time in the sector on the y-axis (I still haven’t worked out how to do that? Any hints? Add them to the comments please…;-), I came across a variant of the parallel coordinate plot hidden away in the lattice package:
What this plot does is for each row (i.e. each driver) take values from separate columns (i.e. times from each sector), normalise them, and then plot lines between the normalised value, one “axis” per column; each row defines a separate category.
The normalisation obviously hides the magnitude of the differences between the time deltas in each sector (the min-max range might be hundredths in one sector, tenths in another), but this plot does show us that there are different groupings of cars – there is clear(?!) white space in the diagram:
Whilst the parallel co-ordinate plot helps identify groupings of cars, and shows where they may have similar performance, it isn’t so good at helping us get an idea of which sector had most impact on the final lap time. For this, I think we need to have a single axis in seconds showing the delta from the fastest time in the sector. That is, we should have a parallel plot where the parallel axes have the same scale, but in terms of sector time, a floating origin (so e.g. the origin for one sector might be 28.6s and for another, 22.4s). For convenience, I’d also like to see the deltas shown on the y-axis, and the categorical ranges arranged on the x-axis (in contrast to the diagrams above, where the different ranges appear along the y-axis).
PS I also wonder to what extent we can identify signatures for the different teams? Eg the fifth and sixth slowest cars in sector 1 have the same signature across all three sectors and come from the same team; and the third and fourth slowest cars in sector 2 have a very similar signature (and again, represent the same team).
Where else might we look for signatures? In the speed traps maybe? Here’s what the parallel plot for the speed traps looks like:
(In this case, max is better = faster speed.)
To combine the views (timings and speed), we might use a formulation of the flavour:
parallel(~data.frame(a$sector1,a$sector2,a$sector3, -a$inter1,-a$inter2,-a$finish,-a$trap))
This is a bit too cluttered to pull much out of though? I wonder if changing the order of parallel axes might help, e.g. by trying to come up with an order than minimises the number of crossed lines?
And if we colour lines by team, can we see any characteristics?
Using a dashed, rather than solid, line makes the chart a little easier to read (more white space). Using a thinking line also helps bring out the colours.
parallel(~data.frame(a$sector1,-a$inter1,-a$inter2,a$sector2,a$sector3, -a$finish,-a$trap),col=a$team,lty=2,lwd=2)
Here’s another ordering of the axes:
Here are the sector times ordered by team (min is better):
Here are the speeds by team (max is better):
Again, we can reorder this to try to make it easier(?!) to pull out team signatures:
(I wonder – would it make sense to try to order these based on similarity eg derived from a circuit guide?)
Hmmm… I need to ponder this…
Surveying the Territory: Open Source, Open-Ed and Open Data Folk on Twitter
Over the last few weeks, I’ve been tinkering with various ways of using the Twitter API to discover Twitter lists relating to a particular topic area, whether discovered through a particular hashtag, search term, a list that already exists on a topic, or one or more people who may be associated with a particular topic area.
On my to do list is a map of the “open” community on Twitter – and the relationships between them – that will try to identify notable folk in different areas of openness (open government, open data, open licensing, open source software) and the communities around them, then aggregate all this open afficionados, plot the network connections between them, and remap the result (to see whether the distinct communities we started with fall out, as well as to discover who acts as the bridges between them, or alternatively discover whether new emergent groupings appear to crystallise out based on network connectivity).
As a step on the road to that, I had a quick peek around found who were tweeting using the #oscon hashtag over the weekend. Through analysing people who were tweeting regularly around the topic, I identified several lists in the area: @realist/opensource, @twongkee/opensource, @lemasney/opensource ,@suncao/open-linked-free, @jasebo/open-source
Pulling down the members of these lists, and then looking for connections between them, I came up with this map of the open source community on Twitter:
Using a different technique not based on lists, I generated a map of the open data community based on the interconnections between people followed by @openlylocal:
and the open education community based on the people that follow @opencontent:
(So that’s a different way of identifying the members of each community, right? One based on lists that mention users of a particular hashtag, one based on folk a particular individual follows, and one based on the folk that follow a particular individual.)
I’ve also toyed with looking at communities defined by members of lists that mention a particular individual, or people followed by a particular individual, as well as ones based on members of lists that contain folk listed on one or more trusted, curated lists in a particular topic area (got that?!;-).
Whilst the graphs based on mapping friends or followers of an individual give a good overview of that individual’s sphere of interest or influence, I think the community graphs derived from finding connections between people mentioned on “lists in the area” is a bit more robust in terms of mapping out communities in general, though I guess I’d need to do “proper” research to demonstrate that?
As mentioned at the start, the next thing on my list is a map across the aggregated “open” communities on Twitter. Of course, being digerati, many of these people will have decamped to GooglePlus. So maybe I shouldn’t bother, but instead wait for Google+ to mature a bit, an API to become available, blah, blah, blah…
Quick Command Line Reports from CSV Data Parsed Out of XML Data Files
It’s amazing how time flies, isn’t it..? Spotting today’s date I realised that there’s only a week left before the closing date of the UK Discovery Developer Competition, which is making available several UK “cultural metadata” datasets from library catalogue and activity data, EDINA OpenUrl resolver data, National Archives images and Engligh Heritage places metadata, as well as ArchivesHub project related Linked Data and Tyne and Wear Museums Collections metadata.
I was intending to have a look at how easy it was to engage with datasets (e.g. by blogging intitial explorations for each dataset along the lines of the text processing tricks I posted around the EDINA data in Postcards from a Text Processing Excursion, Playing With Large (ish) CSV Files, and Using Them as a Database from the Command Line: EDINA OpenURL Logs and Visualising OpenURL Referrals Using Gource), but I seem to have left it a bit let considering other work I need to get done this week… (i.e. marking:-(
..except for posting this old bit of code I don’t think I’ve posted before that demonstrates how to use the Python scripting language to parse an XML file, such as the Huddersfield University library MOSAIC activity data, and create a CSV/text file that we can then run simple text processing tools against.
If you download the Huddersfield or Lincoln data (e.g. from http://library.hud.ac.uk/wikis/mosaic/index.php/Project_Data) and have a look at a few lines from the XML files (e.g. using the Unix command line tool head, as in head -n 150 filename.xml to show the first 150 lines of the file), you will notice records of the form:
<useRecordCollection>
<useRecord>
<from>
<institution>University of Huddersfield</institution>
<academicYear>2008</academicYear>
<extractedOn>
<year>2009</year>
<month>6</month>
<day>4</day>
</extractedOn>
<source>LMS</source>
</from>
<resource>
<media>book</media>
<globalID type="ISBN">1903365430</globalID>
<author>Elizabeth I, Queen of England, 1533-1603.</author>
<title>Elizabeth I : the golden reign of Gloriana /</title>
<localID>585543</localID>
<catalogueURL>http://library.hud.ac.uk/catlink/bib/585543</catalogueURL>
<publisher>National Archives</publisher>
<published>2003</published>
</resource>
<context>
<courseCode type="ucas">LQV0</courseCode>
<courseName>BA(H) Humanities</courseName>
<progression>UG2</progression>
</context>
</useRecord>
</useRecordCollection>
Suppose we want to extract the data showing which courses each resource was borrowed against. That is, for each use record, we want to extract a localID and the courseCode. The following script achieves that:
from lxml import etree
import csv
#Inspired by http://www.blog.pythonlibrary.org/2010/11/20/python-parsing-xml-with-lxml/
#----------------------------------------------------------------------
def parseMOSAIC_Level1_XML(xmlFile,writer):
context = etree.iterparse(xmlFile)
record = {}
# we are going to use record to create a record containing UCAS codes
# record={ucasCode:[],localID:''}
records = []
print 'starting...'
for action, elem in context:
if elem.tag=='useRecord' and action=='end':
#we have parsed the end of a useRecord, so output course data
if 'ucasCode' in record and 'localID' in record:
for cc in record['ucasCode']:
writer.writerow([record['localID'],cc])
record={}
if elem.tag=='localID':
record['localID']=elem.text
elif elem.tag=='courseCode' and 'type' in elem.attrib and elem.attrib['type']=='ucas':
if 'ucasCode' not in record:
record['ucasCode']=[]
record['ucasCode'].append(elem.text)
elif elem.tag=='progression' and elem.text=='staff':
record['staff']='staff'
#return records
writer = csv.writer(open("test.csv", "wb"))
f='mosaic.2008.level1.1265378452.0000001.xml'
s='mosaic.2008.sampledata.xml'
parseMOSAIC_Level1_XML(f,writer)
Usage (if you save the code to the file mosaicXML2csv.py): python mosaicXML2csv.py
Note: this minimal example uses the file specified by f=, in the above case mosaic.2008.level1.1265378452.0000001.xml and writes the CSV out to test.csv
(You can also find the code as a gist on Github: simple Python XML2CSV converter)
Running the script gives data of the form:
185215,L500
231109,L500
180965,W400
181384,W400
180554,W400
201002,W400
...
Note that we might add an additional column for progression. Add in something like:
if elem.tag=='progression': record['progression']=elem.text
and modify the write command to something like writer.writerow([record['localID'],cc,record['progression']])
We can now generate quick reports over the simplified test.csv data file.
For example, how many records did we extract:
wc -l test.csv
If we sort the records (and by so doing, group duplicated rows) [sort test.csv], we can then pull out unique rows and count the number of times they repeat [uniq -c], then sort them by the number of reoccurrences [sort -k 1 -n -r] and pull out the top 20 [head -n 20] using the combined, piped Unix commandline command:
sort test.csv | uniq -c | sort -k 1 -n -r | head -n 20
This gives and output of the form:
186 220759,L500
134 176259,L500
130 176895,L500
showing that resource with localID 220759 was taken out on course L500 186 times.
If we just want to count the number of books taken out on a course as a whole, we can just pull out the coursecode column using the cut command, setting the delimiter to be a comma:
cut -f 2 -d ',' test.csv
Having extracted the course code column, we can sort, find repeat counts, sort again and show the courses with the most borrowings against them:
cut -f 2 -d ',' test.csv | sort | uniq -c | sort -k 1 -n -r | head -n 10
This gives a result of the form:
13476 L500
8799 M931
7499 P301
In other words, we can create very quick and dirty reports over the data using simple commandline tools once we generate a row based, simply delimited text file version of the original XML data report.
Having got the data in a simple test.csv text file, we can also load it directly into the graph plotting tool Gephi, where the two columns (localID and courseCode) are both interpreted as nodes, with an edge going from the localID to the courseCode. (That is, we can treat the two column CSV file as defining a bipartite graph structure.)
Running a clustering statistic and a statistic that allows us to size nodes according to degree, we can generate a view over the data that shows the relative activity against courses:
Here’s another view, using a different layout:
Also note that by choosing an appropriate layout algorithm, the network structure visually identifies courses that are “similar” by virtue of being connected to similar resources. The thickness of the edges is proportional to the number of times a resource was borrowed against a particular course, so we can also visually identify such items at a glance.




































