May 2012 – Page 3 – OUseful.Info, the blog…

<xs:complexType name="sponsors_struct"> <xs:sequence> <xs:element name="lead_sponsor" type="sponsor_struct"/> <xs:element name="collaborator" type="sponsor_struct" minOccurs="0" maxOccurs="unbounded"/> </xs:sequence> </xs:complexType>

import os from lxml import etree import networkx as nx import networkx.readwrite.gexf as gf from xml.etree.cElementTree import tostring def flatten(el): if el != None: result = [ (el.text or "") ] for sel in el: result.append(flatten(sel)) result.append(sel.tail or "") return "".join(result) return '' def graphOut(DG): writer=gf.GEXFWriter(encoding='utf-8',prettyprint=True,version='1.1draft') writer.add_graph(DG) #print tostring(writer.xml) f = open('workfile.gexf', 'w') f.write(tostring(writer.xml)) def sponsorGrapher(DG,xmlRoot,sponsorList): sponsors_xml=xmlRoot.find('.//sponsors') lead=flatten(sponsors_xml.find('./lead_sponsor/agency')) if lead !='': if lead not in sponsorList: sponsorList.append(lead) DG.add_node(sponsorList.index(lead),label=lead,name=lead) for collab in sponsors_xml.findall('./collaborator/agency'): collabname=flatten(collab) if collabname !='': if collabname not in sponsorList: sponsorList.append(collabname) DG.add_node( sponsorList.index( collabname ), label=collabname, name=collabname, Label=collabname ) DG.add_edge( sponsorList.index(lead), sponsorList.index(collabname) ) return DG, sponsorList def parsePage(path,fn,sponsorGraph,sponsorList): fnp='/'.join([path,fn]) xmldata=etree.parse(fnp) xmlRoot = xmldata.getroot() sponsorGraph,sponsorList = sponsorGrapher(sponsorGraph,xmlRoot,sponsorList) return sponsorGraph,sponsorList XML_DATA_DIR='./ukClinicalTrialsData' listing = os.listdir(XML_DATA_DIR) sponsorDG=nx.DiGraph() sponsorList=[] for page in listing: if os.path.splitext( page )[1] =='.xml': sponsorDG, sponsorList = parsePage(XML_DATA_DIR,page, sponsorDG, sponsorList) graphOut(sponsorDG)

How can we represent conversations between a small sample of users, such as the email or SMS converstations between James Murdoch’s political lobbiest and a Government minister’s special adviser (Leveson inquiry evidence), or the pattern of retweet activity around a couple of heavily retweeted individuals using a particular hashtag?

I spent a bit of time on-and-off today mulling over ways of representing this sort of interaction, in search of something like a UML call sequence diagram but not, and here’s what I came up with in the context of the retweet activity:

The chart looks a bit complicated at first, but there’s a lot of information in there. The small grey dots on their own are tweets using a particular hashtag that aren’t identified as RTs in a body of tweets obtained via a Twitter search around a particular hashtag (that is, they don’t start with a pattern something like RT @[^:]*:). The x-axis represents the time a tweet was sent and the y-axis who sent it. Paired dots connected by a vertical line segment show two people, one of whom (light grey point) retweeted the other (dark grey point). RTs of two notable individuals are highlighted using different colours. The small black dots highlight original tweets sent by the individuals who we highlight in terms of how they are retweeted. Whilst we can’t tell which tweet was retweeted, we may get an idea of how the pattern of RT behaviour related to the individuals of interest plays out relative to when they actually tweeted.

Here’s the R-code used to build up the chart. Note that the order in which the layers are constructed is important (for example, we need the small black dots to be in the top layer).

##RT chart, constructed in R using ggplot2
require(ggplot2)
#the base data set - exclude tweets that aren't RTs
g = ggplot(subset(tw.df.rt,subset=(!is.na(rtof))))
#Add in vertical grey lines connecting who RT'd whom
g = g + geom_linerange(aes(x=created,ymin=screenName,ymax=rtof),colour='lightgrey')
#Use a more complete dataset to mark *all* tweets with a lightgrey point
g = g + geom_point(data=(tw.df),aes(x=created,y=screenName),colour='lightgrey')
#Use points at either end of the RT line segment to distinguish who RTd whom
g = g + geom_point(aes(x=created,y=screenName),colour='lightgrey') + geom_point(aes(x=created,y=rtof),colour='grey') + opts(axis.text.y=theme_text(size=5))
#We're going to highlight RTs of two particular individuals
#Define a couple of functions to subset the data
subdata.rtof=function(u) return(subset(tw.df.rt,subset=(!is.na(rtof) & rtof==u)))
subdata.user=function(u) return(subset(tw.df.rt,subset=(!is.na(rtof) & screenName==u)))
#Grab user 1
s1='gsiemens'
ss1=subdata.rtof(s1)
ss1x=subdata.user(s1)
sc1='aquamarine3'
#Highlight the RT lines associated with RTs of this user
g = g + geom_linerange(data=ss1,aes(x=created,ymin=screenName,ymax=rtof),colour=sc1)
#Grab user 2
s2='busynessgirl'
ss2=subdata.rtof(s2)
ss2x=subdata.user(s2)
sc2='orange'
#Highlight the RT lines associated with RTs of this user
g = g + geom_linerange(data=ss2,aes(x=created,ymin=screenName,ymax=rtof),colour=sc2)
#Now we add another layer to colour the nodes associated with RTs of the two selected users
g = g + geom_point(data=ss1,aes(x=created,y=rtof),colour=sc1) + geom_point(data=ss1,aes(x=created,y=screenName),colour=sc1)
g = g + geom_point(data=ss2,aes(x=created,y=rtof),colour=sc2) + geom_point(data=ss2,aes(x=created,y=screenName),colour=sc2)
#Finally, add a highlight to mark when the RTd folk we are highlighting actually tweet
g = g + geom_point(data=(ss1x),aes(x=created,y=screenName),colour='black',size=1)
g = g + geom_point(data=(ss2x),aes(x=created,y=screenName),colour='black',size=1)
#Print the chart
print(g)

One thing I’m not sure about is the order of names on the y-axis. That said, one advantage of using the conversational, exploratory visualisation data approach that I favour is that if you let you eyes try to seek out patterns, you may be able to pick up clues for some sort of model around the data that really emphasises those patterns. So for example, looking at the chart, I wonder if there would be any merit in organising the y-axis so that folk who RTd orange but not aquamarine were in the bottom third of the chart, folk who RTd aqua but not orange were in the top third of the chart, folk who RTd orange and aqua were between the two users of interest, and folk who RTd neither orange nor aqua were arranged closer to the edges, with folk who RTd each other close to each other (invoking an ink minimisation principle)?

Something else that it would be nice to do would be to use the time an original tweet was sent as the x-axis value for the tweet marker for the original sender of a tweet that is RTd. We would then get a visual indication of how quickly a tweet was RTd.

PS I also created a script that generated a wealth of other charts around the lak12 hashtag [PDF]. The code used to generate the report can be found as the file exampleSearchReport.Rnw in this code repository.

Month: May 2012

Sketching Sponsor Partners Running UK Clinical Trials

Doodling With a Conversation, or Retweet, Data Sketch Around LAK12