Sketching Sponsor Partners Running UK Clinical Trials

Using data from the clinicaltrials.gov registry (search for UK clinical trials), I grabbed all records relating to trials that have at least in part run in the UK as an XML file download, then mapped links between project lead sponsors and their collaborators. Here’s a quick sketch of the result:

ukClinicalTrialPartners (PDF)

The XML data schema defines the corresponding fields as follows:

<!-- === Sponsors ==================================================== -->

  <xs:complexType name="sponsors_struct">
    <xs:sequence>
      <xs:element name="lead_sponsor" type="sponsor_struct"/>
      <xs:element name="collaborator" type="sponsor_struct" minOccurs="0" maxOccurs="unbounded"/>
    </xs:sequence>
  </xs:complexType>

Here’s the Python script I used to extract the data and generate the graph representation of it (requires networkx), which I then exported as a GEXF file that could be loaded into Gephi and used to generate the sketch shown above.

import os
from lxml import etree
import networkx as nx
import networkx.readwrite.gexf as gf
from xml.etree.cElementTree import tostring


def flatten(el):
	if el != None:
		result = [ (el.text or "") ]
		for sel in el:
			result.append(flatten(sel))
			result.append(sel.tail or "")
		return "".join(result)
	return ''

def graphOut(DG):
	writer=gf.GEXFWriter(encoding='utf-8',prettyprint=True,version='1.1draft')
	writer.add_graph(DG)
	#print tostring(writer.xml)
	f = open('workfile.gexf', 'w')
	f.write(tostring(writer.xml))

def sponsorGrapher(DG,xmlRoot,sponsorList):
	sponsors_xml=xmlRoot.find('.//sponsors')
	lead=flatten(sponsors_xml.find('./lead_sponsor/agency'))
	if lead !='':
		if lead not in sponsorList:
			sponsorList.append(lead)
			DG.add_node(sponsorList.index(lead),label=lead,name=lead)
			
	for collab in sponsors_xml.findall('./collaborator/agency'):
		collabname=flatten(collab)
		if collabname !='':
			if collabname not in sponsorList:
				sponsorList.append(collabname)
				DG.add_node( sponsorList.index( collabname ), label=collabname, name=collabname, Label=collabname )
			DG.add_edge( sponsorList.index(lead), sponsorList.index(collabname) )
	return DG, sponsorList

def parsePage(path,fn,sponsorGraph,sponsorList):
	fnp='/'.join([path,fn])
	xmldata=etree.parse(fnp)
	xmlRoot = xmldata.getroot()
	sponsorGraph,sponsorList = sponsorGrapher(sponsorGraph,xmlRoot,sponsorList)
	return sponsorGraph,sponsorList

XML_DATA_DIR='./ukClinicalTrialsData'
listing = os.listdir(XML_DATA_DIR)

sponsorDG=nx.DiGraph()
sponsorList=[]
for page in listing:
	if os.path.splitext( page )[1] =='.xml':
		sponsorDG, sponsorList = parsePage(XML_DATA_DIR,page, sponsorDG, sponsorList)

graphOut(sponsorDG)

Once the file is loaded in to Gephi, you can hover over nodes to see which organisations partnered which other organisations, etc.

One thing the graph doesn’t show directly are links between co-collaborators – edges go simply from lead partner to each collaborator. It would also be possible to generate a graph that represents pairwise links between every sponsor of a particular trial.

The XML data download also includes information about the locations of trials (sometimes at the city level, sometimes giving postcode level data). So the next thing I may look at is a map to see where sponsors tend to runs trials in the UK, and maybe even see whether different sponsors tend to favour different trial sites…

Further down the line, I wonder whether any linkage can be made across to things like GP practice level prescribing behaviour, or grant awards from the MRC?

PS these may be handy too – World Health Organisation Clinical Trials Registry portal, EU Clinical Trials Register

PPS looks like we can generate a link to the clinicaltrials.gov download file easily enough. Original URL is:
http://clinicaltrials.gov/ct2/results?cntry1=EU%3AGB&show_flds=Y&show_down=Y#down
Download URL is:
http://clinicaltrials.gov/ct2/results/download?down_stds=all&down_typ=study&down_flds=shown&down_fmt=plain&cntry1=EU%3AGB&show_flds=Y&show_down=Y
So now I wonder: can Scraperwiki accept a zip file, unzip it, then parse all the resulting files? Answers, with code snippets, via the comments, please:-) DONE – example here: Scraperwiki: clinicaltrials.gov test

Doodling With a Conversation, or Retweet, Data Sketch Around LAK12

How can we represent conversations between a small sample of users, such as the email or SMS converstations between James Murdoch’s political lobbiest and a Government minister’s special adviser (Leveson inquiry evidence), or the pattern of retweet activity around a couple of heavily retweeted individuals using a particular hashtag?

I spent a bit of time on-and-off today mulling over ways of representing this sort of interaction, in search of something like a UML call sequence diagram but not, and here’s what I came up with in the context of the retweet activity:

The chart looks a bit complicated at first, but there’s a lot of information in there. The small grey dots on their own are tweets using a particular hashtag that aren’t identified as RTs in a body of tweets obtained via a Twitter search around a particular hashtag (that is, they don’t start with a pattern something like RT @[^:]*:). The x-axis represents the time a tweet was sent and the y-axis who sent it. Paired dots connected by a vertical line segment show two people, one of whom (light grey point) retweeted the other (dark grey point). RTs of two notable individuals are highlighted using different colours. The small black dots highlight original tweets sent by the individuals who we highlight in terms of how they are retweeted. Whilst we can’t tell which tweet was retweeted, we may get an idea of how the pattern of RT behaviour related to the individuals of interest plays out relative to when they actually tweeted.

Here’s the R-code used to build up the chart. Note that the order in which the layers are constructed is important (for example, we need the small black dots to be in the top layer).

##RT chart, constructed in R using ggplot2
require(ggplot2)
#the base data set - exclude tweets that aren't RTs
g = ggplot(subset(tw.df.rt,subset=(!is.na(rtof))))
#Add in vertical grey lines connecting who RT'd whom
g = g + geom_linerange(aes(x=created,ymin=screenName,ymax=rtof),colour='lightgrey')
#Use a more complete dataset to mark *all* tweets with a lightgrey point
g = g + geom_point(data=(tw.df),aes(x=created,y=screenName),colour='lightgrey')
#Use points at either end of the RT line segment to distinguish who RTd whom
g = g + geom_point(aes(x=created,y=screenName),colour='lightgrey') + geom_point(aes(x=created,y=rtof),colour='grey') + opts(axis.text.y=theme_text(size=5))
#We're going to highlight RTs of two particular individuals
#Define a couple of functions to subset the data
subdata.rtof=function(u) return(subset(tw.df.rt,subset=(!is.na(rtof) & rtof==u)))
subdata.user=function(u) return(subset(tw.df.rt,subset=(!is.na(rtof) & screenName==u)))
#Grab user 1
s1='gsiemens'
ss1=subdata.rtof(s1)
ss1x=subdata.user(s1)
sc1='aquamarine3'
#Highlight the RT lines associated with RTs of this user
g = g + geom_linerange(data=ss1,aes(x=created,ymin=screenName,ymax=rtof),colour=sc1)
#Grab user 2
s2='busynessgirl'
ss2=subdata.rtof(s2)
ss2x=subdata.user(s2)
sc2='orange'
#Highlight the RT lines associated with RTs of this user
g = g + geom_linerange(data=ss2,aes(x=created,ymin=screenName,ymax=rtof),colour=sc2)
#Now we add another layer to colour the nodes associated with RTs of the two selected users
g = g + geom_point(data=ss1,aes(x=created,y=rtof),colour=sc1) + geom_point(data=ss1,aes(x=created,y=screenName),colour=sc1)
g = g + geom_point(data=ss2,aes(x=created,y=rtof),colour=sc2) + geom_point(data=ss2,aes(x=created,y=screenName),colour=sc2)
#Finally, add a highlight to mark when the RTd folk we are highlighting actually tweet
g = g + geom_point(data=(ss1x),aes(x=created,y=screenName),colour='black',size=1)
g = g + geom_point(data=(ss2x),aes(x=created,y=screenName),colour='black',size=1)
#Print the chart
print(g)

One thing I’m not sure about is the order of names on the y-axis. That said, one advantage of using the conversational, exploratory visualisation data approach that I favour is that if you let you eyes try to seek out patterns, you may be able to pick up clues for some sort of model around the data that really emphasises those patterns. So for example, looking at the chart, I wonder if there would be any merit in organising the y-axis so that folk who RTd orange but not aquamarine were in the bottom third of the chart, folk who RTd aqua but not orange were in the top third of the chart, folk who RTd orange and aqua were between the two users of interest, and folk who RTd neither orange nor aqua were arranged closer to the edges, with folk who RTd each other close to each other (invoking an ink minimisation principle)?

Something else that it would be nice to do would be to use the time an original tweet was sent as the x-axis value for the tweet marker for the original sender of a tweet that is RTd. We would then get a visual indication of how quickly a tweet was RTd.

PS I also created a script that generated a wealth of other charts around the lak12 hashtag [PDF]. The code used to generate the report can be found as the file exampleSearchReport.Rnw in this code repository.