Using data from the clinicaltrials.gov registry (search for UK clinical trials), I grabbed all records relating to trials that have at least in part run in the UK as an XML file download, then mapped links between project lead sponsors and their collaborators. Here’s a quick sketch of the result:
The XML data schema defines the corresponding fields as follows:
<!-- === Sponsors ==================================================== --> <xs:complexType name="sponsors_struct"> <xs:sequence> <xs:element name="lead_sponsor" type="sponsor_struct"/> <xs:element name="collaborator" type="sponsor_struct" minOccurs="0" maxOccurs="unbounded"/> </xs:sequence> </xs:complexType>
Here’s the Python script I used to extract the data and generate the graph representation of it (requires networkx), which I then exported as a GEXF file that could be loaded into Gephi and used to generate the sketch shown above.
import os from lxml import etree import networkx as nx import networkx.readwrite.gexf as gf from xml.etree.cElementTree import tostring def flatten(el): if el != None: result = [ (el.text or "") ] for sel in el: result.append(flatten(sel)) result.append(sel.tail or "") return "".join(result) return '' def graphOut(DG): writer=gf.GEXFWriter(encoding='utf-8',prettyprint=True,version='1.1draft') writer.add_graph(DG) #print tostring(writer.xml) f = open('workfile.gexf', 'w') f.write(tostring(writer.xml)) def sponsorGrapher(DG,xmlRoot,sponsorList): sponsors_xml=xmlRoot.find('.//sponsors') lead=flatten(sponsors_xml.find('./lead_sponsor/agency')) if lead !='': if lead not in sponsorList: sponsorList.append(lead) DG.add_node(sponsorList.index(lead),label=lead,name=lead) for collab in sponsors_xml.findall('./collaborator/agency'): collabname=flatten(collab) if collabname !='': if collabname not in sponsorList: sponsorList.append(collabname) DG.add_node( sponsorList.index( collabname ), label=collabname, name=collabname, Label=collabname ) DG.add_edge( sponsorList.index(lead), sponsorList.index(collabname) ) return DG, sponsorList def parsePage(path,fn,sponsorGraph,sponsorList): fnp='/'.join([path,fn]) xmldata=etree.parse(fnp) xmlRoot = xmldata.getroot() sponsorGraph,sponsorList = sponsorGrapher(sponsorGraph,xmlRoot,sponsorList) return sponsorGraph,sponsorList XML_DATA_DIR='./ukClinicalTrialsData' listing = os.listdir(XML_DATA_DIR) sponsorDG=nx.DiGraph() sponsorList= for page in listing: if os.path.splitext( page ) =='.xml': sponsorDG, sponsorList = parsePage(XML_DATA_DIR,page, sponsorDG, sponsorList) graphOut(sponsorDG)
Once the file is loaded in to Gephi, you can hover over nodes to see which organisations partnered which other organisations, etc.
One thing the graph doesn’t show directly are links between co-collaborators – edges go simply from lead partner to each collaborator. It would also be possible to generate a graph that represents pairwise links between every sponsor of a particular trial.
The XML data download also includes information about the locations of trials (sometimes at the city level, sometimes giving postcode level data). So the next thing I may look at is a map to see where sponsors tend to runs trials in the UK, and maybe even see whether different sponsors tend to favour different trial sites…
PS these may be handy too – World Health Organisation Clinical Trials Registry portal, EU Clinical Trials Register
PPS looks like we can generate a link to the clinicaltrials.gov download file easily enough. Original URL is:
Download URL is:
So now I wonder:
can Scraperwiki accept a zip file, unzip it, then parse all the resulting files? Answers, with code snippets, via the comments, please:-) DONE – example here: Scraperwiki: clinicaltrials.gov test