Archive for the ‘Visualisation’ Category
Whilst looking at the apparently conflicting results from a couple of recent polls by YouGov on press regulation (reviewed in a piece by me over on OpenLearn: Two can play at that game: When polls collide in support of a package on the OU/BBC co-produced Radio 4 programme, More Or Less), my eye was also drawn to the different ways in which the survey results were presented graphically.
The polls were commissioned by The Sun newspaper on the one hand, and the Media Standards Trust/Hacked Off on the other. If you look at the poll data (The Sun/YouGov [PDF] and Media Standards Trust/YouGov [PDF] respectively), you’ll see that it’s reported in a standard format. (I couldn’t find actual data releases, but the survey reports look as if they are generated in a templated way, so getting the core of a generic scraper together for them shouldn’t be too difficult…) But how was that represented to readers (text based headlines and commentary aside?
Here are a couple of grabs from the Sun’s story (State-run watchdog ‘will gag free press’):
Pie-charts in 3D, with a tilt… gorgeous… erm, not… And the colour choice for the bar chart inner-column text is a bit low on contrast compared to the background, isn’t it?
It looks a bit like the writer took a photo of the print edition of the story on their phone, uploaded it and popped it into the story, doesn’t it?
I guess credit should be given for keeping the risk responses separate in the second image, when they could have just gone for the headline figures as pulled out in the YouGov report:
So what I’m wondering now is the extent to which a chart’s “theme” or style reflects the authority or formal weight we might ascribe to it, in much the same way as different fonts carry different associations? Anyone remember the slating that CERN got for using Comic Sans in their Higgs-Boson discovery announcement (eg here, here or here)?
Things could hardly have been more critical if they had used CrappyGraphs or an XKCD style chart generator (as for example described in Style your R charts like the Economist, Tableau … or XKCD ; or alternatively, XKCD-style for matplotlib).
Oh, hang on a minute, it almost looks like they did!
Anyway – back to the polls. The Media Standards Trust reported on their poll using charts that had a more formal look about them:
The chart annotations are also rather clearer to read.
PS with respect to the Sun’s copyright/syndication notice, and my use of the images above:
I haven’t approached the copyright holders seeking permission to reproduce the charts here, but I would argue that this piece is just working up to being research into the way numerical data is reported, as well as hinting at criticism and review. So there…
PPS As far as bad charts go, they may also be, misrepresentations and underhand attempts at persuasion, graphic style, are also possible, as SimplyStatistics describes: “The statisticians at Fox News use classic and novel graphical techniques to lead with data” [ The statisticians at Fox News use classic and novel graphical techniques to lead with data ] See also: OpenLearn – Cheating with Charts.
This is another of those confluence style posts, where a handful of things I’ve read in quick succession seem to phase lock in my mind:
(brought to mind in part via @downes a week or so ago: How to Synch 32 Metronomes)
The first was a post by Alan Levine on Making Text Work, which describes a simple technique for making text overlays on photographs create a more coherent image than just slapping text on:
One of the techniques I use on presentation slides from time to time is a solid filled banner or stripe containing text, some opaque, sometimes transparent. I wouldn’t claim to be much of an artist but it makes the slides slightly more interesting than a header with an image in the middle of the slide, surrounded by whitespace…
(Which reminds me: maybe I should look through Presentation Zen Design and Presentation Zen: Simple Ideas on Presentation Design and Delivery again…)
Reading Alan’s post, it occurred to me that once you get the idea of using a solid or semi-transparent solid filled background to a text label, you tend to remember it and add it to your toolbox of presentation ideas (of course, you might also forget and later rediscover this sort of trick… My own slides tend to crudely follow particular design ideas I’ve recently picked up on, albeit in a crude and often not very well polished way, that I’ve decided I want to try to explore…) In the slide above, several tricks are evident: the solid filled text label, the positioning of it, the backgrounded blog post that actually serves as a reference for the slide (you can do a web search for the post title to learn more about the topic), the foregrounded image, rotated slightly, and so on.
The thing that struck me about Alan’s post was that it reminded me of a time before I was really aware of using a solid filled label to foreground a piece of text, which in turn caused me to reflect on other things I now take for granted as ideas that I can draw on and combine with other ideas.
In the same way we learn to spell, and learn to use punctuation, and start to pick up on the grammer that structures a language, so we can use those rules to construct ever more complex sentences. Once we know the rules, it becomes a choice as to whether or not we employ them.
Here’s an example of how we might come to acquire a new design idea, drawn from a brief conversation with @mediaczar a couple of nights ago when Mat asked if anyone knew the name od this sort of chart combination:
I didn’t know what to call the chart, but thought it should be easy enough to try to wrangle one together using ggplot in R, guessing that a geom_errorbarh() might work; Mat came back suggesting geom_crossbar().
Here’s a minimal code fragment I used to explore it:
#plot a horizontal bar across a bar chart a=data.frame(x=10,y=43) b=stack(a) c=data.frame(d=c('x','y'),f=c(3,22)) library(ggplot2) g=ggplot(b)+geom_bar(aes(x=ind,y=values),stat='identity') g=g+geom_crossbar(data = c, aes(x=d,y=f,ymax=f+1,ymin=f-1), colour = NA, fill = "red", width = 0.9, alpha = 0.5) print(g)
Here’s an example of how I used it – in an as yet unlabelled sketch showing for a particular F1 driver, their gird position for each race (red bar) the the number of places they gained (or lost) during the first lap:
So now I know how to achieve that effect…
one two more things… Just after reading Alan’s post, I read a post by James Allen on possible race strategies for the Japanese Grand Prix:
The first thing that struck me was that even if you vaguely understand how a race chart works, the following statement may not be readily obvious to you from the top chart (my emphasis):
Three stops is actually faster [taking new softs on lap 12], as the [upper] graph … shows, but it requires the driver to pass the two stoppers in the final stint. If there is a safety car, it will hand an advantage to the two stoppers.
So, can you see why the three stopper (the green line) “requires the driver to pass the two stoppers in the final stint”? Let’s step back for a moment – can you see which bit of the graphs represent an overtake?
This is actually quite a complex graph to read – the axes are non-obvious, and not that easy to describe, though you soon pick up a feeling for how the chart works (honest!). Getting a sensible interpretation working for the surprising feature – the sharp vertical drops – is one way of getting into this chart, as well as looking at how the lines are postioned at the extreme x values (that is, at the end of the first and last laps).
The second thing that occurred to me was that we could actually remove the fragment of the line that shows the pitstop and instead show a separate line segment for each stint for each driver, and hence the line crossings that do not represent required overtakes. I’ve used this technique before, for example to show the separate stints on a chart of laptimes for a particular driver over the course of a race:
And as to where I got that trick? I think it was a bastardisation of a cycle plot, which can be used to show monthly, weekly or seasonal trends over a series of years:
…but it could equally have been a Stephen Few highlighted trick of disconnecting a timeseries line at the crossing point of one month to the next…
Whatever the case, one of the ideas I always have in mind is whether it may be possible to introduce white space in the form of a break in a line in order to separate out different groups of data in a very simple way.
Over the summer, an episode of one of my favourite audio/radio programmes, the OU co-produced Radio 4 programme More or Less included a package on high frequency trading. To illustrate how fast high frequency trading works, the programme used a beautiful bit of sonification (the audio equivalent of a graphical data visualisation). You can listen to it on iPlayer here: How fast is high-frequency trading?
Just in case it’s blocked outside the UK, here’s a version I cropped from the downloaded podcast myself [MP3]:
Tim Harford, presenter of the programme, also wrote about high frequency trading here: High-frequency trading and the $440m mistake. Interestingly, the article also includes the audio package… Here’s a link to the original programme on iPlayer: How to lose money, fast
A couple of weeks prior to the More or Less programme (coincidence? Or inspiration?) a blog post about a data sketch done by NYT’s incredibly creative Amanda Cox referred to a similar audio technique to illustrate(?!) close finishes in sprint races: Why Amanda Cox should be in charge of audio. The post also referred back to a New York Times piece from February 2012 capturing just how closely some of the 2010 Winter Olympics race finished: Fractions of a Second: An Olympic Musical.
So now I’m wondering – have you ever
seen, erm, heard a presentation that has used audio, rather than graphics, to illustrate a data story?
Part of the promise of sports data journalism is the ability to use data from an event to enrich the reporting of that event. One of the widely used graphical devices used in motor racing is the lap chart, which shows the relative positions of each car at the end of each lap:
Another, more complex chart, and one that can be quite hard to read when you first come across it, is the race history chart, which shows the laptime of each car relative to the average laptime (calculated over the whole of the race) of the race winner:
Both of these charts can be used to illustrate the progression of a race, and even in some cases to identify stories that might otherwise have been missed (particularly races amongst back markers, for example). For Olympics events particularly, where reporting is often at a local level (national and local press reporting on the progression of their athletes, as well as the winning athletes), timing data may be one of the few sources available for finding out what actually happened to a particular competitor who didn’t feature in coverage that typically focusses on the head of the race.
I’ve also experimented with some other views, including a race summary chart that captures the start position, end of first lap position, final position and range of positions held at the end of each lap by each driver:
One of the ways of using this chart is as a quick summary of the race position chart, as well as a tool for highlighting possible “driver of the day” candidates.
A rich lap chart might also be used to convey information about the distance between cars as well as their relative positions. Here’s one experiment I tried (using Gephi to visualise the data) in which node size is proportional to time to car in front and colour is related to time to car behind (red is hot – car behind is close):
(You might also be able to imagine a variant of this chart where we fix the y-value so each row shows data relating to one particular driver. Looking along a row then allows us to see how exciting a race they had.)
All of these charts can be calculated from lap time data. Some of them can be calculated from data describing the position held by each competitor at the end of each lap. But whatever the case, the data is what drives the visualisation.
A little bit of me had been hoping that laptime data for Olympics track, swimming and cycling events might be available somewhere, but if it is, I haven’t found a reliable source yet. What I did find encouraging, though, was that the New York Times, (in many ways one of the organisations that is seeing the value of using visualised data-driven storytelling in its daily activities) did make some split time data available – and was putting it to work – in the swimming events:
Here, the NYT have given split data showing the times achieved in each leg by the relay team members, along with a lap chart that has a higher level of detail, showing the position of each team at the end of each 50m length (I think?!). The progression of each of the medal winners is highlighted using an appropriate colour theme.
[Here's an insight from @kevinQ about how the New York Times dataviz team put this graphic together: Shifts in rankings. Apparently, they'd done similar views in previous years using a Flash component, but the current iteration uses d3.js]
The chart provides an illustration that can be used to help a reporter identify different stories about how the race progressed, whether or not it is included in the final piece. The graphic can also be used as a sidebar illustration of a race report.
Lap charts also lend themselves to interactive views, or highlighted customisations that can be used to illustrate competition between selected individuals – here’s another F1 example, this time from the f1fanatic blog:
(I have to admit, I prefer this sort of chart with greyed options for the unhighlighted drivers because it gives a better sense of the position churn that is happening elsewhere in the race.)
Of course, without the data, it can be difficult trying to generate these charts…
…which is to say: if you know where lap data can be found for any of the Olympics events, please post a link to the source in the comments below:-)
PS for an example of the lapcharting style used to track the hole by hole scoring across a multi-round golf tournament, see Andy Cotgreave’s Golf Analytics.
In London Olympics 2012 Medal Tables At A Glance? I posted some treemap visualisations of the Olympics medal tables generated using a Google Visualisation Chart treemap component. I thought it might be worth posting a quick R generated example too, using the off-the-shelf/straight out of CRAN treemap component. (If you want to play along, download the data as CSV from here.)
The original data looks like this:
but ideally we want it to look like this:
I posted a quick recipe showing how to do this sort of reshaping in Google Refine, but in R it’s even easier – just melt the Gold, Silver and Bronze columns into a pair of columns…
Here’s the full code to do the reshaping and generate a simple treemap:
#load in the data from a file odata = read.csv("~/Downloads/nbc_olympic_medalscrape.csv") #Reshape the data require(reshape) odatar=melt(odata,id=c('cc','ccevent','Event')) #And generate the treemap in the simplest possible way require(treemap) tmPlot(odatar, index=c("cc", "Event","variable"), vSize="value", vColor='value', type="value")
And here’s the treemap, with country blocks ordered in this case by total medal haul:
(To view the countries ordered according to number of Golds, a quick fix would be to order hierarchy with the medal type shown at the highest level of the tree: index=c("variable","cc", "Event").)
Generating variant views (I described six variants in the original post) is easy enough – just tweak the order of the elements of the index setting. (I should have named the melt created columns something more sensible than the default, shouldn’t I? Note that the vSize and vColor value value (sic) refers to the column name that identifies the medalType column. The type value says use the numerical value…. (i.e. it’s literal – it doesn’t refer to a column name…)
Out of the can – simples enough… So what might we be able to do with a little bit more treatment? Examples via the comments, please ;-)
Looking at the various medal standings for medals awarded during any Olympics games is all very well, but it doesn’t really show where each country won its medals or whether particular sports are dominated by a single country. Ranked as they are by the number of gold medals won, the medal standings don’t make it easy to see what we might term “strength in depth” – that is, we don’t get an sense of how the rankings might change if other medal colours were taken into account in some way.
Four years ago, in a quick round up of visualisations from the 2008 Beijing Olympics (More Olympics Medal Table Visualisations) I posted an example of an IBM Many Eyes Treemap visualisation I’d created showing how medals had been awarded across the top 10 medal winning countries. (Quite by chance, a couple of days ago I noticed one of the visualisations I’d created had appeared as an example in an academic paper – A Magic Treemap Cube for Visualizing
Olympic Games Data).
Although not that widely used, I personally find treemaps a wonderful device for providing a macroscopic overview of a dataset. Whilst getting actual values out of them may be hit and miss, they can be used to provide a quick orientation around a hierarchically ordered dataset. Yes, it may be hard to distinguish detail, but you can easily get your eye in and start framing more detailed questions to ask of the data.
Whilst there is still a lot more thinking I’d like to do around the use of treemaps for visualising Olympics medal data using treemaps, here are a handful of quick sketches constructed using Google visualisation chart treemap components, and data scraped from NBC.
The data I have scraped is represented using rows of the form:
Country, Event, Gold, Silver, Bronze
where Event is at the level of “Swimming”, “Cycling” etc rather than at finer levels of detail (it’s really hard finding data at even this level of data in an easily grabbable way?)
I’ve then treated the data as hierarchically structured over three levels, which can be arranged in six ways:
- MedalType, Country, Event
- MedalType, Event, Country
- Event, MedalType, Country
- Event, Country, MedalType
- Country, MedalType, Event
- Country, Event, MedalType
Each ordering provides a different view over the data, and can be used to get a feel for different stories that are to be told.
First up, ordered by Medal, Country, Event:
This is a representation, of sorts, of the traditional medal standings table. If you look to the Gold segment, you can see the top few countries by medal count. We can also zoom in to see what events those medals tended to be awarded in:
The colouring is a bit off – the Google components is not as directly scriptable as a d3js treemap, for example – but with a bit of experimentation it may be able to find a colour scheme that better indicates the number of medals allocated in each case.
The Medal-Country-Event view thus allows us to get a feel for the overall medal standings. But how about the extent to which one country or another dominated an event? In this case, an Event-Country-Medal view gives us a feeling for strength in depth (ie we’re happy to take a point of view based on the the award of any medal type:
and the Country Medal Event view allows us to then tunnel in on the gold winning events:
I think that colour could be used to make these charts even more accessible – maybe using different colouring schemes for the different variations – which is something I need to start thinking about (please feel free to make suggestions in the comments:-). It would also be good to have a little more control over the text that is displayed. The Google chart component is a little limited in this respect, so I think I need to find an alternative for more involved play – d3js seems like it’d be a good bet, although I need to do a quick review of R based treemap libraries too to see if there is anything there that may be appropriate.
It’d probably also be worth jotting down a few notes about what each of the six hierarchical variants might be good for highlighting, as well as exploring just as quick doodles with the Google chart component simpler treemaps that don’t reveal lower level structure, leaving that to be discovered through interactivity. (I showed the lower levels in the above treemaps because I was exploring static (i.e. printable) macroscopic views over the medal standings data.)
Data allowing, it would also be interesting to be able to get more detailed data visualised (for example, down to the level of actual events- 100m and Long Jump, for example, rather than Tack and Field, as well as the names of individual medalists.
PS for another Olympics related visualisation I’ve started exploring, see At A Glance View of the 2012 Olympics Heptathlon Performances
PPS As mentioned at the start, I love treemaps. See for example this initial demo of an F1 Championship points treemap in Many Eyes and as an Ergast Motor Sport API powered ‘live’ visualisation using a Google treemap chart component: A Treemap View of the F1 2011 Drivers and Constructors Championship
A week or two ago, the Government Data Service started publishing a summary document containing website transaction stats from across central government departments (GDS: Data Driven Delivery). The transactional services explorer uses a bubble chart to show the relative number of transactions occurring within each department:
The sizes of the bubbles are related to the volume of transactions (although I’m not sure what the exact relationship is?). They’re also positioned on a spiral, so as you work clockwise round the diagram starting from the largest bubble, the next bubble in the series is smaller (the “Other” catchall bubble is the exception, sitting as it does on the end of the tail irrespective of its relative size). This spatial positioning helps communicate relative sizes when the actual diameter of two bubbles next to each other is hard to differentiate between.
Clicking on a link takes you down into a view of the transactions occurring within that department:
Out of idle curiosity, I wondered what a treemap view of the data might reveal. The order of magnitude differences in the number of transactions across departments meant the the resulting graphic was dominated by departments with large numbers of transactions, so I did what you do in such cases and instead set the size of the leaf nodes in the tree to be the log10 of the number of transactions in a particular category, rather than the actual number of transactions. Each node higher up the tree was then simply the sum of values in the lower levels.
The result is a treemap that I decided shows “interestingness”, which I defined for the purposes of this graphic as being some function of the number and variety of transactions within a departement. Here’s a nested view of it, generated using a Google chart visualisation API treemap component:
The data I grabbed had a couple of usable structural levels that we can make use of in the chart. Here’s going down to the first level:
…and then the second:
Whilst the block sizes aren’t really a very good indicator of the number of transactions, it turns out that the default colouring does indicate relative proportions in the transaction count reasonably well: deep red corresponds to a low number of transactions, dark green a large number.
As a management tool, I guess the colours could also be used to display percentage change in transaction count within an area month on month (red for a decrease, green for an increase), though a slightly different size transformation function might be sensible in order to draw out the differences in relative transaction volumes a little more?
I’m not sure how well this works as a visualisation that would appeal to hardcore visualisation puritans, but as a graphical macroscopic device, I think it does give some sort of overview of the range and volume of transactions across departments that could be used as an opening gambit for a conversation with this data?
Earlier this year I doodled a recipe for comparing the folk commonly followed by users of a couple of BBC programme hashtags (Social Media Interest Maps of Newsnight and BBCQT Twitterers). Prompted in part by a tweet from Michael Smethurst/@fantasticlife about generating an ESP map for UK politicians (something I’ve also doodled before – Sketching the Structure of the UK Political Media Twittersphere) I drew on the @tweetminster Twitter lists of MPs by party to generate lists of folk commonly followed by the MPs of each party.
Using the R wordcloud library commonality and comparison clouds, we can get a visual impression of folk commonly followed in significant numbers by all the MPs of the three main parties, as well as the folk the MPs of each party follow significantly and differentially to the other parties:
There’s still a fair bit to do making the methodology robust (for example, being able to cope with comparing folk commonly followed by different sets of users where the size of the set differs to a significant extent (for example, there is a large difference between the number of tweeting Conservative and LibDem MPs). I’ve also noticed that repeatedly running the comparison.cloud code turns up different clouds, so there’s some element of randomness in there. I guess this just adds to the “sketchy” nature of the visualisation; or maybe hints at a technique akin to the way a photogrpaher will take multiple shots of a subject before picking one or two to illustrate something in particular. Which is to say: the “truthiness” of the image reflects the message that you are trying to communicate. The visualisation in this case exposes a partial truth (which is to say, no absolute truth), or particular perspective about the way different groups differentially follow folk on Twitter. A couple of other quirks I’ve noticed about the comparison.cloud as currently defined: firstly, very highly represented friends are sized too large to appear in the cloud (which is why very commonly followed folk across all sets – the people that appear in the commonality cloud – tend not to appear) – there must be a better way of handling this? Secondly, if one person is represented so highly in one group that they don’t appear in the cloud for that group, they may appear elsewhere in the cloud. (So for example, I tried plotting clouds for folk commonly followed by a sample of the followers of @davegorman, as well as the people commonly followed by the friends of @davegorman – and @davegorman appeared as a small label in the friends part of the comparison.cloud (notwithstanding the fact that all the followers of @davegorman follow @davegorman, but not all his friends do… What might make more sense would be to suppress the display of a label in the colour of a particular group if that label has a higher representation in any of the other groups (and isn’t displayed because it would be too large)).
That said, as a quick sketch, I think there’s some information being revealed there (the coloured comparison.cloud seems to pull out some names that make sense as commonly followed folk peculiar to each party…). I guess way forward is to start picking apart the comparison.cloud code, another is to explore a few more comparison sets? Suggestions welcome as to what they might be…:-)
PS by the by, I notice via the Guardian datablog (Church vs beer: using Twitter to map regional differences in US culture) another Twitter based comparison project – Church or Beer? Americans on Twitter – which looked at geo-coded Tweets over a particular time period on a US state-wide basis and counted the relative occurrence of Tweets mentioning “church” or “beer”…
An article in today’s Guardian (Serco investigated over claims of ‘unsafe’ out-of-hours GP service) about services provided by Serco to various NHS Trusts got me thinking about how much local councils spend with Serco companies. OpenlyLocal provides a patchy(?) aggregating service over local council spending data (I don’t think there’s an equivalent aggregator for NHS organisations’ spending, or police authority spending?) so I thought I’d have a quick peek at how the money flows from councils to Serco.
If we search the OpenlyLocal Spending Dashboard, we can get a summary of spend with various Serco companies from local councils whose spending data has ben logged by the site:
Using the local spend on corporates scraper I used to produce Inter-Council Payments Network Graph, I grabbed details of payments to companies returned by a search on OpenlyLocal for suppliers containing the keyword serco, and then generated a directed graph with edges defined: a) from council nodes to company nodes; b) from company nodes to canonical company nodes. (Where possible, OpenlyLocal tries to reconcile companies identified for payment by councils with canonical company identifiers so that we can start to get a feeling for how different councils make payments to the same companies.)
I then exported the graph as a json node/edge list so that it could be displayed by Mike Bostock’s d3.js Sankey diagram library:
(Note that I’ve filtered the edges to only show ones above a certain payment amount (£10k).)
As a presentation graphic, it’s really tatty, doesn’t include amount labels (though they could be added) and so on. But as a sketch, it provides an easy to digest view over the data as a starting point for a deeper conversation with the data. We might also be able to use the diagram as a starting point for a data quality improvement process, by identifying the companies that we really should try to reconcile.
Here are flows associated with speend to g4s identified companies:
I also had a quick peek at which councils were spending £3,500 and up (in total) with the OU…
Digging into OpenlyLocal spending data a little more deeply, it seems we can get a breakdown of how total payments from council to supplier are made up, such as by spending department.
Which suggests to me that we could introduce another “column” in the Sankey diagram that joins councils with payees via spending department (I suspect the Category column would result in data that’s a bit too fine grained).
See also: University Funding – A Wider View
Using data from the clinicaltrials.gov registry (search for UK clinical trials), I grabbed all records relating to trials that have at least in part run in the UK as an XML file download, then mapped links between project lead sponsors and their collaborators. Here’s a quick sketch of the result:
The XML data schema defines the corresponding fields as follows:
<!-- === Sponsors ==================================================== --> <xs:complexType name="sponsors_struct"> <xs:sequence> <xs:element name="lead_sponsor" type="sponsor_struct"/> <xs:element name="collaborator" type="sponsor_struct" minOccurs="0" maxOccurs="unbounded"/> </xs:sequence> </xs:complexType>
Here’s the Python script I used to extract the data and generate the graph representation of it (requires networkx), which I then exported as a GEXF file that could be loaded into Gephi and used to generate the sketch shown above.
import os from lxml import etree import networkx as nx import networkx.readwrite.gexf as gf from xml.etree.cElementTree import tostring def flatten(el): if el != None: result = [ (el.text or "") ] for sel in el: result.append(flatten(sel)) result.append(sel.tail or "") return "".join(result) return '' def graphOut(DG): writer=gf.GEXFWriter(encoding='utf-8',prettyprint=True,version='1.1draft') writer.add_graph(DG) #print tostring(writer.xml) f = open('workfile.gexf', 'w') f.write(tostring(writer.xml)) def sponsorGrapher(DG,xmlRoot,sponsorList): sponsors_xml=xmlRoot.find('.//sponsors') lead=flatten(sponsors_xml.find('./lead_sponsor/agency')) if lead !='': if lead not in sponsorList: sponsorList.append(lead) DG.add_node(sponsorList.index(lead),label=lead,name=lead) for collab in sponsors_xml.findall('./collaborator/agency'): collabname=flatten(collab) if collabname !='': if collabname not in sponsorList: sponsorList.append(collabname) DG.add_node( sponsorList.index( collabname ), label=collabname, name=collabname, Label=collabname ) DG.add_edge( sponsorList.index(lead), sponsorList.index(collabname) ) return DG, sponsorList def parsePage(path,fn,sponsorGraph,sponsorList): fnp='/'.join([path,fn]) xmldata=etree.parse(fnp) xmlRoot = xmldata.getroot() sponsorGraph,sponsorList = sponsorGrapher(sponsorGraph,xmlRoot,sponsorList) return sponsorGraph,sponsorList XML_DATA_DIR='./ukClinicalTrialsData' listing = os.listdir(XML_DATA_DIR) sponsorDG=nx.DiGraph() sponsorList= for page in listing: if os.path.splitext( page ) =='.xml': sponsorDG, sponsorList = parsePage(XML_DATA_DIR,page, sponsorDG, sponsorList) graphOut(sponsorDG)
Once the file is loaded in to Gephi, you can hover over nodes to see which organisations partnered which other organisations, etc.
One thing the graph doesn’t show directly are links between co-collaborators – edges go simply from lead partner to each collaborator. It would also be possible to generate a graph that represents pairwise links between every sponsor of a particular trial.
The XML data download also includes information about the locations of trials (sometimes at the city level, sometimes giving postcode level data). So the next thing I may look at is a map to see where sponsors tend to runs trials in the UK, and maybe even see whether different sponsors tend to favour different trial sites…
PS these may be handy too – World Health Organisation Clinical Trials Registry portal, EU Clinical Trials Register
PPS looks like we can generate a link to the clinicaltrials.gov download file easily enough. Original URL is:
Download URL is:
So now I wonder:
can Scraperwiki accept a zip file, unzip it, then parse all the resulting files? Answers, with code snippets, via the comments, please:-) DONE – example here: Scraperwiki: clinicaltrials.gov test