Graph Structure of an Open Science Notebook – “Linked Science” FTW…

Early days on this, but what, if anything, can we look from looking at the link structure of an open science lab-book, based on the use of hyperlinks between pages in the lab-book?

A couple of days ago, I started informally bouncing ideas around with @cameronneylon about quick wins/low hanging fruit visualisations around his open science notebook (a full description of our conversation – and indeed the whole history of this ad hoc “mini-project” – can be found on Cameron’s blog: A little bit of federated Open Notebook Science). So here are a couple of Gephi takes on the lab-book (original data/scripts can be found from the github links in Cameron’s post.)

The lab notebook identifies different types of post, which can be used to colour the graph:

Lab notebook - colour modules by section type

The network graph also shows the presence of highly linked “procedure” type nodes relating to a particular experimental procedure. If we apply the ego filter to the graph we can get a close look at which posts are connected to a procedure:

Applying a gephi ego filter to a set of linked posts from a lab notebook

If we run the modularity statistic, we can automatically partition the posts into groupings of posts that are linked together – here they are grouped by modularity class:

Different modularity classes

We can expand different class nodes to see the posts associated with them:

Modularity partitions on lab book, partially expanded

Here’s one close up:

Modularity class

If we apply the ego network, we see the modularity cluster does seem to have acted in a meaningful way:

Module identified - ego network applied

Notice though that we lose sight of the internal link structure within that modularity class that was evident in the previous image.

Was that connect node important in some way?

Close up of internal structure of connected node in modulairity class

With his intimate knowledge of the experiments recorded in the lab book, Cameron also observed that Gephi has (largely) successfully clustered the correct posts together [according to protein classification] and thus separate the purifications from each other based only on connectivity. This suggests that even if posts aren’t explicitly tagged by a particular experiment, the link analysis may be useful in finding posts that are related to a particular experiment; in cases where a post is included in one group and links out to another, it may indicate some some sort of relationship between the separate clusters, such as a shared reagent.

So why might the visualisation of the whole notebook be a useful thing to do? My take on it is that the visualisation acts as a macroscope.

As Jonathan Schull put it in the Macroscope Manisfesto:

Most natural patterns are not easily perceived, for they do not happen to produce lasting stimuli to which our nervous systems are attuned. But everything we know about biology, epidemiology, social networks, computational algorithms and data structures, tells us that branching patterns are “out there”, waiting to be mapped, illuminated, seen-anew. In the last few decades new data sources, new data-analytic tools, and new tracking techniques have become available to scientists and school children. It is now possible to envision a “macroscope” that present these invisible but ubiquitous patterns to human perceptual systems so that they would engage our innate ability to perceive millions of leaves as scores of trees…and a forest

For me, Gephi can act as a macroscope in the way it reveals structure from across the whole of Cameron’s open science lab book in a single image, and allows us to interrogate the lab book from a variety of perspectives in an interactive way.

The approach is amenable to displaying structures aggregated from across multiple blogs, as long as they link to teach other. It may also serve to identify related processes, as for example when modularity clusters are connected by one or more links.

And what might this suggest as a baby next step Open Notebook Science? Well I can’t help thinking that maybe open Lab Notebooks should also be publishing their link graph, with URI referenced external links as well as internal links included… then we can create some big graphs across notebooks and start to see what might fall out…

Linked Science FTW ;-)

PS One think that may or may not be missing from the above – links to a video demonstrating each procedure, if appropriate, on a visual protocols site. Just by the by, here’s a Google custom search engine I created some time ago that implements a Science Experimental Protocols Videos meta-search engine. (It doesnlt turn up anything for /Purification of sortase/ though;-(

13 comments

  1. Jean-Claude Bradley

    Tony – that is very interesting. We recently observed the intersection of 2 organic chemistry Open Notebooks between the Todd and Bradley groups: http://usefulchem.blogspot.com/2010/06/use-of-ons-to-protect-open-research.html

    It might be worth trying to graphically show this intersection. By using the ReactionCompounds spreadsheet and linking reactants and products by ReactionID you should be able to graph the connections within each notebook and between them: http://onsbooks.wikispaces.com/Reaction+Attempts

    • Tony Hirst

      Thanks for that comment;

      Also, looking at the ReactionCompunds spreadsheet ( https://spreadsheets0.google.com/ccc?key=toMz3Kp3T2EAUDF3MSGCquw&hl=en#gid=0 – title Reaction Attempts?) I see columns that may be of interest as:
      – reactionID, with related columns CompoundName, CSID, role
      – hyperlink (several reactionIDs may relate to one hyperlink, which i guess is a procedure?)

      So we would expect to see a hyperlink connected to one or more reactionIDs, and reactionIDs linking on to CompoundName/CSID coloured by role.

      When you say “within each notebook”, do you mean each hyperlink entry?

  2. Jean-Claude Bradley

    The hyperlinks point to the lab notebook page where the reaction took place. There may be several different reactions done in a single experiment so that is why different reactionIDs may be associated with the same hyperlink. I don’t think it is very relevant that multiple reactions were done in the same experiment in terms of the graph. I think it matters more how CSIDs(molecular identifiers) connect to each other as co-reactants or as a reactant-product relationship to build the graph. You could then color the connections based on the lab notebook or the person who did the reaction perhaps to visualize patterns and intersections.

    • Tony Hirst

      So what do you want to see? A plot of CSID-CSID connections? Colouring gets tricky then eg if a compound is a reactant in one and a product in another reaction. If we plot CDIS-ReactionID, we can colour the edge as reactant or product, and see which CSIDs are used in different reactions? The ego filter distance 2 applied to a CSID would then also display the other CSIDs it had been in a reaction with?

  3. Jean-Claude Bradley

    Yes I think you could get an informative plot if you lump reactants and products together and just graph the CSID-CSID connections. Different colors for the researcher will automatically show the different notebooks as well.

    • Tony Hirst

      I’ve no idea if this is interesting or not… (It’s sort of visual co-browsing around paired compounds, isn’t it?)

      e.g. http://www.flickr.com/photos/psychemedia/5053332039/

      If the code has worked as I think it should (?!), what plotted is:

      – for each ReactionID, grab the list of compounds;
      – generate the pairwise combination of compounds for each ReactionID, and use these as edges; (In the image above they’re directed, but that’s an artefact, and they should be undirected. If a mixed directed/undirected graph is possible, could have undirected links between reactants in a ReactionID, and directed from those to products?)
      – colour according to hyperlink; note that if several hyperlinks are associated with a compound, the colour used will refer to the first relevant hyperlink in the file

      A gist for generating the Gephi readable gdf file from a CSV export of the spreadsheet is available here: http://gist.github.com/611274

  4. Jean-Claude Bradley

    This looks very promising – can you color based on researcher instead of hyperlink? That should show the intersection between notebooks – although it might be difficult to see from an image because it is so congested. Is it possible to interact with the plot and zoom into certain sections?

    • Tony Hirst

      Where does the author information exist? It’s not in the spreadsheet I was using… Is there an author-to-reactionID map somewhere?

      As to zooming in – yes, the map is interactively navigable, e.g. within Gephi.

    • Tony Hirst

      Great – thanks… I actually made reference to the events of ‘that weekend’ in a presentation last week… reminds me that I need to post the slides and a write-up…

  5. Pingback: Unmeasurable Impact « OUseful.Info, the blog…