OUseful.Info, the blog…

Trying to find useful things to do with emerging technologies in open education

Posts Tagged ‘gephi

Sketching Connections Between US House and Senate Tweeps

with one comment

Following on from Sketching the Structure of the UK Political Media Twittersphere, I did a quick trawl of US Senators and Congress members based on the lists maintained by @HuffPostPol, and visualised the connections between them in Gephi using the Force Layout algorithm:

Firstpass sketch of UH house and senate tweeps

Node size is proportional to betweenness centrality measured over the directed graph where edges go from one person to a person they follow. The edge set is limited to edges between members of the four @huffPostPol lists: republican-senators, democratic-senators, democratic-members, republican-members.

Written by Tony Hirst

January 5, 2011 at 1:41 pm

Posted in Data, Visualisation

Tagged with

Sketching the Structure of the UK Political Media Twittersphere

with 4 comments

I’m not sure how, now, but earlier today I came across the first part of a two part NPR article on In London, A Case Study In Opinionated Press, looking at “ideology in the media”, and how the British print media at least tend to be associated with a particular political affiliation.

This reminded me in part of something I read in Click*, by Bill Tancer over the Christmas break (don’t worry – I didn’t pay even the full discounted price; it was remaindered for a couple of quid in an end-of-line bookshop in Cheltenham…probably still is…) relating to the flow of traffic between different overtly politically affiliated blogs in the US. One chart show traffic flows between blogs based on some work done by Matthew Hindman (Political Traffic, June 2007). Another cited Hitwise statistics about upstream and downstream traffic to/from certain news media sites.

* (The book itself is essentially a distillation of observations drawn from Hitwise stats. If nothing else, it prompted me to wonder (again) about the extent to which academic researchers make use of the big data generated on the web, as well as market research data for segmenting (large) sets of (social science) research data. (If you’re concerned about invasion of privacy online, at least in terms of what marketing profiles are being built around your behaviour, are you similarly concerned about what trad direct mail marketers know about you?!;-))

One of the things I’d idly considered whilst reading the book was what a map of the UK media might look like. The OU has a couple of Hitwise seats, I think, though I’m probably not allowed anywhere near them; Alexa offers a certain amount of data (I’ve no idea how reliable it is, though!), though I’m not sure about it’s provision of an API and any license conditions around the use of the data:

Guardian upstream traffic on alexa

The interesting sites are likely to start appearing further down the long tail…

Data I do have access to is social graph data on Twitter. The Tweetminster folk maintain a variety of relevant lists collating MPs by party affiliation, journalists by media organisation, government departments and so on, so it was easy enough to grab this data and check out who’s following whom. (I really need to clean the data to provide the option of allowing only active twitterers to be displayed…)

Here are a few idle snapshots generated using Gephi, in the preview view; consider it a tease…

Nodes coloured and grouped by affiliation (party, media organisation, etc). This initial layout shows the groupings (I don’t remember what size is? Child nodes I think? (i.e. the number of actual tweeps in that category)) The actual layout was achieved by using a force directed layout (which is sensitive to the number of links between sets of nodes) on the individual people nodes, and then grouping them by category.

Uk political media on twitter

The next chart shows the individual twitterer nodes, coloured by party, under the force atlas layout; a link exists between A and B if A follows B.:

uk political media on twitter

We see that nodes with similar affiliation are, on the whole, closer together; which is to say, folk in a particular party or organisation follow each other like crazy, and then follow some other folk too;-)

To explore the structure in a little more detail, I sized the nodes using betweenness centrality, which is related to how structurally important a node is in connecting different parts of the overall graph:

UK politics on twitter - sized by betweenness centrality

Out of interest, I also ran the modularity statistic over the network to see whether there were any natural forming clusters; because the whole network is quite highly interlinked, only three main clusters were identified:

- what looks largely like a Labour clump:

UK political tweeps cluster 1
?the left is a ghetto?

- a clump of folk associated with (or tracking) government shenanigans, maybe?

Uk political tweeps cluster 2

- I did toy with calling this cluster “hangers-on around the political scene”;-)

Uk political tweeps cluster 3

Okay – enough for now; if you want the data, then you have to demonstrate what impact it’ll get me;-) Or you can make a donation to a charity of my choice. Or you can just grab it yourself (I’ve blogged how enough times;-)

PS I really need to sort an effective colour palette out…

PPS on the wishlist – find a way of looking at edges going from nodes in one group to another group; I think the Gephi MASK operator may do this but i don’t have a clue how it works… A newt function to take two lists and just plot connections between members of the separate lists could be quite handy though, so I’ll save that up for my next train journey:-)
DOH! it’s obvious – just use s UNION filter and get the required attribute filtered nodes:

gephi filter two groups

PPPS just by the by, I found a way of grabbing friends and followers data from protected Tweeps. My understanding of the Twitter API was that this required: a) authentication; b) the requesting party to be a friend of the protected party, but I don’t require any of that. I’m not sure if the data I’m getting is current, or whether it’s a bit stale, but it’s more than I think I should be able to access via the Twitter API?! I’m not going to blog the how-to, though… figure it out yourself;-)

Written by Tony Hirst

January 4, 2011 at 8:49 pm

Posted in Data, Visualisation

Tagged with

Using Gephi to Create Bubble Charts: Exploring Government Tenders

leave a comment »

[Elements of this post has been largely deprecated since I drafted it a couple of weeks ago, but I'm posting it anyway because this is my open notebook, and as such it has a role in logging the things that are maybe dead ends, as well as hopefully more useful stuff...]

At its heart, Gephi is an environment for visualising graphs (or networks) in which “nodes” are connected to each other by “edges”. Nodes are represented using circles whose size and colour represent particular characteristics of the the node. So for example, if you were to visualise your Facebook friends, a node might represent a particular friend, the size of the node might be proportional to the number of friends they have, and the colour to how many photos they have uploaded. Lines between nodes would then show who is a friend of whom. But must we always add the lines between the nodes? If we leave them out, can we effectively use Gephi as a tool for generating charts like the Many Eyes bubble charts?

Many Eyes Bubble Chart

One of the data import formats supported by Gephi is the gdf format (gdf documentation), which expects a list of node definitions, followed by a list of edge connections. If we ignore the edges, then we can just import a set of node definitions, and create a bubble chart.

As an example of this, let’s see what we can do with some the Transparency in procurement and contracting information released by the Cabinet Office. As part of the data release, they publish a CSV file containing a summary of all the tender documents held:

Gov procurement tender docs

Looking at the data, we see that each tender is represented by one or more documents. Each row in the CSV file gives us information about the tender (its project ID, originating department, expected value, expected duration) as well as the particular document. So if we view each tender as a “bubble” or node in Gephi, we might want to represent it as follows:

nodedef> name VARCHAR,label VARCHAR, procid VARCHAR, estVal DOUBLE,estDur DOUBLE,date VARCHAR, dept VARCHAR, desc VARCHAR, nature VARCHAR
402846,"Spring Electoral Events Contact Centre",402846,125000,48,"17/09/2010","Central Office of Information","Invitation to Tender","Competition as part of an existing framework agreement"
"2010CMTLSE00001","Supply of body armour to HMCS","2010CMTLSE00001",250000,48,"24/09/2010","Ministry of Justice","RFQ instructions","Competition as part of an existing framework agreement"

Note that the GDF file requires a particular sort of header, followed by CSV rows of data. It’s easy enough in this case to simply edit the original CSV file by deleting the first line, tweaking the column headers to the name VARCHAR, label VARCHAR… format required by the GDF file, and prefixing the new first row (the header row) with nodedef>.

However, I’ve recently started exploring the use of the browser based desktop application Google Refine (formerly Freebase Gridworks) as a step in my workflow for tidying up CSV data and then getting it into the GDF format.

Loading a CSV file into gridworks

Here’s what it looks like once the data has been imported:

Data in gridworks

(For a great overview of what Gridworks allows you to do with data, see @jenit’s Using Freebase Gridworks to Create Linked Data.)

The data I want to visualise in Gephi relates to the current tenders, rather than anything to do with the actual documents, so we can use Gridworks to simplify the data set by deleting the document type, document name and contact email columns. We can also check that columns using a restricted vocabulary (e.g. the type of tender being offered), do so consistently. For example, if we look at the Nature of the Tender Process column, and select Cluster and Edit…:

Clustering in Gridworks

we can see that there may be the odd typo that we can correct automatically:

Tidying data in Gridworks

The Description column also has various categories we can tidy up:

Tidying data in Gridworks

Here are the data tidying steps I’ve applied:

Data cleaning steps in Gridworks

At the time of writing, a new version of Google Refine/Gridworks is about to be released. In the version I’m using, I don’t think it’s possible to remove duplicate rows, which we have aplenty in my tidied up dataset (where several documents were listed for a tender, there are now several identical rows in my dataset). [Google Refine 2.0 is out now - and I don't think it can de-dupe?] However, I happen to know that Gephi will ignore duplicates of nodes loaded into Gephi, so we can do the de-dupe/de-duplication step there…

To generate the GDF file, we need to create a header line, and then define the output pattern for each row. We can do this using Gridworks’ Templating support:

Gridworks templating

Here’s how I define the output document:

Preparing the export template in gridworks

Note that the linebreaks will need removing in order to generate the correct output format. Also, in the version of Gridworks I’m using, it’s worth noting that whenever you run the template, you’re returned to the main data window and your template definitions is lost… (so before running the template code, grab a copy of it into a text editor just to be safe;-)

When you export the data, it’s exported to your browser downloads directory, as a text file. Change the filetype form txt to .gdf and import it into Gephi:

DeDupe in Gephi

You’ll see that Gephi has detected the duplicate rows based on common name elements (that is, common project IDs), and ignored the copies/duplicates.

Now we can view the procurement data using proportional symbol visualisations – here I size the nodes by estimated value (and display the label size in proportion to node size), and colour the nodes according to estimated duration:

Prcouremnt in gephi

[Since drafting this post, I've found a far better way of getting just the node data into Gephi - load it into the data table as a node table. I'll post more on how to do this in a follow on post...]

(The Many Eyes take on Bubble Charts ignores x/y co-ordinates as useful data, although other definitions of Bubble Charts include x/y location as important factors. In the current example, I allow Gephi to layout the nodes/bubbles. However, we can define x and y co-ordinates in the gdf file if we want to specifically locate the bubbles on the canvas.)

We can also use Gephi to cluster the data according to calling departments, or type of procurement exercise:

Clustering data in gephi

I *think* the size of the resulting bubble is proportional to the sum of the values used to inform the node size of the original components, so we should be able to group by procurement exercise type and have the bubble size be proportional to the sum of the estimated values of the procurement projects in that procurement class.

We can also expand a clustered node to see what activity is related to it – in this case, here are tenders from the British Library:

Exploding a cluster in Gephi

Going back to the full list, here we size by estimated value and colour by type of procurement:

category colouring in gephi

We can also generate views over the data using filters – so for example, COI sponsored procurement:

Filtering in gephi

One thing that Gephi doesn’t currently support is a treemap style visualisation. However, now we have deduped the data by importing it into Gephi, we can export it as a simple CSV file from the datatable view, and then upload the data to Many Eyes and make use of its treemap:

Gephi datatable export

We use TSV because that is the preferred format for Many Eyes… (data file on Many Eyes)

Importing data into Many Eyes

Here’s one configuration of the treemap:

Procurment treemap

With the data in Many Eyes, we can of course easily generate other views over it, such as a histogram:

Many Eyes histogram

(NB in the original data, the Estimated value column – which should contain numbers – also contained a few unknown elements:

More data cleaning in gridworks

Because code that expects numbers sometimes chokes on text, I should maybe have set the unknow vlaues to a default value as shown above?)

Okay – so what have we covered in this post?
- how to start cleaning/preparing data in Freebase Gridworks/Google Refine;
- how to use reebase Gridworks/Google Refine to generate an output file according to a template;
- how to use Gephi to deduplicate data based on a common field (in this case, the project id);
- how to use Gephi as a proportional symbol/bubble chart visualisation tool;
- how to export data from Gephi and upload it into Many Eyes;
- how to use Many Eyes to generate a treemap.

As ever, this blog post took longer to write than it took me to work through the exercise originally.

Written by Tony Hirst

November 19, 2010 at 10:00 am

Posted in Data

Tagged with

data.open.ac.uk Linked Data Now Exposing Module Information

with 4 comments

As HE becomes more and more corporatised, I suspect we’re going to see online supermarkets appearing that help you identify – and register on – degree courses in exchange for an affiliate/referral fee from the university concerned. For those sites to appear, they’ll need access to course catalogues, of course. UCAS currently holds the most comprehensive one that I know of, but it’s a pain to scrape and all but useless as a datasource. But if the universities publish course catalogue information themselves in a clean way (and ideally, a standardised way), it shouldn’t be too hard to construct aggregation sites ourselves…

So it was encouraging to see earlier this week an announcement that the OU’s data.open.ac.uk site has started publishing module data from the course catalogue – that is, data about the modules (as we now call them – they used to be called courses) that you can study with the OU.

The data includes various bits of administrative information about each module, the territories it can be studied in, and (most importantly?!) pricing information;-)

data.open.ac.uk - module data

You may remember that the data.open.ac.uk site itself launched a few weeks ago with the release of Linked Data sets including data about deposits in the open repository, as well as OU podcasts on iTunes (data.open.ac.uk Arrives, With Linked Data Goodness. Where podcasts are associated with a course, the magic of Linked Data means that we can easily get to the podcasts via the course/module identifier:

data.open.ac.uk

It’s also possible to find modules that bear an isSimilarTo relation to the current module, where isSimilarTo means (I think?) “was also studied by students taking this module”.

As an example of how to get at the data, here’s a Python script using the Python YQL library that lets me run a SPARQL query over the data.open.ac.uk course module data (the code includes a couple of example queries):

import yql

def run_sparql_query(query, endpoint):
    y = yql.Public()
    query='select * from sparql.search where query="'+query+'" and service="'+endpoint+'"'
    env = "http://datatables.org/alltables.env"
    return y.execute(query, env=env)

endpoint='http://data.open.ac.uk/query'

# This query finds the identifiers of postgraduate technology courses that are similar to each other
q1='''
select distinct ?x ?z from <http://data.open.ac.uk/context/course> where {
?x a <http://purl.org/vocab/aiiso/schema#Module>.
?x <http://data.open.ac.uk/saou/ontology#courseLevel> <http://data.open.ac.uk/saou/ontology#postgraduate>.
?x <http://purl.org/dc/terms/subject> <http://data.open.ac.uk/topic/technology>.
?x <http://purl.org/goodrelations/v1#isSimilarTo> ?z
} limit 10
'''

# This query finds the names and course codes of 
# postgraduate technology courses that are similar to each other
q2='''
select distinct ?code1 ?name1 ?code2 ?name2 from <http://data.open.ac.uk/context/course> where {
?x a <http://purl.org/vocab/aiiso/schema#Module>.
?x <http://data.open.ac.uk/saou/ontology#courseLevel> <http://data.open.ac.uk/saou/ontology#postgraduate>.
?x <http://purl.org/dc/terms/subject> <http://data.open.ac.uk/topic/technology>.
?x <http://courseware.rkbexplorer.com/ontologies/courseware#has-title> ?name1.
?x <http://purl.org/goodrelations/v1#isSimilarTo> ?z.
?z <http://courseware.rkbexplorer.com/ontologies/courseware#has-title> ?name2.
?x <http://purl.org/vocab/aiiso/schema#code> ?code1.
?z <http://purl.org/vocab/aiiso/schema#code> ?code2.
}
'''

# This query finds the names and course codes of 
# postgraduate courses that are similar to each other
q3='''
select distinct ?code1 ?name1 ?code2 ?name2 from <http://data.open.ac.uk/context/course> where {
?x a <http://purl.org/vocab/aiiso/schema#Module>.
?x <http://data.open.ac.uk/saou/ontology#courseLevel> <http://data.open.ac.uk/saou/ontology#postgraduate>.
?x <http://courseware.rkbexplorer.com/ontologies/courseware#has-title> ?name1.
?x <http://purl.org/goodrelations/v1#isSimilarTo> ?z.
?z <http://courseware.rkbexplorer.com/ontologies/courseware#has-title> ?name2.
?x <http://purl.org/vocab/aiiso/schema#code> ?code1.
?z <http://purl.org/vocab/aiiso/schema#code> ?code2.
}
'''

result=run_sparql_query(q3, endpoint)

for row in result.rows:
	for r in row['result']:
		print r

I’m not sure what purposes we can put any of this data to yet, but for starters I wondered just how connected the various postgraduate courses are based on the isSimilarTo relation. Using q3 from the code above, I generated a Gephi GDF/network file using the following snippet:

# Generate a Gephi GDF file showing connections between 
# modules that are similar to each other
fname='out2.gdf'
f=open(fname,'w')

f.write('nodedef> name VARCHAR, label VARCHAR, title VARCHAR\n')
ccodes=[]
for row in result.rows:
	for r in row['result']:
		if r['code1']['value'] not in ccodes:
			ccodes.append(r['code1']['value'])
			f.write(r['code1']['value']+','+r['code1']['value']+',"'+r['name1']['value']+'"\n')
		if r['code2']['value'] not in ccodes:
			ccodes.append(r['code2']['value'])
			f.write(r['code2']['value']+','+r['code2']['value']+',"'+r['name2']['value']+'"\n')
		
f.write('edgedef> c1 VARCHAR, c2 VARCHAR\n')
for row in result.rows:
	for r in row['result']:
		#print r
		f.write(r['code1']['value']+','+r['code2']['value']+'\n')

f.close()

to produce the following graph. (Size is out degree, colour is in degree. Edges go from ?x to ?z. Layout: Fruchterman Reingold, followed by Expansion.)

OU postgrad courses in gephi

The layout style is a force directed algorithm, which in this case has had the effect of picking out various clusters of highly connected courses (so for example, the E courses are clustered together, as are the M courses, B courses, T courses and so on.)

If we run the ego filter over this network on a particular module code, we can see which modules were studying alongside it:

ego filter on course codes

Note that in the above diagram, the nodes are sized/coloured according to in-degree/out-degree in the original, complete graph, If we re-calculate those measures on just this partition, we get the following:

Recoloured course network

If we return to the whole network, and run the Modularity class statistic, we can identify several different course clusters:

Modules - modularity class

Here’s one of them expanded:

A module cluster

Here are some more:

COurse clusters

I’m not sure what use any of this is, but if nothing else, it shows there’s structure in that data (which is exactly what we’d expect, right?;-)

PS as to how I wrote my first query on this data, I copied the ‘postgraduate modules in computing’ example query from data.open.ac.uk:

http://data.open.ac.uk/query?query=select%20distinct%20%3Fx%20%0Afrom%20%3Chttp://data.open.ac.uk/context/course%3E%0Awhere%20{%3Fx%20a%20%3Chttp://purl.org/vocab/aiiso/schema%23Module%3E.%0A%3Fx%20%3Chttp://data.open.ac.uk/saou/ontology%23courseLevel%3E%20%3Chttp://data.open.ac.uk/saou/ontology%23postgraduate%3E.%0A%3Fx%20%3Chttp://purl.org/dc/terms/subject%3E%20%3Chttp://data.open.ac.uk/topic/computing%3E%0A}%0A&limit=200

and pasted it into a tool that “unescapes” encoded URLs, which encodes the SPARQL query:

Unescaping text

I was then able to pull out the example query:
select distinct ?x from <http://data.open.ac.uk/context/course>
where {?x a <http://purl.org/vocab/aiiso/schema#Module>.
?x <http://data.open.ac.uk/saou/ontology#courseLevel> <http://data.open.ac.uk/saou/ontology#postgraduate>.
?x <http://purl.org/dc/terms/subject> <http://data.open.ac.uk/topic/computing>
}

Just by the by, there’s a host of other handy text tools at Text Mechanic.

Written by Tony Hirst

November 17, 2010 at 11:28 am

Friends of the Community: Who’s Effectively Following a Hashtag

with 2 comments

Picking up on @briankelly’s Thoughts on ILI 2010, where he reports on a few gross level stats about #ili2010 hashtag activity grabbed from Summarizr, here are a few things I observed from looking at some of the hashtag community network stats…

To start with, I looked at the “inner hashtag community” where I grab the list of hashtaggers and their friends who have also used the hashtag and make links between them to give this sort of graph, as used before in many OUseful.info posts:

ILI2010 hashtaggers

(Directed graph from person to friend (i.e. to person they follow); node size proportional to in-degree, heat to out-degree.)

After running a few network statistics generated using Gephi, and exporting the data from the Gephi Data Table view, I uploaded the statistics data to IBM’s ManyEyes site here. This allows us to view the distribution of the hashtaggers based on various statistical and network measures using a range of other visualisation techniques, such as histograms (view interactive histogram chart for ILI2010 hashtaggers, interactive scatterplot)

So for example, here’s the distribution of hashtaggers by total number of followers (that is, including followers outside the hashtag community) as a histogram:

ILI2010 hashtaggers - total numbers of followers

If we look at the betweenness measure, which was calculated over the friends connections between the hashtaggers, we can see who’s best suited to getting a message broadcast across the community through direct and friend-of-a-friend links:

ILIhashtaggers - inner frineds betweenness

If we look at the in-degree (number of people in the hashtag community who have friended (i.e. are following) an individual, divided by the total number of friends of that individual, we can identify people who are being followed by more people in the community than they have as friends:

ILI2010 hashtaggers - in-degree divided by total friends

If we look at the in-degree divided by a users total number of followers, we can see the extent to which a person’s twitter feed is dominated by updates from folk who have used the ILI2010 hashtag:

ILI2010 hashtaggers - ectent to which stream is dominated by hashtaggers

In the above case, we see one person who appears to only follow members of the ILI2010 hashtag community. (I’m guessing that if folk come to twitter through a conference, this might be a signature of that?) Before you get too excited though, a little more digging suggests that that person only follows 1 person;-)

The interactive scatterplot allows us to view 3 dimensions of data – in the following case, ‘m looking for well connected (good betweenness centrality), well respected (high in-degree) folk in the hashtag community who also have a large reach in terms of their total number of followers:

ILI2010 hashtaggers - scatterplot

In terms of audience development, we can also create a network based on the complete follower lists of the ILI2010 hashtaggers. Creating such a graph generates a network with 71627 nodes, of which 236 were hashtaggers – meaning that in principle 71,391 people outside the hashtag community might have seen an ILI2010 hashtagged tweet…

Using a directed graph from hashtaggers to their followers, If we filter the graph to only show individuals with an in-degree above 60, say, we can see those people who are following at least 60 people who have used the hashtag:

ILI2010 hashtagger followers

In the way I have constructed this graph, the nodes showing Twitter usernames are in the hashtag community, the numerical IDs are individuals who didn’t use the ILI2010 hashtag but who do follow at least 60 people who did, and therefore presumably saw quite a lot of tweets about the event.

Looking up the twitter IDs of the “friends of the hashtag community”, we see the following people did not use the hashtag over the sample period, but do follow lots of people who did: @ijclark, @aekins, @metalibrarian, @schammond, @Jo_Bo_Anderson, @research_inform, @tomroper, @facetpublishing, @DavidGurteen

Of course, to know the extent to which hashtagger activity dominates the twitterstream of this “friends of the ahshtag community”, we’d need to normalise this against their total number of friends; because for exampe If I follow 20k people, of which 60 were hashtaggers, I’d probably miss most of the hashtagged tweets; whereas, if I follow 100 people, of which 60 are hashtaggers, the density of tweets received from hashtaggers could be expected to be quite high.

Okay – enough for now… although if you can think of anything else that might be interesting to know about the wider community around the hashtaggers, please post it in a comment below:-)

Written by Tony Hirst

October 18, 2010 at 9:23 am

Posted in Tinkering

Tagged with , ,

Graph Structure of an Open Science Notebook – “Linked Science” FTW…

with 13 comments

Early days on this, but what, if anything, can we look from looking at the link structure of an open science lab-book, based on the use of hyperlinks between pages in the lab-book?

A couple of days ago, I started informally bouncing ideas around with @cameronneylon about quick wins/low hanging fruit visualisations around his open science notebook (a full description of our conversation – and indeed the whole history of this ad hoc “mini-project” – can be found on Cameron’s blog: A little bit of federated Open Notebook Science). So here are a couple of Gephi takes on the lab-book (original data/scripts can be found from the github links in Cameron’s post.)

The lab notebook identifies different types of post, which can be used to colour the graph:

Lab notebook - colour modules by section type

The network graph also shows the presence of highly linked “procedure” type nodes relating to a particular experimental procedure. If we apply the ego filter to the graph we can get a close look at which posts are connected to a procedure:

Applying a gephi ego filter to a set of linked posts from a lab notebook

If we run the modularity statistic, we can automatically partition the posts into groupings of posts that are linked together – here they are grouped by modularity class:

Different modularity classes

We can expand different class nodes to see the posts associated with them:

Modularity partitions on lab book, partially expanded

Here’s one close up:

Modularity class

If we apply the ego network, we see the modularity cluster does seem to have acted in a meaningful way:

Module identified - ego network applied

Notice though that we lose sight of the internal link structure within that modularity class that was evident in the previous image.

Was that connect node important in some way?

Close up of internal structure of connected node in modulairity class

With his intimate knowledge of the experiments recorded in the lab book, Cameron also observed that Gephi has (largely) successfully clustered the correct posts together [according to protein classification] and thus separate the purifications from each other based only on connectivity. This suggests that even if posts aren’t explicitly tagged by a particular experiment, the link analysis may be useful in finding posts that are related to a particular experiment; in cases where a post is included in one group and links out to another, it may indicate some some sort of relationship between the separate clusters, such as a shared reagent.

So why might the visualisation of the whole notebook be a useful thing to do? My take on it is that the visualisation acts as a macroscope.

As Jonathan Schull put it in the Macroscope Manisfesto:

Most natural patterns are not easily perceived, for they do not happen to produce lasting stimuli to which our nervous systems are attuned. But everything we know about biology, epidemiology, social networks, computational algorithms and data structures, tells us that branching patterns are “out there”, waiting to be mapped, illuminated, seen-anew. In the last few decades new data sources, new data-analytic tools, and new tracking techniques have become available to scientists and school children. It is now possible to envision a “macroscope” that present these invisible but ubiquitous patterns to human perceptual systems so that they would engage our innate ability to perceive millions of leaves as scores of trees…and a forest

For me, Gephi can act as a macroscope in the way it reveals structure from across the whole of Cameron’s open science lab book in a single image, and allows us to interrogate the lab book from a variety of perspectives in an interactive way.

The approach is amenable to displaying structures aggregated from across multiple blogs, as long as they link to teach other. It may also serve to identify related processes, as for example when modularity clusters are connected by one or more links.

And what might this suggest as a baby next step Open Notebook Science? Well I can’t help thinking that maybe open Lab Notebooks should also be publishing their link graph, with URI referenced external links as well as internal links included… then we can create some big graphs across notebooks and start to see what might fall out…

Linked Science FTW ;-)

PS One think that may or may not be missing from the above – links to a video demonstrating each procedure, if appropriate, on a visual protocols site. Just by the by, here’s a Google custom search engine I created some time ago that implements a Science Experimental Protocols Videos meta-search engine. (It doesnlt turn up anything for /Purification of sortase/ though;-(

Written by Tony Hirst

October 3, 2010 at 10:42 pm

Initial Thoughts on Profiling @dirdigeng’s Friends Network on Twitter

with 6 comments

Last week, Andrew Stott, Director of Digital Engagement in the Cabinet Office, announced his retirement date over Twitter:

dirdigeng retires...

At the time of writing, @dirdigeng follows slightly over two thousand folk on Twitter, so I thought I’d have a quick look at who the “players” are…

The network described is constructed as follows:

- nodes represent the people followed by @dirdigeng on Twitter;
- a directed edge from A to B means that A is following B.

In the first view (randomly layed out, using Gephi), we plot node size as linearly proportional to the number of dirdigeng’s friends who are following each of the other friends (that is, the in-degree of each node), and colour proportional to their total number of followers (including people not friended by @dirdigeng).

dirdigeng friends community

The colour mapping is non-linear – @Number10gov, @guardiantach and @mashable have significantly more followers that the other nodes – and is set via the spline control:

SPline control for rankings in Gephi

If we run the betweenness centrality statistic, and size nodes accordingly, we can see how the various parts of the network may be connected. (“Betweenness centrality is a measure based on the number of shortest paths between any two nodes that pass through a particular node. Nodes around the edge of the network would typically have a low betweenness centrality. A high betweenness centrality might suggest that the individual is connecting various different parts of the network together.”)

dirdigeng's friends on twitter: betweenness centrality

We can also run the modularity class statistic to try to partition the friends into small networks with a high degree of internal connections. Here’s what we get (click through on the image to see it in more detail):

dirdigeng friends - modularity classes

Modularity groups help us understand the structure of the network in a bit more detail. I’ve started to think they might also be used to automatically generate a seeding set of people who form a highly interconnected community with an interest in a particular topic and from a particular stance.

As well as looking at the structure of the network, we can also create a search engine over the home pages declared in the Twitter bios of @dirdigeng’s friends. My thinking here is that this might provide a useful constrained search engine over sites engaged in social media and with an interest in “Digital Britain”.

The simplest custem search engine simply uses the URLs from the Twitter bios of folk followedd by @dirdigeng and adds them to a “Digital Britain” Google Custom search engine. However, one attractive feature of the Google CSEs is that you can also tweak the rankings by weighting results from different domains differently to give a “weighted” custom search engine.

As a quick experiment, I produced one weighted search engine where I set the score for each domain to be the normalised number of followers amongst @dirdigeng’s friends community. (That is, the domain score equalled the indegree of a node in the @dirdigEng friends network, divided by the total number of people in that network).

Custom search - Digital Britain

Weighted and unweighted dirdigeng friends CSEs

As you can see from the above, the results differ… Whether there is any improvement in the ranking of results is another thing. (There is also the question of how best to score, or boost, rankings based on networks stastics, and the extent to which rankings should be determined by friends network factors…)

It also strikes me that the modularity groups might also be used to inform the setup of a CSE. For example, separate modularity groups/classes may be used to define refinement label, allowing users to just search pages from members of a particular modularity class, or boost the results from those people.

And finally, I wonder whether we can mine the tweets of @dirdigeng’s friends, as well as those of @dirdigeng, to provide raw material for additional advice for searchers?

Written by Tony Hirst

September 27, 2010 at 2:35 pm

PLENK2010 – Twitter Clusters

with 12 comments

Playing around with looking at the structure of my own Twitter friends network (see recent previous posts) by using the Gephi modularity statistic to partition (or cluster) my Twitter network depending on the strengths of connections between members of that network, it struck me that I could take a similar approach to exploring the structure of the relations between the members of a Twitter list. So I grabbed the members of the PLENK2010 list (which I had automatically created by mining the Twapperkeeper archive of posts tagged with PLENK2010, and then adding frequent hashtaggers to the list), grabbed all their friends lists, and had a poke around the friends connections between the list members.

The Gephi modularity tool identified three medium sized clusters, one large cluster, and several smaller ones. Looking at the three middle sized clusters, let’s see who’s in each cluster, where they’re from (from their Twitter location info) and what their interests are (from their Twitter bio field).

Here’s the first cluster:

Plenk2010 - twitter cluster

PLENK2010 - location cluster

first geo interst cluster

Here’s the second cluster:

PLENK2010 twitter cluster

Anotehr PLENK2010 location cluster

UK cluster interests

And here’s the third:

PLENK2010 twitter cluster

A third PLEN2010 geo cluster

PLENK2010 German cluster

Not surprisingly, it seems as if geography still plays a role in defining networks…

There was also a large cluster identified in the original pass:

PLENK2010 twitter cluster

Here’s what they’re interested in:

PLENK2010 interests

And here’s where they’re from:

PLENK2010 - big cluster locale

Here’s what happens if we partition that large cluster by running the modularity tool over just the members of this cluster again:

PLENK2010 twitter community - tunnelling in

Do they make any sort of sense…?

So is this:

a) interesting?
b) useful?

If it’s useful – why? What can we do with this information?

Written by Tony Hirst

September 23, 2010 at 12:46 pm

Posted in Uncourse, Visualisation

Tagged with ,

Digging Deeper into the Structure of My Twitter Friends Network: Librarian Spotting

with 2 comments

A couple of days ago, I grabbed the Twitter friends lists of all my Twitter friends (that is, lists of all the people that the people I follow on Twitter follow…) and plotted the connections between them filtered through the people I follow (Small World? A Snapshot of How My Twitter “Friends” Follow Each Other…). That is, for all of the people I follow on Twitter, I plotted the extent to which they follow each other… got that?

Running the resulting network through Gephi’s modularity statistic (some sort of clustering algorithm; I really need to find out which), several distinct clusters of people turned up: OU folk, data journalism folk, ed techies, JISC/Museums/library folk, and open gov data folk.

(Gephi allows you to export the graph file for the current project, including annotations, if appropriate, (such as modularity class) that are added by running Gepi’s statistics. Extracting the list of nodes (i.e. Twitter users), and filtering them by modularity class means we can create separate lists of individuals based on which cluster they appear in; which in turn means that we could generate a Twitter list from those individuals.)

From my “curated” list of Twitter friends, we can identify a set of “OU twitterers” through a cluster analysis of the mass action of their own friending behaviour, and I could use this to automatically generate a Twitter list of (potential) OU Twitterers that other people can follow.

Here’s the total set of my followers, coloured by modularity class and sized by in-degree (that is, the number of my friend who follow that person).

My Twitter friends, coloured by modularity class

If we filter on modularity class, we can just look at the folk in what I have labelled “OU Twitterers”. There are one or two folk in there who donlt quite fit this label (e.g. University of Leicester folk, and a handful of otherwise “disconnected” folk…), but it’s not bad.

OU Twitterers

Note that if I grab the complete friends and followers lists of these individuals, and look for users who are commonly followed, who also tend to follow back, and who donlt have huge numbers of followers (ie they aren’t celebrities who automatically follow back…) I may discover other OU Twitterers that I don’t follow…

If we run the modularity stat over this group of people, the “OU Twitterers” (most easily done by generating a new workspace from the filtered group), we see three more partitions fall out. Broadly, this first one corresponds to OU Library folk (ish…):

OU LIbrary folk...

Twitterers from my faculty (several whom rarely, if ever, tweet):

Twitterers I follow in my faculty

And the rest (the vast majority, in fact):

OU folk

(Note that a coule of folk are completely disconnected, and have nothing to do with the OU…)

Running the modulraity class over this larger group turns up nothing of interest.

So… so what? So this. Firstly, I can mine the friends lists of the friends of arbitrary people on Twitter and pull out clusters from that may tell me something about the interests of those people. (For example, we might grab their twitter biography statements and run them through a word cloud as a first approximation; or grab their recent tweets and do some text mining on that to see if there is any common interest. Hashtag analysis might also be revealing…) Secondly, we could use the members of cluster to act as a first approximation for a list of connected members of a community interested in a particular topic area; for these community members we could then pull down lists of all their friends and followers and look to see if we can grow the list through other commonly connected to individuals.

PS after tweeting the original post, a couple of people asked if I could grab the data from their friends lists. For example, @neilkod’s turned up clusters relating to “Utah tweeps, my cycling ones, and of course data/#rstats.” So the approach appears to work in general…:-)

Written by Tony Hirst

September 23, 2010 at 8:48 am

Posted in OU2.0, Visualisation

Tagged with ,

Small World? A Snapshot of How My Twitter “Friends” Follow Each Other…

with 5 comments

I’m now following about 500 or so people on Twitter, but to what extent are they following each other? Are there any noticeable subgroups in the folk I follow, by virtue of them being highly linked to each other in the friends and following stakes?

How my twitter friends are interconnected - size is # of my friends following my friends, colour is # of my friends they follow

Each of the nodes represents one of my Twitter friends (that is, each node represents a separate person I follow on Twitter).

Node size is proportional to the number of my friends who are following other of my friends.

Node colour is proportional the the number of my friends that person is following (blue is cold – low number; red is hot – high number).

The graph is an indication of the extent to which the people I follow (that is, my friends…) is an echo chamber…

Running the Gephi “connected cpmponents” statistic, it seems that the group is pretty tightly connected… There is one noticeable separate component that contains more than a singleton, from a few accounts I followed last year…:

Partition over my twitter friends

If I look at the labels for the other separate components (not shown), they mainly correspond to people with private accounts, although there are a couple of people who are completely independent of the rest of my Twitter social circle.

The Gephi modularity class statistic, however, suggests there is a little more structure hiding in there…

My twitter network - modularity class

(This is a random algorithm, so it may give slightly different answers each time it is run…)

Let’s peek inside them…

My twitter friends - one cluster

Looks a bit educationalist to me…;-)

How about this one:

another of my twitter clusters

Hmm. Government and open data, maybe? What next…?

ANother of my twitter friend clusters...

BBC and journ hack types, with a bit of datajourn thrown in maybe?

Hmmm – the next one looks like an OU cluster:

AN OU cluster in my twitter friends

And that leaves….

Final twitter cluster

JISC, museums and libraries…

Seems about right to me:-)

PS Images produced using Gephi… Note to self: start spending a ittle more time about tidying up the presentation of some of these images…;-)

PPS for a similar exercise applied to my Facebook friends, see Getting Started With The Gephi Network Visualisation App – My Facebook Network, Part IV

Written by Tony Hirst

September 21, 2010 at 4:59 pm

Posted in Visualisation

Tagged with ,

Follow

Get every new post delivered to your Inbox.

Join 150 other followers