madewithgephi – Page 2 – OUseful.Info, the blog…

A First Attempt at Looking at F1 Timing Data in Google Motion Charts (aka “Gapminder”)

Having managed to get F1 timing data data through my cobbled together F1 timing data Scraperwiki, it becomes much easier to try out different visualisation approaches that can be used to review the stories that sometimes get hidden in the heat of the race (that data journalism trick of using visualisation as an analytic tool for story discovery, for example).

Whilst I was on holiday, reading a chapter in Beautiful Visualization on Gapminder/Trendalyser/Google Motion Charts (it seems the animations may be effective when narrated, as when Hans Rosling performs with them, but for the uninitiated, they can simply be confusing…), it struck me that I should be able to view some of the timing data in the motion chart…

So here’s a first attempt (going against the previously identified “works best with narration” bit of best practice;-) – F1 timing data (China 2011) in Google Motion Charts, the video:

Visualising the China 2011 F1 Grand Prix in Google Motion Charts

If you want to play with the chart itself, you can find it here: F1 timing data (China 2011) Google Motion Chart.

The (useful) dimensions are:

lap – the lap number;
pos – the car/racing number of each driver;
trackPos – the position in the race (the racing position);
currTrackPos – the position on the track (so if a lapped car is between the leader and second place car, their respective currtrackpos are 1, 2, 3);
pitHistory – the number of pit stops to date

The timeToLead, timeToFront and timeToBack measures give the time (in seconds) between each car and the leader, the time to the car in the racing position ahead, and the time to the car in racing position behind (these last two datasets are incomplete at the moment… I still need to calculate this missing datapoints…). The elapsedTime is the elapsed racetime for each car at the end of each measured lap.

The time starts at 1900 because of a quirk in Google Motion Charts – they only work properly for times measured in years, months and days (or years and quarters) for 1900 onwards. (You can use years less than 1900 but at 1899 bad things might happen!) This means that I can simply use the elapsed time as the timebase. So until such a time as the chart supports date:time or :time as well as date: stamps, my fix is simply to use an integer timecount (the elapsed time in seconds) + 1900.

Visualising Twitter Friend Connections Using Gephi: An Example Using the @WiredUK Friends Network

To corrupt a well known saying, “cook a man a meal and he’ll eat it; teach a man a recipe, and maybe he’ll cook for you…”, I thought it was probably about time I posted the recipe I’ve been using for laying out Twitter friends networks using Gephi, not least because I’ve been generating quite a few network files for folk lately, giving them copies, and then not having a tutorial to point them to. So here’s that tutorial…

The starting point is actually quite a long way down the “how did you that?” chain, but I have to start somewhere, and the middle’s easier than the beginning, so that’s where we’ll step in (I’ll give some clues as to how the beginning works at the end…;-)

Here’s what we’ll be working towards: a diagram that shows how the people on Twitter that @wiredUK follows follow each other:

The tool we’re going to use to layout this graph from a data file is a free, extensible, open source, cross platform Java based tool called Gephi. If you want to play along, download the datafile. (Or try with a network of your own, such as your Facebook network or social data grabbed from Google+.)

From the Gephi file menu, Open the appropriate graph file:

Import the file as a Directed Graph:

The Graph window displays the graph in a raw form:

Sometimes a graph may contain nodes that are not connected to any other nodes. (For example, protected Twitter accounts do not publish – and are not published in – friends or followers lists publicly via the Twitter API.) Some layout algorithms may push unconnected nodes far away from the rest of the graph, which can affect generation of presentation views of the network, so we need to filter out these unconnected nodes. The easiest way of doing this is to filter the graph using the Giant Component filter.

To colour the graph, I often make us of the modularity statistic. This algorithm attempts to find clusters in the graph by identifying components that are highly interconnected.

This algorithm is a random one, so it’s often worth running it several times to see how many communities typically get identified.

A brief report is displayed after running the statistic:

While we have the Statistics panel open, we can take the opportunity to run another measure: the HITS algorithm. This generates the well known Authority and Hub values which we can use to size nodes in the graph.

The next step is to actually colour the graph. In the Partition panel, refresh the partition options list and then select Modularity Class.

Choose appropriate colours (right click on each colour panel to select an appropriate colour for each class – I often select pastel colours) and apply them to the graph.

The next thing we want to do is lay out the graph. The Layout panel contains several different layout algorithms that can be used to support the visual analysis of the structures inherent in the network; (try some of them – each works in a slightly different way; some are also better than others for coping with large networks). For a network this size and this densely connected,I’d typically start out with one of the force directed layouts, that positions nodes according to how tightly linked they are to each other.

When you select the layout type, you will notice there are several parameters you can play with. The default set is often a good place to start…

Run the layout tool and you should see the network start to lay itself out. Some algorithms require you to actually Stop the layout algorithm; others terminate themselves according to a stopping criterion, or because they are a “one-shot” application (such as the Expansion algorithm, which just scales the x and y values by a given factor).

We can zoom in and out on the layout of the graph using a mouse wheel (on my MacBook trackpad, I use a two finger slide up and down), or use the zoom slider from the “More options” tab:

To see which Twitter ID each node corresponds to, we can turn on the labels:

This view is very cluttered – the nodes are too close to each other to see what’s going on. The labels and the nodes are also all the same size, giving the same visual weight to each node and each label. One thing I like to do is resize the nodes relative to some property, and then scale the label size to be proportional to the node size.

Here’s how we can scale the node size and then set the text label size to be proportional to node size. In the Ranking panel, select the node size property, and the attribute you want to make the size proportional to. I’m going to use Authority, which is a network property that we calculated when we ran the HITS algorithm. Essentially, it’s a measure of how well linked to a node is.

The min size/max size slider lets us define the minimum and maximum node sizes. By default, a linear mapping from attribute value to size is used, but the spline option lets us use a non-linear mappings.

I’m going with the default linear mapping…

We can now scale the labels according to node size:

Note that you can continue to use the text size slider to scale the size of all the displayed labels together.

This diagram is now looking quite cluttered – to make it easier to read, it would be good if we could spread it out a bit. The Expansion layout algorithm can help us do this:

A couple of other layout algorithms that are often useful: the Transformation layout algorithm lets us scale the x and y axes independently (compared to the Expansion algorithm, which scales both axes by the same amount); and the Clockwise Rotate and Counter-Clockwise Rotate algorithm lets us rotate the whole layout (this can be useful if you want to rotate the graph so that it fits neatly into a landscape view.

The expanded layout is far easier to read, but some of the labels still overlap. The Label Adjust layout tool can jiggle the nodes so that they don’t overlap.

(Note that you can also move individual nodes by clicking on them and dragging them.)

So – nearly there… The final push is to generate a good quality output. We can do this from the preview window:

The preview window is where we can generate good quality SVG renderings of the graph. The node size, colour and scaled label sizes are determined in the original Overview area (the one we were working in), although additional customisations are possible in the Preview area.

To render our graph, I just want to make a couple of tweaks to the original Default preview settings: Show Labels and set the base font size.

Click on the Refresh button to render the graph:

Oops – I overdid the font size… let’s try again:

Okay – so that’s a good start. Now I find I often enter into a dance between the Preview ad Overview panels, tweaking the layout until I get something I’m satisfied with, or at least, that’s half-way readable.

How to read the graph is another matter of course, though by using colour, sizing and placement, we can hopefully draw out in a visual way some interesting properties of the network. The recipe described above, for example, results in a view of the network that shows:

– groups of people who are tightly connected to each other, as identified by the modularity statistic and consequently group colour; this often defines different sorts of interest groups. (My follower network shows distinct groups of people from the Open University, and JISC, the HE library and educational technology sectors, UK opendata and data journalist types, for example.)
– people who are well connected in the graph, as displayed by node and label size.

Here’s my final version of the @wiredUK “inner friends” network:

You can probably do better though…;-)

To recap, here’s the recipe again:

– filter on connected component (private accounts don’t disclose friend/follower detail to the api key i use) to give a connected graph;
– run the modularity statistic to identify clusters; sometimes I try several attempts
– colour by modularity class identified in previous step, often tweaking colours to use pastel tones
– I often use a force directed layout, then Expansion to spread to network out a bit if necessary; the Clockwise Rotate or Counter-Clockwise rotate will rotate the network view; I often try to get a landscape format; the Transformation layout lets you expand or contract the graph along a single axis, or both axes by different amounts.
– run HITS statistic and size nodes by authority
– size labels proportional to node size
– use label adjust and expand to to tweak the layout
– use preview with proportional labels to generate a nice output graph
– iterate previous two steps to a get a layout that is hopefully not completely unreadable…

Got that?!;-)

Finally, to the return beginning. The recipe I use to generate the data is as follows:

grab a list of twitter IDs (call it L); there are several ways of doing this, for example: obtain a list of tweets on a particular topic by searching for a particular hashtag, then grab the set of unique IDs of people using the hashtag; grab the IDs of the members of one or more Twitter lists; grab the IDs of people following or followed by a particular person; grab the IDs of people sending geo-located tweets in a particular area;
for each person P in L, add them as a node to a graph;
for each person P in L, get a list of people followed by the corresponding person, e.g. Fr(P)
for each X in e.g. Fr(P): if X in Fr(P) and X in L, create an edge [P,X] and add it to the graph
save the graph in a format that can be visualised in Gephi.

To make this recipe, I use Tweepy and a Python script to call the Twitter API and get the friends lists from there, but you could use the Google Social API to get the same data. There’s an example of calling that API using Javscript in my “live” Twitter friends visualisation script (Using Protovis to Visualise Connections Between People Tweeting a Particular Term) as well as in the A Bit of NewsJam MoJo – SocialGeo Twitter Map.

A Couple More Social Media Positioning Maps for UK HE Twitter Accounts

Whenever you put messages out over a communications channel, you ideally have some idea of what your audience is interested in. As @stuartbrown pointed out to me yesterday, if folk are following (and continue to follow) the @openuniversity Twitter account, you at least know (or hope!) that they have an interest in what the @openunivirsity is tweeting about. But if we’re trying to profile @openuniversity followers as part of a particular campaign, such as supporting commissioning of OU/BBC co-pro content intended to have some sort of Twitter engagement strategy around it (I’m not saying this is on the cards, just imagining), then it makes sense to me at least that we should have some idea about what the other interests of @openuniversity followers are…?

…which is why I think there may be signal in looking at the folk followed in large numbers by folk who follow the @openuniversity Twitter account…

So here’s a quick snapshot of who 20 or more of a random sample of 1000 followers of @openuniversity follow on Twitter:

There are several grouping s evident: UK HE accounts (top left), US HE accounts (bottom left), BBC/UK news, culture, politics (12 o’clock), comedians and DJs (top right), music and celebrities (south east), international news/tech (south).

For comparison purposes, here’s a quick map generated under similar conditions for @edgehill:

Again, there are UK and US HE account clusters, as well as a celebrity grouping, but reflecting the situated nature of Edge Hill (compared to the distributed nature of the OU), we can also see a Liverpool cluster top left…

Note that I know these images are hard to read/not laid out very clearly/don’t offer an interactive zoom. If you want to see a proper detailed view/map, you’ll have to request the files from me and fire up Gephi or NodeXL yourself ;-)

If anything, what I’m trying to find out is whether these sorts of map might be useful, rather than just ‘interesting’ (and maybe they’re not even that!), and if so how? (That is, what might they cause to be done or cause to be done differently?)

The Social Hinterland Around the #FOTE11 Hashtaggers

Deep breath… ready..? The following is representation of notable people who are followed by people who follow recent users of the #fote11 hashtag.

The method: grab 500 recent tweets incorporating the #fote11 hashtag; identify the unique twitter users who sent those tweets; for each of those users, take a random sample of 50 of their followers to build up a sample of the socially connected audience around the #fote11 tag and reduce that sampled list to folk who follow 2 or more of the hashtaggers. Now grab all the friends of those people to give a network that links sampled followers of hashtag users to the people those sampled users follow. Reduce this graph to only show folk who are followed by 20 or more of the sampled followers of the hashtag users. And that’s what’s shown above… (Nodes sized according to eigenvector centrality.)

What I think it shows is a snapshot of who’s notable amongst the people who are following at least two of the #fote11 hashtaggers. What I really need to do is find a way of colouring the nodes to distinguish between folk who used the hashtag and those who don’t… But I don’t have time to tweak my code to do that right now:-(

That is all…

Visualising Related Entries in Wikipedia Using Gephi

Sometime last week, @mediaczar tipped me off to a neat recipe on the wonderfully named Drunks&Lampposts blog, Graphing the history of philosophy, that uses Gephi to map an influence network in the world of philosophy. The data is based on the extraction of the “influencedBy” relationship over philosophers referred to in Wikipedia using the machine readable, structured data view of Wikipedia that is DBpedia.

The recipe given hints at how to extract data from DBpedia, tidy it up and then import it into Gephi… but there is a quicker way: the Gephi Semantic Web Import plugin. (If it’s not already installed, you can install this plugin via the Tools -> Plugins menu, then look in the Available Plugin.)

To get DBpedia data into Gephi, we need to do three things:

– tell the importer where to find the data by giving it a URL (the “Driver” configuration setting);
– tell the importer what data we want to get back, by specifying what is essentially a database query (the “Request” configuration setting);
– tell Gephi how to create the network we want to visualise from the data returned from DBpedia (in the context of the “Request” configuration).

Fortunately, we don’t have to work out how to do this from scratch – from the Semantic Web Import Configuration panel, configure the importer by setting the configuration to DBPediaMovies.

Hitting “Set Configuration” sets up the Driver (Remote SOAP Endpoint with Endpoint URL http://dbpedia.org/sparql):

and provides a dummy, sample query Request:

We need to do some work creating our own query now, but not too much – we can use this DBpediaMovies example and the query given on the Drunks&Lampposts blog as a starting point:

SELECT *
WHERE {
?p a
<http://dbpedia.org/ontology/Philosopher> .
?p <http://dbpedia.org/ontology/influenced> ?influenced.
}

This query essentially says: ‘give me all the pairs of people, (?p, ?influenced), where each person ?p is a philosopher, and each person ?influenced is influenced by ?p’.

We can replace the WHERE part of the query in the Semantic Web Importer with the WHERE part of this query, but what graph do we want to put together in the CONSTRUCT part of the Request?

The graph we are going to visualise will have nodes that are philosophers or the people who influenced them. The edges connecting the nodes will represent that one influenced the other, using a directed line (with an arrow) to show that A influenced B, for example.

The following construction should achieve this:

CONSTRUCT{
?p <http://dbpedia.org/ontology/influenced> ?influenced.
} WHERE {
  ?p a
<http://dbpedia.org/ontology/Philosopher> .
?p <http://dbpedia.org/ontology/influenced> ?influenced.
} LIMIT 10000

(The LIMIT argument limits the number of rows of data we’re going to get back. It’s often good practice to set this quite low when you’re trying out a new query!)

Hit Run and a graph should be imported:

If you click on the Graph panel (in the main Overview view of the Gephi tool), you should see the graph:

If we run the PageRank or EigenVector centrality statistic, size the nodes according to that value, and lay out the graph using a force directed or Fruchtermann-Rheingold layout algorithm, we get something like this:

The nodes are labelled in a rather clumsy way – http://dbpedia.org/page/Martin_Heidegger – for example, but we can tidy this up. Going to one of the DPpedia pages, such as http://dbpedia.org/page/Martin_Heidegger, we find what else DBpedia knows about this person:

In particular, we see we can get hold of the name of the philosopher using the foaf:name property/relation. If you look back to the original DBpediaMovies example, we can start to pick it apart. It looks as if there are a set of gephi properties we can use to create our network, including a “label” property. Maybe this will help us label our nodes more clearly, using the actual name of a philosopher for example? You may also notice the declaration of a gephi “prefix”, which appears in various constructions (such as gephi:label). Hmmm.. Maybe gephi:label is to prefix gephi:<http://gephi.org/> as foaf:name is to something? If we do a web search for the phrase foaf:name prefix, we turn up several results that contain the phrase prefix foaf:<http://xmlns.com/foaf/0.1/>, so maybe we need one of those to get the foaf:name out of DBpedia….?

But how do we get it out? We’ve already seen that we can get the name of a person who was influenced by a philosopher by asking for results where this relation holds: ?p <http://dbpedia.org/ontology/influenced> ?influenced. So it follows we can get the name of a philosopher (?pname) by asking for the foaf:name in the WHEER part of the query:

?p <foaf:name> ?pname.

and then using this name as a label in the CONSTRUCTion:

?p gephi:label ?pname.

We can also do a similar exercise for the person who is influenced.

looking through the DBpedia record, I notice that as well as an influenced relation, there is an influencedBy relation (I think this is the one that was actually used in the Drunks&Lampposts blog?). So let’s use that in this final version of the query:

prefix gephi:<http://gephi.org/>
prefix foaf: <http://xmlns.com/foaf/0.1/>
CONSTRUCT{
  ?philosopher gephi:label ?philosopherName .
  ?influence gephi:label ?influenceName .
  ?philosopher <http://dbpedia.org/ontology/influencedBy> ?influence
} WHERE {
  ?philosopher a
  <http://dbpedia.org/ontology/Philosopher> .
  ?philosopher <http://dbpedia.org/ontology/influencedBy> ?influence.
  ?philosopher foaf:name ?philosopherName.
  ?influence foaf:name ?influenceName.
} LIMIT 10000

If you’ve already run a query to load in a graph, if you run this query it may appear on top of the previous one, so it’s best to clear the workspace first. At the bottom right of the screen is a list of workspaces – click on the RDF Request Graph label to pop up a list of workspaces, and close the RDF Request Graph one by clicking on the x.

Now run the query into a newly launched, pristine workspace, and play with the graph to your heart’s content…:-) [I’ll maybe post more on this later – in the meantime, if you’re new to Gephi, here are some Gephi tutorials]

Here’s what I get sizing nodes and labels by PageRank, and laying out the graph by using a combination of Force Atlas2, Expansion and Label Adjust (to stop labels overlapping) layout tools:

Using the Ego Network filter, we can then focus on the immediate influence network (influencers and influenced) of an individual philosopher:

What this recipe hopefully shows is how you can directly load data from DBpedia into Gephi. The two tricks you need to learn to do this for other data sets are:

1) figuring out how to get data out of DBpedia (the WHERE part of the Request);
2) figuring out how to get that data into shape for Gephi (the CONSTRUCT part of the request).

If you come up with any other interesting graphs, please post Request fragments in the comments below:-)

PS via @sciencebase (Mapping research on Wikipedia with Wikimaps), there’s this related tool: WikiMaps, on online (and desktop?) tool for visualising various Wikipedia powered graphs, such as, erm, Justin Bieber’s network…

Any other related tools out there for constructing and visualising Wikipedia powered network maps? Please add a link via the comments if you know of any…

PPS for a generalisation of this approach, and a recipe for finding other DBpedia networks to map, see Mapping How Programming Languages Influenced Each Other According to Wikipedia.

PPPS Here’s another handy recipe that shows how to pull SPARQLed DBPedia queries into R, analyse them there, and then generate a graphML file for rendering in Gephi: SPARQL Package for R / Gephi – Movie star graph visualization Tutorial

PPPPS related – a large scale version of this? Wikipedia Mining Algorithm Reveals The Most Influential People In 35 Centuries Of Human History

Mapping Related Musical Genres on Wikipedia/DBPedia With Gephi

Following on from Mapping How Programming Languages Influenced Each Other According to Wikipedia, where I tried to generalise the approach described in Visualising Related Entries in Wikipedia Using Gephi for grabbing datasets in Wikipedia related to declared influences between items within particular subject areas, here’s another way of grabbing data from Wikipedia/DBpedia that we can visualise as similarity neighbourhoods/maps (following @danbri: Everything Still Looks Like A Graph (but graphs look like maps)).

In this case, the technique relies on identifying items that are associated with several different values for the same sort of classification-type. So for example, in the world of music, a band may be associated with one or more musical genres. If a particular band is associated with the genres Electronic music, New Wave music and Ambient music, we might construct a graph by drawing lines/edges between nodes representing each of those musical genres. That is, if we let nodes represent genre, we might draw edges between two nodes show that a particular band has been labelled as falling within each of those two genres.

So for example, here’s a sketch of genres that are associated with at least some of the bands that have also been labelled as “Psychedelic” on Wikipedia:

Following the recipe described here, I used this Request within the Gephi Semantic Web Import module to grab the data:

prefix gephi:
CONSTRUCT{
  ?genreA gephi:label ?genreAname .
  ?genreB gephi:label ?genreBname .
  ?genreA  ?genreB .
  ?genreB  ?genreA .
} WHERE {
?band  .
?band  "group_or_band"@en.
?band  ?genreA.
?band  ?genreB.
?genreA rdfs:label ?genreAname.
?genreB rdfs:label ?genreBname.
FILTER(?genreA != ?genreB && langMatches(lang(?genreAname), "en")  && langMatches(lang(?genreBname), "en"))
}

(I made up the relation type to describe the edge…;-)

This query searches for things that fall into the declared genre, and then checks that they are also a group_or_band. Note that this approach was discovered through idle browsing of the properties of several bands. Instead of:
?band "group_or_band"@en.
I should maybe have used a more strongly semantically defined relation such as:
?band a >http://schema.org/MusicGroup>.
or:
?band a .

The FILTER helps us pull back English language name labels, as well as creating pairs of different genre terms from each band (again, there may be a better way of doing this? I’m still a SPARQL novice! If you know a better way of doing this, or a more efficient way of writing the query, please let me know via the comments.)

It’s easy enough to generate similarly focussed maps around other specific genres; the following query run using the DBpedia SNORQL interface pulls out candidate values:

SELECT DISTINCT ?genre WHERE {
  ?band  "group_or_band"@en.
  ?band  ?genre.
} limit 50 offset 0

(The offset parameter allows you to page between results; so an offset of 10 will display results starting with the 11th(?) result.)

What this query does is look for items that are declared as a type group_or_band and then pull out the genres associated with each band.

If you take a deep breath, you’ll hopefully see how this recipe can be used to help probe similar “co-attributes” of things in DBpedia/Wikipeda, if you can work out how to narrow down your search to find them… (My starting point is to browse DPpedia pages of things that might have properties I’m interested in. So for example, when searching for hooks into music related data, we might have a peak at the DBpedia page for Hawkwind (who aren’t, apparently, of the Psychedelic genre…), and then hunt for likely relations to try out in a sample SNORQL query…)

PS if you pick up on this recipe and come up with any interesting maps over particular bits of DBpedia, please post a link in the comments below:-)