Tagged: gephi

Bands Incorporated

A few weeks ago, as I was doodling with some Companies House director network mapping code and simple Companies House chatbot ideas, I tweeted an example of Iron Maiden’s company structure based on co-director relationships. Depending on the original search is seeded, the maps may also includes elements of band members’ own personal holdings/interests. The following map, for example, is seeded just from the Iron Maiden LLP company number:


If you know anything about the band, you’ll know Bruce Dickinson’s aircraft interests make complete sense…

That graph is actually a bipartite graph – nodes are either directors or companies. We can easily generate a projection of the graph that replaces directors that link companies by edges that represent “common director” links between companies:


(The edges are actually weighted, so the greater the edge weight, the more directors there are in common between the linked companies.)

In today’s Guardian, I notice they’re running a story about Radiohead’s company structure, with a parallel online piece, Radiohead’s corporate empire: inside the band’s dollars and cents which shows how to get a story out of such a map, as well as how to re-present the original raw map to provide to a bit more spatial semantic structure to it:


(The story also digs into the financial reports from some of the companies.)

By way of comparison, here’s my raw map of Radiohead’s current company structure, generated from Companies House data seeded on the company number for Radiohead Trademark:


It’s easy enough to grab the data for other bands. So how about someone like The Who? If we look in the immediate vicinity of The Who Group, we see core interests:


But if we look for linkage to the next level of co-director links, we start to see other corporate groups that hold some at least one shared interest with the band members:


So what other bands incorporated in the UK might be worth mapping?

Trawling the Companies House API to Generate Co-Director Networks

Somewhen ago (it’s always somewhen ago; most of the world never seems to catch up with what’s already happened!:-( I started dabbling with the OpenCorporates API to generate co-director corporate maps that showed companies linked by multiple directors. It must have been a bad idea because no-one could see any point in it, not even interestingness…  (Which suggests to me that boards made up of directors are similarly meaningless? In which case, how are companies supposed to hold themselves to account?)

I tend to disagree. If I hadn’t been looking at connected companies around food processing firms, I would never have learned that one that meat processors cope with animal fat waste is to feed it into the biodiesel raw material supply chain.

Anyway, if we ever get to see a beneficial ownership register, a similar approach should work to generate maps showing how companies sharing beneficial owners are linked. (The same approach also drives my emergent social positioning Twitter maps and the Wikipedia semantic maps I posted about again recently.)

As a possible precursor to that, I thought I’d try to reimplement the code (in part to see if a better approach came to mind) using data grabbed directly from Companies House via their API. I’d already started dabbling with the API (Chat Sketches with the Companies House API) so it didn’t take much more to get a grapher going…

But first, I realise in that earlier post I’d missed the function for actually calling the API – so here it is:

import urllib2, base64, json
from urllib import urlencode
from time import sleep

def url_nice_req(url,t=300):
        return urllib2.urlopen(url)
    except HTTPError, e:
        if e.code == 429:
            print("Overloaded API, resting for a bit...")
            return url_req(url)

#Inspired by http://stackoverflow.com/a/2955687/454773
def ch_request(CH_API_TOKEN,url,args=None):
    if args is not None:
    request = urllib2.Request(url)
    # You need the replace to handle encodestring adding a trailing newline 
    # (https://docs.python.org/2/library/base64.html#base64.encodestring)
    base64string = base64.encodestring('%s:' % (CH_API_TOKEN)).replace('\n', '')
    request.add_header("Authorization", "Basic %s" % base64string)   
    result = url_nice_req(request)

    return json.loads(result.read())


In the original implementation, I stored the incremental search results in a dict; in the reimplementation, I thought I’d make use of a small SQLite database.

import sqlite3
if db in locals():
db = sqlite3.connect(tmpDB)
c = db.cursor()

for drop in ['directorslite','companieslite','codirs','coredirs','singlecos']:
    c.execute('''drop table if exists {}'''.format(drop))
c.execute('''create table directorslite
         (dirnum text primary key,
          dirdob integer,
          dirname text)''')

c.execute('''create table companieslite
         (conum text primary key,
          costatus text,
          coname text)''')

c.execute('''create table codirs
         (conum text,
          dirnum text,
          typ text,
          status text)''')

c.execute('''create table coredirs
         (dirnum text)''')

c.execute('''create table singlecos
         (conum text,
          coname text)''')


The code itself runs in two passes. The first pass builds up a seed set of directors from a single company or set of companies using a simple harvester:

def updateOnCo(seed,typ='current',role='director'):
    print('harvesting {}'.format(seed))
          'dirdob':p['date_of_birth']['year'] if 'date_of_birth' in p else None,
          'dirname':p['name']} for p in o]
    for y in x:
        if y['dirnum'] not in dirsdone:
        if isinstance(z, dict): z=[z]
    print('Adding {} directors'.format(len(z)))
    c.executemany('INSERT INTO directorslite (dirnum, dirdob,dirname)'
                     'VALUES (:dirnum,:dirdob,:dirname)', z)
    for oo in [i for i in o if i['links']['officer']['appointments'].strip('/').split('/')[1] not in dirsparsed]:
        print('New director: {}'.format(oid))
        #Play nice with the api
        #add company details
          'costatus':p['appointed_to']['company_status'] if 'company_status' in p['appointed_to'] else '',
          'coname':p['appointed_to']['company_name'] if 'company_name' in p['appointed_to'] else ''} for p in ooo['items']]
        for y in x:
            if y['conum'] not in cosdone:
        if isinstance(z, dict): z=[z]
        print('Adding {} companies'.format(len(z)))
        c.executemany('INSERT INTO companieslite (conum, costatus,coname)'
                     'VALUES (:conum,:costatus,:coname)', z)
        for i in x:cosdone.append(i['conum'])
        #add company director links
            'typ':'current','status':'director'} for p in ooo['items']]
        c.executemany('INSERT INTO codirs (conum, dirnum,typ,status)'
                     'VALUES (:conum,:dirnum,:typ,:status)', x)
        print('Adding {} company-directorships'.format(len(x)))

The set of seed companies may be companies associated with one or more specified seed directors, for example:

def dirCoSeeds(dirseeds,typ='all',role='all'):
    ''' Find companies associated with dirseeds '''
    for d in dirseeds:
        for c in ch_getAppointments(d,typ=typ,role=role)['items']:
    return coseeds

for d in ch_searchOfficers('Bernard Ecclestone',n=10,exact='forename')['items']:

Then I call a first pass of the co-directed companies search with the set of company seeds:

#Need to handle director or LLP Designated Member
for seed in coseeds:
c.executemany('INSERT INTO coredirs (dirnum) VALUES (?)', [[d] for d in dirsparsed])

seeder_roles=['Finance Director']
#for dirs in seeded_cos, if dir_role is in seeder_roles then do a second seeding based on their companies


Then we go for a crawl for as many steps as required… The approach I’ve taken here is to search through the current database to find the companies heuristically defined as codirected, and then feed these back into the harvester.

while depth<maxdepth:
    print('---------------\nFilling out level - {}...'.format(depth))
    if seeder and depth==0:
        #Another policy would be dive on all companies associated w/ dirs of seed
        #In which case set the above test to depth==0
        tofetch=[u[0] for u in c.execute(''' SELECT DISTINCT conum from codirs''')]
        duals=c.execute('''SELECT cd1.conum as c1,cd2.conum as c2, count(*) FROM codirs AS cd1
                        LEFT JOIN codirs AS cd2 
                        ON cd1.dirnum = cd2.dirnum AND cd1.dirnum
                        WHERE cd1.conum < cd2.conum GROUP BY c1,c2 HAVING COUNT(*)>1
        tofetch=[x for t in duals for x in t[:2]]
        #The above has some issues. eg only 1 director is required, and secretary IDs are unique to company
        #Maybe need to change logic so if two directors OR company just has one director?
        #if relaxed>0:
        #    print('Being relaxed {} at depth {}...'.format(relaxed,depth))
        #    duals=c.execute('''SELECT cd.conum as c1,cl.coname as cn, count(*) FROM codirs as cd JOIN companieslite as cl 
        #                 WHERE cd.conum= cl.conum GROUP BY c1,cn HAVING COUNT(*)=1
        #                ''')
        #    tofetch=tofetch+[x[0] for x in duals]
        #    relaxed=relaxed-1
    if depth==0 and oneDirSeed:
        #add in companies with a single director first time round
        for u in c.execute('''SELECT DISTINCT cd.conum, cl.coname FROM codirs cd  JOIN companieslite cl ON
            if len(o['items'])==1 or u[0]in coseeds:
        c.executemany('INSERT INTO singlecos (conum,coname) VALUES (:conum,:coname)', sco)
    #TO DO: Another stategy might to to try to find the Finance Director or other names role and seed from them?
    #Get undone companies
    print('To fetch: ',[u for u in tofetch if u not in cosparsed])
    for u in [x for x in tofetch if x not in cosparsed]:
            #play nice
    #Parse companies

To visualise the data, I opted for Gephi, which meant having to export the data. I started off with a simple CSV edgelist exporter:

data=c.execute('''SELECT cl1.coname as Source,cl2.coname as Target, count(*) FROM codirs AS cd1
                        LEFT JOIN codirs AS cd2 JOIN companieslite as cl1 JOIN companieslite as cl2
                        ON cd1.dirnum = cd2.dirnum and cd1.conum=cl1.conum and cd2.conum=cl2.conum
                        WHERE cd1.conum 1''')
import csv
with open('output1.csv', 'wb') as f:
    writer = csv.writer(f)
    writer.writerow(['Source', 'Target'])
data= c.execute('''SELECT cl1.coname as c1,cl2.coname as c2 FROM codirs AS cd1
                        LEFT JOIN codirs AS cd2 JOIN singlecos as cl1 JOIN singlecos as cl2
                        ON cd1.dirnum = cd2.dirnum and cd1.conum=cl1.conum and cd2.conum=cl2.conum
                        WHERE cd1.conum &lt; cd2.conum''')
with open('output1.csv', 'ab') as f:
    writer = csv.writer(f)

but soon changed that to a proper graph file export, based on a graph built around the codirected companies using the networkx package:

import networkx as nx


data=c.execute('''SELECT cl.conum as cid, cl.coname as cn, dl.dirnum as did, dl.dirname as dn
FROM codirs AS cd JOIN companieslite as cl JOIN directorslite as dl ON cd.dirnum = dl.dirnum and cd.conum=cl.conum ''')
for d in data:
    G.add_node(d[0], Label=d[1])
    G.add_node(d[2], Label=d[3])
nx.write_gexf(G, "test.gexf")

I then load the graph file into Gephi to visualise the data.

Here’s an example of the sort of thing we can get out for a search seeded on companies associated with the Bernie Ecclestone who directs at least one F1 related company:


On the to do list is to automate this a little bit more by adding some netwrok statistics, and possibly a first pass layout, in the networkx step.

In terms of time required to collect the data, the ,a href=”https://developer.companieshouse.gov.uk/api/docs/index/gettingStarted/rateLimiting.html”>Companies House API is rate limited to allow 600 requests within a five minute period. Many company networks can be mapped within the 600 call limit, but even for larger networks, the trawl doesn’t take too long even if two or three rest periods are required.

Revisiting My Twitter Harvesting Code

Despite having suffered a catastrophic/unrecoverable hard-disk failure on the (unbacked up) machine I had my Twitter harvesting notebooks (and cached data database) on, I did manage to find a reasonably current version of the code (via Github gists and Dropbox) and spent a few evening hours tinkering with over the last ten days or so.

So as a quick to note-to-self, here’s a list of the functions I currently have to hand:

  • search for users using a recent search terms: get a list of users recently using a particular term or phrase;
  • search for users using a recent hashtag: get a list of users recently using a particular hashtag;
  • generate maps of folk commonly followed by users of the searchterm/tag: from the term or tag userlist, find the folk commonly followed by those users and generate a network edge list;
  • get members of a list: get a list of the members of a particular list;
  • get lists a person is a member of: get a list of the lists a user is a member of; optionally limit to lists with more than a certain number of followers;
  • triangulate lists: find lists that several specified users are a member of, thresholded (so e.g. lists where at least 3 of 5 people mentioned are on the list); also limit by minimum number of subscribers to list (so we can ignore lists with no subscribers etc). List triangulation can be applied to lists of users e.g. folk using a particular hashtag; so we have a route to finding lists that may be topically related to a particular tag;
  • download members of lists a specified user is a member of: for the lists a particular user is a member of, grab details of all the members of those lists’
  • get all friends/followers of a user: this can be limited to a maximum number of friends/followers (eg 5000);
  • get common friends of (sampled) followers of a user: for a particular user, get their followers, sample N of them, then find folk commonly followed by that sample; output as a graph edge list;
  • find common followers of a set of specified users: for a list of users (e.g. recent users of a particular hashtag), find folk who follow a minimum number of them, or who are followed by a minimum number of them;
  • tag user biographies using Thomson Reuters OpenCalais and IBM Alchemy APIs: this tagging can be easily applied to all users in a list, tagging their biographies one at a time

I’ve also started looking again at generating topic models around Twitter data, starting with user biographies (which so far is not very interesting!)

With these various functions, it’s easy enough to generate various combinations of emergent social positioning map. I’ve started exploring various Python libraries for clustering and laying out maps automatically, but tend to fall back to handcrafting the displays using Gephi. On the to do list is to try to automate the Gephi side, at least for a first pass, using the Gephi toolkit, though at the moment that looks like requiring that I get my head round a bit of Java. Ideally, I’d like to be able to see a Gephi endpoint (perhaps from a Gephi headless server running in a Docker container…?:-), give it a graph file and a config file, and get a PDF, SVG or PNG layout back…

I also need to do a couple of proof-of-concept one-off printed outputs for myself, like getting an ESP map printed as an A0 poster or folded map.

More Storyhunting Around Local Elections Data Using Gephi – To What Extent Do Candidates Support Each Other?

In Questioning Election Data to See if It Has a Story to Tell I started to explore various ways in which we could start to search for stories in a dataset finessed out of a set of poll notices announcing the recent Isle of Wight Council elections. In this post, I’ll do a little more questioning, especially around the assentors (proposers, seconders etc) who supported each candidate, looking to see whether there are any social structures in there resulting from candidates supporting each others’ applications. The essence of what we’re doing is some simple social network analysis around the candidate/assentor network. (For an alternative route to the result, see To What Extent Do Candidates Support Each Other Redux – A One-Liner, Thirty Second Route to the Info.)

This is what we’ll be working towards:

Tidier intra-candidate support map

If you want to play along, you can get the data from my IW poll notices scrape on ScraperWiki, specifically the support table.

scraperwiki council elections - assentors

Here’s a reminder of what the original PDF doc looked like (archive copy):

IW poll notice assentors

Checking the extent to which candidates supported each other is something we could do by hand, looking down each candidate’s list of assentors for names of other candidates, but it would be a laborious job. It’s far easier(?!;-) to automate it…

When we want to compare names using a computer programme or script, the simplest approach is to do an exact string match (a string is a list of characters). Two strings match if they are exactly the same, so for example: This string is the same as This string, but not this string (they differ in their first character – upper case T in the first example as compared with lower case t in the last. We’ll be using exact string matching to identify whether a candidate has the same name as any of the assentors, so on the scraper, I did a little fiddling around with the names, in particular generating a new column that recasts the name of the candidate into the same presentation form used to identify the assentors (Firstname I. Lastname).

We can download a CSV representation of the data from the scraper directly:

Scraperwiki CSV download

The first thing I want to explore is the extent to which candidates support other candidates to see if we can identify any political groupings. The tool I’m going to use to visualise the data is Gephi, an open-source cross-platform application (requires Java) that you can download for free from gephi.org.


To view the data in Gephi, it’s easiest if we rename a couple of columns so that Gephi can recognise relations between supporters and candidates; if we open the CSV download file in a text editor, we can rename the candinit as target and the column as Source to represent an arrow going from an assentor to a candidate, where the arrow reads something along the lines of “is a supporter of”.

csv rename

Start Gephi, select Data Laboratory tab and then New Project from the File menu.

geohi data lab new project

You should now see a toolbar that includes an “Import Spreadsheet option”:

gephi import spreadsheet

Import the CSV file as such, identifying it as an Edges Table:

import data into gephi data laboaratory

You should notice that the Source and Target columns have been identified as such and we have the choice to import the other column or not – let’s bring them in…

SOurce and Target recognised

You should now see the data has been loaded in to Gephi…

Data loaded in

If you click on the Overview tab button, you should see a mass of nodes/circles representing candidates and assentors with arrows going from assentors to candidates.


Let’s see how they connect – we can Run the Force Atlas 2 Layout algorithm for starters. I tweaked the Scaling value and ticked on Stronger Gravity to help shape the resulting layout:

force layout tweaks

If you look closely, you’ll be able to see that there are many separate groupings of connected circles – this represent candidates who are supported by folk who are not also candidates (sometimes a node sits on top of a line so it looks as if two noes are connected when in fact they aren’t…)

Close up simple patterns

However, there are also other groupings in which one candidate may support another:

candidate support

These connections may allow us to see grouping of candidates supporting each other along party lines.

One of the powerful things about Gephi is that it allows us to construct quite complex, nested filters that we can apply to the data based on the properties of the network the data describes so that we can focus on particular aspects of the network I’m going to filter the network so that it shows only those individuals who are supported by at least one person (in-degree 1 or more) and who support at least one person (out-degree one or more) – that is, folk who are candidates (in-degree 1 or more) who also supported (out degree 1 or more) another candidate. Let’s also turn labels on to see which candidates the filter identifies, and colour the edges along party lines. We can now see some information about the connectedness a little more clearly:

lots going on

Hmmm.. how about if we extend out filter to see who’s connected to these nodes (this might include other candidates who do not themselves assent to another candidate), and also rezise the nodes/labels so we can better see the candidates’ names. The Neigbours Network filter takes the nodes we have and then also finds the nodes that are connected to them to depth 2 in this case (that is, it brings in nodes connected to the candidates who are also supporters (depth 1), and the nodes connected to those nodes (depth two). Which is to say, it will being in the candidates who are supported by candidates, and their supporters:

A few more tweaks

That’s a bit clearer, but there are still overlapping lines, so it may make sense to layout the network again:

improve the layout

We can also experiment with other colourings – if we go to the Statistics panel, we can run a Connected Components filter that tries to find nodes that are connected into distinct groups. We can then colour each of the separate groups uniquely:

colour the groups

Let’s reset the colours and go back to colourings along party lines:

Gephi reset colours

If we go to the Preview view, we can generate a prettified view of the network:

Preview layout

In it, we can clearly see groupings along party lines (inside the blue boxes). There is something odd, though? There appears to be a connection between UKIP and Independent groupings? Let’s zoom in:

this is odd

Going back to the Graph view and zooming in, we see that Paul G. taylor appears to be supporting two candidates of different parties… Hmm – I wonder: are there actually two Paul G. Taylors, I wonder, with different political preferences? (Note to self: check on Electoral Commission website what regulations there are about assenting. Can you only assent to one person, and then only within the ward in which you are registered to vote? For local elections, could you be registered to vote in more than one electoral division within the same council area?)

To check that there are no other names that support more than one candidate, we can create another, simple filter that just selects nodes with out-degree 2 or more – that is, who support 2 or more other nodes:

Filter on nodes out degree 2

Just that one then…

Looking at the fuller chart, it’s still rather scruffy. We could tidy it by removing assentors who are not themselves candidates (that is, there are no arrows pointing in to them). The way Gephi filters work support chaining. If you look at the filters, you will see they are nested, much like a nested comment thread in a forum. Filters at the bottom of the tree act on the graph and pass the filtereed network to date up the tree to the next filter. This means we can pass the network as shown above into another filter layer that removes folk who are “just” assentors and not candidates.

nested filters

Here’s the result:

Nesting filters in gephi

And again we can go into Preview mode to generate a nice vectorised version of the graph:

Tidier intra-candidate support map

This quite clearly shows several mutual support networks between Labour candidates (red edges), Conservative candidates (blue edges), independents (black edges) and a large grouping of UKIP candidates (purple edges).

So there we have it a quick tour of how to use Gephi to look at the co-support structure of group of local election candidates. Were the highlighted candidates to be successful in their election, it could signify possible factions or groupings within the council, particular amongst the independents? Along the way we saw how to make use of filters, and spotted something we need to check (whether the same person supported two candidates (if that isn’t allowed?) or whether they are two different people sharing the same name.

If this all seems like too much effort, remembers that there’s always the One-Liner, Thirty Second Route to the Info.

PS by the by, a recent FOI request on WhatDoTheyKnow suggests another possible line of enquiry around possible candidates – if they have been elected to the council before, how good was their attendance record? (I don’t think OpenlyLocal scrapes this information? Presumably it is available somewhere on the council website?)

Grabbing Twitter Search Results into Google Refine And Exporting Conversations into Gephi

How can we get a quick snapshot of who’s talking to whom on Twitter in the context of a particular hashtag?

Here’s a quick recipe that shows how…

First we need to grab some search data. The Twitter API documentation provides us with some clues about how to construct a web address/URL that will grab results back from a particular search on Twitter in a machine readable way (that is, as data):

  • http://search.twitter.com/search.format is the base URL, and the format we require is json, which gives us http://search.twitter.com/search.json
  • the query we want is presented using the q= parameter: http://search.twitter.com/search.json?q=searchterm
  • if we want multiple search terms (for example, library skills), they need encoding in a particular way. The easiest was is just to construct your URL, enter it into the location/URL bar of your browser and hit enter, or use a service such as this string encoder. The browser should encode the URL for you. (If the only punctuation in your search phrase are spaces, you can encode them yourself: just change each space to %20, to give something like library%20skills. If you want to encode the # in a hashtag, use %23
  • We want to get back as many results as are allowed at any one time (which happens to be 100), so set rpp=100, that is: http://search.twitter.com/search.json?q=library%20skills&rpp=100
  • results are paged (in the sense of different pages of Google search results, for example), which means we can ask for the first 100 results, the second 100 results and so on as far back as the most recent 1500 tweets (page 15 for rpp=100, or page 30 if we were using rpp=50 (since 15*100 = 30*50 = 1500): http://search.twitter.com/search.json?q=library%20skills&rpp=100&page=1

Clicking on Next provides us with a dialogue that will allow us to load the data from the URLs into Google Refine:

Clicking “Configure Parsing Options” loads the data and provides us with a preview of it:

If you inspect the data that is returned, you should see it has a repeating pattern. Hovering over the various elements allows you to identify what repeating part of the result we want to import. For example, we could just import each tweet:

Or we could import all the data fields – let’s grab them all:

If you click the highlighted text, or click “Update Preview View”, you can get a preview of how the data will appear. To return to the selection view, click “Pick Record Nodes”:

“Create Project” actually generates the project and pulls all the data in… The column names are a little messy, but we can tidy those:

Look for the from_user and to_user columns and rename them source and target respectively… (hovering over a column name pops up tooltip that shows the full column name):

For the example I’m going to describe, we don’t actually need to rename the columns, but it’s handy to know how to do it;-)

We can now filter out all the rows with a “null” value in the target column. It seems a bit fiddly at first, but you soon get used to the procedure… Select the text facet to pop up a window that show the unique elements in the target column and how often they occur. Sort the list by count, and click on the “null” element – it should be highlighted and its setting should appear as “exclude”. The column will now be showing elements in the column that have the null value:

Click on the “Invert” option and the column will now filter out all the “null” elements and only show the elements that have a non-null value – that is, tweets that have a “to_user” value (which is to say, those tweets were sent to a particular user). Here’s what we get:

Let’s now export the source and target data so we can get it into Gephi:

Deselect all the columns, and then select source and target columns; also deselect the ‘output column headers’ – we don’t need headers where this file is going…

Export the custom layout as CSV data:

We can now import this data into another application – Gephi. Gephi is a cross platform package for visualising networks. In the simplest case, it can import two column data files where each row represents two things that are connected to each other. In our case, we have connections between “source” and “target” Twitter names – that is, connections that show when one Twitter user in our search sample has sent a message to another.

Launch Gephi and from the file menu, open the file you exported from Google Refine:

We’ve now got our data into Gephi, where we can start to visualise it…

…but that is a post for another day… (or if you’re impatient, you can find some examples of how to drive Gephi here).

Mapping Related Musical Genres on Wikipedia/DBPedia With Gephi

Following on from Mapping How Programming Languages Influenced Each Other According to Wikipedia, where I tried to generalise the approach described in Visualising Related Entries in Wikipedia Using Gephi for grabbing datasets in Wikipedia related to declared influences between items within particular subject areas, here’s another way of grabbing data from Wikipedia/DBpedia that we can visualise as similarity neighbourhoods/maps (following @danbri: Everything Still Looks Like A Graph (but graphs look like maps)).

In this case, the technique relies on identifying items that are associated with several different values for the same sort of classification-type. So for example, in the world of music, a band may be associated with one or more musical genres. If a particular band is associated with the genres Electronic music, New Wave music and Ambient music, we might construct a graph by drawing lines/edges between nodes representing each of those musical genres. That is, if we let nodes represent genre, we might draw edges between two nodes show that a particular band has been labelled as falling within each of those two genres.

So for example, here’s a sketch of genres that are associated with at least some of the bands that have also been labelled as “Psychedelic” on Wikipedia:

Following the recipe described here, I used this Request within the Gephi Semantic Web Import module to grab the data:

prefix gephi:
  ?genreA gephi:label ?genreAname .
  ?genreB gephi:label ?genreBname .
  ?genreA  ?genreB .
  ?genreB  ?genreA .
?band  .
?band  "group_or_band"@en.
?band  ?genreA.
?band  ?genreB.
?genreA rdfs:label ?genreAname.
?genreB rdfs:label ?genreBname.
FILTER(?genreA != ?genreB && langMatches(lang(?genreAname), "en")  && langMatches(lang(?genreBname), "en"))

(I made up the relation type to describe the edge…;-)

This query searches for things that fall into the declared genre, and then checks that they are also a group_or_band. Note that this approach was discovered through idle browsing of the properties of several bands. Instead of:
?band "group_or_band"@en.
I should maybe have used a more strongly semantically defined relation such as:
?band a >http://schema.org/MusicGroup&gt;.
?band a .

The FILTER helps us pull back English language name labels, as well as creating pairs of different genre terms from each band (again, there may be a better way of doing this? I’m still a SPARQL novice! If you know a better way of doing this, or a more efficient way of writing the query, please let me know via the comments.)

It’s easy enough to generate similarly focussed maps around other specific genres; the following query run using the DBpedia SNORQL interface pulls out candidate values:

  ?band  "group_or_band"@en.
  ?band  ?genre.
} limit 50 offset 0

(The offset parameter allows you to page between results; so an offset of 10 will display results starting with the 11th(?) result.)

What this query does is look for items that are declared as a type group_or_band and then pull out the genres associated with each band.

If you take a deep breath, you’ll hopefully see how this recipe can be used to help probe similar “co-attributes” of things in DBpedia/Wikipeda, if you can work out how to narrow down your search to find them… (My starting point is to browse DPpedia pages of things that might have properties I’m interested in. So for example, when searching for hooks into music related data, we might have a peak at the DBpedia page for Hawkwind (who aren’t, apparently, of the Psychedelic genre…), and then hunt for likely relations to try out in a sample SNORQL query…)

PS if you pick up on this recipe and come up with any interesting maps over particular bits of DBpedia, please post a link in the comments below:-)

— see also: http://www.geirmund.com/black-metal-family-tree/black-metal-family-tree.html

Mapping How Programming Languages Influenced Each Other According to Wikipedia

By way of demonstrating how the recipe described in Visualising Related Entries in Wikipedia Using Gephi can easily be turned to other things, here’s a map of how different computer programming languages influence each other according to DBpedia/Wikipedia:

Here’s the code that I pasted in to the Request area of the Gephi Semantic Web Import plugin as configured for a DBpedia import:

prefix gephi:<http://gephi.org/>
prefix foaf: <http://xmlns.com/foaf/0.1/>
  ?a gephi:label ?an .
  ?b gephi:label ?bn .
  ?a <http://dbpedia.org/ontology/influencedBy> ?b
?a a <http://dbpedia.org/ontology/ProgrammingLanguage>.
?b a <http://dbpedia.org/ontology/ProgrammingLanguage>.
?a <http://dbpedia.org/ontology/influencedBy> ?b.
?a foaf:name ?an.
?b foaf:name ?bn.

As to how I found the <http://dbpedia.org/ontology/ProgrammingLanguage&gt; relation, I had a play around with the SNORQL query interface for DBpedia looking for possible relations using queries along the lines of:

?a <http://dbpedia.org/ontology/influencedBy> ?b.
?a rdf:type ?c.
?b a ?c.
} limit 50 offset 150

(I think a (as in ?x a ?y and rdf:type are synonyms?)

This query looks for pairs of things (?a, ?b), each of the same type, ?c, where ?b also influences ?a, then reports what sort of thing (?c) they are (philosophers, for example, or programming languages). We can then use this thing in our custom Wikipedia/DBpedia/Gephi semantic web mapping request to map out the “internal” influence network pertaining to that thing (internal in the sense that the things that are influencing and influenced are both representatives of the same, erm, thing…;-).

The limit term specifies how many results to return, the offset essentially allows you to page through results (so an offset of 500 will return results starting with the 501st result overall). DISTINCT ensures we see unique relations.

If you see a relation that looks like dbpedia:ontology/Philosopher, put it in and brackets (<>) and replace dbpedia: with http://dbpedia.org/ to give something like <http://dbpedia.org/ontology/Philosopher&gt;.

PS see how to use a similar technique to map out musical genres ascribed to bands on WIkipedia

Visualising Related Entries in Wikipedia Using Gephi

Sometime last week, @mediaczar tipped me off to a neat recipe on the wonderfully named Drunks&Lampposts blog, Graphing the history of philosophy, that uses Gephi to map an influence network in the world of philosophy. The data is based on the extraction of the “influencedBy” relationship over philosophers referred to in Wikipedia using the machine readable, structured data view of Wikipedia that is DBpedia.

The recipe given hints at how to extract data from DBpedia, tidy it up and then import it into Gephi… but there is a quicker way: the Gephi Semantic Web Import plugin. (If it’s not already installed, you can install this plugin via the Tools -> Plugins menu, then look in the Available Plugin.)

To get DBpedia data into Gephi, we need to do three things:

– tell the importer where to find the data by giving it a URL (the “Driver” configuration setting);
– tell the importer what data we want to get back, by specifying what is essentially a database query (the “Request” configuration setting);
– tell Gephi how to create the network we want to visualise from the data returned from DBpedia (in the context of the “Request” configuration).

Fortunately, we don’t have to work out how to do this from scratch – from the Semantic Web Import Configuration panel, configure the importer by setting the configuration to DBPediaMovies.

Hitting “Set Configuration” sets up the Driver (Remote SOAP Endpoint with Endpoint URL http://dbpedia.org/sparql):

and provides a dummy, sample query Request:

We need to do some work creating our own query now, but not too much – we can use this DBpediaMovies example and the query given on the Drunks&Lampposts blog as a starting point:

?p a
<http://dbpedia.org/ontology/Philosopher> .
?p <http://dbpedia.org/ontology/influenced> ?influenced.

This query essentially says: ‘give me all the pairs of people, (?p, ?influenced), where each person ?p is a philosopher, and each person ?influenced is influenced by ?p’.

We can replace the WHERE part of the query in the Semantic Web Importer with the WHERE part of this query, but what graph do we want to put together in the CONSTRUCT part of the Request?

The graph we are going to visualise will have nodes that are philosophers or the people who influenced them. The edges connecting the nodes will represent that one influenced the other, using a directed line (with an arrow) to show that A influenced B, for example.

The following construction should achieve this:

?p <http://dbpedia.org/ontology/influenced> ?influenced.
  ?p a
<http://dbpedia.org/ontology/Philosopher> .
?p <http://dbpedia.org/ontology/influenced> ?influenced.
} LIMIT 10000

(The LIMIT argument limits the number of rows of data we’re going to get back. It’s often good practice to set this quite low when you’re trying out a new query!)

Hit Run and a graph should be imported:

If you click on the Graph panel (in the main Overview view of the Gephi tool), you should see the graph:

If we run the PageRank or EigenVector centrality statistic, size the nodes according to that value, and lay out the graph using a force directed or Fruchtermann-Rheingold layout algorithm, we get something like this:

The nodes are labelled in a rather clumsy way – http://dbpedia.org/page/Martin_Heidegger – for example, but we can tidy this up. Going to one of the DPpedia pages, such as http://dbpedia.org/page/Martin_Heidegger, we find what else DBpedia knows about this person:

In particular, we see we can get hold of the name of the philosopher using the foaf:name property/relation. If you look back to the original DBpediaMovies example, we can start to pick it apart. It looks as if there are a set of gephi properties we can use to create our network, including a “label” property. Maybe this will help us label our nodes more clearly, using the actual name of a philosopher for example? You may also notice the declaration of a gephi “prefix”, which appears in various constructions (such as gephi:label). Hmmm.. Maybe gephi:label is to prefix gephi:<http://gephi.org/&gt; as foaf:name is to something? If we do a web search for the phrase foaf:name prefix, we turn up several results that contain the phrase prefix foaf:<http://xmlns.com/foaf/0.1/&gt;, so maybe we need one of those to get the foaf:name out of DBpedia….?

But how do we get it out? We’ve already seen that we can get the name of a person who was influenced by a philosopher by asking for results where this relation holds: ?p <http://dbpedia.org/ontology/influenced&gt; ?influenced. So it follows we can get the name of a philosopher (?pname) by asking for the foaf:name in the WHEER part of the query:

?p <foaf:name> ?pname.

and then using this name as a label in the CONSTRUCTion:

?p gephi:label ?pname.

We can also do a similar exercise for the person who is influenced.

looking through the DBpedia record, I notice that as well as an influenced relation, there is an influencedBy relation (I think this is the one that was actually used in the Drunks&Lampposts blog?). So let’s use that in this final version of the query:

prefix gephi:<http://gephi.org/>
prefix foaf: <http://xmlns.com/foaf/0.1/>
  ?philosopher gephi:label ?philosopherName .
  ?influence gephi:label ?influenceName .
  ?philosopher <http://dbpedia.org/ontology/influencedBy> ?influence
  ?philosopher a
  <http://dbpedia.org/ontology/Philosopher> .
  ?philosopher <http://dbpedia.org/ontology/influencedBy> ?influence.
  ?philosopher foaf:name ?philosopherName.
  ?influence foaf:name ?influenceName.
} LIMIT 10000

If you’ve already run a query to load in a graph, if you run this query it may appear on top of the previous one, so it’s best to clear the workspace first. At the bottom right of the screen is a list of workspaces – click on the RDF Request Graph label to pop up a list of workspaces, and close the RDF Request Graph one by clicking on the x.

Now run the query into a newly launched, pristine workspace, and play with the graph to your heart’s content…:-) [I’ll maybe post more on this later – in the meantime, if you’re new to Gephi, here are some Gephi tutorials]

Here’s what I get sizing nodes and labels by PageRank, and laying out the graph by using a combination of Force Atlas2, Expansion and Label Adjust (to stop labels overlapping) layout tools:

Using the Ego Network filter, we can then focus on the immediate influence network (influencers and influenced) of an individual philosopher:

What this recipe hopefully shows is how you can directly load data from DBpedia into Gephi. The two tricks you need to learn to do this for other data sets are:

1) figuring out how to get data out of DBpedia (the WHERE part of the Request);
2) figuring out how to get that data into shape for Gephi (the CONSTRUCT part of the request).

If you come up with any other interesting graphs, please post Request fragments in the comments below:-)

[See also: Graphing Every* Idea In History]

PS via @sciencebase (Mapping research on Wikipedia with Wikimaps), there’s this related tool: WikiMaps, on online (and desktop?) tool for visualising various Wikipedia powered graphs, such as, erm, Justin Bieber’s network…

Any other related tools out there for constructing and visualising Wikipedia powered network maps? Please add a link via the comments if you know of any…

PPS for a generalisation of this approach, and a recipe for finding other DBpedia networks to map, see Mapping How Programming Languages Influenced Each Other According to Wikipedia.

PPPS Here’s another handy recipe that shows how to pull SPARQLed DBPedia queries into R, analyse them there, and then generate a graphML file for rendering in Gephi: SPARQL Package for R / Gephi – Movie star graph visualization Tutorial

PPPPS related – a large scale version of this? Wikipedia Mining Algorithm Reveals The Most Influential People In 35 Centuries Of Human History

Visualising Networks in Gephi via a Scraperwiki Exported GEXF File

How do you visualise data scraped from the web using Scraperwiki as a network using a graph visualisation tool such as Gephi? One way is to import the a two-dimensional data table (i.e. a CSV file) exported from Scraperwiki into Gephi using the Data Explorer, but at times this can be a little fiddly and may require you to mess around with column names to make sure they’re the names Gephi expects. Another way is to get the data into a graph based representation using an appropriate file format such as GEXF or GraphML that can be loaded directly (and unambiguously) into Gephi or other network analysis and visualisation tools.

A quick bit of backstory first…

A couple of related key features for me of a “data management system” (eg the joint post from Francis Irving and Rufus Pollock on From CMS to DMS: C is for Content, D is for Data) are the ability to put data into shapes that play nicely with predefined analysis and visualisation routines, and the ability to export data in a variety of formats or representations that allow that data to be be readily imported into, or used by, other applications, tools, or software libraries. Which is to say, I’m into glue

So here’s some glue – a recipe for generating a GEXF formatted file that can be loaded directly into Gephi and used to visualise networks like this one of how OpenLearn units are connected by course code and top level subject area:

The inspiration for this demo comes from a couple of things: firstly, noticing that networkx is one of the third party supported libraries on ScraperWiki (as of last night, I think the igraph library is also available; thanks @frabcus ;-); secondly, having broken ground for myself on how to get Scraperwiki views to emit data feeds rather than HTML pages (eg OpenLearn Glossary Items as a JSON feed).

As a rather contrived demo, let’s look at the data from this scrape of OpenLearn units, as visualised above:

The data is available from the openlearn-units scraper in the table swdata. The columns of interest are name, parentCourseCode, topic and unitcode. What I’m going to do is generate a graph file that represents which unitcodes are associated with which parentCourseCodes, and which topics are associated with each parentCourseCode. We can then visualise a network that shows parentCourseCodes by topic, along with the child (unitcode) course units generated from each Open University parent course (parentCourseCode).

From previous dabblings with the networkx library, I knew it’d be easy enough to generate a graph representation from the data in the Scraperwiki data table. Essentially, two steps are required: 1) create and label nodes, as required; 2) tie nodes together with edges. (If a node hasn’t been defined when you use it to create an edge, netwrokx will create it for you.)

I decided to create and label some of the nodes in advance: unit nodes would carry their name and unitcode; parent course nodes would just carry their parentCourseCode; and topic nodes would carry an newly created ID and the topic name itself. (The topic name is a string of characters and would make for a messy ID for the node!)

To keep gephi happy, I’m going to explicitly add a label attribute to some of the nodes that will be used, by default, to label nodes in Gephi views of the network. (Here are some hints on generating graphs in networkx.)

Here’s how I built the graph:

import scraperwiki
import urllib
import networkx as nx

scraperwiki.sqlite.attach( 'openlearn-units' )
q = '* FROM "swdata"'
data = scraperwiki.sqlite.select(q)


for row in data:
    if topic not in topics:

Having generated a representation of the data as a graph using networkx, we now need to export the data. networkx supports a variety of export formats, including GEXF. Looking at the documentation for the GEXF exporter, we see that it offers methods for exporting the GEXF representation to a file. But for scraperwiki, we want to just print out a representation of the file, not actually save the printed representation of the graph to a file. So how do we get hold of an XML representation of the GEXF formatted data so we can print it out? A peek into the source code for the GEXF exporter (other exporter file sources here) suggests that the functions we need can be found in the networkx.readwrite.gexf file: a constructor (GEXFWriter), and a method for loading in the graph (.add_graph()). An XML representation of the file can then be obtained and printed out using the ElementTree tostring function.

Here’s the code I hacked out as a result of that little investigation:

import networkx.readwrite.gexf as gf


scraperwiki.utils.httpresponseheader("Content-Type", "text/xml")

from xml.etree.cElementTree import tostring
print tostring(writer.xml)

Note the use of the scraperwiki.utils.httpresponseheader to set the MIMEtype of the view. If we don’t do this, scraperwiki will by default publish an HTML page view, along with a Scraperwiki logo embedded in the page.

Here’s the full code for the view.

And here’s the GEXF view:

Save this file with a .gexf suffix and you can then open the file directly into Gephi.

Hopefully, what this post shows is how you can generate your own, potentially complex, output file formats within Scraperwiki that can then be imported directly into other tools.

PS see also Exporting and Displaying Scraperwiki Datasets Using the Google Visualisation API, which shows how to generate a Google Visualisation API JSON from Scraperwiki, allowing for the quick and easy generation of charts and tables using Google Visualisation API components.

Visualising Twitter Friend Connections Using Gephi: An Example Using the @WiredUK Friends Network

To corrupt a well known saying, “cook a man a meal and he’ll eat it; teach a man a recipe, and maybe he’ll cook for you…”, I thought it was probably about time I posted the recipe I’ve been using for laying out Twitter friends networks using Gephi, not least because I’ve been generating quite a few network files for folk lately, giving them copies, and then not having a tutorial to point them to. So here’s that tutorial…

The starting point is actually quite a long way down the “how did you that?” chain, but I have to start somewhere, and the middle’s easier than the beginning, so that’s where we’ll step in (I’ll give some clues as to how the beginning works at the end…;-)

Here’s what we’ll be working towards: a diagram that shows how the people on Twitter that @wiredUK follows follow each other:

@wireduk innerfriends

The tool we’re going to use to layout this graph from a data file is a free, extensible, open source, cross platform Java based tool called Gephi. If you want to play along, download the datafile. (Or try with a network of your own, such as your Facebook network or social data grabbed from Google+.)

From the Gephi file menu, Open the appropriate graph file:

Gephi - file open

Import the file as a Directed Graph:

Gephi - import directed graph

The Graph window displays the graph in a raw form:

Gephi -graph view of imported graph

Sometimes a graph may contain nodes that are not connected to any other nodes. (For example, protected Twitter accounts do not publish – and are not published in – friends or followers lists publicly via the Twitter API.) Some layout algorithms may push unconnected nodes far away from the rest of the graph, which can affect generation of presentation views of the network, so we need to filter out these unconnected nodes. The easiest way of doing this is to filter the graph using the Giant Component filter.

Gephi - filter on Giant Component

To colour the graph, I often make us of the modularity statistic. This algorithm attempts to find clusters in the graph by identifying components that are highly interconnected.

Gephi - modularity statistic

This algorithm is a random one, so it’s often worth running it several times to see how many communities typically get identified.

A brief report is displayed after running the statistic:

Gephi - modularity statistic report

While we have the Statistics panel open, we can take the opportunity to run another measure: the HITS algorithm. This generates the well known Authority and Hub values which we can use to size nodes in the graph.

Gephi - HITS statistic

The next step is to actually colour the graph. In the Partition panel, refresh the partition options list and then select Modularity Class.

Gephi - select modularity partition

Choose appropriate colours (right click on each colour panel to select an appropriate colour for each class – I often select pastel colours) and apply them to the graph.

Gephi - colour nodes by modularity class

The next thing we want to do is lay out the graph. The Layout panel contains several different layout algorithms that can be used to support the visual analysis of the structures inherent in the network; (try some of them – each works in a slightly different way; some are also better than others for coping with large networks). For a network this size and this densely connected,I’d typically start out with one of the force directed layouts, that positions nodes according to how tightly linked they are to each other.

Gephi select a layout

When you select the layout type, you will notice there are several parameters you can play with. The default set is often a good place to start…

Run the layout tool and you should see the network start to lay itself out. Some algorithms require you to actually Stop the layout algorithm; others terminate themselves according to a stopping criterion, or because they are a “one-shot” application (such as the Expansion algorithm, which just scales the x and y values by a given factor).

Gephi - forceAtlas 2

We can zoom in and out on the layout of the graph using a mouse wheel (on my MacBook trackpad, I use a two finger slide up and down), or use the zoom slider from the “More options” tab:

Gephi zoom

To see which Twitter ID each node corresponds to, we can turn on the labels:

Gephi - labels

This view is very cluttered – the nodes are too close to each other to see what’s going on. The labels and the nodes are also all the same size, giving the same visual weight to each node and each label. One thing I like to do is resize the nodes relative to some property, and then scale the label size to be proportional to the node size.

Here’s how we can scale the node size and then set the text label size to be proportional to node size. In the Ranking panel, select the node size property, and the attribute you want to make the size proportional to. I’m going to use Authority, which is a network property that we calculated when we ran the HITS algorithm. Essentially, it’s a measure of how well linked to a node is.

Gephi - node sizing

The min size/max size slider lets us define the minimum and maximum node sizes. By default, a linear mapping from attribute value to size is used, but the spline option lets us use a non-linear mappings.

Gephi - node sizing spilne

I’m going with the default linear mapping…

Gephi - size nodes

We can now scale the labels according to node size:

Gephi - scale labels

Note that you can continue to use the text size slider to scale the size of all the displayed labels together.

This diagram is now looking quite cluttered – to make it easier to read, it would be good if we could spread it out a bit. The Expansion layout algorithm can help us do this:

Gephi - expansion

A couple of other layout algorithms that are often useful: the Transformation layout algorithm lets us scale the x and y axes independently (compared to the Expansion algorithm, which scales both axes by the same amount); and the Clockwise Rotate and Counter-Clockwise Rotate algorithm lets us rotate the whole layout (this can be useful if you want to rotate the graph so that it fits neatly into a landscape view.

The expanded layout is far easier to read, but some of the labels still overlap. The Label Adjust layout tool can jiggle the nodes so that they don’t overlap.

gephi - label adjust

(Note that you can also move individual nodes by clicking on them and dragging them.)

So – nearly there… The final push is to generate a good quality output. We can do this from the preview window:

Gephi preview window

The preview window is where we can generate good quality SVG renderings of the graph. The node size, colour and scaled label sizes are determined in the original Overview area (the one we were working in), although additional customisations are possible in the Preview area.

To render our graph, I just want to make a couple of tweaks to the original Default preview settings: Show Labels and set the base font size.

Gephi - preview settings

Click on the Refresh button to render the graph:

Gephi - preview refresh

Oops – I overdid the font size… let’s try again:

gephi - preview resize

Okay – so that’s a good start. Now I find I often enter into a dance between the Preview ad Overview panels, tweaking the layout until I get something I’m satisfied with, or at least, that’s half-way readable.

How to read the graph is another matter of course, though by using colour, sizing and placement, we can hopefully draw out in a visual way some interesting properties of the network. The recipe described above, for example, results in a view of the network that shows:

– groups of people who are tightly connected to each other, as identified by the modularity statistic and consequently group colour; this often defines different sorts of interest groups. (My follower network shows distinct groups of people from the Open University, and JISC, the HE library and educational technology sectors, UK opendata and data journalist types, for example.)
– people who are well connected in the graph, as displayed by node and label size.

Here’s my final version of the @wiredUK “inner friends” network:

@wireduk innerfriends

You can probably do better though…;-)

To recap, here’s the recipe again:

– filter on connected component (private accounts don’t disclose friend/follower detail to the api key i use) to give a connected graph;
– run the modularity statistic to identify clusters; sometimes I try several attempts
– colour by modularity class identified in previous step, often tweaking colours to use pastel tones
– I often use a force directed layout, then Expansion to spread to network out a bit if necessary; the Clockwise Rotate or Counter-Clockwise rotate will rotate the network view; I often try to get a landscape format; the Transformation layout lets you expand or contract the graph along a single axis, or both axes by different amounts.
– run HITS statistic and size nodes by authority
– size labels proportional to node size
– use label adjust and expand to to tweak the layout
– use preview with proportional labels to generate a nice output graph
– iterate previous two steps to a get a layout that is hopefully not completely unreadable…

Got that?!;-)

Finally, to the return beginning. The recipe I use to generate the data is as follows:

  1. grab a list of twitter IDs (call it L); there are several ways of doing this, for example: obtain a list of tweets on a particular topic by searching for a particular hashtag, then grab the set of unique IDs of people using the hashtag; grab the IDs of the members of one or more Twitter lists; grab the IDs of people following or followed by a particular person; grab the IDs of people sending geo-located tweets in a particular area;
  2. for each person P in L, add them as a node to a graph;
  3. for each person P in L, get a list of people followed by the corresponding person, e.g. Fr(P)
  4. for each X in e.g. Fr(P): if X in Fr(P) and X in L, create an edge [P,X] and add it to the graph
  5. save the graph in a format that can be visualised in Gephi.

To make this recipe, I use Tweepy and a Python script to call the Twitter API and get the friends lists from there, but you could use the Google Social API to get the same data. There’s an example of calling that API using Javscript in my “live” Twitter friends visualisation script (Using Protovis to Visualise Connections Between People Tweeting a Particular Term) as well as in the A Bit of NewsJam MoJo – SocialGeo Twitter Map.