Mapping Related Musical Genres on Wikipedia/DBPedia With Gephi

Following on from Mapping How Programming Languages Influenced Each Other According to Wikipedia, where I tried to generalise the approach described in Visualising Related Entries in Wikipedia Using Gephi for grabbing datasets in Wikipedia related to declared influences between items within particular subject areas, here’s another way of grabbing data from Wikipedia/DBpedia that we can visualise as similarity neighbourhoods/maps (following @danbri: Everything Still Looks Like A Graph (but graphs look like maps)).

In this case, the technique relies on identifying items that are associated with several different values for the same sort of classification-type. So for example, in the world of music, a band may be associated with one or more musical genres. If a particular band is associated with the genres Electronic music, New Wave music and Ambient music, we might construct a graph by drawing lines/edges between nodes representing each of those musical genres. That is, if we let nodes represent genre, we might draw edges between two nodes show that a particular band has been labelled as falling within each of those two genres.

So for example, here’s a sketch of genres that are associated with at least some of the bands that have also been labelled as “Psychedelic” on Wikipedia:

Following the recipe described here, I used this Request within the Gephi Semantic Web Import module to grab the data:

prefix gephi:
CONSTRUCT{
  ?genreA gephi:label ?genreAname .
  ?genreB gephi:label ?genreBname .
  ?genreA  ?genreB .
  ?genreB  ?genreA .
} WHERE {
?band  .
?band  "group_or_band"@en.
?band  ?genreA.
?band  ?genreB.
?genreA rdfs:label ?genreAname.
?genreB rdfs:label ?genreBname.
FILTER(?genreA != ?genreB && langMatches(lang(?genreAname), "en")  && langMatches(lang(?genreBname), "en"))
}

(I made up the relation type to describe the edge…;-)

This query searches for things that fall into the declared genre, and then checks that they are also a group_or_band. Note that this approach was discovered through idle browsing of the properties of several bands. Instead of:
?band "group_or_band"@en.
I should maybe have used a more strongly semantically defined relation such as:
?band a >http://schema.org/MusicGroup>.
or:
?band a .

The FILTER helps us pull back English language name labels, as well as creating pairs of different genre terms from each band (again, there may be a better way of doing this? I’m still a SPARQL novice! If you know a better way of doing this, or a more efficient way of writing the query, please let me know via the comments.)

It’s easy enough to generate similarly focussed maps around other specific genres; the following query run using the DBpedia SNORQL interface pulls out candidate values:

SELECT DISTINCT ?genre WHERE {
  ?band  "group_or_band"@en.
  ?band  ?genre.
} limit 50 offset 0

(The offset parameter allows you to page between results; so an offset of 10 will display results starting with the 11th(?) result.)

What this query does is look for items that are declared as a type group_or_band and then pull out the genres associated with each band.

If you take a deep breath, you’ll hopefully see how this recipe can be used to help probe similar “co-attributes” of things in DBpedia/Wikipeda, if you can work out how to narrow down your search to find them… (My starting point is to browse DPpedia pages of things that might have properties I’m interested in. So for example, when searching for hooks into music related data, we might have a peak at the DBpedia page for Hawkwind (who aren’t, apparently, of the Psychedelic genre…), and then hunt for likely relations to try out in a sample SNORQL query…)

PS if you pick up on this recipe and come up with any interesting maps over particular bits of DBpedia, please post a link in the comments below:-)

— see also: http://www.geirmund.com/black-metal-family-tree/black-metal-family-tree.html

Mapping How Programming Languages Influenced Each Other According to Wikipedia

By way of demonstrating how the recipe described in Visualising Related Entries in Wikipedia Using Gephi can easily be turned to other things, here’s a map of how different computer programming languages influence each other according to DBpedia/Wikipedia:

Here’s the code that I pasted in to the Request area of the Gephi Semantic Web Import plugin as configured for a DBpedia import:

prefix gephi:<http://gephi.org/>
prefix foaf: <http://xmlns.com/foaf/0.1/>
CONSTRUCT{
  ?a gephi:label ?an .
  ?b gephi:label ?bn .
  ?a <http://dbpedia.org/ontology/influencedBy> ?b
} WHERE {
?a a <http://dbpedia.org/ontology/ProgrammingLanguage>.
?b a <http://dbpedia.org/ontology/ProgrammingLanguage>.
?a <http://dbpedia.org/ontology/influencedBy> ?b.
?a foaf:name ?an.
?b foaf:name ?bn.
}

As to how I found the <http://dbpedia.org/ontology/ProgrammingLanguage&gt; relation, I had a play around with the SNORQL query interface for DBpedia looking for possible relations using queries along the lines of:

SELECT DISTINCT ?c WHERE {
?a <http://dbpedia.org/ontology/influencedBy> ?b.
?a rdf:type ?c.
?b a ?c.
} limit 50 offset 150

(I think a (as in ?x a ?y and rdf:type are synonyms?)

This query looks for pairs of things (?a, ?b), each of the same type, ?c, where ?b also influences ?a, then reports what sort of thing (?c) they are (philosophers, for example, or programming languages). We can then use this thing in our custom Wikipedia/DBpedia/Gephi semantic web mapping request to map out the “internal” influence network pertaining to that thing (internal in the sense that the things that are influencing and influenced are both representatives of the same, erm, thing…;-).

The limit term specifies how many results to return, the offset essentially allows you to page through results (so an offset of 500 will return results starting with the 501st result overall). DISTINCT ensures we see unique relations.

If you see a relation that looks like dbpedia:ontology/Philosopher, put it in and brackets (<>) and replace dbpedia: with http://dbpedia.org/ to give something like <http://dbpedia.org/ontology/Philosopher&gt;.

PS see how to use a similar technique to map out musical genres ascribed to bands on WIkipedia

Visualising Related Entries in Wikipedia Using Gephi

Sometime last week, @mediaczar tipped me off to a neat recipe on the wonderfully named Drunks&Lampposts blog, Graphing the history of philosophy, that uses Gephi to map an influence network in the world of philosophy. The data is based on the extraction of the “influencedBy” relationship over philosophers referred to in Wikipedia using the machine readable, structured data view of Wikipedia that is DBpedia.

The recipe given hints at how to extract data from DBpedia, tidy it up and then import it into Gephi… but there is a quicker way: the Gephi Semantic Web Import plugin. (If it’s not already installed, you can install this plugin via the Tools -> Plugins menu, then look in the Available Plugin.)

To get DBpedia data into Gephi, we need to do three things:

– tell the importer where to find the data by giving it a URL (the “Driver” configuration setting);
– tell the importer what data we want to get back, by specifying what is essentially a database query (the “Request” configuration setting);
– tell Gephi how to create the network we want to visualise from the data returned from DBpedia (in the context of the “Request” configuration).

Fortunately, we don’t have to work out how to do this from scratch – from the Semantic Web Import Configuration panel, configure the importer by setting the configuration to DBPediaMovies.

Hitting “Set Configuration” sets up the Driver (Remote SOAP Endpoint with Endpoint URL http://dbpedia.org/sparql):

and provides a dummy, sample query Request:

We need to do some work creating our own query now, but not too much – we can use this DBpediaMovies example and the query given on the Drunks&Lampposts blog as a starting point:

SELECT *
WHERE {
?p a
<http://dbpedia.org/ontology/Philosopher> .
?p <http://dbpedia.org/ontology/influenced> ?influenced.
}

This query essentially says: ‘give me all the pairs of people, (?p, ?influenced), where each person ?p is a philosopher, and each person ?influenced is influenced by ?p’.

We can replace the WHERE part of the query in the Semantic Web Importer with the WHERE part of this query, but what graph do we want to put together in the CONSTRUCT part of the Request?

The graph we are going to visualise will have nodes that are philosophers or the people who influenced them. The edges connecting the nodes will represent that one influenced the other, using a directed line (with an arrow) to show that A influenced B, for example.

The following construction should achieve this:

CONSTRUCT{
?p <http://dbpedia.org/ontology/influenced> ?influenced.
} WHERE {
  ?p a
<http://dbpedia.org/ontology/Philosopher> .
?p <http://dbpedia.org/ontology/influenced> ?influenced.
} LIMIT 10000

(The LIMIT argument limits the number of rows of data we’re going to get back. It’s often good practice to set this quite low when you’re trying out a new query!)

Hit Run and a graph should be imported:

If you click on the Graph panel (in the main Overview view of the Gephi tool), you should see the graph:

If we run the PageRank or EigenVector centrality statistic, size the nodes according to that value, and lay out the graph using a force directed or Fruchtermann-Rheingold layout algorithm, we get something like this:

The nodes are labelled in a rather clumsy way – http://dbpedia.org/page/Martin_Heidegger – for example, but we can tidy this up. Going to one of the DPpedia pages, such as http://dbpedia.org/page/Martin_Heidegger, we find what else DBpedia knows about this person:

In particular, we see we can get hold of the name of the philosopher using the foaf:name property/relation. If you look back to the original DBpediaMovies example, we can start to pick it apart. It looks as if there are a set of gephi properties we can use to create our network, including a “label” property. Maybe this will help us label our nodes more clearly, using the actual name of a philosopher for example? You may also notice the declaration of a gephi “prefix”, which appears in various constructions (such as gephi:label). Hmmm.. Maybe gephi:label is to prefix gephi:<http://gephi.org/&gt; as foaf:name is to something? If we do a web search for the phrase foaf:name prefix, we turn up several results that contain the phrase prefix foaf:<http://xmlns.com/foaf/0.1/&gt;, so maybe we need one of those to get the foaf:name out of DBpedia….?

But how do we get it out? We’ve already seen that we can get the name of a person who was influenced by a philosopher by asking for results where this relation holds: ?p <http://dbpedia.org/ontology/influenced&gt; ?influenced. So it follows we can get the name of a philosopher (?pname) by asking for the foaf:name in the WHEER part of the query:

?p <foaf:name> ?pname.

and then using this name as a label in the CONSTRUCTion:

?p gephi:label ?pname.

We can also do a similar exercise for the person who is influenced.

looking through the DBpedia record, I notice that as well as an influenced relation, there is an influencedBy relation (I think this is the one that was actually used in the Drunks&Lampposts blog?). So let’s use that in this final version of the query:

prefix gephi:<http://gephi.org/>
prefix foaf: <http://xmlns.com/foaf/0.1/>
CONSTRUCT{
  ?philosopher gephi:label ?philosopherName .
  ?influence gephi:label ?influenceName .
  ?philosopher <http://dbpedia.org/ontology/influencedBy> ?influence
} WHERE {
  ?philosopher a
  <http://dbpedia.org/ontology/Philosopher> .
  ?philosopher <http://dbpedia.org/ontology/influencedBy> ?influence.
  ?philosopher foaf:name ?philosopherName.
  ?influence foaf:name ?influenceName.
} LIMIT 10000

If you’ve already run a query to load in a graph, if you run this query it may appear on top of the previous one, so it’s best to clear the workspace first. At the bottom right of the screen is a list of workspaces – click on the RDF Request Graph label to pop up a list of workspaces, and close the RDF Request Graph one by clicking on the x.

Now run the query into a newly launched, pristine workspace, and play with the graph to your heart’s content…:-) [I’ll maybe post more on this later – in the meantime, if you’re new to Gephi, here are some Gephi tutorials]

Here’s what I get sizing nodes and labels by PageRank, and laying out the graph by using a combination of Force Atlas2, Expansion and Label Adjust (to stop labels overlapping) layout tools:

Using the Ego Network filter, we can then focus on the immediate influence network (influencers and influenced) of an individual philosopher:

What this recipe hopefully shows is how you can directly load data from DBpedia into Gephi. The two tricks you need to learn to do this for other data sets are:

1) figuring out how to get data out of DBpedia (the WHERE part of the Request);
2) figuring out how to get that data into shape for Gephi (the CONSTRUCT part of the request).

If you come up with any other interesting graphs, please post Request fragments in the comments below:-)

[See also: Graphing Every* Idea In History]

PS via @sciencebase (Mapping research on Wikipedia with Wikimaps), there’s this related tool: WikiMaps, on online (and desktop?) tool for visualising various Wikipedia powered graphs, such as, erm, Justin Bieber’s network…

Any other related tools out there for constructing and visualising Wikipedia powered network maps? Please add a link via the comments if you know of any…

PPS for a generalisation of this approach, and a recipe for finding other DBpedia networks to map, see Mapping How Programming Languages Influenced Each Other According to Wikipedia.

PPPS Here’s another handy recipe that shows how to pull SPARQLed DBPedia queries into R, analyse them there, and then generate a graphML file for rendering in Gephi: SPARQL Package for R / Gephi – Movie star graph visualization Tutorial

PPPPS related – a large scale version of this? Wikipedia Mining Algorithm Reveals The Most Influential People In 35 Centuries Of Human History

Creating a Winter Olympics 2010 Medal Map In Google Spreadsheets

Another Olympics, so another chance to look at how to construct a real time medal map using the tricks and techniques that are available to us today, cf. the approach I took in 2008. In this post, I’ll show how to create a medal map using Wikipedia data and Google spreadsheet query functions in Google Apps Standard Edition (this will also work in Google Apps for Edu, Business and so on, but not (currently) in the ‘everyday’ edition of Google spreadsheets that you get with a personal Google account. Why not? Because I make use of Google Apps Script, that’s why…)

To start with, let’s get some data. Wikipedia is surprisingly good as a near realtime data source for large sporting events, so let’s see what’s on the Winter Olympics 2010 medals page:

We can pull this data into a Google spreadsheet using the =importHTML() formula:

***NOTE – THE PAGE LAYOUT HAS CHANGED AND WITH IT TABLE NUMBER. THE FORMULA SHOULD NOW READ:
=ImportHtml(“http://en.wikipedia.org/wiki/2010_Winter_Olympics_medal_table&#8221;,”table”,2) ***
[Update: bah – back to table 3 again…oh, the joys or arbitrary HTML scraping!)

In order to plot medals in a “heat map” using the Google heatmap widget:

we need to provide two columns of data: an ISO country code and the number value we want to associate with that country code.

Looking at the medals table we have imported, we see that we have country name and a non-ISO country code.

Let’s start by extracting the country name using a Google Apps Script (tutorials). These scripts are user defined formulae that you can create and then use in the same way that you would use any of the normally provided spreadsheet formulae. Open the script editor from the spreadsheet Tools menu:

Write your function(s):

then call them in the spreadsheet:

The next step is to get the ISO country code for each country. I’m going to do this using a QUERY formula that interrogates a list of ISO country codes in another sheet (see Using Google Spreadsheets Like a Database – The QUERY Formula for more on the Google Spreadsheets QUERY function):

At its heart, a particular QUERY will look something like this:

=QUERY('ISO Country Codes'!A2:B268,"select A,B where A contains 'FRANCE' limit 1")))

The limit 1 ensures we only return a single result.

To construct the QUERY from a country name specisifed in a cell, we embed a CONCATENATE formula to construct the select string:

=QUERY('ISO Country Codes'!A2:B268,concatenate("select A,B where A contains '",upper(H2),"' limit 1")))

The contains operator is case sensitive, so as we have country names in all upper case in our ISO sheet, we convert the country name pulled in from the medal table to upper case as well. (I couldn’t get the more appropriate matches expression to work?)

So now we can get a column of country names (again) along with the corresponding ISO country code:

(The IF statement is there to keep things tidy.)

If we now copy across medal data into a column alongside the coutnry code column, we can create a heat map widget:

This widget is also available as a Google gadget:

Although there is some latency in the system, as changes are made to the original WIkipedia page, the Google spreadsheet should occasionally check it as and when the spreadsheet – or the heat map – is viewed.

So there you have it, a real time Medal Map, with data pulled into a Google spreadsheet from a Wikipedia page using an importHTML function, tidied using a Google Apps script function, used to key a QUERY in an ISO codes spreadsheet/database, and then rendered via a third party Google gadget.

Overall hack time – about half an hour, first time round. Time to write this post – over an hour…

PS see also Data Scraping Wikipedia with Google Spreadsheets