Search results for: wikipedia

Semantic Cartography – Mapping Dodgy Goth Bands With Common Members Using Wikipedia Data

Several years ago I did some doodles using the Gephi network visualiser Semantic Web Import plugin to sketch out how various sorts of thing (philosophers, music genres, programming languages) were related in Wikipedia (or at least, DBpedia, the semantic web derivative of Wikipedia). A couple of days ago, I started sketching some new queries in a Jupyter IPython notebook to generate a wider range of maps, using the networkx package to analyse the results locally, as well as building and export a graph that I could then visualise in Gephi.

The following bit of code provides a simple function for running a SPARQL query against a SPARQL endpoint, such as the DBpedia endpoint. It also accepts a set of prefix definitions for the query.

from SPARQLWrapper import SPARQLWrapper, JSON

#Add some helper functions
def runQuery(endpoint,prefix,q):
    ''' Run a SPARQL query with a declared prefix over a specified endpoint '''
    sparql = SPARQLWrapper(endpoint)
    return sparql.query().convert()


prefix gephi:<>
prefix foaf: <>
prefix dbp: <>
prefix dbr: <>
prefix dbc: <>
prefix dct: <>
prefix rdfs: <>
prefix yago: <>
prefix xsd: <>

Here’s an example of the style of query I explored a few years ago – it identifies a thing that’s a band in a particular genre, and then tries to find other genres associated with that band. Each combination of genres adds an edge to the resulting graph. The FILTER element makes sure that we make edges between different genres.

SELECT DISTINCT ?a ?an ?b ?bn WHERE {{
?band dbp:genre dbr:{}.
?band <> "group_or_band"@en.
?band dbp:genre ?a.
?band dbp:genre ?b.
?a dbp:name ?an.
?b dbp:name ?bn.
FILTER(?a != ?b && langMatches(lang(?an), "en")  && langMatches(lang(?bn), "en"))


Another simple function takes the resulting edge list and creates a node labeled graph from it using the networkx library. We can then export a graph file from this network that can be visualised in Gephi. (On my to do list is using networkx to  calculate some simple network statistics and generate a first attempt at a preview layout automatically, rather than doing it by hand in Gephi, which is what I do at the moment…)

def nxGrapher_hack(response,config,typ='undirected'):
    ''' typ: forward | reverse | undirected'''
    if typ=='undirected':
        G = nx.Graph()
        G = nx.DiGraph()

    for r in response['results']['bindings']:
        G.add_node(r[fr]['value'], label=r[fr_l]['value'])
        G.add_node(r[to]['value'], label=r[to_l]['value'])
        if typ=='reverse':
    return G

G=nxGrapher_hack(r, {'from':('a','an'),'to':('b','bn')})
nx.write_gexf(G, "music_{}.gexf".format(m))

Here’s the sort of map/graph we can generate as a result:

As well as genre information, we can look up information about band members, such as the current or previous members of a particular band*.


*Since generating the data files last night, and running them again today, a whole raft of bander membership details appear to have disappeared. WTF?! Now I remember another of the reasons I keep avoiding the semantic web – it’s as flakey as anything and you can never tell if the problem is yours, someone else’s or the result of an update (or downgrade) in the data!

What this means is that we can anchor a query on a band, and find the current or previous members. In the following snippet, the single braces (“{}”@en) are replaced by the value of the declared band name:

m="The Mission (band)"
SELECT DISTINCT ?a ?an ?b ?bn WHERE {{
?x <> "group_or_band"@en.
?x rdfs:label "{}"@en.

?a <> "group_or_band"@en.
?a rdfs:label ?an.

?b rdfs:label ?bn.
?b a dbo:Person.
{{?a dbp:pastMembers ?b.}} UNION
{{?a dbp:currentMembers ?b.}}.
{{?x dbp:pastMembers ?b.}} UNION
{{?x dbp:currentMembers ?b.}}

FILTER((lang(?an)=&amp;amp;quot;en&amp;amp;quot;) &amp;amp;amp;&amp;amp;amp; (lang(?bn)=&amp;amp;quot;en&amp;amp;quot;) &amp;amp;amp;&amp;amp;amp; !(STRSTARTS(?bn,&amp;amp;quot;List of&amp;amp;quot;)) &amp;amp;amp;&amp;amp;amp; !(STRSTARTS(?an,&amp;amp;quot;List of&amp;amp;quot;)))


G=nxGrapher_hack(r, {'from':('a','an'),'to':('b','bn')})
nx.write_gexf(G, "band_{}.gexf".format(m))

A slight tweak to the code lets us replace the anchoring (that is, the search) around a single band name to a set of band names. This allows us to get the current and previous members of all the declared bands.

m=['The Mission (band)','The Cult','The Sisters of Mercy','Fields of the Nephilim','All_About_Eve_(band)']

?x rdfs:label "{}"@en.

ms=''' UNION
'''.join(['{'+p.format(i)+'}' for i in m])

#In the query, replace ?x rdfs:label &amp;amp;quot;{}&amp;amp;quot;@en. with {}
#In the format method, replace m with ms

Rather than searching around one or more bands, we could instead hook into bands associated with a particular genre. Rather than anchoring around ?x rdfs:label "{}"@en, for example, use ?x dbp:genre dbr:{}. This then lets us generate views of the following form:


As well as mapping the territory around particular musical genres, we can also generate maps for other contexts, such as around particular art movements. For example:

SELECT DISTINCT ?a ?an ?b ?bn WHERE {{
?movement dct:subject dbc:Art_movements.
?movement dct:subject dbc:{}.
?artist dbp:movement ?movement.
?artist dbp:movement ?a.
?artist dbp:movement ?b.
?a rdfs:label ?an.
?b rdfs:label ?bn.
FILTER(?a != ?b && (lang(?an)="en") && (lang(?bn)="en"))

G=nxGrapher_hack(r, {'from':('a','an'),'to':('b','bn')})
nx.write_gexf(G, "art_{}.gexf".format(m))

Or we can tap into other ontologies to limit our searches, and generate a range of influence maps:

SELECT ?a ?an ?b ?bn WHERE {{
  ?a a yago:{typ} .
  ?b a yago:{typ} .
  ?a rdfs:label ?an.
  ?b rdfs:label ?bn.
  {{?a <> ?b.}}
   UNION {{
  ?b <> ?a.
G=nxGrapher_hack(r, {'from':('a','an'),'to':('b','bn')},typ='forward')
nx.write_gexf(G, "influence_{}.gexf".format(y))

So why bother?

Here are several reasons: first, because it’s interesting/fun/recreational; secondly, it allows us to compare our own mental model of the wider context around a particular genre or movement with the Wikipedia version; thirdly, if we’re expert, it might allow us to spot gaps or errors in the Wikipedia data, and fix it; fourthly, these sorts of data collections are used to make recommendations to you, so it helps to get a feel for the sorts of things they can represent, the relations they claim exist, and the ways they can go wrong, so you trust the machines a little bit less, or are least, a little bit more informedly.

PS One of the reasons for grabbing the data using Python was because Gephi has recently undergone an update, and the extensions developed for the earlier version are still being migrated. However, checking today, I notice that the SemanticWebImport plugin has made it across, so it should be possible to run variants of the queries directly in Gephi. See the previous posts for examples.

Fragments – Wikipedia to Markdown

I’ve been sketching some ideas, pondering the ethics of doing an F1 review style book blending (openly licensed) content from Wikipedia race reports with some of my own f1datajunkie charts, and also wondering about the extent to which I could automatically generate Wikipedia style race report sentences from the data; I think the sentence generation, in general should be quite easy – the harder part would be identifying the “interesting” sentences (that is, the ones that make it into the report, rather than than the totality of ones that could be generated).

So far, my sketches have been based around just grabbing the content from Wikipedia, and transforming to markdown, the markup language used in the Leanpub workflow:

In Python 3.x at least, I came across some encoding issues, and couldn’t seem to identify Wikipedia page sections. For what it’s worth, a minimal scribble looks something like this:

!pip3 install wikipedia
import wikipedia

#Search for page titles on Wikipedia'2014 Australian grand prix')

#Load a page'2014 Australian Grand Prix')

#Preview page content

#Preview a section's content by section header
##For some reason, f1.sections shows an empty list for me?

#pandoc supports Wikimedia to markdown conversion
!apt-get -y install pandoc
!pip3 install pypandoc
import pypandoc

#To work round encoding issues, write the content to a file and then convert it...
f = open('delme1.txt', 'w', encoding='utf8')

md=pypandoc.convert('delme1.txt', 'md', format='mediawiki')

If the Formula One race report pages follow similar templates and use similar headings, then it should be straightforward enough to pull down sections of the reports and interleave them with charts and tables. (As well as issues parsing out section headers to fill the sections list, the tables on the page don’t appear to be grabbed into the .content field (assuming the API wrapper does manage to grab that content down? However, I can easily recreate those from things like the ergast API).

Looking at the construction of sentences in the race reports, many of them are formulaic. However, as noted above, generating sentences is one thing, but generating interesting sentences is another. For that, I think we need to identify sets of rules that mark data features out as interesting or not before generating sentences from them.

Tracking Anonymous Wikipedia Edits From Specific IP Ranges

Via @davewiner’s blog, I spotted a link to @congressedits, “a bot that tweets anonymous Wikipedia edits that are made from IP addresses in the US Congress”. (For more info, see why @congressedits?, /via @ostephens.) I didn’t follow the link to the home page for that account (doh!), but in response to a question about whether white label code was available, @superglaze pointed me to, a script that “will watch Wikipedia for edits from a set of named IP ranges and will tweet when it notices one”.

It turns out the script was inspired by @parliamentedits, a bot built by @tomscott that “tracks edits to Wikipedia made from Parliamentary IP addresses” built using IFTT and possibly a list of IP ranges operated by the House of Commons gleaned from this FOI request?


My immediate thought was set up something to track edits made to Wikipedia from OU IP addresses, then idly wondered if set of feeds for tracking edits from HEIs in general might also be useful (something to add to the UK University Web Observatory for example?)

To the extent that Wikipedia represents an authoritative source of information, for some definition of authoritative(?!), it could be interesting to track the “impact” of our foolish universities in terms of contributing to the sum of of human knowledge as represented by Wikipedia.

It’d also be interesting to track the sorts of edits made from anonymous and named editors from HEI IP ranges. I wonder what classes they may fall into?

  1. edits from the marketing and comms folk?
  2. ego and peer ego edits, eg from academics keeping the web pages of other academics in their field up to date?
  3. research topic edits – academics maintaining pages that relate to their research areas or areas of scholarly interest?
  4. teaching topic edits – academics maintaining pages that relate to their teaching activities?
  5. library edits – edits made from the library?
  6. student edits – edits made by students as part of a course?
  7. “personal” edits – edits made by folk who class themselves and Wikimedians in general and just happen to make edits while they are on an HEI network?

My second thought was to wonder to what extent might news and media organisations be maintaining – or tweaking – Wikipedia pages? The BBC, for example, who have made widespread use of Wikipedia in their Linked Data driven music and wildlife pages.

Hmmm… news.. reminds me: wasn’t a civil servant who made abusive edits to a Wikipedia page sacked recently? Ah, yes: Civil servant fired after Telegraph investigation into Hillsborough Wikipedia slurs, as my OU colleague Andrew Smith suggested might happen.

Or how about other cultural organisations – museums and galleries for example?

Or charities?

Or particular brands? Hmm…

So I wonder: could we try to identify areas of expertise on, or attempted/potential influence over, particular topics by doing reverse IP lookups from pages focussed on those topics? This sort of mapping activity pivots the idea of visualising related entries in Wikipedia to map IP ranges, and perhaps from that locations and individuals associated with maintaining a set of resources around a particular topic area (cf. Visualising Delicious Tag Communities).

I think I started looking at how we might start to map IP ranges for organisations once….? Erm… maybe not, actually: it was looking up domains a company owned from its nameservers.

Hmm.. thinks… webstats show IP ranges of incoming requests – can we create maps from those? In fact, are there maps/indexes that give IP ranges for eg companies or universities?

I’m rambling again…

PS Related: Repository Googalytics – Visits from HEIs which briefly reviews the idea of tracking visits to HEI repositories from other HEIs…

Mapping Related Musical Genres on Wikipedia/DBPedia With Gephi

Following on from Mapping How Programming Languages Influenced Each Other According to Wikipedia, where I tried to generalise the approach described in Visualising Related Entries in Wikipedia Using Gephi for grabbing datasets in Wikipedia related to declared influences between items within particular subject areas, here’s another way of grabbing data from Wikipedia/DBpedia that we can visualise as similarity neighbourhoods/maps (following @danbri: Everything Still Looks Like A Graph (but graphs look like maps)).

In this case, the technique relies on identifying items that are associated with several different values for the same sort of classification-type. So for example, in the world of music, a band may be associated with one or more musical genres. If a particular band is associated with the genres Electronic music, New Wave music and Ambient music, we might construct a graph by drawing lines/edges between nodes representing each of those musical genres. That is, if we let nodes represent genre, we might draw edges between two nodes show that a particular band has been labelled as falling within each of those two genres.

So for example, here’s a sketch of genres that are associated with at least some of the bands that have also been labelled as “Psychedelic” on Wikipedia:

Following the recipe described here, I used this Request within the Gephi Semantic Web Import module to grab the data:

prefix gephi:<>
  ?genreA gephi:label ?genreAname .
  ?genreB gephi:label ?genreBname .
  ?genreA <> ?genreB .
  ?genreB <> ?genreA .
?band <> <>.
?band <> "group_or_band"@en.
?band <> ?genreA.
?band <> ?genreB.
?genreA rdfs:label ?genreAname.
?genreB rdfs:label ?genreBname.
FILTER(?genreA != ?genreB && langMatches(lang(?genreAname), "en")  && langMatches(lang(?genreBname), "en"))

(I made up the relation type to describe the edge…;-)

This query searches for things that fall into the declared genre, and then checks that they are also a group_or_band. Note that this approach was discovered through idle browsing of the properties of several bands. Instead of:
?band <; "group_or_band"@en.
I should maybe have used a more strongly semantically defined relation such as:
?band a >;.
?band a <;.

The FILTER helps us pull back English language name labels, as well as creating pairs of different genre terms from each band (again, there may be a better way of doing this? I’m still a SPARQL novice! If you know a better way of doing this, or a more efficient way of writing the query, please let me know via the comments.)

It’s easy enough to generate similarly focussed maps around other specific genres; the following query run using the DBpedia SNORQL interface pulls out candidate values:

  ?band <> "group_or_band"@en.
  ?band <> ?genre.
} limit 50 offset 0

(The offset parameter allows you to page between results; so an offset of 10 will display results starting with the 11th(?) result.)

What this query does is look for items that are declared as a type group_or_band and then pull out the genres associated with each band.

If you take a deep breath, you’ll hopefully see how this recipe can be used to help probe similar “co-attributes” of things in DBpedia/Wikipeda, if you can work out how to narrow down your search to find them… (My starting point is to browse DPpedia pages of things that might have properties I’m interested in. So for example, when searching for hooks into music related data, we might have a peak at the DBpedia page for Hawkwind (who aren’t, apparently, of the Psychedelic genre…), and then hunt for likely relations to try out in a sample SNORQL query…)

PS if you pick up on this recipe and come up with any interesting maps over particular bits of DBpedia, please post a link in the comments below:-)

Mapping How Programming Languages Influenced Each Other According to Wikipedia

By way of demonstrating how the recipe described in Visualising Related Entries in Wikipedia Using Gephi can easily be turned to other things, here’s a map of how different computer programming languages influence each other according to DBpedia/Wikipedia:

Here’s the code that I pasted in to the Request area of the Gephi Semantic Web Import plugin as configured for a DBpedia import:

prefix gephi:<>
prefix foaf: <>
  ?a gephi:label ?an .
  ?b gephi:label ?bn .
  ?a <> ?b
?a a <>.
?b a <>.
?a <> ?b.
?a foaf:name ?an.
?b foaf:name ?bn.

As to how I found the <; relation, I had a play around with the SNORQL query interface for DBpedia looking for possible relations using queries along the lines of:

?a <> ?b.
?a rdf:type ?c.
?b a ?c.
} limit 50 offset 150

(I think a (as in ?x a ?y and rdf:type are synonyms?)

This query looks for pairs of things (?a, ?b), each of the same type, ?c, where ?b also influences ?a, then reports what sort of thing (?c) they are (philosophers, for example, or programming languages). We can then use this thing in our custom Wikipedia/DBpedia/Gephi semantic web mapping request to map out the “internal” influence network pertaining to that thing (internal in the sense that the things that are influencing and influenced are both representatives of the same, erm, thing…;-).

The limit term specifies how many results to return, the offset essentially allows you to page through results (so an offset of 500 will return results starting with the 501st result overall). DISTINCT ensures we see unique relations.

If you see a relation that looks like dbpedia:ontology/Philosopher, put it in and brackets (<>) and replace dbpedia: with to give something like <;.

PS see how to use a similar technique to map out musical genres ascribed to bands on WIkipedia

Visualising Related Entries in Wikipedia Using Gephi

Sometime last week, @mediaczar tipped me off to a neat recipe on the wonderfully named Drunks&Lampposts blog, Graphing the history of philosophy, that uses Gephi to map an influence network in the world of philosophy. The data is based on the extraction of the “influencedBy” relationship over philosophers referred to in Wikipedia using the machine readable, structured data view of Wikipedia that is DBpedia.

The recipe given hints at how to extract data from DBpedia, tidy it up and then import it into Gephi… but there is a quicker way: the Gephi Semantic Web Import plugin. (If it’s not already installed, you can install this plugin via the Tools -> Plugins menu, then look in the Available Plugin.)

To get DBpedia data into Gephi, we need to do three things:

– tell the importer where to find the data by giving it a URL (the “Driver” configuration setting);
– tell the importer what data we want to get back, by specifying what is essentially a database query (the “Request” configuration setting);
– tell Gephi how to create the network we want to visualise from the data returned from DBpedia (in the context of the “Request” configuration).

Fortunately, we don’t have to work out how to do this from scratch – from the Semantic Web Import Configuration panel, configure the importer by setting the configuration to DBPediaMovies.

Hitting “Set Configuration” sets up the Driver (Remote SOAP Endpoint with Endpoint URL

and provides a dummy, sample query Request:

We need to do some work creating our own query now, but not too much – we can use this DBpediaMovies example and the query given on the Drunks&Lampposts blog as a starting point:

?p a
<> .
?p <> ?influenced.

This query essentially says: ‘give me all the pairs of people, (?p, ?influenced), where each person ?p is a philosopher, and each person ?influenced is influenced by ?p’.

We can replace the WHERE part of the query in the Semantic Web Importer with the WHERE part of this query, but what graph do we want to put together in the CONSTRUCT part of the Request?

The graph we are going to visualise will have nodes that are philosophers or the people who influenced them. The edges connecting the nodes will represent that one influenced the other, using a directed line (with an arrow) to show that A influenced B, for example.

The following construction should achieve this:

?p <> ?influenced.
  ?p a
<> .
?p <> ?influenced.
} LIMIT 10000

(The LIMIT argument limits the number of rows of data we’re going to get back. It’s often good practice to set this quite low when you’re trying out a new query!)

Hit Run and a graph should be imported:

If you click on the Graph panel (in the main Overview view of the Gephi tool), you should see the graph:

If we run the PageRank or EigenVector centrality statistic, size the nodes according to that value, and lay out the graph using a force directed or Fruchtermann-Rheingold layout algorithm, we get something like this:

The nodes are labelled in a rather clumsy way – – for example, but we can tidy this up. Going to one of the DPpedia pages, such as, we find what else DBpedia knows about this person:

In particular, we see we can get hold of the name of the philosopher using the foaf:name property/relation. If you look back to the original DBpediaMovies example, we can start to pick it apart. It looks as if there are a set of gephi properties we can use to create our network, including a “label” property. Maybe this will help us label our nodes more clearly, using the actual name of a philosopher for example? You may also notice the declaration of a gephi “prefix”, which appears in various constructions (such as gephi:label). Hmmm.. Maybe gephi:label is to prefix gephi:<; as foaf:name is to something? If we do a web search for the phrase foaf:name prefix, we turn up several results that contain the phrase prefix foaf:<;, so maybe we need one of those to get the foaf:name out of DBpedia….?

But how do we get it out? We’ve already seen that we can get the name of a person who was influenced by a philosopher by asking for results where this relation holds: ?p <; ?influenced. So it follows we can get the name of a philosopher (?pname) by asking for the foaf:name in the WHEER part of the query:

?p <foaf:name> ?pname.

and then using this name as a label in the CONSTRUCTion:

?p gephi:label ?pname.

We can also do a similar exercise for the person who is influenced.

looking through the DBpedia record, I notice that as well as an influenced relation, there is an influencedBy relation (I think this is the one that was actually used in the Drunks&Lampposts blog?). So let’s use that in this final version of the query:

prefix gephi:<>
prefix foaf: <>
  ?philosopher gephi:label ?philosopherName .
  ?influence gephi:label ?influenceName .
  ?philosopher <> ?influence
  ?philosopher a
  <> .
  ?philosopher <> ?influence.
  ?philosopher foaf:name ?philosopherName.
  ?influence foaf:name ?influenceName.
} LIMIT 10000

If you’ve already run a query to load in a graph, if you run this query it may appear on top of the previous one, so it’s best to clear the workspace first. At the bottom right of the screen is a list of workspaces – click on the RDF Request Graph label to pop up a list of workspaces, and close the RDF Request Graph one by clicking on the x.

Now run the query into a newly launched, pristine workspace, and play with the graph to your heart’s content…:-) [I’ll maybe post more on this later – in the meantime, if you’re new to Gephi, here are some Gephi tutorials]

Here’s what I get sizing nodes and labels by PageRank, and laying out the graph by using a combination of Force Atlas2, Expansion and Label Adjust (to stop labels overlapping) layout tools:

Using the Ego Network filter, we can then focus on the immediate influence network (influencers and influenced) of an individual philosopher:

What this recipe hopefully shows is how you can directly load data from DBpedia into Gephi. The two tricks you need to learn to do this for other data sets are:

1) figuring out how to get data out of DBpedia (the WHERE part of the Request);
2) figuring out how to get that data into shape for Gephi (the CONSTRUCT part of the request).

If you come up with any other interesting graphs, please post Request fragments in the comments below:-)

[See also: Graphing Every* Idea In History]

PS via @sciencebase (Mapping research on Wikipedia with Wikimaps), there’s this related tool: WikiMaps, on online (and desktop?) tool for visualising various Wikipedia powered graphs, such as, erm, Justin Bieber’s network…

Any other related tools out there for constructing and visualising Wikipedia powered network maps? Please add a link via the comments if you know of any…

PPS for a generalisation of this approach, and a recipe for finding other DBpedia networks to map, see Mapping How Programming Languages Influenced Each Other According to Wikipedia.

PPPS Here’s another handy recipe that shows how to pull SPARQLed DBPedia queries into R, analyse them there, and then generate a graphML file for rendering in Gephi: SPARQL Package for R / Gephi – Movie star graph visualization Tutorial

PPPPS related – a large scale version of this? Wikipedia Mining Algorithm Reveals The Most Influential People In 35 Centuries Of Human History

Data Scraping Wikipedia with Google Spreadsheets

Prompted in part by a presentation I have to give tomorrow as an OU eLearning community session (I hope some folks turn up – the 90 minute session on Mashing Up the PLE – RSS edition is the only reason I’m going in…), and in part by Scott Leslie’s compelling programme for a similar duration Mashing Up your own PLE session (scene scetting here: Hunting the Wily “PLE”), I started having a tinker with using Google spreadsheets as for data table screenscraping.

So here’s a quick summary of (part of) what I found I could do.

The Google spreadsheet function =importHTML(“”,”table”,N) will scrape a table from an HTML web page into a Google spreadsheet. The URL of the target web page, and the target table element both need to be in double quotes. The number N identifies the N’th table in the page (counting starts at 0) as the target table for data scraping.

So for example, have a look at the following Wikipedia page – List of largest United Kingdom settlements by population (found using a search on Wikipedia for uk city population – NOTE: URLs (web addresses) and actual data tables may have changed since this post was written, BUT you should be able to find something similar…):

Grab the URL, fire up a new Google spreadsheet, and satrt to enter the formula “=importHTML” into one of the cells:

Autocompletion works a treat, so finish off the expression:


And as if by magic, a data table appears:

All well and good – if you want to create a chart or two, why not try the Google charting tools?

Google chart

Where things get really interesting, though, is when you start letting the data flow around…

So for example, if you publish the spreadsheet you can liberate the document in a variety of formats:

As well publishing the spreadsheet as an HTML page that anyone can see (and that is pulling data from the WIkipedia page, remember), you can also get access to an RSS feed of the data – and a host of other data formats:

See the “More publishing options” link? Lurvely :-)

Let’s have a bit of CSV goodness:

Why CSV? Here’s why:

Lurvely… :-)

(NOTE – Google spreadsheets’ CSV generator can be a bit crap at times and may require some fudging (and possibly a loss of data) in the pipe – here’s an example: When a Hack Goes Wrong… Google Spreadsheets and Yahoo Pipes.)

Unfortunately, the *’s in the element names mess things up a bit, so let’s rename them (don’t forget to dump the original row of the feed (alternatively, tweak the CSV URL so it starts with row 2); we might as well create a proper RSS feed too, by making sure we at least have a title and description element in there:

Make the description a little more palatable using a regular expression to rewrite the description element, and work some magic with the location extractor block (see how it finds the lat/long co-ordinates, and adds them to each item?;-):

DEPRECATED…. The following image is the OLD WAY of doing this and is not to be recommended…


Geocoding in Yahoo Pipes is done more reliably through the following trick – replace the Location Builder block with a Loop block into which you should insert a Location Builder Block

yahoo pipe loop

The location builder will look to a specified element for the content we wish to geocode:

yahoo pipe location builder

The Location Builder block should be configured to output the geocoded result to the y:location element. NOTE: the geocode often assumes US town/city names. If you have a list of town names that you know come from a given country, you may wish to annotate them with a country identify before you try to geocode them. A regular expression block can do this:

regex uk

This block says – in the title element, grab a copy of everything – .* – into a variable – (.*) – and then replace the contents of the title element with it’s original value – $1 – as well as “, UK” – $1, UK

Note that this regular expression block would need to be wired in BEFORE the geocoding Loop block. That is, we want the geocoder to act on a title element containing “Cambridge, UK” for example, rather than just “Cambridge”.


And to top it all off:

And for the encore? Grab the KML feed out of the pipe:

…and shove it in a Google map:

So to recap, we have scraped some data from a wikipedia page into a Google spreadsheet using the =importHTML formula, published a handful of rows from the table as CSV, consumed the CSV in a Yahoo pipe and created a geocoded KML feed from it, and then displayed it in a YahooGoogle map.

Kewel :-)

PS If you “own” the web page that a table appears on, there is actually quote a lot you can do to either visualise it, or make it ‘interactive’, with very little effort – see Progressive Enhancement – Some Examples and HTML Tables and the Data Web for more details…

PPS for a version of this post in German, see: (Please post a linkback if you’ve translated this post into any other languages :-)

PPPS this is neat – geocoding in Google spreadsheets itself: Geocoding by Google Spreadsheets.

PPPS Once you have scraped the data into a Google spreadsheet, it’s possible to treat it as a database using the QUERY spreadsheet function. For more on the QUERY function, see Using Google Spreadsheets Like a Database – The QUERY Formula and Creating a Winter Olympics 2010 Medal Map In Google Spreadsheets.

Computer Spirits…

I doubt there are many readers of this blog who aren’t familiar with science fiction guru Arthur C. Clarke’s adage that “[a]ny sufficiently advanced technology is indistinguishable from magic”. And there may even be a playful few who invoke Rowlingesque spells on the commandline using Harry Potter bash aliases. So I was wondering again today about what other magical or folkloric ideas could be used to help engage folk’s curiosity in how the world of tech works, and maybe teach computing related ideas through stories.

For example, last week I noticed that a reasonable number of links on Wikipedia point to the Internet Archive.

I also picked from a recent Recode/Decode podcast interview between the person you may know as the awesomest tech interviewer ever, Kara Swisher, and Internet Archive champion, Brewster Kahle, that bots do the repair work. So things like the User:InternetArchiveBot and/or CyberBot II maybe? Broken links are identified, and link references updated to point to archival copies. (For more info, see: More than 1 million formerly broken links in English Wikipedia updated to archived versions from the Wayback Machine and Fixing broken links in Wikipedia (especially the comments).)

Hmm… helpful bots.. like helpful spirits, or Brownies in a folkloric sense. Things that come out at night and help invisibly around the home…

And if there are helpful spirits, there are probably malicious ones too. The code equivalent of boggarts and bogles that cause mischief or mayhem – robot phone callers, or scripts that raise pop-ups when you’re trying to read a post online, for example? Maybe we if we start to rethink of online tech inconveniences as malevolent spirits we’ll find better ways to ignore or dispel them?! Or at least find a way to engage people into thinking about them, and from that working out how best to get rid of them or banish them from our lives?

PS the problem of Link Rot is an issue for maintaining OU course materials too. As materials are presented year on year, link targets move away and/or die. Sometimes the materials are patched with a corrected link to wherever the resource moved to, other times we refresh materials and find a new resource to link to. But generally, I wonder, why don’t we make like Wikipedia and get a Brownie to help? Are there Moodle bots to do helpful work like this around the VLE?

Tabloid Data Journalism?

At the risk of coming across as a bit snobbish, this ad for a Data Journalist for The Penny Hoarder riled me somewhat…

Do you have a passion for telling stories with data? We’re looking for a data journalist who can crunch statistics about jobs, budgeting, spending and saving — and produce compelling digital content that resonates with our readers. You should have expertise in data mining and analysis, and the ability to present the results in conversational, fun articles and/or telling graphics.

As our data journalist, you will produce revealing, clickable, data-driven articles and/or graphics, plus serve as a resource for our growing team of writers and editors. We envision using data sources such as the Bureau of Labor Statistics and U.S. Census Bureau to report on personal finance issues of interest to our national readership of young professionals, coupon fans and financially striving people of all ages. We want to infuse our blog with seriously interesting data while staying true to our vibe: fun, weird, useful.

Our ideal candidate…
– …
– Can write in a bloggy, conversational voice that emphasizes what the data means to real people
– Has a knack for identifying clicky topics and story angles that are highly shareable
– Gets excited when a blog post goes viral
– …

According to Wikipedia (who else?!;-), Tabloid journalism is a style of journalism that emphasizes sensational crime stories, gossip columns about celebrities and sports stars, junk food news and astrology.

(Yes, yes, I know, I know, tabloid papers can also do proper, hard hitting investigative journalism… But I’m thinking about that sense of the term…)

So what might tabloid data journalism be? See above?

PS ish prompted by @SophieWarnes, it’s probably worth mentioning the aborted Ampp3d project in this context… eg Ampp3d launches as ‘socially-shareable data journalism’ site, Martin Belam talks about Trinity Mirror’s data journalism at Ampp3d and The Mirror Is Making Widespread Cuts To Its Online Journalism.

PPS …and a write-up of that by Sophie: Is there room for ‘tabloid data journalism’?

Trawling the Companies House API to Generate Co-Director Networks

Somewhen ago (it’s always somewhen ago; most of the world never seems to catch up with what’s already happened!:-( I started dabbling with the OpenCorporates API to generate co-director corporate maps that showed companies linked by multiple directors. It must have been a bad idea because no-one could see any point in it, not even interestingness…  (Which suggests to me that boards made up of directors are similarly meaningless? In which case, how are companies supposed to hold themselves to account?)

I tend to disagree. If I hadn’t been looking at connected companies around food processing firms, I would never have learned that one that meat processors cope with animal fat waste is to feed it into the biodiesel raw material supply chain.

Anyway, if we ever get to see a beneficial ownership register, a similar approach should work to generate maps showing how companies sharing beneficial owners are linked. (The same approach also drives my emergent social positioning Twitter maps and the Wikipedia semantic maps I posted about again recently.)

As a possible precursor to that, I thought I’d try to reimplement the code (in part to see if a better approach came to mind) using data grabbed directly from Companies House via their API. I’d already started dabbling with the API (Chat Sketches with the Companies House API) so it didn’t take much more to get a grapher going…

But first, I realise in that earlier post I’d missed the function for actually calling the API – so here it is:

import urllib2, base64, json
from urllib import urlencode
from time import sleep

def url_nice_req(url,t=300):
        return urllib2.urlopen(url)
    except HTTPError, e:
        if e.code == 429:
            print("Overloaded API, resting for a bit...")
            return url_req(url)

#Inspired by
def ch_request(CH_API_TOKEN,url,args=None):
    if args is not None:
    request = urllib2.Request(url)
    # You need the replace to handle encodestring adding a trailing newline 
    # (
    base64string = base64.encodestring('%s:' % (CH_API_TOKEN)).replace('\n', '')
    request.add_header("Authorization", "Basic %s" % base64string)   
    result = url_nice_req(request)

    return json.loads(


In the original implementation, I stored the incremental search results in a dict; in the reimplementation, I thought I’d make use of a small SQLite database.

import sqlite3
if db in locals():
db = sqlite3.connect(tmpDB)
c = db.cursor()

for drop in ['directorslite','companieslite','codirs','coredirs','singlecos']:
    c.execute('''drop table if exists {}'''.format(drop))
c.execute('''create table directorslite
         (dirnum text primary key,
          dirdob integer,
          dirname text)''')

c.execute('''create table companieslite
         (conum text primary key,
          costatus text,
          coname text)''')

c.execute('''create table codirs
         (conum text,
          dirnum text,
          typ text,
          status text)''')

c.execute('''create table coredirs
         (dirnum text)''')

c.execute('''create table singlecos
         (conum text,
          coname text)''')


The code itself runs in two passes. The first pass builds up a seed set of directors from a single company or set of companies using a simple harvester:

def updateOnCo(seed,typ='current',role='director'):
    print('harvesting {}'.format(seed))
          'dirdob':p['date_of_birth']['year'] if 'date_of_birth' in p else None,
          'dirname':p['name']} for p in o]
    for y in x:
        if y['dirnum'] not in dirsdone:
        if isinstance(z, dict): z=[z]
    print('Adding {} directors'.format(len(z)))
    c.executemany('INSERT INTO directorslite (dirnum, dirdob,dirname)'
                     'VALUES (:dirnum,:dirdob,:dirname)', z)
    for oo in [i for i in o if i['links']['officer']['appointments'].strip('/').split('/')[1] not in dirsparsed]:
        print('New director: {}'.format(oid))
        #Play nice with the api
        #add company details
          'costatus':p['appointed_to']['company_status'] if 'company_status' in p['appointed_to'] else '',
          'coname':p['appointed_to']['company_name'] if 'company_name' in p['appointed_to'] else ''} for p in ooo['items']]
        for y in x:
            if y['conum'] not in cosdone:
        if isinstance(z, dict): z=[z]
        print('Adding {} companies'.format(len(z)))
        c.executemany('INSERT INTO companieslite (conum, costatus,coname)'
                     'VALUES (:conum,:costatus,:coname)', z)
        for i in x:cosdone.append(i['conum'])
        #add company director links
            'typ':'current','status':'director'} for p in ooo['items']]
        c.executemany('INSERT INTO codirs (conum, dirnum,typ,status)'
                     'VALUES (:conum,:dirnum,:typ,:status)', x)
        print('Adding {} company-directorships'.format(len(x)))

The set of seed companies may be companies associated with one or more specified seed directors, for example:

def dirCoSeeds(dirseeds,typ='all',role='all'):
    ''' Find companies associated with dirseeds '''
    for d in dirseeds:
        for c in ch_getAppointments(d,typ=typ,role=role)['items']:
    return coseeds

for d in ch_searchOfficers('Bernard Ecclestone',n=10,exact='forename')['items']:

Then I call a first pass of the co-directed companies search with the set of company seeds:

#Need to handle director or LLP Designated Member
for seed in coseeds:
c.executemany('INSERT INTO coredirs (dirnum) VALUES (?)', [[d] for d in dirsparsed])

seeder_roles=['Finance Director']
#for dirs in seeded_cos, if dir_role is in seeder_roles then do a second seeding based on their companies


Then we go for a crawl for as many steps as required… The approach I’ve taken here is to search through the current database to find the companies heuristically defined as codirected, and then feed these back into the harvester.

while depth<maxdepth:
    print('---------------\nFilling out level - {}...'.format(depth))
    if seeder and depth==0:
        #Another policy would be dive on all companies associated w/ dirs of seed
        #In which case set the above test to depth==0
        tofetch=[u[0] for u in c.execute(''' SELECT DISTINCT conum from codirs''')]
        duals=c.execute('''SELECT cd1.conum as c1,cd2.conum as c2, count(*) FROM codirs AS cd1
                        LEFT JOIN codirs AS cd2 
                        ON cd1.dirnum = cd2.dirnum AND cd1.dirnum
                        WHERE cd1.conum < cd2.conum GROUP BY c1,c2 HAVING COUNT(*)>1
        tofetch=[x for t in duals for x in t[:2]]
        #The above has some issues. eg only 1 director is required, and secretary IDs are unique to company
        #Maybe need to change logic so if two directors OR company just has one director?
        #if relaxed>0:
        #    print('Being relaxed {} at depth {}...'.format(relaxed,depth))
        #    duals=c.execute('''SELECT cd.conum as c1,cl.coname as cn, count(*) FROM codirs as cd JOIN companieslite as cl 
        #                 WHERE cd.conum= cl.conum GROUP BY c1,cn HAVING COUNT(*)=1
        #                ''')
        #    tofetch=tofetch+[x[0] for x in duals]
        #    relaxed=relaxed-1
    if depth==0 and oneDirSeed:
        #add in companies with a single director first time round
        for u in c.execute('''SELECT DISTINCT cd.conum, cl.coname FROM codirs cd  JOIN companieslite cl ON
            if len(o['items'])==1 or u[0]in coseeds:
        c.executemany('INSERT INTO singlecos (conum,coname) VALUES (:conum,:coname)', sco)
    #TO DO: Another stategy might to to try to find the Finance Director or other names role and seed from them?
    #Get undone companies
    print('To fetch: ',[u for u in tofetch if u not in cosparsed])
    for u in [x for x in tofetch if x not in cosparsed]:
            #play nice
    #Parse companies

To visualise the data, I opted for Gephi, which meant having to export the data. I started off with a simple CSV edgelist exporter:

data=c.execute('''SELECT cl1.coname as Source,cl2.coname as Target, count(*) FROM codirs AS cd1
                        LEFT JOIN codirs AS cd2 JOIN companieslite as cl1 JOIN companieslite as cl2
                        ON cd1.dirnum = cd2.dirnum and cd1.conum=cl1.conum and cd2.conum=cl2.conum
                        WHERE cd1.conum 1''')
import csv
with open('output1.csv', 'wb') as f:
    writer = csv.writer(f)
    writer.writerow(['Source', 'Target'])
data= c.execute('''SELECT cl1.coname as c1,cl2.coname as c2 FROM codirs AS cd1
                        LEFT JOIN codirs AS cd2 JOIN singlecos as cl1 JOIN singlecos as cl2
                        ON cd1.dirnum = cd2.dirnum and cd1.conum=cl1.conum and cd2.conum=cl2.conum
                        WHERE cd1.conum &lt; cd2.conum''')
with open('output1.csv', 'ab') as f:
    writer = csv.writer(f)

but soon changed that to a proper graph file export, based on a graph built around the codirected companies using the networkx package:

import networkx as nx


data=c.execute('''SELECT cl.conum as cid, cl.coname as cn, dl.dirnum as did, dl.dirname as dn
FROM codirs AS cd JOIN companieslite as cl JOIN directorslite as dl ON cd.dirnum = dl.dirnum and cd.conum=cl.conum ''')
for d in data:
    G.add_node(d[0], Label=d[1])
    G.add_node(d[2], Label=d[3])
nx.write_gexf(G, "test.gexf")

I then load the graph file into Gephi to visualise the data.

Here’s an example of the sort of thing we can get out for a search seeded on companies associated with the Bernie Ecclestone who directs at least one F1 related company:


On the to do list is to automate this a little bit more by adding some netwrok statistics, and possibly a first pass layout, in the networkx step.

In terms of time required to collect the data, the ,a href=””>Companies House API is rate limited to allow 600 requests within a five minute period. Many company networks can be mapped within the 600 call limit, but even for larger networks, the trawl doesn’t take too long even if two or three rest periods are required.