OUseful.Info, the blog…

Trying to find useful things to do with emerging technologies in open education

Search Results

Mapping Related Musical Genres on Wikipedia/DBPedia With Gephi

Following on from Mapping How Programming Languages Influenced Each Other According to Wikipedia, where I tried to generalise the approach described in Visualising Related Entries in Wikipedia Using Gephi for grabbing datasets in Wikipedia related to declared influences between items within particular subject areas, here’s another way of grabbing data from Wikipedia/DBpedia that we can visualise as similarity neighbourhoods/maps (following @danbri: Everything Still Looks Like A Graph (but graphs look like maps)).

In this case, the technique relies on identifying items that are associated with several different values for the same sort of classification-type. So for example, in the world of music, a band may be associated with one or more musical genres. If a particular band is associated with the genres Electronic music, New Wave music and Ambient music, we might construct a graph by drawing lines/edges between nodes representing each of those musical genres. That is, if we let nodes represent genre, we might draw edges between two nodes show that a particular band has been labelled as falling within each of those two genres.

So for example, here’s a sketch of genres that are associated with at least some of the bands that have also been labelled as “Psychedelic” on Wikipedia:

Following the recipe described here, I used this Request within the Gephi Semantic Web Import module to grab the data:

prefix gephi:<http://gephi.org/>
CONSTRUCT{
  ?genreA gephi:label ?genreAname .
  ?genreB gephi:label ?genreBname .
  ?genreA <http://ouseful.info/edge> ?genreB .
  ?genreB <http://ouseful.info/edge> ?genreA .
} WHERE {
?band <http://dbpedia.org/ontology/genre> <http://dbpedia.org/resource/Psychedelic>.
?band <http://dbpedia.org/property/background> "group_or_band"@en.
?band <http://dbpedia.org/ontology/genre> ?genreA.
?band <http://dbpedia.org/ontology/genre> ?genreB.
?genreA rdfs:label ?genreAname.
?genreB rdfs:label ?genreBname.
FILTER(?genreA != ?genreB && langMatches(lang(?genreAname), "en")  && langMatches(lang(?genreBname), "en"))
}

(I made up the relation type to describe the edge…;-)

This query searches for things that fall into the declared genre, and then checks that they are also a group_or_band. Note that this approach was discovered through idle browsing of the properties of several bands. Instead of:
?band <http://dbpedia.org/property/background&gt; "group_or_band"@en.
I should maybe have used a more strongly semantically defined relation such as:
?band a >http://schema.org/MusicGroup&gt;.
or:
?band a <http://dbpedia.org/ontology/Band&gt;.

The FILTER helps us pull back English language name labels, as well as creating pairs of different genre terms from each band (again, there may be a better way of doing this? I’m still a SPARQL novice! If you know a better way of doing this, or a more efficient way of writing the query, please let me know via the comments.)

It’s easy enough to generate similarly focussed maps around other specific genres; the following query run using the DBpedia SNORQL interface pulls out candidate values:

SELECT DISTINCT ?genre WHERE {
  ?band <http://dbpedia.org/property/background> "group_or_band"@en.
  ?band <http://dbpedia.org/ontology/genre> ?genre.
} limit 50 offset 0

(The offset parameter allows you to page between results; so an offset of 10 will display results starting with the 11th(?) result.)

What this query does is look for items that are declared as a type group_or_band and then pull out the genres associated with each band.

If you take a deep breath, you’ll hopefully see how this recipe can be used to help probe similar “co-attributes” of things in DBpedia/Wikipeda, if you can work out how to narrow down your search to find them… (My starting point is to browse DPpedia pages of things that might have properties I’m interested in. So for example, when searching for hooks into music related data, we might have a peak at the DBpedia page for Hawkwind (who aren’t, apparently, of the Psychedelic genre…), and then hunt for likely relations to try out in a sample SNORQL query…)

PS if you pick up on this recipe and come up with any interesting maps over particular bits of DBpedia, please post a link in the comments below:-)

Written by Tony Hirst

July 4, 2012 at 1:04 pm

Posted in Tinkering

Tagged with , , ,

Mapping How Programming Languages Influenced Each Other According to Wikipedia

By way of demonstrating how the recipe described in Visualising Related Entries in Wikipedia Using Gephi can easily be turned to other things, here’s a map of how different computer programming languages influence each other according to DBpedia/Wikipedia:

Here’s the code that I pasted in to the Request area of the Gephi Semantic Web Import plugin as configured for a DBpedia import:

prefix gephi:<http://gephi.org/>
prefix foaf: <http://xmlns.com/foaf/0.1/>
CONSTRUCT{
  ?a gephi:label ?an .
  ?b gephi:label ?bn .
  ?a <http://dbpedia.org/ontology/influencedBy> ?b
} WHERE {
?a a <http://dbpedia.org/ontology/ProgrammingLanguage>.
?b a <http://dbpedia.org/ontology/ProgrammingLanguage>.
?a <http://dbpedia.org/ontology/influencedBy> ?b.
?a foaf:name ?an.
?b foaf:name ?bn.
}

As to how I found the <http://dbpedia.org/ontology/ProgrammingLanguage&gt; relation, I had a play around with the SNORQL query interface for DBpedia looking for possible relations using queries along the lines of:

SELECT DISTINCT ?c WHERE {
?a <http://dbpedia.org/ontology/influencedBy> ?b.
?a rdf:type ?c.
?b a ?c.
} limit 50 offset 150

(I think a (as in ?x a ?y and rdf:type are synonyms?)

This query looks for pairs of things (?a, ?b), each of the same type, ?c, where ?b also influences ?a, then reports what sort of thing (?c) they are (philosophers, for example, or programming languages). We can then use this thing in our custom Wikipedia/DBpedia/Gephi semantic web mapping request to map out the “internal” influence network pertaining to that thing (internal in the sense that the things that are influencing and influenced are both representatives of the same, erm, thing…;-).

The limit term specifies how many results to return, the offset essentially allows you to page through results (so an offset of 500 will return results starting with the 501st result overall). DISTINCT ensures we see unique relations.

If you see a relation that looks like dbpedia:ontology/Philosopher, put it in and brackets (<>) and replace dbpedia: with http://dbpedia.org/ to give something like <http://dbpedia.org/ontology/Philosopher&gt;.

PS see how to use a similar technique to map out musical genres ascribed to bands on WIkipedia

Written by Tony Hirst

July 3, 2012 at 12:08 pm

Visualising Related Entries in Wikipedia Using Gephi

Sometime last week, @mediaczar tipped me off to a neat recipe on the wonderfully named Drunks&Lampposts blog, Graphing the history of philosophy, that uses Gephi to map an influence network in the world of philosophy. The data is based on the extraction of the “influencedBy” relationship over philosophers referred to in Wikipedia using the machine readable, structured data view of Wikipedia that is DBpedia.

The recipe given hints at how to extract data from DBpedia, tidy it up and then import it into Gephi… but there is a quicker way: the Gephi Semantic Web Import plugin. (If it’s not already installed, you can install this plugin via the Tools -> Plugins menu, then look in the Available Plugin.)

To get DBpedia data into Gephi, we need to do three things:

- tell the importer where to find the data by giving it a URL (the “Driver” configuration setting);
- tell the importer what data we want to get back, by specifying what is essentially a database query (the “Request” configuration setting);
- tell Gephi how to create the network we want to visualise from the data returned from DBpedia (in the context of the “Request” configuration).

Fortunately, we don’t have to work out how to do this from scratch – from the Semantic Web Import Configuration panel, configure the importer by setting the configuration to DBPediaMovies.

Hitting “Set Configuration” sets up the Driver (Remote SOAP Endpoint with Endpoint URL http://dbpedia.org/sparql):

and provides a dummy, sample query Request:

We need to do some work creating our own query now, but not too much – we can use this DBpediaMovies example and the query given on the Drunks&Lampposts blog as a starting point:

SELECT *
WHERE {
?p a
<http://dbpedia.org/ontology/Philosopher> .
?p <http://dbpedia.org/ontology/influenced> ?influenced.
}

This query essentially says: ‘give me all the pairs of people, (?p, ?influenced), where each person ?p is a philosopher, and each person ?influenced is influenced by ?p’.

We can replace the WHERE part of the query in the Semantic Web Importer with the WHERE part of this query, but what graph do we want to put together in the CONSTRUCT part of the Request?

The graph we are going to visualise will have nodes that are philosophers or the people who influenced them. The edges connecting the nodes will represent that one influenced the other, using a directed line (with an arrow) to show that A influenced B, for example.

The following construction should achieve this:

CONSTRUCT{
?p <http://dbpedia.org/ontology/influenced> ?influenced.
} WHERE {
  ?p a
<http://dbpedia.org/ontology/Philosopher> .
?p <http://dbpedia.org/ontology/influenced> ?influenced.
} LIMIT 10000

(The LIMIT argument limits the number of rows of data we’re going to get back. It’s often good practice to set this quite low when you’re trying out a new query!)

Hit Run and a graph should be imported:

If you click on the Graph panel (in the main Overview view of the Gephi tool), you should see the graph:

If we run the PageRank or EigenVector centrality statistic, size the nodes according to that value, and lay out the graph using a force directed or Fruchtermann-Rheingold layout algorithm, we get something like this:

The nodes are labelled in a rather clumsy way – http://dbpedia.org/page/Martin_Heidegger – for example, but we can tidy this up. Going to one of the DPpedia pages, such as http://dbpedia.org/page/Martin_Heidegger, we find what else DBpedia knows about this person:

In particular, we see we can get hold of the name of the philosopher using the foaf:name property/relation. If you look back to the original DBpediaMovies example, we can start to pick it apart. It looks as if there are a set of gephi properties we can use to create our network, including a “label” property. Maybe this will help us label our nodes more clearly, using the actual name of a philosopher for example? You may also notice the declaration of a gephi “prefix”, which appears in various constructions (such as gephi:label). Hmmm.. Maybe gephi:label is to prefix gephi:<http://gephi.org/&gt; as foaf:name is to something? If we do a web search for the phrase foaf:name prefix, we turn up several results that contain the phrase prefix foaf:<http://xmlns.com/foaf/0.1/&gt;, so maybe we need one of those to get the foaf:name out of DBpedia….?

But how do we get it out? We’ve already seen that we can get the name of a person who was influenced by a philosopher by asking for results where this relation holds: ?p <http://dbpedia.org/ontology/influenced&gt; ?influenced. So it follows we can get the name of a philosopher (?pname) by asking for the foaf:name in the WHEER part of the query:

?p <foaf:name> ?pname.

and then using this name as a label in the CONSTRUCTion:

?p gephi:label ?pname.

We can also do a similar exercise for the person who is influenced.

looking through the DBpedia record, I notice that as well as an influenced relation, there is an influencedBy relation (I think this is the one that was actually used in the Drunks&Lampposts blog?). So let’s use that in this final version of the query:

prefix gephi:<http://gephi.org/>
prefix foaf: <http://xmlns.com/foaf/0.1/>
CONSTRUCT{
  ?philosopher gephi:label ?philosopherName .
  ?influence gephi:label ?influenceName .
  ?philosopher <http://dbpedia.org/ontology/influencedBy> ?influence
} WHERE {
  ?philosopher a
  <http://dbpedia.org/ontology/Philosopher> .
  ?philosopher <http://dbpedia.org/ontology/influencedBy> ?influence.
  ?philosopher foaf:name ?philosopherName.
  ?influence foaf:name ?influenceName.
} LIMIT 10000

If you’ve already run a query to load in a graph, if you run this query it may appear on top of the previous one, so it’s best to clear the workspace first. At the bottom right of the screen is a list of workspaces – click on the RDF Request Graph label to pop up a list of workspaces, and close the RDF Request Graph one by clicking on the x.

Now run the query into a newly launched, pristine workspace, and play with the graph to your heart’s content…:-) [I'll maybe post more on this later - in the meantime, if you're new to Gephi, here are some Gephi tutorials]

Here’s what I get sizing nodes and labels by PageRank, and laying out the graph by using a combination of Force Atlas2, Expansion and Label Adjust (to stop labels overlapping) layout tools:

Using the Ego Network filter, we can then focus on the immediate influence network (influencers and influenced) of an individual philosopher:

What this recipe hopefully shows is how you can directly load data from DBpedia into Gephi. The two tricks you need to learn to do this for other data sets are:

1) figuring out how to get data out of DBpedia (the WHERE part of the Request);
2) figuring out how to get that data into shape for Gephi (the CONSTRUCT part of the request).

If you come up with any other interesting graphs, please post Request fragments in the comments below:-)

[See also: Graphing Every* Idea In History]

PS via @sciencebase (Mapping research on Wikipedia with Wikimaps), there’s this related tool: WikiMaps, on online (and desktop?) tool for visualising various Wikipedia powered graphs, such as, erm, Justin Bieber’s network…

Any other related tools out there for constructing and visualising Wikipedia powered network maps? Please add a link via the comments if you know of any…

PPS for a generalisation of this approach, and a recipe for finding other DBpedia networks to map, see Mapping How Programming Languages Influenced Each Other According to Wikipedia.

PPPS Here’s another handy recipe that shows how to pull SPARQLed DBPedia queries into R, analyse them there, and then generate a graphML file for rendering in Gephi: SPARQL Package for R / Gephi – Movie star graph visualization Tutorial

Written by Tony Hirst

July 3, 2012 at 10:05 am

Data Scraping Wikipedia with Google Spreadsheets

Prompted in part by a presentation I have to give tomorrow as an OU eLearning community session (I hope some folks turn up – the 90 minute session on Mashing Up the PLE – RSS edition is the only reason I’m going in…), and in part by Scott Leslie’s compelling programme for a similar duration Mashing Up your own PLE session (scene scetting here: Hunting the Wily “PLE”), I started having a tinker with using Google spreadsheets as for data table screenscraping.

So here’s a quick summary of (part of) what I found I could do.

The Google spreadsheet function =importHTML(“”,”table”,N) will scrape a table from an HTML web page into a Google spreadsheet. The URL of the target web page, and the target table element both need to be in double quotes. The number N identifies the N’th table in the page (counting starts at 0) as the target table for data scraping.

So for example, have a look at the following Wikipedia page – List of largest United Kingdom settlements by population (found using a search on Wikipedia for uk city population – NOTE: URLs (web addresses) and actual data tables may have changed since this post was written, BUT you should be able to find something similar…):

Grab the URL, fire up a new Google spreadsheet, and satrt to enter the formula “=importHTML” into one of the cells:

Autocompletion works a treat, so finish off the expression:

=ImportHtml(“http://en.wikipedia.org/wiki/List_of_largest_United_Kingdom_settlements_by_population&#8221;,”table”,1)

And as if by magic, a data table appears:

All well and good – if you want to create a chart or two, why not try the Google charting tools?

Google chart

Where things get really interesting, though, is when you start letting the data flow around…

So for example, if you publish the spreadsheet you can liberate the document in a variety of formats:

As well publishing the spreadsheet as an HTML page that anyone can see (and that is pulling data from the WIkipedia page, remember), you can also get access to an RSS feed of the data – and a host of other data formats:

See the “More publishing options” link? Lurvely :-)

Let’s have a bit of CSV goodness:

Why CSV? Here’s why:

Lurvely… :-)

(NOTE – Google spreadsheets’ CSV generator can be a bit crap at times and may require some fudging (and possibly a loss of data) in the pipe – here’s an example: When a Hack Goes Wrong… Google Spreadsheets and Yahoo Pipes.)

Unfortunately, the *’s in the element names mess things up a bit, so let’s rename them (don’t forget to dump the original row of the feed (alternatively, tweak the CSV URL so it starts with row 2); we might as well create a proper RSS feed too, by making sure we at least have a title and description element in there:

Make the description a little more palatable using a regular expression to rewrite the description element, and work some magic with the location extractor block (see how it finds the lat/long co-ordinates, and adds them to each item?;-):

DEPRECATED…. The following image is the OLD WAY of doing this and is not to be recommended…

…DEPRECATED

Geocoding in Yahoo Pipes is done more reliably through the following trick – replace the Location Builder block with a Loop block into which you should insert a Location Builder Block

yahoo pipe loop

The location builder will look to a specified element for the content we wish to geocode:

yahoo pipe location builder

The Location Builder block should be configured to output the geocoded result to the y:location element. NOTE: the geocode often assumes US town/city names. If you have a list of town names that you know come from a given country, you may wish to annotate them with a country identify before you try to geocode them. A regular expression block can do this:

regex uk

This block says – in the title element, grab a copy of everything – .* – into a variable – (.*) – and then replace the contents of the title element with it’s original value – $1 – as well as “, UK” – $1, UK

Note that this regular expression block would need to be wired in BEFORE the geocoding Loop block. That is, we want the geocoder to act on a title element containing “Cambridge, UK” for example, rather than just “Cambridge”.

Lurvely…

And to top it all off:

And for the encore? Grab the KML feed out of the pipe:

…and shove it in a Google map:

So to recap, we have scraped some data from a wikipedia page into a Google spreadsheet using the =importHTML formula, published a handful of rows from the table as CSV, consumed the CSV in a Yahoo pipe and created a geocoded KML feed from it, and then displayed it in a YahooGoogle map.

Kewel :-)

PS If you “own” the web page that a table appears on, there is actually quote a lot you can do to either visualise it, or make it ‘interactive’, with very little effort – see Progressive Enhancement – Some Examples and HTML Tables and the Data Web for more details…

PPS for a version of this post in German, see: http://plerzelwupp.pl.funpic.de/wikitabellen_in_googlemaps/. (Please post a linkback if you’ve translated this post into any other languages :-)

PPPS this is neat – geocoding in Google spreadsheets itself: Geocoding by Google Spreadsheets.

PPPS Once you have scraped the data into a Google spreadsheet, it’s possible to treat it as a database using the QUERY spreadsheet function. For more on the QUERY function, see Using Google Spreadsheets Like a Database – The QUERY Formula and Creating a Winter Olympics 2010 Medal Map In Google Spreadsheets.

Written by Tony Hirst

October 14, 2008 at 10:21 pm

First Signs (For Me) of Linked Data Being Properly Linked…?!

with 3 comments

As anyone who’s followed this blog for some time will know, my relationship with Linked Data has been an off and on again one over the years. At the current time, it’s largely off – all my OpenRefine installs seem to have given up the ghost as far as reconciliation and linking services go, and I have no idea where the problem lies (whether with the plugins, the installs, with Java, with the endpoints, with the reconciliations or linkages I’m trying to establish).

My dabblings with pulling data in from Wikipedia/DBpedia to Gephi (eg as described in Visualising Related Entries in Wikipedia Using Gephi and the various associated follow-on posts) continue to be hit and miss due to the vagaries of DBpedia and the huge gaps in infobox structured data across Wikipedia itself.

With OpenRefine not doing its thing for me, I haven’t been able to use that app as the glue to bind together queries made across different Linked Data services, albeit in piecemeal fashion. Because from the occasional sideline view I have of the Linked Data world, I haven’t seen any obvious way of actually linking data sets other than by pulling identifiers in to a new OpenRefine column (or wherever) from one service, then using those identifiers to pull in data from another endpoint into another column, and so on…

So all is generally not well.

However, a recent post by the Ordnance Survey’s John Goodwin (aka @gothwin) caught my eye the other day: Federating SPARQL Queries Across Government Linked Data. It seems that federated queries can now be made across several endpoints.

John gives an example using data from the Ordnance Survey SPARQL endpoint and an endpoint published by the Environment Agency:

The Environment Agency has published a number of its open data offerings as linked data … A relatively straight forward SPARQL query will get you a list of bathing waters, their name and the district they are in.

[S]uppose we just want a list of bathing water areas in South East England – how would we do that? This is where SPARQL federation comes in. The information about which European Regions districts are in is held in the Ordnance Survey linked data. If you hop over the the Ordnance Survey SPARQL endpoint explorer you can run [a] query to find all districts in South East England along with their names …

Using the SERVICE keyword we can bring these two queries together to find all bathing waters in South East England, and the districts they are in:

And here’s the query John shows, as run against the Ordnance Survey SPARQL endpoint

SELECT ?x ?name ?districtname WHERE {
  ?x a <http://environment.data.gov.uk/def/bathing-water/BathingWater> .
  ?x <http://www.w3.org/2000/01/rdf-schema#label> ?name .
  ?x <http://statistics.data.gov.uk/def/administrative-geography/district> ?district .
  SERVICE <http://data.ordnancesurvey.co.uk/datasets/boundary-line/apis/sparql>
    ?district <http://data.ordnancesurvey.co.uk/ontology/spatialrelations/within> <http://data.ordnancesurvey.co.uk/id/7000000000041421> .
    ?district <http://www.w3.org/2000/01/rdf-schema#label> ?districtname .
  }
} ORDER BY ?districtname

In a follow on post, John goes even further “by linking up data from Ordnance Survey, the Office of National Statistics, the Department of Communities and Local Government and Hampshire County Council”.

So that’s four endpoints – the original one against which the query is first fired, and three others…

SELECT ?districtname ?imdrank ?changeorder ?opdate ?councilwebsite ?siteaddress WHERE {
  ?district <http://data.ordnancesurvey.co.uk/ontology/spatialrelations/within <http://data.ordnancesurvey.co.uk/id/7000000000017765> .
  ?district a <http://data.ordnancesurvey.co.uk/ontology/admingeo/District> .
  ?district <http://www.w3.org/2000/01/rdf-schema#label> ?districtname .
  SERVICE <http://opendatacommunities.org/sparql> {
    ?s <http://purl.org/linked-data/sdmx/2009/dimension#refArea> ?district .
    ?s <http://opendatacommunities.org/def/IMD#IMD-rank> ?imdrank . 
    ?authority <http://opendatacommunities.org/def/local-government/governs> ?district .
    ?authority <http://xmlns.com/foaf/0.1/page> ?councilwebsite .
  }
  ?district <http://www.w3.org/2002/07/owl#sameAs> ?onsdist .
  SERVICE <http://statistics.data.gov.uk/sparql> {
    ?onsdist <http://statistics.data.gov.uk/def/boundary-change/originatingChangeOrder> ?changeorder .
    ?onsdist <http://statistics.data.gov.uk/def/boundary-change/operativedate> ?opdate .
  }
  SERVICE <http://linkeddata.hants.gov.uk/sparql> {
    ?landsupsite <http://data.ordnancesurvey.co.uk/ontology/admingeo/district> ?district .
    ?landsupsite a <http://linkeddata.hants.gov.uk/def/land-supply/LandSupplySite> .
    ?landsupsite <http://www.ordnancesurvey.co.uk/ontology/BuildingsAndPlaces/v1.1/BuildingsAndPlaces.owl#hasAddress> ?siteaddress .
  }
}

Now we’re getting somewhere….

Written by Tony Hirst

March 25, 2014 at 3:25 pm

Posted in Anything you want

Tagged with

Recreational Data

with 2 comments

Part of my weekend ritual is to buy the weekend papers and have a go at the recreational maths problems that are Sudoku and Killer. I also look for news stories with a data angle that might prompt a bit of recreational data activity…

In a paper that may or may not have been presented at the First European Congress of Mathematics in Paris, July, 1992, Prof. David Singmaster reflected on “The Unreasonable Utility of Recreational Mathematics”.

unreasonableUtility

To begin with, it is worth considering what is meant by recreational
mathematics.

First, recreational mathematics is mathematics that is fun and popular – that is, the problems should be understandable to the interested layman, though the solutions may be harder. (However, if the solution is too hard, this may shift the topic from recreational toward the serious – e.g. Fermat’s Last Theorem, the Four Colour Theorem or the Mandelbrot Set.)

Secondly, recreational mathematics is mathematics that is fun and used as either as a diversion from serious mathematics or as a way of making serious mathematics understandable or palatable. These are the pedagogic uses of recreational mathematics. They are already present in the oldest known mathematics and continue to the present day.

These two aspects of recreational mathematics – the popular and the pedagogic – overlap considerably and there is no clear boundary between them and “serious” mathematics.

How is recreational mathematics useful?

Firstly, recreational problems are often the basis of serious mathematics. The most obvious fields are probability and graph theory where popular problems have been a major (or the dominant) stimulus to the creation and evolution of the subject. …

Secondly, recreational mathematics has frequently turned up ideas of genuine but non-obvious utility. …

Anyone who has tried to do anything with “real world” data knows how much of a puzzle it can represent: from finding the data, to getting hold of it, to getting it into a state and a shape where you can actually work with it, to analysing it, charting it, looking for pattern and structure within it, having a conversation with it, getting it to tell you one of the many stories it may represent, there are tricks to be learned and problems to be solved. And they’re fun.

An obvious definition [of recreational mathematics] is that it is mathematics that is fun, but almost any mathematician will say that he enjoys his work, even if he is studying eigenvalues of elliptic differential operators, so this definition would encompass almost all mathematics and hence is too general. There are two, somewhat overlapping, definitions that cover most of what is meant by recreational mathematics.

…the two definitions described above.

So how might we define “recreational data”. For me, recreational data activities are, in who or in part, data investigations, involving one or more steps of the data lifecycle (discovery, acquisition, cleaning, analysis, visualisation, storytelling). They are the activities I engage in when I look for, or behind, the numbers that appear in a news story. They’re the stories I read on FullFact, or listen to on the OU/BBC co-pro More or Less; they’re at the heart of the beautiful little book that is The Tiger That Isn’t; recreational data is what I do in the “Diary of a Data Sleuth” posts on OpenLearn.

Recreational data is about the joy of trying to find stories in data.

Recreational data is, or can be, the data journalism you do for yourself or the sense you make of the stats in the sports pages.

Recreational data is a safe place to practice – I tinker with Twitter and formulate charts around Formula One. But remember this: “recreational problems are often the basis of serious [practice]“. The “work” I did around Visualising Twitter User Timeline Activity in R? I can (and do) reuse that code as the basis of other timeline analyses. The puzzle of plotting connected concepts on Wikipedia I described in Visualising Related Entries in Wikipedia Using Gephi? It’s a pattern I can keep on playing with.

If you think you might like to do some doodle of your own with some data, why not check out the School Of Data. Or watch out on OpenLearn for some follow up stories from the OU/BBC co-pro of Hans Rosling’s award winning Don’t Panic

Written by Tony Hirst

March 21, 2014 at 9:56 am

Political Representation on BBC Political Q&A Programmes – Stub

It’s too nice a day to be inside hacking around with Parliament data as a remote participant in today’s Parliamentary hack weekend (resource list), but if it had been a wet weekend I may have toyed with one of the following:

- revisiting this cleaning script for Analysing UK Lobbying Data Using OpenRefine (actually, a look at who finds/offers support for All Party Groups. The idea was to get a dataset of people who provide secretariat and funds to APPGs, as well as who works for them, and then do something with that dataset…)

- tinkering with data from Question Time and Any Questions…

On that last one:

- we have data from the BBC about historical episodes of Question Time and historical episodes of Any Questions. (Click a year/month link to get the listing.)

These gives us generatable URLs for programmes by month with URLs of form http://www.bbc.co.uk/programmes/b006t1q9/broadcasts/2013/01 but how do we get a JSON version of that?! Adding .json on the end doesn’t work?!:-( UPDATE – this could be a start, via @nevali – use pattern /programmes/PID.rdf , such as http://www.bbc.co.uk/programmes/b006qgvj.rdf

We can get bits of information (albeit in semi-structured from) about panellists in data form from programme URL hacks like this: http://www.bbc.co.uk/programmes/b007m3c1.json

Note that some older programmes don’t list all the panelists in the data? So a visit to WIkipedia – http://en.wikipedia.org/wiki/List_of_Question_Time_episodes#2007 – may be in order for Question Time (there isn’t a similar page for Any Questions?)

Given panellists (the BBC could be more helpful here in the way it structures its data…), see if we can identify parliamentarians (MP suffix? Lord/Lady title?) and look them up using the new-to-me, not-yet-played-with-it UK Parliament – Members’ Names Data Platform API. Not sure if reconciliation works on parliamentarian lookup (indeed, not sure if there is a reconciliation service anywhere for looking up MPs, members of the House of Lords, etc?)

From Members’ Names API, we can get things like gender, constituency, whether or not they were holding a (shadow) cabinet post, maybe whether they were on a particular committee at the time etc. From programme pages, we may be able to get the location of the programme recording. So this opens up possibility of mapping geo-coverage of Question Time/Any Questions, both in terms of where the programmes visit as well as which constituencies are represented on them.

If we were feeling playful, we could also have a stab at which APPGs have representation on those programmes!

It also suggests a simpler hack – of just providing a briefing around the representatives appearing on a particular episode in terms of their current (or at the time) parliamentary status (committees, cabinet positions, APPGs etc etc)?

Written by Tony Hirst

November 16, 2013 at 12:31 pm

Pondering Bibliographic Coupling and Co-citation Analyses in the Context of Company Directorships

Over the last month or so, I’ve made a start reading through Mark Newman’s Networks: An Introduction, trying (though I’m not sure how successfully!) to bring an element of discipline to my otherwise osmotically acquired understanding of the techniques employed by various network analysis tools.

One distinction that made a lot of sense to me came from the domain of bibliometrics, specifically between the notions of bibliographic coupling and co-citation.

Co-citation
The idea of co-citation will be familiar to many – when one article cites a set of other articles, those other articles are “co-cited” by the first. When the same articles are co-cited by lots of other articles, we may have reason to believe that they are somehow related in a meaningful way.

cocitation analysis
Image via Wikipedia

In graph terms, we might also represent this as simpler graph within which edges between two articles indicate that they have been co-cited by documents within a particular corpus, with the weight of each edge representing the number of documents within that corpus that have co-cited them.

Bibliographic coupling
Bibliographic coupling is actually an earlier notion, describing the extent to which two works are related by virtue of them both referencing the same other work.

Bibliographic coupling
Image via Wikipedia

Again, in graph terms, we might think of a simpler undirected network in which edges between two articles act as an indicator that they have cited or referenced the same work, with the weight of the edge representing the number of documents that they have co-cited.

A comparison of co-citation and bibliographic coupling networks shows one to be “retrospective” and the other to be “forward looking”. The articles referenced in bibliographic coupling network can be generated directly from a corpus set of articles, and to this extent bibliographic coupling looks to the past. In a co-citation network, the edges that connect two articles can only be generated when a future published article cites them both.

Co-citation, Bibliographic Coupling and Company Director Networks

For some time I’ve been tinkering with the notion of co-director networks, using OpenCorporates data as a data source (eg Mapping Corporate Networks With OpenCorporates). What I’ve tended to focus on are networks built up from active companies and their current directors, looking to see which companies are currently connected by virtue of currently sharing the same directors. On the to do list are timelines showing the companies that a particular director has been associated with, and when, as well as directorial appointments and terminations within a particular company.

In both co-citation and bibliographic analyses, the nodes are the same type of thing (that is, works that are citated, such as articles). A work cites a work. (Note: does author co-citation analysis rely on mappings from works to cited authors, or citing authors to cited authors?). In company-director networks, we have bipartite representation, with directors and companies representing the two types of node and where edges connect companies and directors but not companies and companies or directors and directors; unless a company is a director, but we generally fudge the labelling there.

If we treat “companies that retain directors” as “articles that cite other articles”:

- under a “co-citation” style view, we generate links between companies that share common directors;
- under a “bibliographic coupling” style view, we generate links between directors of the same companies.

I’ve been doing this anyway, but the bibliographic coupling/co-citation distinction may help me tighten it up a little, as well as improving ways of calculating and analysing these networks by reusing analyses described by the bibliometricians?

Pondering the “future vs. past” distinction, the following also comes to mind:

- at the moment, I am generating networks based on current directors of active companies;
- could we construct a dynamic (temporal?) hypergraph from hyperedges that connect all the directors associated with a particular company at a particular time? If so, what could we do with this graph?! (As an aside, it’s probably worth noting that I know absolutely nothing about hypergraphs!)

I’ve also started wondering about ‘director pathways’ in which we define directors as nodes (where all we require was that a person was a director of a company at some time) and directed “citation” edges. These edges would go from one director to other director nodes under the condition that the “citing” director was appointed to a particular company within a particular time period t1..t2 before the appointment to the same company of a “cited” director. If one director follows another director into more than one company, we increase the weight of the edge accordingly. (We could maybe also explore modes in which edge weights represent the amount of time that two directors are in the same company together.)

The aim is… probably pointless and not that interesting. Unless it is… The sort of questions this approach would allow us to ask would be along the lines of: are there groups of directors whose directorial appointments follow similar trajectories through companies; or are there groups of directors who appear to move from one company to another along with each other?

Written by Tony Hirst

May 24, 2013 at 12:13 pm

Posted in Anything you want

Tagged with

Publishing Stats for Analytic Reuse – FAOStat Website and R Package

How can stats and data publishers, from NGOs and (inter)national statistics agencies to scientific researchers, publish their data in a way that supports its analysis directly, as well as in combination with other datasets?

Here’s one approach I learned about from Michael Kao of the UN Food and Agriculture Organisation statistics division, FAOStat.

At first glimpse, the FAOStat website offers a rich website that supports data downloads, previews and simple analysis tools around a wide variety of international food related datasets:

FAOStat website

FAOstat - graphical tools

faostat - inline data preview

FAOStat - ddata analysis

One problem with having so many controls and fields available is that it can be hard to know where (or how) to get started – a bit like the problem of being presented with an empty SPARQL query box…

It would be quite handy to be able to set – and save with meaningful labels – preference sets about the countries you’re interested in so you don’t have to keep keep scrolling through long country lists looking for the countries you want to generate reports for? (Support for “standard” groupings of countries might also be useful?) Being able to share URLs to predefined reports might also be handy? But this would possibly make the site even more complex to use!

One easier way of working with FAOStat data, particularly if you access the FAO datasets regularly, might be to take a programmatic route using the FAOStat R package. Making datasets available in ways that bring that data directly into a desktop analysis environment where they can be worked on without requiring cleaning or other forms of tidying up (which is often the case when data is made available via Excel spreadsheets or CSV files) is a trend I hope we see more of. (That is not to say that data shouldn’t also be published in “generic” document formats…). If you are using a reproducible research strategy, queries to original datasources provide implicit, self-describing metadata about the data source and the query used to return a particular dataset, metadata that is all to easy to lose, or otherwise detach from a dataset when working with downloaded files.

I haven’t had chance to play with this package yet – it’s still in testing anyway, I think? – but it looks quite handy at a first glance (I need to do a proper review…). As well as providing a way of running data grab queries over theFAO FAOSTAT and World Bank WDI APIs, it seems to provide support for “linkage”. As the draft vignette suggests, “Merge is a typical data manipulation step in daily work yet a non-trivial exercise especially when working with different data sources. The built in mergeSYB function enables one to merge data from different sources as long as the country coding system is identified. … Data from any source with [a] classification [supported by the package] can be supplied to mergeSYB in order to obtain a single merged data. (sic)“. Supported formats currently include: United Nations M49 country standard [UN_CODE]; FAO country code scheme [FAOST_CODE]; FAO Global Administrative Unit Layers (GAUL) [ADM0_CODE]; ISO 3166-1 alpha-2 [ISO2_CODE]; ISO 3166-1 alpha-2 (World Bank) [ISO2_WB_CODE]; ISO 3166-1 alpha-3 [ISO3_CODE]; ISO 3166-1 alpha-3 (World Bank) [ISO3_WB_CODE].

By releasing an “official” R package to access the FAOStat API, it occurs to me that this makes it much easier to start building sector specific Shiny applications around particular datasets? I wonder whether the FAOstat folk have considered whether there is a possibility of developing a small Shiny app or custom client ecosystem around their data, even if it just takes the form of a curated set of gists that can be downloaded directly into RStudio, for example, using runGist?

I don’t know whether the Eurostat EC Statistics database has an associated R package too? (If so, it could be quite interesting trying to tie them together?! I do note, however, that Eurostat data is available for download (though I haven’t read the terms/license conditions…).

I also note that a Linked Data/SPARQL way in to Eurostat data appears to be available? Eurostat Linked Data.

[Man flu, hence the brevity of the post... skulks back off to sick bed...]

PS BY the by, I notice that the NHS are experimenting with making some data releases available via Google Public Data Explorer [scroll down...]

PPS See also this package – Smarter Poland – which provides an API to the Eurostat database.

Written by Tony Hirst

March 8, 2013 at 2:45 pm

Posted in Rstats

Tagged with , ,

When a Hack Goes Wrong… Google Spreadsheets and Yahoo Pipes

One of my most successful posts in terms of lifetime traffic numbers has been a recipe for scraping data from a Wikipedia page, pulling it into a Google spreadsheet, publishing it as CSV, pulling it into a Yahoo Pipe, geo-coding it, publishing it as a KML file, displaying the KML in Google maps and embedding the map in another page (which could in principle be the original WIkipedia page): Data Scraping Wikipedia with Google Spreadsheets.

A variant of this recipe in other words:

wikipediamap

Running the hack now on a new source web page, we get the following data pulled into the spreadsheet:

pipe broken

And the following appearing in the pipe (I am trying to replace the first line with my own headers):

imported data

The CSV file appears to be misbehaving… Downloading the CSV data and looking at it in TextWrangler, a text editor, we start to see what’s wrong:

text editor

The text editor creates line numbers for things it sees as separate, well formed rows in the CSV data. We see that the header, which should be a single row, is actually spread over four rows. In addition, the London data is split over two rows. The line for Greater Manchester behaves correctly…: if you look at the line numbers, you can see line 7 overflows in the editor (the … in the line number count shows the CSV line (a separate dataflow) has overflowed the width of the editor and been wrapped round in the editor view).

If I tell the editor to stop “soft wrapping” each line of data in the CSV file, the editor displays each line of the CSV file on a single line in the editor:

text editor nowrap

So… where does this get us in fixing the pipe? Not too far. We can skip the first 5 lines of the file that we import into the pipe, and that gets around all the messed up line breaks at the top of the file, but we lose the row containing the data for London. In the short term, this is probably the pragmatic thing to do.

Next up, we might look to the Wikipedia file and see how the elements that appear to be breaking the the CSV file might be fixed to unbreak them. Finally, we could go to the Google Spreadsheets forums and complain about the pile of crap broken CSV generation that the Googlers appear to have implemented…

PS MartinH suggests workaround in the comments, wrapping a QUERY round the import and renaming the columns…

Written by Tony Hirst

February 5, 2013 at 5:25 pm

Follow

Get every new post delivered to your Inbox.

Join 728 other followers