OUseful.Info, the blog…

Trying to find useful things to do with emerging technologies in open education

Search Results

Mapping Related Musical Genres on Wikipedia/DBPedia With Gephi

with 2 comments

Following on from Mapping How Programming Languages Influenced Each Other According to Wikipedia, where I tried to generalise the approach described in Visualising Related Entries in Wikipedia Using Gephi for grabbing datasets in Wikipedia related to declared influences between items within particular subject areas, here’s another way of grabbing data from Wikipedia/DBpedia that we can visualise as similarity neighbourhoods/maps (following @danbri: Everything Still Looks Like A Graph (but graphs look like maps)).

In this case, the technique relies on identifying items that are associated with several different values for the same sort of classification-type. So for example, in the world of music, a band may be associated with one or more musical genres. If a particular band is associated with the genres Electronic music, New Wave music and Ambient music, we might construct a graph by drawing lines/edges between nodes representing each of those musical genres. That is, if we let nodes represent genre, we might draw edges between two nodes show that a particular band has been labelled as falling within each of those two genres.

So for example, here’s a sketch of genres that are associated with at least some of the bands that have also been labelled as “Psychedelic” on Wikipedia:

Following the recipe described here, I used this Request within the Gephi Semantic Web Import module to grab the data:

prefix gephi:<http://gephi.org/>
CONSTRUCT{
  ?genreA gephi:label ?genreAname .
  ?genreB gephi:label ?genreBname .
  ?genreA <http://ouseful.info/edge> ?genreB .
  ?genreB <http://ouseful.info/edge> ?genreA .
} WHERE {
?band <http://dbpedia.org/ontology/genre> <http://dbpedia.org/resource/Psychedelic>.
?band <http://dbpedia.org/property/background> "group_or_band"@en.
?band <http://dbpedia.org/ontology/genre> ?genreA.
?band <http://dbpedia.org/ontology/genre> ?genreB.
?genreA rdfs:label ?genreAname.
?genreB rdfs:label ?genreBname.
FILTER(?genreA != ?genreB && langMatches(lang(?genreAname), "en")  && langMatches(lang(?genreBname), "en"))
}

(I made up the relation type to describe the edge…;-)

This query searches for things that fall into the declared genre, and then checks that they are also a group_or_band. Note that this approach was discovered through idle browsing of the properties of several bands. Instead of:
?band <http://dbpedia.org/property/background&gt; "group_or_band"@en.
I should maybe have used a more strongly semantically defined relation such as:
?band a >http://schema.org/MusicGroup&gt;.
or:
?band a <http://dbpedia.org/ontology/Band&gt;.

The FILTER helps us pull back English language name labels, as well as creating pairs of different genre terms from each band (again, there may be a better way of doing this? I’m still a SPARQL novice! If you know a better way of doing this, or a more efficient way of writing the query, please let me know via the comments.)

It’s easy enough to generate similarly focussed maps around other specific genres; the following query run using the DBpedia SNORQL interface pulls out candidate values:

SELECT DISTINCT ?genre WHERE {
  ?band <http://dbpedia.org/property/background> "group_or_band"@en.
  ?band <http://dbpedia.org/ontology/genre> ?genre.
} limit 50 offset 0

(The offset parameter allows you to page between results; so an offset of 10 will display results starting with the 11th(?) result.)

What this query does is look for items that are declared as a type group_or_band and then pull out the genres associated with each band.

If you take a deep breath, you’ll hopefully see how this recipe can be used to help probe similar “co-attributes” of things in DBpedia/Wikipeda, if you can work out how to narrow down your search to find them… (My starting point is to browse DPpedia pages of things that might have properties I’m interested in. So for example, when searching for hooks into music related data, we might have a peak at the DBpedia page for Hawkwind (who aren’t, apparently, of the Psychedelic genre…), and then hunt for likely relations to try out in a sample SNORQL query…)

PS if you pick up on this recipe and come up with any interesting maps over particular bits of DBpedia, please post a link in the comments below:-)

Written by Tony Hirst

July 4, 2012 at 1:04 pm

Posted in Tinkering

Tagged with , , ,

Mapping How Programming Languages Influenced Each Other According to Wikipedia

with 9 comments

By way of demonstrating how the recipe described in Visualising Related Entries in Wikipedia Using Gephi can easily be turned to other things, here’s a map of how different computer programming languages influence each other according to DBpedia/Wikipedia:

Here’s the code that I pasted in to the Request area of the Gephi Semantic Web Import plugin as configured for a DBpedia import:

prefix gephi:<http://gephi.org/>
prefix foaf: <http://xmlns.com/foaf/0.1/>
CONSTRUCT{
  ?a gephi:label ?an .
  ?b gephi:label ?bn .
  ?a <http://dbpedia.org/ontology/influencedBy> ?b
} WHERE {
?a a <http://dbpedia.org/ontology/ProgrammingLanguage>.
?b a <http://dbpedia.org/ontology/ProgrammingLanguage>.
?a <http://dbpedia.org/ontology/influencedBy> ?b.
?a foaf:name ?an.
?b foaf:name ?bn.
}

As to how I found the <http://dbpedia.org/ontology/ProgrammingLanguage&gt; relation, I had a play around with the SNORQL query interface for DBpedia looking for possible relations using queries along the lines of:

SELECT DISTINCT ?c WHERE {
?a <http://dbpedia.org/ontology/influencedBy> ?b.
?a rdf:type ?c.
?b a ?c.
} limit 50 offset 150

(I think a (as in ?x a ?y and rdf:type are synonyms?)

This query looks for pairs of things (?a, ?b), each of the same type, ?c, where ?b also influences ?a, then reports what sort of thing (?c) they are (philosophers, for example, or programming languages). We can then use this thing in our custom Wikipedia/DBpedia/Gephi semantic web mapping request to map out the “internal” influence network pertaining to that thing (internal in the sense that the things that are influencing and influenced are both representatives of the same, erm, thing…;-).

The limit term specifies how many results to return, the offset essentially allows you to page through results (so an offset of 500 will return results starting with the 501st result overall). DISTINCT ensures we see unique relations.

If you see a relation that looks like dbpedia:ontology/Philosopher, put it in and brackets (<>) and replace dbpedia: with http://dbpedia.org/ to give something like <http://dbpedia.org/ontology/Philosopher&gt;.

PS see how to use a similar technique to map out musical genres ascribed to bands on WIkipedia

Written by Tony Hirst

July 3, 2012 at 12:08 pm

Visualising Related Entries in Wikipedia Using Gephi

with 13 comments

Sometime last week, @mediaczar tipped me off to a neat recipe on the wonderfully named Drunks&Lampposts blog, Graphing the history of philosophy, that uses Gephi to map an influence network in the world of philosophy. The data is based on the extraction of the “influencedBy” relationship over philosophers referred to in Wikipedia using the machine readable, structured data view of Wikipedia that is DBpedia.

The recipe given hints at how to extract data from DBpedia, tidy it up and then import it into Gephi… but there is a quicker way: the Gephi Semantic Web Import plugin. (If it’s not already installed, you can install this plugin via the Tools -> Plugins menu, then look in the Available Plugin.)

To get DBpedia data into Gephi, we need to do three things:

- tell the importer where to find the data by giving it a URL (the “Driver” configuration setting);
- tell the importer what data we want to get back, by specifying what is essentially a database query (the “Request” configuration setting);
- tell Gephi how to create the network we want to visualise from the data returned from DBpedia (in the context of the “Request” configuration).

Fortunately, we don’t have to work out how to do this from scratch – from the Semantic Web Import Configuration panel, configure the importer by setting the configuration to DBPediaMovies.

Hitting “Set Configuration” sets up the Driver (Remote SOAP Endpoint with Endpoint URL http://dbpedia.org/sparql):

and provides a dummy, sample query Request:

We need to do some work creating our own query now, but not too much – we can use this DBpediaMovies example and the query given on the Drunks&Lampposts blog as a starting point:

SELECT *
WHERE {
?p a
<http://dbpedia.org/ontology/Philosopher> .
?p <http://dbpedia.org/ontology/influenced> ?influenced.
}

This query essentially says: ‘give me all the pairs of people, (?p, ?influenced), where each person ?p is a philosopher, and each person ?influenced is influenced by ?p’.

We can replace the WHERE part of the query in the Semantic Web Importer with the WHERE part of this query, but what graph do we want to put together in the CONSTRUCT part of the Request?

The graph we are going to visualise will have nodes that are philosophers or the people who influenced them. The edges connecting the nodes will represent that one influenced the other, using a directed line (with an arrow) to show that A influenced B, for example.

The following construction should achieve this:

CONSTRUCT{
?p <http://dbpedia.org/ontology/influenced> ?influenced.
} WHERE {
  ?p a
<http://dbpedia.org/ontology/Philosopher> .
?p <http://dbpedia.org/ontology/influenced> ?influenced.
} LIMIT 10000

(The LIMIT argument limits the number of rows of data we’re going to get back. It’s often good practice to set this quite low when you’re trying out a new query!)

Hit Run and a graph should be imported:

If you click on the Graph panel (in the main Overview view of the Gephi tool), you should see the graph:

If we run the PageRank or EigenVector centrality statistic, size the nodes according to that value, and lay out the graph using a force directed or Fruchtermann-Rheingold layout algorithm, we get something like this:

The nodes are labelled in a rather clumsy way – http://dbpedia.org/page/Martin_Heidegger – for example, but we can tidy this up. Going to one of the DPpedia pages, such as http://dbpedia.org/page/Martin_Heidegger, we find what else DBpedia knows about this person:

In particular, we see we can get hold of the name of the philosopher using the foaf:name property/relation. If you look back to the original DBpediaMovies example, we can start to pick it apart. It looks as if there are a set of gephi properties we can use to create our network, including a “label” property. Maybe this will help us label our nodes more clearly, using the actual name of a philosopher for example? You may also notice the declaration of a gephi “prefix”, which appears in various constructions (such as gephi:label). Hmmm.. Maybe gephi:label is to prefix gephi:<http://gephi.org/&gt; as foaf:name is to something? If we do a web search for the phrase foaf:name prefix, we turn up several results that contain the phrase prefix foaf:<http://xmlns.com/foaf/0.1/&gt;, so maybe we need one of those to get the foaf:name out of DBpedia….?

But how do we get it out? We’ve already seen that we can get the name of a person who was influenced by a philosopher by asking for results where this relation holds: ?p <http://dbpedia.org/ontology/influenced&gt; ?influenced. So it follows we can get the name of a philosopher (?pname) by asking for the foaf:name in the WHEER part of the query:

?p <foaf:name> ?pname.

and then using this name as a label in the CONSTRUCTion:

?p gephi:label ?pname.

We can also do a similar exercise for the person who is influenced.

looking through the DBpedia record, I notice that as well as an influenced relation, there is an influencedBy relation (I think this is the one that was actually used in the Drunks&Lampposts blog?). So let’s use that in this final version of the query:

prefix gephi:<http://gephi.org/>
prefix foaf: <http://xmlns.com/foaf/0.1/>
CONSTRUCT{
  ?philosopher gephi:label ?philosopherName .
  ?influence gephi:label ?influenceName .
  ?philosopher <http://dbpedia.org/ontology/influencedBy> ?influence
} WHERE {
  ?philosopher a
  <http://dbpedia.org/ontology/Philosopher> .
  ?philosopher <http://dbpedia.org/ontology/influencedBy> ?influence.
  ?philosopher foaf:name ?philosopherName.
  ?influence foaf:name ?influenceName.
} LIMIT 10000

If you’ve already run a query to load in a graph, if you run this query it may appear on top of the previous one, so it’s best to clear the workspace first. At the bottom right of the screen is a list of workspaces – click on the RDF Request Graph label to pop up a list of workspaces, and close the RDF Request Graph one by clicking on the x.

Now run the query into a newly launched, pristine workspace, and play with the graph to your heart’s content…:-) [I'll maybe post more on this later - in the meantime, if you're new to Gephi, here are some Gephi tutorials]

Here’s what I get sizing nodes and labels by PageRank, and laying out the graph by using a combination of Force Atlas2, Expansion and Label Adjust (to stop labels overlapping) layout tools:

Using the Ego Network filter, we can then focus on the immediate influence network (influencers and influenced) of an individual philosopher:

What this recipe hopefully shows is how you can directly load data from DBpedia into Gephi. The two tricks you need to learn to do this for other data sets are:

1) figuring out how to get data out of DBpedia (the WHERE part of the Request);
2) figuring out how to get that data into shape for Gephi (the CONSTRUCT part of the request).

If you come up with any other interesting graphs, please post Request fragments in the comments below:-)

[See also: Graphing Every* Idea In History]

PS via @sciencebase (Mapping research on Wikipedia with Wikimaps), there’s this related tool: WikiMaps, on online (and desktop?) tool for visualising various Wikipedia powered graphs, such as, erm, Justin Bieber’s network…

Any other related tools out there for constructing and visualising Wikipedia powered network maps? Please add a link via the comments if you know of any…

PPS for a generalisation of this approach, and a recipe for finding other DBpedia networks to map, see Mapping How Programming Languages Influenced Each Other According to Wikipedia.

PPPS Here’s another handy recipe that shows how to pull SPARQLed DBPedia queries into R, analyse them there, and then generate a graphML file for rendering in Gephi: SPARQL Package for R / Gephi – Movie star graph visualization Tutorial

Written by Tony Hirst

July 3, 2012 at 10:05 am

Data Scraping Wikipedia with Google Spreadsheets

with 186 comments

Prompted in part by a presentation I have to give tomorrow as an OU eLearning community session (I hope some folks turn up – the 90 minute session on Mashing Up the PLE – RSS edition is the only reason I’m going in…), and in part by Scott Leslie’s compelling programme for a similar duration Mashing Up your own PLE session (scene scetting here: Hunting the Wily “PLE”), I started having a tinker with using Google spreadsheets as for data table screenscraping.

So here’s a quick summary of (part of) what I found I could do.

The Google spreadsheet function =importHTML(“”,”table”,N) will scrape a table from an HTML web page into a Google spreadsheet. The URL of the target web page, and the target table element both need to be in double quotes. The number N identifies the N’th table in the page (counting starts at 0) as the target table for data scraping.

So for example, have a look at the following Wikipedia page – List of largest United Kingdom settlements by population (found using a search on Wikipedia for uk city population – NOTE: URLs (web addresses) and actual data tables may have changed since this post was written, BUT you should be able to find something similar…):

Grab the URL, fire up a new Google spreadsheet, and satrt to enter the formula “=importHTML” into one of the cells:

Autocompletion works a treat, so finish off the expression:

=ImportHtml(“http://en.wikipedia.org/wiki/List_of_largest_United_Kingdom_settlements_by_population&#8221;,”table”,1)

And as if by magic, a data table appears:

All well and good – if you want to create a chart or two, why not try the Google charting tools?

Google chart

Where things get really interesting, though, is when you start letting the data flow around…

So for example, if you publish the spreadsheet you can liberate the document in a variety of formats:

As well publishing the spreadsheet as an HTML page that anyone can see (and that is pulling data from the WIkipedia page, remember), you can also get access to an RSS feed of the data – and a host of other data formats:

See the “More publishing options” link? Lurvely :-)

Let’s have a bit of CSV goodness:

Why CSV? Here’s why:

Lurvely… :-)

(NOTE – Google spreadsheets’ CSV generator can be a bit crap at times and may require some fudging (and possibly a loss of data) in the pipe – here’s an example: When a Hack Goes Wrong… Google Spreadsheets and Yahoo Pipes.)

Unfortunately, the *’s in the element names mess things up a bit, so let’s rename them (don’t forget to dump the original row of the feed (alternatively, tweak the CSV URL so it starts with row 2); we might as well create a proper RSS feed too, by making sure we at least have a title and description element in there:

Make the description a little more palatable using a regular expression to rewrite the description element, and work some magic with the location extractor block (see how it finds the lat/long co-ordinates, and adds them to each item?;-):

DEPRECATED…. The following image is the OLD WAY of doing this and is not to be recommended…

…DEPRECATED

Geocoding in Yahoo Pipes is done more reliably through the following trick – replace the Location Builder block with a Loop block into which you should insert a Location Builder Block

yahoo pipe loop

The location builder will look to a specified element for the content we wish to geocode:

yahoo pipe location builder

The Location Builder block should be configured to output the geocoded result to the y:location element. NOTE: the geocode often assumes US town/city names. If you have a list of town names that you know come from a given country, you may wish to annotate them with a country identify before you try to geocode them. A regular expression block can do this:

regex uk

This block says – in the title element, grab a copy of everything – .* – into a variable – (.*) – and then replace the contents of the title element with it’s original value – $1 – as well as “, UK” – $1, UK

Note that this regular expression block would need to be wired in BEFORE the geocoding Loop block. That is, we want the geocoder to act on a title element containing “Cambridge, UK” for example, rather than just “Cambridge”.

Lurvely…

And to top it all off:

And for the encore? Grab the KML feed out of the pipe:

…and shove it in a Google map:

So to recap, we have scraped some data from a wikipedia page into a Google spreadsheet using the =importHTML formula, published a handful of rows from the table as CSV, consumed the CSV in a Yahoo pipe and created a geocoded KML feed from it, and then displayed it in a YahooGoogle map.

Kewel :-)

PS If you “own” the web page that a table appears on, there is actually quote a lot you can do to either visualise it, or make it ‘interactive’, with very little effort – see Progressive Enhancement – Some Examples and HTML Tables and the Data Web for more details…

PPS for a version of this post in German, see: http://plerzelwupp.pl.funpic.de/wikitabellen_in_googlemaps/. (Please post a linkback if you’ve translated this post into any other languages :-)

PPPS this is neat – geocoding in Google spreadsheets itself: Geocoding by Google Spreadsheets.

PPPS Once you have scraped the data into a Google spreadsheet, it’s possible to treat it as a database using the QUERY spreadsheet function. For more on the QUERY function, see Using Google Spreadsheets Like a Database – The QUERY Formula and Creating a Winter Olympics 2010 Medal Map In Google Spreadsheets.

Written by Tony Hirst

October 14, 2008 at 10:21 pm

Publishing Stats for Analytic Reuse – FAOStat Website and R Package

with 2 comments

How can stats and data publishers, from NGOs and (inter)national statistics agencies to scientific researchers, publish their data in a way that supports its analysis directly, as well as in combination with other datasets?

Here’s one approach I learned about from Michael Kao of the UN Food and Agriculture Organisation statistics division, FAOStat.

At first glimpse, the FAOStat website offers a rich website that supports data downloads, previews and simple analysis tools around a wide variety of international food related datasets:

FAOStat website

FAOstat - graphical tools

faostat - inline data preview

FAOStat - ddata analysis

One problem with having so many controls and fields available is that it can be hard to know where (or how) to get started – a bit like the problem of being presented with an empty SPARQL query box…

It would be quite handy to be able to set – and save with meaningful labels – preference sets about the countries you’re interested in so you don’t have to keep keep scrolling through long country lists looking for the countries you want to generate reports for? (Support for “standard” groupings of countries might also be useful?) Being able to share URLs to predefined reports might also be handy? But this would possibly make the site even more complex to use!

One easier way of working with FAOStat data, particularly if you access the FAO datasets regularly, might be to take a programmatic route using the FAOStat R package. Making datasets available in ways that bring that data directly into a desktop analysis environment where they can be worked on without requiring cleaning or other forms of tidying up (which is often the case when data is made available via Excel spreadsheets or CSV files) is a trend I hope we see more of. (That is not to say that data shouldn’t also be published in “generic” document formats…). If you are using a reproducible research strategy, queries to original datasources provide implicit, self-describing metadata about the data source and the query used to return a particular dataset, metadata that is all to easy to lose, or otherwise detach from a dataset when working with downloaded files.

I haven’t had chance to play with this package yet – it’s still in testing anyway, I think? – but it looks quite handy at a first glance (I need to do a proper review…). As well as providing a way of running data grab queries over theFAO FAOSTAT and World Bank WDI APIs, it seems to provide support for “linkage”. As the draft vignette suggests, “Merge is a typical data manipulation step in daily work yet a non-trivial exercise especially when working with different data sources. The built in mergeSYB function enables one to merge data from different sources as long as the country coding system is identified. … Data from any source with [a] classification [supported by the package] can be supplied to mergeSYB in order to obtain a single merged data. (sic)“. Supported formats currently include: United Nations M49 country standard [UN_CODE]; FAO country code scheme [FAOST_CODE]; FAO Global Administrative Unit Layers (GAUL) [ADM0_CODE]; ISO 3166-1 alpha-2 [ISO2_CODE]; ISO 3166-1 alpha-2 (World Bank) [ISO2_WB_CODE]; ISO 3166-1 alpha-3 [ISO3_CODE]; ISO 3166-1 alpha-3 (World Bank) [ISO3_WB_CODE].

By releasing an “official” R package to access the FAOStat API, it occurs to me that this makes it much easier to start building sector specific Shiny applications around particular datasets? I wonder whether the FAOstat folk have considered whether there is a possibility of developing a small Shiny app or custom client ecosystem around their data, even if it just takes the form of a curated set of gists that can be downloaded directly into RStudio, for example, using runGist?

I don’t know whether the Eurostat EC Statistics database has an associated R package too? (If so, it could be quite interesting trying to tie them together?! I do note, however, that Eurostat data is available for download (though I haven’t read the terms/license conditions…).

I also note that a Linked Data/SPARQL way in to Eurostat data appears to be available? Eurostat Linked Data.

[Man flu, hence the brevity of the post... skulks back off to sick bed...]

PS BY the by, I notice that the NHS are experimenting with making some data releases available via Google Public Data Explorer [scroll down...]

Written by Tony Hirst

March 8, 2013 at 2:45 pm

Posted in Rstats

Tagged with , ,

When a Hack Goes Wrong… Google Spreadsheets and Yahoo Pipes

with 5 comments

One of my most successful posts in terms of lifetime traffic numbers has been a recipe for scraping data from a Wikipedia page, pulling it into a Google spreadsheet, publishing it as CSV, pulling it into a Yahoo Pipe, geo-coding it, publishing it as a KML file, displaying the KML in Google maps and embedding the map in another page (which could in principle be the original WIkipedia page): Data Scraping Wikipedia with Google Spreadsheets.

A variant of this recipe in other words:

wikipediamap

Running the hack now on a new source web page, we get the following data pulled into the spreadsheet:

pipe broken

And the following appearing in the pipe (I am trying to replace the first line with my own headers):

imported data

The CSV file appears to be misbehaving… Downloading the CSV data and looking at it in TextWrangler, a text editor, we start to see what’s wrong:

text editor

The text editor creates line numbers for things it sees as separate, well formed rows in the CSV data. We see that the header, which should be a single row, is actually spread over four rows. In addition, the London data is split over two rows. The line for Greater Manchester behaves correctly…: if you look at the line numbers, you can see line 7 overflows in the editor (the … in the line number count shows the CSV line (a separate dataflow) has overflowed the width of the editor and been wrapped round in the editor view).

If I tell the editor to stop “soft wrapping” each line of data in the CSV file, the editor displays each line of the CSV file on a single line in the editor:

text editor nowrap

So… where does this get us in fixing the pipe? Not too far. We can skip the first 5 lines of the file that we import into the pipe, and that gets around all the messed up line breaks at the top of the file, but we lose the row containing the data for London. In the short term, this is probably the pragmatic thing to do.

Next up, we might look to the Wikipedia file and see how the elements that appear to be breaking the the CSV file might be fixed to unbreak them. Finally, we could go to the Google Spreadsheets forums and complain about the pile of crap broken CSV generation that the Googlers appear to have implemented…

PS MartinH suggests workaround in the comments, wrapping a QUERY round the import and renaming the columns…

Written by Tony Hirst

February 5, 2013 at 5:25 pm

This Week in Open and Communications Data Land…

with 2 comments

Following the official opening of the Open Data Institute (ODI) last week, a flurry of data related announcements this week:

Things have been moving on the Communications Data front too. Communications Data got a look in as part of the 2011/2012 Security and Intelligence Committee Annual Report with a review of what’s currently possible and “why change may be necessary”. Apparently:

118. The changes in the telecommunications industry, and the methods being used by people to communicate, have resulted in the erosion of the ability of the police and Agencies to access the information they require to conduct their investigations. Historically, prior to the introduction of mobile telephones, the police and Agencies could access (via CSPs, when appropriately authorised) the communications data they required, which was carried exclusively across the fixed-line telephone network. With the move to mobile and now internet-based telephony, this access has declined: the Home Office has estimated that, at present, the police and Agencies can access only 75% of the communications data that they would wish, and it is predicted that this will significantly decline over the next few years if no action is taken. Clearly, this is of concern to the police and intelligence and security Agencies as it could significantly impact their ability to investigate the most serious of criminal offences.

N. The transition to internet-based communication, and the emergence of social networking and instant messaging, have transformed the way people communicate. The current legislative framework – which already allows the police and intelligence and security Agencies to access this material under tightly defined circumstances – does not cover these new forms of communication. [original emphasis]

Elsewhere in Parliament, the Joint Select Committee Report on the Draft Communications Data Bill was published and took a critical tone (Home Secretary should not be given carte blanche to order retention of any type of data under draft communications data bill, says joint committee. “There needs to be some substantial re-writing of the Bill before it is brought before Parliament” adds Lord Blencathra, Chair of the Joint Committee.) Friend and colleague Ray Corrigan links to some of the press reviews of the report here: Joint Committee declare CDB unworkable.

In other news, Prime Minister David Cameron’s announcement of DNA tests to revolutionise fight against cancer and help 100,000 patients was reported via a technology angle – Everybody’s DNA could be on genetic map in ‘very near future’ [Daily Telegraph] – as well as by means of more reactionary headlines: Plans for NHS database of patients’ DNA angers privacy campaigners [Guardian], Privacy fears over DNA database for up to 100,000 patients [Daily Telegraph].

If DNA is your thing, don’t forget that the Home Office already operates a National DNA Database for law enforcement purposes.

And if national databases are your thing, there always the National Pupil Database which was in the news recently with the launch of a consultation on proposed amendments to individual pupil information prescribed persons regulations which seeks to “maximise the value of this rich dataset” by widening access to this data. (Again, Ray provides some context and commentary: Mr Gove touting access to National Pupil Database.)

PS A late inclusion: DECC announcement around smart meter rollout with some potential links to #midata strategy (eg “suppliers will not be able to use energy consumption data for marketing purposes unless they have explicit consent”). A whole raft of consultations were held around smart metering and Govenerment responses are also published today, including Government Response on Data Access and Privacy Framework, the Smart Metering Privacy Impact Assessment and a report on public attitudes research around smart metering. I also spotted an earlier consultation that had passed me by around the Data and Communications Company (DCC) License Conditions; here the response, which opens with: “The communications and data transfer and management required to support smart metering is to be organised by a new central communications body – the Data and Communications Company (“the DCC”). The DCC will be a new licensed entity regulated by the Gas and Electricity Markets Authority (otherwise referred to as “the Authority”, or “Ofgem”). A single organisation will be granted a licence under each of the Electricity and Gas Acts (there will be two licences in a single document, referred to as the “DCC Licence”) to provide these services within the domestic sector throughout Great Britain”. Another one to put on the reading pile…

Putting a big brother watch hat on, the notion of “meter surveillance” brings to mind BBC article about an upcoming (will hopefully thence be persistently available on iPlayer?) radio programme on “Electric Network Frequency (ENF) analysis”, The hum that helps to fight crime. According to Wikipedia, ENF is a forensic science technique for validating audio recordings by comparing frequency changes in background mains hum in the recording with long-term high-precision historical records of mains frequency changes from a database. In turn, this reminds me of appliance signature detection (identifying what appliance is switched on or off from its electrical load curve signature), for example Leveraging smart meter data to recognize home appliances. In context of audio surveillance, how about supplementing surveillance video cameras with microphones? Public Buses Across Country [US] Quietly Adding Microphones to Record Passenger Conversations.

Written by Tony Hirst

December 12, 2012 at 2:23 pm

Posted in Data, opengov, Policy

Tagged with , ,

#CAST12 DataViz Sandbox – Resources

leave a comment »

I had a trip up to London yesterday to give the second of two talks on data visualisation to the #cast12 Masters students at Goldsmiths University. As promised to them, here’s a list of resources they might find useful..:

1) Storytelling with dataHans Rosling demoing Gapminder (using a visualisation technique now often referred to as a motion chart; see the orginal here: Gapminder).

(See also the BBC4 programme fronted by Hans Rosling, “The Joy of Stats”).

How line graphs can narrate a story – Kurt Vonnegut on the Shape of Stories

The Charts’n'things blog, which describes some of the design process that goes on in coming up with some of the great visualisations produced by the New York Times.

2) Google Refine

Google RefineOpenRefine is one of those tools that can make one of the more painful parts of producing visualisations – getting data into a state where you can actually use it – much more manageable. Here are some example use cases:

3) API datagrabs and screenscraping. Here are some handy resources:

4) Gephi tutorials:

5) General.
(Social) network analysis – a theoretical overview: Social Network Analysis – G. Cheliotis.

There are a few extras in there, but anything I missed?

Written by Tony Hirst

December 12, 2012 at 10:45 am

Posted in Anything you want

Tagged with

Therapy Time: Networked Personal Learning and a Reflection on the Urban Peasant…

with 3 comments

Way back when I was a postgrad, I used to spend a coffee fuelled morning reading in bed, and then get up to eat a cooked breakfast whilst watching the Urban Peasant:

My abiding memory, in part confirmed by several of the asides in the above clip (can you guess which?!), was that of “agile cooking” and flexible recipes. A chicken curry (pork’s fine too, or beef, even fish if you like; or potato if you want a vegetarian version) could be served with rice (or bread, or a baked potato); if you didn’t like curry, you could leave out the spices or curry powder, and just use a stock cube. If a recipe called for chopped vegetables, you could also grate them or slice them or dice them or…”it’s your decision”. Potato and peas could equally well be carrot or parsnip and beans. If you needed to add water to a recipe, you could add wine, or beer, or fruit juice or whatever instead; if you wanted to have scrambled egg on toast, you could also fry it, or poach it, or boil it. And the toast could be a crumpet or a muffin or just use “whatever you’ve got”.

The ethos was very much one of: start with an idea, and/or see what you’ve got, and then work with it – a real hacker ethic. It also encouraged you to try alternative ideas out, to be adaptive. And I’m pretty sure mistakes happened too – but that was fine…

When I play with data, I often have a goal in mind (albeit a loose one), used to provide a focus for exploring a data set I want to explore a little (typically using Schneiderman’s “Overview first, zoom and filter, then details-on-demand” approach), to see what potential it might hold, or to act a testbed for a tool or technique I want to try out. The problem then becomes one of coming up with some sort of recipe that works with the data and tools I have to hand, as well as the techniques and processes I’ve used before. Sometimes, a recipe I’m working on requires me to get another ingredient out of the fridge, or another utensil out of the cupboard. Sometimes I use a tea towel as an oven glove, or a fork as a knife. Sometimes I taste the food-in-process to know when it’s done, sometimes I go by the colour, texture, consistency, shape, smell or clouds of smoke that have started to appear.

Because I haven’t had any formal training in any of this “stuff”, using “approved” academic sources (I’ve recently been living by R-Bloggers (which is populated by quite a few academics) and Stack Overflow, for example), I suffer from a lack of confidence in talking about it in an academic way (see for example For My One Thousandth Blogpost: The Un-Academic), and a similar lack of confidence in feeling that I could ever charge anybody a fee for telling them what I (think I) know (leave aside for the moment that I effectively charge the OU my salary, benefits and on-costs… hmmm?!). I used to do the academic thing way back when as a postgrad and early postdoc, but fell out of the habit over the last few years because there seemed to me to be a huge amount of investment of time required for very little impact or consequence of what I was doing. Yes, it’s important for things be “right”, but I’m not sure my maths is up to generating formal proofs of new algorithms. I may be able to do the engineering or technologist thing of getting something working, -ish, good enough “for now”, research-style coding, but it’s always mindful of an engineering style trade-off: that it might not be “right” and is just something I figured out that seems to work, but that it’ll do because it lets me get something done… As Artur Bergman puts it using rather colourful language – “yes, correlation isn’t causation, but…”


(This clip was originally brought to mind by a recent commentary from Stephen Downes on The Internet Blowhard’s Favorite Phrase, and the original post it refers to.)

Also mixed up in the notion of “right” is seeing things as “right” if they are formally recognised or accepted as such, which is where assessment and peer review come in: you let other people you trust make an assessment about whatever it is you do/have done, publicly recognising your achievements which in turn allows you to make a justifiable claim to them. (I am reminded here of the definition of knowledge as justified true belief. That word “justified” is interesting, isn’t it…?)

As well as resisting getting in the whole grant bidding cycle for turnover generating, public money cycling projects that are set up to fail, I’ve also recently started to fall out of OU-style formal teaching roles… again, in part because of the long lead times involved with producing course materials and my preference for network based, rather than teamwork based, working style. (I so need to revisit formal models of teamwork and try to come up with a corresponding formulation for networks rather than teams…Or do a lit review to find one that’s already out there…!) I tend to write in 1 hour chunks based on 3-4 hours work, then post whatever it is I’ve done. One reason for doing this is becuase I figure most people read or do things in 5 to 15 minutes or one to two hour chunks and that in a network-centric, distributed online online educational setting small chunks are more likely to be discoverable and immediately useful (directly and immediately learnable from) chunks. There’s no shame in using a well crafted Wikipedia as a starting point for discovering more detailed – and academic – resources: at least you stand a good chance of finding the Wikipedia page! In the same way, I try to link out out to supporting resources from most of my posts so that readers (including myself as a revisitor to these pages in that set) have some additional context, supporting or conflicting material to help get more value from it. (Related: Why I (Edu)Blog.)

Thinking about my own personal lack of confidence, which in part arises from the way I have informally learned whatever it is that I have actually learned over the last few years and not had it formally validated by anybody else, my interest in espousing an informal style networked learning on others is an odd one… Because based on my own experience, it doesn’t give me the feeling that what I know is valid (justified..?), or even necessarily trustable by anybody other than me (because I know how it’s caveated because of what I have personally learned about it, rather than just being told about it), even if it is pragmatic and at least occasionally appears to be useful. (Hmm… I don’t think an OU editor would let me get away with a sentence like that in a piece of OU course material!) Maybe I need to start keeping a second, formalised reflective learning journal as the OU Skills for OU Study suggests to log what I learn, and provide some sort of indexable and searchable metadata around it? In fact, this approach might be a useful approach if I do another uncourse? (It also brings to mind the word equation: Learning material + metadata = learning object (it was something like that, wasn’t it?!))

To the extent that this blog is an act of informal, open teaching, I think it offers three main things: a) “knowledge transferring” discoverable resources on a variety of specialised topics; b) fragmentary records of created knowledge (I *think* I’ve managed to make up odd bits of new stuff over the last few years…); c) a model of some sort of online distributed network centric learning behaviour (see also the Digital Worlds Uncourse Blog Experiment in this respect).

I guess one of the things I do get to validate against is the real world. When I used to go into schools doing robotics activities*, kids would ask me if their robot or programme was “right”. In many cases, there wasn’t really a notion of “right”, it was more a case of:

  • were there things that were obviously wrong?
  • did the thing work as anticipated (or indeed, did any elements of it work at all?!;-)?
  • were there any bits that could be improved, adapted or done in another more elegant way?

So it is with some of my visualisation “experiments” – are they not wrong (is the data right, is there a sensible relationship between the data and the visual mappings)? do they “work” at all (eg in the sense of communicating a particular trend, or revealing a particular anomaly)? could they be improved? By running the robot program, or trying to read the story a data visualisation appears to be telling us, we can get a sense of how “right” it is; but there is often no single “right” for it to be. Which is where doubt can crop in… Because if something is “not right”, then maybe it’s “wrong”…?

In the world of distributed, networked learning, I think one thing we need to work on is developing an appropriate sense of validation and legitimisation of personal learning. Things like badges are weak extrinsic signs that some would claim have a role in this, but I wonder how networks and communities can be shaped and architected, or how their dynamics might work, so that learners develop not only a well-founded intrinsic confidence about what they have self-learned, but also a feeling that what they have self-learned is as legitimate as something they have been formally taught? (I know, I know: “I was at the University of Life, me”… As I am, now… which reminds me, I’ve a Coursera video and Feynman lecture on Youtube to watch, and a couple of code mentor answers to questions I’ve raised on various Stack Exchange sites to read through; and I probably should check to see if there are any questions recently posted to Stack Overflow that I may be able to answer and use to link out to other, more academic “open educational” resources…)

[Rereading this post, I think I am suffering from a lack of formality and the sense of justification that comes with it. Hmmm...]

* This is something I’ve recently been asked to do again for an MK local primary school in the new year; the contact queried how much I might charge and whilst in the past I would have said “no need”, for some reason this time I felt obliged to seek advice about from the Deanery about whether I should charge, and if so how much. This a huge personal cultural shift away from my traditional “of course/pro bono” attitude, and it felt wrong, somehow. To the extent that universities are public bodies, they should work with other public services in their local and extended communities. But of course, I get the sense we’re not really being encouraged to think of ourselves as public bodies very much any more, we’re commercial services… And that feeling affects the personal responsibility I feel when acting for and on behalf of the university. As it turns out, the Deanery seems keen that we participate freely in community events… But I note here that I felt (for the first time) as if I had to check first. So what’s in the air?

See also: Terran Lane’s On Leaving Academia and (via @boyledsweetie) Inspirational teaching: since when did entertainment not matter?

Written by Tony Hirst

October 4, 2012 at 1:32 pm

ILI2012 Workshop Prep – Appropriating IT: innovative uses of emerging technologies

with 6 comments

Given that workshops at ILI2012 last a day (10 till 5), I thought I’d better start prepping the workshop I’m delivering with Martin Hawksey at this year’s Internat Librarian International early… W2 – Appropriating IT: innovative uses of emerging technologies:

Are you concerned that you are not maximising the potential of the many tools available to you? Do you know your mash-ups from your APIs? How are your data visualisation skills? Could you be using emerging technologies more imaginatively? What new technologies could you use to inspire, inform and educate your users? Learn about some of the most interesting emerging technologies and explore their potential for information professionals.

The workshop will combine a range of presentations and discussions about emerging information skills and techniques with some practical ‘makes’ to explore how a variety of free tools and applications can be appropriated and plugged together to create powerful information handling tools with few, if any, programming skills required.

Topics include:

- Visualisation tools
- Maps and timelines
- Data wrangling
- Social media hacks
- Screenscraping and data liberation
- Data visualisation

(If you would like to join in with the ‘makes’, please bring a laptop)

I have some ideas about how to fill the day – and I’m sure Martin does too – but I thought it might be worth asking what any readers of this blog might be interested in learning about in a little more detail and using slightly easier, starting from nowhere baby steps than I usually post.

My initial plan is to come up with five or six self contained elements that can also be loosely joined, structuring the day something like this:

  • opening, and an example of the sort of thing you’ll be able to do by the end of the day – no prior experience required, handheld walkthroughs all the way; intros from the floor along with what folk expect to get out of the day/want to be able to do at the day (h/t @briankelly in the comments; of course, if folks’ expectations differ from what we had planned….;-). As well as demo-ing how to use tools, we’ll also discuss why you might want to do these things and some of the strategies involved in trying to work out how to do them, knowing what you already know, or how to find out/work out how to do them if you don’t..
  • The philosophy of “appropriation”, “small pieces, lightly joined”, “minimum viability” and ‘why Twitter, blogs and Stack Overflow are Good Things”;
  • Visualising Data – because it’s fun to start playing straight away…
    • Google Fusion Tables – visualisations and queries
    • Google visualisation API/chart components

    Payoff: generate some charts and dashboards using pre-provided data (any ideas what data sets we might use…? At least one should have geo-data for a simple mapping demo…)

  • — Morning coffee break? —
  • Data scraping:
    • Google spreadsheets – import CSV, import HTML table;
    • Google Refine – import XLS, import JSON, import XML
    • (Briefly) – note the existence of other scraper tools, incl. Scraperwiki, and how they can be used

    Payoff: scrape some data and generate some charts/views… Any ideas what data to use? For the JSON, I thought about finishing with a grab of Twitter data, to set up after lunch…

  • — Lunch? —
  • (Social) Network Analysis with Gephi
    • Visually analyse Twitter data and/or Facebook data grabbed using Google Refine and/or TAGSExplorer
    • Wikipedia graphing using DBPedia
    • Other examples of how to think in graphs…
  • The scary session…
    • Working with large data files – examples of some simple text processing command line tools
    • Data cleansing and shaping – Google Refine, for the most part, including the use of reconciliation; additional examples based on regular expressions in a text editor, Google spreadsheets as a database, Stanford Data Wrangler, and R…
  • — Afternoon coffee break? —
  • Writing Diagrams – examples referring back to Gephi, mentioning Graphviz, then looking at R/ggplot2, finishing with R’s googleVis library as a way of generating Google Visualisation API Charts…
  • Wrap up – review of the philosophy, showing how it was applied throughout the exercises; maybe a multi-step mashup as a final demo?

Requirements: we’d need good wifi/network connections; also, it would help if participants pre-installed – and checked the set up of: a) a Google account; b) a modern browser (standardising on Google Chrome might be easiest?) c) Google Refine; d) Gephi (which may also require the installation of a Java runtime, eg on a new-ish Mac); e) R; f) RStudio and a raft of R libraries (ggplot2, plyr, reshape, RCurl, stringr, googleViz); g) a good text editor (?I use TextWrangler on a Mac); h) commandline tools (Windows machines);

Throughout each session, participants will be encouraged to identify datasets or IT workflow issues they encounter at work and discuss how the ideas presented in the workshop may be appropriated for use in those contexts…

Of course, this is all subject to change (I haven’t asked Martin how he sees the day panning out yet;-), but it gives a flavour of my current thinking… So: what sorts of things would you like to see? And would you like to book any of the sessions for a workshop at your place…?!;-)

Written by Tony Hirst

September 25, 2012 at 1:05 pm

Posted in Anything you want, Infoskills

Tagged with

Follow

Get every new post delivered to your Inbox.

Join 343 other followers