Category: Tinkering

Semantic Cartography – Mapping Dodgy Goth Bands With Common Members Using Wikipedia Data

Several years ago I did some doodles using the Gephi network visualiser Semantic Web Import plugin to sketch out how various sorts of thing (philosophers, music genres, programming languages) were related in Wikipedia (or at least, DBpedia, the semantic web derivative of Wikipedia). A couple of days ago, I started sketching some new queries in a Jupyter IPython notebook to generate a wider range of maps, using the networkx package to analyse the results locally, as well as building and export a graph that I could then visualise in Gephi.

The following bit of code provides a simple function for running a SPARQL query against a SPARQL endpoint, such as the DBpedia endpoint. It also accepts a set of prefix definitions for the query.

from SPARQLWrapper import SPARQLWrapper, JSON

#Add some helper functions
def runQuery(endpoint,prefix,q):
    ''' Run a SPARQL query with a declared prefix over a specified endpoint '''
    sparql = SPARQLWrapper(endpoint)
    sparql.setQuery(prefix+q)
    sparql.setReturnFormat(JSON)
    return sparql.query().convert()

endpoint='http://dbpedia.org/sparql'

prefix='''
prefix gephi:<http://gephi.org/>
prefix foaf: <http://xmlns.com/foaf/0.1/>
prefix dbp: <http://dbpedia.org/property/>
prefix dbr: <http://dbpedia.org/resource/>
prefix dbc: <http://dbpedia.org/resource/Category:>
prefix dct: <http://purl.org/dc/terms/>
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
prefix yago: <http://dbpedia.org/class/yago/>
prefix xsd: <http://www.w3.org/2001/XMLSchema#>
'''

Here’s an example of the style of query I explored a few years ago – it identifies a thing that’s a band in a particular genre, and then tries to find other genres associated with that band. Each combination of genres adds an edge to the resulting graph. The FILTER element makes sure that we make edges between different genres.

m='Gothic_rock'
q='''
SELECT DISTINCT ?a ?an ?b ?bn WHERE {{
?band dbp:genre dbr:{}.
?band <http://dbpedia.org/property/background> "group_or_band"@en.
?band dbp:genre ?a.
?band dbp:genre ?b.
?a dbp:name ?an.
?b dbp:name ?bn.
FILTER(?a != ?b && langMatches(lang(?an), "en")  && langMatches(lang(?bn), "en"))
}}'''.format(m)

r=runQuery(endpoint,prefix,q)

Another simple function takes the resulting edge list and creates a node labeled graph from it using the networkx library. We can then export a graph file from this network that can be visualised in Gephi. (On my to do list is using networkx to  calculate some simple network statistics and generate a first attempt at a preview layout automatically, rather than doing it by hand in Gephi, which is what I do at the moment…)

def nxGrapher_hack(response,config,typ='undirected'):
    ''' typ: forward | reverse | undirected'''
    if typ=='undirected':
        G = nx.Graph()
    else:
        G = nx.DiGraph()

    fr,fr_l=config['from']
    to,to_l=config['to']
    for r in response['results']['bindings']:
        G.add_node(r[fr]['value'], label=r[fr_l]['value'])
        G.add_node(r[to]['value'], label=r[to_l]['value'])
        if typ=='reverse':
            G.add_edge(r[to]['value'],r[fr]['value'])
        else:
            G.add_edge(r[fr]['value'],r[to]['value'])
    return G

G=nxGrapher_hack(r, {'from':('a','an'),'to':('b','bn')})
nx.write_gexf(G, "music_{}.gexf".format(m))

Here’s the sort of map/graph we can generate as a result:
Gephi_0_9_1_-_Project_1_-_Project_2

As well as genre information, we can look up information about band members, such as the current or previous members of a particular band*.

About__Wayne_Hussey

*Since generating the data files last night, and running them again today, a whole raft of bander membership details appear to have disappeared. WTF?! Now I remember another of the reasons I keep avoiding the semantic web – it’s as flakey as anything and you can never tell if the problem is yours, someone else’s or the result of an update (or downgrade) in the data!

What this means is that we can anchor a query on a band, and find the current or previous members. In the following snippet, the single braces (“{}”@en) are replaced by the value of the declared band name:

m="The Mission (band)"
q='''
SELECT DISTINCT ?a ?an ?b ?bn WHERE {{
?x <http://dbpedia.org/property/background> "group_or_band"@en.
?x rdfs:label "{}"@en.

?a <http://dbpedia.org/property/background> "group_or_band"@en.
?a rdfs:label ?an.

?b rdfs:label ?bn.
?b a dbo:Person.
{{?a dbp:pastMembers ?b.}} UNION
{{?a dbp:currentMembers ?b.}}.
{{?x dbp:pastMembers ?b.}} UNION
{{?x dbp:currentMembers ?b.}}

FILTER((lang(?an)=&amp;amp;quot;en&amp;amp;quot;) &amp;amp;amp;&amp;amp;amp; (lang(?bn)=&amp;amp;quot;en&amp;amp;quot;) &amp;amp;amp;&amp;amp;amp; !(STRSTARTS(?bn,&amp;amp;quot;List of&amp;amp;quot;)) &amp;amp;amp;&amp;amp;amp; !(STRSTARTS(?an,&amp;amp;quot;List of&amp;amp;quot;)))
}}'''.format(m)

r=runQuery(endpoint,prefix,q)

G=nxGrapher_hack(r, {'from':('a','an'),'to':('b','bn')})
nx.write_gexf(G, "band_{}.gexf".format(m))

A slight tweak to the code lets us replace the anchoring (that is, the search) around a single band name to a set of band names. This allows us to get the current and previous members of all the declared bands.

m=['The Mission (band)','The Cult','The Sisters of Mercy','Fields of the Nephilim','All_About_Eve_(band)']

p='''
?x rdfs:label "{}"@en.
'''

ms=''' UNION
'''.join(['{'+p.format(i)+'}' for i in m])

#In the query, replace ?x rdfs:label &amp;amp;quot;{}&amp;amp;quot;@en. with {}
#In the format method, replace m with ms

Rather than searching around one or more bands, we could instead hook into bands associated with a particular genre. Rather than anchoring around ?x rdfs:label "{}"@en, for example, use ?x dbp:genre dbr:{}. This then lets us generate views of the following form:

Gephi_0_9_1_-_Project_1

As well as mapping the territory around particular musical genres, we can also generate maps for other contexts, such as around particular art movements. For example:

m='Surrealism'
q='''
SELECT DISTINCT ?a ?an ?b ?bn WHERE {{
?movement dct:subject dbc:Art_movements.
?movement dct:subject dbc:{}.
?artist dbp:movement ?movement.
?artist dbp:movement ?a.
?artist dbp:movement ?b.
?a rdfs:label ?an.
?b rdfs:label ?bn.
FILTER(?a != ?b && (lang(?an)="en") && (lang(?bn)="en"))
}}'''.format(m)

r=runQuery(endpoint,prefix,q)
G=nxGrapher_hack(r, {'from':('a','an'),'to':('b','bn')})
nx.write_gexf(G, "art_{}.gexf".format(m))

Or we can tap into other ontologies to limit our searches, and generate a range of influence maps:

y='Artist109812338'
#Artist109812338
#Painter110391653
#Potter110460806
#Sculptor110566072
#Philosopher110423589
#PhilosophersOfLanguage
#PhilosophersOfMathematics
#PhilosophersOfMind
q='''
SELECT ?a ?an ?b ?bn WHERE {{
  ?a a yago:{typ} .
  ?b a yago:{typ} .
  ?a rdfs:label ?an.
  ?b rdfs:label ?bn.
  {{?a <http://dbpedia.org/ontology/influencedBy> ?b.}}
   UNION {{
  ?b <http://dbpedia.org/ontology/influenced> ?a.
  }}
  }}'''.format(typ=y)
r=runQuery(endpoint,prefix,q)
G=nxGrapher_hack(r, {'from':('a','an'),'to':('b','bn')},typ='forward')
nx.write_gexf(G, "influence_{}.gexf".format(y))

So why bother?

Here are several reasons: first, because it’s interesting/fun/recreational; secondly, it allows us to compare our own mental model of the wider context around a particular genre or movement with the Wikipedia version; thirdly, if we’re expert, it might allow us to spot gaps or errors in the Wikipedia data, and fix it; fourthly, these sorts of data collections are used to make recommendations to you, so it helps to get a feel for the sorts of things they can represent, the relations they claim exist, and the ways they can go wrong, so you trust the machines a little bit less, or are least, a little bit more informedly.

PS One of the reasons for grabbing the data using Python was because Gephi has recently undergone an update, and the extensions developed for the earlier version are still being migrated. However, checking today, I notice that the SemanticWebImport plugin has made it across, so it should be possible to run variants of the queries directly in Gephi. See the previous posts for examples.

Chat Sketches with the Companies House API, Before the F***kWit UKGov Sell It Off

Ranty title a gut reaction response to news that the Land Registry faces privatisation.

Sketching around similar ideas to my Slack/slash conversational autoresponder around the Parliament data platform API, I thought I’d have a quick play with the UK Companies House API, which provides a simple interface to company registration data, director information and disqualified director information.

Bulk downloads are available for company registration information (here’s a quick howto about working with it; I’ll post a howto showing how to work with it using a containerised database when I get a chance, but for now, here are some clues) and from the API developer forums it looks as if bulk director’s information is available by request.

Working with your own bulk copy of the data is preferable, because it means you can write your own arbitrarily complex queries over any or all of the columns. The Companies House API, on the other hand, gives you a weak search over company and directors names, and the ability to look up individual known records. To do any sort of processing, you need to grab a large amount of search data, and/or make lots of individual known item records to build you own local datastore, and then search or filter across that.

So for example, here’s the first fumblings of my own function for filtering down on a list of officers associated with a particular company based on director role or current status (which I called typ for some reason? Bah:-(:

def ch_getCompanyOfficers(cn,typ='all',role='all'):
    #typ: all, current, previous
    url=&quot;https://api.companieshouse.gov.uk/company/{}/officers&quot;.format(cn)
    co=ch_request(CH_API_TOKEN,url)
    if typ=='current':
        co['items']=[i for i in co['items'] if 'resigned_on' not in i]
        #should possibly check here that len(co['items'])==co['active_count'] ?
    elif typ=='previous':
        co['items']=[i for i in co['items'] if 'resigned_on' in i]
    if role!='all':
        co['items']=[i for i in co['items'] if role==i['officer_role']]
    return co

The next function runs a search over officers by name, but then also lets you filter down the responses to show just those directors who also match a particular search string as part of any company name they are associated with.

def ch_searchOfficers(q,n=50,start_index='',company=''):
    url= 'https://api.companieshouse.gov.uk/search/officers'
    properties={'q':q,'items_per_page':n,'start_index':start_index}
    o=ch_request(CH_API_TOKEN,url,properties)
    if company != '':
        for p in o['items']:
            p['items'] = [i for i in ch_getAppointments(p['links']['self'])['items'] if company.lower() in i['appointed_to']['company_name'].lower()]
        o['items'] = [i for i in o['items'] if len(i['items'])]
    return o

You get the gist, hopefully. Run a crude API call, and then filter down the result according to particular data properties contained within the search result.

Anyway, as far as the chatting goes, here’s what I’ve started playing around with…

First, let’s just ask what companies a director with a particular name is associated with.

Companies_House_API_Bot1

We can take this a bit further by filtering down on the directors associated with a particular company. (Actually, this is simplified now to call the reporting function simply as dirCompanies(c)).

Companies_House_API_Bot

Alternatively, we might try to narrow the search for directors associated with companies in a particular locality. (I’m still trying to get my head round the different logics of this, because companies as well as directors are associated with addresses. I really need to try some specific investigative tasks to get a better feel for how to tune this sort of filter…)

Companies_House_API_Bot2

I’ve also started trying to think around the currency of appointments, for example supporting the ability to filter down based on resigned appointments:

Companies_House_API_Bot3

Associated with this sort of query (in the sense of exploring the past) are filters that let us search around dissolved companies, for example:

Companies_House_API_Bot4

(I should probably also put some time filters in there, for example to search for companies that a particular person was a director of at a particular time…)

We can also search into the disqualified directors register. To try to reduce the sense of turning this into a fishing trip, searching by director name and then filtering by locality feels like it could be handy (though again, this needs some refinement on the way I apply the locality filter.)

Companies_House_API_Bot5

Next step on this particular task is to tidy these up a little bit more and then pop them into a Slack responder.

But there are also some other Companies House goodies to come…such as revisiting the notion of co-director based company maps.

 

Calling an OData Service From Python – UK Parliament Members Data Platform

Whilst having a quick play producing Slack bots and slash commands around the UK Parliament APIs, I noticed (again) that the Members data platform has an OData endpoint.

OData is a data protocol for querying online data services via HTTP requests although it never really seemed to have caught the popular imagination, possibly because Microsoft thought it up, possibly because it seems really fiddly to use…

I had a quick look around for Python client/handler for it, and the closest I came was the pyslet package. I’ve posted a notebook showing my investigations to date here: Handling the UK Parliament Members Data Platform OData Feed, but it seems really clunky and I’m not sure I’ve got it right! (There doesn’t seem to be a lot of tutorial support out there, either?)

Here’s an example of the sort of mess I got myself in:

UK_Parliament_api_test_and_members_data_platform_OData_service_test

To make the Parliament OData service more useful needs a higher level Python wrapper, I think, that abstracts a bit further and provides some function calls that make it a tad easier (and natural) to get at the data. Or maybe I need to step back, have a read of the OData blocks, properly get my head around the pyslet OData calls, and try again!

Sketching a Slack Slash Parliamentary Auto-Responder Using AWS Lambda Functions

Across a couple of recent posts, I’ve explored how to use a webhook manager to implement the a simple Slack bot that handles queries from Slack and return information from the UK Parliament data API (Searching the UK Parliament API from Slack Slash Commands Using a Python Microservice via Hook.io Webhooks) and how to use AWS Lambda functions to construct a simple Slack slash command responder (Implementing Slack Slash Commands Using Amazon Lambda Functions – Getting Started).

So this morning, I thought I’d have a go at getting a Slack slash command responder using AWS Lambda functions to handle a couple more queries. Here’s where I got to…

First up, asking for committees that a member of parliament sits on:

slashtest___OUseful_Slack6

Secondly, a query on who the current members of a particular committee are:

slashtest___OUseful_Slack4

One rationale for supporting this sort of query is to provide fingertips information access to a researcher through a unified conversational interface.

To trigger the responses, I’ve used a regular expression that tries to capture several different question types:

x='committees that Andrew Turner is on"
regexp=re.compile(r'.*(?:committees[ (?:that|does|is)]*) (.*?)(( (:?is )?(on|sits on|sit on|a member of))|$)')
m=re.match(regexp,x)

Obviously, this is not very advanced in terms of natural language processing, but the domain is a simple one and the number of forms that a query requesting this sort of information might take will probably be quite simplistic – and predictable!

Having extracted the member’s name (for a lookup of the committees a member is on) or the committee name when trying to look up the members of a particular committee), a URL is generated that can request the data from the Parliament members API. For example:

def committee_URL(c):
    comm_url='http://data.parliament.uk/membersdataplatform/services/mnis/members/query/House=commons%7C{}/Committees/'
    urlargs={'committee':c}
    return comm_url.format(urlencode(urlargs))

We can this use this URL to get some JSON data back:

def getJSON(url):
    q = Request(url)
    q.add_header('accept', 'application/json')
    r= urlopen(q)
    a=r.read()
    return json.loads(a.decode('utf-8-sig'))

The next step is to parse the JSON response to pull out the information we want, and convert it to a simple text string:

def committeeMembers(members,c):
    tl=[]
    if members['Members'] is None: return None
    for m in members['Members']['Member']:
        tl.append('{} ({})'.format(m['FullTitle'],m['Party']['#text']))
    return 'Members of the {}: {}'.format(c,', '.join(tl))

This text string can then be returned to Slack as the slash command response.

[UPDATE…] Here’s another example… The members’ API can look up MP by location, constituency or postcode; so if we try them in turn, we can take in a wide variety of location styles; and it only takes a really simple regular expression to prime the pattern match for what I guess is a wide range of possible conversational gambits for requesting this information:

slashtest___OUseful_Slack5

As with the members API, the Parliament data API will also return JSON responses to valid queries (I used the Parliament data API in the original Hook.io demo). There’s quite a few APIs to play with – data.parliament.uk datasets so as and when I get a moment, I may try to code some more of them up as conversational responders:-)

Implementing Slash Commands Using Amazon Lambda Functions – Writing Tests

In a couple of earlier posts, I’ve described how to set up a simple (insecure) AWS Lambda function to handle a request from a Slack slash command using AWS Lambda functions and the AWS API Gateway, as well as how to encrypt the Slack token that can be used by the micro-service to check that the request has come from a known Slack channel. In this post, I’ll show how to define a simple test event that allows you to test the operation of a Lambda function.

The function I want to test initially is one that simply parses an HTTP POST message from a Slack slash command. When the slash command is issued, a callback is raised that POSTs a payload with the following structure to the Lambda function:

token=SOME_TOKEN
team_id=T0001
team_domain=example
channel_id=C123456789
channel_name=test
user_id=U123456789
user_name=TestUser
command=/testcommand
text="some sort of text string"
response_url=https://hooks.slack.com/commands/1234/5678

The Slack documentation describes how this data will be sent to your URL as a HTTP POST with a content-type header set as application/x-www-form-urlencoded, which is to say that it will be passed in the body in a encoded URL form:

token=SOME_TOKEN&team_id=T0001&team_domain=example&... etc.

The Lambda function test event is created from the Lambda function control panel:

Lambda_Management_Console_test1

The test event needs to contain an example of the POSTed information that the Lambda function expects and can handle:

Lambda_Management_Console_test

For example:
{
"body":"token=ACTUAL_TOKEN _GIVES_THE_SECRET_AWAY_OOP&text=Who+are+the+members+of+the+Defence+Committee&command=/simpletest&user_name=testuser&channel_name=testChannel"
}

I suppose a dummy test token could be used in the test string and provided with limited access to the Lambda function routines?

Saving and running the text function provides a report showing either the output from a successful execution of the function, or an error message…:

Lambda_Management_Console_and_Inbox

Of course, if you don’t create a test event that faithfully resembles the content of a message sent from the service that triggered the Lambda function (in this case, an application/x-www-form-urlencoded POST event raised by the Slack slash command), you’ll either be testing against the wrong thing or getting a false response from the test.

 

A Quick Look at the Private Eye FOI’d “Offshore Landowners” Data from the Land Registry

A few days ago, Private Eye popped up a link to the (not open) data they’d FOId from the Land Registry around land registry applications made by offshore companies: Selling England (and Wales) by the pound. (UPDATE: see also the official publication, Land Registry: Overseas Companies data, and associated guidance.)

I thought have have a quick look at the data to see what sorts of thing it contained. I’ve popped a quick introductory conversation with it here: Private Eye – UK Land Ownership By Offshore Companies.

One of the things I learned was that solar panel installation companies can often get a hold on you…

Private_Eye_-_Secretive_Land_Owners_-_Land_Registry_2

Here’s another glimpse:

Private_Eye_-_Secretive_Land_Owners_-_Land_Registry_1

As well as looking at data on a national basis (for example, to see how the same company or companies have taken title across the UK), we can also look at the data on a more local level. For example, here’s a snapshot over locations of titles held by overseas companies on the Isle of Wight:

Private_Eye_-_Secretive_Land_Owners_-_Land_Registry

If you want to run the notebook yourself (for example: Seven Ways of Running IPython / Jupyter Notebooks), you should be able to download the raw version of the notebook.

PS to learn more about using Jupyter notebooks, and the Python pandas library, for basis data wrangling, why not sign up to the free OU FutureLearn MOOC Learn to Code for Data Analysis (aka Learn to Code, A Line at a Time).

PPS The UKGov also seems to have decided that it might be a good idea to sell off the Land Registry. Open data advocate Owen Boswarva summarised  the previous consultation held on this a couple of years ago as follows: Who supports the privatisation of Land Registry? Mainly corporates. As a private company, under current UK legislation,I don’t think that a LandRegCo would be subject to FOI, even if it was maintaining a public register.

Implementing Slash Commands Using Amazon Lambda Functions – Encrypting the Slack Token

In an earlier post, Implementing Slack Slash Commands Using Amazon Lambda Functions – Getting Started, I avoided the use of an encrypted Slack token to identify the provenance of an incoming request in favour of the plaintext version to try to simplify the “getting started with AWS Lambda functions” aspect of that recipe. In this post, I’ll describe how to to step up to the mark and use the encrypted token.

Although I tried to limit myself to free tier usage, an invoice from Amazon made me realise that there’s a cost associated with generating and subscribing to AWS encryption keys of $1 per month…
To begin with, you’ll need to create an AWS encryption key. The method is described here but I’ll walk you though it…

The is generated from the IAM console – select the Encyrption Keys element from the left hand sidebar, and then make sure you select the correct AWS region (that is, the region that the Lambda function is defined in) before creating the key:

IAM_1_Management_Console_and_Timeline

Check again that you’re in the correct region, and then give your key an alias (I used slackslashtest):

IAM_7_Management_Console

You then need to set various permissions for potential users of the encryption key. I avoided giving anyone administrative permissions:

IAM_2_Management_Console_and_Timeline

but I did give usage permissions to the role I’d defined to execute my Lambda function:

IAM_3_Management_Console

Once you’ve assigned the roles and defined the encryption key, you should be able to see it from the IAM Encryption Keys console listing:

IAM_5_Management_Console

Select the encryption key and make a copy of the ARN that identifies it:

IAM_4_Management_Console

You now need to add the ARN for this encryption key to a policy that defines what the role used to execute the Lambda function can do. From the IAM console, select Roles and then the role you’re interested in:

IAM_8_Management_Console

Create a new role policy for that role:

IAM_9_Management_Console

You can use the policy generator tool to create the policy:

IAM_13_Management_Console

Select the AWS Key Management Service, and then select the Decrypt action. This will allow the role to use the decrypt method for the specified encryption key:

IAM_10_Management_Console

Add the ARN for your encryption key (the one you copied above) and select Add Statement to add the decrypt action on the specified encryption key to the newly created role policy.

IAM_11_Management_Console

You can now generate and review the policy – you may want to give it a sensible name:

IAM_12_Management_Console

So… we’ve now created a key, with the alias slackslashtest, and given the role that executes the Lambda function permission to access it as part of the encryption key definition; we’ve then declared access to the Decrypt method via the role policy definition.

Now we need to use the encryption key to encrypt our Slack token. You can do this using the Amazon CLI (Command Line Interface). To do this, you first need to install the AWS CLI on your computer. (I think I did this on a Mac using Homebrew? I’m not sure if there’s an online console way of doing the encryption?)

Once the AWS CLI is installed, you need to configure it. To do this, you need to get some more keys. From the IAM console, select Users and then your user. You now need to Create Access Key.

IAM_Management_Console_k

Creating an access key is fraught with risk – you get one opportunity to look at the key values, and one opportunity to download the credentials, and that’s it! So make a note of the values…

IAM_Management_Console_k2

You’re now going to use these access keys to set up the AWS CLI on your computer (you should only need to do this once). After ensuring that the AWS CLI is installed, (enter the command aws on the command line and see if it takes!), run the command aws configure and provide your access key credentials. Also make sure you select the region you want to work in.

serverless_slack_—_bash_—_80×24

Having configured the CLI with permission to talk to the AWS servers, you can now use it to encrypt the Slack token. Run the command:

aws kms encrypt --key-id alias/YOUR_KEY_ALIAS --plaintext "YOUR_SLACK_TOKEN"

using approriate values for the AWS encryption key alias (mine was slackslashtest) and Slack token. This calls the key encryption service and uses the specified encryption key, via its alias, to encrypt the plaintext string.

serverless_slack_—_bash_—_80×24_and_tm351-docker-build-example_—_vagrant_tm351docker-jul15b___vagrant_—_bash_—_202×24_and_TM351VM_—_bash_—_163×25

The CiphertextBlob is the encrypted version of the token. In your AWS Lambda function definition, you can use this value as the encrypted expected token from Slack that checks the provenance of whoever’s made a request to the Lambda function:

Lambda_Management_Console_and_slashtest___OUseful_Slack

Comment out – or better, delete! – the original plaintext version of the Slack token that we used as a shortcut previously, and save the Lambda function.

Now when you call the Lambda function from Slack, via the slash function, it should run as before, only this time the Slack token lookup is made against an encrypted, rather than plaintext, version of it on the AWS side.

In the final post of this short series, I’ll describe how to write a simple test event to test the Lambda function.