Pondering Bibliographic Coupling and Co-citation Analyses in the Context of Company Directorships

Over the last month or so, I’ve made a start reading through Mark Newman’s Networks: An Introduction, trying (though I’m not sure how successfully!) to bring an element of discipline to my otherwise osmotically acquired understanding of the techniques employed by various network analysis tools.

One distinction that made a lot of sense to me came from the domain of bibliometrics, specifically between the notions of bibliographic coupling and co-citation.

Co-citation
The idea of co-citation will be familiar to many – when one article cites a set of other articles, those other articles are “co-cited” by the first. When the same articles are co-cited by lots of other articles, we may have reason to believe that they are somehow related in a meaningful way.

cocitation analysis
Image via Wikipedia

In graph terms, we might also represent this as simpler graph within which edges between two articles indicate that they have been co-cited by documents within a particular corpus, with the weight of each edge representing the number of documents within that corpus that have co-cited them.

Bibliographic coupling
Bibliographic coupling is actually an earlier notion, describing the extent to which two works are related by virtue of them both referencing the same other work.

Bibliographic coupling
Image via Wikipedia

Again, in graph terms, we might think of a simpler undirected network in which edges between two articles act as an indicator that they have cited or referenced the same work, with the weight of the edge representing the number of documents that they have co-cited.

A comparison of co-citation and bibliographic coupling networks shows one to be “retrospective” and the other to be “forward looking”. The articles referenced in bibliographic coupling network can be generated directly from a corpus set of articles, and to this extent bibliographic coupling looks to the past. In a co-citation network, the edges that connect two articles can only be generated when a future published article cites them both.

Co-citation, Bibliographic Coupling and Company Director Networks

For some time I’ve been tinkering with the notion of co-director networks, using OpenCorporates data as a data source (eg Mapping Corporate Networks With OpenCorporates). What I’ve tended to focus on are networks built up from active companies and their current directors, looking to see which companies are currently connected by virtue of currently sharing the same directors. On the to do list are timelines showing the companies that a particular director has been associated with, and when, as well as directorial appointments and terminations within a particular company.

In both co-citation and bibliographic analyses, the nodes are the same type of thing (that is, works that are citated, such as articles). A work cites a work. (Note: does author co-citation analysis rely on mappings from works to cited authors, or citing authors to cited authors?). In company-director networks, we have bipartite representation, with directors and companies representing the two types of node and where edges connect companies and directors but not companies and companies or directors and directors; unless a company is a director, but we generally fudge the labelling there.

If we treat “companies that retain directors” as “articles that cite other articles”:

– under a “co-citation” style view, we generate links between companies that share common directors;
– under a “bibliographic coupling” style view, we generate links between directors of the same companies.

I’ve been doing this anyway, but the bibliographic coupling/co-citation distinction may help me tighten it up a little, as well as improving ways of calculating and analysing these networks by reusing analyses described by the bibliometricians?

Pondering the “future vs. past” distinction, the following also comes to mind:

– at the moment, I am generating networks based on current directors of active companies;
– could we construct a dynamic (temporal?) hypergraph from hyperedges that connect all the directors associated with a particular company at a particular time? If so, what could we do with this graph?! (As an aside, it’s probably worth noting that I know absolutely nothing about hypergraphs!)

I’ve also started wondering about ‘director pathways’ in which we define directors as nodes (where all we require was that a person was a director of a company at some time) and directed “citation” edges. These edges would go from one director to other director nodes under the condition that the “citing” director was appointed to a particular company within a particular time period t1..t2 before the appointment to the same company of a “cited” director. If one director follows another director into more than one company, we increase the weight of the edge accordingly. (We could maybe also explore modes in which edge weights represent the amount of time that two directors are in the same company together.)

The aim is… probably pointless and not that interesting. Unless it is… The sort of questions this approach would allow us to ask would be along the lines of: are there groups of directors whose directorial appointments follow similar trajectories through companies; or are there groups of directors who appear to move from one company to another along with each other?

Mapping Corporate Networks With OpenCorporates

I was due to be at #odw13 today, but circumstances beyond my control intruded…

The presentation I was down to give related to some of the things we could do with company data from OpenCorporates. Here’s a related thing that covers some of what I was intending to talk about…

(I’m experimenting with a new way of putting together presentations by actually writing notes for each slide. Please let me know via the comments whether you think this approach makes my slidedecks any easier to understand!)

PS Interesting take on mapping BP corporate network using OpenCorporates data et al by OpenOil: Mapping BP – using open data to track Big Oil Introduction.

Co-Director Network Data Files in GEXF and JSON from OpenCorporates Data via Scraperwiki and networkx

I’ve been tinkering with OpenCorporates data again, tidying up the co-director scraper described in Corporate Sprawl Sketch Trawls Using OpenCorporates (the new scraper is here: Scraperwiki: opencorporates trawler) and thinking a little about working with the data as a graph/network.

What I’ve been focussing on for now are networks that show connections between directors and companies, something we might imagine as follows:

comapny director netwrok

In this network, the blue circles represent companies and the red circles directors. A line connecting a director with a company says that the director is a director of that company. So for example, company C1 has just one director, specifically D1; and director D2 is director of companies C2 and C3, along with director D3.

It’s easy enough to build up a graph like this from a list of “company-director” pairs (or “relations”). These can be described using a simple two column data format, such as you might find in a simple database table or CSV (comma separated value) text file, where each row defines a separate connection:

Company Director
C1 D1
C2 D1
C2 D2
C2 D2
C3 D3
C3 D3

This is (sort of) how I’m representing data I’ve pulled from OpenCorporates, data that starts with a company seed, grabs the current directors of that target company, searches for other companies those people are directors of (using an undocumented OpenCorporates search feature – exact string matching on the director search (put the direction name in double quotes…;-), and then captures which of those companies share are least two directors with the original company.

In order to turn the data, which looks like this:

OpenCorporates data

into a map that resembles something like this (this is actually a view over a different dataset):

care uk sprawl

we need to do a couple of things. Working backwards, these are:

  1. use some sort of tool to generate a pretty picture from the data;
  2. get the data out of the table into the tool using an appropriate exchange format.

Tools include desktop tools such as Gephi (which can import data directly from a CSV file or database table), or graph viewers such as the sigma.js javascript library, or d3.js with an appropriate plugin.

Note that the positioning of the nodes in the visualisation may be handled in a couple of ways:

  • either the visualisation tool uses a layout algorithm to work out the co-ordinates for each of the nodes; or
  • the visualisation tool is passed a graph file that contains the co-ordinates saying where each node should be placed; the visualisation tool then simply lays out the graph using those provided co-ordinates.

The dataflow I’m working towards looks something like this:

opencorporates graphflow

networkx is a Python library (available on Scraperwiki) that makes it easy to build up representations of graphs simply by adding nodes and edges to a graph data structure. networkx can also publish data in a variety of handy exchange formats, including gexf (as used by Gephi and sigma.js), and a JSON graph representation (as used by d3.js and maybe sigma.js (example plugin?).

As a quick demo, I’ve built a scraperwiki view (opencorporates trawler gexf) that pulls on a directors_ table from my opencorporates trawler and then publishes the information either as gexf file (default) or as a JSON file using URLs of the form:

https://views.scraperwiki.com/run/opencorporates_trawler_gexf/?key=compassFood2_2 (gexf default)
https://views.scraperwiki.com/run/opencorporates_trawler_gexf/?key=compassFood2_2&output=json
https://views.scraperwiki.com/run/opencorporates_trawler_gexf/?key=compassFood2_2&output=gexf

This view can therefore be used to easily export data from my OpenCorporates trawler as a gexf file that can be used to easily import data into the Gephi desktop tool, or provide a URL to some JSON data that can be visualised using a Javscript library within a web page (I started doodling the mechanics of one example here: sigmajs test; better examples of what’s possible can be seen at Exploring Data and on the Oxford Internet Institute – Visualisations blog. If anyone would like to explore building a nice GUI to my OpenCorporates trawl data, feel free:-).

We can also use networks to publish data based on processing the network. The example graph above shows a netwrok with two sorts of nodes, connected by edges: company nodes and director nodes. This is a special sort of graph in that companies are only ever connected to directors, and directors are only ever connected to companies. That is, the nodes fall into one of two sorts – company or director – and they only ever connect “across” node type lines. If you look at this sort of graph (sometimes referred to as a bipartite or bimodal graph) for a while, you might be able to spot how you can fiddle with it (technical term;-) to get a different view over the data, such as those directors connected to other directors by virtue of being directors of the same company:

Director network

or those companies that are connected by virtue of sharing common directors:

company network

(Note that the lines/edges can be “weighted” to show the number of connections relating two companies or directors (that is, the number of companies that two directors are connected by, or the number of directors that two companies are connected by). We can then visually depict this weight using line/edge thickness.)

The networkx library conveniently provides functions for generating such views over the data, which can also be accessed via my scraperwiki view:

As the view is paramaterised via a URL, it can be used as a form of “glue logic” to bring data out of a directors table (which itself was populated by mining data from OpenCorporates in a particular way) and express it in a form that can be directly plugged in to a visualisation toolkit. Simples:-)

PS related: a templating system by Craig Russell for generating XML feeds from Google Spreadsheets – EasyOpenData.

Corporate Sprawl Sketch Trawls Using OpenCorporates

A recent post on the OpenCorporates blog (Major Milestone: Over 50 million companies (& a sneak peak at the future)) provides a sneak preview of a tool they’re developing for visualising networks of companies based on “links of control or minority shareholdings”. I’m not sure what that actually means, but it all sounds very exciting;-)

Since Chris et al.(?) added the ability to view director information for companies directly, as well as search for directors via the new 0.2 version of the OpenCorporates API, I’ve been meaning to update my corporate sprawl hack (eg in context of Tesco, G4S and Thames Water) to make use of the director information directly. (Previously, I was trying to scrape it myself from company filings data that is also published via OpenCorporates.)

I finally got round to it over the weekend (ScraperWiki: OpenCorporates trawler), so here’s my opening recipe which tries to map the extent of a current corporate network based on common directorship:

  1. Given an OpenCorporates company ID, get the list of directors
  2. Try to find current directors (I’m using the heuristic of looking for ones with no end date on their appointment and add them to a directors set;
  3. For each director in the directors set, search for directors with the same name, At the moment, the directors search is really loose, so I do a filtering pass to further limit results to only directors with exactly the same name.
    [There are three things to note here: i) it would be useful to have an ‘exact search’ limit option on the directors search to limit responses to just directors name that exactly match the query string; ii) the directors search returns individual records for the appointment of a particular director in a particular company – at the moment, there is no notion of an actual person who may be the director of multiple companies (FRBR comes to mind here, eg in sense of a director as a work?!, as well as researcher ID schemes such as Orcid); iii) the director records contain a uid element that is currently set to null. Is this possibly for a director ID scheme so we can know that two people with the same name who are directors of different companies are actually the same person?]
    The filtered directors search returns a list of director appointments relating to people with exactly the same name as the current directors of the target company. Each record relates to an appointment to a particular company, which gives us a list of companies that are possibly related to the target company by virtue of co-directorship.
  4. Having got a list of possibly related companies, look up the details for each. If the company is an active company, I run a couple of tests to see if it is related to the target company. The heuristics I’ve started off with are:
    • does it share exactly the same registered address as the target company? If so, there’s a chance it’s related. [Note: being able to search companies by address could be quite useful, as a step on the functionality road to a full geo-search based on geocoding of addresses, maybe?!;-)]/li>
    • does the company share N or more current directors with directors in the directors set? (I’m starting off with N=2.) If so, there’s a chance it’s related.
  5. This is the end of the first pass, and it returns a set of active companies that are possibly related to a target company by virtue of: i) sharing at least N active directors; and/or ii) sharing at least one common director and the same address.
  6. I also set the trawl up to recurse: the above description is a depth=1 search. For depth 2, from the list of companies added to the sprawl, grab all their active directors and repeat. We can do this to any depth required, though it may make sense to increase N as more directors get added to the directors set. If we increase the search depth we can search ever deeper (I haven’t tried this much yet!).
  7. Note that I also added a couple of optimisation steps to try to counter directors that are just nominees – ones that have hundreds of pages of results in the directors lookup and end up taking the sprawl across the corporate connected giant component (related: FRBR superduping, xISBN lookups and Harry Potter…

As an example of what sort of thing this discovers, here’s a depth 2 search around Care UK, whose annual report I was reading last week… (I hadn’t realised quite how privatisation of care services had got…)

care uk sprawl

Here’s depth 3:

careuk depth 3

And here’s a glimpse at some of the companies identified:

careuk discoverd companies

One thing that occurred to me might be that this tool could be used to support corporate discovery during the curation process of “corporate groupings“:

opencorporates - corporate grouping

A few things to note about corporate groupings:

  1. it would be useful to be able to filter on all/active/inactive status?
  2. if you mistakenly add a company to a corporate grouping, how do you remove it?
  3. the feature that pulls in spending items from OpenlyLocal is really nice, but it requires better tools on the OpenlyLocal side for associating spending line elements with companies. This is particularly true for sprawls, where eg council spending items declare an amount spent with eg “Care UK” but you have no idea which legal entity that actually relates to?

And just in passing, what’s going on here?

refund?

Hmmm.. this post has itself turned into a bit of a sprawl, hasn’t it?! For completeness, here’s the code from the scraper:

#The aim of this scraper is to provide, in the first instance, a way of bootstrapping a search around either a company ID or a director ID
#The user should also define a tablename stub to identify the trawl.

#If one or more company IDs are specified:
#Get the company details
#??Add any names the company was previously known a list of 'previous' companies ?
#??do "morph chains" to show how company names change?
#Get the directors
#Search for directors of same name and then do an exact match filter pass
#Get the companies associated with those exact matches


#TO DO - Spot and handle rate limiting
#TO DO - populate db

targetCompanies=['gb/01668247'] #list of OpenCorporates Company IDs with leading country code
targetDirectors=[] #list of OpenCorporates Director IDs
targetStub='Care UK 2,2 test' #name of the db table stub
trawldepth=2
coverage='current' #all, current, previous **Relates to directors
status='active' #all, active, inactive **Relates to companies
DIRINTERSECT=2 #The minimum number of shared directors (current or past) to count as part of same grouping
#------

targetStub=targetStub.replace(' ','_')

import scraperwiki, simplejson,urllib,re

#Keep the API key [private - via http://blog.scraperwiki.com/2011/10/19/tweeting-the-drilling/
import os, cgi
try:
    qsenv = dict(cgi.parse_qsl(os.getenv("QUERY_STRING")))
    ockey=qsenv["OCKEY"]
    ykey=qsenv["YKEY"]
except:
    ockey=''

#----
APISTUB='http://api.opencorporates.com/v0.2'

def deslash(x): return x.strip('/')
def signed(url): return url+'?api_token='+ockey

def occStrip(ocURL):
    return deslash(ocURL.replace('http://opencorporates.com/companies',''))

def buildURL(items):
    url=APISTUB
    for i in items:
        url='/'.join([url,deslash(i)])
    return signed(url)

def getOCcompanyData(ocid):
    ocurl=buildURL(['companies',ocid])
    ocdata=simplejson.load(urllib.urlopen(ocurl))
    if 'results' in ocdata: return ocdata['results']
    else: return -1

def getOCofficerData(ocid):
    ocurl=buildURL(['officers',ocid])
    ocdata=simplejson.load(urllib.urlopen(ocurl))
    return ocdata['results']


def recorder(data):
    d=[]
    for record in data['companies']:
        dd=record.copy()
        d.append(dd)
        if len(d)>100:
            scraperwiki.sqlite.save(unique_keys=['ocid'], table_name='companies_'+targetStub, data=d)
            d=[]
    scraperwiki.sqlite.save(unique_keys=['jurisdiction_code','company_number'], table_name='companies_'+targetStub, data=d)
    data['companies']=[]
    d=[]
    for record in data['directors']:
            dd=record.copy()
            d.append(dd)
            if len(d)>100:
                scraperwiki.sqlite.save(unique_keys=['ocid'], table_name='directors_'+targetStub, data=d)
                d=[]
    scraperwiki.sqlite.save(unique_keys=['ocid'], table_name='directors_'+targetStub, data=d)
    data['directors']=[]
    return data
    
exclusions_d=['FIRST SCOTTISH SECRETARIES LIMITED','FIRST DIRECTORS LIMITED']
exclusions_r=['nominated director','nominated secretary']
def getOCofficerCompaniesSearch(name,page=1,cidnames=[]):
    durl=APISTUB+'/officers/search?q='+name+'&per_page=100&page='+str(page)
    ocdata=simplejson.load(urllib.urlopen(durl+'&api_token='+ockey))['results']
    optimise=0
    #?need a heuristic for results with large page count?
    #Maybe put things into secondary possibles to check against?
    #The logic of this is really hacky and pragmatic(?!;-) Need to rethink... 
    for officer in ocdata['officers']:
        if (officer['officer']['name'].strip() in exclusions_d) or officer['officer']['position'] in exclusions_r:
            optimise=1
            break
        elif name==officer['officer']['name']:
            #print 'Possible new company for',name,officer['officer']['company']['name']
            #would a nominated secretary be interesting to search on? eg FIRST SECRETARIES LIMITED
            cidnames.append( ( occStrip(officer['officer']['company']['opencorporates_url']), occStrip(officer['officer']['company']['name']) ) )
    if page < ocdata['total_pages'] and optimise==0:
        page=page+1
        cidnames=getOCofficerCompaniesSearch(name,page,cidnames)
    #http://api.opencorporates.com/v0.2/officers/search?q=john+smith
    return cidnames
#-----

def trawlPass(data=[],depth=1,coverage='current',status='active'):
    data['depth']=data['depth']+1
    done=1
    newTargets=[]
    for ocid in data['targetCompanies']:
        if ocid not in data['cids']:
            bigtmp=[]
            data['cids'].append(ocid)
            cd=getOCcompanyData(ocid)
            if cd!=-1:
                if status=='active' and (cd['company']['inactive']): cd=-1
                elif status=='inactive' and not (cd['company']['inactive']): cd=-1
            if cd!=-1:
                cd=cd['company']
                uid=occStrip(cd['opencorporates_url'])
                dids=cd['officers']
                tmp={'ocid':uid}
                for x in ['name','jurisdiction_code','company_number','incorporation_date','dissolution_date','registered_address_in_full']:
                    tmp[x]=cd[x]
                didset=[]
                for didr in dids:
                    did=didr['officer']
                    #TEST - TO DO  - is None the right thing here?
                    print did['name'],did['end_date']
                    if coverage=='all':
                        didset.append(did['name'])
                    elif coverage=='current' and did['end_date'] is None:
                        didset.append(did['name'])
                    elif coverage=='previous' and did['end_date']is not None:
                        didset.append(did['name'])
                #some additional logic for heuristically determining whether or not a company is in same grouping
                if data['depth']==1: inset=1
                else: inset=0
                print coverage,'dirset',didset
                if (len(list(set(didset) & set(data['dnames'])))) >= DIRINTERSECT : inset=1
                if cd['registered_address_in_full'] in data['addresses']: inset=1
                if (inset==1):
                    data['companies'].append(tmp.copy())
                    print 'Added',tmp
                    if cd['registered_address_in_full'] not in data['addresses']: data['addresses'].append(cd['registered_address_in_full'])
                    for didr in dids:
                        if didr['officer']['name'] in didset:
                            did=didr['officer']
                            print 'dir',did['name']
                            did['ocid']=did['opencorporates_url'].replace("http://opencorporates.com/officers/","")
                            did['cname']=cd['name']
                            data['directors'].append(did.copy())
                            if did['name'] not in data['dnames']:
                                data['dnames'].append(did['name'])
                                #get matchalikes
                                cidnames=getOCofficerCompaniesSearch(did['name'])
                                for (cid,cname) in cidnames:
                                    bigtmp.append({'cid':cid,'cname':cname,'dname':did['name']})
                                    if len(bigtmp)>20:
                                        scraperwiki.sqlite.save(unique_keys=['cid','dname'], table_name='possibles_'+targetStub, data=bigtmp)
                                        bigtmp=[]
                                    if cid not in data['targetCompanies'] and cid not in newTargets:
                                        #print 'Brand new company for dir',cid
                                        newTargets.append(cid)
                    #if len(data['companies'])>20 or len(data['directors'])>20:
                    data=recorder(data)
                scraperwiki.sqlite.save(unique_keys=['cid','dname'], table_name='possibles_'+targetStub, data=bigtmp)
                bigtmp=[]
    data=recorder(data)
    for ocid in newTargets:
        data['targetCompanies'].append(ocid)
        done=0
    for director in data['targetDirectors']:
        od=getOCofficerData(ocid)['officer']
        ocid=occStrip(od['company']['opencorporates_url'])
        if ocid not in data['targetCompanies']:
            data['targetCompanies'].append(ocid)
            done=0
    depth=depth-1
    if (done==0) and depth>0:
        return trawlPass(data,depth,coverage,status)
    else: return data

_targetCompanies=[]
for c in targetCompanies:
    _targetCompanies.append(deslash(c))

init={'depth':0,'targetCompanies':_targetCompanies,'targetDirectors':targetDirectors,'cids':[],'dnames':[],'addresses':[],'companies':[],'directors':[]}
data=trawlPass(init,trawldepth,coverage,status)
print data

When I get a chance, I’ll try to pop up a couple of viewers over the data that’s scraped.

Organisations Providing Benefits to All-Party Parliamentary Groups, Part 1

Via a tweet from the author, I came across Rob Fenwick’s post on APPGs – the next Westminster scandal? (APPG = All Party Parliamentary Groups):

APPGs are entitled to a Secretariat. Set aside any images you have of a sensibly dressed person of a certain age mildly taking dictation, the provision of an APPG Secretariat is one of the main routes used by public affairs agencies, charities, and businesses to cosey up to MPs and Peers. These “secretaries” often came up with the idea of setting up the group in the first place, to advance the interests of a client or cause.

The post describes some of the organisations that provide secretariat services to APPGs, and in a couple of cases also takes the next step: “Take the APPG on the Aluminium Industry, the secretarial services of which are provided by Aluminium Federation Ltd which is “a not-for-profit organisation.” That sounds suitably reassuring – if the organisation is not-for-profit what chance can there be of big business buying favoured access? It’s only when you look at the Federation’s website, and examine each of its nine sub-associations in turn, that it becomes clear that this not-for-profit organisation is a membership umbrella for private business. This is above board, within the rules, published, and transparent. Transparent if you’re prepared to invest the time to look, of course.”

It’s worth reading the post in full… Go on…

… Done that? Go on.

Right. Here’s the list of registered All Party Groups. Here’s an example of what’s recorded:

APG form

Conveniently, there’s a Scraperwiki scraper (David Jones / All Party Groups) grabbing this information, so I though I’d have a play with it.

Looking at the benefits, there is a fair bit of convention in the way benefits are described. For example, we see recurring things of the form:

  • 5000 from CAFOD, 6000 from Christian Aid – that is, [AMOUNT] from [ORGANISATION]
  • Age UK (a charity) acts as the groups secretariat. – that is, [ORGANISATION] {(OPTIONAL_TYPE)} acts as the groups secretariat.

We could parse these things out directly (on the to do list!) but as a short cut, I thought I’d try a couple of content analysis/entity extraction services to see if they could pull out the names of companies and charities from the benefits list. You can find the scraper I used to enhance David Jones’ APPG scraper here: APG Enhancer.

Here are a couple of sample reports from my scraper:

This gives a first pass attempt at extracting organisation (company and charity) names from the APPG register, and in passing provides a partial directory for looking up companies by APG (partial because the entity extractors aren’t perfect and don’t manage to identify every company, charity, association or other recognised group.

A more thorough way to look up particular companies is to do a site’n’path limited web search: eg
aviva site:http://www.publications.parliament.uk/pa/cm/cmallparty/register/

How might we go further, though? One way would be to look up companies on OpenCorporates, and pull down a list of directors:

opencorposrates - look up directors

And then we can start to walk through the database of directors, looking for other companies that appear to have the same director:

opencorporates - director lookup

(Note: we need to watch out for false positives, whereby one director has the same name as another person who is also a company director. There may be false negatives too, where we don’t find a directorship held by a specific person because a slightly different variation of their name was used on a registration document.)

We can also look up charities on OpenCharities to find charity trustees:

OpenCharities charity lookup

If we’re graph walking, we might then look up the trustees on OpenCorporates to see whether or not the trustees are directors of any companies with possible interests in the area, and as a result identify companies who may be trying to influence Parliamentarians through APPGs that benefit from the direct support of a charity, via that charity.

In this way, we can start to build out a wider direct interest graph around a Parliamentary group. I’m not sure how useful or even how meaningful any of this actually is, but it’s increasingly possible, and once the scripted patterns are there, increasingly easy to deploy in other contexts (for example, wherever there is a list of company names, charity names, or names of people who may be directors. I guess a trustee search on OpenCharities may also be available at some point? From a graph linking point of view, I also wonder if any charities share registered addresses with companies, etc…)

PS by the by, here’s a guest post I just wrote on the OpenCorporates blog: Data Sketching With the OpenCorporates API.

Sketching With OpenCorporates – Fragmentary Notes in Context of Thames Water Corporate Sprawl

The Observer newspaper today leads with news of how the UK’s water companies appear to be as, if not more, concerned with running tax efficient financial engines as they are maintaining the water and sewerage network. Using a recipe that’s probably run its course (which is to say – I have some thoughts on how to address some of its many deficiencies) – Corporate Sprawl mapping – I ran a search on OpenCorporates for mentions of “Thames Water” and then plotted the network of companies as connected by directors identified through director dealings also indexed by OpenCorporates:

With the release of the new version 0.2 of the OpenCorporates API, I notice that information regarding directors is now addressable, which means that we should be able to pivot from one company, to its directors, to other companies associated with that director…

To get a feel for what may be possible, let’s run a search on /Thames Water/, and then click through on one of the director links – we can see (through the search interface), records for individual corporate officers, along with sidebar links to similarly named officers (with different officer IDs):

(At this point, I don’t know the extent to which the API reconciles this individual references, if at all – I’m still working my way through the web search interface…)

Let’s assume for a moment that the similarly named individuals in the sidebar are the same as the person whose officer record we are looking at. We notice that as well as Thames Water companies, other companies are listed that would not be discovered by a simple search for /Thames Water/ – INNOVA PARK MANAGEMENT COMPANY LIMITED, for example. (Note that we can also see dates the directorial appointments were valid, which means we may be able to track the career of a particular director; FWIW, offering tools to support ordering directors by date of appointment, or using something resembling a Gantt chart layout, may help patterns jump out of this data…?)

Innova Park Management Company Ltd may or may not have anything to do with Thames Water of course, but there are a couple of bits of evidence we can look for to see whether it is likely that it is part of the Thames Water corporate sprawl using just OpenCorporates data: firstly, we might look to see if this company concurrently shares several directors with Thames Water companies; secondly, we might check its registered address:

(In this case, we also note that /Thames Water/ appears in the previous name of Innova Park Management Company Ltd (something I think that the OpenCorporates search does pick up on?).)

One of the things I’ve mentioned to Chris Taggart before is how geocoding company addresses might give us a good way into to finding colocated companies. One reason for why this might be useful is that it might be able to show how companies evolve through different times and yet remain registered at the same address. It also provides a possible way in to sprawl mapping if many of the sprawl companies are registered at the same address at the same time (though there may be other reasons for companies being registered at the same address: companies may be registered by an accountancy or legal firm, for example, that offers registered address services; or be co-located in a particular building. But for investigations, this may also be useful, for example in cases of tracking down small companies serviced by, erm, creative accountants…)

(By the by, this Google Refine/OpenRefine tutorial contains a cunning trick – geocode addresses using Google maps to get lat/long coordinates, then use a scatterplot facet to view lat/long grid and select rectangular regions within it – that is, it gives you an ad hoc spatial search function… very cunning;-)

Note to self: I think I’ve pondered this before – certainty factors around the extent to which two companies are part of the same sprawl, or two similarly named directors are the same person. Something along the lines of:

– corporate sprawl: certainty = f( number_of_shared_directors, shared_address, similar_words_in_company_name, ...)
– same person (X, Y): certainty related to number of companies both X and Y are directors of that share other directors, share same address, share similar company name.

If we quickly look at the new OpenCorporates API, we see that there are a couple of officers related called: GET officers/search and GET officers/:id.

Based on the above ‘note to self’, I wonder if it’d be useful to code up a recipe that takes an officer ID, fetches the name of the director, runs a name search, then tries to assign a likelihood that each person in the returned set of search results is the same as the person whose ID was supplied in the original lookup? This is the sort of thing that Google Refine reconciliation API services offer, so maybe this is already available via the OpenCorporates reconciliation API?

PS I use the phrase “corporate sprawl” to refer to a similar thing that OpenCorporate’s user-curated corporate_groupings refer to. One thing that interests me is extent to which we can build tools to automatically make suggestions about corporate_grouping membership.

PPS running the scraper, I noticed that Scraperwiki have a job opening for a “data scientist”

Initial Sketch of Registered Addresses of Tesco Companies

Following on from Mapping the Tesco Corporate Organisational Sprawl – An Initial Sketch, where I graphed relations between Tesco registered companies based on co-directorships, I also used OpenCorporates to grab the registered addresses for the companies returned from the OpenCorporates reconciliation API based on a search using the term tesco.

This initial sketch uses two node types – companies and registered addresses (here’s the Scraperwiki view used to generate the graph file):

We can see how several of the addresses relate to the same location, although they are not identical in string matching terms – a bit of text processing may be able to fix that though…

Not surprisingly, the Cayman Islands features as well as the Cheshunt address…

Having got addresses, we could do a bit of geocoding and pop the results onto a map…here’s an example using Google Fusion Tables.

Mapping the Tesco Corporate Organisational Sprawl – An Initial Sketch

A quick sketch, prompted by Tesco Graph Hunting on OpenCorporates of how some of Tesco’s various corporate holdings are related based on director appointments and terminations:

The recipe is as follows:

– grab a list of companies that may be associated with “Tesco” by querying the OpenCorporates reconciliation API for tesco
– grab the filings for each of those companies
– trawl through the filings looking for director appointments or terminations
– store a row for each directorial appointment or termination including the company name and the director.

You can find the scraper here: Tesco Sprawl Grapher

import scraperwiki, simplejson,urllib

import networkx as nx

#Keep the API key [private - via http://blog.scraperwiki.com/2011/10/19/tweeting-the-drilling/
import os, cgi
try:
    qsenv = dict(cgi.parse_qsl(os.getenv("QUERY_STRING")))
    ockey=qsenv["OCKEY"]
except:
    ockey=''

rurl='http://opencorporates.com/reconcile/gb?query=tesco'
#note - the opencorporates api also offers a search:  companies/search
entities=simplejson.load(urllib.urlopen(rurl))

def getOCcompanyData(ocid):
    ocurl='http://api.opencorporates.com'+ocid+'/data'+'?api_token='+ockey
    ocdata=simplejson.load(urllib.urlopen(ocurl))
    return ocdata

#need to find a way of playing nice with the api, and not keep retrawling

def getOCfilingData(ocid):
    ocurl='http://api.opencorporates.com'+ocid+'/filings'+'?per_page=100&api_token='+ockey
    tmpdata=simplejson.load(urllib.urlopen(ocurl))
    ocdata=tmpdata['filings']
    print 'filings',ocid
    #print 'filings',ocid,ocdata
    #print 'filings 2',tmpdata
    while tmpdata['page']<tmpdata['total_pages']:
        page=str(tmpdata['page']+1)
        print '...another page',page,str(tmpdata["total_pages"]),str(tmpdata['page'])
        ocurl='http://api.opencorporates.com'+ocid+'/filings'+'?page='+page+'&per_page=100&api_token='+ockey
        tmpdata=simplejson.load(urllib.urlopen(ocurl))
        ocdata=ocdata+tmpdata['filings']
    return ocdata

def recordDirectorChange(ocname,ocid,ffiling,director):
    ddata={}
    ddata['ocname']=ocname
    ddata['ocid']=ocid
    ddata['fdesc']=ffiling["description"]
    ddata['fdirector']=director
    ddata['fdate']=ffiling["date"]
    ddata['fid']=ffiling["id"]
    ddata['ftyp']=ffiling["filing_type"]
    ddata['fcode']=ffiling["filing_code"]
    print 'ddata',ddata
    scraperwiki.sqlite.save(unique_keys=['fid'], table_name='directors', data=ddata)

def logDirectors(ocname,ocid,filings):
    print 'director filings',filings
    for filing in filings:
        if filing["filing"]["filing_type"]=="Appointment of director" or filing["filing"]["filing_code"]=="AP01":
            desc=filing["filing"]["description"]
            director=desc.replace('DIRECTOR APPOINTED ','')
            recordDirectorChange(ocname,ocid,filing['filing'],director)
        elif filing["filing"]["filing_type"]=="Termination of appointment of director" or filing["filing"]["filing_code"]=="TM01":
            desc=filing["filing"]["description"]
            director=desc.replace('APPOINTMENT TERMINATED, DIRECTOR ','')
            director=director.replace('APPOINTMENT TERMINATED, ','')
            recordDirectorChange(ocname,ocid,filing['filing'],director)

for entity in entities['result']:
    ocid=entity['id']
    ocname=entity['name']
    filings=getOCfilingData(ocid)
    logDirectors(ocname,ocid,filings)

The next step is to graph the result. I used a Scraperwiki view (Tesco sprawl demo graph) to generate a bipartite network connecting directors (either appointed or terminated) with companies and then published the result as a GEXF file that can be loaded directly into Gephi.

import scraperwiki
import urllib
import networkx as nx

import networkx.readwrite.gexf as gf

from xml.etree.cElementTree import tostring

scraperwiki.sqlite.attach( 'tesco_sprawl_grapher')
q = '* FROM "directors"'
data = scraperwiki.sqlite.select(q)

DG=nx.DiGraph()

directors=[]
companies=[]
for row in data:
    if row['fdirector'] not in directors:
        directors.append(row['fdirector'])
        DG.add_node(directors.index(row['fdirector']),label=row['fdirector'],name=row['fdirector'])
    if row['ocname'] not in companies:
        companies.append(row['ocname'])
        DG.add_node(row['ocid'],label=row['ocname'],name=row['ocname'])   
    DG.add_edge(directors.index(row['fdirector']),row['ocid'])

scraperwiki.utils.httpresponseheader("Content-Type", "text/xml")


writer=gf.GEXFWriter(encoding='utf-8',prettyprint=True,version='1.1draft')
writer.add_graph(DG)

print tostring(writer.xml)

Saving the output of the view as a gexf file means it can be loaded directly in to Gephi. (It would be handy if Gephi could load files in from a URL, methinks?) A version of the graph, laid out using a force directed layout, with nodes coloured according to modularity grouping, suggests some clustering of the companies. Note the parts of the whole graph are disconnected.

In the fragment below, we see Tesco Property Nominees are only losley linked to each other, and from the previous graphic, we see that Tesco Underwriting doesn’t share any recent director moves with any other companies that I trawled. (That said, the scraper did hit the OpenCorporates API limiter, so there may well be missing edges/data…)

And what is it with accountants naming companies after colours?! (It reminds me of sys admins naming servers after distilleries and Lord of the Rings characters!) Is there any sense in there, or is arbitrary?

Tesco Graph Hunting on OpenCorporates

A quick lunchtime post on some thoughts around constructing corporate graphs around OpenCorporates data. To ground it, consider a search for “tesco” run on gb registered companies via the OpenCorporates reconciliation API.

{"result":[{"id":"/companies/gb/00445790", "name":"TESCO PLC", "type":[{"id":"/organization/organization","name":"Organization"}], "score":78.0, "match":false, "uri":"http://opencorporates.com/companies/gb/00445790"}, {"id":"/companies/gb/05888959", "name":"TESCO AQUA (FINCO1) LIMITED", "type":[{"id":"/organization/organization", "name":"Organization"}], "score":71.0, "match":false, "uri":"http://opencorporates.com/companies/gb/05888959"}, { ...

Some or all of these companies may or may not be part of the same corporate group. (That is, there may be companies in that list with Tesco in the name that are not part of the group of companies associated with a major UK supermarket.)

If we treat the companies returned in that list as one class of nodes in a graph, we can start to construct a range of graphs that demonstrate linkage between companies based on a variety of factors. For example, a matching address for a registered, post off box mediated, address in an offshore tax haven suggests there may be a weak tie at least between companies:

(Alternatively, we might construct bipartite graphs containing company nodes and address nodes, for example, then collapse the graph about common addresses.)

Shared directors would be another source of linkage, although at the moment, I don’t think OpenCorporates publishes directors associated with UK companies (I suspect that data is still commercially licensed?). However, there is associated information available in the OpenCorporates database already…. For example, if we look at the various company filings, we can pick up records relating to director appointments and terminations?

By monitoring filings, we can then start to build up a record of directorial involvement with companies? From looking at the filings, it also suggests that it would make sense to record commencement and cessation dates for directorial appointments…

There may also be weak secondary evidence linking companies. For example, two companies that file trademarks using the same agent have a weak tie through that agent. (Of course, that agent may be acting for two completely independent companies.)

If we weight edges between nodes according to the perceived strength of a tie and then lay out the graph in a way that is sensitive to the number of weight of edge connections between company nodes, we may be able to start mapping out the corporate structure of these large, distributed corporations, either in network map terms, or maybe by mapping geolocated nodes based on registered addresses; and then we can start asking questions about why these distributed corporate entities are structured the way they are…

PS note to self – OpenCorporates API limit with key: 1000/hr, 10k/day

Looking up Images Trademarked By Companies Using OpenCorporates and Google Refine

Listening to Chris Taggart talking about OpenCorporates at netzwerk recherche conf – data, research, stories, I figured I really should start to have a play…

Looking through the example data available from an opencorporates company ID via the API, I spotted that registered trademark data was available. So here’s a quick roundabout way of previewing trademarked images using OpenCorporates and Google Refine.

First step is to grab the data – the opencorporates API reference docs give an example URL for grabbing a company’s (i.e. a legal entity’s) data: http://api.opencorporates.com/companies/gb/00102498/data

Google Refine supports the import of JSON from a URL:

(Hmm, it seems as if we could load in data from several URLs in one go… maybe data from different BP companies?)

Having grabbed the JSON, we can say which blocks we want to import as row items:

We can preview the rows to check we’re bringing in what we expect…

We’ll take this data by clicking on Create Project, and then start to work on it. Because the plan is to grab trademark images, we need to grab data back from OpenCorporates relating to each trademark. We can generate the API call URLs from the datum – id column:

The OpenCorporates data item API calls are of the form http://api.opencorporates.com/data/2601371, which we can generate as follows:

Here’s what we get back:

If we look through the data, there are several fields that may be interesting: the “representative_name_lines (the person/group that registered the trademark), the representative_address_lines, the mark_image_type and most importantly of all, the international_registration_number. Note that some of the trademarks are not images – we’ll end up ignoring those (for the purposes of this post, at least!)

We can pull out these data items into separate columns by creating columns directly from the trademark data column:

The elements are pulled in using expressions of the following form:

Here are the expressions I used (each expression is used to create a new column from the trademark data column that was imported from automatically constructed URLs):

  • value.parseJson().datum.attributes.mark_image_type – the first part of the expression parses the data as JSON, then we navigate using dot notation to the part of the Javascript object we want…
  • value.parseJson().datum.attributes.mark_text
  • value.parseJson().datum.attributes.representative_address_lines
  • value.parseJson().datum.attributes.representative_name_lines
  • value.parseJson().datum.attributes.international_registration_number

Finding how to get images from international registration numbers was a bit of a faff. In the end, I looked up several records on the WIPO website that displayed trademarked images, then looked at the pattern of their URLs. The ones I checked seemed to have the form:
http://www.wipo.int/romarin/images/XX/YY/XXYYNN.typ
where typ is gif or jpg and XXYYNN is the international registration number. (This may or may not be a robust convention, but it worked for the examples I tried…)

The following GREL expression generates the appropriate URL from the trademark column:

if( or(value.parseJson().datum.attributes.mark_image_type==’JPG’, value.parseJson().datum.attributes.mark_image_type==’GIF’), ‘http://www.wipo.int/romarin/images/&#8217; + splitByLengths(value.parseJson().datum.attributes.international_registration_number, 2)[0] + ‘/’ + splitByLengths(value.parseJson().datum.attributes.international_registration_number, 2, 2)[1] + ‘/’ + value.parseJson().datum.attributes.international_registration_number + ‘.’ + toLowercase (value.parseJson().datum.attributes.mark_image_type), ”)

The first part checks that we have a GIF or JPG image type identified, and if it does, then we construct the URL path, and finally cast the filetype to lower case, else we return an empty string.

Now we can filter the data to only show rows that contain a trademark image URL:

Finally, we can create a template to export a simple HTML file that will let us preview the image:

Here’s a crude template I tried:

The file is exported as a .txt file, but it’s easy enough to change the suffix to .html so that we can view the fie in a browser, or I can cut and paste the html into this page…

[UPDATE: images look like they now have the form: https://i1.wp.com/www.wipo.int/romarin/images/77/78/777839.jpg ? The IDs may also have changed…]

null null
null null
“[\”MURGITROYD & COMPANY\”]” “[\”17 Lansdowne Road\”,\”Croydon, Surrey CRO 2BX\”]”
“[\”A.C. CHILLINGWORTH\”,\”GROUP TRADE MARKS\”]” “[\”Britannic House,\”,\”1 Finsbury Circus\”,\”LONDON EC2M 7BA\”]”
“[\”A.C. CHILLINGWORTH\”,\”GROUP TRADE MARKS\”]” “[\”Britannic House,\”,\”1 Finsbury Circus\”,\”LONDON EC2M 7BA\”]”
“[\”A.C. CHILLINGWORTH\”,\”GROUP TRADE MARKS\”]” “[\”Britannic House,\”,\”1 Finsbury Circus\”,\”LONDON EC2M 7BA\”]”
“[\”A.C. CHILLINGWORTH\”,\”GROUP TRADE MARKS\”]” “[\”Britannic House,\”,\”1 Finsbury Circus\”,\”LONDON EC2M 7BA\”]”
“[\”BP GROUP TRADE MARKS\”]” “[\”20 Canada Square,\”,\”Canary Wharf\”,\”London E14 5NJ\”]”
“[\”Murgitroyd & Company\”]” “[\”Scotland House,\”,\”165-169 Scotland Street\”,\”Glasgow G5 8PL\”]”
“[\”BP GROUP TRADE MARKS\”]” “[\”20 Canada Square,\”,\”Canary Wharf\”,\”London E14 5NJ\”]”
“[\”BP Group Trade Marks\”]” “[\”20 Canada Square, Canary Wharf\”,\”London E14 5NJ\”]”
“[\”ROBERT WILLIAM BOAD\”,\”BP p.l.c. – GROUP TRADE MARKS\”]” “[\”Britannic House,\”,\”1 Finsbury Circus\”,\”LONDON, EC2M 7BA\”]”
“[\”ROBERT WILLIAM BOAD\”,\”BP p.l.c. – GROUP TRADE MARKS\”]” “[\”Britannic House,\”,\”1 Finsbury Circus\”,\”LONDON, EC2M 7BA\”]”
“[\”ROBERT WILLIAM BOAD\”,\”BP p.l.c. – GROUP TRADE MARKS\”]” “[\”Britannic House,\”,\”1 Finsbury Circus\”,\”LONDON, EC2M 7BA\”]”
“[\”ROBERT WILLIAM BOAD\”,\”BP p.l.c. – GROUP TRADE MARKS\”]” “[\”Britannic House,\”,\”1 Finsbury Circus\”,\”LONDON, EC2M 7BA\”]”
“[\”MURGITROYD & COMPANY\”]” “[\”17 Lansdowne Road\”,\”Croydon, Surrey CRO 2BX\”]”
“[\”MURGITROYD & COMPANY\”]” “[\”17 Lansdowne Road\”,\”Croydon, Surrey CRO 2BX\”]”
“[\”MURGITROYD & COMPANY\”]” “[\”17 Lansdowne Road\”,\”Croydon, Surrey CRO 2BX\”]”
“[\”MURGITROYD & COMPANY\”]” “[\”17 Lansdowne Road\”,\”Croydon, Surrey CRO 2BX\”]”
“[\”A.C. CHILLINGWORTH\”,\”GROUP TRADE MARKS\”]” “[\”Britannic House,\”,\”1 Finsbury Circus\”,\”LONDON EC2M 7BA\”]”
“[\”BP Group Trade Marks\”]” “[\”20 Canada Square, Canary Wharf\”,\”London E14 5NJ\”]”
“[\”ROBERT WILLIAM BOAD\”,\”GROUP TRADE MARKS\”]” “[\”Britannic House,\”,\”1 Finsbury Circus\”,\”LONDON, EC2M 7BA\”]”
“[\”BP GROUP TRADE MARKS\”]” “[\”20 Canada Square,\”,\”Canary Wharf\”,\”London E14 5NJ\”]”

Okay – so maybe I need to tidy up the registration related columns, but as a recipe, it sort of works. (Note that it took way longer to create this blog post than it did to come up with the recipe…)

A couple of things that came to mind: having used Google Refine to sketch out this hack, we could now move code it up, maybe in something like Scraperwiki. For example, I only found trademarks registered to one legal entity associated with BP, rather than checking for trademarks held by the myriad number of legal entities associated with BP. I also wonder whether it would be possible to “compile” what Google Refine is doing (import from URL, select row items, run operations against columns, export templated data) as code so that it could be run elsewhere (so for example, could all through steps be exported as a single Javascript or Python script, maybe calling on a GREL/Google Refine library that provides some sort of abstraction layer of virtual machine for the script to make use of?)

PS What’s next…? The trademark data also identifies one or more areas in which the trademark applies; I need to find some way of pulling out each of the “en” attribute values from the items listed in the value.parseJson().datum.attributes.goods_and_services_classifications.