OUseful.Info, the blog…

Trying to find useful things to do with emerging technologies in open education

Corporate Sprawl Sketch Trawls Using OpenCorporates

A recent post on the OpenCorporates blog (Major Milestone: Over 50 million companies (& a sneak peak at the future)) provides a sneak preview of a tool they’re developing for visualising networks of companies based on “links of control or minority shareholdings”. I’m not sure what that actually means, but it all sounds very exciting;-)

Since Chris et al.(?) added the ability to view director information for companies directly, as well as search for directors via the new 0.2 version of the OpenCorporates API, I’ve been meaning to update my corporate sprawl hack (eg in context of Tesco, G4S and Thames Water) to make use of the director information directly. (Previously, I was trying to scrape it myself from company filings data that is also published via OpenCorporates.)

I finally got round to it over the weekend (ScraperWiki: OpenCorporates trawler), so here’s my opening recipe which tries to map the extent of a current corporate network based on common directorship:

  1. Given an OpenCorporates company ID, get the list of directors
  2. Try to find current directors (I’m using the heuristic of looking for ones with no end date on their appointment and add them to a directors set;
  3. For each director in the directors set, search for directors with the same name, At the moment, the directors search is really loose, so I do a filtering pass to further limit results to only directors with exactly the same name.
    [There are three things to note here: i) it would be useful to have an 'exact search' limit option on the directors search to limit responses to just directors name that exactly match the query string; ii) the directors search returns individual records for the appointment of a particular director in a particular company - at the moment, there is no notion of an actual person who may be the director of multiple companies (FRBR comes to mind here, eg in sense of a director as a work?!, as well as researcher ID schemes such as Orcid); iii) the director records contain a uid element that is currently set to null. Is this possibly for a director ID scheme so we can know that two people with the same name who are directors of different companies are actually the same person?]
    The filtered directors search returns a list of director appointments relating to people with exactly the same name as the current directors of the target company. Each record relates to an appointment to a particular company, which gives us a list of companies that are possibly related to the target company by virtue of co-directorship.
  4. Having got a list of possibly related companies, look up the details for each. If the company is an active company, I run a couple of tests to see if it is related to the target company. The heuristics I’ve started off with are:
    • does it share exactly the same registered address as the target company? If so, there’s a chance it’s related. [Note: being able to search companies by address could be quite useful, as a step on the functionality road to a full geo-search based on geocoding of addresses, maybe?!;-)]/li>

    • does the company share N or more current directors with directors in the directors set? (I’m starting off with N=2.) If so, there’s a chance it’s related.
  5. This is the end of the first pass, and it returns a set of active companies that are possibly related to a target company by virtue of: i) sharing at least N active directors; and/or ii) sharing at least one common director and the same address.
  6. I also set the trawl up to recurse: the above description is a depth=1 search. For depth 2, from the list of companies added to the sprawl, grab all their active directors and repeat. We can do this to any depth required, though it may make sense to increase N as more directors get added to the directors set. If we increase the search depth we can search ever deeper (I haven’t tried this much yet!).
  7. Note that I also added a couple of optimisation steps to try to counter directors that are just nominees – ones that have hundreds of pages of results in the directors lookup and end up taking the sprawl across the corporate connected giant component (related: FRBR superduping, xISBN lookups and Harry Potter…

As an example of what sort of thing this discovers, here’s a depth 2 search around Care UK, whose annual report I was reading last week… (I hadn’t realised quite how privatisation of care services had got…)

care uk sprawl

Here’s depth 3:

careuk depth 3

And here’s a glimpse at some of the companies identified:

careuk discoverd companies

One thing that occurred to me might be that this tool could be used to support corporate discovery during the curation process of “corporate groupings“:

opencorporates - corporate grouping

A few things to note about corporate groupings:

  1. it would be useful to be able to filter on all/active/inactive status?
  2. if you mistakenly add a company to a corporate grouping, how do you remove it?
  3. the feature that pulls in spending items from OpenlyLocal is really nice, but it requires better tools on the OpenlyLocal side for associating spending line elements with companies. This is particularly true for sprawls, where eg council spending items declare an amount spent with eg “Care UK” but you have no idea which legal entity that actually relates to?

And just in passing, what’s going on here?

refund?

Hmmm.. this post has itself turned into a bit of a sprawl, hasn’t it?! For completeness, here’s the code from the scraper:

#The aim of this scraper is to provide, in the first instance, a way of bootstrapping a search around either a company ID or a director ID
#The user should also define a tablename stub to identify the trawl.

#If one or more company IDs are specified:
#Get the company details
#??Add any names the company was previously known a list of 'previous' companies ?
#??do "morph chains" to show how company names change?
#Get the directors
#Search for directors of same name and then do an exact match filter pass
#Get the companies associated with those exact matches


#TO DO - Spot and handle rate limiting
#TO DO - populate db

targetCompanies=['gb/01668247'] #list of OpenCorporates Company IDs with leading country code
targetDirectors=[] #list of OpenCorporates Director IDs
targetStub='Care UK 2,2 test' #name of the db table stub
trawldepth=2
coverage='current' #all, current, previous **Relates to directors
status='active' #all, active, inactive **Relates to companies
DIRINTERSECT=2 #The minimum number of shared directors (current or past) to count as part of same grouping
#------

targetStub=targetStub.replace(' ','_')

import scraperwiki, simplejson,urllib,re

#Keep the API key [private - via http://blog.scraperwiki.com/2011/10/19/tweeting-the-drilling/
import os, cgi
try:
    qsenv = dict(cgi.parse_qsl(os.getenv("QUERY_STRING")))
    ockey=qsenv["OCKEY"]
    ykey=qsenv["YKEY"]
except:
    ockey=''

#----
APISTUB='http://api.opencorporates.com/v0.2'

def deslash(x): return x.strip('/')
def signed(url): return url+'?api_token='+ockey

def occStrip(ocURL):
    return deslash(ocURL.replace('http://opencorporates.com/companies',''))

def buildURL(items):
    url=APISTUB
    for i in items:
        url='/'.join([url,deslash(i)])
    return signed(url)

def getOCcompanyData(ocid):
    ocurl=buildURL(['companies',ocid])
    ocdata=simplejson.load(urllib.urlopen(ocurl))
    if 'results' in ocdata: return ocdata['results']
    else: return -1

def getOCofficerData(ocid):
    ocurl=buildURL(['officers',ocid])
    ocdata=simplejson.load(urllib.urlopen(ocurl))
    return ocdata['results']


def recorder(data):
    d=[]
    for record in data['companies']:
        dd=record.copy()
        d.append(dd)
        if len(d)>100:
            scraperwiki.sqlite.save(unique_keys=['ocid'], table_name='companies_'+targetStub, data=d)
            d=[]
    scraperwiki.sqlite.save(unique_keys=['jurisdiction_code','company_number'], table_name='companies_'+targetStub, data=d)
    data['companies']=[]
    d=[]
    for record in data['directors']:
            dd=record.copy()
            d.append(dd)
            if len(d)>100:
                scraperwiki.sqlite.save(unique_keys=['ocid'], table_name='directors_'+targetStub, data=d)
                d=[]
    scraperwiki.sqlite.save(unique_keys=['ocid'], table_name='directors_'+targetStub, data=d)
    data['directors']=[]
    return data
    
exclusions_d=['FIRST SCOTTISH SECRETARIES LIMITED','FIRST DIRECTORS LIMITED']
exclusions_r=['nominated director','nominated secretary']
def getOCofficerCompaniesSearch(name,page=1,cidnames=[]):
    durl=APISTUB+'/officers/search?q='+name+'&per_page=100&page='+str(page)
    ocdata=simplejson.load(urllib.urlopen(durl+'&api_token='+ockey))['results']
    optimise=0
    #?need a heuristic for results with large page count?
    #Maybe put things into secondary possibles to check against?
    #The logic of this is really hacky and pragmatic(?!;-) Need to rethink... 
    for officer in ocdata['officers']:
        if (officer['officer']['name'].strip() in exclusions_d) or officer['officer']['position'] in exclusions_r:
            optimise=1
            break
        elif name==officer['officer']['name']:
            #print 'Possible new company for',name,officer['officer']['company']['name']
            #would a nominated secretary be interesting to search on? eg FIRST SECRETARIES LIMITED
            cidnames.append( ( occStrip(officer['officer']['company']['opencorporates_url']), occStrip(officer['officer']['company']['name']) ) )
    if page < ocdata['total_pages'] and optimise==0:
        page=page+1
        cidnames=getOCofficerCompaniesSearch(name,page,cidnames)
    #http://api.opencorporates.com/v0.2/officers/search?q=john+smith
    return cidnames
#-----

def trawlPass(data=[],depth=1,coverage='current',status='active'):
    data['depth']=data['depth']+1
    done=1
    newTargets=[]
    for ocid in data['targetCompanies']:
        if ocid not in data['cids']:
            bigtmp=[]
            data['cids'].append(ocid)
            cd=getOCcompanyData(ocid)
            if cd!=-1:
                if status=='active' and (cd['company']['inactive']): cd=-1
                elif status=='inactive' and not (cd['company']['inactive']): cd=-1
            if cd!=-1:
                cd=cd['company']
                uid=occStrip(cd['opencorporates_url'])
                dids=cd['officers']
                tmp={'ocid':uid}
                for x in ['name','jurisdiction_code','company_number','incorporation_date','dissolution_date','registered_address_in_full']:
                    tmp[x]=cd[x]
                didset=[]
                for didr in dids:
                    did=didr['officer']
                    #TEST - TO DO  - is None the right thing here?
                    print did['name'],did['end_date']
                    if coverage=='all':
                        didset.append(did['name'])
                    elif coverage=='current' and did['end_date'] is None:
                        didset.append(did['name'])
                    elif coverage=='previous' and did['end_date']is not None:
                        didset.append(did['name'])
                #some additional logic for heuristically determining whether or not a company is in same grouping
                if data['depth']==1: inset=1
                else: inset=0
                print coverage,'dirset',didset
                if (len(list(set(didset) & set(data['dnames'])))) >= DIRINTERSECT : inset=1
                if cd['registered_address_in_full'] in data['addresses']: inset=1
                if (inset==1):
                    data['companies'].append(tmp.copy())
                    print 'Added',tmp
                    if cd['registered_address_in_full'] not in data['addresses']: data['addresses'].append(cd['registered_address_in_full'])
                    for didr in dids:
                        if didr['officer']['name'] in didset:
                            did=didr['officer']
                            print 'dir',did['name']
                            did['ocid']=did['opencorporates_url'].replace("http://opencorporates.com/officers/","")
                            did['cname']=cd['name']
                            data['directors'].append(did.copy())
                            if did['name'] not in data['dnames']:
                                data['dnames'].append(did['name'])
                                #get matchalikes
                                cidnames=getOCofficerCompaniesSearch(did['name'])
                                for (cid,cname) in cidnames:
                                    bigtmp.append({'cid':cid,'cname':cname,'dname':did['name']})
                                    if len(bigtmp)>20:
                                        scraperwiki.sqlite.save(unique_keys=['cid','dname'], table_name='possibles_'+targetStub, data=bigtmp)
                                        bigtmp=[]
                                    if cid not in data['targetCompanies'] and cid not in newTargets:
                                        #print 'Brand new company for dir',cid
                                        newTargets.append(cid)
                    #if len(data['companies'])>20 or len(data['directors'])>20:
                    data=recorder(data)
                scraperwiki.sqlite.save(unique_keys=['cid','dname'], table_name='possibles_'+targetStub, data=bigtmp)
                bigtmp=[]
    data=recorder(data)
    for ocid in newTargets:
        data['targetCompanies'].append(ocid)
        done=0
    for director in data['targetDirectors']:
        od=getOCofficerData(ocid)['officer']
        ocid=occStrip(od['company']['opencorporates_url'])
        if ocid not in data['targetCompanies']:
            data['targetCompanies'].append(ocid)
            done=0
    depth=depth-1
    if (done==0) and depth>0:
        return trawlPass(data,depth,coverage,status)
    else: return data

_targetCompanies=[]
for c in targetCompanies:
    _targetCompanies.append(deslash(c))

init={'depth':0,'targetCompanies':_targetCompanies,'targetDirectors':targetDirectors,'cids':[],'dnames':[],'addresses':[],'companies':[],'directors':[]}
data=trawlPass(init,trawldepth,coverage,status)
print data

When I get a chance, I’ll try to pop up a couple of viewers over the data that’s scraped.

Written by Tony Hirst

January 28, 2013 at 10:57 am

Posted in Anything you want

Tagged with ,

4 Responses

Subscribe to comments with RSS.

  1. I’m thinking (from the context of the developer of an alt online newspaper http://course.downes.ca and http://monctonfreepress.ca ) how subversive it would be to just make these diagrams (the ones in your post, not the ones on OpenCorporates) open up when you click on a person’s name or a company name – people would then see how the elements of the news they read every day are implicated in this corporate web.

    Stephen Downes

    January 30, 2013 at 8:13 pm

  2. [...] the scraper I described in Corporate Sprawl Sketch Trawls Using OpenCorporates, I popped in a few different seed companies and pulled back lists of companies that shared two or [...]

  3. [...] been tinkering with OpenCorporates data again, tidying up the co-director scraper described in Corporate Sprawl Sketch Trawls Using OpenCorporates (the new scraper is here: Scraperwiki: opencorporates trawler) and thinking a little about working [...]


Comments are closed.

Follow

Get every new post delivered to your Inbox.

Join 784 other followers

%d bloggers like this: