A quick sketch, prompted by Tesco Graph Hunting on OpenCorporates of how some of Tesco’s various corporate holdings are related based on director appointments and terminations:
The recipe is as follows:
– grab a list of companies that may be associated with “Tesco” by querying the OpenCorporates reconciliation API for tesco
– grab the filings for each of those companies
– trawl through the filings looking for director appointments or terminations
– store a row for each directorial appointment or termination including the company name and the director.
You can find the scraper here: Tesco Sprawl Grapher
import scraperwiki, simplejson,urllib import networkx as nx #Keep the API key [private - via http://blog.scraperwiki.com/2011/10/19/tweeting-the-drilling/ import os, cgi try: qsenv = dict(cgi.parse_qsl(os.getenv("QUERY_STRING"))) ockey=qsenv["OCKEY"] except: ockey='' rurl='http://opencorporates.com/reconcile/gb?query=tesco' #note - the opencorporates api also offers a search: companies/search entities=simplejson.load(urllib.urlopen(rurl)) def getOCcompanyData(ocid): ocurl='http://api.opencorporates.com'+ocid+'/data'+'?api_token='+ockey ocdata=simplejson.load(urllib.urlopen(ocurl)) return ocdata #need to find a way of playing nice with the api, and not keep retrawling def getOCfilingData(ocid): ocurl='http://api.opencorporates.com'+ocid+'/filings'+'?per_page=100&api_token='+ockey tmpdata=simplejson.load(urllib.urlopen(ocurl)) ocdata=tmpdata['filings'] print 'filings',ocid #print 'filings',ocid,ocdata #print 'filings 2',tmpdata while tmpdata['page']<tmpdata['total_pages']: page=str(tmpdata['page']+1) print '...another page',page,str(tmpdata["total_pages"]),str(tmpdata['page']) ocurl='http://api.opencorporates.com'+ocid+'/filings'+'?page='+page+'&per_page=100&api_token='+ockey tmpdata=simplejson.load(urllib.urlopen(ocurl)) ocdata=ocdata+tmpdata['filings'] return ocdata def recordDirectorChange(ocname,ocid,ffiling,director): ddata={} ddata['ocname']=ocname ddata['ocid']=ocid ddata['fdesc']=ffiling["description"] ddata['fdirector']=director ddata['fdate']=ffiling["date"] ddata['fid']=ffiling["id"] ddata['ftyp']=ffiling["filing_type"] ddata['fcode']=ffiling["filing_code"] print 'ddata',ddata scraperwiki.sqlite.save(unique_keys=['fid'], table_name='directors', data=ddata) def logDirectors(ocname,ocid,filings): print 'director filings',filings for filing in filings: if filing["filing"]["filing_type"]=="Appointment of director" or filing["filing"]["filing_code"]=="AP01": desc=filing["filing"]["description"] director=desc.replace('DIRECTOR APPOINTED ','') recordDirectorChange(ocname,ocid,filing['filing'],director) elif filing["filing"]["filing_type"]=="Termination of appointment of director" or filing["filing"]["filing_code"]=="TM01": desc=filing["filing"]["description"] director=desc.replace('APPOINTMENT TERMINATED, DIRECTOR ','') director=director.replace('APPOINTMENT TERMINATED, ','') recordDirectorChange(ocname,ocid,filing['filing'],director) for entity in entities['result']: ocid=entity['id'] ocname=entity['name'] filings=getOCfilingData(ocid) logDirectors(ocname,ocid,filings)
The next step is to graph the result. I used a Scraperwiki view (Tesco sprawl demo graph) to generate a bipartite network connecting directors (either appointed or terminated) with companies and then published the result as a GEXF file that can be loaded directly into Gephi.
import scraperwiki import urllib import networkx as nx import networkx.readwrite.gexf as gf from xml.etree.cElementTree import tostring scraperwiki.sqlite.attach( 'tesco_sprawl_grapher') q = '* FROM "directors"' data = scraperwiki.sqlite.select(q) DG=nx.DiGraph() directors=[] companies=[] for row in data: if row['fdirector'] not in directors: directors.append(row['fdirector']) DG.add_node(directors.index(row['fdirector']),label=row['fdirector'],name=row['fdirector']) if row['ocname'] not in companies: companies.append(row['ocname']) DG.add_node(row['ocid'],label=row['ocname'],name=row['ocname']) DG.add_edge(directors.index(row['fdirector']),row['ocid']) scraperwiki.utils.httpresponseheader("Content-Type", "text/xml") writer=gf.GEXFWriter(encoding='utf-8',prettyprint=True,version='1.1draft') writer.add_graph(DG) print tostring(writer.xml)
Saving the output of the view as a gexf file means it can be loaded directly in to Gephi. (It would be handy if Gephi could load files in from a URL, methinks?) A version of the graph, laid out using a force directed layout, with nodes coloured according to modularity grouping, suggests some clustering of the companies. Note the parts of the whole graph are disconnected.
In the fragment below, we see Tesco Property Nominees are only losley linked to each other, and from the previous graphic, we see that Tesco Underwriting doesn’t share any recent director moves with any other companies that I trawled. (That said, the scraper did hit the OpenCorporates API limiter, so there may well be missing edges/data…)
And what is it with accountants naming companies after colours?! (It reminds me of sys admins naming servers after distilleries and Lord of the Rings characters!) Is there any sense in there, or is arbitrary?
> And what is it with accountants naming companies after colours?
Big Reservoir Dogs fans?