Getting Started With Neo4j and Companies House OpenData

One of the things that’s been on my to do list for ages has been to start playing with the neo4j graph database. I finally got round to having a dabble last night, and made a start trying to figure out how to load some sample data in.

The data I looked at came in two flavours, both bulk data downloads from Companies House:, a JSON dataset containing beneficial ownership/significant control data, and a tabular, CSV dataset containing basic company information.

To simplify running neo4j, I created a simple docker-compose.yml file that would fire up a couple of linked containers – one running neo4j, the other running a Jupyter notebook that I could run queries from. (Actually, I think neo4j has its own web UI, but I’m more comfortable in writing Python scripts in the Jupyter environment.)

#visit 7474 and change the default password - eg to: neo4jch
neo4jch:
  image: neo4j
  ports:
    - "7474:7474"
    - "1337:1337"
  volumes:
    - /opt/data

jupyterscipyneoch:
  image: jupyter/scipy-notebook
  ports:
    - "8890:8888"
  links:
    - neo4jch:neo4j
  volumes:
    - ./notebooks:/home/jovyan/work

To launch things, I tend to run Kitematic, launch a docker command line, cd to the directory containing the above YAML file, then run docker-compose up -d. Kitematic then provides links to the neo4j and Jupyter web page UIs. One thing to note is that neo4j seems to want it’s default password changing – go to the container’s page on port 7474 and reset the password – I changed mine to neo4jch. Once launched, the containers can be suspended with the command docker-compose stop and resumed with docker-compose start.

I’ve popped an example notebook up here, along with a couple of sample data files, that shows how to load both sorts of data (the hierarchical JSON data, and the flat CSV table, into neo4j, along with a couple of sample queries.

That said, I’m not sure how good the examples are – I still need to read the documentation! (For example, via @markhneedham, “MERGE is MATCH/CREATE so you can use the same query on new/existing companies” which should let me figure out how to properly create company information nodes and them link to them from beneficial owners.)

Here are some examples of my starting attempts at the data ingest. Firstly, for JSON data that looks like this:

{
  "company_number": "09145694",
  "data": {
    "address": {
      "address_line_1": "****",
      "locality": "****",
      "postal_code": "****",
      "premises": "****",
      "region": "****"
    },
    "country_of_residence": "England",
    "date_of_birth": {
      "month": *,
      "year": *
    },
    "etag": "****",
    "kind": "individual-person-with-significant-control",
    "links": {
      "self": "/company/09145694/persons-with-significant-control/individual/bIhuKnMFctSnjrDjUG8n3NgOrlU"
    },
    "name": "***",
    "name_elements": {
      "forename": "***",
      "middle_name": "***",
      "surname": "***",
      "title": "***"
    },
    "nationality": "***",
    "natures_of_control": [
      "ownership-of-shares-50-to-75-percent"
    ],
    "notified_on": "2016-04-06"
  }
}

The following bit of Cypher script seems to load the data in:

with codecs.open('snapshot_beneficialsmall.txt', 'r', 'utf-8-sig') as f:
    for line in f:
        jdata = json.loads(line)
        query = """
WITH {jdata} AS jd
MERGE (beneficialowner:BeneficialOwner {name: jd.data.name}) ON CREATE
  SET beneficialowner.nationality = jd.data.nationality, beneficialowner.country_of_residence = jd.data.country_of_residence
MERGE (company:Company {companynumber: jd.company_number})
MERGE (beneficialowner)-[:BENEFICIALOWNEROF]->(company)
FOREACH (noc IN jd.data.natures_of_control | MERGE (beneficialowner)-[:BENEFICIALOWNEROF {kind:noc}]->(company))
"""
        graph.run(query, jdata = jdata)

For the CSV data, I tried the following recipe:

import csv
#Ideally, we create a company:Company node with a company either here
#and then link to it from the beneficial ownership data?
with open('snapshotcompanydata.csv','r') as csvfile:
    #need to clean the column names by stripping whitespace
    reader = csv.DictReader(csvfile,skipinitialspace=True)
    for row in reader:
        query="""
        WITH {row} AS row
        MERGE (company:Company {companynumber: row.CompanyNumber}) ON CREATE
  SET company.name = row.CompanyName

        MERGE (address:Address {postcode : row["RegAddress.PostCode"]}) ON CREATE
        SET address.line1=row['RegAddress.AddressLine1'], address.line2=row['RegAddress.AddressLine2'],
        address.posttown=row['RegAddress.PostTown'],
        address.county=row['RegAddress.County'],address.country=row['RegAddress.Country']
        MERGE (company)-[:LOCATION]->(address)

        MERGE (companyactivity:SICCode {siccode:row['SICCode.SicText_1']})
        MERGE (company)-[:ACTIVITY]->(companyactivity)
        """
        graph.run(query,row=row)

Note the way that “dotted” column names are handled.

What these early experiments suggest is that I should probably spend a bit of time trying to model the data to work out what sort of graph structure makes sense. My gut reaction was to define node types identifying beneficial owners, companies and SIC codes. Differently attributed BENEFICIALOWNEROF edges identify what sort of control a beneficial owner has.

companieshouse_beneficialownership_neo4j_-_companies_house_beneficial_ownership_data_ingester_ipynb_at_master_%c2%b7_psychemedia_companieshouse_beneficialownership

However, for generality, I think I should define a more general person node, who could also have DIRECTORROLE edges linking them to companies with attributes correpsponding to things like “director”, “company secretary”, “nominee direcotor” etc? (I don’t think director information is available as a download from Companies House, but it could be accreted/cached into my own database each time I look up director information via the Companies House API.)

A couple of other things that need addressing: constraints (so for example, we should only have one node per company number – the correlate of company numbers being a unique key in a relational datatable (via @markhneedham, s/thing like CREATE CONSTRAINT ON (c:Company) ASSERT c. companynumber is UNIQUE maybe…); and indexes – it would probably make sense to create an index on something company numbers, for example.

Next on the to do list, some example queries on the data as I currently have it modelled to see what sorts of question we can ask and what sorts of network we can extract (I may need to add in more than the sample of data – which means I may also need to look at optimising the way the data is imported?). This might also inform how I should be modelling the data!;-)

Related: Trawling the Companies House API to Generate Co-Director Networks.

See also: Getting Started With the Neo4j Graph Database – Linking Neo4j and Jupyter SciPy Docker Containers Using Docker Compose and Accessing a Neo4j Graph Database Server from RStudio and Jupyter R Notebooks Using Docker Containers.

PS also via @markhneedham, one to explore when eg annotating a pre-existing node with additional attributes from a new dataset, something along lines of MERGE (c:Company {…}) SET c.newProp1 = “boo”, c.newProp2 = “blah” etc…

Bands Incorporated

A few weeks ago, as I was doodling with some Companies House director network mapping code and simple Companies House chatbot ideas, I tweeted an example of Iron Maiden’s company structure based on co-director relationships. Depending on the original search is seeded, the maps may also includes elements of band members’ own personal holdings/interests. The following map, for example, is seeded just from the Iron Maiden LLP company number:

iron_maiden

If you know anything about the band, you’ll know Bruce Dickinson’s aircraft interests make complete sense…

That graph is actually a bipartite graph – nodes are either directors or companies. We can easily generate a projection of the graph that replaces directors that link companies by edges that represent “common director” links between companies:

ireonmaiden2.png

(The edges are actually weighted, so the greater the edge weight, the more directors there are in common between the linked companies.)

In today’s Guardian, I notice they’re running a story about Radiohead’s company structure, with a parallel online piece, Radiohead’s corporate empire: inside the band’s dollars and cents which shows how to get a story out of such a map, as well as how to re-present the original raw map to provide to a bit more spatial semantic structure to it:

Radiohead_s_corporate_empire__inside_the_band_s_dollars_and_cents___Music___The_Guardian

(The story also digs into the financial reports from some of the companies.)

By way of comparison, here’s my raw map of Radiohead’s current company structure, generated from Companies House data seeded on the company number for Radiohead Trademark:

radiohead

It’s easy enough to grab the data for other bands. So how about someone like The Who? If we look in the immediate vicinity of The Who Group, we see core interests:

who1

But if we look for linkage to the next level of co-director links, we start to see other corporate groups that hold some at least one shared interest with the band members:

who2

So what other bands incorporated in the UK might be worth mapping?

Chat Sketches with the Companies House API, Before the F***kWit UKGov Sell It Off

Ranty title a gut reaction response to news that the Land Registry faces privatisation.

Sketching around similar ideas to my Slack/slash conversational autoresponder around the Parliament data platform API, I thought I’d have a quick play with the UK Companies House API, which provides a simple interface to company registration data, director information and disqualified director information.

Bulk downloads are available for company registration information (here’s a quick howto about working with it; I’ll post a howto showing how to work with it using a containerised database when I get a chance, but for now, here are some clues) and from the API developer forums it looks as if bulk director’s information is available by request.

Working with your own bulk copy of the data is preferable, because it means you can write your own arbitrarily complex queries over any or all of the columns. The Companies House API, on the other hand, gives you a weak search over company and directors names, and the ability to look up individual known records. To do any sort of processing, you need to grab a large amount of search data, and/or make lots of individual known item records to build you own local datastore, and then search or filter across that.

So for example, here’s the first fumblings of my own function for filtering down on a list of officers associated with a particular company based on director role or current status (which I called typ for some reason? Bah:-(:

def ch_getCompanyOfficers(cn,typ='all',role='all'):
    #typ: all, current, previous
    url="https://api.companieshouse.gov.uk/company/{}/officers".format(cn)
    co=ch_request(CH_API_TOKEN,url)
    if typ=='current':
        co['items']=[i for i in co['items'] if 'resigned_on' not in i]
        #should possibly check here that len(co['items'])==co['active_count'] ?
    elif typ=='previous':
        co['items']=[i for i in co['items'] if 'resigned_on' in i]
    if role!='all':
        co['items']=[i for i in co['items'] if role==i['officer_role']]
    return co

The next function runs a search over officers by name, but then also lets you filter down the responses to show just those directors who also match a particular search string as part of any company name they are associated with.

def ch_searchOfficers(q,n=50,start_index='',company=''):
    url= 'https://api.companieshouse.gov.uk/search/officers'
    properties={'q':q,'items_per_page':n,'start_index':start_index}
    o=ch_request(CH_API_TOKEN,url,properties)
    if company != '':
        for p in o['items']:
            p['items'] = [i for i in ch_getAppointments(p['links']['self'])['items'] if company.lower() in i['appointed_to']['company_name'].lower()]
        o['items'] = [i for i in o['items'] if len(i['items'])]
    return o

You get the gist, hopefully. Run a crude API call, and then filter down the result according to particular data properties contained within the search result.

Anyway, as far as the chatting goes, here’s what I’ve started playing around with…

First, let’s just ask what companies a director with a particular name is associated with.

Companies_House_API_Bot1

We can take this a bit further by filtering down on the directors associated with a particular company. (Actually, this is simplified now to call the reporting function simply as dirCompanies(c)).

Companies_House_API_Bot

Alternatively, we might try to narrow the search for directors associated with companies in a particular locality. (I’m still trying to get my head round the different logics of this, because companies as well as directors are associated with addresses. I really need to try some specific investigative tasks to get a better feel for how to tune this sort of filter…)

Companies_House_API_Bot2

I’ve also started trying to think around the currency of appointments, for example supporting the ability to filter down based on resigned appointments:

Companies_House_API_Bot3

Associated with this sort of query (in the sense of exploring the past) are filters that let us search around dissolved companies, for example:

Companies_House_API_Bot4

(I should probably also put some time filters in there, for example to search for companies that a particular person was a director of at a particular time…)

We can also search into the disqualified directors register. To try to reduce the sense of turning this into a fishing trip, searching by director name and then filtering by locality feels like it could be handy (though again, this needs some refinement on the way I apply the locality filter.)

Companies_House_API_Bot5

Next step on this particular task is to tidy these up a little bit more and then pop them into a Slack responder.

But there are also some other Companies House goodies to come…such as revisiting the notion of co-director based company maps.