Getting Started With Neo4j and Companies House OpenData

One of the things that’s been on my to do list for ages has been to start playing with the neo4j graph database. I finally got round to having a dabble last night, and made a start trying to figure out how to load some sample data in.

The data I looked at came in two flavours, both bulk data downloads from Companies House:, a JSON dataset containing beneficial ownership/significant control data, and a tabular, CSV dataset containing basic company information.

To simplify running neo4j, I created a simple docker-compose.yml file that would fire up a couple of linked containers – one running neo4j, the other running a Jupyter notebook that I could run queries from. (Actually, I think neo4j has its own web UI, but I’m more comfortable in writing Python scripts in the Jupyter environment.)

#visit 7474 and change the default password - eg to: neo4jch
neo4jch:
  image: neo4j
  ports:
    - "7474:7474"
    - "1337:1337"
  volumes:
    - /opt/data

jupyterscipyneoch:
  image: jupyter/scipy-notebook
  ports:
    - "8890:8888"
  links:
    - neo4jch:neo4j
  volumes:
    - ./notebooks:/home/jovyan/work

To launch things, I tend to run Kitematic, launch a docker command line, cd to the directory containing the above YAML file, then run docker-compose up -d. Kitematic then provides links to the neo4j and Jupyter web page UIs. One thing to note is that neo4j seems to want it’s default password changing – go to the container’s page on port 7474 and reset the password – I changed mine to neo4jch. Once launched, the containers can be suspended with the command docker-compose stop and resumed with docker-compose start.

I’ve popped an example notebook up here, along with a couple of sample data files, that shows how to load both sorts of data (the hierarchical JSON data, and the flat CSV table, into neo4j, along with a couple of sample queries.

That said, I’m not sure how good the examples are – I still need to read the documentation! (For example, via @markhneedham, “MERGE is MATCH/CREATE so you can use the same query on new/existing companies” which should let me figure out how to properly create company information nodes and them link to them from beneficial owners.)

Here are some examples of my starting attempts at the data ingest. Firstly, for JSON data that looks like this:

{
  "company_number": "09145694",
  "data": {
    "address": {
      "address_line_1": "****",
      "locality": "****",
      "postal_code": "****",
      "premises": "****",
      "region": "****"
    },
    "country_of_residence": "England",
    "date_of_birth": {
      "month": *,
      "year": *
    },
    "etag": "****",
    "kind": "individual-person-with-significant-control",
    "links": {
      "self": "/company/09145694/persons-with-significant-control/individual/bIhuKnMFctSnjrDjUG8n3NgOrlU"
    },
    "name": "***",
    "name_elements": {
      "forename": "***",
      "middle_name": "***",
      "surname": "***",
      "title": "***"
    },
    "nationality": "***",
    "natures_of_control": [
      "ownership-of-shares-50-to-75-percent"
    ],
    "notified_on": "2016-04-06"
  }
}

The following bit of Cypher script seems to load the data in:

with codecs.open('snapshot_beneficialsmall.txt', 'r', 'utf-8-sig') as f:
    for line in f:
        jdata = json.loads(line)
        query = """
WITH {jdata} AS jd
MERGE (beneficialowner:BeneficialOwner {name: jd.data.name}) ON CREATE
  SET beneficialowner.nationality = jd.data.nationality, beneficialowner.country_of_residence = jd.data.country_of_residence
MERGE (company:Company {companynumber: jd.company_number})
MERGE (beneficialowner)-[:BENEFICIALOWNEROF]->(company)
FOREACH (noc IN jd.data.natures_of_control | MERGE (beneficialowner)-[:BENEFICIALOWNEROF {kind:noc}]->(company))
"""
        graph.run(query, jdata = jdata)

For the CSV data, I tried the following recipe:

import csv
#Ideally, we create a company:Company node with a company either here
#and then link to it from the beneficial ownership data?
with open('snapshotcompanydata.csv','r') as csvfile:
    #need to clean the column names by stripping whitespace
    reader = csv.DictReader(csvfile,skipinitialspace=True)
    for row in reader:
        query="""
        WITH {row} AS row
        MERGE (company:Company {companynumber: row.CompanyNumber}) ON CREATE
  SET company.name = row.CompanyName

        MERGE (address:Address {postcode : row["RegAddress.PostCode"]}) ON CREATE
        SET address.line1=row['RegAddress.AddressLine1'], address.line2=row['RegAddress.AddressLine2'],
        address.posttown=row['RegAddress.PostTown'],
        address.county=row['RegAddress.County'],address.country=row['RegAddress.Country']
        MERGE (company)-[:LOCATION]->(address)

        MERGE (companyactivity:SICCode {siccode:row['SICCode.SicText_1']})
        MERGE (company)-[:ACTIVITY]->(companyactivity)
        """
        graph.run(query,row=row)

Note the way that “dotted” column names are handled.

What these early experiments suggest is that I should probably spend a bit of time trying to model the data to work out what sort of graph structure makes sense. My gut reaction was to define node types identifying beneficial owners, companies and SIC codes. Differently attributed BENEFICIALOWNEROF edges identify what sort of control a beneficial owner has.

companieshouse_beneficialownership_neo4j_-_companies_house_beneficial_ownership_data_ingester_ipynb_at_master_%c2%b7_psychemedia_companieshouse_beneficialownership

However, for generality, I think I should define a more general person node, who could also have DIRECTORROLE edges linking them to companies with attributes correpsponding to things like “director”, “company secretary”, “nominee direcotor” etc? (I don’t think director information is available as a download from Companies House, but it could be accreted/cached into my own database each time I look up director information via the Companies House API.)

A couple of other things that need addressing: constraints (so for example, we should only have one node per company number – the correlate of company numbers being a unique key in a relational datatable (via @markhneedham, s/thing like CREATE CONSTRAINT ON (c:Company) ASSERT c. companynumber is UNIQUE maybe…); and indexes – it would probably make sense to create an index on something company numbers, for example.

Next on the to do list, some example queries on the data as I currently have it modelled to see what sorts of question we can ask and what sorts of network we can extract (I may need to add in more than the sample of data – which means I may also need to look at optimising the way the data is imported?). This might also inform how I should be modelling the data!;-)

Related: Trawling the Companies House API to Generate Co-Director Networks.

See also: Getting Started With the Neo4j Graph Database – Linking Neo4j and Jupyter SciPy Docker Containers Using Docker Compose and Accessing a Neo4j Graph Database Server from RStudio and Jupyter R Notebooks Using Docker Containers.

PS also via @markhneedham, one to explore when eg annotating a pre-existing node with additional attributes from a new dataset, something along lines of MERGE (c:Company {…}) SET c.newProp1 = “boo”, c.newProp2 = “blah” etc…