Installing Applications via postBuild in MyBinder and repo2docker

A note on downloading and installing things into a Binderised repo, or a container built using repo2docker.

If you save the files into $HOME as part of the container build process, if you try to use the image outside of MyBinder you will find that if storage volumes or local directories are mounted onto $HOME, your saved files are clobbered.

The MyBinder / repo2docker build is pretty limiting in terms of permissions the default jovyan user has over the file system. $HOME is one place you can write to, but if you need somewhere outside the path, then $CONDA_DIR (which defaults to /srv/conda) is handy…

For example, I just tweaked my neo4j binder repo to install a downloaded neo4j server into that path.

Tinkering With Neo4j and Cypher

I am so bored of tech at the moment — I just wish I could pluck up the courage to go into the garden and start working on it again (it was, after all, one of the reasons for buying the house we’ve been in for several years now, and months go by without me setting foot into it; for the third year in a row the apples and pears have gone to rot, except for the ones the neighbours go scrumping for…) Instead, I sit all day, every day, in front of a screen, hacking at a keyboard… and I f*****g hate it…

Anyway… here’s some of the stuff that I’ve been playing with yesterday and today, in part prompted by a tweet doing the rounds again on:

#Software #Analytics with #Jupyter notebooks using a prefilled #Neo4j database running on #MyBinder by @softvisresearch
Created with building blocks from @feststelltaste and @psychemedia
#knowledgegraph #softwaredevelopment
https://github.com/softvis-research/BeLL

Impact.

Yeah:-)

Anyway… It prompted me to revisit my binder-neo4j repo that demos how to launch a neo4j database in a MyBinder container tp provide some more baby steps ways in to actually getting started running queries.

So yesterday I added in a third party cypher kernel to the build, HelgeCPH/cypher_kernel that lets you write cypher queries in code cells; and today I hacked together some simple magic — innovationOUtside/cypher_magic — that lets you write cypher queries in block magic cells in a “normal” (python kernel) notebook. This magic really should be extended a bit more eg to allow connections to arbitrary neo4j databases, and perhaps crib from the cypher_kernel to include graph conversions to a networkx graph object format as well as graphical vidusalisations.

The cypher-kernel uses visjs, as does an earlier cypher magic that appears to have rotted (ipython-cypher). But if we can get the graph objects into a nx format, then we could also use netwulf to make pretty diagrams…

The tweet-linked repo also looks interesting (although I don’t speak German at all, so, erm…); there may be things I can also pull out of there to add to my binder-neo4j repo, although I may need to rethink that: the binder-neo4j repo had started out as a minimal template repo for just getting started with neo4j in MyBinder/repo2docker. But it’s started creeping… Maybe I should pare it back again, install the magic from its own repo, and but the demos in a more disposable place.

Getting Started With Neo4j and Companies House OpenData

One of the things that’s been on my to do list for ages has been to start playing with the neo4j graph database. I finally got round to having a dabble last night, and made a start trying to figure out how to load some sample data in.

The data I looked at came in two flavours, both bulk data downloads from Companies House:, a JSON dataset containing beneficial ownership/significant control data, and a tabular, CSV dataset containing basic company information.

To simplify running neo4j, I created a simple docker-compose.yml file that would fire up a couple of linked containers – one running neo4j, the other running a Jupyter notebook that I could run queries from. (Actually, I think neo4j has its own web UI, but I’m more comfortable in writing Python scripts in the Jupyter environment.)

#visit 7474 and change the default password - eg to: neo4jch
neo4jch:
  image: neo4j
  ports:
    - "7474:7474"
    - "1337:1337"
  volumes:
    - /opt/data

jupyterscipyneoch:
  image: jupyter/scipy-notebook
  ports:
    - "8890:8888"
  links:
    - neo4jch:neo4j
  volumes:
    - ./notebooks:/home/jovyan/work

To launch things, I tend to run Kitematic, launch a docker command line, cd to the directory containing the above YAML file, then run docker-compose up -d. Kitematic then provides links to the neo4j and Jupyter web page UIs. One thing to note is that neo4j seems to want it’s default password changing – go to the container’s page on port 7474 and reset the password – I changed mine to neo4jch. Once launched, the containers can be suspended with the command docker-compose stop and resumed with docker-compose start.

I’ve popped an example notebook up here, along with a couple of sample data files, that shows how to load both sorts of data (the hierarchical JSON data, and the flat CSV table, into neo4j, along with a couple of sample queries.

That said, I’m not sure how good the examples are – I still need to read the documentation! (For example, via @markhneedham, “MERGE is MATCH/CREATE so you can use the same query on new/existing companies” which should let me figure out how to properly create company information nodes and them link to them from beneficial owners.)

Here are some examples of my starting attempts at the data ingest. Firstly, for JSON data that looks like this:

{
  "company_number": "09145694",
  "data": {
    "address": {
      "address_line_1": "****",
      "locality": "****",
      "postal_code": "****",
      "premises": "****",
      "region": "****"
    },
    "country_of_residence": "England",
    "date_of_birth": {
      "month": *,
      "year": *
    },
    "etag": "****",
    "kind": "individual-person-with-significant-control",
    "links": {
      "self": "/company/09145694/persons-with-significant-control/individual/bIhuKnMFctSnjrDjUG8n3NgOrlU"
    },
    "name": "***",
    "name_elements": {
      "forename": "***",
      "middle_name": "***",
      "surname": "***",
      "title": "***"
    },
    "nationality": "***",
    "natures_of_control": [
      "ownership-of-shares-50-to-75-percent"
    ],
    "notified_on": "2016-04-06"
  }
}

The following bit of Cypher script seems to load the data in:

with codecs.open('snapshot_beneficialsmall.txt', 'r', 'utf-8-sig') as f:
    for line in f:
        jdata = json.loads(line)
        query = """
WITH {jdata} AS jd
MERGE (beneficialowner:BeneficialOwner {name: jd.data.name}) ON CREATE
  SET beneficialowner.nationality = jd.data.nationality, beneficialowner.country_of_residence = jd.data.country_of_residence
MERGE (company:Company {companynumber: jd.company_number})
MERGE (beneficialowner)-[:BENEFICIALOWNEROF]->(company)
FOREACH (noc IN jd.data.natures_of_control | MERGE (beneficialowner)-[:BENEFICIALOWNEROF {kind:noc}]->(company))
"""
        graph.run(query, jdata = jdata)

For the CSV data, I tried the following recipe:

import csv
#Ideally, we create a company:Company node with a company either here
#and then link to it from the beneficial ownership data?
with open('snapshotcompanydata.csv','r') as csvfile:
    #need to clean the column names by stripping whitespace
    reader = csv.DictReader(csvfile,skipinitialspace=True)
    for row in reader:
        query="""
        WITH {row} AS row
        MERGE (company:Company {companynumber: row.CompanyNumber}) ON CREATE
  SET company.name = row.CompanyName

        MERGE (address:Address {postcode : row["RegAddress.PostCode"]}) ON CREATE
        SET address.line1=row['RegAddress.AddressLine1'], address.line2=row['RegAddress.AddressLine2'],
        address.posttown=row['RegAddress.PostTown'],
        address.county=row['RegAddress.County'],address.country=row['RegAddress.Country']
        MERGE (company)-[:LOCATION]->(address)

        MERGE (companyactivity:SICCode {siccode:row['SICCode.SicText_1']})
        MERGE (company)-[:ACTIVITY]->(companyactivity)
        """
        graph.run(query,row=row)

Note the way that “dotted” column names are handled.

What these early experiments suggest is that I should probably spend a bit of time trying to model the data to work out what sort of graph structure makes sense. My gut reaction was to define node types identifying beneficial owners, companies and SIC codes. Differently attributed BENEFICIALOWNEROF edges identify what sort of control a beneficial owner has.

companieshouse_beneficialownership_neo4j_-_companies_house_beneficial_ownership_data_ingester_ipynb_at_master_%c2%b7_psychemedia_companieshouse_beneficialownership

However, for generality, I think I should define a more general person node, who could also have DIRECTORROLE edges linking them to companies with attributes correpsponding to things like “director”, “company secretary”, “nominee direcotor” etc? (I don’t think director information is available as a download from Companies House, but it could be accreted/cached into my own database each time I look up director information via the Companies House API.)

A couple of other things that need addressing: constraints (so for example, we should only have one node per company number – the correlate of company numbers being a unique key in a relational datatable (via @markhneedham, s/thing like CREATE CONSTRAINT ON (c:Company) ASSERT c. companynumber is UNIQUE maybe…); and indexes – it would probably make sense to create an index on something company numbers, for example.

Next on the to do list, some example queries on the data as I currently have it modelled to see what sorts of question we can ask and what sorts of network we can extract (I may need to add in more than the sample of data – which means I may also need to look at optimising the way the data is imported?). This might also inform how I should be modelling the data!;-)

Related: Trawling the Companies House API to Generate Co-Director Networks.

See also: Getting Started With the Neo4j Graph Database – Linking Neo4j and Jupyter SciPy Docker Containers Using Docker Compose and Accessing a Neo4j Graph Database Server from RStudio and Jupyter R Notebooks Using Docker Containers.

PS also via @markhneedham, one to explore when eg annotating a pre-existing node with additional attributes from a new dataset, something along lines of MERGE (c:Company {…}) SET c.newProp1 = “boo”, c.newProp2 = “blah” etc…

Querying Panama Papers Neo4j Database Container From a Linked Jupyter Notebook Container

A few weeks ago I posted some quick doodles showing, on the one hand, how to get the Panama Papers data into a simple SQLite database and in another how to link a neo4j graph database to a Jupyter notebook server using Docker Compose.

As the original Panama Papers investigation used neo4j as its backend database, I thought putting the data into a neo4j container could give me the excuse I needed to start looking at neo4j.

Anyway, it seems as if someone has already pushed a neo4j Docker container image preseeded with the Panama Papers data, so here’s my quickstart.

To use it, you need to have Docker installed, download the docker-compose.yaml file and then run:

docker-compose up

If you do this from a command line launched from Kitematic, Kitematic should provide you with a link to the neo4j database, running on the Docker IP address and port 7474. Log in with the default credentials ( neo4j / neo4j ) and change the password to panamapapers (all lower case).

Download the quickstart notebook into the newly created notebooks directory, and you should be able to see it from the notebooks homepage on Docker IP address port 8890 (or again, just follow the link from Kitematic).


neo4j:
image: ryguyrg/neo4j-panama-papers
ports:
– "7474:7474"
– "1337:1337"
volumes:
– /opt/data
jupyterscipy:
image: jupyter/scipy-notebook
ports:
– "8890:8888"
links:
– neo4j:neo4j
volumes:
– ./notebooks:/home/jovyan/work
rstudio:
image: rocker/rstudio
ports:
– "8787:8787"
links:
– neo4j:neo4j
volumes:
– ./rstudio:/home/rstudio
#jupyterIR:
# image: jupyter/r-notebook
# ports:
# – "8889:8888"
# links:
# – neo4j:neo4j
# volumes:
# – ./notebooks:/home/jovyan/work


Loading

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

I’m still trying to find my way around both the py2neo Python wrapper and the neo4j Cypher query language, so the demo thus far is not that inspiring!

And I’m not sure when I’ll get a chance to look at it again…:-(

Accessing a Neo4j Graph Database Server from RStudio and Jupyter R Notebooks Using Docker Containers

In Getting Started With the Neo4j Graph Database – Linking Neo4j and Jupyter SciPy Docker Containers Using Docker Compose I posted a recipe demonstrating how to link a Jupyter notebook container with a neo4j container to provide a quick way to get up an running with neo4j from a Python environment.

It struck me that it should be just as easy to launch an R environment, so here’s a docker-compose.yml file that will do just that:

neo4j:
  image: kbastani/docker-neo4j:latest
  ports:
    - "7474:7474"
    - "1337:1337"
  volumes:
    - /opt/data

rstudio:
  image: rocker/rstudio
  ports:
    - "8787:8787"
  links:
    - neo4j:neo4j
  volumes:
    - ./rstudio:/home/rstudio

jupyterIR:
  image: jupyter/r-notebook
  ports:
    - "8889:8888"
  links:
    - neo4j:neo4j
  volumes:
    - ./notebooks:/home/jovyan/work

If you’re using Kitematic (available via the Docker Toolbox), launch the docker command line interface (Docker CLI), cd into the directory containing the docker-compose.yml file, and run the docker-compose up -d command. This will download the necessary images and fire up the linked containers: one running neo4j, one running RStudio, and one running a Jupyter notebook with an R kernel.

You should then be able to find the URLs/links for RStudio and the notebooks in Kitematic:

Screenshot_12_04_2016_08_59

Once again, Nicole White has some quickstart examples for using R with neo4j, this time using the Rneo4j R package. One thing I noticed with the Jupyter R kernel was that I needed to specify the CRAN mirror when installing the package: install.packages('RNeo4j', repos="http://cran.rstudio.com/")

To connect to the neo4j database, use the domain mapping specified in the Docker Compose file: graph = startGraph("http://neo4j:7474/db/data/")

Here’s an example in RStudio running from the container:

RStudio-neo4j

And the Jupyter notebook:

neo4j_R

Notebooks and RStudio project files are shared into subdirectories of the current directory (from which the docker compose command was run) on host.