19.04.2023

Deploy Neo4j 4.4 community edition on Kubernetes

In a current project with a supply chain background we aim to analyse information on transportation orders. For those orders we will analyse the execution plans and compare them to the real execution processes. For the corresponding orders we collect informations on offers, pricing and invoicing to gain a better understanding of the system.

In the system landscape a domain-driven design approach is implemented. This means that the domains are separated based on certain business processes and departments. If we want to gather all information concerning one transportation order we have to look up the information from different domains as the order will be processed there over time. To reduce the time for ad-hoc collection we store the order related data in one place as a data backbone. From the data analytics, actual behaviour and communication flows will be gained to derive optimization strategies and suggestions for processes and workflows in other modules in the system environment.

Why Kubernetes?

Kubernetes is a container orchestration platform that already brings a lot of handling opportunities for deployments, maintenance and monitoring. As so called ‘Cloud’ we are able to deploy using Infrastructure-as-Code in a developer-self-service-manner. This enables developers to have full control over their tec stack. It also supports green IT as unused or unhealthy pods will be deleted or restarted if required.

Why Neo4j?

Even if there was already a Neo4j database installed within the project, it is still the best fit to reach our goal. With the help of graph-data we don’t only focus on the data itself but on the relationships between the data. This allows for deeper insights and adds significant value. For a supply-chain by way of example we can analyse the time elapsed between different events and therefore analyse the total amount of time needed to finish a transportation order successfully and compare it to the planned amount of time. And with deeper analysis we may find out bottlenecks for specific places and processes. In our case neo4j being a schema-less database reduces overhead when extending the data model as the way of storing the data is close to the way it is created. We can therefore create a process individual structure. If we then take turns to analyse data we can build up indexes to optimize querying.

In Neo4j we also have the possibility to use a lot of pre-defined graph data science algorithms e.g. Subgraph-Analysis, Clustering to make assumptions on our data, that we may not have made otherwise. It gives us the endless possibilities for graph-data-pattern analysis.

What about the data model?

Neo4j is a schema-less database. Schema-less does not mean that we can not enforce constraints on uniqueness in our database. To track changes to the constrains we use Liquibase which is normally used to track schema changes in relational DB. Tracking changes using Liquibase will be of advantage when migrating the database as all relevant schema information will be collected there.

When deciding what graph data model to build, keep the use case in mind. A good data model reduces length and complexity of the cypher query to analyse the graph.

What else is there?

In addition to the Neo4j database the project consists of a Java based Spring Boot application to consume messages from an AMQP-Broker (RabbitMQ). The application will process the messages and the containing data to save it to the Neo4j database. All related system parts are deployed as containers to a project specific namespace in the Kubernetes cluster. As developers we declare our container using a deployment. It includes container dependencies and configurations. The containers are grouped into pods. We can easily manage their lifecycle-manage by scaling the pods as needed.

Deploying Neo4j into Kubernetes

If you want to deploy a Neo4j database to your Kubernetes cluster, the easiest option is to download the neo4j 4.4.16 image from Dockerhub. This image includes the database and a web front-end to access the data. This web frontend will be used through a service. We can define the service as follows:

---
apiVersion: v1
kind: Service
metadata:
  name: my-neo4j
  labels:
    app: my-neo4j
spec:
  ports:
    - port: 7474
      name: neo4j-port
    - port: 7687
      name: bolt-port
  selector:
    app: my-neo4j

Further we define a stateful set to manage our Neo4j database pod. It will keep the pod and related network identifiers stable and will use a persistent storage. To achieve storage persistence we connect the storage to a Persistent Volume Claim (PVC). This PVC can be used to persist/store the data even when the Pod including the Neo4j database is restarted. When restarting the pod a new Neo4j instance will be created accessing the persisted data. By declaring read and write-access to “ReadWriteOnce” we define the pod to be the exclusive consumer of the PVC.

As previously mentioned the database is mainly accessed using the web front-end. For this reason we introduced the name of the front-end service to the database.

---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: my-neo4j
  labels: &LABELS
    app: my-neo4j
spec:
  selector:
    matchLabels:
      app: my-neo4j
  serviceName: my-neo4j
  replicas: 1
  template:
    metadata:
      labels: *LABELS
    spec:
      imagePullSecrets:
        - name: docker-registry-secret
      initContainers:
        # use init container to copy APOC to plugins directory
        - name: neo4j-apoc-init
          image: public/library/neo4j:4.4.8
          command: [ '/bin/sh', '-c', 'cp -v /var/lib/neo4j/labs/apoc-4.4.0.5-core.jar /var/lib/neo4j/plugins/apoc-4.4.0.5-core.jar' ]
          volumeMounts:
            - name: neo4j-plugins
              mountPath: /var/lib/neo4j/plugins
      containers:
        - name: neo4j
          image: neo4j:4.4.16
          ports:
            - containerPort: 7474
              name: neo4j-port
            - containerPort: 7687
              name: bolt-port
          envFrom:
            - secretRef:
              name: my-neo4j-secret
          env:
            - name: NEO4J_dbms_connector_bolt_advertised__address
              value: 'your_address:443'
            - name: NEO4J_dbms_security_procedures_allowlist
              value: 'apoc.*'
          volumeMounts:
            - name: my-neo4j-data
              mountPath: /data
            - name: neo4j-plugins
              mountPath: /var/lib/neo4j/plugins
      resources:
        limits:
          cpu: 500m   # 1/2 cpu
          memory: 1Gi
        requests:
          cpu: 50m   # 1/20 cpu
          memory: 500Mi
  volumes:
    - name: neo4j-plugins
      emptyDir: { }
  volumeClaimTemplates:
    - metadata:
      name: my-neo4j-data
      spec:
        accessModes: [ "ReadWriteOnce" ]
        resources:
          requests:
            storage: 2Gi

APOC (Awesome Procedures On Cypher) adds procedures for importing and exporting data to files but also includes some graph data science algorithms. The APOC-library is already packaged within the Neo4j image placed in the libs-folder. We therefore use an Init Container to copy the library before server-startup to the plugins folder. Init Containers are some sort of specialised containers that run and terminate prior to the app containers. We can use them to set-up up the app container.

We also use Kubernetes secrets and sealed secrets for handling administrator credentials. By referencing the secret from the database deployment, it can be accessed and processed by Neo4j. The related secret is defined as follows:

apiVersion: v1
kind: Secret
metadata:
  name: my-neo4j-secret
# NEO4J_AUTH being username/password -needed for DB-Startup
data:
  NEO4J_AUTH:
  NEO4J_SERVER_PASSWORD:
  NEO4J_SERVER_USER:
type: Opaque

Our secret is called my-neo4j-secret and contains three variables to be used in the Neo4j configuration. Those are: Username, Password and Authentication, which is a combination of username/password. Remember that when defining them in the data section all of these need to be base64 encoded. If you safe your code to a repository you may seal this secret and check in the sealed secret only.

Further on we defined some environment variables and inject these into the container for Neo4j related configurations. The advertised address is displayed and auto-inserted in the Neo4j web front-end, so that it is unnecessary for the user to fill in this field by hand. We defined the advertised address for the specific case of using the bolt connector since we access the server using the bolt protocol. With the dbms_allowlist we can unlock the usage of APOC-procedures. In this setup we allow to execute all APOC-procedure. After that you can finally deploy your Neo4j database configuration to the Kubernetes cluster.

Continue reading: This article is part 1 of a series of 5 articles on Neo4j. The next articles include: Migration to Neo4j 4.4, Backing up Neo4j Community Edition, Creating your own Neo4j image, Presenting Data with NeoDash.