Backing up Neo4j using MinIO / S3 buckets

There are several possibilities to back up Neo4j running in Kubernetes. However, there are challenges when being bound to use the community edition. It allows to back up the database in offline mode only. We achieve sufficient backup functionality using the provided GraphML export in the APOC library.
Our system environment has more than one running database. That is why we already have a MinIO setup for other back ups. In this system we only created a new bucket. As a prerequisite to follow this guide, you should already have an s3 compatible object store available.

How to configure the database
To write backups directly to buckets, we need some additional plugin jars. We therefore use an Init Container to download and provide these jars. If you were following our blogposts on Neo4j Kubernetes deployment, you can just add an other Init Container to load the following jars. The related base image can be any image that allows you to use curl to copy from a remote repository (for example a minimal alpine or bitnami base-image which should be preferred). We took the neo4j 4.4.8 image as it contains curl and we did not need to create an extra image. Besides the copying of the APOC-library we define a second Init Container to load the additionally needed jars.

      initContainers:
        - name: neo4j-s3-init
          image: neo4j:4.4.8
          command: [ '/bin/sh', '-c', 'curl -L https://repo1.maven.org/maven2/joda-time/joda-time/2.10.13/joda-time-2.10.13.jar -O; cp -v joda-time-2.10.13.jar /var/lib/neo4j/plugins ; curl -L https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-s3/1.12.136/aws-java-sdk-s3-1.12.136.jar -O; cp -v aws-java-sdk-s3-1.12.136.jar /var/lib/neo4j/plugins ; curl -L https://repo1.maven.org/maven2/org/apache/httpcomponents/httpcore/4.4.15/httpcore-4.4.15.jar -O; cp -v httpcore-4.4.15.jar /var/lib/neo4j/plugins ; curl -L https://repo1.maven.org/maven2/org/apache/httpcomponents/httpclient/4.5.13/httpclient-4.5.13.jar -O; cp -v httpclient-4.5.13.jar /var/lib/neo4j/plugins ; curl -L https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-core/1.12.136/aws-java-sdk-core-1.12.136.jar -O; cp -v aws-java-sdk-core-1.12.136.jar /var/lib/neo4j/plugins' ]

Make sure you still have the plugin folder mounted as needed as the jars will be saved there. For more information see our Blogpost on deploying Neo4j to Kubernetes.

          volumeMounts:
            - name: neo4j-plugins
              mountPath: /var/lib/neo4j/plugins

How to write a file to the bucket in general

You may use the Neo4j web front-end and execute this cypher query to write a GraphML-backup-file directly to a S3 bucket.

WITH "s3://MINIO_ACCESS_KEY:MINIO_SECRET_KEY@HOSTNAME/BUCKET_NAME/backup.graphml" AS filename
CALL apoc.export.graphml.all(filename, {useTypes:TRUE, storeNodeIds:FALSE})
YIELD file
RETURN file;

Here HOSTNAME is the url of the bucketstore (e.g. https://backup-store.company.net:443), MINIO_ACCESS_KEY is the key to access the bucket, BUCKET_NAME is the name of the bucket in the MinIO store and MINIO_SECRET_KEY is the secret related to access the bucket.

You may also execute this statement directly using the Neo4j HTTP-API from your shell.

curl -X POST url-of-neo4j-frontend/db/neo4j/tx/commit -H 'Content-Type: application/json' -H 'Authorization: base64_NEO4J_BASIC_AUTH=' -d '{"statements": [ { "statement" : "CALL apoc.export.graphml.all( \" s3://MINIO_ACCESS_KEY:MINIO_SECRET_KEY@HOSTNAME/BUCKET_NAME/backup.graphml \" , {useTypes:TRUE, storeNodeIds:FALSE}) YIELD file RETURN file;" } ]}'

Hint: If you are using Kubernetes secrets, you need to define two different credentials. One will be the authentication parsed from the format username/password to start up the database. The other one will be used in the authorization header for the Neo4j HTTP-API in the format username:password (we call the later: base64_NEO4J_BASIC_AUTH – which will also be in base64 encoding). If you need more information on how we defined our secret read the first article of this series.

How to write a file to the bucket using Kubernetes tooling

The previous paragraph described the steps to manually handle the backup using short cypher statements. This is for you to trace back problems. Our aim is to use Kubernetes tools instead to automate the process. The idea is to place a cypher query via the Neo4j HTTP-API using a shell script. Finally running this script on a periodic schedule in the cluster.

For the backup we defined a backup-image in a separate backup-project. We use an image based on a small bitnami-shell image. As we need curl and ca-certificate packages, we install them. Finally we copy the shell-script to the container.

FROM bitnami/bitnami-shell:10-debian-10

RUN set -ex \
&& apt-get update\
&& apt-get install -y ca-certificates\
&& apt-get install -y curl

COPY backup-neo4j-to-s3.sh /backup-neo4j-to-s3.sh

The mentioned shell-script fires the Cypher-statement against the Neo4j HTTP-API endpoint. In our set-up we had all variables defined and sealed into secrets, that is why we can handle them quite smooth.

#!/bin/sh

if [ $# -eq 0 ]
  then
    echo 'Host name or address of the Neo4j instance must be provided.'
    exit 1
fi

currentDateTime=$(date +%Y%m%d%H%M%S)
path="s3://$MINIO_ACCESS_KEY:$MINIO_SECRET_KEY@$HOSTNAME/$BUCKET_NAME/backup-"
extension=".graphml"
base64_NEO4J_BASIC_AUTH=$(echo -n $NEO4J_BASIC_AUTH |base64)

curl -X POST $1db/neo4j/tx/commit -H "Content-Type: application/json" -H "Authorization: $base64_NEO4J_BASIC_AUTH" \
-d '{"statements": [ {  "statement" : "CALL apoc.export.graphml.all( \" '"$path$currentDateTime$extension"' \"  , {useTypes:TRUE, storeNodeIds:FALSE}) YIELD file RETURN file;"  } ]}'

We use a timestamp within the backup file to trace back on the creation date and time. For the authorization key, we need to base64 encode the secret, as the sealed-secret-controller decrypts the secret completely and the HTTP-API still needs to handle the NEO4J_BASIC_AUTH in base64 encoding. After preparing all additionally needed variables, we can finally formulate the curl POST request to place the cypher statement. Be careful with all the text quotes, it took us quite a while to place all the correct quotes in the right places.

How to automate the write-process periodically

To periodically create a backup file we define a Kubernetes Cronjob that will force the execution of the backup by placing a request to the HTTP-API of the Neo4j database. We defined the Cronjob as follows. It will be executed once a night at 2am. At that time we have low traffic on the machine and the execution could temporarily lock the database as depending on the amount of data the query could be long running and resource consuming.

The definition of the Cronjob looks like this:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: backup
spec:
  schedule: '0 1 * * *'
  failedJobsHistoryLimit: 1
  successfulJobsHistoryLimit: 0
  concurrencyPolicy: Forbid
  jobTemplate:
    spec:
      template:
        spec:
          imagePullSecrets:
          - name: docker-registry-secret
          restartPolicy: OnFailure
          containers:
            - name: backup
              image: backup-image:1.1.0
              imagePullPolicy: Always
              # use Neo4j HTTP-API to send a backup query that writes a backup to MinIO storage
              command: ['/backup-neo4j-to-s3.sh', 'https://my.neo4j-front-end-url/']
              envFrom:
                - secretRef:
                    name: backup- MinIO
                - secretRef:
                    name: my-neo4j-secret
              resources:
                limits:
                  cpu: 500m   # 1/2 cpu
                  memory: 500Mi
                requests:
                  cpu: 50m   # 1/20 cpu
                  memory: 500Mi

To investigate the Cronjob, we can derive a single Job from it. This job will immediately be executed in Kubernetes.

kubectl create job –from=cronjob/backup backup-once

To display all available jobs

kubectl get jobs

To display the configuration of the job that has been derived from the Cronjob

kubectl describe job backup-once

Display logs of the job that ran once to discover errors

kubectl logs backup-once-xxxxx

To access the logs from the container, if there are failures with the container

kubectl logs backup-once -c backup

After all issues are solved we can delete this job

kubectl delete job backup-once

Continue reading: This article is part 3 of a series of 5 articles on Neo4j. The upcoming articles include: Creating your own Neo4j image and Presenting Data with NeoDash.