Add HA docs, common issues and split brain explanation (#1200)

as51340 · katarinasupe · web-flow · commit e522bcd9a6a5 · 2025-03-27T09:51:56.000+01:00
* Add HA docs, common issues and split brain explanation

* Document finalizers and no-changing external IP

* Don't use latest tag on deployment

* Rewrite for clarity and add HA erros page

---------

Co-authored-by: katarinasupe &lt;supe.katarina@gmail.com&gt;
Co-authored-by: Katarina Supe &lt;61758502+katarinasupe@users.noreply.github.com&gt;
diff --git a/pages/clustering/high-availability.mdx b/pages/clustering/high-availability.mdx
@@ -55,7 +55,9 @@ since Raft, as a consensus algorithm, works by forming a majority in the decisio
 ## Bolt+routing
 
 Directly connecting to the MAIN instance isn't preferred in the HA cluster since the MAIN instance changes due to various failures. Because of that, users
-can use bolt+routing so that write queries can always be sent to the correct data instance. This protocol works so that the client
+can use bolt+routing so that write queries can always be sent to the correct data instance. This will prevent a split-brain issue since clients, when writing,
+won't be routed to the old main but rather to the new main instance on which failover got performed.
+This protocol works in a way that the client
 first sends a ROUTE bolt message to any coordinator instance. The coordinator replies to the message by returning the routing table with three entries specifying
 from which instance can be data read, to which instance data can be written and which instances can behave as routers. In the Memgraph HA cluster, the MAIN
 data instance is the only writeable instance, REPLICAs are readable instances, and COORDINATORs behave as routers. Bolt+routing is the client-side routing protocol
@@ -872,4 +874,4 @@ that and automatically promote the first alive REPLICA to become the new MAIN. T
 
 </Steps>
 
-<CommunityLinks/>
+<CommunityLinks/>
diff --git a/pages/getting-started/install-memgraph/kubernetes.mdx b/pages/getting-started/install-memgraph/kubernetes.mdx
@@ -159,6 +159,12 @@ helm install <release-name> memgraph/memgraph
 ```
 Replace `<release-name>` with the name of the release you chose.
 
+<Callout type="info"> 
+When installing a chart, it's best practice to specify the exact version you
+want to use. Using the latest tag can lead to issues, as a pod restart may pull
+a newer image, potentially causing unexpected changes or incompatibilities.
+</Callout>
+
 #### Access Memgraph
 
 Once Memgraph is installed, you can access it using the provided services and
@@ -315,6 +321,13 @@ helm install <release-name> memgraph/memgraph-high-availability --set env.MEMGRA
 Replace `<release-name>` with a name of your choice for the release and set the
 Enterprise license.
 
+
+<Callout type="info"> 
+When installing a chart, it's best practice to specify the exact version you
+want to use. Using the latest tag can lead to issues, as a pod restart may pull
+a newer image, potentially causing unexpected changes or incompatibilities.
+</Callout>
+
 ### Changing the default chart values
 
 To change the default chart values, run the command with the specified set of
@@ -336,6 +349,11 @@ helm upgrade <release-name> memgraph/memgraph-high-availability --set <flag1>=<v
 
 Again it is possible use both `--set` and values.yaml to set configuration options.
 
+If you're using `IngressNginx` and performing an upgrade, the attached public IP
+should remain the same. It will only change if the release includes specific
+updates that modify it—and such changes will be documented.
+
+
 ### Uninstall Helm chart
 
 Uninstallation is done with:
@@ -377,6 +395,19 @@ done
 echo "All PVs have been patched."
 ```
 
+<Callout type="info"> 
+Kubernetes uses a concept called "Storage Object in Use Protection", which
+delays the deletion of PersistentVolumeClaims (PVCs) until the pods using them
+are deleted. Similarly, PersistentVolumes (PVs) won't be deleted until their
+associated PVCs are removed.
+
+If PVC deletion is stuck, you can try removing its finalizers by running:
+
+```bash
+kubectl patch pvc PVC_NAME -p '{"metadata":{"finalizers": []}}' --type=merge
+```
+</Callout>
+
 ### Network configuration
 
 All instances which are part of Memgraph HA cluster use the internal Cluster IP network for communicating between themselves. By default, management port is on all instances opened on
diff --git a/pages/help-center/errors/_meta.ts b/pages/help-center/errors/_meta.ts
@@ -2,6 +2,7 @@ export default {
   "auth": "Auth",
   "connection": "Connection",
   "durability": "Durability",
+  "high-availability": "High availability",
   "memory": "Memory",
   "modules": "Modules",
   "ports": "Ports",
diff --git a/pages/help-center/errors/high-availability.mdx b/pages/help-center/errors/high-availability.mdx
@@ -0,0 +1,29 @@
+---
+title: High availability errors
+description: Learn about high availability errors aand how to troubleshoot them.
+---
+
+import { Callout } from 'nextra/components'
+import {CommunityLinks} from '/components/social-card/CommunityLinks'
+
+# High availability errors
+
+## Errors 
+
+1. [At least one SYNC replica has not confirmed...](#error-1)
+
+### Troubleshooting SYNC replica not being confirmed [#error-1]
+
+If you're connecting to the cluster and encounter the error message **"At least one
+SYNC replica has not confirmed..."** when writing to the main instance, several
+issues could be causing this. Below are the possible reasons and how to resolve
+them:
+
+1. Network isn't correctly configured between MAIN and REPLICAs -> Check if
+   hostnames/IPs can be reached.
+2. Storage on replica isn't clean -> If you used your replica instances before
+connecting them in the cluster, MAIN won't be able to successfully register
+replica instance. Delete data directory of data instances and try to reconnect
+the cluster again.
+
+<CommunityLinks/>