Skip to content

Commit e522bcd

Browse files
Add HA docs, common issues and split brain explanation (#1200)
* Add HA docs, common issues and split brain explanation * Document finalizers and no-changing external IP * Don't use latest tag on deployment * Rewrite for clarity and add HA erros page --------- Co-authored-by: katarinasupe <[email protected]> Co-authored-by: Katarina Supe <[email protected]>
1 parent 514d57c commit e522bcd

File tree

4 files changed

+65
-2
lines changed

4 files changed

+65
-2
lines changed

pages/clustering/high-availability.mdx

+4-2
Original file line numberDiff line numberDiff line change
@@ -55,7 +55,9 @@ since Raft, as a consensus algorithm, works by forming a majority in the decisio
5555
## Bolt+routing
5656

5757
Directly connecting to the MAIN instance isn't preferred in the HA cluster since the MAIN instance changes due to various failures. Because of that, users
58-
can use bolt+routing so that write queries can always be sent to the correct data instance. This protocol works so that the client
58+
can use bolt+routing so that write queries can always be sent to the correct data instance. This will prevent a split-brain issue since clients, when writing,
59+
won't be routed to the old main but rather to the new main instance on which failover got performed.
60+
This protocol works in a way that the client
5961
first sends a ROUTE bolt message to any coordinator instance. The coordinator replies to the message by returning the routing table with three entries specifying
6062
from which instance can be data read, to which instance data can be written and which instances can behave as routers. In the Memgraph HA cluster, the MAIN
6163
data instance is the only writeable instance, REPLICAs are readable instances, and COORDINATORs behave as routers. Bolt+routing is the client-side routing protocol
@@ -872,4 +874,4 @@ that and automatically promote the first alive REPLICA to become the new MAIN. T
872874

873875
</Steps>
874876

875-
<CommunityLinks/>
877+
<CommunityLinks/>

pages/getting-started/install-memgraph/kubernetes.mdx

+31
Original file line numberDiff line numberDiff line change
@@ -159,6 +159,12 @@ helm install <release-name> memgraph/memgraph
159159
```
160160
Replace `<release-name>` with the name of the release you chose.
161161

162+
<Callout type="info">
163+
When installing a chart, it's best practice to specify the exact version you
164+
want to use. Using the latest tag can lead to issues, as a pod restart may pull
165+
a newer image, potentially causing unexpected changes or incompatibilities.
166+
</Callout>
167+
162168
#### Access Memgraph
163169
164170
Once Memgraph is installed, you can access it using the provided services and
@@ -315,6 +321,13 @@ helm install <release-name> memgraph/memgraph-high-availability --set env.MEMGRA
315321
Replace `<release-name>` with a name of your choice for the release and set the
316322
Enterprise license.
317323

324+
325+
<Callout type="info">
326+
When installing a chart, it's best practice to specify the exact version you
327+
want to use. Using the latest tag can lead to issues, as a pod restart may pull
328+
a newer image, potentially causing unexpected changes or incompatibilities.
329+
</Callout>
330+
318331
### Changing the default chart values
319332
320333
To change the default chart values, run the command with the specified set of
@@ -336,6 +349,11 @@ helm upgrade <release-name> memgraph/memgraph-high-availability --set <flag1>=<v
336349
337350
Again it is possible use both `--set` and values.yaml to set configuration options.
338351
352+
If you're using `IngressNginx` and performing an upgrade, the attached public IP
353+
should remain the same. It will only change if the release includes specific
354+
updates that modify it—and such changes will be documented.
355+
356+
339357
### Uninstall Helm chart
340358

341359
Uninstallation is done with:
@@ -377,6 +395,19 @@ done
377395
echo "All PVs have been patched."
378396
```
379397

398+
<Callout type="info">
399+
Kubernetes uses a concept called "Storage Object in Use Protection", which
400+
delays the deletion of PersistentVolumeClaims (PVCs) until the pods using them
401+
are deleted. Similarly, PersistentVolumes (PVs) won't be deleted until their
402+
associated PVCs are removed.
403+
404+
If PVC deletion is stuck, you can try removing its finalizers by running:
405+
406+
```bash
407+
kubectl patch pvc PVC_NAME -p '{"metadata":{"finalizers": []}}' --type=merge
408+
```
409+
</Callout>
410+
380411
### Network configuration
381412
382413
All instances which are part of Memgraph HA cluster use the internal Cluster IP network for communicating between themselves. By default, management port is on all instances opened on

pages/help-center/errors/_meta.ts

+1
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@ export default {
22
"auth": "Auth",
33
"connection": "Connection",
44
"durability": "Durability",
5+
"high-availability": "High availability",
56
"memory": "Memory",
67
"modules": "Modules",
78
"ports": "Ports",
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
---
2+
title: High availability errors
3+
description: Learn about high availability errors aand how to troubleshoot them.
4+
---
5+
6+
import { Callout } from 'nextra/components'
7+
import {CommunityLinks} from '/components/social-card/CommunityLinks'
8+
9+
# High availability errors
10+
11+
## Errors
12+
13+
1. [At least one SYNC replica has not confirmed...](#error-1)
14+
15+
### Troubleshooting SYNC replica not being confirmed [#error-1]
16+
17+
If you're connecting to the cluster and encounter the error message **"At least one
18+
SYNC replica has not confirmed..."** when writing to the main instance, several
19+
issues could be causing this. Below are the possible reasons and how to resolve
20+
them:
21+
22+
1. Network isn't correctly configured between MAIN and REPLICAs -> Check if
23+
hostnames/IPs can be reached.
24+
2. Storage on replica isn't clean -> If you used your replica instances before
25+
connecting them in the cluster, MAIN won't be able to successfully register
26+
replica instance. Delete data directory of data instances and try to reconnect
27+
the cluster again.
28+
29+
<CommunityLinks/>

0 commit comments

Comments
 (0)