Skip to content

Commit d29e148

Browse files
committed
Small improvements to resilience design docs (#57791)
A follow-up to #47233 to clarify a few points.
1 parent 90a45d2 commit d29e148

File tree

1 file changed

+30
-25
lines changed

1 file changed

+30
-25
lines changed

docs/reference/high-availability/cluster-design.asciidoc

+30-25
Original file line numberDiff line numberDiff line change
@@ -10,15 +10,12 @@ There is a limit to how small a resilient cluster can be. All {es} clusters
1010
require:
1111

1212
* One <<modules-discovery-quorums,elected master node>> node
13+
* At least one node for each <<modules-node,role>>.
1314
* At least one copy of every <<scalability,shard>>.
1415

15-
We also recommend adding a new node to the cluster for each
16-
<<modules-node,role>>.
16+
A resilient cluster requires redundancy for every required cluster component.
17+
This means a resilient cluster must have:
1718

18-
A resilient cluster requires redundancy for every required cluster component,
19-
except the elected master node. For resilient clusters, we recommend:
20-
21-
* One elected master node
2219
* At least three master-eligible nodes
2320
* At least two nodes of each role
2421
* At least two copies of each shard (one primary and one or more replicas)
@@ -27,13 +24,18 @@ A resilient cluster needs three master-eligible nodes so that if one of
2724
them fails then the remaining two still form a majority and can hold a
2825
successful election.
2926

30-
Similarly, node redundancy makes it likely that if a node for a particular role
31-
fails, another node can take on its responsibilities.
27+
Similarly, redundancy of nodes of each role means that if a node for a
28+
particular role fails, another node can take on its responsibilities.
3229

3330
Finally, a resilient cluster should have at least two copies of each shard. If
34-
one copy fails then there is another good copy to take over. {es} automatically
35-
rebuilds any failed shard copies on the remaining nodes in order to restore the
36-
cluster to full health after a failure.
31+
one copy fails then there should be another good copy to take over. {es}
32+
automatically rebuilds any failed shard copies on the remaining nodes in order
33+
to restore the cluster to full health after a failure.
34+
35+
Failures temporarily reduce the total capacity of your cluster. In addition,
36+
after a failure the cluster must perform additional background activities to
37+
restore itself to health. You should make sure that your cluster has the
38+
capacity to handle your workload even if some nodes fail.
3739

3840
Depending on your needs and budget, an {es} cluster can consist of a single
3941
node, hundreds of nodes, or any number in between. When designing a smaller
@@ -60,13 +62,16 @@ To accommodate this, {es} assigns nodes every role by default.
6062

6163
A single node cluster is not resilient. If the the node fails, the cluster will
6264
stop working. Because there are no replicas in a one-node cluster, you cannot
63-
store your data redundantly. However, at least one replica is required for a
64-
<<cluster-health,`green` cluster health status>>. To ensure your cluster can
65-
report a `green` status, set
66-
<<dynamic-index-settings,`index.number_of_replicas`>> to `0` on every index. If
67-
the node fails, you may need to restore an older copy of any lost indices from a
68-
<<modules-snapshots,snapshot>>. Because they are not resilient to any failures,
69-
we do not recommend using one-node clusters in production.
65+
store your data redundantly. However, by default at least one replica is
66+
required for a <<cluster-health,`green` cluster health status>>. To ensure your
67+
cluster can report a `green` status, override the default by setting
68+
<<dynamic-index-settings,`index.number_of_replicas`>> to `0` on every index.
69+
70+
If the node fails, you may need to restore an older copy of any lost indices
71+
from a <<modules-snapshots,snapshot>>.
72+
73+
Because they are not resilient to any failures, we do not recommend using
74+
one-node clusters in production.
7075

7176
[[high-availability-cluster-design-two-nodes]]
7277
==== Two-node clusters
@@ -84,8 +89,8 @@ not <<master-node,master-eligible>>. This means you can be certain which of your
8489
nodes is the elected master of the cluster. The cluster can tolerate the loss of
8590
the other master-ineligible node. If you don't set `node.master: false` on one
8691
node, both nodes are master-eligible. This means both nodes are required for a
87-
master election. This election will fail if your cluster cannot reliably
88-
tolerate the loss of either node.
92+
master election. Since the election will fail if either node is unavailable,
93+
your cluster cannot reliably tolerate the loss of either node.
8994

9095
By default, each node is assigned every role. We recommend you assign both nodes
9196
all other roles except master eligibility. If one node fails, the other node can
@@ -114,7 +119,7 @@ master, but it is impossible to tell the difference between the failure of a
114119
remote node and a mere loss of connectivity between the nodes. If both nodes
115120
were capable of running independent elections, a loss of connectivity would
116121
lead to a https://en.wikipedia.org/wiki/Split-brain_(computing)[split-brain
117-
problem] and therefore, data loss. {es} avoids this and
122+
problem] and therefore data loss. {es} avoids this and
118123
protects your data by electing neither node as master until that node can be
119124
sure that it has the latest cluster state and that there is no other master in
120125
the cluster. This could result in the cluster having no master until
@@ -212,8 +217,8 @@ The cluster will be resilient to the loss of any node as long as:
212217
- There are at least two data nodes.
213218
- Every index has at least one replica of each shard, in addition to the
214219
primary.
215-
- The cluster has at least three master-eligible nodes. At least two of these
216-
nodes are not voting-only, master-eligible nodes.
220+
- The cluster has at least three master-eligible nodes, as long as at least two
221+
of these nodes are not voting-only master-eligible nodes.
217222
- Clients are configured to send their requests to more than one node or are
218223
configured to use a load balancer that balances the requests across an
219224
appropriate set of nodes. The {ess-trial}[Elastic Cloud] service provides such
@@ -343,8 +348,8 @@ The cluster will be resilient to the loss of any zone as long as:
343348
- Shard allocation awareness is configured to avoid concentrating all copies of
344349
a shard within a single zone.
345350
- The cluster has at least three master-eligible nodes. At least two of these
346-
nodes are not voting-only master-eligible nodes, spread evenly across at least
347-
three zones.
351+
nodes are not voting-only master-eligible nodes, and they are spread evenly
352+
across at least three zones.
348353
- Clients are configured to send their requests to nodes in more than one zone
349354
or are configured to use a load balancer that balances the requests across an
350355
appropriate set of nodes. The {ess-trial}[Elastic Cloud] service provides such

0 commit comments

Comments
 (0)