Skip to content

Commit f307847

Browse files
authored
[DOCS] Adds overview and API ref for cluster voting configurations (#36954)
1 parent 1780ced commit f307847

8 files changed

+263
-149
lines changed

docs/reference/cluster.asciidoc

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -104,3 +104,5 @@ include::cluster/tasks.asciidoc[]
104104
include::cluster/nodes-hot-threads.asciidoc[]
105105

106106
include::cluster/allocation-explain.asciidoc[]
107+
108+
include::cluster/voting-exclusions.asciidoc[]
Lines changed: 76 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,76 @@
1+
[[voting-config-exclusions]]
2+
== Voting configuration exclusions API
3+
++++
4+
<titleabbrev>Voting Configuration Exclusions</titleabbrev>
5+
++++
6+
7+
Adds or removes master-eligible nodes from the
8+
<<modules-discovery-voting,voting configuration exclusion list>>.
9+
10+
[float]
11+
=== Request
12+
13+
`POST _cluster/voting_config_exclusions/<node_name>` +
14+
15+
`DELETE _cluster/voting_config_exclusions`
16+
17+
[float]
18+
=== Path parameters
19+
20+
`node_name`::
21+
A <<cluster-nodes,node filter>> that identifies {es} nodes.
22+
23+
[float]
24+
=== Description
25+
26+
By default, if there are more than three master-eligible nodes in the cluster
27+
and you remove fewer than half of the master-eligible nodes in the cluster at
28+
once, the <<modules-discovery-voting,voting configuration>> automatically
29+
shrinks.
30+
31+
If you want to shrink the voting configuration to contain fewer than three nodes
32+
or to remove half or more of the master-eligible nodes in the cluster at once,
33+
you must use this API to remove departed nodes from the voting configuration
34+
manually. It adds an entry for that node in the voting configuration exclusions
35+
list. The cluster then tries to reconfigure the voting configuration to remove
36+
that node and to prevent it from returning.
37+
38+
If the API fails, you can safely retry it. Only a successful response
39+
guarantees that the node has been removed from the voting configuration and will
40+
not be reinstated.
41+
42+
NOTE: Voting exclusions are required only when you remove at least half of the
43+
master-eligible nodes from a cluster in a short time period. They are not
44+
required when removing master-ineligible nodes or fewer than half of the
45+
master-eligible nodes.
46+
47+
The <<modules-discovery-settings,`cluster.max_voting_config_exclusions`
48+
setting>> limits the size of the voting configuration exclusion list. The
49+
default value is `10`. Since voting configuration exclusions are persistent and
50+
limited in number, you must clear the voting config exclusions list once the
51+
exclusions are no longer required.
52+
53+
There is also a
54+
<<modules-discovery-settings,`cluster.auto_shrink_voting_configuration` setting>>,
55+
which is set to true by default. If it is set to false, you must use this API to
56+
maintain the voting configuration.
57+
58+
For more information, see <<modules-discovery-removing-nodes>>.
59+
60+
[float]
61+
=== Examples
62+
63+
Add `nodeId1` to the voting configuration exclusions list:
64+
[source,js]
65+
--------------------------------------------------
66+
POST /_cluster/voting_config_exclusions/nodeId1
67+
--------------------------------------------------
68+
// CONSOLE
69+
// TEST[catch:bad_request]
70+
71+
Remove all exclusions from the list:
72+
[source,js]
73+
--------------------------------------------------
74+
DELETE /_cluster/voting_config_exclusions
75+
--------------------------------------------------
76+
// CONSOLE

docs/reference/modules/discovery.asciidoc

Lines changed: 18 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,16 @@ module. This module is divided into the following sections:
1313
unknown, such as when a node has just started up or when the previous
1414
master has failed.
1515

16+
<<modules-discovery-quorums>>::
17+
18+
This section describes how {es} uses a quorum-based voting mechanism to
19+
make decisions even if some nodes are unavailable.
20+
21+
<<modules-discovery-voting>>::
22+
23+
This section describes the concept of voting configurations, which {es}
24+
automatically updates as nodes leave and join the cluster.
25+
1626
<<modules-discovery-bootstrap-cluster>>::
1727

1828
Bootstrapping a cluster is required when an Elasticsearch cluster starts up
@@ -40,26 +50,27 @@ module. This module is divided into the following sections:
4050
Cluster state publishing is the process by which the elected master node
4151
updates the cluster state on all the other nodes in the cluster.
4252

43-
<<modules-discovery-quorums>>::
53+
<<cluster-fault-detection>>::
54+
55+
{es} performs health checks to detect and remove faulty nodes.
4456

45-
This section describes the detailed design behind the master election and
46-
auto-reconfiguration logic.
47-
4857
<<modules-discovery-settings,Settings>>::
4958

5059
There are settings that enable users to influence the discovery, cluster
5160
formation, master election and fault detection processes.
5261

5362
include::discovery/discovery.asciidoc[]
5463

64+
include::discovery/quorums.asciidoc[]
65+
66+
include::discovery/voting.asciidoc[]
67+
5568
include::discovery/bootstrapping.asciidoc[]
5669

5770
include::discovery/adding-removing-nodes.asciidoc[]
5871

5972
include::discovery/publishing.asciidoc[]
6073

61-
include::discovery/quorums.asciidoc[]
62-
6374
include::discovery/fault-detection.asciidoc[]
6475

65-
include::discovery/discovery-settings.asciidoc[]
76+
include::discovery/discovery-settings.asciidoc[]

docs/reference/modules/discovery/adding-removing-nodes.asciidoc

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@ cluster, and to scale the cluster up and down by adding and removing
1212
master-ineligible nodes only. However there are situations in which it may be
1313
desirable to add or remove some master-eligible nodes to or from a cluster.
1414

15+
[[modules-discovery-adding-nodes]]
1516
==== Adding master-eligible nodes
1617

1718
If you wish to add some nodes to your cluster, simply configure the new nodes
@@ -24,6 +25,7 @@ cluster. You can use the `cluster.join.timeout` setting to configure how long a
2425
node waits after sending a request to join a cluster. Its default value is `30s`.
2526
See <<modules-discovery-settings>>.
2627

28+
[[modules-discovery-removing-nodes]]
2729
==== Removing master-eligible nodes
2830

2931
When removing master-eligible nodes, it is important not to remove too many all
@@ -50,7 +52,7 @@ will never automatically move a node on the voting exclusions list back into the
5052
voting configuration. Once an excluded node has been successfully
5153
auto-reconfigured out of the voting configuration, it is safe to shut it down
5254
without affecting the cluster's master-level availability. A node can be added
53-
to the voting configuration exclusion list using the following API:
55+
to the voting configuration exclusion list using the <<voting-config-exclusions>> API. For example:
5456

5557
[source,js]
5658
--------------------------------------------------

docs/reference/modules/discovery/discovery-settings.asciidoc

Lines changed: 13 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,15 @@
33

44
Discovery and cluster formation are affected by the following settings:
55

6+
`cluster.auto_shrink_voting_configuration`::
7+
8+
Controls whether the <<modules-discovery-voting,voting configuration>>
9+
sheds departed nodes automatically, as long as it still contains at least 3
10+
nodes. The default value is `true`. If set to `false`, the voting
11+
configuration never shrinks automatically and you must remove departed
12+
nodes manually with the <<voting-config-exclusions,voting configuration
13+
exclusions API>>.
14+
615
[[master-election-settings]]`cluster.election.back_off_time`::
716

817
Sets the amount to increase the upper bound on the wait before an election
@@ -152,9 +161,11 @@ APIs are not be blocked and can run on any available node.
152161

153162
Provides a list of master-eligible nodes in the cluster. The list contains
154163
either an array of hosts or a comma-delimited string. Each value has the
155-
format `host:port` or `host`, where `port` defaults to the setting `transport.profiles.default.port`. Note that IPv6 hosts must be bracketed.
164+
format `host:port` or `host`, where `port` defaults to the setting
165+
`transport.profiles.default.port`. Note that IPv6 hosts must be bracketed.
156166
The default value is `127.0.0.1, [::1]`. See <<unicast.hosts>>.
157167

158168
`discovery.zen.ping.unicast.hosts.resolve_timeout`::
159169

160-
Sets the amount of time to wait for DNS lookups on each round of discovery. This is specified as a <<time-units, time unit>> and defaults to `5s`.
170+
Sets the amount of time to wait for DNS lookups on each round of discovery.
171+
This is specified as a <<time-units, time unit>> and defaults to `5s`.

docs/reference/modules/discovery/fault-detection.asciidoc

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -2,8 +2,9 @@
22
=== Cluster fault detection
33

44
The elected master periodically checks each of the nodes in the cluster to
5-
ensure that they are still connected and healthy. Each node in the cluster also periodically checks the health of the elected master. These checks
6-
are known respectively as _follower checks_ and _leader checks_.
5+
ensure that they are still connected and healthy. Each node in the cluster also
6+
periodically checks the health of the elected master. These checks are known
7+
respectively as _follower checks_ and _leader checks_.
78

89
Elasticsearch allows these checks to occasionally fail or timeout without
910
taking any action. It considers a node to be faulty only after a number of
@@ -16,4 +17,4 @@ and retry setting values and attempts to remove the node from the cluster.
1617
Similarly, if a node detects that the elected master has disconnected, this
1718
situation is treated as an immediate failure. The node bypasses the timeout and
1819
retry settings and restarts its discovery phase to try and find or elect a new
19-
master.
20+
master.

docs/reference/modules/discovery/quorums.asciidoc

Lines changed: 7 additions & 136 deletions
Original file line numberDiff line numberDiff line change
@@ -18,13 +18,13 @@ cluster. In many cases you can do this simply by starting or stopping the nodes
1818
as required. See <<modules-discovery-adding-removing-nodes>>.
1919

2020
As nodes are added or removed Elasticsearch maintains an optimal level of fault
21-
tolerance by updating the cluster's _voting configuration_, which is the set of
22-
master-eligible nodes whose responses are counted when making decisions such as
23-
electing a new master or committing a new cluster state. A decision is made only
24-
after more than half of the nodes in the voting configuration have responded.
25-
Usually the voting configuration is the same as the set of all the
26-
master-eligible nodes that are currently in the cluster. However, there are some
27-
situations in which they may be different.
21+
tolerance by updating the cluster's <<modules-discovery-voting,voting
22+
configuration>>, which is the set of master-eligible nodes whose responses are
23+
counted when making decisions such as electing a new master or committing a new
24+
cluster state. A decision is made only after more than half of the nodes in the
25+
voting configuration have responded. Usually the voting configuration is the
26+
same as the set of all the master-eligible nodes that are currently in the
27+
cluster. However, there are some situations in which they may be different.
2828

2929
To be sure that the cluster remains available you **must not stop half or more
3030
of the nodes in the voting configuration at the same time**. As long as more
@@ -38,46 +38,6 @@ cluster-state update that adjusts the voting configuration to match, and this
3838
can take a short time to complete. It is important to wait for this adjustment
3939
to complete before removing more nodes from the cluster.
4040

41-
[float]
42-
==== Setting the initial quorum
43-
44-
When a brand-new cluster starts up for the first time, it must elect its first
45-
master node. To do this election, it needs to know the set of master-eligible
46-
nodes whose votes should count. This initial voting configuration is known as
47-
the _bootstrap configuration_ and is set in the
48-
<<modules-discovery-bootstrap-cluster,cluster bootstrapping process>>.
49-
50-
It is important that the bootstrap configuration identifies exactly which nodes
51-
should vote in the first election. It is not sufficient to configure each node
52-
with an expectation of how many nodes there should be in the cluster. It is also
53-
important to note that the bootstrap configuration must come from outside the
54-
cluster: there is no safe way for the cluster to determine the bootstrap
55-
configuration correctly on its own.
56-
57-
If the bootstrap configuration is not set correctly, when you start a brand-new
58-
cluster there is a risk that you will accidentally form two separate clusters
59-
instead of one. This situation can lead to data loss: you might start using both
60-
clusters before you notice that anything has gone wrong and it is impossible to
61-
merge them together later.
62-
63-
NOTE: To illustrate the problem with configuring each node to expect a certain
64-
cluster size, imagine starting up a three-node cluster in which each node knows
65-
that it is going to be part of a three-node cluster. A majority of three nodes
66-
is two, so normally the first two nodes to discover each other form a cluster
67-
and the third node joins them a short time later. However, imagine that four
68-
nodes were erroneously started instead of three. In this case, there are enough
69-
nodes to form two separate clusters. Of course if each node is started manually
70-
then it's unlikely that too many nodes are started. If you're using an automated
71-
orchestrator, however, it's certainly possible to get into this situation--
72-
particularly if the orchestrator is not resilient to failures such as network
73-
partitions.
74-
75-
The initial quorum is only required the very first time a whole cluster starts
76-
up. New nodes joining an established cluster can safely obtain all the
77-
information they need from the elected master. Nodes that have previously been
78-
part of a cluster will have stored to disk all the information that is required
79-
when they restart.
80-
8141
[float]
8242
==== Master elections
8343

@@ -104,92 +64,3 @@ and then started again then it will automatically recover, such as during a
10464
action with the APIs described here in these cases, because the set of master
10565
nodes is not changing permanently.
10666

107-
[float]
108-
==== Automatic changes to the voting configuration
109-
110-
Nodes may join or leave the cluster, and Elasticsearch reacts by automatically
111-
making corresponding changes to the voting configuration in order to ensure that
112-
the cluster is as resilient as possible.
113-
114-
The default auto-reconfiguration
115-
behaviour is expected to give the best results in most situations. The current
116-
voting configuration is stored in the cluster state so you can inspect its
117-
current contents as follows:
118-
119-
[source,js]
120-
--------------------------------------------------
121-
GET /_cluster/state?filter_path=metadata.cluster_coordination.last_committed_config
122-
--------------------------------------------------
123-
// CONSOLE
124-
125-
NOTE: The current voting configuration is not necessarily the same as the set of
126-
all available master-eligible nodes in the cluster. Altering the voting
127-
configuration involves taking a vote, so it takes some time to adjust the
128-
configuration as nodes join or leave the cluster. Also, there are situations
129-
where the most resilient configuration includes unavailable nodes, or does not
130-
include some available nodes, and in these situations the voting configuration
131-
differs from the set of available master-eligible nodes in the cluster.
132-
133-
Larger voting configurations are usually more resilient, so Elasticsearch
134-
normally prefers to add master-eligible nodes to the voting configuration after
135-
they join the cluster. Similarly, if a node in the voting configuration
136-
leaves the cluster and there is another master-eligible node in the cluster that
137-
is not in the voting configuration then it is preferable to swap these two nodes
138-
over. The size of the voting configuration is thus unchanged but its
139-
resilience increases.
140-
141-
It is not so straightforward to automatically remove nodes from the voting
142-
configuration after they have left the cluster. Different strategies have
143-
different benefits and drawbacks, so the right choice depends on how the cluster
144-
will be used. You can control whether the voting configuration automatically shrinks by using the following setting:
145-
146-
`cluster.auto_shrink_voting_configuration`::
147-
148-
Defaults to `true`, meaning that the voting configuration will automatically
149-
shrink, shedding departed nodes, as long as it still contains at least 3
150-
nodes. If set to `false`, the voting configuration never automatically
151-
shrinks; departed nodes must be removed manually using the
152-
<<modules-discovery-adding-removing-nodes,voting configuration exclusions API>>.
153-
154-
NOTE: If `cluster.auto_shrink_voting_configuration` is set to `true`, the
155-
recommended and default setting, and there are at least three master-eligible
156-
nodes in the cluster, then Elasticsearch remains capable of processing
157-
cluster-state updates as long as all but one of its master-eligible nodes are
158-
healthy.
159-
160-
There are situations in which Elasticsearch might tolerate the loss of multiple
161-
nodes, but this is not guaranteed under all sequences of failures. If this
162-
setting is set to `false` then departed nodes must be removed from the voting
163-
configuration manually, using the
164-
<<modules-discovery-adding-removing-nodes,voting exclusions API>>, to achieve
165-
the desired level of resilience.
166-
167-
No matter how it is configured, Elasticsearch will not suffer from a "split-brain" inconsistency.
168-
The `cluster.auto_shrink_voting_configuration` setting affects only its availability in the
169-
event of the failure of some of its nodes, and the administrative tasks that
170-
must be performed as nodes join and leave the cluster.
171-
172-
[float]
173-
==== Even numbers of master-eligible nodes
174-
175-
There should normally be an odd number of master-eligible nodes in a cluster.
176-
If there is an even number, Elasticsearch leaves one of them out of the voting
177-
configuration to ensure that it has an odd size. This omission does not decrease
178-
the failure-tolerance of the cluster. In fact, improves it slightly: if the
179-
cluster suffers from a network partition that divides it into two equally-sized
180-
halves then one of the halves will contain a majority of the voting
181-
configuration and will be able to keep operating. If all of the master-eligible
182-
nodes' votes were counted, neither side would contain a strict majority of the
183-
nodes and so the cluster would not be able to make any progress.
184-
185-
For instance if there are four master-eligible nodes in the cluster and the
186-
voting configuration contained all of them, any quorum-based decision would
187-
require votes from at least three of them. This situation means that the cluster
188-
can tolerate the loss of only a single master-eligible node. If this cluster
189-
were split into two equal halves, neither half would contain three
190-
master-eligible nodes and the cluster would not be able to make any progress.
191-
If the voting configuration contains only three of the four master-eligible
192-
nodes, however, the cluster is still only fully tolerant to the loss of one
193-
node, but quorum-based decisions require votes from two of the three voting
194-
nodes. In the event of an even split, one half will contain two of the three
195-
voting nodes so that half will remain available.

0 commit comments

Comments
 (0)