Skip to content

Commit 09425d5

Browse files
authored
Add elasticsearch-node tool docs (#37812)
This commit, mostly authored by @DaveCTurner, adds documentation for elasticsearch-node tool #37696.
1 parent fe405bd commit 09425d5

File tree

3 files changed

+337
-0
lines changed

3 files changed

+337
-0
lines changed

docs/reference/commands/index.asciidoc

+2
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@ tasks from the command line:
1010
* <<certgen>>
1111
* <<certutil>>
1212
* <<migrate-tool>>
13+
* <<node-tool>>
1314
* <<saml-metadata>>
1415
* <<setup-passwords>>
1516
* <<shard-tool>>
@@ -21,6 +22,7 @@ tasks from the command line:
2122
include::certgen.asciidoc[]
2223
include::certutil.asciidoc[]
2324
include::migrate-tool.asciidoc[]
25+
include::node-tool.asciidoc[]
2426
include::saml-metadata.asciidoc[]
2527
include::setup-passwords.asciidoc[]
2628
include::shard-tool.asciidoc[]
+334
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,334 @@
1+
[[node-tool]]
2+
== elasticsearch-node
3+
4+
The `elasticsearch-node` command enables you to perform unsafe operations that
5+
risk data loss but which may help to recover some data in a disaster.
6+
7+
[float]
8+
=== Synopsis
9+
10+
[source,shell]
11+
--------------------------------------------------
12+
bin/elasticsearch-node unsafe-bootstrap|detach-cluster
13+
[--ordinal <Integer>] [-E <KeyValuePair>]
14+
[-h, --help] ([-s, --silent] | [-v, --verbose])
15+
--------------------------------------------------
16+
17+
[float]
18+
=== Description
19+
20+
Sometimes {es} nodes are temporarily stopped, perhaps because of the need to
21+
perform some maintenance activity or perhaps because of a hardware failure.
22+
After you resolve the temporary condition and restart the node,
23+
it will rejoin the cluster and continue normally. Depending on your
24+
configuration, your cluster may be able to remain completely available even
25+
while one or more of its nodes are stopped.
26+
27+
Sometimes it might not be possible to restart a node after it has stopped. For
28+
example, the node's host may suffer from a hardware problem that cannot be
29+
repaired. If the cluster is still available then you can start up a fresh node
30+
on another host and {es} will bring this node into the cluster in place of the
31+
failed node.
32+
33+
Each node stores its data in the data directories defined by the
34+
<<path-settings,`path.data` setting>>. This means that in a disaster you can
35+
also restart a node by moving its data directories to another host, presuming
36+
that those data directories can be recovered from the faulty host.
37+
38+
{es} <<modules-discovery-quorums,requires a response from a majority of the
39+
master-eligible nodes>> in order to elect a master and to update the cluster
40+
state. This means that if you have three master-eligible nodes then the cluster
41+
will remain available even if one of them has failed. However if two of the
42+
three master-eligible nodes fail then the cluster will be unavailable until at
43+
least one of them is restarted.
44+
45+
In very rare circumstances it may not be possible to restart enough nodes to
46+
restore the cluster's availability. If such a disaster occurs, you should
47+
build a new cluster from a recent snapshot and re-import any data that was
48+
ingested since that snapshot was taken.
49+
50+
However, if the disaster is serious enough then it may not be possible to
51+
recover from a recent snapshot either. Unfortunately in this case there is no
52+
way forward that does not risk data loss, but it may be possible to use the
53+
`elasticsearch-node` tool to construct a new cluster that contains some of the
54+
data from the failed cluster.
55+
56+
This tool has two modes:
57+
58+
* `elastisearch-node unsafe-bootstap` can be used if there is at least one
59+
remaining master-eligible node. It forces one of the remaining nodes to form
60+
a brand-new cluster on its own, using its local copy of the cluster metadata.
61+
This is known as _unsafe cluster bootstrapping_.
62+
63+
* `elastisearch-node detach-cluster` enables you to move nodes from one cluster
64+
to another. This can be used to move nodes into the new cluster created with
65+
the `elastisearch-node unsafe-bootstap` command. If unsafe cluster bootstrapping was not
66+
possible, it also enables you to
67+
move nodes into a brand-new cluster.
68+
69+
[[node-tool-unsafe-bootstrap]]
70+
[float]
71+
==== Unsafe cluster bootstrapping
72+
73+
If there is at least one remaining master-eligible node, but it is not possible
74+
to restart a majority of them, then the `elasticsearch-node unsafe-bootstrap`
75+
command will unsafely override the cluster's <<modules-discovery-voting,voting
76+
configuration>> as if performing another
77+
<<modules-discovery-bootstrap-cluster,cluster bootstrapping process>>.
78+
The target node can then form a new cluster on its own by using
79+
the cluster metadata held locally on the target node.
80+
81+
[WARNING]
82+
These steps can lead to arbitrary data loss since the target node may not hold the latest cluster
83+
metadata, and this out-of-date metadata may make it impossible to use some or
84+
all of the indices in the cluster.
85+
86+
Since unsafe bootstrapping forms a new cluster containing a single node, once
87+
you have run it you must use the <<node-tool-detach-cluster,`elasticsearch-node
88+
detach-cluster` tool>> to migrate any other surviving nodes from the failed
89+
cluster into this new cluster.
90+
91+
When you run the `elasticsearch-node unsafe-bootstrap` tool it will analyse the
92+
state of the node and ask for confirmation before taking any action. Before
93+
asking for confirmation it reports the term and version of the cluster state on
94+
the node on which it runs as follows:
95+
96+
[source,txt]
97+
----
98+
Current node cluster state (term, version) pair is (4, 12)
99+
----
100+
101+
If you have a choice of nodes on which to run this tool then you should choose
102+
one with a term that is as large as possible. If there is more than one
103+
node with the same term, pick the one with the largest version.
104+
This information identifies the node with the freshest cluster state, which minimizes the
105+
quantity of data that might be lost. For example, if the first node reports
106+
`(4, 12)` and a second node reports `(5, 3)`, then the second node is preferred
107+
since its term is larger. However if the second node reports `(3, 17)` then
108+
the first node is preferred since its term is larger. If the second node
109+
reports `(4, 10)` then it has the same term as the first node, but has a
110+
smaller version, so the first node is preferred.
111+
112+
[WARNING]
113+
Running this command can lead to arbitrary data loss. Only run this tool if you
114+
understand and accept the possible consequences and have exhausted all other
115+
possibilities for recovery of your cluster.
116+
117+
The sequence of operations for using this tool are as follows:
118+
119+
1. Make sure you have really lost access to at least half of the
120+
master-eligible nodes in the cluster, and they cannot be repaired or recovered
121+
by moving their data paths to healthy hardware.
122+
2. Stop **all** remaining nodes.
123+
3. Choose one of the remaining master-eligible nodes to become the new elected
124+
master as described above.
125+
4. On this node, run the `elasticsearch-node unsafe-bootstrap` command as shown
126+
below. Verify that the tool reported `Master node was successfully
127+
bootstrapped`.
128+
5. Start this node and verify that it is elected as the master node.
129+
6. Run the <<node-tool-detach-cluster,`elasticsearch-node detach-cluster`
130+
tool>>, described below, on every other node in the cluster.
131+
7. Start all other nodes and verify that each one joins the cluster.
132+
8. Investigate the data in the cluster to discover if any was lost during this
133+
process.
134+
135+
When you run the tool it will make sure that the node that is being used to
136+
bootstrap the cluster is not running. It is important that all other
137+
master-eligible nodes are also stopped while this tool is running, but the tool
138+
does not check this.
139+
140+
The message `Master node was successfully bootstrapped` does not mean that
141+
there has been no data loss, it just means that tool was able to complete its
142+
job.
143+
144+
[[node-tool-detach-cluster]]
145+
[float]
146+
==== Detaching nodes from their cluster
147+
148+
It is unsafe for nodes to move between clusters, because different clusters
149+
have completely different cluster metadata. There is no way to safely merge the
150+
metadata from two clusters together.
151+
152+
To protect against inadvertently joining the wrong cluster, each cluster
153+
creates a unique identifier, known as the _cluster UUID_, when it first starts
154+
up. Every node records the UUID of its cluster and refuses to join a
155+
cluster with a different UUID.
156+
157+
However, if a node's cluster has permanently failed then it may be desirable to
158+
try and move it into a new cluster. The `elasticsearch-node detach-cluster`
159+
command lets you detach a node from its cluster by resetting its cluster UUID.
160+
It can then join another cluster with a different UUID.
161+
162+
For example, after unsafe cluster bootstrapping you will need to detach all the
163+
other surviving nodes from their old cluster so they can join the new,
164+
unsafely-bootstrapped cluster.
165+
166+
Unsafe cluster bootstrapping is only possible if there is at least one
167+
surviving master-eligible node. If there are no remaining master-eligible nodes
168+
then the cluster metadata is completely lost. However, the individual data
169+
nodes also contain a copy of the index metadata corresponding with their
170+
shards. This sometimes allows a new cluster to import these shards as
171+
<<modules-gateway-dangling-indices,dangling indices>>. You can sometimes
172+
recover some indices after the loss of all master-eligible nodes in a cluster
173+
by creating a new cluster and then using the `elasticsearch-node
174+
detach-cluster` command to move any surviving nodes into this new cluster.
175+
176+
There is a risk of data loss when importing a dangling index because data nodes
177+
may not have the most recent copy of the index metadata and do not have any
178+
information about <<docs-replication,which shard copies are in-sync>>. This
179+
means that a stale shard copy may be selected to be the primary, and some of
180+
the shards may be incompatible with the imported mapping.
181+
182+
[WARNING]
183+
Execution of this command can lead to arbitrary data loss. Only run this tool
184+
if you understand and accept the possible consequences and have exhausted all
185+
other possibilities for recovery of your cluster.
186+
187+
The sequence of operations for using this tool are as follows:
188+
189+
1. Make sure you have really lost access to every one of the master-eligible
190+
nodes in the cluster, and they cannot be repaired or recovered by moving their
191+
data paths to healthy hardware.
192+
2. Start a new cluster and verify that it is healthy. This cluster may comprise
193+
one or more brand-new master-eligible nodes, or may be an unsafely-bootstrapped
194+
cluster formed as described above.
195+
3. Stop **all** remaining data nodes.
196+
4. On each data node, run the `elasticsearch-node detach-cluster` tool as shown
197+
below. Verify that the tool reported `Node was successfully detached from the
198+
cluster`.
199+
5. If necessary, configure each data node to
200+
<<modules-discovery-hosts-providers,discover the new cluster>>.
201+
6. Start each data node and verify that it has joined the new cluster.
202+
7. Wait for all recoveries to have completed, and investigate the data in the
203+
cluster to discover if any was lost during this process.
204+
205+
The message `Node was successfully detached from the cluster` does not mean
206+
that there has been no data loss, it just means that tool was able to complete
207+
its job.
208+
209+
[float]
210+
=== Parameters
211+
212+
`unsafe-bootstrap`:: Specifies to unsafely bootstrap this node as a new
213+
one-node cluster.
214+
215+
`detach-cluster`:: Specifies to unsafely detach this node from its cluster so
216+
it can join a different cluster.
217+
218+
`--ordinal <Integer>`:: If there is <<max-local-storage-nodes,more than one
219+
node sharing a data path>> then this specifies which node to target. Defaults
220+
to `0`, meaning to use the first node in the data path.
221+
222+
`-E <KeyValuePair>`:: Configures a setting.
223+
224+
`-h, --help`:: Returns all of the command parameters.
225+
226+
`-s, --silent`:: Shows minimal output.
227+
228+
`-v, --verbose`:: Shows verbose output.
229+
230+
[float]
231+
=== Examples
232+
233+
[float]
234+
==== Unsafe cluster bootstrapping
235+
236+
Suppose your cluster had five master-eligible nodes and you have permanently
237+
lost three of them, leaving two nodes remaining.
238+
239+
* Run the tool on the first remaining node, but answer `n` at the confirmation
240+
step.
241+
242+
[source,txt]
243+
----
244+
node_1$ ./bin/elasticsearch-node unsafe-bootstrap
245+
246+
WARNING: Elasticsearch MUST be stopped before running this tool.
247+
248+
Current node cluster state (term, version) pair is (4, 12)
249+
250+
You should only run this tool if you have permanently lost half or more
251+
of the master-eligible nodes in this cluster, and you cannot restore the
252+
cluster from a snapshot. This tool can cause arbitrary data loss and its
253+
use should be your last resort. If you have multiple surviving master
254+
eligible nodes, you should run this tool on the node with the highest
255+
cluster state (term, version) pair.
256+
257+
Do you want to proceed?
258+
259+
Confirm [y/N] n
260+
----
261+
262+
* Run the tool on the second remaining node, and again answer `n` at the
263+
confirmation step.
264+
265+
[source,txt]
266+
----
267+
node_2$ ./bin/elasticsearch-node unsafe-bootstrap
268+
269+
WARNING: Elasticsearch MUST be stopped before running this tool.
270+
271+
Current node cluster state (term, version) pair is (5, 3)
272+
273+
You should only run this tool if you have permanently lost half or more
274+
of the master-eligible nodes in this cluster, and you cannot restore the
275+
cluster from a snapshot. This tool can cause arbitrary data loss and its
276+
use should be your last resort. If you have multiple surviving master
277+
eligible nodes, you should run this tool on the node with the highest
278+
cluster state (term, version) pair.
279+
280+
Do you want to proceed?
281+
282+
Confirm [y/N] n
283+
----
284+
285+
* Since the second node has a greater term it has a fresher cluster state, so
286+
it is better to unsafely bootstrap the cluster using this node:
287+
288+
[source,txt]
289+
----
290+
node_2$ ./bin/elasticsearch-node unsafe-bootstrap
291+
292+
WARNING: Elasticsearch MUST be stopped before running this tool.
293+
294+
Current node cluster state (term, version) pair is (5, 3)
295+
296+
You should only run this tool if you have permanently lost half or more
297+
of the master-eligible nodes in this cluster, and you cannot restore the
298+
cluster from a snapshot. This tool can cause arbitrary data loss and its
299+
use should be your last resort. If you have multiple surviving master
300+
eligible nodes, you should run this tool on the node with the highest
301+
cluster state (term, version) pair.
302+
303+
Do you want to proceed?
304+
305+
Confirm [y/N] y
306+
Master node was successfully bootstrapped
307+
----
308+
309+
[float]
310+
==== Detaching nodes from their cluster
311+
312+
After unsafely bootstrapping a new cluster, run the `elasticsearch-node
313+
detach-cluster` command to detach all remaining nodes from the failed cluster
314+
so they can join the new cluster:
315+
316+
[source, txt]
317+
----
318+
node_3$ ./bin/elasticsearch-node detach-cluster
319+
320+
WARNING: Elasticsearch MUST be stopped before running this tool.
321+
322+
You should only run this tool if you have permanently lost all of the
323+
master-eligible nodes in this cluster and you cannot restore the cluster
324+
from a snapshot, or you have already unsafely bootstrapped a new cluster
325+
by running `elasticsearch-node unsafe-bootstrap` on a master-eligible
326+
node that belonged to the same cluster as this node. This tool can cause
327+
arbitrary data loss and its use should be your last resort.
328+
329+
Do you want to proceed?
330+
331+
Confirm [y/N] y
332+
Node was successfully detached from the cluster
333+
----
334+

docs/reference/modules/gateway.asciidoc

+1
Original file line numberDiff line numberDiff line change
@@ -49,6 +49,7 @@ as long as the following conditions are met:
4949

5050
NOTE: These settings only take effect on a full cluster restart.
5151

52+
[[modules-gateway-dangling-indices]]
5253
=== Dangling indices
5354

5455
When a node joins the cluster, any shards stored in its local data

0 commit comments

Comments
 (0)