-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Add elasticsearch-node tool docs #37812
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 6 commits
2c61bdf
769c93d
0d3bae3
d72d1b6
af8d49f
23b3c9b
b76a132
f8f3bf8
ca5d84e
0311381
6824146
8c16ce8
4446ab3
c147029
8e89ecc
548ffc3
4f46915
90662e4
8e46c7d
2f16be6
9b4dc3c
0aab27b
fc2ee02
1d0afb2
9310c3c
9042659
3de2885
020dcd5
5256211
a1f1409
f729459
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,197 @@ | ||
[[node-tool]] | ||
== elasticsearch-node | ||
|
||
Sometimes {es} nodes are temporarily stopped, perhaps because of the need to | ||
perform some maintenance activity or perhaps because of a hardware failure. | ||
Once the temporary condition has been resolved you should restart the node and | ||
it will rejoin the cluster and continue normally. Depending on your | ||
configuration, your cluster may be able to remain completely available even | ||
while one or more of its nodes are stopped. | ||
|
||
Sometimes it might not be possible to restart a node after it has stopped. For | ||
example, the node's host may suffer from a hardware problem that cannot be | ||
repaired. If the cluster is still available then you can start up | ||
a fresh node on another host and {es} will bring this node into the cluster in place | ||
of the failed node. | ||
|
||
Each node stores its data in the data directories defined by the | ||
<<path-settings,`path.data` setting>>. This means that in a disaster you can | ||
also restart a node by moving its data directories to another host, presuming | ||
that those data directories can be recovered from the faulty host. Note that it | ||
is not possible to restore the data directory from a backup because this will | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @DaveCTurner Not sure what you mean by "is not possible to restore the data directory from a backup". There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Copying the literal data directory of the dead node somewhere else, post-mortem, is ok, but restoring from a backup (which could be who-knows-how stale) is decidedly not. We occasionally see people taking backups of their data directories and causing hassle when they discover that they can't restore from such things, and I saw a risk that this paragraph might be interpreted wrongly by those kinds of people. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @DaveCTurner Now I see what you mean, can we re-phrase it like "Note that if you have previously taken a backup of the data folder of the stopped node, you can not restore from it, you need a data folder state at the moment this node was stopped" There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The trouble with the phrase "if you have previously taken a backup of the data folder" is that it suggests this is a thing you might try and do, and I think we should avoid making that suggestion. Technically you can't take a backup of the data folder: a backup from which you cannot ever safely restore isn't really a backup at all 🤔. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @DaveCTurner I agree. In this case, probably it makes sense to remove this sentence at all because it's confusing? |
||
lead to data corruption. Backups of an {es} cluster can only be taken using | ||
<<modules-snapshots>>. | ||
|
||
{es} <<modules-discovery-quorums,requires a response from a majority of the | ||
master-eligible nodes>> in order to elect a master and to update the cluster | ||
state. This means that if you have three master-eligible nodes then the cluster | ||
will remain available even if one of them has failed. However if two of the | ||
three master-eligible nodes fail then the cluster will be unavailable until at | ||
least one of them is restarted. | ||
|
||
In very rare circumstances it may not be possible to restart enough nodes to | ||
restore the cluster's availability. If such a disaster occurs then you should | ||
andrershov marked this conversation as resolved.
Show resolved
Hide resolved
|
||
build a new cluster from a recent snapshot, and re-import any data that was | ||
andrershov marked this conversation as resolved.
Show resolved
Hide resolved
|
||
ingested since that snapshot was taken. | ||
|
||
However, if the disaster is serious enough then it may not be possible to | ||
recover from a recent snapshot either. Unfortunately in this case there is no | ||
way forward that does not risk data loss, but it may be possible to use the | ||
`elasticsearch-node` tool to unsafely bring the cluster back online. | ||
|
||
This tool has two modes, depending on whether there are any master-eligible | ||
nodes remaining or not: | ||
|
||
* `elastisearch-node unsafe-bootstap` can be used if there is at least one | ||
remaining master-eligible node. It allows you to force one of the remaining | ||
nodes to become the elected master on its own. | ||
|
||
* `elastisearch-node detach-cluster` can be used if there are no remaining | ||
master-eligible nodes. It allows you to detach any remaining data nodes from | ||
the old, failed, cluster so they can join a new cluster. | ||
|
||
[float] | ||
=== Unsafe cluster bootstrapping | ||
|
||
If there is at least one remaining master-eligible node, but it is not possible | ||
to restart a majority of them, then the `elasticsearch-node unsafe-bootstrap` | ||
command will unsafely override the cluster's <<modules-discovery-voting,voting | ||
configuration>> as if performing another | ||
<<modules-discovery-bootstrap-cluster,cluster bootstrapping process>>, allowing | ||
andrershov marked this conversation as resolved.
Show resolved
Hide resolved
|
||
the target node to become the elected master without needing a response from | ||
any other nodes. This can lead to arbitrary data loss since the chosen node may | ||
not hold the latest cluster metadata, and this out-of-date metadata may make it | ||
impossible to use some or all of the indices in the cluster. | ||
|
||
When you run the `elasticsearch-node unsafe-bootstrap` tool it will analyse the | ||
state of the node and ask for confirmation before taking any action. Before | ||
asking for confirmation it reports the term and version of the cluster state on | ||
the node on which it runs as follows: | ||
|
||
[source,txt] | ||
---- | ||
Current node cluster state (term, version) pair is (4, 12) | ||
---- | ||
|
||
If you have a choice of nodes on which to run this tool then you should pick | ||
one with a term that is as large as possible, and if there are multiple nodes | ||
with the same term then you should pick the one with the largest version. This | ||
identifies the node with the freshest cluster state, minimising the quantity of | ||
data that might be lost. For example, if the first node reports `(4, 12)` and a | ||
second node reports `(5, 3)`, then the second node is preferred since its term | ||
is larger. However if the second node reports `(3, 17)` then the first node is | ||
preferred since its term is larger. If the second node reports `(4, 10)` then | ||
it has the same term as the first node, but has a smaller version, so the first | ||
node is preferred. | ||
|
||
[WARNING] | ||
andrershov marked this conversation as resolved.
Show resolved
Hide resolved
|
||
Execution of this command can lead to arbitrary data loss. Only run this tool | ||
if you understand and accept the possible consequences and have exhausted all | ||
other possibilities for recovery of your cluster. | ||
|
||
The sequence of operations for using this tool are as follows: | ||
|
||
1. Make sure you have really lost access to at least half of the | ||
master-eligible nodes in the cluster, and they cannot be repaired or recovered | ||
by moving their data paths to healthy hardware. | ||
2. Stop **all** remaining master-eligible nodes. | ||
3. Select one of the remaining master-eligible nodes to become the new elected | ||
master as described above. | ||
4. On this node, run the `elasticsearch-node unsafe-bootstrap` command as shown | ||
below. Verify that the tool reported `Master node was successfully | ||
bootstrapped`. | ||
5. Start this node and verify that it is elected as the master node. | ||
6. Start all other master-eligible nodes and verify that each one joins the | ||
cluster. | ||
7. Any running master-ineligible nodes will automatically join the | ||
newly-elected master. Restart any previously-stopped nodes and verify that the | ||
cluster is now fully-formed. | ||
8. Investigate the data in the cluster to discover if any was lost during this | ||
andrershov marked this conversation as resolved.
Show resolved
Hide resolved
|
||
process. | ||
|
||
[WARNING] | ||
When you run the tool it will make sure that the node that is being used to | ||
bootstrap the cluster is not running. It is important that all other | ||
master-eligible nodes are also stopped while this tool is running, but the tool | ||
does not check this. | ||
|
||
[NOTE] | ||
andrershov marked this conversation as resolved.
Show resolved
Hide resolved
|
||
The message `Master node was successfully bootstrapped` does not mean that | ||
there has been no data loss, it just means that tool was able to complete its | ||
job. | ||
|
||
As an example, suppose your cluster had five master-eligible nodes and you have | ||
permanently lost three of them, leaving two nodes remaining. | ||
|
||
* Run the tool on the first remaining node, but answer `n` at the confirmation | ||
step. | ||
|
||
[source,txt] | ||
---- | ||
node_1$ ./bin/elasticsearch-node unsafe-bootstrap | ||
|
||
WARNING: Elasticsearch MUST be stopped before running this tool. | ||
|
||
Current node cluster state (term, version) pair is (4, 12) | ||
|
||
You should run this tool only if you have permanently lost half | ||
or more of the master-eligible nodes, and you cannot restore the cluster | ||
from a snapshot. This tool can result in arbitrary data loss and | ||
should be the last resort. | ||
If you have multiple survived master eligible nodes, consider running | ||
this tool on the node with the highest cluster state (term, version) pair. | ||
Do you want to proceed? | ||
|
||
Confirm [y/N] n | ||
---- | ||
|
||
* Run the tool on the second remaining node, and again answer `n` at the | ||
confirmation step. | ||
|
||
[source,txt] | ||
---- | ||
node_2$ ./bin/elasticsearch-node unsafe-bootstrap | ||
|
||
WARNING: Elasticsearch MUST be stopped before running this tool. | ||
|
||
Current node cluster state (term, version) pair is (5, 3) | ||
|
||
You should run this tool only if you have permanently lost half | ||
or more of the master-eligible nodes, and you cannot restore the cluster | ||
from a snapshot. This tool can result in arbitrary data loss and | ||
should be the last resort. | ||
If you have multiple survived master eligible nodes, consider running | ||
this tool on the node with the highest cluster state (term, version) pair. | ||
Do you want to proceed? | ||
|
||
Confirm [y/N] n | ||
---- | ||
|
||
* Since the second node has a greater term it has a fresher cluster state, so | ||
it is better to unsafely bootstrap the cluster using this node: | ||
|
||
[source,txt] | ||
---- | ||
node_2$ ./bin/elasticsearch-node unsafe-bootstrap | ||
|
||
WARNING: Elasticsearch MUST be stopped before running this tool. | ||
|
||
Current node cluster state (term, version) pair is (5, 3) | ||
|
||
You should run this tool only if you have permanently lost half | ||
or more of the master-eligible nodes, and you cannot restore the cluster | ||
from a snapshot. This tool can result in arbitrary data loss and | ||
should be the last resort. | ||
If you have multiple survived master eligible nodes, consider running | ||
this tool on the node with the highest cluster state (term, version) pair. | ||
Do you want to proceed? | ||
|
||
Confirm [y/N] y | ||
Master node was successfully bootstrapped | ||
---- | ||
|
||
[float] | ||
=== Detach cluster | ||
To be described | ||
|
||
|
Uh oh!
There was an error while loading. Please reload this page.