-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Increase PeerFinder verbosity on persistent failure #73128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Increase PeerFinder verbosity on persistent failure #73128
Conversation
If a node is partitioned away from the rest of the cluster then the `ClusterFormationFailureHelper` periodically reports that it cannot discover the expected collection of nodes, but does not indicate why. To prove it's a connectivity problem, users must today restart the node with `DEBUG` logging on `org.elasticsearch.discovery.PeerFinder` to see further details. With this commit we log messages at `WARN` level if the node remains disconnected for longer than a configurable timeout, which defaults to 5 minutes. Relates elastic#72968
Pinging @elastic/es-distributed (Team:Distributed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for improving this.
"connection failed", | ||
"org.elasticsearch.discovery.PeerFinder", | ||
Level.DEBUG, | ||
"*connection failed*")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: now that we validate the message, it would be nice to show that it contains both the transport address and the exception message cannot connect to
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 3f03284.
"connection failed", | ||
"org.elasticsearch.discovery.PeerFinder", | ||
Level.WARN, | ||
"*connection failed: cannot connect to*")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: now that we validate the message, it would be nice to show that it contains the transport address.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 3f03284.
Thanks Henning |
If a node is partitioned away from the rest of the cluster then the `ClusterFormationFailureHelper` periodically reports that it cannot discover the expected collection of nodes, but does not indicate why. To prove it's a connectivity problem, users must today restart the node with `DEBUG` logging on `org.elasticsearch.discovery.PeerFinder` to see further details. With this commit we log messages at `WARN` level if the node remains disconnected for longer than a configurable timeout, which defaults to 5 minutes. Relates #72968
Since elastic#73128 a sufficiently old `PeerFinder` will report all exceptions encountered during discovery to help diagnose cluster formation problems. We throw exceptions on genuine connection failures, but we also throw exceptions if the discovered node is the local node or is master-ineligible because these nodes are no use in discovery. We report all such exceptions as failures: [instance-0000000001] address [10.0.0.1:12345], node [null], requesting [false] connection failed: [instance-0000000002][10.0.0.1:12345] non-master-eligible node found Experience shows that users often have master-ineligible nodes in their discovery config so will see these messages frequently if the cluster cannot form, and may interpret the `connection failed` as the source of the problems even though they're benign. This commit adjusts the language in these messages to be more balanced, replacing `connection failed` with `discovery result`, including the phrase `successfully discovered` in the exception messsage, and giving advice on how to suppress the message.
Since #73128 a sufficiently old `PeerFinder` will report all exceptions encountered during discovery to help diagnose cluster formation problems. We throw exceptions on genuine connection failures, but we also throw exceptions if the discovered node is the local node or is master-ineligible because these nodes are no use in discovery. We report all such exceptions as failures: [instance-0000000001] address [10.0.0.1:12345], node [null], requesting [false] connection failed: [instance-0000000002][10.0.0.1:12345] non-master-eligible node found Experience shows that users often have master-ineligible nodes in their discovery config so will see these messages frequently if the cluster cannot form, and may interpret the `connection failed` as the source of the problems even though they're benign. This commit adjusts the language in these messages to be more balanced, replacing `connection failed` with `discovery result`, including the phrase `successfully discovered` in the exception messsage, and giving advice on how to suppress the message.
If a node is partitioned away from the rest of the cluster then the
ClusterFormationFailureHelper
periodically reports that it cannotdiscover the expected collection of nodes, but does not indicate why. To
prove it's a connectivity problem, users must today restart the node
with
DEBUG
logging onorg.elasticsearch.discovery.PeerFinder
to seefurther details.
With this commit we log messages at
WARN
level if the node remainsdisconnected for longer than a configurable timeout, which defaults to 5
minutes.
Relates #72968