Increase PeerFinder verbosity on persistent failure #73128

DaveCTurner · 2021-05-16T10:32:58Z

If a node is partitioned away from the rest of the cluster then the
ClusterFormationFailureHelper periodically reports that it cannot
discover the expected collection of nodes, but does not indicate why. To
prove it's a connectivity problem, users must today restart the node
with DEBUG logging on org.elasticsearch.discovery.PeerFinder to see
further details.

With this commit we log messages at WARN level if the node remains
disconnected for longer than a configurable timeout, which defaults to 5
minutes.

Relates #72968

If a node is partitioned away from the rest of the cluster then the `ClusterFormationFailureHelper` periodically reports that it cannot discover the expected collection of nodes, but does not indicate why. To prove it's a connectivity problem, users must today restart the node with `DEBUG` logging on `org.elasticsearch.discovery.PeerFinder` to see further details. With this commit we log messages at `WARN` level if the node remains disconnected for longer than a configurable timeout, which defaults to 5 minutes. Relates elastic#72968

elasticmachine · 2021-05-16T10:33:01Z

Pinging @elastic/es-distributed (Team:Distributed)

henningandersen

LGTM, thanks for improving this.

server/src/main/java/org/elasticsearch/discovery/PeerFinder.java

henningandersen · 2021-05-17T07:53:13Z

server/src/test/java/org/elasticsearch/discovery/PeerFinderTests.java

+                    "connection failed",
+                    "org.elasticsearch.discovery.PeerFinder",
+                    Level.DEBUG,
+                    "*connection failed*"));


nit: now that we validate the message, it would be nice to show that it contains both the transport address and the exception message cannot connect to.

Done in 3f03284.

henningandersen · 2021-05-17T07:53:56Z

server/src/test/java/org/elasticsearch/discovery/PeerFinderTests.java

+                    "connection failed",
+                    "org.elasticsearch.discovery.PeerFinder",
+                    Level.WARN,
+                    "*connection failed: cannot connect to*"));


nit: now that we validate the message, it would be nice to show that it contains the transport address.

Done in 3f03284.

…n-persistent-failure

DaveCTurner · 2021-05-17T09:52:23Z

Thanks Henning

If a node is partitioned away from the rest of the cluster then the `ClusterFormationFailureHelper` periodically reports that it cannot discover the expected collection of nodes, but does not indicate why. To prove it's a connectivity problem, users must today restart the node with `DEBUG` logging on `org.elasticsearch.discovery.PeerFinder` to see further details. With this commit we log messages at `WARN` level if the node remains disconnected for longer than a configurable timeout, which defaults to 5 minutes. Relates #72968

Since elastic#73128 a sufficiently old `PeerFinder` will report all exceptions encountered during discovery to help diagnose cluster formation problems. We throw exceptions on genuine connection failures, but we also throw exceptions if the discovered node is the local node or is master-ineligible because these nodes are no use in discovery. We report all such exceptions as failures: [instance-0000000001] address [10.0.0.1:12345], node [null], requesting [false] connection failed: [instance-0000000002][10.0.0.1:12345] non-master-eligible node found Experience shows that users often have master-ineligible nodes in their discovery config so will see these messages frequently if the cluster cannot form, and may interpret the `connection failed` as the source of the problems even though they're benign. This commit adjusts the language in these messages to be more balanced, replacing `connection failed` with `discovery result`, including the phrase `successfully discovered` in the exception messsage, and giving advice on how to suppress the message.

Since #73128 a sufficiently old `PeerFinder` will report all exceptions encountered during discovery to help diagnose cluster formation problems. We throw exceptions on genuine connection failures, but we also throw exceptions if the discovered node is the local node or is master-ineligible because these nodes are no use in discovery. We report all such exceptions as failures: [instance-0000000001] address [10.0.0.1:12345], node [null], requesting [false] connection failed: [instance-0000000002][10.0.0.1:12345] non-master-eligible node found Experience shows that users often have master-ineligible nodes in their discovery config so will see these messages frequently if the cluster cannot form, and may interpret the `connection failed` as the source of the problems even though they're benign. This commit adjusts the language in these messages to be more balanced, replacing `connection failed` with `discovery result`, including the phrase `successfully discovered` in the exception messsage, and giving advice on how to suppress the message.

DaveCTurner added >enhancement :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. v8.0.0 v7.14.0 labels May 16, 2021

DaveCTurner requested a review from henningandersen May 16, 2021 10:32

elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label May 16, 2021

henningandersen approved these changes May 17, 2021

View reviewed changes

DaveCTurner added 3 commits May 17, 2021 09:23

Merge branch 'master' into 2021-05-16-increase-peerfinder-verbosity-o…

d698e2b

…n-persistent-failure

Include stack trace if DEBUG enabled

3f03284

Whitespace

993bace

DaveCTurner merged commit eabe2d1 into elastic:master May 17, 2021

DaveCTurner deleted the 2021-05-16-increase-peerfinder-verbosity-on-persistent-failure branch May 17, 2021 09:52

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

DaveCTurner mentioned this pull request Jan 2, 2022

[DOC] Document ES Loggers for Troubleshooting #82172

Closed

DaveCTurner mentioned this pull request Jan 27, 2022

Make PeerFinder log messages happier #83222

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Increase PeerFinder verbosity on persistent failure #73128

Increase PeerFinder verbosity on persistent failure #73128

Uh oh!

DaveCTurner commented May 16, 2021

Uh oh!

elasticmachine commented May 16, 2021

Uh oh!

henningandersen left a comment

Uh oh!

Uh oh!

henningandersen May 17, 2021

Uh oh!

DaveCTurner May 17, 2021

Uh oh!

henningandersen May 17, 2021

Uh oh!

DaveCTurner May 17, 2021

Uh oh!

DaveCTurner commented May 17, 2021

Uh oh!

Uh oh!

Increase PeerFinder verbosity on persistent failure #73128

Increase PeerFinder verbosity on persistent failure #73128

Uh oh!

Conversation

DaveCTurner commented May 16, 2021

Uh oh!

elasticmachine commented May 16, 2021

Uh oh!

henningandersen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

henningandersen May 17, 2021

Choose a reason for hiding this comment

Uh oh!

DaveCTurner May 17, 2021

Choose a reason for hiding this comment

Uh oh!

henningandersen May 17, 2021

Choose a reason for hiding this comment

Uh oh!

DaveCTurner May 17, 2021

Choose a reason for hiding this comment

Uh oh!

DaveCTurner commented May 17, 2021

Uh oh!

Uh oh!