Skip to content

Increase PeerFinder verbosity on persistent failure #73128

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

DaveCTurner
Copy link
Contributor

If a node is partitioned away from the rest of the cluster then the
ClusterFormationFailureHelper periodically reports that it cannot
discover the expected collection of nodes, but does not indicate why. To
prove it's a connectivity problem, users must today restart the node
with DEBUG logging on org.elasticsearch.discovery.PeerFinder to see
further details.

With this commit we log messages at WARN level if the node remains
disconnected for longer than a configurable timeout, which defaults to 5
minutes.

Relates #72968

If a node is partitioned away from the rest of the cluster then the
`ClusterFormationFailureHelper` periodically reports that it cannot
discover the expected collection of nodes, but does not indicate why. To
prove it's a connectivity problem, users must today restart the node
with `DEBUG` logging on `org.elasticsearch.discovery.PeerFinder` to see
further details.

With this commit we log messages at `WARN` level if the node remains
disconnected for longer than a configurable timeout, which defaults to 5
minutes.

Relates elastic#72968
@DaveCTurner DaveCTurner added >enhancement :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. v8.0.0 v7.14.0 labels May 16, 2021
@elasticmachine elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label May 16, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for improving this.

"connection failed",
"org.elasticsearch.discovery.PeerFinder",
Level.DEBUG,
"*connection failed*"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: now that we validate the message, it would be nice to show that it contains both the transport address and the exception message cannot connect to.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 3f03284.

"connection failed",
"org.elasticsearch.discovery.PeerFinder",
Level.WARN,
"*connection failed: cannot connect to*"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: now that we validate the message, it would be nice to show that it contains the transport address.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 3f03284.

@DaveCTurner DaveCTurner merged commit eabe2d1 into elastic:master May 17, 2021
@DaveCTurner
Copy link
Contributor Author

Thanks Henning

@DaveCTurner DaveCTurner deleted the 2021-05-16-increase-peerfinder-verbosity-on-persistent-failure branch May 17, 2021 09:52
DaveCTurner added a commit that referenced this pull request May 17, 2021
If a node is partitioned away from the rest of the cluster then the
`ClusterFormationFailureHelper` periodically reports that it cannot
discover the expected collection of nodes, but does not indicate why. To
prove it's a connectivity problem, users must today restart the node
with `DEBUG` logging on `org.elasticsearch.discovery.PeerFinder` to see
further details.

With this commit we log messages at `WARN` level if the node remains
disconnected for longer than a configurable timeout, which defaults to 5
minutes.

Relates #72968
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Jan 27, 2022
Since elastic#73128 a sufficiently old `PeerFinder` will report all exceptions
encountered during discovery to help diagnose cluster formation
problems. We throw exceptions on genuine connection failures, but we
also throw exceptions if the discovered node is the local node or is
master-ineligible because these nodes are no use in discovery. We report
all such exceptions as failures:

    [instance-0000000001]
        address [10.0.0.1:12345], node [null], requesting [false]
        connection failed:
            [instance-0000000002][10.0.0.1:12345]
            non-master-eligible node found

Experience shows that users often have master-ineligible nodes in their
discovery config so will see these messages frequently if the cluster
cannot form, and may interpret the `connection failed` as the source of
the problems even though they're benign.

This commit adjusts the language in these messages to be more balanced,
replacing `connection failed` with `discovery result`, including the
phrase `successfully discovered` in the exception messsage, and giving
advice on how to suppress the message.
DaveCTurner added a commit that referenced this pull request Jan 28, 2022
Since #73128 a sufficiently old `PeerFinder` will report all exceptions
encountered during discovery to help diagnose cluster formation
problems. We throw exceptions on genuine connection failures, but we
also throw exceptions if the discovered node is the local node or is
master-ineligible because these nodes are no use in discovery. We report
all such exceptions as failures:

    [instance-0000000001]
        address [10.0.0.1:12345], node [null], requesting [false]
        connection failed:
            [instance-0000000002][10.0.0.1:12345]
            non-master-eligible node found

Experience shows that users often have master-ineligible nodes in their
discovery config so will see these messages frequently if the cluster
cannot form, and may interpret the `connection failed` as the source of
the problems even though they're benign.

This commit adjusts the language in these messages to be more balanced,
replacing `connection failed` with `discovery result`, including the
phrase `successfully discovered` in the exception messsage, and giving
advice on how to suppress the message.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. >enhancement Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. v7.14.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants