Skip to content

Extend default probe connect/handshake timeouts #68059

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

DaveCTurner
Copy link
Contributor

Today the discovery phase has a short 1-second timeout for handshaking
with a remote node after connecting, which allows it to quickly move on
and retry in the case of connecting to something that doesn't respond
straight away (e.g. it isn't an Elasticsearch node).

This short timeout was necessary when the component was first developed
because each connection attempt would block a thread. Since #42636 the
connection attempt is now nonblocking so we can apply a more relaxed
timeout.

If transport security is enabled then our handshake timeout applies to
the TLS handshake followed by the Elasticsearch handshake. If the TLS
handshake alone takes over a second then the whole handshake times out
with a ConnectTransportException, but this does not tell us which of
the two individual handshakes took so long.

TLS handshakes have their own 10-second timeout, which if reached yields
a SslHandshakeTimeoutException that allows us to distinguish a problem
at the TLS level from one at the Elasticsearch level. Therefore this
commit extends the discovery probe timeouts.

Today the discovery phase has a short 1-second timeout for handshaking
with a remote node after connecting, which allows it to quickly move on
and retry in the case of connecting to something that doesn't respond
straight away (e.g. it isn't an Elasticsearch node).

This short timeout was necessary when the component was first developed
because each connection attempt would block a thread. Since elastic#42636 the
connection attempt is now nonblocking so we can apply a more relaxed
timeout.

If transport security is enabled then our handshake timeout applies to
the TLS handshake followed by the Elasticsearch handshake. If the TLS
handshake alone takes over a second then the whole handshake times out
with a `ConnectTransportException`, but this does not tell us which of
the two individual handshakes took so long.

TLS handshakes have their own 10-second timeout, which if reached yields
a `SslHandshakeTimeoutException` that allows us to distinguish a problem
at the TLS level from one at the Elasticsearch level. Therefore this
commit extends the discovery probe timeouts.
@DaveCTurner DaveCTurner added >enhancement :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. v8.0.0 v7.12.0 labels Jan 27, 2021
@elasticmachine elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Jan 27, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

Copy link
Member

@original-brownbear original-brownbear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++

@DaveCTurner DaveCTurner merged commit 9c100cd into elastic:master Jan 27, 2021
@DaveCTurner DaveCTurner deleted the 2021-01-27-longer-handshake-timeout branch January 27, 2021 16:41
DaveCTurner added a commit that referenced this pull request Jan 27, 2021
Today the discovery phase has a short 1-second timeout for handshaking
with a remote node after connecting, which allows it to quickly move on
and retry in the case of connecting to something that doesn't respond
straight away (e.g. it isn't an Elasticsearch node).

This short timeout was necessary when the component was first developed
because each connection attempt would block a thread. Since #42636 the
connection attempt is now nonblocking so we can apply a more relaxed
timeout.

If transport security is enabled then our handshake timeout applies to
the TLS handshake followed by the Elasticsearch handshake. If the TLS
handshake alone takes over a second then the whole handshake times out
with a `ConnectTransportException`, but this does not tell us which of
the two individual handshakes took so long.

TLS handshakes have their own 10-second timeout, which if reached yields
a `SslHandshakeTimeoutException` that allows us to distinguish a problem
at the TLS level from one at the Elasticsearch level. Therefore this
commit extends the discovery probe timeouts.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. >enhancement Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. v7.12.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants