Skip to content

Move ConnectionManager to async APIs #42636

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Jul 5, 2019
Merged

Conversation

ywelsch
Copy link
Contributor

@ywelsch ywelsch commented May 28, 2019

This PR converts the ConnectionManager's openConnection and connectToNode methods to async-style. This will allow us to not block threads anymore when opening connections. This PR also adapts the cluster coordination subsystem to make use of the new async APIs, allowing to remove some hacks in the test infrastructure that had to account for the previous synchronous nature of the connection APIs.

@ywelsch ywelsch added >enhancement WIP :Distributed Coordination/Network Http and internode communication implementations :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. v8.0.0 v7.3.0 labels May 28, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@ywelsch ywelsch removed the WIP label Jun 25, 2019
@ywelsch ywelsch requested review from DaveCTurner and Tim-Brooks and removed request for DaveCTurner June 25, 2019 07:40
@ywelsch
Copy link
Contributor Author

ywelsch commented Jun 25, 2019

@tbrooks8 (transport) @DaveCTurner (cluster coordination) this is ready for review now.

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requested an extra test, but otherwise the bits I've marked 👍 LGTM.

@jpountz jpountz added v7.4.0 and removed v7.3.0 labels Jul 3, 2019
Transport.Connection connection = connectedNodes.get(node);
if (connection != null) {
assert connectingNodes.containsKey(node) == false;
lock.close();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary since try/with/resources with release.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to release this before calling the response listener.

return;
}

final List<ActionListener<Void>> connectionListeners = connectingNodes.computeIfAbsent(node, n -> new ArrayList());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new ArrayList<>()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in 3a0fc08

} finally {
final Transport.Connection finalConnection = conn;
conn.addCloseListener(ActionListener.wrap(() -> {
logger.info("close listener called for node {}", node);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't totally understand the value of this logging message. Especially since there might be multiple and other close listeners attached to this connection. This message seems to imply THIS is the close listener. It would make more sense to me if the message at lease reflected what this listener is going (deregistering the node and notify that it has disconnected).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was a leftover from a debugging session :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

turned this into trace log at 3a0fc08

@ywelsch ywelsch requested review from DaveCTurner and Tim-Brooks July 4, 2019 14:47
Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cluster coordination side LGTM

Copy link
Contributor

@Tim-Brooks Tim-Brooks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ywelsch ywelsch merged commit bca865d into elastic:master Jul 5, 2019
ywelsch added a commit that referenced this pull request Jul 5, 2019
This commit converts the ConnectionManager's openConnection and connectToNode methods to
async-style. This will allow us to not block threads anymore when opening connections. This PR also
adapts the cluster coordination subsystem to make use of the new async APIs, allowing to remove
some hacks in the test infrastructure that had to account for the previous synchronous nature of the
connection APIs.
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Jul 8, 2019
Since elastic#42636 we no longer treat connections specially when simulating a
blackholed connection. This means that at the end of the safety phase we may
have just started a connection attempt which will time out, but the default
timeout is 30 seconds, much longer than the 2 seconds we normally allow for
post-safety-phase discovery. This commit adds time for such a connection
attempt to time out.

It also fixes some spurious logging of `this` that now refers to an object with
an unhelpful `toString()` implementation introduced in elastic#42636.

Fixes elastic#44073
DaveCTurner added a commit that referenced this pull request Jul 9, 2019
Since #42636 we no longer treat connections specially when simulating a
blackholed connection. This means that at the end of the safety phase we may
have just started a connection attempt which will time out, but the default
timeout is 30 seconds, much longer than the 2 seconds we normally allow for
post-safety-phase discovery. This commit adds time for such a connection
attempt to time out.

It also fixes some spurious logging of `this` that now refers to an object with
an unhelpful `toString()` implementation introduced in #42636.

Fixes #44073
DaveCTurner added a commit that referenced this pull request Jul 9, 2019
Since #42636 we no longer treat connections specially when simulating a
blackholed connection. This means that at the end of the safety phase we may
have just started a connection attempt which will time out, but the default
timeout is 30 seconds, much longer than the 2 seconds we normally allow for
post-safety-phase discovery. This commit adds time for such a connection
attempt to time out.

It also fixes some spurious logging of `this` that now refers to an object with
an unhelpful `toString()` implementation introduced in #42636.

Fixes #44073
@colings86 colings86 removed the :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. label Aug 30, 2019
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Jan 27, 2021
Today the discovery phase has a short 1-second timeout for handshaking
with a remote node after connecting, which allows it to quickly move on
and retry in the case of connecting to something that doesn't respond
straight away (e.g. it isn't an Elasticsearch node).

This short timeout was necessary when the component was first developed
because each connection attempt would block a thread. Since elastic#42636 the
connection attempt is now nonblocking so we can apply a more relaxed
timeout.

If transport security is enabled then our handshake timeout applies to
the TLS handshake followed by the Elasticsearch handshake. If the TLS
handshake alone takes over a second then the whole handshake times out
with a `ConnectTransportException`, but this does not tell us which of
the two individual handshakes took so long.

TLS handshakes have their own 10-second timeout, which if reached yields
a `SslHandshakeTimeoutException` that allows us to distinguish a problem
at the TLS level from one at the Elasticsearch level. Therefore this
commit extends the discovery probe timeouts.
DaveCTurner added a commit that referenced this pull request Jan 27, 2021
Today the discovery phase has a short 1-second timeout for handshaking
with a remote node after connecting, which allows it to quickly move on
and retry in the case of connecting to something that doesn't respond
straight away (e.g. it isn't an Elasticsearch node).

This short timeout was necessary when the component was first developed
because each connection attempt would block a thread. Since #42636 the
connection attempt is now nonblocking so we can apply a more relaxed
timeout.

If transport security is enabled then our handshake timeout applies to
the TLS handshake followed by the Elasticsearch handshake. If the TLS
handshake alone takes over a second then the whole handshake times out
with a `ConnectTransportException`, but this does not tell us which of
the two individual handshakes took so long.

TLS handshakes have their own 10-second timeout, which if reached yields
a `SslHandshakeTimeoutException` that allows us to distinguish a problem
at the TLS level from one at the Elasticsearch level. Therefore this
commit extends the discovery probe timeouts.
DaveCTurner added a commit that referenced this pull request Jan 27, 2021
Today the discovery phase has a short 1-second timeout for handshaking
with a remote node after connecting, which allows it to quickly move on
and retry in the case of connecting to something that doesn't respond
straight away (e.g. it isn't an Elasticsearch node).

This short timeout was necessary when the component was first developed
because each connection attempt would block a thread. Since #42636 the
connection attempt is now nonblocking so we can apply a more relaxed
timeout.

If transport security is enabled then our handshake timeout applies to
the TLS handshake followed by the Elasticsearch handshake. If the TLS
handshake alone takes over a second then the whole handshake times out
with a `ConnectTransportException`, but this does not tell us which of
the two individual handshakes took so long.

TLS handshakes have their own 10-second timeout, which if reached yields
a `SslHandshakeTimeoutException` that allows us to distinguish a problem
at the TLS level from one at the Elasticsearch level. Therefore this
commit extends the discovery probe timeouts.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants