-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Move ConnectionManager to async APIs #42636
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Pinging @elastic/es-distributed |
@tbrooks8 (transport) @DaveCTurner (cluster coordination) this is ready for review now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Requested an extra test, but otherwise the bits I've marked 👍 LGTM.
server/src/main/java/org/elasticsearch/cluster/coordination/Coordinator.java
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/cluster/coordination/JoinHelper.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/discovery/HandshakingTransportAddressConnector.java
Show resolved
Hide resolved
test/framework/src/main/java/org/elasticsearch/test/disruption/DisruptableMockTransport.java
Show resolved
Hide resolved
server/src/test/java/org/elasticsearch/cluster/NodeConnectionsServiceTests.java
Show resolved
Hide resolved
Transport.Connection connection = connectedNodes.get(node); | ||
if (connection != null) { | ||
assert connectingNodes.containsKey(node) == false; | ||
lock.close(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unnecessary since try/with/resources with release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to release this before calling the response listener.
return; | ||
} | ||
|
||
final List<ActionListener<Void>> connectionListeners = connectingNodes.computeIfAbsent(node, n -> new ArrayList()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
new ArrayList<>()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed in 3a0fc08
} finally { | ||
final Transport.Connection finalConnection = conn; | ||
conn.addCloseListener(ActionListener.wrap(() -> { | ||
logger.info("close listener called for node {}", node); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't totally understand the value of this logging message. Especially since there might be multiple and other close listeners attached to this connection. This message seems to imply THIS is the close listener. It would make more sense to me if the message at lease reflected what this listener is going (deregistering the node and notify that it has disconnected).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was a leftover from a debugging session :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
turned this into trace log at 3a0fc08
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cluster coordination side LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This commit converts the ConnectionManager's openConnection and connectToNode methods to async-style. This will allow us to not block threads anymore when opening connections. This PR also adapts the cluster coordination subsystem to make use of the new async APIs, allowing to remove some hacks in the test infrastructure that had to account for the previous synchronous nature of the connection APIs.
Since elastic#42636 we no longer treat connections specially when simulating a blackholed connection. This means that at the end of the safety phase we may have just started a connection attempt which will time out, but the default timeout is 30 seconds, much longer than the 2 seconds we normally allow for post-safety-phase discovery. This commit adds time for such a connection attempt to time out. It also fixes some spurious logging of `this` that now refers to an object with an unhelpful `toString()` implementation introduced in elastic#42636. Fixes elastic#44073
Since #42636 we no longer treat connections specially when simulating a blackholed connection. This means that at the end of the safety phase we may have just started a connection attempt which will time out, but the default timeout is 30 seconds, much longer than the 2 seconds we normally allow for post-safety-phase discovery. This commit adds time for such a connection attempt to time out. It also fixes some spurious logging of `this` that now refers to an object with an unhelpful `toString()` implementation introduced in #42636. Fixes #44073
Since #42636 we no longer treat connections specially when simulating a blackholed connection. This means that at the end of the safety phase we may have just started a connection attempt which will time out, but the default timeout is 30 seconds, much longer than the 2 seconds we normally allow for post-safety-phase discovery. This commit adds time for such a connection attempt to time out. It also fixes some spurious logging of `this` that now refers to an object with an unhelpful `toString()` implementation introduced in #42636. Fixes #44073
Today the discovery phase has a short 1-second timeout for handshaking with a remote node after connecting, which allows it to quickly move on and retry in the case of connecting to something that doesn't respond straight away (e.g. it isn't an Elasticsearch node). This short timeout was necessary when the component was first developed because each connection attempt would block a thread. Since elastic#42636 the connection attempt is now nonblocking so we can apply a more relaxed timeout. If transport security is enabled then our handshake timeout applies to the TLS handshake followed by the Elasticsearch handshake. If the TLS handshake alone takes over a second then the whole handshake times out with a `ConnectTransportException`, but this does not tell us which of the two individual handshakes took so long. TLS handshakes have their own 10-second timeout, which if reached yields a `SslHandshakeTimeoutException` that allows us to distinguish a problem at the TLS level from one at the Elasticsearch level. Therefore this commit extends the discovery probe timeouts.
Today the discovery phase has a short 1-second timeout for handshaking with a remote node after connecting, which allows it to quickly move on and retry in the case of connecting to something that doesn't respond straight away (e.g. it isn't an Elasticsearch node). This short timeout was necessary when the component was first developed because each connection attempt would block a thread. Since #42636 the connection attempt is now nonblocking so we can apply a more relaxed timeout. If transport security is enabled then our handshake timeout applies to the TLS handshake followed by the Elasticsearch handshake. If the TLS handshake alone takes over a second then the whole handshake times out with a `ConnectTransportException`, but this does not tell us which of the two individual handshakes took so long. TLS handshakes have their own 10-second timeout, which if reached yields a `SslHandshakeTimeoutException` that allows us to distinguish a problem at the TLS level from one at the Elasticsearch level. Therefore this commit extends the discovery probe timeouts.
Today the discovery phase has a short 1-second timeout for handshaking with a remote node after connecting, which allows it to quickly move on and retry in the case of connecting to something that doesn't respond straight away (e.g. it isn't an Elasticsearch node). This short timeout was necessary when the component was first developed because each connection attempt would block a thread. Since #42636 the connection attempt is now nonblocking so we can apply a more relaxed timeout. If transport security is enabled then our handshake timeout applies to the TLS handshake followed by the Elasticsearch handshake. If the TLS handshake alone takes over a second then the whole handshake times out with a `ConnectTransportException`, but this does not tell us which of the two individual handshakes took so long. TLS handshakes have their own 10-second timeout, which if reached yields a `SslHandshakeTimeoutException` that allows us to distinguish a problem at the TLS level from one at the Elasticsearch level. Therefore this commit extends the discovery probe timeouts.
This PR converts the ConnectionManager's
openConnection
andconnectToNode
methods to async-style. This will allow us to not block threads anymore when opening connections. This PR also adapts the cluster coordination subsystem to make use of the new async APIs, allowing to remove some hacks in the test infrastructure that had to account for the previous synchronous nature of the connection APIs.