Move ConnectionManager to async APIs #42636

ywelsch · 2019-05-28T15:22:55Z

This PR converts the ConnectionManager's openConnection and connectToNode methods to async-style. This will allow us to not block threads anymore when opening connections. This PR also adapts the cluster coordination subsystem to make use of the new async APIs, allowing to remove some hacks in the test infrastructure that had to account for the previous synchronous nature of the connection APIs.

elasticmachine · 2019-05-28T15:22:56Z

Pinging @elastic/es-distributed

ywelsch · 2019-06-25T07:41:00Z

@tbrooks8 (transport) @DaveCTurner (cluster coordination) this is ready for review now.

DaveCTurner

Requested an extra test, but otherwise the bits I've marked 👍 LGTM.

server/src/main/java/org/elasticsearch/cluster/coordination/Coordinator.java

server/src/main/java/org/elasticsearch/cluster/coordination/JoinHelper.java

server/src/main/java/org/elasticsearch/discovery/HandshakingTransportAddressConnector.java

test/framework/src/main/java/org/elasticsearch/test/disruption/DisruptableMockTransport.java

server/src/test/java/org/elasticsearch/cluster/NodeConnectionsServiceTests.java

Tim-Brooks · 2019-07-04T00:12:21Z

server/src/main/java/org/elasticsearch/transport/ConnectionManager.java

+            Transport.Connection connection = connectedNodes.get(node);
+            if (connection != null) {
+                assert connectingNodes.containsKey(node) == false;
+                lock.close();


Unnecessary since try/with/resources with release.

I wanted to release this before calling the response listener.

Tim-Brooks · 2019-07-04T00:14:10Z

server/src/main/java/org/elasticsearch/transport/ConnectionManager.java

+                return;
+            }
+
+            final List<ActionListener<Void>> connectionListeners = connectingNodes.computeIfAbsent(node, n -> new ArrayList());


new ArrayList<>()

fixed in 3a0fc08

Tim-Brooks · 2019-07-04T00:23:54Z

server/src/main/java/org/elasticsearch/transport/ConnectionManager.java

+                            } finally {
+                                final Transport.Connection finalConnection = conn;
+                                conn.addCloseListener(ActionListener.wrap(() -> {
+                                    logger.info("close listener called for node {}", node);


I don't totally understand the value of this logging message. Especially since there might be multiple and other close listeners attached to this connection. This message seems to imply THIS is the close listener. It would make more sense to me if the message at lease reflected what this listener is going (deregistering the node and notify that it has disconnected).

this was a leftover from a debugging session :)

turned this into trace log at 3a0fc08

DaveCTurner

Cluster coordination side LGTM

Tim-Brooks

LGTM

This commit converts the ConnectionManager's openConnection and connectToNode methods to async-style. This will allow us to not block threads anymore when opening connections. This PR also adapts the cluster coordination subsystem to make use of the new async APIs, allowing to remove some hacks in the test infrastructure that had to account for the previous synchronous nature of the connection APIs.

Since elastic#42636 we no longer treat connections specially when simulating a blackholed connection. This means that at the end of the safety phase we may have just started a connection attempt which will time out, but the default timeout is 30 seconds, much longer than the 2 seconds we normally allow for post-safety-phase discovery. This commit adds time for such a connection attempt to time out. It also fixes some spurious logging of `this` that now refers to an object with an unhelpful `toString()` implementation introduced in elastic#42636. Fixes elastic#44073

Since #42636 we no longer treat connections specially when simulating a blackholed connection. This means that at the end of the safety phase we may have just started a connection attempt which will time out, but the default timeout is 30 seconds, much longer than the 2 seconds we normally allow for post-safety-phase discovery. This commit adds time for such a connection attempt to time out. It also fixes some spurious logging of `this` that now refers to an object with an unhelpful `toString()` implementation introduced in #42636. Fixes #44073

Today the discovery phase has a short 1-second timeout for handshaking with a remote node after connecting, which allows it to quickly move on and retry in the case of connecting to something that doesn't respond straight away (e.g. it isn't an Elasticsearch node). This short timeout was necessary when the component was first developed because each connection attempt would block a thread. Since elastic#42636 the connection attempt is now nonblocking so we can apply a more relaxed timeout. If transport security is enabled then our handshake timeout applies to the TLS handshake followed by the Elasticsearch handshake. If the TLS handshake alone takes over a second then the whole handshake times out with a `ConnectTransportException`, but this does not tell us which of the two individual handshakes took so long. TLS handshakes have their own 10-second timeout, which if reached yields a `SslHandshakeTimeoutException` that allows us to distinguish a problem at the TLS level from one at the Elasticsearch level. Therefore this commit extends the discovery probe timeouts.

Today the discovery phase has a short 1-second timeout for handshaking with a remote node after connecting, which allows it to quickly move on and retry in the case of connecting to something that doesn't respond straight away (e.g. it isn't an Elasticsearch node). This short timeout was necessary when the component was first developed because each connection attempt would block a thread. Since #42636 the connection attempt is now nonblocking so we can apply a more relaxed timeout. If transport security is enabled then our handshake timeout applies to the TLS handshake followed by the Elasticsearch handshake. If the TLS handshake alone takes over a second then the whole handshake times out with a `ConnectTransportException`, but this does not tell us which of the two individual handshakes took so long. TLS handshakes have their own 10-second timeout, which if reached yields a `SslHandshakeTimeoutException` that allows us to distinguish a problem at the TLS level from one at the Elasticsearch level. Therefore this commit extends the discovery probe timeouts.

async

44d9a73

ywelsch added >enhancement WIP :Distributed Coordination/Network Http and internode communication implementations :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. v8.0.0 v7.3.0 labels May 28, 2019

ywelsch added 2 commits May 29, 2019 16:24

more tests

4206391

checkstyl

ef7d407

jaymode mentioned this pull request May 31, 2019

[CI] org.elasticsearch.node.NodeTests times out repeatedly on Windows #42350

Closed

ywelsch added 4 commits May 31, 2019 21:03

fix test

3709db6

Merge remote-tracking branch 'elastic/master' into async-handshake

516d60e

revert change

d00cbef

add test and assertions

81f3e45

ywelsch removed the WIP label Jun 25, 2019

ywelsch requested review from DaveCTurner and Tim-Brooks and removed request for DaveCTurner June 25, 2019 07:40

DaveCTurner reviewed Jul 3, 2019

View reviewed changes

jpountz added v7.4.0 and removed v7.3.0 labels Jul 3, 2019

Tim-Brooks reviewed Jul 4, 2019

View reviewed changes

ywelsch added 3 commits July 4, 2019 16:15

Merge remote-tracking branch 'elastic/master' into async-handshake

9a1399c

undo change

39d12cc

tim feedback

3a0fc08

ywelsch requested review from DaveCTurner and Tim-Brooks July 4, 2019 14:47

DaveCTurner approved these changes Jul 4, 2019

View reviewed changes

Tim-Brooks reviewed Jul 5, 2019

View reviewed changes

ywelsch merged commit bca865d into elastic:master Jul 5, 2019

DaveCTurner mentioned this pull request Jul 8, 2019

Wait for blackholed connection before discovery #44077

Merged

DaveCTurner mentioned this pull request Jul 11, 2019

Avoid blocking a thread waiting for connections #40150

Closed

6 tasks

colings86 removed the :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. label Aug 30, 2019

DaveCTurner mentioned this pull request Jan 27, 2021

Extend probe handshake timeout if security enabled #68048

Closed

DaveCTurner mentioned this pull request Jan 27, 2021

Extend default probe connect/handshake timeouts #68059

Merged

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move ConnectionManager to async APIs #42636

Move ConnectionManager to async APIs #42636

ywelsch commented May 28, 2019 •

edited

Loading

elasticmachine commented May 28, 2019

ywelsch commented Jun 25, 2019

DaveCTurner left a comment

Tim-Brooks Jul 4, 2019

ywelsch Jul 4, 2019

Tim-Brooks Jul 4, 2019

ywelsch Jul 4, 2019

Tim-Brooks Jul 4, 2019

ywelsch Jul 4, 2019

ywelsch Jul 4, 2019

DaveCTurner left a comment

Tim-Brooks left a comment

Move ConnectionManager to async APIs #42636

Move ConnectionManager to async APIs #42636

Conversation

ywelsch commented May 28, 2019 • edited Loading

elasticmachine commented May 28, 2019

ywelsch commented Jun 25, 2019

DaveCTurner left a comment

Choose a reason for hiding this comment

Tim-Brooks Jul 4, 2019

Choose a reason for hiding this comment

ywelsch Jul 4, 2019

Choose a reason for hiding this comment

Tim-Brooks Jul 4, 2019

Choose a reason for hiding this comment

ywelsch Jul 4, 2019

Choose a reason for hiding this comment

Tim-Brooks Jul 4, 2019

Choose a reason for hiding this comment

ywelsch Jul 4, 2019

Choose a reason for hiding this comment

ywelsch Jul 4, 2019

Choose a reason for hiding this comment

DaveCTurner left a comment

Choose a reason for hiding this comment

Tim-Brooks left a comment

Choose a reason for hiding this comment

ywelsch commented May 28, 2019 •

edited

Loading