Avoid blocking a thread waiting for connections #40150
Labels
:Distributed Coordination/Network
Http and internode communication implementations
>enhancement
Meta
resiliency
Today we block a thread waiting for connections to open. Threads are a precious resource, and opening a connection can be time-consuming if the remote node is unresponsive. Although #39629 mostly alleviates the effects seen in #28920, it is still possible that a poorly-timed attempt by the
NodeConnectionsService
to reconnect to all the known nodes in the cluster state could saturate the small-yet-important management threadpool in a network partition.In #29023 we suggested creating a dedicated threadpool for connections, but then the work in #35144 brought us closer to being able to open these connections asynchronously and the idea of introducing a dedicated threadpool was dropped. However it's not yet possible to open a connection fully asynchronously, so there is still a risk of saturating a threadpool during a network partition.
To avoid losing track of this, here is a meta-issue which tracks the remaining places that need to work asynchronously:
ConnectionManager#internalOpenConnection
,ConnectionManager#openConnection
andConnectionManager#connectToNode
(Move ConnectionManager to async APIs #42636)TransportService#connectToNode
(Move ConnectionManager to async APIs #42636)HandshakingTransportAddressConnector#connectToRemoteMasterNode
(Move ConnectionManager to async APIs #42636)NodeConnectionsService#ConnectionTarget
(Make NodeConnectionsService non-blocking #44211)Coordinator#handleJoinRequest
(Move ConnectionManager to async APIs #42636)RemoteClusterConnection#ConnectHandler
(Asynchronously connect to remote clusters #44825)In each case there are quite a few tests that will need adjusting, so I think it makes sense to break the work up like this.
Connections are also opened by the transport client, but it seems less important to make these connections asynchronously.
The text was updated successfully, but these errors were encountered: