Only accept transport requests after node is fully initialized #16746

bleskes · 2016-02-21T15:54:06Z

At the moment we open up the transport service before the local node has been fully initialized. This causes bug as some data structures are not fully initialized yet. See for example #16723.

Sadly, we can't just start the TransportService last (as we do with the HTTP server) because the ClusterService needs to know the bound published network address for the local DiscoveryNode. This address can only be determined by actually binding (people may use, for example, port 0). Instead we start the TransportService as late as possible but block any incoming requests until the node has completed initialization.

A couple of other cleanup during start time:

The gateway service now starts before the initial cluster join so we can simplify the logic to recover state if the local node has become master.
The discovery is started before the transport service accepts requests, but we only start the join process later using a dedicated method.

Closes #16723

s1monw · 2016-02-21T18:00:53Z

core/src/main/java/org/elasticsearch/transport/TransportService.java

@@ -71,7 +72,8 @@

    public static final String DIRECT_RESPONSE_PROFILE = ".direct";

-    private final AtomicBoolean started = new AtomicBoolean(false);


Can we make this a simple CountDownLatch that way we know it's only going in one direction (never move back to blocking), we get the wait for free and don't need to suppress forbidden API?

s1monw · 2016-02-21T18:11:37Z

@bleskes after looking at InternalClusterService I would love to explore detaching TransportSerivce I think we have a good chance to get it fixed if we:

move the LocalNode stuff into TransportService
add the reconnect task into a seperate class that is started seperately and depends on both transportservice and ClusterService
for the connect stuff we do before we notify listeners should be maybe a listener too? I think it's a messed up dependency in ClusterService?

WDYT

bleskes · 2016-02-21T21:52:42Z

@s1monw I think these are excellent ideas and I will explore them - I totally agree we should decouple the cluster state service from the transport service. There are some issues to figure out in order to solve the issue where the transport service needs to bind the socket in order to make a DiscoNode , which is in turn used in the very first cluster state. I pushed another commit simplifying things here even further. I think it starts to be simple enough and we can push this without doing a bigger refactoring. It's a judgement call and I defer to your judgement. I'm good with any decision.

s1monw · 2016-02-22T22:42:54Z

@bleskes I think we should do what we have for 2.x and open a followup for master. I mean we can push this to master and make the followup removing this one?

bleskes · 2016-02-22T23:14:26Z

@s1monw will do. Thanks.

…c#16746 We should open up the node to the world when it's as ready as possiblAt the moment we open up the transport service before the local node has been fully initialized. This causes bug as some data structures are not fully initialized yet. See for example elastic#16723. Sadly, we can't just start the TransportService last (as we do with the HTTP server) because the ClusterService needs to know the bound published network address for the local DiscoveryNode. This address can only be determined by actually binding (people may use, for example, port 0). Instead we start the TransportService as late as possible but block any incoming requests until the node has completed initialization. A couple of other cleanup during start time: 1) The gateway service now starts before the initial cluster join so we can simplify the logic to recover state if the local node has become master. 2) The discovery is started before the transport service accepts requests, but we only start the join process later using a dedicated method. Closes elastic#16723 Closes elastic#16746

We should open up the node to the world when it's as ready as possiblAt the moment we open up the transport service before the local node has been fully initialized. This causes bug as some data structures are not fully initialized yet. See for example #16723. Sadly, we can't just start the TransportService last (as we do with the HTTP server) because the ClusterService needs to know the bound published network address for the local DiscoveryNode. This address can only be determined by actually binding (people may use, for example, port 0). Instead we start the TransportService as late as possible but block any incoming requests until the node has completed initialization. A couple of other cleanup during start time: 1) The gateway service now starts before the initial cluster join so we can simplify the logic to recover state if the local node has become master. 2) The discovery is started before the transport service accepts requests, but we only start the join process later using a dedicated method. Closes #16723 Closes #16746

Today we bind to our transport address(es) very early in the startup of a node so that we know the addresses to which we're bound, even though we are not yet ready to handle any requests. If we receive a request in this state then we throw an `IllegalStateException` which results in a logged warning and the connection being closed. In practice, this happens straight away since the first request on the connection, the handshake, is sent as soon as it's open. With this commit we instead quietly close the connection straight away, even before any requests are received, avoiding the noisy logging. Relates elastic#44939 Relates elastic#16746 Closes elastic#61356

Today we bind to our transport address(es) very early in the startup of a node so that we know the addresses to which we're bound, even though we are not yet ready to handle any requests. If we receive a request in this state then we throw an `IllegalStateException` which results in a logged warning and the connection being closed. In practice, this happens straight away since the first request on the connection, the handshake, is sent as soon as it's open. This commit introduces a `TransportNotReadyException` for this specific case, and suppresses the noisy logging on such exceptions. Relates elastic#44939 Relates elastic#16746 Closes elastic#61356

Today we bind to our transport address(es) very early in the startup of a node so that we know the addresses to which we're bound, even though we are not yet ready to handle any requests. If we receive a request in this state then we throw an `IllegalStateException` which results in a logged warning and the connection being closed. In practice, this happens straight away since the first request on the connection, the handshake, is sent as soon as it's open. This commit introduces a `TransportNotReadyException` for this specific case, and suppresses the noisy logging on such exceptions. Relates #44939 Relates #16746 Closes #61356

bleskes added >bug :Distributed Coordination/Network Http and internode communication implementations v5.0.0-alpha1 v2.3.0 labels Feb 21, 2016

bleskes mentioned this pull request Feb 21, 2016

NullPointerException in InternalClusterService.add() using TransportClient #16723

Closed

s1monw reviewed Feb 21, 2016
View reviewed changes

bleskes force-pushed the transport_start_last branch from 8eb2314 to 5a91ad1 Compare February 24, 2016 02:02

bleskes merged commit 5a91ad1 into elastic:master Feb 24, 2016

bleskes mentioned this pull request Feb 24, 2016

Decouple our ClusterService and TransportService #16788

Closed

bleskes deleted the transport_start_last branch February 24, 2016 02:10

DaveCTurner mentioned this pull request Mar 1, 2021

Suppress warning on request if transport not ready #69686

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Only accept transport requests after node is fully initialized #16746

Only accept transport requests after node is fully initialized #16746

Uh oh!

bleskes commented Feb 21, 2016

Uh oh!

s1monw Feb 21, 2016

Uh oh!

s1monw commented Feb 21, 2016

Uh oh!

bleskes commented Feb 21, 2016

Uh oh!

s1monw commented Feb 22, 2016

Uh oh!

bleskes commented Feb 22, 2016

Uh oh!

Uh oh!

		@@ -71,7 +72,8 @@

		public static final String DIRECT_RESPONSE_PROFILE = ".direct";

		private final AtomicBoolean started = new AtomicBoolean(false);

Only accept transport requests after node is fully initialized #16746

Only accept transport requests after node is fully initialized #16746

Uh oh!

Conversation

bleskes commented Feb 21, 2016

Uh oh!

s1monw Feb 21, 2016

Choose a reason for hiding this comment

Uh oh!

s1monw commented Feb 21, 2016

Uh oh!

bleskes commented Feb 21, 2016

Uh oh!

s1monw commented Feb 22, 2016

Uh oh!

bleskes commented Feb 22, 2016

Uh oh!

Uh oh!