Defer reroute when nodes join #42855

DaveCTurner · 2019-06-04T15:25:16Z

Today the master eagerly reroutes the cluster as part of processing node joins.
However, it is not necessary to do this reroute straight away, and it is
sometimes preferable to defer it until later. For instance, when the master
wins its election it processes joins and performs a reroute, but it would be
better to defer the reroute until after the master has become properly
established.

This change defers this reroute into a separate task.

Today the master eagerly reroutes the cluster as part of processing node joins. However, it is not necessary to do this reroute straight away, and it is sometimes preferable to defer it until later. For instance, when the master wins its election it processes joins and performs a reroute, but it would be better to defer the reroute until after the master has become properly established. This change defers this reroute into a separate task, and batches multiple such tasks together.

elasticmachine · 2019-06-04T15:25:22Z

Pinging @elastic/es-distributed

DaveCTurner · 2019-06-05T06:30:06Z

@ywelsch I managed to synthesise a failure by changing some of the randomBoolean()s and rarely()s in IndicesClusterStateServiceRandomUpdatesTests#testRandomClusterStateUpdates, but the probability of hitting it seemed very low otherwise.

andrershov · 2019-06-05T15:53:57Z

server/src/main/java/org/elasticsearch/cluster/coordination/Coordinator.java

@@ -141,15 +142,17 @@
    public Coordinator(String nodeName, Settings settings, ClusterSettings clusterSettings, TransportService transportService,


I think we're at the point when the number of Coordinator constructor parameters are unmanageable. Can we possibly add a JavaDoc to the constructor describing how specific dependency is used by Coordinator?

I know what you mean, but most of them are of very specific types. I added docs for the ones whose types don't make their meaning so clear in d9dbf2d (including the one added here).

andrershov · 2019-06-05T15:55:38Z

server/src/test/java/org/elasticsearch/cluster/SimpleDataNodesIT.java

+            equalTo(false));
+
+        final AtomicBoolean stopRerouting = new AtomicBoolean();
+        final Thread rerouteThread = new Thread(() -> {


I'm not sure I understand what reroute thread is doing here. I mean I understand that it continuously performs reroutes, but why is it needed in the test? Can you add the comment, please?

I found a simpler way to test the same thing now that I understand what's going on a bit better. See 563ea02.

ywelsch

One nit, looking good o.w.

ywelsch · 2019-06-11T07:54:04Z

server/src/test/java/org/elasticsearch/snapshots/SnapshotResiliencyTests.java

@@ -1243,7 +1243,8 @@ public void start(ClusterState initialState) {
                allocationService, masterService, () -> persistedState,
                hostsResolver -> testClusterNodes.nodes.values().stream().filter(n -> n.node.isMasterNode())
                    .map(n -> n.node.getAddress()).collect(Collectors.toList()),
-                clusterService.getClusterApplierService(), Collections.emptyList(), random());
+                clusterService.getClusterApplierService(), Collections.emptyList(), random(),
+                s -> {});


perhaps plug in in the actual RoutingService here. This test cares about shards and realistic mocking, so I'm worried that a lot of effort will be spend in the future here to figure out why shards are not allocated.

Today the master eagerly reroutes the cluster as part of processing node joins. However, it is not necessary to do this reroute straight away, and it is sometimes preferable to defer it until later. For instance, when the master wins its election it processes joins and performs a reroute, but it would be better to defer the reroute until after the master has become properly established. This change defers this reroute into a separate task, and batches multiple such tasks together.

DaveCTurner added 3 commits June 4, 2019 13:58

Use RoutingService

2c58586

Imports

257a280

DaveCTurner added >enhancement :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. v8.0.0 v7.3.0 labels Jun 4, 2019

DaveCTurner requested review from andrershov and ywelsch June 4, 2019 15:25

Adjust auto-expand replicas when adding nodes

3a1c7d8

DaveCTurner added 2 commits June 5, 2019 07:55

Imports

5ae4bee

Revert some unnecessary changes

a7c5b5a

andrershov reviewed Jun 5, 2019

View reviewed changes

DaveCTurner added 3 commits June 6, 2019 15:54

Merge branch 'master' into 2019-06-04-deferred-reroute-on-join

eb5a1d9

Javadocs

d9dbf2d

Simplify test

563ea02

DaveCTurner requested a review from andrershov June 6, 2019 15:15

Imports

a091de5

ywelsch approved these changes Jun 11, 2019

View reviewed changes

DaveCTurner added 2 commits June 11, 2019 09:17

Merge branch 'master' into 2019-06-04-deferred-reroute-on-join

c05c9f9

Use real routing service

c1d8edb

DaveCTurner merged commit ddedf80 into elastic:master Jun 11, 2019

DaveCTurner deleted the 2019-06-04-deferred-reroute-on-join branch June 11, 2019 11:16

shwetathareja mentioned this pull request Jun 2, 2020

[Optimization]: During reroute async fetch data in GatewayAllocator, send request in generic threadpool instead of masterService#updateTask #57498

Closed

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Defer reroute when nodes join #42855

Defer reroute when nodes join #42855

Uh oh!

DaveCTurner commented Jun 4, 2019

Uh oh!

elasticmachine commented Jun 4, 2019

Uh oh!

DaveCTurner commented Jun 5, 2019

Uh oh!

andrershov Jun 5, 2019

Uh oh!

DaveCTurner Jun 6, 2019

Uh oh!

andrershov Jun 5, 2019

Uh oh!

DaveCTurner Jun 6, 2019

Uh oh!

ywelsch left a comment

Uh oh!

ywelsch Jun 11, 2019

Uh oh!

Uh oh!

		@@ -141,15 +142,17 @@
		public Coordinator(String nodeName, Settings settings, ClusterSettings clusterSettings, TransportService transportService,

Defer reroute when nodes join #42855

Defer reroute when nodes join #42855

Uh oh!

Conversation

DaveCTurner commented Jun 4, 2019

Uh oh!

elasticmachine commented Jun 4, 2019

Uh oh!

DaveCTurner commented Jun 5, 2019

Uh oh!

andrershov Jun 5, 2019

Choose a reason for hiding this comment

Uh oh!

DaveCTurner Jun 6, 2019

Choose a reason for hiding this comment

Uh oh!

andrershov Jun 5, 2019

Choose a reason for hiding this comment

Uh oh!

DaveCTurner Jun 6, 2019

Choose a reason for hiding this comment

Uh oh!

ywelsch left a comment

Choose a reason for hiding this comment

Uh oh!

ywelsch Jun 11, 2019

Choose a reason for hiding this comment

Uh oh!

Uh oh!