Skip to content

Defer reroute when nodes join #42855

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

DaveCTurner
Copy link
Contributor

Today the master eagerly reroutes the cluster as part of processing node joins.
However, it is not necessary to do this reroute straight away, and it is
sometimes preferable to defer it until later. For instance, when the master
wins its election it processes joins and performs a reroute, but it would be
better to defer the reroute until after the master has become properly
established.

This change defers this reroute into a separate task.

Today the master eagerly reroutes the cluster as part of processing node joins.
However, it is not necessary to do this reroute straight away, and it is
sometimes preferable to defer it until later. For instance, when the master
wins its election it processes joins and performs a reroute, but it would be
better to defer the reroute until after the master has become properly
established.

This change defers this reroute into a separate task, and batches multiple such
tasks together.
@DaveCTurner DaveCTurner added >enhancement :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. v8.0.0 v7.3.0 labels Jun 4, 2019
@DaveCTurner DaveCTurner requested review from andrershov and ywelsch June 4, 2019 15:25
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@DaveCTurner
Copy link
Contributor Author

@ywelsch I managed to synthesise a failure by changing some of the randomBoolean()s and rarely()s in IndicesClusterStateServiceRandomUpdatesTests#testRandomClusterStateUpdates, but the probability of hitting it seemed very low otherwise.

@@ -141,15 +142,17 @@
public Coordinator(String nodeName, Settings settings, ClusterSettings clusterSettings, TransportService transportService,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're at the point when the number of Coordinator constructor parameters are unmanageable. Can we possibly add a JavaDoc to the constructor describing how specific dependency is used by Coordinator?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know what you mean, but most of them are of very specific types. I added docs for the ones whose types don't make their meaning so clear in d9dbf2d (including the one added here).

equalTo(false));

final AtomicBoolean stopRerouting = new AtomicBoolean();
final Thread rerouteThread = new Thread(() -> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand what reroute thread is doing here. I mean I understand that it continuously performs reroutes, but why is it needed in the test? Can you add the comment, please?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found a simpler way to test the same thing now that I understand what's going on a bit better. See 563ea02.

@DaveCTurner DaveCTurner requested a review from andrershov June 6, 2019 15:15
Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One nit, looking good o.w.

@@ -1243,7 +1243,8 @@ public void start(ClusterState initialState) {
allocationService, masterService, () -> persistedState,
hostsResolver -> testClusterNodes.nodes.values().stream().filter(n -> n.node.isMasterNode())
.map(n -> n.node.getAddress()).collect(Collectors.toList()),
clusterService.getClusterApplierService(), Collections.emptyList(), random());
clusterService.getClusterApplierService(), Collections.emptyList(), random(),
s -> {});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps plug in in the actual RoutingService here. This test cares about shards and realistic mocking, so I'm worried that a lot of effort will be spend in the future here to figure out why shards are not allocated.

@DaveCTurner DaveCTurner merged commit ddedf80 into elastic:master Jun 11, 2019
@DaveCTurner DaveCTurner deleted the 2019-06-04-deferred-reroute-on-join branch June 11, 2019 11:16
DaveCTurner added a commit that referenced this pull request Jun 11, 2019
Today the master eagerly reroutes the cluster as part of processing node joins.
However, it is not necessary to do this reroute straight away, and it is
sometimes preferable to defer it until later. For instance, when the master
wins its election it processes joins and performs a reroute, but it would be
better to defer the reroute until after the master has become properly
established.

This change defers this reroute into a separate task, and batches multiple such
tasks together.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. >enhancement v7.3.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants