Skip to content

Increase default value for cluster.routing.allocation.cluster_concurrent_rebalance #97750

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
idegtiarenko opened this issue Jul 18, 2023 · 2 comments
Labels
:Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) >enhancement Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.

Comments

@idegtiarenko
Copy link
Contributor

idegtiarenko commented Jul 18, 2023

Description

cluster.routing.allocation.cluster_concurrent_rebalance property is limiting the amount of shards that could be rebalanced simultaneously. The default value is 2 what is reasonable for a small amount of shards however it is becoming a bottleneck for a bigger clusters (10+ nodes).

Since new desired balance shard allocator is not affected by #87279 (effectively resolved by #93977) I believe we should change the default to allow big clusters to rebalance quicker.

The new default could be set to:

  • 10 (or any other higher arbitrary number). This will not resolve the issue completely but will move the bottleneck a little further
  • Make it dependent on the cluster size (for example allow 1 concurrent rebalance per every 2 nodes in cluster ro introduce a new setting such as cluster.routing.allocation.node_concurrent_recoveries_per_node). This approach will allow to scale the number with the cluster size
  • -1 (or unlimited). This way the bottleneck would be defined by amount of incomming/outgoing recoveries the node could sustain: cluster.routing.allocation.node_concurrent_incoming_recoveries / cluster.routing.allocation.node_concurrent_outgoing_recoveries. This is the most aggresive option and it may delay the necessary shard movements (such as hot->warm tier migration) due to already ongoing rebalances.
@idegtiarenko idegtiarenko added >enhancement :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) team-discuss Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. labels Jul 18, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@idegtiarenko
Copy link
Contributor Author

After discussing this with a team we decided that we should limit amount of rebalances per node level similar to cluster.routing.allocation.node_concurrent_incoming_recoveries / cluster.routing.allocation.node_concurrent_outgoing_recoveries (with default value of 1 per node) as this is the safest option

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) >enhancement Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.
Projects
None yet
Development

No branches or pull requests

2 participants