Increase default value for cluster.routing.allocation.cluster_concurrent_rebalance #97750

idegtiarenko · 2023-07-18T11:32:04Z

Description

cluster.routing.allocation.cluster_concurrent_rebalance property is limiting the amount of shards that could be rebalanced simultaneously. The default value is 2 what is reasonable for a small amount of shards however it is becoming a bottleneck for a bigger clusters (10+ nodes).

Since new desired balance shard allocator is not affected by #87279 (effectively resolved by #93977) I believe we should change the default to allow big clusters to rebalance quicker.

The new default could be set to:

10 (or any other higher arbitrary number). This will not resolve the issue completely but will move the bottleneck a little further
Make it dependent on the cluster size (for example allow 1 concurrent rebalance per every 2 nodes in cluster ro introduce a new setting such as cluster.routing.allocation.node_concurrent_recoveries_per_node). This approach will allow to scale the number with the cluster size
-1 (or unlimited). This way the bottleneck would be defined by amount of incomming/outgoing recoveries the node could sustain: cluster.routing.allocation.node_concurrent_incoming_recoveries / cluster.routing.allocation.node_concurrent_outgoing_recoveries. This is the most aggresive option and it may delay the necessary shard movements (such as hot->warm tier migration) due to already ongoing rebalances.

The text was updated successfully, but these errors were encountered:

elasticsearchmachine · 2023-07-18T11:32:29Z

Pinging @elastic/es-distributed (Team:Distributed)

idegtiarenko · 2023-07-19T06:12:17Z

After discussing this with a team we decided that we should limit amount of rebalances per node level similar to cluster.routing.allocation.node_concurrent_incoming_recoveries / cluster.routing.allocation.node_concurrent_outgoing_recoveries (with default value of 1 per node) as this is the safest option

idegtiarenko removed the team-discuss label Jul 19, 2023

DaveCTurner mentioned this issue Aug 1, 2023

Throttle recoveries on data nodes instead of master #98087

Open

9 tasks

madhava-sridhar mentioned this issue May 30, 2024

set default value of cluster.routing.allocation.cluster_concurrent_rebalance to -1 #109210

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase default value for cluster.routing.allocation.cluster_concurrent_rebalance #97750

Increase default value for cluster.routing.allocation.cluster_concurrent_rebalance #97750

idegtiarenko commented Jul 18, 2023 •

edited

Loading

elasticsearchmachine commented Jul 18, 2023

idegtiarenko commented Jul 19, 2023

Increase default value for cluster.routing.allocation.cluster_concurrent_rebalance #97750

Increase default value for cluster.routing.allocation.cluster_concurrent_rebalance #97750

Comments

idegtiarenko commented Jul 18, 2023 • edited Loading

Description

elasticsearchmachine commented Jul 18, 2023

idegtiarenko commented Jul 19, 2023

idegtiarenko commented Jul 18, 2023 •

edited

Loading