-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Skip rebalancing when cluster_concurrent_rebalance threshold reached #33329
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ating full throttle
Pinging @elastic/es-distributed |
|
||
@Override | ||
public Decision canRebalance(RoutingAllocation allocation) { | ||
return canRebalance(null, allocation); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of passing null
here, can you move the implementation to this method and then call call this method from canRebalance(ShardRouting shardRouting, RoutingAllocation allocation)
, similar as was done for ClusterRebalanceAllocationDecider
.
@elasticmachine test this please |
Hi @ywelsch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Builds are currently flaky, not related to this PR though. I'll take care of merging this once our build stabilizes again. |
Thanks @ywelsch , |
…33329) Allows to skip shard balancing when the cluster_concurrent_rebalance threshold is already reached, which cuts down the time spent in the rebalance method of BalancedShardsAllocator.
…33329) Allows to skip shard balancing when the cluster_concurrent_rebalance threshold is already reached, which cuts down the time spent in the rebalance method of BalancedShardsAllocator.
Follow-up from #27628
This is a pre-emptive check during shard relocation. Most of the time during relocation when relocating shards are more than the cluster_concurrent_rebalance we are skipping rebalancing instead of iterating over all the shards and returning a
THROTTLE
decision. This results in faster shard iteration.Benchmarking
"indices.recovery.max_bytes_per_sec" : "300mb"
"cluster.routing.allocation.node_concurrent_recoveries" : "4"
"cluster.routing.allocation.cluster_concurrent_rebalance" : "2"
4.9.38-16.35.amzn1.x86_64
1.8
Time spent on rebalance, tp90 is
732ms
, total time spent in allocation tp90 is4820ms
without optimization.This optimization cuts down the time spent in rebalance which is roughly 15% of the time spent in a single iteration made by the master