Skip rebalancing when cluster_concurrent_rebalance threshold reached #33329

Bukhtawar · 2018-09-01T10:01:43Z

Follow-up from #27628
This is a pre-emptive check during shard relocation. Most of the time during relocation when relocating shards are more than the cluster_concurrent_rebalance we are skipping rebalancing instead of iterating over all the shards and returning a THROTTLE decision. This results in faster shard iteration.

Benchmarking

Data nodes (i3.8xlarge) AWS EC2
Master node (c4.8xlarge) AWS EC2
Shards (count 25k, 2500 indices, 5 primary, 1 replica, size 5gb)
Cluster settings
"indices.recovery.max_bytes_per_sec" : "300mb"
"cluster.routing.allocation.node_concurrent_recoveries" : "4"
"cluster.routing.allocation.cluster_concurrent_rebalance" : "2"
OS version 4.9.38-16.35.amzn1.x86_64
JRE 1.8
Performed relocation from 100 data nodes to 100 data nodes.
Time spent on rebalance, tp90 is 732ms, total time spent in allocation tp90 is 4820ms without optimization.This optimization cuts down the time spent in rebalance which is roughly 15% of the time spent in a single iteration made by the master

…ating full throttle

elasticmachine · 2018-09-01T13:43:51Z

Pinging @elastic/es-distributed

ywelsch · 2018-09-10T08:28:54Z

...g/elasticsearch/cluster/routing/allocation/decider/ConcurrentRebalanceAllocationDecider.java

+
+    @Override
+    public Decision canRebalance(RoutingAllocation allocation) {
+        return canRebalance(null, allocation);


instead of passing null here, can you move the implementation to this method and then call call this method from canRebalance(ShardRouting shardRouting, RoutingAllocation allocation), similar as was done for ClusterRebalanceAllocationDecider.

ywelsch · 2018-09-12T12:32:03Z

@elasticmachine test this please

Bukhtawar · 2018-09-13T09:46:12Z

Hi @ywelsch
I believe the build failures are unrelated to the change. Please do let me know if they need a fix

ywelsch

LGTM

ywelsch · 2018-09-13T09:50:55Z

I believe the build failures are unrelated to the change. Please do let me know if they need a fix

Builds are currently flaky, not related to this PR though. I'll take care of merging this once our build stabilizes again.

Bukhtawar · 2018-09-13T09:52:45Z

Thanks @ywelsch ,
As nothing is pending at our end we'll get started on the next PR.

…33329) Allows to skip shard balancing when the cluster_concurrent_rebalance threshold is already reached, which cuts down the time spent in the rebalance method of BalancedShardsAllocator.

Skip iterating over all shards during rebalance when shards are reloc…

252bfc6

…ating full throttle

dnhatn added the :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) label Sep 1, 2018

tlrx assigned ywelsch Sep 6, 2018

ywelsch suggested changes Sep 10, 2018

View reviewed changes

Refactoring the method based on review comments

82fbd8b

ywelsch added >enhancement v7.0.0 v6.5.0 v6.4.1 labels Sep 13, 2018

ywelsch approved these changes Sep 13, 2018

View reviewed changes

ywelsch changed the title ~~[Relocation optimization] Skip iterating over all shards during rebalance when shards are relocating full THROTTLE~~ Skip rebalancing when cluster_concurrent_rebalance threshold reached Sep 17, 2018

ywelsch merged commit 14d57c1 into elastic:master Sep 17, 2018

colings86 added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

Bukhtawar mentioned this pull request Jun 5, 2019

Change put mapping priority to URGENT #42105

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Skip rebalancing when cluster_concurrent_rebalance threshold reached #33329

Skip rebalancing when cluster_concurrent_rebalance threshold reached #33329

Uh oh!

Bukhtawar commented Sep 1, 2018 •

edited

Loading

Uh oh!

elasticmachine commented Sep 1, 2018

Uh oh!

ywelsch Sep 10, 2018

Uh oh!

ywelsch commented Sep 12, 2018

Uh oh!

Bukhtawar commented Sep 13, 2018

Uh oh!

ywelsch left a comment

Uh oh!

ywelsch commented Sep 13, 2018

Uh oh!

Bukhtawar commented Sep 13, 2018

Uh oh!

Uh oh!

Skip rebalancing when cluster_concurrent_rebalance threshold reached #33329

Skip rebalancing when cluster_concurrent_rebalance threshold reached #33329

Uh oh!

Conversation

Bukhtawar commented Sep 1, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticmachine commented Sep 1, 2018

Uh oh!

ywelsch Sep 10, 2018

Choose a reason for hiding this comment

Uh oh!

ywelsch commented Sep 12, 2018

Uh oh!

Bukhtawar commented Sep 13, 2018

Uh oh!

ywelsch left a comment

Choose a reason for hiding this comment

Uh oh!

ywelsch commented Sep 13, 2018

Uh oh!

Bukhtawar commented Sep 13, 2018

Uh oh!

Uh oh!

Bukhtawar commented Sep 1, 2018 •

edited

Loading