Skip to content

Skip rebalancing when cluster_concurrent_rebalance threshold reached #33329

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Sep 17, 2018

Conversation

Bukhtawar
Copy link
Contributor

@Bukhtawar Bukhtawar commented Sep 1, 2018

Follow-up from #27628
This is a pre-emptive check during shard relocation. Most of the time during relocation when relocating shards are more than the cluster_concurrent_rebalance we are skipping rebalancing instead of iterating over all the shards and returning a THROTTLE decision. This results in faster shard iteration.

Benchmarking

  • Data nodes (i3.8xlarge) AWS EC2
  • Master node (c4.8xlarge) AWS EC2
  • Shards (count 25k, 2500 indices, 5 primary, 1 replica, size 5gb)
  • Cluster settings
    "indices.recovery.max_bytes_per_sec" : "300mb"
    "cluster.routing.allocation.node_concurrent_recoveries" : "4"
    "cluster.routing.allocation.cluster_concurrent_rebalance" : "2"
  • OS version 4.9.38-16.35.amzn1.x86_64
  • JRE 1.8
  • Performed relocation from 100 data nodes to 100 data nodes.
    Time spent on rebalance, tp90 is 732ms, total time spent in allocation tp90 is 4820ms without optimization.This optimization cuts down the time spent in rebalance which is roughly 15% of the time spent in a single iteration made by the master

@dnhatn dnhatn added the :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) label Sep 1, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed


@Override
public Decision canRebalance(RoutingAllocation allocation) {
return canRebalance(null, allocation);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of passing null here, can you move the implementation to this method and then call call this method from canRebalance(ShardRouting shardRouting, RoutingAllocation allocation), similar as was done for ClusterRebalanceAllocationDecider.

@ywelsch
Copy link
Contributor

ywelsch commented Sep 12, 2018

@elasticmachine test this please

@Bukhtawar
Copy link
Contributor Author

Hi @ywelsch
I believe the build failures are unrelated to the change. Please do let me know if they need a fix

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ywelsch
Copy link
Contributor

ywelsch commented Sep 13, 2018

I believe the build failures are unrelated to the change. Please do let me know if they need a fix

Builds are currently flaky, not related to this PR though. I'll take care of merging this once our build stabilizes again.

@Bukhtawar
Copy link
Contributor Author

Thanks @ywelsch ,
As nothing is pending at our end we'll get started on the next PR.

@ywelsch ywelsch changed the title [Relocation optimization] Skip iterating over all shards during rebalance when shards are relocating full THROTTLE Skip rebalancing when cluster_concurrent_rebalance threshold reached Sep 17, 2018
@ywelsch ywelsch merged commit 14d57c1 into elastic:master Sep 17, 2018
ywelsch pushed a commit that referenced this pull request Sep 17, 2018
…33329)

Allows to skip shard balancing when the cluster_concurrent_rebalance threshold is already reached, which cuts down the time spent in the rebalance method of BalancedShardsAllocator.
ywelsch pushed a commit that referenced this pull request Sep 17, 2018
…33329)

Allows to skip shard balancing when the cluster_concurrent_rebalance threshold is already reached, which cuts down the time spent in the rebalance method of BalancedShardsAllocator.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) >enhancement v6.4.1 v6.5.0 v7.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants