Skip to content

Increasing cluster.routing.allocation.cluster_concurrent_rebalance causes redundant shard movements #87279

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
DaveCTurner opened this issue Jun 1, 2022 · 6 comments
Labels
>bug :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.

Comments

@DaveCTurner
Copy link
Contributor

DaveCTurner commented Jun 1, 2022

If cluster.routing.allocation.cluster_concurrent_rebalance is increased above the default of 2 then the balancer will sometimes make unnecessary rebalancing movements which can drastically increase the time it takes for the cluster to reach a balanced state. It's likely that something similar happens if you increase cluster.routing.allocation.node_concurrent_recoveries and friends too. For example, if you start with the following routing table ...

-----node_id[node-0][V]
--------[index-0][0], node[node-0], [P], s[STARTED], a[id=Sf5zRgAAQACYIgoIAAAAAA]
--------[index-0][1], node[node-0], [P], s[STARTED], a[id=mL05qf__T_-F4Iem_____w]
--------[index-1][0], node[node-0], [P], s[STARTED], a[id=riXHcQAAQACiAIRJAAAAAA]
--------[index-1][1], node[node-0], [P], s[STARTED], a[id=3p7gWwAAQACAV_1aAAAAAA]
-----node_id[node-1][V]
--------[index-0][2], node[node-1], [P], s[STARTED], a[id=sDzZ0___T_-4clZCAAAAAA]
--------[index-0][3], node[node-1], [P], s[STARTED], a[id=RFYlyv__T_-PnVozAAAAAA]
--------[index-1][2], node[node-1], [P], s[STARTED], a[id=4MbZxP__T_-WXmVSAAAAAA]
--------[index-1][3], node[node-1], [P], s[STARTED], a[id=pQwaKAAAQACUTMDZ_____w]
-----node_id[node-2][V]
--------[index-2][0], node[node-2], [P], s[STARTED], a[id=C2Hn7P__T_-I4nQUAAAAAA]

... then a call to reroute() will relocate some shards onto node-2 but will also relocate this node's sole shard onto a different node:

-----node_id[node-0][V]
--------[index-0][0], node[node-0], [P], s[STARTED], a[id=Sf5zRgAAQACYIgoIAAAAAA]
--------[index-0][1], node[node-0], relocating [node-2], [P], s[RELOCATING], a[id=mL05qf__T_-F4Iem_____w, rId=OHLwbBZ_Sy-Re7yBy_yDSQ]
--------[index-1][0], node[node-0], [P], s[STARTED], a[id=riXHcQAAQACiAIRJAAAAAA]
--------[index-1][1], node[node-0], [P], s[STARTED], a[id=3p7gWwAAQACAV_1aAAAAAA]
-----node_id[node-1][V]
--------[index-2][0], node[node-1], relocating [node-2], [P], recovery_source[peer recovery], s[INITIALIZING], a[id=Gjys3XrdRCy58HQVMXXFGg, rId=C2Hn7P__T_-I4nQUAAAAAA]
--------[index-0][2], node[node-1], [P], s[STARTED], a[id=sDzZ0___T_-4clZCAAAAAA]
--------[index-0][3], node[node-1], relocating [node-2], [P], s[RELOCATING], a[id=RFYlyv__T_-PnVozAAAAAA, rId=F1s3moLuTJOZeuQTCzSS3g]
--------[index-1][2], node[node-1], [P], s[STARTED], a[id=4MbZxP__T_-WXmVSAAAAAA]
--------[index-1][3], node[node-1], [P], s[STARTED], a[id=pQwaKAAAQACUTMDZ_____w]
-----node_id[node-2][V]
--------[index-2][0], node[node-2], relocating [node-1], [P], s[RELOCATING], a[id=C2Hn7P__T_-I4nQUAAAAAA, rId=Gjys3XrdRCy58HQVMXXFGg]
--------[index-0][1], node[node-2], relocating [node-0], [P], recovery_source[peer recovery], s[INITIALIZING], a[id=OHLwbBZ_Sy-Re7yBy_yDSQ, rId=mL05qf__T_-F4Iem_____w]
--------[index-0][3], node[node-2], relocating [node-1], [P], recovery_source[peer recovery], s[INITIALIZING], a[id=F1s3moLuTJOZeuQTCzSS3g, rId=RFYlyv__T_-PnVozAAAAAA]

In more detail, the balancer simulates throttled shard movements onto the emptier node and these movements improve both shard and index balance, but this process seems to overshoot a little which applies pressure to move shards off the empty node, and these movements are not throttled so they are triggered straight away.

Start balancing cluster
Balancing from node [node-1] weight: [0.8166666] to node [node-2] weight: [-1.6333333]  delta: [2.4499998]
Try relocating shard of [index-0] from [node-1] to [node-2]
Relocate [[index-0][3], node[node-1], [P], s[STARTED], a[id=RFYlyv__T_-PnVozAAAAAA]] from [node-1] to [node-2]
Balancing from node [node-0] weight: [0.8166666] to node [node-2] weight: [-0.6333333]  delta: [1.4499999]
Try relocating shard of [index-0] from [node-0] to [node-2]
Relocate [[index-0][1], node[node-0], [P], s[STARTED], a[id=mL05qf__T_-F4Iem_____w]] from [node-0] to [node-2]
Stop balancing index [index-0]  min_node [node-2] weight: [0.36666664]  max_node [node-1] weight: [-0.18333335]  delta: [0.55]
Balancing from node [node-0] weight: [0.36666664] to node [node-2] weight: [-0.73333335]  delta: [1.1]
Try relocating shard of [index-1] from [node-0] to [node-2]
Simulate relocation of [[index-1][1], node[node-0], [P], s[STARTED], a[id=3p7gWwAAQACAV_1aAAAAAA]] from [node-0] to [node-2]
Balancing from node [node-1] weight: [0.36666664] to node [node-2] weight: [-0.73333335]  delta: [1.1]
Try relocating shard of [index-1] from [node-1] to [node-2]
Simulate relocation of [[index-1][3], node[node-1], [P], s[STARTED], a[id=pQwaKAAAQACUTMDZ_____w]] from [node-1] to [node-2]
Balancing from node [node-2] weight: [1.2666667] to node [node-1] weight: [-0.6333333]  delta: [1.9]
Try relocating shard of [index-2] from [node-2] to [node-1]
Relocate [[index-2][0], node[node-2], [P], s[STARTED], a[id=C2Hn7P__T_-I4nQUAAAAAA]] from [node-2] to [node-1]
Stop balancing index [index-2]  min_node [node-1] weight: [0.36666664]  max_node [node-0] weight: [-0.6333333]  delta: [1.0]

It still achieves a balanced cluster in the end, it just takes more shard movements than it needs to get there:

-----node_id[node-0][V]
--------[index-0][0], node[node-0], [P], s[STARTED], a[id=...]
--------[index-1][0], node[node-0], [P], s[STARTED], a[id=...]
--------[index-1][1], node[node-0], [P], s[STARTED], a[id=...]
-----node_id[node-1][V]
--------[index-0][2], node[node-1], [P], s[STARTED], a[id=...]
--------[index-1][2], node[node-1], [P], s[STARTED], a[id=...]
--------[index-2][0], node[node-1], [P], s[STARTED], a[id=...]
-----node_id[node-2][V]
--------[index-0][1], node[node-2], [P], s[STARTED], a[id=...]
--------[index-0][3], node[node-2], [P], s[STARTED], a[id=...]
--------[index-1][3], node[node-2], [P], s[STARTED], a[id=...]

Failing test case

private static void addIndex(
    Metadata.Builder metadataBuilder,
    RoutingTable.Builder routingTableBuilder,
    String indexName,
    int numberOfShards,
    IntFunction<String> assignmentFunction
) {
    final var inSyncIds = randomList(numberOfShards, numberOfShards, () -> UUIDs.randomBase64UUID(random()));
    final var indexMetadataBuilder = IndexMetadata.builder(indexName)
        .settings(
            Settings.builder()
                .put("index.number_of_shards", numberOfShards)
                .put("index.number_of_replicas", 0)
                .put("index.version.created", Version.CURRENT)
                .build()
        );
    for (int shardId = 0; shardId < numberOfShards; shardId++) {
        indexMetadataBuilder.putInSyncAllocationIds(shardId, Set.of(inSyncIds.get(shardId)));
    }
    metadataBuilder.put(indexMetadataBuilder);
    final var indexId = metadataBuilder.get(indexName).getIndex();
    final var indexRoutingTableBuilder = IndexRoutingTable.builder(indexId);

    for (int shardId = 0; shardId < numberOfShards; shardId++) {
        indexRoutingTableBuilder.addShard(
            TestShardRouting.newShardRouting(
                new ShardId(indexId, shardId),
                assignmentFunction.apply(shardId),
                null,
                true,
                ShardRoutingState.STARTED,
                AllocationId.newInitializing(inSyncIds.get(shardId))
            )
        );
    }

    routingTableBuilder.add(indexRoutingTableBuilder);
}

public void testRebalance() {
    final var discoveryNodesBuilder = DiscoveryNodes.builder();
    for (int nodeIndex = 0; nodeIndex < 3; nodeIndex++) {
        final var nodeId = "node-" + nodeIndex;
        discoveryNodesBuilder.add(
            new DiscoveryNode(
                nodeId,
                nodeId,
                UUIDs.randomBase64UUID(random()),
                buildNewFakeTransportAddress(),
                Map.of(),
                DiscoveryNodeRole.roles(),
                Version.CURRENT
            )
        );
    }

    final var metadataBuilder = Metadata.builder();
    final var routingTableBuilder = RoutingTable.builder();

    addIndex(metadataBuilder, routingTableBuilder, "index-0", 4, shardId -> "node-" + (shardId / 2));
    addIndex(metadataBuilder, routingTableBuilder, "index-1", 4, shardId -> "node-" + (shardId / 2));
    addIndex(metadataBuilder, routingTableBuilder, "index-2", 1, shardId -> "node-2");

    final var clusterState = ClusterState.builder(ClusterName.DEFAULT)
        .nodes(discoveryNodesBuilder)
        .metadata(metadataBuilder)
        .routingTable(routingTableBuilder)
        .build();

    final var settings = Settings.builder().put("cluster.routing.allocation.cluster_concurrent_rebalance", 3).build();

    final var allocationService = new AllocationService(
        new AllocationDeciders(
            ClusterModule.createAllocationDeciders(
                settings,
                new ClusterSettings(settings, ClusterSettings.BUILT_IN_CLUSTER_SETTINGS),
                Collections.emptyList()
            )
        ),
        new TestGatewayAllocator(),
        new BalancedShardsAllocator(settings),
        EmptyClusterInfoService.INSTANCE,
        SNAPSHOT_INFO_SERVICE_WITH_NO_SHARD_SIZES
    );

    final var reroutedState = allocationService.reroute(clusterState, "test");
    final var relocatingShards = RoutingNodesHelper.shardsWithState(reroutedState.getRoutingNodes(), RELOCATING);
    assertTrue(relocatingShards.stream().noneMatch(sr -> sr.currentNodeId().equals("node-2")));
}

Workaround

Remove the following settings from the configuration, so that their default values take effect:

  • cluster.routing.allocation.cluster_concurrent_rebalance
  • cluster.routing.allocation.node_concurrent_incoming_recoveries
  • cluster.routing.allocation.node_concurrent_outgoing_recoveries
  • cluster.routing.allocation.node_concurrent_recoveries
@DaveCTurner DaveCTurner added >bug :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) labels Jun 1, 2022
@elasticmachine elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Jun 1, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@DaveCTurner
Copy link
Contributor Author

DaveCTurner commented Jun 1, 2022

The precise sequence of shard movements isn't totally daft it seems. We start here:

         node-0  node-1  node-2
index-0  0,1     2,3     -
index-1  0,1     2,3     -
index-2  -       -       0

We try and balance index-0 first. Clearly we must move a shard to node-2, and by symmetry it doesn't matter which we choose:

         node-0  node-1  node-2
index-0  1       2,3     0
index-1  0,1     2,3     -
index-2  -       -       0

This means the shards of index-0 itself are now balanced, but we continue working on this index to improve the overall balance too by moving one of its shards from node-1 to node-2:

         node-0  node-1  node-2
index-0  1       3       0,2
index-1  0,1     2,3     -
index-2  -       -       0

This move doesn't affect how balanced index-0 is, but it fixes the overall balance so that's a net improvement. We move on to index-1 which is very unbalanced, and since index balance slightly overrules the overall balance it's best to move one of its shards onto node-2:

         node-0  node-1  node-2
index-0  1       3       0,2
index-1  1       2,3     0
index-2  -       -       0

That fixes the index balance but breaks the overall balance, and it's not possible to fix that by moving shards of index-1 so we move on to index-2. This index only has one shard so by definition its index balance is always perfect, but we can move its shard to fix the overall balance:

         node-0  node-1  node-2
index-0  1       3       0,2
index-1  1       2,3     0
index-2  0       -       -

Tada 🎉 the cluster is now balanced.

In this one case we'd make better decisions by moving on to index-1 in the second step, rather than staying with index-0, but it's not clear that this is true in general.

@VimCommando
Copy link
Contributor

@DaveCTurner since the desired balance allocator reconciles shard movement against a final goal, should we consider this issue resolved in 8.6+, won't that prevent the redundant movements of the same shard?

@DaveCTurner
Copy link
Contributor Author

No, in the situation described above the desired balance allocator would compute a goal which would require those extra shard movements to achieve.

idegtiarenko added a commit to idegtiarenko/elasticsearch that referenced this issue Feb 21, 2023
BalancedShardAllocator is prone to perform unnecessary moves when
cluster_concurrent_rebalance is set to high values (>2).
(See elastic#87279)
Above allocator is used in DesiredBalanceComputer. Since we do not move actual
shard data during calculation it is possible to artificially set above setting
to 1 to avoid unnecessary moves in desired balance.
idegtiarenko added a commit that referenced this issue Feb 23, 2023
BalancedShardAllocator is prone to perform unnecessary moves when
cluster_concurrent_rebalance is set to high values (>2).
(See #87279)
Above allocator is used in DesiredBalanceComputer. Since we do not move actual
shard data during calculation it is possible to artificially set above setting
to 2 to avoid unnecessary moves in desired balance.
idegtiarenko added a commit to idegtiarenko/elasticsearch that referenced this issue Feb 23, 2023
)

BalancedShardAllocator is prone to perform unnecessary moves when
cluster_concurrent_rebalance is set to high values (>2).
(See elastic#87279)
Above allocator is used in DesiredBalanceComputer. Since we do not move actual
shard data during calculation it is possible to artificially set above setting
to 2 to avoid unnecessary moves in desired balance.
elasticsearchmachine pushed a commit that referenced this issue Mar 1, 2023
…) (#94082)

* Simulate shard moves using cluster_concurrent_rebalance=2 (#93977)

BalancedShardAllocator is prone to perform unnecessary moves when
cluster_concurrent_rebalance is set to high values (>2).
(See #87279)
Above allocator is used in DesiredBalanceComputer. Since we do not move actual
shard data during calculation it is possible to artificially set above setting
to 2 to avoid unnecessary moves in desired balance.

* fix merge

---------

Co-authored-by: Elastic Machine <[email protected]>
@leandrojmp
Copy link
Contributor

Hello,

Just upgraded my cluster from 8.5.1 to 8.8.1 and now all my nodes in the hot and warm tiers are unbalaced and the cluster is constantly moving shards around, could be related to this issue?

Or there are any other issues related to that balance change made on 8.6?

I had custom settings for cluster.routing.allocation.cluster_concurrent_rebalance, cluster.routing.allocation.node_concurrent_incoming_recoveries and
cluster.routing.allocation.node_concurrent_outgoing_recoveries, just set them tonull and will see if the workaround still works.

@VimCommando
Copy link
Contributor

@leandrojmp yes, the release notes for 8.6.0 introduced the desired balance allocator. This updated the default shard allocation calculations and is more likely the change in behavior you're seeing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.
Projects
None yet
Development

No branches or pull requests

4 participants