Increasing `cluster.routing.allocation.cluster_concurrent_rebalance` causes redundant shard movements #87279

DaveCTurner · 2022-06-01T10:05:47Z

If cluster.routing.allocation.cluster_concurrent_rebalance is increased above the default of 2 then the balancer will sometimes make unnecessary rebalancing movements which can drastically increase the time it takes for the cluster to reach a balanced state. It's likely that something similar happens if you increase cluster.routing.allocation.node_concurrent_recoveries and friends too. For example, if you start with the following routing table ...

-----node_id[node-0][V]
--------[index-0][0], node[node-0], [P], s[STARTED], a[id=Sf5zRgAAQACYIgoIAAAAAA]
--------[index-0][1], node[node-0], [P], s[STARTED], a[id=mL05qf__T_-F4Iem_____w]
--------[index-1][0], node[node-0], [P], s[STARTED], a[id=riXHcQAAQACiAIRJAAAAAA]
--------[index-1][1], node[node-0], [P], s[STARTED], a[id=3p7gWwAAQACAV_1aAAAAAA]
-----node_id[node-1][V]
--------[index-0][2], node[node-1], [P], s[STARTED], a[id=sDzZ0___T_-4clZCAAAAAA]
--------[index-0][3], node[node-1], [P], s[STARTED], a[id=RFYlyv__T_-PnVozAAAAAA]
--------[index-1][2], node[node-1], [P], s[STARTED], a[id=4MbZxP__T_-WXmVSAAAAAA]
--------[index-1][3], node[node-1], [P], s[STARTED], a[id=pQwaKAAAQACUTMDZ_____w]
-----node_id[node-2][V]
--------[index-2][0], node[node-2], [P], s[STARTED], a[id=C2Hn7P__T_-I4nQUAAAAAA]

... then a call to reroute() will relocate some shards onto node-2 but will also relocate this node's sole shard onto a different node:

-----node_id[node-0][V]
--------[index-0][0], node[node-0], [P], s[STARTED], a[id=Sf5zRgAAQACYIgoIAAAAAA]
--------[index-0][1], node[node-0], relocating [node-2], [P], s[RELOCATING], a[id=mL05qf__T_-F4Iem_____w, rId=OHLwbBZ_Sy-Re7yBy_yDSQ]
--------[index-1][0], node[node-0], [P], s[STARTED], a[id=riXHcQAAQACiAIRJAAAAAA]
--------[index-1][1], node[node-0], [P], s[STARTED], a[id=3p7gWwAAQACAV_1aAAAAAA]
-----node_id[node-1][V]
--------[index-2][0], node[node-1], relocating [node-2], [P], recovery_source[peer recovery], s[INITIALIZING], a[id=Gjys3XrdRCy58HQVMXXFGg, rId=C2Hn7P__T_-I4nQUAAAAAA]
--------[index-0][2], node[node-1], [P], s[STARTED], a[id=sDzZ0___T_-4clZCAAAAAA]
--------[index-0][3], node[node-1], relocating [node-2], [P], s[RELOCATING], a[id=RFYlyv__T_-PnVozAAAAAA, rId=F1s3moLuTJOZeuQTCzSS3g]
--------[index-1][2], node[node-1], [P], s[STARTED], a[id=4MbZxP__T_-WXmVSAAAAAA]
--------[index-1][3], node[node-1], [P], s[STARTED], a[id=pQwaKAAAQACUTMDZ_____w]
-----node_id[node-2][V]
--------[index-2][0], node[node-2], relocating [node-1], [P], s[RELOCATING], a[id=C2Hn7P__T_-I4nQUAAAAAA, rId=Gjys3XrdRCy58HQVMXXFGg]
--------[index-0][1], node[node-2], relocating [node-0], [P], recovery_source[peer recovery], s[INITIALIZING], a[id=OHLwbBZ_Sy-Re7yBy_yDSQ, rId=mL05qf__T_-F4Iem_____w]
--------[index-0][3], node[node-2], relocating [node-1], [P], recovery_source[peer recovery], s[INITIALIZING], a[id=F1s3moLuTJOZeuQTCzSS3g, rId=RFYlyv__T_-PnVozAAAAAA]

In more detail, the balancer simulates throttled shard movements onto the emptier node and these movements improve both shard and index balance, but this process seems to overshoot a little which applies pressure to move shards off the empty node, and these movements are not throttled so they are triggered straight away.

Start balancing cluster
Balancing from node [node-1] weight: [0.8166666] to node [node-2] weight: [-1.6333333]  delta: [2.4499998]
Try relocating shard of [index-0] from [node-1] to [node-2]
Relocate [[index-0][3], node[node-1], [P], s[STARTED], a[id=RFYlyv__T_-PnVozAAAAAA]] from [node-1] to [node-2]
Balancing from node [node-0] weight: [0.8166666] to node [node-2] weight: [-0.6333333]  delta: [1.4499999]
Try relocating shard of [index-0] from [node-0] to [node-2]
Relocate [[index-0][1], node[node-0], [P], s[STARTED], a[id=mL05qf__T_-F4Iem_____w]] from [node-0] to [node-2]
Stop balancing index [index-0]  min_node [node-2] weight: [0.36666664]  max_node [node-1] weight: [-0.18333335]  delta: [0.55]
Balancing from node [node-0] weight: [0.36666664] to node [node-2] weight: [-0.73333335]  delta: [1.1]
Try relocating shard of [index-1] from [node-0] to [node-2]
Simulate relocation of [[index-1][1], node[node-0], [P], s[STARTED], a[id=3p7gWwAAQACAV_1aAAAAAA]] from [node-0] to [node-2]
Balancing from node [node-1] weight: [0.36666664] to node [node-2] weight: [-0.73333335]  delta: [1.1]
Try relocating shard of [index-1] from [node-1] to [node-2]
Simulate relocation of [[index-1][3], node[node-1], [P], s[STARTED], a[id=pQwaKAAAQACUTMDZ_____w]] from [node-1] to [node-2]
Balancing from node [node-2] weight: [1.2666667] to node [node-1] weight: [-0.6333333]  delta: [1.9]
Try relocating shard of [index-2] from [node-2] to [node-1]
Relocate [[index-2][0], node[node-2], [P], s[STARTED], a[id=C2Hn7P__T_-I4nQUAAAAAA]] from [node-2] to [node-1]
Stop balancing index [index-2]  min_node [node-1] weight: [0.36666664]  max_node [node-0] weight: [-0.6333333]  delta: [1.0]

It still achieves a balanced cluster in the end, it just takes more shard movements than it needs to get there:

-----node_id[node-0][V]
--------[index-0][0], node[node-0], [P], s[STARTED], a[id=...]
--------[index-1][0], node[node-0], [P], s[STARTED], a[id=...]
--------[index-1][1], node[node-0], [P], s[STARTED], a[id=...]
-----node_id[node-1][V]
--------[index-0][2], node[node-1], [P], s[STARTED], a[id=...]
--------[index-1][2], node[node-1], [P], s[STARTED], a[id=...]
--------[index-2][0], node[node-1], [P], s[STARTED], a[id=...]
-----node_id[node-2][V]
--------[index-0][1], node[node-2], [P], s[STARTED], a[id=...]
--------[index-0][3], node[node-2], [P], s[STARTED], a[id=...]
--------[index-1][3], node[node-2], [P], s[STARTED], a[id=...]

Failing test case

private static void addIndex(
    Metadata.Builder metadataBuilder,
    RoutingTable.Builder routingTableBuilder,
    String indexName,
    int numberOfShards,
    IntFunction<String> assignmentFunction
) {
    final var inSyncIds = randomList(numberOfShards, numberOfShards, () -> UUIDs.randomBase64UUID(random()));
    final var indexMetadataBuilder = IndexMetadata.builder(indexName)
        .settings(
            Settings.builder()
                .put("index.number_of_shards", numberOfShards)
                .put("index.number_of_replicas", 0)
                .put("index.version.created", Version.CURRENT)
                .build()
        );
    for (int shardId = 0; shardId < numberOfShards; shardId++) {
        indexMetadataBuilder.putInSyncAllocationIds(shardId, Set.of(inSyncIds.get(shardId)));
    }
    metadataBuilder.put(indexMetadataBuilder);
    final var indexId = metadataBuilder.get(indexName).getIndex();
    final var indexRoutingTableBuilder = IndexRoutingTable.builder(indexId);

    for (int shardId = 0; shardId < numberOfShards; shardId++) {
        indexRoutingTableBuilder.addShard(
            TestShardRouting.newShardRouting(
                new ShardId(indexId, shardId),
                assignmentFunction.apply(shardId),
                null,
                true,
                ShardRoutingState.STARTED,
                AllocationId.newInitializing(inSyncIds.get(shardId))
            )
        );
    }

    routingTableBuilder.add(indexRoutingTableBuilder);
}

public void testRebalance() {
    final var discoveryNodesBuilder = DiscoveryNodes.builder();
    for (int nodeIndex = 0; nodeIndex < 3; nodeIndex++) {
        final var nodeId = "node-" + nodeIndex;
        discoveryNodesBuilder.add(
            new DiscoveryNode(
                nodeId,
                nodeId,
                UUIDs.randomBase64UUID(random()),
                buildNewFakeTransportAddress(),
                Map.of(),
                DiscoveryNodeRole.roles(),
                Version.CURRENT
            )
        );
    }

    final var metadataBuilder = Metadata.builder();
    final var routingTableBuilder = RoutingTable.builder();

    addIndex(metadataBuilder, routingTableBuilder, "index-0", 4, shardId -> "node-" + (shardId / 2));
    addIndex(metadataBuilder, routingTableBuilder, "index-1", 4, shardId -> "node-" + (shardId / 2));
    addIndex(metadataBuilder, routingTableBuilder, "index-2", 1, shardId -> "node-2");

    final var clusterState = ClusterState.builder(ClusterName.DEFAULT)
        .nodes(discoveryNodesBuilder)
        .metadata(metadataBuilder)
        .routingTable(routingTableBuilder)
        .build();

    final var settings = Settings.builder().put("cluster.routing.allocation.cluster_concurrent_rebalance", 3).build();

    final var allocationService = new AllocationService(
        new AllocationDeciders(
            ClusterModule.createAllocationDeciders(
                settings,
                new ClusterSettings(settings, ClusterSettings.BUILT_IN_CLUSTER_SETTINGS),
                Collections.emptyList()
            )
        ),
        new TestGatewayAllocator(),
        new BalancedShardsAllocator(settings),
        EmptyClusterInfoService.INSTANCE,
        SNAPSHOT_INFO_SERVICE_WITH_NO_SHARD_SIZES
    );

    final var reroutedState = allocationService.reroute(clusterState, "test");
    final var relocatingShards = RoutingNodesHelper.shardsWithState(reroutedState.getRoutingNodes(), RELOCATING);
    assertTrue(relocatingShards.stream().noneMatch(sr -> sr.currentNodeId().equals("node-2")));
}

Workaround

Remove the following settings from the configuration, so that their default values take effect:

cluster.routing.allocation.cluster_concurrent_rebalance
cluster.routing.allocation.node_concurrent_incoming_recoveries
cluster.routing.allocation.node_concurrent_outgoing_recoveries
cluster.routing.allocation.node_concurrent_recoveries

The text was updated successfully, but these errors were encountered:

elasticmachine · 2022-06-01T10:05:50Z

Pinging @elastic/es-distributed (Team:Distributed)

DaveCTurner · 2022-06-01T15:08:20Z

The precise sequence of shard movements isn't totally daft it seems. We start here:

         node-0  node-1  node-2
index-0  0,1     2,3     -
index-1  0,1     2,3     -
index-2  -       -       0

We try and balance index-0 first. Clearly we must move a shard to node-2, and by symmetry it doesn't matter which we choose:

         node-0  node-1  node-2
index-0  1       2,3     0
index-1  0,1     2,3     -
index-2  -       -       0

This means the shards of index-0 itself are now balanced, but we continue working on this index to improve the overall balance too by moving one of its shards from node-1 to node-2:

         node-0  node-1  node-2
index-0  1       3       0,2
index-1  0,1     2,3     -
index-2  -       -       0

This move doesn't affect how balanced index-0 is, but it fixes the overall balance so that's a net improvement. We move on to index-1 which is very unbalanced, and since index balance slightly overrules the overall balance it's best to move one of its shards onto node-2:

         node-0  node-1  node-2
index-0  1       3       0,2
index-1  1       2,3     0
index-2  -       -       0

That fixes the index balance but breaks the overall balance, and it's not possible to fix that by moving shards of index-1 so we move on to index-2. This index only has one shard so by definition its index balance is always perfect, but we can move its shard to fix the overall balance:

         node-0  node-1  node-2
index-0  1       3       0,2
index-1  1       2,3     0
index-2  0       -       -

Tada 🎉 the cluster is now balanced.

In this one case we'd make better decisions by moving on to index-1 in the second step, rather than staying with index-0, but it's not clear that this is true in general.

VimCommando · 2023-01-25T16:59:46Z

@DaveCTurner since the desired balance allocator reconciles shard movement against a final goal, should we consider this issue resolved in 8.6+, won't that prevent the redundant movements of the same shard?

DaveCTurner · 2023-01-25T17:04:15Z

No, in the situation described above the desired balance allocator would compute a goal which would require those extra shard movements to achieve.

BalancedShardAllocator is prone to perform unnecessary moves when cluster_concurrent_rebalance is set to high values (>2). (See elastic#87279) Above allocator is used in DesiredBalanceComputer. Since we do not move actual shard data during calculation it is possible to artificially set above setting to 1 to avoid unnecessary moves in desired balance.

BalancedShardAllocator is prone to perform unnecessary moves when cluster_concurrent_rebalance is set to high values (>2). (See #87279) Above allocator is used in DesiredBalanceComputer. Since we do not move actual shard data during calculation it is possible to artificially set above setting to 2 to avoid unnecessary moves in desired balance.

) BalancedShardAllocator is prone to perform unnecessary moves when cluster_concurrent_rebalance is set to high values (>2). (See elastic#87279) Above allocator is used in DesiredBalanceComputer. Since we do not move actual shard data during calculation it is possible to artificially set above setting to 2 to avoid unnecessary moves in desired balance.

…) (#94082) * Simulate shard moves using cluster_concurrent_rebalance=2 (#93977) BalancedShardAllocator is prone to perform unnecessary moves when cluster_concurrent_rebalance is set to high values (>2). (See #87279) Above allocator is used in DesiredBalanceComputer. Since we do not move actual shard data during calculation it is possible to artificially set above setting to 2 to avoid unnecessary moves in desired balance. * fix merge --------- Co-authored-by: Elastic Machine <[email protected]>

leandrojmp · 2023-06-21T12:36:44Z

Hello,

Just upgraded my cluster from 8.5.1 to 8.8.1 and now all my nodes in the hot and warm tiers are unbalaced and the cluster is constantly moving shards around, could be related to this issue?

Or there are any other issues related to that balance change made on 8.6?

I had custom settings for cluster.routing.allocation.cluster_concurrent_rebalance, cluster.routing.allocation.node_concurrent_incoming_recoveries and
cluster.routing.allocation.node_concurrent_outgoing_recoveries, just set them tonull and will see if the workaround still works.

VimCommando · 2023-06-24T21:12:59Z

@leandrojmp yes, the release notes for 8.6.0 introduced the desired balance allocator. This updated the default shard allocation calculations and is more likely the change in behavior you're seeing.

DaveCTurner added >bug :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) labels Jun 1, 2022

elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Jun 1, 2022

DaveCTurner mentioned this issue Aug 1, 2022

A data node is both source and destination during cluster recovery. #88876

Closed

idegtiarenko mentioned this issue Feb 21, 2023

Simulate shard moves using cluster_concurrent_rebalance=2 #93977

Merged

idegtiarenko mentioned this issue Jul 18, 2023

Increase default value for cluster.routing.allocation.cluster_concurrent_rebalance #97750

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increasing `cluster.routing.allocation.cluster_concurrent_rebalance` causes redundant shard movements #87279

Increasing `cluster.routing.allocation.cluster_concurrent_rebalance` causes redundant shard movements #87279

DaveCTurner commented Jun 1, 2022 •

edited

Loading

elasticmachine commented Jun 1, 2022

DaveCTurner commented Jun 1, 2022 •

edited

Loading

VimCommando commented Jan 25, 2023

DaveCTurner commented Jan 25, 2023

leandrojmp commented Jun 21, 2023

VimCommando commented Jun 24, 2023

Increasing cluster.routing.allocation.cluster_concurrent_rebalance causes redundant shard movements #87279

Increasing cluster.routing.allocation.cluster_concurrent_rebalance causes redundant shard movements #87279

Comments

DaveCTurner commented Jun 1, 2022 • edited Loading

Failing test case

Workaround

elasticmachine commented Jun 1, 2022

DaveCTurner commented Jun 1, 2022 • edited Loading

VimCommando commented Jan 25, 2023

DaveCTurner commented Jan 25, 2023

leandrojmp commented Jun 21, 2023

VimCommando commented Jun 24, 2023

Increasing `cluster.routing.allocation.cluster_concurrent_rebalance` causes redundant shard movements #87279

Increasing `cluster.routing.allocation.cluster_concurrent_rebalance` causes redundant shard movements #87279

DaveCTurner commented Jun 1, 2022 •

edited

Loading

DaveCTurner commented Jun 1, 2022 •

edited

Loading