Fix flaky testRebalanceOnlyAfterAllShardsAreActive #94102

idegtiarenko · 2023-02-24T08:13:05Z

routing_nodes:
-----node_id[node1][V]
--------[test][0], node[node1], [P], s[STARTED], a[id=XSf9l5rrRx2ocQfryTWa6g], failed_attempts[0]
--------[test][1], node[node1], [R], s[STARTED], a[id=KtqcvNqJR8iEa_avoiWrxA], failed_attempts[0]
--------[test][2], node[node1], [P], s[STARTED], a[id=4t4dA4GgS_eiyOjOHQm8UA], failed_attempts[0]
--------[test][3], node[node1], [R], s[STARTED], a[id=PLmJXeZzQ9q2Uc9bVxbCqw], failed_attempts[0]
--------[test][4], node[node1], [P], s[STARTED], a[id=sDYXEvp_QZGVIku3UZxDig], failed_attempts[0]
-----node_id[node2][V]
--------[test][0], node[node2], [R], s[STARTED], a[id=gbeLQn1-R_yS1Cji1SQy6A], failed_attempts[0]
--------[test][1], node[node2], relocating [node6], [P], s[RELOCATING], a[id=xiwtqLwgSqaWVKae6SMamw, rId=ovxcHMXNQtyb5eJGAO0nKg], failed_attempts[0], expected_shard_size[569277388]
--------[test][2], node[node2], relocating [node7], [R], s[RELOCATING], a[id=VwkFODi5TDerGcWr7NHung, rId=Obqbb6tQQce7u7adURwFNw], failed_attempts[0], expected_shard_size[1820676772]
--------[test][3], node[node2], relocating [node10], [P], s[RELOCATING], a[id=GL3-FT2_QBCThMulq9P6YA, rId=T6TZpTPlTuWtrQwHh5mplA], failed_attempts[0], expected_shard_size[2066543429]
--------[test][4], node[node2], relocating [node3], [R], s[RELOCATING], a[id=Ion0MHUmTlqIHMHCUV_IYQ, rId=nAzc69WARGu3GY2uv8IhHQ], failed_attempts[0], expected_shard_size[296020011]
-----node_id[node3][V]
--------[test][4], node[node3], relocating [node2], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=nAzc69WARGu3GY2uv8IhHQ, rId=Ion0MHUmTlqIHMHCUV_IYQ], failed_attempts[0], expected_shard_size[296020011]
-----node_id[node4][V]
-----node_id[node5][V]
-----node_id[node6][V]
--------[test][1], node[node6], relocating [node2], [P], recovery_source[peer recovery], s[INITIALIZING], a[id=ovxcHMXNQtyb5eJGAO0nKg, rId=xiwtqLwgSqaWVKae6SMamw], failed_attempts[0], expected_shard_size[569277388]
-----node_id[node7][V]
--------[test][2], node[node7], relocating [node2], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=Obqbb6tQQce7u7adURwFNw, rId=VwkFODi5TDerGcWr7NHung], failed_attempts[0], expected_shard_size[1820676772]
-----node_id[node8][V]
-----node_id[node9][V]
-----node_id[node10][V]
--------[test][3], node[node10], relocating [node2], [P], recovery_source[peer recovery], s[INITIALIZING], a[id=T6TZpTPlTuWtrQwHh5mplA, rId=GL3-FT2_QBCThMulq9P6YA], failed_attempts[0], expected_shard_size[2066543429]
---- unassigned


DesiredBalance[lastConvergedIndex=3, assignments={
[test][0]=ShardAssignment[nodeIds=[node2, node1], total=2, unassigned=0, ignored=0],
[test][2]=ShardAssignment[nodeIds=[node7, node8], total=2, unassigned=0, ignored=0],
[test][1]=ShardAssignment[nodeIds=[node6, node9], total=2, unassigned=0, ignored=0],
[test][4]=ShardAssignment[nodeIds=[node3, node4], total=2, unassigned=0, ignored=0],
[test][3]=ShardAssignment[nodeIds=[node10, node5], total=2, unassigned=0, ignored=0]
}]

This failure is happening when one of the indexes both primary and replica shards are already located on the desired node.
In such case we immediately relocating only 4 shards.

I believe this is a valid state so the test is updated to accept it.

Closes: #94086

elasticsearchmachine · 2023-02-24T08:13:29Z

Pinging @elastic/es-distributed (Team:Distributed)

idegtiarenko · 2023-02-24T08:21:33Z

It is interesting that I could not reproduce this failure in #94082

DaveCTurner

LGTM, one nit

DaveCTurner · 2023-02-24T12:04:06Z

...er/src/test/java/org/elasticsearch/cluster/routing/allocation/RebalanceAfterActiveTests.java

+            assertThat(clusterState.routingTable().index("test").shard(i).primaryShard().state(), equalTo(UNASSIGNED));
+            assertThat(clusterState.routingTable().index("test").shard(i).primaryShard().currentNodeId(), nullValue());
+            assertThat(clusterState.routingTable().index("test").shard(i).replicaShards().get(0).state(), equalTo(UNASSIGNED));
+            assertThat(clusterState.routingTable().index("test").shard(i).replicaShards().get(0).currentNodeId(), nullValue());


I kind of preferred this how it was written before: there's two shards and regardless of their primary/replica status they are both unassigned.

Fix flaky testRebalanceOnlyAfterAllShardsAreActive

884dcfc

idegtiarenko requested a review from DaveCTurner February 24, 2023 08:13

DaveCTurner approved these changes Feb 24, 2023

View reviewed changes

revert

c2c13b5

idegtiarenko merged commit ded6f3d into elastic:main Feb 27, 2023

idegtiarenko deleted the fix_94086 branch February 27, 2023 12:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix flaky testRebalanceOnlyAfterAllShardsAreActive #94102

Fix flaky testRebalanceOnlyAfterAllShardsAreActive #94102

Uh oh!

idegtiarenko commented Feb 24, 2023 •

edited

Loading

Uh oh!

elasticsearchmachine commented Feb 24, 2023

Uh oh!

idegtiarenko commented Feb 24, 2023

Uh oh!

DaveCTurner left a comment

Uh oh!

DaveCTurner Feb 24, 2023

Uh oh!

Uh oh!

Fix flaky testRebalanceOnlyAfterAllShardsAreActive #94102

Fix flaky testRebalanceOnlyAfterAllShardsAreActive #94102

Uh oh!

Conversation

idegtiarenko commented Feb 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticsearchmachine commented Feb 24, 2023

Uh oh!

idegtiarenko commented Feb 24, 2023

Uh oh!

DaveCTurner left a comment

Choose a reason for hiding this comment

Uh oh!

DaveCTurner Feb 24, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

idegtiarenko commented Feb 24, 2023 •

edited

Loading