Skip to content

Fix flaky testRebalanceOnlyAfterAllShardsAreActive #94102

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Feb 27, 2023

Conversation

idegtiarenko
Copy link
Contributor

@idegtiarenko idegtiarenko commented Feb 24, 2023

routing_nodes:
-----node_id[node1][V]
--------[test][0], node[node1], [P], s[STARTED], a[id=XSf9l5rrRx2ocQfryTWa6g], failed_attempts[0]
--------[test][1], node[node1], [R], s[STARTED], a[id=KtqcvNqJR8iEa_avoiWrxA], failed_attempts[0]
--------[test][2], node[node1], [P], s[STARTED], a[id=4t4dA4GgS_eiyOjOHQm8UA], failed_attempts[0]
--------[test][3], node[node1], [R], s[STARTED], a[id=PLmJXeZzQ9q2Uc9bVxbCqw], failed_attempts[0]
--------[test][4], node[node1], [P], s[STARTED], a[id=sDYXEvp_QZGVIku3UZxDig], failed_attempts[0]
-----node_id[node2][V]
--------[test][0], node[node2], [R], s[STARTED], a[id=gbeLQn1-R_yS1Cji1SQy6A], failed_attempts[0]
--------[test][1], node[node2], relocating [node6], [P], s[RELOCATING], a[id=xiwtqLwgSqaWVKae6SMamw, rId=ovxcHMXNQtyb5eJGAO0nKg], failed_attempts[0], expected_shard_size[569277388]
--------[test][2], node[node2], relocating [node7], [R], s[RELOCATING], a[id=VwkFODi5TDerGcWr7NHung, rId=Obqbb6tQQce7u7adURwFNw], failed_attempts[0], expected_shard_size[1820676772]
--------[test][3], node[node2], relocating [node10], [P], s[RELOCATING], a[id=GL3-FT2_QBCThMulq9P6YA, rId=T6TZpTPlTuWtrQwHh5mplA], failed_attempts[0], expected_shard_size[2066543429]
--------[test][4], node[node2], relocating [node3], [R], s[RELOCATING], a[id=Ion0MHUmTlqIHMHCUV_IYQ, rId=nAzc69WARGu3GY2uv8IhHQ], failed_attempts[0], expected_shard_size[296020011]
-----node_id[node3][V]
--------[test][4], node[node3], relocating [node2], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=nAzc69WARGu3GY2uv8IhHQ, rId=Ion0MHUmTlqIHMHCUV_IYQ], failed_attempts[0], expected_shard_size[296020011]
-----node_id[node4][V]
-----node_id[node5][V]
-----node_id[node6][V]
--------[test][1], node[node6], relocating [node2], [P], recovery_source[peer recovery], s[INITIALIZING], a[id=ovxcHMXNQtyb5eJGAO0nKg, rId=xiwtqLwgSqaWVKae6SMamw], failed_attempts[0], expected_shard_size[569277388]
-----node_id[node7][V]
--------[test][2], node[node7], relocating [node2], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=Obqbb6tQQce7u7adURwFNw, rId=VwkFODi5TDerGcWr7NHung], failed_attempts[0], expected_shard_size[1820676772]
-----node_id[node8][V]
-----node_id[node9][V]
-----node_id[node10][V]
--------[test][3], node[node10], relocating [node2], [P], recovery_source[peer recovery], s[INITIALIZING], a[id=T6TZpTPlTuWtrQwHh5mplA, rId=GL3-FT2_QBCThMulq9P6YA], failed_attempts[0], expected_shard_size[2066543429]
---- unassigned


DesiredBalance[lastConvergedIndex=3, assignments={
[test][0]=ShardAssignment[nodeIds=[node2, node1], total=2, unassigned=0, ignored=0],
[test][2]=ShardAssignment[nodeIds=[node7, node8], total=2, unassigned=0, ignored=0],
[test][1]=ShardAssignment[nodeIds=[node6, node9], total=2, unassigned=0, ignored=0],
[test][4]=ShardAssignment[nodeIds=[node3, node4], total=2, unassigned=0, ignored=0],
[test][3]=ShardAssignment[nodeIds=[node10, node5], total=2, unassigned=0, ignored=0]
}]

This failure is happening when one of the indexes both primary and replica shards are already located on the desired node.
In such case we immediately relocating only 4 shards.

I believe this is a valid state so the test is updated to accept it.

Closes: #94086

@idegtiarenko idegtiarenko added >test-failure Triaged test failures from CI :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. v8.8.0 labels Feb 24, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@idegtiarenko
Copy link
Contributor Author

It is interesting that I could not reproduce this failure in #94082

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, one nit

Comment on lines 76 to 79
assertThat(clusterState.routingTable().index("test").shard(i).primaryShard().state(), equalTo(UNASSIGNED));
assertThat(clusterState.routingTable().index("test").shard(i).primaryShard().currentNodeId(), nullValue());
assertThat(clusterState.routingTable().index("test").shard(i).replicaShards().get(0).state(), equalTo(UNASSIGNED));
assertThat(clusterState.routingTable().index("test").shard(i).replicaShards().get(0).currentNodeId(), nullValue());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kind of preferred this how it was written before: there's two shards and regardless of their primary/replica status they are both unassigned.

@idegtiarenko idegtiarenko merged commit ded6f3d into elastic:main Feb 27, 2023
@idegtiarenko idegtiarenko deleted the fix_94086 branch February 27, 2023 12:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. >test-failure Triaged test failures from CI v8.8.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CI] RebalanceAfterActiveTests testRebalanceOnlyAfterAllShardsAreActive failing
3 participants