Skip to content

[CI] RebalanceAfterActiveTests testRebalanceOnlyAfterAllShardsAreActive failing #94086

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
idegtiarenko opened this issue Feb 23, 2023 · 4 comments · Fixed by #94102
Closed

[CI] RebalanceAfterActiveTests testRebalanceOnlyAfterAllShardsAreActive failing #94086

idegtiarenko opened this issue Feb 23, 2023 · 4 comments · Fixed by #94102
Assignees
Labels
:Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. >test-failure Triaged test failures from CI

Comments

@idegtiarenko
Copy link
Contributor

Build scan:
https://gradle-enterprise.elastic.co/s/xdilv3b7y5lzq/tests/:server:test/org.elasticsearch.cluster.routing.allocation.RebalanceAfterActiveTests/testRebalanceOnlyAfterAllShardsAreActive

Reproduction line:

./gradlew ':server:test' --tests "org.elasticsearch.cluster.routing.allocation.RebalanceAfterActiveTests.testRebalanceOnlyAfterAllShardsAreActive" -Dtests.seed=B30A5B87F3CE8FD0 -Dtests.locale=ar-LB -Dtests.timezone=Australia/Brisbane -Druntime.java=17

Applicable branches:
main

Reproduces locally?:
Yes

Failure history:
https://gradle-enterprise.elastic.co/scans/tests?tests.container=org.elasticsearch.cluster.routing.allocation.RebalanceAfterActiveTests&tests.test=testRebalanceOnlyAfterAllShardsAreActive

Failure excerpt:

java.lang.AssertionError: 
Expected: <5>
     but: was <6>

  at __randomizedtesting.SeedInfo.seed([B30A5B87F3CE8FD0:89B703F4717FB16C]:0)
  at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
  at org.junit.Assert.assertThat(Assert.java:956)
  at org.junit.Assert.assertThat(Assert.java:923)
  at org.elasticsearch.cluster.routing.allocation.RebalanceAfterActiveTests.testRebalanceOnlyAfterAllShardsAreActive(RebalanceAfterActiveTests.java:136)
  at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(NativeMethodAccessorImpl.java:-2)
  at jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
  at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:568)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:48)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
  at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:850)
  at java.lang.Thread.run(Thread.java:833)

@idegtiarenko idegtiarenko added Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) >test-failure Triaged test failures from CI labels Feb 23, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@idegtiarenko
Copy link
Contributor Author

The test is failing around this line:

assertThat(shardsWithState(clusterState.getRoutingNodes(), STARTED).size(), equalTo(5));

In short test is doing following:

  • create an index (primaries=5, replicas=1)
  • create 2 nodes
  • start all shards (expect 5 shards per node)
  • add 8 more nodes
  • expect 5 (primaries) are started and 5 replicas are immediately start moving to other nodes. <--- this fails.

A new desired balance is calculated in a way that one of the indices expected nodes are ones where it already allocated (so one of the replicas do not need to move):

routing_nodes:
-----node_id[node1][V]
--------[test][0], node[node1], [P], s[STARTED], a[id=XSf9l5rrRx2ocQfryTWa6g], failed_attempts[0]
--------[test][1], node[node1], [R], s[STARTED], a[id=KtqcvNqJR8iEa_avoiWrxA], failed_attempts[0]
--------[test][2], node[node1], [P], s[STARTED], a[id=4t4dA4GgS_eiyOjOHQm8UA], failed_attempts[0]
--------[test][3], node[node1], [R], s[STARTED], a[id=PLmJXeZzQ9q2Uc9bVxbCqw], failed_attempts[0]
--------[test][4], node[node1], [P], s[STARTED], a[id=sDYXEvp_QZGVIku3UZxDig], failed_attempts[0]
-----node_id[node2][V]
--------[test][0], node[node2], [R], s[STARTED], a[id=gbeLQn1-R_yS1Cji1SQy6A], failed_attempts[0]
--------[test][1], node[node2], relocating [node6], [P], s[RELOCATING], a[id=xiwtqLwgSqaWVKae6SMamw, rId=ovxcHMXNQtyb5eJGAO0nKg], failed_attempts[0], expected_shard_size[569277388]
--------[test][2], node[node2], relocating [node7], [R], s[RELOCATING], a[id=VwkFODi5TDerGcWr7NHung, rId=Obqbb6tQQce7u7adURwFNw], failed_attempts[0], expected_shard_size[1820676772]
--------[test][3], node[node2], relocating [node10], [P], s[RELOCATING], a[id=GL3-FT2_QBCThMulq9P6YA, rId=T6TZpTPlTuWtrQwHh5mplA], failed_attempts[0], expected_shard_size[2066543429]
--------[test][4], node[node2], relocating [node3], [R], s[RELOCATING], a[id=Ion0MHUmTlqIHMHCUV_IYQ, rId=nAzc69WARGu3GY2uv8IhHQ], failed_attempts[0], expected_shard_size[296020011]
-----node_id[node3][V]
--------[test][4], node[node3], relocating [node2], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=nAzc69WARGu3GY2uv8IhHQ, rId=Ion0MHUmTlqIHMHCUV_IYQ], failed_attempts[0], expected_shard_size[296020011]
-----node_id[node4][V]
-----node_id[node5][V]
-----node_id[node6][V]
--------[test][1], node[node6], relocating [node2], [P], recovery_source[peer recovery], s[INITIALIZING], a[id=ovxcHMXNQtyb5eJGAO0nKg, rId=xiwtqLwgSqaWVKae6SMamw], failed_attempts[0], expected_shard_size[569277388]
-----node_id[node7][V]
--------[test][2], node[node7], relocating [node2], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=Obqbb6tQQce7u7adURwFNw, rId=VwkFODi5TDerGcWr7NHung], failed_attempts[0], expected_shard_size[1820676772]
-----node_id[node8][V]
-----node_id[node9][V]
-----node_id[node10][V]
--------[test][3], node[node10], relocating [node2], [P], recovery_source[peer recovery], s[INITIALIZING], a[id=T6TZpTPlTuWtrQwHh5mplA, rId=GL3-FT2_QBCThMulq9P6YA], failed_attempts[0], expected_shard_size[2066543429]
---- unassigned


DesiredBalance[lastConvergedIndex=3, assignments={
[test][0]=ShardAssignment[nodeIds=[node2, node1], total=2, unassigned=0, ignored=0],
[test][2]=ShardAssignment[nodeIds=[node7, node8], total=2, unassigned=0, ignored=0],
[test][1]=ShardAssignment[nodeIds=[node6, node9], total=2, unassigned=0, ignored=0],
[test][4]=ShardAssignment[nodeIds=[node3, node4], total=2, unassigned=0, ignored=0],
[test][3]=ShardAssignment[nodeIds=[node10, node5], total=2, unassigned=0, ignored=0]
}]

@idegtiarenko idegtiarenko self-assigned this Feb 23, 2023
@DaveCTurner
Copy link
Contributor

Your analysis looks right to me, yes. One question:

        // we only allow one relocation at a time

Do you know where that rule is implemented?

@idegtiarenko
Copy link
Contributor Author

This was added with initial file commit in 2010 for 0.9 version.
I suspect this was the case back then, but no longer true today.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. >test-failure Triaged test failures from CI
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants