CoordinatorTests.testDiscoveryUsesNodesFromLastClusterState test failure #41967

benwtrent · 2019-05-08T19:37:48Z

Reproduces locally

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.x+multijob-unix-compatibility/os=debian-8/155/consoleFull

Failure:

13:54:46 org.elasticsearch.cluster.coordination.CoordinatorTests > testDiscoveryUsesNodesFromLastClusterState FAILED
13:54:46     java.lang.AssertionError: node1 has applied its state 
13:54:46     Expected: <606L>
13:54:46          but: was <605L>
13:54:46         at __randomizedtesting.SeedInfo.seed([567D9E1ADC657714:3060AB7CA14A7F0B]:0)
13:54:46         at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
13:54:46         at org.junit.Assert.assertThat(Assert.java:956)
13:54:46         at org.elasticsearch.cluster.coordination.CoordinatorTests$Cluster.stabilise(CoordinatorTests.java:1524)
13:54:46         at org.elasticsearch.cluster.coordination.CoordinatorTests$Cluster.stabilise(CoordinatorTests.java:1505)
13:54:46         at org.elasticsearch.cluster.coordination.CoordinatorTests.testDiscoveryUsesNodesFromLastClusterState(CoordinatorTests.java:1086)

Reproduce:

 ./gradlew :server:test --tests "org.elasticsearch.cluster.coordination.CoordinatorTests.testDiscoveryUsesNodesFromLastClusterState" -Dtests.seed=567D9E1ADC657714 -Dtests.security.manager=true -Dtests.locale=es-US -Dtests.timezone=Chile/EasterIsland -Dcompiler.java=12 -Druntime.java=8

Seems to only fail on 7.x branch. Verified that master without the java=8 flag it passed just fine.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2019-05-08T19:37:50Z

Pinging @elastic/es-distributed

See elastic/elasticsearch#41967.

Today the default stabilisation time is calculated on the assumption that the elected master has no pending tasks to process when it is elected, but this is not a safe assumption to make. This can result in a cluster reaching the end of its stabilisation time without having stabilised. Furthermore in elastic#36943 we increased the probability that each step in `runRandomly()` enqueues another task, vastly increasing the chance that we hit such a situation. This change extends the stabilisation process to allow time for all pending tasks, plus a task that might currently be in flight. Fixes elastic#41967, in which the master entered the stabilisation phase with over 800 tasks to process.

See elastic/elasticsearch#41967.

Today the default stabilisation time is calculated on the assumption that the elected master has no pending tasks to process when it is elected, but this is not a safe assumption to make. This can result in a cluster reaching the end of its stabilisation time without having stabilised. Furthermore in #36943 we increased the probability that each step in `runRandomly()` enqueues another task, vastly increasing the chance that we hit such a situation. This change extends the stabilisation process to allow time for all pending tasks, plus a task that might currently be in flight. Fixes #41967, in which the master entered the stabilisation phase with over 800 tasks to process.

Today the default stabilisation time is calculated on the assumption that the elected master has no pending tasks to process when it is elected, but this is not a safe assumption to make. This can result in a cluster reaching the end of its stabilisation time without having stabilised. Furthermore in elastic#36943 we increased the probability that each step in `runRandomly()` enqueues another task, vastly increasing the chance that we hit such a situation. This change extends the stabilisation process to allow time for all pending tasks, plus a task that might currently be in flight. Fixes elastic#41967, in which the master entered the stabilisation phase with over 800 tasks to process.

Today the default stabilisation time is calculated on the assumption that the elected master has no pending tasks to process when it is elected, but this is not a safe assumption to make. This can result in a cluster reaching the end of its stabilisation time without having stabilised. Furthermore in #36943 we increased the probability that each step in `runRandomly()` enqueues another task, vastly increasing the chance that we hit such a situation. This change extends the stabilisation process to allow time for all pending tasks, plus a task that might currently be in flight. Fixes #41967, in which the master entered the stabilisation phase with over 800 tasks to process.

benwtrent added >test-failure Triaged test failures from CI :Distributed Indexing/Distributed A catch all label for anything in the Distributed Indexing Area. Please avoid if you can. v7.2.0 labels May 8, 2019

benwtrent added a commit that referenced this issue May 8, 2019

mute test related to #41967

83b0561

benwtrent added a commit that referenced this issue May 8, 2019

mute test related to #41967 (#41968)

edd6438

benwtrent mentioned this issue May 8, 2019

mute test related to #41967 #41968

Merged

ywelsch assigned DaveCTurner May 10, 2019

seut added a commit to crate/crate that referenced this issue May 24, 2019

Mute coordinator test related to #41967

ab605fd

See elastic/elasticsearch#41967.

DaveCTurner mentioned this issue May 24, 2019

Drain master task queue when stabilising #42504

Merged

mergify bot pushed a commit to crate/crate that referenced this issue May 24, 2019

Mute coordinator test related to #41967

ac3f840

See elastic/elasticsearch#41967.

DaveCTurner closed this as completed in #42504 May 24, 2019

DaveCTurner added a commit that referenced this issue May 24, 2019

Remove AwaitsFix of #41967 following #42504

a5b6ed8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CoordinatorTests.testDiscoveryUsesNodesFromLastClusterState test failure #41967

CoordinatorTests.testDiscoveryUsesNodesFromLastClusterState test failure #41967

benwtrent commented May 8, 2019

elasticmachine commented May 8, 2019

CoordinatorTests.testDiscoveryUsesNodesFromLastClusterState test failure #41967

CoordinatorTests.testDiscoveryUsesNodesFromLastClusterState test failure #41967

Comments

benwtrent commented May 8, 2019

elasticmachine commented May 8, 2019