ClusterDisruptionIT.testAckedIndexing failure #53064

imotov · 2020-03-03T17:02:57Z

Failure in 7.x https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.x+matrix-java-periodic/ES_RUNTIME_JAVA=java11,nodes=general-purpose/548/console

11:26:04   2> REPRODUCE WITH: ./gradlew ':server:integTest' --tests "org.elasticsearch.discovery.ClusterDisruptionIT.testAckedIndexing" -Dtests.seed=A1089782D65D0286 -Dtests.security.manager=true -Dtests.locale=ar-SD -Dtests.timezone=America/Punta_Arenas -Dcompiler.java=13
11:26:04   2> java.lang.AssertionError: ClusterRerouteResponse failed - not acked
11:26:04     Expected: <true>
11:26:04          but: was <false>
11:26:04         at __randomizedtesting.SeedInfo.seed([A1089782D65D0286:2BC923718A20E4CD]:0)
11:26:04         at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
11:26:04         at org.elasticsearch.test.hamcrest.ElasticsearchAssertions.assertAcked(ElasticsearchAssertions.java:115)
11:26:04         at org.elasticsearch.test.hamcrest.ElasticsearchAssertions.assertAcked(ElasticsearchAssertions.java:103)
11:26:04         at org.elasticsearch.discovery.ClusterDisruptionIT.testAckedIndexing(ClusterDisruptionIT.java:230)

I cannot find any other failures of this test recently. It fails in a different spot then before in #41068 so might be unrelated.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2020-03-03T17:02:59Z

Pinging @elastic/es-distributed (:Distributed/Cluster Coordination)

henningandersen · 2020-03-05T13:16:37Z

Adding a sleep when marking nodes faulty makes this reproduce 7/10 times on my CI:

diff --git a/server/src/main/java/org/elasticsearch/cluster/coordination/FollowersChecker.java b/server/src/main/java/org/elasticsearch/cluster/coordination/FollowersChecker.java
index 390b7a4cbde..0795b7bfc64 100644
--- a/server/src/main/java/org/elasticsearch/cluster/coordination/FollowersChecker.java
+++ b/server/src/main/java/org/elasticsearch/cluster/coordination/FollowersChecker.java
@@ -354,6 +354,11 @@ public class FollowersChecker {
             transportService.getThreadPool().generic().execute(new Runnable() {
                 @Override
                 public void run() {
+                    try {
+                        Thread.sleep(10);
+                    } catch (InterruptedException e) {
+                        e.printStackTrace();
+                    }
                     synchronized (mutex) {
                         if (running() == false) {
                             logger.trace("{} no longer running, not marking faulty", FollowerChecker.this);

Will find a workaround for this specific case.

We discussed this at distributed sync and the issue is that any disruption style test risk seeing nodes disconnect after the disruption has been stopped, since the follower check's marking faulty can be delayed. We saw no easy general solution to this, but discussed following:

Wait until all transport requests have been responded to.
Wait until all threads are idle (or at least all current processing is done). This is tricky with netty and other threads outside ThreadPool
assertBusy all the things or similar. We might need an assertBusy ignoring other exceptions than assertions.

Use assertBusy when doing reroute after bridged disruption, since it can return non-acked if a node is marked faulty by follower check after disruption ended. Closes elastic#53064

Use assertBusy when doing reroute after bridged disruption, since it can return non-acked if a node is marked faulty by follower check after disruption ended. Closes #53064

imotov added >test-failure Triaged test failures from CI :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. labels Mar 3, 2020

henningandersen self-assigned this Mar 4, 2020

henningandersen mentioned this issue Mar 5, 2020

Fix ClusterDisruptionIT.testAckedIndexing #53169

Merged

henningandersen closed this as completed in #53169 Mar 6, 2020

DaveCTurner mentioned this issue Sep 14, 2020

[CI] CoordinatorTests.testDoesNotPerformElectionWhenRestartingFollower failure #61711

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ClusterDisruptionIT.testAckedIndexing failure #53064

ClusterDisruptionIT.testAckedIndexing failure #53064

imotov commented Mar 3, 2020

elasticmachine commented Mar 3, 2020

Uh oh!

henningandersen commented Mar 5, 2020

Uh oh!

ClusterDisruptionIT.testAckedIndexing failure #53064

ClusterDisruptionIT.testAckedIndexing failure #53064

Comments

imotov commented Mar 3, 2020

elasticmachine commented Mar 3, 2020

Uh oh!

henningandersen commented Mar 5, 2020

Uh oh!