Skip to content

ClusterDisruptionIT.testAckedIndexing failure #53064

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
imotov opened this issue Mar 3, 2020 · 2 comments · Fixed by #53169
Closed

ClusterDisruptionIT.testAckedIndexing failure #53064

imotov opened this issue Mar 3, 2020 · 2 comments · Fixed by #53169
Assignees
Labels
:Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. >test-failure Triaged test failures from CI

Comments

@imotov
Copy link
Contributor

imotov commented Mar 3, 2020

Failure in 7.x https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.x+matrix-java-periodic/ES_RUNTIME_JAVA=java11,nodes=general-purpose/548/console

11:26:04   2> REPRODUCE WITH: ./gradlew ':server:integTest' --tests "org.elasticsearch.discovery.ClusterDisruptionIT.testAckedIndexing" -Dtests.seed=A1089782D65D0286 -Dtests.security.manager=true -Dtests.locale=ar-SD -Dtests.timezone=America/Punta_Arenas -Dcompiler.java=13
11:26:04   2> java.lang.AssertionError: ClusterRerouteResponse failed - not acked
11:26:04     Expected: <true>
11:26:04          but: was <false>
11:26:04         at __randomizedtesting.SeedInfo.seed([A1089782D65D0286:2BC923718A20E4CD]:0)
11:26:04         at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
11:26:04         at org.elasticsearch.test.hamcrest.ElasticsearchAssertions.assertAcked(ElasticsearchAssertions.java:115)
11:26:04         at org.elasticsearch.test.hamcrest.ElasticsearchAssertions.assertAcked(ElasticsearchAssertions.java:103)
11:26:04         at org.elasticsearch.discovery.ClusterDisruptionIT.testAckedIndexing(ClusterDisruptionIT.java:230)

I cannot find any other failures of this test recently. It fails in a different spot then before in #41068 so might be unrelated.

@imotov imotov added >test-failure Triaged test failures from CI :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. labels Mar 3, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/Cluster Coordination)

@henningandersen henningandersen self-assigned this Mar 4, 2020
@henningandersen
Copy link
Contributor

Adding a sleep when marking nodes faulty makes this reproduce 7/10 times on my CI:

diff --git a/server/src/main/java/org/elasticsearch/cluster/coordination/FollowersChecker.java b/server/src/main/java/org/elasticsearch/cluster/coordination/FollowersChecker.java
index 390b7a4cbde..0795b7bfc64 100644
--- a/server/src/main/java/org/elasticsearch/cluster/coordination/FollowersChecker.java
+++ b/server/src/main/java/org/elasticsearch/cluster/coordination/FollowersChecker.java
@@ -354,6 +354,11 @@ public class FollowersChecker {
             transportService.getThreadPool().generic().execute(new Runnable() {
                 @Override
                 public void run() {
+                    try {
+                        Thread.sleep(10);
+                    } catch (InterruptedException e) {
+                        e.printStackTrace();
+                    }
                     synchronized (mutex) {
                         if (running() == false) {
                             logger.trace("{} no longer running, not marking faulty", FollowerChecker.this);

Will find a workaround for this specific case.

We discussed this at distributed sync and the issue is that any disruption style test risk seeing nodes disconnect after the disruption has been stopped, since the follower check's marking faulty can be delayed. We saw no easy general solution to this, but discussed following:

  • Wait until all transport requests have been responded to.
  • Wait until all threads are idle (or at least all current processing is done). This is tricky with netty and other threads outside ThreadPool
  • assertBusy all the things or similar. We might need an assertBusy ignoring other exceptions than assertions.

henningandersen added a commit to henningandersen/elasticsearch that referenced this issue Mar 5, 2020
Use assertBusy when doing reroute after bridged disruption,
since it can return non-acked if a node is marked faulty
by follower check after disruption ended.

Closes elastic#53064
henningandersen added a commit that referenced this issue Mar 6, 2020
Use assertBusy when doing reroute after bridged disruption,
since it can return non-acked if a node is marked faulty
by follower check after disruption ended.

Closes #53064
henningandersen added a commit that referenced this issue Mar 6, 2020
Use assertBusy when doing reroute after bridged disruption,
since it can return non-acked if a node is marked faulty
by follower check after disruption ended.

Closes #53064
henningandersen added a commit that referenced this issue Mar 6, 2020
Use assertBusy when doing reroute after bridged disruption,
since it can return non-acked if a node is marked faulty
by follower check after disruption ended.

Closes #53064
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. >test-failure Triaged test failures from CI
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants