Skip to content

[CI] CoordinatorTests.testDoesNotPerformElectionWhenRestartingFollower failure #61711

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
henningandersen opened this issue Aug 31, 2020 · 5 comments · Fixed by #62954
Closed
Assignees
Labels
:Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. >test-failure Triaged test failures from CI v7.10.0

Comments

@henningandersen
Copy link
Contributor

Build scan:
https://gradle-enterprise.elastic.co/s/vhneniggttxdq

Repro line:

./gradlew ':server:test' --tests "org.elasticsearch.cluster.coordination.CoordinatorTests.testDoesNotPerformElectionWhenRestartingFollower" -Dtests.seed=D5EE617ADD52F575 -Dtests.security.manager=true -Dbuild.snapshot=false -Dtests.jvm.argline="-Dbuild.snapshot=false" -Dtests.locale=de-CH -Dtests.timezone=Asia/Manila -Druntime.java=8

Reproduces locally?:

./gradlew ':server:test' --tests "org.elasticsearch.cluster.coordination.CoordinatorTests.testDoesNotPerformElectionWhenRestartingFollower" -Dtests.seed=D5EE617ADD52F575 -Dtests.security.manager=true -Dtests.locale=de-CH -Dtests.timezone=Asia/Manila -Druntime.java=8

reproduces locally for me.

Applicable branches:

Failed in 7.x only (but only failed once).

Failure history:

https://build-stats.elastic.co/goto/0d9728f81d0a7966d908aae65e4751d4
Failed once only recently.

Failure excerpt:

org.elasticsearch.cluster.coordination.CoordinatorTests > testDoesNotPerformElectionWhenRestartingFollower FAILED
    java.lang.AssertionError: node1 is a follower of node0
    Expected: is <FOLLOWER>
         but: was <CANDIDATE>
        at __randomizedtesting.SeedInfo.seed([D5EE617ADD52F575:489C4247AC88A7A0]:0)
        at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
        at org.junit.Assert.assertThat(Assert.java:956)
        at org.elasticsearch.cluster.coordination.AbstractCoordinatorTestCase$Cluster.stabilise(AbstractCoordinatorTestCase.java:542)
        at org.elasticsearch.cluster.coordination.AbstractCoordinatorTestCase$Cluster.stabilise(AbstractCoordinatorTestCase.java:502)
        at org.elasticsearch.cluster.coordination.CoordinatorTests.testDoesNotPerformElectionWhenRestartingFollower(CoordinatorTests.java:1561)
@henningandersen henningandersen added >test-failure Triaged test failures from CI :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. v7.10.0 labels Aug 31, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/Cluster Coordination)

@elasticmachine elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Aug 31, 2020
@fcofdez fcofdez self-assigned this Aug 31, 2020
@fcofdez
Copy link
Contributor

fcofdez commented Sep 1, 2020

It seems like this is related with the changes introduced in #60872.
In this build there are 2 nodes:

  • node0: master-eligible node
  • node1: non master-eligible node

During AbstractCoordinatorTestCase.Cluster#runRandomly() phase node1 is unable to join the cluster as the connection is closed/blackholed a few times. The problematic part can be summarized as follows:

  1. node1 attempts to join the cluster sending a internal:cluster/coordination/join message to node0 and adds a new in flight join request between node1 and node0
  2. node0 receives the join request and establishes a connection with node1 and then sends a internal:cluster/coordination/join/validate message to node1
  3. node1 processes the internal:cluster/coordination/join/validate request but the response is dropped as the connection is black-holed at that point. The current behavior for back holed messages is to trigger a connection closed exception after 1 day.
  4. The test moves on to the AbstractCoordinatorTestCase.Cluster#stabilise() phase, and node1 attempts to join the cluster again, but since there's an in-flight join request that wasn't removed from the in-flight set the request is bypassed.
  5. Since there's only 2 nodes in this test, the node is unable to join to the cluster and the test expects that node1 is a FOLLOWER after stabilizing the cluster.

I'm not sure if we should reintroduce the timeout from #60872 (that solves the issue), or schedule the ConnectTransportException earlier when we drop messages (more in line with the current TCP keep-alive settings).

@fcofdez
Copy link
Contributor

fcofdez commented Sep 3, 2020

After discussing this with @original-brownbear it seems like AbstractCoordinatorTestCase.Cluster#stabilise() should take into account this edge case while it is running pending tasks to stabilize the cluster. Currently, it takes into account certain failure scenarios, before introducing #60872, that time frame was enough to trigger the error handler for the join task and retry.

My proposal is to increase the DEFAULT_STABILISATION_TIME to cover this case i.e. increasing it by 1 day, AFAIK the only side effect of this would be an increase on test execution time, but I would like some feedback on this @ywelsch

@ywelsch
Copy link
Contributor

ywelsch commented Sep 4, 2020

I'm not excited about increasing DEFAULT_STABILISATION_TIME from a couple of minutes to a day. The reason is that we can no longer assert anything sensible that the cluster converges within a reasonable amount of time (1 day is not reasonable). We should probably introduce a randomized bad connection detection timeout (simulating tcp_retries2) that ranges from a minute to a day, and then incorporate that into DEFAULT_STABILISATION_TIME. Let's wait for @DaveCTurner to be back though, I'm sure he will have an opinion here as well.

@DaveCTurner
Copy link
Contributor

DaveCTurner commented Sep 14, 2020

IMO this is actually a bug in runRandomly() in that it fails to account for the pending disconnection, leaving the system in a state where stabilisation may fail due to a long-delayed disconnection. In other words, it is not properly respecting its coolDown parameter. My preference would be to keep executing the loop in runRandomly() until there are no pending disconnection events, and to prevent new network disruptions from happening during the cool-down phase (i.e. when finishTime != -1.

This is the same bug that affects testAckedIndexing, see e.g. #53064, although here it is actually tractable to fix since we can track the number of pending disconnection events.

I also see value in randomising the bad-connection-detection timeout, but don't think we should incorporate that into the default stabilisation time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. >test-failure Triaged test failures from CI v7.10.0
Projects
None yet
5 participants