[CI] CoordinatorTests.testDoesNotPerformElectionWhenRestartingFollower failure #61711

henningandersen · 2020-08-31T11:45:22Z

Build scan:
https://gradle-enterprise.elastic.co/s/vhneniggttxdq

Repro line:

./gradlew ':server:test' --tests "org.elasticsearch.cluster.coordination.CoordinatorTests.testDoesNotPerformElectionWhenRestartingFollower" -Dtests.seed=D5EE617ADD52F575 -Dtests.security.manager=true -Dbuild.snapshot=false -Dtests.jvm.argline="-Dbuild.snapshot=false" -Dtests.locale=de-CH -Dtests.timezone=Asia/Manila -Druntime.java=8

Reproduces locally?:

./gradlew ':server:test' --tests "org.elasticsearch.cluster.coordination.CoordinatorTests.testDoesNotPerformElectionWhenRestartingFollower" -Dtests.seed=D5EE617ADD52F575 -Dtests.security.manager=true -Dtests.locale=de-CH -Dtests.timezone=Asia/Manila -Druntime.java=8

reproduces locally for me.

Applicable branches:

Failed in 7.x only (but only failed once).

Failure history:

https://build-stats.elastic.co/goto/0d9728f81d0a7966d908aae65e4751d4
Failed once only recently.

Failure excerpt:

org.elasticsearch.cluster.coordination.CoordinatorTests > testDoesNotPerformElectionWhenRestartingFollower FAILED
    java.lang.AssertionError: node1 is a follower of node0
    Expected: is <FOLLOWER>
         but: was <CANDIDATE>
        at __randomizedtesting.SeedInfo.seed([D5EE617ADD52F575:489C4247AC88A7A0]:0)
        at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
        at org.junit.Assert.assertThat(Assert.java:956)
        at org.elasticsearch.cluster.coordination.AbstractCoordinatorTestCase$Cluster.stabilise(AbstractCoordinatorTestCase.java:542)
        at org.elasticsearch.cluster.coordination.AbstractCoordinatorTestCase$Cluster.stabilise(AbstractCoordinatorTestCase.java:502)
        at org.elasticsearch.cluster.coordination.CoordinatorTests.testDoesNotPerformElectionWhenRestartingFollower(CoordinatorTests.java:1561)

The text was updated successfully, but these errors were encountered:

elasticmachine · 2020-08-31T11:45:24Z

Pinging @elastic/es-distributed (:Distributed/Cluster Coordination)

fcofdez · 2020-09-01T17:19:24Z

It seems like this is related with the changes introduced in #60872.
In this build there are 2 nodes:

node0: master-eligible node
node1: non master-eligible node

During AbstractCoordinatorTestCase.Cluster#runRandomly() phase node1 is unable to join the cluster as the connection is closed/blackholed a few times. The problematic part can be summarized as follows:

node1 attempts to join the cluster sending a internal:cluster/coordination/join message to node0 and adds a new in flight join request between node1 and node0
node0 receives the join request and establishes a connection with node1 and then sends a internal:cluster/coordination/join/validate message to node1
node1 processes the internal:cluster/coordination/join/validate request but the response is dropped as the connection is black-holed at that point. The current behavior for back holed messages is to trigger a connection closed exception after 1 day.
The test moves on to the AbstractCoordinatorTestCase.Cluster#stabilise() phase, and node1 attempts to join the cluster again, but since there's an in-flight join request that wasn't removed from the in-flight set the request is bypassed.
Since there's only 2 nodes in this test, the node is unable to join to the cluster and the test expects that node1 is a FOLLOWER after stabilizing the cluster.

I'm not sure if we should reintroduce the timeout from #60872 (that solves the issue), or schedule the ConnectTransportException earlier when we drop messages (more in line with the current TCP keep-alive settings).

fcofdez · 2020-09-03T09:19:31Z

After discussing this with @original-brownbear it seems like AbstractCoordinatorTestCase.Cluster#stabilise() should take into account this edge case while it is running pending tasks to stabilize the cluster. Currently, it takes into account certain failure scenarios, before introducing #60872, that time frame was enough to trigger the error handler for the join task and retry.

My proposal is to increase the DEFAULT_STABILISATION_TIME to cover this case i.e. increasing it by 1 day, AFAIK the only side effect of this would be an increase on test execution time, but I would like some feedback on this @ywelsch

ywelsch · 2020-09-04T08:26:01Z

I'm not excited about increasing DEFAULT_STABILISATION_TIME from a couple of minutes to a day. The reason is that we can no longer assert anything sensible that the cluster converges within a reasonable amount of time (1 day is not reasonable). We should probably introduce a randomized bad connection detection timeout (simulating tcp_retries2) that ranges from a minute to a day, and then incorporate that into DEFAULT_STABILISATION_TIME. Let's wait for @DaveCTurner to be back though, I'm sure he will have an opinion here as well.

DaveCTurner · 2020-09-14T06:07:17Z

IMO this is actually a bug in runRandomly() in that it fails to account for the pending disconnection, leaving the system in a state where stabilisation may fail due to a long-delayed disconnection. In other words, it is not properly respecting its coolDown parameter. My preference would be to keep executing the loop in runRandomly() until there are no pending disconnection events, and to prevent new network disruptions from happening during the cool-down phase (i.e. when finishTime != -1.

This is the same bug that affects testAckedIndexing, see e.g. #53064, although here it is actually tractable to fix since we can track the number of pending disconnection events.

I also see value in randomising the bad-connection-detection timeout, but don't think we should incorporate that into the default stabilisation time.

…ordinatorTestCase. Closes elastic#61711

…ordinatorTestCase (#62954) Closes #61711

…ordinatorTestCase Backport of elastic#62954 Closes elastic#61711

…ordinatorTestCase (#62987) Backport of #62954 Closes #61711

henningandersen added >test-failure Triaged test failures from CI :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. v7.10.0 labels Aug 31, 2020

elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Aug 31, 2020

fcofdez self-assigned this Aug 31, 2020

fcofdez added a commit to fcofdez/elasticsearch that referenced this issue Sep 28, 2020

Drain all disruption events until all nodes are healthy on AbstractCo…

d3aee75

…ordinatorTestCase. Closes elastic#61711

fcofdez mentioned this issue Sep 28, 2020

Drain all disruption events until all nodes are healthy on AbstractCoordinatorTestCase #62954

Merged

fcofdez closed this as completed in #62954 Sep 29, 2020

fcofdez added a commit that referenced this issue Sep 29, 2020

Drain all disruption events until all nodes are healthy on AbstractCo…

6d1f7a7

…ordinatorTestCase (#62954) Closes #61711

fcofdez added a commit to fcofdez/elasticsearch that referenced this issue Sep 29, 2020

Drain all disruption events until all nodes are healthy on AbstractCo…

2730859

…ordinatorTestCase Backport of elastic#62954 Closes elastic#61711

fcofdez mentioned this issue Sep 29, 2020

Drain all disruption events until all nodes are healthy on AbstractCoordinatorTestCase 7.x #62987

Merged

fcofdez added a commit that referenced this issue Sep 29, 2020

Drain all disruption events until all nodes are healthy on AbstractCo…

e0f6b29

…ordinatorTestCase (#62987) Backport of #62954 Closes #61711

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[CI] CoordinatorTests.testDoesNotPerformElectionWhenRestartingFollower failure #61711

[CI] CoordinatorTests.testDoesNotPerformElectionWhenRestartingFollower failure #61711

henningandersen commented Aug 31, 2020

elasticmachine commented Aug 31, 2020

Uh oh!

fcofdez commented Sep 1, 2020

Uh oh!

fcofdez commented Sep 3, 2020

Uh oh!

ywelsch commented Sep 4, 2020

Uh oh!

DaveCTurner commented Sep 14, 2020 •

edited

Loading

Uh oh!

[CI] CoordinatorTests.testDoesNotPerformElectionWhenRestartingFollower failure #61711

[CI] CoordinatorTests.testDoesNotPerformElectionWhenRestartingFollower failure #61711

Comments

henningandersen commented Aug 31, 2020

elasticmachine commented Aug 31, 2020

Uh oh!

fcofdez commented Sep 1, 2020

Uh oh!

fcofdez commented Sep 3, 2020

Uh oh!

ywelsch commented Sep 4, 2020

Uh oh!

DaveCTurner commented Sep 14, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DaveCTurner commented Sep 14, 2020 •

edited

Loading