Skip to content

Fix CoordinatorTests.testUnresponsiveLeaderDetectedEventually #64462

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

fcofdez
Copy link
Contributor

@fcofdez fcofdez commented Nov 2, 2020

Take into account messy scenarios of 5 node clusters elections
where multiple nodes can trigger an election concurrently, meaning
that it takes longer to stabilize the cluster and elect a leader.

Fixes #63918

Take into account messy scenarios where a 5 node clusters elections
where multiple nodes can trigger an election concurrently, meaning
that it takes longer to stabilise the cluster and elect a leader.

Fixes elastic#63918
@fcofdez fcofdez added >test Issues or PRs that are addressing/adding tests :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. v7.11.0 labels Nov 2, 2020
@fcofdez fcofdez requested a review from DaveCTurner November 2, 2020 09:45
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/Cluster Coordination)

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm actually I think we need another defaultMillis(PUBLISH_TIMEOUT_SETTING) too -- the failing elections all try and publish to the unresponsive leader and must therefore wait for a timeout.

@fcofdez
Copy link
Contributor Author

fcofdez commented Nov 2, 2020

I'm not sure if we need the additional publish timeout as those publications go through as expected (since they have quorum) or fail due to term bumps. Am I missing something?

@DaveCTurner
Copy link
Contributor

Yes, until the master is properly established each publication will go to the unresponsive node and therefore wait for the publish timeout before proceeding. That node is only removed from the cluster once the elections have settled down.

In the failing test, we blackhole node4 at 433835372ms, and then attempt the following publications:

[433867822ms][+    32450ms][node4][o.e.c.c.C.CoordinatorPublication          ] publishing PublishRequest{term=3, version=6, state=cluster uuid: H5lUTwAAQACNnOGK_____w [committed: true]
[433868232ms][+    32860ms][node3][o.e.c.c.C.CoordinatorPublication          ] publishing PublishRequest{term=5, version=6, state=cluster uuid: H5lUTwAAQACNnOGK_____w [committed: true]
[433897846ms][+    62474ms][node4][o.e.c.c.C.CoordinatorPublication          ] publication ended unsuccessfully: Publication{term=3, version=6}
[433898290ms][+    62918ms][node3][o.e.c.c.C.CoordinatorPublication          ] publication ended successfully: Publication{term=5, version=6}
[433898645ms][+    63273ms][node3][o.e.c.c.C.CoordinatorPublication          ] publishing PublishRequest{term=8, version=7, state=cluster uuid: H5lUTwAAQACNnOGK_____w [committed: true]
[433898735ms][+    63363ms][node3][o.e.c.c.C.CoordinatorPublication          ] publication ended unsuccessfully: Publication{term=8, version=7}
[433899213ms][+    63841ms][node1][o.e.c.c.C.CoordinatorPublication          ] publishing PublishRequest{term=10, version=8, state=cluster uuid: H5lUTwAAQACNnOGK_____w [committed: true]
[433929341ms][+    93969ms][node1][o.e.c.c.C.CoordinatorPublication          ] publication ended successfully: Publication{term=10, version=8}
[433929686ms][+    94314ms][node1][o.e.c.c.C.CoordinatorPublication          ] publishing PublishRequest{term=13, version=9, state=cluster uuid: H5lUTwAAQACNnOGK_____w [committed: true]
[433959731ms][+   124359ms][node1][o.e.c.c.C.CoordinatorPublication          ] publication ended successfully: Publication{term=13, version=9}
[433959731ms][+   124359ms][node1][o.e.c.c.C.CoordinatorPublication          ] publishing PublishRequest{term=13, version=10, state=cluster uuid: H5lUTwAAQACNnOGK_____w [committed: true]
[433989746ms][+   154374ms][node1][o.e.c.c.C.CoordinatorPublication          ] publication ended successfully: Publication{term=13, version=10}
[433989826ms][+   154454ms][node1][o.e.c.c.C.CoordinatorPublication          ] publishing PublishRequest{term=13, version=11, state=cluster uuid: H5lUTwAAQACNnOGK_____w [committed: true]
[433989975ms][+   154603ms][node1][o.e.c.c.C.CoordinatorPublication          ] publication ended successfully: Publication{term=13, version=11}
[433990058ms][+   154686ms][node1][o.e.c.c.C.CoordinatorPublication          ] publishing PublishRequest{term=13, version=12, state=cluster uuid: H5lUTwAAQACNnOGK_____w [committed: true]
[433990366ms][+   154994ms][node1][o.e.c.c.C.CoordinatorPublication          ] publication ended successfully: Publication{term=13, version=12}

(second column is times relative to the blackhole time)

Until version 11, all publications go to node4 and therefore take ≥30 seconds because of hitting the publish timeout. In particular, this includes the clashing publication of version 6 which we're wanting to account for here.

@fcofdez
Copy link
Contributor Author

fcofdez commented Nov 2, 2020

Thanks for the explanation @DaveCTurner ! I was missing that publications waits up until the timeout to succeed even if it got a Quorum. I've updated the PR.

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@fcofdez fcofdez merged commit e29fc62 into elastic:master Nov 2, 2020
fcofdez added a commit to fcofdez/elasticsearch that referenced this pull request Nov 2, 2020
Take into account messy scenarios of 5 node clusters elections
where multiple nodes can trigger an election concurrently, meaning
that it takes longer to stabilize the cluster and elect a leader.

Fixes elastic#63918
Backport of elastic#64462
@fcofdez fcofdez added the v8.0.0 label Nov 2, 2020
fcofdez added a commit that referenced this pull request Nov 2, 2020
Take into account messy scenarios of 5 node clusters elections
where multiple nodes can trigger an election concurrently, meaning
that it takes longer to stabilize the cluster and elect a leader.

Fixes #63918
Backport of #64462
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Oct 20, 2021
Today we require the cluster to stabilise in a time period that allows
time for the first election to encounter conflicts. However on very rare
occasions there might be an election conflict in the second election
too. This commit extends the stabilisation timeout to allow for this.

Similar to elastic#64462
Closes elastic#78370
elasticsearchmachine pushed a commit that referenced this pull request Oct 28, 2021
Today we require the cluster to stabilise in a time period that allows
time for the first election to encounter conflicts. However on very rare
occasions there might be an election conflict in the second election
too. This commit extends the stabilisation timeout to allow for this.
Similar to #64462 Closes #78370
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Oct 28, 2021
Today we require the cluster to stabilise in a time period that allows
time for the first election to encounter conflicts. However on very rare
occasions there might be an election conflict in the second election
too. This commit extends the stabilisation timeout to allow for this.
Similar to elastic#64462 Closes elastic#78370
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Oct 28, 2021
Today we require the cluster to stabilise in a time period that allows
time for the first election to encounter conflicts. However on very rare
occasions there might be an election conflict in the second election
too. This commit extends the stabilisation timeout to allow for this.
Similar to elastic#64462 Closes elastic#78370
elasticsearchmachine pushed a commit that referenced this pull request Nov 5, 2021
Today we require the cluster to stabilise in a time period that allows
time for the first election to encounter conflicts. However on very rare
occasions there might be an election conflict in the second election
too. This commit extends the stabilisation timeout to allow for this.
Similar to #64462 Closes #78370

Co-authored-by: Elastic Machine <[email protected]>
elasticsearchmachine pushed a commit that referenced this pull request Nov 5, 2021
Today we require the cluster to stabilise in a time period that allows
time for the first election to encounter conflicts. However on very rare
occasions there might be an election conflict in the second election
too. This commit extends the stabilisation timeout to allow for this.
Similar to #64462 Closes #78370

Co-authored-by: Elastic Machine <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. >test Issues or PRs that are addressing/adding tests v7.11.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CoordinatorTests.testUnresponsiveLeaderDetectedEventually reproducible failure:
4 participants