-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Fix CoordinatorTests.testUnresponsiveLeaderDetectedEventually #64462
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix CoordinatorTests.testUnresponsiveLeaderDetectedEventually #64462
Conversation
Take into account messy scenarios where a 5 node clusters elections where multiple nodes can trigger an election concurrently, meaning that it takes longer to stabilise the cluster and elect a leader. Fixes elastic#63918
Pinging @elastic/es-distributed (:Distributed/Cluster Coordination) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm actually I think we need another defaultMillis(PUBLISH_TIMEOUT_SETTING)
too -- the failing elections all try and publish to the unresponsive leader and must therefore wait for a timeout.
I'm not sure if we need the additional publish timeout as those publications go through as expected (since they have quorum) or fail due to term bumps. Am I missing something? |
Yes, until the master is properly established each publication will go to the unresponsive node and therefore wait for the publish timeout before proceeding. That node is only removed from the cluster once the elections have settled down. In the failing test, we blackhole
(second column is times relative to the blackhole time) Until version 11, all publications go to |
Thanks for the explanation @DaveCTurner ! I was missing that publications waits up until the timeout to succeed even if it got a Quorum. I've updated the PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Take into account messy scenarios of 5 node clusters elections where multiple nodes can trigger an election concurrently, meaning that it takes longer to stabilize the cluster and elect a leader. Fixes elastic#63918 Backport of elastic#64462
Today we require the cluster to stabilise in a time period that allows time for the first election to encounter conflicts. However on very rare occasions there might be an election conflict in the second election too. This commit extends the stabilisation timeout to allow for this. Similar to elastic#64462 Closes elastic#78370
Today we require the cluster to stabilise in a time period that allows time for the first election to encounter conflicts. However on very rare occasions there might be an election conflict in the second election too. This commit extends the stabilisation timeout to allow for this. Similar to #64462 Closes #78370
Today we require the cluster to stabilise in a time period that allows time for the first election to encounter conflicts. However on very rare occasions there might be an election conflict in the second election too. This commit extends the stabilisation timeout to allow for this. Similar to elastic#64462 Closes elastic#78370
Today we require the cluster to stabilise in a time period that allows time for the first election to encounter conflicts. However on very rare occasions there might be an election conflict in the second election too. This commit extends the stabilisation timeout to allow for this. Similar to elastic#64462 Closes elastic#78370
Today we require the cluster to stabilise in a time period that allows time for the first election to encounter conflicts. However on very rare occasions there might be an election conflict in the second election too. This commit extends the stabilisation timeout to allow for this. Similar to #64462 Closes #78370 Co-authored-by: Elastic Machine <[email protected]>
Today we require the cluster to stabilise in a time period that allows time for the first election to encounter conflicts. However on very rare occasions there might be an election conflict in the second election too. This commit extends the stabilisation timeout to allow for this. Similar to #64462 Closes #78370 Co-authored-by: Elastic Machine <[email protected]>
Take into account messy scenarios of 5 node clusters elections
where multiple nodes can trigger an election concurrently, meaning
that it takes longer to stabilize the cluster and elect a leader.
Fixes #63918