Skip to content

Reduce follower cluster state lag timeout for disruption tests #108691

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

DiannaHohensee
Copy link
Contributor

@DiannaHohensee DiannaHohensee commented May 15, 2024

It's possible for a node-left task to get interrupted prior to removing
the node from the master's list of faultyNodes. Nodes on the faultyNodes
list do not receive cluster state updates, and are eventually removed.

Subsequently, when the node attempts to rejoin, after test network
disruptions have ceased, the node-join request can succeed, but the
node will never receive the cluster state update, consider the node-join
a failure, and will resend node-join requests until the LagDetector
removes the node from the faultyNodes list.
#108690 will address the
node-join issue.

Closes #91447


Much belatedly circling back to this.

It's possible for a node-left task to get interrupted prior to removing
the node from the master's list of faultyNodes. Nodes on the faultyNodes
list do not receive cluster state updates, and are eventually removed.

Subsequently, when the node attempts to rejoin, after test network
disruptions have ceased, the node-join request can succeed, but the
node will never receive the cluster state update, consider the node-join
a failure, and will resend node-join requests until the LagDetector
removes the node from the faultyNodes list.
elastic#108690 will address the
node-join issue.
@DiannaHohensee DiannaHohensee added >test Issues or PRs that are addressing/adding tests :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. labels May 15, 2024
@DiannaHohensee DiannaHohensee self-assigned this May 15, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, one suggestion

@DiannaHohensee DiannaHohensee merged commit 4700027 into elastic:main May 15, 2024
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. >test Issues or PRs that are addressing/adding tests v8.15.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CI] ClusterDisruptionIT testAckedIndexing failing
3 participants