Skip to content

TEST: ClusterDisruptionIT testSendingShardFailure fails #36428

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ywelsch opened this issue Dec 10, 2018 · 2 comments
Closed

TEST: ClusterDisruptionIT testSendingShardFailure fails #36428

ywelsch opened this issue Dec 10, 2018 · 2 comments
Assignees
Labels
:Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. >test-failure Triaged test failures from CI

Comments

@ywelsch
Copy link
Contributor

ywelsch commented Dec 10, 2018

Test failure: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+intake/638/console

The issue exhibited by the test is as follows: Simulated disconnect between follower and leader while sending shard failure. Follower gets reconnected to leader through a follower check, but does not remove NO_MASTER_BLOCK from cluster state, which means that sending the shard failure is not retried. Unfortunately, removing the NO_MASTER_BLOCK on becoming follower is not enough, because that will still not trigger the shard failure to be resent because the cluster state version is not incremented.

Relevant log lines:

1> [2018-12-10T05:35:00,225][DEBUG][o.e.c.a.s.ShardStateAction] [testSendingShardFailure] sending [internal:cluster/shard/failure] to [ir1zPCLYRTikvc1ljjbFdw] for shard entry [shard id [[test][0]], allocation id [BSJMH3QrSAOmXvAY67aymg], primary term [0], message [simulated], failure [CorruptIndexException[simulated (resource=null)]], markAsStale [true]]
  1> [2018-12-10T05:35:00,226][DEBUG][o.e.c.c.Coordinator      ] [node_t0] onLeaderFailure: becoming CANDIDATE (was FOLLOWER, lastKnownLeader was [Optional[{node_t1}{ir1zPCLYRTikvc1ljjbFdw}{MXpZkxX8TpWwVU5N02B_ew}{127.0.0.1}{127.0.0.1:35909}]])
...
  1> [2018-12-10T05:35:00,230][DEBUG][o.e.c.s.ClusterApplierService] [node_t0] processing [becoming candidate: onLeaderFailure]: execute
  1> [2018-12-10T05:35:00,230][TRACE][o.e.c.s.ClusterApplierService] [node_t0] cluster state updated, source [becoming candidate: onLeaderFailure]
...
1> [2018-12-10T05:35:00,242][DEBUG][o.e.d.ClusterDisruptionIT] [testSendingShardFailure] ensuring cluster is stable with [3] nodes. access node: [node_t0]. timeout: [30s]
  1> [2018-12-10T05:35:00,242][DEBUG][o.e.a.a.c.h.TransportClusterHealthAction] [testSendingShardFailure] no known master node, scheduling a retry
  1> [2018-12-10T05:35:00,466][DEBUG][o.e.c.c.Coordinator      ] [node_t0] onFollowerCheckRequest: becoming FOLLOWER of [{node_t1}{ir1zPCLYRTikvc1ljjbFdw}{MXpZkxX8TpWwVU5N02B_ew}{127.0.0.1}{127.0.0.1:35909}] (was CANDIDATE, lastKnownLeader was [Optional[{node_t1}{ir1zPCLYRTikvc1ljjbFdw}{MXpZkxX8TpWwVU5N02B_ew}{127.0.0.1}{127.0.0.1:35909}]])
  1> [2018-12-10T05:35:00,470][DEBUG][o.e.c.c.LeaderChecker    ] [node_t0] closed check scheduler woken up, doing nothing
  1> [2018-12-10T05:35:30,244][DEBUG][o.e.a.a.c.h.TransportClusterHealthAction] [node_t0] timed out while retrying [cluster:monitor/health] after failure (timeout [30s])
  1> [2018-12-10T05:35:30,245][INFO ][o.e.d.ClusterDisruptionIT] [testSendingShardFailure] [ClusterDisruptionIT#testSendingShardFailure]: cleaning up after test
  1> [2018-12-10T05:35:30,245][INFO ][o.e.t.InternalTestCluster] [testSendingShardFailure] Clearing active scheme network disruption (disruption type: network disconnects, disrupted links: two partitions (partition 1: [node_t0] and partition 2: [node_t2, node_t1])), expected healing time 0s
  1> [2018-12-10T05:35:30,268][DEBUG][o.e.a.a.c.s.TransportClusterStateAction] [testSendingShardFailure] no known master node, scheduling a retry
  1> [2018-12-10T05:36:00,268][DEBUG][o.e.a.a.c.s.TransportClusterStateAction] [node_t0] timed out while retrying [cluster:monitor/state] after failure (timeout [30s])
@ywelsch ywelsch added >test-failure Triaged test failures from CI :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. labels Dec 10, 2018
@ywelsch ywelsch self-assigned this Dec 10, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@ywelsch
Copy link
Contributor Author

ywelsch commented Dec 10, 2018

Muted the test in f79e602

ywelsch added a commit that referenced this issue Dec 10, 2018
ywelsch added a commit that referenced this issue Dec 11, 2018
Deals with a situation where a follower becomes disconnected from the leader, but only for such a
short time where it becomes candidate and puts up a NO_MASTER_BLOCK, but then receives a
follower check from the leader. If the leader does not notice the node disconnecting, it is important
for the node not to be turned back into a follower but try and join the leader again.

We still should prefer the node into a follower on a follower check when this follower check triggers
a term bump as this can help during a leader election to quickly have a leader turn all other nodes
into followers, even before the leader has had the chance to transfer a possibly very large cluster
state.

Closes #36428
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. >test-failure Triaged test failures from CI
Projects
None yet
Development

No branches or pull requests

2 participants