TEST: ClusterDisruptionIT testSendingShardFailure fails #36428

ywelsch · 2018-12-10T12:52:43Z

Test failure: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+intake/638/console

The issue exhibited by the test is as follows: Simulated disconnect between follower and leader while sending shard failure. Follower gets reconnected to leader through a follower check, but does not remove NO_MASTER_BLOCK from cluster state, which means that sending the shard failure is not retried. Unfortunately, removing the NO_MASTER_BLOCK on becoming follower is not enough, because that will still not trigger the shard failure to be resent because the cluster state version is not incremented.

Relevant log lines:

1> [2018-12-10T05:35:00,225][DEBUG][o.e.c.a.s.ShardStateAction] [testSendingShardFailure] sending [internal:cluster/shard/failure] to [ir1zPCLYRTikvc1ljjbFdw] for shard entry [shard id [[test][0]], allocation id [BSJMH3QrSAOmXvAY67aymg], primary term [0], message [simulated], failure [CorruptIndexException[simulated (resource=null)]], markAsStale [true]]
  1> [2018-12-10T05:35:00,226][DEBUG][o.e.c.c.Coordinator      ] [node_t0] onLeaderFailure: becoming CANDIDATE (was FOLLOWER, lastKnownLeader was [Optional[{node_t1}{ir1zPCLYRTikvc1ljjbFdw}{MXpZkxX8TpWwVU5N02B_ew}{127.0.0.1}{127.0.0.1:35909}]])
...
  1> [2018-12-10T05:35:00,230][DEBUG][o.e.c.s.ClusterApplierService] [node_t0] processing [becoming candidate: onLeaderFailure]: execute
  1> [2018-12-10T05:35:00,230][TRACE][o.e.c.s.ClusterApplierService] [node_t0] cluster state updated, source [becoming candidate: onLeaderFailure]
...
1> [2018-12-10T05:35:00,242][DEBUG][o.e.d.ClusterDisruptionIT] [testSendingShardFailure] ensuring cluster is stable with [3] nodes. access node: [node_t0]. timeout: [30s]
  1> [2018-12-10T05:35:00,242][DEBUG][o.e.a.a.c.h.TransportClusterHealthAction] [testSendingShardFailure] no known master node, scheduling a retry
  1> [2018-12-10T05:35:00,466][DEBUG][o.e.c.c.Coordinator      ] [node_t0] onFollowerCheckRequest: becoming FOLLOWER of [{node_t1}{ir1zPCLYRTikvc1ljjbFdw}{MXpZkxX8TpWwVU5N02B_ew}{127.0.0.1}{127.0.0.1:35909}] (was CANDIDATE, lastKnownLeader was [Optional[{node_t1}{ir1zPCLYRTikvc1ljjbFdw}{MXpZkxX8TpWwVU5N02B_ew}{127.0.0.1}{127.0.0.1:35909}]])
  1> [2018-12-10T05:35:00,470][DEBUG][o.e.c.c.LeaderChecker    ] [node_t0] closed check scheduler woken up, doing nothing
  1> [2018-12-10T05:35:30,244][DEBUG][o.e.a.a.c.h.TransportClusterHealthAction] [node_t0] timed out while retrying [cluster:monitor/health] after failure (timeout [30s])
  1> [2018-12-10T05:35:30,245][INFO ][o.e.d.ClusterDisruptionIT] [testSendingShardFailure] [ClusterDisruptionIT#testSendingShardFailure]: cleaning up after test
  1> [2018-12-10T05:35:30,245][INFO ][o.e.t.InternalTestCluster] [testSendingShardFailure] Clearing active scheme network disruption (disruption type: network disconnects, disrupted links: two partitions (partition 1: [node_t0] and partition 2: [node_t2, node_t1])), expected healing time 0s
  1> [2018-12-10T05:35:30,268][DEBUG][o.e.a.a.c.s.TransportClusterStateAction] [testSendingShardFailure] no known master node, scheduling a retry
  1> [2018-12-10T05:36:00,268][DEBUG][o.e.a.a.c.s.TransportClusterStateAction] [node_t0] timed out while retrying [cluster:monitor/state] after failure (timeout [30s])

The text was updated successfully, but these errors were encountered:

elasticmachine · 2018-12-10T12:52:45Z

Pinging @elastic/es-distributed

ywelsch · 2018-12-10T12:54:01Z

Muted the test in f79e602

Relates to #36428

Deals with a situation where a follower becomes disconnected from the leader, but only for such a short time where it becomes candidate and puts up a NO_MASTER_BLOCK, but then receives a follower check from the leader. If the leader does not notice the node disconnecting, it is important for the node not to be turned back into a follower but try and join the leader again. We still should prefer the node into a follower on a follower check when this follower check triggers a term bump as this can help during a leader election to quickly have a leader turn all other nodes into followers, even before the leader has had the chance to transfer a possibly very large cluster state. Closes #36428

ywelsch added >test-failure Triaged test failures from CI :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. labels Dec 10, 2018

ywelsch self-assigned this Dec 10, 2018

ywelsch added a commit that referenced this issue Dec 10, 2018

Mute ClusterDisruptionIT.testSendingShardFailure

f79e602

Relates to #36428

ywelsch mentioned this issue Dec 10, 2018

Zen2: Only turn to follower when term bumping on follower check #36449

Merged

ywelsch closed this as completed in #36449 Dec 11, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TEST: ClusterDisruptionIT testSendingShardFailure fails #36428

TEST: ClusterDisruptionIT testSendingShardFailure fails #36428

ywelsch commented Dec 10, 2018

elasticmachine commented Dec 10, 2018

ywelsch commented Dec 10, 2018

TEST: ClusterDisruptionIT testSendingShardFailure fails #36428

TEST: ClusterDisruptionIT testSendingShardFailure fails #36428

Comments

ywelsch commented Dec 10, 2018

elasticmachine commented Dec 10, 2018

ywelsch commented Dec 10, 2018