[CI] frequent master failures in 6.8 debian CI tests #49057

hub-cap · 2019-11-13T19:45:23Z

For the last few days the CI tests for debian-8 6.8 have been failing with various master node connection issues like the following

Caused by: NoNodeAvailableException[None of the configured nodes are available: [{#transport#-1}{sFU5z-aFTxCUNRGO6LBcEA}{127.0.0.1}{127.0.0.1:39709}]]

MasterNotDiscoveredException[NotMasterException[no longer master. source: [cluster_health (wait_for_events [LANGUID])]]]; nested: NotMasterException[no longer master. source: [cluster_health (wait_for_events [LANGUID])]];

ElasticsearchException[failed to update minimum master node to [2] (current masters [3])]; nested: NoNodeAvailableException[None of the configured nodes are available: [{#transport#-1}{jWUSUH7hQ1SvZ7K8u0cbuw}{127.0.0.1}{127.0.0.1:44365}]];

see this job link for more information.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2019-11-13T19:45:24Z

Pinging @elastic/es-distributed (:Distributed/Distributed)

ywelsch · 2019-11-14T09:30:00Z

This looks to be an infrastructure issue with nodes not being able to establish connections between themselves and not detecting dropped connections. There have been no changes to the distributed layer in 6.8, and I suspect this is the same stuff as #43387

[2019-11-12T00:27:34,278][WARN ][o.e.c.NodeConnectionsService] [node_t2] failed to connect to node {node_t1}{j-BMvpT8TYWsuL-ElRjpSA}{bmWwmfmDSpmeQM2LSV_cYA}{127.0.0.1}{127.0.0.1:49001} (tried [1] times)
org.elasticsearch.transport.ConnectTransportException: [node_t1][127.0.0.1:49001] connect_timeout[30s]
	at org.elasticsearch.transport.TcpTransport$ChannelsConnectedListener.onTimeout(TcpTransport.java:1316) ~[main/:?]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:681) ~[main/:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_221]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_221]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_221]
[2019-11-12T00:27:34,280][WARN ][o.e.c.s.ClusterApplierService] [node_t2] cluster state applier task [apply cluster state (from master [master {node_t2}{pFAtyjVNRm67_HmFVCX0Hg}{nHgCTUdkQ8i0pQj3GLrZjA}{127.0.0.1}{127.0.0.1:41481} committed version [12] source [zen-disco-elected-as-master ([0] nodes joined)]])] took [30s] above the warn threshold of 30s
[2019-11-12T00:27:34,280][WARN ][o.e.c.s.MasterService    ] [node_t2] cluster state update task [zen-disco-elected-as-master ([0] nodes joined)] took [30s] above the warn threshold of 30s
[2019-11-12T00:27:34,282][INFO ][o.e.c.s.MasterService    ] [node_t2] zen-disco-node-failed({node_t1}{j-BMvpT8TYWsuL-ElRjpSA}{bmWwmfmDSpmeQM2LSV_cYA}{127.0.0.1}{127.0.0.1:49001}), reason(transport disconnected)[{node_t1}{j-BMvpT8TYWsuL-ElRjpSA}{bmWwmfmDSpmeQM2LSV_cYA}{127.0.0.1}{127.0.0.1:49001} transport disconnected], reason: removed {{node_t1}{j-BMvpT8TYWsuL-ElRjpSA}{bmWwmfmDSpmeQM2LSV_cYA}{127.0.0.1}{127.0.0.1:49001},}

elasticmachine · 2019-11-14T09:30:22Z

Pinging @elastic/es-core-infra (:Core/Infra/Build)

jaymode · 2019-11-19T20:53:49Z

Looks like the infrastructure issues are resolved and these jobs have been passing. Closing for now.

pgomulka · 2020-01-07T15:33:51Z

it looks like this re-occurred again. One example https://gradle-enterprise.elastic.co/s/atangcvtavd4k
but almost all runs of this job failed this year https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.8+multijob-unix-compatibility/os=debian-8&&immutable/
the message in logs is the same as in first comment.
java.lang.IllegalStateException: cluster failed to form with expected nodes

was there an infra issue last time ? so that we can notify the infra team about the problem reoccurring.

jaymode · 2020-01-08T17:03:07Z

This is manifesting the same way again with connection timeouts and it looks like #43387 has also started appearing again. @original-brownbear has an open issue regarding looking into the possibility of testing this on different infrastructure due to the networking issues we see on Debian 8/9 and GCP. These failures look different than #43387, however I suspect this is due to the use of a different transport; 6.8 failures appear to be using the MockTcpTransport, which is blocking so we will not see the logging messages that are seen with the MockNioTransport in 7.x+.

original-brownbear · 2020-01-22T14:11:37Z

Closing this as a duplicate of #43387, both have the same cause.

hub-cap added >test-failure Triaged test failures from CI :Distributed Indexing/Distributed A catch all label for anything in the Distributed Indexing Area. Please avoid if you can. labels Nov 13, 2019

ywelsch added :Delivery/Build Build or test infrastructure and removed :Distributed Indexing/Distributed A catch all label for anything in the Distributed Indexing Area. Please avoid if you can. labels Nov 14, 2019

jaymode closed this as completed Nov 19, 2019

pgomulka reopened this Jan 7, 2020

jaymode added :Distributed Coordination/Network Http and internode communication implementations and removed :Delivery/Build Build or test infrastructure labels Jan 8, 2020

original-brownbear closed this as completed Jan 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[CI] frequent master failures in 6.8 debian CI tests #49057

[CI] frequent master failures in 6.8 debian CI tests #49057

hub-cap commented Nov 13, 2019

elasticmachine commented Nov 13, 2019

Uh oh!

ywelsch commented Nov 14, 2019

Uh oh!

elasticmachine commented Nov 14, 2019

Uh oh!

jaymode commented Nov 19, 2019

Uh oh!

pgomulka commented Jan 7, 2020

Uh oh!

jaymode commented Jan 8, 2020

Uh oh!

original-brownbear commented Jan 22, 2020

Uh oh!

[CI] frequent master failures in 6.8 debian CI tests #49057

[CI] frequent master failures in 6.8 debian CI tests #49057

Comments

hub-cap commented Nov 13, 2019

elasticmachine commented Nov 13, 2019

Uh oh!

ywelsch commented Nov 14, 2019

Uh oh!

elasticmachine commented Nov 14, 2019

Uh oh!

jaymode commented Nov 19, 2019

Uh oh!

pgomulka commented Jan 7, 2020

Uh oh!

jaymode commented Jan 8, 2020

Uh oh!

original-brownbear commented Jan 22, 2020

Uh oh!