Skip to content

[CI] frequent master failures in 6.8 debian CI tests #49057

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
hub-cap opened this issue Nov 13, 2019 · 7 comments
Closed

[CI] frequent master failures in 6.8 debian CI tests #49057

hub-cap opened this issue Nov 13, 2019 · 7 comments
Labels
:Distributed Coordination/Network Http and internode communication implementations >test-failure Triaged test failures from CI

Comments

@hub-cap
Copy link
Contributor

hub-cap commented Nov 13, 2019

For the last few days the CI tests for debian-8 6.8 have been failing with various master node connection issues like the following

Caused by: NoNodeAvailableException[None of the configured nodes are available: [{#transport#-1}{sFU5z-aFTxCUNRGO6LBcEA}{127.0.0.1}{127.0.0.1:39709}]]

MasterNotDiscoveredException[NotMasterException[no longer master. source: [cluster_health (wait_for_events [LANGUID])]]]; nested: NotMasterException[no longer master. source: [cluster_health (wait_for_events [LANGUID])]];

ElasticsearchException[failed to update minimum master node to [2] (current masters [3])]; nested: NoNodeAvailableException[None of the configured nodes are available: [{#transport#-1}{jWUSUH7hQ1SvZ7K8u0cbuw}{127.0.0.1}{127.0.0.1:44365}]];

see this job link for more information.

@hub-cap hub-cap added >test-failure Triaged test failures from CI :Distributed Indexing/Distributed A catch all label for anything in the Distributed Indexing Area. Please avoid if you can. labels Nov 13, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/Distributed)

@ywelsch
Copy link
Contributor

ywelsch commented Nov 14, 2019

This looks to be an infrastructure issue with nodes not being able to establish connections between themselves and not detecting dropped connections. There have been no changes to the distributed layer in 6.8, and I suspect this is the same stuff as #43387

[2019-11-12T00:27:34,278][WARN ][o.e.c.NodeConnectionsService] [node_t2] failed to connect to node {node_t1}{j-BMvpT8TYWsuL-ElRjpSA}{bmWwmfmDSpmeQM2LSV_cYA}{127.0.0.1}{127.0.0.1:49001} (tried [1] times)
org.elasticsearch.transport.ConnectTransportException: [node_t1][127.0.0.1:49001] connect_timeout[30s]
	at org.elasticsearch.transport.TcpTransport$ChannelsConnectedListener.onTimeout(TcpTransport.java:1316) ~[main/:?]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:681) ~[main/:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_221]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_221]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_221]
[2019-11-12T00:27:34,280][WARN ][o.e.c.s.ClusterApplierService] [node_t2] cluster state applier task [apply cluster state (from master [master {node_t2}{pFAtyjVNRm67_HmFVCX0Hg}{nHgCTUdkQ8i0pQj3GLrZjA}{127.0.0.1}{127.0.0.1:41481} committed version [12] source [zen-disco-elected-as-master ([0] nodes joined)]])] took [30s] above the warn threshold of 30s
[2019-11-12T00:27:34,280][WARN ][o.e.c.s.MasterService    ] [node_t2] cluster state update task [zen-disco-elected-as-master ([0] nodes joined)] took [30s] above the warn threshold of 30s
[2019-11-12T00:27:34,282][INFO ][o.e.c.s.MasterService    ] [node_t2] zen-disco-node-failed({node_t1}{j-BMvpT8TYWsuL-ElRjpSA}{bmWwmfmDSpmeQM2LSV_cYA}{127.0.0.1}{127.0.0.1:49001}), reason(transport disconnected)[{node_t1}{j-BMvpT8TYWsuL-ElRjpSA}{bmWwmfmDSpmeQM2LSV_cYA}{127.0.0.1}{127.0.0.1:49001} transport disconnected], reason: removed {{node_t1}{j-BMvpT8TYWsuL-ElRjpSA}{bmWwmfmDSpmeQM2LSV_cYA}{127.0.0.1}{127.0.0.1:49001},}

@ywelsch ywelsch added :Delivery/Build Build or test infrastructure and removed :Distributed Indexing/Distributed A catch all label for anything in the Distributed Indexing Area. Please avoid if you can. labels Nov 14, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra (:Core/Infra/Build)

@jaymode
Copy link
Member

jaymode commented Nov 19, 2019

Looks like the infrastructure issues are resolved and these jobs have been passing. Closing for now.

@jaymode jaymode closed this as completed Nov 19, 2019
@pgomulka
Copy link
Contributor

pgomulka commented Jan 7, 2020

it looks like this re-occurred again. One example https://gradle-enterprise.elastic.co/s/atangcvtavd4k
but almost all runs of this job failed this year https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.8+multijob-unix-compatibility/os=debian-8&&immutable/
the message in logs is the same as in first comment.
java.lang.IllegalStateException: cluster failed to form with expected nodes

was there an infra issue last time ? so that we can notify the infra team about the problem reoccurring.

@pgomulka pgomulka reopened this Jan 7, 2020
@jaymode
Copy link
Member

jaymode commented Jan 8, 2020

This is manifesting the same way again with connection timeouts and it looks like #43387 has also started appearing again. @original-brownbear has an open issue regarding looking into the possibility of testing this on different infrastructure due to the networking issues we see on Debian 8/9 and GCP. These failures look different than #43387, however I suspect this is due to the use of a different transport; 6.8 failures appear to be using the MockTcpTransport, which is blocking so we will not see the logging messages that are seen with the MockNioTransport in 7.x+.

@jaymode jaymode added :Distributed Coordination/Network Http and internode communication implementations and removed :Delivery/Build Build or test infrastructure labels Jan 8, 2020
@original-brownbear
Copy link
Contributor

Closing this as a duplicate of #43387, both have the same cause.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Network Http and internode communication implementations >test-failure Triaged test failures from CI
Projects
None yet
Development

No branches or pull requests

6 participants