ClusterApplierService stuck for mins while establishing connections to other node due to mismatch ephemeralId #56979

shwetathareja · 2020-05-20T07:35:38Z

Elasticsearch version (bin/elasticsearch --version): 7.1

Plugins installed: []

Description of the problem including expected versus actual behavior:

ClusterApplierService on master was stuck for 48m until it was eventually interrupted.

On further investigation, it was found that it was stuck establishing node connection which was repeated failing due to handshake failed. unexpected remote node error

Issue: The ES process was restarted on the target node causing the ephemeralId to mismatch previous: pi9bH-T5RFOTl4JB8niS-w new: izRTvB7KRVCenyA20GdH6A, resulting in unexpected remote node exception.

Though this change #39629 would reduce the occurrence by not establishing connections to already disconnected nodes and establish connection to new nodes only which is present in 7.2.0 on wards. But this could happen with a new node as well (in versions > 7.1.1) if during cluster state processing, the ES process was restarted.

ClusterApplierService shouldn't to be stuck while establishing connection in case ephemeralId mismatches. This node should be removed from the cluster, so that it joins back again.

Steps to reproduce:

Node sends join request to master
During the state processing ES process was restarted on new node, causing its ephemeralId to change.
This would cause ClusterApplierService to stuck while establishing connections

Provide logs (if relevant):

[2020-05-01T03:41:40,986][WARN ][o.e.c.s.MasterService ] [2e383c] failed to publish updated cluster state in [48.9m]: version [119560], uuid [VhfHxS62SPewZa2IpQ7elg], source [elected-as-master ([3] nodes joined) .. java.lang.IllegalStateException: Future got interrupted at org.elasticsearch.common.util.concurrent.FutureUtils.get(FutureUtils.java:60) ~[elasticsearch-7.1.1.jar:7.1.1] at org.elasticsearch.cluster.service.MasterService.publish(MasterService.java:256) [elasticsearch-7.1.1.jar:7.1.1] at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:238) [elasticsearch-7.1.1.jar:7.1.1] at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:142) [elasticsearch-7.1.1.jar:7.1.1] at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150) [elasticsearch-7.1.1.jar:7.1.1] at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188) [elasticsearch-7.1.1.jar:7.1.1] at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:690) [elasticsearch-7.1.1.jar:7.1.1] at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252) [elasticsearch-7.1.1.jar:7.1.1] at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215) [elasticsearch-7.1.1.jar:7.1.1] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_172] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_172] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_172] Caused by: java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:998) ~[?:1.8.0_172] at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) ~[?:1.8.0_172]

[2020-05-01T03:09:28,186][WARN ][o.e.c.NodeConnectionsService] [2e383c] failed to connect to node {158fe9}{0CZIV4StTSCJ9uy607yYdA}{pi9bH-T5RFOTl4JB8niS-w}{x.x.x.x}{x.x.x.x:9300} (tried [85] times) org.elasticsearch.transport.ConnectTransportException: [158fe9][x.x.x.x:9300] handshake failed. unexpected remote node {158fe9}{0CZIV4StTSCJ9uy607yYdA}{izRTvB7KRVCenyA20GdH6A} at org.elasticsearch.transport.TransportService.lambda$connectionValidator$4(TransportService.java:352) ~[elasticsearch-7.1.1.jar:7.1.1] at org.elasticsearch.transport.ConnectionManager.connectToNode(ConnectionManager.java:105) ~[elasticsearch-7.1.1.jar:7.1.1] at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:344) ~[elasticsearch-7.1.1.jar:7.1.1] at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:331) ~[elasticsearch-7.1.1.jar:7.1.1] at org.elasticsearch.cluster.NodeConnectionsService.validateAndConnectIfNeeded(NodeConnectionsService.java:153) [elasticsearch-7.1.1.jar:7.1.1] ... at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:510) [netty-transport-4.1.32.Final.jar:4.1.32.Final] at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:470) [netty-transport-4.1.32.Final.jar:4.1.32.Final] at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:909) [netty-common-4.1.32.Final.jar:4.1.32.Final] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_172]

The text was updated successfully, but these errors were encountered:

elasticmachine · 2020-05-20T07:44:33Z

Pinging @elastic/es-distributed (:Distributed/Cluster Coordination)

DaveCTurner · 2020-05-20T08:02:24Z

ClusterApplierService shouldn't to be stuck while establishing connection in case ephemeralId mismatches.

The ClusterApplierService does not get stuck here, it fails immediately, logs the failure, and carries on with its work. You saw this logged repeatedly which tells us that it was indeed not stuck.

This node should be removed from the cluster, so that it joins back again.

Indeed, the node is removed from the cluster so that the new node can join in its place, as soon as the master is not busy with other higher-priority work, so I think the explanation is that your cluster was otherwise overloaded. There have been a number of changes since 7.1 that might affect this (e.g #43381, #44433, possibly others). Can you reproduce this on the latest version (7.7.0 at time of writing)? Can you share the output of GET /_cluster/pending_tasks from the time when it appeared to be stuck?

shwetathareja · 2020-05-20T08:31:25Z

@DaveCTurner: Thanks for the feedback. I don't have the GET /_cluster/pending_tasks output from the actual occurrence. will try to repro for 7.7.0 and will get back.

DaveCTurner · 2020-06-03T21:28:03Z

Closing this due to inactivity, but will reopen if it reproduces in a newer version.

GaneshJayaram97 · 2021-11-09T10:01:58Z

Hi @DaveCTurner,

We are also observing the ephemeralId mismatch issue when establishing communication with other nodes along with other issue which can be found here.

Also, attached the _cluster/pending_tasks output in the same thread.

shwetathareja added >bug needs:triage Requires assignment of a team area label labels May 20, 2020

elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label May 20, 2020

DaveCTurner added feedback_needed and removed >bug needs:triage Requires assignment of a team area label labels May 20, 2020

ywelsch removed the :Distributed Indexing/Distributed A catch all label for anything in the Distributed Indexing Area. Please avoid if you can. label May 25, 2020

DaveCTurner closed this as completed Jun 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ClusterApplierService stuck for mins while establishing connections to other node due to mismatch ephemeralId #56979

ClusterApplierService stuck for mins while establishing connections to other node due to mismatch ephemeralId #56979

shwetathareja commented May 20, 2020

elasticmachine commented May 20, 2020

Uh oh!

DaveCTurner commented May 20, 2020

Uh oh!

shwetathareja commented May 20, 2020

Uh oh!

DaveCTurner commented Jun 3, 2020

Uh oh!

GaneshJayaram97 commented Nov 9, 2021 •

edited

Loading

Uh oh!

ClusterApplierService stuck for mins while establishing connections to other node due to mismatch ephemeralId #56979

ClusterApplierService stuck for mins while establishing connections to other node due to mismatch ephemeralId #56979

Comments

shwetathareja commented May 20, 2020

elasticmachine commented May 20, 2020

Uh oh!

DaveCTurner commented May 20, 2020

Uh oh!

shwetathareja commented May 20, 2020

Uh oh!

DaveCTurner commented Jun 3, 2020

Uh oh!

GaneshJayaram97 commented Nov 9, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

GaneshJayaram97 commented Nov 9, 2021 •

edited

Loading