You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
openjdk 11.0.12 2021-07-20 LTS
OpenJDK Runtime Environment 18.9 (build 11.0.12+7-LTS)
OpenJDK 64-Bit Server VM 18.9 (build 11.0.12+7-LTS, mixed mode, sharing)
OS version :
Linux 3.10.0-1160.42.2.el7.x86_64 #1 SMP Tue Sep 7 14:49:57 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
ES Cluster Topology :
5 node setup :
3 Master Eligible Nodes (Here, all the nodes are data nodes as well)
2 Data Nodes
Note :
1. Terminologies :
x.x.x.x refers to Old leader IP
y.y.y.y refers to new leader IP
a.a.a.a, b.b.b.b, c.c.c.c refers to the members in ES cluster
Here, b.b.b.b and c.c.c.c are data nodes and c.c.c.c is unreachable
2. Out of 5 nodes, one node is not reachable now at the time of collecting the logs and this is a data node
3. The issue occurs only in 5 node setup and not in 3 node setup
Description of the problem including expected versus actual behavior:
We formed a cluster with 5 nodes and while performing a rolling restart observed that previous master node (before rolling restart) is not rejoining the cluster back again. While others have successfully rejoined the cluster
Here the expected behaviour is all nodes should be rejoining the cluster
Observations :
Following are the things that were observed when the issue occurred
1. Cluster tasks with source "**elected-as-master ([2] nodes joined)**" is running forever causing the other tasks for node-left and node-join to wait in the queue forever
On a side-note, since the old leader has been disconnected and when it came online ephemeral id got changed and hence new leader node cannot connect with it with the exception "handshake failed. unexpected remote node"
From new ES leader Logs :
org.elasticsearch.cluster.metadata.ProcessClusterEventTimeoutException: failed to process cluster event (auto create [myindex-2021-11-03]) within 1m
at org.elasticsearch.cluster.service.MasterService$Batcher.lambda$onTimeout$0(MasterService.java:143) ~[elasticsearch-7.10.2.jar:7.10.2]
at java.util.ArrayList.forEach(ArrayList.java:1541) ~[?:?]
at org.elasticsearch.cluster.service.MasterService$Batcher.lambda$onTimeout$1(MasterService.java:142) ~[elasticsearch-7.10.2.jar:7.10.2]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:684) ~[elasticsearch-7.10.2.jar:7.10.2]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
at java.lang.Thread.run(Thread.java:829) [?:?]
[2021-11-02T13:19:35,111][WARN ][o.e.c.NodeConnectionsService] [y.y.y.y] failed to connect to {x.x.x.x}{37L59gGkQDSy_Mm4nVKq0A}{GozQ-qT_SNuDZ1PThGs0Dw}{x.x.x.x}{x.x.x.x:9300}{dimr} (tried [26725] times)
org.elasticsearch.transport.ConnectTransportException: [x.x.x.x][x.x.x.x:9300] handshake failed. unexpected remote node {x.x.x.x}{37L59gGkQDSy_Mm4nVKq0A}{Ai6cyeSIQdKFbWTUThPyYw}{x.x.x.x}{x.x.x.x:9300}{dimr}
at org.elasticsearch.transport.TransportService.lambda$connectionValidator$5(TransportService.java:389) ~[elasticsearch-7.10.2.jar:7.10.2]
at org.elasticsearch.action.ActionListener$4.onResponse(ActionListener.java:157) [elasticsearch-7.10.2.jar:7.10.2]
at org.elasticsearch.transport.TransportService$5.onResponse(TransportService.java:476) [elasticsearch-7.10.2.jar:7.10.2]
at org.elasticsearch.transport.TransportService$5.onResponse(TransportService.java:466) [elasticsearch-7.10.2.jar:7.10.2]
at org.elasticsearch.action.ActionListenerResponseHandler.handleResponse(ActionListenerResponseHandler.java:54) [elasticsearch-7.10.2.jar:7.10.2]
at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1171) [elasticsearch-7.10.2.jar:7.10.2]
at org.elasticsearch.transport.InboundHandler.doHandleResponse(InboundHandler.java:253) [elasticsearch-7.10.2.jar:7.10.2]
at org.elasticsearch.transport.InboundHandler.lambda$handleResponse$1(InboundHandler.java:247) [elasticsearch-7.10.2.jar:7.10.2]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:684) [elasticsearch-7.10.2.jar:7.10.2]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
at java.lang.Thread.run(Thread.java:829) [?:?]
Note :
The above logs are repetitive in nature and filled the ES logs
Currently there is no specific reproducing steps to simulate the node rejoin failure but it is occurring one or the other times when rolling restarts were performed
The text was updated successfully, but these errors were encountered:
This certainly looks like a bug but I don't think this is the right place to report it. You're using the OpenDistro fork of Elasticsearch which is quite separate from the codebase tracked in this repository, and you should report issues to the maintainers of your fork instead. Therefore I am closing this.
If you can reproduce this with Elasticsearch proper then we'd love to hear about it, and ask that you capture a heap dump and thread dump from the master before restarting anything next time. Either append them (at least the thread dump) to this issue and we'll reopen it, or else open another issue.
Elasticsearch version :
Plugins installed:
JVM version (
java -version
):OS version :
ES Cluster Topology :
5 node setup :
Note :
Description of the problem including expected versus actual behavior:
Observations :
Workaround suggested :
Links Referred before raising this issue :
Steps to reproduce:
Currently there is no specific reproducing steps to simulate the node rejoin failure but it is occurring one or the other times when rolling restarts were performed
The text was updated successfully, but these errors were encountered: