Skip to content

ES master re-election algorithm tries electing a non-reachable master. #31801

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
itiyama opened this issue Jul 4, 2018 · 8 comments
Closed
Assignees
Labels
:Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection.

Comments

@itiyama
Copy link

itiyama commented Jul 4, 2018

Elasticsearch version (bin/elasticsearch --version): 5.5
Plugins installed: []

JVM version (java -version): 1.8
OS version (uname -a if on a Unix-like system): 4.9.38-16.35.amzn1.x86_64
Description of the problem including expected versus actual behavior:

This issue was observed in a cluster with 8 data nodes and 3 dedicated master eligible nodes.
We have observed that master re-election gets stuck in a loop sometimes even though all master eligible nodes are up and running and only one of them is not reachable. The UnicastHostsProvider returns the host, but it is not ping-able from any other node. It is only after UnicastHostsProvider stops returning that host that the master re-election is triggered. In this particular case, master re-election was stuck in a loop for the node for more than 1.5 hours because UnicastHostsProvider continues to return it. That particular host was the master before the re-election was triggered.

Scenario:

  1. ES on all dedicated masters restart- on current master it restarts in the end.
  2. The ‘master left’ is detected by fault detection logic on all data nodes.
  3. Master pinger in each node is stopped(as master re-election is going to be triggered) but master is not removed from cluster state.
  4. Join logic(Re-election) is triggered but all nodes return the old master in their ping results but joining the old master fails. This happens in a loop.

I want to understand the following:
Why does the master node re-appear in the current nodes even though master left. Please note that the master node ephemeral id is old. Also, since master pinger died, we have no way of removing a stale master now.

Steps to reproduce:

Unfortunately I don't have the steps to reproduce the issue, but have seen this happening more than once. I have the relevant logs and I have looked at the code flow which makes me think that this is a bug.

Provide logs (if relevant):
Log level is info for all the logs

Logs on one of the data nodes[removed ip addresses]: These log lines appear only once on all data nodes.

[2018-06-05T00:34:02,182][INFO ][o.e.d.z.ZenDiscovery     ] [ub38WLW] master_left [{31tudzn}{31tudznpRrOr8iRcg3eqCw}{IchjnSK7SkKsP8u90WgtgA}], reason [transport disconnected]
[2018-06-05T00:34:02,183][WARN ][o.e.d.z.ZenDiscovery     ] [ub38WLW] master left (reason = transport disconnected), current nodes: nodes:
{31tudzn}{31tudznpRrOr8iRcg3eqCw}{IchjnSK7SkKsP8u90WgtgA}, master
   {ub38WLW}{ub38WLW2QrmMUORORsjIaQ}{bzvVWANSTK6svB6wgrbqDQ}, local
   {XCuAlLW}{XCuAlLWhTFixeF-ZfvVZ0g}{mmrE4kn_S2OJp_IMc64CCw}
   {Ostbsj_}{Ostbsj_mRaGzgRzbsMmJCg}{olZOAwrSSX-79FokruKQbQ}
   {K720qk5}{K720qk56TBeULS2GN7J9zA}{QvH5SbF1T0md-ujHPevjUg}
   {UfWI8YH}{UfWI8YHRTyaedIJD0i9UcA}{qcmCT-iPR7KIeZYprntjAA}
   {a4A5voP}{a4A5voPvQzSR7kEh65vNPA}{fVLArUnCRqOzSswQjjiY8w}
   {RFf4YvU}{RFf4YvUPQA-JHGU_NvlPLA}{mWnG3CUiQZqpsHb8U8Bt6Q}
   {0de5xLZ}{0de5xLZDTluZjEBC6SrOHA}{h_fHZguHTr2jPmb_y27EnQ}
   {g6V22Qk}{g6V22QkNREyriF8rmrM8bg}{B98SirMJT6OGjHFV771rwQ}
   {o_yEC-x}{o_yEC-x1S6SeZuswgugMLw}{Twkui3yPQtCNOjEBmPIVtQ}

Post this all I see on all data nodes is a failed attempt to connect to the master - also the master node ephemeral id is old.

[2018-06-05T00:34:02,186][WARN ][o.e.c.NodeConnectionsService] [ub38WLW] failed to connect to node {31tudzn}{31tudznpRrOr8iRcg3eqCw}{IchjnSK7SkKsP8u90WgtgA} (tried [1] times)
org.elasticsearch.transport.ConnectTransportException: [31tudzn] connect_timeout[30s]
    at org.elasticsearch.transport.netty4.Netty4Transport.connectToChannels(Netty4Transport.java:361) ~[?:?]
    at org.elasticsearch.transport.TcpTransport.openConnection(TcpTransport.java:548) ~[elasticsearch-5.5.2.jar:5.5.2]
    at org.elasticsearch.transport.TcpTransport.connectToNode(TcpTransport.java:472) ~[elasticsearch-5.5.2.jar:5.5.2]
    at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:332) ~[elasticsearch-5.5.2.jar:5.5.2]
    at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:319) ~[elasticsearch-5.5.2.jar:5.5.2]
    at org.elasticsearch.cluster.NodeConnectionsService.validateAndConnectIfNeeded(NodeConnectionsService.java:154) [elasticsearch-5.5.2.jar:5.5.2]
    at org.elasticsearch.cluster.NodeConnectionsService$1.doRun(NodeConnectionsService.java:107) [elasticsearch-5.5.2.jar:5.5.2]
    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638) [elasticsearch-5.5.2.jar:5.5.2]
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-5.5.2.jar:5.5.2]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_112]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_112]
    at java.lang.Thread.run(Thread.java:745) [?:1.8.0_112]                      
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: ip:9300
    at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[?:1.8.0_112]  
    at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) ~[?:1.8.0_112]
    at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:352) ~[?:?]
    at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340) ~[?:?]
    at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:632) ~[?:?]
    at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:544) ~[?:?]
    at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:498) ~[?:?]
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:458) ~[?:?]      
    at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) ~[?:?]
    ... 1 more  
Caused by: java.net.ConnectException: Connection refused                        
    at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[?:1.8.0_112]  
    at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) ~[?:1.8.0_112]
    at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:352) ~[?:?]
    at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340) ~[?:?]
    at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:632) ~[?:?]
    at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:544) ~[?:?]
    at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:498) ~[?:?]
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:458) ~[?:?]      
    at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) ~[?:?]
    ... 1 more 

Logs on master node- Please note that the master node ephemeral id is different here- hence it tries to connect to some itself on a remote connection, instead of waiting for incoming joins.


org.elasticsearch.transport.ConnectTransportException: [31tudzn] handshake failed. unexpected remote node {31tudzn}{31tudznpRrOr8iRcg3eqCw}{SNZOWSLtQwWsFodR6B0D-w}
    at org.elasticsearch.transport.TransportService.lambda$connectToNode$3(TransportService.java:336) ~[elasticsearch-5.5.2.jar:5.5.2]
    at org.elasticsearch.transport.TcpTransport.connectToNode(TcpTransport.java:473) ~[elasticsearch-5.5.2.jar:5.5.2]
    at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:332) ~[elasticsearch-5.5.2.jar:5.5.2]
    at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:319) ~[elasticsearch-5.5.2.jar:5.5.2]
    at org.elasticsearch.discovery.zen.ZenDiscovery.joinElectedMaster(ZenDiscovery.java:459) [elasticsearch-5.5.2.jar:5.5.2]
    at org.elasticsearch.discovery.zen.ZenDiscovery.innerJoinCluster(ZenDiscovery.java:411) [elasticsearch-5.5.2.jar:5.5.2]
    at org.elasticsearch.discovery.zen.ZenDiscovery.access$4100(ZenDiscovery.java:83) [elasticsearch-5.5.2.jar:5.5.2]
    at org.elasticsearch.discovery.zen.ZenDiscovery$JoinThreadControl$1.run(ZenDiscovery.java:1188) [elasticsearch-5.5.2.jar:5.5.2]
    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:569) [elasticsearch-5.5.2.jar:5.5.2]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_112]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_112]
    at java.lang.Thread.run(Thread.java:745) [?:1.8.0_112]  

My hypothesis of the issue:

  1. Master node leaves.
  2. Due to some concurrency issue- master node comes back again into cluster state(old master node comes back as the ephemeral id is old).
  3. Since master fault detection was turned off due to master leaving the cluster, the master is not removed again.
  4. The pings from the nodes return the master in their response since master is not removed from cluster state due to concurrency issue + pinger dying(pinger is turned off due to master leaving the cluster). Note that there is no liveness on the master value that is broadcasted during pings. If it is not removed, the same value is always returned once pinger is turned off- irrespective of master being unreachable.
  5. Since activeMaster is returned from the node, we don't re-elect a master from dedicated master eligible nodes but just do a tie-break leading to the old master being elected as it has the lowest node id and highest cluster state version.
  6. We try to connect to the old master but due to some network issue unable to do so. Step 4 starts again and we keep on repeating steps 4-6 until master is removed from ping response.
@itiyama itiyama changed the title ES master re-election tries electing a non-reachable master. ES master re-election algorithm tries electing a non-reachable master. Jul 4, 2018
@ywelsch ywelsch added the :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. label Jul 5, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@ywelsch
Copy link
Contributor

ywelsch commented Jul 12, 2018

When printing master_left (the WARN one), the nodes execute the ZenDiscovery.rejoin method and remove the active master from their local cluster state. During pinging, they will then not broadcast knowledge of an active master. Can you check if the above WARN message was logged on all nodes? Can you check if there's a follow-up log message indicating that the master has been readded to the cluster state?

@itiyama
Copy link
Author

itiyama commented Jul 19, 2018

Yes, I checked. The above WARN was logged on all 8 data nodes.
There is no follow-up log message indicating that the master has been re-added. In fact, the log WARN message appeared after 1.5 hours for a different master eligible node when the old master was terminated

@ywelsch
Copy link
Contributor

ywelsch commented Jul 19, 2018

Are you using any discovery.zen.* settings that deviate from the defaults?
Can you try setting discovery.zen.master_election.ignore_non_master_pings to true and see if that helps?

@itiyama
Copy link
Author

itiyama commented Jul 19, 2018

Are you using any discovery.zen.* settings that deviate from the defaults?

I am using default discovery.zen.* settings.

Can you try setting discovery.zen.master_election.ignore_non_master_pings to true and see if that helps?

I can, but why do you think that would help- Also, since I don't know the exact steps to reproduce the scenario, verifying if that works might be tricky.

Just a thought- It appeared to me as though the cluster state did not change even though the master fault detection stopped- which further caused the stale data to be present in cluster state. If we can have some staleness associated with the master value stored in cluster state- master would have been re-elected even though the fault detection stopped due to whatever reason. We should know the concurrency issue which did not change(or revert) the cluster state when master pings stopped- but invalidating master entries in the cluster state would have helped here.

@itiyama
Copy link
Author

itiyama commented Aug 9, 2018

Please provide a response to this.

@ywelsch
Copy link
Contributor

ywelsch commented Aug 13, 2018

I can, but why do you think that would help

The scenario you described had all master-eligible nodes restart. If it's the data nodes having poisonous information about who they think is the master, this setting will make sure that the active mastership information by data nodes is discarded, i.e., only the values coming from the cluster state of the master-eligible nodes will be trusted for that information, and as those have been restarted, they will not contain info about mastership of any node, thereby allowing a proper election.

but invalidating master entries in the cluster state would have helped here.

We should not be building new features into some components just to work around bugs in other components. If there's a bug here, we need to find and fix it.

Once the cluster is in this situation, it would be interesting to see on each node what their local cluster state looks like (GET /_cluster/state?local) in order to check if they all properly consider the master to have stepped down (i.e. no current master in their cluster state). If there's a node that mistakenly considers the master to still be present, take a heap dump which allows us to further inspect the state.

@DaveCTurner
Copy link
Contributor

The changes in #32006 mean that we are now much more picky about the nodes that we consider to be "live" during the discovery phase, and a master is chosen from the known-live nodes more sensibly. In particular, we check that we're actually connected to a node during discovery, so I think this poisonous situation can no longer occur.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection.
Projects
None yet
Development

No branches or pull requests

4 participants