ES master re-election algorithm tries electing a non-reachable master. #31801

itiyama · 2018-07-04T16:12:07Z

Elasticsearch version (bin/elasticsearch --version): 5.5
Plugins installed: []

JVM version (java -version): 1.8
OS version (uname -a if on a Unix-like system): 4.9.38-16.35.amzn1.x86_64
Description of the problem including expected versus actual behavior:

This issue was observed in a cluster with 8 data nodes and 3 dedicated master eligible nodes.
We have observed that master re-election gets stuck in a loop sometimes even though all master eligible nodes are up and running and only one of them is not reachable. The UnicastHostsProvider returns the host, but it is not ping-able from any other node. It is only after UnicastHostsProvider stops returning that host that the master re-election is triggered. In this particular case, master re-election was stuck in a loop for the node for more than 1.5 hours because UnicastHostsProvider continues to return it. That particular host was the master before the re-election was triggered.

Scenario:

ES on all dedicated masters restart- on current master it restarts in the end.
The ‘master left’ is detected by fault detection logic on all data nodes.
Master pinger in each node is stopped(as master re-election is going to be triggered) but master is not removed from cluster state.
Join logic(Re-election) is triggered but all nodes return the old master in their ping results but joining the old master fails. This happens in a loop.

I want to understand the following:
Why does the master node re-appear in the current nodes even though master left. Please note that the master node ephemeral id is old. Also, since master pinger died, we have no way of removing a stale master now.

Steps to reproduce:

Unfortunately I don't have the steps to reproduce the issue, but have seen this happening more than once. I have the relevant logs and I have looked at the code flow which makes me think that this is a bug.

Provide logs (if relevant):
Log level is info for all the logs

Logs on one of the data nodes[removed ip addresses]: These log lines appear only once on all data nodes.

[2018-06-05T00:34:02,182][INFO ][o.e.d.z.ZenDiscovery     ] [ub38WLW] master_left [{31tudzn}{31tudznpRrOr8iRcg3eqCw}{IchjnSK7SkKsP8u90WgtgA}], reason [transport disconnected]
[2018-06-05T00:34:02,183][WARN ][o.e.d.z.ZenDiscovery     ] [ub38WLW] master left (reason = transport disconnected), current nodes: nodes:
{31tudzn}{31tudznpRrOr8iRcg3eqCw}{IchjnSK7SkKsP8u90WgtgA}, master
   {ub38WLW}{ub38WLW2QrmMUORORsjIaQ}{bzvVWANSTK6svB6wgrbqDQ}, local
   {XCuAlLW}{XCuAlLWhTFixeF-ZfvVZ0g}{mmrE4kn_S2OJp_IMc64CCw}
   {Ostbsj_}{Ostbsj_mRaGzgRzbsMmJCg}{olZOAwrSSX-79FokruKQbQ}
   {K720qk5}{K720qk56TBeULS2GN7J9zA}{QvH5SbF1T0md-ujHPevjUg}
   {UfWI8YH}{UfWI8YHRTyaedIJD0i9UcA}{qcmCT-iPR7KIeZYprntjAA}
   {a4A5voP}{a4A5voPvQzSR7kEh65vNPA}{fVLArUnCRqOzSswQjjiY8w}
   {RFf4YvU}{RFf4YvUPQA-JHGU_NvlPLA}{mWnG3CUiQZqpsHb8U8Bt6Q}
   {0de5xLZ}{0de5xLZDTluZjEBC6SrOHA}{h_fHZguHTr2jPmb_y27EnQ}
   {g6V22Qk}{g6V22QkNREyriF8rmrM8bg}{B98SirMJT6OGjHFV771rwQ}
   {o_yEC-x}{o_yEC-x1S6SeZuswgugMLw}{Twkui3yPQtCNOjEBmPIVtQ}

Post this all I see on all data nodes is a failed attempt to connect to the master - also the master node ephemeral id is old.

[2018-06-05T00:34:02,186][WARN ][o.e.c.NodeConnectionsService] [ub38WLW] failed to connect to node {31tudzn}{31tudznpRrOr8iRcg3eqCw}{IchjnSK7SkKsP8u90WgtgA} (tried [1] times)
org.elasticsearch.transport.ConnectTransportException: [31tudzn] connect_timeout[30s]
    at org.elasticsearch.transport.netty4.Netty4Transport.connectToChannels(Netty4Transport.java:361) ~[?:?]
    at org.elasticsearch.transport.TcpTransport.openConnection(TcpTransport.java:548) ~[elasticsearch-5.5.2.jar:5.5.2]
    at org.elasticsearch.transport.TcpTransport.connectToNode(TcpTransport.java:472) ~[elasticsearch-5.5.2.jar:5.5.2]
    at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:332) ~[elasticsearch-5.5.2.jar:5.5.2]
    at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:319) ~[elasticsearch-5.5.2.jar:5.5.2]
    at org.elasticsearch.cluster.NodeConnectionsService.validateAndConnectIfNeeded(NodeConnectionsService.java:154) [elasticsearch-5.5.2.jar:5.5.2]
    at org.elasticsearch.cluster.NodeConnectionsService$1.doRun(NodeConnectionsService.java:107) [elasticsearch-5.5.2.jar:5.5.2]
    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638) [elasticsearch-5.5.2.jar:5.5.2]
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-5.5.2.jar:5.5.2]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_112]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_112]
    at java.lang.Thread.run(Thread.java:745) [?:1.8.0_112]                      
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: ip:9300
    at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[?:1.8.0_112]  
    at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) ~[?:1.8.0_112]
    at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:352) ~[?:?]
    at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340) ~[?:?]
    at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:632) ~[?:?]
    at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:544) ~[?:?]
    at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:498) ~[?:?]
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:458) ~[?:?]      
    at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) ~[?:?]
    ... 1 more  
Caused by: java.net.ConnectException: Connection refused                        
    at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[?:1.8.0_112]  
    at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) ~[?:1.8.0_112]
    at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:352) ~[?:?]
    at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340) ~[?:?]
    at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:632) ~[?:?]
    at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:544) ~[?:?]
    at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:498) ~[?:?]
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:458) ~[?:?]      
    at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) ~[?:?]
    ... 1 more

Logs on master node- Please note that the master node ephemeral id is different here- hence it tries to connect to some itself on a remote connection, instead of waiting for incoming joins.


org.elasticsearch.transport.ConnectTransportException: [31tudzn] handshake failed. unexpected remote node {31tudzn}{31tudznpRrOr8iRcg3eqCw}{SNZOWSLtQwWsFodR6B0D-w}
    at org.elasticsearch.transport.TransportService.lambda$connectToNode$3(TransportService.java:336) ~[elasticsearch-5.5.2.jar:5.5.2]
    at org.elasticsearch.transport.TcpTransport.connectToNode(TcpTransport.java:473) ~[elasticsearch-5.5.2.jar:5.5.2]
    at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:332) ~[elasticsearch-5.5.2.jar:5.5.2]
    at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:319) ~[elasticsearch-5.5.2.jar:5.5.2]
    at org.elasticsearch.discovery.zen.ZenDiscovery.joinElectedMaster(ZenDiscovery.java:459) [elasticsearch-5.5.2.jar:5.5.2]
    at org.elasticsearch.discovery.zen.ZenDiscovery.innerJoinCluster(ZenDiscovery.java:411) [elasticsearch-5.5.2.jar:5.5.2]
    at org.elasticsearch.discovery.zen.ZenDiscovery.access$4100(ZenDiscovery.java:83) [elasticsearch-5.5.2.jar:5.5.2]
    at org.elasticsearch.discovery.zen.ZenDiscovery$JoinThreadControl$1.run(ZenDiscovery.java:1188) [elasticsearch-5.5.2.jar:5.5.2]
    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:569) [elasticsearch-5.5.2.jar:5.5.2]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_112]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_112]
    at java.lang.Thread.run(Thread.java:745) [?:1.8.0_112]

My hypothesis of the issue:

Master node leaves.
Due to some concurrency issue- master node comes back again into cluster state(old master node comes back as the ephemeral id is old).
Since master fault detection was turned off due to master leaving the cluster, the master is not removed again.
The pings from the nodes return the master in their response since master is not removed from cluster state due to concurrency issue + pinger dying(pinger is turned off due to master leaving the cluster). Note that there is no liveness on the master value that is broadcasted during pings. If it is not removed, the same value is always returned once pinger is turned off- irrespective of master being unreachable.
Since activeMaster is returned from the node, we don't re-elect a master from dedicated master eligible nodes but just do a tie-break leading to the old master being elected as it has the lowest node id and highest cluster state version.
We try to connect to the old master but due to some network issue unable to do so. Step 4 starts again and we keep on repeating steps 4-6 until master is removed from ping response.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2018-07-05T16:02:24Z

Pinging @elastic/es-distributed

ywelsch · 2018-07-12T11:44:38Z

When printing master_left (the WARN one), the nodes execute the ZenDiscovery.rejoin method and remove the active master from their local cluster state. During pinging, they will then not broadcast knowledge of an active master. Can you check if the above WARN message was logged on all nodes? Can you check if there's a follow-up log message indicating that the master has been readded to the cluster state?

itiyama · 2018-07-19T09:52:32Z

Yes, I checked. The above WARN was logged on all 8 data nodes.
There is no follow-up log message indicating that the master has been re-added. In fact, the log WARN message appeared after 1.5 hours for a different master eligible node when the old master was terminated

ywelsch · 2018-07-19T16:37:43Z

Are you using any discovery.zen.* settings that deviate from the defaults?
Can you try setting discovery.zen.master_election.ignore_non_master_pings to true and see if that helps?

itiyama · 2018-07-19T17:31:38Z

Are you using any discovery.zen.* settings that deviate from the defaults?

I am using default discovery.zen.* settings.

Can you try setting discovery.zen.master_election.ignore_non_master_pings to true and see if that helps?

I can, but why do you think that would help- Also, since I don't know the exact steps to reproduce the scenario, verifying if that works might be tricky.

Just a thought- It appeared to me as though the cluster state did not change even though the master fault detection stopped- which further caused the stale data to be present in cluster state. If we can have some staleness associated with the master value stored in cluster state- master would have been re-elected even though the fault detection stopped due to whatever reason. We should know the concurrency issue which did not change(or revert) the cluster state when master pings stopped- but invalidating master entries in the cluster state would have helped here.

itiyama · 2018-08-09T05:45:56Z

Please provide a response to this.

ywelsch · 2018-08-13T12:34:22Z

I can, but why do you think that would help

The scenario you described had all master-eligible nodes restart. If it's the data nodes having poisonous information about who they think is the master, this setting will make sure that the active mastership information by data nodes is discarded, i.e., only the values coming from the cluster state of the master-eligible nodes will be trusted for that information, and as those have been restarted, they will not contain info about mastership of any node, thereby allowing a proper election.

but invalidating master entries in the cluster state would have helped here.

We should not be building new features into some components just to work around bugs in other components. If there's a bug here, we need to find and fix it.

Once the cluster is in this situation, it would be interesting to see on each node what their local cluster state looks like (GET /_cluster/state?local) in order to check if they all properly consider the master to have stepped down (i.e. no current master in their cluster state). If there's a node that mistakenly considers the master to still be present, take a heap dump which allows us to further inspect the state.

DaveCTurner · 2018-12-19T10:57:31Z

The changes in #32006 mean that we are now much more picky about the nodes that we consider to be "live" during the discovery phase, and a master is chosen from the known-live nodes more sensibly. In particular, we check that we're actually connected to a node during discovery, so I think this poisonous situation can no longer occur.

itiyama changed the title ~~ES master re-election tries electing a non-reachable master.~~ ES master re-election algorithm tries electing a non-reachable master. Jul 4, 2018

ywelsch added the :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. label Jul 5, 2018

bleskes assigned ywelsch Jul 23, 2018

DaveCTurner closed this as completed Dec 19, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ES master re-election algorithm tries electing a non-reachable master. #31801

ES master re-election algorithm tries electing a non-reachable master. #31801

itiyama commented Jul 4, 2018 •

edited

Loading

elasticmachine commented Jul 5, 2018

ywelsch commented Jul 12, 2018

itiyama commented Jul 19, 2018

ywelsch commented Jul 19, 2018

itiyama commented Jul 19, 2018

itiyama commented Aug 9, 2018

ywelsch commented Aug 13, 2018

DaveCTurner commented Dec 19, 2018

ES master re-election algorithm tries electing a non-reachable master. #31801

ES master re-election algorithm tries electing a non-reachable master. #31801

Comments

itiyama commented Jul 4, 2018 • edited Loading

elasticmachine commented Jul 5, 2018

ywelsch commented Jul 12, 2018

itiyama commented Jul 19, 2018

ywelsch commented Jul 19, 2018

itiyama commented Jul 19, 2018

itiyama commented Aug 9, 2018

ywelsch commented Aug 13, 2018

DaveCTurner commented Dec 19, 2018

itiyama commented Jul 4, 2018 •

edited

Loading