Cluster downtime during master node restart while not in discovery file provider #1138

sebgl · 2019-06-24T12:40:02Z

I'm opening this issue while working on PVC reuse/rolling upgrades (#312), which is not merged yet, but it seemed important to have this separate discussion.

I observe a small downtime in the cluster during the rolling upgrade process, while the master node is being restarted (we restart it last). There is no downtime when other nodes are restarted.

During master nodes restart, requests to Elasticsearch (3 x v7.1 mdi nodes cluster - 2/3 nodes still alive - the master is down) return:

{"error":{"root_cause":[{"type":"master_not_discovered_exception","reason":null}],"type":"master_not_discovered_exception","reason":null},"status":503}

Master election never happens, even though we have 2/3 master eligible nodes alive, until the master node gets back into the cluster a few seconds later (restart over).

Those are the errors we can see in one of the 2 remaining Elasticsearch instances logs:

{"type": "server", "timestamp": "2019-06-24T12:28:32,179+0000", "level": "DEBUG", "component": "o.e.a.a.c.n.i.TransportNodesInfoAction", "cluster.name": "elasticsearch-sample", "node.name": "elasticsearch-sample-es-x4dl6l8vkn", "cluster.uuid": "IdHEH4LNQR6XiIF05_hZMQ", "node.id": "_HmKAToxSjOexs-qdN1LrQ", "message": "failed to execute on node [XF9dsvmJQ9iew2
dUwQd6Bw]" ,
"stacktrace": ["org.elasticsearch.transport.NodeNotConnectedException: [elasticsearch-sample-es-mkt2kvbpgs][10.16.0.56:9300] Node not connected",
"at org.elasticsearch.transport.ConnectionManager.getConnection(ConnectionManager.java:151) ~[elasticsearch-7.1.0.jar:7.1.0]" ...

{"type": "server", "timestamp": "2019-06-24T12:28:32,876+0000", "level": "WARN", "component": "o.e.c.c.ClusterFormationFailureHelper", "cluster.name": "elasticsearch-sample", "node.name": "elasticsearch-sample-es-x4dl6l8vkn", "cluster.uuid": "IdHEH4LNQR6XiIF05_hZMQ", "node.id": "_HmKAToxSjOexs-qdN1LrQ", "message": "master not discovered or elected yet, an election requires a node with id [XF9dsvmJQ9iew2dUwQd6Bw], have discovered [{elasticsearch-sample-es-lb7qc4g6r6}{1-faHCEOSo2-eZDsV-YixA}{CKgONMtLR723Pc8n_3p0Zg}{10.16.1.58}{10.16.1.58:9300}{ml.machine_memory=2147483648, ml.max_open_jobs=20, xpack.installed=true, foo=bar}] which is not a quorum; discovery will continue using [127.0.0.1:9300, 127.0.0.1:9301, 127.0.0.1:9302, 127.0.0.1:9303, 127.0.0.1:9304, [::1]:9300, [::1]:9301, [::1]:9302, [::1]:9303, [::1]:9304, 10.16.1.58:9300] from hosts providers and [{elasticsearch-sample-es-mkt2kvbpgs}{XF9dsvmJQ9iew2dUwQd6Bw}{dw9R1PbTSMW7ah2r_mVk6A}{10.16.0.56}{10.16.0.56:9300}{ml.machine_memory=2147483648, ml.max_open_jobs=20, xpack.installed=true, foo=bar}, {elasticsearch-sample-es-x4dl6l8vkn}{_HmKAToxSjOexs-qdN1LrQ}{4eL_Fhr5TcmUC9xy076qWQ}{10.16.1.57}{10.16.1.57:9300}{ml.machine_memory=2147483648, xpack.installed=true, foo=bar, ml.max_open_jobs=20}, {elasticsearch-sample-es-lb7qc4g6r6}{1-faHCEOSo2-eZDsV-YixA}{CKgONMtLR723Pc8n_3p0Zg}{10.16.1.58}{10.16.1.58:9300}{ml.machine_memory=2147483648, ml.max_open_jobs=20, xpack.installed=true, foo=bar}] from last-known cluster state; node term 14, last-accepted version 759 in term 14" }

The two remaining nodes seem to complain about the third node (master whose restart is in progress) not being available.

Debugging this a bit more, I realised this is related to the way we manage the discovery.seed_providers file. In this file, we inject each master node's IP (Kubernetes pod IP), on every reconciliation loop. To do that we simply inspect the current pods in the cluster, and if they're master eligible we append their IP to that file, which gets propagated to all nodes in the cluster.
Very soon after stopping the master node (deleting the pod but keeping data volume around), its IP address is also deleted from that file. From our perspective there is no reason to keep it around: that "old" IP does not make sense anymore. When recreated (with the same data), the pod will probably get assigned a new IP.
So we first create the pod, then as soon as it has an IP available we inject it into the file. And the situation gets unlocked, master election can proceed.
However during the whole time where the pod is restarted, and its IP disappears from the seed providers discovery file, the cluster is unavailable.

If I "manually" delete the pod but keep its IP (which does not make sense anymore) in the discovery.seed_providers file, a new master gets elected instantly among the 2 remaining nodes.

I'm wondering if:

this is expected, and we should do whatever's necessary in the operator to avoid that situation
this is not expected, and something that should be fixed in Elasticsearch (in other terms: if 2/3 master eligible nodes are in the cluster, a leader election should happen even if the 3rd one has disappeared from hosts discovery)

DaveCTurner · 2019-06-24T12:48:45Z

This sounds like the sort of thing that is fixed by elastic/elasticsearch#39629 in 7.2. It's probably a good idea to exclude the master from the voting config before shutting it down, in order to cause it to hand over to another node while it's still alive, and this should help in 7.0 and 7.1 as well.

sebgl · 2019-06-24T15:29:00Z

@DaveCTurner thanks for the feedback! Your hint on voting exclusions (which we already did) helped me find a bug in the current code where exclusions where reset only after all nodes were rolled (which doesn't make sense). I think that was actually the underlying issue behind my cluster getting stuck.

DaveCTurner · 2019-06-24T16:45:46Z

That makes me suspect there might also be a bug in how you're detecting the success of adding voting config exclusions. You must check that they're in the exclusions list but also that the node ids are removed from the voting config. This is what the API does, so it's probably simplest to hit the API until it returns 200 Ok.

sebgl · 2019-06-25T07:52:28Z

@DaveCTurner I'm a bit confused now.
If I understand correctly, there are 3 API calls we can do regarding voting exclusions:

Add some nodes to the exclusion list (POST /_cluster/voting_config_exclusions/node_name)
Remove all exclusions (DELETE /_cluster/voting_config_exclusions )
Get the list of voting nodes (GET /_cluster/state?filter_path=metadata.cluster_coordination.last_committed_config )

What we do is:

When there is a node to roll, add it to the exclusion list before deleting the pod (POST /_cluster/voting_config_exclusions/node_name)
When the number of nodes in the cluster is expected (node that was shut down is back in the cluster), delete the exclusion list entirely (DELETE /_cluster/voting_config_exclusions )
We never get the list of voting nodes

Do you think we are missing a step here?

You must check that they're in the exclusions list

How?

it's probably simplest to hit the API until it returns 200 Ok

Which API are you referencing here?

DaveCTurner · 2019-06-25T08:13:46Z

Ok, I was guessing how the OP might have come about, but maybe I guessed wrong. AIUI you ended up trying to add each node to the voting config exclusions list, and then shut it down, but you weren't clearing the list at each step. This would mean that when you got to the last node (the master) you wouldn't get a 200 OK from the POST /_cluster/voting_config_exclusions/node_name API call, so I would have expected the process to stop at this point. Therefore I do not quite understand how the process got as far as the situation described in the OP in which the last node was shut down.

When the number of nodes in the cluster is expected (node that was shut down is back in the cluster), delete the exclusion list entirely

I would say to do this sooner, ideally just after stopping the node. This is how it's done in the test suite:

https://github.com/elastic/elasticsearch/blob/5fa36dad0b8503bef3d91173bb342da555445c26/test/framework/src/main/java/org/elasticsearch/test/InternalTestCluster.java#L1577-L1588

When DELETE /_cluster/voting_config_exclusions returns 200 OK it means the cluster has processed the removal of the shut-down node, which means that any subsequent API calls take its removal into account. Without that call, there's a risk that you're looking at a stale cluster state.

You must check that they're in the exclusions list

How?

The safest way is to POST /_cluster/voting_config_exclusions/node_name and check that it returns 200 OK.

sebgl · 2019-07-08T10:38:25Z

Closing this one, thanks a lot @DaveCTurner for the help.

DaveCTurner mentioned this issue Jun 24, 2019

Optimize cluster mutations by reusing PVCs when replacing nodes #312

Closed

sebgl closed this as completed Jul 8, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster downtime during master node restart while not in discovery file provider #1138

Cluster downtime during master node restart while not in discovery file provider #1138

sebgl commented Jun 24, 2019 •

edited

Loading

DaveCTurner commented Jun 24, 2019

sebgl commented Jun 24, 2019

DaveCTurner commented Jun 24, 2019

sebgl commented Jun 25, 2019

DaveCTurner commented Jun 25, 2019

sebgl commented Jul 8, 2019

Cluster downtime during master node restart while not in discovery file provider #1138

Cluster downtime during master node restart while not in discovery file provider #1138

Comments

sebgl commented Jun 24, 2019 • edited Loading

DaveCTurner commented Jun 24, 2019

sebgl commented Jun 24, 2019

DaveCTurner commented Jun 24, 2019

sebgl commented Jun 25, 2019

DaveCTurner commented Jun 25, 2019

sebgl commented Jul 8, 2019

sebgl commented Jun 24, 2019 •

edited

Loading