Leader election may be too slow to re-elect new master #784

jsafrane · 2018-11-28T15:46:57Z

Current leader election depends on Kubernetes to delete faulty pods relatively quickly. It does not work well when a leader is on a node that becomes unresponsive (network partition, kubelet hangs, ...). The pod is not deleted automatically, the leader is not working and new one cannot be elected. I'd expect a new leader to be available in ~ 1 minute even in the worst conditions.

mhrivnak · 2018-11-28T18:15:27Z

Leader election for operators is primarily focused on guaranteeing that in a scenario where multiple pods are running as the same operator, only one of them can be active. Most operators run a single pod at a time, but overlap can happen during operator upgrade, pod rescheduling for whatever reason, etc. Leader election is less focused on providing high availability, of the sort where you would run multiple pods all the time in order to have a warm spare that can take over should the leader fail. You can do that, and leader election will work for that case. But as you observe, the scenario of a failed node is difficult.

This documentation describes in detail the challenges and potential ambiguity that comes with node failure. Simply put, if a node fails, it may be impossible to determine if a pod is still running on it or not. If your leader happens to be on a failed node, in most cases it will be deleted after the pod-eviction-timeout.

If it is important to you that an operator on an unreachable node gets rescheduled more quickly, you may also have the same concern about other workloads, and it would make sense to look at lowering the pod-eviction-timeout. I'm not sure why the default is 5 minutes, which seems rather long, except perhaps to be conservative for the case of pods that are expensive to re-schedule.

Otherwise if your priority is quick recovery from a node that has gone silent, you might prefer to use the lease-based leader election that is provided by controller-runtime. But using the leased-based approach is a trade-off. It has weaker guarantees about preventing concurrent leadership, but it does recover from the missing/frozen/disconnected/silent/etc node problem more quickly.

Hopefully total failure of a node, of the kind where it suddenly goes silent and kubernetes is not otherwise able to determine if it's still alive, will be rare for you. I think most people will prefer the guarantees that come with our leader election and will be able to work with a > 1 minute SLA in case of ambiguous node failure.

hasbro17 · 2018-11-28T19:36:31Z

Just for reference the controller-runtime supports turning on lease based leader election via the manager options.

Perhaps we can document that more clearly as an alternative to allow users to choose the tradeoff they want to make with their choice of leader election.

jsafrane · 2018-11-29T15:31:49Z

But using the leased-based approach is a trade-off. It has weaker guarantees about preventing concurrent leadership, but it does recover from the missing/frozen/disconnected/silent/etc node problem more quickly.

Is there a list of issues with client-go/leader-election? In my opinion it's more reliable than pod deletion. I prefer faster recovery, critical openshift components depend on it. IMO we can't afford a controller being unavailable for 5 minutes.

Perhaps we can document that more clearly as an alternative to allow users to choose the tradeoff they want to make with their choice of leader election.

+1

hasbro17 · 2019-02-18T22:13:35Z

Long overdue but with #1052 we now have a section that explains the two options for leader election.

estroz added the docs label Nov 29, 2018

hasbro17 mentioned this issue Dec 17, 2018

Leader election locking for Operators #136

Closed

estroz self-assigned this Jan 22, 2019

hasbro17 mentioned this issue Feb 4, 2019

doc/user-guide: show how to enable both leader election options #1052

Merged

hasbro17 closed this as completed Feb 18, 2019

estroz mentioned this issue Jul 23, 2020

Why leader re-election should work after the default timeout 5-min when a worker node is failed? #3498

Closed

Bryce-huang mentioned this issue Dec 29, 2020

Check the node status when leader re-election. operator-framework/operator-lib#24

Closed

estroz mentioned this issue Jan 19, 2021

Make leader-for-life leader election more integrated with controller-runtime operator-framework/operator-lib#48

Open

varshaprasad96 mentioned this issue Jul 10, 2023

Deprecate Leader for life based leader election operator-framework/operator-lib#117

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Leader election may be too slow to re-elect new master #784

Leader election may be too slow to re-elect new master #784

jsafrane commented Nov 28, 2018

mhrivnak commented Nov 28, 2018

hasbro17 commented Nov 28, 2018

jsafrane commented Nov 29, 2018

hasbro17 commented Feb 18, 2019

Leader election may be too slow to re-elect new master #784

Leader election may be too slow to re-elect new master #784

Comments

jsafrane commented Nov 28, 2018

mhrivnak commented Nov 28, 2018

hasbro17 commented Nov 28, 2018

jsafrane commented Nov 29, 2018

hasbro17 commented Feb 18, 2019