-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Leader election may be too slow to re-elect new master #784
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Leader election for operators is primarily focused on guaranteeing that in a scenario where multiple pods are running as the same operator, only one of them can be active. Most operators run a single pod at a time, but overlap can happen during operator upgrade, pod rescheduling for whatever reason, etc. Leader election is less focused on providing high availability, of the sort where you would run multiple pods all the time in order to have a warm spare that can take over should the leader fail. You can do that, and leader election will work for that case. But as you observe, the scenario of a failed node is difficult. This documentation describes in detail the challenges and potential ambiguity that comes with node failure. Simply put, if a node fails, it may be impossible to determine if a pod is still running on it or not. If your leader happens to be on a failed node, in most cases it will be deleted after the If it is important to you that an operator on an unreachable node gets rescheduled more quickly, you may also have the same concern about other workloads, and it would make sense to look at lowering the Otherwise if your priority is quick recovery from a node that has gone silent, you might prefer to use the lease-based leader election that is provided by controller-runtime. But using the leased-based approach is a trade-off. It has weaker guarantees about preventing concurrent leadership, but it does recover from the missing/frozen/disconnected/silent/etc node problem more quickly. Hopefully total failure of a node, of the kind where it suddenly goes silent and kubernetes is not otherwise able to determine if it's still alive, will be rare for you. I think most people will prefer the guarantees that come with our leader election and will be able to work with a > 1 minute SLA in case of ambiguous node failure. |
Just for reference the controller-runtime supports turning on lease based leader election via the manager options. Perhaps we can document that more clearly as an alternative to allow users to choose the tradeoff they want to make with their choice of leader election. |
Is there a list of issues with client-go/leader-election? In my opinion it's more reliable than pod deletion. I prefer faster recovery, critical openshift components depend on it. IMO we can't afford a controller being unavailable for 5 minutes.
+1 |
Current leader election depends on Kubernetes to delete faulty pods relatively quickly. It does not work well when a leader is on a node that becomes unresponsive (network partition, kubelet hangs, ...). The pod is not deleted automatically, the leader is not working and new one cannot be elected. I'd expect a new leader to be available in ~ 1 minute even in the worst conditions.
The text was updated successfully, but these errors were encountered: