-
Notifications
You must be signed in to change notification settings - Fork 1.4k
🌱 Add etcd endpoint health check to KCP #3810
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
cc @fabriziopandini cc @vincepri |
/milestone v0.3.11 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to add some tests around the new extended health check please?
Some observations regarding quorum loss: I created a worklod cluster with 3 master nodes. Edited 2 of the etcd manifests with wrong cert path. After fixing 1 of the failed etcd's cert path, etcd pod came up, had 2 healthy etcd pods out of 3 members. Reached etcd’s max db size. One question here is that when we should call etcd unhealthy in KCP health check. Currently if there is any member that is not healthy or if any ready etcd pod has alarms, we fail etcd health check and block scale up/down and rollout. IMO, as long as etcd is responding (meaning quorum is intact), we should only log the problems we see in etcd members in conditions but do not fail health check. Checking if control plane is healthy also covers etcd quorum, because once quorum is lost, apiserver stops running. According to these observations, not mandating all control plane nodes to have an healthy etcd pod is reasonable. We can only use etcd health check to add conditions. What are your thoughts? |
dd9e194
to
7104268
Compare
Removed quorum check from this PR as we do not have a good way to detect quorum loss.
|
That is a very good point, I would have expected the apiserver to still be functioning in a read-only mode. I'm wondering if this is a bug or if this was an intentional change that was made in k8s at some point.
It depends on how sophisticated (and complex) we want to be in the short/long term. Short term, I think it is best that we defer recovery to an experienced administrator that can make the judgement calls needed to recover safely. Longer term it might be possible to leverage https://github.com/kubernetes-sigs/etcdadm to provide more advanced etcd management (including recovery from snapshot).
According to https://etcd.io/docs/v3.4.0/op-guide/maintenance/, if any member hits a storage space issue then there is etcd cluster degredation. I'm not sure we should attempt to scale the cluster in those cases (outside of possibly remediation attempts). The docs also seem to indicate that the alarm needs to be explicitly cleared in order for the cluster to resume operation (This also seems to be backed up by the Rancher docs: https://rancher.com/docs/rancher/v2.x/en/troubleshooting/kubernetes-components/etcd/#disarm-alarm)
I'm a bit hesitant to rely on proxy signals for etcd as much as possible, only because etcd is the primary resource we actually care about as part of health checking, we can accept much higher levels of service degradation of the other components (especially the scheduler and controller manager, since they use leader election locking), but once etcd quorum is lost our ability to easily and safely recover the system without concern for data loss or control plane availability is greatly impacted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sedefsavas thanks for reworking this after all the above discussions.
The PR in the current form is a further improvement on EtcdIsHealthy
, so I'm +1 to get this merged, just wondering if we should get this in v0.4 only instead of merging in 0.3.x + forward porting.
WRT to the implementation, few small nits, not blocking
7104268
to
ec7b948
Compare
lgtm pending squash + final decision on having this on 0.3 branch + forward porting of having this on 0.4 only |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to only 4.0, given that this isn't a bug fix
/milestone v0.4.0 @sedefsavas mind rebasing? |
ec7b948
to
7b8428d
Compare
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
7b8428d
to
fc76c12
Compare
ready to lgtm pending squash |
fc76c12
to
6b648d8
Compare
Squashed. |
/lgtm |
/test pull-cluster-api-e2e-full-main |
@sedefsavas: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
@sedefsavas: PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/hold |
@sedefsavas @fabriziopandini What's the status of this PR? |
/close |
@fabriziopandini: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
What this PR does / why we need it:
This PR adds endpoint health check to etcd health check in KCP.
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Part of #3674