Skip to content

🌱 Add etcd endpoint health check to KCP #3810

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

sedefsavas
Copy link

@sedefsavas sedefsavas commented Oct 16, 2020

What this PR does / why we need it:
This PR adds endpoint health check to etcd health check in KCP.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):

Part of #3674

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 16, 2020
@sedefsavas
Copy link
Author

cc @fabriziopandini cc @vincepri

@vincepri
Copy link
Member

/milestone v0.3.11

@k8s-ci-robot k8s-ci-robot added this to the v0.3.11 milestone Oct 16, 2020
Copy link
Member

@vincepri vincepri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to add some tests around the new extended health check please?

@sedefsavas
Copy link
Author

Some observations regarding quorum loss:

I created a worklod cluster with 3 master nodes.

Edited 2 of the etcd manifests with wrong cert path.
Observation 1: kube-apiserver and etcd on those nodes have stopped. etcd kept running on the third node but api-server was not running there too. Lost connection to the workload cluster, hence the only error management cluster saw was a connection error. Therefore, there is no point checking etcd quorum because as long as we have connection to the cluster, quorum is in tact.

After fixing 1 of the failed etcd's cert path, etcd pod came up, had 2 healthy etcd pods out of 3 members.
Observation 2: Quorum is reestablished, management cluster was able to connect to workload cluster. I didn’t know once quorum is lost, it could be auto-recovered. Am I missing something here?

Reached etcd’s max db size.
Observation 3: Lost the whole workload cluster, could not recover from here.

One question here is that when we should call etcd unhealthy in KCP health check. Currently if there is any member that is not healthy or if any ready etcd pod has alarms, we fail etcd health check and block scale up/down and rollout. IMO, as long as etcd is responding (meaning quorum is intact), we should only log the problems we see in etcd members in conditions but do not fail health check. Checking if control plane is healthy also covers etcd quorum, because once quorum is lost, apiserver stops running.

According to these observations, not mandating all control plane nodes to have an healthy etcd pod is reasonable. We can only use etcd health check to add conditions. What are your thoughts?
cc @fabriziopandini

@sedefsavas sedefsavas changed the title 🌱 Add etcd quorum check to KCP 🌱 Add etcd endpoint health check to KCP Oct 20, 2020
@sedefsavas
Copy link
Author

Removed quorum check from this PR as we do not have a good way to detect quorum loss.
But kept 2 changes:

  1. endpoint health check
  2. checking if an etcd member is in the control plane nodes list.

cc @fabriziopandini @vincepri

@detiber
Copy link
Member

detiber commented Oct 20, 2020

Observation 1: kube-apiserver and etcd on those nodes have stopped. etcd kept running on the third node but api-server was not running there too. Lost connection to the workload cluster, hence the only error management cluster saw was a connection error. Therefore, there is no point checking etcd quorum because as long as we have connection to the cluster, quorum is in tact.

That is a very good point, I would have expected the apiserver to still be functioning in a read-only mode. I'm wondering if this is a bug or if this was an intentional change that was made in k8s at some point.

After fixing 1 of the failed etcd's cert path, etcd pod came up, had 2 healthy etcd pods out of 3 members.
Observation 2: Quorum is reestablished, management cluster was able to connect to workload cluster. I didn’t know once quorum is lost, it could be auto-recovered. Am I missing something here?

It depends on how sophisticated (and complex) we want to be in the short/long term. Short term, I think it is best that we defer recovery to an experienced administrator that can make the judgement calls needed to recover safely.

Longer term it might be possible to leverage https://github.com/kubernetes-sigs/etcdadm to provide more advanced etcd management (including recovery from snapshot).

Reached etcd’s max db size.
Observation 3: Lost the whole workload cluster, could not recover from here.

One question here is that when we should call etcd unhealthy in KCP health check. Currently if there is any member that is not healthy or if any ready etcd pod has alarms, we fail etcd health check and block scale up/down and rollout. IMO, as long as etcd is responding (meaning quorum is intact), we should only log the problems we see in etcd members in conditions but do not fail health check. Checking if control plane is healthy also covers etcd quorum, because once quorum is lost, apiserver stops running.

According to https://etcd.io/docs/v3.4.0/op-guide/maintenance/, if any member hits a storage space issue then there is etcd cluster degredation. I'm not sure we should attempt to scale the cluster in those cases (outside of possibly remediation attempts). The docs also seem to indicate that the alarm needs to be explicitly cleared in order for the cluster to resume operation (This also seems to be backed up by the Rancher docs: https://rancher.com/docs/rancher/v2.x/en/troubleshooting/kubernetes-components/etcd/#disarm-alarm)

According to these observations, not mandating all control plane nodes to have an healthy etcd pod is reasonable. We can only use etcd health check to add conditions. What are your thoughts?

I'm a bit hesitant to rely on proxy signals for etcd as much as possible, only because etcd is the primary resource we actually care about as part of health checking, we can accept much higher levels of service degradation of the other components (especially the scheduler and controller manager, since they use leader election locking), but once etcd quorum is lost our ability to easily and safely recover the system without concern for data loss or control plane availability is greatly impacted.

Copy link
Member

@fabriziopandini fabriziopandini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sedefsavas thanks for reworking this after all the above discussions.
The PR in the current form is a further improvement on EtcdIsHealthy, so I'm +1 to get this merged, just wondering if we should get this in v0.4 only instead of merging in 0.3.x + forward porting.

WRT to the implementation, few small nits, not blocking

@fabriziopandini
Copy link
Member

lgtm pending squash + final decision on having this on 0.3 branch + forward porting of having this on 0.4 only

Copy link
Member

@vincepri vincepri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to only 4.0, given that this isn't a bug fix

@vincepri vincepri changed the base branch from release-0.3 to master October 21, 2020 19:27
@vincepri
Copy link
Member

/milestone v0.4.0

@sedefsavas mind rebasing?

@k8s-ci-robot k8s-ci-robot modified the milestones: v0.3.11, v0.4.0 Oct 21, 2020
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 21, 2020
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 22, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign timothysc after the PR has been reviewed.
You can assign the PR to them by writing /assign @timothysc in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 22, 2020
@fabriziopandini
Copy link
Member

ready to lgtm pending squash

@sedefsavas
Copy link
Author

Squashed.

@fabriziopandini
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 23, 2020
@fabriziopandini
Copy link
Member

/test pull-cluster-api-e2e-full-main

@k8s-ci-robot
Copy link
Contributor

@sedefsavas: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
pull-cluster-api-e2e-full-main 6b648d8 link /test pull-cluster-api-e2e-full-main

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 4, 2020
@k8s-ci-robot
Copy link
Contributor

@sedefsavas: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@fabriziopandini
Copy link
Member

/hold
for #3900

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 4, 2020
@vincepri
Copy link
Member

@sedefsavas @fabriziopandini What's the status of this PR?

@fabriziopandini
Copy link
Member

/close
given that now #3900 merged, including KCP conditions for etcd

@k8s-ci-robot
Copy link
Contributor

@fabriziopandini: Closed this PR.

In response to this:

/close
given that now #3900 merged, including KCP conditions for etcd

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants