🌱 Add etcd endpoint health check to KCP #3810

sedefsavas · 2020-10-16T06:40:29Z

What this PR does / why we need it:
This PR adds endpoint health check to etcd health check in KCP.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):

Part of #3674

sedefsavas · 2020-10-16T06:53:34Z

cc @fabriziopandini cc @vincepri

vincepri · 2020-10-16T15:04:09Z

/milestone v0.3.11

vincepri

Is it possible to add some tests around the new extended health check please?

controlplane/kubeadm/internal/workload_cluster_etcd.go

controlplane/kubeadm/internal/etcd/etcd.go

controlplane/kubeadm/internal/workload_cluster_etcd.go

controlplane/kubeadm/internal/etcd/etcd.go

sedefsavas · 2020-10-19T05:36:58Z

Some observations regarding quorum loss:

I created a worklod cluster with 3 master nodes.

Edited 2 of the etcd manifests with wrong cert path.
Observation 1: kube-apiserver and etcd on those nodes have stopped. etcd kept running on the third node but api-server was not running there too. Lost connection to the workload cluster, hence the only error management cluster saw was a connection error. Therefore, there is no point checking etcd quorum because as long as we have connection to the cluster, quorum is in tact.

After fixing 1 of the failed etcd's cert path, etcd pod came up, had 2 healthy etcd pods out of 3 members.
Observation 2: Quorum is reestablished, management cluster was able to connect to workload cluster. I didn’t know once quorum is lost, it could be auto-recovered. Am I missing something here?

Reached etcd’s max db size.
Observation 3: Lost the whole workload cluster, could not recover from here.

One question here is that when we should call etcd unhealthy in KCP health check. Currently if there is any member that is not healthy or if any ready etcd pod has alarms, we fail etcd health check and block scale up/down and rollout. IMO, as long as etcd is responding (meaning quorum is intact), we should only log the problems we see in etcd members in conditions but do not fail health check. Checking if control plane is healthy also covers etcd quorum, because once quorum is lost, apiserver stops running.

According to these observations, not mandating all control plane nodes to have an healthy etcd pod is reasonable. We can only use etcd health check to add conditions. What are your thoughts?
cc @fabriziopandini

controlplane/kubeadm/internal/etcd/etcd.go

controlplane/kubeadm/internal/workload_cluster_etcd.go

sedefsavas · 2020-10-20T06:41:53Z

Removed quorum check from this PR as we do not have a good way to detect quorum loss.
But kept 2 changes:

endpoint health check
checking if an etcd member is in the control plane nodes list.

cc @fabriziopandini @vincepri

controlplane/kubeadm/internal/cluster.go

detiber · 2020-10-20T15:13:32Z

Observation 1: kube-apiserver and etcd on those nodes have stopped. etcd kept running on the third node but api-server was not running there too. Lost connection to the workload cluster, hence the only error management cluster saw was a connection error. Therefore, there is no point checking etcd quorum because as long as we have connection to the cluster, quorum is in tact.

That is a very good point, I would have expected the apiserver to still be functioning in a read-only mode. I'm wondering if this is a bug or if this was an intentional change that was made in k8s at some point.

After fixing 1 of the failed etcd's cert path, etcd pod came up, had 2 healthy etcd pods out of 3 members.
Observation 2: Quorum is reestablished, management cluster was able to connect to workload cluster. I didn’t know once quorum is lost, it could be auto-recovered. Am I missing something here?

It depends on how sophisticated (and complex) we want to be in the short/long term. Short term, I think it is best that we defer recovery to an experienced administrator that can make the judgement calls needed to recover safely.

Longer term it might be possible to leverage https://github.com/kubernetes-sigs/etcdadm to provide more advanced etcd management (including recovery from snapshot).

Reached etcd’s max db size.
Observation 3: Lost the whole workload cluster, could not recover from here.

One question here is that when we should call etcd unhealthy in KCP health check. Currently if there is any member that is not healthy or if any ready etcd pod has alarms, we fail etcd health check and block scale up/down and rollout. IMO, as long as etcd is responding (meaning quorum is intact), we should only log the problems we see in etcd members in conditions but do not fail health check. Checking if control plane is healthy also covers etcd quorum, because once quorum is lost, apiserver stops running.

According to https://etcd.io/docs/v3.4.0/op-guide/maintenance/, if any member hits a storage space issue then there is etcd cluster degredation. I'm not sure we should attempt to scale the cluster in those cases (outside of possibly remediation attempts). The docs also seem to indicate that the alarm needs to be explicitly cleared in order for the cluster to resume operation (This also seems to be backed up by the Rancher docs: https://rancher.com/docs/rancher/v2.x/en/troubleshooting/kubernetes-components/etcd/#disarm-alarm)

According to these observations, not mandating all control plane nodes to have an healthy etcd pod is reasonable. We can only use etcd health check to add conditions. What are your thoughts?

I'm a bit hesitant to rely on proxy signals for etcd as much as possible, only because etcd is the primary resource we actually care about as part of health checking, we can accept much higher levels of service degradation of the other components (especially the scheduler and controller manager, since they use leader election locking), but once etcd quorum is lost our ability to easily and safely recover the system without concern for data loss or control plane availability is greatly impacted.

fabriziopandini

@sedefsavas thanks for reworking this after all the above discussions.
The PR in the current form is a further improvement on EtcdIsHealthy, so I'm +1 to get this merged, just wondering if we should get this in v0.4 only instead of merging in 0.3.x + forward porting.

WRT to the implementation, few small nits, not blocking

controlplane/kubeadm/internal/workload_cluster_etcd.go

fabriziopandini · 2020-10-21T09:46:06Z

lgtm pending squash + final decision on having this on 0.3 branch + forward porting of having this on 0.4 only

controlplane/kubeadm/internal/etcd/etcd.go

controlplane/kubeadm/internal/workload_cluster_etcd.go

vincepri

+1 to only 4.0, given that this isn't a bug fix

vincepri · 2020-10-21T19:27:35Z

/milestone v0.4.0

@sedefsavas mind rebasing?

k8s-ci-robot · 2020-10-22T06:50:59Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign timothysc after the PR has been reviewed.
You can assign the PR to them by writing /assign @timothysc in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

fabriziopandini · 2020-10-22T10:05:10Z

ready to lgtm pending squash

sedefsavas · 2020-10-22T16:11:41Z

Squashed.

fabriziopandini · 2020-10-23T09:46:43Z

/lgtm

fabriziopandini · 2020-10-23T09:47:18Z

/test pull-cluster-api-e2e-full-main

k8s-ci-robot · 2020-10-23T10:09:33Z

@sedefsavas: The following test failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
pull-cluster-api-e2e-full-main	`6b648d8`	link	`/test pull-cluster-api-e2e-full-main`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

k8s-ci-robot · 2020-11-04T16:38:57Z

@sedefsavas: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

fabriziopandini · 2020-11-04T20:05:00Z

/hold
for #3900

vincepri · 2020-12-15T20:43:14Z

@sedefsavas @fabriziopandini What's the status of this PR?

fabriziopandini · 2020-12-16T13:23:50Z

/close
given that now #3900 merged, including KCP conditions for etcd

k8s-ci-robot · 2020-12-16T13:23:59Z

@fabriziopandini: Closed this PR.

In response to this:

/close
given that now #3900 merged, including KCP conditions for etcd

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot requested review from JoelSpeed and justinsb October 16, 2020 06:40

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 16, 2020

sedefsavas mentioned this pull request Oct 16, 2020

🌱Add Machine and KCP conditions to KCP controller #3674

Closed

k8s-ci-robot added this to the v0.3.11 milestone Oct 16, 2020

vincepri reviewed Oct 16, 2020

View reviewed changes

fabriziopandini reviewed Oct 16, 2020

View reviewed changes

vincepri reviewed Oct 16, 2020

View reviewed changes

controlplane/kubeadm/internal/etcd/etcd.go Outdated Show resolved Hide resolved

fabriziopandini reviewed Oct 19, 2020

View reviewed changes

controlplane/kubeadm/internal/etcd/etcd.go Outdated Show resolved Hide resolved

controlplane/kubeadm/internal/workload_cluster_etcd.go Outdated Show resolved Hide resolved

sedefsavas force-pushed the etcdquorum branch from dd9e194 to 7104268 Compare October 20, 2020 06:36

sedefsavas changed the title ~~🌱 Add etcd quorum check to KCP~~ 🌱 Add etcd endpoint health check to KCP Oct 20, 2020

neolit123 reviewed Oct 20, 2020

View reviewed changes

controlplane/kubeadm/internal/cluster.go Outdated Show resolved Hide resolved

fabriziopandini reviewed Oct 20, 2020

View reviewed changes

sedefsavas force-pushed the etcdquorum branch from 7104268 to ec7b948 Compare October 21, 2020 00:02

vincepri reviewed Oct 21, 2020

View reviewed changes

controlplane/kubeadm/internal/etcd/etcd.go Outdated Show resolved Hide resolved

vincepri reviewed Oct 21, 2020

View reviewed changes

controlplane/kubeadm/internal/workload_cluster_etcd.go Outdated Show resolved Hide resolved

vincepri reviewed Oct 21, 2020

View reviewed changes

vincepri changed the base branch from release-0.3 to master October 21, 2020 19:27

k8s-ci-robot modified the milestones: v0.3.11, v0.4.0 Oct 21, 2020

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 21, 2020

sedefsavas force-pushed the etcdquorum branch from ec7b948 to 7b8428d Compare October 22, 2020 06:50

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 22, 2020

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 22, 2020

sedefsavas force-pushed the etcdquorum branch from 7b8428d to fc76c12 Compare October 22, 2020 06:52

Add etcd quorum check to KCP

6b648d8

sedefsavas force-pushed the etcdquorum branch from fc76c12 to 6b648d8 Compare October 22, 2020 16:11

k8s-ci-robot assigned fabriziopandini Oct 23, 2020

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 23, 2020

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 4, 2020

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 4, 2020

k8s-ci-robot closed this Dec 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🌱 Add etcd endpoint health check to KCP #3810

🌱 Add etcd endpoint health check to KCP #3810

sedefsavas commented Oct 16, 2020 •

edited

Loading

sedefsavas commented Oct 16, 2020

vincepri commented Oct 16, 2020

vincepri left a comment

sedefsavas commented Oct 19, 2020

sedefsavas commented Oct 20, 2020

detiber commented Oct 20, 2020

fabriziopandini left a comment

fabriziopandini commented Oct 21, 2020

vincepri left a comment

vincepri commented Oct 21, 2020

k8s-ci-robot commented Oct 22, 2020

fabriziopandini commented Oct 22, 2020

sedefsavas commented Oct 22, 2020

fabriziopandini commented Oct 23, 2020

fabriziopandini commented Oct 23, 2020

k8s-ci-robot commented Oct 23, 2020

k8s-ci-robot commented Nov 4, 2020

fabriziopandini commented Nov 4, 2020

vincepri commented Dec 15, 2020

fabriziopandini commented Dec 16, 2020

k8s-ci-robot commented Dec 16, 2020

🌱 Add etcd endpoint health check to KCP #3810

🌱 Add etcd endpoint health check to KCP #3810

Conversation

sedefsavas commented Oct 16, 2020 • edited Loading

sedefsavas commented Oct 16, 2020

vincepri commented Oct 16, 2020

vincepri left a comment

Choose a reason for hiding this comment

sedefsavas commented Oct 19, 2020

sedefsavas commented Oct 20, 2020

detiber commented Oct 20, 2020

fabriziopandini left a comment

Choose a reason for hiding this comment

fabriziopandini commented Oct 21, 2020

vincepri left a comment

Choose a reason for hiding this comment

vincepri commented Oct 21, 2020

k8s-ci-robot commented Oct 22, 2020

fabriziopandini commented Oct 22, 2020

sedefsavas commented Oct 22, 2020

fabriziopandini commented Oct 23, 2020

fabriziopandini commented Oct 23, 2020

k8s-ci-robot commented Oct 23, 2020

k8s-ci-robot commented Nov 4, 2020

fabriziopandini commented Nov 4, 2020

vincepri commented Dec 15, 2020

fabriziopandini commented Dec 16, 2020

k8s-ci-robot commented Dec 16, 2020

sedefsavas commented Oct 16, 2020 •

edited

Loading