OCPBUGS-32183: catalog-operator: delete catalog pods stuck in Terminating state due to unreachable node #3201

joelanford · 2024-04-11T16:46:20Z

Description of the change:

Motivation for the change:

Architectural changes:

Testing remarks:

Reviewer Checklist

openshift-ci · 2024-04-11T16:46:25Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

perdasilva · 2024-04-12T11:51:27Z

pkg/controller/registry/reconciler/grpc.go

+	// currentPods refers to the current pod instances of the catalog source
+	currentPods := c.currentPods(logger, source)
+
+	tmpPods := currentPods[:0]


nifty! didn't know about this one

Actually, this reminds me there's a new kid on the block (slices.DeleteFunc). I'll switch to that as the intent is more clear.

…to unreachable node Signed-off-by: Joe Lanford <[email protected]>

grokspawn · 2024-04-12T20:04:48Z

would love to better understand the need for the errors --> pkgerrors change, to know if that's a pattern we should look for elsewhere.

joelanford · 2024-04-12T20:10:03Z

errors is the name of a very popular package in the Go standard library.

This package was authored prior to standard library Go having fmt.Errorf("blah blah: %w", err). Before Go supported that, pkg/errors.Wrapf(err, "message") was a popular way to wrap errors.

I made this change to give the errors name to stdlib errors over the third-party errors package.

grokspawn · 2024-04-12T20:33:31Z

pkg/controller/registry/reconciler/grpc.go

@@ -523,6 +545,29 @@ func imageChanged(logger *logrus.Entry, updatePod *corev1.Pod, servingPods []*co
 	return false
 }

+func isPodDead(pod *corev1.Pod) bool {
+	for _, check := range []func(*corev1.Pod) bool{


This is a clever way to provide an extensible approach to a series of checks... but do we need it? It seems we would have the same benefits by doing s/isPodDead/isPodDeletedByTaintManager/g

I'm somewhat confident we'll find another way for pods to be dead. We've seen similar issues in operator-lib leader for life.

So I figured I'd make things super easy for ourselves next time around.

Yeah, do we think we'll need to expand this in the future?

Well, I ain't agin' it, but I generally like to supply it when it's needed.
/lgtm
from me

I'm not objecting... just asking.

[PR 3201](operator-framework#3201) attempted to solve for the issue by deleting the pods stuck in Terminating due to unreachable node. However, the logic to do that was included in `EnsureRegistryServer`, which only gets executed in polling in requested by the user. This PR fixes the issue by modifying the `RegistryReconciler` interface with the introduction of a new component for the interface: `RegistryCleaner`. This promotes the job of cleaning up the pods that are stuck (and any other resources that may need to be cleaned) to a first class status. The `RegistryCleaner` is then called as the first step in the Catalog Operator registry reconclier, so that the pods stuck are cleaned up before the rest of the reconciler logic is executed. The PR provides implementation of `RegistryCleaner` for the `GrpcReconciler`, `ConfigMapReconciler` and the `GrpcAddressRegistryReconciler` implementations of `RegistryReconciler` interface.

[PR 3201](operator-framework#3201) attempted to solve for the issue by deleting the pods stuck in Terminating due to unreachable node. However, the logic to do that was included in `EnsureRegistryServer`, which only gets executed if polling in requested by the user. This PR fixes the issue by modifying the `RegistryReconciler` interface with the introduction of a new component for the interface: `RegistryCleaner`. This promotes the job of cleaning up the pods that are stuck (and any other resources that may need to be cleaned) to a first class status. The `RegistryCleaner` is then called as the first step in the Catalog Operator registry reconclier, so that the pods stuck are cleaned up before the rest of the reconciler logic is executed. The PR provides implementation of `RegistryCleaner` for the `GrpcReconciler`, `ConfigMapReconciler` and the `GrpcAddressRegistryReconciler` implementations of `RegistryReconciler` interface.

[PR 3201](operator-framework#3201) attempted to solve for the issue by deleting the pods stuck in `Terminating` due to unreachable node. However, the logic to do that was included in `EnsureRegistryServer`, which only gets executed if polling in requested by the user. This PR moves the logic of checking for dead pods out of `EnsureRegistryServer`, and puts it in `CheckRegistryServer` instead. This way, if there are any dead pods detected during `CheckRegistryServer`, the value of `healthy` is returned `false`, which inturn triggers `EnsureRegistryServer`.

[PR 3201](#3201) attempted to solve for the issue by deleting the pods stuck in `Terminating` due to unreachable node. However, the logic to do that was included in `EnsureRegistryServer`, which only gets executed if polling in requested by the user. This PR moves the logic of checking for dead pods out of `EnsureRegistryServer`, and puts it in `CheckRegistryServer` instead. This way, if there are any dead pods detected during `CheckRegistryServer`, the value of `healthy` is returned `false`, which inturn triggers `EnsureRegistryServer`.

[PR 3201](operator-framework/operator-lifecycle-manager#3201) attempted to solve for the issue by deleting the pods stuck in `Terminating` due to unreachable node. However, the logic to do that was included in `EnsureRegistryServer`, which only gets executed if polling in requested by the user. This PR moves the logic of checking for dead pods out of `EnsureRegistryServer`, and puts it in `CheckRegistryServer` instead. This way, if there are any dead pods detected during `CheckRegistryServer`, the value of `healthy` is returned `false`, which inturn triggers `EnsureRegistryServer`. Upstream-repository: operator-lifecycle-manager Upstream-commit: f2431893193e7112f78298ad7682ff3e1b179d8c

…ilure (#3366) [PR 3201](operator-framework/operator-lifecycle-manager#3201) attempted to solve for the issue by deleting the pods stuck in `Terminating` due to unreachable node. However, the logic to do that was included in `EnsureRegistryServer`, which only gets executed if polling in requested by the user. This PR moves the logic of checking for dead pods out of `EnsureRegistryServer`, and puts it in `CheckRegistryServer` instead. This way, if there are any dead pods detected during `CheckRegistryServer`, the value of `healthy` is returned `false`, which inturn triggers `EnsureRegistryServer`. Upstream-repository: operator-lifecycle-manager Upstream-commit: f2431893193e7112f78298ad7682ff3e1b179d8c

muellerfabi

The fix needs some improvement as it does not catch all reasons why a pod is dead

muellerfabi · 2024-12-05T09:48:33Z

pkg/controller/registry/reconciler/grpc.go

+	return false
+}
+
+func isPodDeletedByTaintManager(pod *corev1.Pod) bool {


I just had the situation that a catalogsource pod was evicted due to resource pressure on the node on a OCP 4.16.21 cluster that includes this fix.

Unfortunately this fix did not help, as the reason for pods dead is different:

"reason": "TerminationByKubelet", "status": "True",

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 11, 2024

perdasilva reviewed Apr 12, 2024

View reviewed changes

joelanford force-pushed the force-delete-dead-catalog-pods branch 2 times, most recently from d856987 to 0038478 Compare April 12, 2024 18:41

joelanford marked this pull request as ready for review April 12, 2024 18:42

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 12, 2024

openshift-ci bot requested review from benluddy and oceanc80 April 12, 2024 18:43

catalog-operator: delete catalog pods stuck in Terminating state due …

82f4997

…to unreachable node Signed-off-by: Joe Lanford <[email protected]>

joelanford force-pushed the force-delete-dead-catalog-pods branch from 0038478 to 82f4997 Compare April 12, 2024 19:01

joelanford changed the title ~~catalog-operator: delete catalog pods stuck in Terminating state due to unreachable node~~ OCPBUGS-32183: catalog-operator: delete catalog pods stuck in Terminating state due to unreachable node Apr 12, 2024

grokspawn reviewed Apr 12, 2024

View reviewed changes

openshift-ci bot assigned grokspawn Apr 12, 2024

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Apr 12, 2024

joelanford added this pull request to the merge queue Apr 12, 2024

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Apr 12, 2024

tmshort added this pull request to the merge queue Apr 15, 2024

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Apr 15, 2024

grokspawn added this pull request to the merge queue Apr 15, 2024

Merged via the queue into operator-framework:master with commit 68c24cf Apr 15, 2024
14 checks passed

joelanford mentioned this pull request Apr 25, 2024

The catalog source pod can not migrate when the node becomes NotReady #3208

Closed

anik120 mentioned this pull request Aug 14, 2024

(fix) registry pods do not come up again after node failure #3366

Merged

11 tasks

anik120 mentioned this pull request Sep 4, 2024

OCPBUGS-41217: (fix) registry pods do not come up again after node failure (#3366) openshift/operator-framework-olm#854

Merged

anik120 mentioned this pull request Sep 5, 2024

OCPBUGS-39574: (fix) registry pods do not come up again after node failure (#3366) openshift/operator-framework-olm#855

Merged

anik120 mentioned this pull request Sep 16, 2024

OCPBUGS-41981: (fix) registry pods do not come up again after node failure (#3366) openshift/operator-framework-olm#868

Merged

anik120 mentioned this pull request Sep 18, 2024

OCPBUGS-42150: (fix) registry pods do not come up again after node failure (#3366) openshift/operator-framework-olm#872

Merged

joelanford deleted the force-delete-dead-catalog-pods branch October 30, 2024 13:03

muellerfabi reviewed Dec 5, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCPBUGS-32183: catalog-operator: delete catalog pods stuck in Terminating state due to unreachable node #3201

OCPBUGS-32183: catalog-operator: delete catalog pods stuck in Terminating state due to unreachable node #3201

joelanford commented Apr 11, 2024

openshift-ci bot commented Apr 11, 2024

perdasilva Apr 12, 2024

joelanford Apr 12, 2024

grokspawn commented Apr 12, 2024

joelanford commented Apr 12, 2024

grokspawn Apr 12, 2024 •

edited

Loading

joelanford Apr 12, 2024

tmshort Apr 12, 2024

grokspawn Apr 12, 2024

tmshort Apr 12, 2024

muellerfabi left a comment

muellerfabi Dec 5, 2024

OCPBUGS-32183: catalog-operator: delete catalog pods stuck in Terminating state due to unreachable node #3201

OCPBUGS-32183: catalog-operator: delete catalog pods stuck in Terminating state due to unreachable node #3201

Conversation

joelanford commented Apr 11, 2024

openshift-ci bot commented Apr 11, 2024

perdasilva Apr 12, 2024

Choose a reason for hiding this comment

joelanford Apr 12, 2024

Choose a reason for hiding this comment

grokspawn commented Apr 12, 2024

joelanford commented Apr 12, 2024

grokspawn Apr 12, 2024 • edited Loading

Choose a reason for hiding this comment

joelanford Apr 12, 2024

Choose a reason for hiding this comment

tmshort Apr 12, 2024

Choose a reason for hiding this comment

grokspawn Apr 12, 2024

Choose a reason for hiding this comment

tmshort Apr 12, 2024

Choose a reason for hiding this comment

muellerfabi left a comment

Choose a reason for hiding this comment

muellerfabi Dec 5, 2024

Choose a reason for hiding this comment

grokspawn Apr 12, 2024 •

edited

Loading