-
Notifications
You must be signed in to change notification settings - Fork 551
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCPBUGS-32183: catalog-operator: delete catalog pods stuck in Terminating state due to unreachable node #3201
OCPBUGS-32183: catalog-operator: delete catalog pods stuck in Terminating state due to unreachable node #3201
Conversation
Skipping CI for Draft Pull Request. |
// currentPods refers to the current pod instances of the catalog source | ||
currentPods := c.currentPods(logger, source) | ||
|
||
tmpPods := currentPods[:0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nifty! didn't know about this one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, this reminds me there's a new kid on the block (slices.DeleteFunc
). I'll switch to that as the intent is more clear.
d856987
to
0038478
Compare
…to unreachable node Signed-off-by: Joe Lanford <[email protected]>
0038478
to
82f4997
Compare
would love to better understand the need for the errors --> pkgerrors change, to know if that's a pattern we should look for elsewhere. |
This package was authored prior to standard library Go having I made this change to give the |
@@ -523,6 +545,29 @@ func imageChanged(logger *logrus.Entry, updatePod *corev1.Pod, servingPods []*co | |||
return false | |||
} | |||
|
|||
func isPodDead(pod *corev1.Pod) bool { | |||
for _, check := range []func(*corev1.Pod) bool{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a clever way to provide an extensible approach to a series of checks... but do we need it? It seems we would have the same benefits by doing s/isPodDead/isPodDeletedByTaintManager/g
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm somewhat confident we'll find another way for pods to be dead. We've seen similar issues in operator-lib
leader for life.
So I figured I'd make things super easy for ourselves next time around.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, do we think we'll need to expand this in the future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I ain't agin' it, but I generally like to supply it when it's needed.
/lgtm
from me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not objecting... just asking.
68c24cf
[PR 3201](operator-framework#3201) attempted to solve for the issue by deleting the pods stuck in Terminating due to unreachable node. However, the logic to do that was included in `EnsureRegistryServer`, which only gets executed in polling in requested by the user. This PR fixes the issue by modifying the `RegistryReconciler` interface with the introduction of a new component for the interface: `RegistryCleaner`. This promotes the job of cleaning up the pods that are stuck (and any other resources that may need to be cleaned) to a first class status. The `RegistryCleaner` is then called as the first step in the Catalog Operator registry reconclier, so that the pods stuck are cleaned up before the rest of the reconciler logic is executed. The PR provides implementation of `RegistryCleaner` for the `GrpcReconciler`, `ConfigMapReconciler` and the `GrpcAddressRegistryReconciler` implementations of `RegistryReconciler` interface.
[PR 3201](operator-framework#3201) attempted to solve for the issue by deleting the pods stuck in Terminating due to unreachable node. However, the logic to do that was included in `EnsureRegistryServer`, which only gets executed if polling in requested by the user. This PR fixes the issue by modifying the `RegistryReconciler` interface with the introduction of a new component for the interface: `RegistryCleaner`. This promotes the job of cleaning up the pods that are stuck (and any other resources that may need to be cleaned) to a first class status. The `RegistryCleaner` is then called as the first step in the Catalog Operator registry reconclier, so that the pods stuck are cleaned up before the rest of the reconciler logic is executed. The PR provides implementation of `RegistryCleaner` for the `GrpcReconciler`, `ConfigMapReconciler` and the `GrpcAddressRegistryReconciler` implementations of `RegistryReconciler` interface.
[PR 3201](operator-framework#3201) attempted to solve for the issue by deleting the pods stuck in Terminating due to unreachable node. However, the logic to do that was included in `EnsureRegistryServer`, which only gets executed if polling in requested by the user. This PR fixes the issue by modifying the `RegistryReconciler` interface with the introduction of a new component for the interface: `RegistryCleaner`. This promotes the job of cleaning up the pods that are stuck (and any other resources that may need to be cleaned) to a first class status. The `RegistryCleaner` is then called as the first step in the Catalog Operator registry reconclier, so that the pods stuck are cleaned up before the rest of the reconciler logic is executed. The PR provides implementation of `RegistryCleaner` for the `GrpcReconciler`, `ConfigMapReconciler` and the `GrpcAddressRegistryReconciler` implementations of `RegistryReconciler` interface.
[PR 3201](operator-framework#3201) attempted to solve for the issue by deleting the pods stuck in `Terminating` due to unreachable node. However, the logic to do that was included in `EnsureRegistryServer`, which only gets executed if polling in requested by the user. This PR moves the logic of checking for dead pods out of `EnsureRegistryServer`, and puts it in `CheckRegistryServer` instead. This way, if there are any dead pods detected during `CheckRegistryServer`, the value of `healthy` is returned `false`, which inturn triggers `EnsureRegistryServer`.
[PR 3201](operator-framework#3201) attempted to solve for the issue by deleting the pods stuck in `Terminating` due to unreachable node. However, the logic to do that was included in `EnsureRegistryServer`, which only gets executed if polling in requested by the user. This PR moves the logic of checking for dead pods out of `EnsureRegistryServer`, and puts it in `CheckRegistryServer` instead. This way, if there are any dead pods detected during `CheckRegistryServer`, the value of `healthy` is returned `false`, which inturn triggers `EnsureRegistryServer`.
[PR 3201](operator-framework#3201) attempted to solve for the issue by deleting the pods stuck in `Terminating` due to unreachable node. However, the logic to do that was included in `EnsureRegistryServer`, which only gets executed if polling in requested by the user. This PR moves the logic of checking for dead pods out of `EnsureRegistryServer`, and puts it in `CheckRegistryServer` instead. This way, if there are any dead pods detected during `CheckRegistryServer`, the value of `healthy` is returned `false`, which inturn triggers `EnsureRegistryServer`.
[PR 3201](operator-framework#3201) attempted to solve for the issue by deleting the pods stuck in `Terminating` due to unreachable node. However, the logic to do that was included in `EnsureRegistryServer`, which only gets executed if polling in requested by the user. This PR moves the logic of checking for dead pods out of `EnsureRegistryServer`, and puts it in `CheckRegistryServer` instead. This way, if there are any dead pods detected during `CheckRegistryServer`, the value of `healthy` is returned `false`, which inturn triggers `EnsureRegistryServer`.
[PR 3201](operator-framework#3201) attempted to solve for the issue by deleting the pods stuck in `Terminating` due to unreachable node. However, the logic to do that was included in `EnsureRegistryServer`, which only gets executed if polling in requested by the user. This PR moves the logic of checking for dead pods out of `EnsureRegistryServer`, and puts it in `CheckRegistryServer` instead. This way, if there are any dead pods detected during `CheckRegistryServer`, the value of `healthy` is returned `false`, which inturn triggers `EnsureRegistryServer`.
[PR 3201](operator-framework#3201) attempted to solve for the issue by deleting the pods stuck in `Terminating` due to unreachable node. However, the logic to do that was included in `EnsureRegistryServer`, which only gets executed if polling in requested by the user. This PR moves the logic of checking for dead pods out of `EnsureRegistryServer`, and puts it in `CheckRegistryServer` instead. This way, if there are any dead pods detected during `CheckRegistryServer`, the value of `healthy` is returned `false`, which inturn triggers `EnsureRegistryServer`.
[PR 3201](operator-framework#3201) attempted to solve for the issue by deleting the pods stuck in `Terminating` due to unreachable node. However, the logic to do that was included in `EnsureRegistryServer`, which only gets executed if polling in requested by the user. This PR moves the logic of checking for dead pods out of `EnsureRegistryServer`, and puts it in `CheckRegistryServer` instead. This way, if there are any dead pods detected during `CheckRegistryServer`, the value of `healthy` is returned `false`, which inturn triggers `EnsureRegistryServer`.
[PR 3201](operator-framework#3201) attempted to solve for the issue by deleting the pods stuck in `Terminating` due to unreachable node. However, the logic to do that was included in `EnsureRegistryServer`, which only gets executed if polling in requested by the user. This PR moves the logic of checking for dead pods out of `EnsureRegistryServer`, and puts it in `CheckRegistryServer` instead. This way, if there are any dead pods detected during `CheckRegistryServer`, the value of `healthy` is returned `false`, which inturn triggers `EnsureRegistryServer`.
[PR 3201](operator-framework#3201) attempted to solve for the issue by deleting the pods stuck in `Terminating` due to unreachable node. However, the logic to do that was included in `EnsureRegistryServer`, which only gets executed if polling in requested by the user. This PR moves the logic of checking for dead pods out of `EnsureRegistryServer`, and puts it in `CheckRegistryServer` instead. This way, if there are any dead pods detected during `CheckRegistryServer`, the value of `healthy` is returned `false`, which inturn triggers `EnsureRegistryServer`.
[PR 3201](#3201) attempted to solve for the issue by deleting the pods stuck in `Terminating` due to unreachable node. However, the logic to do that was included in `EnsureRegistryServer`, which only gets executed if polling in requested by the user. This PR moves the logic of checking for dead pods out of `EnsureRegistryServer`, and puts it in `CheckRegistryServer` instead. This way, if there are any dead pods detected during `CheckRegistryServer`, the value of `healthy` is returned `false`, which inturn triggers `EnsureRegistryServer`.
[PR 3201](operator-framework/operator-lifecycle-manager#3201) attempted to solve for the issue by deleting the pods stuck in `Terminating` due to unreachable node. However, the logic to do that was included in `EnsureRegistryServer`, which only gets executed if polling in requested by the user. This PR moves the logic of checking for dead pods out of `EnsureRegistryServer`, and puts it in `CheckRegistryServer` instead. This way, if there are any dead pods detected during `CheckRegistryServer`, the value of `healthy` is returned `false`, which inturn triggers `EnsureRegistryServer`. Upstream-repository: operator-lifecycle-manager Upstream-commit: f2431893193e7112f78298ad7682ff3e1b179d8c
[PR 3201](operator-framework/operator-lifecycle-manager#3201) attempted to solve for the issue by deleting the pods stuck in `Terminating` due to unreachable node. However, the logic to do that was included in `EnsureRegistryServer`, which only gets executed if polling in requested by the user. This PR moves the logic of checking for dead pods out of `EnsureRegistryServer`, and puts it in `CheckRegistryServer` instead. This way, if there are any dead pods detected during `CheckRegistryServer`, the value of `healthy` is returned `false`, which inturn triggers `EnsureRegistryServer`. Upstream-repository: operator-lifecycle-manager Upstream-commit: f2431893193e7112f78298ad7682ff3e1b179d8c
[PR 3201](operator-framework/operator-lifecycle-manager#3201) attempted to solve for the issue by deleting the pods stuck in `Terminating` due to unreachable node. However, the logic to do that was included in `EnsureRegistryServer`, which only gets executed if polling in requested by the user. This PR moves the logic of checking for dead pods out of `EnsureRegistryServer`, and puts it in `CheckRegistryServer` instead. This way, if there are any dead pods detected during `CheckRegistryServer`, the value of `healthy` is returned `false`, which inturn triggers `EnsureRegistryServer`. Upstream-repository: operator-lifecycle-manager Upstream-commit: f2431893193e7112f78298ad7682ff3e1b179d8c
[PR 3201](operator-framework/operator-lifecycle-manager#3201) attempted to solve for the issue by deleting the pods stuck in `Terminating` due to unreachable node. However, the logic to do that was included in `EnsureRegistryServer`, which only gets executed if polling in requested by the user. This PR moves the logic of checking for dead pods out of `EnsureRegistryServer`, and puts it in `CheckRegistryServer` instead. This way, if there are any dead pods detected during `CheckRegistryServer`, the value of `healthy` is returned `false`, which inturn triggers `EnsureRegistryServer`. Upstream-repository: operator-lifecycle-manager Upstream-commit: f2431893193e7112f78298ad7682ff3e1b179d8c
[PR 3201](operator-framework/operator-lifecycle-manager#3201) attempted to solve for the issue by deleting the pods stuck in `Terminating` due to unreachable node. However, the logic to do that was included in `EnsureRegistryServer`, which only gets executed if polling in requested by the user. This PR moves the logic of checking for dead pods out of `EnsureRegistryServer`, and puts it in `CheckRegistryServer` instead. This way, if there are any dead pods detected during `CheckRegistryServer`, the value of `healthy` is returned `false`, which inturn triggers `EnsureRegistryServer`. Upstream-repository: operator-lifecycle-manager Upstream-commit: f2431893193e7112f78298ad7682ff3e1b179d8c
…ilure (#3366) [PR 3201](operator-framework/operator-lifecycle-manager#3201) attempted to solve for the issue by deleting the pods stuck in `Terminating` due to unreachable node. However, the logic to do that was included in `EnsureRegistryServer`, which only gets executed if polling in requested by the user. This PR moves the logic of checking for dead pods out of `EnsureRegistryServer`, and puts it in `CheckRegistryServer` instead. This way, if there are any dead pods detected during `CheckRegistryServer`, the value of `healthy` is returned `false`, which inturn triggers `EnsureRegistryServer`. Upstream-repository: operator-lifecycle-manager Upstream-commit: f2431893193e7112f78298ad7682ff3e1b179d8c
…ilure (#3366) [PR 3201](operator-framework/operator-lifecycle-manager#3201) attempted to solve for the issue by deleting the pods stuck in `Terminating` due to unreachable node. However, the logic to do that was included in `EnsureRegistryServer`, which only gets executed if polling in requested by the user. This PR moves the logic of checking for dead pods out of `EnsureRegistryServer`, and puts it in `CheckRegistryServer` instead. This way, if there are any dead pods detected during `CheckRegistryServer`, the value of `healthy` is returned `false`, which inturn triggers `EnsureRegistryServer`. Upstream-repository: operator-lifecycle-manager Upstream-commit: f2431893193e7112f78298ad7682ff3e1b179d8c
…ilure (#3366) [PR 3201](operator-framework/operator-lifecycle-manager#3201) attempted to solve for the issue by deleting the pods stuck in `Terminating` due to unreachable node. However, the logic to do that was included in `EnsureRegistryServer`, which only gets executed if polling in requested by the user. This PR moves the logic of checking for dead pods out of `EnsureRegistryServer`, and puts it in `CheckRegistryServer` instead. This way, if there are any dead pods detected during `CheckRegistryServer`, the value of `healthy` is returned `false`, which inturn triggers `EnsureRegistryServer`. Upstream-repository: operator-lifecycle-manager Upstream-commit: f2431893193e7112f78298ad7682ff3e1b179d8c
…ilure (#3366) [PR 3201](operator-framework/operator-lifecycle-manager#3201) attempted to solve for the issue by deleting the pods stuck in `Terminating` due to unreachable node. However, the logic to do that was included in `EnsureRegistryServer`, which only gets executed if polling in requested by the user. This PR moves the logic of checking for dead pods out of `EnsureRegistryServer`, and puts it in `CheckRegistryServer` instead. This way, if there are any dead pods detected during `CheckRegistryServer`, the value of `healthy` is returned `false`, which inturn triggers `EnsureRegistryServer`. Upstream-repository: operator-lifecycle-manager Upstream-commit: f2431893193e7112f78298ad7682ff3e1b179d8c
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fix needs some improvement as it does not catch all reasons why a pod is dead
return false | ||
} | ||
|
||
func isPodDeletedByTaintManager(pod *corev1.Pod) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just had the situation that a catalogsource pod was evicted due to resource pressure on the node on a OCP 4.16.21 cluster that includes this fix.
Unfortunately this fix did not help, as the reason for pods dead is different:
"reason": "TerminationByKubelet",
"status": "True",
Description of the change:
Motivation for the change:
Architectural changes:
Testing remarks:
Reviewer Checklist
/doc
[FLAKE]
are truly flaky and have an issue