-
Notifications
You must be signed in to change notification settings - Fork 551
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
retry unpacking jobs on failure #3016
Conversation
job, err = c.client.BatchV1().Jobs(fresh.GetNamespace()).Create(context.TODO(), fresh, metav1.CreateOptions{}) | ||
} | ||
return | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This appears to retry without limit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the retry cadence? is it exp backoff?
Maybe it's ok to retry forever as long as we're not hammering the apiserver?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This runs whenever olm syncs resolves a namespace - we use the default client-go workqueue so we have exp backoff up to ~15 min.
We do however reset this backoff each time the new unpack job begins, so this can become as short as a 5 second retry loop if the unpack timeout is short enough.
@@ -651,6 +651,14 @@ func (c *ConfigMapUnpacker) ensureJob(cmRef *corev1.ObjectReference, bundlePath | |||
|
|||
return | |||
} | |||
// Cleanup old unpacking job and retry |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we don't care about persisting the failed job at all, this can be simplified to deleting the job immediately after failure and waiting for the next resolver run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should persist the failed job - we need a debug trail of some sort
@@ -651,6 +651,14 @@ func (c *ConfigMapUnpacker) ensureJob(cmRef *corev1.ObjectReference, bundlePath | |||
|
|||
return | |||
} | |||
// Cleanup old unpacking job and retry | |||
if _, isFailed := getCondition(job, batchv1.JobFailed); isFailed { | |||
err = c.client.BatchV1().Jobs(job.GetNamespace()).Delete(context.TODO(), job.GetName(), metav1.DeleteOptions{}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why delete it manually and not set TTL to garbage collect? (https://kubernetes.io/docs/concepts/workloads/controllers/job/#ttl-mechanism-for-finished-jobs).
I haven't looked into the entire code, but I'm assuming the controller should re-create a new Job incase it has not been unpacked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
iirc, the current implementation requires completed jobs to persist to indicate an unpacked bundle.
|
||
// BundleUnpackRetryMinimumIntervalAnnotationKey sets a minimum interval to wait before | ||
// attempting to recreate a failed unpack job for a bundle. | ||
BundleUnpackRetryMinimumIntervalAnnotationKey = "operatorframework.io/bundle-unpack-min-retry-interval" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will you have a follow-up PR to document how to use this field?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll follow up with operator-framework/olm-docs#313 once the PR is merged
114883f
to
ae2404f
Compare
Signed-off-by: Ankita Thomas <[email protected]>
… unpack jobs Signed-off-by: Ankita Thomas <[email protected]>
ae2404f
to
670d610
Compare
0c61bf4
to
11dddea
Compare
Signed-off-by: Ankita Thomas <[email protected]>
4956334
to
7d77701
Compare
7d77701
to
10195c3
Compare
Signed-off-by: Ankita Thomas <[email protected]>
10195c3
to
0dbb05e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ankitathomas, tmshort The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Description of the change:
Recreate failed bundle unpack jobs to allow for automatic retries on unpacking failure.
Motivation for the change:
Bundle unpack jobs may fail due to network or configuration issues in the cluster that may be transient or resolved with user intervention. Since unpack jobs have deterministic names referencing the bundle they correspond to, recovery from unpack failures requires manual intervention for deleting the associated unpack jobs.
This PR automates recreation of failed unpack jobs indefinitely with a minimum guaranteed interval between jobs if specified by a new operatorGroup annotation.