retry unpacking jobs on failure #3016

ankitathomas · 2023-08-22T14:35:04Z

Description of the change:
Recreate failed bundle unpack jobs to allow for automatic retries on unpacking failure.

Motivation for the change:
Bundle unpack jobs may fail due to network or configuration issues in the cluster that may be transient or resolved with user intervention. Since unpack jobs have deterministic names referencing the bundle they correspond to, recovery from unpack failures requires manual intervention for deleting the associated unpack jobs.

This PR automates recreation of failed unpack jobs indefinitely with a minimum guaranteed interval between jobs if specified by a new operatorGroup annotation.

tmshort · 2023-08-22T14:37:34Z

pkg/controller/bundle/bundle_unpacker.go

+			job, err = c.client.BatchV1().Jobs(fresh.GetNamespace()).Create(context.TODO(), fresh, metav1.CreateOptions{})
+		}
+		return
+	}


This appears to retry without limit?

What's the retry cadence? is it exp backoff?
Maybe it's ok to retry forever as long as we're not hammering the apiserver?

This runs whenever olm syncs resolves a namespace - we use the default client-go workqueue so we have exp backoff up to ~15 min.

We do however reset this backoff each time the new unpack job begins, so this can become as short as a 5 second retry loop if the unpack timeout is short enough.

ankitathomas · 2023-08-22T14:39:49Z

pkg/controller/bundle/bundle_unpacker.go

@@ -651,6 +651,14 @@ func (c *ConfigMapUnpacker) ensureJob(cmRef *corev1.ObjectReference, bundlePath

 		return
 	}
+	// Cleanup old unpacking job and retry


If we don't care about persisting the failed job at all, this can be simplified to deleting the job immediately after failure and waiting for the next resolver run.

I think we should persist the failed job - we need a debug trail of some sort

varshaprasad96 · 2023-08-28T15:40:18Z

pkg/controller/bundle/bundle_unpacker.go

@@ -651,6 +651,14 @@ func (c *ConfigMapUnpacker) ensureJob(cmRef *corev1.ObjectReference, bundlePath

 		return
 	}
+	// Cleanup old unpacking job and retry
+	if _, isFailed := getCondition(job, batchv1.JobFailed); isFailed {
+		err = c.client.BatchV1().Jobs(job.GetNamespace()).Delete(context.TODO(), job.GetName(), metav1.DeleteOptions{})


why delete it manually and not set TTL to garbage collect? (https://kubernetes.io/docs/concepts/workloads/controllers/job/#ttl-mechanism-for-finished-jobs).

I haven't looked into the entire code, but I'm assuming the controller should re-create a new Job incase it has not been unpacked.

iirc, the current implementation requires completed jobs to persist to indicate an unpacked bundle.

dtfranz · 2023-09-07T16:45:45Z

pkg/controller/bundle/bundle_unpacker.go

+
+	// BundleUnpackRetryMinimumIntervalAnnotationKey sets a minimum interval to wait before
+	// attempting to recreate a failed unpack job for a bundle.
+	BundleUnpackRetryMinimumIntervalAnnotationKey = "operatorframework.io/bundle-unpack-min-retry-interval"


Will you have a follow-up PR to document how to use this field?

I'll follow up with operator-framework/olm-docs#313 once the PR is merged

Signed-off-by: Ankita Thomas <[email protected]>

… unpack jobs Signed-off-by: Ankita Thomas <[email protected]>

Signed-off-by: Ankita Thomas <[email protected]>

tmshort

/lgtm

openshift-ci · 2023-10-02T19:05:03Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ankitathomas, tmshort

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [tmshort]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci bot requested review from njhale and tmshort August 22, 2023 14:35

tmshort reviewed Aug 22, 2023

View reviewed changes

ankitathomas commented Aug 22, 2023

View reviewed changes

varshaprasad96 reviewed Aug 28, 2023

View reviewed changes

ankitathomas changed the title ~~retry unpacking jobs on failure~~ WIP: retry unpacking jobs on failure Aug 29, 2023

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 29, 2023

ankitathomas force-pushed the jobretry branch from a4e9013 to a9c554a Compare August 30, 2023 14:01

dtfranz reviewed Sep 7, 2023

View reviewed changes

ankitathomas force-pushed the jobretry branch 2 times, most recently from 114883f to ae2404f Compare September 26, 2023 13:16

ankitathomas added 2 commits September 26, 2023 09:17

retry unpacking jobs on failure

611f975

Signed-off-by: Ankita Thomas <[email protected]>

preserve failed unpack jobs, enforce minimum interval between failing…

4652ff1

… unpack jobs Signed-off-by: Ankita Thomas <[email protected]>

ankitathomas force-pushed the jobretry branch from ae2404f to 670d610 Compare September 26, 2023 13:17

ankitathomas changed the title ~~WIP: retry unpacking jobs on failure~~ retry unpacking jobs on failure Sep 26, 2023

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 26, 2023

ankitathomas force-pushed the jobretry branch 4 times, most recently from 0c61bf4 to 11dddea Compare September 28, 2023 18:11

unpack retry e2e tests

66ec506

Signed-off-by: Ankita Thomas <[email protected]>

ankitathomas force-pushed the jobretry branch 6 times, most recently from 4956334 to 7d77701 Compare September 29, 2023 15:49

ankitathomas requested review from varshaprasad96, tmshort and perdasilva September 29, 2023 15:54

ankitathomas requested a review from dtfranz September 29, 2023 15:55

ankitathomas force-pushed the jobretry branch from 7d77701 to 10195c3 Compare September 29, 2023 16:11

cleanup old jobs

0dbb05e

Signed-off-by: Ankita Thomas <[email protected]>

ankitathomas force-pushed the jobretry branch from 10195c3 to 0dbb05e Compare September 29, 2023 16:55

tmshort approved these changes Oct 2, 2023

View reviewed changes

openshift-ci bot assigned tmshort Oct 2, 2023

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Oct 2, 2023

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 2, 2023

ankitathomas added this pull request to the merge queue Oct 3, 2023

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Oct 3, 2023

tmshort added this pull request to the merge queue Oct 3, 2023

Merged via the queue into operator-framework:master with commit 4fc64d2 Oct 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

retry unpacking jobs on failure #3016

retry unpacking jobs on failure #3016

ankitathomas commented Aug 22, 2023 •

edited

Loading

tmshort Aug 22, 2023

perdasilva Aug 23, 2023 •

edited

Loading

ankitathomas Aug 25, 2023 •

edited

Loading

ankitathomas Aug 22, 2023

perdasilva Aug 23, 2023

varshaprasad96 Aug 28, 2023 •

edited

Loading

ankitathomas Aug 29, 2023

dtfranz Sep 7, 2023

ankitathomas Oct 2, 2023

tmshort left a comment

openshift-ci bot commented Oct 2, 2023

retry unpacking jobs on failure #3016

retry unpacking jobs on failure #3016

Conversation

ankitathomas commented Aug 22, 2023 • edited Loading

tmshort Aug 22, 2023

Choose a reason for hiding this comment

perdasilva Aug 23, 2023 • edited Loading

Choose a reason for hiding this comment

ankitathomas Aug 25, 2023 • edited Loading

Choose a reason for hiding this comment

ankitathomas Aug 22, 2023

Choose a reason for hiding this comment

perdasilva Aug 23, 2023

Choose a reason for hiding this comment

varshaprasad96 Aug 28, 2023 • edited Loading

Choose a reason for hiding this comment

ankitathomas Aug 29, 2023

Choose a reason for hiding this comment

dtfranz Sep 7, 2023

Choose a reason for hiding this comment

ankitathomas Oct 2, 2023

Choose a reason for hiding this comment

tmshort left a comment

Choose a reason for hiding this comment

openshift-ci bot commented Oct 2, 2023

ankitathomas commented Aug 22, 2023 •

edited

Loading

perdasilva Aug 23, 2023 •

edited

Loading

ankitathomas Aug 25, 2023 •

edited

Loading

varshaprasad96 Aug 28, 2023 •

edited

Loading