Skip to content

Add a check for head pod imagePullSecrets #601

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jul 26, 2024
Merged

Add a check for head pod imagePullSecrets #601

merged 2 commits into from
Jul 26, 2024

Conversation

Ygnas
Copy link
Contributor

@Ygnas Ygnas commented Jul 24, 2024

Issue link

https://issues.redhat.com/browse/RHOAIENG-9247

What changes have been made

Added a check for imagePullSecrets in the head pod, and if it does not exists delete the head pod, so it can be redeployed.

Verification steps

Set up an integrated OpenShift container registry and upload an image to it:
https://github.com/opendatahub-io/distributed-workloads/tree/main/images/runtime/examples#readme
Use the uploaded image inside the notebooks, head pod should not enter the ImagePullBackOff state (happens only frequently but not always).

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • Testing is not required for this change

@Ygnas Ygnas marked this pull request as draft July 24, 2024 17:16
@sutaakar
Copy link
Contributor

@Ygnas Can you provide unit tests for the change?

@astefanutti
Copy link
Contributor

/lgtm

I've tested it multiple times and never encountered the issue.

@openshift-ci openshift-ci bot added lgtm and removed lgtm labels Jul 25, 2024
@Ygnas Ygnas marked this pull request as ready for review July 25, 2024 13:29
@openshift-ci openshift-ci bot requested a review from tedhtchang July 25, 2024 13:29
@astefanutti
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm label Jul 25, 2024
Comment on lines 181 to 184
if len(sa.ImagePullSecrets) == 0 {
sa.ImagePullSecrets = []corev1.LocalObjectReference{
{Name: "test-image-pull-secret"},
}
_, err = k8sClient.CoreV1().ServiceAccounts(namespaceName).Update(ctx, sa, metav1.UpdateOptions{})
if err != nil {
return err
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code doesn't belong into eventually check.
This code should be executed once SA is available.

Just a generic note - try to avoid conditions in tests if possible. In this case you can i.e. check that ImagePullSecrets is empty, as a next step you can put there a value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks I think, I made it way better now

@openshift-ci openshift-ci bot removed the lgtm label Jul 25, 2024
Copy link
Collaborator

@KPostOffice KPostOffice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, one point of clarification

Comment on lines +216 to +218
if err := r.deleteHeadPodIfMissingImagePullSecrets(ctx, cluster); err != nil {
return ctrl.Result{RequeueAfter: requeueTime}, err
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this not occur in WorkerPods too?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Theoretically yes. Practically, the worker Pods init container that waits for the head Pod to become ready makes it so it does not happen. I'd be inclined to keep the logic simple at the moment and iterate if necessary.

Copy link
Contributor

@sutaakar sutaakar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@astefanutti
Copy link
Contributor

/approve

Copy link

openshift-ci bot commented Jul 26, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: astefanutti

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-bot openshift-merge-bot bot merged commit 133db94 into project-codeflare:main Jul 26, 2024
8 checks passed
Comment on lines +568 to +580
missingSecrets := map[string]bool{}
for _, secret := range serviceAccount.ImagePullSecrets {
missingSecrets[secret.Name] = true
}
for _, secret := range headPod.Spec.ImagePullSecrets {
delete(missingSecrets, secret.Name)
}
if len(missingSecrets) > 0 {
if err := r.kubeClient.CoreV1().Pods(headPod.Namespace).Delete(ctx, headPod.Name, metav1.DeleteOptions{}); err != nil {
return fmt.Errorf("failed to delete head pod: %w", err)
}
}
return nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think i fully understand this part:
what would happen if the serviceAccount has multiple ImagePullSecrets? then it can get into an infinitely reconcile by deleting headPod?

Copy link
Contributor Author

@Ygnas Ygnas Jul 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zdtsw I don't think it should, I just tried adding extra ImagePullSecrets into the SA, the head was deleted and recreated with the extra ones. I think all the ImagePullSecrets always end up in the headPod.
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants