Exit CFO initialisation if RayCluster CRD is not available #512

ChristianZaccaria · 2024-04-12T09:24:15Z

Issue link

Jira: https://issues.redhat.com/browse/RHOAIENG-5331

What changes have been made

Refactored to exit early if the RayCluster CRD is not present in the cluster.

Note: This could be considered a temporary change until the RC Controller is moved to KubeRay.

Verification steps

Deploy the CFO without KubeRay and Ray CRDs:

podman build -t quay.io/<quayusername>/codeflare-operator:<tagname> .
podman push quay.io/<quayusername>/codeflare-operator:<tagname>
make deploy IMG=quay.io/<quayusername>/codeflare-operator:<tagname>

The CFO pod will fail to initialise and will restart the pod until the required CRDs are available in the cluster. Deploying KubeRay installs the required CRDs and the CFO pod should start.

Checks

I've made sure the tests are passing.
Testing Strategy
- Unit tests
- Manual tests
- Testing is not required for this change

dimakis · 2024-04-12T16:23:33Z

Dockerfile

+    microdnf install tar -y \
+    && microdnf clean all && \
+    rm -rf /var/cache/yum
+ADD https://mirror.openshift.com/pub/openshift-v4/clients/ocp/stable/openshift-client-linux.tar.gz /tmp/openshift-client-linux.tar.gz


use this instead please https://gitlab.cee.redhat.com/data-hub/rhods-cpaas-midstream/-/blob/rhods-1.34-rhel-8/distgit/containers/odh-deployer/Dockerfile.in?ref_type=heads

I'm trying to pull registry.redhat.io/openshift4/ose-cli:v4.13 but seem to get a permission error from registry.redhat.io. Looking into it.

zdtsw · 2024-04-13T17:02:03Z

config/manager/manager.yaml

+            echo "Checking for $crd"
+            until oc get crd $crd; do
+              echo "$crd not available yet, retrying in 10 seconds..."
+              sleep 10


what happens CRD is not installed, not just not available for short period?
this will cause initContainer never stop?

That would be the current behaviour. The CFO shouldn't start without those CRDs previously installed, hence, the initContainer will continue to search for them until they are present.

Should I have a timeout for it? In case of timing out, I suppose the only way of 're-activating' the CFO would be to restart the pod.

so if the bash script timeout, it should exit non-zero, which by default will cause Pod restart without start the "real" container.
i think this can be easily verified in the cluster.

zdtsw · 2024-04-13T17:09:25Z

i must have missed something again:
you are injecting "oc" (or kubectl) binary into the same CFO operator image, and start an initContainer with exactly the same image, for one purpose to check if these 3 CRD has been installed into the cluster?

But, why not:

find a dedicated initcontainer images as initContainer (e.g the one dimakis pointed out?) all you need is just the "oc" binary
do not touch existing Dockerfile.
and no need use var in kustomize but hardcode image with digists in the deployment of the CFO

ChristianZaccaria · 2024-04-15T08:23:59Z

Hi @zdtsw, completely agreed, that makes more sense to me. Will fix now, thanks for the advice and insights!

astefanutti

Is this solution compatible with the webhooks that are being introduced in #507 and #508?

astefanutti

An alternative approach would be to setup a watcher in the CFO main if the CRDs are not present, that would watch their installation, and start the controller and webhooks on that event.

astefanutti · 2024-04-18T14:28:20Z

main.go

+		exitOnError(err, "unable to create apiextensionsClient")
+	}
+	crdName := "rayclusters.ray.io"
+	if err := checkCRDAvailability(apiextensionsClient, crdName); err != nil {


How is it different from the call to hasAPIResourceForGVK that's done a bit after.

TBH I only noticed that new function just now after rebasing. - The main difference is that with the Kubernetes API Extensions client we can directly check for the required CRD, which is more focused on the scope of this issue.

With that said, running the CFO from main branch with the hasAPIResourceForGVK approach, for some reason, without the rayclusters CRD installed it doesn't fail the initialisation of the CFO pod. Investigating....

Functionally these two methods does very much the same. It doesn't fail on main only because we want it, we just do not start the RayCluster controller at the moment

hasAPIResoureForGVK has the advantage that it doesn't require permission to read CRDs.

That's true, a good advantage. One question in case I'm mistaken, do we really want to start the CFO anyways without the RC Controller? Even after applying the CRD, the CFO won't attempt to start the controller. We would remain in the same scenario where the user would need to restart the pod. Moreover, I couldn't find any error messages displayed in the CFO logs for the user to know.

The idea was that there may be other controllers orthogonal to KubeRay, like in #491. But it's true at the moment we could exit in that case.

I see, thanks for clarifying on that, I wasn't sure. I refactored the logic to exit and not initialise the pod unless the CRD is present. Let me know what you think. I know this is a temporary change too until the RC Controller is moved to KubeRay.

Current behaviour now: if the RayCluster CRD is not available, the CFO will fail to start and attempt to restart until the CRD is installed - This would display error logs in the CFO pod. Once the RayCluster CRD is installed, the CFO manager and RC Controller will start

openshift-ci · 2024-04-19T11:32:43Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from christianzaccaria. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ChristianZaccaria · 2024-04-19T11:35:06Z

Note: rebased

openshift-merge-robot · 2024-04-27T16:50:36Z

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

ChristianZaccaria · 2024-04-29T06:58:21Z

Closing in favour of #546

openshift-ci bot requested review from anishasthana and jbusche April 12, 2024 09:24

ChristianZaccaria force-pushed the initcontainer-crds branch from c4ad870 to 6685254 Compare April 12, 2024 13:03

openshift-merge-robot added the needs-rebase label Apr 12, 2024

ChristianZaccaria force-pushed the initcontainer-crds branch from 5e5a3dd to 26dc308 Compare April 12, 2024 15:30

openshift-merge-robot removed the needs-rebase label Apr 12, 2024

ChristianZaccaria force-pushed the initcontainer-crds branch 2 times, most recently from 2b2f97a to c593124 Compare April 12, 2024 15:55

dimakis reviewed Apr 12, 2024

View reviewed changes

zdtsw reviewed Apr 13, 2024

View reviewed changes

astefanutti reviewed Apr 16, 2024

View reviewed changes

ChristianZaccaria added the do-not-merge/work-in-progress label Apr 17, 2024

ChristianZaccaria force-pushed the initcontainer-crds branch from c593124 to de9e8bd Compare April 18, 2024 14:24

ChristianZaccaria changed the title ~~Add initContainer to check for required CRDs availability~~ Use Kubernetes API Extensions client to verify raycluster CRD availability Apr 18, 2024

openshift-ci bot removed the do-not-merge/work-in-progress label Apr 18, 2024

ChristianZaccaria changed the title ~~Use Kubernetes API Extensions client to verify raycluster CRD availability~~ Use Kubernetes API Extensions client to verify CRD availability Apr 18, 2024

astefanutti reviewed Apr 18, 2024

View reviewed changes

ChristianZaccaria force-pushed the initcontainer-crds branch 3 times, most recently from 8b5f69c to d16ea4f Compare April 18, 2024 15:44

ChristianZaccaria added the do-not-merge/hold label Apr 18, 2024

ChristianZaccaria force-pushed the initcontainer-crds branch from d16ea4f to 4bdaf29 Compare April 18, 2024 16:13

ChristianZaccaria removed the do-not-merge/hold label Apr 18, 2024

ChristianZaccaria changed the title ~~Use Kubernetes API Extensions client to verify CRD availability~~ Exit CFO initialisation if RayCluster CRD is not available Apr 19, 2024

ChristianZaccaria added 2 commits April 19, 2024 12:22

Adjust olm_tests workflow to create raycluster CRD for deployment

7b250c8

Refactor CRD check to exit if RayCluster CRD not available

8035a3b

ChristianZaccaria force-pushed the initcontainer-crds branch from 4bdaf29 to 8035a3b Compare April 19, 2024 11:32

openshift-merge-robot added the needs-rebase label Apr 27, 2024

ChristianZaccaria closed this Apr 29, 2024

Exit CFO initialisation if RayCluster CRD is not available #512

Exit CFO initialisation if RayCluster CRD is not available #512

Uh oh!

Conversation

ChristianZaccaria commented Apr 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issue link

What changes have been made

Verification steps

Checks

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zdtsw commented Apr 13, 2024

Uh oh!

ChristianZaccaria commented Apr 15, 2024

Uh oh!

astefanutti left a comment

Choose a reason for hiding this comment

Uh oh!

astefanutti left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ChristianZaccaria Apr 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ChristianZaccaria Apr 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ChristianZaccaria Apr 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

openshift-ci bot commented Apr 19, 2024

Uh oh!

ChristianZaccaria commented Apr 19, 2024

Uh oh!

openshift-merge-robot commented Apr 27, 2024

Uh oh!

ChristianZaccaria commented Apr 29, 2024

Uh oh!

Uh oh!

ChristianZaccaria commented Apr 12, 2024 •

edited

Loading

ChristianZaccaria Apr 18, 2024 •

edited

Loading

ChristianZaccaria Apr 18, 2024 •

edited

Loading

ChristianZaccaria Apr 18, 2024 •

edited

Loading