Skip to content

Exit CFO initialisation if RayCluster CRD is not available #512

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

ChristianZaccaria
Copy link
Contributor

@ChristianZaccaria ChristianZaccaria commented Apr 12, 2024

Issue link

Jira: https://issues.redhat.com/browse/RHOAIENG-5331

What changes have been made

  • Refactored to exit early if the RayCluster CRD is not present in the cluster.

Note: This could be considered a temporary change until the RC Controller is moved to KubeRay.

Verification steps

  1. Deploy the CFO without KubeRay and Ray CRDs:
podman build -t quay.io/<quayusername>/codeflare-operator:<tagname> .
podman push quay.io/<quayusername>/codeflare-operator:<tagname>
make deploy IMG=quay.io/<quayusername>/codeflare-operator:<tagname>
  1. The CFO pod will fail to initialise and will restart the pod until the required CRDs are available in the cluster. Deploying KubeRay installs the required CRDs and the CFO pod should start.

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • Testing is not required for this change

Dockerfile Outdated
microdnf install tar -y \
&& microdnf clean all && \
rm -rf /var/cache/yum
ADD https://mirror.openshift.com/pub/openshift-v4/clients/ocp/stable/openshift-client-linux.tar.gz /tmp/openshift-client-linux.tar.gz
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to pull registry.redhat.io/openshift4/ose-cli:v4.13 but seem to get a permission error from registry.redhat.io. Looking into it.

Comment on lines 49 to 52
echo "Checking for $crd"
until oc get crd $crd; do
echo "$crd not available yet, retrying in 10 seconds..."
sleep 10
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens CRD is not installed, not just not available for short period?
this will cause initContainer never stop?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be the current behaviour. The CFO shouldn't start without those CRDs previously installed, hence, the initContainer will continue to search for them until they are present.

Should I have a timeout for it? In case of timing out, I suppose the only way of 're-activating' the CFO would be to restart the pod.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so if the bash script timeout, it should exit non-zero, which by default will cause Pod restart without start the "real" container.
i think this can be easily verified in the cluster.

@zdtsw
Copy link
Contributor

zdtsw commented Apr 13, 2024

i must have missed something again:
you are injecting "oc" (or kubectl) binary into the same CFO operator image, and start an initContainer with exactly the same image, for one purpose to check if these 3 CRD has been installed into the cluster?

But, why not:

  • find a dedicated initcontainer images as initContainer (e.g the one dimakis pointed out?) all you need is just the "oc" binary
  • do not touch existing Dockerfile.
  • and no need use var in kustomize but hardcode image with digists in the deployment of the CFO

@ChristianZaccaria
Copy link
Contributor Author

Hi @zdtsw, completely agreed, that makes more sense to me. Will fix now, thanks for the advice and insights!

Copy link
Contributor

@astefanutti astefanutti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this solution compatible with the webhooks that are being introduced in #507 and #508?

Copy link
Contributor

@astefanutti astefanutti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An alternative approach would be to setup a watcher in the CFO main if the CRDs are not present, that would watch their installation, and start the controller and webhooks on that event.

@ChristianZaccaria ChristianZaccaria changed the title Add initContainer to check for required CRDs availability Use Kubernetes API Extensions client to verify raycluster CRD availability Apr 18, 2024
@ChristianZaccaria ChristianZaccaria changed the title Use Kubernetes API Extensions client to verify raycluster CRD availability Use Kubernetes API Extensions client to verify CRD availability Apr 18, 2024
main.go Outdated
exitOnError(err, "unable to create apiextensionsClient")
}
crdName := "rayclusters.ray.io"
if err := checkCRDAvailability(apiextensionsClient, crdName); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is it different from the call to hasAPIResourceForGVK that's done a bit after.

Copy link
Contributor Author

@ChristianZaccaria ChristianZaccaria Apr 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH I only noticed that new function just now after rebasing. - The main difference is that with the Kubernetes API Extensions client we can directly check for the required CRD, which is more focused on the scope of this issue.

With that said, running the CFO from main branch with the hasAPIResourceForGVK approach, for some reason, without the rayclusters CRD installed it doesn't fail the initialisation of the CFO pod. Investigating....

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Functionally these two methods does very much the same. It doesn't fail on main only because we want it, we just do not start the RayCluster controller at the moment

hasAPIResoureForGVK has the advantage that it doesn't require permission to read CRDs.

Copy link
Contributor Author

@ChristianZaccaria ChristianZaccaria Apr 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true, a good advantage. One question in case I'm mistaken, do we really want to start the CFO anyways without the RC Controller? Even after applying the CRD, the CFO won't attempt to start the controller. We would remain in the same scenario where the user would need to restart the pod. Moreover, I couldn't find any error messages displayed in the CFO logs for the user to know.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea was that there may be other controllers orthogonal to KubeRay, like in #491. But it's true at the moment we could exit in that case.

Copy link
Contributor Author

@ChristianZaccaria ChristianZaccaria Apr 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks for clarifying on that, I wasn't sure. I refactored the logic to exit and not initialise the pod unless the CRD is present. Let me know what you think. I know this is a temporary change too until the RC Controller is moved to KubeRay.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current behaviour now: if the RayCluster CRD is not available, the CFO will fail to start and attempt to restart until the CRD is installed - This would display error logs in the CFO pod. Once the RayCluster CRD is installed, the CFO manager and RC Controller will start

@ChristianZaccaria ChristianZaccaria force-pushed the initcontainer-crds branch 3 times, most recently from 8b5f69c to d16ea4f Compare April 18, 2024 15:44
@ChristianZaccaria ChristianZaccaria changed the title Use Kubernetes API Extensions client to verify CRD availability Exit CFO initialisation if RayCluster CRD is not available Apr 19, 2024
Copy link

openshift-ci bot commented Apr 19, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from christianzaccaria. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ChristianZaccaria
Copy link
Contributor Author

Note: rebased

@openshift-merge-robot
Copy link
Collaborator

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@ChristianZaccaria
Copy link
Contributor Author

Closing in favour of #546

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants