-
Notifications
You must be signed in to change notification settings - Fork 58
Exit CFO initialisation if RayCluster CRD is not available #512
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Exit CFO initialisation if RayCluster CRD is not available #512
Conversation
c4ad870
to
6685254
Compare
5e5a3dd
to
26dc308
Compare
2b2f97a
to
c593124
Compare
Dockerfile
Outdated
microdnf install tar -y \ | ||
&& microdnf clean all && \ | ||
rm -rf /var/cache/yum | ||
ADD https://mirror.openshift.com/pub/openshift-v4/clients/ocp/stable/openshift-client-linux.tar.gz /tmp/openshift-client-linux.tar.gz |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm trying to pull registry.redhat.io/openshift4/ose-cli:v4.13
but seem to get a permission error from registry.redhat.io. Looking into it.
config/manager/manager.yaml
Outdated
echo "Checking for $crd" | ||
until oc get crd $crd; do | ||
echo "$crd not available yet, retrying in 10 seconds..." | ||
sleep 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what happens CRD is not installed, not just not available for short period?
this will cause initContainer never stop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be the current behaviour. The CFO shouldn't start without those CRDs previously installed, hence, the initContainer will continue to search for them until they are present.
Should I have a timeout for it? In case of timing out, I suppose the only way of 're-activating' the CFO would be to restart the pod.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so if the bash script timeout, it should exit non-zero, which by default will cause Pod restart without start the "real" container.
i think this can be easily verified in the cluster.
i must have missed something again: But, why not:
|
Hi @zdtsw, completely agreed, that makes more sense to me. Will fix now, thanks for the advice and insights! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An alternative approach would be to setup a watcher in the CFO main if the CRDs are not present, that would watch their installation, and start the controller and webhooks on that event.
c593124
to
de9e8bd
Compare
main.go
Outdated
exitOnError(err, "unable to create apiextensionsClient") | ||
} | ||
crdName := "rayclusters.ray.io" | ||
if err := checkCRDAvailability(apiextensionsClient, crdName); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is it different from the call to hasAPIResourceForGVK
that's done a bit after.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH I only noticed that new function just now after rebasing. - The main difference is that with the Kubernetes API Extensions client we can directly check for the required CRD, which is more focused on the scope of this issue.
With that said, running the CFO from main branch with the hasAPIResourceForGVK
approach, for some reason, without the rayclusters CRD installed it doesn't fail the initialisation of the CFO pod. Investigating....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Functionally these two methods does very much the same. It doesn't fail on main only because we want it, we just do not start the RayCluster controller at the moment
hasAPIResoureForGVK
has the advantage that it doesn't require permission to read CRDs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's true, a good advantage. One question in case I'm mistaken, do we really want to start the CFO anyways without the RC Controller? Even after applying the CRD, the CFO won't attempt to start the controller. We would remain in the same scenario where the user would need to restart the pod. Moreover, I couldn't find any error messages displayed in the CFO logs for the user to know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea was that there may be other controllers orthogonal to KubeRay, like in #491. But it's true at the moment we could exit in that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, thanks for clarifying on that, I wasn't sure. I refactored the logic to exit and not initialise the pod unless the CRD is present. Let me know what you think. I know this is a temporary change too until the RC Controller is moved to KubeRay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Current behaviour now: if the RayCluster CRD is not available, the CFO will fail to start and attempt to restart until the CRD is installed - This would display error logs in the CFO pod. Once the RayCluster CRD is installed, the CFO manager and RC Controller will start
8b5f69c
to
d16ea4f
Compare
d16ea4f
to
4bdaf29
Compare
4bdaf29
to
8035a3b
Compare
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Note: rebased |
PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Closing in favour of #546 |
Issue link
Jira: https://issues.redhat.com/browse/RHOAIENG-5331
What changes have been made
Note: This could be considered a temporary change until the RC Controller is moved to KubeRay.
Verification steps
Checks