-
Notifications
You must be signed in to change notification settings - Fork 58
add raycluster controller to CFO #453
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add raycluster controller to CFO #453
Conversation
/retest |
9383cc7
to
d1d1069
Compare
d1d1069
to
28f85b8
Compare
I tested and made sure that all the expected resources are created. Still need to write unit tests and add a RayCluster def which with a OAuth sidecar that will work with the created resources |
18ecf16
to
c878213
Compare
2735edb
to
6659ded
Compare
6659ded
to
e1af880
Compare
e1af880
to
c7ed522
Compare
apiVersion: ray.io/v1
kind: RayCluster
metadata:
labels:
controller-tools.k8s.io: '1.0'
annotations:
codeflare.dev/oauth: 'true'
name: raytest
namespace: default
spec:
enableInTreeAutoscaling: false
headGroupSpec:
rayStartParams:
block: 'true'
dashboard-host: 0.0.0.0
num-gpus: '0'
serviceType: ClusterIP
template:
spec:
containers:
- env:
- name: MY_POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
image: quay.io/project-codeflare/ray:latest-py39-cu118
imagePullPolicy: Always
lifecycle:
preStop:
exec:
command:
- /bin/sh
- -c
- ray stop
name: ray-head
ports:
- containerPort: 6379
name: gcs
- containerPort: 8265
name: dashboard
- containerPort: 10001
name: client
resources:
limits:
cpu: 1
memory: 8G
nvidia.com/gpu: 0
requests:
cpu: 1
memory: 8G
nvidia.com/gpu: 0
- args:
- --https-address=:8443
- --provider=openshift
- --openshift-service-account=raytest
- --upstream=http://localhost:8265
- --tls-cert=/etc/tls/private/tls.crt
- --tls-key=/etc/tls/private/tls.key
- --cookie-secret=$(COOKIE_SECRET)
- --openshift-delegate-urls={"/":{"resource":"pods","namespace":"default","verb":"get"}}
image: registry.redhat.io/openshift4/ose-oauth-proxy@sha256:1ea6a01bf3e63cdcf125c6064cbd4a4a270deaf0f157b3eabb78f60556840366
name: oauth-proxy
ports:
- containerPort: 8443
name: oauth-proxy
resources: {}
volumeMounts:
- mountPath: /etc/tls/private
name: proxy-tls-secret
readOnly: true
env:
- name: COOKIE_SECRET
valueFrom:
secretKeyRef:
name: raytest-oauth-config
key: cookie_secret
imagePullSecrets: []
serviceAccount: raytest
volumes:
- name: proxy-tls-secret
secret:
secretName: raytest-tls
rayVersion: 2.7.0
workerGroupSpecs:
- groupName: small-group-raytest
maxReplicas: 1
minReplicas: 1
rayStartParams:
block: 'true'
num-gpus: '0'
replicas: 1
template:
metadata:
annotations:
key: value
labels:
key: value
spec:
containers:
- env:
- name: MY_POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
image: quay.io/project-codeflare/ray:latest-py39-cu118
lifecycle:
preStop:
exec:
command:
- /bin/sh
- -c
- ray stop
name: machine-learning
resources:
limits:
cpu: 1
memory: 4G
nvidia.com/gpu: 0
requests:
cpu: 1
memory: 4G
nvidia.com/gpu: 0
imagePullSecrets: [] Here is a well formatted RayCluster that can be used to test these changes. Run the controller using:
Then apply the yaml above. There should be a route in the same namespace on the cluster for the dashboard that is protected by an OAuth Proxy |
/hold Still need to edit CI/makefile so that the new Envtest tests will run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two points:
- Do we want to leverage static sidecar injection in the Kuberay deployment?
- Could we have a e2e test that run on OpenShift?
@sutaakar FYI.
@@ -0,0 +1,14234 @@ | |||
--- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use a reference to import these CRDs as we do for MCAD: https://github.com/project-codeflare/codeflare-operator/blob/1264faabc1835b3aafb5e11fc31c4b1bbb9a1382/config/crd/mcad/kustomization.yaml#L4C6-L4C83
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We only need it for testing and I couldn't get envtest to work with remote files. I'm thinking of changing it so that it downloads them and cleans them up as part of the BeforeSuite
and AfterSuite
. It really only needs the route
and the raycluster
. Do we need to have the Ray CRDs applied as well before starting the controller?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I missed that envtest cannot use Kustomize reference mechanism.
For the dependency on the RayCluster API and the KubeRay CRD, I have some ideas that I need to formalise, but in the short term, what we can do is only start the controller when the API is present in the cluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can cause race conditions when both components are enabled. If CFO starts before KubeRay, then this Reconciler will never start despite the API being present soon after
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, but on the other hand, the CFO will never work if KubeRay is not installed. I don't think there is an ideal solution to this problem in the short term. We can document the CFO has to be restarted after KubeRay is installed, until we rework the deployment strategy.
My current thinking is that this controller should be deployed along side KubeRay, e.g. in a sidecar container.
c7ed522
to
56b8857
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested this with Christians work on the SDK works as expected
/lgtm
I changed adding the taint to be |
/unhold |
Signed-off-by: Kevin <[email protected]>
c65dbce
to
2a1de75
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
@Bobbins228 has lgtm'd it.
Signed-off-by: Kevin <[email protected]>
b139c3c
to
1ff6d05
Compare
@@ -6,4 +10,10 @@ plugins: | |||
scorecard.sdk.operatorframework.io/v2: {} | |||
projectName: codeflare-operator | |||
repo: github.com/project-codeflare/codeflare-operator | |||
resources: | |||
- controller: true | |||
domain: codeflare.dev |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a nit, but I'm not sure that's the right domain here if it's needed.
controllers/raycluster_controller.go
Outdated
} else if err != nil { | ||
return err | ||
} | ||
return r.Client.Delete(ctx, obj, &deleteOptions) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it simpler to ignore not found error on delete?
pkg/config/config.go
Outdated
@@ -36,6 +36,8 @@ type CodeFlareOperatorConfiguration struct { | |||
|
|||
// The InstaScale controller configuration | |||
InstaScale *InstaScaleConfiguration `json:"instascale,omitempty"` | |||
|
|||
RayClusterOAuth *bool `json:"rayClusterOAuth,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest to structure the configuration a bit, in a dedicated struct.
controllers/raycluster_controller.go
Outdated
strTrue = "true" | ||
strFalse = "false" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That can be removed?
controllers/raycluster_controller.go
Outdated
const ( | ||
requeueTime = 10 | ||
controllerName = "codeflare-raycluster-controller" | ||
oauthAnnotation = "codeflare.dev/oauth=true" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That can be removed?
controllers/raycluster_controller.go
Outdated
requeueTime = 10 | ||
controllerName = "codeflare-raycluster-controller" | ||
oauthAnnotation = "codeflare.dev/oauth=true" | ||
CodeflareOAuthFinalizer = "codeflare.dev/oauth-finalizer" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May I suggest something like openshift.ai/oauth-proxy
? In case we move the controller outside codeflare.
pkg/config/config.go
Outdated
@@ -37,6 +37,10 @@ type CodeFlareOperatorConfiguration struct { | |||
// The InstaScale controller configuration | |||
InstaScale *InstaScaleConfiguration `json:"instascale,omitempty"` | |||
|
|||
Authorization *AuthorizationConfiguration `json:"authorization,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I should have been more specific. Here for the first level it's about controller, so that'd be KubeRay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, got it. Will fix
pkg/config/config.go
Outdated
Authorization *AuthorizationConfiguration `json:"authorization,omitempty"` | ||
} | ||
|
||
type AuthorizationConfiguration struct { | ||
RayClusterOAuth *bool `json:"rayClusterOAuth,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be RayClusterDashboardOAuthEnabled
as other endpoints might require different configuration.
0184f58
to
2bac4d3
Compare
pkg/config/config.go
Outdated
@@ -36,6 +36,12 @@ type CodeFlareOperatorConfiguration struct { | |||
|
|||
// The InstaScale controller configuration | |||
InstaScale *InstaScaleConfiguration `json:"instascale,omitempty"` | |||
|
|||
KubeRay *KubeRayConfiguration `json:"authorization,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
json:"authorization,omitempty"
-> json:"kuberay,omitempty"
pkg/config/config.go
Outdated
} | ||
|
||
type KubeRayConfiguration struct { | ||
RayDashboardOAuthEnabled *bool `json:"rayClusterOAuth,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
json:"rayClusterOAuth,omitempty"
-> json:"rayClusterOAuthEnabled,omitempty"
Signed-off-by: Kevin <[email protected]>
2bac4d3
to
1f7f96b
Compare
/lgtm |
LGTM, we may want to be more fine-grained to enable the creation of Route without OAuth, but that can be done later. |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: astefanutti The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
9ea041f
into
project-codeflare:main
Issue link
https://issues.redhat.com/browse/RHOAIENG-1992
What changes have been made
Added a controller to CFO operator which does a series of ServerSideApplies in order to create the requisite k8 resources for securing the RayDashboard
Verification steps
This PR does not implement the webhook required for automatically injecting the oauth sidecar into the RayCluster definition so that must be done by the user at this point.
TODO Provide a working RayCluster definition for validating the authenticated dashboard
Checks