Skip to content

add raycluster controller to CFO #453

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

KPostOffice
Copy link
Collaborator

@KPostOffice KPostOffice commented Jan 25, 2024

Issue link

https://issues.redhat.com/browse/RHOAIENG-1992

What changes have been made

Added a controller to CFO operator which does a series of ServerSideApplies in order to create the requisite k8 resources for securing the RayDashboard

Verification steps

This PR does not implement the webhook required for automatically injecting the oauth sidecar into the RayCluster definition so that must be done by the user at this point.

TODO Provide a working RayCluster definition for validating the authenticated dashboard

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • Testing is not required for this change

@anishasthana
Copy link
Contributor

/retest

@KPostOffice
Copy link
Collaborator Author

I tested and made sure that all the expected resources are created. Still need to write unit tests and add a RayCluster def which with a OAuth sidecar that will work with the created resources

@KPostOffice
Copy link
Collaborator Author

KPostOffice commented Feb 7, 2024

apiVersion: ray.io/v1
kind: RayCluster
metadata:
  labels:
    controller-tools.k8s.io: '1.0'
  annotations:
    codeflare.dev/oauth: 'true'
  name: raytest
  namespace: default
spec:
  enableInTreeAutoscaling: false
  headGroupSpec:
    rayStartParams:
      block: 'true'
      dashboard-host: 0.0.0.0
      num-gpus: '0'
    serviceType: ClusterIP
    template:
      spec:
        containers:
        - env:
          - name: MY_POD_IP
            valueFrom:
              fieldRef:
                fieldPath: status.podIP
          image: quay.io/project-codeflare/ray:latest-py39-cu118
          imagePullPolicy: Always
          lifecycle:
            preStop:
              exec:
                command:
                - /bin/sh
                - -c
                - ray stop
          name: ray-head
          ports:
          - containerPort: 6379
            name: gcs
          - containerPort: 8265
            name: dashboard
          - containerPort: 10001
            name: client
          resources:
            limits:
              cpu: 1
              memory: 8G
              nvidia.com/gpu: 0
            requests:
              cpu: 1
              memory: 8G
              nvidia.com/gpu: 0
        - args:
          - --https-address=:8443
          - --provider=openshift
          - --openshift-service-account=raytest
          - --upstream=http://localhost:8265
          - --tls-cert=/etc/tls/private/tls.crt
          - --tls-key=/etc/tls/private/tls.key
          - --cookie-secret=$(COOKIE_SECRET)
          - --openshift-delegate-urls={"/":{"resource":"pods","namespace":"default","verb":"get"}}
          image: registry.redhat.io/openshift4/ose-oauth-proxy@sha256:1ea6a01bf3e63cdcf125c6064cbd4a4a270deaf0f157b3eabb78f60556840366
          name: oauth-proxy
          ports:
          - containerPort: 8443
            name: oauth-proxy
          resources: {}
          volumeMounts:
          - mountPath: /etc/tls/private
            name: proxy-tls-secret
            readOnly: true
          env:
          - name: COOKIE_SECRET
            valueFrom:
              secretKeyRef:
                name: raytest-oauth-config
                key: cookie_secret
        imagePullSecrets: []
        serviceAccount: raytest
        volumes:
        - name: proxy-tls-secret
          secret:
            secretName: raytest-tls
  rayVersion: 2.7.0
  workerGroupSpecs:
  - groupName: small-group-raytest
    maxReplicas: 1
    minReplicas: 1
    rayStartParams:
      block: 'true'
      num-gpus: '0'
    replicas: 1
    template:
      metadata:
        annotations:
          key: value
        labels:
          key: value
      spec:
        containers:
        - env:
          - name: MY_POD_IP
            valueFrom:
              fieldRef:
                fieldPath: status.podIP
          image: quay.io/project-codeflare/ray:latest-py39-cu118
          lifecycle:
            preStop:
              exec:
                command:
                - /bin/sh
                - -c
                - ray stop
          name: machine-learning
          resources:
            limits:
              cpu: 1
              memory: 4G
              nvidia.com/gpu: 0
            requests:
              cpu: 1
              memory: 4G
              nvidia.com/gpu: 0
        imagePullSecrets: []

Here is a well formatted RayCluster that can be used to test these changes. Run the controller using:

$ NAMESPACE=foobar make run

Then apply the yaml above. There should be a route in the same namespace on the cluster for the dashboard that is protected by an OAuth Proxy

@KPostOffice
Copy link
Collaborator Author

/hold

Still need to edit CI/makefile so that the new Envtest tests will run

Copy link
Contributor

@astefanutti astefanutti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two points:

  • Do we want to leverage static sidecar injection in the Kuberay deployment?
  • Could we have a e2e test that run on OpenShift?

@sutaakar FYI.

@@ -0,0 +1,14234 @@
---
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only need it for testing and I couldn't get envtest to work with remote files. I'm thinking of changing it so that it downloads them and cleans them up as part of the BeforeSuite and AfterSuite. It really only needs the route and the raycluster. Do we need to have the Ray CRDs applied as well before starting the controller?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I missed that envtest cannot use Kustomize reference mechanism.

For the dependency on the RayCluster API and the KubeRay CRD, I have some ideas that I need to formalise, but in the short term, what we can do is only start the controller when the API is present in the cluster.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can cause race conditions when both components are enabled. If CFO starts before KubeRay, then this Reconciler will never start despite the API being present soon after

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but on the other hand, the CFO will never work if KubeRay is not installed. I don't think there is an ideal solution to this problem in the short term. We can document the CFO has to be restarted after KubeRay is installed, until we rework the deployment strategy.

My current thinking is that this controller should be deployed along side KubeRay, e.g. in a sidecar container.

Copy link
Contributor

@Bobbins228 Bobbins228 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested this with Christians work on the SDK works as expected

/lgtm

@openshift-ci openshift-ci bot added the lgtm label Mar 21, 2024
@KPostOffice
Copy link
Collaborator Author

I changed adding the taint to be info level in the case of an error since errors are not unexpected due to the two controllers managing the resource

@KPostOffice
Copy link
Collaborator Author

/unhold

Copy link
Contributor

@Bobbins228 Bobbins228 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm label Mar 25, 2024
@anishasthana anishasthana dismissed stale reviews from Bobbins228 and ChristianZaccaria March 29, 2024 17:27

@Bobbins228 has lgtm'd it.

@openshift-ci openshift-ci bot removed the lgtm label Apr 2, 2024
@@ -6,4 +10,10 @@ plugins:
scorecard.sdk.operatorframework.io/v2: {}
projectName: codeflare-operator
repo: github.com/project-codeflare/codeflare-operator
resources:
- controller: true
domain: codeflare.dev
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a nit, but I'm not sure that's the right domain here if it's needed.

} else if err != nil {
return err
}
return r.Client.Delete(ctx, obj, &deleteOptions)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it simpler to ignore not found error on delete?

@@ -36,6 +36,8 @@ type CodeFlareOperatorConfiguration struct {

// The InstaScale controller configuration
InstaScale *InstaScaleConfiguration `json:"instascale,omitempty"`

RayClusterOAuth *bool `json:"rayClusterOAuth,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest to structure the configuration a bit, in a dedicated struct.

Comment on lines 63 to 64
strTrue = "true"
strFalse = "false"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That can be removed?

const (
requeueTime = 10
controllerName = "codeflare-raycluster-controller"
oauthAnnotation = "codeflare.dev/oauth=true"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That can be removed?

requeueTime = 10
controllerName = "codeflare-raycluster-controller"
oauthAnnotation = "codeflare.dev/oauth=true"
CodeflareOAuthFinalizer = "codeflare.dev/oauth-finalizer"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May I suggest something like openshift.ai/oauth-proxy? In case we move the controller outside codeflare.

@@ -37,6 +37,10 @@ type CodeFlareOperatorConfiguration struct {
// The InstaScale controller configuration
InstaScale *InstaScaleConfiguration `json:"instascale,omitempty"`

Authorization *AuthorizationConfiguration `json:"authorization,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I should have been more specific. Here for the first level it's about controller, so that'd be KubeRay.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, got it. Will fix

Authorization *AuthorizationConfiguration `json:"authorization,omitempty"`
}

type AuthorizationConfiguration struct {
RayClusterOAuth *bool `json:"rayClusterOAuth,omitempty"`
Copy link
Contributor

@astefanutti astefanutti Apr 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be RayClusterDashboardOAuthEnabled as other endpoints might require different configuration.

@@ -36,6 +36,12 @@ type CodeFlareOperatorConfiguration struct {

// The InstaScale controller configuration
InstaScale *InstaScaleConfiguration `json:"instascale,omitempty"`

KubeRay *KubeRayConfiguration `json:"authorization,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

json:"authorization,omitempty" -> json:"kuberay,omitempty"

}

type KubeRayConfiguration struct {
RayDashboardOAuthEnabled *bool `json:"rayClusterOAuth,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

json:"rayClusterOAuth,omitempty" -> json:"rayClusterOAuthEnabled,omitempty"

@astefanutti
Copy link
Contributor

/lgtm

@astefanutti
Copy link
Contributor

LGTM, we may want to be more fine-grained to enable the creation of Route without OAuth, but that can be done later.

@astefanutti
Copy link
Contributor

/approve

Copy link

openshift-ci bot commented Apr 3, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: astefanutti

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved label Apr 3, 2024
@openshift-merge-bot openshift-merge-bot bot merged commit 9ea041f into project-codeflare:main Apr 3, 2024
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants