Skip to content

Add AppWrapper v1beta2 CRD and controllers to Codeflare operator #543

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 23 commits into from
May 17, 2024

Conversation

dgrove-oss
Copy link
Collaborator

@dgrove-oss dgrove-oss commented Apr 22, 2024

Replaces #491.

This assumes/includes #541.

@dgrove-oss
Copy link
Collaborator Author

rebased to resolve conflict in Dockerfile

@dgrove-oss dgrove-oss force-pushed the appwrapper branch 2 times, most recently from a67dd1f to a940fa0 Compare April 26, 2024 21:01
@dgrove-oss dgrove-oss changed the title Add AppWrapper CRD and controllers to Codeflare operator Add AppWrapper v1beta2 CRD and controllers to Codeflare operator Apr 29, 2024
@dgrove-oss dgrove-oss force-pushed the appwrapper branch 3 times, most recently from 0dd33f7 to 143b507 Compare April 29, 2024 18:23
@Srihari1192 Srihari1192 self-requested a review April 30, 2024 10:08
@Srihari1192
Copy link
Contributor

@dgrove-oss Appwrapper instance creation is failing with below error

Error "failed calling webhook "mappwrapper.kb.io": failed to call webhook: the server could not find the requested resource" for field "undefined".

@dgrove-oss
Copy link
Collaborator Author

dgrove-oss commented May 6, 2024

I make an additional adjustment to the startup logic so that the AppWrapper webhooks will be registered as soon as the certificates are ready if the operator config enables AppWrappers.

However, the AppWrapper CRD is still installed unconditionally. If AppWrapper is disabled in the config, this will result in the user getting a cryptic error like:

Error from server (InternalError): error when creating "../appwrapper/samples/wrapped-pod.yaml": Internal error occurred: failed calling webhook "mappwrapper.kb.io": failed to call webhook: the server could not find the requested resource

when they create or edit an AppWrapper.

If AppWrappers are enabled in the config, then you should see the following behavior (if Kueue is already installed on your cluster, only step 3 below is relevant).

  1. Initial startup of make deploy ENV=e2e (trimming noise from cert rotation). AppWrappers are enabled, but Kueue is not installed in the cluster
2024-05-06T14:44:36Z	INFO	setup	Build info	{"operatorVersion": "", "appwrapperVersion": "UNKNOWN", "date": "2024-05-06 14:38"}
2024-05-06T14:44:36Z	INFO	setup	setting up health endpoints
2024-05-06T14:44:36Z	INFO	setup	setting up RayCluster controller
2024-05-06T14:44:36Z	INFO	We detected being on Vanilla Kubernetes!
2024-05-06T14:44:36Z	INFO	setup	setting up AppWrapper components
2024-05-06T14:44:36Z	INFO	setup	Workload API not available; setting up waiter for Workload API availability
2024-05-06T14:44:36Z	INFO	setup	starting manager
2024-05-06T14:44:36Z	INFO	controller-runtime.metrics	Starting metrics server
2024-05-06T14:44:36Z	INFO	controller-runtime.metrics	Serving metrics server	{"bindAddress": ":8080", "secure": false}
2024-05-06T14:44:36Z	INFO	starting server	{"kind": "health probe", "addr": "[::]:8081"}
2024-05-06T14:44:36Z	INFO	setup	API workloads.kueue.x-k8s.io not available, setting up retry watcher
2024-05-06T14:44:36Z	INFO	setup	API rayclusters.ray.io not available, setting up retry watcher
2024-05-06T14:44:36Z	INFO	Starting workers	{"controller": "cert-rotator", "worker count": 1}
2024-05-06T14:44:38Z	INFO	setup	Setting up AppWrapper webhook
2024-05-06T14:44:38Z	INFO	controller-runtime.builder	Registering a mutating webhook	{"GVK": "workload.codeflare.dev/v1beta2, Kind=AppWrapper", "path": "/mutate-workload-codeflare-dev-v1beta2-appwrapper"}
2024-05-06T14:44:38Z	INFO	controller-runtime.webhook	Registering webhook	{"path": "/mutate-workload-codeflare-dev-v1beta2-appwrapper"}
2024-05-06T14:44:38Z	INFO	controller-runtime.builder	Registering a validating webhook	{"GVK": "workload.codeflare.dev/v1beta2, Kind=AppWrapper", "path": "/validate-workload-codeflare-dev-v1beta2-appwrapper"}
2024-05-06T14:44:38Z	INFO	controller-runtime.webhook	Registering webhook	{"path": "/validate-workload-codeflare-dev-v1beta2-appwrapper"}
2024-05-06T14:44:38Z	INFO	controller-runtime.webhook	Starting webhook server
2024-05-06T14:44:38Z	INFO	controller-runtime.certwatcher	Updated current TLS certificate
2024-05-06T14:44:38Z	INFO	controller-runtime.webhook	Serving webhook server	{"host": "", "port": 9443}
2024-05-06T14:44:38Z	INFO	controller-runtime.certwatcher	Starting certificate watcher
2024-05-06T14:47:06Z	INFO	admission	Applying defaults	{"webhookGroup": "workload.codeflare.dev", "webhookKind": "AppWrapper", "AppWrapper": {"name":"sample-job","namespace":"default"}, "namespace": "default", "name": "sample-job", "resource": {"group":"workload.codeflare.dev","version":"v1beta2","resource":"appwrappers"}, "user": "kubernetes-admin", "requestID": "53545468-844c-43a0-8bc3-3649a124da80", "job": {"apiVersion": "workload.codeflare.dev/v1beta2", "kind": "AppWrapper", "namespace": "default", "name": "sample-job"}}
2024-05-06T14:47:06Z	INFO	admission	Validating create	{"webhookGroup": "workload.codeflare.dev", "webhookKind": "AppWrapper", "AppWrapper": {"name":"sample-job","namespace":"default"}, "namespace": "default", "name": "sample-job", "resource": {"group":"workload.codeflare.dev","version":"v1beta2","resource":"appwrappers"}, "user": "kubernetes-admin", "requestID": "35624b2e-ebb7-43a1-8989-dc02e826908f", "job": {"apiVersion": "workload.codeflare.dev/v1beta2", "kind": "AppWrapper", "namespace": "default", "name": "sample-job"}}
  1. do a make kueue-e2e to install Kueue; the codeflare operator should restart
2024-05-06T14:51:39Z	INFO	setup	API workloads.kueue.x-k8s.io installed, invoking deferred action
2024-05-06T14:51:39Z	INFO	setup	Workload API now available; triggering controller restart
...
2024-05-06T14:51:39Z	INFO	Wait completed, proceeding to shutdown the manager
  1. On restart, AppWrappers should be fully enabled

2024-05-06T14:52:03Z    INFO    setup   Build info      {"operatorVersion": "", "appwrapperVersion": "UNKNOWN", "date": "2024-05-06 14:38"}
2024-05-06T14:52:03Z    INFO    setup   setting up health endpoints
2024-05-06T14:52:03Z    INFO    setup   setting up RayCluster controller
2024-05-06T14:52:03Z    INFO    We detected being on Vanilla Kubernetes!
2024-05-06T14:52:03Z    INFO    setup   setting up AppWrapper components
2024-05-06T14:52:03Z    INFO    setup   Workload API available; enabling AppWrappers
2024-05-06T14:52:03Z    INFO    setup   Waiting for certificate generation to complete
2024-05-06T14:52:03Z    INFO    setup   starting manager
2024-05-06T14:52:03Z    INFO    controller-runtime.metrics      Starting metrics server
2024-05-06T14:52:03Z    INFO    starting server {"kind": "health probe", "addr": "[::]:8081"}
2024-05-06T14:52:03Z    INFO    controller-runtime.metrics      Serving metrics server  {"bindAddress": ":8080", "secure": false}
2024-05-06T14:52:03Z    INFO    setup   API rayclusters.ray.io not available, setting up retry watcher
2024-05-06T14:52:03Z    INFO    cert-rotation   starting cert rotator controller
2024-05-06T14:52:03Z    INFO    Starting EventSource    {"controller": "cert-rotator", "source": "kind source: *v1.Secret"}
2024-05-06T14:52:03Z    INFO    Starting EventSource    {"controller": "cert-rotator", "source": "kind source: *unstructured.Unstructured"}
2024-05-06T14:52:03Z    INFO    Starting EventSource    {"controller": "cert-rotator", "source": "kind source: *unstructured.Unstructured"}
2024-05-06T14:52:03Z    INFO    Starting Controller     {"controller": "cert-rotator"}
2024-05-06T14:52:03Z    INFO    cert-rotation   no cert refresh needed
2024-05-06T14:52:03Z    INFO    cert-rotation   certs are ready in /tmp/k8s-webhook-server/serving-certs
2024-05-06T14:52:03Z    INFO    Starting workers        {"controller": "cert-rotator", "worker count": 1}
2024-05-06T14:52:03Z    INFO    cert-rotation   no cert refresh needed
2024-05-06T14:52:03Z    INFO    cert-rotation   Ensuring CA cert        {"name": "codeflare-operator-validating-webhook-configuration", "gvk": "admissionregistration.k8s.io/v1, Kind=ValidatingWebhookConfiguration", "name": "codeflare-operator-validating-webhook-configuration", "gvk": "admissionregistration.k8s.io/v1, Kind=ValidatingWebhookConfiguration"}
2024-05-06T14:52:03Z    INFO    cert-rotation   Ensuring CA cert        {"name": "codeflare-operator-mutating-webhook-configuration", "gvk": "admissionregistration.k8s.io/v1, Kind=MutatingWebhookConfiguration", "name": "codeflare-operator-mutating-webhook-configuration", "gvk": "admissionregistration.k8s.io/v1, Kind=MutatingWebhookConfiguration"}
2024-05-06T14:52:05Z    INFO    cert-rotation   CA certs are injected to webhooks
2024-05-06T14:52:05Z    INFO    setup   Setting up AppWrapper webhook
2024-05-06T14:52:05Z    INFO    setup   Setting up AppWrapper controller
2024-05-06T14:52:05Z    INFO    Starting Controller     {"controller": "AppWrapperChildWorkload", "controllerGroup": "workload.codeflare.dev", "controllerKind": "AppWrapper"}
2024-05-06T14:52:05Z    INFO    Starting workers        {"controller": "AppWrapperChildWorkload", "controllerGroup": "workload.codeflare.dev", "controllerKind": "AppWrapper", "worker count": 1}
2024-05-06T14:52:05Z    INFO    Starting EventSource    {"controller": "AppWrapperChildWorkload", "controllerGroup": "workload.codeflare.dev", "controllerKind": "AppWrapper", "source": "kind source: *v1beta2.AppWrapper"}
2024-05-06T14:52:05Z    INFO    controller-runtime.builder      Registering a mutating webhook  {"GVK": "workload.codeflare.dev/v1beta2, Kind=AppWrapper", "path": "/mutate-workload-codeflare-dev-v1beta2-appwrapper"}
2024-05-06T14:52:05Z    INFO    controller-runtime.webhook      Registering webhook     {"path": "/mutate-workload-codeflare-dev-v1beta2-appwrapper"}
2024-05-06T14:52:05Z    INFO    controller-runtime.builder      Registering a validating webhook        {"GVK": "workload.codeflare.dev/v1beta2, Kind=AppWrapper", "path": "/validate-workload-codeflare-dev-v1beta2-appwrapper"}
2024-05-06T14:52:05Z    INFO    controller-runtime.webhook      Registering webhook     {"path": "/validate-workload-codeflare-dev-v1beta2-appwrapper"}
2024-05-06T14:52:05Z    INFO    Starting EventSource    {"controller": "AppWrapperWorkload", "controllerGroup": "workload.codeflare.dev", "controllerKind": "AppWrapper", "source": "kind source: *v1beta2.AppWrapper"}
2024-05-06T14:52:05Z    INFO    Starting EventSource    {"controller": "AppWrapperWorkload", "controllerGroup": "workload.codeflare.dev", "controllerKind": "AppWrapper", "source": "kind source: *v1beta1.Workload"}
2024-05-06T14:52:05Z    INFO    Starting Controller     {"controller": "AppWrapperWorkload", "controllerGroup": "workload.codeflare.dev", "controllerKind": "AppWrapper"}
2024-05-06T14:52:05Z    INFO    controller-runtime.webhook      Starting webhook server
2024-05-06T14:52:05Z    INFO    Starting EventSource    {"controller": "AppWrapper", "controllerGroup": "workload.codeflare.dev", "controllerKind": "AppWrapper", "source": "kind source: *v1beta2.AppWrapper"}
2024-05-06T14:52:05Z    INFO    Starting EventSource    {"controller": "AppWrapper", "controllerGroup": "workload.codeflare.dev", "controllerKind": "AppWrapper", "source": "kind source: *v1.Pod"}
2024-05-06T14:52:05Z    INFO    Starting Controller     {"controller": "AppWrapper", "controllerGroup": "workload.codeflare.dev", "controllerKind": "AppWrapper"}
2024-05-06T14:52:05Z    INFO    controller-runtime.certwatcher  Updated current TLS certificate
2024-05-06T14:52:05Z    INFO    controller-runtime.webhook      Serving webhook server  {"host": "", "port": 9443}
2024-05-06T14:52:05Z    INFO    controller-runtime.certwatcher  Starting certificate watcher
2024-05-06T14:52:05Z    INFO    Starting workers        {"controller": "AppWrapperWorkload", "controllerGroup": "workload.codeflare.dev", "controllerKind": "AppWrapper", "worker count": 1}
2024-05-06T14:52:05Z    INFO    Starting workers        {"controller": "AppWrapper", "controllerGroup": "workload.codeflare.dev", "controllerKind": "AppWrapper", "worker count": 1}

@dgrove-oss
Copy link
Collaborator Author

To document the expectation, if AppWrappers are disabled in the config your log should look like this:

2024-05-06T15:06:36Z	INFO	setup	Build info	{"operatorVersion": "", "appwrapperVersion": "UNKNOWN", "date": "2024-05-06 14:38"}
2024-05-06T15:06:36Z	INFO	setup	setting up health endpoints
2024-05-06T15:06:36Z	INFO	setup	setting up RayCluster controller
2024-05-06T15:06:36Z	INFO	We detected being on Vanilla Kubernetes!
2024-05-06T15:06:36Z	INFO	setup	setting up AppWrapper components
2024-05-06T15:06:36Z	INFO	setup	AppWrappers are disabled by operator configuration
2024-05-06T15:06:36Z	INFO	setup	starting manager
...

@dgrove-oss
Copy link
Collaborator Author

dgrove-oss commented May 6, 2024

I made further adjustments. Now if AppWrappers are completely disabled by the config, we setup a webhook that generates an error when AppWrappers are created.

Error from server (Forbidden): error when creating "../appwrapper/samples/wrapped-job.yaml": admission webhook "vappwrapper.kb.io" denied the request: AppWrappers disabled by CodeFlare operator configuration

@dgrove-oss
Copy link
Collaborator Author

I've ported the e2e tests from #491 to this PR as well now.

@astefanutti
Copy link
Contributor

/lgtm

@astefanutti
Copy link
Contributor

/approve

Copy link

openshift-ci bot commented May 17, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: astefanutti

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-bot openshift-merge-bot bot merged commit 92e6eb5 into project-codeflare:main May 17, 2024
8 checks passed
@dgrove-oss dgrove-oss deleted the appwrapper branch May 19, 2024 20:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants