Skip to content

Upgrade from AppWrapper v1beta1 to v1beta2 #491

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 26 commits into from
Closed

Upgrade from AppWrapper v1beta1 to v1beta2 #491

wants to merge 26 commits into from

Conversation

dgrove-oss
Copy link
Collaborator

Issue link

What changes have been made

Context: This PR supports the migration of the codeflare operator from MCADv1 to Kueue+AppWrapper

Related PRs:

Changes:

  1. Upgrade AppWrapper from v1beta1 to v1beta2
  2. Remove MCADv1 controller
  3. Add AppWrapper controller
  4. Remove InstaScale controller
  5. Update build, test, and CI accordingly

Verification steps

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • Testing is not required for this change

@openshift-ci openshift-ci bot requested review from Fiona-Waters and sutaakar March 19, 2024 00:46
Copy link

openshift-ci bot commented Mar 19, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign astefanutti for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@dgrove-oss
Copy link
Collaborator Author

Don't bother reviewing this yet. It will take a few iterations to get in synch with the SDK PR.

@dgrove-oss dgrove-oss force-pushed the mcadv1b2 branch 9 times, most recently from 06fe334 to ce5ae4b Compare March 21, 2024 20:43
@dgrove-oss
Copy link
Collaborator Author

dgrove-oss commented Mar 22, 2024

This PR is now ready for review.

The Upgrade test is failing, but assuming I understand what the test is doing, I think that is expected and unavoidable. We explicitly decided we would not attempt to support upgrading from AppWrapper v1beta1 to v1beta2. We can't just do an in place patch of the controller container image because more than that has changed. There are different CRDs and different RBACs.

@dgrove-oss dgrove-oss force-pushed the mcadv1b2 branch 7 times, most recently from 861080e to 2d56a93 Compare March 26, 2024 00:59
@dgrove-oss
Copy link
Collaborator Author

PR ready for review.

The PR effectively disables the OLM Insall and Upgrade in a second commit to get us through the transition. Once we flow the changes all the way through the system and there is a codeflare operator release that doesn't contain MCAD/Instascale controllers, we could adjust and re-enable.

Copy link
Collaborator

@KPostOffice KPostOffice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! a few questions

Comment on lines +2 to +8
#FROM registry.access.redhat.com/ubi8/go-toolset:1.20.10 as builder

# BEGIN -- workaround lack of go-toolset for golang 1.21

# FROM registry-proxy.engineering.redhat.com/rh-osbs/openshift-golang-builder:v1.21 AS golang
FROM golang:1.21 AS golang
FROM registry.access.redhat.com/ubi8/ubi:8.8 AS builder
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dimakis Will this be okay in our build pipelines?

Copy link
Collaborator Author

@dgrove-oss dgrove-oss Mar 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was based on a Dockerfile I got from @astefanutti for Kueue, but his file used a redhat internal image AS golang that I didn't have access to. I used the default golang:1.21 to get unblocked, but its almost certainly not the right one to use in the long run.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, we won't get away with that. we'll need to go back to a RH registry

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a RH registry that has the a usable go 1.21 toolchain that is publicly accessible? I couldn't find one. If there is one, I'd be happy to change the docker file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine for upstream, I think. This has to be overwritten anyway downstream.

@dgrove-oss dgrove-oss force-pushed the mcadv1b2 branch 5 times, most recently from 8e82964 to 9986552 Compare March 29, 2024 18:15
@dgrove-oss
Copy link
Collaborator Author

dgrove-oss commented Mar 29, 2024

@KPostOffice -- I've added the webhook and full Kueue support. The PyTorch e2e test is passing, but the Ray e2e test is failing. I'm digging into it, but I don't expect the contents of this PR to change very much as a result. If you have time to look at the changes in 2537d07 and 9986552 and give any feedback you have that would be useful. At first glance, the Ray e2e test looks like a bug in Kueue's generic reconciler loop, but I have to dig deeper to be sure.

@dgrove-oss
Copy link
Collaborator Author

dgrove-oss commented Mar 29, 2024

So we aren't blocked on merging the PR, I've added a new e2e RayCluster test that bypasses AppWrappers entirely and submits the RayCluster directly to Kueue.

@openshift-merge-robot
Copy link
Collaborator

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

dgrove-oss added a commit to dgrove-oss/codeflare-operator that referenced this pull request Apr 19, 2024
@dgrove-oss dgrove-oss mentioned this pull request Apr 19, 2024
dgrove-oss added a commit to dgrove-oss/codeflare-operator that referenced this pull request Apr 22, 2024
dgrove-oss added a commit to dgrove-oss/codeflare-operator that referenced this pull request Apr 26, 2024
dgrove-oss added a commit to dgrove-oss/codeflare-operator that referenced this pull request Apr 26, 2024
dgrove-oss added a commit to dgrove-oss/codeflare-operator that referenced this pull request Apr 26, 2024
dgrove-oss added a commit to dgrove-oss/codeflare-operator that referenced this pull request Apr 26, 2024
@dgrove-oss
Copy link
Collaborator Author

Replaced by #543.

@dgrove-oss dgrove-oss closed this May 2, 2024
dgrove-oss added a commit to dgrove-oss/codeflare-operator that referenced this pull request May 6, 2024
dgrove-oss added a commit to dgrove-oss/codeflare-operator that referenced this pull request May 10, 2024
dgrove-oss added a commit to dgrove-oss/codeflare-operator that referenced this pull request May 14, 2024
dgrove-oss added a commit to dgrove-oss/codeflare-operator that referenced this pull request May 14, 2024
@dgrove-oss dgrove-oss deleted the mcadv1b2 branch May 19, 2024 20:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants