Skip to content

Possible MCAD issues on OC 4.13 #476

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jbusche opened this issue Jul 13, 2023 · 8 comments
Closed

Possible MCAD issues on OC 4.13 #476

jbusche opened this issue Jul 13, 2023 · 8 comments
Assignees

Comments

@jbusche
Copy link
Contributor

jbusche commented Jul 13, 2023

@asm582 was installing MCAD on an AWS OC 4.13 cluster and was seeing security errors.
https://project-codeflare.slack.com/archives/C04L7QH4Q84/p1688587042248109?thread_ts=1688585995.782689&cid=C04L7QH4Q84

@Maxusmusti mentioned that he was using it fine on his OC 4.13 cluster, and I tried it on a Fyre Cluster and the CodeFlare stack seemed to apply and install fine for me. (However, I am seeing trouble with a third-party storage solution, PortWorx, that I use for the codeflare notebook pvc).

@tedhtchang is going to provision an IBM Cloud OC 4.13 cluster (which has it's own IBM Gold storage) and we'll try it there, try to confirm the CodeFlare stack on OC 4.13.

@asm582
Copy link
Member

asm582 commented Jul 13, 2023

Do we know what images we are using for installs? for running tests we should be using images built from main branch

@jbusche
Copy link
Contributor Author

jbusche commented Jul 13, 2023

Codeflare Operator 0.0.6 with the kfdef deploys release-v1.32.0 of MCAD:

oc describe pod mcad-controller-mcad-fb8d5b7d8-kkqhl |grep Image:
    Image:         quay.io/project-codeflare/mcad-controller:release-v1.32.0

@asm582
Copy link
Member

asm582 commented Jul 13, 2023

@jbusche For tests we should be using images compiled from the main branch. FYI here is one image that was compiled:

quay.io/asmalvan/quota-mgmt-0712

@jbusche
Copy link
Contributor Author

jbusche commented Jul 13, 2023

@tedhtchang created an IBM Cloud OpenShift 4.13 and I've installed the ODH 1.7.0 and CodeFlare 0.0.6 stacks on top of it. There isn't enough capacity to run the Ray batch job, but I was able to start up a tiny batch_mnist_mcad job and it's running fine. I don't see any errors in the MCAD operator.

@asm582 this issue was created because there was some concern that the Codeflare 0.0.6 stack wouldn't run properly on OC 4.13. But Mustafa hasn't seen any problem on his cluster, and on IBM Cloud it looks good to me.

Testing the newly built quota-management images using the helm install method would be a different issue. Yesterday, I was not able to make either quota-management image run on OC 4.11, OC 4.12 or OC 4.13. If you have an issue for it I can post the errors I'm seeing.

@asm582
Copy link
Member

asm582 commented Jul 13, 2023

I see, I don't have an issue. the task requirement is to test MCAD with images built from the main branch. I am running an image quay.io/asmalvan/quota-mgmt-0712 compiled from the main and it runs fine on OpenShift 4.11.42

@z103cb
Copy link
Contributor

z103cb commented Jul 18, 2023

@jbusche For your test would it be possible to dump the definition of the app wrapper CRD. I am looking to see if these lines exist

    subresources:
      status: {}

@jbusche
Copy link
Contributor Author

jbusche commented Jul 18, 2023

I don't see this in GA release of CodeFlare (0.0.6 which has the mcad-controller:release-v1.32.0 version of MCAD, which gets installed with the kfdef deployment.)
GA-appwrappers.mcad.ibm.com.txt

However, using @asm582's pr 475 and installed using the following commands:

helm upgrade --install mcad-controller .  --namespace kube-system --wait --set image.repository=quay.io/asmalvan/quota-mgmt-0712  --set image.tag=latest  --set configMap.name=mcad-controller-configmap --set configMap.podCreationTimeout='"120000"'  --set coscheduler.rbac.apiGroup=scheduling.sigs.k8s.io --set coscheduler.rbac.resource=podgroups --set loglevel=10 --set resources.limits.cpu=1500m --set resources.requests.cpu=1160m --set configMap.quotaEnabled='"true"' --set configMap.preemptionEnabled='"true"'

and

oc apply -f multi-cluster-app-dispatcher/test/e2e-kuttl/install-quota-subtree.yaml

Then I see the subresources entry:
quota-mgmt-0712-appwrappers.mcad.ibm.com.txt

@MichaelClifford MichaelClifford moved this from In Progress to Done in Project CodeFlare Sprint Board Jul 20, 2023
@asm582
Copy link
Member

asm582 commented Aug 24, 2023

@jbusche I am closing this issue as we do not see the above problem, please feel free to reopen if the issue still persists.

@asm582 asm582 closed this as completed Aug 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

No branches or pull requests

3 participants