-
Notifications
You must be signed in to change notification settings - Fork 58
Helpful SRE Information on CodeFlare Stack
- Replacing Images in MCAD operator or InstaScale operators
- Changing resources for MCAD operator or InstaScale operator - NOTE, ODH 2.0.0+ only!
- CodeFlare Cleanup steps
- Installation of CodeFlare with ODH 2.0.0
- Testing the CodeFlare components from the ODH
Method to replace existing MCAD or InstaScale images. (NOTE: Even though this replaces the images, that doesn't mean the newer or older images work or are tested with the installed CodeFlare stack...)
kubectl edit mcads mcad
or
kubectl edit instascales instascale
and under spec: add something like this for MCAD:
spec:
controllerImage: quay.io/project-codeflare/mcad-controller:main-v1.30.0
or for InstaScale:
spec:
controllerImage: quay.io/project-codeflare/instascale-controller:v0.0.4
Edit the CR for either mcads or instascale like this:
kubectl edit mcads mcad
or
kubectl edit instascales instascale
And then add this under the spec section:
controllerResources:
limits:
cpu: "1"
memory: 1G
requests:
cpu: "1"
memory: 1G
To completely clean up all the CodeFlare components after an install, follow these steps:
-
No appwrappers should be left running:
kubectl get appwrappers -A
If any are left, you'd want to delete them
-
Remove the notebook and notebook pvc:
kubectl delete notebook jupyter-nb-kube-3aadmin -n opendatahub kubectl delete pvc jupyterhub-nb-kube-3aadmin-pvc -n opendatahub
-
Remove the codeflare-stack kfdef
kubectl delete kfdef codeflare-stack -n opendatahub
-
Remove the CodeFlare Operator csv and subscription:
kubectl delete sub codeflare-operator -n openshift-operators kubectl delete csv codeflare-operator.v0.0.6 -n openshift-operators
-
Remove the CodeFlare CRDs
kubectl delete crd instascales.codeflare.codeflare.dev mcads.codeflare.codeflare.dev schedulingspecs.mcad.ibm.com queuejobs.mcad.ibm.com
-
Install the "Fast" channel of the ODH operator (gets 2.0.0)
-
Install GA CodeFlare operator (gets v0.0.6)
-
Apply the following dsc:
kubectl apply -f - <<EOF
apiVersion: datasciencecluster.opendatahub.io/v1alpha1
kind: DataScienceCluster
metadata:
name: default
spec:
components:
dashboard:
enabled: true
datasciencepipelines:
enabled: false
distributedWorkloads:
enabled: true
kserve:
enabled: false
modelmeshserving:
enabled: false
workbenches:
enabled: true
EOF
-
find the route for the dashboard oc get route -n opendatahub
-
Open up the dashboard, Click on: Data Science Projects --> Launch Jupyter --> Codeflare Notebook --> Start Server
-
In a Terminal, clone the codeflare-sdk git clone https://github.com/project-codeflare/codeflare-sdk.git
All the same from this point... (edited)
This item is if you wanted to run the ODH automated tests against Codeflare.
Note: You need to have the following components installed before you run the tests
- Logged into your OpenShift Cluster and have used
oc login
so you can run commands - ODH operator (Right now, ODH 1.8.0)
- CodeFlare Operator (Right now, CodeFlare 0.1.0)
- ODH kfdef applied:
oc apply -f https://raw.githubusercontent.com/opendatahub-io/odh-manifests/master/kfdef/odh-core.yaml -n opendatahub
- CodeFlare kfdef applied: Note, it could be either one of the two, depending on what you're intending to test:
oc apply -f https://raw.githubusercontent.com/opendatahub-io/distributed-workloads/main/codeflare-stack-kfdef.yaml -n opendatahub
or
oc apply -f https://raw.githubusercontent.com/opendatahub-io/odh-manifests/master/kfdef/codeflare-stack-kfdef.yaml -n opendatahub
Step 1. You need to download the peak testing suite:
git clone https://github.com/opendatahub-io/peak
Step 2. Change to that directory:
cd peak
Step 3. Initialize peak:
git submodule update --init
Step 4. Create a file with the branch you want to test. For example, for main you'd do this:
echo opendatahub-kubeflow nil https://github.com/opendatahub-io/odh-manifests.git master > master-list
(The format of this command is to list the repo and then the branch you want to test, so if you wanted to test against a branch different than master you'd do something like this:)
echo opendatahub-kubeflow nil https://github.com/anishasthana/odh-manifests.git dw_0.1.1 > anish-011-list
Step 5. Setup your test so the code is downloaded for you by running setup against the list you created in step 4:
./setup.sh -t master-list
Step 6. Run it, substituting the kubeadmin password below, like this:
OPENSHIFT_TESTUSER_NAME=kubeadmin OPENSHIFT_TESTUSER_PASS=uz9uh-u2WMS-8L9VS-SrVCR ./run.sh codeflare-stack.sh
step 7. It'll fire off a notebook and you should see the mnist pod start. You can follow the logs to see where it's at here:
oc get pods -n opendatahub
And you can follow the log pod here:
oc logs -f mnistjob-cdjnbmll99swpc-0
Example output looks like:
[0]:Validating: 76%|███████▌ | 60/79 [00:02<00:00, 25.84it/s]
[0]:Epoch 3: 100%|██████████| 939/939 [01:02<00:00, 15.07it/s, loss=0.179, v_num=0, val_loss=0.145, val_acc=0.957]
[0]:
[0]:Epoch 3: 100%|██████████| 939/939 [01:02<00:00, 15.07it/s, loss=0.179, v_num=0, val_loss=0.145, val_acc=0.957][0]:
[0]:Epoch 3: 0%| | 0/939 [00:00<?, ?it/s, loss=0.179, v_num=0, val_loss=0.145, val_acc=0.957]
[0]:Epoch 4: 0%| | 0/939 [00:00<?, ?it/s, loss=0.179, v_num=0, val_loss=0.145, val_acc=0.957][0]:
Note: Currently the tests aren't working due to min_worker
and max_worker
and gpu
You can fix this by
vi operator-tests/opendatahub-kubeflow/tests/resources/codeflare-stack/mnist_ray_mini.ipynb and change the line:
"cluster = Cluster(ClusterConfiguration(namespace='opendatahub', name='mnisttest', min_worker=2, max_worker=2, min_cpus=2, max_cpus=2, min_memory=4, max_memory=4, gpu=0, instascale=False))"
to
"cluster = Cluster(ClusterConfiguration(namespace='opendatahub', name='mnisttest', num_workers=2, min_cpus=2, max_cpus=2, min_memory=4, max_memory=4, num_gpus=0, instascale=False))"