Skip to content

Commit e85318e

Browse files
committed
first draft of queuing/borrowing demo script
1 parent 5c0081f commit e85318e

File tree

5 files changed

+233
-3
lines changed

5 files changed

+233
-3
lines changed

setup.KubeConEU25/README.md

Lines changed: 122 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -537,12 +537,131 @@ kubectl label servicemonitors.monitoring.coreos.com -n nvidia-GPU-operator nvidi
537537

538538
## Workload Management
539539

540-
We will now demonstrate the queueing, quota management, and fault recovery
541-
capabilities of MLBatch using synthetic workloads.
540+
We will now demonstrate the queuing, quota management, and fault recovery capabilities of MLBatch
541+
using synthetic workloads.
542542

543543
<details>
544+
For this portion of the tutorial, we will use variations on the simple batch/v1 Job shown below.
545+
All variations will create multiple pods, each requesting some number of GPUs, and sleep for
546+
a specified interval before completing successfully.
547+
```yaml
548+
apiVersion: workload.codeflare.dev/v1beta2
549+
kind: AppWrapper
550+
metadata:
551+
generateName: <jobtype>
552+
labels:
553+
kueue.x-k8s.io/queue-name: default-queue
554+
spec:
555+
components:
556+
- template:
557+
apiVersion: batch/v1
558+
kind: Job
559+
metadata:
560+
generateName: <jobtype>
561+
spec:
562+
completions: <number of pods>
563+
parallelism: <number of pods>
564+
template:
565+
spec:
566+
restartPolicy: Never
567+
terminationGracePeriodSeconds: 0
568+
priorityClassName: <priority class>
569+
containers:
570+
- name: busybox
571+
image: quay.io/project-codeflare/busybox:1.36
572+
command: ["sh", "-c", "sleep 600"]
573+
resources:
574+
limits:
575+
nvidia.com/gpu: 4
576+
```
577+
We will use four types of jobs:
578+
| Job Type | Priority | Duration | Number of Pods | GPU Usage |
579+
---------------------------------------------------------------
580+
| short | normal | 30s | 2 | 2 X 4 = 8 |
581+
| normal | normal | 600s | 2 | 2 X 4 = 8 |
582+
| important| high | 600s | 2 | 2 x 4 = 8 |
583+
| large | normal | 600s | 4 | 4 x 4 = 16|
584+
585+
### Queuing
586+
587+
First, Alice will submit a burst of short running jobs that exceeds
588+
the number of available GPUs in the cluster. The excess jobs will
589+
suspended by Kueue and admitted in turn as resources become available.
544590
545-
TODO
591+
```sh
592+
kubectl create -f ./setup.KubeConEU25/sample-jobs/short.yaml -n blue --as alice
593+
kubectl create -f ./setup.KubeConEU25/sample-jobs/short.yaml -n blue --as alice
594+
kubectl create -f ./setup.KubeConEU25/sample-jobs/short.yaml -n blue --as alice
595+
kubectl create -f ./setup.KubeConEU25/sample-jobs/short.yaml -n blue --as alice
596+
kubectl create -f ./setup.KubeConEU25/sample-jobs/short.yaml -n blue --as alice
597+
kubectl create -f ./setup.KubeConEU25/sample-jobs/short.yaml -n blue --as alice
598+
kubectl create -f ./setup.KubeConEU25/sample-jobs/short.yaml -n blue --as alice
599+
```
600+
601+
Since no one else is using the cluster, Alice is able to utilize
602+
both her blue team's quota of 8 GPUs and to borrow all 8 GPUs from the red team's quota
603+
and the 8 GPUs allocated to the slack cluster queue. During this part of the demo,
604+
we will start with 3 admitted jobs and 5 pending jobs on the blue cluster queue. Over
605+
the next two minutes, the queue will drain as the short running jobs complete and the
606+
next pending job is admitted.
607+
608+
### Borrowing and Preemption
609+
610+
Alice will now submit 4 normal jobs. Again, with borrowing three of these jobs
611+
will be able to run immediately and the 4th job will be queued.
612+
```sh
613+
kubectl create -f ./setup.KubeConEU25/sample-jobs/normal.yaml -n blue --as alice
614+
kubectl create -f ./setup.KubeConEU25/sample-jobs/normal.yaml -n blue --as alice
615+
kubectl create -f ./setup.KubeConEU25/sample-jobs/normal.yaml -n blue --as alice
616+
kubectl create -f ./setup.KubeConEU25/sample-jobs/normal.yaml -n blue --as alice
617+
```
618+
619+
Alice can use priorities to ensure important jobs run quickly.
620+
```sh
621+
kubectl create -f ./setup.KubeConEU25/sample-jobs/important.yaml -n blue --as alice
622+
```
623+
One of Alice's normal jobs is automatically suspended and put back on the queue of
624+
waiting jobs to make resource available for her high priority job.
625+
626+
Bob on the red team arrives at work and submits two jobs.
627+
628+
```sh
629+
kubectl create -f ./setup.KubeConEU25/sample-jobs/normal.yaml -n red --as bob
630+
kubectl create -f ./setup.KubeConEU25/sample-jobs/normal.yaml -n red --as bob
631+
```
632+
To allow Bob to utilize his quota, which Alice's jobs had been borrowing, one of Alice's
633+
jobs is quickly preempted returned it to the queue of pending jobs.
634+
635+
### Fault Tolerance
636+
637+
In this scenario, we will start fresh with an empty cluster. Alice will submit
638+
a single large job:
639+
```sh
640+
kubectl create -f ./setup.KubeConEU25/sample-jobs/large.yaml -n blue --as alice
641+
```
642+
After the job is running, we will simulate Autopilot detecting a serious GPU failure
643+
on by labeling a Node:
644+
```sh
645+
kubectl label node <node-name> autopilot.ibm.com/gpuhealth=EVICT --overwrite
646+
```
647+
MLBatch will automatically trigger a reset of all running jobs with Pods on
648+
the impacted node. This reset first does a clean removal of all of the job's
649+
Pods and then creates fresh versions of them. Since MLBatch automatically injects
650+
the Kubernetes affinities shown below into all Pods it creates for user workloads,
651+
the Kubernetes scheduler will avoid scheduling the new Pods on the impacted Node.
652+
```yaml
653+
affinity:
654+
nodeAffinity:
655+
requiredDuringSchedulingIgnoredDuringExecution:
656+
nodeSelectorTerms:
657+
- matchExpressions:
658+
- key: autopilot.ibm.com/gpuhealth
659+
operator: NotIn
660+
values:
661+
- ERR
662+
- TESTING
663+
- EVICT
664+
```
546665
547666
</details>
548667
Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
apiVersion: workload.codeflare.dev/v1beta2
2+
kind: AppWrapper
3+
metadata:
4+
generateName: important
5+
labels:
6+
kueue.x-k8s.io/queue-name: default-queue
7+
spec:
8+
components:
9+
- template:
10+
apiVersion: batch/v1
11+
kind: Job
12+
metadata:
13+
generateName: important
14+
spec:
15+
completions: 2
16+
parallelism: 2
17+
template:
18+
spec:
19+
restartPolicy: Never
20+
terminationGracePeriodSeconds: 0
21+
priorityClassName: high-priority
22+
containers:
23+
- name: busybox
24+
image: quay.io/project-codeflare/busybox:1.36
25+
command: ["sh", "-c", "sleep 600"]
26+
resources:
27+
limits:
28+
nvidia.com/gpu: 4
Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
apiVersion: workload.codeflare.dev/v1beta2
2+
kind: AppWrapper
3+
metadata:
4+
generateName: large
5+
labels:
6+
kueue.x-k8s.io/queue-name: default-queue
7+
annotations:
8+
workload.codeflare.dev.appwrapper/retryPausePeriodDuration: 5s
9+
spec:
10+
components:
11+
- template:
12+
apiVersion: batch/v1
13+
kind: Job
14+
metadata:
15+
generateName: large
16+
spec:
17+
completions: 4
18+
parallelism: 4
19+
template:
20+
spec:
21+
restartPolicy: Never
22+
terminationGracePeriodSeconds: 0
23+
containers:
24+
- name: busybox
25+
image: quay.io/project-codeflare/busybox:1.36
26+
command: ["sh", "-c", "sleep 600"]
27+
resources:
28+
limits:
29+
nvidia.com/gpu: 4
Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
apiVersion: workload.codeflare.dev/v1beta2
2+
kind: AppWrapper
3+
metadata:
4+
generateName: normal
5+
labels:
6+
kueue.x-k8s.io/queue-name: default-queue
7+
spec:
8+
components:
9+
- template:
10+
apiVersion: batch/v1
11+
kind: Job
12+
metadata:
13+
generateName: normal
14+
spec:
15+
completions: 2
16+
parallelism: 2
17+
template:
18+
spec:
19+
restartPolicy: Never
20+
terminationGracePeriodSeconds: 0
21+
containers:
22+
- name: busybox
23+
image: quay.io/project-codeflare/busybox:1.36
24+
command: ["sh", "-c", "sleep 600"]
25+
resources:
26+
limits:
27+
nvidia.com/gpu: 4
Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
apiVersion: workload.codeflare.dev/v1beta2
2+
kind: AppWrapper
3+
metadata:
4+
generateName: short
5+
labels:
6+
kueue.x-k8s.io/queue-name: default-queue
7+
spec:
8+
components:
9+
- template:
10+
apiVersion: batch/v1
11+
kind: Job
12+
metadata:
13+
generateName: short
14+
spec:
15+
completions: 2
16+
parallelism: 2
17+
template:
18+
spec:
19+
restartPolicy: Never
20+
terminationGracePeriodSeconds: 0
21+
containers:
22+
- name: busybox
23+
image: quay.io/project-codeflare/busybox:1.36
24+
command: ["sh", "-c", "sleep 30"]
25+
resources:
26+
limits:
27+
nvidia.com/gpu: 4

0 commit comments

Comments
 (0)