Skip to content

Commit ec7ca4e

Browse files
authored
first draft of queuing/borrowing demo script (#168)
1 parent 5c0081f commit ec7ca4e

File tree

5 files changed

+245
-3
lines changed

5 files changed

+245
-3
lines changed

Diff for: setup.KubeConEU25/README.md

+134-3
Original file line numberDiff line numberDiff line change
@@ -537,12 +537,143 @@ kubectl label servicemonitors.monitoring.coreos.com -n nvidia-GPU-operator nvidi
537537

538538
## Workload Management
539539

540-
We will now demonstrate the queueing, quota management, and fault recovery
541-
capabilities of MLBatch using synthetic workloads.
540+
We will now demonstrate the queuing, quota management, and fault recovery capabilities of MLBatch
541+
using synthetic workloads.
542542

543543
<details>
544+
For this portion of the tutorial, we will use variations on the simple batch/v1 Job shown below.
545+
All variations will create multiple pods, each requesting some number of GPUs, and sleep for
546+
a specified interval before completing successfully.
544547

545-
TODO
548+
```yaml
549+
apiVersion: workload.codeflare.dev/v1beta2
550+
kind: AppWrapper
551+
metadata:
552+
generateName: <jobtype>
553+
labels:
554+
kueue.x-k8s.io/queue-name: default-queue
555+
spec:
556+
components:
557+
- template:
558+
apiVersion: batch/v1
559+
kind: Job
560+
metadata:
561+
generateName: <jobtype>
562+
spec:
563+
completions: <number of pods>
564+
parallelism: <number of pods>
565+
template:
566+
spec:
567+
restartPolicy: Never
568+
terminationGracePeriodSeconds: 0
569+
priorityClassName: <priority class>
570+
containers:
571+
- name: busybox
572+
image: quay.io/project-codeflare/busybox:1.36
573+
command: ["sh", "-c", "sleep 600"]
574+
resources:
575+
limits:
576+
nvidia.com/gpu: 4
577+
```
578+
579+
We will use four types of jobs:
580+
581+
| Job Type | Priority | Duration | Number of Pods | GPU Usage |
582+
|----------|----------|----------|----------------|------------|
583+
| short | normal | 30s | 2 | 2 X 4 = 8 |
584+
| normal | normal | 600s | 2 | 2 X 4 = 8 |
585+
| important| high | 600s | 2 | 2 x 4 = 8 |
586+
| large | normal | 600s | 4 | 4 x 4 = 16 |
587+
588+
### Queuing
589+
590+
First, Alice will submit a burst of short running jobs that exceeds
591+
the number of available GPUs in the cluster. The excess jobs will
592+
suspended by Kueue and admitted in turn as resources become available.
593+
594+
```sh
595+
kubectl create -f ./setup.KubeConEU25/sample-jobs/short.yaml -n blue --as alice
596+
kubectl create -f ./setup.KubeConEU25/sample-jobs/short.yaml -n blue --as alice
597+
kubectl create -f ./setup.KubeConEU25/sample-jobs/short.yaml -n blue --as alice
598+
kubectl create -f ./setup.KubeConEU25/sample-jobs/short.yaml -n blue --as alice
599+
kubectl create -f ./setup.KubeConEU25/sample-jobs/short.yaml -n blue --as alice
600+
kubectl create -f ./setup.KubeConEU25/sample-jobs/short.yaml -n blue --as alice
601+
kubectl create -f ./setup.KubeConEU25/sample-jobs/short.yaml -n blue --as alice
602+
```
603+
604+
Since no one else is using the cluster, Alice is able to utilize
605+
both her blue team's quota of 8 GPUs and to borrow all 8 GPUs from the red team's quota
606+
and the 8 GPUs allocated to the slack cluster queue. During this part of the demo,
607+
we will start with 3 admitted jobs and 5 pending jobs on the blue cluster queue. Over
608+
the next two minutes, the queue will drain as the short running jobs complete and the
609+
next pending job is admitted.
610+
611+
### Borrowing and Preemption
612+
613+
Alice will now submit 4 normal jobs. Again, with borrowing, three of these jobs
614+
will be able to run immediately and the 4th job will be queued.
615+
616+
```sh
617+
kubectl create -f ./setup.KubeConEU25/sample-jobs/normal.yaml -n blue --as alice
618+
kubectl create -f ./setup.KubeConEU25/sample-jobs/normal.yaml -n blue --as alice
619+
kubectl create -f ./setup.KubeConEU25/sample-jobs/normal.yaml -n blue --as alice
620+
kubectl create -f ./setup.KubeConEU25/sample-jobs/normal.yaml -n blue --as alice
621+
```
622+
623+
Alice can use priorities to ensure her important jobs run quickly.
624+
625+
```sh
626+
kubectl create -f ./setup.KubeConEU25/sample-jobs/important.yaml -n blue --as alice
627+
```
628+
629+
One of Alice's normal jobs is automatically suspended and put back on the queue of
630+
waiting jobs to make its resource available for her high priority job.
631+
632+
Finally Bob on the red team arrives at work and submits two jobs.
633+
634+
```sh
635+
kubectl create -f ./setup.KubeConEU25/sample-jobs/normal.yaml -n red --as bob
636+
kubectl create -f ./setup.KubeConEU25/sample-jobs/normal.yaml -n red --as bob
637+
```
638+
639+
Kueue ensures that Bob has immediate access to his team's allocated quota
640+
by evicting borrowing jobs. One of Alice's running
641+
jobs is quickly suspended and returned to her team's queue of pending jobs.
642+
643+
### Fault Tolerance
644+
645+
In this scenario, we will start fresh with an empty cluster. Alice will submit
646+
a single large job:
647+
648+
```sh
649+
kubectl create -f ./setup.KubeConEU25/sample-jobs/large.yaml -n blue --as alice
650+
```
651+
652+
After the job is running, we will simulate Autopilot detecting a serious GPU failure
653+
on by labeling a Node:
654+
655+
```sh
656+
kubectl label node <node-name> autopilot.ibm.com/gpuhealth=EVICT --overwrite
657+
```
658+
659+
MLBatch will automatically trigger a reset of all running jobs with Pods on
660+
the impacted node. This reset first does a clean removal of all of the job's
661+
Pods and then creates fresh versions of them. Since MLBatch automatically injects
662+
the Kubernetes affinities shown below into all Pods it creates for user workloads,
663+
the Kubernetes scheduler will avoid scheduling the new Pods on the impacted Node.
664+
```yaml
665+
affinity:
666+
nodeAffinity:
667+
requiredDuringSchedulingIgnoredDuringExecution:
668+
nodeSelectorTerms:
669+
- matchExpressions:
670+
- key: autopilot.ibm.com/gpuhealth
671+
operator: NotIn
672+
values:
673+
- ERR
674+
- TESTING
675+
- EVICT
676+
```
546677
547678
</details>
548679

Diff for: setup.KubeConEU25/sample-jobs/important.yaml

+28
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
apiVersion: workload.codeflare.dev/v1beta2
2+
kind: AppWrapper
3+
metadata:
4+
generateName: important
5+
labels:
6+
kueue.x-k8s.io/queue-name: default-queue
7+
spec:
8+
components:
9+
- template:
10+
apiVersion: batch/v1
11+
kind: Job
12+
metadata:
13+
generateName: important
14+
spec:
15+
completions: 2
16+
parallelism: 2
17+
template:
18+
spec:
19+
restartPolicy: Never
20+
terminationGracePeriodSeconds: 0
21+
priorityClassName: high-priority
22+
containers:
23+
- name: busybox
24+
image: quay.io/project-codeflare/busybox:1.36
25+
command: ["sh", "-c", "sleep 600"]
26+
resources:
27+
limits:
28+
nvidia.com/gpu: 4

Diff for: setup.KubeConEU25/sample-jobs/large.yaml

+29
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
apiVersion: workload.codeflare.dev/v1beta2
2+
kind: AppWrapper
3+
metadata:
4+
generateName: large
5+
labels:
6+
kueue.x-k8s.io/queue-name: default-queue
7+
annotations:
8+
workload.codeflare.dev.appwrapper/retryPausePeriodDuration: 5s
9+
spec:
10+
components:
11+
- template:
12+
apiVersion: batch/v1
13+
kind: Job
14+
metadata:
15+
generateName: large
16+
spec:
17+
completions: 4
18+
parallelism: 4
19+
template:
20+
spec:
21+
restartPolicy: Never
22+
terminationGracePeriodSeconds: 0
23+
containers:
24+
- name: busybox
25+
image: quay.io/project-codeflare/busybox:1.36
26+
command: ["sh", "-c", "sleep 600"]
27+
resources:
28+
limits:
29+
nvidia.com/gpu: 4

Diff for: setup.KubeConEU25/sample-jobs/normal.yaml

+27
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
apiVersion: workload.codeflare.dev/v1beta2
2+
kind: AppWrapper
3+
metadata:
4+
generateName: normal
5+
labels:
6+
kueue.x-k8s.io/queue-name: default-queue
7+
spec:
8+
components:
9+
- template:
10+
apiVersion: batch/v1
11+
kind: Job
12+
metadata:
13+
generateName: normal
14+
spec:
15+
completions: 2
16+
parallelism: 2
17+
template:
18+
spec:
19+
restartPolicy: Never
20+
terminationGracePeriodSeconds: 0
21+
containers:
22+
- name: busybox
23+
image: quay.io/project-codeflare/busybox:1.36
24+
command: ["sh", "-c", "sleep 600"]
25+
resources:
26+
limits:
27+
nvidia.com/gpu: 4

Diff for: setup.KubeConEU25/sample-jobs/short.yaml

+27
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
apiVersion: workload.codeflare.dev/v1beta2
2+
kind: AppWrapper
3+
metadata:
4+
generateName: short
5+
labels:
6+
kueue.x-k8s.io/queue-name: default-queue
7+
spec:
8+
components:
9+
- template:
10+
apiVersion: batch/v1
11+
kind: Job
12+
metadata:
13+
generateName: short
14+
spec:
15+
completions: 2
16+
parallelism: 2
17+
template:
18+
spec:
19+
restartPolicy: Never
20+
terminationGracePeriodSeconds: 0
21+
containers:
22+
- name: busybox
23+
image: quay.io/project-codeflare/busybox:1.36
24+
command: ["sh", "-c", "sleep 30"]
25+
resources:
26+
limits:
27+
nvidia.com/gpu: 4

0 commit comments

Comments
 (0)