@@ -544,6 +544,7 @@ using synthetic workloads.
544
544
For this portion of the tutorial, we will use variations on the simple batch/v1 Job shown below.
545
545
All variations will create multiple pods, each requesting some number of GPUs, and sleep for
546
546
a specified interval before completing successfully.
547
+
547
548
``` yaml
548
549
apiVersion : workload.codeflare.dev/v1beta2
549
550
kind : AppWrapper
@@ -574,6 +575,7 @@ spec:
574
575
limits :
575
576
nvidia.com/gpu : 4
576
577
` ` `
578
+
577
579
We will use four types of jobs:
578
580
| Job Type | Priority | Duration | Number of Pods | GPU Usage |
579
581
---------------------------------------------------------------
@@ -609,6 +611,7 @@ next pending job is admitted.
609
611
610
612
Alice will now submit 4 normal jobs. Again, with borrowing three of these jobs
611
613
will be able to run immediately and the 4th job will be queued.
614
+
612
615
``` sh
613
616
kubectl create -f ./setup.KubeConEU25/sample-jobs/normal.yaml -n blue --as alice
614
617
kubectl create -f ./setup.KubeConEU25/sample-jobs/normal.yaml -n blue --as alice
@@ -617,9 +620,11 @@ kubectl create -f ./setup.KubeConEU25/sample-jobs/normal.yaml -n blue --as alice
617
620
```
618
621
619
622
Alice can use priorities to ensure important jobs run quickly.
623
+
620
624
``` sh
621
625
kubectl create -f ./setup.KubeConEU25/sample-jobs/important.yaml -n blue --as alice
622
626
```
627
+
623
628
One of Alice's normal jobs is automatically suspended and put back on the queue of
624
629
waiting jobs to make resource available for her high priority job.
625
630
@@ -629,21 +634,26 @@ Bob on the red team arrives at work and submits two jobs.
629
634
kubectl create -f ./setup.KubeConEU25/sample-jobs/normal.yaml -n red --as bob
630
635
kubectl create -f ./setup.KubeConEU25/sample-jobs/normal.yaml -n red --as bob
631
636
```
637
+
632
638
To allow Bob to utilize his quota, which Alice's jobs had been borrowing, one of Alice's
633
639
jobs is quickly preempted returned it to the queue of pending jobs.
634
640
635
641
### Fault Tolerance
636
642
637
643
In this scenario, we will start fresh with an empty cluster. Alice will submit
638
644
a single large job:
645
+
639
646
``` sh
640
647
kubectl create -f ./setup.KubeConEU25/sample-jobs/large.yaml -n blue --as alice
641
648
```
649
+
642
650
After the job is running, we will simulate Autopilot detecting a serious GPU failure
643
651
on by labeling a Node:
652
+
644
653
``` sh
645
654
kubectl label node < node-name> autopilot.ibm.com/gpuhealth=EVICT --overwrite
646
655
```
656
+
647
657
MLBatch will automatically trigger a reset of all running jobs with Pods on
648
658
the impacted node. This reset first does a clean removal of all of the job's
649
659
Pods and then creates fresh versions of them. Since MLBatch automatically injects
0 commit comments