@@ -537,12 +537,131 @@ kubectl label servicemonitors.monitoring.coreos.com -n nvidia-GPU-operator nvidi
537
537
538
538
## Workload Management
539
539
540
- We will now demonstrate the queueing , quota management, and fault recovery
541
- capabilities of MLBatch using synthetic workloads.
540
+ We will now demonstrate the queuing , quota management, and fault recovery capabilities of MLBatch
541
+ using synthetic workloads.
542
542
543
543
<details >
544
+ For this portion of the tutorial, we will use variations on the simple batch/v1 Job shown below.
545
+ All variations will create multiple pods, each requesting some number of GPUs, and sleep for
546
+ a specified interval before completing successfully.
547
+ ``` yaml
548
+ apiVersion : workload.codeflare.dev/v1beta2
549
+ kind : AppWrapper
550
+ metadata :
551
+ generateName : <jobtype>
552
+ labels :
553
+ kueue.x-k8s.io/queue-name : default-queue
554
+ spec :
555
+ components :
556
+ - template :
557
+ apiVersion : batch/v1
558
+ kind : Job
559
+ metadata :
560
+ generateName : <jobtype>
561
+ spec :
562
+ completions : <number of pods>
563
+ parallelism : <number of pods>
564
+ template :
565
+ spec :
566
+ restartPolicy : Never
567
+ terminationGracePeriodSeconds : 0
568
+ priorityClassName : <priority class>
569
+ containers :
570
+ - name : busybox
571
+ image : quay.io/project-codeflare/busybox:1.36
572
+ command : ["sh", "-c", "sleep 600"]
573
+ resources :
574
+ limits :
575
+ nvidia.com/gpu : 4
576
+ ` ` `
577
+ We will use four types of jobs:
578
+ | Job Type | Priority | Duration | Number of Pods | GPU Usage |
579
+ ---------------------------------------------------------------
580
+ | short | normal | 30s | 2 | 2 X 4 = 8 |
581
+ | normal | normal | 600s | 2 | 2 X 4 = 8 |
582
+ | important| high | 600s | 2 | 2 x 4 = 8 |
583
+ | large | normal | 600s | 4 | 4 x 4 = 16|
584
+
585
+ ### Queuing
586
+
587
+ First, Alice will submit a burst of short running jobs that exceeds
588
+ the number of available GPUs in the cluster. The excess jobs will
589
+ suspended by Kueue and admitted in turn as resources become available.
544
590
545
- TODO
591
+ ` ` ` sh
592
+ kubectl create -f ./setup.KubeConEU25/sample-jobs/short.yaml -n blue --as alice
593
+ kubectl create -f ./setup.KubeConEU25/sample-jobs/short.yaml -n blue --as alice
594
+ kubectl create -f ./setup.KubeConEU25/sample-jobs/short.yaml -n blue --as alice
595
+ kubectl create -f ./setup.KubeConEU25/sample-jobs/short.yaml -n blue --as alice
596
+ kubectl create -f ./setup.KubeConEU25/sample-jobs/short.yaml -n blue --as alice
597
+ kubectl create -f ./setup.KubeConEU25/sample-jobs/short.yaml -n blue --as alice
598
+ kubectl create -f ./setup.KubeConEU25/sample-jobs/short.yaml -n blue --as alice
599
+ ```
600
+
601
+ Since no one else is using the cluster, Alice is able to utilize
602
+ both her blue team's quota of 8 GPUs and to borrow all 8 GPUs from the red team's quota
603
+ and the 8 GPUs allocated to the slack cluster queue. During this part of the demo,
604
+ we will start with 3 admitted jobs and 5 pending jobs on the blue cluster queue. Over
605
+ the next two minutes, the queue will drain as the short running jobs complete and the
606
+ next pending job is admitted.
607
+
608
+ ### Borrowing and Preemption
609
+
610
+ Alice will now submit 4 normal jobs. Again, with borrowing three of these jobs
611
+ will be able to run immediately and the 4th job will be queued.
612
+ ``` sh
613
+ kubectl create -f ./setup.KubeConEU25/sample-jobs/normal.yaml -n blue --as alice
614
+ kubectl create -f ./setup.KubeConEU25/sample-jobs/normal.yaml -n blue --as alice
615
+ kubectl create -f ./setup.KubeConEU25/sample-jobs/normal.yaml -n blue --as alice
616
+ kubectl create -f ./setup.KubeConEU25/sample-jobs/normal.yaml -n blue --as alice
617
+ ```
618
+
619
+ Alice can use priorities to ensure important jobs run quickly.
620
+ ``` sh
621
+ kubectl create -f ./setup.KubeConEU25/sample-jobs/important.yaml -n blue --as alice
622
+ ```
623
+ One of Alice's normal jobs is automatically suspended and put back on the queue of
624
+ waiting jobs to make resource available for her high priority job.
625
+
626
+ Bob on the red team arrives at work and submits two jobs.
627
+
628
+ ``` sh
629
+ kubectl create -f ./setup.KubeConEU25/sample-jobs/normal.yaml -n red --as bob
630
+ kubectl create -f ./setup.KubeConEU25/sample-jobs/normal.yaml -n red --as bob
631
+ ```
632
+ To allow Bob to utilize his quota, which Alice's jobs had been borrowing, one of Alice's
633
+ jobs is quickly preempted returned it to the queue of pending jobs.
634
+
635
+ ### Fault Tolerance
636
+
637
+ In this scenario, we will start fresh with an empty cluster. Alice will submit
638
+ a single large job:
639
+ ``` sh
640
+ kubectl create -f ./setup.KubeConEU25/sample-jobs/large.yaml -n blue --as alice
641
+ ```
642
+ After the job is running, we will simulate Autopilot detecting a serious GPU failure
643
+ on by labeling a Node:
644
+ ``` sh
645
+ kubectl label node < node-name> autopilot.ibm.com/gpuhealth=EVICT --overwrite
646
+ ```
647
+ MLBatch will automatically trigger a reset of all running jobs with Pods on
648
+ the impacted node. This reset first does a clean removal of all of the job's
649
+ Pods and then creates fresh versions of them. Since MLBatch automatically injects
650
+ the Kubernetes affinities shown below into all Pods it creates for user workloads,
651
+ the Kubernetes scheduler will avoid scheduling the new Pods on the impacted Node.
652
+ ``` yaml
653
+ affinity :
654
+ nodeAffinity :
655
+ requiredDuringSchedulingIgnoredDuringExecution :
656
+ nodeSelectorTerms :
657
+ - matchExpressions :
658
+ - key : autopilot.ibm.com/gpuhealth
659
+ operator : NotIn
660
+ values :
661
+ - ERR
662
+ - TESTING
663
+ - EVICT
664
+ ` ` `
546
665
547
666
</details>
548
667
0 commit comments