@@ -537,12 +537,143 @@ kubectl label servicemonitors.monitoring.coreos.com -n nvidia-GPU-operator nvidi
537
537
538
538
## Workload Management
539
539
540
- We will now demonstrate the queueing , quota management, and fault recovery
541
- capabilities of MLBatch using synthetic workloads.
540
+ We will now demonstrate the queuing , quota management, and fault recovery capabilities of MLBatch
541
+ using synthetic workloads.
542
542
543
543
<details >
544
+ For this portion of the tutorial, we will use variations on the simple batch/v1 Job shown below.
545
+ All variations will create multiple pods, each requesting some number of GPUs, and sleep for
546
+ a specified interval before completing successfully.
544
547
545
- TODO
548
+ ``` yaml
549
+ apiVersion : workload.codeflare.dev/v1beta2
550
+ kind : AppWrapper
551
+ metadata :
552
+ generateName : <jobtype>
553
+ labels :
554
+ kueue.x-k8s.io/queue-name : default-queue
555
+ spec :
556
+ components :
557
+ - template :
558
+ apiVersion : batch/v1
559
+ kind : Job
560
+ metadata :
561
+ generateName : <jobtype>
562
+ spec :
563
+ completions : <number of pods>
564
+ parallelism : <number of pods>
565
+ template :
566
+ spec :
567
+ restartPolicy : Never
568
+ terminationGracePeriodSeconds : 0
569
+ priorityClassName : <priority class>
570
+ containers :
571
+ - name : busybox
572
+ image : quay.io/project-codeflare/busybox:1.36
573
+ command : ["sh", "-c", "sleep 600"]
574
+ resources :
575
+ limits :
576
+ nvidia.com/gpu : 4
577
+ ` ` `
578
+
579
+ We will use four types of jobs:
580
+
581
+ | Job Type | Priority | Duration | Number of Pods | GPU Usage |
582
+ |----------|----------|----------|----------------|------------|
583
+ | short | normal | 30s | 2 | 2 X 4 = 8 |
584
+ | normal | normal | 600s | 2 | 2 X 4 = 8 |
585
+ | important| high | 600s | 2 | 2 x 4 = 8 |
586
+ | large | normal | 600s | 4 | 4 x 4 = 16 |
587
+
588
+ ### Queuing
589
+
590
+ First, Alice will submit a burst of short running jobs that exceeds
591
+ the number of available GPUs in the cluster. The excess jobs will
592
+ suspended by Kueue and admitted in turn as resources become available.
593
+
594
+ ` ` ` sh
595
+ kubectl create -f ./setup.KubeConEU25/sample-jobs/short.yaml -n blue --as alice
596
+ kubectl create -f ./setup.KubeConEU25/sample-jobs/short.yaml -n blue --as alice
597
+ kubectl create -f ./setup.KubeConEU25/sample-jobs/short.yaml -n blue --as alice
598
+ kubectl create -f ./setup.KubeConEU25/sample-jobs/short.yaml -n blue --as alice
599
+ kubectl create -f ./setup.KubeConEU25/sample-jobs/short.yaml -n blue --as alice
600
+ kubectl create -f ./setup.KubeConEU25/sample-jobs/short.yaml -n blue --as alice
601
+ kubectl create -f ./setup.KubeConEU25/sample-jobs/short.yaml -n blue --as alice
602
+ ```
603
+
604
+ Since no one else is using the cluster, Alice is able to utilize
605
+ both her blue team's quota of 8 GPUs and to borrow all 8 GPUs from the red team's quota
606
+ and the 8 GPUs allocated to the slack cluster queue. During this part of the demo,
607
+ we will start with 3 admitted jobs and 5 pending jobs on the blue cluster queue. Over
608
+ the next two minutes, the queue will drain as the short running jobs complete and the
609
+ next pending job is admitted.
610
+
611
+ ### Borrowing and Preemption
612
+
613
+ Alice will now submit 4 normal jobs. Again, with borrowing, three of these jobs
614
+ will be able to run immediately and the 4th job will be queued.
615
+
616
+ ``` sh
617
+ kubectl create -f ./setup.KubeConEU25/sample-jobs/normal.yaml -n blue --as alice
618
+ kubectl create -f ./setup.KubeConEU25/sample-jobs/normal.yaml -n blue --as alice
619
+ kubectl create -f ./setup.KubeConEU25/sample-jobs/normal.yaml -n blue --as alice
620
+ kubectl create -f ./setup.KubeConEU25/sample-jobs/normal.yaml -n blue --as alice
621
+ ```
622
+
623
+ Alice can use priorities to ensure her important jobs run quickly.
624
+
625
+ ``` sh
626
+ kubectl create -f ./setup.KubeConEU25/sample-jobs/important.yaml -n blue --as alice
627
+ ```
628
+
629
+ One of Alice's normal jobs is automatically suspended and put back on the queue of
630
+ waiting jobs to make its resource available for her high priority job.
631
+
632
+ Finally Bob on the red team arrives at work and submits two jobs.
633
+
634
+ ``` sh
635
+ kubectl create -f ./setup.KubeConEU25/sample-jobs/normal.yaml -n red --as bob
636
+ kubectl create -f ./setup.KubeConEU25/sample-jobs/normal.yaml -n red --as bob
637
+ ```
638
+
639
+ Kueue ensures that Bob has immediate access to his team's allocated quota
640
+ by evicting borrowing jobs. One of Alice's running
641
+ jobs is quickly suspended and returned to her team's queue of pending jobs.
642
+
643
+ ### Fault Tolerance
644
+
645
+ In this scenario, we will start fresh with an empty cluster. Alice will submit
646
+ a single large job:
647
+
648
+ ``` sh
649
+ kubectl create -f ./setup.KubeConEU25/sample-jobs/large.yaml -n blue --as alice
650
+ ```
651
+
652
+ After the job is running, we will simulate Autopilot detecting a serious GPU failure
653
+ on by labeling a Node:
654
+
655
+ ``` sh
656
+ kubectl label node < node-name> autopilot.ibm.com/gpuhealth=EVICT --overwrite
657
+ ```
658
+
659
+ MLBatch will automatically trigger a reset of all running jobs with Pods on
660
+ the impacted node. This reset first does a clean removal of all of the job's
661
+ Pods and then creates fresh versions of them. Since MLBatch automatically injects
662
+ the Kubernetes affinities shown below into all Pods it creates for user workloads,
663
+ the Kubernetes scheduler will avoid scheduling the new Pods on the impacted Node.
664
+ ``` yaml
665
+ affinity :
666
+ nodeAffinity :
667
+ requiredDuringSchedulingIgnoredDuringExecution :
668
+ nodeSelectorTerms :
669
+ - matchExpressions :
670
+ - key : autopilot.ibm.com/gpuhealth
671
+ operator : NotIn
672
+ values :
673
+ - ERR
674
+ - TESTING
675
+ - EVICT
676
+ ` ` `
546
677
547
678
</details>
548
679
0 commit comments