Skip to content

Commit 21e7b10

Browse files
committed
fix paragraphs
1 parent e85318e commit 21e7b10

File tree

1 file changed

+10
-0
lines changed

1 file changed

+10
-0
lines changed

setup.KubeConEU25/README.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -544,6 +544,7 @@ using synthetic workloads.
544544
For this portion of the tutorial, we will use variations on the simple batch/v1 Job shown below.
545545
All variations will create multiple pods, each requesting some number of GPUs, and sleep for
546546
a specified interval before completing successfully.
547+
547548
```yaml
548549
apiVersion: workload.codeflare.dev/v1beta2
549550
kind: AppWrapper
@@ -574,6 +575,7 @@ spec:
574575
limits:
575576
nvidia.com/gpu: 4
576577
```
578+
577579
We will use four types of jobs:
578580
| Job Type | Priority | Duration | Number of Pods | GPU Usage |
579581
---------------------------------------------------------------
@@ -609,6 +611,7 @@ next pending job is admitted.
609611

610612
Alice will now submit 4 normal jobs. Again, with borrowing three of these jobs
611613
will be able to run immediately and the 4th job will be queued.
614+
612615
```sh
613616
kubectl create -f ./setup.KubeConEU25/sample-jobs/normal.yaml -n blue --as alice
614617
kubectl create -f ./setup.KubeConEU25/sample-jobs/normal.yaml -n blue --as alice
@@ -617,9 +620,11 @@ kubectl create -f ./setup.KubeConEU25/sample-jobs/normal.yaml -n blue --as alice
617620
```
618621

619622
Alice can use priorities to ensure important jobs run quickly.
623+
620624
```sh
621625
kubectl create -f ./setup.KubeConEU25/sample-jobs/important.yaml -n blue --as alice
622626
```
627+
623628
One of Alice's normal jobs is automatically suspended and put back on the queue of
624629
waiting jobs to make resource available for her high priority job.
625630

@@ -629,21 +634,26 @@ Bob on the red team arrives at work and submits two jobs.
629634
kubectl create -f ./setup.KubeConEU25/sample-jobs/normal.yaml -n red --as bob
630635
kubectl create -f ./setup.KubeConEU25/sample-jobs/normal.yaml -n red --as bob
631636
```
637+
632638
To allow Bob to utilize his quota, which Alice's jobs had been borrowing, one of Alice's
633639
jobs is quickly preempted returned it to the queue of pending jobs.
634640

635641
### Fault Tolerance
636642

637643
In this scenario, we will start fresh with an empty cluster. Alice will submit
638644
a single large job:
645+
639646
```sh
640647
kubectl create -f ./setup.KubeConEU25/sample-jobs/large.yaml -n blue --as alice
641648
```
649+
642650
After the job is running, we will simulate Autopilot detecting a serious GPU failure
643651
on by labeling a Node:
652+
644653
```sh
645654
kubectl label node <node-name> autopilot.ibm.com/gpuhealth=EVICT --overwrite
646655
```
656+
647657
MLBatch will automatically trigger a reset of all running jobs with Pods on
648658
the impacted node. This reset first does a clean removal of all of the job's
649659
Pods and then creates fresh versions of them. Since MLBatch automatically injects

0 commit comments

Comments
 (0)