You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Adding ioctls to fix a simple multi-GPU Huggingface`accelerate` program that does not work on GCP H100s.
---
### System details
* **instance type:** `a3-highgpu-8g` (GCP, us-east4-a)
* **NVIDIA driver:** `Driver Version: 550.54.15 CUDA Version: 12.4`
* **NVIDIA device:** 4 x NVIDIA H100 HBM3
* **uname -a:** `Linux gcp-h100-us-east4-a-0-bb25baf985414f8899dfdfcb82d6796d 5.15.0-208.159.3.el9uek.x86_64 #2 SMP Wed Jun 19 09:05:13 PDT 2024 x86_64 x86_64 x86_64 GNU/Linux`
```
runsc version release-20240513.0-173-gc526d251933a-dirty
spec: 1.1.0-rc.1
```
---
## Reproduction steps
1. **Install gVisor**
**2. Add GPU enabling gvisor options**
In `/etc/docker/daemon.json`:
```json
{
"runtimes": {
"nvidia": {
"path": "nvidia-container-runtime",
"runtimeArgs": []
},
"runsc": {
"path": "/home/modal/runsc",
"runtimeArgs": ["--nvproxy", "--nvproxy-docker", "-debug-log=/tmp/runsc/", "-debug", "-strace"]
}
}
}
```
**3. Run Dockerfile**
```Dockerfile
# Dockerfile
FROM winglian/axolotl@sha256:5c724f7accd8188b0f84ead93b7efbfa8f8661f40e133646bd6d946bc3423d6d
RUN pip install fastapi==0.111.0
RUN pip install huggingface-hub~=0.23.0 pydantic==2.6.3 python-dateutil
ENV HUGGINGFACE_HUB_CACHE="/pretrained"
ENV TQDM_DISABLE="true"
ENV AXOLOTL_NCCL_TIMEOUT="60"
COPY <<EOF repro.py
import os
import subprocess
from pathlib import Path
print("[MOD-3226] hello from the repro!!!")
from accelerate import Accelerator
accelerator = Accelerator()
with accelerator.main_process_first():
print(f"hello! {accelerator.process_index}")
EOF
ENTRYPOINT ["accelerate", "launch", "repro.py"]
```
```
sudo docker run -it --runtime=$RUNTIME --gpus='"device=GPU-c453e5c7-a56d-70bf-78ce-61be6cb8e0db,GPU-4703196a-e3df-9e3f-bb8b-6fa91c8e9970,GPU-4a9c162c-9280-eaa8-215a-2c681e82a99f,GPU-1660d344-e18b-e48a-cced-38380e903c31"' ce4326479c8412b13bba27416e3e77093d4411279b432ca1b25050f17ef57a67
```
### Results
**`runc`**
```
sudo docker run -it --gpus='"device=GPU-c453e5c7-a56d-70bf-78ce-61be6cb8e0db,GPU-4703196a-e3df-9e3f-bb8b-6fa91c8e9970,GPU-4a9c162c-9280-eaa8-215a-2c681e82a99f,GPU-1660d344-e18b-e48a-cced-38380e903c31"' ce4326479c8412b13bba27416e3e77093d4411279b432ca1b25050f17ef57a67
The following values were not passed to `accelerate launch` and had defaults used instead:
`--num_processes` was set to a value of `4`
More than one GPU was found, enabling multi-GPU training.
If this was unintended please pass in `--num_processes=1`.
`--num_machines` was set to a value of `1`
`--mixed_precision` was set to a value of `'no'`
`--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
[MOD-3226] hello from the repro!!!
[MOD-3226] hello from the repro!!!
[MOD-3226] hello from the repro!!!
[MOD-3226] hello from the repro!!!
hello! 0
hello! 1
hello! 2hello! 3
```
**`runsc` (main)**
<details> <summary>💥 Failure logs</summary>
```
sudo docker run -it --runtime=runsc --gpus='"device=GPU-c453e5c7-a56d-70bf-78ce-61be6cb8e0db,GPU-4703196a-e3df-9e3f-bb8b-6fa91c8e9970,GPU-4a9c162c-9280-eaa8-215a-2c681e82a99f,GPU-1660d344-e18b-e48a-cced-38380e903c31"' ce4326479c8412b13bba27416e3e77093d4411279b432ca1b25050f17ef57a67
The following values were not passed to `accelerate launch` and had defaults used instead:
`--num_processes` was set to a value of `4`
More than one GPU was found, enabling multi-GPU training.
If this was unintended please pass in `--num_processes=1`.
`--num_machines` was set to a value of `1`
`--mixed_precision` was set to a value of `'no'`
`--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
[MOD-3226] hello from the repro!!!
[MOD-3226] hello from the repro!!!
[MOD-3226] hello from the repro!!!
[MOD-3226] hello from the repro!!!
Detected kernel version 4.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
hello! 0
Traceback (most recent call last):
File "/workspace/axolotl/repro.py", line 10, in <module>
with accelerator.main_process_first():
File "/root/miniconda3/envs/py3.10/lib/python3.10/contextlib.py", line 142, in __exit__
next(self.gen)
File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/accelerator.py", line 884, in main_process_first
with self.state.main_process_first():
File "/root/miniconda3/envs/py3.10/lib/python3.10/contextlib.py", line 142, in __exit__
next(self.gen)
File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/state.py", line 1056, in main_process_first
with PartialState().main_process_first():
File "/root/miniconda3/envs/py3.10/lib/python3.10/contextlib.py", line 142, in __exit__
next(self.gen)
File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/state.py", line 502, in main_process_first
yield from self._goes_first(self.is_main_process)
File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/state.py", line 390, in _goes_first
self.wait_for_everyone()
File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/state.py", line 379, in wait_for_everyone
torch.distributed.barrier()
File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
return func(*args, **kwargs)
File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
work = default_pg.barrier(opts=opts)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.18.6
ncclUnhandledCudaError: Call to CUDA function failed.
Last error:
Cuda failure 'unknown error'
[2024-07-11 19:52:01,530] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 68 closing signal SIGTERM
[2024-07-11 19:52:01,532] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 69 closing signal SIGTERM
[2024-07-11 19:52:01,533] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 70 closing signal SIGTERM
[2024-07-11 19:52:02,108] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 67) of binary: /root/miniconda3/envs/py3.10/bin/python
Traceback (most recent call last):
File "/root/miniconda3/envs/py3.10/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 46, in main
args.func(args)
File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1073, in launch_command
multi_gpu_launcher(args)
File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/commands/launch.py", line 718, in multi_gpu_launcher
distrib_run.run(args)
File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
repro.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-07-11_19:52:01
host : d45a08528293
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 67)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
```
---
</details>
**`runsc` (this pull request)**
<details> <summary>✅ Success logs</summary>
```
[modal@gcp-h100-us-east4-a-0-bb25baf985414f8899dfdfcb82d6796d ~]$ sudo docker run -it --runtime=runsc --gpus='"device=GPU-c453e5c7-a56d-70bf-78ce-61be6cb8e0db,GPU-4703196a-e3df-9e3f-bb8b-6fa91c8e9970,GPU-4a9c162c-9280-eaa8-215a-2c681e82a99f,GPU-1660d344-e18b-e48a-cced-38380e903c31"' ce4326479c8412b13bba27416e3e77093d4411279b432ca1b25050f17ef57a67
The following values were not passed to `accelerate launch` and had defaults used instead:
`--num_processes` was set to a value of `4`
More than one GPU was found, enabling multi-GPU training.
If this was unintended please pass in `--num_processes=1`.
`--num_machines` was set to a value of `1`
`--mixed_precision` was set to a value of `'no'`
`--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
[MOD-3226] hello from the repro!!!
[MOD-3226] hello from the repro!!!
[MOD-3226] hello from the repro!!!
[MOD-3226] hello from the repro!!!
Detected kernel version 4.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
hello! 0
hello! 1
hello! 3hello! 2
```
</details>
FUTURE_COPYBARA_INTEGRATE_REVIEW=#10649 from thundergolfer:master d3d19f1
PiperOrigin-RevId: 651754677
0 commit comments