Multi-gpu training with slurm times out #20434

nightingal3 · 2024-11-19T22:53:46Z

Bug description

(Note: cross posting from litgpt since I think this may be about pytorch-lightning?)

I was transferring some checkpoints from a cluster that didn't use slurm to one that does use slurm. I trained the checkpoint using multiple gpus/nodes, and I found that I'm able to load and start training it when using an interactive job. However, when I use sbatch to submit my job, the job times out after some time.

I've seen this post: https://lightning.ai/docs/fabric/2.4.0/guide/multi_node/slurm.html and added srun to my submission script. However, even though 4 devices seem to be initialized, the model still gets stuck before training and times out.

A debug log and my submission script is linked. My sbatch script is a bit different since it runs another sh script, which does a bunch of stuff and then litgpt pretrain <...>, but I'm not sure this would be an issue...

I also tried setting the fabric initialization to explicitly have the number of nodes, devices, etc like in the example in pretrain.py but it didn't make a difference:

fabric = L.Fabric(
        accelerator="gpu", devices=4, num_nodes=1, strategy=strategy, precision=precision, loggers=[logger]
    )

Details:
My script:

#!/bin/bash
#SBATCH --job-name=train_model
#SBATCH --output=slurm_logs/%j.out
#SBATCH --time=2-00:00:00
#SBATCH --nodes=1
#SBATCH --gres=gpu:A6000:4
#SBATCH --ntasks-per-node=4
#SBATCH --mem=50G
#SBATCH --partition=general
#SBATCH --mail-user=<email>
#SBATCH --mail-type=ALL

export CUDA_DEVICE_ORDER=PCI_BUS_ID
export NCCL_DEBUG=INFO

# Check if training type is provided
if [ $# -eq 0 ]; then
    echo "Usage: $0 <sequential|mixed> [training args...]"
    exit 1
fi

# Get the training type and remove it from args
TRAIN_TYPE=$1
shift

case $TRAIN_TYPE in
    sequential)
        srun ./pretrain_then_finetune.sh "$@"
        ;;
    mixed)
        srun ./mixed_pretraining_fixed.sh "$@"
        ;;
    *)
        echo "Invalid training type. Use 'sequential' or 'mixed'"
        exit 1
        ;;
esac

Debug example error:

Node information:

=== Slurm Environment ===
SLURM_NTASKS: 4
SLURM_PROCID: 0
SLURM_LOCALID: 0
SLURM_JOB_ID: 3178163

=== GPU Information ===
Available GPUs:
GPU 0: NVIDIA RTX A6000 (UUID: GPU-b349d8f4-c2a8-bd4b-2ed8-4678cc3093ad)
GPU 1: NVIDIA RTX A6000 (UUID: GPU-6386b3c4-ba07-b55c-a8d8-1d7e38378b83)
GPU 2: NVIDIA RTX A6000 (UUID: GPU-8a6310cf-0811-4754-64db-8c4117d4be50)
GPU 3: NVIDIA RTX A6000 (UUID: GPU-99486ac3-a03a-16fa-0e46-8bade74f121a)

GPU Topology:
	�[4mGPU0	GPU1	GPU2	GPU3	NIC0	CPU Affinity	NUMA Affinity	GPU NUMA ID�[0m
GPU0	 X 	SYS	SYS	SYS	SYS	1-2,7-8,129-130	0		N/A
GPU1	SYS	 X 	NV4	NODE	NODE	64-65,67,69	1		N/A
GPU2	SYS	NV4	 X 	NODE	NODE	64-65,67,69	1		N/A
GPU3	SYS	NODE	NODE	 X 	NODE	64-65,67,69	1		N/A
NIC0	SYS	NODE	NODE	NODE	 X 				

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0

What operating system are you using?

Linux

LitGPT Version

Version: 0.4.0

What version are you seeing the problem on?

v2.3

How to reproduce the bug

submission script:

#!/bin/bash
#SBATCH --job-name=train_model
#SBATCH --output=slurm_logs/%j.out
#SBATCH --time=2-00:00:00
#SBATCH --nodes=1
#SBATCH --gres=gpu:A6000:4
#SBATCH --ntasks-per-node=4
#SBATCH --mem=50G
#SBATCH --partition=general
#SBATCH [email protected]
#SBATCH --mail-type=ALL

# Get training type
TRAIN_TYPE=$1
shift

export CUDA_DEVICE_ORDER=PCI_BUS_ID
export NCCL_DEBUG=INFO

# Run training script
case $TRAIN_TYPE in
    sequential)
        srun ./pretrain_then_finetune.sh "$@"
        ;;
    mixed)
        srun ./mixed_pretraining_fixed.sh "$@"
        ;;
    *)
        echo "Invalid training type. Use 'sequential' or 'mixed'"
        exit 1
        ;;
esac


inside `pretrain_then_finetune.sh`:
```bash
<conda activate the env>

litgpt pretrain $model_name <...>



### Error messages and logs

[...previous stuff]
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/4
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/4
Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/4
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/4
[rank: 0] Seed set to 42
[rank: 0] Seed set to 42
[rank: 0] Seed set to 42

distributed_backend=nccl
All distributed processes registered. Starting with 4 processes

[rank: 0] Seed set to 42
babel-0-31:2324237:2324237 [0] NCCL INFO Bootstrap : Using ibs8:172.16.1.17<0>
babel-0-31:2324237:2324237 [0] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
babel-0-31:2324237:2324237 [0] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
babel-0-31:2324237:2324237 [0] NCCL INFO NET/Plugin: Using internal network plugin.
babel-0-31:2324237:2324237 [0] NCCL INFO cudaDriverVersion 12060
NCCL version 2.21.5+cuda12.4
/home/mengyan3/.local/lib/python3.9/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:690: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html .
warnings.warn(
babel-0-31:2324240:2324240 [1] NCCL INFO cudaDriverVersion 12060
babel-0-31:2324240:2324240 [1] NCCL INFO Bootstrap : Using ibs8:172.16.1.17<0>
babel-0-31:2324240:2324240 [1] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
babel-0-31:2324240:2324240 [1] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
babel-0-31:2324240:2324240 [1] NCCL INFO NET/Plugin: Using internal network plugin.
babel-0-31:2324240:2324524 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [RO]; OOB ibs8:172.16.1.17<0>
babel-0-31:2324240:2324524 [1] NCCL INFO Using non-device net plugin version 0
babel-0-31:2324240:2324524 [1] NCCL INFO Using network IB
babel-0-31:2324240:2324524 [1] NCCL INFO ncclCommInitRank comm 0x555d6dc227b0 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 81000 commId 0xb886029f29f1a815 - Init START
babel-0-31:2324240:2324524 [1] NCCL INFO Setting affinity for GPU 1 to 2b,00000000,00000000,00000000,0000002b,00000000,00000000
babel-0-31:2324240:2324524 [1] NCCL INFO NVLS multicast support is not available on dev 1
babel-0-31:2324240:2324524 [1] NCCL INFO comm 0x555d6dc227b0 rank 1 nRanks 4 nNodes 1 localRanks 4 localRank 1 MNNVL 0
babel-0-31:2324240:2324524 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
babel-0-31:2324240:2324524 [1] NCCL INFO P2P Chunksize set to 524288
babel-0-31:2324240:2324524 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[2] via P2P/CUMEM
babel-0-31:2324240:2324524 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[2] via P2P/CUMEM
babel-0-31:2324240:2324524 [1] NCCL INFO Connected all rings
babel-0-31:2324240:2324524 [1] NCCL INFO Channel 00/0 : 1[1] -> 0[0] via P2P/CUMEM
babel-0-31:2324240:2324524 [1] NCCL INFO Channel 01/0 : 1[1] -> 0[0] via P2P/CUMEM
babel-0-31:2324240:2324524 [1] NCCL INFO Connected all trees
babel-0-31:2324240:2324524 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
babel-0-31:2324240:2324524 [1] NCCL INFO 2 coll channels, 2 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
babel-0-31:2324240:2324524 [1] NCCL INFO TUNER/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-tuner.so
babel-0-31:2324240:2324524 [1] NCCL INFO TUNER/Plugin: Using internal tuner plugin.
babel-0-31:2324240:2324524 [1] NCCL INFO ncclCommInitRank comm 0x555d6dc227b0 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 81000 commId 0xb886029f29f1a815 - Init COMPLETE
[rank1]:[E1119 13:21:12.512786305 ProcessGroupNCCL.cpp:616] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800000 milliseconds before timing out.
babel-0-31:2324239:2324239 [3] NCCL INFO cudaDriverVersion 12060
babel-0-31:2324239:2324239 [3] NCCL INFO Bootstrap : Using ibs8:172.16.1.17<0>
babel-0-31:2324239:2324239 [3] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
babel-0-31:2324239:2324239 [3] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
babel-0-31:2324239:2324239 [3] NCCL INFO NET/Plugin: Using internal network plugin.
babel-0-31:2324239:2324522 [3] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [RO]; OOB ibs8:172.16.1.17<0>
babel-0-31:2324239:2324522 [3] NCCL INFO Using non-device net plugin version 0
babel-0-31:2324239:2324522 [3] NCCL INFO Using network IB
babel-0-31:2324239:2324522 [3] NCCL INFO ncclCommInitRank comm 0x5584021f39f0 rank 3 nranks 4 cudaDev 3 nvmlDev 3 busId e1000 commId 0xb886029f29f1a815 - Init START
babel-0-31:2324239:2324522 [3] NCCL INFO Setting affinity for GPU 3 to 2b,00000000,00000000,00000000,0000002b,00000000,00000000
babel-0-31:2324239:2324522 [3] NCCL INFO NVLS multicast support is not available on dev 3
babel-0-31:2324239:2324522 [3] NCCL INFO comm 0x5584021f39f0 rank 3 nRanks 4 nNodes 1 localRanks 4 localRank 3 MNNVL 0
babel-0-31:2324239:2324522 [3] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2
babel-0-31:2324239:2324522 [3] NCCL INFO P2P Chunksize set to 524288
babel-0-31:2324239:2324522 [3] NCCL INFO Channel 00/0 : 3[3] -> 0[0] via P2P/CUMEM
babel-0-31:2324239:2324522 [3] NCCL INFO Channel 01/0 : 3[3] -> 0[0] via P2P/CUMEM
babel-0-31:2324239:2324522 [3] NCCL INFO Connected all rings
babel-0-31:2324239:2324522 [3] NCCL INFO Channel 00/0 : 3[3] -> 2[2] via P2P/CUMEM
babel-0-31:2324239:2324522 [3] NCCL INFO Channel 01/0 : 3[3] -> 2[2] via P2P/CUMEM
babel-0-31:2324239:2324522 [3] NCCL INFO Connected all trees
babel-0-31:2324239:2324522 [3] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
babel-0-31:2324239:2324522 [3] NCCL INFO 2 coll channels, 2 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
babel-0-31:2324239:2324522 [3] NCCL INFO TUNER/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-tuner.so
babel-0-31:2324239:2324522 [3] NCCL INFO TUNER/Plugin: Using internal tuner plugin.
babel-0-31:2324239:2324522 [3] NCCL INFO ncclCommInitRank comm 0x5584021f39f0 rank 3 nranks 4 cudaDev 3 nvmlDev 3 busId e1000 commId 0xb886029f29f1a815 - Init COMPLETE
[rank3]:[E1119 13:21:12.512781555 ProcessGroupNCCL.cpp:616] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800000 milliseconds before timing out.
babel-0-31:2324238:2324238 [2] NCCL INFO cudaDriverVersion 12060
babel-0-31:2324238:2324238 [2] NCCL INFO Bootstrap : Using ibs8:172.16.1.17<0>
babel-0-31:2324238:2324238 [2] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
babel-0-31:2324238:2324238 [2] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
babel-0-31:2324238:2324238 [2] NCCL INFO NET/Plugin: Using internal network plugin.
babel-0-31:2324238:2324523 [2] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [RO]; OOB ibs8:172.16.1.17<0>
babel-0-31:2324238:2324523 [2] NCCL INFO Using non-device net plugin version 0
babel-0-31:2324238:2324523 [2] NCCL INFO Using network IB
babel-0-31:2324238:2324523 [2] NCCL INFO ncclCommInitRank comm 0x55880160e670 rank 2 nranks 4 cudaDev 2 nvmlDev 2 busId a1000 commId 0xb886029f29f1a815 - Init START
babel-0-31:2324238:2324523 [2] NCCL INFO Setting affinity for GPU 2 to 2b,00000000,00000000,00000000,0000002b,00000000,00000000
babel-0-31:2324238:2324523 [2] NCCL INFO NVLS multicast support is not available on dev 2
babel-0-31:2324238:2324523 [2] NCCL INFO comm 0x55880160e670 rank 2 nRanks 4 nNodes 1 localRanks 4 localRank 2 MNNVL 0
babel-0-31:2324238:2324523 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1
babel-0-31:2324238:2324523 [2] NCCL INFO P2P Chunksize set to 524288
babel-0-31:2324238:2324523 [2] NCCL INFO Channel 00/0 : 2[2] -> 3[3] via P2P/CUMEM
babel-0-31:2324238:2324523 [2] NCCL INFO Channel 01/0 : 2[2] -> 3[3] via P2P/CUMEM
babel-0-31:2324238:2324523 [2] NCCL INFO Connected all rings
babel-0-31:2324238:2324523 [2] NCCL INFO Channel 00/0 : 2[2] -> 1[1] via P2P/CUMEM
babel-0-31:2324238:2324523 [2] NCCL INFO Channel 01/0 : 2[2] -> 1[1] via P2P/CUMEM
babel-0-31:2324238:2324523 [2] NCCL INFO Connected all trees
babel-0-31:2324238:2324523 [2] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
babel-0-31:2324238:2324523 [2] NCCL INFO 2 coll channels, 2 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
babel-0-31:2324238:2324523 [2] NCCL INFO TUNER/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-tuner.so
babel-0-31:2324238:2324523 [2] NCCL INFO TUNER/Plugin: Using internal tuner plugin.
babel-0-31:2324238:2324523 [2] NCCL INFO ncclCommInitRank comm 0x55880160e670 rank 2 nranks 4 cudaDev 2 nvmlDev 2 busId a1000 commId 0xb886029f29f1a815 - Init COMPLETE
[rank2]:[E1119 13:21:12.525244336 ProcessGroupNCCL.cpp:616] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800013 milliseconds before timing out.
[rank1]:[E1119 13:21:13.938073877 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 1] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank1]:[E1119 13:21:13.938095107 ProcessGroupNCCL.cpp:1834] [PG ID 0 PG GUID 0(default_pg) Rank 1] Timeout at NCCL work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank1]:[E1119 13:21:13.938100947 ProcessGroupNCCL.cpp:630] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E1119 13:21:13.938104817 ProcessGroupNCCL.cpp:636] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
[rank2]:[E1119 13:21:13.938073737 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 2] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank2]:[E1119 13:21:13.938094977 ProcessGroupNCCL.cpp:1834] [PG ID 0 PG GUID 0(default_pg) Rank 2] Timeout at NCCL work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank2]:[E1119 13:21:13.938100577 ProcessGroupNCCL.cpp:630] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank2]:[E1119 13:21:13.938104557 ProcessGroupNCCL.cpp:636] [Rank 2] To avoid data inconsistency, we are taking the entire process down.
[rank3]:[E1119 13:21:13.938073817 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 3] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank3]:[E1119 13:21:13.938094667 ProcessGroupNCCL.cpp:1834] [PG ID 0 PG GUID 0(default_pg) Rank 3] Timeout at NCCL work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank3]:[E1119 13:21:13.938100907 ProcessGroupNCCL.cpp:630] [Rank 3] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank3]:[E1119 13:21:13.938104757 ProcessGroupNCCL.cpp:636] [Rank 3] To avoid data inconsistency, we are taking the entire process down.
[rank2]:[E1119 13:21:13.092845528 ProcessGroupNCCL.cpp:1595] [PG ID 0 PG GUID 0(default_pg) Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800013 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fdf89a24446 in /home/mengyan3/.local/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7fdf8ad37772 in /home/mengyan3/.local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7fdf8ad3ebb3 in /home/mengyan3/.local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7fdf8ad4061d in /home/mengyan3/.local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0x145c0 (0x7fdfd36cd5c0 in /home/mengyan3/.local/lib/python3.9/site-packages/torch/lib/libtorch.so)
frame #5: + 0x89c02 (0x7fdfe2c89c02 in /lib64/libc.so.6)
frame #6: + 0x10ec40 (0x7fdfe2d0ec40 in /lib64/libc.so.6)

[rank1]:[E1119 13:21:13.092997168 ProcessGroupNCCL.cpp:1595] [PG ID 0 PG GUID 0(default_pg) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800000 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f3b4e1d4446 in /home/mengyan3/.local/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7f3b4f4e7772 in /home/mengyan3/.local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f3b4f4eebb3 in /home/mengyan3/.local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f3b4f4f061d in /home/mengyan3/.local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0x145c0 (0x7f3b97e7d5c0 in /home/mengyan3/.local/lib/python3.9/site-packages/torch/lib/libtorch.so)
frame #5: + 0x89c02 (0x7f3ba7489c02 in /lib64/libc.so.6)
frame #6: + 0x10ec40 (0x7f3ba750ec40 in /lib64/libc.so.6)

[rank3]:[E1119 13:21:13.092988978 ProcessGroupNCCL.cpp:1595] [PG ID 0 PG GUID 0(default_pg) Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800000 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fe0c139c446 in /home/mengyan3/.local/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7fe0c26af772 in /home/mengyan3/.local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7fe0c26b6bb3 in /home/mengyan3/.local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7fe0c26b861d in /home/mengyan3/.local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0x145c0 (0x7fe10b0455c0 in /home/mengyan3/.local/lib/python3.9/site-packages/torch/lib/libtorch.so)
frame #5: + 0x89c02 (0x7fe11a689c02 in /lib64/libc.so.6)
frame #6: + 0x10ec40 (0x7fe11a70ec40 in /lib64/libc.so.6)

/data/tir/projects/tir3/users/mengyan3/all_in_one_pretraining/./pretrain_then_finetune.sh: line 184: 2324239 Aborted (core dumped) litgpt pretrain $model_name --resume "${checkpoint_dir}/step${step}/lit_model.pth" --tokenizer_dir "${checkpoint_dir}/step${step}" --data FineWebDataset --data.data_path $pretraining_data_dir --data.val_data_path /data/datasets/hf_cache/data/fineweb/sample-350BT/val/0 --data.num_workers $SLURM_GPUS_ON_NODE --train.micro_batch_size $micro_batch_size --train.max_seq_len $max_seq_len --train.min_lr 1e-6 --train.max_iters ${max_iters} --train.max_additional_steps $max_additional_steps --train.save_interval 500 --train.log_interval $log_interval --train.lr_warmup_fraction 0.01 --train.lr_scheduler $lr_scheduler --eval.interval 1000 --out_dir $out_dir --logs_dir $out_dir --logger_name tensorboard
/data/tir/projects/tir3/users/mengyan3/all_in_one_pretraining/./pretrain_then_finetune.sh: line 184: 2324238 Aborted (core dumped) litgpt pretrain $model_name --resume "${checkpoint_dir}/step${step}/lit_model.pth" --tokenizer_dir "${checkpoint_dir}/step${step}" --data FineWebDataset --data.data_path $pretraining_data_dir --data.val_data_path /data/datasets/hf_cache/data/fineweb/sample-350BT/val/0 --data.num_workers $SLURM_GPUS_ON_NODE --train.micro_batch_size $micro_batch_size --train.max_seq_len $max_seq_len --train.min_lr 1e-6 --train.max_iters ${max_iters} --train.max_additional_steps $max_additional_steps --train.save_interval 500 --train.log_interval $log_interval --train.lr_warmup_fraction 0.01 --train.lr_scheduler $lr_scheduler --eval.interval 1000 --out_dir $out_dir --logs_dir $out_dir --logger_name tensorboard
/data/tir/projects/tir3/users/mengyan3/all_in_one_pretraining/./pretrain_then_finetune.sh: line 184: 2324240 Aborted (core dumped) litgpt pretrain $model_name --resume "${checkpoint_dir}/step${step}/lit_model.pth" --tokenizer_dir "${checkpoint_dir}/step${step}" --data FineWebDataset --data.data_path $pretraining_data_dir --data.val_data_path /data/datasets/hf_cache/data/fineweb/sample-350BT/val/0 --data.num_workers $SLURM_GPUS_ON_NODE --train.micro_batch_size $micro_batch_size --train.max_seq_len $max_seq_len --train.min_lr 1e-6 --train.max_iters ${max_iters} --train.max_additional_steps $max_additional_steps --train.save_interval 500 --train.log_interval $log_interval --train.lr_warmup_fraction 0.01 --train.lr_scheduler $lr_scheduler --eval.interval 1000 --out_dir $out_dir --logs_dir $out_dir --logger_name tensorboard
srun: First task exited 60s ago
srun: StepId=3178163.0 task 0: running
srun: StepId=3178163.0 tasks 1-3: exited
srun: Terminating StepId=3178163.0
slurmstepd: error: *** STEP 3178163.0 ON babel-0-31 CANCELLED AT 2024-11-19T13:24:07 ***
srun: Job step aborted: Waiting up to 122 seconds for job step to finish.
slurmstepd: error: --task-epilog failed status=9



### Environment

<details>
  <summary>Current environment</summary>

* CUDA:
	- GPU:
		- NVIDIA RTX A6000
		- NVIDIA RTX A6000
		- NVIDIA RTX A6000
		- NVIDIA RTX A6000
	- available:         True
	- version:           12.4
* Lightning:
	- botorch:           0.10.0
	- gpytorch:          1.11
	- lightning:         2.3.0.dev20240428
	- lightning-utilities: 0.11.8
	- pytorch-lightning: 2.3.1
	- torch:             2.5.1
	- torchmetrics:      1.4.0.post0
* Packages:
	- absl-py:           2.1.0
	- accelerate:        0.32.0
	- aiohttp:           3.9.5
	- aiosignal:         1.3.1
	- annotated-types:   0.7.0
	- antlr4-python3-runtime: 4.11.0
	- anyio:             4.4.0
	- argcomplete:       3.5.1
	- asttokens:         2.4.1
	- async-timeout:     4.0.3
	- attrs:             23.2.0
	- awscrt:            0.20.11
	- beautifulsoup4:    4.12.3
	- bitsandbytes:      0.42.0
	- boto3:             1.35.63
	- botocore:          1.34.138
	- botorch:           0.10.0
	- bs4:               0.0.2
	- build:             1.2.1
	- certifi:           2024.6.2
	- chardet:           5.2.0
	- charset-normalizer: 3.3.2
	- click:             8.1.7
	- colorama:          0.4.6
	- contourpy:         1.2.1
	- cycler:            0.12.1
	- dataproperty:      1.0.1
	- datasets:          2.20.0
	- dill:              0.3.8
	- distro:            1.9.0
	- dnspython:         2.6.1
	- docker-pycreds:    0.4.0
	- docstring-parser:  0.16
	- dotwiz:            0.4.0
	- email-validator:   2.2.0
	- evaluate:          0.4.2
	- exceptiongroup:    1.2.1
	- executing:         2.0.1
	- exrex:             0.11.0
	- fastapi:           0.111.0
	- fastapi-cli:       0.0.4
	- filelock:          3.16.1
	- fonttools:         4.53.1
	- frozenlist:        1.4.1
	- fsspec:            2024.10.0
	- funcy:             2.0
	- git-filter-repo:   2.34.0
	- gitdb:             4.0.11
	- gitpython:         3.1.43
	- gpytorch:          1.11
	- grpcio:            1.64.1
	- h11:               0.14.0
	- hf-transfer:       0.1.6
	- httpcore:          1.0.5
	- httptools:         0.6.1
	- httpx:             0.27.0
	- huggingface-hub:   0.23.4
	- idna:              3.7
	- importlib-metadata: 8.0.0
	- importlib-resources: 6.4.0
	- jaxtyping:         0.2.33
	- jinja2:            3.1.4
	- jiter:             0.5.0
	- jmespath:          1.0.1
	- joblib:            1.4.2
	- jsonargparse:      4.31.0
	- jsonlines:         4.0.0
	- kiwisolver:        1.4.5
	- lightning:         2.3.0.dev20240428
	- lightning-utilities: 0.11.8
	- linear-operator:   0.5.1
	- litdata:           0.2.30
	- litgpt:            0.4.0
	- litserve:          0.1.1.dev0
	- littleutils:       0.2.4
	- lm-eval:           0.4.3
	- lxml:              5.2.2
	- magicattr:         0.1.6
	- markdown:          3.6
	- markdown-it-py:    3.0.0
	- markupsafe:        2.1.5
	- matplotlib:        3.9.1.post1
	- mbstrdecoder:      1.1.3
	- mdurl:             0.1.2
	- more-itertools:    10.3.0
	- mpmath:            1.3.0
	- multidict:         6.0.5
	- multipledispatch:  1.0.0
	- multiprocess:      0.70.16
	- networkx:          3.2.1
	- nltk:              3.8.1
	- numexpr:           2.10.1
	- numpy:             1.26.4
	- nvidia-cublas-cu12: 12.4.5.8
	- nvidia-cuda-cupti-cu12: 12.4.127
	- nvidia-cuda-nvrtc-cu12: 12.4.127
	- nvidia-cuda-runtime-cu12: 12.4.127
	- nvidia-cudnn-cu12: 9.1.0.70
	- nvidia-cufft-cu12: 11.2.1.3
	- nvidia-curand-cu12: 10.3.5.147
	- nvidia-cusolver-cu12: 11.6.1.9
	- nvidia-cusparse-cu12: 12.3.1.170
	- nvidia-nccl-cu12:  2.21.5
	- nvidia-nvjitlink-cu12: 12.4.127
	- nvidia-nvtx-cu12:  12.4.127
	- openai:            1.43.0
	- opt-einsum:        3.3.0
	- orjson:            3.10.6
	- packaging:         24.1
	- pandas:            2.2.2
	- pathvalidate:      3.2.0
	- peft:              0.11.1
	- pillow:            10.4.0
	- pip:               24.0
	- pip-tools:         7.4.1
	- platformdirs:      4.2.2
	- portalocker:       2.10.0
	- protobuf:          4.25.3
	- psutil:            6.0.0
	- pyarrow:           16.1.0
	- pyarrow-hotfix:    0.6
	- pybind11:          2.13.1
	- pydantic:          2.8.0
	- pydantic-core:     2.20.0
	- pygments:          2.18.0
	- pyheck:            0.1.5
	- pyparsing:         3.1.2
	- pyproject-hooks:   1.1.0
	- pyro-api:          0.1.2
	- pyro-ppl:          1.9.1
	- pytablewriter:     1.2.0
	- python-dateutil:   2.9.0.post0
	- python-dotenv:     1.0.1
	- python-multipart:  0.0.9
	- pytorch-lightning: 2.3.1
	- pytz:              2024.1
	- pyyaml:            6.0.1
	- regex:             2024.5.15
	- requests:          2.32.3
	- rich:              13.7.1
	- rouge-score:       0.1.2
	- s3transfer:        0.10.3
	- sacrebleu:         2.4.2
	- safetensors:       0.4.3
	- scikit-learn:      1.5.1
	- scipy:             1.13.1
	- sentencepiece:     0.2.0
	- sentry-sdk:        2.7.1
	- setproctitle:      1.3.3
	- setuptools:        69.5.1
	- shellingham:       1.5.4
	- six:               1.16.0
	- smmap:             5.0.1
	- sniffio:           1.3.1
	- sorcery:           0.2.2
	- soupsieve:         2.6
	- sqlitedict:        2.1.0
	- starlette:         0.37.2
	- sympy:             1.13.1
	- tabledata:         1.3.3
	- tabulate:          0.9.0
	- tasksource:        0.0.45
	- tcolorpy:          0.1.6
	- tensorboard:       2.17.0
	- tensorboard-data-server: 0.7.2
	- threadpoolctl:     3.5.0
	- tokenizers:        0.19.1
	- tomli:             2.0.1
	- tomlkit:           0.13.2
	- torch:             2.5.1
	- torchmetrics:      1.4.0.post0
	- tqdm:              4.66.4
	- tqdm-multiprocess: 0.0.11
	- transformers:      4.42.3
	- triton:            3.1.0
	- typeguard:         2.13.3
	- typepy:            1.3.2
	- typer:             0.12.3
	- typeshed-client:   2.5.1
	- typing-extensions: 4.12.2
	- tzdata:            2024.1
	- ujson:             5.10.0
	- urllib3:           1.26.19
	- uvicorn:           0.30.1
	- uvloop:            0.19.0
	- wandb:             0.17.4
	- watchfiles:        0.22.0
	- websockets:        12.0
	- werkzeug:          3.0.3
	- wheel:             0.43.0
	- word2number:       1.1
	- wrapt:             1.16.0
	- xmltodict:         0.14.2
	- xxhash:            3.4.1
	- yarl:              1.9.4
	- zipp:              3.19.2
	- zstandard:         0.22.0
* System:
	- OS:                Linux
	- architecture:
		- 64bit
		- ELF
	- processor:         x86_64
	- python:            3.9.0
	- release:           5.14.0-427.40.1.el9_4.x86_64
	- version:           #1 SMP PREEMPT_DYNAMIC Wed Oct 16 07:08:17 EDT 2024

</details>


### More info

_No response_

The text was updated successfully, but these errors were encountered:

lliangthomas · 2025-02-11T17:44:39Z

I'm experiencing a very similar issue. Were you able to find a solution?

nightingal3 added bug Something isn't working needs triage Waiting to be triaged by maintainers labels Nov 19, 2024

github-actions bot added the ver: 2.3.x label Nov 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Multi-gpu training with slurm times out #20434

Multi-gpu training with slurm times out #20434

nightingal3 commented Nov 19, 2024

lliangthomas commented Feb 11, 2025

Uh oh!

Multi-gpu training with slurm times out #20434

Multi-gpu training with slurm times out #20434

Comments

nightingal3 commented Nov 19, 2024

Bug description

Bug description

What operating system are you using?

LitGPT Version

What version are you seeing the problem on?

How to reproduce the bug

distributed_backend=nccl All distributed processes registered. Starting with 4 processes

lliangthomas commented Feb 11, 2025

Uh oh!

distributed_backend=nccl
All distributed processes registered. Starting with 4 processes