Skip to content

multi-node training runs crash because ddp_weakref is None during backward #20706

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
mishooax opened this issue Apr 10, 2025 · 0 comments
Open
Labels
bug Something isn't working needs triage Waiting to be triaged by maintainers ver: 2.5.x

Comments

@mishooax
Copy link

Bug description

Multi-node / multi-GPU training fails mid-way through because ddp_weakref is not being set correctly during the backward pass. This appears to be similar to the issue reported in #20390. I was unable to reproduce this with a small model. Also the exact moment it fails (epoch/step) can vary between training runs. Any ideas? 🙏

[rank13]: Traceback (most recent call last):
[rank13]:              ^^^^^^
[rank13]:   File "/lib/python3.11/site-packages/hydra/main.py", line 94, in decorated_main
[rank13]:     _run_hydra(
[rank13]:   File "/lib/python3.11/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra
[rank13]:     _run_app(
[rank13]:   File "/lib/python3.11/site-packages/hydra/_internal/utils.py", line 457, in _run_app
[rank13]:     run_and_report(
[rank13]:   File "/lib/python3.11/site-packages/hydra/_internal/utils.py", line 223, in run_and_report
[rank13]:     raise ex
[rank13]:   File "/lib/python3.11/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
[rank13]:   File "/lib/python3.11/site-packages/hydra/_internal/utils.py", line 458, in <lambda>
[rank13]:     lambda: hydra.run(
[rank13]:     _ = ret.return_value
[rank13]:     DOPTrainer(config).train()
[rank13]:   File "/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 539, in fit
[rank13]:     call._call_and_handle_interrupt(
[rank13]:     return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
[rank13]:     return function(*args, **kwargs)
[rank13]:     self._run(model, ckpt_path=ckpt_path)
[rank13]:   File "/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 982, in _run
[rank13]:   File "/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 1026, in _run_stage
[rank13]:     self.fit_loop.run()
[rank13]:   File "/lib/python3.11/site-packages/pytorch_lightning/loops/fit_loop.py", line 216, in run
[rank13]:     self.advance(data_fetcher)
[rank13]:     self._optimizer_step(batch_idx, closure)
[rank13]:     output = fn(*args, **kwargs)
[rank13]:              ^^^^^^^^^^^^^^^^^^^
[rank13]:   File "/lib/python3.11/site-packages/pytorch_lightning/core/optimizer.py", line 154, in step
[rank13]:     step_output = self._strategy.optimizer_step(self._optimizer, closure, **kwargs)
[rank13]:                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank13]:   File "/lib/python3.11/site-packages/pytorch_lightning/strategies/ddp.py", line 270, in optimizer_step
[rank13]:     optimizer_output = super().optimizer_step(optimizer, closure, model, **kwargs)
[rank13]:                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank13]:     return self.precision_plugin.optimizer_step(optimizer, model=model, closure=closure, **kwargs)
[rank13]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank13]:   File "/lib/python3.11/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 146, in __call__
[rank13]:     self._result = self.closure(*args, **kwargs)
[rank13]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank13]:   File "/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank13]:     return func(*args, **kwargs)
[rank13]:            ^^^^^^^^^^^^^^^^^^^^^
[rank13]:   File "/lib/python3.11/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 140, in closure
[rank13]:     self._backward_fn(step_output.closure_loss)
[rank13]:   File "/lib/python3.11/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 241, in backward_fn
[rank13]:     call._call_strategy_hook(self.trainer, "backward", loss, optimizer)
[rank13]:   File "/lib/python3.11/site-packages/pytorch_lightning/trainer/call.py", line 323, in _call_strategy_hook
[rank13]:     output = fn(*args, **kwargs)
[rank13]:              ^^^^^^^^^^^^^^^^^^^
[rank13]:   File "/lib/python3.11/site-packages/pytorch_lightning/strategies/strategy.py", line 213, in backward
[rank13]:     self.precision_plugin.backward(closure_loss, self.lightning_module, optimizer, *args, **kwargs)
[rank13]:   File "/lib/python3.11/site-packages/pytorch_lightning/plugins/precision/precision.py", line 73, in backward
[rank13]:     model.backward(tensor, *args, **kwargs)
[rank13]:   File "/lib/python3.11/site-packages/pytorch_lightning/core/module.py", line 1097, in backward
[rank13]:     loss.backward(*args, **kwargs)
[rank13]:   File "/lib/python3.11/site-packages/torch/_tensor.py", line 626, in backward
[rank13]:     torch.autograd.backward(
[rank13]:   File "/lib/python3.11/site-packages/torch/autograd/__init__.py", line 347, in backward
[rank13]:     _engine_run_backward(
[rank13]:   File "/lib/python3.11/site-packages/torch/autograd/graph.py", line 823, in _engine_run_backward
[rank13]:     return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
[rank13]:   File "/lib/python3.11/site-packages/torch/autograd/function.py", line 307, in apply
[rank13]:     return user_fn(self, *args)
[rank13]:            ^^^^^^^^^^^^^^^^^^^^
[rank13]:   File "/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 260, in backward
[rank13]:     reducer = ddp_weakref.reducer
[rank13]:               ^^^^^^^^^^^^^^^^^^^
[rank13]: AttributeError: 'NoneType' object has no attribute 'reducer'
[rank12]:[E410 06:07:50.769590035 ProcessGroupNCCL.cpp:629] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=253173, OpType=ALLGATHER, NumelIn=91356, NumelOut=730848, Timeout(ms)=600000) ran for 600009 milliseconds before timing out.

What version are you seeing the problem on?

v2.5

How to reproduce the bug

Error messages and logs

# Error messages and logs here please

Environment

Current environment
#- PyTorch Lightning Version (e.g., 2.5.0):
#- PyTorch Version (e.g., 2.5):
#- Python version (e.g., 3.12):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):

More info

No response

@mishooax mishooax added bug Something isn't working needs triage Waiting to be triaged by maintainers labels Apr 10, 2025
@mishooax mishooax changed the title training run crashes because ddp_weakref is None during backward multi-node training runs crash because ddp_weakref is None during backward Apr 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Waiting to be triaged by maintainers ver: 2.5.x
Projects
None yet
Development

No branches or pull requests

1 participant