multi-node training runs crash because ddp_weakref
is None
during backward
#20706
Labels
ddp_weakref
is None
during backward
#20706
Bug description
Multi-node / multi-GPU training fails mid-way through because
ddp_weakref
is not being set correctly during the backward pass. This appears to be similar to the issue reported in #20390. I was unable to reproduce this with a small model. Also the exact moment it fails (epoch/step) can vary between training runs. Any ideas? 🙏What version are you seeing the problem on?
v2.5
How to reproduce the bug
Error messages and logs
Environment
Current environment
More info
No response
The text was updated successfully, but these errors were encountered: