You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
SIGTERMException is not raised consistently across all ranks in DDP training because PyTorch Lightning doesn't handle SIGTERM-s well for distributed jobs. As a result checkpointing on SIGTERM can not be implemented reliably for DDP without workarounds in client code.
Issue
The SIGTERMException is raised in on_advance_end. When certain ranks proceed beyond this point and begin the next training step, they become deadlocked while waiting for ranks that raised the exception. The complete SIGTERM handling logic is detailed in the section below. Steps #6 - #8 are not executed consistently.
This can lead to the following deadlock condition:
All ranks complete gradient sharing and optimization at step N-1.
Rank 0 receives SIGTERM, enters the handler, and forwards the SIGTERM to other ranks.
Meanwhile, other ranks finish step N-1 and begin step N. They wait for rank 0 to join.
Rank 0 completes step N-1 and raises SIGTERMException in on_advance_end.
Rank 0 never joins step N, and other ranks never reach on_advance_end on step N, preventing them from raising SIGTERMException.
Schematically:
SIGTERM handling logic in PyTorch Lightning
Kubernetes API Server receives a request to abort a job.
Kubernetes API Server sends an abort request to kubelets on every node.
Kubelet sends a SIGTERM signal to the main process of the pytorch container.
This issue can be consistently reproduced by introducing a 10-second sleep in the on_train_batch_end hook on rank 0. It will ensure that we certainly hit the deadlock condition described above.
The text was updated successfully, but these errors were encountered:
Uh oh!
There was an error while loading. Please reload this page.
Bug description
SIGTERMException is not raised consistently across all ranks in DDP training because PyTorch Lightning doesn't handle SIGTERM-s well for distributed jobs. As a result checkpointing on SIGTERM can not be implemented reliably for DDP without workarounds in client code.
Issue
The
SIGTERMException
is raised inon_advance_end
. When certain ranks proceed beyond this point and begin the next training step, they become deadlocked while waiting for ranks that raised the exception. The complete SIGTERM handling logic is detailed in the section below. Steps #6 - #8 are not executed consistently.This can lead to the following deadlock condition:
SIGTERMException
in on_advance_end.on_advance_end
on step N, preventing them from raisingSIGTERMException
.Schematically:
SIGTERM handling logic in PyTorch Lightning
_SignalConnector
receives the SIGTERM on the local rank 0 (main process) ([github](https://github.com/Lightning-AI/pytorch-lightning/blob/2.5.0.post0/src/lightning/pytorch/trainer/connectors/signal_connector.py#L105-L113))[rank: 0] Received SIGTERM: ...
strategy.auncher.kill
Process <parent> is terminating <child> with 15.
self.received_sigterm = True
([github](https://github.com/Lightning-AI/pytorch-lightning/blob/2.5.0/src/lightning/pytorch/trainer/connectors/signal_connector.py#L113))[rank: N] Received SIGTERM: ...
_*TrainingEpochLoop.on_*advance_end
raisesSIGTERMException
when the batch processing completes ([github](https://github.com/Lightning-AI/pytorch-lightning/blob/2.5.0.post0/src/lightning/pytorch/loops/training_epoch_loop.py#L385-L386))on_exception
hook ([github](https://github.com/Lightning-AI/pytorch-lightning/blob/2.5.0.post0/src/lightning/pytorch/trainer/call.py#L76))What version are you seeing the problem on?
v2.5.0.post0
How to reproduce the bug
This issue can be consistently reproduced by introducing a 10-second sleep in the
on_train_batch_end
hook on rank 0. It will ensure that we certainly hit the deadlock condition described above.The text was updated successfully, but these errors were encountered: