Skip to content

SIGTERMException is not raised consistently across all ranks in DDP #20806

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
olegc-wayve opened this issue May 9, 2025 · 1 comment · Fixed by #20825
Closed

SIGTERMException is not raised consistently across all ranks in DDP #20806

olegc-wayve opened this issue May 9, 2025 · 1 comment · Fixed by #20825
Labels
bug Something isn't working needs triage Waiting to be triaged by maintainers ver: 2.5.x

Comments

@olegc-wayve
Copy link

olegc-wayve commented May 9, 2025

Bug description

SIGTERMException is not raised consistently across all ranks in DDP training because PyTorch Lightning doesn't handle SIGTERM-s well for distributed jobs. As a result checkpointing on SIGTERM can not be implemented reliably for DDP without workarounds in client code.

Issue

The SIGTERMException is raised in on_advance_end. When certain ranks proceed beyond this point and begin the next training step, they become deadlocked while waiting for ranks that raised the exception. The complete SIGTERM handling logic is detailed in the section below. Steps #6 - #8 are not executed consistently.

This can lead to the following deadlock condition:

  • All ranks complete gradient sharing and optimization at step N-1.
  • Rank 0 receives SIGTERM, enters the handler, and forwards the SIGTERM to other ranks.
  • Meanwhile, other ranks finish step N-1 and begin step N. They wait for rank 0 to join.
  • Rank 0 completes step N-1 and raises SIGTERMException in on_advance_end.
    • Rank 0 never joins step N, and other ranks never reach on_advance_end on step N, preventing them from raising SIGTERMException.

Schematically:

Image

SIGTERM handling logic in PyTorch Lightning

  1. Kubernetes API Server receives a request to abort a job.
  2. Kubernetes API Server sends an abort request to kubelets on every node.
  3. Kubelet sends a SIGTERM signal to the main process of the pytorch container.
    1. It waits for grace period and then
    2. It sends a KILL signal.
  4. PL _SignalConnector receives the SIGTERM on the local rank 0 (main process) ([github](https://github.com/Lightning-AI/pytorch-lightning/blob/2.5.0.post0/src/lightning/pytorch/trainer/connectors/signal_connector.py#L105-L113))
    1. It prints [rank: 0] Received SIGTERM: ...
    2. It calls strategy.auncher.kill
  5. The [DDPStrategy](https://github.com/Lightning-AI/pytorch-lightning/blob/2.5.0.post0/src/lightning/pytorch/strategies/ddp.py) uses [_MultiProcessingLauncher](https://github.com/Lightning-AI/pytorch-lightning/blob/2.5.0.post0/src/lightning/pytorch/strategies/launchers/multiprocessing.py). The launcher passes the SIGTERM to ranks 1 - N-1 ([github](https://github.com/Lightning-AI/pytorch-lightning/blob/2.5.0.post0/src/lightning/pytorch/strategies/launchers/multiprocessing.py#L260-L266C39))
    1. It prints Process <parent> is terminating <child> with 15.
  6. All ranks set self.received_sigterm = True ([github](https://github.com/Lightning-AI/pytorch-lightning/blob/2.5.0/src/lightning/pytorch/trainer/connectors/signal_connector.py#L113))
    1. It prints [rank: N] Received SIGTERM: ...
  7. PL _*TrainingEpochLoop.on_*advance_end raises SIGTERMException when the batch processing completes ([github](https://github.com/Lightning-AI/pytorch-lightning/blob/2.5.0.post0/src/lightning/pytorch/loops/training_epoch_loop.py#L385-L386))
  8. The exception is passed to on_exception hook ([github](https://github.com/Lightning-AI/pytorch-lightning/blob/2.5.0.post0/src/lightning/pytorch/trainer/call.py#L76))

What version are you seeing the problem on?

v2.5.0.post0

How to reproduce the bug

This issue can be consistently reproduced by introducing a 10-second sleep in the on_train_batch_end hook on rank 0. It will ensure that we certainly hit the deadlock condition described above.

@olegc-wayve olegc-wayve added bug Something isn't working needs triage Waiting to be triaged by maintainers labels May 9, 2025
@olegc-wayve
Copy link
Author

Thank you for the fix @KAVYANSHTYAGI !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Waiting to be triaged by maintainers ver: 2.5.x
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant