Graceful shutdown when using DDP on SLURM #20649
Unanswered
Unturned3
asked this question in
DDP / multi-GPU / multi-node
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
How can we gracefully terminate a Lightning DDP training run on SLURM? Simply doing
scancel <jobid>
doesn't seem to do a "graceful" shutdown like how Ctrl-C would do in an interactive, single-GPU case.I noticed things like Weights & Biases will think the run is still alive (and later display "Crashed") instead of correctly displaying "Finished" (like it would after Ctrl-C).
In general, I'm confused about the handling of graceful shutdowns in Lightning; The documentation seems quite sparse on this issue. Thanks in advance for any help or suggestions!
Beta Was this translation helpful? Give feedback.
All reactions