-
Notifications
You must be signed in to change notification settings - Fork 3.5k
Feature incompatibilities with HPC/Slurm saving & loading #9407
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Can you clarify what user-facing components will be deprecated? |
This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team! |
@ananthsub Any updates ? |
@carmocca @tchaton : concretely
Questions:
|
Could be removed instead of deprecated, as this is hidden inside the |
Hey @ananthsub, Most of this logic is used within And I would rename |
@carmocca @tchaton @jjenniferdai - as a first pass, what do you think of deprecating the |
I think it makes sense, users could still have SLURM specific behavior in |
I think what this does get at (as mentioned offline) is that checkpointing is means multiple things to people, and we need to distinguish:
|
Hey @ananthsub. Yes, I believe we can depreciate them. |
how about the following for next steps?
|
@tchaton @carmocca @ananthsub @awaelchli thoughts on ^ ? |
Yes, very good! |
Hey @jjenniferdai, sounds good to me. I would make |
Proposed refactoring or deprecation
Proposal: Deprecate dedicated HPC saving & loading
Part of #7740
Motivation
Feature compatibility is quickly dropping for HPC checkpointing in Lightning.
Moving forward, I am concerned with supporting 4 distinct codepaths for saving checkpoints given what's happened with HPC. Paths that exist or are being worked on to trigger saving:
All of these have differences around when & where checkpoints are saved, which in turn impact how the trainer resumes from these checkpoints: #9405
Pitch
Additional context
If you enjoy Lightning, check out our other projects! ⚡
Metrics: Machine learning metrics for distributed, scalable PyTorch applications.
Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, finetuning and solving problems with deep learning
Bolts: Pretrained SOTA Deep Learning models, callbacks and more for research and production with PyTorch Lightning and PyTorch
Lightning Transformers: Flexible interface for high performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.
cc @Borda @justusschock @awaelchli @akihironitta @ananthsub @ninginthecloud @tchaton
The text was updated successfully, but these errors were encountered: