Skip to content

Feature incompatibilities with HPC/Slurm saving & loading #9407

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ananthsub opened this issue Sep 9, 2021 · 14 comments
Closed

Feature incompatibilities with HPC/Slurm saving & loading #9407

ananthsub opened this issue Sep 9, 2021 · 14 comments
Assignees
Labels
checkpointing Related to checkpointing deprecation Includes a deprecation environment: slurm feature Is an improvement or enhancement help wanted Open to be worked on refactor
Milestone

Comments

@ananthsub
Copy link
Contributor

ananthsub commented Sep 9, 2021

Proposed refactoring or deprecation

Proposal: Deprecate dedicated HPC saving & loading
Part of #7740

Motivation

Feature compatibility is quickly dropping for HPC checkpointing in Lightning.

Moving forward, I am concerned with supporting 4 distinct codepaths for saving checkpoints given what's happened with HPC. Paths that exist or are being worked on to trigger saving:

  • Using a checkpoint callback during fitting
  • Calling trainer.save_checkpoint directly
  • Checking for HPC/SLURM preemption through signal handlers
  • Checkpointing on exception if fault tolerance is enabled

All of these have differences around when & where checkpoints are saved, which in turn impact how the trainer resumes from these checkpoints: #9405

Pitch

Additional context


If you enjoy Lightning, check out our other projects! ⚡

  • Metrics: Machine learning metrics for distributed, scalable PyTorch applications.

  • Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, finetuning and solving problems with deep learning

  • Bolts: Pretrained SOTA Deep Learning models, callbacks and more for research and production with PyTorch Lightning and PyTorch

  • Lightning Transformers: Flexible interface for high performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.

cc @Borda @justusschock @awaelchli @akihironitta @ananthsub @ninginthecloud @tchaton

@carmocca
Copy link
Contributor

carmocca commented Oct 6, 2021

Can you clarify what user-facing components will be deprecated?
What would be the impact on the users?
Would any features be removed?
And the on_hpc_save hook?

@stale
Copy link

stale bot commented Nov 6, 2021

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

@stale stale bot added the won't fix This will not be worked on label Nov 6, 2021
@tchaton
Copy link
Contributor

tchaton commented Nov 8, 2021

@ananthsub Any updates ?

@stale stale bot removed the won't fix This will not be worked on label Nov 8, 2021
@carmocca
Copy link
Contributor

Is the hpc_resume_path here the same checkpoint used for general preemption/fault tolerance?

.pl_auto_save_ckpt is the fault-tolerant checkpoint, but this should probably not be there.

Deprecate CheckpointConnector.hpc_save:

Could be removed instead of deprecated, as this is hidden inside the CheckpointConnector which is not a public component.

@tchaton
Copy link
Contributor

tchaton commented Nov 10, 2021

Hey @ananthsub,

Most of this logic is used within slurm_sigusr1_handler_fn to automatically restart SLURM training. I believe we can refactor this code to live within the handler instead.

And I would rename hpc_resume_path into auto_resume_path which is used for Fault Tolerant Auto Restart and possibly Elastic Training in the future.

@ananthsub
Copy link
Contributor Author

@carmocca @tchaton @jjenniferdai - as a first pass, what do you think of deprecating the on_hpc_save and on_hpc_load hooks?

@carmocca
Copy link
Contributor

carmocca commented Dec 2, 2021

I think it makes sense, users could still have SLURM specific behavior in on_load_checkpoint and check if isinstance(environment, SLURMEnvironment) (or using a proxy property in the Trainer for that)

@jjenniferdai jjenniferdai self-assigned this Dec 2, 2021
@ananthsub ananthsub added the deprecation Includes a deprecation label Dec 2, 2021
@ananthsub
Copy link
Contributor Author

ananthsub commented Dec 2, 2021

I think what this does get at (as mentioned offline) is that checkpointing is means multiple things to people, and we need to distinguish:

  1. checkpointing as a mechanism to resume states during intermittent failures. this should be transparent to the end user
  2. checkpointing used to generate artifacts that can be used later on, such as or fine-tuning or inference. here users want to dictate which state is saved. maybe they only need particular modules, or they want to transform those modules before saving them.

@tchaton
Copy link
Contributor

tchaton commented Dec 3, 2021

Hey @ananthsub. Yes, I believe we can depreciate them.

@awaelchli awaelchli added this to the 1.6 milestone Dec 4, 2021
@jjenniferdai
Copy link
Contributor

jjenniferdai commented Dec 11, 2021

how about the following for next steps?

  1. Move auto_save_path out of hpc_resume_path to its own checkpoint_connector.auto_save_path.
    update accordingly: https://github.com/PyTorchLightning/pytorch-lightning/blob/ed84cef3afa8db1381162cf86bf2992bce71f9fb/pytorch_lightning/trainer/connectors/checkpoint_connector.py#L70 -->
    self.resume_checkpoint_path = self.hpc_resume_path or self.auto_save_path or checkpoint_path

  2. Remove hpc_save:

  • Until removal in v1.8, move on_hpc_save call into dump_checkpoint:
if isinstance(environment, SLURMEnvironment) and environment.auto_requeue:
    model.on_hpc_save(ckpt)
if self.trainer.logger:
    self.trainer.logger.finalize("finished")
hpc_save_path = self.trainer.checkpoint_connector.hpc_save_path(self.trainer.weights_save_path)
self.trainer.save_checkpoint(hpc_save_path)

@jjenniferdai
Copy link
Contributor

@tchaton @carmocca @ananthsub @awaelchli thoughts on ^ ?

@awaelchli
Copy link
Contributor

Yes, very good!

@tchaton
Copy link
Contributor

tchaton commented Dec 17, 2021

Hey @jjenniferdai, sounds good to me. I would make hpc_save_path private though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
checkpointing Related to checkpointing deprecation Includes a deprecation environment: slurm feature Is an improvement or enhancement help wanted Open to be worked on refactor
Projects
None yet
Development

No branches or pull requests

5 participants