Feature incompatibilities with HPC/Slurm saving & loading #9407

ananthsub · 2021-09-09T15:19:02Z

Proposed refactoring or deprecation

Proposal: Deprecate dedicated HPC saving & loading
Part of #7740

Motivation

Feature compatibility is quickly dropping for HPC checkpointing in Lightning.

This is enabled only for SLURM schedulers through the SLURM connector. It has not generalized to any other scheduler: https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pytorch_lightning/trainer/connectors/slurm_connector.py
Checkpoint saving is incompatible with plugins like DeepSpeed
Customizing IO is incompatible with the newly introduced CheckpointIO plugin: https://github.com/PyTorchLightning/pytorch-lightning/blob/a079d7fccc0a9be25b40296f2a348c4b4f40c8cf/pytorch_lightning/trainer/connectors/checkpoint_connector.py#L271-L302
Users are asking for ways to disable the automatic resubmission of jobs because they handle this logic in their own scripts outside of the Trainer

Moving forward, I am concerned with supporting 4 distinct codepaths for saving checkpoints given what's happened with HPC. Paths that exist or are being worked on to trigger saving:

Using a checkpoint callback during fitting
Calling trainer.save_checkpoint directly
Checking for HPC/SLURM preemption through signal handlers
Checkpointing on exception if fault tolerance is enabled

All of these have differences around when & where checkpoints are saved, which in turn impact how the trainer resumes from these checkpoints: #9405

Pitch

Additional context

If you enjoy Lightning, check out our other projects! ⚡

_{Metrics: Machine learning metrics for distributed, scalable PyTorch applications.

Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, finetuning and solving problems with deep learning

Bolts: Pretrained SOTA Deep Learning models, callbacks and more for research and production with PyTorch Lightning and PyTorch

Lightning Transformers: Flexible interface for high performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.}

cc @Borda @justusschock @awaelchli @akihironitta @ananthsub @ninginthecloud @tchaton

carmocca · 2021-10-06T10:08:30Z

Can you clarify what user-facing components will be deprecated?
What would be the impact on the users?
Would any features be removed?
And the on_hpc_save hook?

stale · 2021-11-06T04:31:20Z

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

tchaton · 2021-11-08T09:22:38Z

@ananthsub Any updates ?

ananthsub · 2021-11-10T04:56:43Z

@carmocca @tchaton : concretely

Deprecate the on_hpc_save and on_hpc_load model hooks: https://github.com/PyTorchLightning/pytorch-lightning/blob/47e7a2860fc6c6f8d4a9117e213d9f5c0246d4f4/pytorch_lightning/core/saving.py#L218-L234
Deprecate CheckpointConnector.hpc_save: https://github.com/PyTorchLightning/pytorch-lightning/blob/ce149f64517883caf8abce7718c7bca5769f43ce/pytorch_lightning/trainer/connectors/checkpoint_connector.py#L297-L334

Questions:

Is the hpc_resume_path here the same checkpoint used for general preemption/fault tolerance? If not this should be removed too. https://github.com/PyTorchLightning/pytorch-lightning/blob/ce149f64517883caf8abce7718c7bca5769f43ce/pytorch_lightning/trainer/connectors/checkpoint_connector.py#L51-L61

carmocca · 2021-11-10T05:35:22Z

Is the hpc_resume_path here the same checkpoint used for general preemption/fault tolerance?

.pl_auto_save_ckpt is the fault-tolerant checkpoint, but this should probably not be there.

Deprecate CheckpointConnector.hpc_save:

Could be removed instead of deprecated, as this is hidden inside the CheckpointConnector which is not a public component.

tchaton · 2021-11-10T17:48:59Z

Hey @ananthsub,

Most of this logic is used within slurm_sigusr1_handler_fn to automatically restart SLURM training. I believe we can refactor this code to live within the handler instead.

And I would rename hpc_resume_path into auto_resume_path which is used for Fault Tolerant Auto Restart and possibly Elastic Training in the future.

ananthsub · 2021-12-01T21:56:59Z

@carmocca @tchaton @jjenniferdai - as a first pass, what do you think of deprecating the on_hpc_save and on_hpc_load hooks?

carmocca · 2021-12-02T11:10:49Z

I think it makes sense, users could still have SLURM specific behavior in on_load_checkpoint and check if isinstance(environment, SLURMEnvironment) (or using a proxy property in the Trainer for that)

ananthsub · 2021-12-02T21:58:42Z

I think what this does get at (as mentioned offline) is that checkpointing is means multiple things to people, and we need to distinguish:

checkpointing as a mechanism to resume states during intermittent failures. this should be transparent to the end user
checkpointing used to generate artifacts that can be used later on, such as or fine-tuning or inference. here users want to dictate which state is saved. maybe they only need particular modules, or they want to transform those modules before saving them.

tchaton · 2021-12-03T08:59:18Z

Hey @ananthsub. Yes, I believe we can depreciate them.

jjenniferdai · 2021-12-11T01:07:53Z

how about the following for next steps?

Move auto_save_path out of hpc_resume_path to its own checkpoint_connector.auto_save_path.
update accordingly: https://github.com/PyTorchLightning/pytorch-lightning/blob/ed84cef3afa8db1381162cf86bf2992bce71f9fb/pytorch_lightning/trainer/connectors/checkpoint_connector.py#L70 -->
self.resume_checkpoint_path = self.hpc_resume_path or self.auto_save_path or checkpoint_path
Remove hpc_save:

Until removal in v1.8, move on_hpc_save call into dump_checkpoint:

if isinstance(environment, SLURMEnvironment) and environment.auto_requeue:
    model.on_hpc_save(ckpt)

hpc_save --> hpc_save_path. i.e. remove real saving logic, just formats and returns path:
https://github.com/PyTorchLightning/pytorch-lightning/blob/ed84cef3afa8db1381162cf86bf2992bce71f9fb/pytorch_lightning/trainer/connectors/checkpoint_connector.py#L312-L313
https://github.com/PyTorchLightning/pytorch-lightning/blob/ed84cef3afa8db1381162cf86bf2992bce71f9fb/pytorch_lightning/trainer/connectors/checkpoint_connector.py#L316

in slurm_sigusr1_handler_fn :
https://github.com/PyTorchLightning/pytorch-lightning/blob/ed84cef3afa8db1381162cf86bf2992bce71f9fb/pytorch_lightning/trainer/connectors/signal_connector.py#L64 becomes:

if self.trainer.logger:
    self.trainer.logger.finalize("finished")
hpc_save_path = self.trainer.checkpoint_connector.hpc_save_path(self.trainer.weights_save_path)
self.trainer.save_checkpoint(hpc_save_path)

jjenniferdai · 2021-12-15T01:19:09Z

@tchaton @carmocca @ananthsub @awaelchli thoughts on ^ ?

awaelchli · 2021-12-15T01:40:55Z

Yes, very good!

tchaton · 2021-12-17T12:03:17Z

Hey @jjenniferdai, sounds good to me. I would make hpc_save_path private though.

ananthsub added feature Is an improvement or enhancement help wanted Open to be worked on refactor checkpointing Related to checkpointing labels Sep 9, 2021

carmocca added the environment: slurm label Sep 9, 2021

ananthsub mentioned this issue Sep 16, 2021

[Feat] Add graceful detection of signal to exit + SignalConnector and merge SlurmConnector. #9566

Merged

12 tasks

stale bot added the won't fix This will not be worked on label Nov 6, 2021

stale bot removed the won't fix This will not be worked on label Nov 8, 2021

jjenniferdai self-assigned this Dec 2, 2021

ananthsub added the deprecation Includes a deprecation label Dec 2, 2021

jjenniferdai mentioned this issue Dec 3, 2021

Deprecate on_hpc_{save/load} hooks #10911

Merged

12 tasks

awaelchli added this to the 1.6 milestone Dec 4, 2021

This was referenced Dec 15, 2021

Move CheckpointConnector.fault_tolerant_auto_save_path out of CheckpointConnector.hpc_resume_path #11092

Merged

Remove hpc_save #11101

Merged

jjenniferdai closed this as completed Jan 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature incompatibilities with HPC/Slurm saving & loading #9407

Feature incompatibilities with HPC/Slurm saving & loading #9407

ananthsub commented Sep 9, 2021 •

edited

Loading

carmocca commented Oct 6, 2021

stale bot commented Nov 6, 2021

tchaton commented Nov 8, 2021

ananthsub commented Nov 10, 2021

carmocca commented Nov 10, 2021

tchaton commented Nov 10, 2021

ananthsub commented Dec 1, 2021

carmocca commented Dec 2, 2021

ananthsub commented Dec 2, 2021 •

edited

Loading

tchaton commented Dec 3, 2021

jjenniferdai commented Dec 11, 2021 •

edited

Loading

jjenniferdai commented Dec 15, 2021

awaelchli commented Dec 15, 2021

tchaton commented Dec 17, 2021

Feature incompatibilities with HPC/Slurm saving & loading #9407

Feature incompatibilities with HPC/Slurm saving & loading #9407

Comments

ananthsub commented Sep 9, 2021 • edited Loading

Proposed refactoring or deprecation

Motivation

Pitch

Additional context

If you enjoy Lightning, check out our other projects! ⚡

carmocca commented Oct 6, 2021

stale bot commented Nov 6, 2021

tchaton commented Nov 8, 2021

ananthsub commented Nov 10, 2021

carmocca commented Nov 10, 2021

tchaton commented Nov 10, 2021

ananthsub commented Dec 1, 2021

carmocca commented Dec 2, 2021

ananthsub commented Dec 2, 2021 • edited Loading

tchaton commented Dec 3, 2021

jjenniferdai commented Dec 11, 2021 • edited Loading

jjenniferdai commented Dec 15, 2021

awaelchli commented Dec 15, 2021

tchaton commented Dec 17, 2021

ananthsub commented Sep 9, 2021 •

edited

Loading

ananthsub commented Dec 2, 2021 •

edited

Loading

jjenniferdai commented Dec 11, 2021 •

edited

Loading