Support serialized checkpoint loading #9406

ananthsub · 2021-09-09T15:13:39Z

🚀 Feature

Motivation

Currently, all processes load the checkpoint at the same time. This can lead to CPU OOMs for large models when processes are concurrently loading the checkpoint. These use cases, especially with things like mixture of experts, might require serialized loading of checkpoint dicts across ranks (ie load the checkpoint one rank at a time per node). Could we enable this for DDP?

Prior work: #8515

Pitch

This would be controlled per training type plugin. Example pseudocode: https://gist.github.com/ananthsub/4ceedff56b2049a63bbb05ccd283b919

To work through:

Should the TrainingTypePlugin have responsibility of calling LightningModule.on_load_checkpoint instead of the Trainer/connector? This would make sense as the TTP "owns" the LightningModule inside of the trainer, and since it already offers load_model_state_dict: https://github.com/PyTorchLightning/pytorch-lightning/blob/41ba639859cf6c6bf319eb33e5b3394504315962/pytorch_lightning/plugins/training_type/training_type_plugin.py#L159-L160

DeepSpeed already eschews most of the checkpoint connector logic when it comes to loading the lightning module state. This could be a gap for metrics, and this means we could be calling on_load_checkpoint multiple times with certain plugins. In my opinion, this points to needing all LightningModule state load/save/alterations sit inside of the training type plugin.

Alternatives

Additional context

If you enjoy Lightning, check out our other projects! ⚡

_{Metrics: Machine learning metrics for distributed, scalable PyTorch applications.

Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, finetuning and solving problems with deep learning

Bolts: Pretrained SOTA Deep Learning models, callbacks and more for research and production with PyTorch Lightning and PyTorch

Lightning Transformers: Flexible interface for high performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.}

The text was updated successfully, but these errors were encountered:

carmocca · 2021-09-09T17:14:52Z

More previous work: #7509

jjenniferdai · 2021-09-14T04:49:18Z

I'm planning to work on this this week if thats ok!

ananthsub added feature Is an improvement or enhancement help wanted Open to be worked on checkpointing Related to checkpointing labels Sep 9, 2021

ananthsub self-assigned this Sep 9, 2021

tchaton added the let's do it! approved to implement label Sep 10, 2021

ananthsub assigned jjenniferdai Sep 14, 2021

This was referenced Sep 17, 2021

[wip] Support serialized checkpoint loading #9585

Closed

Support serialized checkpoint loading #9605

Merged

ananthsub mentioned this issue Sep 20, 2021

Cleanup FSDP integration to not require boilerplate logic #8722

Closed

awaelchli closed this as completed in #9605 Oct 20, 2021

jjenniferdai mentioned this issue Oct 25, 2021

Support serialized checkpoint loading [redo #9605] #10141

Closed

12 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support serialized checkpoint loading #9406

Support serialized checkpoint loading #9406

ananthsub commented Sep 9, 2021 •

edited

Loading

carmocca commented Sep 9, 2021

Uh oh!

jjenniferdai commented Sep 14, 2021

Uh oh!

Support serialized checkpoint loading #9406

Support serialized checkpoint loading #9406

Comments

ananthsub commented Sep 9, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🚀 Feature

Motivation

Pitch

Alternatives

Additional context

If you enjoy Lightning, check out our other projects! ⚡

carmocca commented Sep 9, 2021

Uh oh!

jjenniferdai commented Sep 14, 2021

Uh oh!

ananthsub commented Sep 9, 2021 •

edited

Loading