Serialize checkpoint loading on each node #7509

maximsch2 · 2021-05-12T20:29:05Z

What does this PR do?

Loading large checkpoints across multiple workers on the same host can lead to OOMs (easy to imagine case: model_size*num_gpus < total ram < 2*model_size*num_gpus - we get 2x penalty for loading checkpoint on each worker before setting it into state_dict of the model), serializing the process would help as now we'll only do things one local worker at a time.

Before submitting

Was this discussed/approved via a GitHub issue? (not for typos and docs)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

codecov · 2021-05-12T20:30:29Z

Codecov Report

Merging #7509 (81508f6) into master (20f6337) will decrease coverage by 5%.
The diff coverage is 43%.

@@           Coverage Diff           @@
##           master   #7509    +/-   ##
=======================================
- Coverage      92%     88%    -5%     
=======================================
  Files         197     197            
  Lines       12878   12884     +6     
=======================================
- Hits        11899   11314   -585     
- Misses        979    1570   +591

pep8speaks · 2021-05-13T17:41:40Z

Hello @maximsch2! Thanks for updating this PR.

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-05-17 17:58:59 UTC

carmocca · 2021-05-17T15:25:44Z

pytorch_lightning/trainer/trainer.py

@@ -143,7 +143,8 @@ def __init__(
        distributed_backend: Optional[str] = None,
        move_metrics_to_cpu: bool = False,
        multiple_trainloader_mode: str = 'max_size_cycle',
-        stochastic_weight_avg: bool = False
+        stochastic_weight_avg: bool = False,
+        serialize_checkpoint_loading: bool = False


Do you think a trainer flag is necessary here? Is it too slow to always serialize?

What about us making the choice by comparing the ram available and model size?

It can potentially be unsafe to always serialize, specifically if you are doing any NCCL communications in on_load_checkpoint and assume they are happening at the same time on all hosts - with serialization those will deadlock, hence having this off by default.

pytorch_lightning/trainer/trainer.py

…lightning into serialize_checkpoint_loading

tchaton · 2021-05-21T12:11:21Z

pytorch_lightning/trainer/trainer.py

@@ -143,7 +143,8 @@ def __init__(
        distributed_backend: Optional[str] = None,
        move_metrics_to_cpu: bool = False,
        multiple_trainloader_mode: str = 'max_size_cycle',
-        stochastic_weight_avg: bool = False
+        stochastic_weight_avg: bool = False,
+        serialize_checkpoint_loading: bool = False


IMO, sequential_checkpoint_loading would be easier to understand.

tchaton · 2021-05-21T12:11:27Z

pytorch_lightning/trainer/trainer.py

-            ckpt_path, map_location=lambda storage, loc: storage
-        )
+        # Serialize checkpoint loading to avoid OOMs
+        if self.serialize_checkpoint_loading and self.num_gpus > 0:


Why not let this responsibility to the training_type_plugin, I guess this is useful only for DDP and derivate right now.

stale · 2021-06-04T13:29:54Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. If you need further help see our docs: https://pytorch-lightning.readthedocs.io/en/latest/generated/CONTRIBUTING.html#pull-request or ask the assistance of a core contributor here or on Slack. Thank you for your contributions.

stale · 2021-06-09T16:00:41Z

This pull request is going to be closed. Please feel free to reopen it create a new from the actual master.

Serialize checkpoint loading on each node

af6640d

handle zero gpu case

1015924

maximsch2 changed the title ~~WIP: Serialize checkpoint loading on each node~~ Serialize checkpoint loading on each node May 12, 2021

maximsch2 marked this pull request as ready for review May 12, 2021 23:55

maximsch2 requested review from awaelchli, tchaton and williamFalcon as code owners May 12, 2021 23:56

maximsch2 requested a review from ananthsub May 13, 2021 00:07

hide this behind a flag and add changelog entry

0d9348d

maximsch2 requested review from Borda, carmocca, justusschock, kaushikb11 and SeanNaren as code owners May 13, 2021 17:41

carmocca reviewed May 17, 2021

View reviewed changes

carmocca added the feature Is an improvement or enhancement label May 17, 2021

maximsch2 added 2 commits May 17, 2021 10:57

Merge branch 'master' of https://github.com/PyTorchLightning/pytorch-…

2da9caf

…lightning into serialize_checkpoint_loading

pep8

81508f6

tchaton reviewed May 21, 2021

View reviewed changes

stale bot added the won't fix This will not be worked on label Jun 4, 2021

stale bot closed this Jun 9, 2021

mleshen mentioned this pull request Jun 20, 2021

OOM issues with loading large model checkpoints w/ FSDP after checkpoint refactor #8043

Closed

carmocca mentioned this pull request Sep 9, 2021

Support serialized checkpoint loading #9406

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Serialize checkpoint loading on each node #7509

Serialize checkpoint loading on each node #7509

Uh oh!

maximsch2 commented May 12, 2021 •

edited

Loading

Uh oh!

codecov bot commented May 12, 2021 •

edited

Loading

Uh oh!

pep8speaks commented May 13, 2021 •

edited

Loading

Uh oh!

carmocca May 17, 2021

Uh oh!

maximsch2 May 17, 2021

Uh oh!

Uh oh!

tchaton May 21, 2021

Uh oh!

tchaton May 21, 2021

Uh oh!

stale bot commented Jun 4, 2021

Uh oh!

stale bot commented Jun 9, 2021

Uh oh!

Uh oh!

Serialize checkpoint loading on each node #7509

Serialize checkpoint loading on each node #7509

Uh oh!

Conversation

maximsch2 commented May 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

PR review

Did you have fun?

Uh oh!

codecov bot commented May 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

pep8speaks commented May 13, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Comment last updated at 2021-05-17 17:58:59 UTC

Uh oh!

carmocca May 17, 2021

Choose a reason for hiding this comment

Uh oh!

maximsch2 May 17, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tchaton May 21, 2021

Choose a reason for hiding this comment

Uh oh!

tchaton May 21, 2021

Choose a reason for hiding this comment

Uh oh!

stale bot commented Jun 4, 2021

Uh oh!

stale bot commented Jun 9, 2021

Uh oh!

Uh oh!

maximsch2 commented May 12, 2021 •

edited

Loading

codecov bot commented May 12, 2021 •

edited

Loading

pep8speaks commented May 13, 2021 •

edited

Loading