Avoid CPU OOM when loading full-state FSDP checkpoints in Fabric #18138

awaelchli · 2023-07-23T02:02:01Z

What does this PR do?

Resolves a comment in the code regarding the loading of full-state checkpoints into a FSDP model: Before this PR, each worker loads its own copy of the checkpoint into CPU memory. Example: On a machine with 8 GPUs and a checkpoint of size 10 GB, we would occupy 8 * 10 = 80 GB of CPU memory when loading the state dict. If there is not enough CPU RAM, we get OOM.

This PR implements a strategy in which we load the state dict sequentially among the local ranks. The drawback is that loading will take N times longer, where N is the number of GPUs on the machine.

Was this discussed/agreed via a GitHub issue? (not for typos and docs)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you list all the breaking changes introduced by this pull request?
Did you update the CHANGELOG? (not for typos, docs, test updates, or minor internal changes/refactors)

PR review

Anyone in the community is welcome to review the PR.
Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list:

Reviewer checklist

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

awaelchli added 2 commits July 18, 2023 14:35

draft

99ccdce

simplify

5299160

github-actions bot added the fabric lightning.fabric.Fabric label Jul 23, 2023

awaelchli added feature Is an improvement or enhancement strategy: fsdp Fully Sharded Data Parallel fabric lightning.fabric.Fabric fun Staff contributions outside working hours - to differentiate from the "community" label and removed fabric lightning.fabric.Fabric labels Jul 23, 2023

awaelchli added this to the 2.1 milestone Jul 23, 2023

awaelchli mentioned this pull request Jul 25, 2023

Add lazy checkpoint loading for FSDP full-state checkpoints #18150

Merged

7 tasks

awaelchli closed this in #18150 Jul 26, 2023

awaelchli deleted the fabric/full-load-memory branch July 27, 2023 00:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Avoid CPU OOM when loading full-state FSDP checkpoints in Fabric #18138

Avoid CPU OOM when loading full-state FSDP checkpoints in Fabric #18138

Uh oh!

awaelchli commented Jul 23, 2023

Uh oh!

Uh oh!

Avoid CPU OOM when loading full-state FSDP checkpoints in Fabric #18138

Avoid CPU OOM when loading full-state FSDP checkpoints in Fabric #18138

Uh oh!

Conversation

awaelchli commented Jul 23, 2023

What does this PR do?

PR review

Uh oh!

Uh oh!