Follow up — changes to load model state in checkpoint connector in case of multiple workers #8044 #8515

mleshen · 2021-07-21T21:10:35Z

What does this PR do?

Fixes #<issue_number>

Does your PR introduce any breaking changes ? If yes, please list them.

Before submitting

Was this discussed/approved via a GitHub issue? (not for typos and docs)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)
Did you list all the breaking changes introduced by this pull request?

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

for more information, see https://pre-commit.ci

codecov · 2021-07-21T21:13:23Z

Codecov Report

Merging #8515 (c7f8c8c) into master (366fb39) will increase coverage by 0%.
The diff coverage is n/a.

❗ Current head c7f8c8c differs from pull request most recent head 6388a33. Consider uploading reports for the commit 6388a33 to get more accurate results

@@           Coverage Diff           @@
##           master   #8515    +/-   ##
=======================================
  Coverage      92%     92%            
=======================================
  Files         175     218    +43     
  Lines       14696   14412   -284     
=======================================
- Hits        13508   13308   -200     
+ Misses       1188    1104    -84

awaelchli · 2021-07-22T11:31:13Z

pytorch_lightning/plugins/training_type/fully_sharded.py

+        for current_worker in range(self.num_processes):
+            if self.local_rank == current_worker:
+                checkpoint = super().load_checkpoint_file(checkpoint_path)
+                self.lightning_module.on_load_checkpoint(checkpoint)
+                self.load_model_state_dict(checkpoint)
+                log.info(f"Rank {self.global_rank}: done loading model states from {checkpoint_path}.")
+                del checkpoint["state_dict"]
+            self.barrier()


this is pretty smart :)

one concern, now the responsibility of calling the hooks is shifted to the plugin. do we want to allow that?
@PyTorchLightning/core-contributors

In this case, I don't see a problem in it. The only thing I am afraid of, is that we may have a precedence case here and that might lead to this pattern where we don't want it.

What do you think @SeanNaren @awaelchli @ananthsub @carmocca ?

pytorch_lightning/plugins/training_type/fully_sharded.py

pytorch_lightning/trainer/connectors/checkpoint_connector.py

for more information, see https://pre-commit.ci

tchaton · 2021-07-26T08:38:57Z

pytorch_lightning/plugins/training_type/fully_sharded.py

+            f"FullyShardedDataParallel has {self.num_processes} processes. Serializing model
+            state restore to avoid CPU OOMs."
+        )
+        for current_worker in range(self.num_processes):


Would this break if the number of gpus on checkpointing / saving isn't the same. Should we save the world_size + rank in each checkpoint and re-use that information on reload ?

thanks for raising this! thinking aloud — in the same training session we wouldn't run into this problem, but even in the case where a model checkpoint from a different training session is being finetuned in a new training session with a different machine, do we envision that world size would be different (for our use cases at least we have config files that keep constant variables like Trainer(gpus) for the same model training)?

I think that is definitely something we need to support. For example I usually use the output from torch.cuda.device_count() as number of gpus.

@tchaton this kind of training "metadata" should get saved with the checkpoint. For example, we will also want to know this for fault-tolerance to fail if the trainer configuration has changed between runs and the user is trying to restore mid-batch.

FWIW i ran a test taking a checkpoint from a previous training session and starting a fresh one: setting trainer.resume_from_checkpoint to the old checkpoint and trainer.gpus=4 instead of 8 (which was num gpus on the original training session) and loading the checkpoint on 4 GPUs didn't break.

i can add this feature in another pull request! would a good place to add the metadata be on_save_checkpoint in model_checkpoint.py? eg here: https://github.com/PyTorchLightning/pytorch-lightning/blob/c7f8c8c3c82b4f249125885490b2392bf9d3d08b/pytorch_lightning/callbacks/model_checkpoint.py#L341

Sorry for the late reply! I've filed #9123 to track this

tchaton · 2021-07-26T08:41:22Z

pytorch_lightning/trainer/connectors/checkpoint_connector.py

-
-        self.trainer.lightning_module.on_load_checkpoint(checkpoint)
-        self.trainer.training_type_plugin.load_model_state_dict(checkpoint)
+            if hasattr(self.trainer.training_type_plugin, "serialized_restore_model_state"):


Just personal preference, but I would prefer to refactor the code and pass the checkpoint_path to the training_type plugin and it handles the loading logic.

tchaton · 2021-07-26T08:41:45Z

pytorch_lightning/trainer/connectors/checkpoint_connector.py

+                checkpoint = self.trainer.training_type_plugin.serialized_restore_model_state(checkpoint_path)
+            else:
+                checkpoint = self.trainer.training_type_plugin.load_checkpoint_file(checkpoint_path)
+                self.trainer.lightning_module.on_load_checkpoint(checkpoint)


For consistency, should we move self.trainer.lightning_module.on_load_checkpoint(checkpoint) to training type plugin ?

tchaton · 2021-07-26T08:42:22Z

pytorch_lightning/plugins/training_type/fully_sharded.py

+        )
+        for current_worker in range(self.num_processes):
+            if self.local_rank == current_worker:
+                checkpoint = super().load_checkpoint_file(checkpoint_path)


Does this assume the same checkpoint path for all workers ?

tchaton · 2021-07-26T08:43:43Z

pytorch_lightning/plugins/training_type/fully_sharded.py

+                self.load_model_state_dict(checkpoint)
+                del checkpoint["state_dict"]


Suggested change

self.load_model_state_dict(checkpoint)

del checkpoint["state_dict"]

self.load_model_state_dict(checkpoint.pop("state_dict"))

carmocca · 2021-07-27T10:17:57Z

pytorch_lightning/plugins/training_type/fully_sharded.py

+        checkpoint = {}
+        rank_zero_info(
+            f"FullyShardedDataParallel has {self.num_processes} processes. Serializing model
+            state restore to avoid CPU OOMs."


Suggested change

state restore to avoid CPU OOMs."

state restoration to avoid CPU OOMs."

carmocca · 2021-07-27T10:21:01Z

pytorch_lightning/plugins/training_type/fully_sharded.py

+            f"FullyShardedDataParallel has {self.num_processes} processes. Serializing model
+            state restore to avoid CPU OOMs."
+        )
+        for current_worker in range(self.num_processes):


@tchaton this kind of training "metadata" should get saved with the checkpoint. For example, we will also want to know this for fault-tolerance to fail if the trainer configuration has changed between runs and the user is trying to restore mid-batch.

tchaton · 2021-07-29T08:41:48Z

Hey @mleshen, any updates on this PR ? Do you need some assistance ?

maximsch2 · 2021-08-03T16:17:20Z

@mleshen , and others - this has been an issue for us a few times in the past and seems to be a particularly tricky things to test for. Do we have tests for checkpoint loading that we can amend to include memory tracking to make sure memory used doesn't scale with number of workers?

tchaton · 2021-08-26T08:28:37Z

pytorch_lightning/callbacks/model_checkpoint.py

-            "dirpath": self.dirpath
+            "dirpath": self.dirpath,
+            "world_size": trainer.world_size,
+            "node_rank": trainer.node_rank,


Let's add num_nodes too.

tchaton · 2021-08-26T08:29:50Z

pytorch_lightning/plugins/training_type/fully_sharded.py

+            f"FullyShardedDataParallel has {self.num_processes} processes. Serializing model
+            state restoration to avoid CPU OOMs."
+        )
+        for current_worker in range(self.num_processes):


Mind adding a comment to explain what is happening in case new reader reached this code :)

tchaton · 2021-08-26T08:30:02Z

pytorch_lightning/trainer/connectors/checkpoint_connector.py

-
-        self.trainer.lightning_module.on_load_checkpoint(checkpoint)
-        self.trainer.training_type_plugin.load_model_state_dict(checkpoint)
+        checkpoint = self.trainer.training_type_plugin.load_model_state(checkpoint_path)


Much cleaner !

tchaton · 2021-08-26T08:32:31Z

pytorch_lightning/plugins/training_type/fully_sharded.py

@@ -178,6 +182,24 @@ def lightning_module_state_dict(self) -> Dict[str, Union[Any, Tensor]]:
        # state dict.
        return super().lightning_module_state_dict()

+    def load_model_state(self, checkpoint_path: Union[str, Path]) -> Dict[str, Any]:
+        checkpoint = {}


docstring ?

tchaton · 2021-08-26T08:40:59Z

Dear @mleshen,

Any updates on this PR ?
noob question. Does FSDP save only 1 checkpoint or a directory with multiple checkpoints as DeeSpeed does. ?

Best,
T.C

rohitgr7 · 2021-08-26T19:06:23Z

pytorch_lightning/plugins/training_type/fully_sharded.py

+                    "deleted state_dict from checkpoint."
+                )
+            self.barrier()
+        return checkpoint


I guess, returning the checkpoint is not required. + docstrings

rohitgr7 · 2021-08-26T19:06:34Z

pytorch_lightning/plugins/training_type/training_type_plugin.py

+        checkpoint = self.load_checkpoint_file(checkpoint_path)
+        self.on_load_checkpoint(checkpoint)
+        self.load_model_state_dict(checkpoint)
+        return checkpoint


stale · 2022-04-16T05:10:03Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. If you need further help see our docs: https://pytorch-lightning.readthedocs.io/en/latest/generated/CONTRIBUTING.html#pull-request or ask the assistance of a core contributor here or on Slack. Thank you for your contributions.

stale · 2022-04-21T05:53:44Z

This pull request is going to be closed. Please feel free to reopen it create a new from the actual master.

mleshen and others added 12 commits June 19, 2021 18:18

Update checkpoint_connector.py

9cb0246

lint

ee7f979

cleanup

e86670e

[pre-commit.ci] auto fixes from pre-commit.com hooks

20e429a

for more information, see https://pre-commit.ci

serialize load_checkpoint_file in fully_sharded

b352d24

Merge branch 'master' of https://github.com/mleshen/pytorch-lightning

815c474

revert changes to checkpoint_connector

6791642

Merge branch 'PyTorchLightning:master' into master

ec70b80

remove changes

f01e705

Merge branch 'PyTorchLightning:master' into master

7f59061

add serialized restore model state

741bd63

Merge branch 'master' of https://github.com/mleshen/pytorch-lightning

8138319

mleshen requested review from awaelchli, Borda, carmocca, justusschock, SeanNaren and tchaton as code owners July 21, 2021 21:10

[pre-commit.ci] auto fixes from pre-commit.com hooks

7482b8c

for more information, see https://pre-commit.ci

mleshen mentioned this pull request Jul 21, 2021

OOM issues with loading large model checkpoints w/ FSDP after checkpoint refactor #8043

Closed

awaelchli added this to the v1.5 milestone Jul 22, 2021

awaelchli added feature Is an improvement or enhancement design Includes a design discussion labels Jul 22, 2021

awaelchli reviewed Jul 22, 2021

View reviewed changes

mleshen and others added 3 commits July 23, 2021 11:34

address review comments

3f47522

Merge branch 'PyTorchLightning:master' into master

9e4d746

[pre-commit.ci] auto fixes from pre-commit.com hooks

c2e1afb

for more information, see https://pre-commit.ci

mleshen requested a review from edenlightning as a code owner July 23, 2021 18:35

fix fully_sharded

ef2be98

Merge branch 'master' of https://github.com/mleshen/pytorch-lightning

d4f2e15

tchaton reviewed Jul 26, 2021

View reviewed changes

carmocca reviewed Jul 27, 2021

View reviewed changes

add metadata to model checkpoint

a9dcd13

mleshen requested review from kaushikb11 and williamFalcon as code owners August 2, 2021 18:15

mergify bot added the has conflicts label Aug 2, 2021

refactor, move loading logic to training type plugin

15ed02f

ananthsub mentioned this pull request Aug 2, 2021

[Feature Request] Support forking for DDP #8230

Closed

carmocca mentioned this pull request Aug 26, 2021

Save training metadata with the fault tolerance checkpoint #9123

Open

tchaton reviewed Aug 26, 2021

View reviewed changes

resolve conflitcs

6388a33

mergify bot removed the has conflicts label Aug 26, 2021

add docstring

7265fda

tchaton reviewed Aug 26, 2021

View reviewed changes

rohitgr7 reviewed Aug 26, 2021

View reviewed changes

ananthsub mentioned this pull request Sep 9, 2021

Support serialized checkpoint loading #9406

Closed

mergify bot added the has conflicts label Sep 14, 2021

awaelchli modified the milestones: v1.5, v1.6 Nov 1, 2021

carmocca removed this from the 1.6 milestone Mar 28, 2022

stale bot added the won't fix This will not be worked on label Apr 16, 2022

stale bot closed this Apr 21, 2022

		self.load_model_state_dict(checkpoint)
		del checkpoint["state_dict"]

	self.load_model_state_dict(checkpoint)
	del checkpoint["state_dict"]
	self.load_model_state_dict(checkpoint.pop("state_dict"))

	state restore to avoid CPU OOMs."
	state restoration to avoid CPU OOMs."

Follow up — changes to load model state in checkpoint connector in case of multiple workers #8044 #8515

Follow up — changes to load model state in checkpoint connector in case of multiple workers #8044 #8515

Uh oh!

Conversation

mleshen commented Jul 21, 2021

What does this PR do?

Does your PR introduce any breaking changes ? If yes, please list them.

Before submitting

PR review

Did you have fun?

Uh oh!

codecov bot commented Jul 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tchaton commented Jul 29, 2021

Uh oh!

maximsch2 commented Aug 3, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tchaton commented Aug 26, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stale bot commented Apr 16, 2022

Uh oh!

stale bot commented Apr 21, 2022

Uh oh!

Uh oh!

codecov bot commented Jul 21, 2021 •

edited

Loading