Support saving and loading ShardedTensor. #62242

pritamdamania87 · 2021-07-27T03:45:32Z

Stack from ghstack:

Add a state_dict hook to ensure ShardedTensors are
added to a state_dict.
Add a pre load state_dict hook to ensure ShardedTensor are added back to a
module at load time.
Add a with_load_process_group context manager for load time.
Added ser-de capability to ShardedTensor.

Differential Revision: D29927881

1) Add a state_dict hook to ensure ShardedTensors are added to a state_dict. 2) Add a pre load state_dict hook to ensure ShardedTensor are added back to a module at load time. 3) Add a `with_load_process_group` context manager for load time. 4) Added ser-de capability to ShardedTensor. Differential Revision: [D29927881](https://our.internmc.facebook.com/intern/diff/D29927881/) [ghstack-poisoned]

facebook-github-bot · 2021-07-27T03:45:35Z

🔗 Helpful links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/62242
📄 Preview docs built from this PR

💊 CI failures summary and remediations

As of commit c1af3a0 (more details on the Dr. CI page):

1/1 failures possibly* introduced in this PR
- 1/1 non-scanned failure(s)

ci.pytorch.org: 1 failed

Failed: pr/pytorch-linux-bionic-rocm4.2-py3.6

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

pritamdamania87 · 2021-07-27T03:46:50Z

test/distributed/_sharded_tensor/test_sharded_tensor.py

+    @with_comms
+    @skip_if_lt_x_gpu(4)
+    @requires_nccl()
+    def test_state_dict(self):


1) Add a state_dict hook to ensure ShardedTensors are added to a state_dict. 2) Add a pre load state_dict hook to ensure ShardedTensor are added back to a module at load time. 3) Add a `with_load_process_group` context manager for load time. 4) Added ser-de capability to ShardedTensor. Differential Revision: [D29927881](https://our.internmc.facebook.com/intern/diff/D29927881/) [ghstack-poisoned]

Pull Request resolved: #62242 1) Add a state_dict hook to ensure ShardedTensors are added to a state_dict. 2) Add a pre load state_dict hook to ensure ShardedTensor are added back to a module at load time. 3) Add a `with_load_process_group` context manager for load time. 4) Added ser-de capability to ShardedTensor. ghstack-source-id: 134381847 Differential Revision: [D29927881](https://our.internmc.facebook.com/intern/diff/D29927881/)

wanchaol

Have some concerns on using the hook to update the state_dict, otherwise looks good to me!

torch/distributed/_sharded_tensor/api.py

torch/distributed/_sharded_tensor/__init__.py

1) Add a state_dict hook to ensure ShardedTensors are added to a state_dict. 2) Add a pre load state_dict hook to ensure ShardedTensor are added back to a module at load time. 3) Add a `with_load_process_group` context manager for load time. 4) Added ser-de capability to ShardedTensor. Differential Revision: [D29927881](https://our.internmc.facebook.com/intern/diff/D29927881/) [ghstack-poisoned]

Pull Request resolved: #62242 1) Add a state_dict hook to ensure ShardedTensors are added to a state_dict. 2) Add a pre load state_dict hook to ensure ShardedTensor are added back to a module at load time. 3) Add a `with_load_process_group` context manager for load time. 4) Added ser-de capability to ShardedTensor. ghstack-source-id: 134574329 Differential Revision: [D29927881](https://our.internmc.facebook.com/intern/diff/D29927881/)

wanchaol

looks great! just one suggestion about adding the tests for exception handling.

torch/distributed/_sharded_tensor/api.py

1) Add a state_dict hook to ensure ShardedTensors are added to a state_dict. 2) Add a pre load state_dict hook to ensure ShardedTensor are added back to a module at load time. 3) Add a `with_load_process_group` context manager for load time. 4) Added ser-de capability to ShardedTensor. Differential Revision: [D29927881](https://our.internmc.facebook.com/intern/diff/D29927881/) [ghstack-poisoned]

Pull Request resolved: #62242 1) Add a state_dict hook to ensure ShardedTensors are added to a state_dict. 2) Add a pre load state_dict hook to ensure ShardedTensor are added back to a module at load time. 3) Add a `with_load_process_group` context manager for load time. 4) Added ser-de capability to ShardedTensor. ghstack-source-id: 134741074 Differential Revision: [D29927881](https://our.internmc.facebook.com/intern/diff/D29927881/)

1) Add a state_dict hook to ensure ShardedTensors are added to a state_dict. 2) Add a pre load state_dict hook to ensure ShardedTensor are added back to a module at load time. 3) Add a `with_load_process_group` context manager for load time. 4) Added ser-de capability to ShardedTensor. Differential Revision: [D29927881](https://our.internmc.facebook.com/intern/diff/D29927881/) [ghstack-poisoned]

Pull Request resolved: #62242 1) Add a state_dict hook to ensure ShardedTensors are added to a state_dict. 2) Add a pre load state_dict hook to ensure ShardedTensor are added back to a module at load time. 3) Add a `with_load_process_group` context manager for load time. 4) Added ser-de capability to ShardedTensor. ghstack-source-id: 134775358 Differential Revision: [D29927881](https://our.internmc.facebook.com/intern/diff/D29927881/)

1) Add a state_dict hook to ensure ShardedTensors are added to a state_dict. 2) Add a pre load state_dict hook to ensure ShardedTensor are added back to a module at load time. 3) Add a `with_load_process_group` context manager for load time. 4) Added ser-de capability to ShardedTensor. Differential Revision: [D29927881](https://our.internmc.facebook.com/intern/diff/D29927881/) [ghstack-poisoned]

Pull Request resolved: #62242 1) Add a state_dict hook to ensure ShardedTensors are added to a state_dict. 2) Add a pre load state_dict hook to ensure ShardedTensor are added back to a module at load time. 3) Add a `with_load_process_group` context manager for load time. 4) Added ser-de capability to ShardedTensor. ghstack-source-id: 134860967 Differential Revision: [D29927881](https://our.internmc.facebook.com/intern/diff/D29927881/)

facebook-github-bot · 2021-08-03T01:34:40Z

This pull request has been merged in c07a123.

another-pjohnson · 2021-08-24T17:36:45Z

torch/distributed/_sharded_tensor/api.py

+        elif self.memory_format == torch.channels_last:
+            mem_format_encoding = 1
+        elif self.memory_format == torch.preserve_format:
+            mem_format_encoding = 1


I think I've found a copy-paste error, it seems like from the setstate logic, mem_format_encoding should be = 2 in this case.

awaelchli · 2021-11-23T09:39:28Z

torch/distributed/_sharded_tensor/__init__.py

+def _recurse_update_module(module, state_dict, prefix):
+    for attr_name, attr in module.__dict__.items():
+        key = prefix + attr_name
+        if key in state_dict:
+            if isinstance(state_dict[key], ShardedTensor):
+                setattr(module, attr_name, state_dict[key])
+
+    for submodule_name, submodule in module.named_modules():
+        key = prefix + submodule_name
+        if submodule_name:
+            _recurse_update_module(submodule, state_dict, key + '.')
+
+
+def _recurse_update_dict(module, destination, prefix):
+    for attr_name, attr in module.__dict__.items():
+        if isinstance(attr, ShardedTensor):
+            destination[prefix + attr_name] = attr
+
+    for submodule_name, submodule in module.named_modules():
+        if submodule_name != '':
+            _recurse_update_dict(submodule, destination, prefix + submodule_name + '.')


I posted an issue here #68805 for a potential improvement. I believe the recursion here is not necessary and causes inefficiency when retrieving the state dict.

pritamdamania87 requested review from cbalioglu, H-Huang, mingzhe09088, mrshenli, rohan-varma, wayi1 and zhaojuanmao as code owners July 27, 2021 03:45

facebook-github-bot added oncall: distributed Add this issue/PR to distributed oncall triage queue cla signed labels Jul 27, 2021

pritamdamania87 mentioned this pull request Jul 27, 2021

Provide option to pass module instance to _load_state_dict_pre_hooks. #62070

Closed

pritamdamania87 commented Jul 27, 2021

View reviewed changes

pritamdamania87 requested a review from wanchaol July 27, 2021 20:32

wanchaol reviewed Jul 27, 2021

View reviewed changes

torch/distributed/_sharded_tensor/api.py Show resolved Hide resolved

torch/distributed/_sharded_tensor/api.py Show resolved Hide resolved

torch/distributed/_sharded_tensor/__init__.py Show resolved Hide resolved

pritamdamania87 requested a review from wanchaol July 28, 2021 23:55

wanchaol approved these changes Jul 29, 2021

View reviewed changes

torch/distributed/_sharded_tensor/api.py Show resolved Hide resolved

yifuwang mentioned this pull request Jul 29, 2021

Register Hooks for ShardedTensor Support Lightning-AI/pytorch-lightning#8633

Closed

pritamdamania87 mentioned this pull request Aug 2, 2021

Initialize RRefs only when explicitly asked for. #62618

Closed

facebook-github-bot closed this in c07a123 Aug 3, 2021

facebook-github-bot added the Merged label Aug 3, 2021

ananthsub mentioned this pull request Aug 6, 2021

[RFC] Checkpointing in Lightning: Create a new CheckpointAgent interface for placing checkpointing logic Lightning-AI/pytorch-lightning#8118

Closed

facebook-github-bot deleted the gh/pritamdamania87/257/head branch August 6, 2021 14:17

another-pjohnson reviewed Aug 24, 2021

View reviewed changes

awaelchli reviewed Nov 23, 2021

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support saving and loading ShardedTensor. #62242

Support saving and loading ShardedTensor. #62242

Uh oh!

pritamdamania87 commented Jul 27, 2021 •

edited

Loading

Uh oh!

facebook-github-bot commented Jul 27, 2021 •

edited

Loading

Uh oh!

pritamdamania87 Jul 27, 2021

Uh oh!

wanchaol left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wanchaol left a comment

Uh oh!

Uh oh!

facebook-github-bot commented Aug 3, 2021

Uh oh!

another-pjohnson Aug 24, 2021

Uh oh!

awaelchli Nov 23, 2021

Uh oh!

Uh oh!

Support saving and loading ShardedTensor. #62242

Support saving and loading ShardedTensor. #62242

Uh oh!

Conversation

pritamdamania87 commented Jul 27, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot commented Jul 27, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful links

💊 CI failures summary and remediations

ci.pytorch.org: 1 failed

Uh oh!

pritamdamania87 Jul 27, 2021

Choose a reason for hiding this comment

Uh oh!

wanchaol left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wanchaol left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

facebook-github-bot commented Aug 3, 2021

Uh oh!

another-pjohnson Aug 24, 2021

Choose a reason for hiding this comment

Uh oh!

awaelchli Nov 23, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pritamdamania87 commented Jul 27, 2021 •

edited

Loading

facebook-github-bot commented Jul 27, 2021 •

edited

Loading