Updates dcp tutorial with recent updates to api including save, load, and distributed state dict #2832

LucasLLC · 2024-04-10T16:48:01Z

Description

DCP has recently updated it's API's, introducing semantics which more closley mirror torch's save/load API's. This update shows users how to use behavior now included in dcp.save/load, as well as some additional details for using the distributed.state_dict methods for managing FQN's and parallelism defaults.

Checklist

The issue that is being fixed is referred in the description (see above "Fixes #ISSUE_NUMBER")
Only one issue is addressed in this pull request
Labels from the issue that this PR is fixing are added to this pull request
No unnecessary issues are included into this pull request.

pytorch-bot · 2024-04-10T16:48:04Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/tutorials/2832

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 65615fa with merge base 700b9d8 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

svekars

Thanks!

recipes_source/distributed_checkpoint_recipe.rst

Co-authored-by: Svetlana Karslioglu <[email protected]>

LucasLLC · 2024-04-10T17:41:31Z

Thanks @svekars great suggestions.

wz337

Overall LGTM! Thanks Lucas!

wz337 · 2024-04-11T20:05:39Z

recipes_source/distributed_checkpoint_recipe.rst

+
+Addditionally, through the use of modules in :func:`torch.distributed.checkpoint.state_dict`,
+DCP offers support for gracefully handling ``state_dict`` generation and loading in distributed settings.
+This includes managing fully-qualified-name (FQN) mappings across models and optimizers, and setting default parameters for PyTorch provided parallelisms.


Is setting default parameters referring to the optimizer initialization? Just want to make sure I understand.

Optimizer initialization is a good point, but I was referencing setting the state dict type for FSDP, since users don't have to do this if they use distributed state dict

LucasLLC · 2024-04-11T20:11:20Z

@pytorchbot merge

fegin

This update looks good to me but lacks of the introduction of the ability to use save and load for Stateful objects.

fegin · 2024-04-15T17:03:14Z

recipes_source/distributed_checkpoint_recipe.rst

-:func:`torch.distributed.checkpoint` enables saving and loading models from multiple ranks in parallel.
-In addition, checkpointing automatically handles fully-qualified-name (FQN) mappings across models and optimizers, enabling load-time resharding across differing cluster topologies.
+:func:`torch.distributed.checkpoint` enables saving and loading models from multiple ranks in parallel. You can use this module to save on any number of ranks in parallel,
+and then re-shard across differing cluster topologies at load time.


Should we also add and different parallelisms?

svekars · 2024-04-19T18:09:49Z

@LucasLLC can you address @fegin comments or will you do it in a separate PR?

svekars · 2024-04-19T22:30:22Z

For any follow up changes, please submit a new PR.

LucasLLC · 2024-04-22T15:02:26Z

Will circle back to this with requested changes! Thanks @fegin , @svekars

updates dcp tutorial with recent updates to api including save, load

e3a49e0

LucasLLC requested review from fegin and wz337 April 10, 2024 16:48

LucasLLC self-assigned this Apr 10, 2024

facebook-github-bot added the cla signed label Apr 10, 2024

svekars added the 2.3 label Apr 10, 2024

svekars reviewed Apr 10, 2024

View reviewed changes

recipes_source/distributed_checkpoint_recipe.rst Outdated Show resolved Hide resolved

recipes_source/distributed_checkpoint_recipe.rst Outdated Show resolved Hide resolved

Apply suggestions from code review

949b75b

Co-authored-by: Svetlana Karslioglu <[email protected]>

wz337 approved these changes Apr 11, 2024

View reviewed changes

fegin reviewed Apr 15, 2024

View reviewed changes

Merge branch 'main' into add_save_load_to_dcp

d07b09b

svekars added 2 commits April 19, 2024 14:26

Merge branch 'main' into add_save_load_to_dcp

f1acc61

Merge branch 'main' into add_save_load_to_dcp

65615fa

svekars merged commit c5c0a9a into main Apr 19, 2024

svekars deleted the add_save_load_to_dcp branch April 19, 2024 22:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Updates dcp tutorial with recent updates to api including save, load, and distributed state dict #2832

Updates dcp tutorial with recent updates to api including save, load, and distributed state dict #2832

Uh oh!

LucasLLC commented Apr 10, 2024

Uh oh!

pytorch-bot bot commented Apr 10, 2024 •

edited

Loading

Uh oh!

svekars left a comment

Uh oh!

Uh oh!

Uh oh!

LucasLLC commented Apr 10, 2024

Uh oh!

wz337 left a comment

Uh oh!

wz337 Apr 11, 2024

Uh oh!

LucasLLC Apr 11, 2024

Uh oh!

LucasLLC commented Apr 11, 2024

Uh oh!

fegin left a comment

Uh oh!

fegin Apr 15, 2024

Uh oh!

svekars commented Apr 19, 2024

Uh oh!

svekars commented Apr 19, 2024

Uh oh!

LucasLLC commented Apr 22, 2024

Uh oh!

Uh oh!

Updates dcp tutorial with recent updates to api including save, load, and distributed state dict #2832

Updates dcp tutorial with recent updates to api including save, load, and distributed state dict #2832

Uh oh!

Conversation

LucasLLC commented Apr 10, 2024

Description

Checklist

Uh oh!

pytorch-bot bot commented Apr 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/tutorials/2832

✅ No Failures

Uh oh!

svekars left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

LucasLLC commented Apr 10, 2024

Uh oh!

wz337 left a comment

Choose a reason for hiding this comment

Uh oh!

wz337 Apr 11, 2024

Choose a reason for hiding this comment

Uh oh!

LucasLLC Apr 11, 2024

Choose a reason for hiding this comment

Uh oh!

LucasLLC commented Apr 11, 2024

Uh oh!

fegin left a comment

Choose a reason for hiding this comment

Uh oh!

fegin Apr 15, 2024

Choose a reason for hiding this comment

Uh oh!

svekars commented Apr 19, 2024

Uh oh!

svekars commented Apr 19, 2024

Uh oh!

LucasLLC commented Apr 22, 2024

Uh oh!

Uh oh!

pytorch-bot bot commented Apr 10, 2024 •

edited

Loading