-
Notifications
You must be signed in to change notification settings - Fork 4.2k
Updates dcp tutorial with recent updates to api including save, load, and distributed state dict #2832
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/tutorials/2832
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 65615fa with merge base 700b9d8 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Co-authored-by: Svetlana Karslioglu <[email protected]>
Thanks @svekars great suggestions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM! Thanks Lucas!
|
||
Addditionally, through the use of modules in :func:`torch.distributed.checkpoint.state_dict`, | ||
DCP offers support for gracefully handling ``state_dict`` generation and loading in distributed settings. | ||
This includes managing fully-qualified-name (FQN) mappings across models and optimizers, and setting default parameters for PyTorch provided parallelisms. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is setting default parameters
referring to the optimizer initialization? Just want to make sure I understand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optimizer initialization is a good point, but I was referencing setting the state dict type for FSDP, since users don't have to do this if they use distributed state dict
@pytorchbot merge |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This update looks good to me but lacks of the introduction of the ability to use save
and load
for Stateful
objects.
:func:`torch.distributed.checkpoint` enables saving and loading models from multiple ranks in parallel. | ||
In addition, checkpointing automatically handles fully-qualified-name (FQN) mappings across models and optimizers, enabling load-time resharding across differing cluster topologies. | ||
:func:`torch.distributed.checkpoint` enables saving and loading models from multiple ranks in parallel. You can use this module to save on any number of ranks in parallel, | ||
and then re-shard across differing cluster topologies at load time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also add and different parallelisms
?
For any follow up changes, please submit a new PR. |
Description
DCP has recently updated it's API's, introducing semantics which more closley mirror torch's save/load API's. This update shows users how to use behavior now included in dcp.save/load, as well as some additional details for using the distributed.state_dict methods for managing FQN's and parallelism defaults.
Checklist