Skip to content

Updates dcp tutorial with recent updates to api including save, load, and distributed state dict #2832

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Apr 19, 2024

Conversation

LucasLLC
Copy link
Contributor

Description

DCP has recently updated it's API's, introducing semantics which more closley mirror torch's save/load API's. This update shows users how to use behavior now included in dcp.save/load, as well as some additional details for using the distributed.state_dict methods for managing FQN's and parallelism defaults.

Checklist

  • The issue that is being fixed is referred in the description (see above "Fixes #ISSUE_NUMBER")
  • Only one issue is addressed in this pull request
  • Labels from the issue that this PR is fixing are added to this pull request
  • No unnecessary issues are included into this pull request.

@LucasLLC LucasLLC requested review from fegin and wz337 April 10, 2024 16:48
@LucasLLC LucasLLC self-assigned this Apr 10, 2024
Copy link

pytorch-bot bot commented Apr 10, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/tutorials/2832

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 65615fa with merge base 700b9d8 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Copy link
Contributor

@svekars svekars left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Co-authored-by: Svetlana Karslioglu <[email protected]>
@LucasLLC
Copy link
Contributor Author

Thanks @svekars great suggestions.

Copy link
Contributor

@wz337 wz337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM! Thanks Lucas!


Addditionally, through the use of modules in :func:`torch.distributed.checkpoint.state_dict`,
DCP offers support for gracefully handling ``state_dict`` generation and loading in distributed settings.
This includes managing fully-qualified-name (FQN) mappings across models and optimizers, and setting default parameters for PyTorch provided parallelisms.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is setting default parameters referring to the optimizer initialization? Just want to make sure I understand.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optimizer initialization is a good point, but I was referencing setting the state dict type for FSDP, since users don't have to do this if they use distributed state dict

@LucasLLC
Copy link
Contributor Author

@pytorchbot merge

Copy link

@fegin fegin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This update looks good to me but lacks of the introduction of the ability to use save and load for Stateful objects.

:func:`torch.distributed.checkpoint` enables saving and loading models from multiple ranks in parallel.
In addition, checkpointing automatically handles fully-qualified-name (FQN) mappings across models and optimizers, enabling load-time resharding across differing cluster topologies.
:func:`torch.distributed.checkpoint` enables saving and loading models from multiple ranks in parallel. You can use this module to save on any number of ranks in parallel,
and then re-shard across differing cluster topologies at load time.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also add and different parallelisms?

@svekars
Copy link
Contributor

svekars commented Apr 19, 2024

@LucasLLC can you address @fegin comments or will you do it in a separate PR?

@svekars
Copy link
Contributor

svekars commented Apr 19, 2024

For any follow up changes, please submit a new PR.

@svekars svekars merged commit c5c0a9a into main Apr 19, 2024
@svekars svekars deleted the add_save_load_to_dcp branch April 19, 2024 22:30
@LucasLLC
Copy link
Contributor Author

Will circle back to this with requested changes! Thanks @fegin , @svekars

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants