add DTensor to optimizer state dict #2585

iamzainhuda · 2024-11-22T23:07:03Z

Summary:
To support 2D parallelism checkpointing, we introduce DTensor to the optimizer state dict.

This diff allows us to leverage N-dimensional device meshes with support for abritrary replication/sharding groups - making checkpointing easy as DCP support replicated/sharded placements on a device mesh (something that is unsupported in ShardedTensor)

Differential Revision: D65555455

facebook-github-bot · 2024-11-22T23:07:20Z

This pull request was exported from Phabricator. Differential Revision: D65555455

Summary: To support 2D parallelism checkpointing, we introduce DTensor to the optimizer state dict. It is enabled through fused_params["output_dtensor"] = True, meaning when table shards are outputted in DTensor so are optimizer shards. This diff allows us to leverage N-dimensional device meshes with support for abritrary replication/sharding groups - making checkpointing easy as DCP/Modelstore support replicated/sharded placements on a device mesh (something that is unsupported in ShardedTensor) Differential Revision: D65555455

facebook-github-bot · 2024-12-09T19:57:01Z

This pull request was exported from Phabricator. Differential Revision: D65555455

Summary: To support 2D parallelism checkpointing, we introduce DTensor to the optimizer state dict. It is enabled through fused_params["output_dtensor"] = True, meaning when table shards are outputted in DTensor so are optimizer shards. This diff allows us to leverage N-dimensional device meshes with support for abritrary replication/sharding groups - making checkpointing easy as DCP/Modelstore support replicated/sharded placements on a device mesh (something that is unsupported in ShardedTensor) Differential Revision: D65555455

facebook-github-bot · 2024-12-10T16:06:29Z

This pull request was exported from Phabricator. Differential Revision: D65555455

Summary: To support 2D parallelism checkpointing, we introduce DTensor to the optimizer state dict. It is enabled through fused_params["output_dtensor"] = True, meaning when table shards are outputted in DTensor so are optimizer shards. This diff allows us to leverage N-dimensional device meshes with support for abritrary replication/sharding groups - making checkpointing easy as DCP/Modelstore support replicated/sharded placements on a device mesh (something that is unsupported in ShardedTensor) Differential Revision: D65555455

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 22, 2024

facebook-github-bot added the fb-exported label Nov 22, 2024

iamzainhuda force-pushed the export-D65555455 branch from 9de67ab to 9aab41c Compare December 9, 2024 19:56

iamzainhuda force-pushed the export-D65555455 branch from 9aab41c to 6cb7df9 Compare December 10, 2024 16:06

facebook-github-bot closed this in 952a335 Dec 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add DTensor to optimizer state dict #2585

add DTensor to optimizer state dict #2585

iamzainhuda commented Nov 22, 2024 •

edited

Loading

facebook-github-bot commented Nov 22, 2024

facebook-github-bot commented Dec 9, 2024

facebook-github-bot commented Dec 10, 2024

add DTensor to optimizer state dict #2585

add DTensor to optimizer state dict #2585

Conversation

iamzainhuda commented Nov 22, 2024 • edited Loading

facebook-github-bot commented Nov 22, 2024

facebook-github-bot commented Dec 9, 2024

facebook-github-bot commented Dec 10, 2024

iamzainhuda commented Nov 22, 2024 •

edited

Loading