Register Hooks for ShardedTensor Support #8633

yifuwang · 2021-07-29T22:55:13Z

🚀 Feature

Motivation

PyTorch is introducing ShardedTensor as the standard way for representing model state in sharded models. For checkpointing purposes, ShardedTensor is a special tensor that appears in model.state_dict(). The state dict can be used for restoring the original model via model.load_state_dict().

However, in order for ShardedTensor to work with .state_dict() and .load_state_dict(), two special hooks need to be registered via _register_state_dict_hook() and _register_load_state_dict_pre_hook(). These hooks are no-ops when these's no ShardedTensor in the model.

Pitch

Since in Lightning the trainer is responsible for obtaining state dict from a model, as well as restoring a model given a state dict, Lightning should probably be responsible for registering these hooks.

Note that the feature is still WIP in PyTorch. We can either support it now for early adopters who also uses Lightning or defer it until the feature is released.

Alternatives

Additional context

pytorch/pytorch#55207
pytorch/pytorch#62242

If you enjoy Lightning, check out our other projects! ⚡

_{Metrics: Machine learning metrics for distributed, scalable PyTorch applications.

Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, finetuning and solving problems with deep learning

Bolts: Pretrained SOTA Deep Learning models, callbacks and more for research and production with PyTorch Lightning and PyTorch

Lightning Transformers: Flexible interface for high performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.}

The text was updated successfully, but these errors were encountered:

awaelchli · 2021-07-29T23:06:47Z

Considering our CI is already setup for nightly pytorch, I think it would be nice to explore introducing these hooks.

ananthsub · 2021-07-30T02:20:57Z

@yifuwang some n00b questions:

where in the trainer do you recommend adding these hooks?
would the trainer need to check if the LightningModule already has the hook registered?
why register the hooks via the trainer vs the LightningModule's constructor?

tchaton · 2021-08-03T09:09:53Z

Dear @yifuwang @ananthsub,

Should we expect the Trainer to auto-inspect the LightningModule and automatically register those hooks if sharded tensors are discovered ?

I was checking '1.10.0.dev20210802+cu111'. The nightly release doesn't contain ChunkShardingSpec yet and we can't write ShardedBoringModel yet.

Best,
T.C

ananthsub · 2021-08-04T05:57:27Z

Should we expect the Trainer to auto-inspect the LightningModule and automatically register those hooks if sharded tensors are discovered ?

from @pritamdamania87, if there's no ShardedTensor in the module, the hooks for loading/saving to state dict are no-ops. so we don't need to inspect the lightning module for sharded tensors. it is safe to always register these.

Ideally this would be enabled by default for all nn.modules. however, this depends on pytorch/pytorch#62094. Until that is resolved, we need to explicitly register the hooks.

yifuwang added feature Is an improvement or enhancement help wanted Open to be worked on labels Jul 29, 2021

ananthsub added the checkpointing Related to checkpointing label Aug 4, 2021

ananthsub mentioned this issue Aug 6, 2021

[RFC] Checkpointing in Lightning: Create a new CheckpointAgent interface for placing checkpointing logic #8118

Closed

ananthsub assigned ananthsub and yifuwang and unassigned ananthsub Aug 10, 2021

yifuwang mentioned this issue Aug 16, 2021

Add ShardedTensor support in LightningModule #8944

Merged

12 tasks

ananthsub added this to the v1.5 milestone Aug 17, 2021

ananthsub closed this as completed Sep 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Register Hooks for ShardedTensor Support #8633

Register Hooks for ShardedTensor Support #8633

yifuwang commented Jul 29, 2021

awaelchli commented Jul 29, 2021

Uh oh!

ananthsub commented Jul 30, 2021

Uh oh!

tchaton commented Aug 3, 2021 •

edited

Loading

Uh oh!

ananthsub commented Aug 4, 2021 •

edited

Loading

Uh oh!

Register Hooks for ShardedTensor Support #8633

Register Hooks for ShardedTensor Support #8633

Comments

yifuwang commented Jul 29, 2021

🚀 Feature

Motivation

Pitch

Alternatives

Additional context

If you enjoy Lightning, check out our other projects! ⚡

awaelchli commented Jul 29, 2021

Uh oh!

ananthsub commented Jul 30, 2021

Uh oh!

tchaton commented Aug 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ananthsub commented Aug 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tchaton commented Aug 3, 2021 •

edited

Loading

ananthsub commented Aug 4, 2021 •

edited

Loading