[Lite] Optimizer state has not been consolidated on this rank #10745

hiyyg · 2021-11-25T03:44:12Z

When using Lite and trying to save the optimizer state_dict using 'ddp_sharded', I got the error below:

RuntimeError: Optimizer state has not been consolidated on this rank.                 Please call `consolidate_state_dict()` on all ranks beforehand if you meant to save the global state

Any guide on how to solve this issue?

cc @carmocca @justusschock @awaelchli @SeanNaren

The text was updated successfully, but these errors were encountered:

awaelchli · 2021-11-25T04:09:45Z

Hi @hiyyg
You can work around this issue for now by calling optimizer.consolidate_state_dict(). Let me know if that works.

Not yet sure if we will be able to find an elegant solution that does not require the user to manually make this call. This is where Lightning can shine though, it can take care of this automatically :)

hiyyg · 2021-11-25T05:58:19Z

Thanks @awaelchli , it now works. However, after using this optimizer.consolidate_state_dict(), the used gpu memory is larger than ddp, is that normal?

awaelchli · 2021-11-27T04:43:40Z

Yes, that's normal. "ddp_sharded" means the optimizer state gets "split" across processes/devices. This means you save memory but if you actually want to save the state of the optimizer, you need to transfer the individual shards and assemble the whole state. This is what optimizer.consolidate_state_dict() does and the reason why you need to call it before saving the state. The full state occupies more memory than a single shard.

hiyyg added the bug Something isn't working label Nov 25, 2021

awaelchli added fabric lightning.fabric.Fabric strategy: fairscale sharded (removed) Sharded Data Parallel labels Nov 25, 2021

awaelchli mentioned this issue Nov 25, 2021

Consolidate state when retrieving sharded state dict in Lite #10746

Merged

11 tasks

awaelchli closed this as completed in #10746 Nov 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Lite] Optimizer state has not been consolidated on this rank #10745

[Lite] Optimizer state has not been consolidated on this rank #10745

hiyyg commented Nov 25, 2021 •

edited by github-actions bot

Loading

awaelchli commented Nov 25, 2021 •

edited

Loading

Uh oh!

hiyyg commented Nov 25, 2021

Uh oh!

awaelchli commented Nov 27, 2021 •

edited

Loading

Uh oh!

[Lite] Optimizer state has not been consolidated on this rank #10745

[Lite] Optimizer state has not been consolidated on this rank #10745

Comments

hiyyg commented Nov 25, 2021 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

awaelchli commented Nov 25, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hiyyg commented Nov 25, 2021

Uh oh!

awaelchli commented Nov 27, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hiyyg commented Nov 25, 2021 •

edited by github-actions bot

Loading

awaelchli commented Nov 25, 2021 •

edited

Loading

awaelchli commented Nov 27, 2021 •

edited

Loading