Skip to content

[Lite] Optimizer state has not been consolidated on this rank #10745

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
hiyyg opened this issue Nov 25, 2021 · 3 comments · Fixed by #10746
Closed

[Lite] Optimizer state has not been consolidated on this rank #10745

hiyyg opened this issue Nov 25, 2021 · 3 comments · Fixed by #10746
Labels
bug Something isn't working fabric lightning.fabric.Fabric strategy: fairscale sharded (removed) Sharded Data Parallel

Comments

@hiyyg
Copy link

hiyyg commented Nov 25, 2021

When using Lite and trying to save the optimizer state_dict using 'ddp_sharded', I got the error below:

RuntimeError: Optimizer state has not been consolidated on this rank.                 Please call `consolidate_state_dict()` on all ranks beforehand if you meant to save the global state

Any guide on how to solve this issue?

cc @carmocca @justusschock @awaelchli @SeanNaren

@hiyyg hiyyg added the bug Something isn't working label Nov 25, 2021
@awaelchli
Copy link
Contributor

awaelchli commented Nov 25, 2021

Hi @hiyyg
You can work around this issue for now by calling optimizer.consolidate_state_dict(). Let me know if that works.

Not yet sure if we will be able to find an elegant solution that does not require the user to manually make this call. This is where Lightning can shine though, it can take care of this automatically :)

@hiyyg
Copy link
Author

hiyyg commented Nov 25, 2021

Thanks @awaelchli , it now works. However, after using this optimizer.consolidate_state_dict(), the used gpu memory is larger than ddp, is that normal?

@awaelchli
Copy link
Contributor

awaelchli commented Nov 27, 2021

Yes, that's normal. "ddp_sharded" means the optimizer state gets "split" across processes/devices. This means you save memory but if you actually want to save the state of the optimizer, you need to transfer the individual shards and assemble the whole state. This is what optimizer.consolidate_state_dict() does and the reason why you need to call it before saving the state. The full state occupies more memory than a single shard.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working fabric lightning.fabric.Fabric strategy: fairscale sharded (removed) Sharded Data Parallel
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants