Skip to content

DDPSpawnPlugin generates a file based on the "best model path" #10933

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
awaelchli opened this issue Dec 3, 2021 · 0 comments · Fixed by #10934 or #10935
Closed

DDPSpawnPlugin generates a file based on the "best model path" #10933

awaelchli opened this issue Dec 3, 2021 · 0 comments · Fixed by #10934 or #10935
Assignees
Labels
bug Something isn't working checkpointing Related to checkpointing strategy: ddp DistributedDataParallel
Milestone

Comments

@awaelchli
Copy link
Contributor

awaelchli commented Dec 3, 2021

🐛 Bug

In the DDPSpawn / TPUSpawn plugin we transfer the weights from rank 0 back to the main process. To do this, we save a checkpoint of the latest model weights and then load it in the main process. The file name is determined based on the checkpoint callback's best_model_path:

https://github.com/PyTorchLightning/pytorch-lightning/blob/a28b4cd0c0bba30c21cae571e650877f66cf5588/pytorch_lightning/plugins/training_type/ddp_spawn.py#L259-L261

This is not a bug that affects users directly as long as they ignore the file that's being saved. The name of the file does not reflect the state of the contents of that file, because the latest weights may not always be the best!

Furthermore, the temp file never gets deleted.

To Reproduce

Run boring model with Trainer(strategy="ddp_spawn", devices=2). The checkpoint directory will contain a file
epoch=0-step=7.tmp_end.ckpt

Expected behavior

The filename is not based on the "best model path" and the file gets deleted after state has been loaded in main process.

Additional context

Found during debugging in #10896.
A PR for this fix is in the work.

cc @awaelchli @ananthsub @ninginthecloud @justusschock @kaushikb11

@awaelchli awaelchli added bug Something isn't working checkpointing Related to checkpointing strategy: ddp spawn labels Dec 3, 2021
@awaelchli awaelchli added this to the 1.5.x milestone Dec 3, 2021
@awaelchli awaelchli self-assigned this Dec 3, 2021
@awaelchli awaelchli added strategy: ddp DistributedDataParallel and removed strategy: ddp spawn labels Nov 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working checkpointing Related to checkpointing strategy: ddp DistributedDataParallel
Projects
None yet
1 participant