DDPSpawnPlugin
generates a file based on the "best model path"
#10933
Labels
bug
Something isn't working
checkpointing
Related to checkpointing
strategy: ddp
DistributedDataParallel
Milestone
Uh oh!
There was an error while loading. Please reload this page.
🐛 Bug
In the DDPSpawn / TPUSpawn plugin we transfer the weights from rank 0 back to the main process. To do this, we save a checkpoint of the latest model weights and then load it in the main process. The file name is determined based on the checkpoint callback's best_model_path:
https://github.com/PyTorchLightning/pytorch-lightning/blob/a28b4cd0c0bba30c21cae571e650877f66cf5588/pytorch_lightning/plugins/training_type/ddp_spawn.py#L259-L261
This is not a bug that affects users directly as long as they ignore the file that's being saved. The name of the file does not reflect the state of the contents of that file, because the latest weights may not always be the best!
Furthermore, the temp file never gets deleted.
To Reproduce
Run boring model with
Trainer(strategy="ddp_spawn", devices=2)
. The checkpoint directory will contain a fileepoch=0-step=7.tmp_end.ckpt
Expected behavior
The filename is not based on the "best model path" and the file gets deleted after state has been loaded in main process.
Additional context
Found during debugging in #10896.
A PR for this fix is in the work.
cc @awaelchli @ananthsub @ninginthecloud @justusschock @kaushikb11
The text was updated successfully, but these errors were encountered: