`DDPSpawnPlugin` generates a file based on the "best model path" #10933

awaelchli · 2021-12-03T23:21:44Z

🐛 Bug

In the DDPSpawn / TPUSpawn plugin we transfer the weights from rank 0 back to the main process. To do this, we save a checkpoint of the latest model weights and then load it in the main process. The file name is determined based on the checkpoint callback's best_model_path:

https://github.com/PyTorchLightning/pytorch-lightning/blob/a28b4cd0c0bba30c21cae571e650877f66cf5588/pytorch_lightning/plugins/training_type/ddp_spawn.py#L259-L261

This is not a bug that affects users directly as long as they ignore the file that's being saved. The name of the file does not reflect the state of the contents of that file, because the latest weights may not always be the best!

Furthermore, the temp file never gets deleted.

To Reproduce

Run boring model with Trainer(strategy="ddp_spawn", devices=2). The checkpoint directory will contain a file
epoch=0-step=7.tmp_end.ckpt

Expected behavior

The filename is not based on the "best model path" and the file gets deleted after state has been loaded in main process.

Additional context

Found during debugging in #10896.
A PR for this fix is in the work.

cc @awaelchli @ananthsub @ninginthecloud @justusschock @kaushikb11

The text was updated successfully, but these errors were encountered:

awaelchli added bug Something isn't working checkpointing Related to checkpointing strategy: ddp spawn labels Dec 3, 2021

awaelchli added this to the 1.5.x milestone Dec 3, 2021

awaelchli self-assigned this Dec 3, 2021

This was referenced Dec 3, 2021

Change temporary spawn checkpoint name #10934

Merged

Fix spawn plugins not deleting temp checkpoint #10935

Merged

awaelchli closed this as completed in #10935 Dec 6, 2021

awaelchli added strategy: ddp DistributedDataParallel and removed strategy: ddp spawn labels Nov 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`DDPSpawnPlugin` generates a file based on the "best model path" #10933

`DDPSpawnPlugin` generates a file based on the "best model path" #10933

awaelchli commented Dec 3, 2021 •

edited by github-actions bot

Loading

DDPSpawnPlugin generates a file based on the "best model path" #10933

DDPSpawnPlugin generates a file based on the "best model path" #10933

Comments

awaelchli commented Dec 3, 2021 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🐛 Bug

To Reproduce

Expected behavior

Additional context

`DDPSpawnPlugin` generates a file based on the "best model path" #10933

`DDPSpawnPlugin` generates a file based on the "best model path" #10933

awaelchli commented Dec 3, 2021 •

edited by github-actions bot

Loading