-
Notifications
You must be signed in to change notification settings - Fork 3.5k
resume_from_checkpoint should not start from scratch if ckpt is not found #7072
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Not sure if there is a real reason why we have warning instead of error. Want to give this a try? Contributions are welcome. |
I actually preferred the old semantics of this parameter as it makes my logic for an interruptible training job easier. Also the docs weren't updated to reflect this change: https://pytorch-lightning.readthedocs.io/en/latest/common/trainer.html#init |
@ia-davidpichler Would you be interested in opening a PR with the doc fix? 😄 |
Agreed that the prior behavior is better for accommodating interruptible training jobs (e.g. AWS Spot) and this feels like a regression in capability. Whether to error or warn should be exposed as an additional parameter, with it defaulting to something like
|
The discussion for the change is in the PR that closed this: #7075 I'll update the docs. |
Uh oh!
There was an error while loading. Please reload this page.
🐛 Bug
If the checkpoint file is not found at the location provided in
resume_from_checkpoint
argument inpl.Trainer
, the training starts from scratch after displaying aUserWarning
that is easy to miss.To Reproduce
Use the following BoringModel.
Expected behavior
Should raise a
FileNotFoundError
and not start training from scratch.The text was updated successfully, but these errors were encountered: