-
Notifications
You must be signed in to change notification settings - Fork 3.5k
Redundant Validation When Resuming Training #11504
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
must be happening here: https://github.com/PyTorchLightning/pytorch-lightning/blob/20128166451e0700319608b677e4a62bad71224b/pytorch_lightning/loops/epoch/training_epoch_loop.py#L145-L147 batch_progress is set to 0 during init and batch_idx here will be |
Have you checked if it's the sanity check? If it is, it got removed in #10785, however, the change is only in the master branch. |
@carmocca verified the issue on master. also, sanity check is turned off in the shared example. |
@rohitgr7 I added
|
@eladsegal yes and no.. solves one issue but creates another one. In case val_check_interval=1, it will redirect to advance_end and start with validation, which might be incorrect. Need to think a better sol. |
I got a bit confused: I do use Anyway, I see you already have a pull request, so never mind. |
Uh oh!
There was an error while loading. Please reload this page.
🐛 Bug
When training is resumed from a checkpoint, the following happens for the first epoch of the resumed run:
To Reproduce
https://colab.research.google.com/drive/1UxXoTVFusy8xnFW-ZhodLbjzewSdKsHq?usp=sharing
The model in the notebook is trained for 2 epochs.
In the prints, you can see that for the original training both epochs are run correctly with one validation per epoch, after the training epoch is completed.
When training is resumed from the checkpoint of the first epoch, it can be seen that the epoch has two validation runs, before and after the training.
Expected behavior
There should be only one validation run per epoch, and it should be after the training.
Environment
cc @Borda @carmocca @justusschock @ananthsub @ninginthecloud @rohitgr7
The text was updated successfully, but these errors were encountered: