Redundant Validation When Resuming Training #11504

eladsegal · 2022-01-17T05:33:08Z

🐛 Bug

When training is resumed from a checkpoint, the following happens for the first epoch of the resumed run:

Validation
Training
Validation

To Reproduce

https://colab.research.google.com/drive/1UxXoTVFusy8xnFW-ZhodLbjzewSdKsHq?usp=sharing
The model in the notebook is trained for 2 epochs.
In the prints, you can see that for the original training both epochs are run correctly with one validation per epoch, after the training epoch is completed.
When training is resumed from the checkpoint of the first epoch, it can be seen that the epoch has two validation runs, before and after the training.

Expected behavior

There should be only one validation run per epoch, and it should be after the training.

Environment

CUDA:
- GPU:
  - Tesla T4
- available: True
- version: 11.1
Packages:
- numpy: 1.19.5
- pyTorch_debug: False
- pyTorch_version: 1.10.0+cu111
- pytorch-lightning: 1.6.0dev
- tqdm: 4.62.3
System:
- OS: Linux
- architecture:
  - 64bit
- processor: x86_64
- python: 3.7.12

cc @Borda @carmocca @justusschock @ananthsub @ninginthecloud @rohitgr7

rohitgr7 · 2022-01-17T12:31:59Z

must be happening here: https://github.com/PyTorchLightning/pytorch-lightning/blob/20128166451e0700319608b677e4a62bad71224b/pytorch_lightning/loops/epoch/training_epoch_loop.py#L145-L147

batch_progress is set to 0 during init and batch_idx here will be ready - 1 = -1. Ideally, it should be 0 here I think.

carmocca · 2022-01-17T13:38:11Z

Have you checked if it's the sanity check?

If it is, it got removed in #10785, however, the change is only in the master branch.

rohitgr7 · 2022-01-17T13:50:26Z

@carmocca verified the issue on master. also, sanity check is turned off in the shared example.

eladsegal · 2022-01-17T18:16:00Z

@rohitgr7 I added max to batch_idx and total_batch_idx and it fixed the issue. Is it a valid solution or should it be solved somewhere deeper?

    @property
    def total_batch_idx(self) -> int:
        """Returns the current batch index (across epochs)"""
        # use `ready` instead of `completed` in case this is accessed after `completed` has been increased
        # but before the next `ready` increase
        return max(0, self.batch_progress.total.ready - 1)

    @property
    def batch_idx(self) -> int:
        """Returns the current batch index (within this epoch)"""
        # use `ready` instead of `completed` in case this is accessed after `completed` has been increased
        # but before the next `ready` increase
        return max(0, self.batch_progress.current.ready - 1)

rohitgr7 · 2022-01-20T14:38:43Z

@eladsegal yes and no.. solves one issue but creates another one. In case val_check_interval=1, it will redirect to advance_end and start with validation, which might be incorrect. Need to think a better sol.

eladsegal · 2022-01-21T00:00:34Z

I got a bit confused: I do use val_check_interval=1, and the problem of starting with validation is what I opened this issue for. The modification in my previous message did seem to fix the "starting with validation" problem.

Anyway, I see you already have a pull request, so never mind.
Thank you!

eladsegal added the bug Something isn't working label Jan 17, 2022

rohitgr7 added the loops Related to the Loop API label Jan 17, 2022

carmocca assigned rohitgr7 Jan 17, 2022

carmocca added this to the 1.5.x milestone Jan 17, 2022

rohitgr7 mentioned this issue Jan 20, 2022

Fix val_loop run on restart #11552

Merged

12 tasks

tchaton added the good first issue Good for newcomers label Jan 21, 2022

rohitgr7 closed this as completed in #11552 Feb 2, 2022

GreenfishK mentioned this issue Feb 23, 2024

Wrong calculations of Prec, Rec and F1 when training resumes from the last checkpoint GreenfishK/Grapher#4

Open

PiotrDabkowski mentioned this issue Sep 12, 2024

Validation is incorrectly run on resume #20277

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Redundant Validation When Resuming Training #11504

Redundant Validation When Resuming Training #11504

eladsegal commented Jan 17, 2022 •

edited by github-actions bot

Loading

rohitgr7 commented Jan 17, 2022

Uh oh!

carmocca commented Jan 17, 2022 •

edited

Loading

Uh oh!

rohitgr7 commented Jan 17, 2022

Uh oh!

eladsegal commented Jan 17, 2022

Uh oh!

rohitgr7 commented Jan 20, 2022

Uh oh!

eladsegal commented Jan 21, 2022

Uh oh!

Redundant Validation When Resuming Training #11504

Redundant Validation When Resuming Training #11504

Comments

eladsegal commented Jan 17, 2022 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🐛 Bug

To Reproduce

Expected behavior

Environment

rohitgr7 commented Jan 17, 2022

Uh oh!

carmocca commented Jan 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rohitgr7 commented Jan 17, 2022

Uh oh!

eladsegal commented Jan 17, 2022

Uh oh!

rohitgr7 commented Jan 20, 2022

Uh oh!

eladsegal commented Jan 21, 2022

Uh oh!

eladsegal commented Jan 17, 2022 •

edited by github-actions bot

Loading

carmocca commented Jan 17, 2022 •

edited

Loading