Skip to content

Redundant Validation When Resuming Training #11504

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
eladsegal opened this issue Jan 17, 2022 · 6 comments · Fixed by #11552
Closed

Redundant Validation When Resuming Training #11504

eladsegal opened this issue Jan 17, 2022 · 6 comments · Fixed by #11552
Assignees
Labels
bug Something isn't working good first issue Good for newcomers loops Related to the Loop API
Milestone

Comments

@eladsegal
Copy link
Contributor

eladsegal commented Jan 17, 2022

🐛 Bug

When training is resumed from a checkpoint, the following happens for the first epoch of the resumed run:

  1. Validation
  2. Training
  3. Validation

To Reproduce

https://colab.research.google.com/drive/1UxXoTVFusy8xnFW-ZhodLbjzewSdKsHq?usp=sharing
The model in the notebook is trained for 2 epochs.
In the prints, you can see that for the original training both epochs are run correctly with one validation per epoch, after the training epoch is completed.
When training is resumed from the checkpoint of the first epoch, it can be seen that the epoch has two validation runs, before and after the training.

Expected behavior

There should be only one validation run per epoch, and it should be after the training.

Environment

  • CUDA:
    • GPU:
      • Tesla T4
    • available: True
    • version: 11.1
  • Packages:
    • numpy: 1.19.5
    • pyTorch_debug: False
    • pyTorch_version: 1.10.0+cu111
    • pytorch-lightning: 1.6.0dev
    • tqdm: 4.62.3
  • System:
    • OS: Linux
    • architecture:
      • 64bit
    • processor: x86_64
    • python: 3.7.12

cc @Borda @carmocca @justusschock @ananthsub @ninginthecloud @rohitgr7

@eladsegal eladsegal added the bug Something isn't working label Jan 17, 2022
@rohitgr7
Copy link
Contributor

must be happening here: https://github.com/PyTorchLightning/pytorch-lightning/blob/20128166451e0700319608b677e4a62bad71224b/pytorch_lightning/loops/epoch/training_epoch_loop.py#L145-L147

batch_progress is set to 0 during init and batch_idx here will be ready - 1 = -1. Ideally, it should be 0 here I think.

@rohitgr7 rohitgr7 added the loops Related to the Loop API label Jan 17, 2022
@carmocca
Copy link
Contributor

carmocca commented Jan 17, 2022

Have you checked if it's the sanity check?

If it is, it got removed in #10785, however, the change is only in the master branch.

@rohitgr7
Copy link
Contributor

@carmocca verified the issue on master. also, sanity check is turned off in the shared example.

@carmocca carmocca added this to the 1.5.x milestone Jan 17, 2022
@eladsegal
Copy link
Contributor Author

@rohitgr7 I added max to batch_idx and total_batch_idx and it fixed the issue. Is it a valid solution or should it be solved somewhere deeper?

    @property
    def total_batch_idx(self) -> int:
        """Returns the current batch index (across epochs)"""
        # use `ready` instead of `completed` in case this is accessed after `completed` has been increased
        # but before the next `ready` increase
        return max(0, self.batch_progress.total.ready - 1)

    @property
    def batch_idx(self) -> int:
        """Returns the current batch index (within this epoch)"""
        # use `ready` instead of `completed` in case this is accessed after `completed` has been increased
        # but before the next `ready` increase
        return max(0, self.batch_progress.current.ready - 1)

@rohitgr7
Copy link
Contributor

@eladsegal yes and no.. solves one issue but creates another one. In case val_check_interval=1, it will redirect to advance_end and start with validation, which might be incorrect. Need to think a better sol.

@eladsegal
Copy link
Contributor Author

I got a bit confused: I do use val_check_interval=1, and the problem of starting with validation is what I opened this issue for. The modification in my previous message did seem to fix the "starting with validation" problem.

Anyway, I see you already have a pull request, so never mind.
Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good first issue Good for newcomers loops Related to the Loop API
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants