Skip to content

[RFC] Tuner Revamp #11012

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
rohitgr7 opened this issue Dec 9, 2021 · 5 comments · Fixed by #11089, #13802, #15087 or #15100
Closed

[RFC] Tuner Revamp #11012

rohitgr7 opened this issue Dec 9, 2021 · 5 comments · Fixed by #11089, #13802, #15087 or #15100
Assignees
Milestone

Comments

@rohitgr7
Copy link
Contributor

rohitgr7 commented Dec 9, 2021

Proposed refactor

Issues

  1. The Tuner has been causing a lot of issues in the past and we plan to refactor it. The primary reason comes from the trainer state snapshotting and restoration.
  2. Auto batch size scaling doesn't work with validate/test/predict. Users might want to identify an optimal batch_size for inference to better utilize their available compute resources.
  3. LR Finder suggestion is not optimal. Sometimes it suggests bad LR as per its algorithm, and sometimes it doesn't suggest anything at all.
  4. Doesn't work with flash finetuning. For eg. let's say the user might want to compute new LR or new batch_size after certain epochs of pre-training, then it's not easily configurable within a single call. One can achieve it with multiple calls but since we support strategies within Flash, this might be worth adding.

ps: please add more issues up here if you have any regarding the tuner.

Possible solutions

  • We can subclass Trainer for tuner and create independent states so that we don't do any sort of snapshotting and restoration with trainer states and it will stay independent.
class Tuner(Trainer):
     # create independent states
     # create custom loops

trainer.tuner(auto_scale_batch_size=..., auto_lr_find=...).fit()
trainer.tuner(auto_scale_batch_size=...).predict()

well, this solution could possibly solve 1 & 2 but possibly can't be configured to solve 4.

  • Another solution proposed by @Borda is to make them as callbacks, so that they can be easily configured by users independently and can help resolve 4. But this solution might not resolve 1 & 2.

  • Another solution @Borda and @SkafteNicki suggested, for now, is to move lr_finder to bolts and experiment there and improve scale_batch_size within lightning. But possibly it can't guarantee to solve 4.

Additional context

Other issues with the tuner right now we need to address:
#9625
#10560
#10557

thanks to @Borda @SkafteNicki @ethanwharris @akihironitta for helping out with the discussion and possible solutions.

cc @justusschock @awaelchli @akihironitta @Borda

@rohitgr7 rohitgr7 added this to the 1.6 milestone Dec 9, 2021
@lukasschmit
Copy link

Not sure if there's an easy workaround currently but I'd love an easy way of combining auto_scale_batch_size with accumulate_grad_batches such that you can just pass a single effective_batch_size flag to the trainer and it accumulates grad batches to meet that after finding the largest batch size.

@tchaton
Copy link
Contributor

tchaton commented Dec 10, 2021

Hey @rohitgr7,

IMO, I don't believe subclassing the Trainer is the way forward. I would rather prefer to work on a better snapshotting, restoring the Trainer state, and extend support for validation, test and predict.

Furthermore, Flash fine-tune is relying on trainer.fit and simply adds a callback internally. If the Tuner refactor would be more flexible at be run on any Trainer entry points, fine-tune could also be supported.

Best,
T.C

@justusschock
Copy link
Member

I agree with @tchaton . Personally, I would prefer having them as callbacks

@rohitgr7
Copy link
Contributor Author

thank you @tchaton and @justusschock for your comments. Looks like making them callbacks is a reasonable solution here. I'll start with it, for now, to see if it's compatible as a callback. Although I can't think of a better snapshotting mechanism for it. Would love to discuss more on that :)

@rohitgr7
Copy link
Contributor Author

hey @lukasschmit !
I couldn't get what you mean here. Can you elaborate more?
within lightning for accumulate_grad_batches, it just accumulates the loss returned from training_step.

@carmocca carmocca modified the milestones: 1.6, 1.7 Feb 1, 2022
@carmocca carmocca moved this to In Progress in Frameworks Planning Feb 14, 2022
@carmocca carmocca modified the milestones: pl:1.7, pl:future Jul 19, 2022
This was linked to pull requests Aug 13, 2022
@rohitgr7 rohitgr7 moved this from In Progress to In Review in Frameworks Planning Aug 22, 2022
Repository owner moved this from In Review to Done in Frameworks Planning Sep 27, 2022
@rohitgr7 rohitgr7 reopened this Sep 27, 2022
@carmocca carmocca modified the milestones: pl:future, pl:1.8 Oct 10, 2022
@rohitgr7 rohitgr7 mentioned this issue Oct 12, 2022
12 tasks
This was linked to pull requests Oct 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment