[RFC] Tuner Revamp #11012

rohitgr7 · 2021-12-09T16:08:49Z

Proposed refactor

Issues

The Tuner has been causing a lot of issues in the past and we plan to refactor it. The primary reason comes from the trainer state snapshotting and restoration.
Auto batch size scaling doesn't work with validate/test/predict. Users might want to identify an optimal batch_size for inference to better utilize their available compute resources.
LR Finder suggestion is not optimal. Sometimes it suggests bad LR as per its algorithm, and sometimes it doesn't suggest anything at all.
Doesn't work with flash finetuning. For eg. let's say the user might want to compute new LR or new batch_size after certain epochs of pre-training, then it's not easily configurable within a single call. One can achieve it with multiple calls but since we support strategies within Flash, this might be worth adding.

ps: please add more issues up here if you have any regarding the tuner.

Possible solutions

We can subclass Trainer for tuner and create independent states so that we don't do any sort of snapshotting and restoration with trainer states and it will stay independent.

class Tuner(Trainer):
     # create independent states
     # create custom loops

trainer.tuner(auto_scale_batch_size=..., auto_lr_find=...).fit()
trainer.tuner(auto_scale_batch_size=...).predict()

well, this solution could possibly solve 1 & 2 but possibly can't be configured to solve 4.

Another solution proposed by @Borda is to make them as callbacks, so that they can be easily configured by users independently and can help resolve 4. But this solution might not resolve 1 & 2.
Another solution @Borda and @SkafteNicki suggested, for now, is to move lr_finder to bolts and experiment there and improve scale_batch_size within lightning. But possibly it can't guarantee to solve 4.

Additional context

Other issues with the tuner right now we need to address:
#9625
#10560
#10557

thanks to @Borda @SkafteNicki @ethanwharris @akihironitta for helping out with the discussion and possible solutions.

cc @justusschock @awaelchli @akihironitta @Borda

The text was updated successfully, but these errors were encountered:

lukasschmit · 2021-12-10T00:24:36Z

Not sure if there's an easy workaround currently but I'd love an easy way of combining auto_scale_batch_size with accumulate_grad_batches such that you can just pass a single effective_batch_size flag to the trainer and it accumulates grad batches to meet that after finding the largest batch size.

tchaton · 2021-12-10T09:31:06Z

Hey @rohitgr7,

IMO, I don't believe subclassing the Trainer is the way forward. I would rather prefer to work on a better snapshotting, restoring the Trainer state, and extend support for validation, test and predict.

Furthermore, Flash fine-tune is relying on trainer.fit and simply adds a callback internally. If the Tuner refactor would be more flexible at be run on any Trainer entry points, fine-tune could also be supported.

Best,
T.C

justusschock · 2021-12-10T10:18:05Z

I agree with @tchaton . Personally, I would prefer having them as callbacks

rohitgr7 · 2021-12-13T08:44:33Z

thank you @tchaton and @justusschock for your comments. Looks like making them callbacks is a reasonable solution here. I'll start with it, for now, to see if it's compatible as a callback. Although I can't think of a better snapshotting mechanism for it. Would love to discuss more on that :)

rohitgr7 · 2021-12-13T08:48:35Z

hey @lukasschmit !
I couldn't get what you mean here. Can you elaborate more?
within lightning for accumulate_grad_batches, it just accumulates the loss returned from training_step.

rohitgr7 added refactor tuner labels Dec 9, 2021

rohitgr7 added this to the 1.6 milestone Dec 9, 2021

rohitgr7 mentioned this issue Dec 15, 2021

Add BatchSizeFinder callback #11089

Merged

12 tasks

carmocca modified the milestones: 1.6, 1.7 Feb 1, 2022

carmocca added this to Frameworks Planning Feb 14, 2022

carmocca assigned rohitgr7 Feb 14, 2022

carmocca moved this to In Progress in Frameworks Planning Feb 14, 2022

carmocca modified the milestones: pl:1.7, pl:future Jul 19, 2022

Borda added this to Lightning RFCs Aug 8, 2022

This was linked to pull requests Aug 13, 2022

Add BatchSizeFinder callback #11089

Merged

Add LRFinder callback #13802

Merged

rohitgr7 moved this from In Progress to In Review in Frameworks Planning Aug 22, 2022

bhack mentioned this issue Sep 7, 2022

Add TPU support for basic_training keras-team/keras-cv#757

Merged

lexierule closed this as completed in #11089 Sep 27, 2022

Repository owner moved this from In Review to Done in Frameworks Planning Sep 27, 2022

rohitgr7 reopened this Sep 27, 2022

Borda closed this as completed in #13802 Oct 5, 2022

carmocca modified the milestones: pl:future, pl:1.8 Oct 10, 2022

This was referenced Oct 12, 2022

Deprecate tuning enum and trainer properties #15100

Merged

Add tuner callback docs #15030

Merged

Update tuner docs #15087

Merged

rohitgr7 mentioned this issue Oct 12, 2022

Add LRFinder callback #13802

Merged

12 tasks

This was linked to pull requests Oct 12, 2022

Update tuner docs #15087

Merged

Deprecate tuning enum and trainer properties #15100

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Tuner Revamp #11012

[RFC] Tuner Revamp #11012

rohitgr7 commented Dec 9, 2021 •

edited

Loading

lukasschmit commented Dec 10, 2021

tchaton commented Dec 10, 2021

justusschock commented Dec 10, 2021

rohitgr7 commented Dec 13, 2021

rohitgr7 commented Dec 13, 2021

[RFC] Tuner Revamp #11012

[RFC] Tuner Revamp #11012

Comments

rohitgr7 commented Dec 9, 2021 • edited Loading

Proposed refactor

Issues

Possible solutions

Additional context

lukasschmit commented Dec 10, 2021

tchaton commented Dec 10, 2021

justusschock commented Dec 10, 2021

rohitgr7 commented Dec 13, 2021

rohitgr7 commented Dec 13, 2021

rohitgr7 commented Dec 9, 2021 •

edited

Loading