You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I trained a large model using native amp, but the loss converged very slow. After a careful check of the backward and optimization code, I found the clip_gradients is executed right after backward, but scaler.unscale_ is conducted in pre_optimization_step.
According to the instruction of Pytorch, the order of clip and unscale should be exchanged. Currently gradient_clip_val may lead to a very flat learning curve if used together with native amp.
Hope to be fixed.
The text was updated successfully, but these errors were encountered:
🐛 Bug
I trained a large model using native amp, but the loss converged very slow. After a careful check of the backward and optimization code, I found the
clip_gradients
is executed right after backward, butscaler.unscale_
is conducted in pre_optimization_step.According to the instruction of Pytorch, the order of clip and unscale should be exchanged. Currently
gradient_clip_val
may lead to a very flat learning curve if used together with native amp.Hope to be fixed.
The text was updated successfully, but these errors were encountered: