Skip to content

Gradient clip norm is called before AMP's unscale leading to wrong gradients #9330

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
phizaz opened this issue Sep 5, 2021 · 3 comments · Fixed by #9606
Closed

Gradient clip norm is called before AMP's unscale leading to wrong gradients #9330

phizaz opened this issue Sep 5, 2021 · 3 comments · Fixed by #9606
Assignees
Labels
bug Something isn't working help wanted Open to be worked on priority: 0 High priority task
Milestone

Comments

@phizaz
Copy link

phizaz commented Sep 5, 2021

🐛 Bug

Gradient clip norm is called before AMP's unscale leading to wrong gradients.

To Reproduce

This happens with:

trainer = pl.Trainer(
        gpus=1,
        precision=16,
        gradient_clip_val=1,
)

The gradient norm at training_step_end is super small! My interpretation is that the gradient clipping happens before AMP's unscale so that unscale makes the gradient much smaller than the clip norm.

https://colab.research.google.com/drive/1CXMo2JP_JmwG_YNrTwTG5S0pAsr41pGC?usp=sharing

Expected behavior

Gradient norms should be close to 1 (clip value). This is verifiable by setting precision=32 and rerun.

Environment

* CUDA:
	- GPU:
		- Tesla K80
	- available:         True
	- version:           10.2
* Packages:
	- numpy:             1.19.5
	- pyTorch_debug:     False
	- pyTorch_version:   1.9.0+cu102
	- pytorch-lightning: 1.4.5
	- tqdm:              4.62.0
* System:
	- OS:                Linux
	- architecture:
		- 64bit
		- 
	- processor:         x86_64
	- python:            3.7.11
	- version:           #1 SMP Sat Jun 5 09:50:34 PDT 2021
@cowwoc
Copy link
Contributor

cowwoc commented Oct 3, 2021

I believe this is fixed in release 1.4.9: https://github.com/PyTorchLightning/pytorch-lightning/releases/tag/1.4.9

Can this issue be closed?

@carmocca
Copy link
Contributor

carmocca commented Oct 6, 2021

Keeping it alive as the fix is not in master

@bergen
Copy link

bergen commented Oct 12, 2021

Has this issue been fixed for TPUs? I am getting a discrepancy between single core and multi-core TPU training. Using version 1.4.9.

@tchaton tchaton added this to the v1.5 milestone Oct 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Open to be worked on priority: 0 High priority task
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants