-
Notifications
You must be signed in to change notification settings - Fork 3.5k
RuntimeError when running basic GAN model (from tutorial at lightning.ai) with DDP #20328
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
set strategy == "ddp_find_unused_parameters_true" as error log said |
Doesn't setting |
I check the code you provided and find out the "unused params" follow the https://discuss.pytorch.org/t/how-to-find-the-unused-parameters-in-network/63948/5, it looks like the main reason is discriminator and generator calculate loss separately but lightning module make them as single model, follow the debug method i mentioned above: # adversarial loss is binary cross-entropy
g_loss = self.adversarial_loss(self.discriminator(self.generated_imgs), valid)
self.log("g_loss", g_loss, prog_bar=True)
self.manual_backward(g_loss)
for name, param in self.named_parameters():
if param.grad is None:
print(name)
optimizer_g.step()
optimizer_g.zero_grad()
self.untoggle_optimizer(optimizer_g)
# train discriminator
# Measure discriminator's ability to classify real from generated samples
self.toggle_optimizer(optimizer_d)
# how well can it label as real?
valid = torch.ones(imgs.size(0), 1)
valid = valid.type_as(imgs)
real_loss = self.adversarial_loss(self.discriminator(imgs), valid)
# how well can it label as fake?
fake = torch.zeros(imgs.size(0), 1)
fake = fake.type_as(imgs)
fake_loss = self.adversarial_loss(self.discriminator(self.generated_imgs.detach()), fake)
# discriminator loss is the average of these
d_loss = (real_loss + fake_loss) / 2
self.log("d_loss", d_loss, prog_bar=True)
self.manual_backward(d_loss)
for name, param in self.named_parameters():
if param.grad is None:
print(name)
optimizer_d.step()
optimizer_d.zero_grad()
self.untoggle_optimizer(optimizer_d) i got the output (by setting "ddp_find_unused_parameters_true"): if you call backward by : self.manual_backward(d_loss + g_loss)
self.toggle_optimizer(optimizer_d)
optimizer_d.step()
optimizer_d.zero_grad()
self.untoggle_optimizer(optimizer_d)
self.toggle_optimizer(optimizer_g)
optimizer_g.step()
optimizer_g.zero_grad()
self.untoggle_optimizer(optimizer_g) "ddp" setting will work correctly |
As discussed further in this issue, I need to correct my suggestion ⤴ that adding up the generator and discriminator losses and calling backpropagation only once is completely wrong, even if they do “solve” the problem of needing to set “ddp_find_unused_parameters_true” |
Bug description
I am trying to train a GAN model on multiple GPUs using DDP. I followed the tutorial at https://lightning.ai/docs/pytorch/stable/notebooks/lightning_examples/basic-gan.html, changing the arguments to Trainer to
Running the script raise Runtime error as follows:
What version are you seeing the problem on?
v2.4
How to reproduce the bug
Error messages and logs
Environment
Current environment
- GPU:
- NVIDIA L40S
- NVIDIA L40S
- NVIDIA L40S
- NVIDIA L40S
- available: True
- version: 12.1
- lightning: 2.4.0
- lightning-utilities: 0.11.7
- pytorch-lightning: 2.4.0
- torch: 2.4.1
- torchmetrics: 1.4.2
- torchvision: 0.19.1
- aiohappyeyeballs: 2.4.3
- aiohttp: 3.10.9
- aiosignal: 1.3.1
- async-timeout: 4.0.3
- attrs: 24.2.0
- autocommand: 2.2.2
- backports.tarfile: 1.2.0
- cxr-training: 0.1.0
- filelock: 3.16.1
- frozenlist: 1.4.1
- fsspec: 2024.9.0
- idna: 3.10
- importlib-metadata: 8.0.0
- importlib-resources: 6.4.0
- inflect: 7.3.1
- jaraco.collections: 5.1.0
- jaraco.context: 5.3.0
- jaraco.functools: 4.0.1
- jaraco.text: 3.12.1
- jinja2: 3.1.4
- lightning: 2.4.0
- lightning-utilities: 0.11.7
- markupsafe: 3.0.1
- more-itertools: 10.3.0
- mpmath: 1.3.0
- multidict: 6.1.0
- networkx: 3.3
- numpy: 2.1.2
- nvidia-cublas-cu12: 12.1.3.1
- nvidia-cuda-cupti-cu12: 12.1.105
- nvidia-cuda-nvrtc-cu12: 12.1.105
- nvidia-cuda-runtime-cu12: 12.1.105
- nvidia-cudnn-cu12: 9.1.0.70
- nvidia-cufft-cu12: 11.0.2.54
- nvidia-curand-cu12: 10.3.2.106
- nvidia-cusolver-cu12: 11.4.5.107
- nvidia-cusparse-cu12: 12.1.0.106
- nvidia-nccl-cu12: 2.20.5
- nvidia-nvjitlink-cu12: 12.6.77
- nvidia-nvtx-cu12: 12.1.105
- packaging: 24.1
- pillow: 10.4.0
- pip: 24.2
- platformdirs: 4.2.2
- propcache: 0.2.0
- pytorch-lightning: 2.4.0
- pyyaml: 6.0.2
- setuptools: 75.1.0
- sympy: 1.13.3
- tomli: 2.0.1
- torch: 2.4.1
- torchmetrics: 1.4.2
- torchvision: 0.19.1
- tqdm: 4.66.5
- triton: 3.0.0
- typeguard: 4.3.0
- typing-extensions: 4.12.2
- wheel: 0.44.0
- yarl: 1.14.0
- zipp: 3.19.2
- OS: Linux
- architecture:
- 64bit
- ELF
- processor: x86_64
- python: 3.10.0
- release: 5.15.0-1063-nvidia
- version: added test model to do also #64-Ubuntu SMP Fri Aug 9 17:13:45 UTC 2024
More info
No response
The text was updated successfully, but these errors were encountered: