ddp_cpu breaks while lookinf for .module: ModuleAttributeError: 'BoringModel' object has no attribute 'module' #4356

carmocca · 2020-10-25T23:32:03Z

🐛 Bug

https://colab.research.google.com/drive/1hMW-0sTTgK-r6xfdwuSDyRNJBH7YYm9V?usp=sharing

To Reproduce

Set num_processes=2 in the trainer (without accelerator="ddp_cpu"`). I know this is an invalid combination but a user of my library got confused with the error.

/usr/local/lib/python3.6/dist-packages/pytorch_lightning/utilities/distributed.py:45: UserWarning: num_processes is only used for distributed_backend="ddp_cpu". Ignoring it.
  warnings.warn(*args, **kwargs)
GPU available: True, used: False
TPU available: False, using: 0 TPU cores
/usr/local/lib/python3.6/dist-packages/pytorch_lightning/utilities/distributed.py:45: UserWarning: GPU available but not used. Set the --gpus flag when calling the script.
  warnings.warn(*args, **kwargs)

---------------------------------------------------------------------------

ModuleAttributeError                      Traceback (most recent call last)

<ipython-input-12-1f9f6fbe4f6c> in <module>()
----> 1 test_x(tmpdir)

4 frames

/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py in __getattr__(self, name)
    770                 return modules[name]
    771         raise ModuleAttributeError("'{}' object has no attribute '{}'".format(
--> 772             type(self).__name__, name))
    773 
    774     def __setattr__(self, name: str, value: Union[Tensor, 'Module']) -> None:

ModuleAttributeError: 'BoringModel' object has no attribute 'module'

Expected behavior

The property num_processes is ignored as mentioned in the warning:

UserWarning: num_processes is only used for distributed_backend="ddp_cpu". Ignoring it.

Environment

CUDA:
- GPU:
  - Tesla T4
- available: True
- version: 10.1
Packages:
- numpy: 1.18.5
- pyTorch_debug: False
- pyTorch_version: 1.6.0+cu101
- pytorch-lightning: 1.0.0
- tqdm: 4.41.1
System:
- OS: Linux
- architecture:
  - 64bit
- processor: x86_64
- python: 3.6.9
- version: Proposal for help #1 SMP Thu Jul 23 08:00:38 PDT 2020

The text was updated successfully, but these errors were encountered:

edenlightning · 2020-10-29T21:09:36Z

Thanks for the issue! Want to submit a fix?

carmocca · 2020-10-30T00:56:16Z

I took a shot at it, but realized this is part of a larger issue.

There is not much documentation about what are the supported uses for num_processes (other than ddp_cpu). Looking at the code, it seems like the following accelerator values can use num_processes>1:

None: What does it default to? Execution goes through here: https://github.com/PyTorchLightning/pytorch-lightning/blob/ebe3a31ddd82c616df6612cb880b0b3b13b9ecde/pytorch_lightning/accelerators/accelerator_connector.py#L301-L306 with the comment suggesting that ddp_cpu is used but that does not seem to be the case.
ddp_spawn: with a very brief mention in the docs
ddp: Another brief mention here. For the previous two, execution goes through here: https://github.com/PyTorchLightning/pytorch-lightning/blob/ebe3a31ddd82c616df6612cb880b0b3b13b9ecde/pytorch_lightning/accelerators/accelerator_connector.py#L325-L328
ddp_cpu: as expected

So the error reported in this issue refers to 1., but 2. also seems to fail. Here is one test to check (fails on master):

@pytest.mark.parametrize("accelerator", [None, "ddp_spawn"])
def test_trainer_num_processes_without_ddp_cpu(tmpdir, accelerator):
    trainer = Trainer(
        default_root_dir=tmpdir,
        weights_summary=None,
        logger=False,
        checkpoint_callback=False,
        progress_bar_refresh_rate=0,
        fast_dev_run=True,
        num_processes=2,
        accelerator=accelerator,
    )
    trainer.fit(EvalModelTemplate())

If what Im saying is correct, also means that this warning is wrong:
https://github.com/PyTorchLightning/pytorch-lightning/blob/ebe3a31ddd82c616df6612cb880b0b3b13b9ecde/pytorch_lightning/accelerators/accelerator_connector.py#L99-L100
and should be updated/removed.

Hopefully someone can clear up what is the expected behaviour and provide sensible warnings/errors as appropriate. Also improve the docs about what are the uses of num_processes

cc @s-rog @williamFalcon

carmocca · 2021-02-22T01:03:00Z

This will probably get cleaned-up by the proposal here: #6090

stale · 2021-03-25T20:43:08Z

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

carmocca · 2021-03-25T21:39:07Z

Closing in favor of #6090 which will clarify the accelerator arguments

carmocca added bug Something isn't working help wanted Open to be worked on labels Oct 25, 2020

carmocca mentioned this issue Oct 29, 2020

Fix AttributeError with num_processes>1 but no ddp_cpu #4435

Closed

6 tasks

tchaton added the won't fix This will not be worked on label Nov 10, 2020

stale bot closed this as completed Nov 20, 2020

carmocca mentioned this issue Nov 28, 2020

What is the module attribute in the training_loop.py? #4890

Closed

edenlightning removed the won't fix This will not be worked on label Nov 30, 2020

edenlightning reopened this Nov 30, 2020

Borda added the good first issue Good for newcomers label Dec 1, 2020

edenlightning removed the good first issue Good for newcomers label Dec 14, 2020

edenlightning changed the title ~~ModuleAttributeError: 'BoringModel' object has no attribute 'module'~~ ddp_cpu breaks while lookinf for .module: ModuleAttributeError: 'BoringModel' object has no attribute 'module' Dec 14, 2020

edenlightning added the priority: 1 Medium priority task label Dec 14, 2020

Borda self-assigned this Jan 4, 2021

edenlightning unassigned Borda Feb 9, 2021

carmocca added priority: 2 Low priority task and removed priority: 1 Medium priority task labels Feb 22, 2021

stale bot added the won't fix This will not be worked on label Mar 25, 2021

carmocca closed this as completed Mar 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ddp_cpu breaks while lookinf for .module: ModuleAttributeError: 'BoringModel' object has no attribute 'module' #4356

ddp_cpu breaks while lookinf for .module: ModuleAttributeError: 'BoringModel' object has no attribute 'module' #4356

carmocca commented Oct 25, 2020 •

edited

Loading

edenlightning commented Oct 29, 2020

Uh oh!

carmocca commented Oct 30, 2020 •

edited

Loading

Uh oh!

carmocca commented Feb 22, 2021 •

edited

Loading

Uh oh!

stale bot commented Mar 25, 2021

Uh oh!

carmocca commented Mar 25, 2021

Uh oh!

ddp_cpu breaks while lookinf for .module: ModuleAttributeError: 'BoringModel' object has no attribute 'module' #4356

ddp_cpu breaks while lookinf for .module: ModuleAttributeError: 'BoringModel' object has no attribute 'module' #4356

Comments

carmocca commented Oct 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🐛 Bug

To Reproduce

Expected behavior

Environment

edenlightning commented Oct 29, 2020

Uh oh!

carmocca commented Oct 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

carmocca commented Feb 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stale bot commented Mar 25, 2021

Uh oh!

carmocca commented Mar 25, 2021

Uh oh!

carmocca commented Oct 25, 2020 •

edited

Loading

carmocca commented Oct 30, 2020 •

edited

Loading

carmocca commented Feb 22, 2021 •

edited

Loading