Skip to content

ddp_cpu breaks while lookinf for .module: ModuleAttributeError: 'BoringModel' object has no attribute 'module' #4356

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
carmocca opened this issue Oct 25, 2020 · 5 comments
Labels
bug Something isn't working help wanted Open to be worked on priority: 2 Low priority task won't fix This will not be worked on

Comments

@carmocca
Copy link
Contributor

carmocca commented Oct 25, 2020

🐛 Bug

https://colab.research.google.com/drive/1hMW-0sTTgK-r6xfdwuSDyRNJBH7YYm9V?usp=sharing

To Reproduce

Set num_processes=2 in the trainer (without accelerator="ddp_cpu"`). I know this is an invalid combination but a user of my library got confused with the error.

/usr/local/lib/python3.6/dist-packages/pytorch_lightning/utilities/distributed.py:45: UserWarning: num_processes is only used for distributed_backend="ddp_cpu". Ignoring it.
  warnings.warn(*args, **kwargs)
GPU available: True, used: False
TPU available: False, using: 0 TPU cores
/usr/local/lib/python3.6/dist-packages/pytorch_lightning/utilities/distributed.py:45: UserWarning: GPU available but not used. Set the --gpus flag when calling the script.
  warnings.warn(*args, **kwargs)

---------------------------------------------------------------------------

ModuleAttributeError                      Traceback (most recent call last)

<ipython-input-12-1f9f6fbe4f6c> in <module>()
----> 1 test_x(tmpdir)

4 frames

/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py in __getattr__(self, name)
    770                 return modules[name]
    771         raise ModuleAttributeError("'{}' object has no attribute '{}'".format(
--> 772             type(self).__name__, name))
    773 
    774     def __setattr__(self, name: str, value: Union[Tensor, 'Module']) -> None:

ModuleAttributeError: 'BoringModel' object has no attribute 'module'

Expected behavior

The property num_processes is ignored as mentioned in the warning:

UserWarning: num_processes is only used for distributed_backend="ddp_cpu". Ignoring it.

Environment

  • CUDA:
    • GPU:
      • Tesla T4
    • available: True
    • version: 10.1
  • Packages:
    • numpy: 1.18.5
    • pyTorch_debug: False
    • pyTorch_version: 1.6.0+cu101
    • pytorch-lightning: 1.0.0
    • tqdm: 4.41.1
  • System:
    • OS: Linux
    • architecture:
      • 64bit
    • processor: x86_64
    • python: 3.6.9
    • version: Proposal for help #1 SMP Thu Jul 23 08:00:38 PDT 2020
@carmocca carmocca added bug Something isn't working help wanted Open to be worked on labels Oct 25, 2020
@edenlightning
Copy link
Contributor

Thanks for the issue! Want to submit a fix?

@carmocca
Copy link
Contributor Author

carmocca commented Oct 30, 2020

I took a shot at it, but realized this is part of a larger issue.

There is not much documentation about what are the supported uses for num_processes (other than ddp_cpu). Looking at the code, it seems like the following accelerator values can use num_processes>1:

  1. None: What does it default to? Execution goes through here: https://github.com/PyTorchLightning/pytorch-lightning/blob/ebe3a31ddd82c616df6612cb880b0b3b13b9ecde/pytorch_lightning/accelerators/accelerator_connector.py#L301-L306 with the comment suggesting that ddp_cpu is used but that does not seem to be the case.
  2. ddp_spawn: with a very brief mention in the docs
  3. ddp: Another brief mention here. For the previous two, execution goes through here: https://github.com/PyTorchLightning/pytorch-lightning/blob/ebe3a31ddd82c616df6612cb880b0b3b13b9ecde/pytorch_lightning/accelerators/accelerator_connector.py#L325-L328
  4. ddp_cpu: as expected

So the error reported in this issue refers to 1., but 2. also seems to fail. Here is one test to check (fails on master):

@pytest.mark.parametrize("accelerator", [None, "ddp_spawn"])
def test_trainer_num_processes_without_ddp_cpu(tmpdir, accelerator):
    trainer = Trainer(
        default_root_dir=tmpdir,
        weights_summary=None,
        logger=False,
        checkpoint_callback=False,
        progress_bar_refresh_rate=0,
        fast_dev_run=True,
        num_processes=2,
        accelerator=accelerator,
    )
    trainer.fit(EvalModelTemplate())

If what Im saying is correct, also means that this warning is wrong:
https://github.com/PyTorchLightning/pytorch-lightning/blob/ebe3a31ddd82c616df6612cb880b0b3b13b9ecde/pytorch_lightning/accelerators/accelerator_connector.py#L99-L100
and should be updated/removed.

Hopefully someone can clear up what is the expected behaviour and provide sensible warnings/errors as appropriate. Also improve the docs about what are the uses of num_processes

cc @s-rog @williamFalcon

@tchaton tchaton added the won't fix This will not be worked on label Nov 10, 2020
@stale stale bot closed this as completed Nov 20, 2020
@edenlightning edenlightning removed the won't fix This will not be worked on label Nov 30, 2020
@edenlightning edenlightning reopened this Nov 30, 2020
@Borda Borda added the good first issue Good for newcomers label Dec 1, 2020
@edenlightning edenlightning removed the good first issue Good for newcomers label Dec 14, 2020
@edenlightning edenlightning changed the title ModuleAttributeError: 'BoringModel' object has no attribute 'module' ddp_cpu breaks while lookinf for .module: ModuleAttributeError: 'BoringModel' object has no attribute 'module' Dec 14, 2020
@edenlightning edenlightning added the priority: 1 Medium priority task label Dec 14, 2020
@Borda Borda self-assigned this Jan 4, 2021
@carmocca
Copy link
Contributor Author

carmocca commented Feb 22, 2021

This will probably get cleaned-up by the proposal here: #6090

@carmocca carmocca added priority: 2 Low priority task and removed priority: 1 Medium priority task labels Feb 22, 2021
@stale
Copy link

stale bot commented Mar 25, 2021

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

@stale stale bot added the won't fix This will not be worked on label Mar 25, 2021
@carmocca
Copy link
Contributor Author

Closing in favor of #6090 which will clarify the accelerator arguments

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Open to be worked on priority: 2 Low priority task won't fix This will not be worked on
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants