Skip to content

Running Pytorch Lightning with accelerator="ddp" creates a run for every GPU core #11

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
R-Peleg opened this issue Jul 1, 2021 · 2 comments

Comments

@R-Peleg
Copy link

R-Peleg commented Jul 1, 2021

Honestly, I'm not sure if this is an issue with this library or with Pytorch Lightning itself.
When I run my Neural Network training code with --gpus=4 and --accelerator="ddp", 4 runs are created on the Neptune run list, while only the first one has any metrics logged within it.
The output I get is:

/home/ssm-user/train_script/venv/lib/python3.6/site-packages/pytorch_lightning/metrics/__init__.py:44: LightningDeprecationWarning: `pytorch_lightning.metrics.*` module has been renamed to `torchmetrics.*` and split off to its own package (https://github.com/PyTorchLightning/metrics) since v1.3 and will be removed inv1.5
  "`pytorch_lightning.metrics.*` module has been renamed to `torchmetrics.*` and split off to its own package"
Global seed set to 101
https://app.neptune.ai/reuven/my-project-name/e/LOG-229
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
Global seed set to 101
https://app.neptune.ai/reuven/my-project-name/e/LOG-230
Global seed set to 101
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/4
Global seed set to 101
https://app.neptune.ai/reuven/my-project-name/e/LOG-231
Global seed set to 101
initializing ddp: GLOBAL_RANK: 2, MEMBER: 3/4
Global seed set to 101
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/4
Global seed set to 101
https://app.neptune.ai/reuven/my-project-name/e/LOG-232
Global seed set to 101
initializing ddp: GLOBAL_RANK: 3, MEMBER: 4/4
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 3 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
@Raalsky
Copy link
Contributor

Raalsky commented Jul 3, 2021

Hey @R-Peleg

Thank you for reporting. At first, I will really appreciate any information about your environment especially neptune-client, neptune-pytorch-lightning, and pytorch-lightning package versions. Of course, an Issue may still remain despite the version but it speeds up the reproduction process.

We do have an environment variable NEPTUNE_CUSTOM_RUN_ID which might be helpful in most parallel/distributed setups. As you suspected that all metrics are uploaded to only one Run custom ID should merge them. More info could be found here: https://docs.neptune.ai/how-to-guides/neptune-api/pipelines .

@R-Peleg
Copy link
Author

R-Peleg commented Jul 4, 2021

Sure, the versions used are:
Python 3.6.9 on Ubuntu, with

neptune-client==0.9.16
neptune-pytorch-lightning==0.9.6
pytorch-lightning==1.3.7.post0
torch==1.7.1
torchmetrics==0.3.2
torchvision==0.8.2

The method/workaround with NEPTUNE_CUSTOM_RUN_ID works well, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants