Running Pytorch Lightning with accelerator="ddp" creates a run for every GPU core #11

R-Peleg · 2021-07-01T07:24:35Z

Honestly, I'm not sure if this is an issue with this library or with Pytorch Lightning itself.
When I run my Neural Network training code with --gpus=4 and --accelerator="ddp", 4 runs are created on the Neptune run list, while only the first one has any metrics logged within it.
The output I get is:

/home/ssm-user/train_script/venv/lib/python3.6/site-packages/pytorch_lightning/metrics/__init__.py:44: LightningDeprecationWarning: `pytorch_lightning.metrics.*` module has been renamed to `torchmetrics.*` and split off to its own package (https://github.com/PyTorchLightning/metrics) since v1.3 and will be removed inv1.5
  "`pytorch_lightning.metrics.*` module has been renamed to `torchmetrics.*` and split off to its own package"
Global seed set to 101
https://app.neptune.ai/reuven/my-project-name/e/LOG-229
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
Global seed set to 101
https://app.neptune.ai/reuven/my-project-name/e/LOG-230
Global seed set to 101
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/4
Global seed set to 101
https://app.neptune.ai/reuven/my-project-name/e/LOG-231
Global seed set to 101
initializing ddp: GLOBAL_RANK: 2, MEMBER: 3/4
Global seed set to 101
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/4
Global seed set to 101
https://app.neptune.ai/reuven/my-project-name/e/LOG-232
Global seed set to 101
initializing ddp: GLOBAL_RANK: 3, MEMBER: 4/4
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 3 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3]

The text was updated successfully, but these errors were encountered:

Raalsky · 2021-07-03T09:58:12Z

Hey @R-Peleg

Thank you for reporting. At first, I will really appreciate any information about your environment especially neptune-client, neptune-pytorch-lightning, and pytorch-lightning package versions. Of course, an Issue may still remain despite the version but it speeds up the reproduction process.

We do have an environment variable NEPTUNE_CUSTOM_RUN_ID which might be helpful in most parallel/distributed setups. As you suspected that all metrics are uploaded to only one Run custom ID should merge them. More info could be found here: https://docs.neptune.ai/how-to-guides/neptune-api/pipelines .

R-Peleg · 2021-07-04T06:44:00Z

Sure, the versions used are:
Python 3.6.9 on Ubuntu, with

neptune-client==0.9.16
neptune-pytorch-lightning==0.9.6
pytorch-lightning==1.3.7.post0
torch==1.7.1
torchmetrics==0.3.2
torchvision==0.8.2

The method/workaround with NEPTUNE_CUSTOM_RUN_ID works well, thanks!

Raalsky mentioned this issue Dec 10, 2021

Fixed NeptuneLogger when using DDP Lightning-AI/pytorch-lightning#11030

Merged

12 tasks

Blaizzy closed this as completed Apr 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Running Pytorch Lightning with accelerator="ddp" creates a run for every GPU core #11

Running Pytorch Lightning with accelerator="ddp" creates a run for every GPU core #11

R-Peleg commented Jul 1, 2021

Raalsky commented Jul 3, 2021

Uh oh!

R-Peleg commented Jul 4, 2021

Uh oh!

Running Pytorch Lightning with accelerator="ddp" creates a run for every GPU core #11

Running Pytorch Lightning with accelerator="ddp" creates a run for every GPU core #11

Comments

R-Peleg commented Jul 1, 2021

Raalsky commented Jul 3, 2021

Uh oh!

R-Peleg commented Jul 4, 2021

Uh oh!