multi-gpu training hangs #12

ajtao · 2021-01-20T18:46:42Z

Hmm, training with --gpus 0, works fine but training with --gpus 0,1 hangs right at initializing ddp ...

The text was updated successfully, but these errors were encountered:

ajtao · 2021-01-20T19:58:38Z

Hmm, maybe this is related: Lightning-AI/pytorch-lightning#4612

MicPie · 2021-09-01T07:19:08Z

I seem to have the same problem. How did you solved it?

ajtao · 2021-09-01T16:56:06Z

In our environment we were setting NODE_RANK. pytorch lightning didn't like this to be set, so that fixed it for us.

sanersbug · 2021-10-15T07:10:46Z

@MicPie @ajtao I have the same problem when i training with --gpu 0, how did you solved it? and where to setting the NODE_RANK.
sorry, i even don't kown what's the NODE_RANK mean

ajtao · 2021-10-16T21:17:58Z

We had an environment variable called NODE_RANK set. We unset it, and it fixed things. Outside of that, i can't help you.

ajtao closed this as completed Jan 22, 2021

shyakocat mentioned this issue Jan 22, 2024

gpu training hangs #235

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multi-gpu training hangs #12

multi-gpu training hangs #12

ajtao commented Jan 20, 2021

ajtao commented Jan 20, 2021

MicPie commented Sep 1, 2021

ajtao commented Sep 1, 2021

sanersbug commented Oct 15, 2021

ajtao commented Oct 16, 2021

multi-gpu training hangs #12

multi-gpu training hangs #12

Comments

ajtao commented Jan 20, 2021

ajtao commented Jan 20, 2021

MicPie commented Sep 1, 2021

ajtao commented Sep 1, 2021

sanersbug commented Oct 15, 2021

ajtao commented Oct 16, 2021