Skip to content

multi-gpu training hangs #12

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ajtao opened this issue Jan 20, 2021 · 5 comments
Closed

multi-gpu training hangs #12

ajtao opened this issue Jan 20, 2021 · 5 comments

Comments

@ajtao
Copy link

ajtao commented Jan 20, 2021

Hmm, training with --gpus 0, works fine but training with --gpus 0,1 hangs right at initializing ddp ...

@ajtao
Copy link
Author

ajtao commented Jan 20, 2021

Hmm, maybe this is related: Lightning-AI/pytorch-lightning#4612

@ajtao ajtao closed this as completed Jan 22, 2021
@MicPie
Copy link

MicPie commented Sep 1, 2021

I seem to have the same problem. How did you solved it?

@ajtao
Copy link
Author

ajtao commented Sep 1, 2021

In our environment we were setting NODE_RANK. pytorch lightning didn't like this to be set, so that fixed it for us.

@sanersbug
Copy link

@MicPie @ajtao I have the same problem when i training with --gpu 0, how did you solved it? and where to setting the NODE_RANK.
sorry, i even don't kown what's the NODE_RANK mean

@ajtao
Copy link
Author

ajtao commented Oct 16, 2021

We had an environment variable called NODE_RANK set. We unset it, and it fixed things. Outside of that, i can't help you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants