-
Notifications
You must be signed in to change notification settings - Fork 3.5k
RichProgressBar
deadlocks distributed training
#10362
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I'm using rich v10.12.0 from Conda. |
Dear @bryant1410, Are you using ddp_spawn ? We noticed similar issue within our own CI and didn't manage to resolve it yet. Best, |
Hey @bryant1410 could you try this branch? https://github.com/PyTorchLightning/pytorch-lightning/tree/feat/enable_rich_default |
@bryant1410 It's okay if you can't share your code, but could you try running and reproducing it in a non Conda environment? Thank you! |
No, |
@bryant1410 were you able to try out the above branch? |
Hey, can't take a look this week, sorry. Will try to give it a ahot next week! |
@SeanNaren Hey, I have met a similar problem with @bryant1410. I tried the above branch, and it works for me. |
Hi @lewjonan, could you try reproducing the issue with #10428 branch? |
Uh oh!
There was an error while loading. Please reload this page.
🐛 Bug
I have run into many distributed training now that have gone completely frozen after enabling the
RichProgressBar
callback.It gets stuck in between epochs after it finished one. Sometimes it's after the first one, sometimes after the second one, sometimes after even 9 epochs.
The weird thing is that if I do
Ctrl+C
, the program doesn't interrupt. I have to kill it with SIGKILL (kill -9
) because SIGTERM doesn't do it. So I don't know the stack trace. I'm also inside a Docker container, so I can't dostrace
(I don't have access to the host computer). Any help here to check the stack trace is welcome.To Reproduce
I'm sorry but I can't share my code. All I can say is that I use a machine with 8 GPUs and the
ddp_find_unused_parameters_false
strategy, and that the problem only appears withRichProgressBar
and that otherwise it doesn't.I'd be happy to provide more details or try stuff out. Ask me for it! But I can't share my code, I'm sorry.
Expected behavior
The training to continue successfully.
Environment
conda
,pip
, source): condatorch.__config__.show()
:The text was updated successfully, but these errors were encountered: