`RichProgressBar` deadlocks distributed training #10362

bryant1410 · 2021-11-04T20:06:47Z

🐛 Bug

I have run into many distributed training now that have gone completely frozen after enabling the RichProgressBar callback.

It gets stuck in between epochs after it finished one. Sometimes it's after the first one, sometimes after the second one, sometimes after even 9 epochs.

The weird thing is that if I do Ctrl+C, the program doesn't interrupt. I have to kill it with SIGKILL (kill -9) because SIGTERM doesn't do it. So I don't know the stack trace. I'm also inside a Docker container, so I can't do strace (I don't have access to the host computer). Any help here to check the stack trace is welcome.

To Reproduce

I'm sorry but I can't share my code. All I can say is that I use a machine with 8 GPUs and the ddp_find_unused_parameters_false strategy, and that the problem only appears with RichProgressBar and that otherwise it doesn't.

I'd be happy to provide more details or try stuff out. Ask me for it! But I can't share my code, I'm sorry.

Expected behavior

The training to continue successfully.

Environment

PyTorch Lightning Version (e.g., 1.3.0): 1.5.0
PyTorch Version (e.g., 1.8): 1.10.0
Python version: 3.8.12
OS (e.g., Linux): Linux
CUDA/cuDNN version: 11.3
GPU models and configuration: 8x A100
How you installed PyTorch (conda, pip, source): conda
If compiling from source, the output of torch.__config__.show():
Any other relevant information:

The text was updated successfully, but these errors were encountered:

bryant1410 · 2021-11-04T23:41:01Z

I'm using rich v10.12.0 from Conda.

tchaton · 2021-11-05T10:36:51Z

Dear @bryant1410,

Are you using ddp_spawn ? We noticed similar issue within our own CI and didn't manage to resolve it yet.

Best,
T.C

SeanNaren · 2021-11-05T11:34:54Z

Hey @bryant1410 could you try this branch?

https://github.com/PyTorchLightning/pytorch-lightning/tree/feat/enable_rich_default
should be able to do something like pip install https://github.com/PyTorchLightning/pytorch-lightning/archive/refs/heads/feat/enable_rich_default.zip

kaushikb11 · 2021-11-05T11:35:33Z

@bryant1410 It's okay if you can't share your code, but could you try running and reproducing it in a non Conda environment? Thank you!

bryant1410 · 2021-11-05T15:09:44Z

Are you using ddp_spawn ? We noticed similar issue within our own CI and didn't manage to resolve it yet.

No, ddp_find_unused_parameters_false.

SeanNaren · 2021-11-08T13:52:21Z

@bryant1410 were you able to try out the above branch?

bryant1410 · 2021-11-08T13:56:11Z

@bryant1410 were you able to try out the above branch?

Hey, can't take a look this week, sorry. Will try to give it a ahot next week!

jnanliu · 2021-11-09T05:39:29Z

@SeanNaren Hey, I have met a similar problem with @bryant1410. I tried the above branch, and it works for me.

kaushikb11 · 2021-11-09T11:27:55Z

Hi @lewjonan, could you try reproducing the issue with #10428 branch?

bryant1410 added bug Something isn't working help wanted Open to be worked on labels Nov 4, 2021

SeanNaren self-assigned this Nov 5, 2021

kaushikb11 added the progress bar: rich label Nov 5, 2021

kaushikb11 added the priority: 0 High priority task label Nov 9, 2021

kaushikb11 self-assigned this Nov 9, 2021

kaushikb11 mentioned this issue Nov 9, 2021

Fix deadlocks for distributed training for RichProgressBar #10428

Merged

12 tasks

kaushikb11 closed this as completed in #10428 Nov 9, 2021

alshedivat mentioned this issue Dec 11, 2021

RichProgressBar deadlocking during non-distributed training #11034

Closed

akihironitta mentioned this issue Aug 2, 2022

Add refresh_rate to RichProgressBar #10497

Merged

12 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`RichProgressBar` deadlocks distributed training #10362

`RichProgressBar` deadlocks distributed training #10362

bryant1410 commented Nov 4, 2021 •

edited

Loading

bryant1410 commented Nov 4, 2021

Uh oh!

tchaton commented Nov 5, 2021

Uh oh!

SeanNaren commented Nov 5, 2021

Uh oh!

kaushikb11 commented Nov 5, 2021

Uh oh!

bryant1410 commented Nov 5, 2021

Uh oh!

SeanNaren commented Nov 8, 2021

Uh oh!

bryant1410 commented Nov 8, 2021

Uh oh!

jnanliu commented Nov 9, 2021

Uh oh!

kaushikb11 commented Nov 9, 2021

Uh oh!

RichProgressBar deadlocks distributed training #10362

RichProgressBar deadlocks distributed training #10362

Comments

bryant1410 commented Nov 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🐛 Bug

To Reproduce

Expected behavior

Environment

bryant1410 commented Nov 4, 2021

Uh oh!

tchaton commented Nov 5, 2021

Uh oh!

SeanNaren commented Nov 5, 2021

Uh oh!

kaushikb11 commented Nov 5, 2021

Uh oh!

bryant1410 commented Nov 5, 2021

Uh oh!

SeanNaren commented Nov 8, 2021

Uh oh!

bryant1410 commented Nov 8, 2021

Uh oh!

jnanliu commented Nov 9, 2021

Uh oh!

kaushikb11 commented Nov 9, 2021

Uh oh!

`RichProgressBar` deadlocks distributed training #10362

`RichProgressBar` deadlocks distributed training #10362

bryant1410 commented Nov 4, 2021 •

edited

Loading