Skip to content

RichProgressBar deadlocks distributed training #10362

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
bryant1410 opened this issue Nov 4, 2021 · 9 comments · Fixed by #10428
Closed

RichProgressBar deadlocks distributed training #10362

bryant1410 opened this issue Nov 4, 2021 · 9 comments · Fixed by #10428
Assignees
Labels
bug Something isn't working help wanted Open to be worked on priority: 0 High priority task progress bar: rich

Comments

@bryant1410
Copy link
Contributor

bryant1410 commented Nov 4, 2021

🐛 Bug

I have run into many distributed training now that have gone completely frozen after enabling the RichProgressBar callback.

It gets stuck in between epochs after it finished one. Sometimes it's after the first one, sometimes after the second one, sometimes after even 9 epochs.

The weird thing is that if I do Ctrl+C, the program doesn't interrupt. I have to kill it with SIGKILL (kill -9) because SIGTERM doesn't do it. So I don't know the stack trace. I'm also inside a Docker container, so I can't do strace (I don't have access to the host computer). Any help here to check the stack trace is welcome.

To Reproduce

I'm sorry but I can't share my code. All I can say is that I use a machine with 8 GPUs and the ddp_find_unused_parameters_false strategy, and that the problem only appears with RichProgressBar and that otherwise it doesn't.

I'd be happy to provide more details or try stuff out. Ask me for it! But I can't share my code, I'm sorry.

Expected behavior

The training to continue successfully.

Environment

  • PyTorch Lightning Version (e.g., 1.3.0): 1.5.0
  • PyTorch Version (e.g., 1.8): 1.10.0
  • Python version: 3.8.12
  • OS (e.g., Linux): Linux
  • CUDA/cuDNN version: 11.3
  • GPU models and configuration: 8x A100
  • How you installed PyTorch (conda, pip, source): conda
  • If compiling from source, the output of torch.__config__.show():
  • Any other relevant information:
@bryant1410 bryant1410 added bug Something isn't working help wanted Open to be worked on labels Nov 4, 2021
@bryant1410
Copy link
Contributor Author

I'm using rich v10.12.0 from Conda.

@tchaton
Copy link
Contributor

tchaton commented Nov 5, 2021

Dear @bryant1410,

Are you using ddp_spawn ? We noticed similar issue within our own CI and didn't manage to resolve it yet.

Best,
T.C

@SeanNaren SeanNaren self-assigned this Nov 5, 2021
@SeanNaren
Copy link
Contributor

Hey @bryant1410 could you try this branch?

https://github.com/PyTorchLightning/pytorch-lightning/tree/feat/enable_rich_default
should be able to do something like pip install https://github.com/PyTorchLightning/pytorch-lightning/archive/refs/heads/feat/enable_rich_default.zip

@kaushikb11
Copy link
Contributor

@bryant1410 It's okay if you can't share your code, but could you try running and reproducing it in a non Conda environment? Thank you!

@bryant1410
Copy link
Contributor Author

Are you using ddp_spawn ? We noticed similar issue within our own CI and didn't manage to resolve it yet.

No, ddp_find_unused_parameters_false.

@SeanNaren
Copy link
Contributor

@bryant1410 were you able to try out the above branch?

@bryant1410
Copy link
Contributor Author

@bryant1410 were you able to try out the above branch?

Hey, can't take a look this week, sorry. Will try to give it a ahot next week!

@jnanliu
Copy link

jnanliu commented Nov 9, 2021

@SeanNaren Hey, I have met a similar problem with @bryant1410. I tried the above branch, and it works for me.

@kaushikb11
Copy link
Contributor

Hi @lewjonan, could you try reproducing the issue with #10428 branch?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Open to be worked on priority: 0 High priority task progress bar: rich
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants