Dataloader hangs. Potential deadlock with set_num_threads
in worker processes?
#75147
Labels
module: dataloader
Related to torch.utils.data.DataLoader and Sampler
module: deadlock
Problems related to deadlocks (hang without exiting)
triaged
This issue has been looked at a team member, and triaged and prioritized into an appropriate module
🐛 Bug ?
I have a main process running with pid
52422
. Sometimes it get stucked when iterating over my dataloader withnum_workers > 0
during training.The threads of the main process:
It has 4 subprocess (
26345
,26346
,26347
,26351
). One of them (26346
) is getting blocked at thepthread_mutex_lock
call, and the others are stucked atpoll
oraccept4
calls.The backtrace of
26346
:However, the mutex that
26346
waits on is already held by another thread, even though there is NO ANY OTHER thread in the process26346
(The remaining 3 child processes all have at least 2 threads).I found the owner of the mutex has the tid of
52574
, which is one of the threads of the main process52422
):The backtrace of owner thread
52574
in the main process is shown below:Why is the mutex of the child process (
26346
) being held by a thread (tid52547
) of the parent process (52422
)?? I speculate that this might be the cause of the potential deadlock in Pytorch.Any help? Thanks!
Versions
My environment:
cc @ssnl @VitalyFedyunin @ejguan @NivekT
The text was updated successfully, but these errors were encountered: