Skip to content

Deadlock at process shutdown #54918

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
lpinca opened this issue Sep 13, 2024 · 20 comments · May be fixed by #56827
Open

Deadlock at process shutdown #54918

lpinca opened this issue Sep 13, 2024 · 20 comments · May be fixed by #56827
Labels
confirmed-bug Issues with confirmed bugs. help wanted Issues that need assistance from volunteers or PRs that need help to proceed. v8 engine Issues and PRs related to the V8 dependency.

Comments

@lpinca
Copy link
Member

lpinca commented Sep 13, 2024

Version

v23.0.0-pre

Platform

Linux USR-PC2HD4AD 5.15.153.1-microsoft-standard-WSL2 #1 SMP Fri Mar 29 23:14:13 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

Subsystem

No response

What steps will reproduce the bug?

There is a deadlock that prevents the Node.js process from exiting. The issue is causing a lot (all?) of timeout failures in our CI. It can be reproduced by running a test in parallel with our test.py tool, for example

$ ./tools/test.py --repeat=10000 test/parallel/test-http2-large-write-multiple-requests.js

See also

How often does it reproduce? Is there a required condition?

Rarely, but often enough to be a serious issue for CI.

What is the expected behavior? Why is that the expected behavior?

The process exits.

What do you see instead?

The process does not exit.

Additional information

Attaching gdb to two of the hanging processes obtained from the command above, produces the following outputs:

$ gdb -p 15884
GNU gdb (Ubuntu 15.0.50.20240403-0ubuntu1) 15.0.50.20240403-git
Copyright (C) 2024 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word".
Attaching to process 15884
[New LWP 15953]
[New LWP 15952]
[New LWP 15951]
[New LWP 15950]
[New LWP 15900]
[New LWP 15889]
[New LWP 15888]
[New LWP 15887]
[New LWP 15886]
[New LWP 15885]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Download failed: Invalid argument.  Continuing without source file ./nptl/./nptl/futex-internal.c.
0x00007f432025fd61 in __futex_abstimed_wait_common64 (private=21855, cancel=true, abstime=0x0, op=393, expected=0,
    futex_word=0x555fa9fce8f0) at ./nptl/futex-internal.c:57

warning: 57     ./nptl/futex-internal.c: No such file or directory
(gdb) bt
#0  0x00007f432025fd61 in __futex_abstimed_wait_common64 (private=21855, cancel=true, abstime=0x0, op=393, expected=0,
    futex_word=0x555fa9fce8f0) at ./nptl/futex-internal.c:57
#1  __futex_abstimed_wait_common (cancel=true, private=21855, abstime=0x0, clockid=0, expected=0,
    futex_word=0x555fa9fce8f0) at ./nptl/futex-internal.c:87
#2  __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x555fa9fce8f0, expected=expected@entry=0,
    clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=0) at ./nptl/futex-internal.c:139
#3  0x00007f43202627dd in __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x555fa9fce870,
    cond=0x555fa9fce8c8) at ./nptl/pthread_cond_wait.c:503
#4  ___pthread_cond_wait (cond=0x555fa9fce8c8, mutex=0x555fa9fce870) at ./nptl/pthread_cond_wait.c:627
#5  0x0000555fa3d30d7d in uv_cond_wait (cond=<optimized out>, mutex=<optimized out>)
    at ../deps/uv/src/unix/thread.c:814
#6  0x0000555fa2c8c063 in node::NodePlatform::DrainTasks(v8::Isolate*) ()
#7  0x0000555fa2ac528d in node::SpinEventLoopInternal(node::Environment*) ()
#8  0x0000555fa2c4abc4 in node::NodeMainInstance::Run() ()
#9  0x0000555fa2b9233a in node::Start(int, char**) ()
#10 0x00007f43201f11ca in __libc_start_call_main (main=main@entry=0x555fa2ababa0 <main>, argc=argc@entry=2,
    argv=argv@entry=0x7ffe74216608) at ../sysdeps/nptl/libc_start_call_main.h:58
#11 0x00007f43201f128b in __libc_start_main_impl (main=0x555fa2ababa0 <main>, argc=2, argv=0x7ffe74216608,
    init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffe742165f8)
    at ../csu/libc-start.c:360
#12 0x0000555fa2ac06a5 in _start ()
(gdb)
$ gdb -p 15885
GNU gdb (Ubuntu 15.0.50.20240403-0ubuntu1) 15.0.50.20240403-git
Copyright (C) 2024 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word".
Attaching to process 15885
Reading symbols from /home/luigi/node/out/Release/node...
Reading symbols from /lib/x86_64-linux-gnu/libstdc++.so.6...
Reading symbols from /home/luigi/.cache/debuginfod_client/40b9b0d17fdeebfb57331304da2b7f85e1396ef2/debuginfo...
Reading symbols from /lib/x86_64-linux-gnu/libm.so.6...
Reading symbols from /usr/lib/debug/.build-id/90/32976b3ecc78d1362fedfcd88528562bbfb7e4.debug...
Reading symbols from /lib/x86_64-linux-gnu/libgcc_s.so.1...
Reading symbols from /home/luigi/.cache/debuginfod_client/92123f0e6223c77754bac47062c0b9713ed363df/debuginfo...
Reading symbols from /lib/x86_64-linux-gnu/libc.so.6...
Reading symbols from /usr/lib/debug/.build-id/6d/64b17fbac799e68da7ebd9985ddf9b5cb375e6.debug...
Reading symbols from /lib64/ld-linux-x86-64.so.2...
Reading symbols from /usr/lib/debug/.build-id/35/3e1b6cb0eebc08cf3ff812eae8a51b4efd684e.debug...
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Download failed: Invalid argument.  Continuing without source file ./misc/../sysdeps/unix/sysv/linux/epoll_pwait.c.
0x00007f43202f0ee0 in __GI_epoll_pwait (epfd=epfd@entry=9, events=events@entry=0x7f43201bdcb0,
    maxevents=maxevents@entry=1024, timeout=timeout@entry=-1, set=set@entry=0x0)
    at ../sysdeps/unix/sysv/linux/epoll_pwait.c:40

warning: 40     ../sysdeps/unix/sysv/linux/epoll_pwait.c: No such file or directory
(gdb) bt
#0  0x00007f43202f0ee0 in __GI_epoll_pwait (epfd=epfd@entry=9, events=events@entry=0x7f43201bdcb0,
    maxevents=maxevents@entry=1024, timeout=timeout@entry=-1, set=set@entry=0x0)
    at ../sysdeps/unix/sysv/linux/epoll_pwait.c:40
#1  0x0000555fa3d364e7 in uv__io_poll (loop=loop@entry=0x555fa9fcea88, timeout=<optimized out>)
    at ../deps/uv/src/unix/linux.c:1432
#2  0x0000555fa3d20c17 in uv_run (loop=0x555fa9fcea88, mode=UV_RUN_DEFAULT) at ../deps/uv/src/unix/core.c:448
#3  0x0000555fa2c8d4b1 in node::WorkerThreadsTaskRunner::DelayedTaskScheduler::Start()::{lambda(void*)#1}::_FUN(void*)
    ()
#4  0x00007f4320263a94 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:447
#5  0x00007f43202f0c3c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
(gdb)
@lpinca
Copy link
Member Author

lpinca commented Sep 13, 2024

As written in #52550 (comment), applying the patch from #47452 seems to fix the issue, but it is not a solution since that patch brings other issues.

@avivkeller
Copy link
Member

This can been seen in various CI runs, assuming confirmed-bug

@lpinca
Copy link
Member Author

lpinca commented Sep 13, 2024

The output of lldb on macOS:

$ lldb -p 5033
(lldb) process attach --pid 5033
Process 5033 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
    frame #0: 0x00007ff802363c3a libsystem_kernel.dylib`__psynch_cvwait + 10
libsystem_kernel.dylib`:
->  0x7ff802363c3a <+10>: jae    0x7ff802363c44            ; <+20>
    0x7ff802363c3c <+12>: movq   %rax, %rdi
    0x7ff802363c3f <+15>: jmp    0x7ff8023617e0            ; cerror_nocancel
    0x7ff802363c44 <+20>: retq   
Target 0: (node) stopped.
Executable module set to "/Users/luigi/code/node/out/Release/node".
Architecture set to: x86_64h-apple-macosx-.
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
  * frame #0: 0x00007ff802363c3a libsystem_kernel.dylib`__psynch_cvwait + 10
    frame #1: 0x00007ff8023a16f3 libsystem_pthread.dylib`_pthread_cond_wait + 1211
    frame #2: 0x000000010df93b23 node`uv_cond_wait(cond=0x00007f9a4d404108, mutex=0x00007f9a4d404098) at thread.c:798:7 [opt]
    frame #3: 0x000000010d0be52b node`node::NodePlatform::DrainTasks(v8::Isolate*) [inlined] node::LibuvMutexTraits::cond_wait(cond=0x00007f9a4d404108, mutex=0x00007f9a4d404098) at node_mutex.h:175:5 [opt]
    frame #4: 0x000000010d0be511 node`node::NodePlatform::DrainTasks(v8::Isolate*) [inlined] node::ConditionVariableBase<node::LibuvMutexTraits>::Wait(this=0x00007f9a4d404108, scoped_lock=<unavailable>) at node_mutex.h:249:3 [opt]
    frame #5: 0x000000010d0be511 node`node::NodePlatform::DrainTasks(v8::Isolate*) [inlined] node::TaskQueue<v8::Task>::BlockingDrain(this=0x00007f9a4d404098) at node_platform.cc:640:20 [opt]
    frame #6: 0x000000010d0be4fb node`node::NodePlatform::DrainTasks(v8::Isolate*) [inlined] node::WorkerThreadsTaskRunner::BlockingDrain(this=0x00007f9a4d404098) at node_platform.cc:214:25 [opt]
    frame #7: 0x000000010d0be4fb node`node::NodePlatform::DrainTasks(this=0x0000600002718000, isolate=<unavailable>) at node_platform.cc:467:33 [opt]
    frame #8: 0x000000010cf18a50 node`node::SpinEventLoopInternal(env=0x00007f9a2c814e00) at embed_helpers.cc:44:17 [opt]
    frame #9: 0x000000010d084083 node`node::NodeMainInstance::Run() [inlined] node::NodeMainInstance::Run(this=<unavailable>, exit_code=<unavailable>, env=0x00007f9a2c814e00) at node_main_instance.cc:111:9 [opt]
    frame #10: 0x000000010d083fe9 node`node::NodeMainInstance::Run(this=<unavailable>) at node_main_instance.cc:100:3 [opt]
    frame #11: 0x000000010cfe4159 node`node::Start(int, char**) [inlined] node::StartInternal(argc=<unavailable>, argv=<unavailable>) at node.cc:1489:24 [opt]
    frame #12: 0x000000010cfe3f54 node`node::Start(argc=<unavailable>, argv=<unavailable>) at node.cc:1496:27 [opt]
    frame #13: 0x00007ff802015345 dyld`start + 1909
(lldb)

@aduh95
Copy link
Contributor

aduh95 commented Sep 17, 2024

@lpinca could you try ab23463? I'm not able to reproduce the deadlock locally, so I can't confirm if it is effective.

@lpinca
Copy link
Member Author

lpinca commented Sep 17, 2024

@aduh95 see #54918 (comment). It seems the same patch.

@aduh95
Copy link
Contributor

aduh95 commented Sep 17, 2024

No it's not, it's based on it but it's meant to fix the test failures.

@targos
Copy link
Member

targos commented Sep 17, 2024

@lpinca
Copy link
Member Author

lpinca commented Sep 17, 2024

The issue persists 😞

gdb -p 18560
GNU gdb (Ubuntu 15.0.50.20240403-0ubuntu1) 15.0.50.20240403-git
Copyright (C) 2024 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word".
Attaching to process 18560
[New LWP 18621]
[New LWP 18620]
[New LWP 18619]
[New LWP 18618]
[New LWP 18574]
[New LWP 18565]
[New LWP 18564]
[New LWP 18563]
[New LWP 18562]
[New LWP 18561]
Downloading separate debug info for /lib/x86_64-linux-gnu/libstdc++.so.6
Downloading separate debug info for system-supplied DSO at 0x7ffdf50f4000
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Download failed: Invalid argument.  Continuing without source file ./nptl/./nptl/futex-internal.c.
0x00007fca5f88ad61 in __futex_abstimed_wait_common64 (private=21976, cancel=true, abstime=0x0, op=393, expected=0,
    futex_word=0x55d834ea10e0) at ./nptl/futex-internal.c:57

warning: 57     ./nptl/futex-internal.c: No such file or directory
(gdb) bt
#0  0x00007fca5f88ad61 in __futex_abstimed_wait_common64 (private=21976, cancel=true, abstime=0x0, op=393, expected=0,
    futex_word=0x55d834ea10e0) at ./nptl/futex-internal.c:57
#1  __futex_abstimed_wait_common (cancel=true, private=21976, abstime=0x0, clockid=0, expected=0,
    futex_word=0x55d834ea10e0) at ./nptl/futex-internal.c:87
#2  __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x55d834ea10e0, expected=expected@entry=0,
    clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=0) at ./nptl/futex-internal.c:139
#3  0x00007fca5f88d7dd in __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x55d834ea1060,
    cond=0x55d834ea10b8) at ./nptl/pthread_cond_wait.c:503
#4  ___pthread_cond_wait (cond=0x55d834ea10b8, mutex=0x55d834ea1060) at ./nptl/pthread_cond_wait.c:627
#5  0x000055d82f131fed in uv_cond_wait (cond=<optimized out>, mutex=<optimized out>)
    at ../deps/uv/src/unix/thread.c:814
#6  0x000055d82e08c853 in node::NodePlatform::DrainTasks(v8::Isolate*) ()
#7  0x000055d82dec528d in node::SpinEventLoopInternal(node::Environment*) ()
#8  0x000055d82e04b364 in node::NodeMainInstance::Run() ()
#9  0x000055d82df9276a in node::Start(int, char**) ()
#10 0x00007fca5f81c1ca in __libc_start_call_main (main=main@entry=0x55d82debac50 <main>, argc=argc@entry=2,
    argv=argv@entry=0x7ffdf50a9a48) at ../sysdeps/nptl/libc_start_call_main.h:58
#11 0x00007fca5f81c28b in __libc_start_main_impl (main=0x55d82debac50 <main>, argc=2, argv=0x7ffdf50a9a48,
    init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffdf50a9a38)
    at ../csu/libc-start.c:360
#12 0x000055d82dec0785 in _start ()
(gdb)

Also, test/wasi/test-wasi-initialize-validation.js and test/wasi/test-wasi-start-validation.js fail.

$ ./node test/wasi/test-wasi-initialize-validation.js
(node:13188) ExperimentalWarning: WASI is an experimental feature and might change at any time
(Use `node --trace-warnings ...` to show where the warning was created)
Mismatched noop function calls. Expected exactly 1, actual 0.
    at Proxy.mustCall (/home/luigi/node/test/common/index.js:451:10)
    at Object.<anonymous> (/home/luigi/node/test/wasi/test-wasi-initialize-validation.js:154:18)
    at Module._compile (node:internal/modules/cjs/loader:1557:14)
    at Module._extensions..js (node:internal/modules/cjs/loader:1702:10)
    at Module.load (node:internal/modules/cjs/loader:1328:32)
    at Module._load (node:internal/modules/cjs/loader:1138:12)
    at TracingChannel.traceSync (node:diagnostics_channel:315:14)
    at wrapModuleLoad (node:internal/modules/cjs/loader:217:24)
    at Function.executeUserEntryPoint [as runMain] (node:internal/modules/run_main:166:5)
$ ./node test/wasi/test-wasi-start-validation.js
(node:13199) ExperimentalWarning: WASI is an experimental feature and might change at any time
(Use `node --trace-warnings ...` to show where the warning was created)
Mismatched noop function calls. Expected exactly 1, actual 0.
    at Proxy.mustCall (/home/luigi/node/test/common/index.js:451:10)
    at Object.<anonymous> (/home/luigi/node/test/wasi/test-wasi-start-validation.js:151:18)
    at Module._compile (node:internal/modules/cjs/loader:1557:14)
    at Module._extensions..js (node:internal/modules/cjs/loader:1702:10)
    at Module.load (node:internal/modules/cjs/loader:1328:32)
    at Module._load (node:internal/modules/cjs/loader:1138:12)
    at TracingChannel.traceSync (node:diagnostics_channel:315:14)
    at wrapModuleLoad (node:internal/modules/cjs/loader:217:24)
    at Function.executeUserEntryPoint [as runMain] (node:internal/modules/run_main:166:5)

@avivkeller avivkeller added the help wanted Issues that need assistance from volunteers or PRs that need help to proceed. label Oct 15, 2024
@avivkeller
Copy link
Member

Hey, I've added the help wanted Issues that need assistance from volunteers or PRs that need help to proceed. label to this issue. While all issues deserve TLC from volunteers, this one is causing a serious CI issue, and a little extra attention wouldn't hurt.

@avivkeller
Copy link
Member

avivkeller commented Oct 15, 2024

From my testing, the following snippet hangs occassionally, which might be related to this?:

tasks_drained_.Wait(scoped_lock);

outstanding_tasks_ is 2 when it hangs, 0 otherwise. It also looks like one of those tasks is completed, but the other one is not?

Some task isn't completing (but that was probably already known), do we know what task, or how to find out?


Update 1

It seems a task is getting stuck during BlockingPop, which might be connected to the issue.

Update 2

The NodeMainInstance is never properly destructed, which prevents the NodePlatform from shutting down. However, this situation appears to be a catch-22. The NodeMainInstance seems to finish only after DrainTasks is completed, but DrainTasks can’t finish until NodeMainInstance is shut down.

Update 3

Removing platform->DrainTasks(isolate); from SpinEventLoopInternal prevents the error from occurring (the draining still happens elsewhere without interruption, although other errors persist).

Update 4

It's not a catch-22, because the hanging process appears to be hanging before the process destruction would even occur. I suspect that if you bt all the threads, you'll see that DrainTasks isn't the only thing hanging,

Indeed, our hanging worker thread appears to be:

Thread 8 (Thread 0x7fb7c74006c0 (LWP 1322244) "node"):
#0  0x00007fb7cd49e22e in __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x55a4a5eefdc8) at ./nptl/futex-internal.c:57
#1  __futex_abstimed_wait_common (futex_word=futex_word@entry=0x55a4a5eefdc8, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=0, cancel=cancel@entry=true) at ./nptl/futex-internal.c:87
#2  0x00007fb7cd49e2ab in __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x55a4a5eefdc8, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=0) at ./nptl/futex-internal.c:139
#3  0x00007fb7cd4a0990 in __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x55a4a5eefd78, cond=0x55a4a5eefda0) at ./nptl/pthread_cond_wait.c:503
#4  ___pthread_cond_wait (cond=0x55a4a5eefda0, mutex=0x55a4a5eefd78) at ./nptl/pthread_cond_wait.c:618
#5  0x000055a46a13ad8c in void heap::base::Stack::SetMarkerForBackgroundThreadAndCallbackImpl<v8::internal::LocalHeap::ExecuteWhileParked<v8::internal::CollectionBarrier::AwaitCollectionBackground(v8::internal::LocalHeap*)::{lambda()#1}>(v8::internal::CollectionBarrier::AwaitCollectionBackground(v8::internal::LocalHeap*)::{lambda()#1})::{lambda()#1}>(heap::base::Stack*, void*, void const*) ()
#6  0x000055a46acb27e3 in PushAllRegistersAndIterateStack ()
#7  0x000055a46a13b2df in v8::internal::CollectionBarrier::AwaitCollectionBackground(v8::internal::LocalHeap*) ()
#8  0x000055a46a1ad972 in v8::internal::HeapAllocator::AllocateRawWithLightRetrySlowPath(int, v8::internal::AllocationType, v8::internal::AllocationOrigin, v8::internal::AllocationAlignment) ()
#9  0x000055a46a1ae0a8 in v8::internal::HeapAllocator::AllocateRawWithRetryOrFailSlowPath(int, v8::internal::AllocationType, v8::internal::AllocationOrigin, v8::internal::AllocationAlignment) ()
#10 0x000055a46a1e4891 in v8::internal::LocalFactory::AllocateRaw(int, v8::internal::AllocationType, v8::internal::AllocationAlignment) ()
#11 0x000055a46a17a4fe in v8::internal::FactoryBase<v8::internal::LocalFactory>::NewProtectedFixedArray(int) ()
#12 0x000055a46a364a6d in v8::internal::DeoptimizationData::New(v8::internal::LocalIsolate*, int) ()
#13 0x000055a46a7b8d3f in v8::internal::maglev::MaglevCodeGenerator::GenerateDeoptimizationData(v8::internal::LocalIsolate*) ()
#14 0x000055a46a7b991b in v8::internal::maglev::MaglevCodeGenerator::BuildCodeObject(v8::internal::LocalIsolate*) [clone .part.0] ()
#15 0x000055a46a7d9948 in v8::internal::maglev::MaglevCodeGenerator::Assemble() ()
#16 0x000055a46a82f269 in v8::internal::maglev::MaglevCompiler::Compile(v8::internal::LocalIsolate*, v8::internal::maglev::MaglevCompilationInfo*) ()
#17 0x000055a46a830b49 in v8::internal::maglev::MaglevCompilationJob::ExecuteJobImpl(v8::internal::RuntimeCallStats*, v8::internal::LocalIsolate*) ()
#18 0x000055a469ffb6bb in v8::internal::OptimizedCompilationJob::ExecuteJob(v8::internal::RuntimeCallStats*, v8::internal::LocalIsolate*) ()
#19 0x000055a46a831047 in v8::internal::maglev::MaglevConcurrentDispatcher::JobTask::Run(v8::JobDelegate*) ()
#20 0x000055a46b21b45f in v8::platform::DefaultJobWorker::Run() ()
#21 0x000055a469cb125f in node::(anonymous namespace)::PlatformWorkerThread(void*) ()
#22 0x00007fb7cd4a1732 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:447
#23 0x00007fb7cd51c2b8 in __GI___clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

Update 5

The hanging tasks is pushed via the following bt:

 1: 0x5582238b587b node::TaskQueue<v8::Task>::Push(std::unique_ptr<v8::Task, std::default_delete<v8::Task> >) [out/Release/node]
 2: 0x5582238b3d47 node::NodePlatform::PostTaskOnWorkerThreadImpl(v8::TaskPriority, std::unique_ptr<v8::Task, std::default_delete<v8::Task> >, v8::SourceLocation const&) [out/Release/node]
 3: 0x558224e25304 v8::platform::DefaultJobState::CallOnWorkerThread(v8::TaskPriority, std::unique_ptr<v8::Task, std::default_delete<v8::Task> >) [out/Release/node]
 4: 0x558224e2548f  [out/Release/node]
 5: 0x558223c0ca77  [out/Release/node]
 6: 0x558223c0e337 v8::internal::Compiler::CompileOptimized(v8::internal::Isolate*, v8::internal::Handle<v8::internal::JSFunction>, v8::internal::ConcurrencyMode, v8::internal::CodeKind) [out/Release/node]
 7: 0x55822423e64a v8::internal::Runtime_CompileOptimized(int, unsigned long*, v8::internal::Isolate*) [out/Release/node]
 8: 0x5581c496f276 

So it looks like this is a V8 hanging-task issue? I'll reach out to them and see if they know anything. (https://issues.chromium.org/issues/374285493)

@avivkeller avivkeller added the v8 engine Issues and PRs related to the V8 dependency. label Oct 18, 2024
dubzzz added a commit to dubzzz/fast-check that referenced this issue Oct 29, 2024
Retries are needed for now against Node 23. There is an open ticket on
V8 side (see https://issues.chromium.org/issues/374285493) and one on
Node side (see nodejs/node#54918) regarding
hanging issues of Node 23.
@lpinca
Copy link
Member Author

lpinca commented Nov 19, 2024

From the discussion in https://issues.chromium.org/issues/374285493 (closed as Won't Fix - Intended Behavior) the issue is not in V8 but in Node.js.

lpinca added a commit to lpinca/node that referenced this issue Nov 28, 2024
The issue is the same as other tests that time out.

Refs: nodejs#54918
lpinca added a commit to lpinca/node that referenced this issue Nov 28, 2024
The issue is likely the same as other tests that time out.

Refs: nodejs#54918
lpinca added a commit to lpinca/node that referenced this issue Nov 28, 2024
The issue is likely the same as other tests that time out.

Refs: nodejs#54918
Refs: nodejs#54844
lpinca added a commit to lpinca/node that referenced this issue Nov 28, 2024
The issue is the same as other tests that time out.

Refs: nodejs#54918
Refs: nodejs#54802
lpinca added a commit to lpinca/node that referenced this issue Nov 28, 2024
The issue is likely the same as other tests that time out.

Refs: nodejs#54918
Refs: nodejs#54802
lpinca added a commit to lpinca/node that referenced this issue Nov 28, 2024
The issue is likely the same as other tests that time out.

Refs: nodejs#54918
Refs: nodejs#54844
Refs: nodejs#54802
lpinca added a commit to lpinca/node that referenced this issue Nov 28, 2024
Move `test/sequential/test-worker-arraybuffer-zerofill.js` back to
`test/parallel/test-worker-arraybuffer-zerofill.js` and remove the
flaky designation.

The original issue is likely the same as other tests that time out.

Refs: nodejs#54918
Refs: nodejs#54839
Refs: nodejs#54802
nodejs-github-bot pushed a commit that referenced this issue May 4, 2025
According to the documentation, the v8 tasks should be executed
based on priority. Previously we always execute the tasks in
FIFO order, this changes the NodePlatform implementation to
execute the higher priority tasks first. The tasks used to
schedule timers for the delayed tasks are run in FIFO order
since priority is irrelavent for the timer scheduling part
while the tasks unwrapped by the timer callbacks are still
ordered by priority.

PR-URL: #58047
Refs: #47452
Refs: #54918
Reviewed-By: Stephen Belanger <[email protected]>
nodejs-github-bot pushed a commit that referenced this issue May 4, 2025
we should not be blocking on the worker tasks on the
main thread in one go. Doing so leads to two problems:

1. If any of the worker tasks post another foreground task and wait
   for it to complete, and that foreground task is posted right after
   we flush the foreground task queue and before the foreground thread
   goes into sleep, we'll never be able to wake up to execute that
   foreground task and in turn the worker task will never complete, and
   we have a deadlock.
2. Worker tasks can be posted from any thread, not necessarily
   associated with the current isolate, and we can be blocking on a
   worker task that is associated with a completely unrelated isolate
   in the event loop. This is suboptimal.

However, not blocking on the worker tasks at all can lead to loss of
some critical user-blocking worker tasks e.g. wasm async compilation
tasks, which should block the main thread until they are completed,
as the documentation suggets. As a compromise, we currently only block
on user-blocking tasks to reduce the chance of deadlocks while making
sure that criticl user-blocking tasks are not lost.

PR-URL: #58047
Refs: #47452
Refs: #54918
Reviewed-By: Stephen Belanger <[email protected]>
@joyeecheung
Copy link
Member

joyeecheung commented May 4, 2025

I changed #58047 to only ref this issue instead of closing it, because I think that PR only eliminates a subset of the deadlocks, but not all - it may still be possible for a user blocking worker task to post a foreground task right before the foreground thread goes into sleep, just that most user blocking worker tasks tend not do that (but it is still possible e.g. with the use of the high priority concurrent compilation flags, or a major GC triggered by the heap reaching a hard limit) or that they are less likely to hit that unfortunate timing. So I left a big FIXME comment in the source describing the situation.

RafaelGSS pushed a commit that referenced this issue May 5, 2025
PR-URL: #58047
Refs: #47452
Refs: #54918
Reviewed-By: Stephen Belanger <[email protected]>
RafaelGSS pushed a commit that referenced this issue May 5, 2025
RafaelGSS pushed a commit that referenced this issue May 5, 2025
According to the documentation, the v8 tasks should be executed
based on priority. Previously we always execute the tasks in
FIFO order, this changes the NodePlatform implementation to
execute the higher priority tasks first. The tasks used to
schedule timers for the delayed tasks are run in FIFO order
since priority is irrelavent for the timer scheduling part
while the tasks unwrapped by the timer callbacks are still
ordered by priority.

PR-URL: #58047
Refs: #47452
Refs: #54918
Reviewed-By: Stephen Belanger <[email protected]>
RafaelGSS pushed a commit that referenced this issue May 5, 2025
we should not be blocking on the worker tasks on the
main thread in one go. Doing so leads to two problems:

1. If any of the worker tasks post another foreground task and wait
   for it to complete, and that foreground task is posted right after
   we flush the foreground task queue and before the foreground thread
   goes into sleep, we'll never be able to wake up to execute that
   foreground task and in turn the worker task will never complete, and
   we have a deadlock.
2. Worker tasks can be posted from any thread, not necessarily
   associated with the current isolate, and we can be blocking on a
   worker task that is associated with a completely unrelated isolate
   in the event loop. This is suboptimal.

However, not blocking on the worker tasks at all can lead to loss of
some critical user-blocking worker tasks e.g. wasm async compilation
tasks, which should block the main thread until they are completed,
as the documentation suggets. As a compromise, we currently only block
on user-blocking tasks to reduce the chance of deadlocks while making
sure that criticl user-blocking tasks are not lost.

PR-URL: #58047
Refs: #47452
Refs: #54918
Reviewed-By: Stephen Belanger <[email protected]>
aduh95 pushed a commit that referenced this issue May 6, 2025
PR-URL: #58047
Refs: #47452
Refs: #54918
Reviewed-By: Stephen Belanger <[email protected]>
aduh95 pushed a commit that referenced this issue May 6, 2025
PR-URL: #58047
Refs: #47452
Refs: #54918
Reviewed-By: Stephen Belanger <[email protected]>
aduh95 pushed a commit that referenced this issue May 6, 2025
aduh95 pushed a commit that referenced this issue May 6, 2025
RafaelGSS pushed a commit that referenced this issue May 14, 2025
PR-URL: #58047
Refs: #47452
Refs: #54918
Reviewed-By: Stephen Belanger <[email protected]>
RafaelGSS pushed a commit that referenced this issue May 14, 2025
aduh95 pushed a commit that referenced this issue May 16, 2025
PR-URL: #58047
Refs: #47452
Refs: #54918
Reviewed-By: Stephen Belanger <[email protected]>
aduh95 pushed a commit that referenced this issue May 16, 2025
aduh95 pushed a commit that referenced this issue May 17, 2025
PR-URL: #58047
Refs: #47452
Refs: #54918
Reviewed-By: Stephen Belanger <[email protected]>
aduh95 pushed a commit that referenced this issue May 17, 2025
aduh95 pushed a commit that referenced this issue May 17, 2025
PR-URL: #58047
Refs: #47452
Refs: #54918
Reviewed-By: Stephen Belanger <[email protected]>
aduh95 pushed a commit that referenced this issue May 17, 2025
aduh95 pushed a commit that referenced this issue May 17, 2025
PR-URL: #58047
Refs: #47452
Refs: #54918
Reviewed-By: Stephen Belanger <[email protected]>
aduh95 pushed a commit that referenced this issue May 17, 2025
aduh95 pushed a commit that referenced this issue May 18, 2025
PR-URL: #58047
Refs: #47452
Refs: #54918
Reviewed-By: Stephen Belanger <[email protected]>
aduh95 pushed a commit that referenced this issue May 18, 2025
aduh95 pushed a commit that referenced this issue May 19, 2025
PR-URL: #58047
Refs: #47452
Refs: #54918
Reviewed-By: Stephen Belanger <[email protected]>
aduh95 pushed a commit that referenced this issue May 19, 2025
@mhdawson mhdawson removed the tsc-agenda Issues and PRs to discuss during the meetings of the TSC. label May 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
confirmed-bug Issues with confirmed bugs. help wanted Issues that need assistance from volunteers or PRs that need help to proceed. v8 engine Issues and PRs related to the V8 dependency.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants