Skip to content

Bitcoin Core rpc_misc.py test failure #128

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Sjors opened this issue Jan 17, 2025 · 8 comments
Closed

Bitcoin Core rpc_misc.py test failure #128

Sjors opened this issue Jan 17, 2025 · 8 comments

Comments

@Sjors
Copy link
Member

Sjors commented Jan 17, 2025

Happened once (?) while testing bitcoin/bitcoin#30975.

https://github.com/bitcoin/bitcoin/actions/runs/12826202676/job/35765674846?pr=30975#step:7:3078

I haven't been able yet to reproduce locally.

@Sjors
Copy link
Member Author

Sjors commented Jan 17, 2025

So far I tried on (M4) macOS 15.2 with commit 9a90f964a3c4fe75477926d34b7f1813b144449d:

cmake -B build -DWITH_MULTIPROCESS=ON 
cmake --build build
BITCOIND=$(pwd)/build/src/bitcoin-node build/test/functional/rpc_misc.py

Using multiprocess master @ 3b2617b

@ryanofsky
Copy link
Collaborator

From the CI link above the error seems to be:

node0 stderr libc++abi: terminating due to uncaught exception of type std::__1::system_error: mutex lock failed: Invalid argument

which according to https://stackoverflow.com/questions/66773247/libcabi-dylib-terminating-with-uncaught-exception-of-type-std-1system-er "typically happens when .lock() is called on a mutex that is not yet constructed, or has already been destructed."

In the test, the error happens when the node is shutting down after an echoipc call has been made:

https://github.com/bitcoin/bitcoin/blob/90227f89411c1a1d4e62ed0e4de3099e1104bc19/test/functional/rpc_misc.py#L83

@Sjors
Copy link
Member Author

Sjors commented Jan 21, 2025

Another CI failure on the same test, this time on centOS. I'm unsure if it's the same issue:

https://cirrus-ci.com/task/6027377219731456?logs=ci#L4105

raise AssertionError("Unexpected stderr {} != {}".format(stderr, expected_stderr))
AssertionError: Unexpected stderr terminate called after throwing an instance of 'kj::ExceptionImpl'
what():  mp/proxy.cpp:242: disconnected: write(m_post_fd, &buffer, 1): Broken pipe

@ryanofsky
Copy link
Collaborator

The failure on centos https://cirrus-ci.com/task/6027377219731456?logs=ci#L4111 also reported bitcoin/bitcoin#30975 (comment) is happening in nearly the same place as originally reported failure on macos, when the test tries to stop the node on line https://github.com/bitcoin/bitcoin/blob/11b293a9e7e89f25acc74d0bda5cc68cd164b551/test/functional/rpc_misc.py#L83 after an echoipc call is made.

However, there are some differences between the two cases:

In both cases there seems to be a race condition and the bug could have the same root cause. I think my next step will be to try to run the test in loop to see if I could reproduce locally.

@ryanofsky
Copy link
Collaborator

The "Unexpected stderr" "terminate called after throwing an instance of 'kj::ExceptionImpl" "mp/proxy.cpp:242: disconnected: write(m_post_fd, &buffer, 1): Broken pipe" error that happens on CentOS was actually seen before and has an issue open bitcoin/bitcoin#31151

@ryanofsky
Copy link
Collaborator

Looking into this more it's not clear if this "mutex lock failed: Invalid argument" issue is the same as the other "disconnected: write(m_post_fd, &buffer, 1): Broken pipe" issue reported in bitcoin/bitcoin#31151, or if the fix in #129 will affect it.

The other "Broken pipe" issue happens the when the child process in the echoipc test does not shut down cleanly due to a race condition in the EventLoop shutdown sequence that happens there. But the "mutex lock failed" here is happening in the parent process not the child process, and even though the last log messages are about the event loop shutting down, they could just indicate that the event loop shut down successfully, and the error is happening at some point after that.

The echoipc parent process starts its event loop with the CapnpProtocol::connect function instead of CapnpProtocol::serve method running the event loop in the child process. And instead of shutting down when the last clean disconnects, it shuts down when the main() process exits and MakeNodeInit() destructor is called to destroy the interfaces::Init object which owns the interfaces::Ipc object which owns the CapnpProtocol object which calls removeClient in its destructor to stop the event loop.

But even though the event loop shutdown sequence is different, it could have the race conditions described in bitcoin/bitcoin#31151 as long as there are any ProxyServerBase::m_impl objects that need to be destroyed asynchronously and could cause EventLoop::startAsyncThread to spawn a separate thread.

The echoipc method in the parent process isn't passing any interface pointers to the child process so it isn't actually creating any ProxyServer objects of its own. However the IPC framework is creating ProxyServer<Thread> and
ProxyServer<ThreadMap> objects and while the former is a custom class that doesn't inherit from ProxyServerBase, the latter does, so it's possible a similar race condition could exist that causes this issue and would be fixed by #129.

But it's also possible #129 does not fix this, so will probably just need to see if it happens again.

@ryanofsky
Copy link
Collaborator

I'm going to close this issue assuming #129 fixes it, but if it happens again, this can be reopened.

@ryanofsky
Copy link
Collaborator

But it's also possible #129 does not fix this, so will probably just need to see if it happens again.

This issue did turn out to be different than #31151 and not fixed by #129. A similar macos "mutex lock failed" error was reported in #154. Both this issue and #154 should be fixed by bitcoin/bitcoin#31815, but this has not been confirmed yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants