-
Notifications
You must be signed in to change notification settings - Fork 7.4k
kernel: thread: race condition between create and join #58116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
OK, I have this sorta reproduced, but only on ARM. The test code as it is in the current PR doesn't run long enough for me to see failures, does it for you? If I crank the runtime up to an hour, I can see qemu_cortex_a53_smp panic fairly reliably after 2-10 minutes and ~2M iterations (pretty randomly, with a lot of spread). I pushed the runtime all the way up to the 6-hour maximum and let it run at bedtime. And I saw one panic on qemu_x86_64[1] after ~5 hours. I haven't seen a failure on RISC-V at all. The code below is the hello-world-ized variant I'm using right now that eliminates the yield/delay steps, the dependence on logging and pthreads, etc... It fails pretty reliably on a53 after about a minute, usually. (Oddly qemu just exits though and doesn't print a panic, did we break something with panics in hello_world apps?). As with the original, it's so far working reliably on the other SMP platforms. So... basically my gut says that we have a SMP glitch in the arm64 code, almost certainly in interrupt entry/exit or context switch. It really doesn't seem to be a race in the kernel per se. One theory, inspired by the recent discovery that x86 qemu correctly implements an obscure 1-instruction interrupt shadow slow, is that qemu is actually doing pedantic memory reordering and we're missing some barrier instructions. Of all our platforms, arm64 has by far the trickiest memory model. Unless there's an easier way to repro this on other platforms? [1] The huge difference in scale here points to this being a separate bug, IMHO. But once you get this far down to failures in the one-in-a-billion range, we need to start accepting that bugs in qemu or memory glitches on the host processor are possibilities. #include <zephyr/kernel.h>
#define NTHREAD 2
#define STACKSZ 4096
#define LOG_INTERVAL_MS 2000
struct k_thread threads[NTHREAD];
K_KERNEL_STACK_ARRAY_DEFINE(stacks, NTHREAD, STACKSZ);
volatile int alive[NTHREAD];
void thread_fn(void *a, void *b, void *c)
{
alive[(long)a] = true;
}
void spawn(int i)
{
k_thread_create(&threads[i], stacks[i], STACKSZ,
thread_fn, (void *)(long)i, NULL, NULL,
-2, 0, K_NO_WAIT);
}
int main(void)
{
for (int i = 0; i < NTHREAD; i++) {
spawn(i);
}
uint64_t n = 0, next_log_ms = 0;
while (true) {
for (int i = 0; i < NTHREAD; i++) {
if (alive[i]) {
k_thread_join(&threads[i], K_FOREVER);
alive[i] = false;
spawn(i);
n++;
}
}
uint64_t now = k_uptime_get();
if (now > next_log_ms) {
printk("%lld joins in %lld ms\n", n, now);
next_log_ms += LOG_INTERVAL_MS;
}
}
return 0;
} |
Flag @carlocaione and @povergoing for the seeming arm64 dependence. Also @dcpleung and @npitre are usually helpful on this sort of thing. |
I confirm having the same experience: reproduced 3 times on ARM64 within Obviously the reported error dump above comes from RISCV. @cfriedt: Can you test with the simplified code from @andyross and confirm
By default the hello_world app is configured with |
Been running these all morning collecting statistics. Both qemu_riscv32_smp and qemu_riscv64_smp have been running without trouble for 6.1 hours now, reaching 1.84G and 1.49G joins respectively (qemu runs at different rates for different architectures). To the extent we can measure right now, joins on these architectures are 100% bug free. As mentioned, arm64 is the problem child. The rig above dies (usually just exiting qemu, sometimes with a deadlock) relatively promptly for me. Over a few dozen runs I measured an average uptime of 6.5M joins and 72 seconds. But x86_64 has a real, measurable issue too: checking and restarting regularly, I've seen a total of seven individual failures with an average uptime of 344M joins in 38 minutes (it turns out that x86 is about 60% faster in qemu, which I guess isn't surprising given that it matches the host architecture). Not sure where to go from here though beyond careful reading of the platform interrupt and context switch code... |
@andyross, @npitre - Definitely
Cannot reproduce with |
Reworked the Even just Whether that involves missing barriers or simply opportunistic scheduling, or simply Qemu guessing wrong, it's hard to say. In any case, what I did with my test was I created a Likely, for the test I've written, I'll set the integration platform to The main assertion that I was trying to avoid was
|
Even running under Twister, so much additional scheduler noise is created that most platforms will fail. It's not something that can be easily fixed, I would say. |
I saw the issue on
|
FWIW: I just killed the RISC-V processes I had running from yesterday after 25 hours and no failures. @povergoing thanks for the confirmation on FVP. And while the data point is only one sample, that's about the reproduction rate @npitre and I measured. Frustratingly it looks like FVP is running this code about 30x slower than qemu. Interestingly though, FVP is (I think?) more cycle-exact than qemu and doesn't have the wild qemu time warp behavior due to scheduling CPUs on host threads. That the behavior is basically the same argues that the problems @cfriedt is seeing with CI/twister in the pthread PR are a different bug. |
What can we deduce so far?
|
Hmm.. I'm not familiar with FVP
To note though, I was originally checking / zasserting on every return code, corner case, etc - I guess it's possible that it's not specific to Qemu.
The plot thickens... |
More data points:
Crash reliably happens within 2 seconds:
Then if I set Rebasing on top of PR #58086 which in theory shouldn't matter does |
I don't think we can say #58086 is related to this one:
|
PR #58086 is not related other than moving the compiled code around.
Hence my assertion that this is code location dependent.
Or rather, data location dependent. Something might be corrupting
memory at a given address.
|
So, I think the problem is that there is a race between What happens is:
This patch is making the sample running fine for me: diff --git a/kernel/sched.c b/kernel/sched.c
index a13f94bf3a..460767438d 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1806,6 +1806,7 @@ int z_impl_k_thread_join(struct k_thread *thread, k_timeout_t timeout)
SYS_PORT_TRACING_OBJ_FUNC_ENTER(k_thread, join, thread, timeout);
if ((thread->base.thread_state & _THREAD_DEAD) != 0U) {
+ wait_for_switch(thread);
ret = 0;
} else if (K_TIMEOUT_EQ(timeout, K_NO_WAIT)) {
ret = -EBUSY; |
Of course if this is indeed the root cause a proper patch must be cooked. |
Shall we also alert somewhere at the thread_create function if the thread hasn't been aborted yet? Cause I met a similar issue which is creating a thread before it is aborted last time #57681 (this might be a kind of wrong usage of the app/test but I think the kernel should detect it and warn) |
Yeah, this is another way to fix this I think. |
Nice catch. Though I'm confused if this is the case why it's so arch-dependent. It's not like arm64 is just more likely to fail than riscv, it's literally billions of times more likely. Is there a similar spin loop somewhere in the riscv internals that is saving us? @povergoing Unfortunately for historical reasons we can't do much validation in k_thread_create(). The way the API is specified it takes an uninitialized struct k_thread memory block, so we can't make any assumptions about what the contents are. It could be arbitrary garbage with the a state that says "running", it could have broken or dangling pointers in the waitq list node that look like it's pended, etc... |
But to be clear, I'm convinced this is it. It's exactly the same race we have in context swtich, and frankly your fix is exactly right. If it's OK with you I'm going to clean it up with some refactoring and clearer docs to make it a more proper-looking scheduler API. But I promise I'll credit you. :) Or you can just submit that as is and I can come with the cleanup later. FWIW: this is also a good opportunity to add a call to the shiny new barrier API, which this layer needs theoretically (though on qemu it's likely not emulated correctly and probably not part of the problem here). |
@andyross please go ahead, I definitely trust you more with this code than myself :D |
@carlocaione: Congrats for finding this race! I was inching my way towards
@andyross: I'd blame QEMU implementation difference here. I'd guess that |
OK, fix up at #58334 This did fine for me on a smoke test running all the qemu SMP platforms against tests/kernel/ and tests/posix/ (three routine timing-related failures seen that cleaned up with retries), and is up to 23 minutes now running the hello_world above against qemu_cortex_a53_smp. I'll see about spawning off some more to run in parallel. Obviously this was heavily build-dependent, so if folks could all try their favorite test rigs against it that would be helpful. |
As discovered by Carlo Caione, the k_thread_join code had a case where it detected it had been called on a thread already marked _THREAD_DEAD and exited early. That's not sufficient. The thread state is mutated from the thread itself on its exit path. It may still be running! Just like the code in z_swap(), we need to spin waiting on the other CPU to write the switch handle before knowing it's safe to return, otherwise the calling context might (and did) do something like immediately k_thread_create() a new thread in the "dead" thread's struct while it was still running on the other core. There was also a similar case in k_thread_abort() which had the same issue: it needs to spin waiting on the other CPU to kill the thread via the same mechanism. Fixes zephyrproject-rtos#58116 Originally-by: Carlo Caione <[email protected]> Signed-off-by: Andy Ross <[email protected]>
As discovered by Carlo Caione, the k_thread_join code had a case where it detected it had been called on a thread already marked _THREAD_DEAD and exited early. That's not sufficient. The thread state is mutated from the thread itself on its exit path. It may still be running! Just like the code in z_swap(), we need to spin waiting on the other CPU to write the switch handle before knowing it's safe to return, otherwise the calling context might (and did) do something like immediately k_thread_create() a new thread in the "dead" thread's struct while it was still running on the other core. There was also a similar case in k_thread_abort() which had the same issue: it needs to spin waiting on the other CPU to kill the thread via the same mechanism. Fixes #58116 Originally-by: Carlo Caione <[email protected]> Signed-off-by: Andy Ross <[email protected]>
As discovered by Carlo Caione, the k_thread_join code had a case where it detected it had been called on a thread already marked _THREAD_DEAD and exited early. That's not sufficient. The thread state is mutated from the thread itself on its exit path. It may still be running! Just like the code in z_swap(), we need to spin waiting on the other CPU to write the switch handle before knowing it's safe to return, otherwise the calling context might (and did) do something like immediately k_thread_create() a new thread in the "dead" thread's struct while it was still running on the other core. There was also a similar case in k_thread_abort() which had the same issue: it needs to spin waiting on the other CPU to kill the thread via the same mechanism. Fixes #58116 Originally-by: Carlo Caione <[email protected]> Signed-off-by: Andy Ross <[email protected]> (cherry picked from commit a08e23f)
Agreed, thanks for the explaination. It is more like a question than a fix :) |
I hate to be a party pooper too, but this bug still exists on |
Describe the bug
While investigating #56163 and digging through pthreads, testing showed that
k_thread_create()
andk_thread_join()
also exhibit a race condition when re-using the samestruct k_thread
s over and over again.It's not something regularly seen in production at the moment, and was only detected by accident. Originally reported here.
On the kernel side, this mainly seems to be an issue with smp platforms.
Please also mention any information which could help others to understand
the problem you're facing:
qemu_x86_64
,qemu_cortex_a53_smp
,qemu_riscv64_smp
,qemu_riscv32_smp
k_thread
at the moment. It happens on all libc configurations and most smp platforms.To Reproduce
Steps to reproduce the behavior:
twister -i -T tests/posix/pthread_pressure
Expected behavior
Tests passing ideally 100% of the time on all platforms.
Impact
It seems to be the opposite of what several contributors and maintainers expect, and is possibly just a corner case that did not receive a lot of traffic.
Logs and console output
E.g.
Environment (please complete the following information):
Additional context
#56163
#57637
The text was updated successfully, but these errors were encountered: