Skip to content

kernel: thread: race condition between create and join #58116

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
cfriedt opened this issue May 22, 2023 · 27 comments · Fixed by #58334
Closed

kernel: thread: race condition between create and join #58116

cfriedt opened this issue May 22, 2023 · 27 comments · Fixed by #58334
Assignees
Labels
area: Kernel bug The issue is a bug, or the PR is fixing a bug priority: medium Medium impact/importance bug

Comments

@cfriedt
Copy link
Member

cfriedt commented May 22, 2023

Describe the bug
While investigating #56163 and digging through pthreads, testing showed that k_thread_create() and k_thread_join() also exhibit a race condition when re-using the same struct k_threads over and over again.

It's not something regularly seen in production at the moment, and was only detected by accident. Originally reported here.

On the kernel side, this mainly seems to be an issue with smp platforms.

Please also mention any information which could help others to understand
the problem you're facing:

  • What target platform are you using? qemu_x86_64, qemu_cortex_a53_smp, qemu_riscv64_smp, qemu_riscv32_smp
  • What have you tried to diagnose or workaround this issue? Wrote a testsuite ([DNM]: tests: posix: stress test for pthread_create and pthread_join #58115). Note, pthreads are disabled in this suite currently, so all failures are k_thread at the moment. It happens on all libc configurations and most smp platforms.
  • Is this a regression? Probably not although it's hard to say.
  • ...

To Reproduce
Steps to reproduce the behavior:

  1. twister -i -T tests/posix/pthread_pressure
  2. see errors

Expected behavior
Tests passing ideally 100% of the time on all platforms.

Impact
It seems to be the opposite of what several contributors and maintainers expect, and is possibly just a corner case that did not receive a lot of traffic.

Logs and console output
E.g.

ERROR   - *** Booting Zephyr OS build zephyr-v3.3.0-4124-gf67ff9c38640 ***
Running TESTSUITE pthread_pressure
===================================================================
START - test_k_thread_create_join
I: NUM_THREADS: 2
I: TEST_NUM_CPUS: 2
I: TEST_DURATION_S: 10
I: TEST_DELAY_US: 0
ASSERTION FAIL [0] @ WEST_TOPDIR/zephyr/kernel/sched.c:1785
	aborted _current back from dead
E:      a0: 0000000000000004    t0: 0000000000000000
E:      a1: 00000000000006f9    t1: 0000000000000009
E:      a2: 0000000080009d98    t2: 0000000000000000
E:      a3: 0000000000000000    t3: 0000000000000001
E:      a4: 0000000000000000    t4: 0000000000000023
E:      a5: 0000000000000001    t5: 000000008000b2b0
E:      a6: 0000000000000001    t6: 0000000080006514
E:      a7: 0000000000000001
E:      ra: 00000000800063c4
E:    mepc: 00000000800017e0
E: mstatus: 0000000a00021880
E: 
E: >>> ZEPHYR FATAL ERROR 4: Kernel panic on CPU 0
E: Current thread: (nil) (unknown)
E: Halting system

Environment (please complete the following information):

  • OS: Linux
  • Toolchain: Zephyr SDK v0.16.1
  • Commit SHA or Version used: d01780f (main), v2.7.4

Additional context
#56163
#57637

@cfriedt cfriedt added bug The issue is a bug, or the PR is fixing a bug priority: medium Medium impact/importance bug labels May 22, 2023
@andyross
Copy link
Collaborator

OK, I have this sorta reproduced, but only on ARM. The test code as it is in the current PR doesn't run long enough for me to see failures, does it for you? If I crank the runtime up to an hour, I can see qemu_cortex_a53_smp panic fairly reliably after 2-10 minutes and ~2M iterations (pretty randomly, with a lot of spread).

I pushed the runtime all the way up to the 6-hour maximum and let it run at bedtime. And I saw one panic on qemu_x86_64[1] after ~5 hours. I haven't seen a failure on RISC-V at all.

The code below is the hello-world-ized variant I'm using right now that eliminates the yield/delay steps, the dependence on logging and pthreads, etc... It fails pretty reliably on a53 after about a minute, usually. (Oddly qemu just exits though and doesn't print a panic, did we break something with panics in hello_world apps?). As with the original, it's so far working reliably on the other SMP platforms.

So... basically my gut says that we have a SMP glitch in the arm64 code, almost certainly in interrupt entry/exit or context switch. It really doesn't seem to be a race in the kernel per se. One theory, inspired by the recent discovery that x86 qemu correctly implements an obscure 1-instruction interrupt shadow slow, is that qemu is actually doing pedantic memory reordering and we're missing some barrier instructions. Of all our platforms, arm64 has by far the trickiest memory model.

Unless there's an easier way to repro this on other platforms?

[1] The huge difference in scale here points to this being a separate bug, IMHO. But once you get this far down to failures in the one-in-a-billion range, we need to start accepting that bugs in qemu or memory glitches on the host processor are possibilities.

#include <zephyr/kernel.h>

#define NTHREAD 2
#define STACKSZ 4096
#define LOG_INTERVAL_MS 2000

struct k_thread threads[NTHREAD];
K_KERNEL_STACK_ARRAY_DEFINE(stacks, NTHREAD, STACKSZ);

volatile int alive[NTHREAD];

void thread_fn(void *a, void *b, void *c)
{
	alive[(long)a] = true;
}

void spawn(int i)
{
	k_thread_create(&threads[i], stacks[i], STACKSZ,
			thread_fn, (void *)(long)i, NULL, NULL,
			-2, 0, K_NO_WAIT);
}

int main(void)
{
	for (int i = 0; i < NTHREAD; i++) {
		spawn(i);
	}

	uint64_t n = 0, next_log_ms = 0;

	while (true) {
		for (int i = 0; i < NTHREAD; i++) {
			if (alive[i]) {
				k_thread_join(&threads[i], K_FOREVER);
				alive[i] = false;
				spawn(i);
				n++;
			}
		}

		uint64_t now = k_uptime_get();

		if (now > next_log_ms) {
			printk("%lld joins in %lld ms\n", n, now);
			next_log_ms += LOG_INTERVAL_MS;
		}
	}

	return 0;
}

@andyross
Copy link
Collaborator

Flag @carlocaione and @povergoing for the seeming arm64 dependence. Also @dcpleung and @npitre are usually helpful on this sort of thing.

@npitre
Copy link
Collaborator

npitre commented May 23, 2023

I confirm having the same experience: reproduced 3 times on ARM64 within
2 minutes (twice within 1 minute). Not reproducible on RISCV after
several hours.

Obviously the reported error dump above comes from RISCV.

@cfriedt: Can you test with the simplified code from @andyross and confirm
if you still get a failure on RISCV?

Oddly qemu just exits though and doesn't print a panic, did we break
something with panics in hello_world apps?

By default the hello_world app is configured with CONFIG_LOG and
CONFIG_ASSERT unset. You need at least the former to get a printed panic.

@andyross
Copy link
Collaborator

Been running these all morning collecting statistics. Both qemu_riscv32_smp and qemu_riscv64_smp have been running without trouble for 6.1 hours now, reaching 1.84G and 1.49G joins respectively (qemu runs at different rates for different architectures). To the extent we can measure right now, joins on these architectures are 100% bug free.

As mentioned, arm64 is the problem child. The rig above dies (usually just exiting qemu, sometimes with a deadlock) relatively promptly for me. Over a few dozen runs I measured an average uptime of 6.5M joins and 72 seconds.

But x86_64 has a real, measurable issue too: checking and restarting regularly, I've seen a total of seven individual failures with an average uptime of 344M joins in 38 minutes (it turns out that x86 is about 60% faster in qemu, which I guess isn't surprising given that it matches the host architecture).

Not sure where to go from here though beyond careful reading of the platform interrupt and context switch code...

@cfriedt
Copy link
Member Author

cfriedt commented May 23, 2023

@cfriedt: Can you test with the simplified code from @andyross and confirm
if you still get a failure on RISCV?

I'll give it a run later this evening

@cfriedt
Copy link
Member Author

cfriedt commented May 23, 2023

@cfriedt: Can you test with the simplified code from @andyross and confirm
if you still get a failure on RISCV?

I'll give it a run later this evening

Have some cycles now...

@cfriedt
Copy link
Member Author

cfriedt commented May 23, 2023

@andyross, @npitre - Definitely qemu_cortex_a53_smp is problematic and reproducibly fails multiple times within 1 minute.

(.venv) cfriedt@ubuntu:~/zephyrproject/zephyr$ west build -p auto -b qemu_cortex_a53_smp -t run samples/kernel/pressure/
-- west build: running target run
[0/1] To exit from QEMU enter: 'CTRL+a, x'[QEMU] CPU: cortex-a53
*** Booting Zephyr OS build zephyr-v3.3.0-4125-g86c5bf81b4e5 ***
Secondary CPU core 1 (MPID:0x1) is up
490 joins in 10 ms
299586 joins in 2010 ms
565609 joins in 4010 ms
833152 joins in 6010 ms
1094691 joins in 8010 ms
1366318 joins in 10010 ms
1639701 joins in 12010 ms
1905485 joins in 14010 ms
2168655 joins in 16010 ms
2421484 joins in 18010 ms
2685592 joins in 20010 ms
(.venv) cfriedt@ubuntu:~/zephyrproject/zephyr$ west build -p auto -b qemu_cortex_a53_smp -t run samples/kernel/pressure/
-- west build: running target run
[0/1] To exit from QEMU enter: 'CTRL+a, x'[QEMU] CPU: cortex-a53
*** Booting Zephyr OS build zephyr-v3.3.0-4125-g86c5bf81b4e5 ***
Secondary CPU core 1 (MPID:0x1) is up
452 joins in 10 ms

Cannot reproduce with qemu_riscv64_smp or qemu_riscv32_smp so far.

@cfriedt
Copy link
Member Author

cfriedt commented May 24, 2023

Reworked the pthread_pressure test somewhat. I didn't realize just how bad it is, but most certainly, when there are conditionals in the hot path or the host system is under heavy load, there is real and measurable "scheduler noise" (jitter).

Even just zassert*() macros checking the return values every time are sufficient to bring down Andy's test above on qemu_riscv64_smp.

Whether that involves missing barriers or simply opportunistic scheduling, or simply Qemu guessing wrong, it's hard to say.

In any case, what I did with my test was I created a CONFIG_TEST_EXTRA_ASSERTIONS option. If people are running it on hardware, then it's something that may be good to enable. I'm not going to enable it right now, but definitely, without the extra assertions, I see no issues.

Likely, for the test I've written, I'll set the integration platform to qemu_riscv64_smp and will likely also skip k_thread testing.

The main assertion that I was trying to avoid was aborted _current back from dead, and I'm no longer seeing that now.

*** Booting Zephyr OS build zephyr-v3.3.0-4131-g26b93375e256 ***
Running TESTSUITE pthread_pressure
===================================================================
START - test_k_thread_create_join
BOARD: qemu_riscv64_smp
NUM_THREADS: 2
TEST_NUM_CPUS: 2
TEST_DURATION_S: 10
TEST_DELAY_US: 0
Thread 0 created and joined 5144 times
Thread 1 created and joined 5143 times
Thread 0 created and joined 10205 times
Thread 1 created and joined 10204 times
Thread 0 created and joined 15363 times
Thread 1 created and joined 15361 times
Thread 0 created and joined 20464 times
Thread 1 created and joined 20461 times
Thread 0 created and joined 25632 times
Thread 1 created and joined 25628 times
Thread 0 created and joined 30751 times
Thread 1 created and joined 30747 times
Thread 0 created and joined 35922 times
Thread 1 created and joined 35917 times
Thread 0 created and joined 41028 times
Thread 1 created and joined 41023 times
Thread 0 created and joined 46109 times
Thread 1 created and joined 46104 times
Thread 0 created and joined 50708 times
Thread 1 created and joined 50703 times
 PASS - test_k_thread_create_join in 9.997 seconds
===================================================================
START - test_pthread_create_join
NUM_THREADS: 2
TEST_NUM_CPUS: 2
TEST_DURATION_S: 10
TEST_DELAY_US: 0
Thread 0 created and joined 3912 times
Thread 1 created and joined 3912 times
Thread 0 created and joined 7853 times
Thread 1 created and joined 7853 times
Thread 0 created and joined 11788 times
Thread 1 created and joined 11788 times
Thread 0 created and joined 15726 times
Thread 1 created and joined 15726 times
Thread 0 created and joined 19673 times
Thread 1 created and joined 19673 times
Thread 0 created and joined 23613 times
Thread 1 created and joined 23613 times
Thread 0 created and joined 27539 times
Thread 1 created and joined 27539 times
Thread 0 created and joined 31393 times
Thread 1 created and joined 31393 times
Thread 0 created and joined 35324 times
Thread 1 created and joined 35324 times
Thread 0 created and joined 38771 times
Thread 1 created and joined 38771 times
 PASS - test_pthread_create_join in 10.000 seconds
===================================================================
TESTSUITE pthread_pressure succeeded

------ TESTSUITE SUMMARY START ------

SUITE PASS - 100.00% [pthread_pressure]: pass = 2, fail = 0, skip = 0, total = 2 duration = 19.997 seconds
 - PASS - [pthread_pressure.test_k_thread_create_join] duration = 9.997 seconds
 - PASS - [pthread_pressure.test_pthread_create_join] duration = 10.000 seconds

------ TESTSUITE SUMMARY END ------

===================================================================
PROJECT EXECUTION SUCCESSFUL

@cfriedt
Copy link
Member Author

cfriedt commented May 24, 2023

Even running under Twister, so much additional scheduler noise is created that most platforms will fail.

It's not something that can be easily fixed, I would say.

@povergoing
Copy link
Member

I saw the issue on fvp_base_revc_2xaemv8a_smp_ns arm64. It needs 85 min to reproduce

19730316 joins in 5142002 ms
19737977 joins in 5144001 ms
19745568 joins in 5146002 ms
19753184 joins in 5148001 ms
19760836 joins in 5150001 ms
ASSERTION FAIL [0] @ WEST_TOPDIR/zephyr/kernel/sched.c:1785
        aborted _current back from dead
[01:25:50.995,000] <err> os: ELR_ELn: 0x000000[01:25:50.995,000] <err> os: ESR_ELn: 0x0000000056000002
[01:25:50.995,000] <err> os:   EC:  0x15 (Unknown)
[01:25:50.996,000] <e00088003a0c
[01:25:50.996,000] <err> os:   ISS: 0x2
[01:25:50.996,000]rr> os:   IL:  0x1
[01:25:50.996,000] <err> os: x0:  0x0000000088010507  x1:  0x00000000000006f9
<err> os: TPIDRRO: 0x010000008801ae50
[01:25:50.996,000] <err> os: x2:  0x0000000000000000  x3:  0x00000000000000ee
[01:25:50.996,000]  [01:25:50.996,000] <err> os: x6:  0x0000000000000000  x7:  0x00000000ffffff80
<err> os: x4:  0x000000008801e2c0  x5:  0x00000000ffffffc8
[01:25:50.996,000] <err> os: x8:  0x0000000000000004  x9:  0x0000000000000118
[01:25:50.997,000] <err> os: x10: 0x0000000088[01:25:50.997,000] <err> os: x12: 0x0000000088008010826  x11: 0x000000008801e2c0
[01:25:50.997,000] <err> os: x14: 0x0000000200007fed  x15: 0x0000000000000000
c28  x13: 0x000000008801e280
[01:25:50.998,000] <err> os: x16: 0x0000000000000001  x17: 0x0000000000000003
[01:25:50.998,000] <err> os: x18: 0xfffffffffffffff8  lr:  0x000000008800c128
[01:25:50.998,000] 01:25:50.998,000] <err> os: Current thread: 0x88;31m<err> os: >>> ZEPHYR FATAL ERROR 4: Kernel panic on CPU 2
01a118 (unknown)
[01:25:51.027,000] <err> os: Halting system

@andyross
Copy link
Collaborator

FWIW: I just killed the RISC-V processes I had running from yesterday after 25 hours and no failures.

@povergoing thanks for the confirmation on FVP. And while the data point is only one sample, that's about the reproduction rate @npitre and I measured. Frustratingly it looks like FVP is running this code about 30x slower than qemu.

Interestingly though, FVP is (I think?) more cycle-exact than qemu and doesn't have the wild qemu time warp behavior due to scheduling CPUs on host threads. That the behavior is basically the same argues that the problems @cfriedt is seeing with CI/twister in the pthread PR are a different bug.

@npitre
Copy link
Collaborator

npitre commented May 24, 2023

What can we deduce so far?

  • If the bug is reproducible on both QEMU and FVP then this is most likely
    not a QEMU bug.

  • If Chris is able to trigger the bug on RISCV using a different application
    then this could be code location dependent.

  • And if this bug can be triggered on RISCV too given the right conditions
    then this would mean it is not ARM64 specific, adding importance to the
    code location susceptibility.

@cfriedt
Copy link
Member Author

cfriedt commented May 24, 2023

What can we deduce so far?

  • If the bug is reproducible on both QEMU and FVP then this is most likely
    not a QEMU bug.

Hmm.. I'm not familiar with FVP

  • If Chris is able to trigger the bug on RISCV using a different application
    then this could be code location dependent.

To note though, I was originally checking / zasserting on every return code, corner case, etc - I guess it's possible that it's not specific to Qemu.

  • And if this bug can be triggered on RISCV too given the right conditions
    then this would mean it is not ARM64 specific, adding importance to the
    code location susceptibility.

The plot thickens...

@npitre
Copy link
Collaborator

npitre commented May 24, 2023

More data points:

  • I'm using current main i.e. commit defb159 at the moment

  • Pasted the code above into samples/hello_world/src/main.c

  • west build -p -b qemu_cortex_a53_smp samples/hello_world/

  • sed -ie 's/# \(CONFIG_LOG\) is not set/\1=y/' build/zephyr/.config

  • west build -t run

Crash reliably happens within 2 seconds:

*** Booting Zephyr OS build zephyr-v3.3.0-4339-gdefb159ab1c4 ***
Secondary CPU core 1 (MPID:0x1) is up
22 joins in 10 ms
187149 joins in 2010 ms
[00:00:03.570,000] <err> os: ELR_ELn: 0x0000000040061980
[00:00:03.570,000] <err> os: ESR_ELn: 0x000000008600000f
[00:00:03.570,000] <err> os:   EC:  0x21 (Instruction Abort taken without a change in Exception level.)
[00:00:03.570,000] <err> os:   IL:  0x1
[00:00:03.570,000] <err> os:   ISS: 0xf
[00:00:03.570,000] <err> os: FAR_ELn: 0x0000000040061980
[00:00:03.570,000] <err> os: TPIDRRO: 0x01000000400177b8
[00:00:03.570,000] <err> os: x0:  0x0000000040006050  x1:  0x0000000040016930
[00:00:03.570,000] <err> os: x2:  0x0000000040016310  x3:  0x0000000040016310
[00:00:03.570,000] <err> os: x4:  0x0000000040061980  x5:  0x0000000040016310
[00:00:03.570,000] <err> os: x6:  0x00000000400619c0  x7:  0x0000000040009bec
[00:00:03.570,000] <err> os: x8:  0x0000000000000040  x9:  0x0000000040016310
[00:00:03.570,000] <err> os: x10: 0x0000000040017860  x11: 0x0000000000000000
[00:00:03.570,000] <err> os: x12: 0x0000000040016f50  x13: 0x0000000000000000
[00:00:03.570,000] <err> os: x14: 0x00000000400619f0  x15: 0x0000000040002204
[00:00:03.570,000] <err> os: x16: 0x0000000040016f50  x17: 0x0000000040016310
[00:00:03.570,000] <err> os: x18: 0x0000000040017860  lr:  0x00000000400021fc
[00:00:03.570,000] <err> os: >>> ZEPHYR FATAL ERROR 0: CPU exception on CPU 1
[00:00:03.570,000] <err> os: Current thread: 0x40016310 (unknown)
[00:00:03.580,000] <err> os: Halting system

Then if I set CONFIG_ASSERT=y then it takes somewhat longer to crash, like
within 2 minutes as I initially reported.

Rebasing on top of PR #58086 which in theory shouldn't matter does
in fact make this crash go away.

@povergoing
Copy link
Member

Rebasing on top of PR #58086 which in theory shouldn't matter does
in fact make this crash go away.

I don't think we can say #58086 is related to this one:

  1. I disabled FPU & FPU_SHARING and also reproduced the issue on qemu_cortex_a53
  2. I found adding a printk can make this crash go away too (I added a printk before __ASSERT(false, "aborted _current back from dead");) I didn't test it for a long time, but the crash cannot happen at least within 1 minute.

@npitre
Copy link
Collaborator

npitre commented May 25, 2023 via email

@carlocaione
Copy link
Collaborator

carlocaione commented May 26, 2023

So, I think the problem is that there is a race between z_thread_abort() and k_thread_join().

What happens is:

  1. A thread is marked as dead on a core at:
    end_thread(thread);

    please note that this is done before any arch_switch() is actually happened yet.
  2. As soon as it is marked as such, on a different CPU, k_thread_join() is unblocked:

    zephyr/kernel/sched.c

    Lines 1808 to 1809 in 7457bcf

    if ((thread->base.thread_state & _THREAD_DEAD) != 0U) {
    ret = 0;
    and a new thread (the same as the dying one) is created and scheduled before the dead one is actually swapped out already.

This patch is making the sample running fine for me:

diff --git a/kernel/sched.c b/kernel/sched.c
index a13f94bf3a..460767438d 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1806,6 +1806,7 @@ int z_impl_k_thread_join(struct k_thread *thread, k_timeout_t timeout)
        SYS_PORT_TRACING_OBJ_FUNC_ENTER(k_thread, join, thread, timeout);
 
        if ((thread->base.thread_state & _THREAD_DEAD) != 0U) {
+               wait_for_switch(thread);
                ret = 0;
        } else if (K_TIMEOUT_EQ(timeout, K_NO_WAIT)) {
                ret = -EBUSY;

@carlocaione
Copy link
Collaborator

carlocaione commented May 26, 2023

Of course if this is indeed the root cause a proper patch must be cooked.

@povergoing
Copy link
Member

povergoing commented May 26, 2023

Shall we also alert somewhere at the thread_create function if the thread hasn't been aborted yet? Cause I met a similar issue which is creating a thread before it is aborted last time #57681 (this might be a kind of wrong usage of the app/test but I think the kernel should detect it and warn)

@carlocaione
Copy link
Collaborator

Shall we also alert somewhere at the thread_create function if the thread hasn't been aborted yet?

Yeah, this is another way to fix this I think.

@andyross
Copy link
Collaborator

Nice catch. Though I'm confused if this is the case why it's so arch-dependent. It's not like arm64 is just more likely to fail than riscv, it's literally billions of times more likely. Is there a similar spin loop somewhere in the riscv internals that is saving us?

@povergoing Unfortunately for historical reasons we can't do much validation in k_thread_create(). The way the API is specified it takes an uninitialized struct k_thread memory block, so we can't make any assumptions about what the contents are. It could be arbitrary garbage with the a state that says "running", it could have broken or dangling pointers in the waitq list node that look like it's pended, etc...

@andyross
Copy link
Collaborator

But to be clear, I'm convinced this is it. It's exactly the same race we have in context swtich, and frankly your fix is exactly right. If it's OK with you I'm going to clean it up with some refactoring and clearer docs to make it a more proper-looking scheduler API. But I promise I'll credit you. :) Or you can just submit that as is and I can come with the cleanup later.

FWIW: this is also a good opportunity to add a call to the shiny new barrier API, which this layer needs theoretically (though on qemu it's likely not emulated correctly and probably not part of the problem here).

@carlocaione
Copy link
Collaborator

carlocaione commented May 26, 2023

But to be clear, I'm convinced this is it. It's exactly the same race we have in context swtich, and frankly your fix is exactly right. If it's OK with you I'm going to clean it up with some refactoring and clearer docs to make it a more proper-looking scheduler API. But I promise I'll credit you. :)

@andyross please go ahead, I definitely trust you more with this code than myself :D

@npitre
Copy link
Collaborator

npitre commented May 26, 2023

@carlocaione: Congrats for finding this race! I was inching my way towards
it last night when I decided to go to bed instead. Glad you beat me to it.

Nice catch. Though I'm confused if this is the case why it's so
arch-dependent. It's not like arm64 is just more likely to fail
than riscv, it's literally billions of times more likely. Is there
a similar spin loop somewhere in the riscv internals that is saving us?

@andyross: I'd blame QEMU implementation difference here. I'd guess that
on real hardware both would be as likely (or maybe as unlikely).

@andyross
Copy link
Collaborator

OK, fix up at #58334

This did fine for me on a smoke test running all the qemu SMP platforms against tests/kernel/ and tests/posix/ (three routine timing-related failures seen that cleaned up with retries), and is up to 23 minutes now running the hello_world above against qemu_cortex_a53_smp. I'll see about spawning off some more to run in parallel.

Obviously this was heavily build-dependent, so if folks could all try their favorite test rigs against it that would be helpful.

andyross added a commit to andyross/zephyr that referenced this issue May 26, 2023
As discovered by Carlo Caione, the k_thread_join code had a case where
it detected it had been called on a thread already marked _THREAD_DEAD
and exited early.  That's not sufficient.  The thread state is mutated
from the thread itself on its exit path.  It may still be running!

Just like the code in z_swap(), we need to spin waiting on the other
CPU to write the switch handle before knowing it's safe to return,
otherwise the calling context might (and did) do something like
immediately k_thread_create() a new thread in the "dead" thread's
struct while it was still running on the other core.

There was also a similar case in k_thread_abort() which had the same
issue: it needs to spin waiting on the other CPU to kill the thread
via the same mechanism.

Fixes zephyrproject-rtos#58116

Originally-by: Carlo Caione <[email protected]>
Signed-off-by: Andy Ross <[email protected]>
jgl-meta pushed a commit that referenced this issue May 26, 2023
As discovered by Carlo Caione, the k_thread_join code had a case where
it detected it had been called on a thread already marked _THREAD_DEAD
and exited early.  That's not sufficient.  The thread state is mutated
from the thread itself on its exit path.  It may still be running!

Just like the code in z_swap(), we need to spin waiting on the other
CPU to write the switch handle before knowing it's safe to return,
otherwise the calling context might (and did) do something like
immediately k_thread_create() a new thread in the "dead" thread's
struct while it was still running on the other core.

There was also a similar case in k_thread_abort() which had the same
issue: it needs to spin waiting on the other CPU to kill the thread
via the same mechanism.

Fixes #58116

Originally-by: Carlo Caione <[email protected]>
Signed-off-by: Andy Ross <[email protected]>
cfriedt pushed a commit that referenced this issue May 27, 2023
As discovered by Carlo Caione, the k_thread_join code had a case where
it detected it had been called on a thread already marked _THREAD_DEAD
and exited early.  That's not sufficient.  The thread state is mutated
from the thread itself on its exit path.  It may still be running!

Just like the code in z_swap(), we need to spin waiting on the other
CPU to write the switch handle before knowing it's safe to return,
otherwise the calling context might (and did) do something like
immediately k_thread_create() a new thread in the "dead" thread's
struct while it was still running on the other core.

There was also a similar case in k_thread_abort() which had the same
issue: it needs to spin waiting on the other CPU to kill the thread
via the same mechanism.

Fixes #58116

Originally-by: Carlo Caione <[email protected]>
Signed-off-by: Andy Ross <[email protected]>
(cherry picked from commit a08e23f)
@povergoing
Copy link
Member

@povergoing Unfortunately for historical reasons we can't do much validation in k_thread_create(). The way the API is specified it takes an uninitialized struct k_thread memory block, so we can't make any assumptions about what the contents are. It could be arbitrary garbage with the a state that says "running", it could have broken or dangling pointers in the waitq list node that look like it's pended, etc...

Agreed, thanks for the explaination. It is more like a question than a fix :)

@cfriedt
Copy link
Member Author

cfriedt commented May 27, 2023

I hate to be a party pooper too, but this bug still exists on v2.7-branch. I guess aim for v2.7.6 for an actual release, but the required changes should be backported to v2.7-branch ASAP still.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: Kernel bug The issue is a bug, or the PR is fixing a bug priority: medium Medium impact/importance bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants