test_posix_common: fix test_clock_gettime_rollover failure #55208

hakehuang · 2023-02-27T07:59:01Z

change the CONFIG_SYS_CLOCK_TICKS_PER_SEC to a smaller number to increase the systick accurate and meet test requirement

fixing: #53605

hakehuang · 2023-02-28T10:34:18Z

@cfriedt , I think disable the tickless by CONFIG_TICKLESS_KERNEL=n expose an issue in test_pthread_descriptor_leak, as I see in the failure log

===================================================================
START - test_pthread_descriptor_leak
 PASS - test_pthread_descriptor_leak in 0.015 seconds
===================================================================
START - test_sleep
ASSERTION FAIL [0] @ WEST_TOPDIR/zephyr/kernel/sched.c:1782
	aborted _current back from dead
E:  ** FATAL EXCEPTION
E:  ** CPU 0 EXCCAUSE 63 (zephyr exception)
E:  **  PC 0x6000ad3a VADDR (nil)
E:  **  PS 0x60220
E:  **    (INTLEVEL:0 EXCM: 0 UM:1 RING:0 WOE:1 OWB:2 CALLINC:2)
E:  **  A0 0xa000abec  SP 0x600167b0  A2 0x4  A3 0x600167c0
E:  **  A4 0x600167a0  A5 0x4  A6 0x60120  A7 0x60020
E:  **  A8 0xa0009694  A9 0x600166c0 A10 0x21 A11 0x21
E:  ** A12 0x4 A13 0x25 A14 0x600167a0 A15 0x4
E:  ** SAR 0x4
E: >>> ZEPHYR FATAL ERROR 4: Kernel panic on CPU 0
E: Current thread: 0x6001ba50 (unknown)
E: Halting system

if I disable the test_pthread_descriptor_leak, all other cases will pass. and if I reduce the thread cound can also helps for some platforms, but the qemu_cortex_r5 seems keeps failing.

and there are also failures like

===================================================================
START - test_pthread_descriptor_leak

    Assertion failed at WEST_TOPDIR/zephyr/tests/posix/common/src/pthread.c:598: posix_apis_test_pthread_descriptor_leak: (pthread_join(pthread1, &unused) is non-zero)
unable to join thread 9
 FAIL - test_pthread_descriptor_leak in 0.011 seconds
===================================================================
START - test_sleep
ASSERTION FAIL [0] @ WEST_TOPDIR/zephyr/kernel/sched.c:1782
	aborted _current back from dead
E: r0/a1:  0x00000004  r1/a2:  0x000006f6  r2/a3:  0xff00002c
E: r3/a4:  0x00008189 r12/ip:  0x000242b0 r14/lr:  0x0000aa01
E:  xpsr:  0x0000013f
E: Faulting instruction address (r15/pc): 0x0000c062
E: >>> ZEPHYR FATAL ERROR 4: Kernel panic on CPU 0
E: Current thread: 0x20038 (unknown)
E: Halting system

so for the test_clock_gettime_rollovers issue, I think the root cause is the tickless settings, but resolve this issue cause another problem in kernel, the strange thing here is I can not reproduce it on my RT1160/RT1170 board, looks like the real time respones also matters here, what do you think? if you agree this could be a bug, I can submit another ticket for this. Thanks

cfriedt · 2023-02-28T10:46:53Z

@hakehuang - Have you step-debugged the pthread issue? It's kind of vague otherwise. The test itself should definitely work as-is. There may be a race condition with the scheduler as well which is not part of the posix library.

hakehuang · 2023-02-28T14:12:49Z

Have you step-debugged the pthread issue? It's kind of vague otherwise. The test itself should definitely work as-is. There may be a race condition with the scheduler as well which is not part of the posix library.

I think it is not test issue, but a problem in simulator platform with tickless disabled, I will try to further debug it.

hakehuang · 2023-03-01T15:32:34Z

@cfriedt I add a delay in the test_pthread_descriptor_leak after pthread_join looks like there is a race conditon

hakehuang · 2023-03-01T15:45:39Z

@cfriedt I add a delay in the test_pthread_descriptor_leak after pthread_join looks like there is a race conditon

cfriedt · 2023-03-01T18:14:57Z

@cfriedt I add a delay in the test_pthread_descriptor_leak after pthread_join looks like there is a race conditon

Interesting. I wonder if it's in the pthread part or the scheduler part 🤔

hakehuang · 2023-03-02T02:49:22Z

Interesting. I wonder if it's in the pthread part or the scheduler part 🤔

I think it is in the schedule part as the pthread_join applys mutex, which may introduce schedule issue, maybe we use spin lock be better

cfriedt · 2023-03-03T02:59:59Z

pthread_join() needs to be able to yield execution though, so a spinlock probably would not work.

hakehuang · 2023-03-03T03:29:37Z

pthread_join() needs to be able to yield execution though, so a spinlock probably would not work.

this could be a request to posix lib? shall I raise a ticket? and for this PR can you aprove it? as this is mainly to resolve the test_clock_gettime_rollover issue and workaround for pthread_join race condition.

cfriedt · 2023-03-03T13:29:36Z

tests/posix/common/prj.conf

@@ -8,5 +8,6 @@ CONFIG_POSIX_MQUEUE=y
 CONFIG_HEAP_MEM_POOL_SIZE=4096
 CONFIG_MAX_THREAD_BYTES=4
 CONFIG_THREAD_NAME=y
+CONFIG_TICKLESS_KERNEL=n


I don't think it makes sense to add this in prj.conf. IIRC, the kernel should operate with similar levels of reliability in both tick-full or tickless mode.

Assuming that this worked correctly in the past, it would be great to see a git bisect to help find a root cause here. Currently, we are only speculating where the issue is.

the test_clock_gettime_rollover is failed at NXP RT11xx platfrom since the first merge into main branch.

@cfriedt from the test process we can see that it call the
sys_clock_tick_set which is mapped to zephyr/kernel/timeout.c

#ifdef CONFIG_ZTEST void z_impl_sys_clock_tick_set(uint64_t tick) { curr_tick = tick; }

we expect the curr_tick is set to given value.
and then as call clock_gettime() to get the value back, which in our RT platform it is calculated as
zephyr/lib/posix/clock.c

int z_impl_clock_gettime(clockid_t clock_id, struct timespec *ts)

and then it calls the
k_uptime_ticks()
in zephyr/kernel/timeout.c

int64_t z_impl_k_uptime_ticks(void) { return sys_clock_tick_get(); }

which is

int64_t sys_clock_tick_get(void) { uint64_t t = 0U; LOCKED(&timeout_lock) { t = curr_tick + elapsed(); } return t; }

so the actually t is the set curr_tick plus elapsed(), which is

static int32_t elapsed(void) { return announce_remaining == 0 ? sys_clock_elapsed() : 0U; }

then calls the the cortex_systick driver cortex_m_systick.c

uint32_t sys_clock_elapsed(void) { if (!TICKLESS) { return 0; } k_spinlock_key_t key = k_spin_lock(&lock); uint32_t cyc = elapsed() + cycle_count - announced_cycles; k_spin_unlock(&lock, key); return cyc / CYC_PER_TICK; }

whic is return 0 if not enable TICKLESS, and this is what we expected, but in tickless mode the return value will be larger than we expected, and in our case it happens to be the same as we set to UINT64_MAX, see below

Running TESTSUITE posix_apis =================================================================== START - test_clock_gettime_rollover CONFIG_SYS_CLOCK_TICKS_PER_SEC: 10000 rollover_s: 1844674407370955 t-1: {1844674407370955, 161500000} t+0: {1844674407370955, 161500000} Assertion failed at WEST_TOPDIR/zephyr/tests/posix/common/src/clock.c:178: posix_apis_test_clock_gettime_rollover: ts_gt(&ts[1], &ts[0]) is false FAIL - test_clock_gettime_rollover in 0.027 seconds

so the assert error is because the UINT64_MAX - 1 , and UINT64_MAX will get the same value back when calls clock_gettime.

#define CYC_PER_TICK (sys_clock_hw_cycles_per_sec() \ / CONFIG_SYS_CLOCK_TICKS_PER_SEC)

while in rt11xx series the value is 996000000/10000,
and the

uint32_t cyc = elapsed() + cycle_count - announced_cycles; return cyc / CYC_PER_TICK;

will be a key here, but I think this would be a precise issue in systick driver when process the elasped in tickless mode

a simple proof here is to change the CONFIG_SYS_CLOCK_TICKS_PER_SEC to a smaller number say 2000. @cfriedt

cfriedt · 2023-03-03T13:29:53Z

tests/posix/common/src/pthread.c

 	/* If we are leaking descriptors, then this loop will never complete */
 	for (size_t i = 0; i < CONFIG_MAX_PTHREAD_COUNT * 2; ++i) {
 		zassert_ok(pthread_create(&pthread1, &attr, create_thread1, NULL),
 			   "unable to create thread %zu", i);
 		zassert_ok(pthread_join(pthread1, &unused), "unable to join thread %zu", i);
+		usleep(USEC_PER_MSEC * 500U);


This should not be necessary. Adding arbitrary delays is a slippery-slope way to work around synchronization issues. I would prefer to see a root cause first.

This is never found failure if we do not switch to none tickless mode, and I find it may related to thread protection.
by default the CONFIG_MAX_PTHREAD_MUTEX_COUNT=5, CONFIG_MAX_PTHREAD_COND_COUNT=5,
and pthread_join will depend on those pthread_mutex and pthread_cond, and in the test, it seems quickly exhaust them. so the test case seems need add some idle tiem to release those resources. @cfriedt

It should not exhaust them though, so there would seem to be some kind of synchronization problem. Have you tried with something like CONFIG_SPIN_VALIDATE=y? Enabling debug logging at the kernel level?

Okay, had to read up on posix threads... The assumption is that the pthread_join() doesn't return until the thread is terminated. But posix is also relying on the underlying RTOS to clean up the thread. Do we have a case here where the pthread_join() is waiting on the pthread->state_cond variable as the indicator that the thread has exited... which it did, but the kernel hasn't yet cleaned up the thread?

pthread_exit() at the does the pthread_cond_broadcast() on the state change of the state_cond variable then at the end of the routine it destroys and does the k_thread_abort(). Does the pthread_cond_broadcast() allow the scheduler to redo who is running and thus allows the test thread to run again? Is there a change that can be done to ensure that pthread_join() doesn't actually return until the zephyr kernel has cleaned up the thread?

Okay, had to read up on posix threads... The assumption is that the pthread_join() doesn't return until the thread is terminated. But posix is also relying on the underlying RTOS to clean up the thread. Do we have a case here where the pthread_join() is waiting on the pthread->state_cond variable as the indicator that the thread has exited... which it did, but the kernel hasn't yet cleaned up the thread?

Quite possible.

pthread_exit() at the does the pthread_cond_broadcast() on the state change of the state_cond variable then at the end of the routine it destroys and does the k_thread_abort(). Does the pthread_cond_broadcast() allow the scheduler to redo who is running and thus allows the test thread to run again? Is there a change that can be done to ensure that pthread_join() doesn't actually return until the zephyr kernel has cleaned up the thread?

To me that sounds like change that is necessary here. If it can be isolated / reproduced reliably, which @hakehuang has done, and then we have a change such that the repro recipe results in a successful test, I think that would be sufficient to close the bug.

I suspect this issue has likely been present for A Long Time (TM). I doubt that it's a regression. I think that this new test just uncovered something that has always been there.

True - I can't remember the exact name for that but inserting a transitional or "fake" state to simplify a model is usually a good solution.

I wish I could reproduce this on a board with a debugger. Can you recommend an entry level NXP board available that I could use to do that?

@cfriedt , this issue only happens on qemu and renode simualtor.

Can you update the issue with another repro command?

below steps will recreate this issue

west build -b qemu_cortex_r5 west build -t run

The problem is that if I use the west build -t debugserver, and then use gdb to connect it halts untill I quit gdb.

I will add a new tickless configuration to check this issue

dleach02 · 2023-03-10T15:04:03Z

tests/posix/common/testcase.yaml

  portability.posix.common.tls.newlib:
    platform_exclude: nsim_sem_mpu_stack_guard ehl_crb lpcxpresso55s06
    filter: TOOLCHAIN_HAS_NEWLIB == 1 and CONFIG_ARCH_HAS_THREAD_LOCAL_STORAGE and
      CONFIG_TOOLCHAIN_SUPPORTS_THREAD_LOCAL_STORAGE
    extra_configs:
      - CONFIG_NEWLIB_LIBC=y
      - CONFIG_THREAD_LOCAL_STORAGE=y
-      - CONFIG_MAIN_STACK_SIZE=1152
+      - CONFIG_MAIN_STACK_SIZE=4096


while I appreciate that the main stack needed to be increased, this seems like a really large increase? Do we have any insight on what the size really needed?

I agree - it would be good to know exactly how much space is required, and on what platforms.

after disable the tickless mode there seems no need to add the stack_size for rt11 series

reduce the ticks per sec such the the timer accuracy meet test requirement fixing: #53605 Signed-off-by: Hake Huang <[email protected]>

cfriedt · 2023-03-23T00:40:51Z

@hakehuang - I'm trying to reproduce this with qemu_cortex_r5 as you mentioned, but I can't seem to reproduce it with any of these

twister -i -p qemu_cortex_r5 -T tests/posix/common/
west build -p auto -b qemu_cortex_r5 -t run tests/posix/common/

Is there a better command to use?

hakehuang · 2023-03-23T03:20:42Z

@hakehuang - I'm trying to reproduce this with qemu_cortex_r5 as you mentioned, but I can't seem to reproduce it with any of these
twister -i -p qemu_cortex_r5 -T tests/posix/common/
west build -p auto -b qemu_cortex_r5 -t run tests/posix/common/
Is there a better command to use?

you need disable the tickless mode, for latest code, I change to reduce the tick_pre_second, so if you want reproduce this issue, you need add below to prj.conf

CONFIG_TICKLESS_KERNEL=n

in my latest code change, the solution avoid such senario, and I plan to create another PR which will disable the tickless mode which will expose this issue, and I can test against your pthread code change before I create that PR.

cfriedt · 2023-03-23T11:18:09Z

@hakehuang - it's possible that you might need to create a separate issue related to tickless systems. I don't see this as being directly related to the linked issue, 53605

hakehuang · 2023-03-23T17:06:14Z

@hakehuang - it's possible that you might need to create a separate issue related to tickless systems. I don't see this as being directly related to the linked issue, 53605

@cfriedt , this is not related to the fix, but a side effect found if I use none tickless mode, I will report a new ticket for this finging, in short to reproduct thi sissue

add the CONFIG_TICKLESS_KERNEL=n in prj.conf
build posix_common as below

west build -b qemu_cortex_r5 -- -DCONFIG_NEWLIB_LIBC=y
west build -t run

issue reported #56163

hakehuang · 2023-03-28T08:32:56Z

@cfriedt any comments on this PR's fix? as the tickless issue is reported in #56163 and not related to this fix

cfriedt · 2023-03-28T10:30:19Z

tests/posix/common/prj.conf

@@ -8,5 +8,6 @@ CONFIG_POSIX_MQUEUE=y
 CONFIG_HEAP_MEM_POOL_SIZE=4096
 CONFIG_MAX_THREAD_BYTES=4
 CONFIG_THREAD_NAME=y
+CONFIG_SYS_CLOCK_TICKS_PER_SEC=1000


I would suggest not adding this here, but maybe adding the tickless option in testcase.yaml as another test configuration. Technically, you could also add newlib to that same configuration and if necessary, adjust allowed platforms in that configuration.

@cfriedt CONFIG_SYS_CLOCK_TICKS_PER_SEC=1000, is to fix this issue #53605, and if we add disable tickless option and we will find the issue #56163.

I will add a new test config with disable tickless, once the fix is ready.

@hakehuang - I believe adding CONFIG_SYS_CLOCK_TICKS_PER_SEC=1000 to the test project is more of a workaround than a solution. I will try and adjust the testcase so that it passes for the nordic boards mentioned in 53605 and will work with you to ensure that 56163 works as well.

@cfriedt , just a reminder, in the cortex_m_systick.c

uint32_t cyc = elapsed() + cycle_count - announced_cycles; return cyc / CYC_PER_TICK;

so if the platform runs too fast, the default CYC_PER_TICK is 10000, which means the two timer read will return the same value.

hakehuang · 2023-04-01T13:58:33Z

the test code owner revert this test_clock_getting_rollover case

hakehuang requested a review from cfriedt as a code owner February 27, 2023 07:59

zephyrbot added the area: POSIX POSIX API Library label Feb 27, 2023

zephyrbot assigned cfriedt Feb 27, 2023

hakehuang requested review from carlescufi, dleach02, mmahadevan108 and danieldegrasse February 27, 2023 08:00

cfriedt requested changes Mar 3, 2023

View reviewed changes

dleach02 reviewed Mar 10, 2023

View reviewed changes

test_posix_common: fix test_clock_gettime_rollover failure

def18b4

reduce the ticks per sec such the the timer accuracy meet test requirement fixing: #53605 Signed-off-by: Hake Huang <[email protected]>

hakehuang requested review from cfriedt and dleach02 March 22, 2023 11:25

hakehuang mentioned this pull request Mar 23, 2023

posix: pthread: race condition between pthread_create() and pthread_join() #56163

Closed

cfriedt reviewed Mar 28, 2023

View reviewed changes

hakehuang closed this Apr 1, 2023

test_posix_common: fix test_clock_gettime_rollover failure #55208

test_posix_common: fix test_clock_gettime_rollover failure #55208

Conversation

hakehuang commented Feb 27, 2023 • edited Loading

hakehuang commented Feb 28, 2023 • edited Loading

cfriedt commented Feb 28, 2023 • edited Loading

hakehuang commented Feb 28, 2023

hakehuang commented Mar 1, 2023

hakehuang commented Mar 1, 2023

cfriedt commented Mar 1, 2023

hakehuang commented Mar 2, 2023

cfriedt commented Mar 3, 2023

hakehuang commented Mar 3, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hakehuang Mar 5, 2023 • edited Loading

Choose a reason for hiding this comment

hakehuang Mar 6, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hakehuang Mar 6, 2023 • edited Loading

Choose a reason for hiding this comment

cfriedt Mar 10, 2023 • edited Loading

Choose a reason for hiding this comment

dleach02 Mar 10, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hakehuang Mar 13, 2023 • edited Loading

Choose a reason for hiding this comment

cfriedt commented Mar 23, 2023

hakehuang commented Mar 23, 2023

cfriedt commented Mar 23, 2023

hakehuang commented Mar 23, 2023 • edited Loading

hakehuang commented Mar 28, 2023

Choose a reason for hiding this comment

hakehuang Mar 28, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hakehuang commented Apr 1, 2023

hakehuang commented Feb 27, 2023 •

edited

Loading

hakehuang commented Feb 28, 2023 •

edited

Loading

cfriedt commented Feb 28, 2023 •

edited

Loading

hakehuang Mar 5, 2023 •

edited

Loading

hakehuang Mar 6, 2023 •

edited

Loading

hakehuang Mar 6, 2023 •

edited

Loading

cfriedt Mar 10, 2023 •

edited

Loading

dleach02 Mar 10, 2023 •

edited

Loading

hakehuang Mar 13, 2023 •

edited

Loading

hakehuang commented Mar 23, 2023 •

edited

Loading

hakehuang Mar 28, 2023 •

edited

Loading