Skip to content

test_posix_common: fix test_clock_gettime_rollover failure #55208

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed

test_posix_common: fix test_clock_gettime_rollover failure #55208

wants to merge 1 commit into from

Conversation

hakehuang
Copy link
Collaborator

@hakehuang hakehuang commented Feb 27, 2023

change the CONFIG_SYS_CLOCK_TICKS_PER_SEC to a smaller number to increase the systick accurate and meet test requirement

fixing: #53605

@hakehuang
Copy link
Collaborator Author

hakehuang commented Feb 28, 2023

@cfriedt , I think disable the tickless by CONFIG_TICKLESS_KERNEL=n expose an issue in test_pthread_descriptor_leak, as I see in the failure log

===================================================================
START - test_pthread_descriptor_leak
 PASS - test_pthread_descriptor_leak in 0.015 seconds
===================================================================
START - test_sleep
ASSERTION FAIL [0] @ WEST_TOPDIR/zephyr/kernel/sched.c:1782
	aborted _current back from dead
E:  ** FATAL EXCEPTION
E:  ** CPU 0 EXCCAUSE 63 (zephyr exception)
E:  **  PC 0x6000ad3a VADDR (nil)
E:  **  PS 0x60220
E:  **    (INTLEVEL:0 EXCM: 0 UM:1 RING:0 WOE:1 OWB:2 CALLINC:2)
E:  **  A0 0xa000abec  SP 0x600167b0  A2 0x4  A3 0x600167c0
E:  **  A4 0x600167a0  A5 0x4  A6 0x60120  A7 0x60020
E:  **  A8 0xa0009694  A9 0x600166c0 A10 0x21 A11 0x21
E:  ** A12 0x4 A13 0x25 A14 0x600167a0 A15 0x4
E:  ** SAR 0x4
E: >>> ZEPHYR FATAL ERROR 4: Kernel panic on CPU 0
E: Current thread: 0x6001ba50 (unknown)
E: Halting system

if I disable the test_pthread_descriptor_leak, all other cases will pass. and if I reduce the thread cound can also helps for some platforms, but the qemu_cortex_r5 seems keeps failing.

and there are also failures like

===================================================================
START - test_pthread_descriptor_leak

    Assertion failed at WEST_TOPDIR/zephyr/tests/posix/common/src/pthread.c:598: posix_apis_test_pthread_descriptor_leak: (pthread_join(pthread1, &unused) is non-zero)
unable to join thread 9
 FAIL - test_pthread_descriptor_leak in 0.011 seconds
===================================================================
START - test_sleep
ASSERTION FAIL [0] @ WEST_TOPDIR/zephyr/kernel/sched.c:1782
	aborted _current back from dead
E: r0/a1:  0x00000004  r1/a2:  0x000006f6  r2/a3:  0xff00002c
E: r3/a4:  0x00008189 r12/ip:  0x000242b0 r14/lr:  0x0000aa01
E:  xpsr:  0x0000013f
E: Faulting instruction address (r15/pc): 0x0000c062
E: >>> ZEPHYR FATAL ERROR 4: Kernel panic on CPU 0
E: Current thread: 0x20038 (unknown)
E: Halting system

so for the test_clock_gettime_rollovers issue, I think the root cause is the tickless settings, but resolve this issue cause another problem in kernel, the strange thing here is I can not reproduce it on my RT1160/RT1170 board, looks like the real time respones also matters here, what do you think? if you agree this could be a bug, I can submit another ticket for this. Thanks

@cfriedt
Copy link
Member

cfriedt commented Feb 28, 2023

@hakehuang - Have you step-debugged the pthread issue? It's kind of vague otherwise. The test itself should definitely work as-is. There may be a race condition with the scheduler as well which is not part of the posix library.

@hakehuang
Copy link
Collaborator Author

Have you step-debugged the pthread issue? It's kind of vague otherwise. The test itself should definitely work as-is. There may be a race condition with the scheduler as well which is not part of the posix library.

I think it is not test issue, but a problem in simulator platform with tickless disabled, I will try to further debug it.

@hakehuang
Copy link
Collaborator Author

@cfriedt I add a delay in the test_pthread_descriptor_leak after pthread_join looks like there is a race conditon

1 similar comment
@hakehuang
Copy link
Collaborator Author

@cfriedt I add a delay in the test_pthread_descriptor_leak after pthread_join looks like there is a race conditon

@cfriedt
Copy link
Member

cfriedt commented Mar 1, 2023

@cfriedt I add a delay in the test_pthread_descriptor_leak after pthread_join looks like there is a race conditon

Interesting. I wonder if it's in the pthread part or the scheduler part 🤔

@hakehuang
Copy link
Collaborator Author

Interesting. I wonder if it's in the pthread part or the scheduler part 🤔

I think it is in the schedule part as the pthread_join applys mutex, which may introduce schedule issue, maybe we use spin lock be better

@cfriedt
Copy link
Member

cfriedt commented Mar 3, 2023

pthread_join() needs to be able to yield execution though, so a spinlock probably would not work.

@hakehuang
Copy link
Collaborator Author

pthread_join() needs to be able to yield execution though, so a spinlock probably would not work.

this could be a request to posix lib? shall I raise a ticket? and for this PR can you aprove it? as this is mainly to resolve the test_clock_gettime_rollover issue and workaround for pthread_join race condition.

@@ -8,5 +8,6 @@ CONFIG_POSIX_MQUEUE=y
CONFIG_HEAP_MEM_POOL_SIZE=4096
CONFIG_MAX_THREAD_BYTES=4
CONFIG_THREAD_NAME=y
CONFIG_TICKLESS_KERNEL=n
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it makes sense to add this in prj.conf. IIRC, the kernel should operate with similar levels of reliability in both tick-full or tickless mode.

Assuming that this worked correctly in the past, it would be great to see a git bisect to help find a root cause here. Currently, we are only speculating where the issue is.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the test_clock_gettime_rollover is failed at NXP RT11xx platfrom since the first merge into main branch.

Copy link
Collaborator Author

@hakehuang hakehuang Mar 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cfriedt from the test process we can see that it call the
sys_clock_tick_set which is mapped to zephyr/kernel/timeout.c

#ifdef CONFIG_ZTEST
void z_impl_sys_clock_tick_set(uint64_t tick)
{
	curr_tick = tick;
}

we expect the curr_tick is set to given value.
and then as call clock_gettime() to get the value back, which in our RT platform it is calculated as
zephyr/lib/posix/clock.c

int z_impl_clock_gettime(clockid_t clock_id, struct timespec *ts)

and then it calls the
k_uptime_ticks()
in zephyr/kernel/timeout.c

int64_t z_impl_k_uptime_ticks(void)
{
	return sys_clock_tick_get();
}

which is

int64_t sys_clock_tick_get(void)
{
	uint64_t t = 0U;

	LOCKED(&timeout_lock) {
		t = curr_tick + elapsed();
	}
	return t;
}

so the actually t is the set curr_tick plus elapsed(), which is

static int32_t elapsed(void)
{
	return announce_remaining == 0 ? sys_clock_elapsed() : 0U;
}

then calls the the cortex_systick driver cortex_m_systick.c

uint32_t sys_clock_elapsed(void)
{
	if (!TICKLESS) {
		return 0;
	}

	k_spinlock_key_t key = k_spin_lock(&lock);
	uint32_t cyc = elapsed() + cycle_count - announced_cycles;

	k_spin_unlock(&lock, key);
	return cyc / CYC_PER_TICK;
}

whic is return 0 if not enable TICKLESS, and this is what we expected, but in tickless mode the return value will be larger than we expected, and in our case it happens to be the same as we set to UINT64_MAX, see below

Running TESTSUITE posix_apis
===================================================================
START - test_clock_gettime_rollover
CONFIG_SYS_CLOCK_TICKS_PER_SEC: 10000
rollover_s: 1844674407370955
t-1: {1844674407370955, 161500000}
t+0: {1844674407370955, 161500000}

    Assertion failed at WEST_TOPDIR/zephyr/tests/posix/common/src/clock.c:178: posix_apis_test_clock_gettime_rollover: ts_gt(&ts[1], &ts[0]) is false

 FAIL - test_clock_gettime_rollover in 0.027 seconds

so the assert error is because the UINT64_MAX - 1 , and UINT64_MAX will get the same value back when calls clock_gettime.

#define CYC_PER_TICK (sys_clock_hw_cycles_per_sec()	\
		      / CONFIG_SYS_CLOCK_TICKS_PER_SEC)

while in rt11xx series the value is 996000000/10000,
and the

uint32_t cyc = elapsed() + cycle_count - announced_cycles;
return cyc / CYC_PER_TICK;

will be a key here, but I think this would be a precise issue in systick driver when process the elasped in tickless mode

Copy link
Collaborator Author

@hakehuang hakehuang Mar 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a simple proof here is to change the CONFIG_SYS_CLOCK_TICKS_PER_SEC to a smaller number say 2000. @cfriedt

/* If we are leaking descriptors, then this loop will never complete */
for (size_t i = 0; i < CONFIG_MAX_PTHREAD_COUNT * 2; ++i) {
zassert_ok(pthread_create(&pthread1, &attr, create_thread1, NULL),
"unable to create thread %zu", i);
zassert_ok(pthread_join(pthread1, &unused), "unable to join thread %zu", i);
usleep(USEC_PER_MSEC * 500U);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not be necessary. Adding arbitrary delays is a slippery-slope way to work around synchronization issues. I would prefer to see a root cause first.

Copy link
Collaborator Author

@hakehuang hakehuang Mar 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is never found failure if we do not switch to none tickless mode, and I find it may related to thread protection.
by default the CONFIG_MAX_PTHREAD_MUTEX_COUNT=5, CONFIG_MAX_PTHREAD_COND_COUNT=5,
and pthread_join will depend on those pthread_mutex and pthread_cond, and in the test, it seems quickly exhaust them. so the test case seems need add some idle tiem to release those resources. @cfriedt

Copy link
Member

@cfriedt cfriedt Mar 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should not exhaust them though, so there would seem to be some kind of synchronization problem. Have you tried with something like CONFIG_SPIN_VALIDATE=y? Enabling debug logging at the kernel level?

Copy link
Member

@dleach02 dleach02 Mar 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, had to read up on posix threads... The assumption is that the pthread_join() doesn't return until the thread is terminated. But posix is also relying on the underlying RTOS to clean up the thread. Do we have a case here where the pthread_join() is waiting on the pthread->state_cond variable as the indicator that the thread has exited... which it did, but the kernel hasn't yet cleaned up the thread?

pthread_exit() at the does the pthread_cond_broadcast() on the state change of the state_cond variable then at the end of the routine it destroys and does the k_thread_abort(). Does the pthread_cond_broadcast() allow the scheduler to redo who is running and thus allows the test thread to run again? Is there a change that can be done to ensure that pthread_join() doesn't actually return until the zephyr kernel has cleaned up the thread?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, had to read up on posix threads... The assumption is that the pthread_join() doesn't return until the thread is terminated. But posix is also relying on the underlying RTOS to clean up the thread. Do we have a case here where the pthread_join() is waiting on the pthread->state_cond variable as the indicator that the thread has exited... which it did, but the kernel hasn't yet cleaned up the thread?

Quite possible.

pthread_exit() at the does the pthread_cond_broadcast() on the state change of the state_cond variable then at the end of the routine it destroys and does the k_thread_abort(). Does the pthread_cond_broadcast() allow the scheduler to redo who is running and thus allows the test thread to run again? Is there a change that can be done to ensure that pthread_join() doesn't actually return until the zephyr kernel has cleaned up the thread?

To me that sounds like change that is necessary here. If it can be isolated / reproduced reliably, which @hakehuang has done, and then we have a change such that the repro recipe results in a successful test, I think that would be sufficient to close the bug.

I suspect this issue has likely been present for A Long Time (TM). I doubt that it's a regression. I think that this new test just uncovered something that has always been there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True - I can't remember the exact name for that but inserting a transitional or "fake" state to simplify a model is usually a good solution.

I wish I could reproduce this on a board with a debugger. Can you recommend an entry level NXP board available that I could use to do that?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cfriedt , this issue only happens on qemu and renode simualtor.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you update the issue with another repro command?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

below steps will recreate this issue

west build -b qemu_cortex_r5
west build -t run 

The problem is that if I use the west build -t debugserver, and then use gdb to connect it halts untill I quit gdb.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add a new tickless configuration to check this issue

portability.posix.common.tls.newlib:
platform_exclude: nsim_sem_mpu_stack_guard ehl_crb lpcxpresso55s06
filter: TOOLCHAIN_HAS_NEWLIB == 1 and CONFIG_ARCH_HAS_THREAD_LOCAL_STORAGE and
CONFIG_TOOLCHAIN_SUPPORTS_THREAD_LOCAL_STORAGE
extra_configs:
- CONFIG_NEWLIB_LIBC=y
- CONFIG_THREAD_LOCAL_STORAGE=y
- CONFIG_MAIN_STACK_SIZE=1152
- CONFIG_MAIN_STACK_SIZE=4096
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while I appreciate that the main stack needed to be increased, this seems like a really large increase? Do we have any insight on what the size really needed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree - it would be good to know exactly how much space is required, and on what platforms.

Copy link
Collaborator Author

@hakehuang hakehuang Mar 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after disable the tickless mode there seems no need to add the stack_size for rt11 series

reduce the ticks per sec such the the timer accuracy
   meet test requirement

fixing: #53605

Signed-off-by: Hake Huang <[email protected]>
@hakehuang hakehuang requested review from cfriedt and dleach02 March 22, 2023 11:25
@cfriedt
Copy link
Member

cfriedt commented Mar 23, 2023

@hakehuang - I'm trying to reproduce this with qemu_cortex_r5 as you mentioned, but I can't seem to reproduce it with any of these

twister -i -p qemu_cortex_r5 -T tests/posix/common/
west build -p auto -b qemu_cortex_r5 -t run tests/posix/common/

Is there a better command to use?

@hakehuang
Copy link
Collaborator Author

@hakehuang - I'm trying to reproduce this with qemu_cortex_r5 as you mentioned, but I can't seem to reproduce it with any of these

twister -i -p qemu_cortex_r5 -T tests/posix/common/
west build -p auto -b qemu_cortex_r5 -t run tests/posix/common/

Is there a better command to use?

you need disable the tickless mode, for latest code, I change to reduce the tick_pre_second, so if you want reproduce this issue, you need add below to prj.conf

CONFIG_TICKLESS_KERNEL=n

in my latest code change, the solution avoid such senario, and I plan to create another PR which will disable the tickless mode which will expose this issue, and I can test against your pthread code change before I create that PR.

@cfriedt
Copy link
Member

cfriedt commented Mar 23, 2023

@hakehuang - it's possible that you might need to create a separate issue related to tickless systems. I don't see this as being directly related to the linked issue, 53605

@hakehuang
Copy link
Collaborator Author

hakehuang commented Mar 23, 2023

@hakehuang - it's possible that you might need to create a separate issue related to tickless systems. I don't see this as being directly related to the linked issue, 53605

@cfriedt , this is not related to the fix, but a side effect found if I use none tickless mode, I will report a new ticket for this finging, in short to reproduct thi sissue

  1. add the CONFIG_TICKLESS_KERNEL=n in prj.conf
  2. build posix_common as below
west build -b qemu_cortex_r5 -- -DCONFIG_NEWLIB_LIBC=y
west build -t run

issue reported #56163

@hakehuang
Copy link
Collaborator Author

@cfriedt any comments on this PR's fix? as the tickless issue is reported in #56163 and not related to this fix

@@ -8,5 +8,6 @@ CONFIG_POSIX_MQUEUE=y
CONFIG_HEAP_MEM_POOL_SIZE=4096
CONFIG_MAX_THREAD_BYTES=4
CONFIG_THREAD_NAME=y
CONFIG_SYS_CLOCK_TICKS_PER_SEC=1000
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest not adding this here, but maybe adding the tickless option in testcase.yaml as another test configuration. Technically, you could also add newlib to that same configuration and if necessary, adjust allowed platforms in that configuration.

Copy link
Collaborator Author

@hakehuang hakehuang Mar 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cfriedt CONFIG_SYS_CLOCK_TICKS_PER_SEC=1000, is to fix this issue #53605, and if we add disable tickless option and we will find the issue #56163.

I will add a new test config with disable tickless, once the fix is ready.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hakehuang - I believe adding CONFIG_SYS_CLOCK_TICKS_PER_SEC=1000 to the test project is more of a workaround than a solution. I will try and adjust the testcase so that it passes for the nordic boards mentioned in 53605 and will work with you to ensure that 56163 works as well.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cfriedt , just a reminder, in the cortex_m_systick.c

uint32_t cyc = elapsed() + cycle_count - announced_cycles;
return cyc / CYC_PER_TICK;

so if the platform runs too fast, the default CYC_PER_TICK is 10000, which means the two timer read will return the same value.

@hakehuang hakehuang closed this Apr 1, 2023
@hakehuang
Copy link
Collaborator Author

the test code owner revert this test_clock_getting_rollover case

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: POSIX POSIX API Library
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants