-
Notifications
You must be signed in to change notification settings - Fork 7.4k
test_posix_common: fix test_clock_gettime_rollover failure #55208
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test_posix_common: fix test_clock_gettime_rollover failure #55208
Conversation
@cfriedt , I think disable the tickless by
if I disable the test_pthread_descriptor_leak, all other cases will pass. and if I reduce the thread cound can also helps for some platforms, but the qemu_cortex_r5 seems keeps failing. and there are also failures like
so for the test_clock_gettime_rollovers issue, I think the root cause is the tickless settings, but resolve this issue cause another problem in kernel, the strange thing here is I can not reproduce it on my RT1160/RT1170 board, looks like the real time respones also matters here, what do you think? if you agree this could be a bug, I can submit another ticket for this. Thanks |
@hakehuang - Have you step-debugged the pthread issue? It's kind of vague otherwise. The test itself should definitely work as-is. There may be a race condition with the scheduler as well which is not part of the posix library. |
I think it is not test issue, but a problem in simulator platform with tickless disabled, I will try to further debug it. |
@cfriedt I add a delay in the test_pthread_descriptor_leak after pthread_join looks like there is a race conditon |
1 similar comment
@cfriedt I add a delay in the test_pthread_descriptor_leak after pthread_join looks like there is a race conditon |
Interesting. I wonder if it's in the pthread part or the scheduler part 🤔 |
I think it is in the schedule part as the pthread_join applys mutex, which may introduce schedule issue, maybe we use spin lock be better |
|
this could be a request to posix lib? shall I raise a ticket? and for this PR can you aprove it? as this is mainly to resolve the test_clock_gettime_rollover issue and workaround for pthread_join race condition. |
tests/posix/common/prj.conf
Outdated
@@ -8,5 +8,6 @@ CONFIG_POSIX_MQUEUE=y | |||
CONFIG_HEAP_MEM_POOL_SIZE=4096 | |||
CONFIG_MAX_THREAD_BYTES=4 | |||
CONFIG_THREAD_NAME=y | |||
CONFIG_TICKLESS_KERNEL=n |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it makes sense to add this in prj.conf
. IIRC, the kernel should operate with similar levels of reliability in both tick-full or tickless mode.
Assuming that this worked correctly in the past, it would be great to see a git bisect to help find a root cause here. Currently, we are only speculating where the issue is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the test_clock_gettime_rollover is failed at NXP RT11xx platfrom since the first merge into main branch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cfriedt from the test process we can see that it call the
sys_clock_tick_set which is mapped to zephyr/kernel/timeout.c
#ifdef CONFIG_ZTEST
void z_impl_sys_clock_tick_set(uint64_t tick)
{
curr_tick = tick;
}
we expect the curr_tick is set to given value.
and then as call clock_gettime() to get the value back, which in our RT platform it is calculated as
zephyr/lib/posix/clock.c
int z_impl_clock_gettime(clockid_t clock_id, struct timespec *ts)
and then it calls the
k_uptime_ticks()
in zephyr/kernel/timeout.c
int64_t z_impl_k_uptime_ticks(void)
{
return sys_clock_tick_get();
}
which is
int64_t sys_clock_tick_get(void)
{
uint64_t t = 0U;
LOCKED(&timeout_lock) {
t = curr_tick + elapsed();
}
return t;
}
so the actually t is the set curr_tick plus elapsed(), which is
static int32_t elapsed(void)
{
return announce_remaining == 0 ? sys_clock_elapsed() : 0U;
}
then calls the the cortex_systick driver cortex_m_systick.c
uint32_t sys_clock_elapsed(void)
{
if (!TICKLESS) {
return 0;
}
k_spinlock_key_t key = k_spin_lock(&lock);
uint32_t cyc = elapsed() + cycle_count - announced_cycles;
k_spin_unlock(&lock, key);
return cyc / CYC_PER_TICK;
}
whic is return 0 if not enable TICKLESS, and this is what we expected, but in tickless mode the return value will be larger than we expected, and in our case it happens to be the same as we set to UINT64_MAX, see below
Running TESTSUITE posix_apis
===================================================================
START - test_clock_gettime_rollover
CONFIG_SYS_CLOCK_TICKS_PER_SEC: 10000
rollover_s: 1844674407370955
t-1: {1844674407370955, 161500000}
t+0: {1844674407370955, 161500000}
Assertion failed at WEST_TOPDIR/zephyr/tests/posix/common/src/clock.c:178: posix_apis_test_clock_gettime_rollover: ts_gt(&ts[1], &ts[0]) is false
FAIL - test_clock_gettime_rollover in 0.027 seconds
so the assert error is because the UINT64_MAX - 1 , and UINT64_MAX will get the same value back when calls clock_gettime.
#define CYC_PER_TICK (sys_clock_hw_cycles_per_sec() \
/ CONFIG_SYS_CLOCK_TICKS_PER_SEC)
while in rt11xx series the value is 996000000/10000,
and the
uint32_t cyc = elapsed() + cycle_count - announced_cycles;
return cyc / CYC_PER_TICK;
will be a key here, but I think this would be a precise issue in systick driver when process the elasped in tickless mode
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a simple proof here is to change the CONFIG_SYS_CLOCK_TICKS_PER_SEC to a smaller number say 2000. @cfriedt
tests/posix/common/src/pthread.c
Outdated
/* If we are leaking descriptors, then this loop will never complete */ | ||
for (size_t i = 0; i < CONFIG_MAX_PTHREAD_COUNT * 2; ++i) { | ||
zassert_ok(pthread_create(&pthread1, &attr, create_thread1, NULL), | ||
"unable to create thread %zu", i); | ||
zassert_ok(pthread_join(pthread1, &unused), "unable to join thread %zu", i); | ||
usleep(USEC_PER_MSEC * 500U); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should not be necessary. Adding arbitrary delays is a slippery-slope way to work around synchronization issues. I would prefer to see a root cause first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is never found failure if we do not switch to none tickless mode, and I find it may related to thread protection.
by default the CONFIG_MAX_PTHREAD_MUTEX_COUNT=5, CONFIG_MAX_PTHREAD_COND_COUNT=5,
and pthread_join will depend on those pthread_mutex and pthread_cond, and in the test, it seems quickly exhaust them. so the test case seems need add some idle tiem to release those resources. @cfriedt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should not exhaust them though, so there would seem to be some kind of synchronization problem. Have you tried with something like CONFIG_SPIN_VALIDATE=y
? Enabling debug logging at the kernel level?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, had to read up on posix threads... The assumption is that the pthread_join() doesn't return until the thread is terminated. But posix is also relying on the underlying RTOS to clean up the thread. Do we have a case here where the pthread_join() is waiting on the pthread->state_cond variable as the indicator that the thread has exited... which it did, but the kernel hasn't yet cleaned up the thread?
pthread_exit() at the does the pthread_cond_broadcast() on the state change of the state_cond variable then at the end of the routine it destroys and does the k_thread_abort(). Does the pthread_cond_broadcast() allow the scheduler to redo who is running and thus allows the test thread to run again? Is there a change that can be done to ensure that pthread_join() doesn't actually return until the zephyr kernel has cleaned up the thread?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, had to read up on posix threads... The assumption is that the pthread_join() doesn't return until the thread is terminated. But posix is also relying on the underlying RTOS to clean up the thread. Do we have a case here where the pthread_join() is waiting on the pthread->state_cond variable as the indicator that the thread has exited... which it did, but the kernel hasn't yet cleaned up the thread?
Quite possible.
pthread_exit() at the does the pthread_cond_broadcast() on the state change of the state_cond variable then at the end of the routine it destroys and does the k_thread_abort(). Does the pthread_cond_broadcast() allow the scheduler to redo who is running and thus allows the test thread to run again? Is there a change that can be done to ensure that pthread_join() doesn't actually return until the zephyr kernel has cleaned up the thread?
To me that sounds like change that is necessary here. If it can be isolated / reproduced reliably, which @hakehuang has done, and then we have a change such that the repro recipe results in a successful test, I think that would be sufficient to close the bug.
I suspect this issue has likely been present for A Long Time (TM). I doubt that it's a regression. I think that this new test just uncovered something that has always been there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True - I can't remember the exact name for that but inserting a transitional or "fake" state to simplify a model is usually a good solution.
I wish I could reproduce this on a board with a debugger. Can you recommend an entry level NXP board available that I could use to do that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cfriedt , this issue only happens on qemu and renode simualtor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you update the issue with another repro command?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
below steps will recreate this issue
west build -b qemu_cortex_r5
west build -t run
The problem is that if I use the west build -t debugserver
, and then use gdb to connect it halts untill I quit gdb.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add a new tickless configuration to check this issue
tests/posix/common/testcase.yaml
Outdated
portability.posix.common.tls.newlib: | ||
platform_exclude: nsim_sem_mpu_stack_guard ehl_crb lpcxpresso55s06 | ||
filter: TOOLCHAIN_HAS_NEWLIB == 1 and CONFIG_ARCH_HAS_THREAD_LOCAL_STORAGE and | ||
CONFIG_TOOLCHAIN_SUPPORTS_THREAD_LOCAL_STORAGE | ||
extra_configs: | ||
- CONFIG_NEWLIB_LIBC=y | ||
- CONFIG_THREAD_LOCAL_STORAGE=y | ||
- CONFIG_MAIN_STACK_SIZE=1152 | ||
- CONFIG_MAIN_STACK_SIZE=4096 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
while I appreciate that the main stack needed to be increased, this seems like a really large increase? Do we have any insight on what the size really needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree - it would be good to know exactly how much space is required, and on what platforms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
after disable the tickless mode there seems no need to add the stack_size for rt11 series
reduce the ticks per sec such the the timer accuracy meet test requirement fixing: #53605 Signed-off-by: Hake Huang <[email protected]>
@hakehuang - I'm trying to reproduce this with
Is there a better command to use? |
you need disable the tickless mode, for latest code, I change to reduce the tick_pre_second, so if you want reproduce this issue, you need add below to prj.conf
in my latest code change, the solution avoid such senario, and I plan to create another PR which will disable the tickless mode which will expose this issue, and I can test against your pthread code change before I create that PR. |
@hakehuang - it's possible that you might need to create a separate issue related to tickless systems. I don't see this as being directly related to the linked issue, 53605 |
@cfriedt , this is not related to the fix, but a side effect found if I use none tickless mode, I will report a new ticket for this finging, in short to reproduct thi sissue
issue reported #56163 |
@@ -8,5 +8,6 @@ CONFIG_POSIX_MQUEUE=y | |||
CONFIG_HEAP_MEM_POOL_SIZE=4096 | |||
CONFIG_MAX_THREAD_BYTES=4 | |||
CONFIG_THREAD_NAME=y | |||
CONFIG_SYS_CLOCK_TICKS_PER_SEC=1000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest not adding this here, but maybe adding the tickless option in testcase.yaml
as another test configuration. Technically, you could also add newlib to that same configuration and if necessary, adjust allowed platforms in that configuration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hakehuang - I believe adding CONFIG_SYS_CLOCK_TICKS_PER_SEC=1000
to the test project is more of a workaround than a solution. I will try and adjust the testcase so that it passes for the nordic boards mentioned in 53605 and will work with you to ensure that 56163 works as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cfriedt , just a reminder, in the cortex_m_systick.c
uint32_t cyc = elapsed() + cycle_count - announced_cycles;
return cyc / CYC_PER_TICK;
so if the platform runs too fast, the default CYC_PER_TICK is 10000, which means the two timer read will return the same value.
the test code owner revert this test_clock_getting_rollover case |
change the CONFIG_SYS_CLOCK_TICKS_PER_SEC to a smaller number to increase the systick accurate and meet test requirement
fixing: #53605