-
Notifications
You must be signed in to change notification settings - Fork 7.4k
tests/kernel/fifo/fifo_timeout fails on nrf51_pca10028 and nrf52_pca10040 #8159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
most likely a testcase issue? |
Looks like testcase ordering issue. After moving test_timeout_threads_pend_on_dual_fifos to the end test is passing. Doing further debugging to root cause the issue. |
@punitvara can you re-test with this PR #8249 |
@pizi-nordic You suspect it right that it could be related to timer because only platform dependent function is k_cycle_get_32. However, even after your suggest, this test case failed after applying both PR #8249 and #8259 |
Even after adding one printk error is not reproducible. |
@pizi-nordic Can you please check whether still timer related issue persist for nrf because even adding printk solving this problem ? |
I am still working on the timing issues. I can try to look at this problem as soon as all timing issues I see now will be resolved. |
The test is pretty simple:
I found 2 different cases when this test fails: Case 1:
Test fails because it expects notification from thread B before notification from A (scheduler wakes up threads in different order than the one expected by the test). However the system behaviour looks correct (threads are woken up after specified timeout). @nashif: You know more about Zephyr scheduler: Is the thread wakeup order defined if multiple threads timeouts an the same time? (if yes, we could have bug in the scheduler, otherwise the test should be updated). Case 2:
In this case, the test is disrupted by gap in execution lasting dozens of ms. Such gap is easily observed, when UART console is used and almost disappears when RTT is used, so I think that it is related to printk() handling. I going to investigate this scenario a bit. |
It looks that Case 2 was introduced by my debug infrastructure due to #8763. |
@punitvara is this issue fixed? |
@ManojSubbarao No. While I am working on ADC consolidation, will look into it whenever I get sometime |
@rgundi Please look into this issue. |
Working on this. Looks like this is happening because the nrf boards are pretty slow which means even few instructions will end up adding ms. Will explore more and update shortly. |
@rgundi any progress? |
This seems to be a weird issue. Verified everything that was said by @punitvara and @pizi-nordic to be true. Did some more experiments and saw that the issue goes away even on minor modifications. |
I am tempted to declare this as a test related issue, do you agree? |
No. Not yet. I'll probably need one more day's time to come to some kind of a decision. I see that all the threads are properly populated in the timeout queue just before the issue is seen (last pass case). However, somehow it appears one of the threads is strangely getting bumped off the timeout queue without getting serviced. I'll know more tomorrow. |
Looks like there's some issue when "delta_ticks_from_prev" is 0 (i.e. when at least 2 timeouts are expiring on the same system clock tick). Further debug in progress. |
Finally I am able to understand the behaviour. @pizi-nordic: This is about the zephyr scheduler as you rightly speculated. There are 2 functions at play here.
In this particular test case, thread 2 and thread 3 timeout on the same system clock tick but thread 3 gets processed earlier as it is added prior to thread 2. The order of servicing can be ascertained by putting a breakpoint in the function test_thread_pend_and_timeout just after the k_fifo_get function call and printing the _kernel.current value there successively. Before this, the entire timeout_q can be printed for comparison (with delta_from_prev_tick and thread_id for each threads). Below is the gdb command which does that. (gdb) p ((struct _timeout *)(_kernel.timeout_q.head))->thread (gdb) p ((struct _timeout *)(_kernel.timeout_q.head->next))->thread |
@nashif : Since this is a known behavior of the kernel and since there's nothing wrong with this, I propose we classify this as a test case issue. The test case can simply be fixed by doubling the timeouts specified (i.e. 10ms should become 20ms, 20ms should become 40ms and so on). Let me know if you agree with this modification. |
That sounds right to me. No: in general there is no guarantee of wakeup order when multiple threads are woken up on the same tick. While I think timeout handling is uniformly done with a simple dlist that never reorders, things like wait_q's can be more complicated when iterated over. So if you have a situation like this where "ordered" timeouts are aliasing into a single tick, you can get this behavior. I'd consider that a test bug -- it should be validating that (at least) the timeout values differ by more than one full tick as defined by CONFIG_SYS_CLOCK_TICKS_PER_SEC. (Even then it's not foolproof if something else loads the system or locks interrupts to cause a tick to be handled late, but the test should be able to guarantee that too) |
And for reference: NRF5x timer handling does seem to produce surprises. There's a similar thread ordering bug (that I haven't tried to dig into yet) reported against the EDF test on NRF5 when combined with CONFIG_BT: #9843 |
This wasn't speculation :). |
There is no guarantee of wake-up order when multiple threads are woken up on the same tick. Hence, modified the tests accordingly. Fixes zephyrproject-rtos#8159. Signed-off-by: Rajavardhan Gundi <[email protected]>
There is no guarantee of wake-up order when multiple threads are woken up on the same tick. Hence, modified the tests accordingly. Fixes #8159. Signed-off-by: Rajavardhan Gundi <[email protected]>
Looks like this bug has been failing on this platform since the test was introduced
This test also fails with "prj_poll.conf" that is CONFIG_POLL=y in the latest commit: 5b8e4ae
This failure(with CONFIG_POLL=y) was not seen in earlier commits.
The text was updated successfully, but these errors were encountered: