Skip to content

tests/kernel/fifo/fifo_timeout fails on nrf51_pca10028 and nrf52_pca10040 #8159

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
nashif opened this issue Jun 4, 2018 · 25 comments
Closed
Assignees
Labels
bug The issue is a bug, or the PR is fixing a bug platform: nRF Nordic nRFx priority: medium Medium impact/importance bug
Milestone

Comments

@nashif
Copy link
Member

nashif commented Jun 4, 2018

Looks like this bug has been failing on this platform since the test was introduced

***** delaying boot 1000ms (per build configuration) *****
***** Booting Zephyr OS 1.12.0-rc2 (delayed boot 1000ms) *****
Running test suite test_fifo_timeout
===================================================================
starting test - test_timeout_empty_fifo
PASS - test_timeout_empty_fifo
===================================================================
starting test - test_timeout_non_empty_fifo
PASS - test_timeout_non_empty_fifo
===================================================================
starting test - test_timeout_fifo_thread
PASS - test_timeout_fifo_thread
===================================================================
starting test - test_timeout_threads_pend_on_fifo
thread (q order: 2, t/o: 0, fifo 0x20000000)
thread (q order: 3, t/o: 10, fifo 0x20000000)
thread (q order: 0, t/o: 20, fifo 0x20000000)
thread (q order: 4, t/o: 30, fifo 0x20000000)
thread (q order: 1, t/o: 40, fifo 0x20000000)
PASS - test_timeout_threads_pend_on_fifo
===================================================================
starting test - test_timeout_threads_pend_on_dual_fifos
thread (q order: 0, t/o: 0, fifo 0x20000010)
thread (q order: 5, t/o: 10, fifo 0x20000000)
FAIL - test_multiple_threads_pending@156. *** thread 3 woke up, expected 2

 Assertion failed at /home/jenkins/workspace/zephyr-master-tcf-v0.11-branch/LABEL/verify/SHARD/3-3/ZEPHYR_GCC_VARIANT/zephyr/zephyr.git/tests/kernel/fifo/fifo_timeout/src/main.c:386: test_timeout_threads_pend_on_dual_fifos: rv not equal to TC_PASS
FAIL - test_timeout_threads_pend_on_dual_fifos
===================================================================
starting test - test_timeout_threads_pend_fail_on_fifo
FAIL - test_multiple_threads_get_data@207. *** thread 7 woke up, expected 0

 Assertion failed at /home/jenkins/workspace/zephyr-master-tcf-v0.11-branch/LABEL/verify/SHARD/3-3/ZEPHYR_GCC_VARIANT/zephyr/zephyr.git/tests/kernel/fifo/fifo_timeout/src/main.c:401: test_timeout_threads_pend_fail_on_fifo: rv not equal to TC_PASS
FAIL - test_timeout_threads_pend_fail_on_fifo
===================================================================
===================================================================
RunID: ci-180601-1931-1716:k8if
PROJECT EXECUTION FAILED

 Assertion failed at /home/jenkins/workspace/zephyr-master-tcf-v0.11-branch/LABEL/verify/SHARD/3-3/ZEPHYR_GCC_VARIANT/zephyr/zephyr.git/tests/kernel/fifo/fifo_timeout/src/main.c:171: test_thread_pend_and_get_data: packet != NULL is false


 Assertion failed at /home/jenkins/workspace/zephyr-master-tcf-v0.11-branch/LABEL/verify/SHARD/3-3/ZEPHYR_GCC_VARIANT/zephyr/zephyr.git/tests/kernel/fifo/fifo_timeout/src/main.c:171: test_thread_pend_and_get_data: packet != NULL is false


 Assertion failed at /home/jenkins/workspace/zephyr-master-tcf-v0.11-branch/LABEL/verify/SHARD/3-3/ZEPHYR_GCC_VARIANT/zephyr/zephyr.git/tests/kernel/fifo/fifo_timeout/src/main.c:171: test_thread_pend_and_get_data: packet != NULL is false

This test also fails with "prj_poll.conf" that is CONFIG_POLL=y in the latest commit: 5b8e4ae

***** delaying boot 1000ms (per build configuration) *****
***** Booting Zephyr OS v1.12.0-831-g5b8e4ae (delayed boot 1000ms) *****
Running test suite test_fifo_timeout
===================================================================
starting test - test_timeout_empty_fifo
PASS - test_timeout_empty_fifo
===================================================================
starting test - test_timeout_non_empty_fifo
PASS - test_timeout_non_empty_fifo
===================================================================
starting test - test_timeout_fifo_thread
PASS - test_timeout_fifo_thread
===================================================================
starting test - test_timeout_threads_pend_on_fifo
 thread (q order: 2, t/o: 0, fifo 0x20000000)
 thread (q order: 3, t/o: 10, fifo 0x20000000)
 thread (q order: 0, t/o: 20, fifo 0x20000000)
 thread (q order: 4, t/o: 30, fifo 0x20000000)
 thread (q order: 1, t/o: 40, fifo 0x20000000)
PASS - test_timeout_threads_pend_on_fifo
===================================================================
starting test - test_timeout_threads_pend_on_dual_fifos
 thread (q order: 0, t/o: 0, fifo 0x20000010)
 thread (q order: 5, t/o: 10, fifo 0x20000000)
FAIL - test_multiple_threads_pending@156.  *** thread 3 woke up, expected 2

    Assertion failed at /home/pswarnak/workspace/1.12_execution/zephyr/tests/kernel/fifo/fifo_timeout/src/main.c:396: test_timeout_threads_pend_on_dual_fifos: rv not equal to TC_PASS

FAIL - test_timeout_threads_pend_on_dual_fifos
===================================================================
starting test - test_timeout_threads_pend_fail_on_fifo
FAIL - test_multiple_threads_get_data@207.  *** thread 7 woke up, expected 0

    Assertion failed at /home/pswarnak/workspace/1.12_execution/zephyr/tests/kernel/fifo/fifo_timeout/src/main.c:411: test_timeout_threads_pend_fail_on_fifo: rv not equal to TC_PASS


    Assertion failed at /home/pswarnak/workspace/1.12_execution/zephyr/tests/kernel/fifo/fifo_timeout/src/main.c:129: test_thread_pend_and_timeout: packet == NULL is false

FAIL - test_timeout_threads_pend_fail_on_fifo
===================================================================
===================================================================
RunID: :wjv2
PROJECT EXECUTION FAILED

    Assertion failed at /home/pswarnak/workspace/1.12_execution/zephyr/tests/kernel/fifo/fifo_timeout/src/main.c:171: test_thread_pend_and_get_data: packet != NULL is false


    Assertion failed at /home/pswarnak/workspace/1.12_execution/zephyr/tests/kernel/fifo/fifo_timeout/src/main.c:171: test_thread_pend_and_get_data: packet != NULL is false


    Assertion failed at /home/pswarnak/workspace/1.12_execution/zephyr/tests/kernel/fifo/fifo_timeout/src/main.c:171: test_thread_pend_and_get_data: packet != NULL is false


    Assertion failed at /home/pswarnak/workspace/1.12_execution/zephyr/tests/kernel/fifo/fifo_timeout/src/main.c:171: test_thread_pend_and_get_data: packet != NULL is false

This failure(with CONFIG_POLL=y) was not seen in earlier commits.

@nashif nashif added bug The issue is a bug, or the PR is fixing a bug priority: medium Medium impact/importance bug labels Jun 4, 2018
@nashif
Copy link
Member Author

nashif commented Jun 4, 2018

most likely a testcase issue?

@punitvara
Copy link
Contributor

Looks like testcase ordering issue. After moving test_timeout_threads_pend_on_dual_fifos to the end test is passing. Doing further debugging to root cause the issue.

@nashif nashif added the In progress For PRs: is work in progress and should not be merged yet. For issues: Is being worked on label Jun 7, 2018
@pizi-nordic
Copy link
Collaborator

Applying #8249 together with #8259 might solve this problem if it was caused by timer issues.

@ramakrishnapallala
Copy link

@punitvara can you re-test with this PR #8249

@punitvara
Copy link
Contributor

@pizi-nordic You suspect it right that it could be related to timer because only platform dependent function is k_cycle_get_32. However, even after your suggest, this test case failed after applying both PR #8249 and #8259

@punitvara
Copy link
Contributor

Even after adding one printk error is not reproducible.

@carlescufi carlescufi added the platform: nRF Nordic nRFx label Jun 11, 2018
@punitvara
Copy link
Contributor

@pizi-nordic Can you please check whether still timer related issue persist for nrf because even adding printk solving this problem ?

@pizi-nordic
Copy link
Collaborator

pizi-nordic commented Jun 14, 2018

I am still working on the timing issues. I can try to look at this problem as soon as all timing issues I see now will be resolved.

@pizi-nordic
Copy link
Collaborator

pizi-nordic commented Jul 5, 2018

The test is pretty simple:

  1. Create N threads,
  2. In each thread, do k_fifo_get(<empty-fifo>) with different timeout, then notify main thread.
  3. Check if notifications from threads come in correct order (should be in order specified by timeout values).

I found 2 different cases when this test fails:

Case 1:

- Thread A is created.
- Thread B is created.
- Thread A starts and calls k_fifo_get() with 20ms timeout (2+1 ticks)
--- TICK ----
- Thread B starts and calls k_fifo_get() with 10 ms timeout (1+1 ticks)
--- TICK ---
--- TICK ---
- Thread A wakes up and pushes notification (correct, slept 3 ticks).
- Thread B wakes up and pushes notification (correct, slept 2 ticks).

Test fails because it expects notification from thread B before notification from A (scheduler wakes up threads in different order than the one expected by the test). However the system behaviour looks correct (threads are woken up after specified timeout).

@nashif: You know more about Zephyr scheduler: Is the thread wakeup order defined if multiple threads timeouts an the same time? (if yes, we could have bug in the scheduler, otherwise the test should be updated).

Case 2:

- Thread A is created.
- Thread B is created.
- Thread A starts and calls k_fifo_get() with 20ms timeout (2+1 ticks)
--- SYSTEM BLOCKED FOR 100-200ms ---
- Thread A wakes up and pushes notification (correct, slept much more than 3 ticks).
- Thread B starts and calls k_fifo_get() with 10 ms timeout (1+1 ticks)
- Thread B wakes up and pushes notification (correct, slept 2 ticks).

In this case, the test is disrupted by gap in execution lasting dozens of ms. Such gap is easily observed, when UART console is used and almost disappears when RTT is used, so I think that it is related to printk() handling. I going to investigate this scenario a bit.

@pizi-nordic
Copy link
Collaborator

It looks that Case 2 was introduced by my debug infrastructure due to #8763.

@ManojSubbarao
Copy link

@punitvara is this issue fixed?

@punitvara
Copy link
Contributor

punitvara commented Jul 10, 2018

@ManojSubbarao No. While I am working on ADC consolidation, will look into it whenever I get sometime

@nashif nashif added this to the v1.13.0 milestone Aug 26, 2018
@ManojSubbarao
Copy link

@rgundi Please look into this issue.

@carlescufi carlescufi changed the title tests/kernel/fifo/fifo_timeout fails on nrf51_pca10028 tests/kernel/fifo/fifo_timeout fails on nrf51_pca10028 and nrf52_pca10040 Aug 31, 2018
@rgundi
Copy link
Contributor

rgundi commented Sep 3, 2018

Working on this. Looks like this is happening because the nrf boards are pretty slow which means even few instructions will end up adding ms. Will explore more and update shortly.

@nashif
Copy link
Member Author

nashif commented Sep 4, 2018

@rgundi any progress?

@rgundi
Copy link
Contributor

rgundi commented Sep 5, 2018

This seems to be a weird issue. Verified everything that was said by @punitvara and @pizi-nordic to be true. Did some more experiments and saw that the issue goes away even on minor modifications.
@pizi-nordic : Regarding your question about the zephyr scheduler, please refer _add_timeout function in timeout_q.h. If there's a new timeout expiring on the same system clock tick as other timeouts already present in the _timeout_q, it is "prepended" to these timeouts. So, we should have seen this working properly if we were hitting that case.
@nashif : This is one of those timing issues. This is so weird that the issue will not be seen if "boot_delay" is disabled. Also, this issue goes away if we double the timeout values for the fifo (which sounds logical). I will continue to debug this and keep you posted.

@nashif
Copy link
Member Author

nashif commented Sep 6, 2018

I am tempted to declare this as a test related issue, do you agree?

@rgundi
Copy link
Contributor

rgundi commented Sep 6, 2018

No. Not yet. I'll probably need one more day's time to come to some kind of a decision. I see that all the threads are properly populated in the timeout queue just before the issue is seen (last pass case). However, somehow it appears one of the threads is strangely getting bumped off the timeout queue without getting serviced. I'll know more tomorrow.

@rgundi
Copy link
Contributor

rgundi commented Sep 9, 2018

Looks like there's some issue when "delta_ticks_from_prev" is 0 (i.e. when at least 2 timeouts are expiring on the same system clock tick). Further debug in progress.

@rgundi
Copy link
Contributor

rgundi commented Sep 11, 2018

Finally I am able to understand the behaviour. @pizi-nordic: This is about the zephyr scheduler as you rightly speculated. There are 2 functions at play here.

  1. _add_timeout function in timeout_q.h - If there's a new timeout expiring on the same system clock tick as other timeouts already present in the _timeout_q, it is "prepended" to these timeouts. So, if thread A and thread B are timing out on the same system clock tick and if thread B is already in the _timeout_q, thread A will be "prepended" to the q.
  2. handle_timeouts in sys_clock.c – This function handles timeouts by dequeuing the expired ones from _timeout_q and queuing them on a local queue. In this local queue, the order of queuing in _timeout_q for the threads timing out on the same system clock tick is reversed. Hence, effectively, they end up being processed in the same order they were added, time-wise. This means thread B will be serviced first followed by thread A.

In this particular test case, thread 2 and thread 3 timeout on the same system clock tick but thread 3 gets processed earlier as it is added prior to thread 2. The order of servicing can be ascertained by putting a breakpoint in the function test_thread_pend_and_timeout just after the k_fifo_get function call and printing the _kernel.current value there successively. Before this, the entire timeout_q can be printed for comparison (with delta_from_prev_tick and thread_id for each threads). Below is the gdb command which does that.

(gdb) p ((struct _timeout *)(_kernel.timeout_q.head))->thread
$1 = (struct k_thread *) 0x20000290 <ttdata+624>
(gdb) p ((struct _timeout *)(_kernel.timeout_q.head))->delta_ticks_from_prev
$2 = 1

(gdb) p ((struct _timeout *)(_kernel.timeout_q.head->next))->thread
$3 = (struct k_thread *) 0x20000020
(gdb) p ((struct _timeout *)(_kernel.timeout_q.head->next))->delta_ticks_from_prev
$4 = 0

@rgundi
Copy link
Contributor

rgundi commented Sep 11, 2018

@nashif : Since this is a known behavior of the kernel and since there's nothing wrong with this, I propose we classify this as a test case issue. The test case can simply be fixed by doubling the timeouts specified (i.e. 10ms should become 20ms, 20ms should become 40ms and so on). Let me know if you agree with this modification.

@andyross
Copy link
Collaborator

That sounds right to me. No: in general there is no guarantee of wakeup order when multiple threads are woken up on the same tick. While I think timeout handling is uniformly done with a simple dlist that never reorders, things like wait_q's can be more complicated when iterated over.

So if you have a situation like this where "ordered" timeouts are aliasing into a single tick, you can get this behavior. I'd consider that a test bug -- it should be validating that (at least) the timeout values differ by more than one full tick as defined by CONFIG_SYS_CLOCK_TICKS_PER_SEC. (Even then it's not foolproof if something else loads the system or locks interrupts to cause a tick to be handled late, but the test should be able to guarantee that too)

@andyross
Copy link
Collaborator

And for reference: NRF5x timer handling does seem to produce surprises. There's a similar thread ordering bug (that I haven't tried to dig into yet) reported against the EDF test on NRF5 when combined with CONFIG_BT: #9843

@pizi-nordic
Copy link
Collaborator

Finally I am able to understand the behaviour. @pizi-nordic: This is about the zephyr scheduler as you rightly speculated. There are 2 functions at play here. . There are (...)

This wasn't speculation :).
Thank you for digging into this problem. Great job!

@rgundi
Copy link
Contributor

rgundi commented Sep 18, 2018

@andyross : Please review the PR #10047

rgundi added a commit to rgundi/zephyr that referenced this issue Oct 1, 2018
There is no guarantee of wake-up order when multiple threads
are woken up on the same tick. Hence, modified the tests
accordingly.

Fixes zephyrproject-rtos#8159.

Signed-off-by: Rajavardhan Gundi <[email protected]>
@ghost ghost removed the In progress For PRs: is work in progress and should not be merged yet. For issues: Is being worked on label Oct 2, 2018
nashif pushed a commit that referenced this issue Oct 2, 2018
There is no guarantee of wake-up order when multiple threads
are woken up on the same tick. Hence, modified the tests
accordingly.

Fixes #8159.

Signed-off-by: Rajavardhan Gundi <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug The issue is a bug, or the PR is fixing a bug platform: nRF Nordic nRFx priority: medium Medium impact/importance bug
Projects
None yet
Development

No branches or pull requests

8 participants