Skip to content

Timer fixes #12204

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Jan 3, 2019
Merged

Timer fixes #12204

merged 5 commits into from
Jan 3, 2019

Conversation

andyross
Copy link
Collaborator

Two bugs discovered since the timer rewrite: one that would produce incorrect expiration times for timeouts set within timeout handlers in the case where multiple timeouts were expiring in the same tick. Another was a potential timeout list corruption issue if the timeout handler (or under SMP, simultaneous code racing against it) tries to remove a timeout (for example by aborting a thread).

This doesn't fix the API confusion about "what is 'current time' to a timeout handler when the interrupt was delayed" detailed in #11722. I'm going to think a little more on that one and see if an API option makes sense, or at least update the docs.

Relevant to #11722, this works for me over 30 iterations of the timer_api test, so I'm reasonably confident the bug is fixed even if I was never able to reproduce it.

@andyross
Copy link
Collaborator Author

(Apologies for the assignee spam -- I added folks as assignees instead of reviewers, sorry!)

@andyross
Copy link
Collaborator Author

Oh, and note that the first patch (the list handling one) is also included in the x86_64 series. I cherry picked it here because it's more relevant to timer work. I'll update whichever PR gets merged last to remove, just be aware that if you try to pull both you'll get a collision.

@codecov-io
Copy link

codecov-io commented Dec 20, 2018

Codecov Report

Merging #12204 into master will increase coverage by 0.02%.
The diff coverage is 93.1%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #12204      +/-   ##
==========================================
+ Coverage    48.2%   48.22%   +0.02%     
==========================================
  Files         294      294              
  Lines       44192    44200       +8     
  Branches    10603    10607       +4     
==========================================
+ Hits        21301    21316      +15     
+ Misses      18624    18616       -8     
- Partials     4267     4268       +1
Impacted Files Coverage Δ
kernel/sched.c 92.83% <ø> (ø) ⬆️
kernel/timeout.c 86.11% <93.1%> (-0.56%) ⬇️
drivers/clock_control/nrf5_power_clock.c 50% <0%> (-2.11%) ⬇️
boards/posix/native_posix/timer_model.c 51.82% <0%> (+0.72%) ⬆️
boards/posix/nrf52_bsim/time_machine.c 52.45% <0%> (+4.91%) ⬆️
boards/posix/nrf52_bsim/argparse.c 54.28% <0%> (+12.85%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e8ecd27...4c5d450. Read the comment docs.

if (next(t) != NULL) {
next(t)->dticks += t->dticks;
}
if (t->node.next != NULL && t->node.prev != NULL) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this problem exist generally, in that whenever a dnode is removed from a dlist that fact can't be determined from the dnode itself? If so, would it be better to extend the dlist API with something like:

/* mark that node is not in any list */
static inline void sys_dnode_unlink(sys_dnode_t *node)
{
   if (node) {
      node->next = NULL;
   }
}

/* check whether node is in a list */
static bool sys_dnode_is_linked(sys_dnode_t *node)
{
   return node && (node->next != NULL);
}

then implement as:

static void remove_timeout(struct _timeout *t)
{
   if (sys_dnode_is_linked(&t->node)) {
      if (next(t) != NULL) {
         next(t)->dticks += t->dticks;
      }
      sys_dlist_remove(&t->node);
      sys_dnode_unlink(&t->node);
      t->dticks = _INACTIVE;
   }
}

?

It seems this would make the intent more clear, and eliminate the manipulations of prev which I believe should be non-null if and only if next is non-null. It would also make the solution available to any other uses of sys_dlist_remove where the same issue might be relevant.

It might also remove the need to use _INACTIVE as a flag value.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't disagree. Though I think there's an argument that "unlink" and "remove" being distinct APIs is going to be confusing for readers. The dlist is a very low level intrusive data structure, honestly I think it's going to be a little error prone always.

But I totally agree about the "_INACTIVE" state, it's used very inconsistently already and ought to be removed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd considered clear instead of unlink but that didn't seem as...clear. If there's another wording that's better....

Alternatively, put the node->next = NULL assignment into remove, which would be safer anyway at a very minor overall cost. An ability to reliably check whether a node is already in a list before adding it to a list could be very helpful.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I generally support @pabigot idea, however I would like to add small modification:
To be consistent with the way how list is implemented in Zephyr, I suggest following Linux (https://elixir.bootlin.com/linux/latest/ident/list_del_init) and mark removed node by setting node->next (and node->prev) to node.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Researching what @pizi-nordic proposed: In Linux list_del_init removes the entry and initializes the entry as an empty list head. list_del removes the entry and sets the next and prev pointers to poisoned values (dereference causes a page fault that can be diagnosed). list_del is used about twice as often as list_del_init in Linux, but list_del_init is to be used when the removed entry may later be added to another list.

Although the structure of a doubly-linked list is the same in Linux and Zephyr the pointer management is different in both operations supported and order-of-assignment for specific operations like add/prepend. Zephyr uses distinct type names for list nodes and list heads and they aren't initialized the same way, unlike Linux which only uses list_head. So I'm not convinced that resetting a removed sys_dnode_t to appear like an empty sys_dlist_t should be preferred over initializing a removed node to have null pointers, which is presumably what a Zephyr sys_dnode_t that's never been added to a list would have (if it isn't left uninitialized).

In either case, after reviewing this I do think it's important that both next and prev be assigned to self or a poison/null value when the entry is removed, not just next as I originally proposed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we submit this as an enhancement against the dlist code? It's not really timer related, and obviously has code size impact to consider: there are a lot of dlist users and while many could probably benefit from being ported to this scheme a naive change would do nothing but add a bunch of code bytes.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andyross Do you mean you're OK with dropping this patch from this PR, or that you'd rather merge this change as-is rather than bike-shed it here?

Either is fine with me. I will file a PR to fix dlist, because as you've found the existing code is problematic and it should be fixed more generally. IMO there's too much "hack it now, we'll fix it properly later" in Zephyr, and later never comes.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the problem is fixed globally about 42 tests fail sanity check. This after working around a bug in poll which double-removes events from a list (segfaults in many native_posix tests). Looks like several parts of Zephyr depend on the pointers to neighboring nodes remaining dereferenceable after removal from a list.

Probably best to merge your fix for now.

@pizi-nordic
Copy link
Collaborator

Unfortunately the problem still exists on top of the #12204.
The test failed 2 times during during my test run (160 iterations).

===================================================================
starting test - test_timer_periodicity

    Assertion failed at /home/pizi/repos/zephyr/tests/kernel/timer/timer_api/src/main.c:240: test_timer_periodicity: (((delta) >= (50)) && ((delta) <= (50) + (1))) is false

FAIL - test_timer_periodicity

@pizi-nordic
Copy link
Collaborator

The #11722 still fails (#11722 (comment)) with this patch applied.

Copy link
Collaborator

@pizi-nordic pizi-nordic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problems described in #11722 are still not resolved.


announce_remaining = ticks;

while (first() != NULL && first()->dticks <= announce_remaining) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still allows for immediate firing of timeouts scheduled by the t->fn(t), breaking monotonicity of the timer.
(please consider case when announce_remaining is larger than scheduled time).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For like the ninth time, this is a feature, not a bug. It's not the source of the problem you're seeing, I promise.

I opened #12247 to try to explain this yet again. Please take the discussion about API changes there.

(Also: there's no violation of monotonicity -- timers expire in the order expected always. DELTA times between timeouts will and MUST be skewed in the face of lost interrupts, the argument is which delta gets skewed.)

@andyross
Copy link
Collaborator Author

Is that sub-2% frequency consistent with the original report? Are we sure we aren't talking about a different bug here? The issue fixed would have happened any time you had a double-event stacked in the same tick and more than one tick was being announced. That shouldn't have been stochastic in this way. I was originally led to believe this was a consistent problem.

@pizi-nordic, I can't tell if you're trying to -1 this patch because there's another unresolved bug or because you believe this shouldn't be merged. Can you clarify? These are real bugs and the fixes need to go in. Which patches specifically do you want to keep out of the tree?

Copy link
Collaborator

@pabigot pabigot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm taking the dlist issue off the table; will have a PR for that. I have no dog in the desired-behavior fight. I do question whether this PR actually changes anything other than the double-remove issue. Please clarify so I can remove the nack.

Andy Ross added 4 commits January 2, 2019 11:37
This loop was structured badly, as a while(true) with multiple "exit
if" cases in the body.  It was bad enough that I genuinely fooled
myself into rewriting it, having convinced myself there was a bug in
it when there wasn't.

So keep the rewritten loop which expresses the iteration in a more
invariant way (i.e. "while we have an element to expire" and not "test
if we have to exit the loop").  Shorter and easier.  Also makes the
locking clearer as we can simply release the lock around the callback
in a natural/obvious way.

Signed-off-by: Andy Ross <[email protected]>
The use of dticks == INACTIVE to tell whether or not a timeout was
already in the list was insufficient.  There is a time period between
the moment a timeout is removed from the list and the end of its
handler where it is not in the list, yet its list node pointers still
point into it.  Doing things like aborting a thread while that is true
(which can be asynchronous too!)  would corrupt the list even though
all the operations on it were "atomic".

Set the timeout node pointers to nulls atomically when removed, and
check for double-remove conditions (which, again, might be perfectly
OK).

Signed-off-by: Andy Ross <[email protected]>
The recent change that added a locked z_set_timeout_expiry() API
obsoleted the subtle note about synchronization above
reset_time_slice().  None of that matters any more, the API is
synchronized internally in a conventional way.

Signed-off-by: Andy Ross <[email protected]>
The logic in z_set_timeout_expiry() missed the case where the ticks
argument could be zero (or lower), which can happen naturally due to
timing/interrupt slop.  In those circumstances, it would still try to
reset a timer that was "about to expire at the next tick", which would
run afoul of the drivers' internal decisions about how soon a timer
interrupt could be set, and then get pushed out to the next tick.

Explicitly detect this as an "imminent" predicate to make the logic
clearer.

Signed-off-by: Andy Ross <[email protected]>
@andyross
Copy link
Collaborator Author

andyross commented Jan 2, 2019

New version contains a fix for an edge case that certainly could produce the behavior in #11722 (a delayed tick that would result in late callbacks in the same way a delayed interrupt would). Also a comment cleanup, and I rewrote the commit message for the z_clock_announce() loop since the original bug I saw was apparently a phantom (the new loop structure is still better though).

I'm still working on reproducing the bug locally though.

@pabigot
Copy link
Collaborator

pabigot commented Jan 2, 2019

This commit at the top of pabigot/zephyr:pr/12204 is what I use for reproducibility. Usually fails within 10 repetitions, always on the first iteration. A large EXPIRE_TIMES appears to be critical to quick reproduction.

The updated PR still fails:

===================================================================
starting test - test_timer_periodicity
rep 3 iter 0 failed at 17240 delta 40 exp 50

    Assertion failed at ../src/main.c:246: test_timer_periodicity: false is false

FAIL - test_timer_periodicity

The test_timer_periodicity test is racy and subject to initial state
bugs.  The operation of that test is to:

1. Start a timer with a known period
2. Take the current time with k_uptime_get()
3. Wait for the timer to fire with k_timer_status_sync()
4. Check that the current time minus start time is the period

But that's wrong, because a tick expiring between any of the first
three steps is going to skew the math (i.e. the timer will have
started on a different tick than the "start time").

And taking an interrupt lock around the process can't fix the issue,
because in the tickless world we live in k_uptime_get() is actually a
realtime quanity based on a hardware counter and doesn't rely on
interrupt delivery.

Instead, use another timer object to synchronize the test start to a
driver tick, ensuring that even if the race is unfixable the initial
conditions are always correct.

Signed-off-by: Andy Ross <[email protected]>
@andyross
Copy link
Collaborator Author

andyross commented Jan 3, 2019

Got it. After all that, it was a test bug. And it happens everywhere, not just on nRF. We just didn't notice it because the frequency of failure was on the same order as we see with qemu timer tests already, and sanitycheck will retry (one of @andrewboie 's peeves -- yeah, we really need to get -icount working).

Basically, the test had a nonatomic initialization process where it would check current time and start a k_timer, and there was no protection against the case where a tick happened to expire between the two, causing the two times to be off by one tick. Fixed by adding a sync-to-tick process at the start. (This is also a very tidy explanation for why the test only ever failed on the first run -- after the end, something has waited on a timeout and execution is synchronized.)

FWIW: I eventually got it reproduced by putting the test into a while(1) loop with a randomized delay to start at random points in the tick. With this series applied I'm seeing it at 4500 successful iterations and counting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants