-
Notifications
You must be signed in to change notification settings - Fork 7.4k
Timer fixes #12204
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Timer fixes #12204
Conversation
(Apologies for the assignee spam -- I added folks as assignees instead of reviewers, sorry!) |
Oh, and note that the first patch (the list handling one) is also included in the x86_64 series. I cherry picked it here because it's more relevant to timer work. I'll update whichever PR gets merged last to remove, just be aware that if you try to pull both you'll get a collision. |
Codecov Report
@@ Coverage Diff @@
## master #12204 +/- ##
==========================================
+ Coverage 48.2% 48.22% +0.02%
==========================================
Files 294 294
Lines 44192 44200 +8
Branches 10603 10607 +4
==========================================
+ Hits 21301 21316 +15
+ Misses 18624 18616 -8
- Partials 4267 4268 +1
Continue to review full report at Codecov.
|
if (next(t) != NULL) { | ||
next(t)->dticks += t->dticks; | ||
} | ||
if (t->node.next != NULL && t->node.prev != NULL) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this problem exist generally, in that whenever a dnode is removed from a dlist that fact can't be determined from the dnode itself? If so, would it be better to extend the dlist API with something like:
/* mark that node is not in any list */
static inline void sys_dnode_unlink(sys_dnode_t *node)
{
if (node) {
node->next = NULL;
}
}
/* check whether node is in a list */
static bool sys_dnode_is_linked(sys_dnode_t *node)
{
return node && (node->next != NULL);
}
then implement as:
static void remove_timeout(struct _timeout *t)
{
if (sys_dnode_is_linked(&t->node)) {
if (next(t) != NULL) {
next(t)->dticks += t->dticks;
}
sys_dlist_remove(&t->node);
sys_dnode_unlink(&t->node);
t->dticks = _INACTIVE;
}
}
?
It seems this would make the intent more clear, and eliminate the manipulations of prev
which I believe should be non-null if and only if next
is non-null. It would also make the solution available to any other uses of sys_dlist_remove
where the same issue might be relevant.
It might also remove the need to use _INACTIVE
as a flag value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't disagree. Though I think there's an argument that "unlink" and "remove" being distinct APIs is going to be confusing for readers. The dlist is a very low level intrusive data structure, honestly I think it's going to be a little error prone always.
But I totally agree about the "_INACTIVE" state, it's used very inconsistently already and ought to be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd considered clear
instead of unlink
but that didn't seem as...clear. If there's another wording that's better....
Alternatively, put the node->next = NULL
assignment into remove
, which would be safer anyway at a very minor overall cost. An ability to reliably check whether a node is already in a list before adding it to a list could be very helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I generally support @pabigot idea, however I would like to add small modification:
To be consistent with the way how list is implemented in Zephyr, I suggest following Linux (https://elixir.bootlin.com/linux/latest/ident/list_del_init) and mark removed node by setting node->next
(and node->prev
) to node
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Researching what @pizi-nordic proposed: In Linux list_del_init
removes the entry and initializes the entry as an empty list head. list_del
removes the entry and sets the next
and prev
pointers to poisoned values (dereference causes a page fault that can be diagnosed). list_del
is used about twice as often as list_del_init
in Linux, but list_del_init
is to be used when the removed entry may later be added to another list.
Although the structure of a doubly-linked list is the same in Linux and Zephyr the pointer management is different in both operations supported and order-of-assignment for specific operations like add/prepend. Zephyr uses distinct type names for list nodes and list heads and they aren't initialized the same way, unlike Linux which only uses list_head
. So I'm not convinced that resetting a removed sys_dnode_t
to appear like an empty sys_dlist_t
should be preferred over initializing a removed node to have null pointers, which is presumably what a Zephyr sys_dnode_t
that's never been added to a list would have (if it isn't left uninitialized).
In either case, after reviewing this I do think it's important that both next
and prev
be assigned to self or a poison/null value when the entry is removed, not just next
as I originally proposed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we submit this as an enhancement against the dlist code? It's not really timer related, and obviously has code size impact to consider: there are a lot of dlist users and while many could probably benefit from being ported to this scheme a naive change would do nothing but add a bunch of code bytes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andyross Do you mean you're OK with dropping this patch from this PR, or that you'd rather merge this change as-is rather than bike-shed it here?
Either is fine with me. I will file a PR to fix dlist, because as you've found the existing code is problematic and it should be fixed more generally. IMO there's too much "hack it now, we'll fix it properly later" in Zephyr, and later never comes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the problem is fixed globally about 42 tests fail sanity check. This after working around a bug in poll which double-removes events from a list (segfaults in many native_posix tests). Looks like several parts of Zephyr depend on the pointers to neighboring nodes remaining dereferenceable after removal from a list.
Probably best to merge your fix for now.
Unfortunately the problem still exists on top of the #12204.
|
The #11722 still fails (#11722 (comment)) with this patch applied. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problems described in #11722 are still not resolved.
|
||
announce_remaining = ticks; | ||
|
||
while (first() != NULL && first()->dticks <= announce_remaining) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This still allows for immediate firing of timeouts scheduled by the t->fn(t)
, breaking monotonicity of the timer.
(please consider case when announce_remaining
is larger than scheduled time).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For like the ninth time, this is a feature, not a bug. It's not the source of the problem you're seeing, I promise.
I opened #12247 to try to explain this yet again. Please take the discussion about API changes there.
(Also: there's no violation of monotonicity -- timers expire in the order expected always. DELTA times between timeouts will and MUST be skewed in the face of lost interrupts, the argument is which delta gets skewed.)
Is that sub-2% frequency consistent with the original report? Are we sure we aren't talking about a different bug here? The issue fixed would have happened any time you had a double-event stacked in the same tick and more than one tick was being announced. That shouldn't have been stochastic in this way. I was originally led to believe this was a consistent problem. @pizi-nordic, I can't tell if you're trying to -1 this patch because there's another unresolved bug or because you believe this shouldn't be merged. Can you clarify? These are real bugs and the fixes need to go in. Which patches specifically do you want to keep out of the tree? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm taking the dlist issue off the table; will have a PR for that. I have no dog in the desired-behavior fight. I do question whether this PR actually changes anything other than the double-remove issue. Please clarify so I can remove the nack.
This loop was structured badly, as a while(true) with multiple "exit if" cases in the body. It was bad enough that I genuinely fooled myself into rewriting it, having convinced myself there was a bug in it when there wasn't. So keep the rewritten loop which expresses the iteration in a more invariant way (i.e. "while we have an element to expire" and not "test if we have to exit the loop"). Shorter and easier. Also makes the locking clearer as we can simply release the lock around the callback in a natural/obvious way. Signed-off-by: Andy Ross <[email protected]>
The use of dticks == INACTIVE to tell whether or not a timeout was already in the list was insufficient. There is a time period between the moment a timeout is removed from the list and the end of its handler where it is not in the list, yet its list node pointers still point into it. Doing things like aborting a thread while that is true (which can be asynchronous too!) would corrupt the list even though all the operations on it were "atomic". Set the timeout node pointers to nulls atomically when removed, and check for double-remove conditions (which, again, might be perfectly OK). Signed-off-by: Andy Ross <[email protected]>
The recent change that added a locked z_set_timeout_expiry() API obsoleted the subtle note about synchronization above reset_time_slice(). None of that matters any more, the API is synchronized internally in a conventional way. Signed-off-by: Andy Ross <[email protected]>
The logic in z_set_timeout_expiry() missed the case where the ticks argument could be zero (or lower), which can happen naturally due to timing/interrupt slop. In those circumstances, it would still try to reset a timer that was "about to expire at the next tick", which would run afoul of the drivers' internal decisions about how soon a timer interrupt could be set, and then get pushed out to the next tick. Explicitly detect this as an "imminent" predicate to make the logic clearer. Signed-off-by: Andy Ross <[email protected]>
0923b5d
to
15c8433
Compare
New version contains a fix for an edge case that certainly could produce the behavior in #11722 (a delayed tick that would result in late callbacks in the same way a delayed interrupt would). Also a comment cleanup, and I rewrote the commit message for the z_clock_announce() loop since the original bug I saw was apparently a phantom (the new loop structure is still better though). I'm still working on reproducing the bug locally though. |
This commit at the top of pabigot/zephyr:pr/12204 is what I use for reproducibility. Usually fails within 10 repetitions, always on the first iteration. A large The updated PR still fails:
|
The test_timer_periodicity test is racy and subject to initial state bugs. The operation of that test is to: 1. Start a timer with a known period 2. Take the current time with k_uptime_get() 3. Wait for the timer to fire with k_timer_status_sync() 4. Check that the current time minus start time is the period But that's wrong, because a tick expiring between any of the first three steps is going to skew the math (i.e. the timer will have started on a different tick than the "start time"). And taking an interrupt lock around the process can't fix the issue, because in the tickless world we live in k_uptime_get() is actually a realtime quanity based on a hardware counter and doesn't rely on interrupt delivery. Instead, use another timer object to synchronize the test start to a driver tick, ensuring that even if the race is unfixable the initial conditions are always correct. Signed-off-by: Andy Ross <[email protected]>
Got it. After all that, it was a test bug. And it happens everywhere, not just on nRF. We just didn't notice it because the frequency of failure was on the same order as we see with qemu timer tests already, and sanitycheck will retry (one of @andrewboie 's peeves -- yeah, we really need to get -icount working). Basically, the test had a nonatomic initialization process where it would check current time and start a k_timer, and there was no protection against the case where a tick happened to expire between the two, causing the two times to be off by one tick. Fixed by adding a sync-to-tick process at the start. (This is also a very tidy explanation for why the test only ever failed on the first run -- after the end, something has waited on a timeout and execution is synchronized.) FWIW: I eventually got it reproduced by putting the test into a while(1) loop with a randomized delay to start at random points in the tick. With this series applied I'm seeing it at 4500 successful iterations and counting. |
Two bugs discovered since the timer rewrite: one that would produce incorrect expiration times for timeouts set within timeout handlers in the case where multiple timeouts were expiring in the same tick. Another was a potential timeout list corruption issue if the timeout handler (or under SMP, simultaneous code racing against it) tries to remove a timeout (for example by aborting a thread).
This doesn't fix the API confusion about "what is 'current time' to a timeout handler when the interrupt was delayed" detailed in #11722. I'm going to think a little more on that one and see if an API option makes sense, or at least update the docs.
Relevant to #11722, this works for me over 30 iterations of the timer_api test, so I'm reasonably confident the bug is fixed even if I was never able to reproduce it.