-
Notifications
You must be signed in to change notification settings - Fork 7.4k
fix possible deadlock with FPU sharing on ARM64 and RISC-V #58086
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Not really opposed semantically, but this seems like something that's going to want some degree of tuning. Naively this is going to severely penalize a CPU that receives more interrupts than whoever it is contending with, and lots of platforms are very asymmetric with regard to interrupt delivery (e.g. on intel_adsp all device interrupts go to one core). Maybe it's worth having a special mode or API for whatever the contention case you're dealing with is? Also, just because we should always ask the question when we hit cases like this: what does linux do? I'm pretty sure it doesn't do this (but might be wrong! linux spinlocks are pretty arch-dependent IIRC) and worry there's a good reason we're not considering. |
One more gotcha occurred to me: this trick only works for the "outer" lock in a nested lock paradigm, which makes it fragile. It no doubt works now, but if some change comes along that puts the code using it under a different lock, then it won't actually work to unmask interrupts and the deadlock will return. And just to be clear: the context right now is always happening out of thread mode? You never contend on the spinlock from interrupt context, which for the same reason can't be reliably preempted (without knowing a-priori what all the interrupt priorities are). Basically I guess what I'm saying is that this looks correct to me, but kinda creeps me out a bit. |
The current code does this:
What this patch does is:
So this basically restores the same state as the spin_lock()'s initial entry. Linux doesn't have to do this because it allows FPU usage from user space This being said, I don't think my proposal entirely solves the problem. So this needs another thought. |
Or maybe some IRQ can be marked as NMI, and the spinlock loops for checking the NMI IRQ instead of just enabling the IRQ in a spinlock? |
Coming back here having given this more thought (do we not have a bug for this deadlock specifically?), as I think this is the one with the right structure. Can someone verify that this understanding is correct:
So... yeah, the solution is pretty much going to require augmenting the spin loop. But as discovered upthread, you can't do this by servicing interrupts because interrupts might need to be masked due to nested lock state (or by running at a high interrupt priority). So... how about an atomic flag on the CPU? In the FPU trap, you set this flag before sending the IPI and then spin on it[4]. And the flush code clears the flag after doing its work. This is basically the way that k_thread_abort() works in SMP right now, FWIW. [1] Which maybe is a mistake? This absolutely complicates our job, maybe needlessly. Linux gets away with disallowing x87/SSE/AVX in the kernel, no reason we couldn't have a rule like "no FPU when interrupts are masked" or whatever. Obviously that will require subsystem work for areas that violate that rule right now, but frankly that fix might be easier? [2] Seems like a hole in the current design is that it only tracks "threads" mapped to CPUs, but one common case for this deadlock is actually that we're in interrupt context. If that's true, there's (obviously) no need to spill the context for the thread we interrupted, we should just reset the register state or whatever and allow the interrupt to do whatever it wants, right? [3] It's worth pointing out that spinlocks are not the only way an app can busy-wait for something with interrupts masked! Though maybe it's the only one in the tree susceptible to this deadlock right now and could be treated with documentation. [4] A good example of a situation where we spin outside a spinlock. Obviously this spin loop would need to be similarly augmented with the flag check! |
Also: is the deadlock exercisable on any of the qemu platforms? Or does one need to find and download that FVP thing? |
About [2]: If the FPU is used in interrupt context then it is obviously not a thread. If an IRQ context doesn't use the FPU but the FPU still holds a state wanted If that IRQ context tries to get a locked spinlock then we have the same |
Here's another proposal. This implements the spinlock loop augmentation idea. I did the RISC-V part. I'd need someone familiar with the GIC to fill the |
Let me try to fill the ARM64 part |
I think it is on qemu (arm and risc v) but the probability of deadlock issue might be low. I found the issue on FVP and I can 100% reproduce the deadlock on FVP. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some notes. This definitely seems like the right track to me.
include/zephyr/spinlock.h
Outdated
@@ -153,6 +153,9 @@ static ALWAYS_INLINE k_spinlock_key_t k_spin_lock(struct k_spinlock *l) | |||
|
|||
#ifdef CONFIG_SMP | |||
while (!atomic_cas(&l->locked, 0, 1)) { | |||
#ifdef CONFIG_ARCH_HAS_BUSY_SPINLOCK_CHECK | |||
arch_busy_spinlock_check(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pedantic API naming: can we name this something like "arch_spin_relax()" instead? A more common and less obscurely-Zephyr-specific use case for this sort of thing is idle power management and bus contention relaxation. x86 spinlocks are best implemented on big/NUMA systems with MWAIT, etc... Our default spinlock is naive and works great in practice, but almost every system has a "better way to do this".
The other advantage is that we can then document this in such a way that it can be used by other busy loops than just "k_spinlocks".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also: another very reasonable implementation choice would be to make this a function call and implement the default as a weak symbol, which avoids the need to mess with kconfig. Almost by definition, we don't care about cycle-level performance optimization when we're doing nothing waiting for something to happen.
arch/riscv/core/smp.c
Outdated
void arch_busy_spinlock_check(void) | ||
{ | ||
bool fpu_ipi_pending = atomic_and(&cpu_pending_ipi[_current_cpu->id], | ||
IPI_FPU_FLUSH) != 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this not a race? It sets the bit before the flush is complete.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact there's a missing ~ before IPI_FPU_FLUSH. The flag is set elsewhere.
For simplicity, I append a commit on your branch, @npitre would you like to take a look? Feel free to change or rebase. |
arch/arm64/core/smp.c
Outdated
if (fpu_ipi_pending) { | ||
/* | ||
* We're not in IRQ context here and cannot use | ||
* z_riscv_flush_local_fpu() directly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this might be a typo. z_arm64_flush_local_fpu()
* @param irq interrupt ID | ||
* @return Returns true if interrupt is pending, false otherwise | ||
*/ | ||
bool arm_gic_irq_is_pending(unsigned int intid); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
apologize, the doxgen warns. It should be unsigned int irq
* | ||
* @param irq interrupt ID | ||
*/ | ||
void arm_gic_irq_clear_pending(unsigned int intid); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
Give architectures that need it the ability to perform special checks while e.g. waiting for a spinlock to become available. Signed-off-by: Nicolas Pitre <[email protected]>
This is cleaner and less error prone, especially when comes the time to test and clear a bit. Signed-off-by: Nicolas Pitre <[email protected]>
Let's consider CPU1 waiting on a spinlock already taken by CPU2. It is possible for CPU2 to invoke the FPU and trigger an FPU exception when the FPU context for CPU2 is not live on that CPU. If the FPU context for the thread on CPU2 is still held in CPU1's FPU then an IPI is sent to CPU1 asking to flush its FPU to memory. But if CPU1 is spinning on a lock already taken by CPU2, it won't see the pending IPI as IRQs are disabled. CPU2 won't get its FPU state restored and won't complete the required work to release the lock. Let's prevent this deadlock scenario by looking for a pending FPU IPI from the arch_spin_relax() hook and honor it. Signed-off-by: Nicolas Pitre <[email protected]>
I think it is ready for consideration now. @povergoing: Please confirm this actually solves the deadlock you were experiencing. |
Implement irq pending check and clear function for both gic and gicv3. Signed-off-by: Jaxson Han <[email protected]>
Let's consider CPU1 waiting on a spinlock already taken by CPU2. It is possible for CPU2 to invoke the FPU and trigger an FPU exception when the FPU context for CPU2 is not live on that CPU. If the FPU context for the thread on CPU2 is still held in CPU1's FPU then an IPI is sent to CPU1 asking to flush its FPU to memory. But if CPU1 is spinning on a lock already taken by CPU2, it won't see the pending IPI as IRQs are disabled. CPU2 won't get its FPU state restored and won't complete the required work to release the lock. Let's prevent this deadlock scenario by looking for pending FPU IPI from the spinlock loop using the arch_spin_relax() hook. Signed-off-by: Nicolas Pitre <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This all looks great to me. One API note for future decisions.
* arch_nop(). Architectures may implement this function to perform extra | ||
* checks or power management tricks if needed. | ||
*/ | ||
void arch_spin_relax(void); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It occurs to me that my suggestion that this would be useful for e.g. MWAIT-based relaxation means it should take some kind of pointer to the address being waited on, which would require careful documentation. But we can fix that up later if we ever get there; arch_* APIs are tree-internal and not subject to stability or deprecation requirements. And this would start out as an unstable API anyway, surely.
Yes, I confirmed it was solved (after some stress tests) |
We should allow for architecture-specific special processing while
waiting on a spinlock.
This is especially critical in the case where such a CPU is being
sent an IPI from a second CPU which already has the same spinlock taken
and is synchronously waiting for that first CPU to process the IPI.
This scenario may occurs on ARM64 and RISC-V with FPU sharing enabled.
This is an alternative to PR #58058 that should benefit all architectures.