Skip to content

fix possible deadlock with FPU sharing on ARM64 and RISC-V #58086

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
May 25, 2023

Conversation

npitre
Copy link
Collaborator

@npitre npitre commented May 19, 2023

We should allow for architecture-specific special processing while
waiting on a spinlock.

This is especially critical in the case where such a CPU is being
sent an IPI from a second CPU which already has the same spinlock taken
and is synchronously waiting for that first CPU to process the IPI.
This scenario may occurs on ARM64 and RISC-V with FPU sharing enabled.

This is an alternative to PR #58058 that should benefit all architectures.

@andyross
Copy link
Collaborator

Not really opposed semantically, but this seems like something that's going to want some degree of tuning. Naively this is going to severely penalize a CPU that receives more interrupts than whoever it is contending with, and lots of platforms are very asymmetric with regard to interrupt delivery (e.g. on intel_adsp all device interrupts go to one core).

Maybe it's worth having a special mode or API for whatever the contention case you're dealing with is?

Also, just because we should always ask the question when we hit cases like this: what does linux do? I'm pretty sure it doesn't do this (but might be wrong! linux spinlocks are pretty arch-dependent IIRC) and worry there's a good reason we're not considering.

@andyross
Copy link
Collaborator

One more gotcha occurred to me: this trick only works for the "outer" lock in a nested lock paradigm, which makes it fragile. It no doubt works now, but if some change comes along that puts the code using it under a different lock, then it won't actually work to unmask interrupts and the deadlock will return.

And just to be clear: the context right now is always happening out of thread mode? You never contend on the spinlock from interrupt context, which for the same reason can't be reliably preempted (without knowing a-priori what all the interrupt priorities are).

Basically I guess what I'm saying is that this looks correct to me, but kinda creeps me out a bit.

@npitre
Copy link
Collaborator Author

npitre commented May 20, 2023

The current code does this:

  • disable IRQs

  • try a CAS --> fail

  • try a CAS --> fail

  • try a CAS --> success

What this patch does is:

  • disable IRQs

  • try a CAS --> fail

  • enable IRQs

  • disable IRQs

  • try a CAS --> fail

  • enable IRQs

  • disable IRQs

  • try a CAS --> success

So this basically restores the same state as the spin_lock()'s initial entry.

Linux doesn't have to do this because it allows FPU usage from user space
only and spinlock usage from kernel space only. In Zephyr we do have
FPU usage in kernel threads which may end up within critical regions.

This being said, I don't think my proposal entirely solves the problem.
As you mentioned, if two different locks are taken, the first lock would
disable IRQs and the second lock would spin with IRQs always disabled even
with this patch, resulting in IPIs not being delivered again.

So this needs another thought.

@povergoing
Copy link
Member

Or maybe some IRQ can be marked as NMI, and the spinlock loops for checking the NMI IRQ instead of just enabling the IRQ in a spinlock?

@andyross
Copy link
Collaborator

Coming back here having given this more thought (do we not have a bug for this deadlock specifically?), as I think this is the one with the right structure. Can someone verify that this understanding is correct:

  1. The arm64 FPU trap can occur in basically any context. The example given seems to be an assert calling printk I think, but in principle we don't limit the contexts where FPU use is allowed at all[1].

  2. In SMP contexts, it's possible that the last time the current thread ran, it was on a different CPU. And so the registers that need to be spilled to restore the current thread's FPU state[2] are on a different CPU. Right now we try to get the other CPU to spill its context via an IPI.

  3. But since we can be in any context, we might hold a resource[3] on the other CPU is waiting with interrupts masked! That's now a deadlock. It won't service the IPI until it gets the resource, and it can't get the resource until the IPI spills the FPU state.

So... yeah, the solution is pretty much going to require augmenting the spin loop. But as discovered upthread, you can't do this by servicing interrupts because interrupts might need to be masked due to nested lock state (or by running at a high interrupt priority).

So... how about an atomic flag on the CPU? In the FPU trap, you set this flag before sending the IPI and then spin on it[4]. And the flush code clears the flag after doing its work. This is basically the way that k_thread_abort() works in SMP right now, FWIW.

[1] Which maybe is a mistake? This absolutely complicates our job, maybe needlessly. Linux gets away with disallowing x87/SSE/AVX in the kernel, no reason we couldn't have a rule like "no FPU when interrupts are masked" or whatever. Obviously that will require subsystem work for areas that violate that rule right now, but frankly that fix might be easier?

[2] Seems like a hole in the current design is that it only tracks "threads" mapped to CPUs, but one common case for this deadlock is actually that we're in interrupt context. If that's true, there's (obviously) no need to spill the context for the thread we interrupted, we should just reset the register state or whatever and allow the interrupt to do whatever it wants, right?

[3] It's worth pointing out that spinlocks are not the only way an app can busy-wait for something with interrupts masked! Though maybe it's the only one in the tree susceptible to this deadlock right now and could be treated with documentation.

[4] A good example of a situation where we spin outside a spinlock. Obviously this spin loop would need to be similarly augmented with the flag check!

@andyross
Copy link
Collaborator

Also: is the deadlock exercisable on any of the qemu platforms? Or does one need to find and download that FVP thing?

@npitre
Copy link
Collaborator Author

npitre commented May 23, 2023

About [2]:

If the FPU is used in interrupt context then it is obviously not a thread.
The FPU state is flushed to its belonging thread and reset for the IRQ
context's use. Because IRQ contexts are short lived, we simply disable
IRQs for the remainder of that particular IRQ context so it won't be
interrupted by another IRQ that might also want to use the FPU. This way
we don't have to bother with having to preserve FPU state from IRQ contexts.

If an IRQ context doesn't use the FPU but the FPU still holds a state wanted
by a thread on another CPU then the IPI will be processed either immediately
or when this IRQ ccontext is done depending on the priority. But it will
happen eventually.

If that IRQ context tries to get a locked spinlock then we have the same
scenario where IPIs won't be processed unless the spinlock loop is augmented.

@npitre npitre requested review from kgugala and pgielda as code owners May 23, 2023 04:19
@zephyrbot zephyrbot added area: Base OS Base OS Library (lib/os) area: RISCV RISCV Architecture (32-bit & 64-bit) area: ARM64 ARM (64-bit) Architecture labels May 23, 2023
@npitre
Copy link
Collaborator Author

npitre commented May 23, 2023

Here's another proposal. This implements the spinlock loop augmentation idea.

I did the RISC-V part. I'd need someone familiar with the GIC to fill the
TODO line in the ARM64 part.

@povergoing
Copy link
Member

I did the RISC-V part. I'd need someone familiar with the GIC to fill the
TODO line in the ARM64 part.

Let me try to fill the ARM64 part

@povergoing
Copy link
Member

Also: is the deadlock exercisable on any of the qemu platforms? Or does one need to find and download that FVP thing?

I think it is on qemu (arm and risc v) but the probability of deadlock issue might be low. I found the issue on FVP and I can 100% reproduce the deadlock on FVP.

Copy link
Collaborator

@andyross andyross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some notes. This definitely seems like the right track to me.

@@ -153,6 +153,9 @@ static ALWAYS_INLINE k_spinlock_key_t k_spin_lock(struct k_spinlock *l)

#ifdef CONFIG_SMP
while (!atomic_cas(&l->locked, 0, 1)) {
#ifdef CONFIG_ARCH_HAS_BUSY_SPINLOCK_CHECK
arch_busy_spinlock_check();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pedantic API naming: can we name this something like "arch_spin_relax()" instead? A more common and less obscurely-Zephyr-specific use case for this sort of thing is idle power management and bus contention relaxation. x86 spinlocks are best implemented on big/NUMA systems with MWAIT, etc... Our default spinlock is naive and works great in practice, but almost every system has a "better way to do this".

The other advantage is that we can then document this in such a way that it can be used by other busy loops than just "k_spinlocks".

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also: another very reasonable implementation choice would be to make this a function call and implement the default as a weak symbol, which avoids the need to mess with kconfig. Almost by definition, we don't care about cycle-level performance optimization when we're doing nothing waiting for something to happen.

void arch_busy_spinlock_check(void)
{
bool fpu_ipi_pending = atomic_and(&cpu_pending_ipi[_current_cpu->id],
IPI_FPU_FLUSH) != 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this not a race? It sets the bit before the flush is complete.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact there's a missing ~ before IPI_FPU_FLUSH. The flag is set elsewhere.

@povergoing
Copy link
Member

povergoing commented May 24, 2023

For simplicity, I append a commit on your branch, @npitre would you like to take a look? Feel free to change or rebase.

if (fpu_ipi_pending) {
/*
* We're not in IRQ context here and cannot use
* z_riscv_flush_local_fpu() directly.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this might be a typo. z_arm64_flush_local_fpu()

* @param irq interrupt ID
* @return Returns true if interrupt is pending, false otherwise
*/
bool arm_gic_irq_is_pending(unsigned int intid);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apologize, the doxgen warns. It should be unsigned int irq

*
* @param irq interrupt ID
*/
void arm_gic_irq_clear_pending(unsigned int intid);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Nicolas Pitre added 3 commits May 24, 2023 14:40
Give architectures that need it the ability to perform special checks
while e.g. waiting for a spinlock to become available.

Signed-off-by: Nicolas Pitre <[email protected]>
This is cleaner and less error prone, especially when comes the time
to test and clear a bit.

Signed-off-by: Nicolas Pitre <[email protected]>
Let's consider CPU1 waiting on a spinlock already taken by CPU2.

It is possible for CPU2 to invoke the FPU and trigger an FPU exception
when the FPU context for CPU2 is not live on that CPU. If the FPU context
for the thread on CPU2 is still held in CPU1's FPU then an IPI is sent
to CPU1 asking to flush its FPU to memory.

But if CPU1 is spinning on a lock already taken by CPU2, it won't see
the pending IPI as IRQs are disabled. CPU2 won't get its FPU state
restored and won't complete the required work to release the lock.

Let's prevent this deadlock scenario by looking for a pending FPU IPI
from the arch_spin_relax() hook and honor it.

Signed-off-by: Nicolas Pitre <[email protected]>
@npitre
Copy link
Collaborator Author

npitre commented May 24, 2023

I think it is ready for consideration now.

@povergoing: Please confirm this actually solves the deadlock you were experiencing.

@npitre npitre changed the title kernel: don't exclude IRQ servicing on contended spinlocks fix possible deadlock with FPU sharing on ARM64 and RISC-V May 24, 2023
povergoing and others added 2 commits May 24, 2023 15:31
Implement irq pending check and clear function for both gic and gicv3.

Signed-off-by: Jaxson Han <[email protected]>
Let's consider CPU1 waiting on a spinlock already taken by CPU2.

It is possible for CPU2 to invoke the FPU and trigger an FPU exception
when the FPU context for CPU2 is not live on that CPU. If the FPU context
for the thread on CPU2 is still held in CPU1's FPU then an IPI is sent
to CPU1 asking to flush its FPU to memory.

But if CPU1 is spinning on a lock already taken by CPU2, it won't see
the pending IPI as IRQs are disabled. CPU2 won't get its FPU state
restored and won't complete the required work to release the lock.

Let's prevent this deadlock scenario by looking for pending FPU IPI from
the spinlock loop using the arch_spin_relax() hook.

Signed-off-by: Nicolas Pitre <[email protected]>
Copy link
Collaborator

@andyross andyross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all looks great to me. One API note for future decisions.

* arch_nop(). Architectures may implement this function to perform extra
* checks or power management tricks if needed.
*/
void arch_spin_relax(void);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It occurs to me that my suggestion that this would be useful for e.g. MWAIT-based relaxation means it should take some kind of pointer to the address being waited on, which would require careful documentation. But we can fix that up later if we ever get there; arch_* APIs are tree-internal and not subject to stability or deprecation requirements. And this would start out as an unstable API anyway, surely.

@povergoing
Copy link
Member

povergoing commented May 25, 2023

I think it is ready for consideration now.

@povergoing: Please confirm this actually solves the deadlock you were experiencing.

Yes, I confirmed it was solved (after some stress tests)

@fabiobaltieri fabiobaltieri merged commit 8e9872a into zephyrproject-rtos:main May 25, 2023
@npitre npitre deleted the ipideadlock branch May 25, 2023 18:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: ARM64 ARM (64-bit) Architecture area: Base OS Base OS Library (lib/os) area: Interrupt Controller area: Kernel area: RISCV RISCV Architecture (32-bit & 64-bit)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants