fix possible deadlock with FPU sharing on ARM64 and RISC-V #58086

npitre · 2023-05-19T16:34:09Z

We should allow for architecture-specific special processing while
waiting on a spinlock.

This is especially critical in the case where such a CPU is being
sent an IPI from a second CPU which already has the same spinlock taken
and is synchronously waiting for that first CPU to process the IPI.
This scenario may occurs on ARM64 and RISC-V with FPU sharing enabled.

This is an alternative to PR #58058 that should benefit all architectures.

andyross · 2023-05-20T01:24:45Z

Not really opposed semantically, but this seems like something that's going to want some degree of tuning. Naively this is going to severely penalize a CPU that receives more interrupts than whoever it is contending with, and lots of platforms are very asymmetric with regard to interrupt delivery (e.g. on intel_adsp all device interrupts go to one core).

Maybe it's worth having a special mode or API for whatever the contention case you're dealing with is?

Also, just because we should always ask the question when we hit cases like this: what does linux do? I'm pretty sure it doesn't do this (but might be wrong! linux spinlocks are pretty arch-dependent IIRC) and worry there's a good reason we're not considering.

andyross · 2023-05-20T01:36:32Z

One more gotcha occurred to me: this trick only works for the "outer" lock in a nested lock paradigm, which makes it fragile. It no doubt works now, but if some change comes along that puts the code using it under a different lock, then it won't actually work to unmask interrupts and the deadlock will return.

And just to be clear: the context right now is always happening out of thread mode? You never contend on the spinlock from interrupt context, which for the same reason can't be reliably preempted (without knowing a-priori what all the interrupt priorities are).

Basically I guess what I'm saying is that this looks correct to me, but kinda creeps me out a bit.

npitre · 2023-05-20T02:52:01Z

The current code does this:

disable IRQs
try a CAS --> fail
try a CAS --> fail
try a CAS --> success

What this patch does is:

disable IRQs
try a CAS --> fail
enable IRQs
disable IRQs
try a CAS --> fail
enable IRQs
disable IRQs
try a CAS --> success

So this basically restores the same state as the spin_lock()'s initial entry.

Linux doesn't have to do this because it allows FPU usage from user space
only and spinlock usage from kernel space only. In Zephyr we do have
FPU usage in kernel threads which may end up within critical regions.

This being said, I don't think my proposal entirely solves the problem.
As you mentioned, if two different locks are taken, the first lock would
disable IRQs and the second lock would spin with IRQs always disabled even
with this patch, resulting in IPIs not being delivered again.

So this needs another thought.

povergoing · 2023-05-20T06:48:05Z

Or maybe some IRQ can be marked as NMI, and the spinlock loops for checking the NMI IRQ instead of just enabling the IRQ in a spinlock?

andyross · 2023-05-22T16:11:10Z

Coming back here having given this more thought (do we not have a bug for this deadlock specifically?), as I think this is the one with the right structure. Can someone verify that this understanding is correct:

The arm64 FPU trap can occur in basically any context. The example given seems to be an assert calling printk I think, but in principle we don't limit the contexts where FPU use is allowed at all[1].
In SMP contexts, it's possible that the last time the current thread ran, it was on a different CPU. And so the registers that need to be spilled to restore the current thread's FPU state[2] are on a different CPU. Right now we try to get the other CPU to spill its context via an IPI.
But since we can be in any context, we might hold a resource[3] on the other CPU is waiting with interrupts masked! That's now a deadlock. It won't service the IPI until it gets the resource, and it can't get the resource until the IPI spills the FPU state.

So... yeah, the solution is pretty much going to require augmenting the spin loop. But as discovered upthread, you can't do this by servicing interrupts because interrupts might need to be masked due to nested lock state (or by running at a high interrupt priority).

So... how about an atomic flag on the CPU? In the FPU trap, you set this flag before sending the IPI and then spin on it[4]. And the flush code clears the flag after doing its work. This is basically the way that k_thread_abort() works in SMP right now, FWIW.

[1] Which maybe is a mistake? This absolutely complicates our job, maybe needlessly. Linux gets away with disallowing x87/SSE/AVX in the kernel, no reason we couldn't have a rule like "no FPU when interrupts are masked" or whatever. Obviously that will require subsystem work for areas that violate that rule right now, but frankly that fix might be easier?

[2] Seems like a hole in the current design is that it only tracks "threads" mapped to CPUs, but one common case for this deadlock is actually that we're in interrupt context. If that's true, there's (obviously) no need to spill the context for the thread we interrupted, we should just reset the register state or whatever and allow the interrupt to do whatever it wants, right?

[3] It's worth pointing out that spinlocks are not the only way an app can busy-wait for something with interrupts masked! Though maybe it's the only one in the tree susceptible to this deadlock right now and could be treated with documentation.

[4] A good example of a situation where we spin outside a spinlock. Obviously this spin loop would need to be similarly augmented with the flag check!

andyross · 2023-05-22T16:11:48Z

Also: is the deadlock exercisable on any of the qemu platforms? Or does one need to find and download that FVP thing?

npitre · 2023-05-23T04:19:38Z

About [2]:

If the FPU is used in interrupt context then it is obviously not a thread.
The FPU state is flushed to its belonging thread and reset for the IRQ
context's use. Because IRQ contexts are short lived, we simply disable
IRQs for the remainder of that particular IRQ context so it won't be
interrupted by another IRQ that might also want to use the FPU. This way
we don't have to bother with having to preserve FPU state from IRQ contexts.

If an IRQ context doesn't use the FPU but the FPU still holds a state wanted
by a thread on another CPU then the IPI will be processed either immediately
or when this IRQ ccontext is done depending on the priority. But it will
happen eventually.

If that IRQ context tries to get a locked spinlock then we have the same
scenario where IPIs won't be processed unless the spinlock loop is augmented.

npitre · 2023-05-23T04:23:34Z

Here's another proposal. This implements the spinlock loop augmentation idea.

I did the RISC-V part. I'd need someone familiar with the GIC to fill the
TODO line in the ARM64 part.

povergoing · 2023-05-23T05:25:34Z

I did the RISC-V part. I'd need someone familiar with the GIC to fill the
TODO line in the ARM64 part.

Let me try to fill the ARM64 part

povergoing · 2023-05-23T05:36:44Z

Also: is the deadlock exercisable on any of the qemu platforms? Or does one need to find and download that FVP thing?

I think it is on qemu (arm and risc v) but the probability of deadlock issue might be low. I found the issue on FVP and I can 100% reproduce the deadlock on FVP.

andyross

Some notes. This definitely seems like the right track to me.

andyross · 2023-05-23T15:07:00Z

include/zephyr/spinlock.h

@@ -153,6 +153,9 @@ static ALWAYS_INLINE k_spinlock_key_t k_spin_lock(struct k_spinlock *l)

 #ifdef CONFIG_SMP
 	while (!atomic_cas(&l->locked, 0, 1)) {
+#ifdef CONFIG_ARCH_HAS_BUSY_SPINLOCK_CHECK
+		arch_busy_spinlock_check();


Pedantic API naming: can we name this something like "arch_spin_relax()" instead? A more common and less obscurely-Zephyr-specific use case for this sort of thing is idle power management and bus contention relaxation. x86 spinlocks are best implemented on big/NUMA systems with MWAIT, etc... Our default spinlock is naive and works great in practice, but almost every system has a "better way to do this".

The other advantage is that we can then document this in such a way that it can be used by other busy loops than just "k_spinlocks".

Also: another very reasonable implementation choice would be to make this a function call and implement the default as a weak symbol, which avoids the need to mess with kconfig. Almost by definition, we don't care about cycle-level performance optimization when we're doing nothing waiting for something to happen.

andyross · 2023-05-23T15:10:10Z

arch/riscv/core/smp.c

+void arch_busy_spinlock_check(void)
+{
+	bool fpu_ipi_pending = atomic_and(&cpu_pending_ipi[_current_cpu->id],
+					  IPI_FPU_FLUSH) != 0;


Is this not a race? It sets the bit before the flush is complete.

In fact there's a missing ~ before IPI_FPU_FLUSH. The flag is set elsewhere.

povergoing · 2023-05-24T02:34:05Z

For simplicity, I append a commit on your branch, @npitre would you like to take a look? Feel free to change or rebase.

povergoing · 2023-05-24T02:35:27Z

arch/arm64/core/smp.c

+	if (fpu_ipi_pending) {
+		/*
+		 * We're not in IRQ context here and cannot use
+		 * z_riscv_flush_local_fpu() directly.


this might be a typo. z_arm64_flush_local_fpu()

povergoing · 2023-05-24T02:53:52Z

include/zephyr/drivers/interrupt_controller/gic.h

+ * @param irq interrupt ID
+ * @return Returns true if interrupt is pending, false otherwise
+ */
+bool arm_gic_irq_is_pending(unsigned int intid);


apologize, the doxgen warns. It should be unsigned int irq

povergoing · 2023-05-24T02:54:07Z

include/zephyr/drivers/interrupt_controller/gic.h

+ *
+ * @param irq interrupt ID
+ */
+void arm_gic_irq_clear_pending(unsigned int intid);


Give architectures that need it the ability to perform special checks while e.g. waiting for a spinlock to become available. Signed-off-by: Nicolas Pitre <[email protected]>

This is cleaner and less error prone, especially when comes the time to test and clear a bit. Signed-off-by: Nicolas Pitre <[email protected]>

Let's consider CPU1 waiting on a spinlock already taken by CPU2. It is possible for CPU2 to invoke the FPU and trigger an FPU exception when the FPU context for CPU2 is not live on that CPU. If the FPU context for the thread on CPU2 is still held in CPU1's FPU then an IPI is sent to CPU1 asking to flush its FPU to memory. But if CPU1 is spinning on a lock already taken by CPU2, it won't see the pending IPI as IRQs are disabled. CPU2 won't get its FPU state restored and won't complete the required work to release the lock. Let's prevent this deadlock scenario by looking for a pending FPU IPI from the arch_spin_relax() hook and honor it. Signed-off-by: Nicolas Pitre <[email protected]>

npitre · 2023-05-24T18:44:39Z

I think it is ready for consideration now.

@povergoing: Please confirm this actually solves the deadlock you were experiencing.

Implement irq pending check and clear function for both gic and gicv3. Signed-off-by: Jaxson Han <[email protected]>

Let's consider CPU1 waiting on a spinlock already taken by CPU2. It is possible for CPU2 to invoke the FPU and trigger an FPU exception when the FPU context for CPU2 is not live on that CPU. If the FPU context for the thread on CPU2 is still held in CPU1's FPU then an IPI is sent to CPU1 asking to flush its FPU to memory. But if CPU1 is spinning on a lock already taken by CPU2, it won't see the pending IPI as IRQs are disabled. CPU2 won't get its FPU state restored and won't complete the required work to release the lock. Let's prevent this deadlock scenario by looking for pending FPU IPI from the spinlock loop using the arch_spin_relax() hook. Signed-off-by: Nicolas Pitre <[email protected]>

andyross

This all looks great to me. One API note for future decisions.

andyross · 2023-05-24T19:34:22Z

include/zephyr/sys/arch_interface.h

+ * arch_nop(). Architectures may implement this function to perform extra
+ * checks or power management tricks if needed.
+ */
+void arch_spin_relax(void);


It occurs to me that my suggestion that this would be useful for e.g. MWAIT-based relaxation means it should take some kind of pointer to the address being waited on, which would require careful documentation. But we can fix that up later if we ever get there; arch_* APIs are tree-internal and not subject to stability or deprecation requirements. And this would start out as an unstable API anyway, surely.

povergoing · 2023-05-25T07:50:13Z

I think it is ready for consideration now.

@povergoing: Please confirm this actually solves the deadlock you were experiencing.

Yes, I confirmed it was solved (after some stress tests)

npitre requested review from nashif, carlescufi, galak and MaureenHelm as code owners May 19, 2023 16:34

npitre requested review from carlocaione and povergoing May 19, 2023 16:35

npitre mentioned this pull request May 19, 2023

Fix arm64 FPU deadlock #58058

Closed

carlocaione assigned andyross May 19, 2023

npitre force-pushed the ipideadlock branch from afcc792 to 58f1153 Compare May 23, 2023 04:19

npitre requested review from kgugala and pgielda as code owners May 23, 2023 04:19

zephyrbot added area: Base OS Base OS Library (lib/os) area: RISCV RISCV Architecture (32-bit & 64-bit) area: ARM64 ARM (64-bit) Architecture labels May 23, 2023

zephyrbot requested review from andyross, edersondisouza, fkokosinski, katsuster, mgielda, SgrrZhf and tgorochowik May 23, 2023 04:20

andyross reviewed May 23, 2023

View reviewed changes

povergoing requested review from stephanosio and dcpleung as code owners May 24, 2023 02:32

zephyrbot added the area: Interrupt Controller label May 24, 2023

povergoing reviewed May 24, 2023

View reviewed changes

Nicolas Pitre added 3 commits May 24, 2023 14:40

kernel: allow for arch specific processing within busy loops

ee4ab38

Give architectures that need it the ability to perform special checks while e.g. waiting for a spinlock to become available. Signed-off-by: Nicolas Pitre <[email protected]>

riscv: use atomic bit helpers with IPI values

d379c0d

This is cleaner and less error prone, especially when comes the time to test and clear a bit. Signed-off-by: Nicolas Pitre <[email protected]>

npitre force-pushed the ipideadlock branch from 137f653 to c357cfb Compare May 24, 2023 18:40

zephyrbot added the area: Kernel label May 24, 2023

npitre changed the title ~~kernel: don't exclude IRQ servicing on contended spinlocks~~ fix possible deadlock with FPU sharing on ARM64 and RISC-V May 24, 2023

povergoing and others added 2 commits May 24, 2023 15:31

drivers: gic: Add irq pending check and clear function

e8fcb8d

Implement irq pending check and clear function for both gic and gicv3. Signed-off-by: Jaxson Han <[email protected]>

npitre force-pushed the ipideadlock branch from c357cfb to 51ae350 Compare May 24, 2023 19:31

andyross approved these changes May 24, 2023

View reviewed changes

npitre mentioned this pull request May 24, 2023

kernel: thread: race condition between create and join #58116

Closed

povergoing approved these changes May 25, 2023

View reviewed changes

fabiobaltieri merged commit 8e9872a into zephyrproject-rtos:main May 25, 2023

povergoing mentioned this pull request May 25, 2023

Fix misc issues and enable v8r64 FPU #58056

Merged

npitre deleted the ipideadlock branch May 25, 2023 18:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix possible deadlock with FPU sharing on ARM64 and RISC-V #58086

fix possible deadlock with FPU sharing on ARM64 and RISC-V #58086

npitre commented May 19, 2023 •

edited

Loading

andyross commented May 20, 2023

andyross commented May 20, 2023

npitre commented May 20, 2023

povergoing commented May 20, 2023

andyross commented May 22, 2023

andyross commented May 22, 2023

npitre commented May 23, 2023

npitre commented May 23, 2023

povergoing commented May 23, 2023

povergoing commented May 23, 2023

andyross left a comment

andyross May 23, 2023

andyross May 23, 2023

andyross May 23, 2023

npitre May 23, 2023

povergoing commented May 24, 2023 •

edited

Loading

povergoing May 24, 2023

povergoing May 24, 2023

povergoing May 24, 2023

npitre commented May 24, 2023

andyross left a comment

andyross May 24, 2023

povergoing commented May 25, 2023 •

edited

Loading

fix possible deadlock with FPU sharing on ARM64 and RISC-V #58086

fix possible deadlock with FPU sharing on ARM64 and RISC-V #58086

Conversation

npitre commented May 19, 2023 • edited Loading

andyross commented May 20, 2023

andyross commented May 20, 2023

npitre commented May 20, 2023

povergoing commented May 20, 2023

andyross commented May 22, 2023

andyross commented May 22, 2023

npitre commented May 23, 2023

npitre commented May 23, 2023

povergoing commented May 23, 2023

povergoing commented May 23, 2023

andyross left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

povergoing commented May 24, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

npitre commented May 24, 2023

andyross left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

povergoing commented May 25, 2023 • edited Loading

npitre commented May 19, 2023 •

edited

Loading

povergoing commented May 24, 2023 •

edited

Loading

povergoing commented May 25, 2023 •

edited

Loading