Fix arm64 FPU deadlock #58058

povergoing · 2023-05-19T09:03:55Z

[RFC] As discussed in #58056, this PR fixed arm64 FPU deadlock but is really a workaround. Expecting a better way to fix the issue

Using FPU with FPU sharing in a spin-locked critical section might cause a 'deadlock'. When core 0 got a spinlock that other cores are waiting for, it will 'deadlock' if the core 0 current threads' FPU context is living in another one that is waiting for the spinlock but never receives the FPU IPI.

The issue is found in the test case tests/lib/p4workq/ test_stress with SMP on the arm FVP platform
threads use spinlock to sync, core 0 got a spinlock, other cores (say core 1,2,3) are waiting in cas loop (spinlock, with IRQ disabled). The test case on core 0 goes into the spin-locked section and calls the assert which will access FPU registers thus causing an FPU trap. Then it finds out the current thread's FPU context is living in core 1. So core 0 raises IPI to inform core 1 to save the FPU context. But core 1 is in spinlock with IRQ disabled, so deadlock happens.

CC @npitre @carlocaione

…ction Using FPU with FPU sharing in a spin-locked critical section might cause 'deadlock'. When core 0 got a spinlock that other cores are waiting for, it will 'deadlock' if the core 0 current threads' FPU context is living in another one that is waiting for the spinlock but never receives the FPU IPI. To avoid the 'deadlock', let the spinning cores flush their FPU if its fpu_owner is the current thread of another core. Signed-off-by: Jaxson Han <[email protected]>

npitre · 2023-05-19T16:38:11Z

Please have a look at PR #58086 which I think is a better solution.

andyross

This definitely has a race condition. It's not clear to me whether that's fatal for the technique or not.

I think the @npitre idea of ensuring that the IPI demanding the flush gets delivered is probably more robust, though doing it in the generic spinlock worries me too. Maybe there's a need for an arch-specific synchronization trick here? Wouldn't be the first.

andyross · 2023-05-20T02:56:17Z

arch/arm64/core/fpu.c

+	unsigned int num_cpus = arch_num_cpus();
+	int i;
+
+	for (i = 0; i < num_cpus; i++) {


This is racy if you don't hold the scheduler lock. There's no way to ensure that the thread won't become current on another CPU after you've checked it.

You are right, there's no way to ensure that. But only speaking of this case, we actually don't care about which CPU fpu_owner is living on, we just want to guest that this CPU should no longer hold the FPU, and give the FPU context back to its thread. Anyway, agreed, I also don't think this is the right way and will investigate further on npitre idea.

andyross · 2023-05-20T14:48:37Z

What's the actual lock being contended? It's worth pointing out that arch_switch(), called for cooperative context switches, does not actually hold the scheduler lock (though it does enter the function with interrupts masked). With care, you might be able to unmask them around the "request foreign FPU flush" code and break the deadlock.

And my guess is that you're probably not deadlocking between interrupt handling, because if you were the IPI mechanism might not be reliable anyway? (Not sure what the IPI priority is on arm64, but I doubt it's always going to preempt).

povergoing · 2023-05-24T02:52:11Z

Close this, since #58086 will benefit all architectures

povergoing requested a review from npitre May 19, 2023 09:03

povergoing requested review from nashif, carlescufi, galak, MaureenHelm and carlocaione as code owners May 19, 2023 09:03

zephyrbot added the area: ARM64 ARM (64-bit) Architecture label May 19, 2023

zephyrbot requested a review from SgrrZhf May 19, 2023 09:04

zephyrbot assigned carlocaione May 19, 2023

carlocaione requested a review from andyross May 19, 2023 09:06

povergoing mentioned this pull request May 19, 2023

Fix misc issues and enable v8r64 FPU #58056

Merged

npitre mentioned this pull request May 19, 2023

fix possible deadlock with FPU sharing on ARM64 and RISC-V #58086

Merged

andyross requested changes May 20, 2023

View reviewed changes

povergoing closed this May 24, 2023

povergoing deleted the fix_arm64_fpu_deadlock branch May 24, 2023 02:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix arm64 FPU deadlock #58058

Fix arm64 FPU deadlock #58058

povergoing commented May 19, 2023

npitre commented May 19, 2023

andyross left a comment

andyross May 20, 2023

povergoing May 20, 2023

andyross commented May 20, 2023

povergoing commented May 24, 2023

Fix arm64 FPU deadlock #58058

Fix arm64 FPU deadlock #58058

Conversation

povergoing commented May 19, 2023

npitre commented May 19, 2023

andyross left a comment

Choose a reason for hiding this comment

andyross May 20, 2023

Choose a reason for hiding this comment

povergoing May 20, 2023

Choose a reason for hiding this comment

andyross commented May 20, 2023

povergoing commented May 24, 2023