-
Notifications
You must be signed in to change notification settings - Fork 7.4k
Fix arm64 FPU deadlock #58058
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix arm64 FPU deadlock #58058
Conversation
…ction Using FPU with FPU sharing in a spin-locked critical section might cause 'deadlock'. When core 0 got a spinlock that other cores are waiting for, it will 'deadlock' if the core 0 current threads' FPU context is living in another one that is waiting for the spinlock but never receives the FPU IPI. To avoid the 'deadlock', let the spinning cores flush their FPU if its fpu_owner is the current thread of another core. Signed-off-by: Jaxson Han <[email protected]>
Please have a look at PR #58086 which I think is a better solution. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This definitely has a race condition. It's not clear to me whether that's fatal for the technique or not.
I think the @npitre idea of ensuring that the IPI demanding the flush gets delivered is probably more robust, though doing it in the generic spinlock worries me too. Maybe there's a need for an arch-specific synchronization trick here? Wouldn't be the first.
unsigned int num_cpus = arch_num_cpus(); | ||
int i; | ||
|
||
for (i = 0; i < num_cpus; i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is racy if you don't hold the scheduler lock. There's no way to ensure that the thread won't become current on another CPU after you've checked it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, there's no way to ensure that. But only speaking of this case, we actually don't care about which CPU fpu_owner is living on, we just want to guest that this CPU should no longer hold the FPU, and give the FPU context back to its thread. Anyway, agreed, I also don't think this is the right way and will investigate further on npitre idea.
What's the actual lock being contended? It's worth pointing out that arch_switch(), called for cooperative context switches, does not actually hold the scheduler lock (though it does enter the function with interrupts masked). With care, you might be able to unmask them around the "request foreign FPU flush" code and break the deadlock. And my guess is that you're probably not deadlocking between interrupt handling, because if you were the IPI mechanism might not be reliable anyway? (Not sure what the IPI priority is on arm64, but I doubt it's always going to preempt). |
Close this, since #58086 will benefit all architectures |
[RFC] As discussed in #58056, this PR fixed arm64 FPU deadlock but is really a workaround. Expecting a better way to fix the issue
Using FPU with FPU sharing in a spin-locked critical section might cause a 'deadlock'. When core 0 got a spinlock that other cores are waiting for, it will 'deadlock' if the core 0 current threads' FPU context is living in another one that is waiting for the spinlock but never receives the FPU IPI.
The issue is found in the test case
tests/lib/p4workq/
test_stress
with SMP on the arm FVP platformthreads use spinlock to sync, core 0 got a spinlock, other cores (say core 1,2,3) are waiting in
cas
loop (spinlock, with IRQ disabled). The test case on core 0 goes into the spin-locked section and calls theassert
which will access FPU registers thus causing an FPU trap. Then it finds out the current thread's FPU context is living in core 1. So core 0 raises IPI to inform core 1 to save the FPU context. But core 1 is in spinlock with IRQ disabled, so deadlock happens.CC @npitre @carlocaione