Skip to content

Fix arm64 FPU deadlock #58058

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

povergoing
Copy link
Member

[RFC] As discussed in #58056, this PR fixed arm64 FPU deadlock but is really a workaround. Expecting a better way to fix the issue

Using FPU with FPU sharing in a spin-locked critical section might cause a 'deadlock'. When core 0 got a spinlock that other cores are waiting for, it will 'deadlock' if the core 0 current threads' FPU context is living in another one that is waiting for the spinlock but never receives the FPU IPI.

The issue is found in the test case tests/lib/p4workq/ test_stress with SMP on the arm FVP platform
threads use spinlock to sync, core 0 got a spinlock, other cores (say core 1,2,3) are waiting in cas loop (spinlock, with IRQ disabled). The test case on core 0 goes into the spin-locked section and calls the assert which will access FPU registers thus causing an FPU trap. Then it finds out the current thread's FPU context is living in core 1. So core 0 raises IPI to inform core 1 to save the FPU context. But core 1 is in spinlock with IRQ disabled, so deadlock happens.

CC @npitre @carlocaione

…ction

Using FPU with FPU sharing in a spin-locked critical section might cause
'deadlock'. When core 0 got a spinlock that other cores are waiting for,
it will 'deadlock' if the core 0 current threads' FPU context is living
in another one that is waiting for the spinlock but never receives the
FPU IPI.

To avoid the 'deadlock', let the spinning cores flush their FPU if its
fpu_owner is the current thread of another core.

Signed-off-by: Jaxson Han <[email protected]>
@npitre
Copy link
Collaborator

npitre commented May 19, 2023

Please have a look at PR #58086 which I think is a better solution.

Copy link
Collaborator

@andyross andyross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This definitely has a race condition. It's not clear to me whether that's fatal for the technique or not.

I think the @npitre idea of ensuring that the IPI demanding the flush gets delivered is probably more robust, though doing it in the generic spinlock worries me too. Maybe there's a need for an arch-specific synchronization trick here? Wouldn't be the first.

unsigned int num_cpus = arch_num_cpus();
int i;

for (i = 0; i < num_cpus; i++) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is racy if you don't hold the scheduler lock. There's no way to ensure that the thread won't become current on another CPU after you've checked it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, there's no way to ensure that. But only speaking of this case, we actually don't care about which CPU fpu_owner is living on, we just want to guest that this CPU should no longer hold the FPU, and give the FPU context back to its thread. Anyway, agreed, I also don't think this is the right way and will investigate further on npitre idea.

@andyross
Copy link
Collaborator

What's the actual lock being contended? It's worth pointing out that arch_switch(), called for cooperative context switches, does not actually hold the scheduler lock (though it does enter the function with interrupts masked). With care, you might be able to unmask them around the "request foreign FPU flush" code and break the deadlock.

And my guess is that you're probably not deadlocking between interrupt handling, because if you were the IPI mechanism might not be reliable anyway? (Not sure what the IPI priority is on arm64, but I doubt it's always going to preempt).

@povergoing
Copy link
Member Author

Close this, since #58086 will benefit all architectures

@povergoing povergoing closed this May 24, 2023
@povergoing povergoing deleted the fix_arm64_fpu_deadlock branch May 24, 2023 02:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: ARM64 ARM (64-bit) Architecture
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants