You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am stuck. The performance is consistently below the upstream version, despite my best efforts to use SVE efficiently. I would be very grateful if anyone can:
Suggest potential optimization strategies for 128-bit SVE
Point out common mistakes when porting from NEON to SVE
Share similar experience with optimizing kernels on Dimensity SoCs
Any feedback or direction would be deeply appreciated. Thank you!
And I have a question: for example, using svmul_f32 with SVE doesn’t seem to be faster than a regular scalar multiplication in my code.
🧩 System Information
Device: iQOO Neo10
SoC: MediaTek Dimensity 9400
Memory: 16 GB RAM, 512 GB storage
ISA: Armv9 with SVE 128-bit support
Model tested: Qwen2.5-1.5B-Instruct-Q4_0.gguf
FA mode: 1
Threads: 4, 8
llama.cpp build: 7a84777 (5054)
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
🙏 Request for Help
I am stuck. The performance is consistently below the upstream version, despite my best efforts to use SVE efficiently. I would be very grateful if anyone can:
Suggest potential optimization strategies for 128-bit SVE
Point out common mistakes when porting from NEON to SVE
Share similar experience with optimizing kernels on Dimensity SoCs
Any feedback or direction would be deeply appreciated. Thank you!
And I have a question: for example, using
svmul_f32
with SVE doesn’t seem to be faster than a regular scalar multiplication in my code.🧩 System Information
Device: iQOO Neo10
SoC: MediaTek Dimensity 9400
Memory: 16 GB RAM, 512 GB storage
ISA: Armv9 with SVE 128-bit support
Model tested: Qwen2.5-1.5B-Instruct-Q4_0.gguf
FA mode: 1
Threads: 4, 8
llama.cpp build: 7a84777 (5054)
This is my code:
and this my llama-bench:
and this is the upstream llama-bench:
Beta Was this translation helpful? Give feedback.
All reactions