-
Notifications
You must be signed in to change notification settings - Fork 13.3k
vpermb (_mm256_permutexvar_epi8) byte transpose compiles to multiple XMM shuffles if the result is stored #116931
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@llvm/issue-subscribers-backend-x86 Author: Peter Cordes (pcordes)
For some patterns of shuffle constant, we miss compiling `_mm256_permutexvar_epi8` to `vpermb ymm` if the result is only stored, not used in ways that require it as a single 256-bit vector. The worse version is 5 to 6 XMM shuffle instructions, so it's worse even on Zen 1 or a future Intel E-core with AVX10.
This happens even when inlining into a loop and unrolling. Present in all Clang versions as far back as the first one to support AVX-512VBMI, and in current trunk (Godbolt)
This compiles as expected,
But without the inline asm blackbox between the shuffle and store, Clang spends 5 or 6 shuffle uops to feed two 128-bit stores. (This is obviously much less efficient;
Or if
There's no correctness problem, just performance; I tested with memcmp in a test The extra instructions take more space than the 16 bytes saved by using a narrower shuffle-control vector. (18 bytes of code to load a shuffle control vector, |
…t patterns as well as 512-bit The 512-bit filter was to prevent AVX1/2 regressions, but most of that is now handled by canonicalizeShuffleWithOp Ideally we need to support smaller element widths as well. Noticed while triaging llvm#116931
…t patterns as well as 512-bit (llvm#127392) The 512-bit filter was to prevent AVX1/2 regressions, but most of that is now handled by canonicalizeShuffleWithOp Ideally we need to support smaller element widths as well. Noticed while triaging llvm#116931
…) -> shuffle(concat(x,x),concat(y,y),m3) on VBMI targets With VBMI we are guaranteed to support cross-lane 256-bit shuffles, so subvector splats should always be cheap. Fixes llvm#116931
…) -> shuffle(concat(x,x),concat(y,y),m3) on VBMI targets (llvm#130134) With VBMI we are guaranteed to support cross-lane 256-bit shuffles, so subvector splats should always be cheap. Fixes llvm#116931
For some patterns of shuffle constant, we miss compiling
_mm256_permutexvar_epi8
tovpermb ymm
if the result is only stored, not used in ways that require it as a single 256-bit vector. The worse version is 5 to 6 XMM shuffle instructions, so it's worse even on Zen 1 or a future Intel E-core with AVX10.This happens even when inlining into a loop and unrolling.
Present in all Clang versions as far back as the first one to support AVX-512VBMI, and in current trunk (Godbolt)
This compiles as expected,
But without the inline asm blackbox between the shuffle and store, Clang spends 5 or 6 shuffle uops to feed two 128-bit stores. (This is obviously much less efficient;
vpermb
is single-uop on every CPU that supports it. At worst 6c latency on Zen 4 for the 512-bit version, but this is the 256-bit version so 4c latency there.)Or if
v
is mutated before the shuffle+store, e.g. withv = _mm256_add_epi8(v,v);
, the shuffle choice becomes symmetric between low half and extracted high half, instead of usingvpermq $0xFF
to broadcast the high qword.There's no correctness problem, just performance; I tested with memcmp in a test
main
in the Godbolt link.The extra instructions take more space than the 16 bytes saved by using a narrower shuffle-control vector. (18 bytes of code to load a shuffle control vector,
vpermb
, and a singlevmovdqa
store. Plus 32B constant is 50 bytes of static size). vs. the version withvpermq
being 42B of code + 16B of data = 58B, the other is 1 byte smaller (not counting the extra vpaddb). So not appropriate even for-Oz
.(The best I was able to do by hand was 53 bytes, using
mov $4, %al
;vpbroadcastb %eax, %xmm2
;vpermq $255, %ymm0, %ymm0
to get something to add to the high lane of a broadcasted 16-byte vector to generate an input for vpermb.push $4
;vpbroadcastb (%rsp), %xmm2
is also 8 bytes if restoring RSP is free. xor-zero +vgf2p8affineqb $0x04, zero,zero, %dst
is 10 bytes.)The text was updated successfully, but these errors were encountered: