Skip to content

vpermb (_mm256_permutexvar_epi8) byte transpose compiles to multiple XMM shuffles if the result is stored #116931

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
pcordes opened this issue Nov 20, 2024 · 1 comment · Fixed by #130134
Assignees

Comments

@pcordes
Copy link

pcordes commented Nov 20, 2024

For some patterns of shuffle constant, we miss compiling _mm256_permutexvar_epi8 to vpermb ymm if the result is only stored, not used in ways that require it as a single 256-bit vector. The worse version is 5 to 6 XMM shuffle instructions, so it's worse even on Zen 1 or a future Intel E-core with AVX10.

This happens even when inlining into a loop and unrolling.

Present in all Clang versions as far back as the first one to support AVX-512VBMI, and in current trunk (Godbolt)

__attribute__((noinline))
void shufstore_v2(void *out, __m256i v){
    static const uint32_t by8  = 0x18100800;  // low byte of each qword
    static const uint32_t ones = 0x01010101;  // later dwords get the second, etc. byte of each src qword
     __m256i byteshuf = _mm256_setr_epi32(by8 + ones*0, by8 + ones*1, by8 + ones*2, by8 + ones*3,
                                          by8 + ones*4, by8 + ones*5, by8 + ones*6, by8 + ones*7 );
    v = _mm256_permutexvar_epi8(byteshuf, v);
    asm("  nop # picked %0" : "+x"(v));   // require the complete vector 256-bit vector to exist in a single register
   
    _mm256_store_si256(out, v);
}

This compiles as expected,

shufstore_v2:
        vmovdqa .LCPI1_0(%rip), %ymm1
        vpermb  %ymm0, %ymm1, %ymm0
        nop     # picked %ymm0
        vmovaps %ymm0, (%rdi)
        vzeroupper
        retq

But without the inline asm blackbox between the shuffle and store, Clang spends 5 or 6 shuffle uops to feed two 128-bit stores. (This is obviously much less efficient; vpermb is single-uop on every CPU that supports it. At worst 6c latency on Zen 4 for the 512-bit version, but this is the 256-bit version so 4c latency there.)

shufstore:
        vpshufb .LCPI0_0(%rip), %xmm0, %xmm1
        vextracti128    $1, %ymm0, %xmm2
        vpermq  $255, %ymm0, %ymm0
        vpunpcklbw      %xmm0, %xmm2, %xmm0
        vpunpcklwd      %xmm0, %xmm1, %xmm2
        vpunpckhwd      %xmm0, %xmm1, %xmm0
        vmovdqa %xmm0, 16(%rdi)
        vmovdqa %xmm2, (%rdi)
        vzeroupper
        retq

Or if v is mutated before the shuffle+store, e.g. with v = _mm256_add_epi8(v,v);, the shuffle choice becomes symmetric between low half and extracted high half, instead of using vpermq $0xFF to broadcast the high qword.

shufstore_mutated:
        vpaddb  %ymm0, %ymm0, %ymm0
        vextracti128    $1, %ymm0, %xmm1
        vmovdqa .LCPI0_0(%rip), %xmm2
        vpshufb %xmm2, %xmm0, %xmm0
        vpshufb %xmm2, %xmm1, %xmm1
        vpunpcklwd      %xmm1, %xmm0, %xmm2
        vpunpckhwd      %xmm1, %xmm0, %xmm0
        vmovdqa %xmm0, 16(%rdi)
        vmovdqa %xmm2, (%rdi)
        vzeroupper
        retq

There's no correctness problem, just performance; I tested with memcmp in a test main in the Godbolt link.

The extra instructions take more space than the 16 bytes saved by using a narrower shuffle-control vector. (18 bytes of code to load a shuffle control vector, vpermb, and a single vmovdqa store. Plus 32B constant is 50 bytes of static size). vs. the version with vpermq being 42B of code + 16B of data = 58B, the other is 1 byte smaller (not counting the extra vpaddb). So not appropriate even for -Oz.
(The best I was able to do by hand was 53 bytes, using mov $4, %al ; vpbroadcastb %eax, %xmm2 ; vpermq $255, %ymm0, %ymm0 to get something to add to the high lane of a broadcasted 16-byte vector to generate an input for vpermb. push $4 ; vpbroadcastb (%rsp), %xmm2 is also 8 bytes if restoring RSP is free. xor-zero + vgf2p8affineqb $0x04, zero,zero, %dst is 10 bytes.)

@llvmbot
Copy link
Member

llvmbot commented Nov 20, 2024

@llvm/issue-subscribers-backend-x86

Author: Peter Cordes (pcordes)

For some patterns of shuffle constant, we miss compiling `_mm256_permutexvar_epi8` to `vpermb ymm` if the result is only stored, not used in ways that require it as a single 256-bit vector. The worse version is 5 to 6 XMM shuffle instructions, so it's worse even on Zen 1 or a future Intel E-core with AVX10.

This happens even when inlining into a loop and unrolling.

Present in all Clang versions as far back as the first one to support AVX-512VBMI, and in current trunk (Godbolt)

__attribute__((noinline))
void shufstore_v2(void *out, __m256i v){
    static const uint32_t by8  = 0x18100800;  // low byte of each qword
    static const uint32_t ones = 0x01010101;  // later dwords get the second, etc. byte of each src qword
     __m256i byteshuf = _mm256_setr_epi32(by8 + ones*0, by8 + ones*1, by8 + ones*2, by8 + ones*3,
                                          by8 + ones*4, by8 + ones*5, by8 + ones*6, by8 + ones*7 );
    v = _mm256_permutexvar_epi8(byteshuf, v);
    asm("  nop # picked %0" : "+x"(v));   // require the complete vector 256-bit vector to exist in a single register
   
    _mm256_store_si256(out, v);
}

This compiles as expected,

shufstore_v2:
        vmovdqa .LCPI1_0(%rip), %ymm1
        vpermb  %ymm0, %ymm1, %ymm0
        nop     # picked %ymm0
        vmovaps %ymm0, (%rdi)
        vzeroupper
        retq

But without the inline asm blackbox between the shuffle and store, Clang spends 5 or 6 shuffle uops to feed two 128-bit stores. (This is obviously much less efficient; vpermb is single-uop on every CPU that supports it. At worst 6c latency on Zen 4 for the 512-bit version, but this is the 256-bit version so 4c latency there.)

shufstore:
        vpshufb .LCPI0_0(%rip), %xmm0, %xmm1
        vextracti128    $1, %ymm0, %xmm2
        vpermq  $255, %ymm0, %ymm0
        vpunpcklbw      %xmm0, %xmm2, %xmm0
        vpunpcklwd      %xmm0, %xmm1, %xmm2
        vpunpckhwd      %xmm0, %xmm1, %xmm0
        vmovdqa %xmm0, 16(%rdi)
        vmovdqa %xmm2, (%rdi)
        vzeroupper
        retq

Or if v is mutated before the shuffle+store, e.g. with v = _mm256_add_epi8(v,v);, the shuffle choice becomes symmetric between low half and extracted high half, instead of using vpermq $0xFF to broadcast the high qword.

shufstore_mutated:
        vpaddb  %ymm0, %ymm0, %ymm0
        vextracti128    $1, %ymm0, %xmm1
        vmovdqa .LCPI0_0(%rip), %xmm2
        vpshufb %xmm2, %xmm0, %xmm0
        vpshufb %xmm2, %xmm1, %xmm1
        vpunpcklwd      %xmm1, %xmm0, %xmm2
        vpunpckhwd      %xmm1, %xmm0, %xmm0
        vmovdqa %xmm0, 16(%rdi)
        vmovdqa %xmm2, (%rdi)
        vzeroupper
        retq

There's no correctness problem, just performance; I tested with memcmp in a test main in the Godbolt link.

The extra instructions take more space than the 16 bytes saved by using a narrower shuffle-control vector. (18 bytes of code to load a shuffle control vector, vpermb, and a single vmovdqa store. Plus 32B constant is 50 bytes of static size). vs. the version with vpermq being 42B of code + 16B of data = 58B, the other is 1 byte smaller (not counting the extra vpaddb). So not appropriate even for -Oz.
(The best I was able to do by hand was 53 bytes, using mov $4, %al ; vpbroadcastb %eax, %xmm2 ; vpermq $255, %ymm0, %ymm0 to get something to add to the high lane of a broadcasted 16-byte vector to generate an input for vpermb. push $4 ; vpbroadcastb (%rsp), %xmm2 is also 8 bytes if restoring RSP is free. xor-zero + vgf2p8affineqb $0x04, zero,zero, %dst is 10 bytes.)

@RKSimon RKSimon self-assigned this Nov 20, 2024
RKSimon added a commit to RKSimon/llvm-project that referenced this issue Feb 16, 2025
…t patterns as well as 512-bit

The 512-bit filter was to prevent AVX1/2 regressions, but most of that is now handled by canonicalizeShuffleWithOp

Ideally we need to support smaller element widths as well.

Noticed while triaging llvm#116931
RKSimon added a commit that referenced this issue Feb 17, 2025
…t patterns as well as 512-bit (#127392)

The 512-bit filter was to prevent AVX1/2 regressions, but most of that is now handled by canonicalizeShuffleWithOp

Ideally we need to support smaller element widths as well.

Noticed while triaging #116931
RKSimon added a commit that referenced this issue Feb 17, 2025
sivan-shani pushed a commit to sivan-shani/llvm-project that referenced this issue Feb 24, 2025
…t patterns as well as 512-bit (llvm#127392)

The 512-bit filter was to prevent AVX1/2 regressions, but most of that is now handled by canonicalizeShuffleWithOp

Ideally we need to support smaller element widths as well.

Noticed while triaging llvm#116931
sivan-shani pushed a commit to sivan-shani/llvm-project that referenced this issue Feb 24, 2025
RKSimon added a commit to RKSimon/llvm-project that referenced this issue Mar 6, 2025
…) -> shuffle(concat(x,x),concat(y,y),m3) on VBMI targets

With VBMI we are guaranteed to support cross-lane 256-bit shuffles, so subvector splats should always be cheap.

Fixes llvm#116931
@RKSimon RKSimon closed this as completed in 52bc812 Mar 7, 2025
jph-13 pushed a commit to jph-13/llvm-project that referenced this issue Mar 21, 2025
…) -> shuffle(concat(x,x),concat(y,y),m3) on VBMI targets (llvm#130134)

With VBMI we are guaranteed to support cross-lane 256-bit shuffles, so subvector splats should always be cheap.

Fixes llvm#116931
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants