vpermb (_mm256_permutexvar_epi8) byte transpose compiles to multiple XMM shuffles if the result is stored #116931

pcordes · 2024-11-20T06:35:18Z

For some patterns of shuffle constant, we miss compiling _mm256_permutexvar_epi8 to vpermb ymm if the result is only stored, not used in ways that require it as a single 256-bit vector. The worse version is 5 to 6 XMM shuffle instructions, so it's worse even on Zen 1 or a future Intel E-core with AVX10.

This happens even when inlining into a loop and unrolling.

Present in all Clang versions as far back as the first one to support AVX-512VBMI, and in current trunk (Godbolt)

__attribute__((noinline))
void shufstore_v2(void *out, __m256i v){
    static const uint32_t by8  = 0x18100800;  // low byte of each qword
    static const uint32_t ones = 0x01010101;  // later dwords get the second, etc. byte of each src qword
     __m256i byteshuf = _mm256_setr_epi32(by8 + ones*0, by8 + ones*1, by8 + ones*2, by8 + ones*3,
                                          by8 + ones*4, by8 + ones*5, by8 + ones*6, by8 + ones*7 );
    v = _mm256_permutexvar_epi8(byteshuf, v);
    asm("  nop # picked %0" : "+x"(v));   // require the complete vector 256-bit vector to exist in a single register
   
    _mm256_store_si256(out, v);
}

This compiles as expected,

shufstore_v2:
        vmovdqa .LCPI1_0(%rip), %ymm1
        vpermb  %ymm0, %ymm1, %ymm0
        nop     # picked %ymm0
        vmovaps %ymm0, (%rdi)
        vzeroupper
        retq

But without the inline asm blackbox between the shuffle and store, Clang spends 5 or 6 shuffle uops to feed two 128-bit stores. (This is obviously much less efficient; vpermb is single-uop on every CPU that supports it. At worst 6c latency on Zen 4 for the 512-bit version, but this is the 256-bit version so 4c latency there.)

shufstore:
        vpshufb .LCPI0_0(%rip), %xmm0, %xmm1
        vextracti128    $1, %ymm0, %xmm2
        vpermq  $255, %ymm0, %ymm0
        vpunpcklbw      %xmm0, %xmm2, %xmm0
        vpunpcklwd      %xmm0, %xmm1, %xmm2
        vpunpckhwd      %xmm0, %xmm1, %xmm0
        vmovdqa %xmm0, 16(%rdi)
        vmovdqa %xmm2, (%rdi)
        vzeroupper
        retq

Or if v is mutated before the shuffle+store, e.g. with v = _mm256_add_epi8(v,v);, the shuffle choice becomes symmetric between low half and extracted high half, instead of using vpermq $0xFF to broadcast the high qword.

shufstore_mutated:
        vpaddb  %ymm0, %ymm0, %ymm0
        vextracti128    $1, %ymm0, %xmm1
        vmovdqa .LCPI0_0(%rip), %xmm2
        vpshufb %xmm2, %xmm0, %xmm0
        vpshufb %xmm2, %xmm1, %xmm1
        vpunpcklwd      %xmm1, %xmm0, %xmm2
        vpunpckhwd      %xmm1, %xmm0, %xmm0
        vmovdqa %xmm0, 16(%rdi)
        vmovdqa %xmm2, (%rdi)
        vzeroupper
        retq

There's no correctness problem, just performance; I tested with memcmp in a test main in the Godbolt link.

The extra instructions take more space than the 16 bytes saved by using a narrower shuffle-control vector. (18 bytes of code to load a shuffle control vector, vpermb, and a single vmovdqa store. Plus 32B constant is 50 bytes of static size). vs. the version with vpermq being 42B of code + 16B of data = 58B, the other is 1 byte smaller (not counting the extra vpaddb). So not appropriate even for -Oz.
(The best I was able to do by hand was 53 bytes, using mov $4, %al ; vpbroadcastb %eax, %xmm2 ; vpermq $255, %ymm0, %ymm0 to get something to add to the high lane of a broadcasted 16-byte vector to generate an input for vpermb. push $4 ; vpbroadcastb (%rsp), %xmm2 is also 8 bytes if restoring RSP is free. xor-zero + vgf2p8affineqb $0x04, zero,zero, %dst is 10 bytes.)

The text was updated successfully, but these errors were encountered:

llvmbot · 2024-11-20T09:39:21Z

@llvm/issue-subscribers-backend-x86

Author: Peter Cordes (pcordes)

For some patterns of shuffle constant, we miss compiling `_mm256_permutexvar_epi8` to `vpermb ymm` if the result is only stored, not used in ways that require it as a single 256-bit vector. The worse version is 5 to 6 XMM shuffle instructions, so it's worse even on Zen 1 or a future Intel E-core with AVX10.

This happens even when inlining into a loop and unrolling.

Present in all Clang versions as far back as the first one to support AVX-512VBMI, and in current trunk (Godbolt)

__attribute__((noinline))
void shufstore_v2(void *out, __m256i v){
    static const uint32_t by8  = 0x18100800;  // low byte of each qword
    static const uint32_t ones = 0x01010101;  // later dwords get the second, etc. byte of each src qword
     __m256i byteshuf = _mm256_setr_epi32(by8 + ones*0, by8 + ones*1, by8 + ones*2, by8 + ones*3,
                                          by8 + ones*4, by8 + ones*5, by8 + ones*6, by8 + ones*7 );
    v = _mm256_permutexvar_epi8(byteshuf, v);
    asm("  nop # picked %0" : "+x"(v));   // require the complete vector 256-bit vector to exist in a single register
   
    _mm256_store_si256(out, v);
}

This compiles as expected,

shufstore_v2:
        vmovdqa .LCPI1_0(%rip), %ymm1
        vpermb  %ymm0, %ymm1, %ymm0
        nop     # picked %ymm0
        vmovaps %ymm0, (%rdi)
        vzeroupper
        retq

But without the inline asm blackbox between the shuffle and store, Clang spends 5 or 6 shuffle uops to feed two 128-bit stores. (This is obviously much less efficient; vpermb is single-uop on every CPU that supports it. At worst 6c latency on Zen 4 for the 512-bit version, but this is the 256-bit version so 4c latency there.)

shufstore:
        vpshufb .LCPI0_0(%rip), %xmm0, %xmm1
        vextracti128    $1, %ymm0, %xmm2
        vpermq  $255, %ymm0, %ymm0
        vpunpcklbw      %xmm0, %xmm2, %xmm0
        vpunpcklwd      %xmm0, %xmm1, %xmm2
        vpunpckhwd      %xmm0, %xmm1, %xmm0
        vmovdqa %xmm0, 16(%rdi)
        vmovdqa %xmm2, (%rdi)
        vzeroupper
        retq

Or if v is mutated before the shuffle+store, e.g. with v = _mm256_add_epi8(v,v);, the shuffle choice becomes symmetric between low half and extracted high half, instead of using vpermq $0xFF to broadcast the high qword.

shufstore_mutated:
        vpaddb  %ymm0, %ymm0, %ymm0
        vextracti128    $1, %ymm0, %xmm1
        vmovdqa .LCPI0_0(%rip), %xmm2
        vpshufb %xmm2, %xmm0, %xmm0
        vpshufb %xmm2, %xmm1, %xmm1
        vpunpcklwd      %xmm1, %xmm0, %xmm2
        vpunpckhwd      %xmm1, %xmm0, %xmm0
        vmovdqa %xmm0, 16(%rdi)
        vmovdqa %xmm2, (%rdi)
        vzeroupper
        retq

There's no correctness problem, just performance; I tested with memcmp in a test main in the Godbolt link.

The extra instructions take more space than the 16 bytes saved by using a narrower shuffle-control vector. (18 bytes of code to load a shuffle control vector, vpermb, and a single vmovdqa store. Plus 32B constant is 50 bytes of static size). vs. the version with vpermq being 42B of code + 16B of data = 58B, the other is 1 byte smaller (not counting the extra vpaddb). So not appropriate even for -Oz.
(The best I was able to do by hand was 53 bytes, using mov $4, %al ; vpbroadcastb %eax, %xmm2 ; vpermq $255, %ymm0, %ymm0 to get something to add to the high lane of a broadcasted 16-byte vector to generate an input for vpermb. push $4 ; vpbroadcastb (%rsp), %xmm2 is also 8 bytes if restoring RSP is free. xor-zero + vgf2p8affineqb $0x04, zero,zero, %dst is 10 bytes.)

…t patterns as well as 512-bit The 512-bit filter was to prevent AVX1/2 regressions, but most of that is now handled by canonicalizeShuffleWithOp Ideally we need to support smaller element widths as well. Noticed while triaging llvm#116931

…t patterns as well as 512-bit (#127392) The 512-bit filter was to prevent AVX1/2 regressions, but most of that is now handled by canonicalizeShuffleWithOp Ideally we need to support smaller element widths as well. Noticed while triaging #116931

…t patterns as well as 512-bit (llvm#127392) The 512-bit filter was to prevent AVX1/2 regressions, but most of that is now handled by canonicalizeShuffleWithOp Ideally we need to support smaller element widths as well. Noticed while triaging llvm#116931

…) -> shuffle(concat(x,x),concat(y,y),m3) on VBMI targets With VBMI we are guaranteed to support cross-lane 256-bit shuffles, so subvector splats should always be cheap. Fixes llvm#116931

…) -> shuffle(concat(x,x),concat(y,y),m3) on VBMI targets (llvm#130134) With VBMI we are guaranteed to support cross-lane 256-bit shuffles, so subvector splats should always be cheap. Fixes llvm#116931

github-actions bot added the new issue label Nov 20, 2024

RKSimon added backend:X86 and removed new issue labels Nov 20, 2024

RKSimon self-assigned this Nov 20, 2024

RKSimon mentioned this issue Feb 16, 2025

[X86] getFauxShuffleMask - match 256-bit CONCAT(SUB0, SUB1) 64-bit elt patterns as well as 512-bit #127392

Merged

RKSimon added a commit that referenced this issue Feb 17, 2025

[X86] Add test coverage for #116931

94585dc

sivan-shani pushed a commit to sivan-shani/llvm-project that referenced this issue Feb 24, 2025

[X86] Add test coverage for llvm#116931

1f58501

RKSimon mentioned this issue Mar 6, 2025

[X86] combineConcatVectorOps - concat(shuffle(x,y,m1),shuffle(x,y,m2)) -> shuffle(concat(x,x),concat(y,y),m3) on VBMI targets #130134

Merged

RKSimon closed this as completed in #130134 Mar 7, 2025

RKSimon closed this as completed in 52bc812 Mar 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vpermb (_mm256_permutexvar_epi8) byte transpose compiles to multiple XMM shuffles if the result is stored #116931

vpermb (_mm256_permutexvar_epi8) byte transpose compiles to multiple XMM shuffles if the result is stored #116931

pcordes commented Nov 20, 2024

llvmbot commented Nov 20, 2024

vpermb (_mm256_permutexvar_epi8) byte transpose compiles to multiple XMM shuffles if the result is stored #116931

vpermb (_mm256_permutexvar_epi8) byte transpose compiles to multiple XMM shuffles if the result is stored #116931

Comments

pcordes commented Nov 20, 2024

llvmbot commented Nov 20, 2024