Skip to content

[AArch64][SVE] Improve fixed-length addressing modes. #129732

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

rj-jesus
Copy link
Contributor

@rj-jesus rj-jesus commented Mar 4, 2025

This is based on the discussion in #127500 (comment) and subsequent comments.

When compiling VLS SVE, the compiler often replaces VL-based offsets with immediate-based ones. This leads to a mismatch in the allowed addressing modes due to SVE loads/stores generally expecting immediate offsets relative to VL. For example, given:

svfloat64_t foo(const double *x) {
  svbool_t pg = svptrue_b64();
  return svld1_f64(pg, x+svcntd());
}

When compiled with -msve-vector-bits=128, we currently generate:

foo:
        ptrue   p0.d
        mov     x8, #2
        ld1d    { z0.d }, p0/z, [x0, x8, lsl #3]
        ret

Instead, we could be generating:

foo:
        ldr     z0, [x0, #1, mul vl]
        ret

Likewise for other types, stores, and other VLS lengths.

This patch attempts to achieve the above by extending SelectAddrModeIndexedSVE to let constants through when vscale is known.

rj-jesus added 2 commits March 4, 2025 03:22
When compiling VLS SVE, the compiler often replaces VL-based offsets
with immediate-based ones. This leads to a mismatch during isel since
SVE loads/stores generally expect immediate offsets relative to VL.

For example, given:
```c

svfloat64_t foo(const double *x) {
  svbool_t pg = svptrue_b64();
  return svld1_f64(pg, x+svcntd());
}
```

When compiled with `-msve-vector-bits=128`, we currently generate:
```gas
foo:
        ptrue   p0.d
        mov     x8, llvm#2
        ld1d    { z0.d }, p0/z, [x0, x8, lsl llvm#3]
        ret
```

In practice, we could instead be generating:
```gas
foo:
        ldr     z0, [x0, llvm#1, mul vl]
        ret
```

Likewise for other types, stores, and other VLS lengths.
@llvmbot llvmbot added clang Clang issues not falling into any other category backend:AArch64 labels Mar 4, 2025
@llvmbot
Copy link
Member

llvmbot commented Mar 4, 2025

@llvm/pr-subscribers-backend-aarch64

@llvm/pr-subscribers-clang

Author: Ricardo Jesus (rj-jesus)

Changes

This is based on the discussion in #127500 (comment) and subsequent comments.

When compiling VLS SVE, the compiler often replaces VL-based offsets with immediate-based ones. This leads to a mismatch in the allowed addressing modes due to SVE loads/stores generally expecting immediate offsets relative to VL. For example, given:

svfloat64_t foo(const double *x) {
  svbool_t pg = svptrue_b64();
  return svld1_f64(pg, x+svcntd());
}

When compiled with -msve-vector-bits=128, we currently generate:

foo:
        ptrue   p0.d
        mov     x8, #<!-- -->2
        ld1d    { z0.d }, p0/z, [x0, x8, lsl #<!-- -->3]
        ret

Instead, we could be generating:

foo:
        ldr     z0, [x0, #<!-- -->1, mul vl]
        ret

Likewise for other types, stores, and other VLS lengths.

This patch attempts to achieve the above by extending SelectAddrModeIndexedSVE to let constants through when vscale is known.


Full diff: https://github.com/llvm/llvm-project/pull/129732.diff

4 Files Affected:

  • (modified) clang/test/CodeGen/AArch64/sve-vector-bits-codegen.c (+3-6)
  • (modified) llvm/lib/Target/AArch64/AArch64ISelDAGToDAG.cpp (+17-3)
  • (added) llvm/test/CodeGen/AArch64/sve-fixed-length-offsets.ll (+187)
  • (modified) llvm/test/CodeGen/AArch64/sve-fixed-length-shuffles.ll (+45-45)
diff --git a/clang/test/CodeGen/AArch64/sve-vector-bits-codegen.c b/clang/test/CodeGen/AArch64/sve-vector-bits-codegen.c
index 0ed14b4b3b793..1391a1b09fbd1 100644
--- a/clang/test/CodeGen/AArch64/sve-vector-bits-codegen.c
+++ b/clang/test/CodeGen/AArch64/sve-vector-bits-codegen.c
@@ -13,12 +13,9 @@
 
 void func(int *restrict a, int *restrict b) {
 // CHECK-LABEL: func
-// CHECK256-COUNT-1: str
-// CHECK256-COUNT-7: st1w
-// CHECK512-COUNT-1: str
-// CHECK512-COUNT-3: st1w
-// CHECK1024-COUNT-1: str
-// CHECK1024-COUNT-1: st1w
+// CHECK256-COUNT-8: str
+// CHECK512-COUNT-4: str
+// CHECK1024-COUNT-2: str
 // CHECK2048-COUNT-1: st1w
 #pragma clang loop vectorize(enable)
   for (int i = 0; i < 64; ++i)
diff --git a/llvm/lib/Target/AArch64/AArch64ISelDAGToDAG.cpp b/llvm/lib/Target/AArch64/AArch64ISelDAGToDAG.cpp
index 3ca9107cb2ce5..2459b17e68c36 100644
--- a/llvm/lib/Target/AArch64/AArch64ISelDAGToDAG.cpp
+++ b/llvm/lib/Target/AArch64/AArch64ISelDAGToDAG.cpp
@@ -7379,13 +7379,27 @@ bool AArch64DAGToDAGISel::SelectAddrModeIndexedSVE(SDNode *Root, SDValue N,
   if (N.getOpcode() != ISD::ADD)
     return false;
 
-  SDValue VScale = N.getOperand(1);
-  if (VScale.getOpcode() != ISD::VSCALE)
+  int64_t MulImm = std::numeric_limits<int64_t>::max();
+  if (SDValue VScale = N.getOperand(1); VScale.getOpcode() == ISD::VSCALE)
+    MulImm = cast<ConstantSDNode>(VScale.getOperand(0))->getSExtValue();
+  else if (auto C = dyn_cast<ConstantSDNode>(N.getOperand(1))) {
+    int64_t ByteOffset = C->getSExtValue();
+    constexpr auto SVEBitsPerBlock = AArch64::SVEBitsPerBlock;
+    auto MinVScale = Subtarget->getMinSVEVectorSizeInBits() / SVEBitsPerBlock;
+    auto MaxVScale = Subtarget->getMaxSVEVectorSizeInBits() / SVEBitsPerBlock;
+
+    if (!MaxVScale || MinVScale != MaxVScale || ByteOffset % MaxVScale != 0)
+      return false;
+
+    MulImm = ByteOffset / MaxVScale;
+  } else
     return false;
 
+  assert(MulImm != std::numeric_limits<int64_t>::max() &&
+         "Uninitialized MulImm.");
+
   TypeSize TS = MemVT.getSizeInBits();
   int64_t MemWidthBytes = static_cast<int64_t>(TS.getKnownMinValue()) / 8;
-  int64_t MulImm = cast<ConstantSDNode>(VScale.getOperand(0))->getSExtValue();
 
   if ((MulImm % MemWidthBytes) != 0)
     return false;
diff --git a/llvm/test/CodeGen/AArch64/sve-fixed-length-offsets.ll b/llvm/test/CodeGen/AArch64/sve-fixed-length-offsets.ll
new file mode 100644
index 0000000000000..6b25ce3abdc8c
--- /dev/null
+++ b/llvm/test/CodeGen/AArch64/sve-fixed-length-offsets.ll
@@ -0,0 +1,187 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc -mtriple=aarch64-linux-gnu -mattr=+sve < %s | FileCheck %s
+; RUN: llc -mtriple=aarch64-linux-gnu -mattr=+sve -aarch64-sve-vector-bits-min=128 -aarch64-sve-vector-bits-max=128 < %s | FileCheck %s --check-prefix=CHECK-128
+; RUN: llc -mtriple=aarch64-linux-gnu -mattr=+sve -aarch64-sve-vector-bits-min=256 -aarch64-sve-vector-bits-max=256 < %s | FileCheck %s --check-prefix=CHECK-256
+; RUN: llc -mtriple=aarch64-linux-gnu -mattr=+sve -aarch64-sve-vector-bits-min=512 -aarch64-sve-vector-bits-max=512 < %s | FileCheck %s --check-prefix=CHECK-512
+; RUN: llc -mtriple=aarch64-linux-gnu -mattr=+sve -aarch64-sve-vector-bits-min=1024 -aarch64-sve-vector-bits-max=1024 < %s | FileCheck %s --check-prefix=CHECK-1024
+; RUN: llc -mtriple=aarch64-linux-gnu -mattr=+sve -aarch64-sve-vector-bits-min=2048 -aarch64-sve-vector-bits-max=2048 < %s | FileCheck %s --check-prefix=CHECK-2048
+
+define void @nxv16i8(ptr %ldptr, ptr %stptr) {
+; CHECK-LABEL: nxv16i8:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    ptrue p0.b
+; CHECK-NEXT:    mov w8, #256 // =0x100
+; CHECK-NEXT:    ld1b { z0.b }, p0/z, [x0, x8]
+; CHECK-NEXT:    st1b { z0.b }, p0, [x1, x8]
+; CHECK-NEXT:    ret
+;
+; CHECK-128-LABEL: nxv16i8:
+; CHECK-128:       // %bb.0:
+; CHECK-128-NEXT:    ldr z0, [x0, #16, mul vl]
+; CHECK-128-NEXT:    str z0, [x1, #16, mul vl]
+; CHECK-128-NEXT:    ret
+;
+; CHECK-256-LABEL: nxv16i8:
+; CHECK-256:       // %bb.0:
+; CHECK-256-NEXT:    ldr z0, [x0, #8, mul vl]
+; CHECK-256-NEXT:    str z0, [x1, #8, mul vl]
+; CHECK-256-NEXT:    ret
+;
+; CHECK-512-LABEL: nxv16i8:
+; CHECK-512:       // %bb.0:
+; CHECK-512-NEXT:    ldr z0, [x0, #4, mul vl]
+; CHECK-512-NEXT:    str z0, [x1, #4, mul vl]
+; CHECK-512-NEXT:    ret
+;
+; CHECK-1024-LABEL: nxv16i8:
+; CHECK-1024:       // %bb.0:
+; CHECK-1024-NEXT:    ldr z0, [x0, #2, mul vl]
+; CHECK-1024-NEXT:    str z0, [x1, #2, mul vl]
+; CHECK-1024-NEXT:    ret
+;
+; CHECK-2048-LABEL: nxv16i8:
+; CHECK-2048:       // %bb.0:
+; CHECK-2048-NEXT:    ldr z0, [x0, #1, mul vl]
+; CHECK-2048-NEXT:    str z0, [x1, #1, mul vl]
+; CHECK-2048-NEXT:    ret
+  %ldoff = getelementptr inbounds nuw i8, ptr %ldptr, i64 256
+  %stoff = getelementptr inbounds nuw i8, ptr %stptr, i64 256
+  %x = load <vscale x 16 x i8>, ptr %ldoff, align 1
+  store <vscale x 16 x i8> %x, ptr %stoff, align 1
+  ret void
+}
+
+define void @nxv8i16(ptr %ldptr, ptr %stptr) {
+; CHECK-LABEL: nxv8i16:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    ptrue p0.h
+; CHECK-NEXT:    mov x8, #128 // =0x80
+; CHECK-NEXT:    ld1h { z0.h }, p0/z, [x0, x8, lsl #1]
+; CHECK-NEXT:    st1h { z0.h }, p0, [x1, x8, lsl #1]
+; CHECK-NEXT:    ret
+;
+; CHECK-128-LABEL: nxv8i16:
+; CHECK-128:       // %bb.0:
+; CHECK-128-NEXT:    ldr z0, [x0, #16, mul vl]
+; CHECK-128-NEXT:    str z0, [x1, #16, mul vl]
+; CHECK-128-NEXT:    ret
+;
+; CHECK-256-LABEL: nxv8i16:
+; CHECK-256:       // %bb.0:
+; CHECK-256-NEXT:    ldr z0, [x0, #8, mul vl]
+; CHECK-256-NEXT:    str z0, [x1, #8, mul vl]
+; CHECK-256-NEXT:    ret
+;
+; CHECK-512-LABEL: nxv8i16:
+; CHECK-512:       // %bb.0:
+; CHECK-512-NEXT:    ldr z0, [x0, #4, mul vl]
+; CHECK-512-NEXT:    str z0, [x1, #4, mul vl]
+; CHECK-512-NEXT:    ret
+;
+; CHECK-1024-LABEL: nxv8i16:
+; CHECK-1024:       // %bb.0:
+; CHECK-1024-NEXT:    ldr z0, [x0, #2, mul vl]
+; CHECK-1024-NEXT:    str z0, [x1, #2, mul vl]
+; CHECK-1024-NEXT:    ret
+;
+; CHECK-2048-LABEL: nxv8i16:
+; CHECK-2048:       // %bb.0:
+; CHECK-2048-NEXT:    ldr z0, [x0, #1, mul vl]
+; CHECK-2048-NEXT:    str z0, [x1, #1, mul vl]
+; CHECK-2048-NEXT:    ret
+  %ldoff = getelementptr inbounds nuw i16, ptr %ldptr, i64 128
+  %stoff = getelementptr inbounds nuw i16, ptr %stptr, i64 128
+  %x = load <vscale x 8 x i16>, ptr %ldoff, align 2
+  store <vscale x 8 x i16> %x, ptr %stoff, align 2
+  ret void
+}
+
+define void @nxv4i32(ptr %ldptr, ptr %stptr) {
+; CHECK-LABEL: nxv4i32:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    ptrue p0.s
+; CHECK-NEXT:    mov x8, #64 // =0x40
+; CHECK-NEXT:    ld1w { z0.s }, p0/z, [x0, x8, lsl #2]
+; CHECK-NEXT:    st1w { z0.s }, p0, [x1, x8, lsl #2]
+; CHECK-NEXT:    ret
+;
+; CHECK-128-LABEL: nxv4i32:
+; CHECK-128:       // %bb.0:
+; CHECK-128-NEXT:    ldr z0, [x0, #16, mul vl]
+; CHECK-128-NEXT:    str z0, [x1, #16, mul vl]
+; CHECK-128-NEXT:    ret
+;
+; CHECK-256-LABEL: nxv4i32:
+; CHECK-256:       // %bb.0:
+; CHECK-256-NEXT:    ldr z0, [x0, #8, mul vl]
+; CHECK-256-NEXT:    str z0, [x1, #8, mul vl]
+; CHECK-256-NEXT:    ret
+;
+; CHECK-512-LABEL: nxv4i32:
+; CHECK-512:       // %bb.0:
+; CHECK-512-NEXT:    ldr z0, [x0, #4, mul vl]
+; CHECK-512-NEXT:    str z0, [x1, #4, mul vl]
+; CHECK-512-NEXT:    ret
+;
+; CHECK-1024-LABEL: nxv4i32:
+; CHECK-1024:       // %bb.0:
+; CHECK-1024-NEXT:    ldr z0, [x0, #2, mul vl]
+; CHECK-1024-NEXT:    str z0, [x1, #2, mul vl]
+; CHECK-1024-NEXT:    ret
+;
+; CHECK-2048-LABEL: nxv4i32:
+; CHECK-2048:       // %bb.0:
+; CHECK-2048-NEXT:    ldr z0, [x0, #1, mul vl]
+; CHECK-2048-NEXT:    str z0, [x1, #1, mul vl]
+; CHECK-2048-NEXT:    ret
+  %ldoff = getelementptr inbounds nuw i32, ptr %ldptr, i64 64
+  %stoff = getelementptr inbounds nuw i32, ptr %stptr, i64 64
+  %x = load <vscale x 4 x i32>, ptr %ldoff, align 4
+  store <vscale x 4 x i32> %x, ptr %stoff, align 4
+  ret void
+}
+
+define void @nxv2i64(ptr %ldptr, ptr %stptr) {
+; CHECK-LABEL: nxv2i64:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    ptrue p0.d
+; CHECK-NEXT:    mov x8, #32 // =0x20
+; CHECK-NEXT:    ld1d { z0.d }, p0/z, [x0, x8, lsl #3]
+; CHECK-NEXT:    st1d { z0.d }, p0, [x1, x8, lsl #3]
+; CHECK-NEXT:    ret
+;
+; CHECK-128-LABEL: nxv2i64:
+; CHECK-128:       // %bb.0:
+; CHECK-128-NEXT:    ldr z0, [x0, #16, mul vl]
+; CHECK-128-NEXT:    str z0, [x1, #16, mul vl]
+; CHECK-128-NEXT:    ret
+;
+; CHECK-256-LABEL: nxv2i64:
+; CHECK-256:       // %bb.0:
+; CHECK-256-NEXT:    ldr z0, [x0, #8, mul vl]
+; CHECK-256-NEXT:    str z0, [x1, #8, mul vl]
+; CHECK-256-NEXT:    ret
+;
+; CHECK-512-LABEL: nxv2i64:
+; CHECK-512:       // %bb.0:
+; CHECK-512-NEXT:    ldr z0, [x0, #4, mul vl]
+; CHECK-512-NEXT:    str z0, [x1, #4, mul vl]
+; CHECK-512-NEXT:    ret
+;
+; CHECK-1024-LABEL: nxv2i64:
+; CHECK-1024:       // %bb.0:
+; CHECK-1024-NEXT:    ldr z0, [x0, #2, mul vl]
+; CHECK-1024-NEXT:    str z0, [x1, #2, mul vl]
+; CHECK-1024-NEXT:    ret
+;
+; CHECK-2048-LABEL: nxv2i64:
+; CHECK-2048:       // %bb.0:
+; CHECK-2048-NEXT:    ldr z0, [x0, #1, mul vl]
+; CHECK-2048-NEXT:    str z0, [x1, #1, mul vl]
+; CHECK-2048-NEXT:    ret
+  %ldoff = getelementptr inbounds nuw i64, ptr %ldptr, i64 32
+  %stoff = getelementptr inbounds nuw i64, ptr %stptr, i64 32
+  %x = load <vscale x 2 x i64>, ptr %ldoff, align 8
+  store <vscale x 2 x i64> %x, ptr %stoff, align 8
+  ret void
+}
diff --git a/llvm/test/CodeGen/AArch64/sve-fixed-length-shuffles.ll b/llvm/test/CodeGen/AArch64/sve-fixed-length-shuffles.ll
index e33bc8da97c05..2d4cdfa7278b9 100644
--- a/llvm/test/CodeGen/AArch64/sve-fixed-length-shuffles.ll
+++ b/llvm/test/CodeGen/AArch64/sve-fixed-length-shuffles.ll
@@ -30,64 +30,64 @@ define void @crash_when_lowering_extract_shuffle(ptr %dst, i1 %cond) vscale_rang
 ; CHECK-NEXT:  // %bb.1: // %vector.body
 ; CHECK-NEXT:    mov z0.b, #0 // =0x0
 ; CHECK-NEXT:    ptrue p0.s
-; CHECK-NEXT:    mov x9, #8 // =0x8
-; CHECK-NEXT:    mov x10, #24 // =0x18
+; CHECK-NEXT:    mov x9, #24 // =0x18
 ; CHECK-NEXT:    umov w8, v0.b[8]
-; CHECK-NEXT:    mov v1.16b, v0.16b
-; CHECK-NEXT:    mov v1.b[1], v0.b[1]
-; CHECK-NEXT:    fmov s2, w8
-; CHECK-NEXT:    mov x8, #16 // =0x10
-; CHECK-NEXT:    mov v2.b[1], v0.b[9]
-; CHECK-NEXT:    mov v1.b[2], v0.b[2]
-; CHECK-NEXT:    mov v2.b[2], v0.b[10]
-; CHECK-NEXT:    mov v1.b[3], v0.b[3]
-; CHECK-NEXT:    mov v2.b[3], v0.b[11]
-; CHECK-NEXT:    mov v1.b[4], v0.b[4]
-; CHECK-NEXT:    mov v2.b[4], v0.b[12]
-; CHECK-NEXT:    mov v1.b[5], v0.b[5]
-; CHECK-NEXT:    mov v2.b[5], v0.b[13]
-; CHECK-NEXT:    mov v1.b[6], v0.b[6]
-; CHECK-NEXT:    mov v2.b[6], v0.b[14]
-; CHECK-NEXT:    mov v1.b[7], v0.b[7]
-; CHECK-NEXT:    mov v2.b[7], v0.b[15]
-; CHECK-NEXT:    ext z0.b, z0.b, z0.b, #16
-; CHECK-NEXT:    uunpklo z1.h, z1.b
-; CHECK-NEXT:    ext v3.16b, v0.16b, v0.16b, #8
-; CHECK-NEXT:    uunpklo z0.h, z0.b
+; CHECK-NEXT:    mov v2.16b, v0.16b
+; CHECK-NEXT:    mov z3.d, z0.d
+; CHECK-NEXT:    mov v2.b[1], v0.b[1]
+; CHECK-NEXT:    ext z3.b, z3.b, z0.b, #16
+; CHECK-NEXT:    fmov s1, w8
+; CHECK-NEXT:    mov x8, #8 // =0x8
+; CHECK-NEXT:    ext v4.16b, v3.16b, v3.16b, #8
+; CHECK-NEXT:    mov v1.b[1], v0.b[9]
+; CHECK-NEXT:    mov v2.b[2], v0.b[2]
+; CHECK-NEXT:    mov v1.b[2], v0.b[10]
+; CHECK-NEXT:    mov v2.b[3], v0.b[3]
+; CHECK-NEXT:    mov v1.b[3], v0.b[11]
+; CHECK-NEXT:    mov v2.b[4], v0.b[4]
+; CHECK-NEXT:    mov v1.b[4], v0.b[12]
+; CHECK-NEXT:    mov v2.b[5], v0.b[5]
+; CHECK-NEXT:    mov v1.b[5], v0.b[13]
+; CHECK-NEXT:    mov v2.b[6], v0.b[6]
+; CHECK-NEXT:    mov v1.b[6], v0.b[14]
+; CHECK-NEXT:    mov v2.b[7], v0.b[7]
+; CHECK-NEXT:    mov v1.b[7], v0.b[15]
 ; CHECK-NEXT:    uunpklo z2.h, z2.b
-; CHECK-NEXT:    uunpklo z1.s, z1.h
-; CHECK-NEXT:    uunpklo z3.h, z3.b
-; CHECK-NEXT:    uunpklo z0.s, z0.h
+; CHECK-NEXT:    uunpklo z0.h, z1.b
+; CHECK-NEXT:    uunpklo z1.h, z3.b
+; CHECK-NEXT:    uunpklo z3.h, z4.b
 ; CHECK-NEXT:    uunpklo z2.s, z2.h
-; CHECK-NEXT:    lsl z1.s, z1.s, #31
+; CHECK-NEXT:    uunpklo z0.s, z0.h
+; CHECK-NEXT:    uunpklo z1.s, z1.h
 ; CHECK-NEXT:    uunpklo z3.s, z3.h
-; CHECK-NEXT:    lsl z0.s, z0.s, #31
-; CHECK-NEXT:    asr z1.s, z1.s, #31
 ; CHECK-NEXT:    lsl z2.s, z2.s, #31
-; CHECK-NEXT:    asr z0.s, z0.s, #31
-; CHECK-NEXT:    and z1.s, z1.s, #0x1
+; CHECK-NEXT:    lsl z0.s, z0.s, #31
+; CHECK-NEXT:    lsl z1.s, z1.s, #31
 ; CHECK-NEXT:    lsl z3.s, z3.s, #31
 ; CHECK-NEXT:    asr z2.s, z2.s, #31
-; CHECK-NEXT:    and z0.s, z0.s, #0x1
-; CHECK-NEXT:    cmpne p4.s, p0/z, z1.s, #0
-; CHECK-NEXT:    ld1w { z1.s }, p0/z, [x0]
+; CHECK-NEXT:    asr z0.s, z0.s, #31
+; CHECK-NEXT:    asr z1.s, z1.s, #31
 ; CHECK-NEXT:    asr z3.s, z3.s, #31
 ; CHECK-NEXT:    and z2.s, z2.s, #0x1
-; CHECK-NEXT:    cmpne p1.s, p0/z, z0.s, #0
-; CHECK-NEXT:    ld1w { z0.s }, p0/z, [x0, x8, lsl #2]
+; CHECK-NEXT:    and z0.s, z0.s, #0x1
+; CHECK-NEXT:    and z1.s, z1.s, #0x1
 ; CHECK-NEXT:    and z3.s, z3.s, #0x1
-; CHECK-NEXT:    cmpne p2.s, p0/z, z2.s, #0
-; CHECK-NEXT:    ld1w { z2.s }, p0/z, [x0, x9, lsl #2]
-; CHECK-NEXT:    mov z1.s, p4/m, #0 // =0x0
+; CHECK-NEXT:    cmpne p4.s, p0/z, z2.s, #0
+; CHECK-NEXT:    ld1w { z2.s }, p0/z, [x0]
+; CHECK-NEXT:    cmpne p1.s, p0/z, z0.s, #0
+; CHECK-NEXT:    cmpne p2.s, p0/z, z1.s, #0
 ; CHECK-NEXT:    cmpne p3.s, p0/z, z3.s, #0
-; CHECK-NEXT:    ld1w { z3.s }, p0/z, [x0, x10, lsl #2]
+; CHECK-NEXT:    ld1w { z0.s }, p0/z, [x0, x8, lsl #2]
+; CHECK-NEXT:    ld1w { z1.s }, p0/z, [x0, #1, mul vl]
+; CHECK-NEXT:    ld1w { z3.s }, p0/z, [x0, x9, lsl #2]
+; CHECK-NEXT:    mov z2.s, p4/m, #0 // =0x0
 ; CHECK-NEXT:    mov z0.s, p1/m, #0 // =0x0
-; CHECK-NEXT:    mov z2.s, p2/m, #0 // =0x0
-; CHECK-NEXT:    st1w { z1.s }, p0, [x0]
-; CHECK-NEXT:    st1w { z0.s }, p0, [x0, x8, lsl #2]
+; CHECK-NEXT:    mov z1.s, p2/m, #0 // =0x0
 ; CHECK-NEXT:    mov z3.s, p3/m, #0 // =0x0
-; CHECK-NEXT:    st1w { z2.s }, p0, [x0, x9, lsl #2]
-; CHECK-NEXT:    st1w { z3.s }, p0, [x0, x10, lsl #2]
+; CHECK-NEXT:    st1w { z2.s }, p0, [x0]
+; CHECK-NEXT:    st1w { z0.s }, p0, [x0, x8, lsl #2]
+; CHECK-NEXT:    st1w { z1.s }, p0, [x0, #1, mul vl]
+; CHECK-NEXT:    st1w { z3.s }, p0, [x0, x9, lsl #2]
 ; CHECK-NEXT:  .LBB1_2: // %exit
 ; CHECK-NEXT:    ret
   %broadcast.splat = shufflevector <32 x i1> zeroinitializer, <32 x i1> zeroinitializer, <32 x i32> zeroinitializer

@rj-jesus
Copy link
Contributor Author

rj-jesus commented Mar 4, 2025

Hi @paulwalker-arm, In the other PR we had discussed this not being high-priority, and I agree it isn't. Having said that, I asked around and a few colleagues spotted these cases in generated assembly and would appreciate it if we could improve them. It's also something that GCC seems to get right (https://godbolt.org/z/ns3neaMs9). Do you think doing this in SelectAddrModeIndexedSVE is a sensible approach?

@paulwalker-arm
Copy link
Collaborator

Very nice. Does this mean there's new hope for teaching AArch64LoadStoreOpt to convert contiguous SVE fill/spill instructions to NEON ldp/stp?

@rj-jesus
Copy link
Contributor Author

rj-jesus commented Mar 4, 2025

Very nice. Does this mean there's new hope for teaching AArch64LoadStoreOpt to convert contiguous SVE fill/spill instructions to NEON ldp/stp?

Yep, that was also my thinking. I think it should work now.

rj-jesus added 2 commits March 5, 2025 00:53
I think making `MulImm` std::optional helps make the intent of the
if-else chain clearer.

If anyone doesn't agree please let me know and I'll undo this.
Copy link
Contributor

@david-arm david-arm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Collaborator

@paulwalker-arm paulwalker-arm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than a potentially erroneous assert this looks good to me.

Comment on lines 413 to 414
if (MaxSVEVectorSizeInBits &&
MinSVEVectorSizeInBits == MaxSVEVectorSizeInBits)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be just MinSVEVectorSizeInBits == MaxSVEVectorSizeInBits given that returning zero when both are zero is valid?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks very much for the suggestion; that looks much better.
Should we let through the case !MinSVEVectorSizeInBits && MaxSVEVectorSizeInBits == 128 too?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this can happen outside of handwritten IR but otherwise it sounds like a reasonable suggestion to me.

Copy link
Contributor Author

@rj-jesus rj-jesus Mar 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left it as is for now due to the lack of a motivating example and to keep it consistent with getMinSVEVectorSizeInBits/getMaxSVEVectorSizeInBits, which presumably could return 128/2048 as the architecture bounds and avoid this problem altogether. Please let me know if you'd rather I add it.

I'll let the tests run and assuming they are OK merge it.

@rj-jesus rj-jesus merged commit f01e760 into llvm:main Mar 6, 2025
11 checks passed
@rj-jesus rj-jesus deleted the rjj/aarch64-sve-vls-imm-addressing-modes branch March 6, 2025 09:47
@omjavaid
Copy link
Contributor

omjavaid commented Mar 6, 2025

This PR appears to have broken clang-aarch64-sve-vls buildbot. Here is the failing build https://lab.llvm.org/buildbot/#/builders/143/builds/5952

rj-jesus added a commit that referenced this pull request Mar 19, 2025
…)" (#130625)

The original patch from #129732 exposed a bug in `getMemVTFromNode`, which was returning incorrect types for fixed length vectors.
jph-13 pushed a commit to jph-13/llvm-project that referenced this pull request Mar 21, 2025
When compiling VLS SVE, the compiler often replaces VL-based offsets
with immediate-based ones. This leads to a mismatch in the allowed
addressing modes due to SVE loads/stores generally expecting immediate
offsets relative to VL. For example, given:
```c

svfloat64_t foo(const double *x) {
  svbool_t pg = svptrue_b64();
  return svld1_f64(pg, x+svcntd());
}
```

When compiled with `-msve-vector-bits=128`, we currently generate:
```gas
foo:
        ptrue   p0.d
        mov     x8, llvm#2
        ld1d    { z0.d }, p0/z, [x0, x8, lsl llvm#3]
        ret
```

Instead, we could be generating:
```gas
foo:
        ldr     z0, [x0, llvm#1, mul vl]
        ret
```

Likewise for other types, stores, and other VLS lengths.

This patch achieves the above by extending `SelectAddrModeIndexedSVE`
to let constants through when `vscale` is known.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend:AArch64 clang Clang issues not falling into any other category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants