[MachineLICM][AArch64] Hoist COPY instructions with other uses in the loop #71403

Rin18 · 2023-11-06T15:12:20Z

When there is a COPY instruction in the loop with other uses, we want to hoist the COPY, which in turn leads to the users being hoisted as well.

Co-authored by David Green : [email protected]

llvmbot · 2023-11-06T15:12:54Z

@llvm/pr-subscribers-backend-aarch64
@llvm/pr-subscribers-backend-amdgpu
@llvm/pr-subscribers-llvm-globalisel

@llvm/pr-subscribers-backend-x86

Author: Rin (Rin18)

Changes

When there is a COPY instruction in the loop with other uses, we want to hoist the COPY, which in turn leads to the users being hoisted as well.

Co-authored by David Green : [email protected]

Patch is 534.12 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/71403.diff

25 Files Affected:

(modified) llvm/lib/CodeGen/MachineLICM.cpp (+10)
(modified) llvm/test/CodeGen/AArch64/tbl-loops.ll (+63-63)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/fp64-atomics-gfx90a.ll (+28-32)
(modified) llvm/test/CodeGen/AMDGPU/flat_atomics_i32_system.ll (+771-917)
(modified) llvm/test/CodeGen/AMDGPU/flat_atomics_i64_system.ll (+964-1064)
(modified) llvm/test/CodeGen/AMDGPU/fp64-atomics-gfx90a.ll (+28-28)
(modified) llvm/test/CodeGen/AMDGPU/global-load-saddr-to-vaddr.ll (+7-8)
(modified) llvm/test/CodeGen/AMDGPU/global_atomics_i32_system.ll (+335-365)
(modified) llvm/test/CodeGen/AMDGPU/global_atomics_i64_system.ll (+797-817)
(modified) llvm/test/CodeGen/AMDGPU/local-atomics-fp.ll (+104-116)
(modified) llvm/test/CodeGen/AMDGPU/memcpy-crash-issue63986.ll (+12-12)
(modified) llvm/test/CodeGen/AMDGPU/optimize-negated-cond.ll (+2-1)
(modified) llvm/test/CodeGen/AMDGPU/promote-constOffset-to-imm.ll (+4-2)
(modified) llvm/test/CodeGen/AMDGPU/remat-fp64-constants.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/sdiv64.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/sdwa-peephole.ll (+13-13)
(modified) llvm/test/CodeGen/AMDGPU/srem64.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/udiv64.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/urem64.ll (+1-1)
(modified) llvm/test/CodeGen/X86/2007-01-13-StackPtrIndex.ll (+22-23)
(modified) llvm/test/CodeGen/X86/2009-02-26-MachineLICMBug.ll (+1-1)
(modified) llvm/test/CodeGen/X86/avx512-i1test.ll (+1-5)
(modified) llvm/test/CodeGen/X86/pr38795.ll (-1)
(modified) llvm/test/CodeGen/X86/pr63108.ll (+4-4)
(modified) llvm/test/CodeGen/X86/vector-shift-by-select-loop.ll (+70-68)

diff --git a/llvm/lib/CodeGen/MachineLICM.cpp b/llvm/lib/CodeGen/MachineLICM.cpp
index e29f28ecaea0dce..5216662bb9d69db 100644
--- a/llvm/lib/CodeGen/MachineLICM.cpp
+++ b/llvm/lib/CodeGen/MachineLICM.cpp
@@ -1249,6 +1249,16 @@ bool MachineLICMBase::IsProfitableToHoist(MachineInstr &MI,
     return false;
   }
 
+  // If we have a COPY with other uses in the loop, hoist to allow the users to
+  // also be hoisted.
+  if (MI.isCopy() && IsLoopInvariantInst(MI, CurLoop) &&
+      MI.getOperand(0).isReg() &&
+      ajnsdajn MI.getOperand(0).getReg().isVirtual() &&
+      MI.getOperand(1).isReg() && MI.getOperand(1).getReg().isVirtual() &&
+      any_of(MRI->use_nodbg_instructions(MI.getOperand(0).getReg()),
+             [&](MachineInstr &UseMI) { return CurLoop->contains(&UseMI); }))
+    return true;
+
   // High register pressure situation, only hoist if the instruction is going
   // to be remat'ed.
   if (!isTriviallyReMaterializable(MI) &&
diff --git a/llvm/test/CodeGen/AArch64/tbl-loops.ll b/llvm/test/CodeGen/AArch64/tbl-loops.ll
index b63d540fb8e0291..365fe03ab0b0844 100644
--- a/llvm/test/CodeGen/AArch64/tbl-loops.ll
+++ b/llvm/test/CodeGen/AArch64/tbl-loops.ll
@@ -52,19 +52,19 @@ define void @loop1(ptr noalias nocapture noundef writeonly %dst, ptr nocapture n
 ; CHECK-NEXT:    b.eq .LBB0_8
 ; CHECK-NEXT:  .LBB0_6: // %for.body.preheader1
 ; CHECK-NEXT:    movi d0, #0000000000000000
-; CHECK-NEXT:    sub w10, w2, w10
 ; CHECK-NEXT:    mov w11, #1132396544 // =0x437f0000
+; CHECK-NEXT:    sub w10, w2, w10
+; CHECK-NEXT:    fmov s1, w11
 ; CHECK-NEXT:  .LBB0_7: // %for.body
 ; CHECK-NEXT:    // =>This Inner Loop Header: Depth=1
-; CHECK-NEXT:    fmov s2, w11
-; CHECK-NEXT:    ldr s1, [x8], #4
-; CHECK-NEXT:    fcmp s1, s2
-; CHECK-NEXT:    fcsel s2, s2, s1, gt
-; CHECK-NEXT:    fcmp s1, #0.0
-; CHECK-NEXT:    fcsel s1, s0, s2, mi
+; CHECK-NEXT:    ldr s2, [x8], #4
+; CHECK-NEXT:    fcmp s2, s1
+; CHECK-NEXT:    fcsel s3, s1, s2, gt
+; CHECK-NEXT:    fcmp s2, #0.0
+; CHECK-NEXT:    fcsel s2, s0, s3, mi
 ; CHECK-NEXT:    subs w10, w10, #1
-; CHECK-NEXT:    fcvtzs w12, s1
-; CHECK-NEXT:    strb w12, [x9], #1
+; CHECK-NEXT:    fcvtzs w11, s2
+; CHECK-NEXT:    strb w11, [x9], #1
 ; CHECK-NEXT:    b.ne .LBB0_7
 ; CHECK-NEXT:  .LBB0_8: // %for.cond.cleanup
 ; CHECK-NEXT:    ret
@@ -165,25 +165,25 @@ define void @loop2(ptr noalias nocapture noundef writeonly %dst, ptr nocapture n
 ; CHECK-NEXT:    mov x9, x0
 ; CHECK-NEXT:  .LBB1_5: // %for.body.preheader1
 ; CHECK-NEXT:    movi d0, #0000000000000000
-; CHECK-NEXT:    sub w10, w2, w10
 ; CHECK-NEXT:    mov w11, #1132396544 // =0x437f0000
+; CHECK-NEXT:    sub w10, w2, w10
+; CHECK-NEXT:    fmov s1, w11
 ; CHECK-NEXT:  .LBB1_6: // %for.body
 ; CHECK-NEXT:    // =>This Inner Loop Header: Depth=1
-; CHECK-NEXT:    ldp s1, s3, [x8], #8
-; CHECK-NEXT:    fmov s2, w11
-; CHECK-NEXT:    fcmp s1, s2
-; CHECK-NEXT:    fcsel s4, s2, s1, gt
-; CHECK-NEXT:    fcmp s1, #0.0
-; CHECK-NEXT:    fcsel s1, s0, s4, mi
-; CHECK-NEXT:    fcmp s3, s2
-; CHECK-NEXT:    fcsel s2, s2, s3, gt
+; CHECK-NEXT:    ldp s2, s3, [x8], #8
+; CHECK-NEXT:    fcmp s2, s1
+; CHECK-NEXT:    fcsel s4, s1, s2, gt
+; CHECK-NEXT:    fcmp s2, #0.0
+; CHECK-NEXT:    fcsel s2, s0, s4, mi
+; CHECK-NEXT:    fcmp s3, s1
+; CHECK-NEXT:    fcsel s4, s1, s3, gt
 ; CHECK-NEXT:    fcmp s3, #0.0
-; CHECK-NEXT:    fcvtzs w12, s1
-; CHECK-NEXT:    fcsel s2, s0, s2, mi
+; CHECK-NEXT:    fcvtzs w11, s2
+; CHECK-NEXT:    fcsel s3, s0, s4, mi
 ; CHECK-NEXT:    subs w10, w10, #1
-; CHECK-NEXT:    strb w12, [x9]
-; CHECK-NEXT:    fcvtzs w13, s2
-; CHECK-NEXT:    strb w13, [x9, #1]
+; CHECK-NEXT:    strb w11, [x9]
+; CHECK-NEXT:    fcvtzs w12, s3
+; CHECK-NEXT:    strb w12, [x9, #1]
 ; CHECK-NEXT:    add x9, x9, #2
 ; CHECK-NEXT:    b.ne .LBB1_6
 ; CHECK-NEXT:  .LBB1_7: // %for.cond.cleanup
@@ -380,33 +380,33 @@ define void @loop3(ptr noalias nocapture noundef writeonly %dst, ptr nocapture n
 ; CHECK-NEXT:    mov x9, x0
 ; CHECK-NEXT:  .LBB2_7: // %for.body.preheader1
 ; CHECK-NEXT:    movi d0, #0000000000000000
-; CHECK-NEXT:    sub w10, w2, w10
 ; CHECK-NEXT:    mov w11, #1132396544 // =0x437f0000
+; CHECK-NEXT:    sub w10, w2, w10
+; CHECK-NEXT:    fmov s1, w11
 ; CHECK-NEXT:  .LBB2_8: // %for.body
 ; CHECK-NEXT:    // =>This Inner Loop Header: Depth=1
-; CHECK-NEXT:    ldp s1, s3, [x8]
-; CHECK-NEXT:    fmov s2, w11
-; CHECK-NEXT:    fcmp s1, s2
-; CHECK-NEXT:    fcsel s4, s2, s1, gt
-; CHECK-NEXT:    fcmp s1, #0.0
-; CHECK-NEXT:    fcsel s1, s0, s4, mi
-; CHECK-NEXT:    fcmp s3, s2
-; CHECK-NEXT:    fcsel s4, s2, s3, gt
+; CHECK-NEXT:    ldp s2, s3, [x8]
+; CHECK-NEXT:    fcmp s2, s1
+; CHECK-NEXT:    fcsel s4, s1, s2, gt
+; CHECK-NEXT:    fcmp s2, #0.0
+; CHECK-NEXT:    fcsel s2, s0, s4, mi
+; CHECK-NEXT:    fcmp s3, s1
+; CHECK-NEXT:    fcsel s4, s1, s3, gt
 ; CHECK-NEXT:    fcmp s3, #0.0
 ; CHECK-NEXT:    ldr s3, [x8, #8]
-; CHECK-NEXT:    fcvtzs w12, s1
+; CHECK-NEXT:    fcvtzs w11, s2
 ; CHECK-NEXT:    add x8, x8, #12
 ; CHECK-NEXT:    fcsel s4, s0, s4, mi
-; CHECK-NEXT:    fcmp s3, s2
-; CHECK-NEXT:    strb w12, [x9]
-; CHECK-NEXT:    fcsel s2, s2, s3, gt
+; CHECK-NEXT:    fcmp s3, s1
+; CHECK-NEXT:    strb w11, [x9]
+; CHECK-NEXT:    fcsel s5, s1, s3, gt
 ; CHECK-NEXT:    fcmp s3, #0.0
-; CHECK-NEXT:    fcvtzs w13, s4
-; CHECK-NEXT:    fcsel s2, s0, s2, mi
+; CHECK-NEXT:    fcvtzs w12, s4
+; CHECK-NEXT:    fcsel s3, s0, s5, mi
 ; CHECK-NEXT:    subs w10, w10, #1
-; CHECK-NEXT:    strb w13, [x9, #1]
-; CHECK-NEXT:    fcvtzs w14, s2
-; CHECK-NEXT:    strb w14, [x9, #2]
+; CHECK-NEXT:    strb w12, [x9, #1]
+; CHECK-NEXT:    fcvtzs w13, s3
+; CHECK-NEXT:    strb w13, [x9, #2]
 ; CHECK-NEXT:    add x9, x9, #3
 ; CHECK-NEXT:    b.ne .LBB2_8
 ; CHECK-NEXT:  .LBB2_9: // %for.cond.cleanup
@@ -549,39 +549,39 @@ define void @loop4(ptr noalias nocapture noundef writeonly %dst, ptr nocapture n
 ; CHECK-NEXT:    mov x9, x0
 ; CHECK-NEXT:  .LBB3_5: // %for.body.preheader1
 ; CHECK-NEXT:    movi d0, #0000000000000000
-; CHECK-NEXT:    sub w10, w2, w10
 ; CHECK-NEXT:    mov w11, #1132396544 // =0x437f0000
+; CHECK-NEXT:    sub w10, w2, w10
+; CHECK-NEXT:    fmov s1, w11
 ; CHECK-NEXT:  .LBB3_6: // %for.body
 ; CHECK-NEXT:    // =>This Inner Loop Header: Depth=1
-; CHECK-NEXT:    ldp s1, s3, [x8]
-; CHECK-NEXT:    fmov s2, w11
-; CHECK-NEXT:    fcmp s1, s2
-; CHECK-NEXT:    fcsel s4, s2, s1, gt
-; CHECK-NEXT:    fcmp s1, #0.0
-; CHECK-NEXT:    fcsel s1, s0, s4, mi
-; CHECK-NEXT:    fcmp s3, s2
-; CHECK-NEXT:    fcsel s4, s2, s3, gt
+; CHECK-NEXT:    ldp s2, s3, [x8]
+; CHECK-NEXT:    fcmp s2, s1
+; CHECK-NEXT:    fcsel s4, s1, s2, gt
+; CHECK-NEXT:    fcmp s2, #0.0
+; CHECK-NEXT:    fcsel s2, s0, s4, mi
+; CHECK-NEXT:    fcmp s3, s1
+; CHECK-NEXT:    fcsel s4, s1, s3, gt
 ; CHECK-NEXT:    fcmp s3, #0.0
 ; CHECK-NEXT:    ldp s3, s5, [x8, #8]
-; CHECK-NEXT:    fcvtzs w12, s1
+; CHECK-NEXT:    fcvtzs w11, s2
 ; CHECK-NEXT:    add x8, x8, #16
 ; CHECK-NEXT:    fcsel s4, s0, s4, mi
-; CHECK-NEXT:    fcmp s3, s2
-; CHECK-NEXT:    strb w12, [x9]
-; CHECK-NEXT:    fcsel s6, s2, s3, gt
+; CHECK-NEXT:    fcmp s3, s1
+; CHECK-NEXT:    strb w11, [x9]
+; CHECK-NEXT:    fcsel s6, s1, s3, gt
 ; CHECK-NEXT:    fcmp s3, #0.0
-; CHECK-NEXT:    fcvtzs w13, s4
+; CHECK-NEXT:    fcvtzs w12, s4
 ; CHECK-NEXT:    fcsel s3, s0, s6, mi
-; CHECK-NEXT:    fcmp s5, s2
-; CHECK-NEXT:    strb w13, [x9, #1]
-; CHECK-NEXT:    fcsel s2, s2, s5, gt
+; CHECK-NEXT:    fcmp s5, s1
+; CHECK-NEXT:    strb w12, [x9, #1]
+; CHECK-NEXT:    fcsel s6, s1, s5, gt
 ; CHECK-NEXT:    fcmp s5, #0.0
-; CHECK-NEXT:    fcvtzs w14, s3
-; CHECK-NEXT:    fcsel s2, s0, s2, mi
+; CHECK-NEXT:    fcvtzs w13, s3
+; CHECK-NEXT:    fcsel s5, s0, s6, mi
 ; CHECK-NEXT:    subs w10, w10, #1
-; CHECK-NEXT:    strb w14, [x9, #2]
-; CHECK-NEXT:    fcvtzs w15, s2
-; CHECK-NEXT:    strb w15, [x9, #3]
+; CHECK-NEXT:    strb w13, [x9, #2]
+; CHECK-NEXT:    fcvtzs w14, s5
+; CHECK-NEXT:    strb w14, [x9, #3]
 ; CHECK-NEXT:    add x9, x9, #4
 ; CHECK-NEXT:    b.ne .LBB3_6
 ; CHECK-NEXT:  .LBB3_7: // %for.cond.cleanup
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/fp64-atomics-gfx90a.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/fp64-atomics-gfx90a.ll
index c6ea046f95a9199..53b2336180c6617 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/fp64-atomics-gfx90a.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/fp64-atomics-gfx90a.ll
@@ -1447,15 +1447,14 @@ define amdgpu_kernel void @flat_atomic_fadd_f64_noret_pat(ptr %ptr) #1 {
 ; GFX90A-LABEL: flat_atomic_fadd_f64_noret_pat:
 ; GFX90A:       ; %bb.0: ; %main_body
 ; GFX90A-NEXT:    s_load_dwordx2 s[0:1], s[0:1], 0x24
-; GFX90A-NEXT:    s_mov_b64 s[2:3], 0
 ; GFX90A-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX90A-NEXT:    v_pk_mov_b32 v[0:1], s[0:1], s[0:1] op_sel:[0,1]
-; GFX90A-NEXT:    flat_load_dwordx2 v[2:3], v[0:1]
+; GFX90A-NEXT:    v_pk_mov_b32 v[4:5], s[0:1], s[0:1] op_sel:[0,1]
+; GFX90A-NEXT:    flat_load_dwordx2 v[2:3], v[4:5]
+; GFX90A-NEXT:    s_mov_b64 s[0:1], 0
 ; GFX90A-NEXT:  .LBB50_1: ; %atomicrmw.start
 ; GFX90A-NEXT:    ; =>This Inner Loop Header: Depth=1
 ; GFX90A-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
 ; GFX90A-NEXT:    v_add_f64 v[0:1], v[2:3], 4.0
-; GFX90A-NEXT:    v_pk_mov_b32 v[4:5], s[0:1], s[0:1] op_sel:[0,1]
 ; GFX90A-NEXT:    buffer_wbl2
 ; GFX90A-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
 ; GFX90A-NEXT:    flat_atomic_cmpswap_x2 v[0:1], v[4:5], v[0:3] glc
@@ -1463,9 +1462,9 @@ define amdgpu_kernel void @flat_atomic_fadd_f64_noret_pat(ptr %ptr) #1 {
 ; GFX90A-NEXT:    buffer_invl2
 ; GFX90A-NEXT:    buffer_wbinvl1_vol
 ; GFX90A-NEXT:    v_cmp_eq_u64_e32 vcc, v[0:1], v[2:3]
-; GFX90A-NEXT:    s_or_b64 s[2:3], vcc, s[2:3]
+; GFX90A-NEXT:    s_or_b64 s[0:1], vcc, s[0:1]
 ; GFX90A-NEXT:    v_pk_mov_b32 v[2:3], v[0:1], v[0:1] op_sel:[0,1]
-; GFX90A-NEXT:    s_andn2_b64 exec, exec, s[2:3]
+; GFX90A-NEXT:    s_andn2_b64 exec, exec, s[0:1]
 ; GFX90A-NEXT:    s_cbranch_execnz .LBB50_1
 ; GFX90A-NEXT:  ; %bb.2: ; %atomicrmw.end
 ; GFX90A-NEXT:    s_endpgm
@@ -1522,15 +1521,14 @@ define amdgpu_kernel void @flat_atomic_fadd_f64_noret_pat_system(ptr %ptr) #1 {
 ; GFX90A-LABEL: flat_atomic_fadd_f64_noret_pat_system:
 ; GFX90A:       ; %bb.0: ; %main_body
 ; GFX90A-NEXT:    s_load_dwordx2 s[0:1], s[0:1], 0x24
-; GFX90A-NEXT:    s_mov_b64 s[2:3], 0
 ; GFX90A-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX90A-NEXT:    v_pk_mov_b32 v[0:1], s[0:1], s[0:1] op_sel:[0,1]
-; GFX90A-NEXT:    flat_load_dwordx2 v[2:3], v[0:1]
+; GFX90A-NEXT:    v_pk_mov_b32 v[4:5], s[0:1], s[0:1] op_sel:[0,1]
+; GFX90A-NEXT:    flat_load_dwordx2 v[2:3], v[4:5]
+; GFX90A-NEXT:    s_mov_b64 s[0:1], 0
 ; GFX90A-NEXT:  .LBB52_1: ; %atomicrmw.start
 ; GFX90A-NEXT:    ; =>This Inner Loop Header: Depth=1
 ; GFX90A-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
 ; GFX90A-NEXT:    v_add_f64 v[0:1], v[2:3], 4.0
-; GFX90A-NEXT:    v_pk_mov_b32 v[4:5], s[0:1], s[0:1] op_sel:[0,1]
 ; GFX90A-NEXT:    buffer_wbl2
 ; GFX90A-NEXT:    s_waitcnt vmcnt(0)
 ; GFX90A-NEXT:    flat_atomic_cmpswap_x2 v[0:1], v[4:5], v[0:3] glc
@@ -1539,9 +1537,9 @@ define amdgpu_kernel void @flat_atomic_fadd_f64_noret_pat_system(ptr %ptr) #1 {
 ; GFX90A-NEXT:    buffer_wbinvl1_vol
 ; GFX90A-NEXT:    s_waitcnt lgkmcnt(0)
 ; GFX90A-NEXT:    v_cmp_eq_u64_e32 vcc, v[0:1], v[2:3]
-; GFX90A-NEXT:    s_or_b64 s[2:3], vcc, s[2:3]
+; GFX90A-NEXT:    s_or_b64 s[0:1], vcc, s[0:1]
 ; GFX90A-NEXT:    v_pk_mov_b32 v[2:3], v[0:1], v[0:1] op_sel:[0,1]
-; GFX90A-NEXT:    s_andn2_b64 exec, exec, s[2:3]
+; GFX90A-NEXT:    s_andn2_b64 exec, exec, s[0:1]
 ; GFX90A-NEXT:    s_cbranch_execnz .LBB52_1
 ; GFX90A-NEXT:  ; %bb.2: ; %atomicrmw.end
 ; GFX90A-NEXT:    s_endpgm
@@ -1724,23 +1722,22 @@ define amdgpu_kernel void @flat_atomic_fadd_f64_noret_pat_agent_safe(ptr %ptr) {
 ; GFX90A-LABEL: flat_atomic_fadd_f64_noret_pat_agent_safe:
 ; GFX90A:       ; %bb.0: ; %main_body
 ; GFX90A-NEXT:    s_load_dwordx2 s[0:1], s[0:1], 0x24
-; GFX90A-NEXT:    s_mov_b64 s[2:3], 0
 ; GFX90A-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX90A-NEXT:    v_pk_mov_b32 v[0:1], s[0:1], s[0:1] op_sel:[0,1]
-; GFX90A-NEXT:    flat_load_dwordx2 v[2:3], v[0:1]
+; GFX90A-NEXT:    v_pk_mov_b32 v[4:5], s[0:1], s[0:1] op_sel:[0,1]
+; GFX90A-NEXT:    flat_load_dwordx2 v[2:3], v[4:5]
+; GFX90A-NEXT:    s_mov_b64 s[0:1], 0
 ; GFX90A-NEXT:  .LBB58_1: ; %atomicrmw.start
 ; GFX90A-NEXT:    ; =>This Inner Loop Header: Depth=1
 ; GFX90A-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
 ; GFX90A-NEXT:    v_add_f64 v[0:1], v[2:3], 4.0
-; GFX90A-NEXT:    v_pk_mov_b32 v[4:5], s[0:1], s[0:1] op_sel:[0,1]
 ; GFX90A-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
 ; GFX90A-NEXT:    flat_atomic_cmpswap_x2 v[0:1], v[4:5], v[0:3] glc
 ; GFX90A-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
 ; GFX90A-NEXT:    buffer_wbinvl1_vol
 ; GFX90A-NEXT:    v_cmp_eq_u64_e32 vcc, v[0:1], v[2:3]
-; GFX90A-NEXT:    s_or_b64 s[2:3], vcc, s[2:3]
+; GFX90A-NEXT:    s_or_b64 s[0:1], vcc, s[0:1]
 ; GFX90A-NEXT:    v_pk_mov_b32 v[2:3], v[0:1], v[0:1] op_sel:[0,1]
-; GFX90A-NEXT:    s_andn2_b64 exec, exec, s[2:3]
+; GFX90A-NEXT:    s_andn2_b64 exec, exec, s[0:1]
 ; GFX90A-NEXT:    s_cbranch_execnz .LBB58_1
 ; GFX90A-NEXT:  ; %bb.2: ; %atomicrmw.end
 ; GFX90A-NEXT:    s_endpgm
@@ -1957,22 +1954,21 @@ main_body:
 define amdgpu_kernel void @local_atomic_fadd_f64_noret_pat_flush_safe(ptr addrspace(3) %ptr) #4 {
 ; GFX90A-LABEL: local_atomic_fadd_f64_noret_pat_flush_safe:
 ; GFX90A:       ; %bb.0: ; %main_body
-; GFX90A-NEXT:    s_load_dword s2, s[0:1], 0x24
-; GFX90A-NEXT:    s_mov_b64 s[0:1], 0
+; GFX90A-NEXT:    s_load_dword s0, s[0:1], 0x24
 ; GFX90A-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX90A-NEXT:    v_mov_b32_e32 v0, s2
-; GFX90A-NEXT:    ds_read_b64 v[0:1], v0
+; GFX90A-NEXT:    v_mov_b32_e32 v2, s0
+; GFX90A-NEXT:    ds_read_b64 v[0:1], v2
+; GFX90A-NEXT:    s_mov_b64 s[0:1], 0
 ; GFX90A-NEXT:  .LBB67_1: ; %atomicrmw.start
 ; GFX90A-NEXT:    ; =>This Inner Loop Header: Depth=1
 ; GFX90A-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX90A-NEXT:    v_add_f64 v[2:3], v[0:1], 4.0
-; GFX90A-NEXT:    v_mov_b32_e32 v4, s2
+; GFX90A-NEXT:    v_add_f64 v[4:5], v[0:1], 4.0
 ; GFX90A-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX90A-NEXT:    ds_cmpst_rtn_b64 v[2:3], v4, v[0:1], v[2:3]
+; GFX90A-NEXT:    ds_cmpst_rtn_b64 v[4:5], v2, v[0:1], v[4:5]
 ; GFX90A-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX90A-NEXT:    v_cmp_eq_u64_e32 vcc, v[2:3], v[0:1]
+; GFX90A-NEXT:    v_cmp_eq_u64_e32 vcc, v[4:5], v[0:1]
 ; GFX90A-NEXT:    s_or_b64 s[0:1], vcc, s[0:1]
-; GFX90A-NEXT:    v_pk_mov_b32 v[0:1], v[2:3], v[2:3] op_sel:[0,1]
+; GFX90A-NEXT:    v_pk_mov_b32 v[0:1], v[4:5], v[4:5] op_sel:[0,1]
 ; GFX90A-NEXT:    s_andn2_b64 exec, exec, s[0:1]
 ; GFX90A-NEXT:    s_cbranch_execnz .LBB67_1
 ; GFX90A-NEXT:  ; %bb.2: ; %atomicrmw.end
@@ -1985,17 +1981,17 @@ define amdgpu_kernel void @local_atomic_fadd_f64_noret_pat_flush_safe(ptr addrsp
 ; GFX940-NEXT:    s_waitcnt lgkmcnt(0)
 ; GFX940-NEXT:    v_mov_b32_e32 v0, s2
 ; GFX940-NEXT:    ds_read_b64 v[0:1], v0
+; GFX940-NEXT:    v_mov_b32_e32 v2, s2
 ; GFX940-NEXT:  .LBB67_1: ; %atomicrmw.start
 ; GFX940-NEXT:    ; =>This Inner Loop Header: Depth=1
 ; GFX940-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX940-NEXT:    v_add_f64 v[2:3], v[0:1], 4.0
-; GFX940-NEXT:    v_mov_b32_e32 v4, s2
+; GFX940-NEXT:    v_add_f64 v[4:5], v[0:1], 4.0
 ; GFX940-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX940-NEXT:    ds_cmpst_rtn_b64 v[2:3], v4, v[0:1], v[2:3]
+; GFX940-NEXT:    ds_cmpst_rtn_b64 v[4:5], v2, v[0:1], v[4:5]
 ; GFX940-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX940-NEXT:    v_cmp_eq_u64_e32 vcc, v[2:3], v[0:1]
+; GFX940-NEXT:    v_cmp_eq_u64_e32 vcc, v[4:5], v[0:1]
 ; GFX940-NEXT:    s_or_b64 s[0:1], vcc, s[0:1]
-; GFX940-NEXT:    v_mov_b64_e32 v[0:1], v[2:3]
+; GFX940-NEXT:    v_mov_b64_e32 v[0:1], v[4:5]
 ; GFX940-NEXT:    s_andn2_b64 exec, exec, s[0:1]
 ; GFX940-NEXT:    s_cbranch_execnz .LBB67_1
 ; GFX940-NEXT:  ; %bb.2: ; %atomicrmw.end
diff --git a/llvm/test/CodeGen/AMDGPU/flat_atomics_i32_system.ll b/llvm/test/CodeGen/AMDGPU/flat_atomics_i32_system.ll
index 6bdeeddc951ff18..c4a88ebf4897294 100644
--- a/llvm/test/CodeGen/AMDGPU/flat_atomics_i32_system.ll
+++ b/llvm/test/CodeGen/AMDGPU/flat_atomics_i32_system.ll
@@ -1819,22 +1819,20 @@ define amdgpu_gfx void @flat_atomic_nand_i32_noret_scalar(ptr inreg %ptr, i32 in
 ; GCN1-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; GCN1-NEXT:    v_mov_b32_e32 v0, s4
 ; GCN1-NEXT:    v_mov_b32_e32 v1, s5
-; GCN1-NEXT:    flat_load_dword v1, v[0:1]
+; GCN1-NEXT:    flat_load_dword v3, v[0:1]
 ; GCN1-NEXT:    s_mov_b64 s[34:35], 0
 ; GCN1-NEXT:  .LBB44_1: ; %atomicrmw.start
 ; GCN1-NEXT:    ; =>This Inner Loop Header: Depth=1
 ; GCN1-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
-; GCN1-NEXT:    v_and_b32_e32 v0, s6, v1
-; GCN1-NEXT:    v_mov_b32_e32 v2, s4
-; GCN1-NEXT:    v_mov_b32_e32 v3, s5
-; GCN1-NEXT:    v_not_b32_e32 v0, v0
+; GCN1-NEXT:    v_and_b32_e32 v2, s6, v3
+; GCN1-NEXT:    v_not_b32_e32 v2, v2
 ; GCN1-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
-; GCN1-NEXT:    flat_atomic_cmpswap v0, v[2:3], v[0:1] glc
+; GCN1-NEXT:    flat_atomic_cmpswap v2, v[0:1], v[2:3] glc
 ; GCN1-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
 ; GCN1-NEXT:    buffer_wbinvl1_vol
-; GCN1-NEXT:    v_cmp_eq_u32_e32 vcc, v0, v1
+; GCN1-NEXT:    v_cmp_eq_u32_e32 vcc, v2, v3
 ; GCN1-NEXT:    s_or_b64 s[34:35], vcc, s[34:35]
-; GCN1-NEXT:    v_mov_b32_e32 v1, v0
+; GCN1-NEXT:    v_mov_b32_e32 v3, v2
 ; GCN1-NEXT:    s_andn2_b64 exec, exec, s[34:35]
 ; GCN1-NEXT:    s_cbranch_execnz .LBB44_1
 ; GCN1-NEXT:  ; %bb.2: ; %atomicrmw.end
@@ -1846,22 +1844,20 @@ define amdgpu_gfx void @flat_atomic_nand_i32_noret_scalar(ptr inreg %ptr, i32 in
 ; GCN2-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; GCN2-NEXT:    v_mov_b32_e32 v0, s4
 ; GCN2-NEXT:    v_mov_b32_e32 v1, s5
-; GCN2-NEXT:    flat_load_dword v1, v[0:1]
+; GCN2-NEXT:    flat_load_dword v3, v[0:1]
 ; GCN2-NEXT:    s_mov_b64 s[34:35], 0
 ; GCN2-NEXT:  .LBB44_1: ; %atomicrmw.start
 ; GCN2-NEXT:    ; =>This Inner Loop Header: Depth=1
 ; GCN2-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
-; GCN2-NEXT:    v_and_b32_e32 v0, s6, v1
-; GCN2-NEXT:    v_mov_b32_e32 v2, s4
-; GCN2-NEXT:    v_mov_b32_e32 v3, s5
-; GCN2-NEXT:    v_not_b32_e32 v0, v0
+; GCN2-NEXT:    v_and_b32_e32 v2, s6, v3
+; GCN2-NEXT:    v_not_b32_e32 v2, v2
 ; GCN2-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
-; GCN2-NEXT:    flat_atomic_cmpswap v0, v[2:3], v[0:1] glc
+; GCN2-NEXT:    flat_atomic_cmpswap v2, v[0:1], v[2:3] glc
 ; GCN2-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
 ; GCN2-NEXT:    buffer_wbinvl1_vol
-; GCN2-NEXT:    v_cmp_eq_u32_e32 vcc, v0, v1
+; GCN2-NEXT:    v_cmp_eq_u32_e32 vcc, v2, v3
 ; GCN2-NEXT:    s_or_b64 s[34:35], vcc, s[34:35]
-; GCN2-NEXT:    v_mov_b32_e32 v1, v0
+; GCN2-NEXT:    v_mov_b32_e32 v3, v2
 ; GCN2-NEXT:    s_andn2_b64 exec, exec, s[34:35]
 ; GCN2-NEXT:    s_cbranch_execnz .LBB44_1
 ; GCN2-NEXT:  ; %bb.2: ; %atomicrmw.end
@@ -1873,22 +1869,20 @@ define amdgpu_gfx void @flat_atomic_nand_i32_noret_scalar(ptr inreg %ptr, i32 in
 ; GCN3-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; GCN3-NEXT:    v_mov_b32_e32 v0, s4
 ; GCN3-NEXT:    v_mov_b32_e32 v1, s5
-; GCN3-NEXT:    flat_load_dword v1, v[0:1]
+; GCN3-NEXT:    flat_load_dword v3, v[0:1]
 ; GCN3-NEXT:    s_mov_b64 s[34:35], 0
 ; GCN3-NEXT:  .LBB44_1: ; %atomicrmw.start
 ; GCN3-NEXT:    ; =>This Inner Loop Header: Depth=1
 ; GCN3-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
-; GCN3-NEXT:    v_and_b32_e32 v0, s6, v1
-; GCN3-NEXT:    v_mov_b32_e32 v2, s4
-; GCN3-NEXT:    v_mov_b32_e32 v3, s5
-; GCN3-NEXT:    v_not_b32_e32 v0, v0
+; GCN3-NEXT:    v_and_b32_e32 v2, s6, v3
+; GCN3-NEXT:    v_not_b32_e32 v2, v2
 ; GCN3-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
-; GCN3-NEXT:    flat_atomic_cmpswap v0, v[2:3], v[0:1] glc
+; GCN3-NEXT:    flat_atomic_cmpswap v2, v[0:1], v[2:3] glc
 ; GCN3-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
 ; GCN3-NEXT:    buffer_wbinvl1_vol
-; GCN3-NEXT:    v_cmp_eq_u32_e32 vcc, v0, v1
+; GCN3-NEXT:    v_cmp_eq_u32_e32 vcc, v2, v3
 ; GCN3-NEXT:    s_or_b64 s[34:35], vcc, s[34:35]
-; GCN3-NEXT:    v_mov_b32_e32 v1, v0
+; GCN3-NEXT:    v_mov_b32_e32 v3, v2
 ; GCN3-NEXT:    s_andn2_b64 exec, exec, s[34:35]
 ; GCN3-NEXT:    s_cbranch_execnz .LBB44_1
 ; GCN3-NEXT:  ; %bb.2: ; %atomicrmw.end
@@ -1906,26 +1900,24 @@ define amdgpu_gfx void @flat_atomic_nand_i32_noret_offset_scalar(ptr inreg %out,
 ; GCN1-NEXT:    s_addc_u32 s35, s5, 0
 ; GCN1-NEXT:    v_mov_b32_e32 v0, s34
 ; GCN1-NEXT:    v_mov_b32_e32 v1, s35
-; GCN1-NEXT:    flat_load_dword v1, v[0:1]
-; GCN1-NEXT:    s_mov_b64 s[36:37], 0
+; GCN1-NEXT:   ...
[truncated]

davemgreen · 2023-11-07T08:22:29Z

Thanks. Can you rebase to see if the issue with the test goes away?
Adding some AMD reviewers to check these test changes look OK

Rin18 · 2023-11-07T08:25:27Z

Thanks. Can you rebase to see if the issue with the test goes away? Adding some AMD reviewers to check these test changes look OK

Yeah, rebasing sounds like a good idea, I'll do that.

commit fef5786ca1d78288c3f442057bfeb148788417b2 Merge: 77daaf1c4b40 aadaa1cc32b8 Author: Rin Dobrescu <[email protected]> Date: Tue Nov 7 10:27:03 2023 +0000 Merge branch 'hoist_dups' of https://github.com/Rin18/llvm-project-fork into hoist_dups commit 77daaf1c4b4009bc95803e8992291f64c9cf8b33 Author: Rin Dobrescu <[email protected]> Date: Mon Nov 6 15:52:27 2023 +0000 Remove unneeded line commit aa469a15af8aeaca8d474ee7ab5b59ff20aec28c Author: Rin Dobrescu <[email protected]> Date: Mon Nov 6 15:08:11 2023 +0000 Run clang-format on patch commit 0d43d36e939e702e73b5927b4ae1c713ec0edec9 Author: Rin Dobrescu <[email protected]> Date: Mon Nov 6 12:20:21 2023 +0000 [MachineLICM][AArch64] Hoist COPY instructions with other uses in the loop commit aadaa1cc32b88e682f0b9148683d3f6142e73f4e Author: Rin Dobrescu <[email protected]> Date: Mon Nov 6 15:52:27 2023 +0000 Remove unneeded line commit 27c7354c1bcc5aff81fed7623ff56d93af4201da Author: Rin Dobrescu <[email protected]> Date: Mon Nov 6 15:08:11 2023 +0000 Run clang-format on patch commit 34f9b474e3e10f670b6bb0359f2b81ff8275d5a1 Author: Rin Dobrescu <[email protected]> Date: Mon Nov 6 12:20:21 2023 +0000 [MachineLICM][AArch64] Hoist COPY instructions with other uses in the loop

david-arm · 2023-11-07T11:56:44Z

Yeah, rebasing sounds like a good idea, I'll do that.

Hmm, the PR looks a little odd now! There is a single commit in the PR, that is itself a squashed commit.

Rin18 · 2023-11-07T15:52:32Z

Yeah, rebasing sounds like a good idea, I'll do that.

Hmm, the PR looks a little odd now! There is a single commit in the PR, that is itself a squashed commit.

Merged after rebase which led to the commits from main appearing on my PR. Had to squash them to fix it which is why there's only one commit now.

llvm/lib/CodeGen/MachineLICM.cpp

david-arm

LGTM! The test changes look favourable in almost all cases and the code change makes sense to me.

jayfoad · 2023-11-15T11:33:26Z

Adding some AMD reviewers to check these test changes look OK

Thanks. @dstutt @piotrAMD @Ritson FYI.

piotrAMD · 2023-11-15T12:24:23Z

Adding some AMD reviewers to check these test changes look OK

Thanks. @dstutt @piotrAMD @Ritson FYI.

Thanks - AMD test changes look good.
FYI @perlfu

… loop (llvm#71403) When there is a COPY instruction in the loop with other uses, we want to hoist the COPY, which in turn leads to the users being hoisted as well. Co-authored-by David Green : [email protected]

bjope · 2023-11-21T12:51:55Z

llvm/lib/CodeGen/MachineLICM.cpp

@@ -1262,6 +1262,18 @@ bool MachineLICMBase::IsProfitableToHoist(MachineInstr &MI,
    return false;
  }

+  // If we have a COPY with other uses in the loop, hoist to allow the users to


For info: Downstream this caused regressions in some of our benchmarks. Not sure if you need to care about that (we will probably guard this with a check for our downstream target), but thought it might be nice to mention it.

Haven't fully investigated what happens, but I think that hoisting the COPY increases register pressure resulting in spill. The COPY instructions I see that are hoisted in that benchmark can be mapped to two categories:

%228:gn32 = COPY %143:pn ; cross register bank copy %245:gn16 = COPY %227.lo:gn32 ; extracting a subreg

So for example hoisting the cross register bank COPY results in increasing the register pressure for the general gn registers in the path leading up to the use. Similarly, by hoisting the subreg extract the register pressure on the gn registers increase.

I wonder if the heuristic here perhaps should look closer at the using instructions to see if they actually can be hoisted as well? After all, we are in the path where CanCauseHighRegPressure has returned true. So traditinonally we have been more careful here and only hoisted trivially rematerializable MI:s.

Hello. From what I've seen in our benchmarks this has been positive, but there is often some noise from hoisting/sinking. You are right that this could be more conservative, but in our case cross register bank copies will be relatively expensive and we would want to hoist them if we could. I'm not sure about the subreg extracts, but a lot of COPYs are removed prior to register allocation and it would be good if it knew where best to re-add them, if needed.

It is a bit tricky of course, depending on the properties of the target. For a VLIW target like ours the COPY could be really cheap, at least if it can be executed in parallel with something else without increasing latency. A cross-bank copy might actually be fee (zero cycles) even if it is inside the loop. OTOH. if we need to spill/reload a register, then that could be much more expensive, even if it is hoisted to some outer loop.

You may want to check my PR #81735 that is intended to fix the regression caused by spilling in AMDGPU target. It might help you too.

llvmbot added backend:AArch64 backend:AMDGPU backend:X86 llvm:globalisel labels Nov 6, 2023

Rin18 requested review from davemgreen, igogo-x86 and david-arm November 6, 2023 15:29

davemgreen requested review from jayfoad and arsenm November 7, 2023 08:22

Rin18 requested review from JDevlieghere, nikic and a team as code owners November 7, 2023 11:45

Rin18 force-pushed the hoist_dups branch from fef5786 to 475cc90 Compare November 7, 2023 11:47

nikic removed request for a team and nikic November 7, 2023 15:25

arsenm reviewed Nov 8, 2023

View reviewed changes

llvm/lib/CodeGen/MachineLICM.cpp Outdated Show resolved Hide resolved

llvm/lib/CodeGen/MachineLICM.cpp Outdated Show resolved Hide resolved

arsenm added llvm:codegen and removed llvm:globalisel labels Nov 8, 2023

Defer IsLoopInvariantInst check until after the cheaper register checks

fa91086

llvmbot added the llvm:globalisel label Nov 8, 2023

Rin18 removed the llvm:globalisel label Nov 8, 2023

Value capture of CurLoop

0147f50

llvmbot added the llvm:globalisel label Nov 8, 2023

Rin18 removed the llvm:globalisel label Nov 9, 2023

david-arm approved these changes Nov 15, 2023

View reviewed changes

Merge branch 'main' into hoist_dups

55468a2

llvmbot added the llvm:globalisel label Nov 16, 2023

Fix failing tests added after rebase with main

616f3c7

Rin18 removed the llvm:globalisel label Nov 16, 2023

Rin18 merged commit befa925 into llvm:main Nov 20, 2023

bjope reviewed Nov 21, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[MachineLICM][AArch64] Hoist COPY instructions with other uses in the loop #71403

[MachineLICM][AArch64] Hoist COPY instructions with other uses in the loop #71403

Uh oh!

Rin18 commented Nov 6, 2023

Uh oh!

llvmbot commented Nov 6, 2023 •

edited

Loading

Uh oh!

davemgreen commented Nov 7, 2023

Uh oh!

Rin18 commented Nov 7, 2023

Uh oh!

david-arm commented Nov 7, 2023

Uh oh!

Rin18 commented Nov 7, 2023

Uh oh!

Uh oh!

Uh oh!

david-arm left a comment

Uh oh!

jayfoad commented Nov 15, 2023

Uh oh!

piotrAMD commented Nov 15, 2023 •

edited

Loading

Uh oh!

bjope Nov 21, 2023

Uh oh!

davemgreen Nov 21, 2023

Uh oh!

bjope Nov 21, 2023

Uh oh!

michaelselehov Feb 19, 2024

Uh oh!

Uh oh!

[MachineLICM][AArch64] Hoist COPY instructions with other uses in the loop #71403

[MachineLICM][AArch64] Hoist COPY instructions with other uses in the loop #71403

Uh oh!

Conversation

Rin18 commented Nov 6, 2023

Uh oh!

llvmbot commented Nov 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

davemgreen commented Nov 7, 2023

Uh oh!

Rin18 commented Nov 7, 2023

Uh oh!

david-arm commented Nov 7, 2023

Uh oh!

Rin18 commented Nov 7, 2023

Uh oh!

Uh oh!

Uh oh!

david-arm left a comment

Choose a reason for hiding this comment

Uh oh!

jayfoad commented Nov 15, 2023

Uh oh!

piotrAMD commented Nov 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bjope Nov 21, 2023

Choose a reason for hiding this comment

Uh oh!

davemgreen Nov 21, 2023

Choose a reason for hiding this comment

Uh oh!

bjope Nov 21, 2023

Choose a reason for hiding this comment

Uh oh!

michaelselehov Feb 19, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

llvmbot commented Nov 6, 2023 •

edited

Loading

piotrAMD commented Nov 15, 2023 •

edited

Loading