-
Notifications
You must be signed in to change notification settings - Fork 13.6k
[MachineLICM][AArch64] Hoist COPY instructions with other uses in the loop #71403
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@llvm/pr-subscribers-backend-aarch64 @llvm/pr-subscribers-backend-x86 Author: Rin (Rin18) ChangesWhen there is a COPY instruction in the loop with other uses, we want to hoist the COPY, which in turn leads to the users being hoisted as well. Co-authored by David Green : [email protected] Patch is 534.12 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/71403.diff 25 Files Affected:
diff --git a/llvm/lib/CodeGen/MachineLICM.cpp b/llvm/lib/CodeGen/MachineLICM.cpp
index e29f28ecaea0dce..5216662bb9d69db 100644
--- a/llvm/lib/CodeGen/MachineLICM.cpp
+++ b/llvm/lib/CodeGen/MachineLICM.cpp
@@ -1249,6 +1249,16 @@ bool MachineLICMBase::IsProfitableToHoist(MachineInstr &MI,
return false;
}
+ // If we have a COPY with other uses in the loop, hoist to allow the users to
+ // also be hoisted.
+ if (MI.isCopy() && IsLoopInvariantInst(MI, CurLoop) &&
+ MI.getOperand(0).isReg() &&
+ ajnsdajn MI.getOperand(0).getReg().isVirtual() &&
+ MI.getOperand(1).isReg() && MI.getOperand(1).getReg().isVirtual() &&
+ any_of(MRI->use_nodbg_instructions(MI.getOperand(0).getReg()),
+ [&](MachineInstr &UseMI) { return CurLoop->contains(&UseMI); }))
+ return true;
+
// High register pressure situation, only hoist if the instruction is going
// to be remat'ed.
if (!isTriviallyReMaterializable(MI) &&
diff --git a/llvm/test/CodeGen/AArch64/tbl-loops.ll b/llvm/test/CodeGen/AArch64/tbl-loops.ll
index b63d540fb8e0291..365fe03ab0b0844 100644
--- a/llvm/test/CodeGen/AArch64/tbl-loops.ll
+++ b/llvm/test/CodeGen/AArch64/tbl-loops.ll
@@ -52,19 +52,19 @@ define void @loop1(ptr noalias nocapture noundef writeonly %dst, ptr nocapture n
; CHECK-NEXT: b.eq .LBB0_8
; CHECK-NEXT: .LBB0_6: // %for.body.preheader1
; CHECK-NEXT: movi d0, #0000000000000000
-; CHECK-NEXT: sub w10, w2, w10
; CHECK-NEXT: mov w11, #1132396544 // =0x437f0000
+; CHECK-NEXT: sub w10, w2, w10
+; CHECK-NEXT: fmov s1, w11
; CHECK-NEXT: .LBB0_7: // %for.body
; CHECK-NEXT: // =>This Inner Loop Header: Depth=1
-; CHECK-NEXT: fmov s2, w11
-; CHECK-NEXT: ldr s1, [x8], #4
-; CHECK-NEXT: fcmp s1, s2
-; CHECK-NEXT: fcsel s2, s2, s1, gt
-; CHECK-NEXT: fcmp s1, #0.0
-; CHECK-NEXT: fcsel s1, s0, s2, mi
+; CHECK-NEXT: ldr s2, [x8], #4
+; CHECK-NEXT: fcmp s2, s1
+; CHECK-NEXT: fcsel s3, s1, s2, gt
+; CHECK-NEXT: fcmp s2, #0.0
+; CHECK-NEXT: fcsel s2, s0, s3, mi
; CHECK-NEXT: subs w10, w10, #1
-; CHECK-NEXT: fcvtzs w12, s1
-; CHECK-NEXT: strb w12, [x9], #1
+; CHECK-NEXT: fcvtzs w11, s2
+; CHECK-NEXT: strb w11, [x9], #1
; CHECK-NEXT: b.ne .LBB0_7
; CHECK-NEXT: .LBB0_8: // %for.cond.cleanup
; CHECK-NEXT: ret
@@ -165,25 +165,25 @@ define void @loop2(ptr noalias nocapture noundef writeonly %dst, ptr nocapture n
; CHECK-NEXT: mov x9, x0
; CHECK-NEXT: .LBB1_5: // %for.body.preheader1
; CHECK-NEXT: movi d0, #0000000000000000
-; CHECK-NEXT: sub w10, w2, w10
; CHECK-NEXT: mov w11, #1132396544 // =0x437f0000
+; CHECK-NEXT: sub w10, w2, w10
+; CHECK-NEXT: fmov s1, w11
; CHECK-NEXT: .LBB1_6: // %for.body
; CHECK-NEXT: // =>This Inner Loop Header: Depth=1
-; CHECK-NEXT: ldp s1, s3, [x8], #8
-; CHECK-NEXT: fmov s2, w11
-; CHECK-NEXT: fcmp s1, s2
-; CHECK-NEXT: fcsel s4, s2, s1, gt
-; CHECK-NEXT: fcmp s1, #0.0
-; CHECK-NEXT: fcsel s1, s0, s4, mi
-; CHECK-NEXT: fcmp s3, s2
-; CHECK-NEXT: fcsel s2, s2, s3, gt
+; CHECK-NEXT: ldp s2, s3, [x8], #8
+; CHECK-NEXT: fcmp s2, s1
+; CHECK-NEXT: fcsel s4, s1, s2, gt
+; CHECK-NEXT: fcmp s2, #0.0
+; CHECK-NEXT: fcsel s2, s0, s4, mi
+; CHECK-NEXT: fcmp s3, s1
+; CHECK-NEXT: fcsel s4, s1, s3, gt
; CHECK-NEXT: fcmp s3, #0.0
-; CHECK-NEXT: fcvtzs w12, s1
-; CHECK-NEXT: fcsel s2, s0, s2, mi
+; CHECK-NEXT: fcvtzs w11, s2
+; CHECK-NEXT: fcsel s3, s0, s4, mi
; CHECK-NEXT: subs w10, w10, #1
-; CHECK-NEXT: strb w12, [x9]
-; CHECK-NEXT: fcvtzs w13, s2
-; CHECK-NEXT: strb w13, [x9, #1]
+; CHECK-NEXT: strb w11, [x9]
+; CHECK-NEXT: fcvtzs w12, s3
+; CHECK-NEXT: strb w12, [x9, #1]
; CHECK-NEXT: add x9, x9, #2
; CHECK-NEXT: b.ne .LBB1_6
; CHECK-NEXT: .LBB1_7: // %for.cond.cleanup
@@ -380,33 +380,33 @@ define void @loop3(ptr noalias nocapture noundef writeonly %dst, ptr nocapture n
; CHECK-NEXT: mov x9, x0
; CHECK-NEXT: .LBB2_7: // %for.body.preheader1
; CHECK-NEXT: movi d0, #0000000000000000
-; CHECK-NEXT: sub w10, w2, w10
; CHECK-NEXT: mov w11, #1132396544 // =0x437f0000
+; CHECK-NEXT: sub w10, w2, w10
+; CHECK-NEXT: fmov s1, w11
; CHECK-NEXT: .LBB2_8: // %for.body
; CHECK-NEXT: // =>This Inner Loop Header: Depth=1
-; CHECK-NEXT: ldp s1, s3, [x8]
-; CHECK-NEXT: fmov s2, w11
-; CHECK-NEXT: fcmp s1, s2
-; CHECK-NEXT: fcsel s4, s2, s1, gt
-; CHECK-NEXT: fcmp s1, #0.0
-; CHECK-NEXT: fcsel s1, s0, s4, mi
-; CHECK-NEXT: fcmp s3, s2
-; CHECK-NEXT: fcsel s4, s2, s3, gt
+; CHECK-NEXT: ldp s2, s3, [x8]
+; CHECK-NEXT: fcmp s2, s1
+; CHECK-NEXT: fcsel s4, s1, s2, gt
+; CHECK-NEXT: fcmp s2, #0.0
+; CHECK-NEXT: fcsel s2, s0, s4, mi
+; CHECK-NEXT: fcmp s3, s1
+; CHECK-NEXT: fcsel s4, s1, s3, gt
; CHECK-NEXT: fcmp s3, #0.0
; CHECK-NEXT: ldr s3, [x8, #8]
-; CHECK-NEXT: fcvtzs w12, s1
+; CHECK-NEXT: fcvtzs w11, s2
; CHECK-NEXT: add x8, x8, #12
; CHECK-NEXT: fcsel s4, s0, s4, mi
-; CHECK-NEXT: fcmp s3, s2
-; CHECK-NEXT: strb w12, [x9]
-; CHECK-NEXT: fcsel s2, s2, s3, gt
+; CHECK-NEXT: fcmp s3, s1
+; CHECK-NEXT: strb w11, [x9]
+; CHECK-NEXT: fcsel s5, s1, s3, gt
; CHECK-NEXT: fcmp s3, #0.0
-; CHECK-NEXT: fcvtzs w13, s4
-; CHECK-NEXT: fcsel s2, s0, s2, mi
+; CHECK-NEXT: fcvtzs w12, s4
+; CHECK-NEXT: fcsel s3, s0, s5, mi
; CHECK-NEXT: subs w10, w10, #1
-; CHECK-NEXT: strb w13, [x9, #1]
-; CHECK-NEXT: fcvtzs w14, s2
-; CHECK-NEXT: strb w14, [x9, #2]
+; CHECK-NEXT: strb w12, [x9, #1]
+; CHECK-NEXT: fcvtzs w13, s3
+; CHECK-NEXT: strb w13, [x9, #2]
; CHECK-NEXT: add x9, x9, #3
; CHECK-NEXT: b.ne .LBB2_8
; CHECK-NEXT: .LBB2_9: // %for.cond.cleanup
@@ -549,39 +549,39 @@ define void @loop4(ptr noalias nocapture noundef writeonly %dst, ptr nocapture n
; CHECK-NEXT: mov x9, x0
; CHECK-NEXT: .LBB3_5: // %for.body.preheader1
; CHECK-NEXT: movi d0, #0000000000000000
-; CHECK-NEXT: sub w10, w2, w10
; CHECK-NEXT: mov w11, #1132396544 // =0x437f0000
+; CHECK-NEXT: sub w10, w2, w10
+; CHECK-NEXT: fmov s1, w11
; CHECK-NEXT: .LBB3_6: // %for.body
; CHECK-NEXT: // =>This Inner Loop Header: Depth=1
-; CHECK-NEXT: ldp s1, s3, [x8]
-; CHECK-NEXT: fmov s2, w11
-; CHECK-NEXT: fcmp s1, s2
-; CHECK-NEXT: fcsel s4, s2, s1, gt
-; CHECK-NEXT: fcmp s1, #0.0
-; CHECK-NEXT: fcsel s1, s0, s4, mi
-; CHECK-NEXT: fcmp s3, s2
-; CHECK-NEXT: fcsel s4, s2, s3, gt
+; CHECK-NEXT: ldp s2, s3, [x8]
+; CHECK-NEXT: fcmp s2, s1
+; CHECK-NEXT: fcsel s4, s1, s2, gt
+; CHECK-NEXT: fcmp s2, #0.0
+; CHECK-NEXT: fcsel s2, s0, s4, mi
+; CHECK-NEXT: fcmp s3, s1
+; CHECK-NEXT: fcsel s4, s1, s3, gt
; CHECK-NEXT: fcmp s3, #0.0
; CHECK-NEXT: ldp s3, s5, [x8, #8]
-; CHECK-NEXT: fcvtzs w12, s1
+; CHECK-NEXT: fcvtzs w11, s2
; CHECK-NEXT: add x8, x8, #16
; CHECK-NEXT: fcsel s4, s0, s4, mi
-; CHECK-NEXT: fcmp s3, s2
-; CHECK-NEXT: strb w12, [x9]
-; CHECK-NEXT: fcsel s6, s2, s3, gt
+; CHECK-NEXT: fcmp s3, s1
+; CHECK-NEXT: strb w11, [x9]
+; CHECK-NEXT: fcsel s6, s1, s3, gt
; CHECK-NEXT: fcmp s3, #0.0
-; CHECK-NEXT: fcvtzs w13, s4
+; CHECK-NEXT: fcvtzs w12, s4
; CHECK-NEXT: fcsel s3, s0, s6, mi
-; CHECK-NEXT: fcmp s5, s2
-; CHECK-NEXT: strb w13, [x9, #1]
-; CHECK-NEXT: fcsel s2, s2, s5, gt
+; CHECK-NEXT: fcmp s5, s1
+; CHECK-NEXT: strb w12, [x9, #1]
+; CHECK-NEXT: fcsel s6, s1, s5, gt
; CHECK-NEXT: fcmp s5, #0.0
-; CHECK-NEXT: fcvtzs w14, s3
-; CHECK-NEXT: fcsel s2, s0, s2, mi
+; CHECK-NEXT: fcvtzs w13, s3
+; CHECK-NEXT: fcsel s5, s0, s6, mi
; CHECK-NEXT: subs w10, w10, #1
-; CHECK-NEXT: strb w14, [x9, #2]
-; CHECK-NEXT: fcvtzs w15, s2
-; CHECK-NEXT: strb w15, [x9, #3]
+; CHECK-NEXT: strb w13, [x9, #2]
+; CHECK-NEXT: fcvtzs w14, s5
+; CHECK-NEXT: strb w14, [x9, #3]
; CHECK-NEXT: add x9, x9, #4
; CHECK-NEXT: b.ne .LBB3_6
; CHECK-NEXT: .LBB3_7: // %for.cond.cleanup
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/fp64-atomics-gfx90a.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/fp64-atomics-gfx90a.ll
index c6ea046f95a9199..53b2336180c6617 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/fp64-atomics-gfx90a.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/fp64-atomics-gfx90a.ll
@@ -1447,15 +1447,14 @@ define amdgpu_kernel void @flat_atomic_fadd_f64_noret_pat(ptr %ptr) #1 {
; GFX90A-LABEL: flat_atomic_fadd_f64_noret_pat:
; GFX90A: ; %bb.0: ; %main_body
; GFX90A-NEXT: s_load_dwordx2 s[0:1], s[0:1], 0x24
-; GFX90A-NEXT: s_mov_b64 s[2:3], 0
; GFX90A-NEXT: s_waitcnt lgkmcnt(0)
-; GFX90A-NEXT: v_pk_mov_b32 v[0:1], s[0:1], s[0:1] op_sel:[0,1]
-; GFX90A-NEXT: flat_load_dwordx2 v[2:3], v[0:1]
+; GFX90A-NEXT: v_pk_mov_b32 v[4:5], s[0:1], s[0:1] op_sel:[0,1]
+; GFX90A-NEXT: flat_load_dwordx2 v[2:3], v[4:5]
+; GFX90A-NEXT: s_mov_b64 s[0:1], 0
; GFX90A-NEXT: .LBB50_1: ; %atomicrmw.start
; GFX90A-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX90A-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX90A-NEXT: v_add_f64 v[0:1], v[2:3], 4.0
-; GFX90A-NEXT: v_pk_mov_b32 v[4:5], s[0:1], s[0:1] op_sel:[0,1]
; GFX90A-NEXT: buffer_wbl2
; GFX90A-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX90A-NEXT: flat_atomic_cmpswap_x2 v[0:1], v[4:5], v[0:3] glc
@@ -1463,9 +1462,9 @@ define amdgpu_kernel void @flat_atomic_fadd_f64_noret_pat(ptr %ptr) #1 {
; GFX90A-NEXT: buffer_invl2
; GFX90A-NEXT: buffer_wbinvl1_vol
; GFX90A-NEXT: v_cmp_eq_u64_e32 vcc, v[0:1], v[2:3]
-; GFX90A-NEXT: s_or_b64 s[2:3], vcc, s[2:3]
+; GFX90A-NEXT: s_or_b64 s[0:1], vcc, s[0:1]
; GFX90A-NEXT: v_pk_mov_b32 v[2:3], v[0:1], v[0:1] op_sel:[0,1]
-; GFX90A-NEXT: s_andn2_b64 exec, exec, s[2:3]
+; GFX90A-NEXT: s_andn2_b64 exec, exec, s[0:1]
; GFX90A-NEXT: s_cbranch_execnz .LBB50_1
; GFX90A-NEXT: ; %bb.2: ; %atomicrmw.end
; GFX90A-NEXT: s_endpgm
@@ -1522,15 +1521,14 @@ define amdgpu_kernel void @flat_atomic_fadd_f64_noret_pat_system(ptr %ptr) #1 {
; GFX90A-LABEL: flat_atomic_fadd_f64_noret_pat_system:
; GFX90A: ; %bb.0: ; %main_body
; GFX90A-NEXT: s_load_dwordx2 s[0:1], s[0:1], 0x24
-; GFX90A-NEXT: s_mov_b64 s[2:3], 0
; GFX90A-NEXT: s_waitcnt lgkmcnt(0)
-; GFX90A-NEXT: v_pk_mov_b32 v[0:1], s[0:1], s[0:1] op_sel:[0,1]
-; GFX90A-NEXT: flat_load_dwordx2 v[2:3], v[0:1]
+; GFX90A-NEXT: v_pk_mov_b32 v[4:5], s[0:1], s[0:1] op_sel:[0,1]
+; GFX90A-NEXT: flat_load_dwordx2 v[2:3], v[4:5]
+; GFX90A-NEXT: s_mov_b64 s[0:1], 0
; GFX90A-NEXT: .LBB52_1: ; %atomicrmw.start
; GFX90A-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX90A-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX90A-NEXT: v_add_f64 v[0:1], v[2:3], 4.0
-; GFX90A-NEXT: v_pk_mov_b32 v[4:5], s[0:1], s[0:1] op_sel:[0,1]
; GFX90A-NEXT: buffer_wbl2
; GFX90A-NEXT: s_waitcnt vmcnt(0)
; GFX90A-NEXT: flat_atomic_cmpswap_x2 v[0:1], v[4:5], v[0:3] glc
@@ -1539,9 +1537,9 @@ define amdgpu_kernel void @flat_atomic_fadd_f64_noret_pat_system(ptr %ptr) #1 {
; GFX90A-NEXT: buffer_wbinvl1_vol
; GFX90A-NEXT: s_waitcnt lgkmcnt(0)
; GFX90A-NEXT: v_cmp_eq_u64_e32 vcc, v[0:1], v[2:3]
-; GFX90A-NEXT: s_or_b64 s[2:3], vcc, s[2:3]
+; GFX90A-NEXT: s_or_b64 s[0:1], vcc, s[0:1]
; GFX90A-NEXT: v_pk_mov_b32 v[2:3], v[0:1], v[0:1] op_sel:[0,1]
-; GFX90A-NEXT: s_andn2_b64 exec, exec, s[2:3]
+; GFX90A-NEXT: s_andn2_b64 exec, exec, s[0:1]
; GFX90A-NEXT: s_cbranch_execnz .LBB52_1
; GFX90A-NEXT: ; %bb.2: ; %atomicrmw.end
; GFX90A-NEXT: s_endpgm
@@ -1724,23 +1722,22 @@ define amdgpu_kernel void @flat_atomic_fadd_f64_noret_pat_agent_safe(ptr %ptr) {
; GFX90A-LABEL: flat_atomic_fadd_f64_noret_pat_agent_safe:
; GFX90A: ; %bb.0: ; %main_body
; GFX90A-NEXT: s_load_dwordx2 s[0:1], s[0:1], 0x24
-; GFX90A-NEXT: s_mov_b64 s[2:3], 0
; GFX90A-NEXT: s_waitcnt lgkmcnt(0)
-; GFX90A-NEXT: v_pk_mov_b32 v[0:1], s[0:1], s[0:1] op_sel:[0,1]
-; GFX90A-NEXT: flat_load_dwordx2 v[2:3], v[0:1]
+; GFX90A-NEXT: v_pk_mov_b32 v[4:5], s[0:1], s[0:1] op_sel:[0,1]
+; GFX90A-NEXT: flat_load_dwordx2 v[2:3], v[4:5]
+; GFX90A-NEXT: s_mov_b64 s[0:1], 0
; GFX90A-NEXT: .LBB58_1: ; %atomicrmw.start
; GFX90A-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX90A-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX90A-NEXT: v_add_f64 v[0:1], v[2:3], 4.0
-; GFX90A-NEXT: v_pk_mov_b32 v[4:5], s[0:1], s[0:1] op_sel:[0,1]
; GFX90A-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX90A-NEXT: flat_atomic_cmpswap_x2 v[0:1], v[4:5], v[0:3] glc
; GFX90A-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX90A-NEXT: buffer_wbinvl1_vol
; GFX90A-NEXT: v_cmp_eq_u64_e32 vcc, v[0:1], v[2:3]
-; GFX90A-NEXT: s_or_b64 s[2:3], vcc, s[2:3]
+; GFX90A-NEXT: s_or_b64 s[0:1], vcc, s[0:1]
; GFX90A-NEXT: v_pk_mov_b32 v[2:3], v[0:1], v[0:1] op_sel:[0,1]
-; GFX90A-NEXT: s_andn2_b64 exec, exec, s[2:3]
+; GFX90A-NEXT: s_andn2_b64 exec, exec, s[0:1]
; GFX90A-NEXT: s_cbranch_execnz .LBB58_1
; GFX90A-NEXT: ; %bb.2: ; %atomicrmw.end
; GFX90A-NEXT: s_endpgm
@@ -1957,22 +1954,21 @@ main_body:
define amdgpu_kernel void @local_atomic_fadd_f64_noret_pat_flush_safe(ptr addrspace(3) %ptr) #4 {
; GFX90A-LABEL: local_atomic_fadd_f64_noret_pat_flush_safe:
; GFX90A: ; %bb.0: ; %main_body
-; GFX90A-NEXT: s_load_dword s2, s[0:1], 0x24
-; GFX90A-NEXT: s_mov_b64 s[0:1], 0
+; GFX90A-NEXT: s_load_dword s0, s[0:1], 0x24
; GFX90A-NEXT: s_waitcnt lgkmcnt(0)
-; GFX90A-NEXT: v_mov_b32_e32 v0, s2
-; GFX90A-NEXT: ds_read_b64 v[0:1], v0
+; GFX90A-NEXT: v_mov_b32_e32 v2, s0
+; GFX90A-NEXT: ds_read_b64 v[0:1], v2
+; GFX90A-NEXT: s_mov_b64 s[0:1], 0
; GFX90A-NEXT: .LBB67_1: ; %atomicrmw.start
; GFX90A-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX90A-NEXT: s_waitcnt lgkmcnt(0)
-; GFX90A-NEXT: v_add_f64 v[2:3], v[0:1], 4.0
-; GFX90A-NEXT: v_mov_b32_e32 v4, s2
+; GFX90A-NEXT: v_add_f64 v[4:5], v[0:1], 4.0
; GFX90A-NEXT: s_waitcnt lgkmcnt(0)
-; GFX90A-NEXT: ds_cmpst_rtn_b64 v[2:3], v4, v[0:1], v[2:3]
+; GFX90A-NEXT: ds_cmpst_rtn_b64 v[4:5], v2, v[0:1], v[4:5]
; GFX90A-NEXT: s_waitcnt lgkmcnt(0)
-; GFX90A-NEXT: v_cmp_eq_u64_e32 vcc, v[2:3], v[0:1]
+; GFX90A-NEXT: v_cmp_eq_u64_e32 vcc, v[4:5], v[0:1]
; GFX90A-NEXT: s_or_b64 s[0:1], vcc, s[0:1]
-; GFX90A-NEXT: v_pk_mov_b32 v[0:1], v[2:3], v[2:3] op_sel:[0,1]
+; GFX90A-NEXT: v_pk_mov_b32 v[0:1], v[4:5], v[4:5] op_sel:[0,1]
; GFX90A-NEXT: s_andn2_b64 exec, exec, s[0:1]
; GFX90A-NEXT: s_cbranch_execnz .LBB67_1
; GFX90A-NEXT: ; %bb.2: ; %atomicrmw.end
@@ -1985,17 +1981,17 @@ define amdgpu_kernel void @local_atomic_fadd_f64_noret_pat_flush_safe(ptr addrsp
; GFX940-NEXT: s_waitcnt lgkmcnt(0)
; GFX940-NEXT: v_mov_b32_e32 v0, s2
; GFX940-NEXT: ds_read_b64 v[0:1], v0
+; GFX940-NEXT: v_mov_b32_e32 v2, s2
; GFX940-NEXT: .LBB67_1: ; %atomicrmw.start
; GFX940-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX940-NEXT: s_waitcnt lgkmcnt(0)
-; GFX940-NEXT: v_add_f64 v[2:3], v[0:1], 4.0
-; GFX940-NEXT: v_mov_b32_e32 v4, s2
+; GFX940-NEXT: v_add_f64 v[4:5], v[0:1], 4.0
; GFX940-NEXT: s_waitcnt lgkmcnt(0)
-; GFX940-NEXT: ds_cmpst_rtn_b64 v[2:3], v4, v[0:1], v[2:3]
+; GFX940-NEXT: ds_cmpst_rtn_b64 v[4:5], v2, v[0:1], v[4:5]
; GFX940-NEXT: s_waitcnt lgkmcnt(0)
-; GFX940-NEXT: v_cmp_eq_u64_e32 vcc, v[2:3], v[0:1]
+; GFX940-NEXT: v_cmp_eq_u64_e32 vcc, v[4:5], v[0:1]
; GFX940-NEXT: s_or_b64 s[0:1], vcc, s[0:1]
-; GFX940-NEXT: v_mov_b64_e32 v[0:1], v[2:3]
+; GFX940-NEXT: v_mov_b64_e32 v[0:1], v[4:5]
; GFX940-NEXT: s_andn2_b64 exec, exec, s[0:1]
; GFX940-NEXT: s_cbranch_execnz .LBB67_1
; GFX940-NEXT: ; %bb.2: ; %atomicrmw.end
diff --git a/llvm/test/CodeGen/AMDGPU/flat_atomics_i32_system.ll b/llvm/test/CodeGen/AMDGPU/flat_atomics_i32_system.ll
index 6bdeeddc951ff18..c4a88ebf4897294 100644
--- a/llvm/test/CodeGen/AMDGPU/flat_atomics_i32_system.ll
+++ b/llvm/test/CodeGen/AMDGPU/flat_atomics_i32_system.ll
@@ -1819,22 +1819,20 @@ define amdgpu_gfx void @flat_atomic_nand_i32_noret_scalar(ptr inreg %ptr, i32 in
; GCN1-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GCN1-NEXT: v_mov_b32_e32 v0, s4
; GCN1-NEXT: v_mov_b32_e32 v1, s5
-; GCN1-NEXT: flat_load_dword v1, v[0:1]
+; GCN1-NEXT: flat_load_dword v3, v[0:1]
; GCN1-NEXT: s_mov_b64 s[34:35], 0
; GCN1-NEXT: .LBB44_1: ; %atomicrmw.start
; GCN1-NEXT: ; =>This Inner Loop Header: Depth=1
; GCN1-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
-; GCN1-NEXT: v_and_b32_e32 v0, s6, v1
-; GCN1-NEXT: v_mov_b32_e32 v2, s4
-; GCN1-NEXT: v_mov_b32_e32 v3, s5
-; GCN1-NEXT: v_not_b32_e32 v0, v0
+; GCN1-NEXT: v_and_b32_e32 v2, s6, v3
+; GCN1-NEXT: v_not_b32_e32 v2, v2
; GCN1-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
-; GCN1-NEXT: flat_atomic_cmpswap v0, v[2:3], v[0:1] glc
+; GCN1-NEXT: flat_atomic_cmpswap v2, v[0:1], v[2:3] glc
; GCN1-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GCN1-NEXT: buffer_wbinvl1_vol
-; GCN1-NEXT: v_cmp_eq_u32_e32 vcc, v0, v1
+; GCN1-NEXT: v_cmp_eq_u32_e32 vcc, v2, v3
; GCN1-NEXT: s_or_b64 s[34:35], vcc, s[34:35]
-; GCN1-NEXT: v_mov_b32_e32 v1, v0
+; GCN1-NEXT: v_mov_b32_e32 v3, v2
; GCN1-NEXT: s_andn2_b64 exec, exec, s[34:35]
; GCN1-NEXT: s_cbranch_execnz .LBB44_1
; GCN1-NEXT: ; %bb.2: ; %atomicrmw.end
@@ -1846,22 +1844,20 @@ define amdgpu_gfx void @flat_atomic_nand_i32_noret_scalar(ptr inreg %ptr, i32 in
; GCN2-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GCN2-NEXT: v_mov_b32_e32 v0, s4
; GCN2-NEXT: v_mov_b32_e32 v1, s5
-; GCN2-NEXT: flat_load_dword v1, v[0:1]
+; GCN2-NEXT: flat_load_dword v3, v[0:1]
; GCN2-NEXT: s_mov_b64 s[34:35], 0
; GCN2-NEXT: .LBB44_1: ; %atomicrmw.start
; GCN2-NEXT: ; =>This Inner Loop Header: Depth=1
; GCN2-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
-; GCN2-NEXT: v_and_b32_e32 v0, s6, v1
-; GCN2-NEXT: v_mov_b32_e32 v2, s4
-; GCN2-NEXT: v_mov_b32_e32 v3, s5
-; GCN2-NEXT: v_not_b32_e32 v0, v0
+; GCN2-NEXT: v_and_b32_e32 v2, s6, v3
+; GCN2-NEXT: v_not_b32_e32 v2, v2
; GCN2-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
-; GCN2-NEXT: flat_atomic_cmpswap v0, v[2:3], v[0:1] glc
+; GCN2-NEXT: flat_atomic_cmpswap v2, v[0:1], v[2:3] glc
; GCN2-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GCN2-NEXT: buffer_wbinvl1_vol
-; GCN2-NEXT: v_cmp_eq_u32_e32 vcc, v0, v1
+; GCN2-NEXT: v_cmp_eq_u32_e32 vcc, v2, v3
; GCN2-NEXT: s_or_b64 s[34:35], vcc, s[34:35]
-; GCN2-NEXT: v_mov_b32_e32 v1, v0
+; GCN2-NEXT: v_mov_b32_e32 v3, v2
; GCN2-NEXT: s_andn2_b64 exec, exec, s[34:35]
; GCN2-NEXT: s_cbranch_execnz .LBB44_1
; GCN2-NEXT: ; %bb.2: ; %atomicrmw.end
@@ -1873,22 +1869,20 @@ define amdgpu_gfx void @flat_atomic_nand_i32_noret_scalar(ptr inreg %ptr, i32 in
; GCN3-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GCN3-NEXT: v_mov_b32_e32 v0, s4
; GCN3-NEXT: v_mov_b32_e32 v1, s5
-; GCN3-NEXT: flat_load_dword v1, v[0:1]
+; GCN3-NEXT: flat_load_dword v3, v[0:1]
; GCN3-NEXT: s_mov_b64 s[34:35], 0
; GCN3-NEXT: .LBB44_1: ; %atomicrmw.start
; GCN3-NEXT: ; =>This Inner Loop Header: Depth=1
; GCN3-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
-; GCN3-NEXT: v_and_b32_e32 v0, s6, v1
-; GCN3-NEXT: v_mov_b32_e32 v2, s4
-; GCN3-NEXT: v_mov_b32_e32 v3, s5
-; GCN3-NEXT: v_not_b32_e32 v0, v0
+; GCN3-NEXT: v_and_b32_e32 v2, s6, v3
+; GCN3-NEXT: v_not_b32_e32 v2, v2
; GCN3-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
-; GCN3-NEXT: flat_atomic_cmpswap v0, v[2:3], v[0:1] glc
+; GCN3-NEXT: flat_atomic_cmpswap v2, v[0:1], v[2:3] glc
; GCN3-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GCN3-NEXT: buffer_wbinvl1_vol
-; GCN3-NEXT: v_cmp_eq_u32_e32 vcc, v0, v1
+; GCN3-NEXT: v_cmp_eq_u32_e32 vcc, v2, v3
; GCN3-NEXT: s_or_b64 s[34:35], vcc, s[34:35]
-; GCN3-NEXT: v_mov_b32_e32 v1, v0
+; GCN3-NEXT: v_mov_b32_e32 v3, v2
; GCN3-NEXT: s_andn2_b64 exec, exec, s[34:35]
; GCN3-NEXT: s_cbranch_execnz .LBB44_1
; GCN3-NEXT: ; %bb.2: ; %atomicrmw.end
@@ -1906,26 +1900,24 @@ define amdgpu_gfx void @flat_atomic_nand_i32_noret_offset_scalar(ptr inreg %out,
; GCN1-NEXT: s_addc_u32 s35, s5, 0
; GCN1-NEXT: v_mov_b32_e32 v0, s34
; GCN1-NEXT: v_mov_b32_e32 v1, s35
-; GCN1-NEXT: flat_load_dword v1, v[0:1]
-; GCN1-NEXT: s_mov_b64 s[36:37], 0
+; GCN1-NEXT: ...
[truncated]
|
Thanks. Can you rebase to see if the issue with the test goes away? |
Yeah, rebasing sounds like a good idea, I'll do that. |
commit fef5786ca1d78288c3f442057bfeb148788417b2 Merge: 77daaf1c4b40 aadaa1cc32b8 Author: Rin Dobrescu <[email protected]> Date: Tue Nov 7 10:27:03 2023 +0000 Merge branch 'hoist_dups' of https://github.com/Rin18/llvm-project-fork into hoist_dups commit 77daaf1c4b4009bc95803e8992291f64c9cf8b33 Author: Rin Dobrescu <[email protected]> Date: Mon Nov 6 15:52:27 2023 +0000 Remove unneeded line commit aa469a15af8aeaca8d474ee7ab5b59ff20aec28c Author: Rin Dobrescu <[email protected]> Date: Mon Nov 6 15:08:11 2023 +0000 Run clang-format on patch commit 0d43d36e939e702e73b5927b4ae1c713ec0edec9 Author: Rin Dobrescu <[email protected]> Date: Mon Nov 6 12:20:21 2023 +0000 [MachineLICM][AArch64] Hoist COPY instructions with other uses in the loop commit aadaa1cc32b88e682f0b9148683d3f6142e73f4e Author: Rin Dobrescu <[email protected]> Date: Mon Nov 6 15:52:27 2023 +0000 Remove unneeded line commit 27c7354c1bcc5aff81fed7623ff56d93af4201da Author: Rin Dobrescu <[email protected]> Date: Mon Nov 6 15:08:11 2023 +0000 Run clang-format on patch commit 34f9b474e3e10f670b6bb0359f2b81ff8275d5a1 Author: Rin Dobrescu <[email protected]> Date: Mon Nov 6 12:20:21 2023 +0000 [MachineLICM][AArch64] Hoist COPY instructions with other uses in the loop
Hmm, the PR looks a little odd now! There is a single commit in the PR, that is itself a squashed commit. |
Merged after rebase which led to the commits from main appearing on my PR. Had to squash them to fix it which is why there's only one commit now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! The test changes look favourable in almost all cases and the code change makes sense to me.
… loop (llvm#71403) When there is a COPY instruction in the loop with other uses, we want to hoist the COPY, which in turn leads to the users being hoisted as well. Co-authored-by David Green : [email protected]
… loop (llvm#71403) When there is a COPY instruction in the loop with other uses, we want to hoist the COPY, which in turn leads to the users being hoisted as well. Co-authored-by David Green : [email protected]
@@ -1262,6 +1262,18 @@ bool MachineLICMBase::IsProfitableToHoist(MachineInstr &MI, | |||
return false; | |||
} | |||
|
|||
// If we have a COPY with other uses in the loop, hoist to allow the users to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For info: Downstream this caused regressions in some of our benchmarks. Not sure if you need to care about that (we will probably guard this with a check for our downstream target), but thought it might be nice to mention it.
Haven't fully investigated what happens, but I think that hoisting the COPY increases register pressure resulting in spill. The COPY instructions I see that are hoisted in that benchmark can be mapped to two categories:
%228:gn32 = COPY %143:pn ; cross register bank copy
%245:gn16 = COPY %227.lo:gn32 ; extracting a subreg
So for example hoisting the cross register bank COPY results in increasing the register pressure for the general gn registers in the path leading up to the use. Similarly, by hoisting the subreg extract the register pressure on the gn registers increase.
I wonder if the heuristic here perhaps should look closer at the using instructions to see if they actually can be hoisted as well? After all, we are in the path where CanCauseHighRegPressure has returned true. So traditinonally we have been more careful here and only hoisted trivially rematerializable MI:s.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello. From what I've seen in our benchmarks this has been positive, but there is often some noise from hoisting/sinking. You are right that this could be more conservative, but in our case cross register bank copies will be relatively expensive and we would want to hoist them if we could. I'm not sure about the subreg extracts, but a lot of COPYs are removed prior to register allocation and it would be good if it knew where best to re-add them, if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a bit tricky of course, depending on the properties of the target. For a VLIW target like ours the COPY could be really cheap, at least if it can be executed in parallel with something else without increasing latency. A cross-bank copy might actually be fee (zero cycles) even if it is inside the loop. OTOH. if we need to spill/reload a register, then that could be much more expensive, even if it is hoisted to some outer loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may want to check my PR #81735 that is intended to fix the regression caused by spilling in AMDGPU target. It might help you too.
When there is a COPY instruction in the loop with other uses, we want to hoist the COPY, which in turn leads to the users being hoisted as well.
Co-authored by David Green : [email protected]