Skip to content

[MachineLICM][AArch64] Hoist COPY instructions with other uses in the loop #71403

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Nov 20, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 12 additions & 0 deletions llvm/lib/CodeGen/MachineLICM.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1262,6 +1262,18 @@ bool MachineLICMBase::IsProfitableToHoist(MachineInstr &MI,
return false;
}

// If we have a COPY with other uses in the loop, hoist to allow the users to
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For info: Downstream this caused regressions in some of our benchmarks. Not sure if you need to care about that (we will probably guard this with a check for our downstream target), but thought it might be nice to mention it.

Haven't fully investigated what happens, but I think that hoisting the COPY increases register pressure resulting in spill. The COPY instructions I see that are hoisted in that benchmark can be mapped to two categories:

    %228:gn32 = COPY %143:pn    ;  cross register bank copy
    %245:gn16 = COPY %227.lo:gn32   ; extracting a subreg

So for example hoisting the cross register bank COPY results in increasing the register pressure for the general gn registers in the path leading up to the use. Similarly, by hoisting the subreg extract the register pressure on the gn registers increase.

I wonder if the heuristic here perhaps should look closer at the using instructions to see if they actually can be hoisted as well? After all, we are in the path where CanCauseHighRegPressure has returned true. So traditinonally we have been more careful here and only hoisted trivially rematerializable MI:s.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello. From what I've seen in our benchmarks this has been positive, but there is often some noise from hoisting/sinking. You are right that this could be more conservative, but in our case cross register bank copies will be relatively expensive and we would want to hoist them if we could. I'm not sure about the subreg extracts, but a lot of COPYs are removed prior to register allocation and it would be good if it knew where best to re-add them, if needed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a bit tricky of course, depending on the properties of the target. For a VLIW target like ours the COPY could be really cheap, at least if it can be executed in parallel with something else without increasing latency. A cross-bank copy might actually be fee (zero cycles) even if it is inside the loop. OTOH. if we need to spill/reload a register, then that could be much more expensive, even if it is hoisted to some outer loop.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may want to check my PR #81735 that is intended to fix the regression caused by spilling in AMDGPU target. It might help you too.

// also be hoisted.
if (MI.isCopy() && MI.getOperand(0).isReg() &&
MI.getOperand(0).getReg().isVirtual() && MI.getOperand(1).isReg() &&
MI.getOperand(1).getReg().isVirtual() &&
IsLoopInvariantInst(MI, CurLoop) &&
any_of(MRI->use_nodbg_instructions(MI.getOperand(0).getReg()),
[&CurLoop](MachineInstr &UseMI) {
return CurLoop->contains(&UseMI);
}))
return true;

// High register pressure situation, only hoist if the instruction is going
// to be remat'ed.
if (!isTriviallyReMaterializable(MI) &&
Expand Down
126 changes: 63 additions & 63 deletions llvm/test/CodeGen/AArch64/tbl-loops.ll
Original file line number Diff line number Diff line change
Expand Up @@ -52,19 +52,19 @@ define void @loop1(ptr noalias nocapture noundef writeonly %dst, ptr nocapture n
; CHECK-NEXT: b.eq .LBB0_8
; CHECK-NEXT: .LBB0_6: // %for.body.preheader1
; CHECK-NEXT: movi d0, #0000000000000000
; CHECK-NEXT: sub w10, w2, w10
; CHECK-NEXT: mov w11, #1132396544 // =0x437f0000
; CHECK-NEXT: sub w10, w2, w10
; CHECK-NEXT: fmov s1, w11
; CHECK-NEXT: .LBB0_7: // %for.body
; CHECK-NEXT: // =>This Inner Loop Header: Depth=1
; CHECK-NEXT: fmov s2, w11
; CHECK-NEXT: ldr s1, [x8], #4
; CHECK-NEXT: fcmp s1, s2
; CHECK-NEXT: fcsel s2, s2, s1, gt
; CHECK-NEXT: fcmp s1, #0.0
; CHECK-NEXT: fcsel s1, s0, s2, mi
; CHECK-NEXT: ldr s2, [x8], #4
; CHECK-NEXT: fcmp s2, s1
; CHECK-NEXT: fcsel s3, s1, s2, gt
; CHECK-NEXT: fcmp s2, #0.0
; CHECK-NEXT: fcsel s2, s0, s3, mi
; CHECK-NEXT: subs w10, w10, #1
; CHECK-NEXT: fcvtzs w12, s1
; CHECK-NEXT: strb w12, [x9], #1
; CHECK-NEXT: fcvtzs w11, s2
; CHECK-NEXT: strb w11, [x9], #1
; CHECK-NEXT: b.ne .LBB0_7
; CHECK-NEXT: .LBB0_8: // %for.cond.cleanup
; CHECK-NEXT: ret
Expand Down Expand Up @@ -165,25 +165,25 @@ define void @loop2(ptr noalias nocapture noundef writeonly %dst, ptr nocapture n
; CHECK-NEXT: mov x9, x0
; CHECK-NEXT: .LBB1_5: // %for.body.preheader1
; CHECK-NEXT: movi d0, #0000000000000000
; CHECK-NEXT: sub w10, w2, w10
; CHECK-NEXT: mov w11, #1132396544 // =0x437f0000
; CHECK-NEXT: sub w10, w2, w10
; CHECK-NEXT: fmov s1, w11
; CHECK-NEXT: .LBB1_6: // %for.body
; CHECK-NEXT: // =>This Inner Loop Header: Depth=1
; CHECK-NEXT: ldp s1, s3, [x8], #8
; CHECK-NEXT: fmov s2, w11
; CHECK-NEXT: fcmp s1, s2
; CHECK-NEXT: fcsel s4, s2, s1, gt
; CHECK-NEXT: fcmp s1, #0.0
; CHECK-NEXT: fcsel s1, s0, s4, mi
; CHECK-NEXT: fcmp s3, s2
; CHECK-NEXT: fcsel s2, s2, s3, gt
; CHECK-NEXT: ldp s2, s3, [x8], #8
; CHECK-NEXT: fcmp s2, s1
; CHECK-NEXT: fcsel s4, s1, s2, gt
; CHECK-NEXT: fcmp s2, #0.0
; CHECK-NEXT: fcsel s2, s0, s4, mi
; CHECK-NEXT: fcmp s3, s1
; CHECK-NEXT: fcsel s4, s1, s3, gt
; CHECK-NEXT: fcmp s3, #0.0
; CHECK-NEXT: fcvtzs w12, s1
; CHECK-NEXT: fcsel s2, s0, s2, mi
; CHECK-NEXT: fcvtzs w11, s2
; CHECK-NEXT: fcsel s3, s0, s4, mi
; CHECK-NEXT: subs w10, w10, #1
; CHECK-NEXT: strb w12, [x9]
; CHECK-NEXT: fcvtzs w13, s2
; CHECK-NEXT: strb w13, [x9, #1]
; CHECK-NEXT: strb w11, [x9]
; CHECK-NEXT: fcvtzs w12, s3
; CHECK-NEXT: strb w12, [x9, #1]
; CHECK-NEXT: add x9, x9, #2
; CHECK-NEXT: b.ne .LBB1_6
; CHECK-NEXT: .LBB1_7: // %for.cond.cleanup
Expand Down Expand Up @@ -380,33 +380,33 @@ define void @loop3(ptr noalias nocapture noundef writeonly %dst, ptr nocapture n
; CHECK-NEXT: mov x9, x0
; CHECK-NEXT: .LBB2_7: // %for.body.preheader1
; CHECK-NEXT: movi d0, #0000000000000000
; CHECK-NEXT: sub w10, w2, w10
; CHECK-NEXT: mov w11, #1132396544 // =0x437f0000
; CHECK-NEXT: sub w10, w2, w10
; CHECK-NEXT: fmov s1, w11
; CHECK-NEXT: .LBB2_8: // %for.body
; CHECK-NEXT: // =>This Inner Loop Header: Depth=1
; CHECK-NEXT: ldp s1, s3, [x8]
; CHECK-NEXT: fmov s2, w11
; CHECK-NEXT: fcmp s1, s2
; CHECK-NEXT: fcsel s4, s2, s1, gt
; CHECK-NEXT: fcmp s1, #0.0
; CHECK-NEXT: fcsel s1, s0, s4, mi
; CHECK-NEXT: fcmp s3, s2
; CHECK-NEXT: fcsel s4, s2, s3, gt
; CHECK-NEXT: ldp s2, s3, [x8]
; CHECK-NEXT: fcmp s2, s1
; CHECK-NEXT: fcsel s4, s1, s2, gt
; CHECK-NEXT: fcmp s2, #0.0
; CHECK-NEXT: fcsel s2, s0, s4, mi
; CHECK-NEXT: fcmp s3, s1
; CHECK-NEXT: fcsel s4, s1, s3, gt
; CHECK-NEXT: fcmp s3, #0.0
; CHECK-NEXT: ldr s3, [x8, #8]
; CHECK-NEXT: fcvtzs w12, s1
; CHECK-NEXT: fcvtzs w11, s2
; CHECK-NEXT: add x8, x8, #12
; CHECK-NEXT: fcsel s4, s0, s4, mi
; CHECK-NEXT: fcmp s3, s2
; CHECK-NEXT: strb w12, [x9]
; CHECK-NEXT: fcsel s2, s2, s3, gt
; CHECK-NEXT: fcmp s3, s1
; CHECK-NEXT: strb w11, [x9]
; CHECK-NEXT: fcsel s5, s1, s3, gt
; CHECK-NEXT: fcmp s3, #0.0
; CHECK-NEXT: fcvtzs w13, s4
; CHECK-NEXT: fcsel s2, s0, s2, mi
; CHECK-NEXT: fcvtzs w12, s4
; CHECK-NEXT: fcsel s3, s0, s5, mi
; CHECK-NEXT: subs w10, w10, #1
; CHECK-NEXT: strb w13, [x9, #1]
; CHECK-NEXT: fcvtzs w14, s2
; CHECK-NEXT: strb w14, [x9, #2]
; CHECK-NEXT: strb w12, [x9, #1]
; CHECK-NEXT: fcvtzs w13, s3
; CHECK-NEXT: strb w13, [x9, #2]
; CHECK-NEXT: add x9, x9, #3
; CHECK-NEXT: b.ne .LBB2_8
; CHECK-NEXT: .LBB2_9: // %for.cond.cleanup
Expand Down Expand Up @@ -549,39 +549,39 @@ define void @loop4(ptr noalias nocapture noundef writeonly %dst, ptr nocapture n
; CHECK-NEXT: mov x9, x0
; CHECK-NEXT: .LBB3_5: // %for.body.preheader1
; CHECK-NEXT: movi d0, #0000000000000000
; CHECK-NEXT: sub w10, w2, w10
; CHECK-NEXT: mov w11, #1132396544 // =0x437f0000
; CHECK-NEXT: sub w10, w2, w10
; CHECK-NEXT: fmov s1, w11
; CHECK-NEXT: .LBB3_6: // %for.body
; CHECK-NEXT: // =>This Inner Loop Header: Depth=1
; CHECK-NEXT: ldp s1, s3, [x8]
; CHECK-NEXT: fmov s2, w11
; CHECK-NEXT: fcmp s1, s2
; CHECK-NEXT: fcsel s4, s2, s1, gt
; CHECK-NEXT: fcmp s1, #0.0
; CHECK-NEXT: fcsel s1, s0, s4, mi
; CHECK-NEXT: fcmp s3, s2
; CHECK-NEXT: fcsel s4, s2, s3, gt
; CHECK-NEXT: ldp s2, s3, [x8]
; CHECK-NEXT: fcmp s2, s1
; CHECK-NEXT: fcsel s4, s1, s2, gt
; CHECK-NEXT: fcmp s2, #0.0
; CHECK-NEXT: fcsel s2, s0, s4, mi
; CHECK-NEXT: fcmp s3, s1
; CHECK-NEXT: fcsel s4, s1, s3, gt
; CHECK-NEXT: fcmp s3, #0.0
; CHECK-NEXT: ldp s3, s5, [x8, #8]
; CHECK-NEXT: fcvtzs w12, s1
; CHECK-NEXT: fcvtzs w11, s2
; CHECK-NEXT: add x8, x8, #16
; CHECK-NEXT: fcsel s4, s0, s4, mi
; CHECK-NEXT: fcmp s3, s2
; CHECK-NEXT: strb w12, [x9]
; CHECK-NEXT: fcsel s6, s2, s3, gt
; CHECK-NEXT: fcmp s3, s1
; CHECK-NEXT: strb w11, [x9]
; CHECK-NEXT: fcsel s6, s1, s3, gt
; CHECK-NEXT: fcmp s3, #0.0
; CHECK-NEXT: fcvtzs w13, s4
; CHECK-NEXT: fcvtzs w12, s4
; CHECK-NEXT: fcsel s3, s0, s6, mi
; CHECK-NEXT: fcmp s5, s2
; CHECK-NEXT: strb w13, [x9, #1]
; CHECK-NEXT: fcsel s2, s2, s5, gt
; CHECK-NEXT: fcmp s5, s1
; CHECK-NEXT: strb w12, [x9, #1]
; CHECK-NEXT: fcsel s6, s1, s5, gt
; CHECK-NEXT: fcmp s5, #0.0
; CHECK-NEXT: fcvtzs w14, s3
; CHECK-NEXT: fcsel s2, s0, s2, mi
; CHECK-NEXT: fcvtzs w13, s3
; CHECK-NEXT: fcsel s5, s0, s6, mi
; CHECK-NEXT: subs w10, w10, #1
; CHECK-NEXT: strb w14, [x9, #2]
; CHECK-NEXT: fcvtzs w15, s2
; CHECK-NEXT: strb w15, [x9, #3]
; CHECK-NEXT: strb w13, [x9, #2]
; CHECK-NEXT: fcvtzs w14, s5
; CHECK-NEXT: strb w14, [x9, #3]
; CHECK-NEXT: add x9, x9, #4
; CHECK-NEXT: b.ne .LBB3_6
; CHECK-NEXT: .LBB3_7: // %for.cond.cleanup
Expand Down
34 changes: 18 additions & 16 deletions llvm/test/CodeGen/AArch64/zext-to-tbl.ll
Original file line number Diff line number Diff line change
Expand Up @@ -2756,37 +2756,39 @@ exit:
define i32 @test_pr62620_widening_instr(ptr %p1, ptr %p2, i64 %lx, i32 %h) {
; CHECK-LABEL: test_pr62620_widening_instr:
; CHECK: ; %bb.0: ; %entry
; CHECK-NEXT: lsl x8, x2, #4
; CHECK-NEXT: ldr q0, [x0, x8]
; CHECK-NEXT: ldr q1, [x1, x8]
; CHECK-NEXT: lsl x9, x2, #4
; CHECK-NEXT: mov x8, x0
; CHECK-NEXT: mov w0, wzr
; CHECK-NEXT: ldr q0, [x8, x9]
; CHECK-NEXT: ldr q1, [x1, x9]
; CHECK-NEXT: uabdl.8h v2, v0, v1
; CHECK-NEXT: uabal2.8h v2, v0, v1
; CHECK-NEXT: uaddlv.8h s0, v2
; CHECK-NEXT: fmov w8, s0
; CHECK-NEXT: LBB23_1: ; %loop
; CHECK-NEXT: ; =>This Inner Loop Header: Depth=1
; CHECK-NEXT: uabdl.8h v2, v0, v1
; CHECK-NEXT: subs w3, w3, #1
; CHECK-NEXT: uabal2.8h v2, v0, v1
; CHECK-NEXT: uaddlv.8h s2, v2
; CHECK-NEXT: fmov w8, s2
; CHECK-NEXT: add w0, w8, w0
; CHECK-NEXT: b.ne LBB23_1
; CHECK-NEXT: ; %bb.2: ; %exit
; CHECK-NEXT: ret
;
; CHECK-BE-LABEL: test_pr62620_widening_instr:
; CHECK-BE: // %bb.0: // %entry
; CHECK-BE-NEXT: lsl x8, x2, #4
; CHECK-BE-NEXT: add x9, x0, x8
; CHECK-BE-NEXT: add x8, x1, x8
; CHECK-BE-NEXT: lsl x9, x2, #4
; CHECK-BE-NEXT: mov x8, x0
; CHECK-BE-NEXT: mov w0, wzr
; CHECK-BE-NEXT: ld1 { v0.16b }, [x9]
; CHECK-BE-NEXT: ld1 { v1.16b }, [x8]
; CHECK-BE-NEXT: add x8, x8, x9
; CHECK-BE-NEXT: add x9, x1, x9
; CHECK-BE-NEXT: ld1 { v0.16b }, [x8]
; CHECK-BE-NEXT: ld1 { v1.16b }, [x9]
; CHECK-BE-NEXT: uabdl v2.8h, v0.8b, v1.8b
; CHECK-BE-NEXT: uabal2 v2.8h, v0.16b, v1.16b
; CHECK-BE-NEXT: uaddlv s0, v2.8h
; CHECK-BE-NEXT: fmov w8, s0
; CHECK-BE-NEXT: .LBB23_1: // %loop
; CHECK-BE-NEXT: // =>This Inner Loop Header: Depth=1
; CHECK-BE-NEXT: uabdl v2.8h, v0.8b, v1.8b
; CHECK-BE-NEXT: subs w3, w3, #1
; CHECK-BE-NEXT: uabal2 v2.8h, v0.16b, v1.16b
; CHECK-BE-NEXT: uaddlv s2, v2.8h
; CHECK-BE-NEXT: fmov w8, s2
; CHECK-BE-NEXT: add w0, w8, w0
; CHECK-BE-NEXT: b.ne .LBB23_1
; CHECK-BE-NEXT: // %bb.2: // %exit
Expand Down
Loading