Skip to content

[LV] Change loops' interleave count computation #73766

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jan 4, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
54 changes: 39 additions & 15 deletions llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -5579,21 +5579,45 @@ LoopVectorizationCostModel::selectInterleaveCount(ElementCount VF,
MaxInterleaveCount = ForceTargetMaxVectorInterleaveFactor;
}

// If trip count is known or estimated compile time constant, limit the
// interleave count to be less than the trip count divided by VF, provided it
// is at least 1.
//
// For scalable vectors we can't know if interleaving is beneficial. It may
// not be beneficial for small loops if none of the lanes in the second vector
// iterations is enabled. However, for larger loops, there is likely to be a
// similar benefit as for fixed-width vectors. For now, we choose to leave
// the InterleaveCount as if vscale is '1', although if some information about
// the vector is known (e.g. min vector size), we can make a better decision.
if (BestKnownTC) {
MaxInterleaveCount =
std::min(*BestKnownTC / VF.getKnownMinValue(), MaxInterleaveCount);
// Make sure MaxInterleaveCount is greater than 0.
MaxInterleaveCount = std::max(1u, MaxInterleaveCount);
unsigned EstimatedVF = VF.getKnownMinValue();
if (VF.isScalable()) {
if (std::optional<unsigned> VScale = getVScaleForTuning(TheLoop, TTI))
EstimatedVF *= *VScale;
}
assert(EstimatedVF >= 1 && "Estimated VF shouldn't be less than 1");

unsigned KnownTC = PSE.getSE()->getSmallConstantTripCount(TheLoop);
if (KnownTC) {
// If trip count is known we select between two prospective ICs, where
// 1) the aggressive IC is capped by the trip count divided by VF
// 2) the conservative IC is capped by the trip count divided by (VF * 2)
// The final IC is selected in a way that the epilogue loop trip count is
// minimized while maximizing the IC itself, so that we either run the
// vector loop at least once if it generates a small epilogue loop, or else
// we run the vector loop at least twice.

unsigned InterleaveCountUB = bit_floor(
std::max(1u, std::min(KnownTC / EstimatedVF, MaxInterleaveCount)));
unsigned InterleaveCountLB = bit_floor(std::max(
1u, std::min(KnownTC / (EstimatedVF * 2), MaxInterleaveCount)));
MaxInterleaveCount = InterleaveCountLB;

if (InterleaveCountUB != InterleaveCountLB) {
unsigned TailTripCountUB = (KnownTC % (EstimatedVF * InterleaveCountUB));
unsigned TailTripCountLB = (KnownTC % (EstimatedVF * InterleaveCountLB));
// If both produce same scalar tail, maximize the IC to do the same work
// in fewer vector loop iterations
if (TailTripCountUB == TailTripCountLB)
MaxInterleaveCount = InterleaveCountUB;
}
} else if (BestKnownTC) {
// If trip count is an estimated compile time constant, limit the
// IC to be capped by the trip count divided by VF * 2, such that the vector
// loop runs at least twice to make interleaving seem profitable when there
// is an epilogue loop present. Since exact Trip count is not known we
// choose to be conservative in our IC estimate.
MaxInterleaveCount = bit_floor(std::max(
1u, std::min(*BestKnownTC / (EstimatedVF * 2), MaxInterleaveCount)));
}

assert(MaxInterleaveCount > 0 &&
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,9 @@ target triple = "aarch64-linux-gnu"

%pair = type { i8, i8 }

; TODO: For a loop with a profile-guided estimated TC of 32, when the auto-vectorizer chooses VF 16,
; For a loop with a profile-guided estimated TC of 32, when the auto-vectorizer chooses VF 16,
; it should conservatively choose IC 1 so that the vector loop runs twice at least
; CHECK: remark: <unknown>:0:0: vectorized loop (vectorization width: 16, interleaved count: 2)
; CHECK: remark: <unknown>:0:0: vectorized loop (vectorization width: 16, interleaved count: 1)
define void @loop_with_profile_tc_32(ptr noalias %p, ptr noalias %q, i64 %n) {
entry:
br label %for.body
Expand All @@ -29,9 +29,9 @@ for.end:
ret void
}

; TODO: For a loop with a profile-guided estimated TC of 33, when the auto-vectorizer chooses VF 16,
; For a loop with a profile-guided estimated TC of 33, when the auto-vectorizer chooses VF 16,
; it should conservatively choose IC 1 so that the vector loop runs twice at least
; CHECK: remark: <unknown>:0:0: vectorized loop (vectorization width: 16, interleaved count: 2)
; CHECK: remark: <unknown>:0:0: vectorized loop (vectorization width: 16, interleaved count: 1)
define void @loop_with_profile_tc_33(ptr noalias %p, ptr noalias %q, i64 %n) {
entry:
br label %for.body
Expand All @@ -53,9 +53,9 @@ for.end:
ret void
}

; TODO: For a loop with a profile-guided estimated TC of 48, when the auto-vectorizer chooses VF 16,
; For a loop with a profile-guided estimated TC of 48, when the auto-vectorizer chooses VF 16,
; it should conservatively choose IC 1 so that the vector loop runs twice at least
; CHECK: remark: <unknown>:0:0: vectorized loop (vectorization width: 16, interleaved count: 3)
; CHECK: remark: <unknown>:0:0: vectorized loop (vectorization width: 16, interleaved count: 1)
define void @loop_with_profile_tc_48(ptr noalias %p, ptr noalias %q, i64 %n) {
entry:
br label %for.body
Expand All @@ -77,9 +77,9 @@ for.end:
ret void
}

; TODO: For a loop with a profile-guided estimated TC of 63, when the auto-vectorizer chooses VF 16,
; For a loop with a profile-guided estimated TC of 63, when the auto-vectorizer chooses VF 16,
; it should conservatively choose IC 1 so that the vector loop runs twice at least
; CHECK: remark: <unknown>:0:0: vectorized loop (vectorization width: 16, interleaved count: 3)
; CHECK: remark: <unknown>:0:0: vectorized loop (vectorization width: 16, interleaved count: 1)
define void @loop_with_profile_tc_63(ptr noalias %p, ptr noalias %q, i64 %n) {
entry:
br label %for.body
Expand All @@ -101,9 +101,9 @@ for.end:
ret void
}

; TODO: For a loop with a profile-guided estimated TC of 64, when the auto-vectorizer chooses VF 16,
; For a loop with a profile-guided estimated TC of 64, when the auto-vectorizer chooses VF 16,
; it should choose conservatively IC 2 so that the vector loop runs twice at least
; CHECK: remark: <unknown>:0:0: vectorized loop (vectorization width: 16, interleaved count: 4)
; CHECK: remark: <unknown>:0:0: vectorized loop (vectorization width: 16, interleaved count: 2)
define void @loop_with_profile_tc_64(ptr noalias %p, ptr noalias %q, i64 %n) {
entry:
br label %for.body
Expand All @@ -125,9 +125,9 @@ for.end:
ret void
}

; TODO: For a loop with a profile-guided estimated TC of 100, when the auto-vectorizer chooses VF 16,
; For a loop with a profile-guided estimated TC of 100, when the auto-vectorizer chooses VF 16,
; it should choose conservatively IC 2 so that the vector loop runs twice at least
; CHECK: remark: <unknown>:0:0: vectorized loop (vectorization width: 16, interleaved count: 6)
; CHECK: remark: <unknown>:0:0: vectorized loop (vectorization width: 16, interleaved count: 2)
define void @loop_with_profile_tc_100(ptr noalias %p, ptr noalias %q, i64 %n) {
entry:
br label %for.body
Expand All @@ -149,9 +149,9 @@ for.end:
ret void
}

; TODO: For a loop with a profile-guided estimated TC of 128, when the auto-vectorizer chooses VF 16,
; For a loop with a profile-guided estimated TC of 128, when the auto-vectorizer chooses VF 16,
; it should choose conservatively IC 4 so that the vector loop runs twice at least
; CHECK: remark: <unknown>:0:0: vectorized loop (vectorization width: 16, interleaved count: 8)
; CHECK: remark: <unknown>:0:0: vectorized loop (vectorization width: 16, interleaved count: 4)
define void @loop_with_profile_tc_128(ptr noalias %p, ptr noalias %q, i64 %n) {
entry:
br label %for.body
Expand All @@ -173,9 +173,9 @@ for.end:
ret void
}

; TODO: For a loop with a profile-guided estimated TC of 129, when the auto-vectorizer chooses VF 16,
; For a loop with a profile-guided estimated TC of 129, when the auto-vectorizer chooses VF 16,
; it should choose conservatively IC 4 so that the vector loop runs twice at least
; CHECK: remark: <unknown>:0:0: vectorized loop (vectorization width: 16, interleaved count: 8)
; CHECK: remark: <unknown>:0:0: vectorized loop (vectorization width: 16, interleaved count: 4)
define void @loop_with_profile_tc_129(ptr noalias %p, ptr noalias %q, i64 %n) {
entry:
br label %for.body
Expand All @@ -197,9 +197,9 @@ for.end:
ret void
}

; TODO: For a loop with a profile-guided estimated TC of 180, when the auto-vectorizer chooses VF 16,
; For a loop with a profile-guided estimated TC of 180, when the auto-vectorizer chooses VF 16,
; it should choose conservatively IC 4 so that the vector loop runs twice at least
; CHECK: remark: <unknown>:0:0: vectorized loop (vectorization width: 16, interleaved count: 8)
; CHECK: remark: <unknown>:0:0: vectorized loop (vectorization width: 16, interleaved count: 4)
define void @loop_with_profile_tc_180(ptr noalias %p, ptr noalias %q, i64 %n) {
entry:
br label %for.body
Expand All @@ -221,9 +221,9 @@ for.end:
ret void
}

; TODO: For a loop with a profile-guided estimated TC of 193, when the auto-vectorizer chooses VF 16,
; For a loop with a profile-guided estimated TC of 193, when the auto-vectorizer chooses VF 16,
; it should choose conservatively IC 4 so that the vector loop runs twice at least
; CHECK: remark: <unknown>:0:0: vectorized loop (vectorization width: 16, interleaved count: 8)
; CHECK: remark: <unknown>:0:0: vectorized loop (vectorization width: 16, interleaved count: 4)
define void @loop_with_profile_tc_193(ptr noalias %p, ptr noalias %q, i64 %n) {
entry:
br label %for.body
Expand All @@ -245,7 +245,7 @@ for.end:
ret void
}

; TODO: For a loop with a profile-guided estimated TC of 1000, when the auto-vectorizer chooses VF 16,
; For a loop with a profile-guided estimated TC of 1000, when the auto-vectorizer chooses VF 16,
; the IC will be capped by the target-specific maximum interleave count
; CHECK: remark: <unknown>:0:0: vectorized loop (vectorization width: 16, interleaved count: 8)
define void @loop_with_profile_tc_1000(ptr noalias %p, ptr noalias %q, i64 %n) {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -77,9 +77,9 @@ for.end:
ret void
}

; TODO: For this loop with known TC of 48, when the auto-vectorizer chooses VF 16, it should choose
; For this loop with known TC of 48, when the auto-vectorizer chooses VF 16, it should choose
; IC 1 since there will be no remainder loop that needs to run after the vector loop.
; CHECK: remark: <unknown>:0:0: vectorized loop (vectorization width: 16, interleaved count: 3)
; CHECK: remark: <unknown>:0:0: vectorized loop (vectorization width: 16, interleaved count: 1)
define void @loop_with_tc_48(ptr noalias %p, ptr noalias %q) {
entry:
br label %for.body
Expand All @@ -101,9 +101,9 @@ for.end:
ret void
}

; TODO: For this loop with known TC of 49, when the auto-vectorizer chooses VF 16, it should choose
; For this loop with known TC of 49, when the auto-vectorizer chooses VF 16, it should choose
; IC 1 since a remainder loop TC of 1 is more efficient than remainder loop TC of 17 with IC 2
; CHECK: remark: <unknown>:0:0: vectorized loop (vectorization width: 16, interleaved count: 3)
; CHECK: remark: <unknown>:0:0: vectorized loop (vectorization width: 16, interleaved count: 1)
define void @loop_with_tc_49(ptr noalias %p, ptr noalias %q) {
entry:
br label %for.body
Expand All @@ -125,9 +125,9 @@ for.end:
ret void
}

; TODO: For this loop with known TC of 55, when the auto-vectorizer chooses VF 16, it should choose
; For this loop with known TC of 55, when the auto-vectorizer chooses VF 16, it should choose
; IC 1 since a remainder loop TC of 7 is more efficient than remainder loop TC of 23 with IC 2
; CHECK: remark: <unknown>:0:0: vectorized loop (vectorization width: 16, interleaved count: 3)
; CHECK: remark: <unknown>:0:0: vectorized loop (vectorization width: 16, interleaved count: 1)
define void @loop_with_tc_55(ptr noalias %p, ptr noalias %q) {
entry:
br label %for.body
Expand All @@ -149,9 +149,9 @@ for.end:
ret void
}

; TODO: For this loop with known TC of 100, when the auto-vectorizer chooses VF 16, it should choose
; For this loop with known TC of 100, when the auto-vectorizer chooses VF 16, it should choose
; IC 2 since a remainder loop TC of 4 is more efficient than remainder loop TC of 36 with IC 4
; CHECK: remark: <unknown>:0:0: vectorized loop (vectorization width: 16, interleaved count: 6)
; CHECK: remark: <unknown>:0:0: vectorized loop (vectorization width: 16, interleaved count: 2)
define void @loop_with_tc_100(ptr noalias %p, ptr noalias %q) {
entry:
br label %for.body
Expand Down Expand Up @@ -245,9 +245,9 @@ for.end:
ret void
}

; TODO: For this loop with known TC of 193, when the auto-vectorizer chooses VF 16, it should choose
; For this loop with known TC of 193, when the auto-vectorizer chooses VF 16, it should choose
; IC 4 since a remainder loop TC of 1 is more efficient than remainder loop TC of 65 with IC 8
; CHECK: remark: <unknown>:0:0: vectorized loop (vectorization width: 16, interleaved count: 8)
; CHECK: remark: <unknown>:0:0: vectorized loop (vectorization width: 16, interleaved count: 4)
define void @loop_with_tc_193(ptr noalias %p, ptr noalias %q) {
entry:
br label %for.body
Expand Down
4 changes: 0 additions & 4 deletions llvm/test/Transforms/LoopVectorize/PowerPC/large-loop-rdx.ll
Original file line number Diff line number Diff line change
Expand Up @@ -8,10 +8,6 @@
; CHECK-NEXT: fadd
; CHECK-NEXT: fadd
; CHECK-NEXT: fadd
; CHECK-NEXT: fadd
; CHECK-NEXT: fadd
; CHECK-NEXT: fadd
; CHECK-NEXT: fadd
; CHECK-NEXT: =
; CHECK-NOT: fadd
; CHECK-SAME: >
Expand Down
Loading