-
Notifications
You must be signed in to change notification settings - Fork 13.5k
[SystemZ] Fix compile time regression in adjustInliningThreshold(). #137527
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@llvm/pr-subscribers-backend-systemz Author: Jonas Paulsson (JonPsson1) ChangesInstead of always iterating over all GlobalVariable:s in the Module to find the case where both Caller and Callee is using the same GV heavily, first scan Callee (only if less than 200 instructions) for all GVs used more than 10 times, and then do the counting for the Caller for just those relevant GVs. The limit of 200 instructions makes sense as this aims to inline a relatively small function using a GV +10 times. This limit changed only 7 files across 3 SPEC benchmarks . Previously only perlbench performance was affected, and perl is not among these 3 changed benchmarks, so there should not be any difference to consider here. SPEC runs seem to confirm this ("full/home-dir"). Compile time across SPEC shows no difference compared to main. It however resolves the compile time problem with zig where it is on main (compared to removing the heuristic) a 380% increase, but with this change only 2.4% increase (total user compile time with opt). Fixes #134714. Full diff: https://github.com/llvm/llvm-project/pull/137527.diff 1 Files Affected:
diff --git a/llvm/lib/Target/SystemZ/SystemZTargetTransformInfo.cpp b/llvm/lib/Target/SystemZ/SystemZTargetTransformInfo.cpp
index ee142ccd20e20..78f5154229f55 100644
--- a/llvm/lib/Target/SystemZ/SystemZTargetTransformInfo.cpp
+++ b/llvm/lib/Target/SystemZ/SystemZTargetTransformInfo.cpp
@@ -80,7 +80,6 @@ unsigned SystemZTTIImpl::adjustInliningThreshold(const CallBase *CB) const {
const Function *Callee = CB->getCalledFunction();
if (!Callee)
return 0;
- const Module *M = Caller->getParent();
// Increase the threshold if an incoming argument is used only as a memcpy
// source.
@@ -92,29 +91,38 @@ unsigned SystemZTTIImpl::adjustInliningThreshold(const CallBase *CB) const {
}
}
- // Give bonus for globals used much in both caller and callee.
- std::set<const GlobalVariable *> CalleeGlobals;
- std::set<const GlobalVariable *> CallerGlobals;
- for (const GlobalVariable &Global : M->globals())
- for (const User *U : Global.users())
- if (const Instruction *User = dyn_cast<Instruction>(U)) {
- if (User->getParent()->getParent() == Callee)
- CalleeGlobals.insert(&Global);
- if (User->getParent()->getParent() == Caller)
- CallerGlobals.insert(&Global);
+ // Give bonus for globals used much in both caller and a relatively small
+ // callee.
+ if (Callee->getInstructionCount() < 200) {
+ std::map<const Value *, unsigned> Ptr2NumUses;
+ for (auto &BB : *Callee)
+ for (auto &I : BB) {
+ if (const auto *SI = dyn_cast<StoreInst>(&I)) {
+ if (!SI->isVolatile())
+ Ptr2NumUses[SI->getPointerOperand()]++;
+ } else if (const auto *LI = dyn_cast<LoadInst>(&I)) {
+ if (!LI->isVolatile())
+ Ptr2NumUses[LI->getPointerOperand()]++;
+ } else if (const auto *GEP = dyn_cast<GetElementPtrInst>(&I)) {
+ unsigned NumStores = 0, NumLoads = 0;
+ countNumMemAccesses(GEP, NumStores, NumLoads, Callee);
+ Ptr2NumUses[GEP->getPointerOperand()] += NumLoads + NumStores;
+ }
}
- for (auto *GV : CalleeGlobals)
- if (CallerGlobals.count(GV)) {
- unsigned CalleeStores = 0, CalleeLoads = 0;
- unsigned CallerStores = 0, CallerLoads = 0;
- countNumMemAccesses(GV, CalleeStores, CalleeLoads, Callee);
- countNumMemAccesses(GV, CallerStores, CallerLoads, Caller);
- if ((CalleeStores + CalleeLoads) > 10 &&
- (CallerStores + CallerLoads) > 10) {
- Bonus = 1000;
- break;
+
+ for (auto I : Ptr2NumUses) {
+ const Value *Ptr = I.first;
+ unsigned NumCalleeUses = I.second;
+ if (NumCalleeUses > 10 && isa<GlobalVariable>(Ptr)) {
+ unsigned CallerStores = 0, CallerLoads = 0;
+ countNumMemAccesses(Ptr, CallerStores, CallerLoads, Caller);
+ if (CallerStores + CallerLoads > 10) {
+ Bonus = 1000;
+ break;
+ }
}
}
+ }
// Give bonus when Callee accesses an Alloca of Caller heavily.
unsigned NumStores = 0;
|
@alexrp Does this help on your side as well? |
CallerGlobals.insert(&Global); | ||
// Give bonus for globals used much in both caller and a relatively small | ||
// callee. | ||
if (Callee->getInstructionCount() < 200) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
getInstructionCount() is going to iterate over the whole function to determine the instruction count. It's usually better to always do the analysis and just bail out if it inspects too many instructions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, using a counter instead.
} else if (const auto *GEP = dyn_cast<GetElementPtrInst>(&I)) { | ||
unsigned NumStores = 0, NumLoads = 0; | ||
countNumMemAccesses(GEP, NumStores, NumLoads, Callee); | ||
Ptr2NumUses[GEP->getPointerOperand()] += NumLoads + NumStores; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like nothing actually cares about loads and stores separately? Make things simpler by having countNumMemAccesses return the total count?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There actually is a bit further down - separate checks for number of loads and stores. It may be that these could be combined there as well, but that should probably wait and be benchmarked after this.
@nikic thanks for review! I followed your suggestions and indeed see a further speedup: zig.bc is now regressing <0.5% with this heuristic (compared to removing it), so nearly gone. And average over SPEC for InlinerPass is now 2% better than with main (1.46% vs 1.49%). |
// callee. | ||
unsigned InstrCount = 0; | ||
SmallDenseMap<const Value *, unsigned> Ptr2NumUses; | ||
for (auto &I : instructions(Callee)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to watch out for debug instructions here? We don't want the count == 200 check to differ between a debug and a non-debug compile ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I started out with that actually, but @nikic suggested this not being needed with the new debug refs, which was committed for SystemZ a month ago. (haven't double-checked though)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is due to the switch from debug intrinsics to debug records. It's a separate change from debug instr ref (which is a backend thing).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, thanks. Would this still be the case for a LLVM 20 backport? (Not sure when this change came in ...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's already in LLVM 20.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM as well, thanks! This should also go into LLVM 20 to fix the regression.
/cherry-pick 98b895d |
/pull-request #137628 |
…lvm#137527) Instead of always iterating over all GlobalVariable:s in the Module to find the case where both Caller and Callee is using the same GV heavily, first scan Callee (only if less than 200 instructions) for all GVs used more than 10 times, and then do the counting for the Caller for just those relevant GVs. The limit of 200 instructions makes sense as this aims to inline a relatively small function using a GV +10 times. This resolves the compile time problem with zig where it is on main (compared to removing the heuristic) a 380% increase, but with this change <0.5% increase (total user compile time with opt). Fixes llvm#134714. (cherry picked from commit 98b895d)
…lvm#137527) Instead of always iterating over all GlobalVariable:s in the Module to find the case where both Caller and Callee is using the same GV heavily, first scan Callee (only if less than 200 instructions) for all GVs used more than 10 times, and then do the counting for the Caller for just those relevant GVs. The limit of 200 instructions makes sense as this aims to inline a relatively small function using a GV +10 times. This resolves the compile time problem with zig where it is on main (compared to removing the heuristic) a 380% increase, but with this change <0.5% increase (total user compile time with opt). Fixes llvm#134714.
…lvm#137527) Instead of always iterating over all GlobalVariable:s in the Module to find the case where both Caller and Callee is using the same GV heavily, first scan Callee (only if less than 200 instructions) for all GVs used more than 10 times, and then do the counting for the Caller for just those relevant GVs. The limit of 200 instructions makes sense as this aims to inline a relatively small function using a GV +10 times. This resolves the compile time problem with zig where it is on main (compared to removing the heuristic) a 380% increase, but with this change <0.5% increase (total user compile time with opt). Fixes llvm#134714.
…lvm#137527) Instead of always iterating over all GlobalVariable:s in the Module to find the case where both Caller and Callee is using the same GV heavily, first scan Callee (only if less than 200 instructions) for all GVs used more than 10 times, and then do the counting for the Caller for just those relevant GVs. The limit of 200 instructions makes sense as this aims to inline a relatively small function using a GV +10 times. This resolves the compile time problem with zig where it is on main (compared to removing the heuristic) a 380% increase, but with this change <0.5% increase (total user compile time with opt). Fixes llvm#134714.
…lvm#137527) Instead of always iterating over all GlobalVariable:s in the Module to find the case where both Caller and Callee is using the same GV heavily, first scan Callee (only if less than 200 instructions) for all GVs used more than 10 times, and then do the counting for the Caller for just those relevant GVs. The limit of 200 instructions makes sense as this aims to inline a relatively small function using a GV +10 times. This resolves the compile time problem with zig where it is on main (compared to removing the heuristic) a 380% increase, but with this change <0.5% increase (total user compile time with opt). Fixes llvm#134714.
Instead of always iterating over all GlobalVariable:s in the Module to find the case where both Caller and Callee is using the same GV heavily, first scan Callee (only if less than 200 instructions) for all GVs used more than 10 times, and then do the counting for the Caller for just those relevant GVs.
The limit of 200 instructions makes sense as this aims to inline a relatively small function using a GV +10 times. This limit changed only 7 files across 3 SPEC benchmarks . Previously only perlbench performance was affected, and perl is not among these 3 changed benchmarks, so there should not be any difference to consider here. SPEC runs seem to confirm this ("full/home-dir").
Compile time across SPEC shows no difference compared to main. It however resolves the compile time problem with zig where it is on main (compared to removing the heuristic) a 380% increase, but with this change only 2.4% increase (total user compile time with opt).
Fixes #134714.