Skip to content

[SystemZ] Fix compile time regression in adjustInliningThreshold(). #137527

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Apr 28, 2025

Conversation

JonPsson1
Copy link
Contributor

Instead of always iterating over all GlobalVariable:s in the Module to find the case where both Caller and Callee is using the same GV heavily, first scan Callee (only if less than 200 instructions) for all GVs used more than 10 times, and then do the counting for the Caller for just those relevant GVs.

The limit of 200 instructions makes sense as this aims to inline a relatively small function using a GV +10 times. This limit changed only 7 files across 3 SPEC benchmarks . Previously only perlbench performance was affected, and perl is not among these 3 changed benchmarks, so there should not be any difference to consider here. SPEC runs seem to confirm this ("full/home-dir").

Compile time across SPEC shows no difference compared to main. It however resolves the compile time problem with zig where it is on main (compared to removing the heuristic) a 380% increase, but with this change only 2.4% increase (total user compile time with opt).

Fixes #134714.

@llvmbot
Copy link
Member

llvmbot commented Apr 27, 2025

@llvm/pr-subscribers-backend-systemz

Author: Jonas Paulsson (JonPsson1)

Changes

Instead of always iterating over all GlobalVariable:s in the Module to find the case where both Caller and Callee is using the same GV heavily, first scan Callee (only if less than 200 instructions) for all GVs used more than 10 times, and then do the counting for the Caller for just those relevant GVs.

The limit of 200 instructions makes sense as this aims to inline a relatively small function using a GV +10 times. This limit changed only 7 files across 3 SPEC benchmarks . Previously only perlbench performance was affected, and perl is not among these 3 changed benchmarks, so there should not be any difference to consider here. SPEC runs seem to confirm this ("full/home-dir").

Compile time across SPEC shows no difference compared to main. It however resolves the compile time problem with zig where it is on main (compared to removing the heuristic) a 380% increase, but with this change only 2.4% increase (total user compile time with opt).

Fixes #134714.


Full diff: https://github.com/llvm/llvm-project/pull/137527.diff

1 Files Affected:

  • (modified) llvm/lib/Target/SystemZ/SystemZTargetTransformInfo.cpp (+29-21)
diff --git a/llvm/lib/Target/SystemZ/SystemZTargetTransformInfo.cpp b/llvm/lib/Target/SystemZ/SystemZTargetTransformInfo.cpp
index ee142ccd20e20..78f5154229f55 100644
--- a/llvm/lib/Target/SystemZ/SystemZTargetTransformInfo.cpp
+++ b/llvm/lib/Target/SystemZ/SystemZTargetTransformInfo.cpp
@@ -80,7 +80,6 @@ unsigned SystemZTTIImpl::adjustInliningThreshold(const CallBase *CB) const {
   const Function *Callee = CB->getCalledFunction();
   if (!Callee)
     return 0;
-  const Module *M = Caller->getParent();
 
   // Increase the threshold if an incoming argument is used only as a memcpy
   // source.
@@ -92,29 +91,38 @@ unsigned SystemZTTIImpl::adjustInliningThreshold(const CallBase *CB) const {
     }
   }
 
-  // Give bonus for globals used much in both caller and callee.
-  std::set<const GlobalVariable *> CalleeGlobals;
-  std::set<const GlobalVariable *> CallerGlobals;
-  for (const GlobalVariable &Global : M->globals())
-    for (const User *U : Global.users())
-      if (const Instruction *User = dyn_cast<Instruction>(U)) {
-        if (User->getParent()->getParent() == Callee)
-          CalleeGlobals.insert(&Global);
-        if (User->getParent()->getParent() == Caller)
-          CallerGlobals.insert(&Global);
+  // Give bonus for globals used much in both caller and a relatively small
+  // callee.
+  if (Callee->getInstructionCount() < 200) {
+    std::map<const Value *, unsigned> Ptr2NumUses;
+    for (auto &BB : *Callee)
+      for (auto &I : BB) {
+        if (const auto *SI = dyn_cast<StoreInst>(&I)) {
+          if (!SI->isVolatile())
+            Ptr2NumUses[SI->getPointerOperand()]++;
+        } else if (const auto *LI = dyn_cast<LoadInst>(&I)) {
+          if (!LI->isVolatile())
+            Ptr2NumUses[LI->getPointerOperand()]++;
+        } else if (const auto *GEP = dyn_cast<GetElementPtrInst>(&I)) {
+          unsigned NumStores = 0, NumLoads = 0;
+          countNumMemAccesses(GEP, NumStores, NumLoads, Callee);
+          Ptr2NumUses[GEP->getPointerOperand()] += NumLoads + NumStores;
+        }
       }
-  for (auto *GV : CalleeGlobals)
-    if (CallerGlobals.count(GV)) {
-      unsigned CalleeStores = 0, CalleeLoads = 0;
-      unsigned CallerStores = 0, CallerLoads = 0;
-      countNumMemAccesses(GV, CalleeStores, CalleeLoads, Callee);
-      countNumMemAccesses(GV, CallerStores, CallerLoads, Caller);
-      if ((CalleeStores + CalleeLoads) > 10 &&
-          (CallerStores + CallerLoads) > 10) {
-        Bonus = 1000;
-        break;
+
+    for (auto I : Ptr2NumUses) {
+      const Value *Ptr = I.first;
+      unsigned NumCalleeUses = I.second;
+      if (NumCalleeUses > 10 && isa<GlobalVariable>(Ptr)) {
+        unsigned CallerStores = 0, CallerLoads = 0;
+        countNumMemAccesses(Ptr, CallerStores, CallerLoads, Caller);
+        if (CallerStores + CallerLoads > 10) {
+          Bonus = 1000;
+          break;
+        }
       }
     }
+  }
 
   // Give bonus when Callee accesses an Alloca of Caller heavily.
   unsigned NumStores = 0;

@JonPsson1
Copy link
Contributor Author

@alexrp Does this help on your side as well?

CallerGlobals.insert(&Global);
// Give bonus for globals used much in both caller and a relatively small
// callee.
if (Callee->getInstructionCount() < 200) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getInstructionCount() is going to iterate over the whole function to determine the instruction count. It's usually better to always do the analysis and just bail out if it inspects too many instructions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, using a counter instead.

} else if (const auto *GEP = dyn_cast<GetElementPtrInst>(&I)) {
unsigned NumStores = 0, NumLoads = 0;
countNumMemAccesses(GEP, NumStores, NumLoads, Callee);
Ptr2NumUses[GEP->getPointerOperand()] += NumLoads + NumStores;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like nothing actually cares about loads and stores separately? Make things simpler by having countNumMemAccesses return the total count?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There actually is a bit further down - separate checks for number of loads and stores. It may be that these could be combined there as well, but that should probably wait and be benchmarked after this.

@JonPsson1
Copy link
Contributor Author

@nikic thanks for review! I followed your suggestions and indeed see a further speedup: zig.bc is now regressing <0.5% with this heuristic (compared to removing it), so nearly gone. And average over SPEC for InlinerPass is now 2% better than with main (1.46% vs 1.49%).

// callee.
unsigned InstrCount = 0;
SmallDenseMap<const Value *, unsigned> Ptr2NumUses;
for (auto &I : instructions(Callee)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to watch out for debug instructions here? We don't want the count == 200 check to differ between a debug and a non-debug compile ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started out with that actually, but @nikic suggested this not being needed with the new debug refs, which was committed for SystemZ a month ago. (haven't double-checked though)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is due to the switch from debug intrinsics to debug records. It's a separate change from debug instr ref (which is a backend thing).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks. Would this still be the case for a LLVM 20 backport? (Not sure when this change came in ...)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's already in LLVM 20.

Copy link
Contributor

@nikic nikic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@uweigand uweigand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM as well, thanks! This should also go into LLVM 20 to fix the regression.

@JonPsson1 JonPsson1 marked this pull request as ready for review April 28, 2025 12:59
@JonPsson1 JonPsson1 merged commit 98b895d into llvm:main Apr 28, 2025
13 checks passed
@JonPsson1 JonPsson1 deleted the InlinerZig branch April 28, 2025 13:04
@JonPsson1 JonPsson1 added this to the LLVM 20.X Release milestone Apr 28, 2025
@github-project-automation github-project-automation bot moved this to Needs Triage in LLVM Release Status Apr 28, 2025
@JonPsson1
Copy link
Contributor Author

/cherry-pick 98b895d

@llvmbot
Copy link
Member

llvmbot commented Apr 28, 2025

/pull-request #137628

@llvmbot llvmbot moved this from Needs Triage to Done in LLVM Release Status Apr 28, 2025
swift-ci pushed a commit to swiftlang/llvm-project that referenced this pull request Apr 29, 2025
…lvm#137527)

Instead of always iterating over all GlobalVariable:s in the Module to
find the case where both Caller and Callee is using the same GV heavily,
first scan Callee (only if less than 200 instructions) for all GVs used
more than 10 times, and then do the counting for the Caller for just
those relevant GVs.

The limit of 200 instructions makes sense as this aims to inline a
relatively small function using a GV +10 times.

This resolves the compile time problem with zig where it is on main
(compared to removing the heuristic) a 380% increase, but with this
change <0.5% increase (total user compile time with opt).

Fixes llvm#134714.

(cherry picked from commit 98b895d)
IanWood1 pushed a commit to IanWood1/llvm-project that referenced this pull request May 6, 2025
…lvm#137527)

Instead of always iterating over all GlobalVariable:s in the Module to
find the case where both Caller and Callee is using the same GV heavily,
first scan Callee (only if less than 200 instructions) for all GVs used
more than 10 times, and then do the counting for the Caller for just
those relevant GVs.

The limit of 200 instructions makes sense as this aims to inline a
relatively small function using a GV +10 times. 

This resolves the compile time problem with zig where it is on main
(compared to removing the heuristic) a 380% increase, but with this
change <0.5% increase (total user compile time with opt).

Fixes llvm#134714.
IanWood1 pushed a commit to IanWood1/llvm-project that referenced this pull request May 6, 2025
…lvm#137527)

Instead of always iterating over all GlobalVariable:s in the Module to
find the case where both Caller and Callee is using the same GV heavily,
first scan Callee (only if less than 200 instructions) for all GVs used
more than 10 times, and then do the counting for the Caller for just
those relevant GVs.

The limit of 200 instructions makes sense as this aims to inline a
relatively small function using a GV +10 times. 

This resolves the compile time problem with zig where it is on main
(compared to removing the heuristic) a 380% increase, but with this
change <0.5% increase (total user compile time with opt).

Fixes llvm#134714.
IanWood1 pushed a commit to IanWood1/llvm-project that referenced this pull request May 6, 2025
…lvm#137527)

Instead of always iterating over all GlobalVariable:s in the Module to
find the case where both Caller and Callee is using the same GV heavily,
first scan Callee (only if less than 200 instructions) for all GVs used
more than 10 times, and then do the counting for the Caller for just
those relevant GVs.

The limit of 200 instructions makes sense as this aims to inline a
relatively small function using a GV +10 times. 

This resolves the compile time problem with zig where it is on main
(compared to removing the heuristic) a 380% increase, but with this
change <0.5% increase (total user compile time with opt).

Fixes llvm#134714.
Ankur-0429 pushed a commit to Ankur-0429/llvm-project that referenced this pull request May 9, 2025
…lvm#137527)

Instead of always iterating over all GlobalVariable:s in the Module to
find the case where both Caller and Callee is using the same GV heavily,
first scan Callee (only if less than 200 instructions) for all GVs used
more than 10 times, and then do the counting for the Caller for just
those relevant GVs.

The limit of 200 instructions makes sense as this aims to inline a
relatively small function using a GV +10 times. 

This resolves the compile time problem with zig where it is on main
(compared to removing the heuristic) a 380% increase, but with this
change <0.5% increase (total user compile time with opt).

Fixes llvm#134714.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

Successfully merging this pull request may close these issues.

[SystemZ] Large compile time regression in SystemZTTIImpl::adjustInliningThreshold()
4 participants