Skip to content

[AMDGPU] Filter candidates of LiveRegOptimizer for profitable cases #124624

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 5, 2025

Conversation

choikwa
Copy link
Contributor

@choikwa choikwa commented Jan 27, 2025

It is known that for vector whose element fits in i16 will be split and scalarized in SelectionDag's type legalizer
(see SIISelLowering::getPreferredVectorAction).

LRO attempts to undo the scalarizing of vectors across basic block boundary and shoehorn Values in VGPRs. LRO is beneficial for operations that natively work on illegal vector types to prevent flip-flopping between unpacked and packed. If we know that operations on vector will be split and scalarized, then we don't want to shoehorn them back to packed VGPR.

Operations that we know to work natively on illegal vector types usually come in the form of intrinsics (MFMA, DOT8), buffer store, shuffle, phi nodes to name a few.

@llvmbot
Copy link
Member

llvmbot commented Jan 27, 2025

@llvm/pr-subscribers-llvm-analysis

@llvm/pr-subscribers-backend-amdgpu

Author: choikwa (choikwa)

Changes

It is known that for vector whose element fits in i16 will be split and scalarized in SelectionDag's type legalizer
(see SIISelLowering::getPreferredVectorAction).

LRO attempts to undo the scalarizing of vectors across basic block boundary and shoehorn Values in VGPRs. LRO is beneficial for operations that natively work on illegal vector types to prevent flip-flopping between SGPR and VGPR. If we know that operations on vector will be split and scalarized, then we don't want to shoehorn them back to VGPR.

Operations that we know to work natively on illegal vector types usually come in the form of intrinsics (MFMA, DOT8), buffer store, shuffle, phi nodes to name a few.


Patch is 337.61 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/124624.diff

7 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/AMDGPULateCodeGenPrepare.cpp (+51-6)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/vni8-across-blocks.ll (+2001-209)
  • (modified) llvm/test/CodeGen/AMDGPU/dagcomb-extract-vec-elt-different-sizes.ll (+19-20)
  • (modified) llvm/test/CodeGen/AMDGPU/extract-subvector-16bit.ll (+188-171)
  • (modified) llvm/test/CodeGen/AMDGPU/extract-subvector.ll (+26-25)
  • (modified) llvm/test/CodeGen/AMDGPU/sdwa-peephole.ll (+24-12)
  • (modified) llvm/test/CodeGen/AMDGPU/vni8-across-blocks.ll (+1896-148)
diff --git a/llvm/lib/Target/AMDGPU/AMDGPULateCodeGenPrepare.cpp b/llvm/lib/Target/AMDGPU/AMDGPULateCodeGenPrepare.cpp
index f4e651ec477d30..d64951001d9cba 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPULateCodeGenPrepare.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPULateCodeGenPrepare.cpp
@@ -14,6 +14,7 @@
 
 #include "AMDGPU.h"
 #include "AMDGPUTargetMachine.h"
+#include "AMDGPUTargetTransformInfo.h"
 #include "llvm/Analysis/AssumptionCache.h"
 #include "llvm/Analysis/UniformityAnalysis.h"
 #include "llvm/Analysis/ValueTracking.h"
@@ -45,6 +46,7 @@ class AMDGPULateCodeGenPrepare
   Function &F;
   const DataLayout &DL;
   const GCNSubtarget &ST;
+  const TargetTransformInfo &TTI;
 
   AssumptionCache *const AC;
   UniformityInfo &UA;
@@ -53,8 +55,9 @@ class AMDGPULateCodeGenPrepare
 
 public:
   AMDGPULateCodeGenPrepare(Function &F, const GCNSubtarget &ST,
+                           const TargetTransformInfo &TTI,
                            AssumptionCache *AC, UniformityInfo &UA)
-      : F(F), DL(F.getDataLayout()), ST(ST), AC(AC), UA(UA) {}
+      : F(F), DL(F.getDataLayout()), ST(ST), TTI(TTI), AC(AC), UA(UA) {}
   bool run();
   bool visitInstruction(Instruction &) { return false; }
 
@@ -75,6 +78,8 @@ class LiveRegOptimizer {
   Module &Mod;
   const DataLayout &DL;
   const GCNSubtarget &ST;
+  const TargetTransformInfo &TTI;
+
   /// The scalar type to convert to
   Type *const ConvertToScalar;
   /// The set of visited Instructions
@@ -125,8 +130,43 @@ class LiveRegOptimizer {
     return LK.first != TargetLoweringBase::TypeLegal;
   }
 
-  LiveRegOptimizer(Module &Mod, const GCNSubtarget &ST)
-      : Mod(Mod), DL(Mod.getDataLayout()), ST(ST),
+  // Filtering based on operation or its cost.
+  // If an operation incurs high enough cost or natively work on
+  // vector of illegal type, ie. v2i8, then it makes sense to try
+  // to avoid scalarizing across BB.
+  bool shouldReplaceBasedOnOp(Instruction *II) {
+    // Ignore pseudos
+    if (II->isDebugOrPseudoInst())
+      return false;
+
+    // Instruction Cost
+    const auto Cost = TTI.getInstructionCost(II,
+        TargetTransformInfo::TargetCostKind::TCK_SizeAndLatency);
+    LLVM_DEBUG(
+      dbgs() << "shouldReplaceBasedOnOp: " <<
+        *II << " Cost=" << Cost << '\n';
+    );
+    if (Cost >= 8)
+      return true;
+
+    // Intrinsics - assume they natively handle illegal type
+    if (dyn_cast<IntrinsicInst>(II))
+      return true;
+
+    // Stores
+    if (dyn_cast<StoreInst>(II))
+      return true;
+
+    // Shuffles
+    if (dyn_cast<ShuffleVectorInst>(II))
+      return true;
+
+    return false;
+  }
+
+  LiveRegOptimizer(Module &Mod, const GCNSubtarget &ST,
+      const TargetTransformInfo &TTI)
+      : Mod(Mod), DL(Mod.getDataLayout()), ST(ST), TTI(TTI),
         ConvertToScalar(Type::getInt32Ty(Mod.getContext())) {}
 };
 
@@ -140,7 +180,7 @@ bool AMDGPULateCodeGenPrepare::run() {
   // vectors to equivalent vectors of legal type (which are converted back
   // before uses in subsequent blocks), to pack the bits into fewer physical
   // registers (used in CopyToReg/CopyFromReg pairs).
-  LiveRegOptimizer LRO(*F.getParent(), ST);
+  LiveRegOptimizer LRO(*F.getParent(), ST, TTI);
 
   bool Changed = false;
 
@@ -259,6 +299,9 @@ bool LiveRegOptimizer::optimizeLiveType(
     if (!shouldReplace(II->getType()))
       continue;
 
+    if (!shouldReplaceBasedOnOp(II))
+      continue;
+
     if (PHINode *Phi = dyn_cast<PHINode>(II)) {
       PhiNodes.insert(Phi);
       // Collect all the incoming values of problematic PHI nodes.
@@ -478,11 +521,12 @@ bool AMDGPULateCodeGenPrepare::visitLoadInst(LoadInst &LI) {
 PreservedAnalyses
 AMDGPULateCodeGenPreparePass::run(Function &F, FunctionAnalysisManager &FAM) {
   const GCNSubtarget &ST = TM.getSubtarget<GCNSubtarget>(F);
+  const TargetTransformInfo &TTI = TM.getTargetTransformInfo(F);
 
   AssumptionCache &AC = FAM.getResult<AssumptionAnalysis>(F);
   UniformityInfo &UI = FAM.getResult<UniformityInfoAnalysis>(F);
 
-  bool Changed = AMDGPULateCodeGenPrepare(F, ST, &AC, UI).run();
+  bool Changed = AMDGPULateCodeGenPrepare(F, ST, TTI, &AC, UI).run();
 
   if (!Changed)
     return PreservedAnalyses::all();
@@ -518,13 +562,14 @@ bool AMDGPULateCodeGenPrepareLegacy::runOnFunction(Function &F) {
   const TargetPassConfig &TPC = getAnalysis<TargetPassConfig>();
   const TargetMachine &TM = TPC.getTM<TargetMachine>();
   const GCNSubtarget &ST = TM.getSubtarget<GCNSubtarget>(F);
+  const TargetTransformInfo &TTI = TM.getTargetTransformInfo(F);
 
   AssumptionCache &AC =
       getAnalysis<AssumptionCacheTracker>().getAssumptionCache(F);
   UniformityInfo &UI =
       getAnalysis<UniformityInfoWrapperPass>().getUniformityInfo();
 
-  return AMDGPULateCodeGenPrepare(F, ST, &AC, UI).run();
+  return AMDGPULateCodeGenPrepare(F, ST, TTI, &AC, UI).run();
 }
 
 INITIALIZE_PASS_BEGIN(AMDGPULateCodeGenPrepareLegacy, DEBUG_TYPE,
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/vni8-across-blocks.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/vni8-across-blocks.ll
index 9c2fabce4bcdeb..96a167794dfb7a 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/vni8-across-blocks.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/vni8-across-blocks.ll
@@ -6,36 +6,28 @@ define amdgpu_kernel void @v3i8_liveout(ptr addrspace(1) %src1, ptr addrspace(1)
 ; GFX906:       ; %bb.0: ; %entry
 ; GFX906-NEXT:    s_load_dwordx4 s[0:3], s[4:5], 0x24
 ; GFX906-NEXT:    s_load_dwordx2 s[6:7], s[4:5], 0x34
-; GFX906-NEXT:    v_lshlrev_b32_e32 v2, 2, v0
-; GFX906-NEXT:    v_mov_b32_e32 v4, 8
-; GFX906-NEXT:    v_mov_b32_e32 v5, 16
-; GFX906-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX906-NEXT:    global_load_dword v3, v2, s[0:1]
-; GFX906-NEXT:    v_mov_b32_e32 v1, 0xff
+; GFX906-NEXT:    v_lshlrev_b32_e32 v4, 2, v0
 ; GFX906-NEXT:    v_cmp_gt_u32_e32 vcc, 15, v0
+; GFX906-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX906-NEXT:    global_load_dword v1, v4, s[0:1]
 ; GFX906-NEXT:    s_waitcnt vmcnt(0)
-; GFX906-NEXT:    v_and_b32_e32 v6, 0xff, v3
-; GFX906-NEXT:    v_lshlrev_b32_sdwa v7, v4, v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_1
-; GFX906-NEXT:    v_lshlrev_b32_sdwa v3, v5, v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_2
-; GFX906-NEXT:    v_or3_b32 v3, v6, v7, v3
+; GFX906-NEXT:    v_lshrrev_b32_e32 v3, 8, v1
+; GFX906-NEXT:    v_lshrrev_b32_e32 v2, 16, v1
 ; GFX906-NEXT:    s_and_saveexec_b64 s[0:1], vcc
 ; GFX906-NEXT:    s_cbranch_execz .LBB0_2
 ; GFX906-NEXT:  ; %bb.1: ; %bb.1
-; GFX906-NEXT:    global_load_dword v0, v2, s[2:3]
+; GFX906-NEXT:    global_load_dword v1, v4, s[2:3]
 ; GFX906-NEXT:    s_waitcnt vmcnt(0)
-; GFX906-NEXT:    v_and_b32_e32 v2, 0xff, v0
-; GFX906-NEXT:    v_lshlrev_b32_sdwa v3, v4, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_1
-; GFX906-NEXT:    v_lshlrev_b32_sdwa v0, v5, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_2
-; GFX906-NEXT:    v_or3_b32 v3, v2, v3, v0
+; GFX906-NEXT:    v_lshrrev_b32_e32 v3, 8, v1
+; GFX906-NEXT:    v_lshrrev_b32_e32 v2, 16, v1
 ; GFX906-NEXT:  .LBB0_2: ; %bb.2
 ; GFX906-NEXT:    s_or_b64 exec, exec, s[0:1]
-; GFX906-NEXT:    v_lshrrev_b32_e32 v0, 8, v3
-; GFX906-NEXT:    v_and_b32_e32 v0, 0xff, v0
+; GFX906-NEXT:    v_and_b32_e32 v0, 0xff, v3
 ; GFX906-NEXT:    v_lshlrev_b16_e32 v0, 8, v0
-; GFX906-NEXT:    v_or_b32_sdwa v0, v3, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX906-NEXT:    v_and_b32_sdwa v1, v3, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:DWORD
+; GFX906-NEXT:    v_or_b32_sdwa v0, v1, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX906-NEXT:    s_mov_b32 s0, 0xffff
 ; GFX906-NEXT:    v_and_b32_e32 v0, 0xffff, v0
-; GFX906-NEXT:    v_and_b32_e32 v1, 0xffff, v1
+; GFX906-NEXT:    v_and_b32_sdwa v1, s0, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX906-NEXT:    v_lshl_or_b32 v0, v1, 16, v0
 ; GFX906-NEXT:    v_mov_b32_e32 v1, 0
 ; GFX906-NEXT:    global_store_short v1, v0, s[6:7]
@@ -63,19 +55,34 @@ define amdgpu_kernel void @v4i8_liveout(ptr addrspace(1) %src1, ptr addrspace(1)
 ; GFX906:       ; %bb.0: ; %entry
 ; GFX906-NEXT:    s_load_dwordx4 s[0:3], s[4:5], 0x24
 ; GFX906-NEXT:    s_load_dwordx2 s[6:7], s[4:5], 0x34
-; GFX906-NEXT:    v_lshlrev_b32_e32 v2, 2, v0
+; GFX906-NEXT:    v_lshlrev_b32_e32 v5, 2, v0
 ; GFX906-NEXT:    v_cmp_gt_u32_e32 vcc, 15, v0
 ; GFX906-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX906-NEXT:    global_load_dword v1, v2, s[0:1]
+; GFX906-NEXT:    global_load_dword v1, v5, s[0:1]
+; GFX906-NEXT:    s_waitcnt vmcnt(0)
+; GFX906-NEXT:    v_lshrrev_b32_e32 v2, 8, v1
+; GFX906-NEXT:    v_lshrrev_b32_e32 v3, 16, v1
+; GFX906-NEXT:    v_lshrrev_b32_e32 v4, 24, v1
 ; GFX906-NEXT:    s_and_saveexec_b64 s[0:1], vcc
 ; GFX906-NEXT:    s_cbranch_execz .LBB1_2
 ; GFX906-NEXT:  ; %bb.1: ; %bb.1
-; GFX906-NEXT:    global_load_dword v1, v2, s[2:3]
+; GFX906-NEXT:    global_load_dword v1, v5, s[2:3]
+; GFX906-NEXT:    s_waitcnt vmcnt(0)
+; GFX906-NEXT:    v_lshrrev_b32_e32 v2, 8, v1
+; GFX906-NEXT:    v_lshrrev_b32_e32 v3, 16, v1
+; GFX906-NEXT:    v_lshrrev_b32_e32 v4, 24, v1
 ; GFX906-NEXT:  .LBB1_2: ; %bb.2
 ; GFX906-NEXT:    s_or_b64 exec, exec, s[0:1]
-; GFX906-NEXT:    v_mov_b32_e32 v0, 0
-; GFX906-NEXT:    s_waitcnt vmcnt(0)
-; GFX906-NEXT:    global_store_dword v0, v1, s[6:7]
+; GFX906-NEXT:    v_mov_b32_e32 v5, 8
+; GFX906-NEXT:    v_mov_b32_e32 v0, 0xff
+; GFX906-NEXT:    v_lshlrev_b32_sdwa v2, v5, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX906-NEXT:    v_and_or_b32 v0, v1, v0, v2
+; GFX906-NEXT:    v_and_b32_e32 v1, 0xff, v3
+; GFX906-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
+; GFX906-NEXT:    v_lshlrev_b32_e32 v2, 24, v4
+; GFX906-NEXT:    v_or3_b32 v0, v0, v1, v2
+; GFX906-NEXT:    v_mov_b32_e32 v1, 0
+; GFX906-NEXT:    global_store_dword v1, v0, s[6:7]
 ; GFX906-NEXT:    s_endpgm
 entry:
   %idx = call i32 @llvm.amdgcn.workitem.id.x()
@@ -99,28 +106,30 @@ define amdgpu_kernel void @v5i8_liveout(ptr addrspace(1) %src1, ptr addrspace(1)
 ; GFX906:       ; %bb.0: ; %entry
 ; GFX906-NEXT:    s_load_dwordx4 s[0:3], s[4:5], 0x24
 ; GFX906-NEXT:    s_load_dwordx2 s[6:7], s[4:5], 0x34
-; GFX906-NEXT:    v_lshlrev_b32_e32 v3, 3, v0
+; GFX906-NEXT:    v_lshlrev_b32_e32 v6, 3, v0
 ; GFX906-NEXT:    v_cmp_gt_u32_e32 vcc, 15, v0
 ; GFX906-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX906-NEXT:    global_load_dwordx2 v[1:2], v3, s[0:1]
+; GFX906-NEXT:    global_load_dwordx2 v[1:2], v6, s[0:1]
 ; GFX906-NEXT:    s_waitcnt vmcnt(0)
-; GFX906-NEXT:    v_and_b32_e32 v2, 0xff, v2
+; GFX906-NEXT:    v_lshrrev_b32_e32 v3, 8, v1
+; GFX906-NEXT:    v_lshrrev_b32_e32 v4, 16, v1
+; GFX906-NEXT:    v_lshrrev_b32_e32 v5, 24, v1
 ; GFX906-NEXT:    s_and_saveexec_b64 s[0:1], vcc
 ; GFX906-NEXT:    s_cbranch_execz .LBB2_2
 ; GFX906-NEXT:  ; %bb.1: ; %bb.1
-; GFX906-NEXT:    global_load_dwordx2 v[1:2], v3, s[2:3]
+; GFX906-NEXT:    global_load_dwordx2 v[1:2], v6, s[2:3]
 ; GFX906-NEXT:    s_waitcnt vmcnt(0)
-; GFX906-NEXT:    v_and_b32_e32 v2, 0xff, v2
+; GFX906-NEXT:    v_lshrrev_b32_e32 v3, 8, v1
+; GFX906-NEXT:    v_lshrrev_b32_e32 v4, 16, v1
+; GFX906-NEXT:    v_lshrrev_b32_e32 v5, 24, v1
 ; GFX906-NEXT:  .LBB2_2: ; %bb.2
 ; GFX906-NEXT:    s_or_b64 exec, exec, s[0:1]
-; GFX906-NEXT:    v_mov_b32_e32 v4, 0
-; GFX906-NEXT:    v_lshrrev_b32_e32 v0, 8, v1
-; GFX906-NEXT:    v_lshrrev_b32_e32 v3, 24, v1
-; GFX906-NEXT:    global_store_byte v4, v1, s[6:7]
-; GFX906-NEXT:    global_store_byte v4, v0, s[6:7] offset:1
-; GFX906-NEXT:    global_store_byte_d16_hi v4, v1, s[6:7] offset:2
-; GFX906-NEXT:    global_store_byte v4, v3, s[6:7] offset:3
-; GFX906-NEXT:    global_store_byte v4, v2, s[6:7] offset:4
+; GFX906-NEXT:    v_mov_b32_e32 v0, 0
+; GFX906-NEXT:    global_store_byte v0, v1, s[6:7]
+; GFX906-NEXT:    global_store_byte v0, v3, s[6:7] offset:1
+; GFX906-NEXT:    global_store_byte v0, v4, s[6:7] offset:2
+; GFX906-NEXT:    global_store_byte v0, v5, s[6:7] offset:3
+; GFX906-NEXT:    global_store_byte v0, v2, s[6:7] offset:4
 ; GFX906-NEXT:    s_endpgm
 entry:
   %idx = call i32 @llvm.amdgcn.workitem.id.x()
@@ -144,19 +153,46 @@ define amdgpu_kernel void @v8i8_liveout(ptr addrspace(1) %src1, ptr addrspace(1)
 ; GFX906:       ; %bb.0: ; %entry
 ; GFX906-NEXT:    s_load_dwordx4 s[0:3], s[4:5], 0x24
 ; GFX906-NEXT:    s_load_dwordx2 s[6:7], s[4:5], 0x34
-; GFX906-NEXT:    v_lshlrev_b32_e32 v3, 3, v0
+; GFX906-NEXT:    v_lshlrev_b32_e32 v9, 3, v0
 ; GFX906-NEXT:    v_cmp_gt_u32_e32 vcc, 15, v0
 ; GFX906-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX906-NEXT:    global_load_dwordx2 v[1:2], v3, s[0:1]
+; GFX906-NEXT:    global_load_dwordx2 v[1:2], v9, s[0:1]
+; GFX906-NEXT:    s_waitcnt vmcnt(0)
+; GFX906-NEXT:    v_lshrrev_b32_e32 v3, 8, v1
+; GFX906-NEXT:    v_lshrrev_b32_e32 v4, 16, v1
+; GFX906-NEXT:    v_lshrrev_b32_e32 v5, 24, v1
+; GFX906-NEXT:    v_lshrrev_b32_e32 v6, 8, v2
+; GFX906-NEXT:    v_lshrrev_b32_e32 v7, 16, v2
+; GFX906-NEXT:    v_lshrrev_b32_e32 v8, 24, v2
 ; GFX906-NEXT:    s_and_saveexec_b64 s[0:1], vcc
 ; GFX906-NEXT:    s_cbranch_execz .LBB3_2
 ; GFX906-NEXT:  ; %bb.1: ; %bb.1
-; GFX906-NEXT:    global_load_dwordx2 v[1:2], v3, s[2:3]
+; GFX906-NEXT:    global_load_dwordx2 v[1:2], v9, s[2:3]
+; GFX906-NEXT:    s_waitcnt vmcnt(0)
+; GFX906-NEXT:    v_lshrrev_b32_e32 v3, 8, v1
+; GFX906-NEXT:    v_lshrrev_b32_e32 v4, 16, v1
+; GFX906-NEXT:    v_lshrrev_b32_e32 v5, 24, v1
+; GFX906-NEXT:    v_lshrrev_b32_e32 v6, 8, v2
+; GFX906-NEXT:    v_lshrrev_b32_e32 v7, 16, v2
+; GFX906-NEXT:    v_lshrrev_b32_e32 v8, 24, v2
 ; GFX906-NEXT:  .LBB3_2: ; %bb.2
 ; GFX906-NEXT:    s_or_b64 exec, exec, s[0:1]
-; GFX906-NEXT:    v_mov_b32_e32 v0, 0
-; GFX906-NEXT:    s_waitcnt vmcnt(0)
-; GFX906-NEXT:    global_store_dwordx2 v0, v[1:2], s[6:7]
+; GFX906-NEXT:    v_mov_b32_e32 v10, 8
+; GFX906-NEXT:    v_mov_b32_e32 v9, 0xff
+; GFX906-NEXT:    v_lshlrev_b32_sdwa v0, v10, v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX906-NEXT:    v_and_or_b32 v0, v1, v9, v0
+; GFX906-NEXT:    v_and_b32_e32 v1, 0xff, v4
+; GFX906-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
+; GFX906-NEXT:    v_lshlrev_b32_e32 v3, 24, v5
+; GFX906-NEXT:    v_or3_b32 v0, v0, v1, v3
+; GFX906-NEXT:    v_lshlrev_b32_sdwa v1, v10, v6 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX906-NEXT:    v_and_or_b32 v1, v2, v9, v1
+; GFX906-NEXT:    v_and_b32_e32 v2, 0xff, v7
+; GFX906-NEXT:    v_lshlrev_b32_e32 v2, 16, v2
+; GFX906-NEXT:    v_lshlrev_b32_e32 v3, 24, v8
+; GFX906-NEXT:    v_or3_b32 v1, v1, v2, v3
+; GFX906-NEXT:    v_mov_b32_e32 v2, 0
+; GFX906-NEXT:    global_store_dwordx2 v2, v[0:1], s[6:7]
 ; GFX906-NEXT:    s_endpgm
 entry:
   %idx = call i32 @llvm.amdgcn.workitem.id.x()
@@ -180,19 +216,70 @@ define amdgpu_kernel void @v16i8_liveout(ptr addrspace(1) %src1, ptr addrspace(1
 ; GFX906:       ; %bb.0: ; %entry
 ; GFX906-NEXT:    s_load_dwordx4 s[0:3], s[4:5], 0x24
 ; GFX906-NEXT:    s_load_dwordx2 s[6:7], s[4:5], 0x34
-; GFX906-NEXT:    v_lshlrev_b32_e32 v5, 4, v0
+; GFX906-NEXT:    v_lshlrev_b32_e32 v17, 4, v0
 ; GFX906-NEXT:    v_cmp_gt_u32_e32 vcc, 15, v0
 ; GFX906-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX906-NEXT:    global_load_dwordx4 v[1:4], v5, s[0:1]
+; GFX906-NEXT:    global_load_dwordx4 v[1:4], v17, s[0:1]
+; GFX906-NEXT:    s_waitcnt vmcnt(0)
+; GFX906-NEXT:    v_lshrrev_b32_e32 v5, 8, v1
+; GFX906-NEXT:    v_lshrrev_b32_e32 v6, 16, v1
+; GFX906-NEXT:    v_lshrrev_b32_e32 v7, 24, v1
+; GFX906-NEXT:    v_lshrrev_b32_e32 v8, 8, v2
+; GFX906-NEXT:    v_lshrrev_b32_e32 v9, 16, v2
+; GFX906-NEXT:    v_lshrrev_b32_e32 v10, 24, v2
+; GFX906-NEXT:    v_lshrrev_b32_e32 v11, 8, v3
+; GFX906-NEXT:    v_lshrrev_b32_e32 v12, 16, v3
+; GFX906-NEXT:    v_lshrrev_b32_e32 v13, 24, v3
+; GFX906-NEXT:    v_lshrrev_b32_e32 v14, 8, v4
+; GFX906-NEXT:    v_lshrrev_b32_e32 v15, 16, v4
+; GFX906-NEXT:    v_lshrrev_b32_e32 v16, 24, v4
 ; GFX906-NEXT:    s_and_saveexec_b64 s[0:1], vcc
 ; GFX906-NEXT:    s_cbranch_execz .LBB4_2
 ; GFX906-NEXT:  ; %bb.1: ; %bb.1
-; GFX906-NEXT:    global_load_dwordx4 v[1:4], v5, s[2:3]
+; GFX906-NEXT:    global_load_dwordx4 v[1:4], v17, s[2:3]
+; GFX906-NEXT:    s_waitcnt vmcnt(0)
+; GFX906-NEXT:    v_lshrrev_b32_e32 v5, 8, v1
+; GFX906-NEXT:    v_lshrrev_b32_e32 v6, 16, v1
+; GFX906-NEXT:    v_lshrrev_b32_e32 v7, 24, v1
+; GFX906-NEXT:    v_lshrrev_b32_e32 v8, 8, v2
+; GFX906-NEXT:    v_lshrrev_b32_e32 v9, 16, v2
+; GFX906-NEXT:    v_lshrrev_b32_e32 v10, 24, v2
+; GFX906-NEXT:    v_lshrrev_b32_e32 v11, 8, v3
+; GFX906-NEXT:    v_lshrrev_b32_e32 v12, 16, v3
+; GFX906-NEXT:    v_lshrrev_b32_e32 v13, 24, v3
+; GFX906-NEXT:    v_lshrrev_b32_e32 v14, 8, v4
+; GFX906-NEXT:    v_lshrrev_b32_e32 v15, 16, v4
+; GFX906-NEXT:    v_lshrrev_b32_e32 v16, 24, v4
 ; GFX906-NEXT:  .LBB4_2: ; %bb.2
 ; GFX906-NEXT:    s_or_b64 exec, exec, s[0:1]
-; GFX906-NEXT:    v_mov_b32_e32 v0, 0
-; GFX906-NEXT:    s_waitcnt vmcnt(0)
-; GFX906-NEXT:    global_store_dwordx4 v0, v[1:4], s[6:7]
+; GFX906-NEXT:    v_mov_b32_e32 v18, 8
+; GFX906-NEXT:    v_mov_b32_e32 v17, 0xff
+; GFX906-NEXT:    v_lshlrev_b32_sdwa v0, v18, v5 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX906-NEXT:    v_and_or_b32 v0, v1, v17, v0
+; GFX906-NEXT:    v_and_b32_e32 v1, 0xff, v6
+; GFX906-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
+; GFX906-NEXT:    v_lshlrev_b32_e32 v5, 24, v7
+; GFX906-NEXT:    v_or3_b32 v0, v0, v1, v5
+; GFX906-NEXT:    v_lshlrev_b32_sdwa v1, v18, v8 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX906-NEXT:    v_and_or_b32 v1, v2, v17, v1
+; GFX906-NEXT:    v_and_b32_e32 v2, 0xff, v9
+; GFX906-NEXT:    v_lshlrev_b32_e32 v2, 16, v2
+; GFX906-NEXT:    v_lshlrev_b32_e32 v5, 24, v10
+; GFX906-NEXT:    v_or3_b32 v1, v1, v2, v5
+; GFX906-NEXT:    v_lshlrev_b32_sdwa v2, v18, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX906-NEXT:    v_and_or_b32 v2, v3, v17, v2
+; GFX906-NEXT:    v_and_b32_e32 v3, 0xff, v12
+; GFX906-NEXT:    v_lshlrev_b32_e32 v3, 16, v3
+; GFX906-NEXT:    v_lshlrev_b32_e32 v5, 24, v13
+; GFX906-NEXT:    v_or3_b32 v2, v2, v3, v5
+; GFX906-NEXT:    v_lshlrev_b32_sdwa v3, v18, v14 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX906-NEXT:    v_and_or_b32 v3, v4, v17, v3
+; GFX906-NEXT:    v_and_b32_e32 v4, 0xff, v15
+; GFX906-NEXT:    v_lshlrev_b32_e32 v4, 16, v4
+; GFX906-NEXT:    v_lshlrev_b32_e32 v5, 24, v16
+; GFX906-NEXT:    v_or3_b32 v3, v3, v4, v5
+; GFX906-NEXT:    v_mov_b32_e32 v4, 0
+; GFX906-NEXT:    global_store_dwordx4 v4, v[0:3], s[6:7]
 ; GFX906-NEXT:    s_endpgm
 entry:
   %idx = call i32 @llvm.amdgcn.workitem.id.x()
@@ -216,23 +303,123 @@ define amdgpu_kernel void @v32i8_liveout(ptr addrspace(1) %src1, ptr addrspace(1
 ; GFX906:       ; %bb.0: ; %entry
 ; GFX906-NEXT:    s_load_dwordx4 s[0:3], s[4:5], 0x24
 ; GFX906-NEXT:    s_load_dwordx2 s[6:7], s[4:5], 0x34
-; GFX906-NEXT:    v_lshlrev_b32_e32 v9, 5, v0
+; GFX906-NEXT:    v_lshlrev_b32_e32 v32, 5, v0
 ; GFX906-NEXT:    v_cmp_gt_u32_e32 vcc, 15, v0
 ; GFX906-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX906-NEXT:    global_load_dwordx4 v[1:4], v9, s[0:1]
-; GFX906-NEXT:    global_load_dwordx4 v[5:8], v9, s[0:1] offset:16
+; GFX906-NEXT:    global_load_dwordx4 v[5:8], v32, s[0:1]
+; GFX906-NEXT:    global_load_dwordx4 v[1:4], v32, s[0:1] offset:16
+; GFX906-NEXT:    s_waitcnt vmcnt(1)
+; GFX906-NEXT:    v_lshrrev_b32_e32 v31, 8, v5
+; GFX906-NEXT:    v_lshrrev_b32_e32 v26, 16, v5
+; GFX906-NEXT:    v_lshrrev_b32_e32 v27, 24, v5
+; GFX906-NEXT:    v_lshrrev_b32_e32 v28, 8, v6
+; GFX906-NEXT:    v_lshrrev_b32_e32 v29, 16, v6
+; GFX906-NEXT:    v_lshrrev_b32_e32 v30, 24, v6
+; GFX906-NEXT:    v_lshrrev_b32_e32 v0, 8, v7
+; GFX906-NEXT:    v_lshrrev_b32_e32 v9, 16, v7
+; GFX906-NEXT:    v_lshrrev_b32_e32 v10, 24, v7
+; GFX906-NEXT:    v_lshrrev_b32_e32 v11, 8, v8
+; GFX906-NEXT:    v_lshrrev_b32_e32 v12, 16, v8
+; GFX906-NEXT:    v_lshrrev_b32_e32 v13, 24, v8
+; GFX906-NEXT:    s_waitcnt vmcnt(0)
+; GFX906-NEXT:    v_lshrrev_b32_e32 v14, 8, v1
+; GFX906-NEXT:    v_lshrrev_b32_e32 v15, 16, v1
+; GFX906-NEXT:    v_lshrrev_b32_e32 v16, 24, v1
+; GFX906-NEXT:    v_lshrrev_b32_e32 v17, 8, v2
+; GFX906-NEXT:    v_lshrrev_b32_e32 v18, 16, v2
+; GFX906-NEXT:    v_lshrrev_b32_e32 v19, 24, v2
+; GFX906-NEXT:    v_lshrrev_b32_e32 v20, 8, v3
+; GFX906-NEXT:    v_lshrrev_b32_e32 v21, 16, v3
+; GFX906-NEXT:    v_lshrrev_b32_e32 v22...
[truncated]

@llvmbot
Copy link
Member

llvmbot commented Jan 27, 2025

@llvm/pr-subscribers-llvm-globalisel

Author: choikwa (choikwa)

Changes

It is known that for vector whose element fits in i16 will be split and scalarized in SelectionDag's type legalizer
(see SIISelLowering::getPreferredVectorAction).

LRO attempts to undo the scalarizing of vectors across basic block boundary and shoehorn Values in VGPRs. LRO is beneficial for operations that natively work on illegal vector types to prevent flip-flopping between SGPR and VGPR. If we know that operations on vector will be split and scalarized, then we don't want to shoehorn them back to VGPR.

Operations that we know to work natively on illegal vector types usually come in the form of intrinsics (MFMA, DOT8), buffer store, shuffle, phi nodes to name a few.


Patch is 337.61 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/124624.diff

7 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/AMDGPULateCodeGenPrepare.cpp (+51-6)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/vni8-across-blocks.ll (+2001-209)
  • (modified) llvm/test/CodeGen/AMDGPU/dagcomb-extract-vec-elt-different-sizes.ll (+19-20)
  • (modified) llvm/test/CodeGen/AMDGPU/extract-subvector-16bit.ll (+188-171)
  • (modified) llvm/test/CodeGen/AMDGPU/extract-subvector.ll (+26-25)
  • (modified) llvm/test/CodeGen/AMDGPU/sdwa-peephole.ll (+24-12)
  • (modified) llvm/test/CodeGen/AMDGPU/vni8-across-blocks.ll (+1896-148)
diff --git a/llvm/lib/Target/AMDGPU/AMDGPULateCodeGenPrepare.cpp b/llvm/lib/Target/AMDGPU/AMDGPULateCodeGenPrepare.cpp
index f4e651ec477d30..d64951001d9cba 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPULateCodeGenPrepare.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPULateCodeGenPrepare.cpp
@@ -14,6 +14,7 @@
 
 #include "AMDGPU.h"
 #include "AMDGPUTargetMachine.h"
+#include "AMDGPUTargetTransformInfo.h"
 #include "llvm/Analysis/AssumptionCache.h"
 #include "llvm/Analysis/UniformityAnalysis.h"
 #include "llvm/Analysis/ValueTracking.h"
@@ -45,6 +46,7 @@ class AMDGPULateCodeGenPrepare
   Function &F;
   const DataLayout &DL;
   const GCNSubtarget &ST;
+  const TargetTransformInfo &TTI;
 
   AssumptionCache *const AC;
   UniformityInfo &UA;
@@ -53,8 +55,9 @@ class AMDGPULateCodeGenPrepare
 
 public:
   AMDGPULateCodeGenPrepare(Function &F, const GCNSubtarget &ST,
+                           const TargetTransformInfo &TTI,
                            AssumptionCache *AC, UniformityInfo &UA)
-      : F(F), DL(F.getDataLayout()), ST(ST), AC(AC), UA(UA) {}
+      : F(F), DL(F.getDataLayout()), ST(ST), TTI(TTI), AC(AC), UA(UA) {}
   bool run();
   bool visitInstruction(Instruction &) { return false; }
 
@@ -75,6 +78,8 @@ class LiveRegOptimizer {
   Module &Mod;
   const DataLayout &DL;
   const GCNSubtarget &ST;
+  const TargetTransformInfo &TTI;
+
   /// The scalar type to convert to
   Type *const ConvertToScalar;
   /// The set of visited Instructions
@@ -125,8 +130,43 @@ class LiveRegOptimizer {
     return LK.first != TargetLoweringBase::TypeLegal;
   }
 
-  LiveRegOptimizer(Module &Mod, const GCNSubtarget &ST)
-      : Mod(Mod), DL(Mod.getDataLayout()), ST(ST),
+  // Filtering based on operation or its cost.
+  // If an operation incurs high enough cost or natively work on
+  // vector of illegal type, ie. v2i8, then it makes sense to try
+  // to avoid scalarizing across BB.
+  bool shouldReplaceBasedOnOp(Instruction *II) {
+    // Ignore pseudos
+    if (II->isDebugOrPseudoInst())
+      return false;
+
+    // Instruction Cost
+    const auto Cost = TTI.getInstructionCost(II,
+        TargetTransformInfo::TargetCostKind::TCK_SizeAndLatency);
+    LLVM_DEBUG(
+      dbgs() << "shouldReplaceBasedOnOp: " <<
+        *II << " Cost=" << Cost << '\n';
+    );
+    if (Cost >= 8)
+      return true;
+
+    // Intrinsics - assume they natively handle illegal type
+    if (dyn_cast<IntrinsicInst>(II))
+      return true;
+
+    // Stores
+    if (dyn_cast<StoreInst>(II))
+      return true;
+
+    // Shuffles
+    if (dyn_cast<ShuffleVectorInst>(II))
+      return true;
+
+    return false;
+  }
+
+  LiveRegOptimizer(Module &Mod, const GCNSubtarget &ST,
+      const TargetTransformInfo &TTI)
+      : Mod(Mod), DL(Mod.getDataLayout()), ST(ST), TTI(TTI),
         ConvertToScalar(Type::getInt32Ty(Mod.getContext())) {}
 };
 
@@ -140,7 +180,7 @@ bool AMDGPULateCodeGenPrepare::run() {
   // vectors to equivalent vectors of legal type (which are converted back
   // before uses in subsequent blocks), to pack the bits into fewer physical
   // registers (used in CopyToReg/CopyFromReg pairs).
-  LiveRegOptimizer LRO(*F.getParent(), ST);
+  LiveRegOptimizer LRO(*F.getParent(), ST, TTI);
 
   bool Changed = false;
 
@@ -259,6 +299,9 @@ bool LiveRegOptimizer::optimizeLiveType(
     if (!shouldReplace(II->getType()))
       continue;
 
+    if (!shouldReplaceBasedOnOp(II))
+      continue;
+
     if (PHINode *Phi = dyn_cast<PHINode>(II)) {
       PhiNodes.insert(Phi);
       // Collect all the incoming values of problematic PHI nodes.
@@ -478,11 +521,12 @@ bool AMDGPULateCodeGenPrepare::visitLoadInst(LoadInst &LI) {
 PreservedAnalyses
 AMDGPULateCodeGenPreparePass::run(Function &F, FunctionAnalysisManager &FAM) {
   const GCNSubtarget &ST = TM.getSubtarget<GCNSubtarget>(F);
+  const TargetTransformInfo &TTI = TM.getTargetTransformInfo(F);
 
   AssumptionCache &AC = FAM.getResult<AssumptionAnalysis>(F);
   UniformityInfo &UI = FAM.getResult<UniformityInfoAnalysis>(F);
 
-  bool Changed = AMDGPULateCodeGenPrepare(F, ST, &AC, UI).run();
+  bool Changed = AMDGPULateCodeGenPrepare(F, ST, TTI, &AC, UI).run();
 
   if (!Changed)
     return PreservedAnalyses::all();
@@ -518,13 +562,14 @@ bool AMDGPULateCodeGenPrepareLegacy::runOnFunction(Function &F) {
   const TargetPassConfig &TPC = getAnalysis<TargetPassConfig>();
   const TargetMachine &TM = TPC.getTM<TargetMachine>();
   const GCNSubtarget &ST = TM.getSubtarget<GCNSubtarget>(F);
+  const TargetTransformInfo &TTI = TM.getTargetTransformInfo(F);
 
   AssumptionCache &AC =
       getAnalysis<AssumptionCacheTracker>().getAssumptionCache(F);
   UniformityInfo &UI =
       getAnalysis<UniformityInfoWrapperPass>().getUniformityInfo();
 
-  return AMDGPULateCodeGenPrepare(F, ST, &AC, UI).run();
+  return AMDGPULateCodeGenPrepare(F, ST, TTI, &AC, UI).run();
 }
 
 INITIALIZE_PASS_BEGIN(AMDGPULateCodeGenPrepareLegacy, DEBUG_TYPE,
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/vni8-across-blocks.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/vni8-across-blocks.ll
index 9c2fabce4bcdeb..96a167794dfb7a 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/vni8-across-blocks.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/vni8-across-blocks.ll
@@ -6,36 +6,28 @@ define amdgpu_kernel void @v3i8_liveout(ptr addrspace(1) %src1, ptr addrspace(1)
 ; GFX906:       ; %bb.0: ; %entry
 ; GFX906-NEXT:    s_load_dwordx4 s[0:3], s[4:5], 0x24
 ; GFX906-NEXT:    s_load_dwordx2 s[6:7], s[4:5], 0x34
-; GFX906-NEXT:    v_lshlrev_b32_e32 v2, 2, v0
-; GFX906-NEXT:    v_mov_b32_e32 v4, 8
-; GFX906-NEXT:    v_mov_b32_e32 v5, 16
-; GFX906-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX906-NEXT:    global_load_dword v3, v2, s[0:1]
-; GFX906-NEXT:    v_mov_b32_e32 v1, 0xff
+; GFX906-NEXT:    v_lshlrev_b32_e32 v4, 2, v0
 ; GFX906-NEXT:    v_cmp_gt_u32_e32 vcc, 15, v0
+; GFX906-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX906-NEXT:    global_load_dword v1, v4, s[0:1]
 ; GFX906-NEXT:    s_waitcnt vmcnt(0)
-; GFX906-NEXT:    v_and_b32_e32 v6, 0xff, v3
-; GFX906-NEXT:    v_lshlrev_b32_sdwa v7, v4, v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_1
-; GFX906-NEXT:    v_lshlrev_b32_sdwa v3, v5, v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_2
-; GFX906-NEXT:    v_or3_b32 v3, v6, v7, v3
+; GFX906-NEXT:    v_lshrrev_b32_e32 v3, 8, v1
+; GFX906-NEXT:    v_lshrrev_b32_e32 v2, 16, v1
 ; GFX906-NEXT:    s_and_saveexec_b64 s[0:1], vcc
 ; GFX906-NEXT:    s_cbranch_execz .LBB0_2
 ; GFX906-NEXT:  ; %bb.1: ; %bb.1
-; GFX906-NEXT:    global_load_dword v0, v2, s[2:3]
+; GFX906-NEXT:    global_load_dword v1, v4, s[2:3]
 ; GFX906-NEXT:    s_waitcnt vmcnt(0)
-; GFX906-NEXT:    v_and_b32_e32 v2, 0xff, v0
-; GFX906-NEXT:    v_lshlrev_b32_sdwa v3, v4, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_1
-; GFX906-NEXT:    v_lshlrev_b32_sdwa v0, v5, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_2
-; GFX906-NEXT:    v_or3_b32 v3, v2, v3, v0
+; GFX906-NEXT:    v_lshrrev_b32_e32 v3, 8, v1
+; GFX906-NEXT:    v_lshrrev_b32_e32 v2, 16, v1
 ; GFX906-NEXT:  .LBB0_2: ; %bb.2
 ; GFX906-NEXT:    s_or_b64 exec, exec, s[0:1]
-; GFX906-NEXT:    v_lshrrev_b32_e32 v0, 8, v3
-; GFX906-NEXT:    v_and_b32_e32 v0, 0xff, v0
+; GFX906-NEXT:    v_and_b32_e32 v0, 0xff, v3
 ; GFX906-NEXT:    v_lshlrev_b16_e32 v0, 8, v0
-; GFX906-NEXT:    v_or_b32_sdwa v0, v3, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX906-NEXT:    v_and_b32_sdwa v1, v3, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:DWORD
+; GFX906-NEXT:    v_or_b32_sdwa v0, v1, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX906-NEXT:    s_mov_b32 s0, 0xffff
 ; GFX906-NEXT:    v_and_b32_e32 v0, 0xffff, v0
-; GFX906-NEXT:    v_and_b32_e32 v1, 0xffff, v1
+; GFX906-NEXT:    v_and_b32_sdwa v1, s0, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX906-NEXT:    v_lshl_or_b32 v0, v1, 16, v0
 ; GFX906-NEXT:    v_mov_b32_e32 v1, 0
 ; GFX906-NEXT:    global_store_short v1, v0, s[6:7]
@@ -63,19 +55,34 @@ define amdgpu_kernel void @v4i8_liveout(ptr addrspace(1) %src1, ptr addrspace(1)
 ; GFX906:       ; %bb.0: ; %entry
 ; GFX906-NEXT:    s_load_dwordx4 s[0:3], s[4:5], 0x24
 ; GFX906-NEXT:    s_load_dwordx2 s[6:7], s[4:5], 0x34
-; GFX906-NEXT:    v_lshlrev_b32_e32 v2, 2, v0
+; GFX906-NEXT:    v_lshlrev_b32_e32 v5, 2, v0
 ; GFX906-NEXT:    v_cmp_gt_u32_e32 vcc, 15, v0
 ; GFX906-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX906-NEXT:    global_load_dword v1, v2, s[0:1]
+; GFX906-NEXT:    global_load_dword v1, v5, s[0:1]
+; GFX906-NEXT:    s_waitcnt vmcnt(0)
+; GFX906-NEXT:    v_lshrrev_b32_e32 v2, 8, v1
+; GFX906-NEXT:    v_lshrrev_b32_e32 v3, 16, v1
+; GFX906-NEXT:    v_lshrrev_b32_e32 v4, 24, v1
 ; GFX906-NEXT:    s_and_saveexec_b64 s[0:1], vcc
 ; GFX906-NEXT:    s_cbranch_execz .LBB1_2
 ; GFX906-NEXT:  ; %bb.1: ; %bb.1
-; GFX906-NEXT:    global_load_dword v1, v2, s[2:3]
+; GFX906-NEXT:    global_load_dword v1, v5, s[2:3]
+; GFX906-NEXT:    s_waitcnt vmcnt(0)
+; GFX906-NEXT:    v_lshrrev_b32_e32 v2, 8, v1
+; GFX906-NEXT:    v_lshrrev_b32_e32 v3, 16, v1
+; GFX906-NEXT:    v_lshrrev_b32_e32 v4, 24, v1
 ; GFX906-NEXT:  .LBB1_2: ; %bb.2
 ; GFX906-NEXT:    s_or_b64 exec, exec, s[0:1]
-; GFX906-NEXT:    v_mov_b32_e32 v0, 0
-; GFX906-NEXT:    s_waitcnt vmcnt(0)
-; GFX906-NEXT:    global_store_dword v0, v1, s[6:7]
+; GFX906-NEXT:    v_mov_b32_e32 v5, 8
+; GFX906-NEXT:    v_mov_b32_e32 v0, 0xff
+; GFX906-NEXT:    v_lshlrev_b32_sdwa v2, v5, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX906-NEXT:    v_and_or_b32 v0, v1, v0, v2
+; GFX906-NEXT:    v_and_b32_e32 v1, 0xff, v3
+; GFX906-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
+; GFX906-NEXT:    v_lshlrev_b32_e32 v2, 24, v4
+; GFX906-NEXT:    v_or3_b32 v0, v0, v1, v2
+; GFX906-NEXT:    v_mov_b32_e32 v1, 0
+; GFX906-NEXT:    global_store_dword v1, v0, s[6:7]
 ; GFX906-NEXT:    s_endpgm
 entry:
   %idx = call i32 @llvm.amdgcn.workitem.id.x()
@@ -99,28 +106,30 @@ define amdgpu_kernel void @v5i8_liveout(ptr addrspace(1) %src1, ptr addrspace(1)
 ; GFX906:       ; %bb.0: ; %entry
 ; GFX906-NEXT:    s_load_dwordx4 s[0:3], s[4:5], 0x24
 ; GFX906-NEXT:    s_load_dwordx2 s[6:7], s[4:5], 0x34
-; GFX906-NEXT:    v_lshlrev_b32_e32 v3, 3, v0
+; GFX906-NEXT:    v_lshlrev_b32_e32 v6, 3, v0
 ; GFX906-NEXT:    v_cmp_gt_u32_e32 vcc, 15, v0
 ; GFX906-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX906-NEXT:    global_load_dwordx2 v[1:2], v3, s[0:1]
+; GFX906-NEXT:    global_load_dwordx2 v[1:2], v6, s[0:1]
 ; GFX906-NEXT:    s_waitcnt vmcnt(0)
-; GFX906-NEXT:    v_and_b32_e32 v2, 0xff, v2
+; GFX906-NEXT:    v_lshrrev_b32_e32 v3, 8, v1
+; GFX906-NEXT:    v_lshrrev_b32_e32 v4, 16, v1
+; GFX906-NEXT:    v_lshrrev_b32_e32 v5, 24, v1
 ; GFX906-NEXT:    s_and_saveexec_b64 s[0:1], vcc
 ; GFX906-NEXT:    s_cbranch_execz .LBB2_2
 ; GFX906-NEXT:  ; %bb.1: ; %bb.1
-; GFX906-NEXT:    global_load_dwordx2 v[1:2], v3, s[2:3]
+; GFX906-NEXT:    global_load_dwordx2 v[1:2], v6, s[2:3]
 ; GFX906-NEXT:    s_waitcnt vmcnt(0)
-; GFX906-NEXT:    v_and_b32_e32 v2, 0xff, v2
+; GFX906-NEXT:    v_lshrrev_b32_e32 v3, 8, v1
+; GFX906-NEXT:    v_lshrrev_b32_e32 v4, 16, v1
+; GFX906-NEXT:    v_lshrrev_b32_e32 v5, 24, v1
 ; GFX906-NEXT:  .LBB2_2: ; %bb.2
 ; GFX906-NEXT:    s_or_b64 exec, exec, s[0:1]
-; GFX906-NEXT:    v_mov_b32_e32 v4, 0
-; GFX906-NEXT:    v_lshrrev_b32_e32 v0, 8, v1
-; GFX906-NEXT:    v_lshrrev_b32_e32 v3, 24, v1
-; GFX906-NEXT:    global_store_byte v4, v1, s[6:7]
-; GFX906-NEXT:    global_store_byte v4, v0, s[6:7] offset:1
-; GFX906-NEXT:    global_store_byte_d16_hi v4, v1, s[6:7] offset:2
-; GFX906-NEXT:    global_store_byte v4, v3, s[6:7] offset:3
-; GFX906-NEXT:    global_store_byte v4, v2, s[6:7] offset:4
+; GFX906-NEXT:    v_mov_b32_e32 v0, 0
+; GFX906-NEXT:    global_store_byte v0, v1, s[6:7]
+; GFX906-NEXT:    global_store_byte v0, v3, s[6:7] offset:1
+; GFX906-NEXT:    global_store_byte v0, v4, s[6:7] offset:2
+; GFX906-NEXT:    global_store_byte v0, v5, s[6:7] offset:3
+; GFX906-NEXT:    global_store_byte v0, v2, s[6:7] offset:4
 ; GFX906-NEXT:    s_endpgm
 entry:
   %idx = call i32 @llvm.amdgcn.workitem.id.x()
@@ -144,19 +153,46 @@ define amdgpu_kernel void @v8i8_liveout(ptr addrspace(1) %src1, ptr addrspace(1)
 ; GFX906:       ; %bb.0: ; %entry
 ; GFX906-NEXT:    s_load_dwordx4 s[0:3], s[4:5], 0x24
 ; GFX906-NEXT:    s_load_dwordx2 s[6:7], s[4:5], 0x34
-; GFX906-NEXT:    v_lshlrev_b32_e32 v3, 3, v0
+; GFX906-NEXT:    v_lshlrev_b32_e32 v9, 3, v0
 ; GFX906-NEXT:    v_cmp_gt_u32_e32 vcc, 15, v0
 ; GFX906-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX906-NEXT:    global_load_dwordx2 v[1:2], v3, s[0:1]
+; GFX906-NEXT:    global_load_dwordx2 v[1:2], v9, s[0:1]
+; GFX906-NEXT:    s_waitcnt vmcnt(0)
+; GFX906-NEXT:    v_lshrrev_b32_e32 v3, 8, v1
+; GFX906-NEXT:    v_lshrrev_b32_e32 v4, 16, v1
+; GFX906-NEXT:    v_lshrrev_b32_e32 v5, 24, v1
+; GFX906-NEXT:    v_lshrrev_b32_e32 v6, 8, v2
+; GFX906-NEXT:    v_lshrrev_b32_e32 v7, 16, v2
+; GFX906-NEXT:    v_lshrrev_b32_e32 v8, 24, v2
 ; GFX906-NEXT:    s_and_saveexec_b64 s[0:1], vcc
 ; GFX906-NEXT:    s_cbranch_execz .LBB3_2
 ; GFX906-NEXT:  ; %bb.1: ; %bb.1
-; GFX906-NEXT:    global_load_dwordx2 v[1:2], v3, s[2:3]
+; GFX906-NEXT:    global_load_dwordx2 v[1:2], v9, s[2:3]
+; GFX906-NEXT:    s_waitcnt vmcnt(0)
+; GFX906-NEXT:    v_lshrrev_b32_e32 v3, 8, v1
+; GFX906-NEXT:    v_lshrrev_b32_e32 v4, 16, v1
+; GFX906-NEXT:    v_lshrrev_b32_e32 v5, 24, v1
+; GFX906-NEXT:    v_lshrrev_b32_e32 v6, 8, v2
+; GFX906-NEXT:    v_lshrrev_b32_e32 v7, 16, v2
+; GFX906-NEXT:    v_lshrrev_b32_e32 v8, 24, v2
 ; GFX906-NEXT:  .LBB3_2: ; %bb.2
 ; GFX906-NEXT:    s_or_b64 exec, exec, s[0:1]
-; GFX906-NEXT:    v_mov_b32_e32 v0, 0
-; GFX906-NEXT:    s_waitcnt vmcnt(0)
-; GFX906-NEXT:    global_store_dwordx2 v0, v[1:2], s[6:7]
+; GFX906-NEXT:    v_mov_b32_e32 v10, 8
+; GFX906-NEXT:    v_mov_b32_e32 v9, 0xff
+; GFX906-NEXT:    v_lshlrev_b32_sdwa v0, v10, v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX906-NEXT:    v_and_or_b32 v0, v1, v9, v0
+; GFX906-NEXT:    v_and_b32_e32 v1, 0xff, v4
+; GFX906-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
+; GFX906-NEXT:    v_lshlrev_b32_e32 v3, 24, v5
+; GFX906-NEXT:    v_or3_b32 v0, v0, v1, v3
+; GFX906-NEXT:    v_lshlrev_b32_sdwa v1, v10, v6 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX906-NEXT:    v_and_or_b32 v1, v2, v9, v1
+; GFX906-NEXT:    v_and_b32_e32 v2, 0xff, v7
+; GFX906-NEXT:    v_lshlrev_b32_e32 v2, 16, v2
+; GFX906-NEXT:    v_lshlrev_b32_e32 v3, 24, v8
+; GFX906-NEXT:    v_or3_b32 v1, v1, v2, v3
+; GFX906-NEXT:    v_mov_b32_e32 v2, 0
+; GFX906-NEXT:    global_store_dwordx2 v2, v[0:1], s[6:7]
 ; GFX906-NEXT:    s_endpgm
 entry:
   %idx = call i32 @llvm.amdgcn.workitem.id.x()
@@ -180,19 +216,70 @@ define amdgpu_kernel void @v16i8_liveout(ptr addrspace(1) %src1, ptr addrspace(1
 ; GFX906:       ; %bb.0: ; %entry
 ; GFX906-NEXT:    s_load_dwordx4 s[0:3], s[4:5], 0x24
 ; GFX906-NEXT:    s_load_dwordx2 s[6:7], s[4:5], 0x34
-; GFX906-NEXT:    v_lshlrev_b32_e32 v5, 4, v0
+; GFX906-NEXT:    v_lshlrev_b32_e32 v17, 4, v0
 ; GFX906-NEXT:    v_cmp_gt_u32_e32 vcc, 15, v0
 ; GFX906-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX906-NEXT:    global_load_dwordx4 v[1:4], v5, s[0:1]
+; GFX906-NEXT:    global_load_dwordx4 v[1:4], v17, s[0:1]
+; GFX906-NEXT:    s_waitcnt vmcnt(0)
+; GFX906-NEXT:    v_lshrrev_b32_e32 v5, 8, v1
+; GFX906-NEXT:    v_lshrrev_b32_e32 v6, 16, v1
+; GFX906-NEXT:    v_lshrrev_b32_e32 v7, 24, v1
+; GFX906-NEXT:    v_lshrrev_b32_e32 v8, 8, v2
+; GFX906-NEXT:    v_lshrrev_b32_e32 v9, 16, v2
+; GFX906-NEXT:    v_lshrrev_b32_e32 v10, 24, v2
+; GFX906-NEXT:    v_lshrrev_b32_e32 v11, 8, v3
+; GFX906-NEXT:    v_lshrrev_b32_e32 v12, 16, v3
+; GFX906-NEXT:    v_lshrrev_b32_e32 v13, 24, v3
+; GFX906-NEXT:    v_lshrrev_b32_e32 v14, 8, v4
+; GFX906-NEXT:    v_lshrrev_b32_e32 v15, 16, v4
+; GFX906-NEXT:    v_lshrrev_b32_e32 v16, 24, v4
 ; GFX906-NEXT:    s_and_saveexec_b64 s[0:1], vcc
 ; GFX906-NEXT:    s_cbranch_execz .LBB4_2
 ; GFX906-NEXT:  ; %bb.1: ; %bb.1
-; GFX906-NEXT:    global_load_dwordx4 v[1:4], v5, s[2:3]
+; GFX906-NEXT:    global_load_dwordx4 v[1:4], v17, s[2:3]
+; GFX906-NEXT:    s_waitcnt vmcnt(0)
+; GFX906-NEXT:    v_lshrrev_b32_e32 v5, 8, v1
+; GFX906-NEXT:    v_lshrrev_b32_e32 v6, 16, v1
+; GFX906-NEXT:    v_lshrrev_b32_e32 v7, 24, v1
+; GFX906-NEXT:    v_lshrrev_b32_e32 v8, 8, v2
+; GFX906-NEXT:    v_lshrrev_b32_e32 v9, 16, v2
+; GFX906-NEXT:    v_lshrrev_b32_e32 v10, 24, v2
+; GFX906-NEXT:    v_lshrrev_b32_e32 v11, 8, v3
+; GFX906-NEXT:    v_lshrrev_b32_e32 v12, 16, v3
+; GFX906-NEXT:    v_lshrrev_b32_e32 v13, 24, v3
+; GFX906-NEXT:    v_lshrrev_b32_e32 v14, 8, v4
+; GFX906-NEXT:    v_lshrrev_b32_e32 v15, 16, v4
+; GFX906-NEXT:    v_lshrrev_b32_e32 v16, 24, v4
 ; GFX906-NEXT:  .LBB4_2: ; %bb.2
 ; GFX906-NEXT:    s_or_b64 exec, exec, s[0:1]
-; GFX906-NEXT:    v_mov_b32_e32 v0, 0
-; GFX906-NEXT:    s_waitcnt vmcnt(0)
-; GFX906-NEXT:    global_store_dwordx4 v0, v[1:4], s[6:7]
+; GFX906-NEXT:    v_mov_b32_e32 v18, 8
+; GFX906-NEXT:    v_mov_b32_e32 v17, 0xff
+; GFX906-NEXT:    v_lshlrev_b32_sdwa v0, v18, v5 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX906-NEXT:    v_and_or_b32 v0, v1, v17, v0
+; GFX906-NEXT:    v_and_b32_e32 v1, 0xff, v6
+; GFX906-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
+; GFX906-NEXT:    v_lshlrev_b32_e32 v5, 24, v7
+; GFX906-NEXT:    v_or3_b32 v0, v0, v1, v5
+; GFX906-NEXT:    v_lshlrev_b32_sdwa v1, v18, v8 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX906-NEXT:    v_and_or_b32 v1, v2, v17, v1
+; GFX906-NEXT:    v_and_b32_e32 v2, 0xff, v9
+; GFX906-NEXT:    v_lshlrev_b32_e32 v2, 16, v2
+; GFX906-NEXT:    v_lshlrev_b32_e32 v5, 24, v10
+; GFX906-NEXT:    v_or3_b32 v1, v1, v2, v5
+; GFX906-NEXT:    v_lshlrev_b32_sdwa v2, v18, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX906-NEXT:    v_and_or_b32 v2, v3, v17, v2
+; GFX906-NEXT:    v_and_b32_e32 v3, 0xff, v12
+; GFX906-NEXT:    v_lshlrev_b32_e32 v3, 16, v3
+; GFX906-NEXT:    v_lshlrev_b32_e32 v5, 24, v13
+; GFX906-NEXT:    v_or3_b32 v2, v2, v3, v5
+; GFX906-NEXT:    v_lshlrev_b32_sdwa v3, v18, v14 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX906-NEXT:    v_and_or_b32 v3, v4, v17, v3
+; GFX906-NEXT:    v_and_b32_e32 v4, 0xff, v15
+; GFX906-NEXT:    v_lshlrev_b32_e32 v4, 16, v4
+; GFX906-NEXT:    v_lshlrev_b32_e32 v5, 24, v16
+; GFX906-NEXT:    v_or3_b32 v3, v3, v4, v5
+; GFX906-NEXT:    v_mov_b32_e32 v4, 0
+; GFX906-NEXT:    global_store_dwordx4 v4, v[0:3], s[6:7]
 ; GFX906-NEXT:    s_endpgm
 entry:
   %idx = call i32 @llvm.amdgcn.workitem.id.x()
@@ -216,23 +303,123 @@ define amdgpu_kernel void @v32i8_liveout(ptr addrspace(1) %src1, ptr addrspace(1
 ; GFX906:       ; %bb.0: ; %entry
 ; GFX906-NEXT:    s_load_dwordx4 s[0:3], s[4:5], 0x24
 ; GFX906-NEXT:    s_load_dwordx2 s[6:7], s[4:5], 0x34
-; GFX906-NEXT:    v_lshlrev_b32_e32 v9, 5, v0
+; GFX906-NEXT:    v_lshlrev_b32_e32 v32, 5, v0
 ; GFX906-NEXT:    v_cmp_gt_u32_e32 vcc, 15, v0
 ; GFX906-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX906-NEXT:    global_load_dwordx4 v[1:4], v9, s[0:1]
-; GFX906-NEXT:    global_load_dwordx4 v[5:8], v9, s[0:1] offset:16
+; GFX906-NEXT:    global_load_dwordx4 v[5:8], v32, s[0:1]
+; GFX906-NEXT:    global_load_dwordx4 v[1:4], v32, s[0:1] offset:16
+; GFX906-NEXT:    s_waitcnt vmcnt(1)
+; GFX906-NEXT:    v_lshrrev_b32_e32 v31, 8, v5
+; GFX906-NEXT:    v_lshrrev_b32_e32 v26, 16, v5
+; GFX906-NEXT:    v_lshrrev_b32_e32 v27, 24, v5
+; GFX906-NEXT:    v_lshrrev_b32_e32 v28, 8, v6
+; GFX906-NEXT:    v_lshrrev_b32_e32 v29, 16, v6
+; GFX906-NEXT:    v_lshrrev_b32_e32 v30, 24, v6
+; GFX906-NEXT:    v_lshrrev_b32_e32 v0, 8, v7
+; GFX906-NEXT:    v_lshrrev_b32_e32 v9, 16, v7
+; GFX906-NEXT:    v_lshrrev_b32_e32 v10, 24, v7
+; GFX906-NEXT:    v_lshrrev_b32_e32 v11, 8, v8
+; GFX906-NEXT:    v_lshrrev_b32_e32 v12, 16, v8
+; GFX906-NEXT:    v_lshrrev_b32_e32 v13, 24, v8
+; GFX906-NEXT:    s_waitcnt vmcnt(0)
+; GFX906-NEXT:    v_lshrrev_b32_e32 v14, 8, v1
+; GFX906-NEXT:    v_lshrrev_b32_e32 v15, 16, v1
+; GFX906-NEXT:    v_lshrrev_b32_e32 v16, 24, v1
+; GFX906-NEXT:    v_lshrrev_b32_e32 v17, 8, v2
+; GFX906-NEXT:    v_lshrrev_b32_e32 v18, 16, v2
+; GFX906-NEXT:    v_lshrrev_b32_e32 v19, 24, v2
+; GFX906-NEXT:    v_lshrrev_b32_e32 v20, 8, v3
+; GFX906-NEXT:    v_lshrrev_b32_e32 v21, 16, v3
+; GFX906-NEXT:    v_lshrrev_b32_e32 v22...
[truncated]

Copy link

github-actions bot commented Jan 27, 2025

✅ With the latest revision this PR passed the C/C++ code formatter.

@arsenm
Copy link
Contributor

arsenm commented Jan 28, 2025

is beneficial for operations that natively work on illegal vector types to prevent flip-flopping between SGPR and VGPR.

This isn't how SGPR vs VGPR works. It's one direction only, you can't just copy to SGPR from VGPR. What example are you talking about?

return true;

// Intrinsics - assume they natively handle illegal type
if (dyn_cast<IntrinsicInst>(II))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you aren't using the return value use isa. IntrinsicInst also does not imply it is a target intrinsic

return true;

// Shuffles
if (dyn_cast<ShuffleVectorInst>(II))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs new tests but I doubt we should be special casing specific operations

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am mixed about what the ideal solution should be here and left it somewhat open-ended. Correct me if I'm wrong but few things to consider are:

  1. If operation is known to be scalarized, LRO shoehorning Values back into packed VGPR adds overhead.
  2. While whitelisting to only packed ops would work, list may be exhaustive and/or,
  3. It seems LRO has other effects such as splitting live range, which reduces RP for Insts however unintentional.

Should we err on the side of allowing more candidates for LRO?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the most accurate heuristic would be akin to what other passes (e.g. SLP) does, in that we calculate the aggregate value of vectorizing (coercing the illegal type) vs scalarizing (keeping the illegal type as is). But I think that we probably don't need that level of modeling.

For the cases where we have mixed uses (e.g. some user instructions which can natively operate on i8 vectors, and some user instructions which must be scalarized), I would think it is best to prefer type coercion. When we introduce the scalarization in block, I think we have a better chance of folding the vector extract code into the scalarized ops themselves (i.e. sdwa). Moreover, in the case of loops, the livein vector will be live throughout the body of the loop. The coercion will reduce the loop's baseline RP.

@choikwa
Copy link
Contributor Author

choikwa commented Jan 28, 2025

is beneficial for operations that natively work on illegal vector types to prevent flip-flopping between SGPR and VGPR.

This isn't how SGPR vs VGPR works. It's one direction only, you can't just copy to SGPR from VGPR. What example are you talking about?

Apologies. corrected to 'unpacked and packed'

; GFX906-NEXT: v_lshlrev_b16_e32 v2, 8, v4
; GFX906-NEXT: v_or_b32_sdwa v2, v5, v2 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
; GFX906-NEXT: v_or_b32_sdwa v1, v1, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
; GFX906-NEXT: global_store_dwordx2 v3, v[0:1], s[6:7]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto

; GFX906-NEXT: v_lshlrev_b16_e32 v4, 8, v6
; GFX906-NEXT: v_or_b32_sdwa v4, v7, v4 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
; GFX906-NEXT: v_or_b32_sdwa v3, v3, v4 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
; GFX906-NEXT: global_store_dwordx4 v5, v[0:3], s[6:7]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto

; GFX906-NEXT: v_or_b32_sdwa v4, v4, v5 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
; GFX906-NEXT: v_or_b32_sdwa v0, v10, v0 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
; GFX906-NEXT: v_or_b32_sdwa v4, v4, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
; GFX906-NEXT: global_store_dwordx4 v9, v[1:4], s[6:7] offset:16
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto

; GFX906-NEXT: v_lshlrev_b16_e32 v5, 8, v5
; GFX906-NEXT: v_or_b32_sdwa v3, v8, v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
; GFX906-NEXT: v_or_b32_sdwa v5, v6, v5 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
; GFX906-NEXT: v_or_b32_sdwa v3, v3, v5 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto

II, TargetTransformInfo::TargetCostKind::TCK_SizeAndLatency);
LLVM_DEBUG(dbgs() << "shouldReplaceBasedOnOp: " << *II << " Cost=" << Cost
<< '\n';);
if (Cost >= 8)
Copy link
Contributor

@jrbyrnes jrbyrnes Jan 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need a more meaningful value than 8, something the captures the cost of scalarizing a certain value/op.

return true;

// Intrinsics - assume they natively handle illegal type
if (isa<IntrinsicInst>(II))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to bring these special cases into a TTI query?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean expand on TTI::isTypeLegal? like isTypeAndOpLegal? The former only checks if it has registered RegClass to hold the type natively. I assume the latter might make redundant some logic in ISelLowering for determining custom legalization for Type & Op.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I meant having some general query handle these special cases appropriately -- but I think there's no good way to do this through what the TTI currently has.

@@ -259,6 +297,9 @@ bool LiveRegOptimizer::optimizeLiveType(
if (!shouldReplace(II->getType()))
continue;

if (!shouldReplaceBasedOnOp(II))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should be checking the users of II -- including walking through PHI nodes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted

@@ -2102,15 +2102,18 @@ define void @crash_lshlrevb16_not_reg_op() {
; NOSDWA: ; %bb.0: ; %bb0
; NOSDWA-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; NOSDWA-NEXT: s_mov_b64 s[4:5], 0
; NOSDWA-NEXT: s_and_b32 s6, s4, 0xff
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a missed opportunity for const prop (SI-Fold). Previous case shows const prop happening across BB. This was a case where <2 x i8> was not coerced and ended up producing worse code. At first glance, it looks like coercing needed less const prop and was able reach better looking assembly after SI-Fold pass. Wondering if this is the case of not re-running SI-Fold on changed MI's.

@choikwa
Copy link
Contributor Author

choikwa commented Feb 1, 2025

Addressed the unpacked vni8 testcases in the latest commit (they now stay packed still).

@choikwa choikwa force-pushed the LROfilter branch 3 times, most recently from 0c7728b to 5b95b6e Compare February 2, 2025 06:02
@@ -291,6 +333,9 @@ bool LiveRegOptimizer::optimizeLiveType(
}

Instruction *UseInst = cast<Instruction>(V);
if (!shouldReplaceBasedOnOp(UseInst))
Copy link
Contributor

@jrbyrnes jrbyrnes Feb 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do this check after we do the PHINode handling above, then I think we'll probably end up with corrupt PHI node coercion in some cases.

At this position, we have already added the PHINodes and their incoming values to the list to be coerced, but we block adding their uses if coercion is not profitable.

Besides, if the check returns false, this just breaks out of the users loop, meaning we may have added some uses but not other uses, resulting in further corruption.

So, we need to move this check to the top. Also, it is not enough to just check the users of the instruction. User can be a PHI node which really just resolves values across control flow -- what we care about is the user of the PHI node since this may be an instruction which truly maps to some hardware instruction -- we should check if that hardware instruction natively supports 8 bit vectors. Similarly for shuffle/insert/extract vectors, we should look through to their uses as that use will be the determining factor as to whether we should vectorize.

bool IsCoercionProfitable = false;
SmallVector<Instruction *, 4> UserList;

for (User *V : II->users())
	UserList.push_back(cast<Instruction *>(V);

while (!UserList.empty() && !IsCoercionProfitable) {
  Instruction *CandInstruction = UserList.pop_back_val();
  if (isa<PHINode>(CandInstruction)) {
  	for (User *V : CandInstruction->users())
		UserList.push_back(cast<Instruction *>(V);
	if (CandInstruction->getParent() == II->getParent())
		continue;

	// TODO -- look through shufflevector , insert/extract vectorelemt

	IsCoercionProfitable = shouldReplaceBasedOnOp(CandInstruction);
	// Prefer type coercion if we have any proftiable use
	if (IsCoercionProfitable)
		break;
  }
}

if (!IsCoercionProfitable)
	continue;


}
LLVM_DEBUG(dbgs() << "shouldReplaceBasedOnOp: " << *II << " Cost=" << Cost
<< '\n';);
if (Cost >= LRO_COST_THRES)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, we would be able to use TTI to calculate the cost of the instruction, and use TTI to calculate the cost of the scalarized version. This implies we would need to construct the equivalent scalarized version. As a fill-in for the scalarized version, I was thinking we may be able to compare the instruction cost to the TTI.getTypeLegalizationCost but I see that that is not a great comparison due to other factors in the instruction cost. So, maybe it makes the most sense to use a whitelist for now.


/// Check if operation is legal.
/// TODO: If we had IR<->SDag mapping, we could use TLI->isOperationLegal
bool GCNTTIImpl::isOpLegal(Instruction *I) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move this to AMDGPULateCodeGenPrepare. Don't whitelist shuffles, and should only whitelist v_dot / mfma / wmma (instead of all intrinsics).

@choikwa choikwa force-pushed the LROfilter branch 3 times, most recently from 8b15285 to 5cb4322 Compare February 11, 2025 18:02
}

/// Check if intrinsic natively operates on 8-bit or 16-bit
bool isNativeIntrinsic(Intrinsic::ID ID) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need all the dot / mfma / wmma variants -- just the 8 and 4 bit ones.

}

bool isCoercionProfitable(Instruction *II) {
if (shouldReplaceByOp(II))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're checking the users of the op for profitable conditions, so checking shouldReplaceByOp on II doesn't make that much sense (unless we were to to look the other direction -- the defs of its operands)


while (!UserList.empty() && !Profitable) {
auto CII = UserList.pop_back_val();
if (!CVisited.insert(II).second)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (!CVisited.insert(II).second) -> if (!CVisited.insert(CII).second)

if (!CVisited.insert(II).second)
continue;

if (isa<PHINode>(CII) || isa<ShuffleVectorInst>(CII) ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should also look through casts

if (auto *UseInst = dyn_cast<Instruction>(V))
UserList.push_back(UseInst);

if (CII->getParent() == II->getParent())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If our II is a lookthru instruction, then we should bypass this condition

return true;
}

bool isCoercionProfitable(Instruction *II) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like this:

  bool isCoercionProfitable(Instruction *II) {
    SmallPtrSet<Instruction *, 4> CVisited;
    SmallVector<Instruction *, 4> UserList;

    // Check users for profitable conditions (across block user which can natively
    // handle the illegal vector).
    for (User *V : II->users())
      if (auto *UseInst = dyn_cast<Instruction>(V))
        UserList.push_back(UseInst);

    auto IsLookThru = [](Instruction *II) {
      return isa<PHINode>(II) || isa<ShuffleVectorInst>(II) ||
          isa<InsertElementInst>(II) || isa<ExtractElementInst>(II) || isa<CastInst>(II);
    };

    while (!UserList.empty()) {
      auto CII = UserList.pop_back_val();
      if (!CVisited.insert(CII).second)
        continue;    

      if (CII->getParent() == II->getParent() && !IsLookThru(II))
        continue;
      
      if (isOpLegal(CII))
        return true;

      if (IsLookThru(CII))
        for (User *V : CII->users())
          if (auto *UseInst = dyn_cast<Instruction>(V))
            UserList.push_back(UseInst);
    }
    return false;
  }

if (CII->getParent() == II->getParent())
continue;

Profitable = shouldReplaceByOp(CII);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need to check the cost model -- just the whitelist.

// If an operation incurs high enough cost or natively work on
// vector of illegal type, ie. v2i8, then it makes sense to try
// to coerce them as packed VGPR across BB.
bool shouldReplaceByOp(Instruction *II) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can just delete this cost model in favor of the whitelist -- should be able to remove dependency on TTI too

; GFX906-NEXT: .LBB1_2: ; %bb.2
; GFX906-NEXT: s_or_b64 exec, exec, s[0:1]
; GFX906-NEXT: s_waitcnt vmcnt(0)
; GFX906-NEXT: global_store_dword v1, v2, s[6:7]
; GFX906-NEXT: v_lshlrev_b16_e32 v0, 8, v5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The LRO should coerce this.

@choikwa choikwa force-pushed the LROfilter branch 2 times, most recently from edb17b2 to 205a066 Compare February 18, 2025 18:45
if (auto *UseInst = dyn_cast<Instruction>(V))
UserList.push_back(UseInst);

auto IsLookThru = [](Instruction *II) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also look thru v_perm intrinsic

case Intrinsic::amdgcn_wmma_i32_16x16x32_iu4:
case Intrinsic::amdgcn_swmmac_i32_16x16x32_iu8:
case Intrinsic::amdgcn_swmmac_i32_16x16x32_iu4:
case Intrinsic::amdgcn_swmmac_i32_16x16x64_iu4:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also add the buffer_store intrinsics

It is known that for vector whose element fits in i16 will be split
and scalarized in SelectionDag's type legalizer
(see SIISelLowering::getPreferredVectorAction).

LRO attempts to undo the scalarizing of vectors across basic block
boundary and shoehorn Values in VGPRs. LRO is beneficial for operations
that natively work on illegal vector types to prevent flip-flopping
between unpacked and packed. If we know that operations on vector will be
split and scalarized, then we don't want to shoehorn them back to packed
VGPR.

Operations that we know to work natively on illegal vector types
usually come in the form of intrinsics (MFMA, DOT8), buffer store,
shuffle, insert/extract element, phi nodes to name a few.
Copy link
Contributor

@jrbyrnes jrbyrnes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@choikwa choikwa merged commit 45759fe into llvm:main Mar 5, 2025
11 checks passed
@@ -125,6 +127,131 @@ class LiveRegOptimizer {
return LK.first != TargetLoweringBase::TypeLegal;
}

/// Check if intrinsic natively operates on 8-bit or 16-bit
bool isNativeIntrinsic(Intrinsic::ID ID) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isNative intrinsic is redundant; we also really don't want to maintain this table. Is any target intrinsic good enough? If we really must, should have use SearchableTables

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is any target intrinsic good enough

It's probably "good enough" -- it may result in false positive (i.e. applying the coercion to cases where the is no profit), but maybe not many.
a
What we really want is a way to check if an intrinsic / instruction can natively handle illegal vector types without compiler assistance to scalarize the code -- this query may be useful for future work too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

compiler assistance to scalarize the code

Not clear to me what this means. All the vector types will be split or scalarized in some way. The target intrinsics, with a handful of exceptions, only operate on legal types. isTargetIntrinsic || isTypeLegal()?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not clear to me what this means

"intrinsic / instruction can natively handle illegal vector types" = The target intrinsic has either bytewise logic on the 32 bit regs (like mfma) or type doesn't matter (i.e. store).

For a standard instruction (e.g. v4i8 add), the compiler will need to insert scalarization code.

We should probably just use isTargetIntrinsic and narrow it as needed.

isa<InsertElementInst>(II) || isa<ExtractElementInst>(II) ||
isa<CastInst>(II);
};

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of this heuristic isn't covered by tests, there are no new tests added

jph-13 pushed a commit to jph-13/llvm-project that referenced this pull request Mar 21, 2025
…lvm#124624)

It is known that for vector whose element fits in i16 will be split and
scalarized in SelectionDag's type legalizer
(see SIISelLowering::getPreferredVectorAction).

LRO attempts to undo the scalarizing of vectors across basic block
boundary and shoehorn Values in VGPRs. LRO is beneficial for operations
that natively work on illegal vector types to prevent flip-flopping
between unpacked and packed. If we know that operations on vector will
be split and scalarized, then we don't want to shoehorn them back to
packed VGPR.

Operations that we know to work natively on illegal vector types usually
come in the form of intrinsics (MFMA, DOT8), buffer store, shuffle, phi
nodes to name a few.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants