[AMDGPU] Filter candidates of LiveRegOptimizer for profitable cases #124624

choikwa · 2025-01-27T20:57:18Z

It is known that for vector whose element fits in i16 will be split and scalarized in SelectionDag's type legalizer
(see SIISelLowering::getPreferredVectorAction).

LRO attempts to undo the scalarizing of vectors across basic block boundary and shoehorn Values in VGPRs. LRO is beneficial for operations that natively work on illegal vector types to prevent flip-flopping between unpacked and packed. If we know that operations on vector will be split and scalarized, then we don't want to shoehorn them back to packed VGPR.

Operations that we know to work natively on illegal vector types usually come in the form of intrinsics (MFMA, DOT8), buffer store, shuffle, phi nodes to name a few.

llvmbot · 2025-01-27T20:57:54Z

@llvm/pr-subscribers-llvm-analysis

@llvm/pr-subscribers-backend-amdgpu

Author: choikwa (choikwa)

Changes

It is known that for vector whose element fits in i16 will be split and scalarized in SelectionDag's type legalizer
(see SIISelLowering::getPreferredVectorAction).

LRO attempts to undo the scalarizing of vectors across basic block boundary and shoehorn Values in VGPRs. LRO is beneficial for operations that natively work on illegal vector types to prevent flip-flopping between SGPR and VGPR. If we know that operations on vector will be split and scalarized, then we don't want to shoehorn them back to VGPR.

Operations that we know to work natively on illegal vector types usually come in the form of intrinsics (MFMA, DOT8), buffer store, shuffle, phi nodes to name a few.

Patch is 337.61 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/124624.diff

7 Files Affected:

(modified) llvm/lib/Target/AMDGPU/AMDGPULateCodeGenPrepare.cpp (+51-6)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/vni8-across-blocks.ll (+2001-209)
(modified) llvm/test/CodeGen/AMDGPU/dagcomb-extract-vec-elt-different-sizes.ll (+19-20)
(modified) llvm/test/CodeGen/AMDGPU/extract-subvector-16bit.ll (+188-171)
(modified) llvm/test/CodeGen/AMDGPU/extract-subvector.ll (+26-25)
(modified) llvm/test/CodeGen/AMDGPU/sdwa-peephole.ll (+24-12)
(modified) llvm/test/CodeGen/AMDGPU/vni8-across-blocks.ll (+1896-148)

diff --git a/llvm/lib/Target/AMDGPU/AMDGPULateCodeGenPrepare.cpp b/llvm/lib/Target/AMDGPU/AMDGPULateCodeGenPrepare.cpp
index f4e651ec477d30..d64951001d9cba 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPULateCodeGenPrepare.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPULateCodeGenPrepare.cpp
@@ -14,6 +14,7 @@
 
 #include "AMDGPU.h"
 #include "AMDGPUTargetMachine.h"
+#include "AMDGPUTargetTransformInfo.h"
 #include "llvm/Analysis/AssumptionCache.h"
 #include "llvm/Analysis/UniformityAnalysis.h"
 #include "llvm/Analysis/ValueTracking.h"
@@ -45,6 +46,7 @@ class AMDGPULateCodeGenPrepare
   Function &F;
   const DataLayout &DL;
   const GCNSubtarget &ST;
+  const TargetTransformInfo &TTI;
 
   AssumptionCache *const AC;
   UniformityInfo &UA;
@@ -53,8 +55,9 @@ class AMDGPULateCodeGenPrepare
 
 public:
   AMDGPULateCodeGenPrepare(Function &F, const GCNSubtarget &ST,
+                           const TargetTransformInfo &TTI,
                            AssumptionCache *AC, UniformityInfo &UA)
-      : F(F), DL(F.getDataLayout()), ST(ST), AC(AC), UA(UA) {}
+      : F(F), DL(F.getDataLayout()), ST(ST), TTI(TTI), AC(AC), UA(UA) {}
   bool run();
   bool visitInstruction(Instruction &) { return false; }
 
@@ -75,6 +78,8 @@ class LiveRegOptimizer {
   Module &Mod;
   const DataLayout &DL;
   const GCNSubtarget &ST;
+  const TargetTransformInfo &TTI;
+
   /// The scalar type to convert to
   Type *const ConvertToScalar;
   /// The set of visited Instructions
@@ -125,8 +130,43 @@ class LiveRegOptimizer {
     return LK.first != TargetLoweringBase::TypeLegal;
   }
 
-  LiveRegOptimizer(Module &Mod, const GCNSubtarget &ST)
-      : Mod(Mod), DL(Mod.getDataLayout()), ST(ST),
+  // Filtering based on operation or its cost.
+  // If an operation incurs high enough cost or natively work on
+  // vector of illegal type, ie. v2i8, then it makes sense to try
+  // to avoid scalarizing across BB.
+  bool shouldReplaceBasedOnOp(Instruction *II) {
+    // Ignore pseudos
+    if (II->isDebugOrPseudoInst())
+      return false;
+
+    // Instruction Cost
+    const auto Cost = TTI.getInstructionCost(II,
+        TargetTransformInfo::TargetCostKind::TCK_SizeAndLatency);
+    LLVM_DEBUG(
+      dbgs() << "shouldReplaceBasedOnOp: " <<
+        *II << " Cost=" << Cost << '\n';
+    );
+    if (Cost >= 8)
+      return true;
+
+    // Intrinsics - assume they natively handle illegal type
+    if (dyn_cast<IntrinsicInst>(II))
+      return true;
+
+    // Stores
+    if (dyn_cast<StoreInst>(II))
+      return true;
+
+    // Shuffles
+    if (dyn_cast<ShuffleVectorInst>(II))
+      return true;
+
+    return false;
+  }
+
+  LiveRegOptimizer(Module &Mod, const GCNSubtarget &ST,
+      const TargetTransformInfo &TTI)
+      : Mod(Mod), DL(Mod.getDataLayout()), ST(ST), TTI(TTI),
         ConvertToScalar(Type::getInt32Ty(Mod.getContext())) {}
 };
 
@@ -140,7 +180,7 @@ bool AMDGPULateCodeGenPrepare::run() {
   // vectors to equivalent vectors of legal type (which are converted back
   // before uses in subsequent blocks), to pack the bits into fewer physical
   // registers (used in CopyToReg/CopyFromReg pairs).
-  LiveRegOptimizer LRO(*F.getParent(), ST);
+  LiveRegOptimizer LRO(*F.getParent(), ST, TTI);
 
   bool Changed = false;
 
@@ -259,6 +299,9 @@ bool LiveRegOptimizer::optimizeLiveType(
     if (!shouldReplace(II->getType()))
       continue;
 
+    if (!shouldReplaceBasedOnOp(II))
+      continue;
+
     if (PHINode *Phi = dyn_cast<PHINode>(II)) {
       PhiNodes.insert(Phi);
       // Collect all the incoming values of problematic PHI nodes.
@@ -478,11 +521,12 @@ bool AMDGPULateCodeGenPrepare::visitLoadInst(LoadInst &LI) {
 PreservedAnalyses
 AMDGPULateCodeGenPreparePass::run(Function &F, FunctionAnalysisManager &FAM) {
   const GCNSubtarget &ST = TM.getSubtarget<GCNSubtarget>(F);
+  const TargetTransformInfo &TTI = TM.getTargetTransformInfo(F);
 
   AssumptionCache &AC = FAM.getResult<AssumptionAnalysis>(F);
   UniformityInfo &UI = FAM.getResult<UniformityInfoAnalysis>(F);
 
-  bool Changed = AMDGPULateCodeGenPrepare(F, ST, &AC, UI).run();
+  bool Changed = AMDGPULateCodeGenPrepare(F, ST, TTI, &AC, UI).run();
 
   if (!Changed)
     return PreservedAnalyses::all();
@@ -518,13 +562,14 @@ bool AMDGPULateCodeGenPrepareLegacy::runOnFunction(Function &F) {
   const TargetPassConfig &TPC = getAnalysis<TargetPassConfig>();
   const TargetMachine &TM = TPC.getTM<TargetMachine>();
   const GCNSubtarget &ST = TM.getSubtarget<GCNSubtarget>(F);
+  const TargetTransformInfo &TTI = TM.getTargetTransformInfo(F);
 
   AssumptionCache &AC =
       getAnalysis<AssumptionCacheTracker>().getAssumptionCache(F);
   UniformityInfo &UI =
       getAnalysis<UniformityInfoWrapperPass>().getUniformityInfo();
 
-  return AMDGPULateCodeGenPrepare(F, ST, &AC, UI).run();
+  return AMDGPULateCodeGenPrepare(F, ST, TTI, &AC, UI).run();
 }
 
 INITIALIZE_PASS_BEGIN(AMDGPULateCodeGenPrepareLegacy, DEBUG_TYPE,
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/vni8-across-blocks.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/vni8-across-blocks.ll
index 9c2fabce4bcdeb..96a167794dfb7a 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/vni8-across-blocks.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/vni8-across-blocks.ll
@@ -6,36 +6,28 @@ define amdgpu_kernel void @v3i8_liveout(ptr addrspace(1) %src1, ptr addrspace(1)
 ; GFX906:       ; %bb.0: ; %entry
 ; GFX906-NEXT:    s_load_dwordx4 s[0:3], s[4:5], 0x24
 ; GFX906-NEXT:    s_load_dwordx2 s[6:7], s[4:5], 0x34
-; GFX906-NEXT:    v_lshlrev_b32_e32 v2, 2, v0
-; GFX906-NEXT:    v_mov_b32_e32 v4, 8
-; GFX906-NEXT:    v_mov_b32_e32 v5, 16
-; GFX906-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX906-NEXT:    global_load_dword v3, v2, s[0:1]
-; GFX906-NEXT:    v_mov_b32_e32 v1, 0xff
+; GFX906-NEXT:    v_lshlrev_b32_e32 v4, 2, v0
 ; GFX906-NEXT:    v_cmp_gt_u32_e32 vcc, 15, v0
+; GFX906-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX906-NEXT:    global_load_dword v1, v4, s[0:1]
 ; GFX906-NEXT:    s_waitcnt vmcnt(0)
-; GFX906-NEXT:    v_and_b32_e32 v6, 0xff, v3
-; GFX906-NEXT:    v_lshlrev_b32_sdwa v7, v4, v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_1
-; GFX906-NEXT:    v_lshlrev_b32_sdwa v3, v5, v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_2
-; GFX906-NEXT:    v_or3_b32 v3, v6, v7, v3
+; GFX906-NEXT:    v_lshrrev_b32_e32 v3, 8, v1
+; GFX906-NEXT:    v_lshrrev_b32_e32 v2, 16, v1
 ; GFX906-NEXT:    s_and_saveexec_b64 s[0:1], vcc
 ; GFX906-NEXT:    s_cbranch_execz .LBB0_2
 ; GFX906-NEXT:  ; %bb.1: ; %bb.1
-; GFX906-NEXT:    global_load_dword v0, v2, s[2:3]
+; GFX906-NEXT:    global_load_dword v1, v4, s[2:3]
 ; GFX906-NEXT:    s_waitcnt vmcnt(0)
-; GFX906-NEXT:    v_and_b32_e32 v2, 0xff, v0
-; GFX906-NEXT:    v_lshlrev_b32_sdwa v3, v4, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_1
-; GFX906-NEXT:    v_lshlrev_b32_sdwa v0, v5, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_2
-; GFX906-NEXT:    v_or3_b32 v3, v2, v3, v0
+; GFX906-NEXT:    v_lshrrev_b32_e32 v3, 8, v1
+; GFX906-NEXT:    v_lshrrev_b32_e32 v2, 16, v1
 ; GFX906-NEXT:  .LBB0_2: ; %bb.2
 ; GFX906-NEXT:    s_or_b64 exec, exec, s[0:1]
-; GFX906-NEXT:    v_lshrrev_b32_e32 v0, 8, v3
-; GFX906-NEXT:    v_and_b32_e32 v0, 0xff, v0
+; GFX906-NEXT:    v_and_b32_e32 v0, 0xff, v3
 ; GFX906-NEXT:    v_lshlrev_b16_e32 v0, 8, v0
-; GFX906-NEXT:    v_or_b32_sdwa v0, v3, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX906-NEXT:    v_and_b32_sdwa v1, v3, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:DWORD
+; GFX906-NEXT:    v_or_b32_sdwa v0, v1, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX906-NEXT:    s_mov_b32 s0, 0xffff
 ; GFX906-NEXT:    v_and_b32_e32 v0, 0xffff, v0
-; GFX906-NEXT:    v_and_b32_e32 v1, 0xffff, v1
+; GFX906-NEXT:    v_and_b32_sdwa v1, s0, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX906-NEXT:    v_lshl_or_b32 v0, v1, 16, v0
 ; GFX906-NEXT:    v_mov_b32_e32 v1, 0
 ; GFX906-NEXT:    global_store_short v1, v0, s[6:7]
@@ -63,19 +55,34 @@ define amdgpu_kernel void @v4i8_liveout(ptr addrspace(1) %src1, ptr addrspace(1)
 ; GFX906:       ; %bb.0: ; %entry
 ; GFX906-NEXT:    s_load_dwordx4 s[0:3], s[4:5], 0x24
 ; GFX906-NEXT:    s_load_dwordx2 s[6:7], s[4:5], 0x34
-; GFX906-NEXT:    v_lshlrev_b32_e32 v2, 2, v0
+; GFX906-NEXT:    v_lshlrev_b32_e32 v5, 2, v0
 ; GFX906-NEXT:    v_cmp_gt_u32_e32 vcc, 15, v0
 ; GFX906-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX906-NEXT:    global_load_dword v1, v2, s[0:1]
+; GFX906-NEXT:    global_load_dword v1, v5, s[0:1]
+; GFX906-NEXT:    s_waitcnt vmcnt(0)
+; GFX906-NEXT:    v_lshrrev_b32_e32 v2, 8, v1
+; GFX906-NEXT:    v_lshrrev_b32_e32 v3, 16, v1
+; GFX906-NEXT:    v_lshrrev_b32_e32 v4, 24, v1
 ; GFX906-NEXT:    s_and_saveexec_b64 s[0:1], vcc
 ; GFX906-NEXT:    s_cbranch_execz .LBB1_2
 ; GFX906-NEXT:  ; %bb.1: ; %bb.1
-; GFX906-NEXT:    global_load_dword v1, v2, s[2:3]
+; GFX906-NEXT:    global_load_dword v1, v5, s[2:3]
+; GFX906-NEXT:    s_waitcnt vmcnt(0)
+; GFX906-NEXT:    v_lshrrev_b32_e32 v2, 8, v1
+; GFX906-NEXT:    v_lshrrev_b32_e32 v3, 16, v1
+; GFX906-NEXT:    v_lshrrev_b32_e32 v4, 24, v1
 ; GFX906-NEXT:  .LBB1_2: ; %bb.2
 ; GFX906-NEXT:    s_or_b64 exec, exec, s[0:1]
-; GFX906-NEXT:    v_mov_b32_e32 v0, 0
-; GFX906-NEXT:    s_waitcnt vmcnt(0)
-; GFX906-NEXT:    global_store_dword v0, v1, s[6:7]
+; GFX906-NEXT:    v_mov_b32_e32 v5, 8
+; GFX906-NEXT:    v_mov_b32_e32 v0, 0xff
+; GFX906-NEXT:    v_lshlrev_b32_sdwa v2, v5, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX906-NEXT:    v_and_or_b32 v0, v1, v0, v2
+; GFX906-NEXT:    v_and_b32_e32 v1, 0xff, v3
+; GFX906-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
+; GFX906-NEXT:    v_lshlrev_b32_e32 v2, 24, v4
+; GFX906-NEXT:    v_or3_b32 v0, v0, v1, v2
+; GFX906-NEXT:    v_mov_b32_e32 v1, 0
+; GFX906-NEXT:    global_store_dword v1, v0, s[6:7]
 ; GFX906-NEXT:    s_endpgm
 entry:
   %idx = call i32 @llvm.amdgcn.workitem.id.x()
@@ -99,28 +106,30 @@ define amdgpu_kernel void @v5i8_liveout(ptr addrspace(1) %src1, ptr addrspace(1)
 ; GFX906:       ; %bb.0: ; %entry
 ; GFX906-NEXT:    s_load_dwordx4 s[0:3], s[4:5], 0x24
 ; GFX906-NEXT:    s_load_dwordx2 s[6:7], s[4:5], 0x34
-; GFX906-NEXT:    v_lshlrev_b32_e32 v3, 3, v0
+; GFX906-NEXT:    v_lshlrev_b32_e32 v6, 3, v0
 ; GFX906-NEXT:    v_cmp_gt_u32_e32 vcc, 15, v0
 ; GFX906-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX906-NEXT:    global_load_dwordx2 v[1:2], v3, s[0:1]
+; GFX906-NEXT:    global_load_dwordx2 v[1:2], v6, s[0:1]
 ; GFX906-NEXT:    s_waitcnt vmcnt(0)
-; GFX906-NEXT:    v_and_b32_e32 v2, 0xff, v2
+; GFX906-NEXT:    v_lshrrev_b32_e32 v3, 8, v1
+; GFX906-NEXT:    v_lshrrev_b32_e32 v4, 16, v1
+; GFX906-NEXT:    v_lshrrev_b32_e32 v5, 24, v1
 ; GFX906-NEXT:    s_and_saveexec_b64 s[0:1], vcc
 ; GFX906-NEXT:    s_cbranch_execz .LBB2_2
 ; GFX906-NEXT:  ; %bb.1: ; %bb.1
-; GFX906-NEXT:    global_load_dwordx2 v[1:2], v3, s[2:3]
+; GFX906-NEXT:    global_load_dwordx2 v[1:2], v6, s[2:3]
 ; GFX906-NEXT:    s_waitcnt vmcnt(0)
-; GFX906-NEXT:    v_and_b32_e32 v2, 0xff, v2
+; GFX906-NEXT:    v_lshrrev_b32_e32 v3, 8, v1
+; GFX906-NEXT:    v_lshrrev_b32_e32 v4, 16, v1
+; GFX906-NEXT:    v_lshrrev_b32_e32 v5, 24, v1
 ; GFX906-NEXT:  .LBB2_2: ; %bb.2
 ; GFX906-NEXT:    s_or_b64 exec, exec, s[0:1]
-; GFX906-NEXT:    v_mov_b32_e32 v4, 0
-; GFX906-NEXT:    v_lshrrev_b32_e32 v0, 8, v1
-; GFX906-NEXT:    v_lshrrev_b32_e32 v3, 24, v1
-; GFX906-NEXT:    global_store_byte v4, v1, s[6:7]
-; GFX906-NEXT:    global_store_byte v4, v0, s[6:7] offset:1
-; GFX906-NEXT:    global_store_byte_d16_hi v4, v1, s[6:7] offset:2
-; GFX906-NEXT:    global_store_byte v4, v3, s[6:7] offset:3
-; GFX906-NEXT:    global_store_byte v4, v2, s[6:7] offset:4
+; GFX906-NEXT:    v_mov_b32_e32 v0, 0
+; GFX906-NEXT:    global_store_byte v0, v1, s[6:7]
+; GFX906-NEXT:    global_store_byte v0, v3, s[6:7] offset:1
+; GFX906-NEXT:    global_store_byte v0, v4, s[6:7] offset:2
+; GFX906-NEXT:    global_store_byte v0, v5, s[6:7] offset:3
+; GFX906-NEXT:    global_store_byte v0, v2, s[6:7] offset:4
 ; GFX906-NEXT:    s_endpgm
 entry:
   %idx = call i32 @llvm.amdgcn.workitem.id.x()
@@ -144,19 +153,46 @@ define amdgpu_kernel void @v8i8_liveout(ptr addrspace(1) %src1, ptr addrspace(1)
 ; GFX906:       ; %bb.0: ; %entry
 ; GFX906-NEXT:    s_load_dwordx4 s[0:3], s[4:5], 0x24
 ; GFX906-NEXT:    s_load_dwordx2 s[6:7], s[4:5], 0x34
-; GFX906-NEXT:    v_lshlrev_b32_e32 v3, 3, v0
+; GFX906-NEXT:    v_lshlrev_b32_e32 v9, 3, v0
 ; GFX906-NEXT:    v_cmp_gt_u32_e32 vcc, 15, v0
 ; GFX906-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX906-NEXT:    global_load_dwordx2 v[1:2], v3, s[0:1]
+; GFX906-NEXT:    global_load_dwordx2 v[1:2], v9, s[0:1]
+; GFX906-NEXT:    s_waitcnt vmcnt(0)
+; GFX906-NEXT:    v_lshrrev_b32_e32 v3, 8, v1
+; GFX906-NEXT:    v_lshrrev_b32_e32 v4, 16, v1
+; GFX906-NEXT:    v_lshrrev_b32_e32 v5, 24, v1
+; GFX906-NEXT:    v_lshrrev_b32_e32 v6, 8, v2
+; GFX906-NEXT:    v_lshrrev_b32_e32 v7, 16, v2
+; GFX906-NEXT:    v_lshrrev_b32_e32 v8, 24, v2
 ; GFX906-NEXT:    s_and_saveexec_b64 s[0:1], vcc
 ; GFX906-NEXT:    s_cbranch_execz .LBB3_2
 ; GFX906-NEXT:  ; %bb.1: ; %bb.1
-; GFX906-NEXT:    global_load_dwordx2 v[1:2], v3, s[2:3]
+; GFX906-NEXT:    global_load_dwordx2 v[1:2], v9, s[2:3]
+; GFX906-NEXT:    s_waitcnt vmcnt(0)
+; GFX906-NEXT:    v_lshrrev_b32_e32 v3, 8, v1
+; GFX906-NEXT:    v_lshrrev_b32_e32 v4, 16, v1
+; GFX906-NEXT:    v_lshrrev_b32_e32 v5, 24, v1
+; GFX906-NEXT:    v_lshrrev_b32_e32 v6, 8, v2
+; GFX906-NEXT:    v_lshrrev_b32_e32 v7, 16, v2
+; GFX906-NEXT:    v_lshrrev_b32_e32 v8, 24, v2
 ; GFX906-NEXT:  .LBB3_2: ; %bb.2
 ; GFX906-NEXT:    s_or_b64 exec, exec, s[0:1]
-; GFX906-NEXT:    v_mov_b32_e32 v0, 0
-; GFX906-NEXT:    s_waitcnt vmcnt(0)
-; GFX906-NEXT:    global_store_dwordx2 v0, v[1:2], s[6:7]
+; GFX906-NEXT:    v_mov_b32_e32 v10, 8
+; GFX906-NEXT:    v_mov_b32_e32 v9, 0xff
+; GFX906-NEXT:    v_lshlrev_b32_sdwa v0, v10, v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX906-NEXT:    v_and_or_b32 v0, v1, v9, v0
+; GFX906-NEXT:    v_and_b32_e32 v1, 0xff, v4
+; GFX906-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
+; GFX906-NEXT:    v_lshlrev_b32_e32 v3, 24, v5
+; GFX906-NEXT:    v_or3_b32 v0, v0, v1, v3
+; GFX906-NEXT:    v_lshlrev_b32_sdwa v1, v10, v6 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX906-NEXT:    v_and_or_b32 v1, v2, v9, v1
+; GFX906-NEXT:    v_and_b32_e32 v2, 0xff, v7
+; GFX906-NEXT:    v_lshlrev_b32_e32 v2, 16, v2
+; GFX906-NEXT:    v_lshlrev_b32_e32 v3, 24, v8
+; GFX906-NEXT:    v_or3_b32 v1, v1, v2, v3
+; GFX906-NEXT:    v_mov_b32_e32 v2, 0
+; GFX906-NEXT:    global_store_dwordx2 v2, v[0:1], s[6:7]
 ; GFX906-NEXT:    s_endpgm
 entry:
   %idx = call i32 @llvm.amdgcn.workitem.id.x()
@@ -180,19 +216,70 @@ define amdgpu_kernel void @v16i8_liveout(ptr addrspace(1) %src1, ptr addrspace(1
 ; GFX906:       ; %bb.0: ; %entry
 ; GFX906-NEXT:    s_load_dwordx4 s[0:3], s[4:5], 0x24
 ; GFX906-NEXT:    s_load_dwordx2 s[6:7], s[4:5], 0x34
-; GFX906-NEXT:    v_lshlrev_b32_e32 v5, 4, v0
+; GFX906-NEXT:    v_lshlrev_b32_e32 v17, 4, v0
 ; GFX906-NEXT:    v_cmp_gt_u32_e32 vcc, 15, v0
 ; GFX906-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX906-NEXT:    global_load_dwordx4 v[1:4], v5, s[0:1]
+; GFX906-NEXT:    global_load_dwordx4 v[1:4], v17, s[0:1]
+; GFX906-NEXT:    s_waitcnt vmcnt(0)
+; GFX906-NEXT:    v_lshrrev_b32_e32 v5, 8, v1
+; GFX906-NEXT:    v_lshrrev_b32_e32 v6, 16, v1
+; GFX906-NEXT:    v_lshrrev_b32_e32 v7, 24, v1
+; GFX906-NEXT:    v_lshrrev_b32_e32 v8, 8, v2
+; GFX906-NEXT:    v_lshrrev_b32_e32 v9, 16, v2
+; GFX906-NEXT:    v_lshrrev_b32_e32 v10, 24, v2
+; GFX906-NEXT:    v_lshrrev_b32_e32 v11, 8, v3
+; GFX906-NEXT:    v_lshrrev_b32_e32 v12, 16, v3
+; GFX906-NEXT:    v_lshrrev_b32_e32 v13, 24, v3
+; GFX906-NEXT:    v_lshrrev_b32_e32 v14, 8, v4
+; GFX906-NEXT:    v_lshrrev_b32_e32 v15, 16, v4
+; GFX906-NEXT:    v_lshrrev_b32_e32 v16, 24, v4
 ; GFX906-NEXT:    s_and_saveexec_b64 s[0:1], vcc
 ; GFX906-NEXT:    s_cbranch_execz .LBB4_2
 ; GFX906-NEXT:  ; %bb.1: ; %bb.1
-; GFX906-NEXT:    global_load_dwordx4 v[1:4], v5, s[2:3]
+; GFX906-NEXT:    global_load_dwordx4 v[1:4], v17, s[2:3]
+; GFX906-NEXT:    s_waitcnt vmcnt(0)
+; GFX906-NEXT:    v_lshrrev_b32_e32 v5, 8, v1
+; GFX906-NEXT:    v_lshrrev_b32_e32 v6, 16, v1
+; GFX906-NEXT:    v_lshrrev_b32_e32 v7, 24, v1
+; GFX906-NEXT:    v_lshrrev_b32_e32 v8, 8, v2
+; GFX906-NEXT:    v_lshrrev_b32_e32 v9, 16, v2
+; GFX906-NEXT:    v_lshrrev_b32_e32 v10, 24, v2
+; GFX906-NEXT:    v_lshrrev_b32_e32 v11, 8, v3
+; GFX906-NEXT:    v_lshrrev_b32_e32 v12, 16, v3
+; GFX906-NEXT:    v_lshrrev_b32_e32 v13, 24, v3
+; GFX906-NEXT:    v_lshrrev_b32_e32 v14, 8, v4
+; GFX906-NEXT:    v_lshrrev_b32_e32 v15, 16, v4
+; GFX906-NEXT:    v_lshrrev_b32_e32 v16, 24, v4
 ; GFX906-NEXT:  .LBB4_2: ; %bb.2
 ; GFX906-NEXT:    s_or_b64 exec, exec, s[0:1]
-; GFX906-NEXT:    v_mov_b32_e32 v0, 0
-; GFX906-NEXT:    s_waitcnt vmcnt(0)
-; GFX906-NEXT:    global_store_dwordx4 v0, v[1:4], s[6:7]
+; GFX906-NEXT:    v_mov_b32_e32 v18, 8
+; GFX906-NEXT:    v_mov_b32_e32 v17, 0xff
+; GFX906-NEXT:    v_lshlrev_b32_sdwa v0, v18, v5 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX906-NEXT:    v_and_or_b32 v0, v1, v17, v0
+; GFX906-NEXT:    v_and_b32_e32 v1, 0xff, v6
+; GFX906-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
+; GFX906-NEXT:    v_lshlrev_b32_e32 v5, 24, v7
+; GFX906-NEXT:    v_or3_b32 v0, v0, v1, v5
+; GFX906-NEXT:    v_lshlrev_b32_sdwa v1, v18, v8 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX906-NEXT:    v_and_or_b32 v1, v2, v17, v1
+; GFX906-NEXT:    v_and_b32_e32 v2, 0xff, v9
+; GFX906-NEXT:    v_lshlrev_b32_e32 v2, 16, v2
+; GFX906-NEXT:    v_lshlrev_b32_e32 v5, 24, v10
+; GFX906-NEXT:    v_or3_b32 v1, v1, v2, v5
+; GFX906-NEXT:    v_lshlrev_b32_sdwa v2, v18, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX906-NEXT:    v_and_or_b32 v2, v3, v17, v2
+; GFX906-NEXT:    v_and_b32_e32 v3, 0xff, v12
+; GFX906-NEXT:    v_lshlrev_b32_e32 v3, 16, v3
+; GFX906-NEXT:    v_lshlrev_b32_e32 v5, 24, v13
+; GFX906-NEXT:    v_or3_b32 v2, v2, v3, v5
+; GFX906-NEXT:    v_lshlrev_b32_sdwa v3, v18, v14 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX906-NEXT:    v_and_or_b32 v3, v4, v17, v3
+; GFX906-NEXT:    v_and_b32_e32 v4, 0xff, v15
+; GFX906-NEXT:    v_lshlrev_b32_e32 v4, 16, v4
+; GFX906-NEXT:    v_lshlrev_b32_e32 v5, 24, v16
+; GFX906-NEXT:    v_or3_b32 v3, v3, v4, v5
+; GFX906-NEXT:    v_mov_b32_e32 v4, 0
+; GFX906-NEXT:    global_store_dwordx4 v4, v[0:3], s[6:7]
 ; GFX906-NEXT:    s_endpgm
 entry:
   %idx = call i32 @llvm.amdgcn.workitem.id.x()
@@ -216,23 +303,123 @@ define amdgpu_kernel void @v32i8_liveout(ptr addrspace(1) %src1, ptr addrspace(1
 ; GFX906:       ; %bb.0: ; %entry
 ; GFX906-NEXT:    s_load_dwordx4 s[0:3], s[4:5], 0x24
 ; GFX906-NEXT:    s_load_dwordx2 s[6:7], s[4:5], 0x34
-; GFX906-NEXT:    v_lshlrev_b32_e32 v9, 5, v0
+; GFX906-NEXT:    v_lshlrev_b32_e32 v32, 5, v0
 ; GFX906-NEXT:    v_cmp_gt_u32_e32 vcc, 15, v0
 ; GFX906-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX906-NEXT:    global_load_dwordx4 v[1:4], v9, s[0:1]
-; GFX906-NEXT:    global_load_dwordx4 v[5:8], v9, s[0:1] offset:16
+; GFX906-NEXT:    global_load_dwordx4 v[5:8], v32, s[0:1]
+; GFX906-NEXT:    global_load_dwordx4 v[1:4], v32, s[0:1] offset:16
+; GFX906-NEXT:    s_waitcnt vmcnt(1)
+; GFX906-NEXT:    v_lshrrev_b32_e32 v31, 8, v5
+; GFX906-NEXT:    v_lshrrev_b32_e32 v26, 16, v5
+; GFX906-NEXT:    v_lshrrev_b32_e32 v27, 24, v5
+; GFX906-NEXT:    v_lshrrev_b32_e32 v28, 8, v6
+; GFX906-NEXT:    v_lshrrev_b32_e32 v29, 16, v6
+; GFX906-NEXT:    v_lshrrev_b32_e32 v30, 24, v6
+; GFX906-NEXT:    v_lshrrev_b32_e32 v0, 8, v7
+; GFX906-NEXT:    v_lshrrev_b32_e32 v9, 16, v7
+; GFX906-NEXT:    v_lshrrev_b32_e32 v10, 24, v7
+; GFX906-NEXT:    v_lshrrev_b32_e32 v11, 8, v8
+; GFX906-NEXT:    v_lshrrev_b32_e32 v12, 16, v8
+; GFX906-NEXT:    v_lshrrev_b32_e32 v13, 24, v8
+; GFX906-NEXT:    s_waitcnt vmcnt(0)
+; GFX906-NEXT:    v_lshrrev_b32_e32 v14, 8, v1
+; GFX906-NEXT:    v_lshrrev_b32_e32 v15, 16, v1
+; GFX906-NEXT:    v_lshrrev_b32_e32 v16, 24, v1
+; GFX906-NEXT:    v_lshrrev_b32_e32 v17, 8, v2
+; GFX906-NEXT:    v_lshrrev_b32_e32 v18, 16, v2
+; GFX906-NEXT:    v_lshrrev_b32_e32 v19, 24, v2
+; GFX906-NEXT:    v_lshrrev_b32_e32 v20, 8, v3
+; GFX906-NEXT:    v_lshrrev_b32_e32 v21, 16, v3
+; GFX906-NEXT:    v_lshrrev_b32_e32 v22...
[truncated]

llvmbot · 2025-01-27T20:57:55Z

@llvm/pr-subscribers-llvm-globalisel

Author: choikwa (choikwa)

Changes

It is known that for vector whose element fits in i16 will be split and scalarized in SelectionDag's type legalizer
(see SIISelLowering::getPreferredVectorAction).

LRO attempts to undo the scalarizing of vectors across basic block boundary and shoehorn Values in VGPRs. LRO is beneficial for operations that natively work on illegal vector types to prevent flip-flopping between SGPR and VGPR. If we know that operations on vector will be split and scalarized, then we don't want to shoehorn them back to VGPR.

Operations that we know to work natively on illegal vector types usually come in the form of intrinsics (MFMA, DOT8), buffer store, shuffle, phi nodes to name a few.

Patch is 337.61 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/124624.diff

7 Files Affected:

(modified) llvm/lib/Target/AMDGPU/AMDGPULateCodeGenPrepare.cpp (+51-6)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/vni8-across-blocks.ll (+2001-209)
(modified) llvm/test/CodeGen/AMDGPU/dagcomb-extract-vec-elt-different-sizes.ll (+19-20)
(modified) llvm/test/CodeGen/AMDGPU/extract-subvector-16bit.ll (+188-171)
(modified) llvm/test/CodeGen/AMDGPU/extract-subvector.ll (+26-25)
(modified) llvm/test/CodeGen/AMDGPU/sdwa-peephole.ll (+24-12)
(modified) llvm/test/CodeGen/AMDGPU/vni8-across-blocks.ll (+1896-148)

diff --git a/llvm/lib/Target/AMDGPU/AMDGPULateCodeGenPrepare.cpp b/llvm/lib/Target/AMDGPU/AMDGPULateCodeGenPrepare.cpp
index f4e651ec477d30..d64951001d9cba 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPULateCodeGenPrepare.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPULateCodeGenPrepare.cpp
@@ -14,6 +14,7 @@
 
 #include "AMDGPU.h"
 #include "AMDGPUTargetMachine.h"
+#include "AMDGPUTargetTransformInfo.h"
 #include "llvm/Analysis/AssumptionCache.h"
 #include "llvm/Analysis/UniformityAnalysis.h"
 #include "llvm/Analysis/ValueTracking.h"
@@ -45,6 +46,7 @@ class AMDGPULateCodeGenPrepare
   Function &F;
   const DataLayout &DL;
   const GCNSubtarget &ST;
+  const TargetTransformInfo &TTI;
 
   AssumptionCache *const AC;
   UniformityInfo &UA;
@@ -53,8 +55,9 @@ class AMDGPULateCodeGenPrepare
 
 public:
   AMDGPULateCodeGenPrepare(Function &F, const GCNSubtarget &ST,
+                           const TargetTransformInfo &TTI,
                            AssumptionCache *AC, UniformityInfo &UA)
-      : F(F), DL(F.getDataLayout()), ST(ST), AC(AC), UA(UA) {}
+      : F(F), DL(F.getDataLayout()), ST(ST), TTI(TTI), AC(AC), UA(UA) {}
   bool run();
   bool visitInstruction(Instruction &) { return false; }
 
@@ -75,6 +78,8 @@ class LiveRegOptimizer {
   Module &Mod;
   const DataLayout &DL;
   const GCNSubtarget &ST;
+  const TargetTransformInfo &TTI;
+
   /// The scalar type to convert to
   Type *const ConvertToScalar;
   /// The set of visited Instructions
@@ -125,8 +130,43 @@ class LiveRegOptimizer {
     return LK.first != TargetLoweringBase::TypeLegal;
   }
 
-  LiveRegOptimizer(Module &Mod, const GCNSubtarget &ST)
-      : Mod(Mod), DL(Mod.getDataLayout()), ST(ST),
+  // Filtering based on operation or its cost.
+  // If an operation incurs high enough cost or natively work on
+  // vector of illegal type, ie. v2i8, then it makes sense to try
+  // to avoid scalarizing across BB.
+  bool shouldReplaceBasedOnOp(Instruction *II) {
+    // Ignore pseudos
+    if (II->isDebugOrPseudoInst())
+      return false;
+
+    // Instruction Cost
+    const auto Cost = TTI.getInstructionCost(II,
+        TargetTransformInfo::TargetCostKind::TCK_SizeAndLatency);
+    LLVM_DEBUG(
+      dbgs() << "shouldReplaceBasedOnOp: " <<
+        *II << " Cost=" << Cost << '\n';
+    );
+    if (Cost >= 8)
+      return true;
+
+    // Intrinsics - assume they natively handle illegal type
+    if (dyn_cast<IntrinsicInst>(II))
+      return true;
+
+    // Stores
+    if (dyn_cast<StoreInst>(II))
+      return true;
+
+    // Shuffles
+    if (dyn_cast<ShuffleVectorInst>(II))
+      return true;
+
+    return false;
+  }
+
+  LiveRegOptimizer(Module &Mod, const GCNSubtarget &ST,
+      const TargetTransformInfo &TTI)
+      : Mod(Mod), DL(Mod.getDataLayout()), ST(ST), TTI(TTI),
         ConvertToScalar(Type::getInt32Ty(Mod.getContext())) {}
 };
 
@@ -140,7 +180,7 @@ bool AMDGPULateCodeGenPrepare::run() {
   // vectors to equivalent vectors of legal type (which are converted back
   // before uses in subsequent blocks), to pack the bits into fewer physical
   // registers (used in CopyToReg/CopyFromReg pairs).
-  LiveRegOptimizer LRO(*F.getParent(), ST);
+  LiveRegOptimizer LRO(*F.getParent(), ST, TTI);
 
   bool Changed = false;
 
@@ -259,6 +299,9 @@ bool LiveRegOptimizer::optimizeLiveType(
     if (!shouldReplace(II->getType()))
       continue;
 
+    if (!shouldReplaceBasedOnOp(II))
+      continue;
+
     if (PHINode *Phi = dyn_cast<PHINode>(II)) {
       PhiNodes.insert(Phi);
       // Collect all the incoming values of problematic PHI nodes.
@@ -478,11 +521,12 @@ bool AMDGPULateCodeGenPrepare::visitLoadInst(LoadInst &LI) {
 PreservedAnalyses
 AMDGPULateCodeGenPreparePass::run(Function &F, FunctionAnalysisManager &FAM) {
   const GCNSubtarget &ST = TM.getSubtarget<GCNSubtarget>(F);
+  const TargetTransformInfo &TTI = TM.getTargetTransformInfo(F);
 
   AssumptionCache &AC = FAM.getResult<AssumptionAnalysis>(F);
   UniformityInfo &UI = FAM.getResult<UniformityInfoAnalysis>(F);
 
-  bool Changed = AMDGPULateCodeGenPrepare(F, ST, &AC, UI).run();
+  bool Changed = AMDGPULateCodeGenPrepare(F, ST, TTI, &AC, UI).run();
 
   if (!Changed)
     return PreservedAnalyses::all();
@@ -518,13 +562,14 @@ bool AMDGPULateCodeGenPrepareLegacy::runOnFunction(Function &F) {
   const TargetPassConfig &TPC = getAnalysis<TargetPassConfig>();
   const TargetMachine &TM = TPC.getTM<TargetMachine>();
   const GCNSubtarget &ST = TM.getSubtarget<GCNSubtarget>(F);
+  const TargetTransformInfo &TTI = TM.getTargetTransformInfo(F);
 
   AssumptionCache &AC =
       getAnalysis<AssumptionCacheTracker>().getAssumptionCache(F);
   UniformityInfo &UI =
       getAnalysis<UniformityInfoWrapperPass>().getUniformityInfo();
 
-  return AMDGPULateCodeGenPrepare(F, ST, &AC, UI).run();
+  return AMDGPULateCodeGenPrepare(F, ST, TTI, &AC, UI).run();
 }
 
 INITIALIZE_PASS_BEGIN(AMDGPULateCodeGenPrepareLegacy, DEBUG_TYPE,
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/vni8-across-blocks.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/vni8-across-blocks.ll
index 9c2fabce4bcdeb..96a167794dfb7a 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/vni8-across-blocks.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/vni8-across-blocks.ll
@@ -6,36 +6,28 @@ define amdgpu_kernel void @v3i8_liveout(ptr addrspace(1) %src1, ptr addrspace(1)
 ; GFX906:       ; %bb.0: ; %entry
 ; GFX906-NEXT:    s_load_dwordx4 s[0:3], s[4:5], 0x24
 ; GFX906-NEXT:    s_load_dwordx2 s[6:7], s[4:5], 0x34
-; GFX906-NEXT:    v_lshlrev_b32_e32 v2, 2, v0
-; GFX906-NEXT:    v_mov_b32_e32 v4, 8
-; GFX906-NEXT:    v_mov_b32_e32 v5, 16
-; GFX906-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX906-NEXT:    global_load_dword v3, v2, s[0:1]
-; GFX906-NEXT:    v_mov_b32_e32 v1, 0xff
+; GFX906-NEXT:    v_lshlrev_b32_e32 v4, 2, v0
 ; GFX906-NEXT:    v_cmp_gt_u32_e32 vcc, 15, v0
+; GFX906-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX906-NEXT:    global_load_dword v1, v4, s[0:1]
 ; GFX906-NEXT:    s_waitcnt vmcnt(0)
-; GFX906-NEXT:    v_and_b32_e32 v6, 0xff, v3
-; GFX906-NEXT:    v_lshlrev_b32_sdwa v7, v4, v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_1
-; GFX906-NEXT:    v_lshlrev_b32_sdwa v3, v5, v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_2
-; GFX906-NEXT:    v_or3_b32 v3, v6, v7, v3
+; GFX906-NEXT:    v_lshrrev_b32_e32 v3, 8, v1
+; GFX906-NEXT:    v_lshrrev_b32_e32 v2, 16, v1
 ; GFX906-NEXT:    s_and_saveexec_b64 s[0:1], vcc
 ; GFX906-NEXT:    s_cbranch_execz .LBB0_2
 ; GFX906-NEXT:  ; %bb.1: ; %bb.1
-; GFX906-NEXT:    global_load_dword v0, v2, s[2:3]
+; GFX906-NEXT:    global_load_dword v1, v4, s[2:3]
 ; GFX906-NEXT:    s_waitcnt vmcnt(0)
-; GFX906-NEXT:    v_and_b32_e32 v2, 0xff, v0
-; GFX906-NEXT:    v_lshlrev_b32_sdwa v3, v4, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_1
-; GFX906-NEXT:    v_lshlrev_b32_sdwa v0, v5, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_2
-; GFX906-NEXT:    v_or3_b32 v3, v2, v3, v0
+; GFX906-NEXT:    v_lshrrev_b32_e32 v3, 8, v1
+; GFX906-NEXT:    v_lshrrev_b32_e32 v2, 16, v1
 ; GFX906-NEXT:  .LBB0_2: ; %bb.2
 ; GFX906-NEXT:    s_or_b64 exec, exec, s[0:1]
-; GFX906-NEXT:    v_lshrrev_b32_e32 v0, 8, v3
-; GFX906-NEXT:    v_and_b32_e32 v0, 0xff, v0
+; GFX906-NEXT:    v_and_b32_e32 v0, 0xff, v3
 ; GFX906-NEXT:    v_lshlrev_b16_e32 v0, 8, v0
-; GFX906-NEXT:    v_or_b32_sdwa v0, v3, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX906-NEXT:    v_and_b32_sdwa v1, v3, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:DWORD
+; GFX906-NEXT:    v_or_b32_sdwa v0, v1, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX906-NEXT:    s_mov_b32 s0, 0xffff
 ; GFX906-NEXT:    v_and_b32_e32 v0, 0xffff, v0
-; GFX906-NEXT:    v_and_b32_e32 v1, 0xffff, v1
+; GFX906-NEXT:    v_and_b32_sdwa v1, s0, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
 ; GFX906-NEXT:    v_lshl_or_b32 v0, v1, 16, v0
 ; GFX906-NEXT:    v_mov_b32_e32 v1, 0
 ; GFX906-NEXT:    global_store_short v1, v0, s[6:7]
@@ -63,19 +55,34 @@ define amdgpu_kernel void @v4i8_liveout(ptr addrspace(1) %src1, ptr addrspace(1)
 ; GFX906:       ; %bb.0: ; %entry
 ; GFX906-NEXT:    s_load_dwordx4 s[0:3], s[4:5], 0x24
 ; GFX906-NEXT:    s_load_dwordx2 s[6:7], s[4:5], 0x34
-; GFX906-NEXT:    v_lshlrev_b32_e32 v2, 2, v0
+; GFX906-NEXT:    v_lshlrev_b32_e32 v5, 2, v0
 ; GFX906-NEXT:    v_cmp_gt_u32_e32 vcc, 15, v0
 ; GFX906-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX906-NEXT:    global_load_dword v1, v2, s[0:1]
+; GFX906-NEXT:    global_load_dword v1, v5, s[0:1]
+; GFX906-NEXT:    s_waitcnt vmcnt(0)
+; GFX906-NEXT:    v_lshrrev_b32_e32 v2, 8, v1
+; GFX906-NEXT:    v_lshrrev_b32_e32 v3, 16, v1
+; GFX906-NEXT:    v_lshrrev_b32_e32 v4, 24, v1
 ; GFX906-NEXT:    s_and_saveexec_b64 s[0:1], vcc
 ; GFX906-NEXT:    s_cbranch_execz .LBB1_2
 ; GFX906-NEXT:  ; %bb.1: ; %bb.1
-; GFX906-NEXT:    global_load_dword v1, v2, s[2:3]
+; GFX906-NEXT:    global_load_dword v1, v5, s[2:3]
+; GFX906-NEXT:    s_waitcnt vmcnt(0)
+; GFX906-NEXT:    v_lshrrev_b32_e32 v2, 8, v1
+; GFX906-NEXT:    v_lshrrev_b32_e32 v3, 16, v1
+; GFX906-NEXT:    v_lshrrev_b32_e32 v4, 24, v1
 ; GFX906-NEXT:  .LBB1_2: ; %bb.2
 ; GFX906-NEXT:    s_or_b64 exec, exec, s[0:1]
-; GFX906-NEXT:    v_mov_b32_e32 v0, 0
-; GFX906-NEXT:    s_waitcnt vmcnt(0)
-; GFX906-NEXT:    global_store_dword v0, v1, s[6:7]
+; GFX906-NEXT:    v_mov_b32_e32 v5, 8
+; GFX906-NEXT:    v_mov_b32_e32 v0, 0xff
+; GFX906-NEXT:    v_lshlrev_b32_sdwa v2, v5, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX906-NEXT:    v_and_or_b32 v0, v1, v0, v2
+; GFX906-NEXT:    v_and_b32_e32 v1, 0xff, v3
+; GFX906-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
+; GFX906-NEXT:    v_lshlrev_b32_e32 v2, 24, v4
+; GFX906-NEXT:    v_or3_b32 v0, v0, v1, v2
+; GFX906-NEXT:    v_mov_b32_e32 v1, 0
+; GFX906-NEXT:    global_store_dword v1, v0, s[6:7]
 ; GFX906-NEXT:    s_endpgm
 entry:
   %idx = call i32 @llvm.amdgcn.workitem.id.x()
@@ -99,28 +106,30 @@ define amdgpu_kernel void @v5i8_liveout(ptr addrspace(1) %src1, ptr addrspace(1)
 ; GFX906:       ; %bb.0: ; %entry
 ; GFX906-NEXT:    s_load_dwordx4 s[0:3], s[4:5], 0x24
 ; GFX906-NEXT:    s_load_dwordx2 s[6:7], s[4:5], 0x34
-; GFX906-NEXT:    v_lshlrev_b32_e32 v3, 3, v0
+; GFX906-NEXT:    v_lshlrev_b32_e32 v6, 3, v0
 ; GFX906-NEXT:    v_cmp_gt_u32_e32 vcc, 15, v0
 ; GFX906-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX906-NEXT:    global_load_dwordx2 v[1:2], v3, s[0:1]
+; GFX906-NEXT:    global_load_dwordx2 v[1:2], v6, s[0:1]
 ; GFX906-NEXT:    s_waitcnt vmcnt(0)
-; GFX906-NEXT:    v_and_b32_e32 v2, 0xff, v2
+; GFX906-NEXT:    v_lshrrev_b32_e32 v3, 8, v1
+; GFX906-NEXT:    v_lshrrev_b32_e32 v4, 16, v1
+; GFX906-NEXT:    v_lshrrev_b32_e32 v5, 24, v1
 ; GFX906-NEXT:    s_and_saveexec_b64 s[0:1], vcc
 ; GFX906-NEXT:    s_cbranch_execz .LBB2_2
 ; GFX906-NEXT:  ; %bb.1: ; %bb.1
-; GFX906-NEXT:    global_load_dwordx2 v[1:2], v3, s[2:3]
+; GFX906-NEXT:    global_load_dwordx2 v[1:2], v6, s[2:3]
 ; GFX906-NEXT:    s_waitcnt vmcnt(0)
-; GFX906-NEXT:    v_and_b32_e32 v2, 0xff, v2
+; GFX906-NEXT:    v_lshrrev_b32_e32 v3, 8, v1
+; GFX906-NEXT:    v_lshrrev_b32_e32 v4, 16, v1
+; GFX906-NEXT:    v_lshrrev_b32_e32 v5, 24, v1
 ; GFX906-NEXT:  .LBB2_2: ; %bb.2
 ; GFX906-NEXT:    s_or_b64 exec, exec, s[0:1]
-; GFX906-NEXT:    v_mov_b32_e32 v4, 0
-; GFX906-NEXT:    v_lshrrev_b32_e32 v0, 8, v1
-; GFX906-NEXT:    v_lshrrev_b32_e32 v3, 24, v1
-; GFX906-NEXT:    global_store_byte v4, v1, s[6:7]
-; GFX906-NEXT:    global_store_byte v4, v0, s[6:7] offset:1
-; GFX906-NEXT:    global_store_byte_d16_hi v4, v1, s[6:7] offset:2
-; GFX906-NEXT:    global_store_byte v4, v3, s[6:7] offset:3
-; GFX906-NEXT:    global_store_byte v4, v2, s[6:7] offset:4
+; GFX906-NEXT:    v_mov_b32_e32 v0, 0
+; GFX906-NEXT:    global_store_byte v0, v1, s[6:7]
+; GFX906-NEXT:    global_store_byte v0, v3, s[6:7] offset:1
+; GFX906-NEXT:    global_store_byte v0, v4, s[6:7] offset:2
+; GFX906-NEXT:    global_store_byte v0, v5, s[6:7] offset:3
+; GFX906-NEXT:    global_store_byte v0, v2, s[6:7] offset:4
 ; GFX906-NEXT:    s_endpgm
 entry:
   %idx = call i32 @llvm.amdgcn.workitem.id.x()
@@ -144,19 +153,46 @@ define amdgpu_kernel void @v8i8_liveout(ptr addrspace(1) %src1, ptr addrspace(1)
 ; GFX906:       ; %bb.0: ; %entry
 ; GFX906-NEXT:    s_load_dwordx4 s[0:3], s[4:5], 0x24
 ; GFX906-NEXT:    s_load_dwordx2 s[6:7], s[4:5], 0x34
-; GFX906-NEXT:    v_lshlrev_b32_e32 v3, 3, v0
+; GFX906-NEXT:    v_lshlrev_b32_e32 v9, 3, v0
 ; GFX906-NEXT:    v_cmp_gt_u32_e32 vcc, 15, v0
 ; GFX906-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX906-NEXT:    global_load_dwordx2 v[1:2], v3, s[0:1]
+; GFX906-NEXT:    global_load_dwordx2 v[1:2], v9, s[0:1]
+; GFX906-NEXT:    s_waitcnt vmcnt(0)
+; GFX906-NEXT:    v_lshrrev_b32_e32 v3, 8, v1
+; GFX906-NEXT:    v_lshrrev_b32_e32 v4, 16, v1
+; GFX906-NEXT:    v_lshrrev_b32_e32 v5, 24, v1
+; GFX906-NEXT:    v_lshrrev_b32_e32 v6, 8, v2
+; GFX906-NEXT:    v_lshrrev_b32_e32 v7, 16, v2
+; GFX906-NEXT:    v_lshrrev_b32_e32 v8, 24, v2
 ; GFX906-NEXT:    s_and_saveexec_b64 s[0:1], vcc
 ; GFX906-NEXT:    s_cbranch_execz .LBB3_2
 ; GFX906-NEXT:  ; %bb.1: ; %bb.1
-; GFX906-NEXT:    global_load_dwordx2 v[1:2], v3, s[2:3]
+; GFX906-NEXT:    global_load_dwordx2 v[1:2], v9, s[2:3]
+; GFX906-NEXT:    s_waitcnt vmcnt(0)
+; GFX906-NEXT:    v_lshrrev_b32_e32 v3, 8, v1
+; GFX906-NEXT:    v_lshrrev_b32_e32 v4, 16, v1
+; GFX906-NEXT:    v_lshrrev_b32_e32 v5, 24, v1
+; GFX906-NEXT:    v_lshrrev_b32_e32 v6, 8, v2
+; GFX906-NEXT:    v_lshrrev_b32_e32 v7, 16, v2
+; GFX906-NEXT:    v_lshrrev_b32_e32 v8, 24, v2
 ; GFX906-NEXT:  .LBB3_2: ; %bb.2
 ; GFX906-NEXT:    s_or_b64 exec, exec, s[0:1]
-; GFX906-NEXT:    v_mov_b32_e32 v0, 0
-; GFX906-NEXT:    s_waitcnt vmcnt(0)
-; GFX906-NEXT:    global_store_dwordx2 v0, v[1:2], s[6:7]
+; GFX906-NEXT:    v_mov_b32_e32 v10, 8
+; GFX906-NEXT:    v_mov_b32_e32 v9, 0xff
+; GFX906-NEXT:    v_lshlrev_b32_sdwa v0, v10, v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX906-NEXT:    v_and_or_b32 v0, v1, v9, v0
+; GFX906-NEXT:    v_and_b32_e32 v1, 0xff, v4
+; GFX906-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
+; GFX906-NEXT:    v_lshlrev_b32_e32 v3, 24, v5
+; GFX906-NEXT:    v_or3_b32 v0, v0, v1, v3
+; GFX906-NEXT:    v_lshlrev_b32_sdwa v1, v10, v6 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX906-NEXT:    v_and_or_b32 v1, v2, v9, v1
+; GFX906-NEXT:    v_and_b32_e32 v2, 0xff, v7
+; GFX906-NEXT:    v_lshlrev_b32_e32 v2, 16, v2
+; GFX906-NEXT:    v_lshlrev_b32_e32 v3, 24, v8
+; GFX906-NEXT:    v_or3_b32 v1, v1, v2, v3
+; GFX906-NEXT:    v_mov_b32_e32 v2, 0
+; GFX906-NEXT:    global_store_dwordx2 v2, v[0:1], s[6:7]
 ; GFX906-NEXT:    s_endpgm
 entry:
   %idx = call i32 @llvm.amdgcn.workitem.id.x()
@@ -180,19 +216,70 @@ define amdgpu_kernel void @v16i8_liveout(ptr addrspace(1) %src1, ptr addrspace(1
 ; GFX906:       ; %bb.0: ; %entry
 ; GFX906-NEXT:    s_load_dwordx4 s[0:3], s[4:5], 0x24
 ; GFX906-NEXT:    s_load_dwordx2 s[6:7], s[4:5], 0x34
-; GFX906-NEXT:    v_lshlrev_b32_e32 v5, 4, v0
+; GFX906-NEXT:    v_lshlrev_b32_e32 v17, 4, v0
 ; GFX906-NEXT:    v_cmp_gt_u32_e32 vcc, 15, v0
 ; GFX906-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX906-NEXT:    global_load_dwordx4 v[1:4], v5, s[0:1]
+; GFX906-NEXT:    global_load_dwordx4 v[1:4], v17, s[0:1]
+; GFX906-NEXT:    s_waitcnt vmcnt(0)
+; GFX906-NEXT:    v_lshrrev_b32_e32 v5, 8, v1
+; GFX906-NEXT:    v_lshrrev_b32_e32 v6, 16, v1
+; GFX906-NEXT:    v_lshrrev_b32_e32 v7, 24, v1
+; GFX906-NEXT:    v_lshrrev_b32_e32 v8, 8, v2
+; GFX906-NEXT:    v_lshrrev_b32_e32 v9, 16, v2
+; GFX906-NEXT:    v_lshrrev_b32_e32 v10, 24, v2
+; GFX906-NEXT:    v_lshrrev_b32_e32 v11, 8, v3
+; GFX906-NEXT:    v_lshrrev_b32_e32 v12, 16, v3
+; GFX906-NEXT:    v_lshrrev_b32_e32 v13, 24, v3
+; GFX906-NEXT:    v_lshrrev_b32_e32 v14, 8, v4
+; GFX906-NEXT:    v_lshrrev_b32_e32 v15, 16, v4
+; GFX906-NEXT:    v_lshrrev_b32_e32 v16, 24, v4
 ; GFX906-NEXT:    s_and_saveexec_b64 s[0:1], vcc
 ; GFX906-NEXT:    s_cbranch_execz .LBB4_2
 ; GFX906-NEXT:  ; %bb.1: ; %bb.1
-; GFX906-NEXT:    global_load_dwordx4 v[1:4], v5, s[2:3]
+; GFX906-NEXT:    global_load_dwordx4 v[1:4], v17, s[2:3]
+; GFX906-NEXT:    s_waitcnt vmcnt(0)
+; GFX906-NEXT:    v_lshrrev_b32_e32 v5, 8, v1
+; GFX906-NEXT:    v_lshrrev_b32_e32 v6, 16, v1
+; GFX906-NEXT:    v_lshrrev_b32_e32 v7, 24, v1
+; GFX906-NEXT:    v_lshrrev_b32_e32 v8, 8, v2
+; GFX906-NEXT:    v_lshrrev_b32_e32 v9, 16, v2
+; GFX906-NEXT:    v_lshrrev_b32_e32 v10, 24, v2
+; GFX906-NEXT:    v_lshrrev_b32_e32 v11, 8, v3
+; GFX906-NEXT:    v_lshrrev_b32_e32 v12, 16, v3
+; GFX906-NEXT:    v_lshrrev_b32_e32 v13, 24, v3
+; GFX906-NEXT:    v_lshrrev_b32_e32 v14, 8, v4
+; GFX906-NEXT:    v_lshrrev_b32_e32 v15, 16, v4
+; GFX906-NEXT:    v_lshrrev_b32_e32 v16, 24, v4
 ; GFX906-NEXT:  .LBB4_2: ; %bb.2
 ; GFX906-NEXT:    s_or_b64 exec, exec, s[0:1]
-; GFX906-NEXT:    v_mov_b32_e32 v0, 0
-; GFX906-NEXT:    s_waitcnt vmcnt(0)
-; GFX906-NEXT:    global_store_dwordx4 v0, v[1:4], s[6:7]
+; GFX906-NEXT:    v_mov_b32_e32 v18, 8
+; GFX906-NEXT:    v_mov_b32_e32 v17, 0xff
+; GFX906-NEXT:    v_lshlrev_b32_sdwa v0, v18, v5 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX906-NEXT:    v_and_or_b32 v0, v1, v17, v0
+; GFX906-NEXT:    v_and_b32_e32 v1, 0xff, v6
+; GFX906-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
+; GFX906-NEXT:    v_lshlrev_b32_e32 v5, 24, v7
+; GFX906-NEXT:    v_or3_b32 v0, v0, v1, v5
+; GFX906-NEXT:    v_lshlrev_b32_sdwa v1, v18, v8 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX906-NEXT:    v_and_or_b32 v1, v2, v17, v1
+; GFX906-NEXT:    v_and_b32_e32 v2, 0xff, v9
+; GFX906-NEXT:    v_lshlrev_b32_e32 v2, 16, v2
+; GFX906-NEXT:    v_lshlrev_b32_e32 v5, 24, v10
+; GFX906-NEXT:    v_or3_b32 v1, v1, v2, v5
+; GFX906-NEXT:    v_lshlrev_b32_sdwa v2, v18, v11 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX906-NEXT:    v_and_or_b32 v2, v3, v17, v2
+; GFX906-NEXT:    v_and_b32_e32 v3, 0xff, v12
+; GFX906-NEXT:    v_lshlrev_b32_e32 v3, 16, v3
+; GFX906-NEXT:    v_lshlrev_b32_e32 v5, 24, v13
+; GFX906-NEXT:    v_or3_b32 v2, v2, v3, v5
+; GFX906-NEXT:    v_lshlrev_b32_sdwa v3, v18, v14 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
+; GFX906-NEXT:    v_and_or_b32 v3, v4, v17, v3
+; GFX906-NEXT:    v_and_b32_e32 v4, 0xff, v15
+; GFX906-NEXT:    v_lshlrev_b32_e32 v4, 16, v4
+; GFX906-NEXT:    v_lshlrev_b32_e32 v5, 24, v16
+; GFX906-NEXT:    v_or3_b32 v3, v3, v4, v5
+; GFX906-NEXT:    v_mov_b32_e32 v4, 0
+; GFX906-NEXT:    global_store_dwordx4 v4, v[0:3], s[6:7]
 ; GFX906-NEXT:    s_endpgm
 entry:
   %idx = call i32 @llvm.amdgcn.workitem.id.x()
@@ -216,23 +303,123 @@ define amdgpu_kernel void @v32i8_liveout(ptr addrspace(1) %src1, ptr addrspace(1
 ; GFX906:       ; %bb.0: ; %entry
 ; GFX906-NEXT:    s_load_dwordx4 s[0:3], s[4:5], 0x24
 ; GFX906-NEXT:    s_load_dwordx2 s[6:7], s[4:5], 0x34
-; GFX906-NEXT:    v_lshlrev_b32_e32 v9, 5, v0
+; GFX906-NEXT:    v_lshlrev_b32_e32 v32, 5, v0
 ; GFX906-NEXT:    v_cmp_gt_u32_e32 vcc, 15, v0
 ; GFX906-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX906-NEXT:    global_load_dwordx4 v[1:4], v9, s[0:1]
-; GFX906-NEXT:    global_load_dwordx4 v[5:8], v9, s[0:1] offset:16
+; GFX906-NEXT:    global_load_dwordx4 v[5:8], v32, s[0:1]
+; GFX906-NEXT:    global_load_dwordx4 v[1:4], v32, s[0:1] offset:16
+; GFX906-NEXT:    s_waitcnt vmcnt(1)
+; GFX906-NEXT:    v_lshrrev_b32_e32 v31, 8, v5
+; GFX906-NEXT:    v_lshrrev_b32_e32 v26, 16, v5
+; GFX906-NEXT:    v_lshrrev_b32_e32 v27, 24, v5
+; GFX906-NEXT:    v_lshrrev_b32_e32 v28, 8, v6
+; GFX906-NEXT:    v_lshrrev_b32_e32 v29, 16, v6
+; GFX906-NEXT:    v_lshrrev_b32_e32 v30, 24, v6
+; GFX906-NEXT:    v_lshrrev_b32_e32 v0, 8, v7
+; GFX906-NEXT:    v_lshrrev_b32_e32 v9, 16, v7
+; GFX906-NEXT:    v_lshrrev_b32_e32 v10, 24, v7
+; GFX906-NEXT:    v_lshrrev_b32_e32 v11, 8, v8
+; GFX906-NEXT:    v_lshrrev_b32_e32 v12, 16, v8
+; GFX906-NEXT:    v_lshrrev_b32_e32 v13, 24, v8
+; GFX906-NEXT:    s_waitcnt vmcnt(0)
+; GFX906-NEXT:    v_lshrrev_b32_e32 v14, 8, v1
+; GFX906-NEXT:    v_lshrrev_b32_e32 v15, 16, v1
+; GFX906-NEXT:    v_lshrrev_b32_e32 v16, 24, v1
+; GFX906-NEXT:    v_lshrrev_b32_e32 v17, 8, v2
+; GFX906-NEXT:    v_lshrrev_b32_e32 v18, 16, v2
+; GFX906-NEXT:    v_lshrrev_b32_e32 v19, 24, v2
+; GFX906-NEXT:    v_lshrrev_b32_e32 v20, 8, v3
+; GFX906-NEXT:    v_lshrrev_b32_e32 v21, 16, v3
+; GFX906-NEXT:    v_lshrrev_b32_e32 v22...
[truncated]

github-actions · 2025-01-27T21:00:38Z

✅ With the latest revision this PR passed the C/C++ code formatter.

arsenm · 2025-01-28T09:51:40Z

is beneficial for operations that natively work on illegal vector types to prevent flip-flopping between SGPR and VGPR.

This isn't how SGPR vs VGPR works. It's one direction only, you can't just copy to SGPR from VGPR. What example are you talking about?

llvm/lib/Target/AMDGPU/AMDGPULateCodeGenPrepare.cpp

arsenm · 2025-01-28T09:52:22Z

llvm/lib/Target/AMDGPU/AMDGPULateCodeGenPrepare.cpp

+      return true;
+
+    // Intrinsics - assume they natively handle illegal type
+    if (dyn_cast<IntrinsicInst>(II))


if you aren't using the return value use isa. IntrinsicInst also does not imply it is a target intrinsic

arsenm · 2025-01-28T09:53:07Z

llvm/lib/Target/AMDGPU/AMDGPULateCodeGenPrepare.cpp

+      return true;
+
+    // Shuffles
+    if (dyn_cast<ShuffleVectorInst>(II))


This needs new tests but I doubt we should be special casing specific operations

I am mixed about what the ideal solution should be here and left it somewhat open-ended. Correct me if I'm wrong but few things to consider are:

If operation is known to be scalarized, LRO shoehorning Values back into packed VGPR adds overhead.

While whitelisting to only packed ops would work, list may be exhaustive and/or,

It seems LRO has other effects such as splitting live range, which reduces RP for Insts however unintentional.

Should we err on the side of allowing more candidates for LRO?

I think the most accurate heuristic would be akin to what other passes (e.g. SLP) does, in that we calculate the aggregate value of vectorizing (coercing the illegal type) vs scalarizing (keeping the illegal type as is). But I think that we probably don't need that level of modeling.

For the cases where we have mixed uses (e.g. some user instructions which can natively operate on i8 vectors, and some user instructions which must be scalarized), I would think it is best to prefer type coercion. When we introduce the scalarization in block, I think we have a better chance of folding the vector extract code into the scalarized ops themselves (i.e. sdwa). Moreover, in the case of loops, the livein vector will be live throughout the body of the loop. The coercion will reduce the loop's baseline RP.

choikwa · 2025-01-28T20:58:59Z

is beneficial for operations that natively work on illegal vector types to prevent flip-flopping between SGPR and VGPR.

This isn't how SGPR vs VGPR works. It's one direction only, you can't just copy to SGPR from VGPR. What example are you talking about?

Apologies. corrected to 'unpacked and packed'

llvm/test/CodeGen/AMDGPU/vni8-across-blocks.ll

jrbyrnes · 2025-01-29T19:28:42Z

llvm/test/CodeGen/AMDGPU/vni8-across-blocks.ll

+; GFX906-NEXT:    v_lshlrev_b16_e32 v2, 8, v4
+; GFX906-NEXT:    v_or_b32_sdwa v2, v5, v2 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX906-NEXT:    v_or_b32_sdwa v1, v1, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
+; GFX906-NEXT:    global_store_dwordx2 v3, v[0:1], s[6:7]


jrbyrnes · 2025-01-29T19:28:50Z

llvm/test/CodeGen/AMDGPU/vni8-across-blocks.ll

+; GFX906-NEXT:    v_lshlrev_b16_e32 v4, 8, v6
+; GFX906-NEXT:    v_or_b32_sdwa v4, v7, v4 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX906-NEXT:    v_or_b32_sdwa v3, v3, v4 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
+; GFX906-NEXT:    global_store_dwordx4 v5, v[0:3], s[6:7]


jrbyrnes · 2025-01-29T19:29:00Z

llvm/test/CodeGen/AMDGPU/vni8-across-blocks.ll

+; GFX906-NEXT:    v_or_b32_sdwa v4, v4, v5 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX906-NEXT:    v_or_b32_sdwa v0, v10, v0 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX906-NEXT:    v_or_b32_sdwa v4, v4, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
+; GFX906-NEXT:    global_store_dwordx4 v9, v[1:4], s[6:7] offset:16


jrbyrnes · 2025-01-29T19:29:27Z

llvm/test/CodeGen/AMDGPU/vni8-across-blocks.ll

+; GFX906-NEXT:    v_lshlrev_b16_e32 v5, 8, v5
+; GFX906-NEXT:    v_or_b32_sdwa v3, v8, v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX906-NEXT:    v_or_b32_sdwa v5, v6, v5 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX906-NEXT:    v_or_b32_sdwa v3, v3, v5 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD


llvm/test/CodeGen/AMDGPU/vni8-across-blocks.ll

llvm/test/CodeGen/AMDGPU/GlobalISel/vni8-across-blocks.ll

jrbyrnes · 2025-01-29T19:34:58Z

llvm/lib/Target/AMDGPU/AMDGPULateCodeGenPrepare.cpp

+        II, TargetTransformInfo::TargetCostKind::TCK_SizeAndLatency);
+    LLVM_DEBUG(dbgs() << "shouldReplaceBasedOnOp: " << *II << " Cost=" << Cost
+                      << '\n';);
+    if (Cost >= 8)


Need a more meaningful value than 8, something the captures the cost of scalarizing a certain value/op.

jrbyrnes · 2025-01-29T19:35:44Z

llvm/lib/Target/AMDGPU/AMDGPULateCodeGenPrepare.cpp

+      return true;
+
+    // Intrinsics - assume they natively handle illegal type
+    if (isa<IntrinsicInst>(II))


Is it possible to bring these special cases into a TTI query?

Do you mean expand on TTI::isTypeLegal? like isTypeAndOpLegal? The former only checks if it has registered RegClass to hold the type natively. I assume the latter might make redundant some logic in ISelLowering for determining custom legalization for Type & Op.

Yeah I meant having some general query handle these special cases appropriately -- but I think there's no good way to do this through what the TTI currently has.

jrbyrnes · 2025-01-29T19:50:27Z

llvm/lib/Target/AMDGPU/AMDGPULateCodeGenPrepare.cpp

@@ -259,6 +297,9 @@ bool LiveRegOptimizer::optimizeLiveType(
    if (!shouldReplace(II->getType()))
      continue;

+    if (!shouldReplaceBasedOnOp(II))


You should be checking the users of II -- including walking through PHI nodes.

choikwa · 2025-02-01T15:54:07Z

llvm/test/CodeGen/AMDGPU/sdwa-peephole.ll

@@ -2102,15 +2102,18 @@ define void @crash_lshlrevb16_not_reg_op() {
 ; NOSDWA:       ; %bb.0: ; %bb0
 ; NOSDWA-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; NOSDWA-NEXT:    s_mov_b64 s[4:5], 0
+; NOSDWA-NEXT:    s_and_b32 s6, s4, 0xff


This looks like a missed opportunity for const prop (SI-Fold). Previous case shows const prop happening across BB. This was a case where <2 x i8> was not coerced and ended up producing worse code. At first glance, it looks like coercing needed less const prop and was able reach better looking assembly after SI-Fold pass. Wondering if this is the case of not re-running SI-Fold on changed MI's.

choikwa · 2025-02-01T15:55:39Z

Addressed the unpacked vni8 testcases in the latest commit (they now stay packed still).

jrbyrnes · 2025-02-07T20:47:31Z

llvm/lib/Target/AMDGPU/AMDGPULateCodeGenPrepare.cpp

@@ -291,6 +333,9 @@ bool LiveRegOptimizer::optimizeLiveType(
      }

      Instruction *UseInst = cast<Instruction>(V);
+      if (!shouldReplaceBasedOnOp(UseInst))


If we do this check after we do the PHINode handling above, then I think we'll probably end up with corrupt PHI node coercion in some cases.

At this position, we have already added the PHINodes and their incoming values to the list to be coerced, but we block adding their uses if coercion is not profitable.

Besides, if the check returns false, this just breaks out of the users loop, meaning we may have added some uses but not other uses, resulting in further corruption.

So, we need to move this check to the top. Also, it is not enough to just check the users of the instruction. User can be a PHI node which really just resolves values across control flow -- what we care about is the user of the PHI node since this may be an instruction which truly maps to some hardware instruction -- we should check if that hardware instruction natively supports 8 bit vectors. Similarly for shuffle/insert/extract vectors, we should look through to their uses as that use will be the determining factor as to whether we should vectorize.

bool IsCoercionProfitable = false; SmallVector<Instruction *, 4> UserList; for (User *V : II->users()) UserList.push_back(cast<Instruction *>(V); while (!UserList.empty() && !IsCoercionProfitable) { Instruction *CandInstruction = UserList.pop_back_val(); if (isa<PHINode>(CandInstruction)) { for (User *V : CandInstruction->users()) UserList.push_back(cast<Instruction *>(V); if (CandInstruction->getParent() == II->getParent()) continue; // TODO -- look through shufflevector , insert/extract vectorelemt IsCoercionProfitable = shouldReplaceBasedOnOp(CandInstruction); // Prefer type coercion if we have any proftiable use if (IsCoercionProfitable) break; } } if (!IsCoercionProfitable) continue;

jrbyrnes · 2025-02-07T20:49:44Z

llvm/lib/Target/AMDGPU/AMDGPULateCodeGenPrepare.cpp

+    }
+    LLVM_DEBUG(dbgs() << "shouldReplaceBasedOnOp: " << *II << " Cost=" << Cost
+                      << '\n';);
+    if (Cost >= LRO_COST_THRES)


Ideally, we would be able to use TTI to calculate the cost of the instruction, and use TTI to calculate the cost of the scalarized version. This implies we would need to construct the equivalent scalarized version. As a fill-in for the scalarized version, I was thinking we may be able to compare the instruction cost to the TTI.getTypeLegalizationCost but I see that that is not a great comparison due to other factors in the instruction cost. So, maybe it makes the most sense to use a whitelist for now.

jrbyrnes · 2025-02-07T20:50:07Z

llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp

+
+/// Check if operation is legal.
+/// TODO: If we had IR<->SDag mapping, we could use TLI->isOperationLegal
+bool GCNTTIImpl::isOpLegal(Instruction *I) {


Can you move this to AMDGPULateCodeGenPrepare. Don't whitelist shuffles, and should only whitelist v_dot / mfma / wmma (instead of all intrinsics).

jrbyrnes · 2025-02-14T23:38:13Z

llvm/lib/Target/AMDGPU/AMDGPULateCodeGenPrepare.cpp

+  }
+
+  /// Check if intrinsic natively operates on 8-bit or 16-bit
+  bool isNativeIntrinsic(Intrinsic::ID ID) {


We don't need all the dot / mfma / wmma variants -- just the 8 and 4 bit ones.

jrbyrnes · 2025-02-14T23:45:35Z

llvm/lib/Target/AMDGPU/AMDGPULateCodeGenPrepare.cpp

+  }
+
+  bool isCoercionProfitable(Instruction *II) {
+    if (shouldReplaceByOp(II))


We're checking the users of the op for profitable conditions, so checking shouldReplaceByOp on II doesn't make that much sense (unless we were to to look the other direction -- the defs of its operands)

jrbyrnes · 2025-02-14T23:46:33Z

llvm/lib/Target/AMDGPU/AMDGPULateCodeGenPrepare.cpp

+
+    while (!UserList.empty() && !Profitable) {
+      auto CII = UserList.pop_back_val();
+      if (!CVisited.insert(II).second)


if (!CVisited.insert(II).second) -> if (!CVisited.insert(CII).second)

jrbyrnes · 2025-02-14T23:46:42Z

llvm/lib/Target/AMDGPU/AMDGPULateCodeGenPrepare.cpp

+      if (!CVisited.insert(II).second)
+        continue;
+
+      if (isa<PHINode>(CII) || isa<ShuffleVectorInst>(CII) ||


Should also look through casts

jrbyrnes · 2025-02-14T23:47:04Z

llvm/lib/Target/AMDGPU/AMDGPULateCodeGenPrepare.cpp

+          if (auto *UseInst = dyn_cast<Instruction>(V))
+            UserList.push_back(UseInst);
+
+      if (CII->getParent() == II->getParent())


If our II is a lookthru instruction, then we should bypass this condition

jrbyrnes · 2025-02-14T23:47:30Z

llvm/lib/Target/AMDGPU/AMDGPULateCodeGenPrepare.cpp

+    return true;
+  }
+
+  bool isCoercionProfitable(Instruction *II) {


Something like this:

bool isCoercionProfitable(Instruction *II) { SmallPtrSet<Instruction *, 4> CVisited; SmallVector<Instruction *, 4> UserList; // Check users for profitable conditions (across block user which can natively // handle the illegal vector). for (User *V : II->users()) if (auto *UseInst = dyn_cast<Instruction>(V)) UserList.push_back(UseInst); auto IsLookThru = [](Instruction *II) { return isa<PHINode>(II) || isa<ShuffleVectorInst>(II) || isa<InsertElementInst>(II) || isa<ExtractElementInst>(II) || isa<CastInst>(II); }; while (!UserList.empty()) { auto CII = UserList.pop_back_val(); if (!CVisited.insert(CII).second) continue; if (CII->getParent() == II->getParent() && !IsLookThru(II)) continue; if (isOpLegal(CII)) return true; if (IsLookThru(CII)) for (User *V : CII->users()) if (auto *UseInst = dyn_cast<Instruction>(V)) UserList.push_back(UseInst); } return false; }

jrbyrnes · 2025-02-14T23:48:20Z

llvm/lib/Target/AMDGPU/AMDGPULateCodeGenPrepare.cpp

+      if (CII->getParent() == II->getParent())
+        continue;
+
+      Profitable = shouldReplaceByOp(CII);


Don't need to check the cost model -- just the whitelist.

jrbyrnes · 2025-02-14T23:48:49Z

llvm/lib/Target/AMDGPU/AMDGPULateCodeGenPrepare.cpp

+  // If an operation incurs high enough cost or natively work on
+  // vector of illegal type, ie. v2i8, then it makes sense to try
+  // to coerce them as packed VGPR across BB.
+  bool shouldReplaceByOp(Instruction *II) {


I think we can just delete this cost model in favor of the whitelist -- should be able to remove dependency on TTI too

llvm/lib/Target/AMDGPU/AMDGPULateCodeGenPrepare.cpp

jrbyrnes · 2025-02-14T23:57:13Z

llvm/test/CodeGen/AMDGPU/vni8-across-blocks.ll

 ; GFX906-NEXT:  .LBB1_2: ; %bb.2
 ; GFX906-NEXT:    s_or_b64 exec, exec, s[0:1]
-; GFX906-NEXT:    s_waitcnt vmcnt(0)
-; GFX906-NEXT:    global_store_dword v1, v2, s[6:7]
+; GFX906-NEXT:    v_lshlrev_b16_e32 v0, 8, v5


The LRO should coerce this.

jrbyrnes · 2025-02-21T20:23:45Z

llvm/lib/Target/AMDGPU/AMDGPULateCodeGenPrepare.cpp

+      if (auto *UseInst = dyn_cast<Instruction>(V))
+        UserList.push_back(UseInst);
+
+    auto IsLookThru = [](Instruction *II) {


Can we also look thru v_perm intrinsic

jrbyrnes · 2025-03-04T22:47:19Z

llvm/lib/Target/AMDGPU/AMDGPULateCodeGenPrepare.cpp

+    case Intrinsic::amdgcn_wmma_i32_16x16x32_iu4:
+    case Intrinsic::amdgcn_swmmac_i32_16x16x32_iu8:
+    case Intrinsic::amdgcn_swmmac_i32_16x16x32_iu4:
+    case Intrinsic::amdgcn_swmmac_i32_16x16x64_iu4:


Can you also add the buffer_store intrinsics

It is known that for vector whose element fits in i16 will be split and scalarized in SelectionDag's type legalizer (see SIISelLowering::getPreferredVectorAction). LRO attempts to undo the scalarizing of vectors across basic block boundary and shoehorn Values in VGPRs. LRO is beneficial for operations that natively work on illegal vector types to prevent flip-flopping between unpacked and packed. If we know that operations on vector will be split and scalarized, then we don't want to shoehorn them back to packed VGPR. Operations that we know to work natively on illegal vector types usually come in the form of intrinsics (MFMA, DOT8), buffer store, shuffle, insert/extract element, phi nodes to name a few.

jrbyrnes

LGTM, thanks!

arsenm · 2025-03-06T13:41:47Z

llvm/lib/Target/AMDGPU/AMDGPULateCodeGenPrepare.cpp

@@ -125,6 +127,131 @@ class LiveRegOptimizer {
    return LK.first != TargetLoweringBase::TypeLegal;
  }

+  /// Check if intrinsic natively operates on 8-bit or 16-bit
+  bool isNativeIntrinsic(Intrinsic::ID ID) {


isNative intrinsic is redundant; we also really don't want to maintain this table. Is any target intrinsic good enough? If we really must, should have use SearchableTables

Is any target intrinsic good enough

It's probably "good enough" -- it may result in false positive (i.e. applying the coercion to cases where the is no profit), but maybe not many.
a
What we really want is a way to check if an intrinsic / instruction can natively handle illegal vector types without compiler assistance to scalarize the code -- this query may be useful for future work too.

compiler assistance to scalarize the code

Not clear to me what this means. All the vector types will be split or scalarized in some way. The target intrinsics, with a handful of exceptions, only operate on legal types. isTargetIntrinsic || isTypeLegal()?

Not clear to me what this means

"intrinsic / instruction can natively handle illegal vector types" = The target intrinsic has either bytewise logic on the 32 bit regs (like mfma) or type doesn't matter (i.e. store).

For a standard instruction (e.g. v4i8 add), the compiler will need to insert scalarization code.

We should probably just use isTargetIntrinsic and narrow it as needed.

arsenm · 2025-03-06T13:42:06Z

llvm/lib/Target/AMDGPU/AMDGPULateCodeGenPrepare.cpp

+             isa<InsertElementInst>(II) || isa<ExtractElementInst>(II) ||
+             isa<CastInst>(II);
+    };
+


Most of this heuristic isn't covered by tests, there are no new tests added

…lvm#124624) It is known that for vector whose element fits in i16 will be split and scalarized in SelectionDag's type legalizer (see SIISelLowering::getPreferredVectorAction). LRO attempts to undo the scalarizing of vectors across basic block boundary and shoehorn Values in VGPRs. LRO is beneficial for operations that natively work on illegal vector types to prevent flip-flopping between unpacked and packed. If we know that operations on vector will be split and scalarized, then we don't want to shoehorn them back to packed VGPR. Operations that we know to work natively on illegal vector types usually come in the form of intrinsics (MFMA, DOT8), buffer store, shuffle, phi nodes to name a few.

choikwa requested a review from jrbyrnes January 27, 2025 20:57

llvmbot added backend:AMDGPU llvm:globalisel labels Jan 27, 2025

choikwa force-pushed the LROfilter branch from 94334e4 to 21d78b9 Compare January 27, 2025 21:43

arsenm reviewed Jan 28, 2025

View reviewed changes

choikwa force-pushed the LROfilter branch from 21d78b9 to 9a980ca Compare January 29, 2025 08:32

jrbyrnes reviewed Jan 29, 2025

View reviewed changes

choikwa force-pushed the LROfilter branch from 9a980ca to 551ce05 Compare February 1, 2025 07:02

llvmbot added the llvm:analysis label Feb 1, 2025

choikwa force-pushed the LROfilter branch from 551ce05 to 05e0d64 Compare February 1, 2025 07:25

choikwa commented Feb 1, 2025

View reviewed changes

choikwa force-pushed the LROfilter branch 3 times, most recently from 0c7728b to 5b95b6e Compare February 2, 2025 06:02

jrbyrnes reviewed Feb 7, 2025

View reviewed changes

choikwa force-pushed the LROfilter branch 3 times, most recently from 8b15285 to 5cb4322 Compare February 11, 2025 18:02

jrbyrnes reviewed Feb 14, 2025

View reviewed changes

choikwa force-pushed the LROfilter branch 2 times, most recently from edb17b2 to 205a066 Compare February 18, 2025 18:45

jrbyrnes reviewed Feb 21, 2025

View reviewed changes

choikwa force-pushed the LROfilter branch from 205a066 to 58ed72e Compare February 21, 2025 22:38

choikwa force-pushed the LROfilter branch from 58ed72e to dd096ac Compare March 4, 2025 21:04

jrbyrnes reviewed Mar 4, 2025

View reviewed changes

choikwa force-pushed the LROfilter branch from dd096ac to c83cd0f Compare March 5, 2025 00:44

jrbyrnes approved these changes Mar 5, 2025

View reviewed changes

choikwa merged commit 45759fe into llvm:main Mar 5, 2025
11 checks passed

arsenm reviewed Mar 6, 2025

View reviewed changes

[AMDGPU] Filter candidates of LiveRegOptimizer for profitable cases #124624

[AMDGPU] Filter candidates of LiveRegOptimizer for profitable cases #124624

Uh oh!

Conversation

choikwa commented Jan 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Jan 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Jan 27, 2025

Uh oh!

github-actions bot commented Jan 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

arsenm commented Jan 28, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

choikwa commented Jan 28, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jrbyrnes Jan 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

choikwa commented Feb 1, 2025

Uh oh!

jrbyrnes Feb 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

choikwa commented Jan 27, 2025 •

edited

Loading

llvmbot commented Jan 27, 2025 •

edited

Loading

github-actions bot commented Jan 27, 2025 •

edited

Loading

jrbyrnes Jan 29, 2025 •

edited

Loading

jrbyrnes Feb 7, 2025 •

edited

Loading