-
Notifications
You must be signed in to change notification settings - Fork 13.3k
[AMDGPU] Introduce "amdgpu-sw-lower-lds" pass to lower LDS accesses. #87265
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@llvm/pr-subscribers-backend-amdgpu Author: Chaitanya (skc7) ChangesThis PR introduces new pass "amdgpu-sw-lower-lds". It lowers the local data store, LDS, uses in kernel and non-kernel functions in module with dynamically allocated device global memory. Replacement of Kernel LDS accesses:
Replacement of non-kernel LDS accesses:
Patch is 122.46 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/87265.diff 18 Files Affected:
diff --git a/llvm/lib/Target/AMDGPU/AMDGPU.h b/llvm/lib/Target/AMDGPU/AMDGPU.h
index 6016bd5187d887..15ff74f7c53af3 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPU.h
+++ b/llvm/lib/Target/AMDGPU/AMDGPU.h
@@ -263,6 +263,15 @@ struct AMDGPUAlwaysInlinePass : PassInfoMixin<AMDGPUAlwaysInlinePass> {
bool GlobalOpt;
};
+void initializeAMDGPUSwLowerLDSLegacyPass(PassRegistry &);
+extern char &AMDGPUSwLowerLDSLegacyPassID;
+ModulePass *createAMDGPUSwLowerLDSLegacyPass();
+
+struct AMDGPUSwLowerLDSPass : PassInfoMixin<AMDGPUSwLowerLDSPass> {
+ AMDGPUSwLowerLDSPass() {}
+ PreservedAnalyses run(Module &M, ModuleAnalysisManager &AM);
+};
+
class AMDGPUCodeGenPreparePass
: public PassInfoMixin<AMDGPUCodeGenPreparePass> {
private:
diff --git a/llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp b/llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp
index 595f09664c55e4..f0456d3f62a816 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp
@@ -212,6 +212,7 @@
#define DEBUG_TYPE "amdgpu-lower-module-lds"
using namespace llvm;
+using namespace AMDGPU;
namespace {
@@ -234,17 +235,6 @@ cl::opt<LoweringKind> LoweringKindLoc(
clEnumValN(LoweringKind::hybrid, "hybrid",
"Lower via mixture of above strategies")));
-bool isKernelLDS(const Function *F) {
- // Some weirdness here. AMDGPU::isKernelCC does not call into
- // AMDGPU::isKernel with the calling conv, it instead calls into
- // isModuleEntryFunction which returns true for more calling conventions
- // than AMDGPU::isKernel does. There's a FIXME on AMDGPU::isKernel.
- // There's also a test that checks that the LDS lowering does not hit on
- // a graphics shader, denoted amdgpu_ps, so stay with the limited case.
- // Putting LDS in the name of the function to draw attention to this.
- return AMDGPU::isKernel(F->getCallingConv());
-}
-
template <typename T> std::vector<T> sortByName(std::vector<T> &&V) {
llvm::sort(V.begin(), V.end(), [](const auto *L, const auto *R) {
return L->getName() < R->getName();
@@ -305,183 +295,9 @@ class AMDGPULowerModuleLDS {
Decl, {}, {OperandBundleDefT<Value *>("ExplicitUse", UseInstance)});
}
- static bool eliminateConstantExprUsesOfLDSFromAllInstructions(Module &M) {
- // Constants are uniqued within LLVM. A ConstantExpr referring to a LDS
- // global may have uses from multiple different functions as a result.
- // This pass specialises LDS variables with respect to the kernel that
- // allocates them.
-
- // This is semantically equivalent to (the unimplemented as slow):
- // for (auto &F : M.functions())
- // for (auto &BB : F)
- // for (auto &I : BB)
- // for (Use &Op : I.operands())
- // if (constantExprUsesLDS(Op))
- // replaceConstantExprInFunction(I, Op);
-
- SmallVector<Constant *> LDSGlobals;
- for (auto &GV : M.globals())
- if (AMDGPU::isLDSVariableToLower(GV))
- LDSGlobals.push_back(&GV);
-
- return convertUsersOfConstantsToInstructions(LDSGlobals);
- }
-
public:
AMDGPULowerModuleLDS(const AMDGPUTargetMachine &TM_) : TM(TM_) {}
- using FunctionVariableMap = DenseMap<Function *, DenseSet<GlobalVariable *>>;
-
- using VariableFunctionMap = DenseMap<GlobalVariable *, DenseSet<Function *>>;
-
- static void getUsesOfLDSByFunction(CallGraph const &CG, Module &M,
- FunctionVariableMap &kernels,
- FunctionVariableMap &functions) {
-
- // Get uses from the current function, excluding uses by called functions
- // Two output variables to avoid walking the globals list twice
- for (auto &GV : M.globals()) {
- if (!AMDGPU::isLDSVariableToLower(GV)) {
- continue;
- }
-
- for (User *V : GV.users()) {
- if (auto *I = dyn_cast<Instruction>(V)) {
- Function *F = I->getFunction();
- if (isKernelLDS(F)) {
- kernels[F].insert(&GV);
- } else {
- functions[F].insert(&GV);
- }
- }
- }
- }
- }
-
- struct LDSUsesInfoTy {
- FunctionVariableMap direct_access;
- FunctionVariableMap indirect_access;
- };
-
- static LDSUsesInfoTy getTransitiveUsesOfLDS(CallGraph const &CG, Module &M) {
-
- FunctionVariableMap direct_map_kernel;
- FunctionVariableMap direct_map_function;
- getUsesOfLDSByFunction(CG, M, direct_map_kernel, direct_map_function);
-
- // Collect variables that are used by functions whose address has escaped
- DenseSet<GlobalVariable *> VariablesReachableThroughFunctionPointer;
- for (Function &F : M.functions()) {
- if (!isKernelLDS(&F))
- if (F.hasAddressTaken(nullptr,
- /* IgnoreCallbackUses */ false,
- /* IgnoreAssumeLikeCalls */ false,
- /* IgnoreLLVMUsed */ true,
- /* IgnoreArcAttachedCall */ false)) {
- set_union(VariablesReachableThroughFunctionPointer,
- direct_map_function[&F]);
- }
- }
-
- auto functionMakesUnknownCall = [&](const Function *F) -> bool {
- assert(!F->isDeclaration());
- for (const CallGraphNode::CallRecord &R : *CG[F]) {
- if (!R.second->getFunction()) {
- return true;
- }
- }
- return false;
- };
-
- // Work out which variables are reachable through function calls
- FunctionVariableMap transitive_map_function = direct_map_function;
-
- // If the function makes any unknown call, assume the worst case that it can
- // access all variables accessed by functions whose address escaped
- for (Function &F : M.functions()) {
- if (!F.isDeclaration() && functionMakesUnknownCall(&F)) {
- if (!isKernelLDS(&F)) {
- set_union(transitive_map_function[&F],
- VariablesReachableThroughFunctionPointer);
- }
- }
- }
-
- // Direct implementation of collecting all variables reachable from each
- // function
- for (Function &Func : M.functions()) {
- if (Func.isDeclaration() || isKernelLDS(&Func))
- continue;
-
- DenseSet<Function *> seen; // catches cycles
- SmallVector<Function *, 4> wip{&Func};
-
- while (!wip.empty()) {
- Function *F = wip.pop_back_val();
-
- // Can accelerate this by referring to transitive map for functions that
- // have already been computed, with more care than this
- set_union(transitive_map_function[&Func], direct_map_function[F]);
-
- for (const CallGraphNode::CallRecord &R : *CG[F]) {
- Function *ith = R.second->getFunction();
- if (ith) {
- if (!seen.contains(ith)) {
- seen.insert(ith);
- wip.push_back(ith);
- }
- }
- }
- }
- }
-
- // direct_map_kernel lists which variables are used by the kernel
- // find the variables which are used through a function call
- FunctionVariableMap indirect_map_kernel;
-
- for (Function &Func : M.functions()) {
- if (Func.isDeclaration() || !isKernelLDS(&Func))
- continue;
-
- for (const CallGraphNode::CallRecord &R : *CG[&Func]) {
- Function *ith = R.second->getFunction();
- if (ith) {
- set_union(indirect_map_kernel[&Func], transitive_map_function[ith]);
- } else {
- set_union(indirect_map_kernel[&Func],
- VariablesReachableThroughFunctionPointer);
- }
- }
- }
-
- // Verify that we fall into one of 2 cases:
- // - All variables are absolute: this is a re-run of the pass
- // so we don't have anything to do.
- // - No variables are absolute.
- std::optional<bool> HasAbsoluteGVs;
- for (auto &Map : {direct_map_kernel, indirect_map_kernel}) {
- for (auto &[Fn, GVs] : Map) {
- for (auto *GV : GVs) {
- bool IsAbsolute = GV->isAbsoluteSymbolRef();
- if (HasAbsoluteGVs.has_value()) {
- if (*HasAbsoluteGVs != IsAbsolute) {
- report_fatal_error(
- "Module cannot mix absolute and non-absolute LDS GVs");
- }
- } else
- HasAbsoluteGVs = IsAbsolute;
- }
- }
- }
-
- // If we only had absolute GVs, we have nothing to do, return an empty
- // result.
- if (HasAbsoluteGVs && *HasAbsoluteGVs)
- return {FunctionVariableMap(), FunctionVariableMap()};
-
- return {std::move(direct_map_kernel), std::move(indirect_map_kernel)};
- }
-
struct LDSVariableReplacement {
GlobalVariable *SGV = nullptr;
DenseMap<GlobalVariable *, Constant *> LDSVarsToConstantGEP;
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def b/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def
index 90f36fadf35903..eda4949d0296d5 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def
+++ b/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def
@@ -22,6 +22,7 @@ MODULE_PASS("amdgpu-lower-buffer-fat-pointers",
AMDGPULowerBufferFatPointersPass(*this))
MODULE_PASS("amdgpu-lower-ctor-dtor", AMDGPUCtorDtorLoweringPass())
MODULE_PASS("amdgpu-lower-module-lds", AMDGPULowerModuleLDSPass(*this))
+MODULE_PASS("amdgpu-sw-lower-lds", AMDGPUSwLowerLDSPass())
MODULE_PASS("amdgpu-printf-runtime-binding", AMDGPUPrintfRuntimeBindingPass())
MODULE_PASS("amdgpu-unify-metadata", AMDGPUUnifyMetadataPass())
#undef MODULE_PASS
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUSwLowerLDS.cpp b/llvm/lib/Target/AMDGPU/AMDGPUSwLowerLDS.cpp
new file mode 100644
index 00000000000000..ed3670fa1386d6
--- /dev/null
+++ b/llvm/lib/Target/AMDGPU/AMDGPUSwLowerLDS.cpp
@@ -0,0 +1,865 @@
+//===-- AMDGPUSwLowerLDS.cpp -----------------------------------------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+//
+// This pass lowers the local data store, LDS, uses in kernel and non-kernel
+// functions in module with dynamically allocated device global memory.
+//
+// Replacement of Kernel LDS accesses:
+// For a kernel, LDS access can be static or dynamic which are direct
+// (accessed within kernel) and indirect (accessed through non-kernels).
+// A device global memory equal to size of all these LDS globals will be
+// allocated. At the prologue of the kernel, a single work-item from the
+// work-group, does a "malloc" and stores the pointer of the allocation in
+// new LDS global that will be created for the kernel. This will be called
+// "malloc LDS global" in this pass.
+// Each LDS access corresponds to an offset in the allocated memory.
+// All static LDS accesses will be allocated first and then dynamic LDS
+// will occupy the device global memoery.
+// To store the offsets corresponding to all LDS accesses, another global
+// variable is created which will be called "metadata global" in this pass.
+// - Malloc LDS Global:
+// It is LDS global of ptr type with name
+// "llvm.amdgcn.sw.lds.<kernel-name>".
+// - Metadata Global:
+// It is of struct type, with n members. n equals the number of LDS
+// globals accessed by the kernel(direct and indirect). Each member of
+// struct is another struct of type {i32, i32}. First member corresponds
+// to offset, second member corresponds to size of LDS global being
+// replaced. It will have name "llvm.amdgcn.sw.lds.<kernel-name>.md".
+// This global will have an intializer with static LDS related offsets
+// and sizes initialized. But for dynamic LDS related entries, offsets
+// will be intialized to previous static LDS allocation end offset. Sizes
+// for them will be zero initially. These dynamic LDS offset and size
+// values will be updated with in the kernel, since kernel can read the
+// dynamic LDS size allocation done at runtime with query to
+// "hidden_dynamic_lds_size" hidden kernel argument.
+//
+// LDS accesses within the kernel will be replaced by "gep" ptr to
+// corresponding offset into allocated device global memory for the kernel.
+// At the epilogue of kernel, allocated memory would be made free by the same
+// single work-item.
+//
+// Replacement of non-kernel LDS accesses:
+// Multiple kernels can access the same non-kernel function.
+// All the kernels accessing LDS through non-kernels are sorted and
+// assigned a kernel-id. All the LDS globals accessed by non-kernels
+// are sorted. This information is used to build two tables:
+// - Base table:
+// Base table will have single row, with elements of the row
+// placed as per kernel ID. Each element in the row corresponds
+// to addresss of "malloc LDS global" variable created for
+// that kernel.
+// - Offset table:
+// Offset table will have multiple rows and columns.
+// Rows are assumed to be from 0 to (n-1). n is total number
+// of kernels accessing the LDS through non-kernels.
+// Each row will have m elements. m is the total number of
+// unique LDS globals accessed by all non-kernels.
+// Each element in the row correspond to the address of
+// the replacement of LDS global done by that particular kernel.
+// A LDS variable in non-kernel will be replaced based on the information
+// from base and offset tables. Based on kernel-id query, address of "malloc
+// LDS global" for that corresponding kernel is obtained from base table.
+// The Offset into the base "malloc LDS global" is obtained from
+// corresponding element in offset table. With this information, replacement
+// value is obtained.
+//===----------------------------------------------------------------------===//
+
+#include "AMDGPU.h"
+#include "Utils/AMDGPUMemoryUtils.h"
+#include "llvm/ADT/DenseMap.h"
+#include "llvm/ADT/DenseSet.h"
+#include "llvm/ADT/SetOperations.h"
+#include "llvm/ADT/SetVector.h"
+#include "llvm/ADT/StringRef.h"
+#include "llvm/Analysis/CallGraph.h"
+#include "llvm/Analysis/DomTreeUpdater.h"
+#include "llvm/IR/Constants.h"
+#include "llvm/IR/IRBuilder.h"
+#include "llvm/IR/Instructions.h"
+#include "llvm/IR/IntrinsicsAMDGPU.h"
+#include "llvm/IR/MDBuilder.h"
+#include "llvm/IR/ReplaceConstant.h"
+#include "llvm/InitializePasses.h"
+#include "llvm/Pass.h"
+#include "llvm/Transforms/Utils/ModuleUtils.h"
+
+#include <algorithm>
+
+#define DEBUG_TYPE "amdgpu-sw-lower-lds"
+
+using namespace llvm;
+using namespace AMDGPU;
+
+namespace {
+
+using DomTreeCallback = function_ref<DominatorTree *(Function &F)>;
+
+struct LDSAccessTypeInfo {
+ SetVector<GlobalVariable *> StaticLDSGlobals;
+ SetVector<GlobalVariable *> DynamicLDSGlobals;
+};
+
+// Struct to hold all the Metadata required for a kernel
+// to replace a LDS global uses with corresponding offset
+// in to device global memory.
+struct KernelLDSParameters {
+ GlobalVariable *MallocLDSGlobal{nullptr};
+ GlobalVariable *MallocMetadataGlobal{nullptr};
+ LDSAccessTypeInfo DirectAccess;
+ LDSAccessTypeInfo IndirectAccess;
+ DenseMap<GlobalVariable *, SmallVector<uint32_t, 3>>
+ LDSToReplacementIndicesMap;
+ int32_t KernelId{-1};
+ uint32_t MallocSize{0};
+};
+
+// Struct to store infor for creation of offset table
+// for all the non-kernel LDS accesses.
+struct NonKernelLDSParameters {
+ GlobalVariable *LDSBaseTable{nullptr};
+ GlobalVariable *LDSOffsetTable{nullptr};
+ SetVector<Function *> OrderedKernels;
+ SetVector<GlobalVariable *> OrdereLDSGlobals;
+};
+
+class AMDGPUSwLowerLDS {
+public:
+ AMDGPUSwLowerLDS(Module &mod, DomTreeCallback Callback)
+ : M(mod), IRB(M.getContext()), DTCallback(Callback) {}
+ bool Run();
+ void GetUsesOfLDSByNonKernels(CallGraph const &CG,
+ FunctionVariableMap &functions);
+ SetVector<Function *>
+ GetOrderedIndirectLDSAccessingKernels(SetVector<Function *> &&Kernels);
+ SetVector<GlobalVariable *>
+ GetOrderedNonKernelAllLDSGlobals(SetVector<GlobalVariable *> &&Variables);
+ void PopulateMallocLDSGlobal(Function *Func);
+ void PopulateMallocMetadataGlobal(Function *Func);
+ void PopulateLDSToReplacementIndicesMap(Function *Func);
+ void ReplaceKernelLDSAccesses(Function *Func);
+ void LowerKernelLDSAccesses(Function *Func, DomTreeUpdater &DTU);
+ void BuildNonKernelLDSOffsetTable(
+ std::shared_ptr<NonKernelLDSParameters> &NKLDSParams);
+ void BuildNonKernelLDSBaseTable(
+ std::shared_ptr<NonKernelLDSParameters> &NKLDSParams);
+ Constant *
+ GetAddressesOfVariablesInKernel(Function *Func,
+ SetVector<GlobalVariable *> &Variables);
+ void LowerNonKernelLDSAccesses(
+ Function *Func, SetVector<GlobalVariable *> &LDSGlobals,
+ std::shared_ptr<NonKernelLDSParameters> &NKLDSParams);
+
+private:
+ Module &M;
+ IRBuilder<> IRB;
+ DomTreeCallback DTCallback;
+ DenseMap<Function *, std::shared_ptr<KernelLDSParameters>>
+ KernelToLDSParametersMap;
+};
+
+template <typename T> SetVector<T> SortByName(std::vector<T> &&V) {
+ // Sort the vector of globals or Functions based on their name.
+ // Returns a SetVector of globals/Functions.
+ llvm::sort(V.begin(), V.end(), [](const auto *L, const auto *R) {
+ return L->getName() < R->getName();
+ });
+ return {std::move(SetVector<T>(V.begin(), V.end()))};
+}
+
+SetVector<GlobalVariable *> AMDGPUSwLowerLDS::GetOrderedNonKernelAllLDSGlobals(
+ SetVector<GlobalVariable *> &&Variables) {
+ // Sort all the non-kernel LDS accesses based on theor name.
+ SetVector<GlobalVariable *> Ordered = SortByName(
+ std::vector<GlobalVariable *>(Variables.begin(), Variables.end()));
+ return std::move(Ordered);
+}
+
+SetVector<Function *> AMDGPUSwLowerLDS::GetOrderedIndirectLDSAccessingKernels(
+ SetVector<Function *> &&Kernels) {
+ // Sort the non-kernels accessing LDS based on theor name.
+ // Also assign a kernel ID metadata based on the sorted order.
+ LLVMContext &Ctx = M.getContext();
+ if (Kernels.size() > UINT32_MAX) {
+ // 32 bit keeps it in one SGPR. > 2**32 kernels won't fit on the GPU
+ report_fatal_error("Unimplemented SW LDS lowering for > 2**32 kernels");
+ }
+ SetVector<Function *> OrderedKernels =
+ SortByName(std::vector<Function *>(Kernels.begin(), Kernels.end()));
+ for (size_t i = 0; i < Kernels.size(); i++) {
+ Metadata *AttrMDArgs[1] = {
+ ConstantAsMetadata::get(IRB.getInt32(i)),
+ };
+ Function *Func = OrderedKernels[i];
+ Func->setMetadata("llvm.amdgcn.lds.kernel.id",
+ MDNode::get(Ctx, AttrMDArgs));
+ auto &LDSParams = KernelToLDSParametersMap[Func];
+ assert(LDSParams);
+ LDSParams->KernelId = i;
+ }
+ return std::move(OrderedKernels);
+}
+
+void AMDGPUSwLowerLDS::GetUsesOfLDSByNonKernels(
+ CallGraph const &CG, FunctionVariableMap &functions) {
+ // Get uses from the current function, excluding uses by called functions
+ // Two output variables to avoid walking the globals list twice
+ for (auto &GV : M.globals()) {
+ if (!AMDGPU::isLDSVariableToLower(GV)) {
+ continue;
+ }
+
+ if (GV.isAbsoluteSymbolRef()) {
+ report_fatal_error(
+ "LDS variables with absolute addresses are unimplemented.");
+ }
+
+ for (User *V : GV.users()) {
+ User *FUU = V;
+ bool isCast = isa<BitCastOperator, AddrSpaceCastOperator>(FUU);
+ if (isCast && FUU->hasOneUse() && !FUU->user_begin()->user_empty())
+ FUU = *FUU->user_begin();
+ if (auto *I = dyn_cast<Instruction>(FUU)) {
+ Function *F = I->getFunction();
+ if (!isKernelLDS(F)) {
+ functions[F].insert(&GV);
+ }
+ }
+ }
+ }
+}
+
+void AMDGPUSwLowerLDS::PopulateMallocLDSGlobal(Function *Func) {
+ // Create new LDS global required for each kernel to store
+ // device global memory pointer.
+ auto &LDSParams = KernelToLDSParametersMap[Func];
+ assert(LDSParams);
+ // create new global pointer variable
+ LDSParams->MallocLDSGlobal = new GlobalVariable(
+ M, IRB.getPtrTy(), false, GlobalValue::InternalLinkage,
+ PoisonValue::get(IRB.getPtrTy()),
+ Twine("llvm.amdgcn.sw.lds." + F...
[truncated]
|
llvm/test/CodeGen/AMDGPU/amdgpu-sw-lower-lds-multiple-blocks-return.ll
Outdated
Show resolved
Hide resolved
// Sort the vector of globals or Functions based on their name. | ||
// Returns a SetVector of globals/Functions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Name should be a tie-breaker only. Sort by alignment/size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
amdgpu-lower-module-lds pass also uses sorting of globals based on name. It is required to maintain consistent order of globals in offset table and while replacing the LDS globals with offsets into new LDS global.
; CHECK-NEXT: [[TMP0:%.*]] = call i32 @llvm.amdgcn.workitem.id.x() | ||
; CHECK-NEXT: [[TMP1:%.*]] = call i32 @llvm.amdgcn.workitem.id.y() | ||
; CHECK-NEXT: [[TMP2:%.*]] = call i32 @llvm.amdgcn.workitem.id.z() | ||
; CHECK-NEXT: [[TMP3:%.*]] = or i32 [[TMP0]], [[TMP1]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should try to strip the corresponding amdgpu-no-* attributes for introduced intrinsic calls
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added utility method from amdgpu-lower-module-lds pass to AMDGPUMemoryUtils and removed amdgpu-no-workitem-id-* attributes from kernels which access LDS.
llvm/test/CodeGen/AMDGPU/amdgpu-sw-lower-lds-multiple-blocks-return.ll
Outdated
Show resolved
Hide resolved
✅ With the latest revision this PR passed the C/C++ code formatter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Title is misleading. I think the implementation of the pass, and adding it to the pass pipeline should be done in separate changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mostly coding style nits. The coding style here differs a bit from what we usually see so I pointed out the things that stood out to me as someone that's not in the loop with this change.
|
||
class AMDGPUSwLowerLDS { | ||
public: | ||
AMDGPUSwLowerLDS(Module &mod, DomTreeCallback Callback) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AMDGPUSwLowerLDS(Module &mod, DomTreeCallback Callback) | |
AMDGPUSwLowerLDS(Module &Mod, DomTreeCallback Callback) |
CamelCase
AMDGPUSwLowerLDS(Module &mod, DomTreeCallback Callback) | ||
: M(mod), IRB(M.getContext()), DTCallback(Callback) {} | ||
bool run(); | ||
void getUsesOfLDSByNonKernels(CallGraph const &CG, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
void getUsesOfLDSByNonKernels(CallGraph const &CG, | |
void getUsesOfLDSByNonKernels(const CallGraph &CG, |
To be consistent with the codebase.
void getUsesOfLDSByNonKernels(CallGraph const &CG, | ||
FunctionVariableMap &functions); | ||
SetVector<Function *> | ||
getOrderedIndirectLDSAccessingKernels(SetVector<Function *> &&Kernels); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please document those functions, even if it's just a short comment.
It helps maintainability.
template <typename T> SetVector<T> sortByName(std::vector<T> &&V) { | ||
// Sort the vector of globals or Functions based on their name. | ||
// Returns a SetVector of globals/Functions. | ||
llvm::sort(V.begin(), V.end(), [](const auto *L, const auto *R) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
llvm::
is not needed, I think.
I also think you can just do llvm::sort(V, ..)
?
set_union(transitive_map_function[&Func], direct_map_function[F]); | ||
|
||
for (const CallGraphNode::CallRecord &R : *CG[F]) { | ||
Function *ith = R.second->getFunction(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CamelCase
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
// direct_map_kernel lists which variables are used by the kernel | ||
// find the variables which are used through a function call | ||
FunctionVariableMap indirect_map_kernel; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CamelCase
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
continue; | ||
|
||
for (const CallGraphNode::CallRecord &R : *CG[&Func]) { | ||
Function *ith = R.second->getFunction(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CamelCase
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
StringRef FnAttr) { | ||
KernelRoot->removeFnAttr(FnAttr); | ||
|
||
SmallVector<Function *> WorkList({CG[KernelRoot]->getFunction()}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
=
to assign
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
|
||
#include <algorithm> | ||
|
||
#define DEBUG_TYPE "amdgpu-sw-lower-lds" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: is it possible to add some LLVM_DEBUG
output to this pass?
It greatly helps debug eventual issues
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added few debug outputs while replacing the LDS accesses. Thanks for suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs a rebase, hard to see over the moved code patch
//{StartOffset, AlignedSizeInBytes} | ||
SmallString<128> MDItemStr; | ||
raw_svector_ostream MDItemOS(MDItemStr); | ||
MDItemOS << "llvm.amdgcn.sw.lds." << Func->getName().str() << ".md.item"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MDItemOS << "llvm.amdgcn.sw.lds." << Func->getName().str() << ".md.item"; | |
MDItemOS << "llvm.amdgcn.sw.lds." << Func->getName() << ".md.item"; |
auto MallocSizeCalcLambda = | ||
[&](SetVector<GlobalVariable *> &DynamicLDSGlobals) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make this a regular helper function?
Value *ImplicitArg = | ||
IRB.CreateIntrinsic(Intrinsic::amdgcn_implicitarg_ptr, {}, {}); | ||
Value *HiddenDynLDSSize = IRB.CreateInBoundsGEP( | ||
ImplicitArg->getType(), ImplicitArg, {IRB.getInt32(15)}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't understand where the hardcoded 15 came from. There are various ConstInBoundsGEPs for this case too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These should also use 64-bit indexes, this is canonically a 64-bit address space. Can we use an enum or something more structured to access the ABI location? I'm assuming this is assuming COV5?
|
||
auto *GEPForEndStaticLDSSize = IRB.CreateInBoundsGEP( | ||
MetadataStructType, SwLDSMetadata, | ||
{IRB.getInt32(0), IRB.getInt32(NumStaticLDS - 1), IRB.getInt32(2)}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use the Const* variants to hide all the getInt32s away
@@ -0,0 +1,58 @@ | |||
; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --check-globals all --version 4 | |||
; RUN: opt < %s -passes=amdgpu-sw-lower-lds -S -mtriple=amdgcn-- | FileCheck %s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should specifically use amdhsa triples for these tests
… device global memory. (llvm#87265)
/// Strip "amdgpu-no-lds-kernel-id" from any functions where we may have | ||
/// introduced its use. If AMDGPUAttributor ran prior to the pass, we inferred | ||
/// the lack of llvm.amdgcn.lds.kernel.id calls. | ||
void removeNoLdsKernelIdFromReachable(CallGraph &CG, Function *KernelRoot) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this rebased on main? This deletion should have already been merged when the code was moved to AMDGPUMemoryUtils?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rebased and updated in latest commits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#92686 PR raised to move remove this change.
|
||
SmallString<128> MDTypeStr; | ||
raw_svector_ostream MDTypeOS(MDTypeStr); | ||
MDTypeOS << "llvm.amdgcn.sw.lds." << Func->getName().str() << ".md.type"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MDTypeOS << "llvm.amdgcn.sw.lds." << Func->getName().str() << ".md.type"; | |
MDTypeOS << "llvm.amdgcn.sw.lds." << Func->getName() << ".md.type"; |
another one
StructType::create(Ctx, Items, MDTypeOS.str()); | ||
SmallString<128> MDStr; | ||
raw_svector_ostream MDOS(MDStr); | ||
MDOS << "llvm.amdgcn.sw.lds." << Func->getName().str() << ".md"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MDOS << "llvm.amdgcn.sw.lds." << Func->getName().str() << ".md"; | |
MDOS << "llvm.amdgcn.sw.lds." << Func->getName() << ".md"; |
Value *BasePlusOffset = | ||
IRB.CreateInBoundsGEP(IRB.getInt8Ty(), SwLDS, {Load}); | ||
LLVM_DEBUG(dbgs() << "Sw LDS Lowering, Replacing LDS " | ||
<< GV->getName().str()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<< GV->getName().str()); | |
<< GV->getName()); |
|
||
ReplaceKernelLDSAccesses(Func); | ||
|
||
auto *CondFreeBlock = BasicBlock::Create(Ctx, "CondFree", Func); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Presumably the runtime has to manage cleanup of anything that happened in the kernel?
// Replace LDS access in non-kernel with replacement queried from | ||
// Base table and offset from offset table. | ||
LLVM_DEBUG(dbgs() << "Sw LDS lowering, lower non-kernel access for : " | ||
<< Func->getName().str()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<< Func->getName().str()); | |
<< Func->getName()); |
You should almost never need to convert to std::string
Value *BasePlusOffset = | ||
IRB.CreateInBoundsGEP(IRB.getInt8Ty(), BasePtr, {OffsetLoad}); | ||
LLVM_DEBUG(dbgs() << "Sw LDS Lowering, Replace non-kernel LDS for " | ||
<< GV->getName().str()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<< GV->getName().str()); | |
<< GV->getName()); |
@@ -0,0 +1,100 @@ | |||
; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These should use --check-globals since that's most of the point of the pass
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
--check-globals cmd-line option is updating the tests with globals. But, some of the tests when run, are failing with missing ']' "closing bracket "like example below.. So have updated tests with globals check which don't complain this error.
@llvm.amdgcn.sw.lds.offset.table = internal addrspace(4) constant [2 x [4 x i32]] [[4 x i32] [i32 ptrtoint (ptr addrspace(1) @llvm.amdgcn.sw.lds.k0.md to i32), i32 poison, ..
… device global memory. (llvm#87265)
removeFnAttrFromReachable(CG, Func, "amdgpu-no-workitem-id-x"); | ||
removeFnAttrFromReachable(CG, Func, "amdgpu-no-workitem-id-y"); | ||
removeFnAttrFromReachable(CG, Func, "amdgpu-no-workitem-id-z"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These could all be removed in one CallGraph walk instead of 3 separate ones
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently removeFnAttrFromReachable accepts StringRef argument. Need to change it to accept array of stringref.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Raised #94188.
}; | ||
bool IsChanged = false; | ||
AMDGPUSwLowerLDS SwLowerLDSImpl(M, DTCallback); | ||
IsChanged |= SwLowerLDSImpl.run(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can just define isChanged here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I trust the pass ordering with this strategy, and I think the pass name should not claim it's software lowering when it's not really that
for (auto &GV : LDSGlobals) { | ||
if (is_contained(UniqueLDSGlobals, GV)) | ||
continue; | ||
else |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't need else after continue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
Each workgroup needs a different pointer, and a grid can have a lot of work groups. I don't think a global would work. |
d489ba1
to
8faffb9
Compare
So why isn't the runtime responsible for setting up this pointer then? It does that effectively for LDS, the allocation is managed as part of the dispatch. That also goes back to my question of why we need to insert explicit free code, instead of just letting the runtime clean it up after as it would need to anyway |
Which runtime are you talking about? The firmware or trap handler? And where are they going to place the pointer? Or are you even considering a new architected or reserved register to hold it and a new ABI? |
Presumably the implicit kernel arguments, and whatever is setting that up. It's essentially a partner to the queue pointer, which also is in the implicit kernargs |
OK. Suppose the launch has a million work groups. How much memory should the runtime allocate, and how will workgroup J decode what part of that memory to use? It can certainly be done but I'm wondering if we really need to do it now? And how much do we really need an independently working SW LDS? |
The runtime is already bounded on how many groups it can dispatch at once; the allocation is tied to the dispatch size.
I think having the trap door of pure software LDS would enable some useful experiments, such as not depending on any whole program visibility to lower function defined local variables. It also reduces the number of parts that need to directly interact in the compiler pipeline. With the current approach I foresee having to fix the same bugs twice in the module LDS lowering, and the asan version of module LDS lowering |
The runtime doesn't split the dispatch into machine-sized chunks. If it does have a limit, then it is probably much larger than we want to allocate for.
I don't disagree. But reading global memory for the pointer will be slower. The runtime launching one dispatch at a time to manage the memory will be slower, and we still need a kernel prolog and epilog for each workgroup to allocate and deallocate it's chunk of the global allocation, and I still don't know where we are going to store the per-workgroup workgroup-allocation-chunk-index or pointer. |
I thought it already had to do this if stack was enabled to avoid going over a device wide limit |
Yes, there is a special mode when scratch space is low but something like that would not be desirable to impose on every dispatch. |
8faffb9
to
abfcc87
Compare
LLVM Buildbot has detected a new failure on builder Full details are available at: https://lab.llvm.org/buildbot/#/builders/51/builds/2930 Here is the relevant piece of the build log for the reference
|
LLVM Buildbot has detected a new failure on builder Full details are available at: https://lab.llvm.org/buildbot/#/builders/186/builds/1725 Here is the relevant piece of the build log for the reference
|
LLVM Buildbot has detected a new failure on builder Full details are available at: https://lab.llvm.org/buildbot/#/builders/168/builds/2563 Here is the relevant piece of the build log for the reference
|
LLVM Buildbot has detected a new failure on builder Full details are available at: https://lab.llvm.org/buildbot/#/builders/76/builds/2236 Here is the relevant piece of the build log for the reference
|
LLVM Buildbot has detected a new failure on builder Full details are available at: https://lab.llvm.org/buildbot/#/builders/145/builds/1462 Here is the relevant piece of the build log for the reference
|
Fixes linking error in llvm CI: "AMDGPUSwLowerLDS::run()': AMDGPUSwLowerLDS.cpp:(.text._ZN12_GLOBAL__N_116AMDGPUSwLowerLDS3runEv+0x164): undefined reference to `llvm::getAddressSanitizerParams(llvm::Triple const&, int, bool, unsigned long*, int*, bool*)'" #87265 amdgpu-sw-lower-lds pass uses getAddressSanitizerParams method from AddressSanitizer pass. It misses linking of LLVMInstrumentation to AMDGPUCodegen. This PR adds it.
Issue should be fixed by #106039 |
LLVM Buildbot has detected a new failure on builder Full details are available at: https://lab.llvm.org/buildbot/#/builders/64/builds/786 Here is the relevant piece of the build log for the reference
|
This change adds the utilities required to asan instrument memory instructions. In "amdgpu-sw-lower-lds" pass llvm#87265, during lowering from LDS to global memory, new instructions in global memory would be created which need to be asan instrumented. Change-Id: I17f0371cdc15ea7af6c4e2a325af6ad96a5bfb7b
…lvm#87265) This PR introduces new pass "amdgpu-sw-lower-lds". This pass lowers the local data store, LDS, uses in kernel and non-kernel functions in module to use dynamically allocated global memory. Packed LDS Layout is emulated in the global memory. The lowered memory instructions from LDS to global memory are then instrumented for address sanitizer, to catch addressing errors. This pass only work when address sanitizer has been enabled and has instrumented the IR. It identifies that IR has been instrumented using "nosanitize_address" module flag. For a kernel, LDS access can be static or dynamic which are direct (accessed within kernel) and indirect (accessed through non-kernels). **Replacement of Kernel LDS accesses:** - All the LDS accesses corresponding to kernel will be packed together, where all static LDS accesses will be allocated first and then dynamic LDS follows. The total size with alignment is calculated. A new LDS global will be created for the kernel called "SW LDS" and it will have the attribute "amdgpu-lds-size" attached with value of the size calculated. All the LDS accesses in the module will be replaced by GEP with offset into the "Sw LDS". - A new "llvm.amdgcn.<kernel>.dynlds" is created per kernel accessing the dynamic LDS. This will be marked used by kernel and will have MD_absolue_symbol metadata set to total static LDS size, Since dynamic LDS allocation starts after all static LDS allocation. - A device global memory equal to the total LDS size will be allocated. At the prologue of the kernel, a single work-item from the work-group, does a "malloc" and stores the pointer of the allocation in "SW LDS". To store the offsets corresponding to all LDS accesses, another global variable is created which will be called "SW LDS metadata" in this pass. - **SW LDS:** It is LDS global of ptr type with name "llvm.amdgcn.sw.lds.<kernel-name>". - **SW LDS Metadata:** It is of struct type, with n members. n equals the number of LDS globals accessed by the kernel(direct and indirect). Each member of struct is another struct of type {i32, i32, i32}. First member corresponds to offset, second member corresponds to size of LDS global being replaced and third represents the total aligned size. It will have name "llvm.amdgcn.sw.lds.<kernel-name>.md". This global will have an intializer with static LDS related offsets and sizes initialized. But for dynamic LDS related entries, offsets will be intialized to previous static LDS allocation end offset. Sizes for them will be zero initially. These dynamic LDS offset and size values will be updated with in the kernel, since kernel can read the dynamic LDS size allocation done at runtime with query to "hidden_dynamic_lds_size" hidden kernel argument. - At the epilogue of kernel, allocated memory would be made free by the same single work-item. **Replacement of non-kernel LDS accesses:** - Multiple kernels can access the same non-kernel function. All the kernels accessing LDS through non-kernels are sorted and assigned a kernel-id. All the LDS globals accessed by non-kernels are sorted. - This information is used to build two tables: - **Base table:** Base table will have single row, with elements of the row placed as per kernel ID. Each element in the row corresponds to ptr of "SW LDS" variable created for that kernel. - **Offset table:** Offset table will have multiple rows and columns. Rows are assumed to be from 0 to (n-1). n is total number of kernels accessing the LDS through non-kernels. Each row will have m elements. m is the total number of unique LDS globals accessed by all non-kernels. Each element in the row correspond to the ptr of the replacement of LDS global done by that particular kernel. - A LDS variable in non-kernel will be replaced based on the information from base and offset tables. Based on kernel-id query, ptr of "SW LDS" for that corresponding kernel is obtained from base table. The Offset into the base "SW LDS" is obtained from corresponding element in offset table. With this information, replacement value is obtained. Change-Id: I047105a7585878f780a2e741d2116f3c48232e1f
LLVM Buildbot has detected a new failure on builder Full details are available at: https://lab.llvm.org/buildbot/#/builders/10/builds/145 Here is the relevant piece of the build log for the reference
|
This PR introduces new pass "amdgpu-sw-lower-lds".
This pass lowers the local data store, LDS, uses in kernel and non-kernel functions in module to use dynamically allocated global memory. Packed LDS Layout is emulated in the global memory.
The lowered memory instructions from LDS to global memory are then instrumented for address sanitizer, to catch addressing errors.
This pass only work when address sanitizer has been enabled and has instrumented the IR. It identifies that IR has been instrumented using "nosanitize_address" module flag.
For a kernel, LDS access can be static or dynamic which are direct (accessed within kernel) and indirect (accessed through non-kernels).
Replacement of Kernel LDS accesses:
All the LDS accesses corresponding to kernel will be packed together, where all static LDS accesses will be allocated first and then dynamic LDS follows. The total size with alignment is calculated. A new LDS global will be created for the kernel called "SW LDS" and it will have the attribute "amdgpu-lds-size" attached with value of the size calculated. All the LDS accesses in the module will be replaced by GEP with offset into the "Sw LDS".
A new "llvm.amdgcn..dynlds" is created per kernel accessing the dynamic LDS. This will be marked used by kernel and will have MD_absolue_symbol metadata set to total static LDS size, Since dynamic LDS allocation starts after all static LDS allocation.
A device global memory equal to the total LDS size will be allocated. At the prologue of the kernel, a single work-item from the work-group, does a "malloc" and stores the pointer of the allocation in "SW LDS". To store the offsets corresponding to all LDS accesses, another global variable is created which will be called "SW LDS metadata" in this pass.
SW LDS:
It is LDS global of ptr type with name "llvm.amdgcn.sw.lds.".
SW LDS Metadata:
It is of struct type, with n members. n equals the number of LDS globals accessed by the kernel(direct and indirect). Each member of struct is another struct of type {i32, i32, i32}. First member corresponds to offset, second member corresponds to size of LDS global being replaced and third represents the total aligned size. It will have name "llvm.amdgcn.sw.lds..md". This global will have an intializer with static LDS related offsets and sizes initialized. But for dynamic LDS related entries, offsets will be intialized to previous static LDS allocation end offset. Sizes for them will be zero initially. These dynamic LDS offset and size values will be updated with in the kernel, since kernel can read the dynamic LDS size allocation done at runtime with query to "hidden_dynamic_lds_size" hidden kernel argument.
At the epilogue of kernel, allocated memory would be made free by the same single work-item.
Replacement of non-kernel LDS accesses:
Multiple kernels can access the same non-kernel function. All the kernels accessing LDS through non-kernels are sorted and assigned a kernel-id. All the LDS globals accessed by non-kernels are sorted.
This information is used to build two tables:
Base table:
Base table will have single row, with elements of the row placed as per kernel ID. Each element in the row corresponds to ptr of "SW LDS" variable created for that kernel.
Offset table:
Offset table will have multiple rows and columns. Rows are assumed to be from 0 to (n-1). n is total number of kernels accessing the LDS through non-kernels. Each row will have m elements. m is the total number of unique LDS globals accessed by all non-kernels. Each element in the row correspond to the ptr of the replacement of LDS global done by that particular kernel.
A LDS variable in non-kernel will be replaced based on the information from base and offset tables. Based on kernel-id query, ptr of "SW LDS" for that corresponding kernel is obtained from base table. The Offset into the base "SW LDS" is obtained from corresponding element in offset table. With this information, replacement value is obtained.