Skip to content

Commit 2bea25d

Browse files
authored
Merge pull request #2298 from Bensuo/ewan/cuda_update_local_size
Improve CUDA/HIP local argument handling
2 parents 0b5d8f9 + e578228 commit 2bea25d

13 files changed

+1157
-185
lines changed

scripts/core/CUDA.rst

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -148,6 +148,39 @@ take the extra global offset argument. Use of the global offset is not
148148
recommended for non SYCL compiler toolchains. This parameter can be ignored if
149149
the user does not wish to use the global offset.
150150

151+
Local Memory Arguments
152+
----------------------
153+
154+
In UR local memory is a region of memory shared by all the work-items in
155+
a work-group. A kernel function signature can include local memory address
156+
space pointer arguments, which are set by the user with
157+
``urKernelSetArgLocal`` with the number of bytes of local memory to allocate
158+
and make available from the pointer argument.
159+
160+
The CUDA adapter implements local memory in a kernel as a single ``__shared__``
161+
memory allocation, and each individual local memory argument is a ``u32`` byte
162+
offset kernel parameter which is combined inside the kernel with the
163+
``__shared__`` memory allocation. Therefore for ``N`` local arguments that need
164+
set on a kernel with ``urKernelSetArgLocal``, the total aligned size across the
165+
``N`` calls to ``urKernelSetArgLocal`` is calculated for the ``__shared__``
166+
memory allocation by the CUDA adapter and passed as the ``sharedMemBytes``
167+
argument to ``cuLaunchKernel`` (or variants like ``cuLaunchCooperativeKernel``
168+
or ``cuGraphAddKernelNode``).
169+
170+
For each kernel ``u32`` local memory offset parameter, aligned offsets into the
171+
single memory location are calculated and passed at runtime by the adapter via
172+
``kernelParams`` when launching the kernel (or adding the kernel as a graph
173+
node). When a user calls ``urKernelSetArgLocal`` with an argument index that
174+
has already been set on the kernel, the adapter recalculates the size of the
175+
``__shared__`` memory allocation and offset for the index, as well as the
176+
offsets of any local memory arguments at following indices.
177+
178+
.. warning::
179+
180+
The CUDA UR adapter implementation of local memory assumes the kernel created
181+
has been created by DPC++, instrumenting the device code so that local memory
182+
arguments are offsets rather than pointers.
183+
151184
Other Notes
152185
===========
153186

@@ -164,4 +197,5 @@ Contributors
164197
------------
165198

166199
200+
167201

scripts/core/HIP.rst

Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -91,6 +91,46 @@ take the extra global offset argument. Use of the global offset is not
9191
recommended for non SYCL compiler toolchains. This parameter can be ignored if
9292
the user does not wish to use the global offset.
9393

94+
Local Memory Arguments
95+
----------------------
96+
97+
In UR local memory is a region of memory shared by all the work-items in
98+
a work-group. A kernel function signature can include local memory address
99+
space pointer arguments, which are set by the user with
100+
``urKernelSetArgLocal`` with the number of bytes of local memory to allocate
101+
and make available from the pointer argument.
102+
103+
The HIP adapter implements local memory in a kernel as a single ``__shared__``
104+
memory allocation, and each individual local memory argument is a ``u32`` byte
105+
offset kernel parameter which is combined inside the kernel with the
106+
``__shared__`` memory allocation. Therefore for ``N`` local arguments that need
107+
set on a kernel with ``urKernelSetArgLocal``, the total aligned size across the
108+
``N`` calls to ``urKernelSetArgLocal`` is calculated for the ``__shared__``
109+
memory allocation by the HIP adapter and passed as the ``sharedMemBytes``
110+
argument to ``hipModuleLaunchKernel`` or ``hipGraphAddKernelNode``.
111+
112+
For each kernel ``u32`` local memory offset parameter, aligned offsets into the
113+
single memory location are calculated and passed at runtime by the adapter via
114+
``kernelParams`` when launching the kernel (or adding the kernel as a graph
115+
node). When a user calls ``urKernelSetArgLocal`` with an argument index that
116+
has already been set on the kernel, the adapter recalculates the size of the
117+
``__shared__`` memory allocation and offset for the index, as well as the
118+
offsets of any local memory arguments at following indices.
119+
120+
.. warning::
121+
122+
The HIP UR adapter implementation of local memory assumes the kernel created
123+
has been created by DPC++, instrumenting the device code so that local memory
124+
arguments are offsets rather than pointers.
125+
126+
127+
HIP kernels that are generated for DPC++ kernels with SYCL local accessors
128+
contain extra value arguments on top of the local memory argument for the
129+
local accessor. For each ``urKernelSetArgLocal`` argument, a user needs
130+
to make 3 calls to ``urKernelSetArgValue`` with each of the next 3 consecutive
131+
argument indexes. This represents a 3 dimensional offset into the local
132+
accessor.
133+
94134
Other Notes
95135
===========
96136

@@ -100,4 +140,5 @@ Contributors
100140
------------
101141

102142
143+
103144

source/adapters/cuda/command_buffer.cpp

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -522,9 +522,6 @@ UR_APIEXPORT ur_result_t UR_APICALL urCommandBufferAppendKernelLaunchExp(
522522
DepsList.data(), DepsList.size(),
523523
&NodeParams));
524524

525-
if (LocalSize != 0)
526-
hKernel->clearLocalSize();
527-
528525
// Add signal node if external return event is used.
529526
CUgraphNode SignalNode = nullptr;
530527
if (phEvent) {

source/adapters/cuda/enqueue.cpp

Lines changed: 0 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -493,9 +493,6 @@ UR_APIEXPORT ur_result_t UR_APICALL urEnqueueKernelLaunch(
493493
ThreadsPerBlock[0], ThreadsPerBlock[1], ThreadsPerBlock[2], LocalSize,
494494
CuStream, const_cast<void **>(ArgIndices.data()), nullptr));
495495

496-
if (LocalSize != 0)
497-
hKernel->clearLocalSize();
498-
499496
if (phEvent) {
500497
UR_CHECK_ERROR(RetImplEvent->record());
501498
*phEvent = RetImplEvent.release();
@@ -673,9 +670,6 @@ UR_APIEXPORT ur_result_t UR_APICALL urEnqueueKernelLaunchCustomExp(
673670
const_cast<void **>(ArgIndices.data()),
674671
nullptr));
675672

676-
if (LocalSize != 0)
677-
hKernel->clearLocalSize();
678-
679673
if (phEvent) {
680674
UR_CHECK_ERROR(RetImplEvent->record());
681675
*phEvent = RetImplEvent.release();

source/adapters/cuda/kernel.hpp

Lines changed: 78 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -61,10 +61,22 @@ struct ur_kernel_handle_t_ {
6161
using args_t = std::array<char, MaxParamBytes>;
6262
using args_size_t = std::vector<size_t>;
6363
using args_index_t = std::vector<void *>;
64+
/// Storage shared by all args which is mem copied into when adding a new
65+
/// argument.
6466
args_t Storage;
67+
/// Aligned size of each parameter, including padding.
6568
args_size_t ParamSizes;
69+
/// Byte offset into /p Storage allocation for each parameter.
6670
args_index_t Indices;
67-
args_size_t OffsetPerIndex;
71+
/// Aligned size in bytes for each local memory parameter after padding has
72+
/// been added. Zero if the argument at the index isn't a local memory
73+
/// argument.
74+
args_size_t AlignedLocalMemSize;
75+
/// Original size in bytes for each local memory parameter, prior to being
76+
/// padded to appropriate alignment. Zero if the argument at the index
77+
/// isn't a local memory argument.
78+
args_size_t OriginalLocalMemSize;
79+
6880
// A struct to keep track of memargs so that we can do dependency analysis
6981
// at urEnqueueKernelLaunch
7082
struct mem_obj_arg {
@@ -93,7 +105,8 @@ struct ur_kernel_handle_t_ {
93105
Indices.resize(Index + 2, Indices.back());
94106
// Ensure enough space for the new argument
95107
ParamSizes.resize(Index + 1);
96-
OffsetPerIndex.resize(Index + 1);
108+
AlignedLocalMemSize.resize(Index + 1);
109+
OriginalLocalMemSize.resize(Index + 1);
97110
}
98111
ParamSizes[Index] = Size;
99112
// calculate the insertion point on the array
@@ -102,28 +115,81 @@ struct ur_kernel_handle_t_ {
102115
// Update the stored value for the argument
103116
std::memcpy(&Storage[InsertPos], Arg, Size);
104117
Indices[Index] = &Storage[InsertPos];
105-
OffsetPerIndex[Index] = LocalSize;
118+
AlignedLocalMemSize[Index] = LocalSize;
106119
}
107120

108-
void addLocalArg(size_t Index, size_t Size) {
109-
size_t LocalOffset = this->getLocalSize();
121+
/// Returns the padded size and offset of a local memory argument.
122+
/// Local memory arguments need to be padded if the alignment for the size
123+
/// doesn't match the current offset into the kernel local data.
124+
/// @param Index Kernel arg index.
125+
/// @param Size User passed size of local parameter.
126+
/// @return Tuple of (Aligned size, Aligned offset into local data).
127+
std::pair<size_t, size_t> calcAlignedLocalArgument(size_t Index,
128+
size_t Size) {
129+
// Store the unpadded size of the local argument
130+
if (Index + 2 > Indices.size()) {
131+
AlignedLocalMemSize.resize(Index + 1);
132+
OriginalLocalMemSize.resize(Index + 1);
133+
}
134+
OriginalLocalMemSize[Index] = Size;
135+
136+
// Calculate the current starting offset into local data
137+
const size_t LocalOffset = std::accumulate(
138+
std::begin(AlignedLocalMemSize),
139+
std::next(std::begin(AlignedLocalMemSize), Index), size_t{0});
110140

111-
// maximum required alignment is the size of the largest vector type
141+
// Maximum required alignment is the size of the largest vector type
112142
const size_t MaxAlignment = sizeof(double) * 16;
113143

114-
// for arguments smaller than the maximum alignment simply align to the
144+
// For arguments smaller than the maximum alignment simply align to the
115145
// size of the argument
116146
const size_t Alignment = std::min(MaxAlignment, Size);
117147

118-
// align the argument
148+
// Align the argument
119149
size_t AlignedLocalOffset = LocalOffset;
120-
size_t Pad = LocalOffset % Alignment;
150+
const size_t Pad = LocalOffset % Alignment;
121151
if (Pad != 0) {
122152
AlignedLocalOffset += Alignment - Pad;
123153
}
124154

155+
const size_t AlignedLocalSize = Size + (AlignedLocalOffset - LocalOffset);
156+
return std::make_pair(AlignedLocalSize, AlignedLocalOffset);
157+
}
158+
159+
void addLocalArg(size_t Index, size_t Size) {
160+
// Get the aligned argument size and offset into local data
161+
auto [AlignedLocalSize, AlignedLocalOffset] =
162+
calcAlignedLocalArgument(Index, Size);
163+
164+
// Store argument details
125165
addArg(Index, sizeof(size_t), (const void *)&(AlignedLocalOffset),
126-
Size + (AlignedLocalOffset - LocalOffset));
166+
AlignedLocalSize);
167+
168+
// For every existing local argument which follows at later argument
169+
// indices, update the offset and pointer into the kernel local memory.
170+
// Required as padding will need to be recalculated.
171+
const size_t NumArgs = Indices.size() - 1; // Accounts for implicit arg
172+
for (auto SuccIndex = Index + 1; SuccIndex < NumArgs; SuccIndex++) {
173+
const size_t OriginalLocalSize = OriginalLocalMemSize[SuccIndex];
174+
if (OriginalLocalSize == 0) {
175+
// Skip if successor argument isn't a local memory arg
176+
continue;
177+
}
178+
179+
// Recalculate alignment
180+
auto [SuccAlignedLocalSize, SuccAlignedLocalOffset] =
181+
calcAlignedLocalArgument(SuccIndex, OriginalLocalSize);
182+
183+
// Store new local memory size
184+
AlignedLocalMemSize[SuccIndex] = SuccAlignedLocalSize;
185+
186+
// Store new offset into local data
187+
const size_t InsertPos =
188+
std::accumulate(std::begin(ParamSizes),
189+
std::begin(ParamSizes) + SuccIndex, size_t{0});
190+
std::memcpy(&Storage[InsertPos], &SuccAlignedLocalOffset,
191+
sizeof(size_t));
192+
}
127193
}
128194

129195
void addMemObjArg(int Index, ur_mem_handle_t hMem, ur_mem_flags_t Flags) {
@@ -145,15 +211,11 @@ struct ur_kernel_handle_t_ {
145211
std::memcpy(ImplicitOffsetArgs, ImplicitOffset, Size);
146212
}
147213

148-
void clearLocalSize() {
149-
std::fill(std::begin(OffsetPerIndex), std::end(OffsetPerIndex), 0);
150-
}
151-
152214
const args_index_t &getIndices() const noexcept { return Indices; }
153215

154216
uint32_t getLocalSize() const {
155-
return std::accumulate(std::begin(OffsetPerIndex),
156-
std::end(OffsetPerIndex), 0);
217+
return std::accumulate(std::begin(AlignedLocalMemSize),
218+
std::end(AlignedLocalMemSize), 0);
157219
}
158220
} Args;
159221

@@ -240,7 +302,5 @@ struct ur_kernel_handle_t_ {
240302

241303
uint32_t getLocalSize() const noexcept { return Args.getLocalSize(); }
242304

243-
void clearLocalSize() { Args.clearLocalSize(); }
244-
245305
size_t getRegsPerThread() const noexcept { return RegsPerThread; };
246306
};

source/adapters/hip/command_buffer.cpp

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -396,9 +396,6 @@ UR_APIEXPORT ur_result_t UR_APICALL urCommandBufferAppendKernelLaunchExp(
396396
DepsList.data(), DepsList.size(),
397397
&NodeParams));
398398

399-
if (LocalSize != 0)
400-
hKernel->clearLocalSize();
401-
402399
// Get sync point and register the node with it.
403400
auto SyncPoint = hCommandBuffer->addSyncPoint(GraphNode);
404401
if (pSyncPoint) {

source/adapters/hip/enqueue.cpp

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -324,8 +324,6 @@ UR_APIEXPORT ur_result_t UR_APICALL urEnqueueKernelLaunch(
324324
ThreadsPerBlock[0], ThreadsPerBlock[1], ThreadsPerBlock[2],
325325
hKernel->getLocalSize(), HIPStream, ArgIndices.data(), nullptr));
326326

327-
hKernel->clearLocalSize();
328-
329327
if (phEvent) {
330328
UR_CHECK_ERROR(RetImplEvent->record());
331329
*phEvent = RetImplEvent.release();

0 commit comments

Comments
 (0)