Skip to content

Commit e578228

Browse files
committed
Document hip extra arg behavior
1 parent 7d126f3 commit e578228

File tree

6 files changed

+165
-133
lines changed

6 files changed

+165
-133
lines changed

scripts/core/CUDA.rst

Lines changed: 15 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -157,25 +157,28 @@ space pointer arguments, which are set by the user with
157157
``urKernelSetArgLocal`` with the number of bytes of local memory to allocate
158158
and make available from the pointer argument.
159159

160-
The CUDA adapter implements local memory arguments to a kernel as a single
161-
``__shared__`` memory allocation, with each local address space pointer argument
162-
to the kernel converted to a byte offset parameter to the single memory
163-
allocation. Therefore for ``N`` local arguments that need set on a kernel with
164-
``urKernelSetArgLocal``, the total aligned size is calculated for the single
160+
The CUDA adapter implements local memory in a kernel as a single ``__shared__``
161+
memory allocation, and each individual local memory argument is a ``u32`` byte
162+
offset kernel parameter which is combined inside the kernel with the
163+
``__shared__`` memory allocation. Therefore for ``N`` local arguments that need
164+
set on a kernel with ``urKernelSetArgLocal``, the total aligned size across the
165+
``N`` calls to ``urKernelSetArgLocal`` is calculated for the ``__shared__``
165166
memory allocation by the CUDA adapter and passed as the ``sharedMemBytes``
166167
argument to ``cuLaunchKernel`` (or variants like ``cuLaunchCooperativeKernel``
167-
or ``cudaGraphAddKernelNode``).
168+
or ``cuGraphAddKernelNode``).
168169

169-
For each kernel local memory parameter, aligned offsets into the single memory location
170-
are calculated and passed at runtime via ``kernelParams`` when launching the kernel (or
171-
adding as a graph node). When a user calls ``urKernelSetArgLocal`` with an
172-
argument index that has already been set the CUDA adapter recalculates the size of the
173-
single memory allocation and offsets of any local memory arguments at following indices.
170+
For each kernel ``u32`` local memory offset parameter, aligned offsets into the
171+
single memory location are calculated and passed at runtime by the adapter via
172+
``kernelParams`` when launching the kernel (or adding the kernel as a graph
173+
node). When a user calls ``urKernelSetArgLocal`` with an argument index that
174+
has already been set on the kernel, the adapter recalculates the size of the
175+
``__shared__`` memory allocation and offset for the index, as well as the
176+
offsets of any local memory arguments at following indices.
174177

175178
.. warning::
176179

177180
The CUDA UR adapter implementation of local memory assumes the kernel created
178-
has been created by DPC++, instumenting the device code so that local memory
181+
has been created by DPC++, instrumenting the device code so that local memory
179182
arguments are offsets rather than pointers.
180183

181184
Other Notes

scripts/core/HIP.rst

Lines changed: 36 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -94,11 +94,42 @@ the user does not wish to use the global offset.
9494
Local Memory Arguments
9595
----------------------
9696

97-
.. todo::
98-
Copy and update CUDA doc
99-
100-
.. todo::
101-
Document what extra args needed on HIP arg with local accessors
97+
In UR local memory is a region of memory shared by all the work-items in
98+
a work-group. A kernel function signature can include local memory address
99+
space pointer arguments, which are set by the user with
100+
``urKernelSetArgLocal`` with the number of bytes of local memory to allocate
101+
and make available from the pointer argument.
102+
103+
The HIP adapter implements local memory in a kernel as a single ``__shared__``
104+
memory allocation, and each individual local memory argument is a ``u32`` byte
105+
offset kernel parameter which is combined inside the kernel with the
106+
``__shared__`` memory allocation. Therefore for ``N`` local arguments that need
107+
set on a kernel with ``urKernelSetArgLocal``, the total aligned size across the
108+
``N`` calls to ``urKernelSetArgLocal`` is calculated for the ``__shared__``
109+
memory allocation by the HIP adapter and passed as the ``sharedMemBytes``
110+
argument to ``hipModuleLaunchKernel`` or ``hipGraphAddKernelNode``.
111+
112+
For each kernel ``u32`` local memory offset parameter, aligned offsets into the
113+
single memory location are calculated and passed at runtime by the adapter via
114+
``kernelParams`` when launching the kernel (or adding the kernel as a graph
115+
node). When a user calls ``urKernelSetArgLocal`` with an argument index that
116+
has already been set on the kernel, the adapter recalculates the size of the
117+
``__shared__`` memory allocation and offset for the index, as well as the
118+
offsets of any local memory arguments at following indices.
119+
120+
.. warning::
121+
122+
The HIP UR adapter implementation of local memory assumes the kernel created
123+
has been created by DPC++, instrumenting the device code so that local memory
124+
arguments are offsets rather than pointers.
125+
126+
127+
HIP kernels that are generated for DPC++ kernels with SYCL local accessors
128+
contain extra value arguments on top of the local memory argument for the
129+
local accessor. For each ``urKernelSetArgLocal`` argument, a user needs
130+
to make 3 calls to ``urKernelSetArgValue`` with each of the next 3 consecutive
131+
argument indexes. This represents a 3 dimensional offset into the local
132+
accessor.
102133

103134
Other Notes
104135
===========

source/adapters/cuda/kernel.hpp

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -158,8 +158,7 @@ struct ur_kernel_handle_t_ {
158158

159159
void addLocalArg(size_t Index, size_t Size) {
160160
// Get the aligned argument size and offset into local data
161-
size_t AlignedLocalSize, AlignedLocalOffset;
162-
std::tie(AlignedLocalSize, AlignedLocalOffset) =
161+
auto [AlignedLocalSize, AlignedLocalOffset] =
163162
calcAlignedLocalArgument(Index, Size);
164163

165164
// Store argument details
@@ -178,8 +177,7 @@ struct ur_kernel_handle_t_ {
178177
}
179178

180179
// Recalculate alignment
181-
size_t SuccAlignedLocalSize, SuccAlignedLocalOffset;
182-
std::tie(SuccAlignedLocalSize, SuccAlignedLocalOffset) =
180+
auto [SuccAlignedLocalSize, SuccAlignedLocalOffset] =
183181
calcAlignedLocalArgument(SuccIndex, OriginalLocalSize);
184182

185183
// Store new local memory size

source/adapters/hip/kernel.hpp

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -153,8 +153,7 @@ struct ur_kernel_handle_t_ {
153153

154154
void addLocalArg(size_t Index, size_t Size) {
155155
// Get the aligned argument size and offset into local data
156-
size_t AlignedLocalSize, AlignedLocalOffset;
157-
std::tie(AlignedLocalSize, AlignedLocalOffset) =
156+
auto [AlignedLocalSize, AlignedLocalOffset] =
158157
calcAlignedLocalArgument(Index, Size);
159158

160159
// Store argument details
@@ -173,8 +172,7 @@ struct ur_kernel_handle_t_ {
173172
}
174173

175174
// Recalculate alignment
176-
size_t SuccAlignedLocalSize, SuccAlignedLocalOffset;
177-
std::tie(SuccAlignedLocalSize, SuccAlignedLocalOffset) =
175+
auto [SuccAlignedLocalSize, SuccAlignedLocalOffset] =
178176
calcAlignedLocalArgument(SuccIndex, OriginalLocalSize);
179177

180178
// Store new local memory size

0 commit comments

Comments
 (0)