Skip to content

Commit ebb1281

Browse files
authored
[SYCL][CUDA] Fix alignment of local arguments (#5113)
The issue there is that for local kernel argument the CUDA plugin uses CUDA dynamic shared memory, which gives us a single chunk of shared memory to work with. The CUDA plugin then lays out all the local kernel arguments consecutively in this single chunk of memory. And this can cause issues because simply laying the arguments out one after the other can result in misaligned arguments. So this patch is changing the argument layout to align them to the maximum necessary alignment which is the size of the largest vector type. Additionally if there is a local buffer smaller than this maximum alignment, the size of that buffer is simply used for alignment. This fixes the issue in #5007. See also the discussion on #5104 for alternative solution, that may be more efficient but would require a more intrusive ABI changing patch.
1 parent 7b76899 commit ebb1281

File tree

1 file changed

+16
-1
lines changed

1 file changed

+16
-1
lines changed

sycl/plugins/cuda/pi_cuda.hpp

+16-1
Original file line numberDiff line numberDiff line change
@@ -638,7 +638,22 @@ struct _pi_kernel {
638638

639639
void add_local_arg(size_t index, size_t size) {
640640
size_t localOffset = this->get_local_size();
641-
add_arg(index, sizeof(size_t), (const void *)&(localOffset), size);
641+
642+
// maximum required alignment is the size of the largest vector type
643+
const size_t max_alignment = sizeof(double) * 16;
644+
645+
// for arguments smaller than the maximum alignment simply align to the
646+
// size of the argument
647+
const size_t alignment = std::min(max_alignment, size);
648+
649+
// align the argument
650+
size_t alignedLocalOffset = localOffset;
651+
if (localOffset % alignment != 0) {
652+
alignedLocalOffset += alignment - (localOffset % alignment);
653+
}
654+
655+
add_arg(index, sizeof(size_t), (const void *)&(alignedLocalOffset),
656+
size + (alignedLocalOffset - localOffset));
642657
}
643658

644659
void set_implicit_offset(size_t size, std::uint32_t *implicitOffset) {

0 commit comments

Comments
 (0)