[SYCL] Add element size argument to piKernelSetArg #5104

npmiller · 2021-12-08T14:31:14Z

This patch comes from an attempt to fix #5007.

The issue there is that for local kernel argument the CUDA plugin uses
CUDA dynamic shared memory, which gives us a single chunk of shared
memory to work with.

The CUDA plugin then lays out all the local kernel arguments
consecutively in this single chunk of memory.

And this can cause issues because simply laying the arguments out one
after the other can result in misaligned arguments. In #5007 for example
there is an int argument followed by a double4 argument, so the
double4 argument ends up with the wrong alignment, only being aligned
on a 4 bytes boundary following from the int.

It is possible to adjust this and fixup the alignment when laying out
the local kernel arguments in the CUDA plugin, however before this patch
the only information in the plugin would be the total size of local
memory required for the given arguments, which doesn't tell us anything
about the required alignment.

So this patch propagates the size of the elements inside of the
local accessor all the way down to the PI plugin through
piKernelSetArg, and tweaks the local argument layout in the CUDA
plugin to use the type size as alignment for local kernel arguments.

I'm not entirely sure if this is the best approach so feedback on this would be appreciated, this patch may also need to be refined for naming and/or position of the extra argument, however it does fix the issue in #5007

This patch comes from an attempt to fix intel#5007. The issue there is that for local kernel argument the CUDA plugin uses CUDA dynamic shared memory, which gives us a single chunk of shared memory to work with. The CUDA plugin then lays out all the local kernel arguments consecutively in this single chunk of memory. And this can cause issues because simply laying the arguments out one after the other can result in misaligned arguments. In intel#5007 for example there is an `int` argument followed by a `double4` argument, so the `double4` argument ends up with the wrong alignment, only being aligned on a 4 bytes boundary following from the `int`. It is possible to adjust this and fixup the alignment when laying out the local kernel arguments in the CUDA plugin, however before this patch the only information in the plugin would be the total size of local memory required for the given arguments, which doesn't tell us anything about the required alignment. So this patch propagates the size of the elements inside of the local accessor all the way down to the PI plugin through `piKernelSetArg`, and tweaks the local argument layout in the CUDA plugin to use the type size as alignment for local kernel arguments.

romanovvlad · 2021-12-09T11:54:37Z

sycl/include/CL/sycl/detail/cg_types.hpp


  cl::sycl::detail::kernel_param_kind_t MType;
  void *MPtr;
  int MSize;
  int MIndex;
+  int MElemSize;


The patch should break ABI since this structure crosses library boundaries. Breaking ABI is not allowed right now.

romanovvlad · 2021-12-09T12:01:54Z

I'm not entirely sure if this is the best approach so feedback on this would be appreciated, this patch may also need to be refined for naming and/or position of the extra argument, however it does fix the issue in #5007

Can we use the strictest required alignment for vector operations as a default alignment, which I believe it should be sizeof(double) * 16 ?
It could be optimized for small values like: alignment = min(sizeof(double) * 16, arg_size);

npmiller · 2021-12-09T15:18:24Z

I'm not entirely sure if this is the best approach so feedback on this would be appreciated, this patch may also need to be refined for naming and/or position of the extra argument, however it does fix the issue in #5007

Can we use the strictest required alignment for vector operations as a default alignment, which I believe it should be sizeof(double) * 16 ? It could be optimized for small values like: alignment = min(sizeof(double) * 16, arg_size);

Yeah, after looking at this a bit more I think you're right, we could use the largest vector size for this, I'll close this PR and open a separate one with a change to that effect.

The issue there is that for local kernel argument the CUDA plugin uses CUDA dynamic shared memory, which gives us a single chunk of shared memory to work with. The CUDA plugin then lays out all the local kernel arguments consecutively in this single chunk of memory. And this can cause issues because simply laying the arguments out one after the other can result in misaligned arguments. So this patch is changing the argument layout to align them to the maximum necessary alignment which is the size of the largest vector type. Additionally if there is a local buffer smaller than this maximum alignment, the size of that buffer is simply used for alignment. This fixes the issue in intel#5007. See also the discussion on intel#5104 for alternative solution, that may be more efficient but would require a more intrusive ABI changing patch.

The issue there is that for local kernel argument the CUDA plugin uses CUDA dynamic shared memory, which gives us a single chunk of shared memory to work with. The CUDA plugin then lays out all the local kernel arguments consecutively in this single chunk of memory. And this can cause issues because simply laying the arguments out one after the other can result in misaligned arguments. So this patch is changing the argument layout to align them to the maximum necessary alignment which is the size of the largest vector type. Additionally if there is a local buffer smaller than this maximum alignment, the size of that buffer is simply used for alignment. This fixes the issue in #5007. See also the discussion on #5104 for alternative solution, that may be more efficient but would require a more intrusive ABI changing patch.

npmiller requested review from smaslov-intel and a team as code owners December 8, 2021 14:31

romanovvlad requested changes Dec 9, 2021

View reviewed changes

npmiller closed this Dec 9, 2021

npmiller mentioned this pull request Dec 9, 2021

[SYCL][CUDA] Fix alignment of local arguments #5113

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SYCL] Add element size argument to piKernelSetArg #5104

[SYCL] Add element size argument to piKernelSetArg #5104

npmiller commented Dec 8, 2021

romanovvlad Dec 9, 2021

romanovvlad commented Dec 9, 2021

npmiller commented Dec 9, 2021

[SYCL] Add element size argument to piKernelSetArg #5104

[SYCL] Add element size argument to piKernelSetArg #5104

Conversation

npmiller commented Dec 8, 2021

romanovvlad Dec 9, 2021

Choose a reason for hiding this comment

romanovvlad commented Dec 9, 2021

npmiller commented Dec 9, 2021