[SYCL][CUDA] Fix alignment of local arguments #5113

npmiller · 2021-12-09T15:44:52Z

The issue there is that for local kernel argument the CUDA plugin uses
CUDA dynamic shared memory, which gives us a single chunk of shared
memory to work with.

The CUDA plugin then lays out all the local kernel arguments
consecutively in this single chunk of memory.

And this can cause issues because simply laying the arguments out one
after the other can result in misaligned arguments.

So this patch is changing the argument layout to align them to the
maximum necessary alignment which is the size of the largest vector
type. Additionally if there is a local buffer smaller than this maximum
alignment, the size of that buffer is simply used for alignment.

This addresses #5007 for CUDA backend.

See also the discussion on #5104 for alternative solution, that may be
more efficient but would require a more intrusive ABI changing patch.

The issue there is that for local kernel argument the CUDA plugin uses CUDA dynamic shared memory, which gives us a single chunk of shared memory to work with. The CUDA plugin then lays out all the local kernel arguments consecutively in this single chunk of memory. And this can cause issues because simply laying the arguments out one after the other can result in misaligned arguments. So this patch is changing the argument layout to align them to the maximum necessary alignment which is the size of the largest vector type. Additionally if there is a local buffer smaller than this maximum alignment, the size of that buffer is simply used for alignment. This fixes the issue in intel#5007. See also the discussion on intel#5104 for alternative solution, that may be more efficient but would require a more intrusive ABI changing patch.

zjin-lcf · 2021-12-09T18:12:34Z

Is more shared local memory needed (the total size of the memory block) with aligned offset ?

npmiller · 2021-12-09T18:26:29Z

Is more shared local memory needed (the total size of the memory block) with aligned offset ?

Good point, I forgot to update the size of the argument when adding the padding, should be good now.

romanovvlad · 2021-12-10T16:46:51Z

LGTM, can we have a test for that? Cannot think of how such a test could be written though.

npmiller · 2021-12-10T17:24:06Z

LGTM, can we have a test for that? Cannot think of how such a test could be written though.

I'll look into setting up a test for that, it should be fairly straightforward to check the alignment of the local argument addresses, it'll have to go in llvm-test-suite though.

This issue was solved in intel/llvm#5113, local kernel arguments have to be aligned to the type size.

zjin-lcf · 2021-12-13T16:56:07Z

@npmiller
What about a test for this: even if the two arguments are simply laid out consecutively, the double4 argument will still be correctly aligned ?

// Manually capture kernel arguments to ensure an order with the int
// argument first and the double4 argument second. If the two arguments are
// simply laid out consecutively, the double4 argument will not be
// correctly aligned.

npmiller · 2021-12-13T17:07:39Z

I'm not sure I understand what you mean, when I say simply laid out consecutively I just mean if we don't add padding between the int and the double4 and just put one after the other in memory. The test I added should fail in that case because the double4 will not be aligned properly.

zjin-lcf · 2021-12-13T17:17:36Z

With your pull request, users won't need to change their source code or be concerned about alignment. So I thought that a test might be added that does not explicitly specify the order. Is that right ?

npmiller · 2021-12-13T17:47:46Z

Oh I see what you mean now. I'm not sure if that's necessary, the problem wasn't with the arguments capture, just the arguments order, so implicit or explicit capture doesn't really matter here. Except that specifying the arguments explicitly is just a clearer way to make sure the test reproduces an argument order that would trigger the bug. It is artificially making the worse argument order implicit capture could do.

zjin-lcf · 2021-12-13T17:53:31Z

Thanks

zjin-lcf · 2021-12-20T16:17:38Z

@bader
I installed the compiler today and then found that the PR is blocking. Is the blocked merging caused by the development of other features ? A general question is the average number of days a PR can be added to the main repository.

Thanks

bader · 2021-12-20T17:09:13Z

Is the blocked merging caused by the development of other features ?

Are we talking about this particular PR? If so, it looks like this PR is blocked by missing approval from code owners.
I suggest pinging them - they might just missed that tests requested by @romanovvlad are added via a pull request for other repository.

@intel/llvm-reviewers-cuda, please, take a look.

zjin-lcf · 2021-12-20T17:27:43Z

Yes. I was referring to the two PRs about local memory alignment/accessor.
I am not sure if code review is similar to paper review in terms of waiting time. I appreciate the reviewers' time and timeliness.

keryell · 2021-12-20T20:23:59Z

I am not sure if code review is similar to paper review in terms of waiting time.

This is mean! ;-)

steffenlarsen

LGTM! Sorry for the delay in review.

Would it be possible to add a regression test for this?

bader · 2021-12-21T09:21:34Z

/verify with intel/llvm-test-suite#608

@steffenlarsen, does intel/llvm-test-suite#608 look good to you?

steffenlarsen · 2021-12-21T09:44:00Z

@steffenlarsen, does intel/llvm-test-suite#608 look good to you?

Completely missed the link. Yes that test looks good. 😄

bader · 2021-12-21T10:42:25Z

/verify with intel/llvm-test-suite#608

bader · 2021-12-21T16:43:19Z

It looks like SYCL-Unit::CudaKernelsTest.PIKernelArgumentSetTwiceOneLocal test requires some updates with this change.

npmiller · 2022-01-03T15:04:36Z

I've tweaked the alignment code a little bit so that it only adds padding if necessary, the original patch was a bit inefficient as it would still add the alignment even if the address was already aligned. Now it will only add some padding if needed.

This fixes SYCL-Unit::CudaKernelsTest.PIKernelArgumentSetTwiceOneLocal, because that test was expecting 0 for the one local argument it uses and this patch was turning it into 4, by adding extra unnecessary padding.

Now, that it lives in `SYCLLowerIR` it can be easily shared between AMDGCN and NVPTX backends. This requires the same alignment fix as for Cuda, see: #5113 Fixes #5013

This issue was solved in intel/llvm#5113, local kernel arguments have to be aligned to the type size.

…test-suite#608) This issue was solved in intel#5113, local kernel arguments have to be aligned to the type size.

npmiller requested a review from a team as a code owner December 9, 2021 15:44

npmiller requested a review from romanovvlad December 9, 2021 15:44

[SYCL][CUDA] Adjust local memory size

9a718f1

zjin-lcf mentioned this pull request Dec 9, 2021

[SYCL] Add new kernel-arg-runtime-aligned metadata. #5111

Merged

npmiller added a commit to npmiller/llvm-test-suite that referenced this pull request Dec 10, 2021

[SYCL][CUDA] Add unit test for local arguments alignment

b0f099f

This issue was solved in intel/llvm#5113, local kernel arguments have to be aligned to the type size.

npmiller mentioned this pull request Dec 10, 2021

[SYCL][CUDA] Add unit test for local arguments alignment intel/llvm-test-suite#608

Merged

jchlanda mentioned this pull request Dec 15, 2021

[SYCL] Generalize local accessor to shared mem pass #5149

Merged

steffenlarsen previously approved these changes Dec 21, 2021

View reviewed changes

[SYCL][CUDA] Only align if necessary

c0e5ca6

npmiller dismissed steffenlarsen’s stale review via c0e5ca6 January 3, 2022 15:02

steffenlarsen approved these changes Jan 4, 2022

View reviewed changes

bader merged commit ebb1281 into intel:sycl Jan 10, 2022

AerialMantis mentioned this pull request Jan 11, 2022

[SYCL][CUDA][HIP] warp misaligned address on CUDA and results mismatch on HIP #5007

Closed

nmnobre mentioned this pull request Feb 28, 2022

[SYCL][CUDA] Alignment issue with zero-sized local arguments #5682

Closed

bader pushed a commit to intel/llvm-test-suite that referenced this pull request Mar 4, 2022

[SYCL][CUDA] Add unit test for local arguments alignment (#608)

32fbf01

This issue was solved in intel/llvm#5113, local kernel arguments have to be aligned to the type size.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SYCL][CUDA] Fix alignment of local arguments #5113

[SYCL][CUDA] Fix alignment of local arguments #5113

npmiller commented Dec 9, 2021 •

edited by bader

Loading

zjin-lcf commented Dec 9, 2021

npmiller commented Dec 9, 2021

romanovvlad commented Dec 10, 2021

npmiller commented Dec 10, 2021

zjin-lcf commented Dec 13, 2021

npmiller commented Dec 13, 2021

zjin-lcf commented Dec 13, 2021

npmiller commented Dec 13, 2021

zjin-lcf commented Dec 13, 2021

zjin-lcf commented Dec 20, 2021

bader commented Dec 20, 2021

zjin-lcf commented Dec 20, 2021

keryell commented Dec 20, 2021

steffenlarsen left a comment

bader commented Dec 21, 2021

steffenlarsen commented Dec 21, 2021

bader commented Dec 21, 2021

bader commented Dec 21, 2021

npmiller commented Jan 3, 2022

[SYCL][CUDA] Fix alignment of local arguments #5113

[SYCL][CUDA] Fix alignment of local arguments #5113

Conversation

npmiller commented Dec 9, 2021 • edited by bader Loading

zjin-lcf commented Dec 9, 2021

npmiller commented Dec 9, 2021

romanovvlad commented Dec 10, 2021

npmiller commented Dec 10, 2021

zjin-lcf commented Dec 13, 2021

npmiller commented Dec 13, 2021

zjin-lcf commented Dec 13, 2021

npmiller commented Dec 13, 2021

zjin-lcf commented Dec 13, 2021

zjin-lcf commented Dec 20, 2021

bader commented Dec 20, 2021

zjin-lcf commented Dec 20, 2021

keryell commented Dec 20, 2021

steffenlarsen left a comment

Choose a reason for hiding this comment

bader commented Dec 21, 2021

steffenlarsen commented Dec 21, 2021

bader commented Dec 21, 2021

bader commented Dec 21, 2021

npmiller commented Jan 3, 2022

npmiller commented Dec 9, 2021 •

edited by bader

Loading