-
Notifications
You must be signed in to change notification settings - Fork 769
[SYCL][CUDA] Fix alignment of local arguments #5113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The issue there is that for local kernel argument the CUDA plugin uses CUDA dynamic shared memory, which gives us a single chunk of shared memory to work with. The CUDA plugin then lays out all the local kernel arguments consecutively in this single chunk of memory. And this can cause issues because simply laying the arguments out one after the other can result in misaligned arguments. So this patch is changing the argument layout to align them to the maximum necessary alignment which is the size of the largest vector type. Additionally if there is a local buffer smaller than this maximum alignment, the size of that buffer is simply used for alignment. This fixes the issue in intel#5007. See also the discussion on intel#5104 for alternative solution, that may be more efficient but would require a more intrusive ABI changing patch.
Is more shared local memory needed (the total size of the memory block) with aligned offset ? |
Good point, I forgot to update the size of the argument when adding the padding, should be good now. |
LGTM, can we have a test for that? Cannot think of how such a test could be written though. |
I'll look into setting up a test for that, it should be fairly straightforward to check the alignment of the local argument addresses, it'll have to go in |
This issue was solved in intel/llvm#5113, local kernel arguments have to be aligned to the type size.
@npmiller // Manually capture kernel arguments to ensure an order with the int |
I'm not sure I understand what you mean, when I say simply laid out consecutively I just mean if we don't add padding between the |
With your pull request, users won't need to change their source code or be concerned about alignment. So I thought that a test might be added that does not explicitly specify the order. Is that right ? |
Oh I see what you mean now. I'm not sure if that's necessary, the problem wasn't with the arguments capture, just the arguments order, so implicit or explicit capture doesn't really matter here. Except that specifying the arguments explicitly is just a clearer way to make sure the test reproduces an argument order that would trigger the bug. It is artificially making the worse argument order implicit capture could do. |
Thanks |
@bader Thanks |
Are we talking about this particular PR? If so, it looks like this PR is blocked by missing approval from code owners. @intel/llvm-reviewers-cuda, please, take a look. |
Yes. I was referring to the two PRs about local memory alignment/accessor. |
This is mean! ;-) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Sorry for the delay in review.
Would it be possible to add a regression test for this?
/verify with intel/llvm-test-suite#608 @steffenlarsen, does intel/llvm-test-suite#608 look good to you? |
Completely missed the link. Yes that test looks good. 😄 |
/verify with intel/llvm-test-suite#608 |
It looks like |
I've tweaked the alignment code a little bit so that it only adds padding if necessary, the original patch was a bit inefficient as it would still add the alignment even if the address was already aligned. Now it will only add some padding if needed. This fixes |
This issue was solved in intel/llvm#5113, local kernel arguments have to be aligned to the type size.
…test-suite#608) This issue was solved in intel#5113, local kernel arguments have to be aligned to the type size.
The issue there is that for local kernel argument the CUDA plugin uses
CUDA dynamic shared memory, which gives us a single chunk of shared
memory to work with.
The CUDA plugin then lays out all the local kernel arguments
consecutively in this single chunk of memory.
And this can cause issues because simply laying the arguments out one
after the other can result in misaligned arguments.
So this patch is changing the argument layout to align them to the
maximum necessary alignment which is the size of the largest vector
type. Additionally if there is a local buffer smaller than this maximum
alignment, the size of that buffer is simply used for alignment.
This addresses #5007 for CUDA backend.
See also the discussion on #5104 for alternative solution, that may be
more efficient but would require a more intrusive ABI changing patch.