-
Notifications
You must be signed in to change notification settings - Fork 769
[SYCL][CUDA][HIP] warp misaligned address on CUDA and results mismatch on HIP #5007
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hello @zjin-lcf, I've looked at this on CUDA and there is a bug with the way local kernel arguments are laid out in memory which causes the I'm investigating this for a proper fix however as a workaround you can simply re-order the kernel arguments, for example as follows:
The local memory arguments seem to be placed in a shared memory buffer one after the other, so for example with an int and a double 4 you would have I'll update this ticket when I have a proper fix for this. |
The function is called in the following way for the three pointers,
Does the compiler have its own way to determine the order ? Thanks |
The actual "kernel" is the lambda of the parallel for, here there's no issues with the function. So here the kernel arguments are whatever is captured by the I'm not entirely sure how In most cases the kernel argument order shouldn't matter and we definitely need to fix this so users don't have to worry about kernel argument order. |
Thank you for explaining the kernel argument order. |
This patch comes from an attempt to fix intel#5007. The issue there is that for local kernel argument the CUDA plugin uses CUDA dynamic shared memory, which gives us a single chunk of shared memory to work with. The CUDA plugin then lays out all the local kernel arguments consecutively in this single chunk of memory. And this can cause issues because simply laying the arguments out one after the other can result in misaligned arguments. In intel#5007 for example there is an `int` argument followed by a `double4` argument, so the `double4` argument ends up with the wrong alignment, only being aligned on a 4 bytes boundary following from the `int`. It is possible to adjust this and fixup the alignment when laying out the local kernel arguments in the CUDA plugin, however before this patch the only information in the plugin would be the total size of local memory required for the given arguments, which doesn't tell us anything about the required alignment. So this patch propagates the size of the elements inside of the local accessor all the way down to the PI plugin through `piKernelSetArg`, and tweaks the local argument layout in the CUDA plugin to use the type size as alignment for local kernel arguments.
The issue there is that for local kernel argument the CUDA plugin uses CUDA dynamic shared memory, which gives us a single chunk of shared memory to work with. The CUDA plugin then lays out all the local kernel arguments consecutively in this single chunk of memory. And this can cause issues because simply laying the arguments out one after the other can result in misaligned arguments. So this patch is changing the argument layout to align them to the maximum necessary alignment which is the size of the largest vector type. Additionally if there is a local buffer smaller than this maximum alignment, the size of that buffer is simply used for alignment. This fixes the issue in intel#5007. See also the discussion on intel#5104 for alternative solution, that may be more efficient but would require a more intrusive ABI changing patch.
The issue there is that for local kernel argument the CUDA plugin uses CUDA dynamic shared memory, which gives us a single chunk of shared memory to work with. The CUDA plugin then lays out all the local kernel arguments consecutively in this single chunk of memory. And this can cause issues because simply laying the arguments out one after the other can result in misaligned arguments. So this patch is changing the argument layout to align them to the maximum necessary alignment which is the size of the largest vector type. Additionally if there is a local buffer smaller than this maximum alignment, the size of that buffer is simply used for alignment. This fixes the issue in #5007. See also the discussion on #5104 for alternative solution, that may be more efficient but would require a more intrusive ABI changing patch.
We believe the issue for CUDA to be address by #5113, there is a further pull request open for genializing this across both the NVPTX and AMDGCN LLVM backends, resolving this for the HIP as well - #5149. Edit: I originally closed this issue and then re-opened it as it's not yet addressed for the HIP backend. |
Running the sycl program on an nvidia gpu produces the right result after "ulimit -s unlimited'. |
I'm not having any issues with
Build command:
Run command:
|
Could you run: Thanks |
Oh, that's interesting, I am getting a segfault with that:
|
I remember segfault occurs on an nvidia p100 gpu without "ulimit -s unlimited". |
So looking a bit closer at this I think there might be a bug and/or race condition in the SYCL runtime with regards to command deletion which triggers the segfault. I'll have to dig into it further but it seems like the environment variable
With this it will likely use more memory but at least it shouldn't segfault while trying to access a deleted command. |
@sergey-semenov, FYI. |
Thank you for pointing out the issue. |
@sergey-semenov, can you confirm please that this is general runtime issue rather than CUDA specific? |
It certainly appears so. There's one known segfault problem related to post-enqueue cleanup and it has to do with execution graph leaf handling. @npmiller Could you please check if this fix (#5417) takes care of this segfault too? @KseniyaTikhomirova FYI |
@sergey-semenov I've just tried it and as far as I can tell it does seem that #5417 fixes the segfault in this test. I've ran the sample successfully several times with 100 iterations and a couple times with 500 iterations with no issues. @zjin-lcf I suspect this may also fix it for CUDA even without |
I would like to run the program after #5417 is merged. I hope that is fine. Thank you for your updates and tests. |
Yes. I ran the example. Thanks. |
Running the example https://github.com/zjin-lcf/HeCBench/blob/master/aop-sycl/main.cpp built with the CUDA support on a P100 GPU
shows warp misaligned address may be caused by the shared local memory "double4 lsums" in the kernel prepare_svd_kernel<256, PayoffPut>. The SYCL program runs successfully on an Intel GPU.
Did you encounter warp misaligned address when porting a CUDA program ?
Running the example built with the HIP support shows the result does not match the HIP/CUDA version:
To reproduce
The text was updated successfully, but these errors were encountered: