-
Notifications
You must be signed in to change notification settings - Fork 770
Poor performance for 2d/3d range kernels involving primes on CUDA PI #8018
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
What about HIP PI ? Thanks. |
In theory this should be reproducible on HIP and even CL/L0, will check when the queues are freed up. |
Confirmed reproducible on HIP as well using MI100. > clang++ -O3 -fsycl -fsycl-targets=amdgcn-amd-amdhsa -Xsycl-target-backend --offload-arch=gfx908 size.cpp
> ./a.out
Device = gfx908:sramecc+:xnack-
N,elapsed(ms)
1009,164.369
2003,165.983
3001,169.175
4001,167.292
5003,175.794
6007,178.634
7001,175.956
8009,185.029
9001,188.233
10007,191.181
1000,164.402
2000,164.522
3000,157.396
4000,159.444
5000,164.425
6000,157.482
7000,164.384
8000,164.49
9000,161.246
10000,157.353 As the binary plugin is not compatible with something built from the trunk, I've used the oneAPI binary release directly from Intel: > clang++ -v
Intel(R) oneAPI DPC++/C++ Compiler 2023.0.0 (2023.0.0.20221201)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /lustre/home/br-wlin/intel/oneapi/compiler/2023.0.0/linux/bin-llvm
Found candidate GCC installation: /usr/lib/gcc/x86_64-redhat-linux/8
Selected GCC installation: /usr/lib/gcc/x86_64-redhat-linux/8
Candidate multilib: .;@m64
Candidate multilib: 32;@m32
Selected multilib: .;@m64
Found CUDA installation: /lustre/projects/bristol/modules-phase3/nvhpc/22.9/Linux_x86_64/22.9/cuda/11.7, version
Found HIP installation: /opt/rocm, version 4.4.21432
> sycl-ls
[ext_oneapi_hip:gpu:0] AMD HIP BACKEND, gfx908:sramecc+:xnack- 0.0 [HIP 40421.43]
[ext_oneapi_hip:gpu:1] AMD HIP BACKEND, gfx908:sramecc+:xnack- 0.0 [HIP 40421.43]
[ext_oneapi_hip:gpu:2] AMD HIP BACKEND, gfx908:sramecc+:xnack- 0.0 [HIP 40421.43]
[ext_oneapi_hip:gpu:3] AMD HIP BACKEND, gfx908:sramecc+:xnack- 0.0 [HIP 40421.43] Here's a visualisation of the normalised performance hit for V100 and MI100: As an aside, there should be more documentation on the difference between
It should have suggested the user to use the generic > icpx -O3 -fsycl -fsycl-targets=nvptx64-nvidia-cuda -Xsycl-target-backend --cuda-gpu-arch=sm_70 size.cpp
LLVM ERROR: Bitcode output disabled because proprietary optimizations have been performed.
> icpx -O3 -fsycl -fsycl-targets=amdgcn-amd-amdhsa -Xsycl-target-backend --offload-arch=gfx908 size.cpp
LLVM ERROR: Bitcode output disabled because proprietary optimizations have been performed. Note that Codeplay's documentation for AMD and NVIDIA only says to use |
@steffenlarsen Can we also add the HIP tag? |
@tom91136 |
@tom91136 h.parallel_for(sycl::range<2>(1, someLargePrime), [=](auto id){ /* use id[1] */ }); h.parallel_for(sycl::range<1>(someLargePrime), [=](auto id){ use id[0]}); |
The initial factor is not 1 in the real code, but shows how the current algorithm is insufficient for choosing good work-group sizes. The same will happen for a first dimension greater than 1. We also tried manually linearising the 2D kernels, but this didn't perform as well as we wanted either. |
…ly. (#9787) * The `threadsPerBlock` values computed by `guessLocalWorkSize` are not the most optimal values. In particular the `threadsPerBlock` for `Y` and `Z` were much below the possible values. * When Y/Z values of range are prime a very poor performance is witnessed as shown in the associated [issue](#8018) * This PR compute `threadsPerBlock` for X/Y/Z to reduce corresponding `BlocksPerGrid` values. * Below presents the output of the code in associated issue without the changes in this PR. Device = NVIDIA GeForce GTX 1050 Ti N, elapsed(ms) - 1009,4.61658 - 2003,45.6869 - 3001,67.5192 - 4001,88.1543 - 5003,111.338 - 6007,132.848 - 7001,154.697 - 8009,175.452 - 9001,196.237 - 10007,219.39 - 1000,4.59423 - 2000,4.61525 - 3000,4.61935 - 4000,4.62526 - 5000,4.64623 - 6000,4.78904 - 7000,8.92251 - 8000,8.97263 - 9000,9.06992 - 10000,9.03802 * And below shows the output with the PR's updates Device = NVIDIA GeForce GTX 1050 Ti N, elapsed(ms) - 1009,4.58252 - 2003,4.60139 - 3001,3.47269 - 4001,3.62314 - 5003,4.15179 - 6007,7.07976 - 7001,7.49027 - 8009,8.00097 - 9001,9.08756 - 10007,8.0005 - 1000,4.56335 - 2000,4.60376 - 3000,4.76395 - 4000,4.63283 - 5000,4.64732 - 6000,4.63936 - 7000,8.97499 - 8000,8.9941 - 9000,9.01531 - 10000,9.00935
#9787 only addresses this for CUDA, this should still be reproducible on HIP (see graph above) and the CPU backend. |
…ly. (intel#9787) * The `threadsPerBlock` values computed by `guessLocalWorkSize` are not the most optimal values. In particular the `threadsPerBlock` for `Y` and `Z` were much below the possible values. * When Y/Z values of range are prime a very poor performance is witnessed as shown in the associated [issue](intel#8018) * This PR compute `threadsPerBlock` for X/Y/Z to reduce corresponding `BlocksPerGrid` values. * Below presents the output of the code in associated issue without the changes in this PR. Device = NVIDIA GeForce GTX 1050 Ti N, elapsed(ms) - 1009,4.61658 - 2003,45.6869 - 3001,67.5192 - 4001,88.1543 - 5003,111.338 - 6007,132.848 - 7001,154.697 - 8009,175.452 - 9001,196.237 - 10007,219.39 - 1000,4.59423 - 2000,4.61525 - 3000,4.61935 - 4000,4.62526 - 5000,4.64623 - 6000,4.78904 - 7000,8.92251 - 8000,8.97263 - 9000,9.06992 - 10000,9.03802 * And below shows the output with the PR's updates Device = NVIDIA GeForce GTX 1050 Ti N, elapsed(ms) - 1009,4.58252 - 2003,4.60139 - 3001,3.47269 - 4001,3.62314 - 5003,4.15179 - 6007,7.07976 - 7001,7.49027 - 8009,8.00097 - 9001,9.08756 - 10007,8.0005 - 1000,4.56335 - 2000,4.60376 - 3000,4.76395 - 4000,4.63283 - 5000,4.64732 - 6000,4.63936 - 7000,8.97499 - 8000,8.9941 - 9000,9.01531 - 10000,9.00935
the cuda improvement partially reverted in #10051 |
Any update on this? |
Unfortunately, there is not much to be done on this. Any attempt to improve the performance requires re-shaping the data, which can result in invalid access if the code tries to access elements which are not available due to re-shaping. The advise should be to avoid using prime number for any range for hip / cuda backend, if performance matters. |
I don't understand, can't we just short-circuit any threads (workitems) that will be out of bound using the standard |
There have been some PRs to change this behaviour oneapi-src/unified-runtime#1326 #12690 #12715 With the new
Note that there is still some variation between prime and non prime ranges, sometimes with the primes being faster. Let me know if you think this is an issue. See the blocksizes here:
Let me know if you think this resolves this ticket. |
Renaming the kernels you can see that the block sizes can differ for the prime and non prime versions:
Using:
Gets perf for prime vs prime to be a bit closer than previous. While I think it is good to make these vals as close as possible, I am not sure if making them match absolutely is a reasonable expectation. Using You can find the documentation for range rounding in DPC++ here: https://github.com/intel/llvm/blob/sycl/sycl/doc/design/ParallelForRangeRounding.md Ping @breyerml |
@tom91136 do you think I can close the issue? |
Closing the issue. Feel free to reopen if you like, or just continue discussion here. |
The CUDA PI appears to suffer performance issues with ranges that are primes on the second or third dimension for a non-ndrange
parallel_for
.Concretely, the performance of a launch such as
is much slower than the equivalent:
If we profile the kernels using something like Nsight compute, we can see the slower kernels being launched with a block size of 1 for all dimensions, as opposed to a sensible number suggested by
cuOccupancyMaxPotentialBlockSize
(CUDA PI calls this internally fromguessLocalWorkSize
for block size heuristic).Note that enabling PI TRACE is not helpful in debugging this as it only records the name of the CUDA API call along with the return value; critical information such as the kernel name, launch bounds, and memory sizes are not reported.
Here's a single file reproducer:
Testing it on an V100 GPU gives:
There's a merged PR that rounds up ranges to the closest multiple of 32 to avoid bad block sizes.
The PR wraps the kernel body to mask out any extra threads: a typical pattern seen in many CUDA kernels.
Unfortunately, it only applies this optimisation by checking whether the first dimension is something that is not divisible by 32.
No checks were performed on the second and third dimension, so ranges like
range<2>(1, someLargePrime)
won't be considered for wrapping.At runtime, when a kernel with a prime in the range gets enqueued to the CUDA PI, the runtime tries to find a block size that can fully divide the prime number, thus ending up with 1 for the block size.
There's a few critical points that I think may affect how we approach a solution:
CU_DEVICE_ATTRIBUTE_MAX_GRID_DIM_X
versusCU_DEVICE_ATTRIBUTE_MAX_GRID_DIM_Y/Z
, newer cards (>= Volta, possibly older) now report 2147483647 for X and 65535 for Y and Z. Having the Y/Z dimension be limited to 65K seems arbitrary and quite restrictive for HPC use cases.Personally I think the concern of selecting the optimal launch configuration should be done entirely in the PI, the compiler just needs to make sure extra threads are masked off.
The associated risk of doing this is the potential (minor) branch overhead that every kernel now has to pay for.
We should also consider linearising >1D launches which simplifies the logic around
cuOccupancyMaxPotentialBlockSize
and also avoids the bad block size issue entirely.Again, there may be a cost for doing this that needs further investigation.
Note that prime ranges discussed here isn't some contrived test case.
The performance issue was originally discovered while scaling the SYCL implementation of CloverLeaf across multiple MPI ranks.
With large enough sizes, we observed random jumps in the runtime for certain rank counts.
@tomdeakin
The text was updated successfully, but these errors were encountered: