Skip to content

Poor performance for 2d/3d range kernels involving primes on CUDA PI #8018

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
tom91136 opened this issue Jan 16, 2023 · 16 comments · Fixed by #9787
Closed

Poor performance for 2d/3d range kernels involving primes on CUDA PI #8018

tom91136 opened this issue Jan 16, 2023 · 16 comments · Fixed by #9787
Assignees
Labels
bug Something isn't working cuda CUDA back-end hip Issues related to execution on HIP backend.

Comments

@tom91136
Copy link
Contributor

The CUDA PI appears to suffer performance issues with ranges that are primes on the second or third dimension for a non-ndrange parallel_for.
Concretely, the performance of a launch such as

h.parallel_for(sycl::range<2>(1, someLargePrime), [=](auto id){ /* use id[1] */ });

is much slower than the equivalent:

h.parallel_for(sycl::range<2>(someLargePrime, 1), [=](auto id){ /* use id[0] */ });

If we profile the kernels using something like Nsight compute, we can see the slower kernels being launched with a block size of 1 for all dimensions, as opposed to a sensible number suggested by cuOccupancyMaxPotentialBlockSize (CUDA PI calls this internally from guessLocalWorkSize for block size heuristic).
Note that enabling PI TRACE is not helpful in debugging this as it only records the name of the CUDA API call along with the return value; critical information such as the kernel name, launch bounds, and memory sizes are not reported.

Here's a single file reproducer:

// size.cpp
#include <CL/sycl.hpp>
#include <iostream>

template <typename T> double run(size_t N) {
  std::vector<T> hostAs(N, T(0.1));
  std::vector<T> hostBs(N, T(-0.5));
  std::vector<T> hostCs(N, T(0.0));
  sycl::queue queue;
  sycl::buffer<T, 1> a(hostAs.data(), sycl::range<1>(N));
  sycl::buffer<T, 1> b(hostBs.data(), sycl::range<1>(N));
  sycl::buffer<T, 1> c(hostCs.data(), sycl::range<1>(N));
  auto start = std::chrono::high_resolution_clock::now();
  queue.submit([&](sycl::handler &h) {
    auto a_acc = a.template get_access<sycl::access::mode::read>(h);
    auto b_acc = b.template get_access<sycl::access::mode::read>(h);
    auto c_acc = c.template get_access<sycl::access::mode::write>(h);
    h.parallel_for<class the_kernel>(sycl::range<2>(1, N), [=](sycl::id<2> id) {
      auto init = a_acc[id[1]];
      for(size_t n = 0; n < 1000; n++) init = sycl::pow(init, b_acc[id[1]] + n);
      c_acc[id[1]] = init;
    });
  });
  queue.wait_and_throw();
  auto end = std::chrono::high_resolution_clock::now();
  auto c_acc = c.get_host_access();
  for (size_t j = 0; j < N; ++j) {
    auto expected = hostAs[j];
    for(size_t n = 0; n < 1000; n++) expected = sycl::pow(expected, hostBs[j] + n);
    auto actual = c_acc[j];
    if (std::abs(actual - expected) > 0.000001) {
      std::cerr << "Bad value at [" << j << "], expecting " << expected << " but was " << actual << std::endl;
    }
  };
  return std::chrono::duration<double, std::milli>(end - start).count();
}
template <typename T> double runN(size_t N, size_t times) {
  for (size_t i = 0; i < times; ++i) run<T>(N); // warm up
  double total{};
  for (size_t i = 0; i < times; ++i) total += run<T>(N);
  return total;
}
int main() {
  std::vector<size_t> primes = {1009, 2003, 3001, 4001, 5003, 6007, 7001, 8009, 9001, 10007};
  std::vector<size_t> nonPrimes{1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10000};
  using Element = float;
  sycl::queue queue;
  auto device = queue.get_device();
  std::cerr << "Device = " << device.template get_info<sycl::info::device::name>() << std::endl;
  std::cout << "N,elapsed(ms)" << std::endl;
  for (auto x : primes) {
    auto elapsed = runN<Element>(x, 5);
    std::cout << x << "," << elapsed << std::endl;
  }
  for (auto x : nonPrimes) {
    auto elapsed = runN<Element>(x, 5);
    std::cout << x << "," << elapsed << std::endl;
  }
  std::cerr << "Done" << std::endl;
  return EXIT_SUCCESS;
}

Testing it on an V100 GPU gives:

> clang++ -v
clang version 16.0.0 (https://github.com/intel/llvm f865dbdce9581424f5f5599454de5e7b90116966)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /home/br-wlin/sycl_workspace/llvm/build/bin
Found candidate GCC installation: /usr/lib/gcc/x86_64-redhat-linux/8
Selected GCC installation: /usr/lib/gcc/x86_64-redhat-linux/8
Candidate multilib: .;@m64
Candidate multilib: 32;@m32
Selected multilib: .;@m64
Found CUDA installation: /lustre/projects/bristol/modules-phase3/nvhpc/22.9/Linux_x86_64/22.9/cuda/11.7, version 11.7

> time clang++ -O3 -fsycl -fsycl-targets=nvptx64-nvidia-cuda -Xsycl-target-backend --cuda-gpu-arch=sm_70 size.cpp
real    0m6.262s # this is very slow for a simple program like this
user    0m5.755s
sys     0m0.352s
> ./a.out
N,elapsed(ms)
1009,5.39113
2003,4.12835
3001,6.87786
4001,7.86372
5003,8.63055
6007,11.6794
7001,12.5063
8009,14.8147
9001,16.0661
10007,16.5786
1000,5.34469
2000,5.3888
3000,5.42235
4000,5.43256
5000,5.45543
6000,5.50536
7000,5.52733
8000,5.5306
9000,5.55627
10000,5.56491

There's a merged PR that rounds up ranges to the closest multiple of 32 to avoid bad block sizes.
The PR wraps the kernel body to mask out any extra threads: a typical pattern seen in many CUDA kernels.
Unfortunately, it only applies this optimisation by checking whether the first dimension is something that is not divisible by 32.
No checks were performed on the second and third dimension, so ranges like range<2>(1, someLargePrime) won't be considered for wrapping.
At runtime, when a kernel with a prime in the range gets enqueued to the CUDA PI, the runtime tries to find a block size that can fully divide the prime number, thus ending up with 1 for the block size.

There's a few critical points that I think may affect how we approach a solution:

  • The 32 block size constant defined in the SYCL headers from the PR seems to be a somewhat platform-dependent value. It should be up to the PI to decide whether a uniform range can be made from the given grid sizes at runtime. Concretely, the generated kernel images should be robust in accepting different block and grid sizes as long as the global workgroup range can be covered.
  • There's significant difference on the accepted range for CU_DEVICE_ATTRIBUTE_MAX_GRID_DIM_X versus CU_DEVICE_ATTRIBUTE_MAX_GRID_DIM_Y/Z, newer cards (>= Volta, possibly older) now report 2147483647 for X and 65535 for Y and Z. Having the Y/Z dimension be limited to 65K seems arbitrary and quite restrictive for HPC use cases.
  • Supporting up-to 3 dimensions is mostly inherited from computer graphics, and if SYCL-next were to add support for arbitrary dimensions (been hearing about it for a while now), we’ll be forced to encode (linearise the indices) it as 1D launches.

Personally I think the concern of selecting the optimal launch configuration should be done entirely in the PI, the compiler just needs to make sure extra threads are masked off.
The associated risk of doing this is the potential (minor) branch overhead that every kernel now has to pay for.
We should also consider linearising >1D launches which simplifies the logic around cuOccupancyMaxPotentialBlockSize and also avoids the bad block size issue entirely.
Again, there may be a cost for doing this that needs further investigation.

Note that prime ranges discussed here isn't some contrived test case.
The performance issue was originally discovered while scaling the SYCL implementation of CloverLeaf across multiple MPI ranks.
With large enough sizes, we observed random jumps in the runtime for certain rank counts.

@tomdeakin

@tom91136 tom91136 added the bug Something isn't working label Jan 16, 2023
@steffenlarsen steffenlarsen added the cuda CUDA back-end label Jan 17, 2023
@zjin-lcf
Copy link
Contributor

What about HIP PI ? Thanks.

@tom91136
Copy link
Contributor Author

tom91136 commented Jan 19, 2023

In theory this should be reproducible on HIP and even CL/L0, will check when the queues are freed up.

@tom91136
Copy link
Contributor Author

Confirmed reproducible on HIP as well using MI100.
Using the binary ROCM PI from Codeplay (why exactly does the download need to be behind a login/registration?), I'm getting the following with the same reproducer:

> clang++ -O3 -fsycl -fsycl-targets=amdgcn-amd-amdhsa -Xsycl-target-backend --offload-arch=gfx908 size.cpp 
> ./a.out
Device = gfx908:sramecc+:xnack-                                                                                                                                                                                                                                  
N,elapsed(ms)                                                                                                                                                                                                                                                    
1009,164.369
2003,165.983
3001,169.175
4001,167.292
5003,175.794
6007,178.634
7001,175.956
8009,185.029
9001,188.233
10007,191.181
1000,164.402
2000,164.522
3000,157.396
4000,159.444
5000,164.425
6000,157.482
7000,164.384
8000,164.49
9000,161.246
10000,157.353

As the binary plugin is not compatible with something built from the trunk, I've used the oneAPI binary release directly from Intel:

> clang++ -v
Intel(R) oneAPI DPC++/C++ Compiler 2023.0.0 (2023.0.0.20221201)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /lustre/home/br-wlin/intel/oneapi/compiler/2023.0.0/linux/bin-llvm
Found candidate GCC installation: /usr/lib/gcc/x86_64-redhat-linux/8
Selected GCC installation: /usr/lib/gcc/x86_64-redhat-linux/8
Candidate multilib: .;@m64
Candidate multilib: 32;@m32
Selected multilib: .;@m64
Found CUDA installation: /lustre/projects/bristol/modules-phase3/nvhpc/22.9/Linux_x86_64/22.9/cuda/11.7, version 
Found HIP installation: /opt/rocm, version 4.4.21432
> sycl-ls
[ext_oneapi_hip:gpu:0] AMD HIP BACKEND, gfx908:sramecc+:xnack- 0.0 [HIP 40421.43]
[ext_oneapi_hip:gpu:1] AMD HIP BACKEND, gfx908:sramecc+:xnack- 0.0 [HIP 40421.43]
[ext_oneapi_hip:gpu:2] AMD HIP BACKEND, gfx908:sramecc+:xnack- 0.0 [HIP 40421.43]
[ext_oneapi_hip:gpu:3] AMD HIP BACKEND, gfx908:sramecc+:xnack- 0.0 [HIP 40421.43]

Here's a visualisation of the normalised performance hit for V100 and MI100:

Prime vs Non-prime N


As an aside, there should be more documentation on the difference between icpx v.s clang++.
For example, enabling optimisations beyond -O1 gives:

LLVM ERROR: Bitcode output disabled because proprietary optimizations have been performed.

It should have suggested the user to use the generic clang++ driver which does not have the the extra passes.
This applies to both amdhsa and nvptx:

> icpx -O3 -fsycl -fsycl-targets=nvptx64-nvidia-cuda -Xsycl-target-backend --cuda-gpu-arch=sm_70 size.cpp
LLVM ERROR: Bitcode output disabled because proprietary optimizations have been performed.
> icpx -O3 -fsycl -fsycl-targets=amdgcn-amd-amdhsa -Xsycl-target-backend --offload-arch=gfx908 size.cpp 
LLVM ERROR: Bitcode output disabled because proprietary optimizations have been performed.

Note that Codeplay's documentation for AMD and NVIDIA only says to use clang++ without an explanation.

@tom91136
Copy link
Contributor Author

@steffenlarsen Can we also add the HIP tag?

@steffenlarsen steffenlarsen added the hip Issues related to execution on HIP backend. label Jan 30, 2023
@zjin-lcf
Copy link
Contributor

@tom91136
Will you file another post for PI trace? Thanks.
Note that enabling PI TRACE is not helpful in debugging this as it only records the name of the CUDA API call along with the return value; critical information such as the kernel name, launch bounds, and memory sizes are not reported.

@zjin-lcf
Copy link
Contributor

@tom91136
Is it okay to switch from 2D to 1D ?

h.parallel_for(sycl::range<2>(1, someLargePrime), [=](auto id){ /* use id[1] */ });

h.parallel_for(sycl::range<1>(someLargePrime), [=](auto id){ use id[0]});

@tomdeakin
Copy link

tomdeakin commented Feb 16, 2023

The initial factor is not 1 in the real code, but shows how the current algorithm is insufficient for choosing good work-group sizes. The same will happen for a first dimension greater than 1.

We also tried manually linearising the 2D kernels, but this didn't perform as well as we wanted either.

@mmoadeli mmoadeli self-assigned this Jun 6, 2023
steffenlarsen pushed a commit that referenced this issue Jun 13, 2023
…ly. (#9787)

* The `threadsPerBlock` values computed by `guessLocalWorkSize` are not
the most optimal values. In particular the `threadsPerBlock` for `Y` and
`Z` were much below the possible values.
* When Y/Z values of range are prime a very poor performance is
witnessed as shown in the associated
[issue](#8018)
* This PR compute `threadsPerBlock` for X/Y/Z to reduce corresponding
`BlocksPerGrid` values.

* Below presents the output of the code in associated issue without the
changes in this PR.

Device = NVIDIA GeForce GTX 1050 Ti
N,   elapsed(ms)

- 1009,4.61658
- 2003,45.6869
- 3001,67.5192
- 4001,88.1543
- 5003,111.338
- 6007,132.848
- 7001,154.697
- 8009,175.452
- 9001,196.237
- 10007,219.39
- 1000,4.59423
- 2000,4.61525
- 3000,4.61935
- 4000,4.62526
- 5000,4.64623
- 6000,4.78904
- 7000,8.92251
- 8000,8.97263
- 9000,9.06992
- 10000,9.03802

 
* And below shows the output with the PR's updates
 Device = NVIDIA GeForce GTX 1050 Ti
N,  elapsed(ms)

- 1009,4.58252
- 2003,4.60139
- 3001,3.47269
- 4001,3.62314
- 5003,4.15179
- 6007,7.07976
- 7001,7.49027
- 8009,8.00097
- 9001,9.08756
- 10007,8.0005
- 1000,4.56335
- 2000,4.60376
- 3000,4.76395
- 4000,4.63283
- 5000,4.64732
- 6000,4.63936
- 7000,8.97499
- 8000,8.9941
- 9000,9.01531
- 10000,9.00935
@tom91136
Copy link
Contributor Author

#9787 only addresses this for CUDA, this should still be reproducible on HIP (see graph above) and the CPU backend.

@steffenlarsen steffenlarsen reopened this Jun 13, 2023
@mmoadeli mmoadeli removed the cuda CUDA back-end label Jun 13, 2023
fineg74 pushed a commit to fineg74/llvm that referenced this issue Jun 15, 2023
…ly. (intel#9787)

* The `threadsPerBlock` values computed by `guessLocalWorkSize` are not
the most optimal values. In particular the `threadsPerBlock` for `Y` and
`Z` were much below the possible values.
* When Y/Z values of range are prime a very poor performance is
witnessed as shown in the associated
[issue](intel#8018)
* This PR compute `threadsPerBlock` for X/Y/Z to reduce corresponding
`BlocksPerGrid` values.

* Below presents the output of the code in associated issue without the
changes in this PR.

Device = NVIDIA GeForce GTX 1050 Ti
N,   elapsed(ms)

- 1009,4.61658
- 2003,45.6869
- 3001,67.5192
- 4001,88.1543
- 5003,111.338
- 6007,132.848
- 7001,154.697
- 8009,175.452
- 9001,196.237
- 10007,219.39
- 1000,4.59423
- 2000,4.61525
- 3000,4.61935
- 4000,4.62526
- 5000,4.64623
- 6000,4.78904
- 7000,8.92251
- 8000,8.97263
- 9000,9.06992
- 10000,9.03802

 
* And below shows the output with the PR's updates
 Device = NVIDIA GeForce GTX 1050 Ti
N,  elapsed(ms)

- 1009,4.58252
- 2003,4.60139
- 3001,3.47269
- 4001,3.62314
- 5003,4.15179
- 6007,7.07976
- 7001,7.49027
- 8009,8.00097
- 9001,9.08756
- 10007,8.0005
- 1000,4.56335
- 2000,4.60376
- 3000,4.76395
- 4000,4.63283
- 5000,4.64732
- 6000,4.63936
- 7000,8.97499
- 8000,8.9941
- 9000,9.01531
- 10000,9.00935
@mmoadeli mmoadeli added the cuda CUDA back-end label Jun 23, 2023
@mmoadeli
Copy link
Contributor

the cuda improvement partially reverted in #10051

@tom91136
Copy link
Contributor Author

tom91136 commented Aug 7, 2023

Any update on this?

@mmoadeli
Copy link
Contributor

mmoadeli commented Aug 7, 2023

Unfortunately, there is not much to be done on this. Any attempt to improve the performance requires re-shaping the data, which can result in invalid access if the code tries to access elements which are not available due to re-shaping. The advise should be to avoid using prime number for any range for hip / cuda backend, if performance matters.

@tom91136
Copy link
Contributor Author

tom91136 commented Oct 4, 2023

I don't understand, can't we just short-circuit any threads (workitems) that will be out of bound using the standard if(idx > limit) return; idiom?

@hdelan
Copy link
Contributor

hdelan commented Apr 10, 2024

There have been some PRs to change this behaviour oneapi-src/unified-runtime#1326 #12690 #12715

With the new -fsycl-exp-range-rounding flag introduced in #12690 , sensible block sizes are now chosen

$ clang++ -fsycl -fsycl-targets=nvidia_gpu_sm_60 tom.cpp -O3 -fsycl-exp-range-rounding && ./a.out
Device = NVIDIA GeForce GTX 1050 Ti
N,elapsed(ms)
1009,4.50345
2003,7.47965
3001,10.0267
4001,9.25267
5003,10.4168
6007,10.0995
7001,15.7956
8009,12.107
9001,16.4202
10007,20.7509
1000,5.22737
2000,9.76679
3000,10.0152
4000,12.6713
5000,9.94724
6000,17.5764
7000,16.3258
8000,27.0032
9000,16.6231
10000,28.6047
Done

Note that there is still some variation between prime and non prime ranges, sometimes with the primes being faster. Let me know if you think this is an issue.

See the blocksizes here:

 Time (%)  Total Time (ns)  Instances   Avg (ns)     Med (ns)    Min (ns)   Max (ns)   StdDev (ns)     GridXYZ         BlockXYZ                                                     Name                                                
 --------  ---------------  ---------  -----------  -----------  ---------  ---------  -----------  --------------  --------------  ----------------------------------------------------------------------------------------------------
     13.7       72,972,178         10  7,297,217.8  7,624,063.0  5,878,314  8,019,461    832,317.5   400    1    1    25   16    1  Typeinfo name for sycl::_V1::detail::__pf_kernel_wrapper<double run<float>(unsigned long)::[lambda(…
     12.3       65,289,331         20  3,264,466.6  3,229,880.0  3,181,416  3,443,115     88,722.6   563    1    1    16   16    1  Typeinfo name for sycl::_V1::detail::__pf_kernel_wrapper<double run<float>(unsigned long)::[lambda(…
     11.1       59,156,071         20  2,957,803.6  2,935,653.0  2,472,959  3,323,561    251,895.3   219    1    1    32   16    1  Typeinfo name for sycl::_V1::detail::__pf_kernel_wrapper<double run<float>(unsigned long)::[lambda(…
     10.0       53,249,404         10  5,324,940.4  5,325,026.5  5,323,619  5,326,883        918.9   125    1    1    64   16    1  Typeinfo name for sycl::_V1::detail::__pf_kernel_wrapper<double run<float>(unsigned long)::[lambda(…
      7.4       39,448,596         20  1,972,429.8  1,875,992.0  1,862,264  2,222,364    131,723.1   313    1    1    16   16    1  Typeinfo name for sycl::_V1::detail::__pf_kernel_wrapper<double run<float>(unsigned long)::[lambda(…
      7.3       38,765,799         20  1,938,290.0  1,938,168.5  1,937,400  1,939,833        591.8    47    1    1    64   16    1  Typeinfo name for sycl::_V1::detail::__pf_kernel_wrapper<double run<float>(unsigned long)::[lambda(…
      6.5       34,566,994         10  3,456,699.4  3,463,884.0  3,088,678  3,816,176    177,945.1   313    1    1    32   16    1  Typeinfo name for sycl::_V1::detail::__pf_kernel_wrapper<double run<float>(unsigned long)::[lambda(…
      6.0       31,804,815         10  3,180,481.5  3,304,441.0  2,905,796  3,350,474    188,285.4   200    1    1    30   16    1  Typeinfo name for sycl::_V1::detail::__pf_kernel_wrapper<double run<float>(unsigned long)::[lambda(…
      4.5       23,958,028         10  2,395,802.8  2,424,878.5  2,276,444  2,493,503     84,496.6   334    1    1    24   16    1  Typeinfo name for sycl::_V1::detail::__pf_kernel_wrapper<double run<float>(unsigned long)::[lambda(…
      4.5       23,699,915         10  2,369,991.5  2,371,886.0  2,360,318  2,381,662      7,463.8   125    1    1    32   16    1  Typeinfo name for sycl::_V1::detail::__pf_kernel_wrapper<double run<float>(unsigned long)::[lambda(…
      3.7       19,448,402         10  1,944,840.2  1,944,952.5  1,942,840  1,946,136      1,143.4    47    2    1   128    8    1  Typeinfo name for sycl::_V1::detail::__pf_kernel_wrapper<double run<float>(unsigned long)::[lambda(…
      3.6       18,881,263         10  1,888,126.3  1,888,088.0  1,887,928  1,888,376        136.5    80    1    1    25   16    1  Typeinfo name for sycl::_V1::detail::__pf_kernel_wrapper<double run<float>(unsigned long)::[lambda(…
      3.3       17,570,877         10  1,757,087.7  1,752,534.0  1,651,541  1,869,880     84,105.7   251    1    1    16   16    1  Typeinfo name for sycl::_V1::detail::__pf_kernel_wrapper<double run<float>(unsigned long)::[lambda(…
      2.7       14,259,890         10  1,425,989.0  1,428,066.0  1,417,266  1,443,506      8,551.5    63    1    1    32   16    1  Typeinfo name for sycl::_V1::detail::__pf_kernel_wrapper<double run<float>(unsigned long)::[lambda(…
      1.8        9,709,499         10    970,949.9    971,004.5    969,741    971,756        504.1    36    1    1    28   16    1  Typeinfo name for sycl::_V1::detail::__pf_kernel_wrapper<double run<float>(unsigned long)::[lambda(…
      1.6        8,322,441         10    832,244.1    832,155.0    831,371    833,834        698.0     1   16    1  1024    1    1  Typeinfo name for sycl::_V1::detail::__pf_kernel_wrapper<double run<float>(unsigned long)::[lambda(…

Let me know if you think this resolves this ticket.

@hdelan
Copy link
Contributor

hdelan commented Apr 10, 2024

Renaming the kernels you can see that the block sizes can differ for the prime and non prime versions:

 Time (%)  Total Time (ns)  Instances   Avg (ns)     Med (ns)    Min (ns)   Max (ns)   StdDev (ns)     GridXYZ         BlockXYZ                                          Name                                     
 --------  ---------------  ---------  -----------  -----------  ---------  ---------  -----------  --------------  --------------  ------------------------------------------------------------------------------
     12.7       67,171,434         10  6,717,143.4  6,568,275.0  6,118,316  7,569,950    571,089.4   400    1    1    25   16    1  Typeinfo name for sycl::_V1::detail::__pf_kernel_wrapper<mykernel<(int)10000>>
     10.1       53,243,267         10  5,324,326.7  5,324,739.0  5,321,475  5,325,572      1,192.7   125    1    1    64   16    1  Typeinfo name for sycl::_V1::detail::__pf_kernel_wrapper<mykernel<(int)8000>> 
      6.6       34,606,932         10  3,460,693.2  3,516,396.5  3,050,758  3,724,559    179,332.6   313    1    1    32   16    1  Typeinfo name for sycl::_V1::detail::__pf_kernel_wrapper<mykernel<(int)10007>>
      6.2       32,790,113         10  3,279,011.3  3,227,353.0  3,176,841  3,409,931     87,620.3   563    1    1    16   16    1  Typeinfo name for sycl::_V1::detail::__pf_kernel_wrapper<mykernel<(int)9001>> 
      6.2       32,698,431         10  3,269,843.1  3,228,281.0  3,182,217  3,416,075     88,750.2   563    1    1    16   16    1  Typeinfo name for sycl::_V1::detail::__pf_kernel_wrapper<mykernel<(int)9000>> 
      6.0       31,828,116         10  3,182,811.6  3,142,552.0  2,879,236  3,776,848    298,002.9   200    1    1    30   16    1  Typeinfo name for sycl::_V1::detail::__pf_kernel_wrapper<mykernel<(int)6000>> 
      5.8       30,490,980         10  3,049,098.0  3,006,038.0  2,880,613  3,317,930    150,154.4   219    1    1    32   16    1  Typeinfo name for sycl::_V1::detail::__pf_kernel_wrapper<mykernel<(int)7000>> 
      5.7       29,795,707         10  2,979,570.7  2,935,621.0  2,892,933  3,303,402    125,754.1   219    1    1    32   16    1  Typeinfo name for sycl::_V1::detail::__pf_kernel_wrapper<mykernel<(int)7001>> 
      4.5       23,798,381         10  2,379,838.1  2,362,878.0  2,292,669  2,464,159     74,585.4   334    1    1    24   16    1  Typeinfo name for sycl::_V1::detail::__pf_kernel_wrapper<mykernel<(int)8009>> 
      4.5       23,764,620         10  2,376,462.0  2,378,814.0  2,360,862  2,381,790      6,469.1   125    1    1    32   16    1  Typeinfo name for sycl::_V1::detail::__pf_kernel_wrapper<mykernel<(int)4000>> 
      3.8       19,840,763         10  1,984,076.3  1,963,480.5  1,865,688  2,225,276    124,230.1   313    1    1    16   16    1  Typeinfo name for sycl::_V1::detail::__pf_kernel_wrapper<mykernel<(int)5003>> 
      3.7       19,452,760         10  1,945,276.0  1,945,225.0  1,943,257  1,947,961      1,271.0    47    2    1   128    8    1  Typeinfo name for sycl::_V1::detail::__pf_kernel_wrapper<mykernel<(int)6007>> 
      3.7       19,383,476         10  1,938,347.6  1,938,376.0  1,937,337  1,939,609        689.5    47    1    1    64   16    1  Typeinfo name for sycl::_V1::detail::__pf_kernel_wrapper<mykernel<(int)3000>> 
      3.7       19,380,501         10  1,938,050.1  1,937,864.0  1,937,305  1,939,896        756.1    47    1    1    64   16    1  Typeinfo name for sycl::_V1::detail::__pf_kernel_wrapper<mykernel<(int)3001>> 
      3.7       19,364,082         10  1,936,408.2  1,866,583.5  1,864,311  2,161,787    119,112.3   313    1    1    16   16    1  Typeinfo name for sycl::_V1::detail::__pf_kernel_wrapper<mykernel<(int)5000>> 
      3.6       18,880,944         10  1,888,094.4  1,888,055.5  1,887,960  1,888,504        165.8    80    1    1    25   16    1  Typeinfo name for sycl::_V1::detail::__pf_kernel_wrapper<mykernel<(int)2000>> 
      3.4       18,084,838         10  1,808,483.8  1,835,047.5  1,660,469  1,841,431     61,700.9   251    1    1    16   16    1  Typeinfo name for sycl::_V1::detail::__pf_kernel_wrapper<mykernel<(int)4001>> 
      2.7       14,248,789         10  1,424,878.9  1,427,714.0  1,417,202  1,431,442      6,401.0    63    1    1    32   16    1  Typeinfo name for sycl::_V1::detail::__pf_kernel_wrapper<mykernel<(int)2003>> 
      1.8        9,731,771         10    973,177.1    971,148.5    970,636    991,021      6,287.8    36    1    1    28   16    1  Typeinfo name for sycl::_V1::detail::__pf_kernel_wrapper<mykernel<(int)1000>> 
      1.6        8,320,874         10    832,087.4    832,186.5    831,115    833,035        717.3     1   16    1  1024    1    1  Typeinfo name for sycl::_V1::detail::__pf_kernel_wrapper<mykernel<(int)1009>> 

Using:

$ SYCL_PARALLEL_FOR_RANGE_ROUNDING_PARAMS=1:64:1 ./a.out
Device = NVIDIA GeForce GTX 1050 Ti
N	elapsed(ms)
1009	4.50126
2003	4.47025
3001	10.029
4001	14.8567
5003	17.295
6007	10.1562
7001	12.5229
8009	14.9415
9001	29.4241
10007	34.2087


1000	4.43263
2000	4.43492
3000	10.0248
4000	14.8536
5000	17.2908
6000	10.1621
7000	12.519
8000	26.9866
9000	29.4247
10000	34.2246
Done

Gets perf for prime vs prime to be a bit closer than previous. While I think it is good to make these vals as close as possible, I am not sure if making them match absolutely is a reasonable expectation.

Using SYCL_RANGE_ROUNDING_PARAMS as done above I think is a sufficiently high level approach to try to get a good size globally.

You can find the documentation for range rounding in DPC++ here:

https://github.com/intel/llvm/blob/sycl/sycl/doc/design/ParallelForRangeRounding.md

Ping @breyerml

@hdelan
Copy link
Contributor

hdelan commented Apr 22, 2024

@tom91136 do you think I can close the issue?

@hdelan
Copy link
Contributor

hdelan commented Apr 29, 2024

Closing the issue. Feel free to reopen if you like, or just continue discussion here.

@hdelan hdelan closed this as completed Apr 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cuda CUDA back-end hip Issues related to execution on HIP backend.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants