[SYCL][CUDA] Improve function to guess local work size more efficiently. #9787

mmoadeli · 2023-06-08T13:20:15Z

The threadsPerBlock values computed by guessLocalWorkSize are not the most optimal values. In particular the threadsPerBlock for Y and Z were much below the possible values.
When Y/Z values of range are prime a very poor performance is witnessed as shown in the associated issue
This PR compute threadsPerBlock for X/Y/Z to reduce corresponding BlocksPerGrid values.
Below presents the output of the code in associated issue without the changes in this PR.

Device = NVIDIA GeForce GTX 1050 Ti
N, elapsed(ms)

1009,4.61658
2003,45.6869
3001,67.5192
4001,88.1543
5003,111.338
6007,132.848
7001,154.697
8009,175.452
9001,196.237
10007,219.39
1000,4.59423
2000,4.61525
3000,4.61935
4000,4.62526
5000,4.64623
6000,4.78904
7000,8.92251
8000,8.97263
9000,9.06992
10000,9.03802

And below shows the output with the PR's updates
Device = NVIDIA GeForce GTX 1050 Ti
N, elapsed(ms)

1009,4.58252
2003,4.60139
3001,3.47269
4001,3.62314
5003,4.15179
6007,7.07976
7001,7.49027
8009,8.00097
9001,9.08756
10007,8.0005
1000,4.56335
2000,4.60376
3000,4.76395
4000,4.63283
5000,4.64732
6000,4.63936
7000,8.97499
8000,8.9941
9000,9.01531
10000,9.00935

…e number for ranage diementsions.

steffenlarsen

I think this looks good. @jchlanda - Would you mind sanity-checking as well?

steffenlarsen · 2023-06-12T16:36:00Z

sycl/plugins/cuda/pi_cuda.cpp

+  // When global_work_size[0] is prime threadPerBlock[0] will later computed as
+  // 1, which is not efficient configuration. In such case we use
+  // global_work_size[0] + 1 to compute threadPerBlock[0].
+  int x_global_work_size = (isPrime(global_work_size[0]) &&


Nit; I would like this name to be a little more descriptive.

thanks @steffenlarsen , done.

…ly. (intel#9787) * The `threadsPerBlock` values computed by `guessLocalWorkSize` are not the most optimal values. In particular the `threadsPerBlock` for `Y` and `Z` were much below the possible values. * When Y/Z values of range are prime a very poor performance is witnessed as shown in the associated [issue](intel#8018) * This PR compute `threadsPerBlock` for X/Y/Z to reduce corresponding `BlocksPerGrid` values. * Below presents the output of the code in associated issue without the changes in this PR. Device = NVIDIA GeForce GTX 1050 Ti N, elapsed(ms) - 1009,4.61658 - 2003,45.6869 - 3001,67.5192 - 4001,88.1543 - 5003,111.338 - 6007,132.848 - 7001,154.697 - 8009,175.452 - 9001,196.237 - 10007,219.39 - 1000,4.59423 - 2000,4.61525 - 3000,4.61935 - 4000,4.62526 - 5000,4.64623 - 6000,4.78904 - 7000,8.92251 - 8000,8.97263 - 9000,9.06992 - 10000,9.03802 * And below shows the output with the PR's updates Device = NVIDIA GeForce GTX 1050 Ti N, elapsed(ms) - 1009,4.58252 - 2003,4.60139 - 3001,3.47269 - 4001,3.62314 - 5003,4.15179 - 6007,7.07976 - 7001,7.49027 - 8009,8.00097 - 9001,9.08756 - 10007,8.0005 - 1000,4.56335 - 2000,4.60376 - 3000,4.76395 - 4000,4.63283 - 5000,4.64732 - 6000,4.63936 - 7000,8.97499 - 8000,8.9941 - 9000,9.01531 - 10000,9.00935

Improve guessLocalWorkSize to avoid poor performance when having prim…

7944d20

…e number for ranage diementsions.

mmoadeli requested a review from a team as a code owner June 8, 2023 13:20

mmoadeli requested a review from steffenlarsen June 8, 2023 13:20

mmoadeli marked this pull request as draft June 8, 2023 13:20

mmoadeli marked this pull request as ready for review June 8, 2023 13:25

mmoadeli linked an issue Jun 8, 2023 that may be closed by this pull request

Poor performance for 2d/3d range kernels involving primes on CUDA PI #8018

Closed

mmoadeli added 3 commits June 8, 2023 14:34

Merge branch 'sycl' into guess-local-work-space

e8a53ed

Address merge conflicts

2962217

Fix a mistake in comment.

d7f1e4c

mmoadeli requested a review from bader as a code owner June 8, 2023 13:42

mmoadeli added 2 commits June 8, 2023 14:43

Revert change to .gitignore.

c3939b8

Reverts removed comment.

3b3f7f8

mmoadeli changed the title ~~[SYCL][CUDA] Improve function to guess local work size.~~ [SYCL][CUDA] Improve function to guess local work size more efficiently. Jun 8, 2023

Fix code style

073fd4a

bader removed their request for review June 8, 2023 15:19

mmoadeli temporarily deployed to aws June 8, 2023 15:48 — with GitHub Actions Inactive

Merge branch 'intel:sycl' into guess-local-work-space

b05428c

mmoadeli temporarily deployed to aws June 8, 2023 17:06 — with GitHub Actions Inactive

Add missing parantheses

e3f8396

mmoadeli temporarily deployed to aws June 8, 2023 20:20 — with GitHub Actions Inactive

mmoadeli temporarily deployed to aws June 8, 2023 21:00 — with GitHub Actions Inactive

npmiller mentioned this pull request Jun 9, 2023

[SYCL][CUDA] Port CUDA plugin to Unified Runtime #9512

Merged

Define isPrime a a lambda.

e878744

mmoadeli temporarily deployed to aws June 12, 2023 09:57 — with GitHub Actions Inactive

mmoadeli temporarily deployed to aws June 12, 2023 10:43 — with GitHub Actions Inactive

steffenlarsen approved these changes Jun 12, 2023

View reviewed changes

mmoadeli added 3 commits June 13, 2023 09:02

Update a variable name to a more descriptive one.

2bf6505

Update a variable name to a more descriptive one.

2b485cb

Remove the need for computing square in for loop check in isPrime.

e39beee

mmoadeli temporarily deployed to aws June 13, 2023 08:48 — with GitHub Actions Inactive

mmoadeli temporarily deployed to aws June 13, 2023 09:28 — with GitHub Actions Inactive

jchlanda approved these changes Jun 13, 2023

View reviewed changes

mmoadeli temporarily deployed to aws June 13, 2023 10:00 — with GitHub Actions Inactive

mmoadeli temporarily deployed to aws June 13, 2023 10:54 — with GitHub Actions Inactive

Merge branch 'intel:sycl' into guess-local-work-space

d00163a

mmoadeli temporarily deployed to aws June 13, 2023 12:56 — with GitHub Actions Inactive

mmoadeli temporarily deployed to aws June 13, 2023 12:57 — with GitHub Actions Inactive

mmoadeli temporarily deployed to aws June 13, 2023 13:47 — with GitHub Actions Inactive

mmoadeli temporarily deployed to aws June 13, 2023 14:04 — with GitHub Actions Inactive

steffenlarsen merged commit 56e05af into intel:sycl Jun 13, 2023

tom91136 mentioned this pull request Jun 13, 2023

Poor performance for 2d/3d range kernels involving primes on CUDA PI #8018

Closed

joeatodd mentioned this pull request Jun 23, 2023

[SYCL][CUDA] Wrong global range when sycl::range is passed a prime #10051

Closed

mmoadeli deleted the guess-local-work-space branch July 7, 2023 10:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SYCL][CUDA] Improve function to guess local work size more efficiently. #9787

[SYCL][CUDA] Improve function to guess local work size more efficiently. #9787

Uh oh!

mmoadeli commented Jun 8, 2023 •

edited

Loading

Uh oh!

steffenlarsen left a comment

Uh oh!

steffenlarsen Jun 12, 2023

Uh oh!

mmoadeli Jun 13, 2023

Uh oh!

Uh oh!

[SYCL][CUDA] Improve function to guess local work size more efficiently. #9787

[SYCL][CUDA] Improve function to guess local work size more efficiently. #9787

Uh oh!

Conversation

mmoadeli commented Jun 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

steffenlarsen left a comment

Choose a reason for hiding this comment

Uh oh!

steffenlarsen Jun 12, 2023

Choose a reason for hiding this comment

Uh oh!

mmoadeli Jun 13, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mmoadeli commented Jun 8, 2023 •

edited

Loading