Skip to content

[SYCL][CUDA] Improve function to guess local work size more efficiently. #9787

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Jun 13, 2023

Conversation

mmoadeli
Copy link
Contributor

@mmoadeli mmoadeli commented Jun 8, 2023

  • The threadsPerBlock values computed by guessLocalWorkSize are not the most optimal values. In particular the threadsPerBlock for Y and Z were much below the possible values.

  • When Y/Z values of range are prime a very poor performance is witnessed as shown in the associated issue

  • This PR compute threadsPerBlock for X/Y/Z to reduce corresponding BlocksPerGrid values.

  • Below presents the output of the code in associated issue without the changes in this PR.

Device = NVIDIA GeForce GTX 1050 Ti
N, elapsed(ms)

  • 1009,4.61658
  • 2003,45.6869
  • 3001,67.5192
  • 4001,88.1543
  • 5003,111.338
  • 6007,132.848
  • 7001,154.697
  • 8009,175.452
  • 9001,196.237
  • 10007,219.39
  • 1000,4.59423
  • 2000,4.61525
  • 3000,4.61935
  • 4000,4.62526
  • 5000,4.64623
  • 6000,4.78904
  • 7000,8.92251
  • 8000,8.97263
  • 9000,9.06992
  • 10000,9.03802
  • And below shows the output with the PR's updates
    Device = NVIDIA GeForce GTX 1050 Ti
    N, elapsed(ms)
  • 1009,4.58252
  • 2003,4.60139
  • 3001,3.47269
  • 4001,3.62314
  • 5003,4.15179
  • 6007,7.07976
  • 7001,7.49027
  • 8009,8.00097
  • 9001,9.08756
  • 10007,8.0005
  • 1000,4.56335
  • 2000,4.60376
  • 3000,4.76395
  • 4000,4.63283
  • 5000,4.64732
  • 6000,4.63936
  • 7000,8.97499
  • 8000,8.9941
  • 9000,9.01531
  • 10000,9.00935

@mmoadeli mmoadeli requested a review from a team as a code owner June 8, 2023 13:20
@mmoadeli mmoadeli requested a review from steffenlarsen June 8, 2023 13:20
@mmoadeli mmoadeli marked this pull request as draft June 8, 2023 13:20
@mmoadeli mmoadeli marked this pull request as ready for review June 8, 2023 13:25
@mmoadeli mmoadeli linked an issue Jun 8, 2023 that may be closed by this pull request
@mmoadeli mmoadeli requested a review from bader as a code owner June 8, 2023 13:42
@mmoadeli mmoadeli changed the title [SYCL][CUDA] Improve function to guess local work size. [SYCL][CUDA] Improve function to guess local work size more efficiently. Jun 8, 2023
@bader bader removed their request for review June 8, 2023 15:19
@mmoadeli mmoadeli temporarily deployed to aws June 8, 2023 15:48 — with GitHub Actions Inactive
@mmoadeli mmoadeli temporarily deployed to aws June 8, 2023 17:06 — with GitHub Actions Inactive
@mmoadeli mmoadeli temporarily deployed to aws June 8, 2023 20:20 — with GitHub Actions Inactive
@mmoadeli mmoadeli temporarily deployed to aws June 8, 2023 21:00 — with GitHub Actions Inactive
@mmoadeli mmoadeli temporarily deployed to aws June 12, 2023 09:57 — with GitHub Actions Inactive
@mmoadeli mmoadeli temporarily deployed to aws June 12, 2023 10:43 — with GitHub Actions Inactive
Copy link
Contributor

@steffenlarsen steffenlarsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good. @jchlanda - Would you mind sanity-checking as well?

// When global_work_size[0] is prime threadPerBlock[0] will later computed as
// 1, which is not efficient configuration. In such case we use
// global_work_size[0] + 1 to compute threadPerBlock[0].
int x_global_work_size = (isPrime(global_work_size[0]) &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit; I would like this name to be a little more descriptive.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @steffenlarsen , done.

@mmoadeli mmoadeli temporarily deployed to aws June 13, 2023 08:48 — with GitHub Actions Inactive
@mmoadeli mmoadeli temporarily deployed to aws June 13, 2023 09:28 — with GitHub Actions Inactive
@mmoadeli mmoadeli temporarily deployed to aws June 13, 2023 10:00 — with GitHub Actions Inactive
@mmoadeli mmoadeli temporarily deployed to aws June 13, 2023 10:54 — with GitHub Actions Inactive
@mmoadeli mmoadeli temporarily deployed to aws June 13, 2023 12:56 — with GitHub Actions Inactive
@mmoadeli mmoadeli temporarily deployed to aws June 13, 2023 12:57 — with GitHub Actions Inactive
@mmoadeli mmoadeli temporarily deployed to aws June 13, 2023 13:47 — with GitHub Actions Inactive
@mmoadeli mmoadeli temporarily deployed to aws June 13, 2023 14:04 — with GitHub Actions Inactive
@steffenlarsen steffenlarsen merged commit 56e05af into intel:sycl Jun 13, 2023
fineg74 pushed a commit to fineg74/llvm that referenced this pull request Jun 15, 2023
…ly. (intel#9787)

* The `threadsPerBlock` values computed by `guessLocalWorkSize` are not
the most optimal values. In particular the `threadsPerBlock` for `Y` and
`Z` were much below the possible values.
* When Y/Z values of range are prime a very poor performance is
witnessed as shown in the associated
[issue](intel#8018)
* This PR compute `threadsPerBlock` for X/Y/Z to reduce corresponding
`BlocksPerGrid` values.

* Below presents the output of the code in associated issue without the
changes in this PR.

Device = NVIDIA GeForce GTX 1050 Ti
N,   elapsed(ms)

- 1009,4.61658
- 2003,45.6869
- 3001,67.5192
- 4001,88.1543
- 5003,111.338
- 6007,132.848
- 7001,154.697
- 8009,175.452
- 9001,196.237
- 10007,219.39
- 1000,4.59423
- 2000,4.61525
- 3000,4.61935
- 4000,4.62526
- 5000,4.64623
- 6000,4.78904
- 7000,8.92251
- 8000,8.97263
- 9000,9.06992
- 10000,9.03802

 
* And below shows the output with the PR's updates
 Device = NVIDIA GeForce GTX 1050 Ti
N,  elapsed(ms)

- 1009,4.58252
- 2003,4.60139
- 3001,3.47269
- 4001,3.62314
- 5003,4.15179
- 6007,7.07976
- 7001,7.49027
- 8009,8.00097
- 9001,9.08756
- 10007,8.0005
- 1000,4.56335
- 2000,4.60376
- 3000,4.76395
- 4000,4.63283
- 5000,4.64732
- 6000,4.63936
- 7000,8.97499
- 8000,8.9941
- 9000,9.01531
- 10000,9.00935
@mmoadeli mmoadeli deleted the guess-local-work-space branch July 7, 2023 10:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Poor performance for 2d/3d range kernels involving primes on CUDA PI
3 participants