-
Notifications
You must be signed in to change notification settings - Fork 769
[SYCL][CUDA] Improve function to guess local work size more efficiently. #9787
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…e number for ranage diementsions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks good. @jchlanda - Would you mind sanity-checking as well?
sycl/plugins/cuda/pi_cuda.cpp
Outdated
// When global_work_size[0] is prime threadPerBlock[0] will later computed as | ||
// 1, which is not efficient configuration. In such case we use | ||
// global_work_size[0] + 1 to compute threadPerBlock[0]. | ||
int x_global_work_size = (isPrime(global_work_size[0]) && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit; I would like this name to be a little more descriptive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @steffenlarsen , done.
…ly. (intel#9787) * The `threadsPerBlock` values computed by `guessLocalWorkSize` are not the most optimal values. In particular the `threadsPerBlock` for `Y` and `Z` were much below the possible values. * When Y/Z values of range are prime a very poor performance is witnessed as shown in the associated [issue](intel#8018) * This PR compute `threadsPerBlock` for X/Y/Z to reduce corresponding `BlocksPerGrid` values. * Below presents the output of the code in associated issue without the changes in this PR. Device = NVIDIA GeForce GTX 1050 Ti N, elapsed(ms) - 1009,4.61658 - 2003,45.6869 - 3001,67.5192 - 4001,88.1543 - 5003,111.338 - 6007,132.848 - 7001,154.697 - 8009,175.452 - 9001,196.237 - 10007,219.39 - 1000,4.59423 - 2000,4.61525 - 3000,4.61935 - 4000,4.62526 - 5000,4.64623 - 6000,4.78904 - 7000,8.92251 - 8000,8.97263 - 9000,9.06992 - 10000,9.03802 * And below shows the output with the PR's updates Device = NVIDIA GeForce GTX 1050 Ti N, elapsed(ms) - 1009,4.58252 - 2003,4.60139 - 3001,3.47269 - 4001,3.62314 - 5003,4.15179 - 6007,7.07976 - 7001,7.49027 - 8009,8.00097 - 9001,9.08756 - 10007,8.0005 - 1000,4.56335 - 2000,4.60376 - 3000,4.76395 - 4000,4.63283 - 5000,4.64732 - 6000,4.63936 - 7000,8.97499 - 8000,8.9941 - 9000,9.01531 - 10000,9.00935
The
threadsPerBlock
values computed byguessLocalWorkSize
are not the most optimal values. In particular thethreadsPerBlock
forY
andZ
were much below the possible values.When Y/Z values of range are prime a very poor performance is witnessed as shown in the associated issue
This PR compute
threadsPerBlock
for X/Y/Z to reduce correspondingBlocksPerGrid
values.Below presents the output of the code in associated issue without the changes in this PR.
Device = NVIDIA GeForce GTX 1050 Ti
N, elapsed(ms)
Device = NVIDIA GeForce GTX 1050 Ti
N, elapsed(ms)