You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[SYCL][CUDA] Improve function to guess local work size more efficiently. (#9787)
* The `threadsPerBlock` values computed by `guessLocalWorkSize` are not
the most optimal values. In particular the `threadsPerBlock` for `Y` and
`Z` were much below the possible values.
* When Y/Z values of range are prime a very poor performance is
witnessed as shown in the associated
[issue](#8018)
* This PR compute `threadsPerBlock` for X/Y/Z to reduce corresponding
`BlocksPerGrid` values.
* Below presents the output of the code in associated issue without the
changes in this PR.
Device = NVIDIA GeForce GTX 1050 Ti
N, elapsed(ms)
- 1009,4.61658
- 2003,45.6869
- 3001,67.5192
- 4001,88.1543
- 5003,111.338
- 6007,132.848
- 7001,154.697
- 8009,175.452
- 9001,196.237
- 10007,219.39
- 1000,4.59423
- 2000,4.61525
- 3000,4.61935
- 4000,4.62526
- 5000,4.64623
- 6000,4.78904
- 7000,8.92251
- 8000,8.97263
- 9000,9.06992
- 10000,9.03802
* And below shows the output with the PR's updates
Device = NVIDIA GeForce GTX 1050 Ti
N, elapsed(ms)
- 1009,4.58252
- 2003,4.60139
- 3001,3.47269
- 4001,3.62314
- 5003,4.15179
- 6007,7.07976
- 7001,7.49027
- 8009,8.00097
- 9001,9.08756
- 10007,8.0005
- 1000,4.56335
- 2000,4.60376
- 3000,4.76395
- 4000,4.63283
- 5000,4.64732
- 6000,4.63936
- 7000,8.97499
- 8000,8.9941
- 9000,9.01531
- 10000,9.00935
0 commit comments