-
Notifications
You must be signed in to change notification settings - Fork 768
[E2E][CUDA] Add barrier before all_of_group in ballot_group_algorithms test. #13661
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: JackAKirk <[email protected]>
Signed-off-by: JackAKirk <[email protected]>
Signed-off-by: JackAKirk <[email protected]>
@intel/llvm-reviewers-runtime Thanks |
// Note that this barrier is required for the test to pass for | ||
// cuda 12.4 even for sm_60 and later devices. This appears to be a | ||
// cuda ptxas bug. | ||
sycl::group_barrier(BallotGroup); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we maybe mask it with #ifdef __NVPTX__
so we don't inadvertently ignore problems on other targets?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also wonder if it should be the fix in the implementation and not in the test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah makes sense. Ta
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also wonder if it should be the fix in the implementation and not in the test.
I can create a ticket to confirm that it is a ptx bug and report it to nvidia. They also just released cuda 12.5, so it is possible it is working there. Prior to cuda 12.4 this test passed.
I'm just going to close this PR. We've confirmed that it is fixed if you use the cuda driver that was released with cuda 12.5. If you use that driver then the cuda 12.4 toolkit can be used and the test passes. We will probably just make sure that we don't use the cuda driver associated with the cuda 12.4 toolkit in the CI when we upgrade it. |
This upgrades the docker to use the cuda 12.5 image. I've ran the test-e2e locally using cuda 12.5 and all is well. cuda 12.5 also fixed an issue introduced by the cuda 12.4 driver: see #13661 (comment) Signed-off-by: JackAKirk <[email protected]>
Fixes #12995
failure for cuda 12.4.
all_of_group
callsvote.sync.all
ptx instruction in the CUDA backend. It seems cuda 12.4 needs to have all members of the non-uniform ballot group in converged control flow to solve this failure.From my understanding, this change shouldn't be necessary as per the cuda spec for sm_60 and above: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#warp-vote-functions
"For .target sm_6x or below, all threads in membermask must execute the same vote.sync instruction in convergence, and only threads belonging to some membermask can be active when the vote.sync instruction is executed. Otherwise, the behavior is undefined."
I think this is a cuda ptxas bug, but I'm adding a barrier here just so the test passes once we switch to cuda 12.4. This test already passes fine for cuda 12.3 and below. There is no difference in the ptx generated for cuda 12.4, so I think this must be a ptxas/sass issue. Note that strictly speaking we support sm_5x (which would require the barrier addition here anyway) but in reality these "Maxwell" cards are very rarely used because they don't have any data centre cards in this generation. We get asked about "Kepler" support sm_3x sometimes (that we don't officially support because it is below sm_50), but I don't ever remember a sm_5x request/issue.