[E2E][CUDA] Add barrier before all_of_group in ballot_group_algorithms test. #13661

JackAKirk · 2024-05-06T15:15:52Z

Fixes #12995
failure for cuda 12.4.

all_of_group calls vote.sync.all ptx instruction in the CUDA backend. It seems cuda 12.4 needs to have all members of the non-uniform ballot group in converged control flow to solve this failure.

From my understanding, this change shouldn't be necessary as per the cuda spec for sm_60 and above: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#warp-vote-functions

"For .target sm_6x or below, all threads in membermask must execute the same vote.sync instruction in convergence, and only threads belonging to some membermask can be active when the vote.sync instruction is executed. Otherwise, the behavior is undefined."

I think this is a cuda ptxas bug, but I'm adding a barrier here just so the test passes once we switch to cuda 12.4. This test already passes fine for cuda 12.3 and below. There is no difference in the ptx generated for cuda 12.4, so I think this must be a ptxas/sass issue. Note that strictly speaking we support sm_5x (which would require the barrier addition here anyway) but in reality these "Maxwell" cards are very rarely used because they don't have any data centre cards in this generation. We get asked about "Kepler" support sm_3x sometimes (that we don't officially support because it is below sm_50), but I don't ever remember a sm_5x request/issue.

Signed-off-by: JackAKirk <[email protected]>

JackAKirk · 2024-05-23T15:56:19Z

@intel/llvm-reviewers-runtime
Would it be possible to review this?

Thanks

steffenlarsen · 2024-05-23T15:58:50Z

sycl/test-e2e/NonUniformGroups/ballot_group_algorithms.cpp

+          // Note that this barrier is required for the test to pass for
+          // cuda 12.4 even for sm_60 and later devices. This appears to be a
+          // cuda ptxas bug.
+          sycl::group_barrier(BallotGroup);


Could we maybe mask it with #ifdef __NVPTX__ so we don't inadvertently ignore problems on other targets?

I also wonder if it should be the fix in the implementation and not in the test.

Yeah makes sense. Ta

I also wonder if it should be the fix in the implementation and not in the test.

I can create a ticket to confirm that it is a ptx bug and report it to nvidia. They also just released cuda 12.5, so it is possible it is working there. Prior to cuda 12.4 this test passed.

JackAKirk · 2024-05-30T16:07:28Z

I'm just going to close this PR. We've confirmed that it is fixed if you use the cuda driver that was released with cuda 12.5. If you use that driver then the cuda 12.4 toolkit can be used and the test passes. We will probably just make sure that we don't use the cuda driver associated with the cuda 12.4 toolkit in the CI when we upgrade it.

This upgrades the docker to use the cuda 12.5 image. I've ran the test-e2e locally using cuda 12.5 and all is well. cuda 12.5 also fixed an issue introduced by the cuda 12.4 driver: see #13661 (comment) Signed-off-by: JackAKirk <[email protected]>

JackAKirk added 2 commits May 6, 2024 16:02

Add barrier before all_of_group.

05bb45c

Signed-off-by: JackAKirk <[email protected]>

Add a note explaining this addition.

c16a5bd

Signed-off-by: JackAKirk <[email protected]>

JackAKirk requested a review from a team as a code owner May 6, 2024 15:15

JackAKirk requested a review from maarquitos14 May 6, 2024 15:15

JackAKirk changed the title ~~[CUDA] Add barrier before all_of_group in ballot_group_algorithms test.~~ [test-e2e][CUDA] Add barrier before all_of_group in ballot_group_algorithms test. May 6, 2024

JackAKirk changed the title ~~[test-e2e][CUDA] Add barrier before all_of_group in ballot_group_algorithms test.~~ [E2E][CUDA] Add barrier before all_of_group in ballot_group_algorithms test. May 6, 2024

Fix format.

7947e21

Signed-off-by: JackAKirk <[email protected]>

JackAKirk temporarily deployed to WindowsCILock May 6, 2024 15:27 — with GitHub Actions Inactive

JackAKirk temporarily deployed to WindowsCILock May 6, 2024 16:09 — with GitHub Actions Inactive

steffenlarsen reviewed May 23, 2024

View reviewed changes

aelovikov-intel requested review from Pennycook and a team May 23, 2024 15:58

JackAKirk closed this May 30, 2024

uditagarwal97 mentioned this pull request Jun 5, 2024

[CI] Don't run E2E tests on self-hosted CUDA in Nightly #14041

Merged

JackAKirk mentioned this pull request Jun 5, 2024

[CI][CUDA] Uplift docker to use cuda 12.5 image. #14049

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[E2E][CUDA] Add barrier before all_of_group in ballot_group_algorithms test. #13661

[E2E][CUDA] Add barrier before all_of_group in ballot_group_algorithms test. #13661

JackAKirk commented May 6, 2024 •

edited

Loading

JackAKirk commented May 23, 2024

steffenlarsen May 23, 2024

aelovikov-intel May 23, 2024

JackAKirk May 23, 2024

JackAKirk May 23, 2024

JackAKirk commented May 30, 2024

[E2E][CUDA] Add barrier before all_of_group in ballot_group_algorithms test. #13661

[E2E][CUDA] Add barrier before all_of_group in ballot_group_algorithms test. #13661

Conversation

JackAKirk commented May 6, 2024 • edited Loading

JackAKirk commented May 23, 2024

steffenlarsen May 23, 2024

Choose a reason for hiding this comment

aelovikov-intel May 23, 2024

Choose a reason for hiding this comment

JackAKirk May 23, 2024

Choose a reason for hiding this comment

JackAKirk May 23, 2024

Choose a reason for hiding this comment

JackAKirk commented May 30, 2024

JackAKirk commented May 6, 2024 •

edited

Loading