Skip to content

[E2E][CUDA] NonUniformGroups/ballot_group_algorithms.cpp failed on CUDA #12995

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
uditagarwal97 opened this issue Mar 12, 2024 · 12 comments
Closed
Assignees
Labels
bug Something isn't working confirmed cuda CUDA back-end

Comments

@uditagarwal97
Copy link
Contributor

Describe the bug

NonUniformGroups/ballot_group_algorithms.cpp failed on self-hosted CUDA runner during SYCL Nightly testing: https://github.com/intel/llvm/actions/runs/8242960746/job/22543077484

FAIL: SYCL :: NonUniformGroups/ballot_group_algorithms.cpp (1450 of [19](https://github.com/intel/llvm/actions/runs/8242960746/job/22543077484#step:21:20)31)
******************** TEST 'SYCL :: NonUniformGroups/ballot_group_algorithms.cpp' FAILED ********************
Exit Code: -6

Command Output (stdout):
--
# RUN: at line 1
/__w/llvm/llvm/toolchain/bin//clang++   -fsycl -fsycl-targets=nvptx64-nvidia-cuda /__w/llvm/llvm/llvm/sycl/test-e2e/NonUniformGroups/ballot_group_algorithms.cpp -o /__w/llvm/llvm/build-e2e/NonUniformGroups/Output/ballot_group_algorithms.cpp.tmp.out
# executed command: /__w/llvm/llvm/toolchain/bin//clang++ -fsycl -fsycl-targets=nvptx64-nvidia-cuda /__w/llvm/llvm/llvm/sycl/test-e2e/NonUniformGroups/ballot_group_algorithms.cpp -o /__w/llvm/llvm/build-e2e/NonUniformGroups/Output/ballot_group_algorithms.cpp.tmp.out
# note: command had no output on stdout or stderr
# RUN: at line 2
env SYCL_PI_CUDA_ENABLE_IMAGE_SUPPORT=1 ONEAPI_DEVICE_SELECTOR=cuda:gpu  /__w/llvm/llvm/build-e2e/NonUniformGroups/Output/ballot_group_algorithms.cpp.tmp.out
# executed command: env SYCL_PI_CUDA_ENABLE_IMAGE_SUPPORT=1 ONEAPI_DEVICE_SELECTOR=cuda:gpu /__w/llvm/llvm/build-e2e/NonUniformGroups/Output/ballot_group_algorithms.cpp.tmp.out
# .---command stderr------------
# | ballot_group_algorithms.cpp.tmp.out: /__w/llvm/llvm/llvm/sycl/test-e2e/NonUniformGroups/ballot_group_algorithms.cpp:1: int main(): Assertion `AllAcc[WI] == true' failed.
# `-----------------------------
# error: command failed with exit status: -6

To reproduce

intel/llvm commit id: ad6085c

Environment

sycl-ls --verbose output:

> sycl-ls --verbose

ur_print: Images are not fully supported by the CUDA BE, their support is disabled by default. Their partial support can be activated by setting SYCL_PI_CUDA_ENABLE_IMAGE_SUPPORT environment variable at runtime.
[cuda:gpu][cuda:0] NVIDIA CUDA BACKEND, NVIDIA GeForce RTX 3090 [8](https://github.com/intel/llvm/actions/runs/8242960746/job/22543077484#step:17:9).6 [CUDA 12.4]

Platforms: 1
Platform [#1]:
    Version  : CUDA [12](https://github.com/intel/llvm/actions/runs/8242960746/job/22543077484#step:17:13).4
    Name     : NVIDIA CUDA BACKEND
    Vendor   : NVIDIA Corporation
    Devices  : 1
        Device [#0]:
        Type       : gpu
        Version    : 8.6
        Name       : NVIDIA GeForce RTX 3090
        Vendor     : NVIDIA Corporation
        Driver     : CUDA 12.4
        Aspects    : gpu fp fp64 online_compiler online_linker queue_profiling usm_device_allocations usm_host_allocations usm_shared_allocations ext_intel_pci_address usm_atomic_host_allocations usm_atomic_shared_allocations atomic64 ext_intel_device_info_uuid ext_oneapi_native_assert ext_oneapi_bfloat16_math_functions ext_intel_free_memory ext_intel_device_id ext_intel_memory_clock_rate ext_intel_memory_bus_width ext_oneapi_bindless_images ext_oneapi_bindless_images_shared_usm ext_oneapi_bindless_images_2d_usm ext_oneapi_interop_memory_import ext_oneapi_interop_semaphore_import ext_oneapi_mipmap ext_oneapi_mipmap_anisotropy ext_oneapi_mipmap_level_reference ext_oneapi_ballot_group ext_oneapi_fixed_size_group ext_oneapi_opportunistic_groupcl_khr_fp64 cl_khr_subgroups pi_ext_intel_devicelib_assert ur_exp_command_buffer  cl_khr_fp16  ext_oneapi_graph
        info::device::sub_group_sizes: 32
default_selector()      : gpu, NVIDIA CUDA BACKEND, NVIDIA GeForce RTX 3090 8.6 [CUDA 12.4]
accelerator_selector()  : No device of requested type available. -1 (PI_ERRO...
cpu_selector()          : No device of requested type available. -1 (PI_ERRO...
gpu_selector()          : gpu, NVIDIA CUDA BACKEND, NVIDIA GeForce RTX 3090 8.6 [CUDA 12.4]
custom_selector(gpu)    : gpu, NVIDIA CUDA BACKEND, NVIDIA GeForce RTX 3090 8.6 [CUDA 12.4]

Additional context

No response

@uditagarwal97 uditagarwal97 added bug Something isn't working cuda CUDA back-end labels Mar 12, 2024
@uditagarwal97
Copy link
Contributor Author

@steffenlarsen FYI

@JackAKirk
Copy link
Contributor

Cuda 12.4 is not tested yet as it was only released last week, and we don't have that machine.
If you comment out https://github.com/intel/llvm/blob/sycl/sycl/test-e2e/NonUniformGroups/ballot_group_algorithms.cpp#L121 does it pass? I guess that "none" would also fail if "any" does, so you may want to comment that out too.

Would be useful to know if it is only the "any" that is failing.

@uditagarwal97
Copy link
Contributor Author

@JackAKirk Since we don't support CUDA 12.4 yet, I think it would be better to downgrade the CUDA version on the CI machine instead of changing the test case, to prevent issues like this in the future.

@uditagarwal97 uditagarwal97 self-assigned this Mar 14, 2024
@JackAKirk
Copy link
Contributor

JackAKirk commented Mar 14, 2024

@JackAKirk Since we don't support CUDA 12.4 yet, I think it would be better to downgrade the CUDA version on the CI machine instead of changing the test case, to prevent issues like this in the future.

I think that the CI is using cuda 12.1 (with an a10 gpu). Isn't the CUDA 12.4 just for your self hosted runner?

@uditagarwal97
Copy link
Contributor Author

@JackAKirk Since we don't support CUDA 12.4 yet, I think it would be better to downgrade the CUDA version on the CI machine instead of changing the test case, to prevent issues like this in the future.

I think that the CI is using cuda 12.1 (with an a10 gpu). Isn't the CUDA 12.4 just for your self hosted runner?

Yes. I mean downgrading CUDA version in the self-hosted CUDA runner.

@JackAKirk
Copy link
Contributor

@JackAKirk Since we don't support CUDA 12.4 yet, I think it would be better to downgrade the CUDA version on the CI machine instead of changing the test case, to prevent issues like this in the future.

I think that the CI is using cuda 12.1 (with an a10 gpu). Isn't the CUDA 12.4 just for your self hosted runner?

Yes. I mean downgrading CUDA version in the self-hosted CUDA runner.

This is the first I have head of this self-hosted runner. tbh normally new cuda versions should not be an issue, although 12.4 does make some interesting changes to ptxas. When one of the systems I have access to gets 12.4 I will test it.

@bader
Copy link
Contributor

bader commented Mar 14, 2024

I expect CI to use CUDA SDK 12.1 which is installed into the docker container we use.

@uditagarwal97
Copy link
Contributor Author

I expect CI to use CUDA SDK 12.1 which is installed into the docker container we use.

Yes. For now, I'll get the CUDA version downgraded on the self-hosted runner to 12.1. In future, if we decide to upgrade the CUDA version, we should do that uniformly across all CI machines, including AWS ones. We would also have to update some dockerfiles (like https://github.com/intel/llvm/blob/sycl/devops/containers/ubuntu2204_build.Dockerfile) in that case.

@uditagarwal97
Copy link
Contributor Author

Closing this issue as we have downgraded the CUDA version to 12.1 and this test failure is gone: https://github.com/intel/llvm/actions/runs/8495325861/job/23286803900

@JackAKirk
Copy link
Contributor

I've now tested this on a100 using cuda 12.4 and the test passes.

@JackAKirk
Copy link
Contributor

JackAKirk commented May 3, 2024

I reproduced this test failing on rtx30 series (sm_86) and a100 using cuda 12.4, but passes on a100 (sm_80). I'll look into it.

@JackAKirk JackAKirk reopened this May 3, 2024
aelovikov-intel added a commit to aelovikov-intel/llvm that referenced this issue Jun 5, 2024
Fails in Nightly testing on the self-hosted CUDA runner:
intel#12995.
aelovikov-intel added a commit that referenced this issue Jun 6, 2024
…UDA (#14058)

Fails in Nightly testing on the self-hosted CUDA runner:
#12995.
ianayl pushed a commit to ianayl/sycl that referenced this issue Jun 13, 2024
…UDA (intel#14058)

Fails in Nightly testing on the self-hosted CUDA runner:
intel#12995.
@JackAKirk
Copy link
Contributor

This was identified as a cuda runtime issue that was fixed in later versions of the cuda runtime and is nothing to do with dpc++, so closing the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working confirmed cuda CUDA back-end
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants