-
Notifications
You must be signed in to change notification settings - Fork 768
[E2E][CUDA] NonUniformGroups/ballot_group_algorithms.cpp failed on CUDA #12995
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@steffenlarsen FYI |
Cuda 12.4 is not tested yet as it was only released last week, and we don't have that machine. Would be useful to know if it is only the "any" that is failing. |
@JackAKirk Since we don't support CUDA 12.4 yet, I think it would be better to downgrade the CUDA version on the CI machine instead of changing the test case, to prevent issues like this in the future. |
I think that the CI is using cuda 12.1 (with an a10 gpu). Isn't the CUDA 12.4 just for your self hosted runner? |
Yes. I mean downgrading CUDA version in the self-hosted CUDA runner. |
This is the first I have head of this self-hosted runner. tbh normally new cuda versions should not be an issue, although 12.4 does make some interesting changes to ptxas. When one of the systems I have access to gets 12.4 I will test it. |
I expect CI to use CUDA SDK 12.1 which is installed into the docker container we use. |
Yes. For now, I'll get the CUDA version downgraded on the self-hosted runner to 12.1. In future, if we decide to upgrade the CUDA version, we should do that uniformly across all CI machines, including AWS ones. We would also have to update some dockerfiles (like https://github.com/intel/llvm/blob/sycl/devops/containers/ubuntu2204_build.Dockerfile) in that case. |
Closing this issue as we have downgraded the CUDA version to 12.1 and this test failure is gone: https://github.com/intel/llvm/actions/runs/8495325861/job/23286803900 |
I've now tested this on a100 using cuda 12.4 and the test passes. |
I reproduced this test failing on rtx30 series (sm_86) and a100 using cuda 12.4, |
Fails in Nightly testing on the self-hosted CUDA runner: intel#12995.
…UDA (intel#14058) Fails in Nightly testing on the self-hosted CUDA runner: intel#12995.
This was identified as a cuda runtime issue that was fixed in later versions of the cuda runtime and is nothing to do with dpc++, so closing the issue. |
Describe the bug
NonUniformGroups/ballot_group_algorithms.cpp failed on self-hosted CUDA runner during SYCL Nightly testing: https://github.com/intel/llvm/actions/runs/8242960746/job/22543077484
To reproduce
intel/llvm commit id: ad6085c
Environment
sycl-ls --verbose
output:Additional context
No response
The text was updated successfully, but these errors were encountered: