[SYCL][CUDA] Default selector behaviour #1665

Ruyk · 2020-05-11T10:27:48Z

There were some interesting discussions about the behavior of the SYCL device selector on - the now closed - PR #1543.
I think is better if we discuss this particular topic separately and make a decision on what is better for the community.
The current default device selection works like this:

[A] If there is no CUDA device on the system:

Give scores to devices (I think: GPU > CPU > FPGA > host)
Return device with highest score

[B] If there is a CUDA platform on the system, and no SYCL_BE is passed, or SYCL_BE=PI_OPENCL

The OpenCL backend is preferred, so the NVIDIA OpenCL platform GPU is returned first
User gets invalid triple error, since default compilation doesn't generate the binary format. If using the CUDA triple, still fails with different errors we haven't investigated.

[C] If there is a CUDA platform on the system, and the SYCL_BE CUDA is passed:

The CUDA backend is preffered, so the PI CUDA backend is returned.
The default compilation will fail, since the user hasn't pass the right triple. If the user has passed the right triple for the CUDA backend, the program runs.

On Codeplay side, we get a lot of feedback and user questions from people accidentally running their SYCL application on an NVIDIA OpenCL platform and getting strange failures.
We recommend users to remove the NVIDIA OpenCL icd from the system so its not used on device selection, and only the PI CUDA backend is available for DPC++ applications.
This is not always possible, since not everyone has permission to edit the file. There are environmental flags that can be used when using the Khronos ICD loader, such as the OPENCL_ICD_VENDORS that alters the path from where the ICD files are loaded, but is not a practical solution.
For that, we propose to remove the NVIDIA OpenCL platform from the device selection all together, so users don't accidentally use a platform that won't work with the default configuration.
We do not want to remove any other OpenCL 1.2 platform since others may work by default (e.g., POCL or ComputeAorta).

After users have dealt with their NVIDIA OpenCL problems, still remains the issue of selecting the CUDA backend.
When the NVIDIA OpenCL platform is removed from the system, and there is no SYCL_BE preference, the CUDA backend, its selected first on default_selector (or even in gpu_selector), since it exposes the GPU. This causes problems to users that are deploying a SYCL application with a default selector on a system that has an NVIDIA GPU, because it will be selected first. This fails if the application has not been built with the right triple for the CUDA backend.
When using the SYCL_BE=PI_OPENCL , this problems goes away as the CUDA backend is not selected.

To prevent users from accidentally triggering the selection of the NVIDIA CUDA backend, we suggested on the ill-fated PR to make the selection of the CUDA backend explicit. CUDA backend will not be used on default (or GPU) selection unless SYCL_BE=PI_CUDA is exported.
If it is, then CUDA device will likely be selected first on default or device selection, but user has opted in for this, so its not accidental.
This shouldn't prevent selection of other devices (e.g., when using the cpu_selector or accelerator_selector), so multiple devices on the same SYCL application is still possible.

Note that all of this is for the default device selection, and users can still write their own custom device selectors to bypass this. If a user wants to use a CUDA device, she can write a CUDA selector that will force usage of a CUDA device or bail out.

To summarize (and conclude):

I'll make a PR to expunge NVIDIA OpenCL platform from device selection all together (Won't be possible even to write a custom device selector to use it).
I'll make a PR to make usage of CUDA backend on default selectors an "opt-in". This won't affect custom selectors.

Does this seems a good approach? Are there any alternatives proposed?

The text was updated successfully, but these errors were encountered:

Ruyk · 2020-05-11T10:28:36Z

Ping @hiaselhans

hiaselhans · 2020-05-11T13:04:47Z

thanks for involving me on this @Ruyk :)

to point 2, i don't have a lot to add, i just want to verify current upstream behaviour:

SYCL_BE is empty: opencl and cuda backend are available
SYCL_BE=PI_OPENCL: only opencl backend available
SYCL_BE=PI_CUDA: only cuda backend

I don't have a strong preference on how to do it but i think we should enable all of these choices. So having cuda as opt-in should still provide a way to have both: cuda and opencl available.

1: it might be a good timing to hide (or put a low rating on-) other unsupported devices too. You are right to not hide 1.2 devices alltogether but i think isDeviceBinaryTypeSupported's logic is quite well suited and enforced at a later point anyways..?

llvm/sycl/source/detail/program_manager/program_manager.cpp

Line 260 in 9e79d31

static bool isDeviceBinaryTypeSupported(const context &C,

Ruyk · 2020-05-15T09:18:49Z

#1689 deals with removal of NVIDIA OpenCL from platform list, includes a better error message on isDeviceBinaryTypeSupported for CUDA backend case.

romanovvlad · 2020-06-09T13:28:45Z

This fails if the application has not been built with the right triple for the CUDA backend.
I'll make a PR to expunge NVIDIA OpenCL platform from device selection all together (Won't be possible even to write a custom device selector to use it).

This can prevent users to use NVIDIA OpenCL with online compilation(aka sycl::program::build_with_source). Do not think a lot of people user it, so I support this suggestion. Probably we could use SYCL_DEVICE_ALLOWLIST config which would be set in sycl.conf(https://github.com/intel/llvm/blob/sycl/sycl/source/detail/config.cpp#L25) which would need to be created for this, so the user will be able to override it if he really wants.

I'll make a PR to make usage of CUDA backend on default selectors an "opt-in". This won't affect custom selectors.

I'm not a proper SYCL user, but I think I would be confused if I need to set some env var to just use working target. I remember there was a suggestion to have a driver option which enables build of the device code for both SPIRV and PTX triples; and advertise it everywhere in the examples, so novice users have most "portable" binaries.

npmiller · 2022-05-30T08:56:25Z

The PR: #6203 should partially address point 2 by giving extra preference to devices with available images in the fat binary.

So for example, a program with only a SPIR-V image, will prefer any device with SPIR-V support over the CUDA backend GPU device, so we shouldn't need the "opt-in" as simply using the CUDA triple or not during compilation will influence device selection.

Note that with this patch it may still incorrectly select the CUDA backend if there is no other suitable devices, and won't fall back on the host device, but I'm not sure if that would be desirable.

This patch solves: * [SYCL] Default selector should filter devices based on available device images #2004 * [SYCL][CUDA] Default selector behaviour #1665 In some cases the current selector may select a device for which we don't have an AOT or SPIR-V binary for, this patch ensures that such devices get skipped.

AerialMantis · 2022-06-20T13:55:01Z

Since #6203 has been merged I think this issue can be closed. Support for Nvidia GPUs via the OpenCL backend is still disabled, so if we were to enable this in the future we may want to revisit this again, but in that case we could re-open this ticket or create a new ticket.

Add the `n` argument in `vloadn.spvasm`, `vload_halfn.spvasm`, and `vloada_halfn.spvasm`. Add a trailing `(` to the CL20 LLVM IR patterns. Add extra integer arguments in `OpImageWrite.cl` which hold the Image Operands `SignExtend`/`ZeroExtend`. Add the accumulator arguments in `SPV_KHR_integer_dot_product-sat.ll`. Original commit: KhronosGroup/SPIRV-LLVM-Translator@9f00637

Ruyk assigned bader May 11, 2020

bader added the cuda CUDA back-end label May 12, 2020

bader assigned romanovvlad and unassigned bader May 12, 2020

Ruyk mentioned this issue May 18, 2020

[SYCL][CUDA] Improvements to CUDA device selection #1689

Merged

bader mentioned this issue Jun 29, 2020

[SYCL] Default selector should filter devices based on available device images #2004

Closed

AerialMantis added the runtime Runtime library related issue label May 8, 2022

npmiller mentioned this issue May 26, 2022

[SYCL] Don't select devices with no available images #6203

Merged

AerialMantis closed this as completed Jun 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SYCL][CUDA] Default selector behaviour #1665

[SYCL][CUDA] Default selector behaviour #1665

Ruyk commented May 11, 2020

Ruyk commented May 11, 2020

hiaselhans commented May 11, 2020 •

edited

Loading

Ruyk commented May 15, 2020

romanovvlad commented Jun 9, 2020

npmiller commented May 30, 2022

AerialMantis commented Jun 20, 2022

[SYCL][CUDA] Default selector behaviour #1665

[SYCL][CUDA] Default selector behaviour #1665

Comments

Ruyk commented May 11, 2020

Ruyk commented May 11, 2020

hiaselhans commented May 11, 2020 • edited Loading

Ruyk commented May 15, 2020

romanovvlad commented Jun 9, 2020

npmiller commented May 30, 2022

AerialMantis commented Jun 20, 2022

hiaselhans commented May 11, 2020 •

edited

Loading