[SYCL] Prevent fallback assert postprocessor being passed to CUDA and HIP devices #4604

AidanBeltonS · 2021-09-20T14:50:53Z

This patch fixes an error with HIP backend where tests DeviceCodeSplit/split-per-kernel.cpp and DeviceCodeSplit/split-per-source-main.cpp fail with Memory access fault by GPU node-4 (Agent handle: 0x709c50) on address (nil). Reason: Page not present or supervisor privilege.

This is due to a postprocessor being sent to queue_impl::submit_impl.
The patch add checks to queue.hpp so postprocessors needed for fallback assert do not get passed for HIP and CUDA devices. Neither currently support fallback asserts.

Note: adding __AMDGCN__ to

#if !defined(SYCL_DISABLE_FALLBACK_ASSERT) && !defined(__NVPTX__)
#define __SYCL_USE_FALLBACK_ASSERT 1
#else
#define __SYCL_USE_FALLBACK_ASSERT 0
#endif

was not a fix as a new error arose in its place. Investigation showed that macros __AMDGCN__ and __NVPTX__ are only set for device compilation and as a result did not stop preprocessors being passed to submit_impl. I think the postprocesor was not supposed to be passed to CUDA devices. Despite being passed it was not causing an error for CUDA just HIP.

romanovvlad · 2021-09-20T15:58:30Z

This patch fixes an error with HIP backend where tests DeviceCodeSplit/split-per-kernel.cpp and DeviceCodeSplit/split-per-source-main.cpp fail with Memory access fault by GPU node-4 (Agent handle: 0x709c50) on address (nil). Reason: Page not present or supervisor privilege.

This is due to a postprocessor being sent to queue_impl::submit_impl.

Could you please clarify if this patch is a workaround until the issue is root caused and a proper solution found or a final solution? Do we know what the root cause is?

AidanBeltonS · 2021-09-20T16:03:42Z

Could you please clarify if this patch is a workaround until the issue is root caused and a proper solution found or a final solution? Do we know what the root cause is?

The root cause is that target macro's like __NVPTX__ and __AMDGCN__ are being used to prevent sending postprocessors to certain devices. These macros are not specified when building for the host so it does not prevent sending a postprocessor.
This causes errors on HIP with module splitting.

This is proposed as a long-term solution until CUDA and HIP support fallback asserts.

sycl/include/CL/sycl/queue.hpp

s-kanaev · 2021-09-21T07:02:54Z

I think the postprocesor was not supposed to be passed to CUDA devices.

Even though it was passed for CUDA device, it converted was sort of a NOP due CUDA backend reported support for native assert. See the change in #3767

AidanBeltonS · 2021-09-21T09:13:39Z

Even though it was passed for CUDA device, it converted was sort of a NOP due CUDA backend reported support for native assert. See the change in #3767

Ahh, thanks for explaining. I was wondering why CUDA devices were not failing.

s-kanaev · 2021-09-21T10:29:55Z

Even though it was passed for CUDA device, it converted was sort of a NOP due CUDA backend reported support for native assert. See the change in #3767

Ahh, thanks for explaining. I was wondering why CUDA devices were not failing.

@AidanBeltonS , could you, then, modify the HIP plugin to report assert being supported? This will remain {{queue.hpp}} back-end agnostic.

AidanBeltonS · 2021-09-21T12:03:54Z

@AidanBeltonS , could you, then, modify the HIP plugin to report assert being supported? This will remain {{queue.hpp}} back-end agnostic.

I have tested adding PI_DEVICE_INFO_EXTENSION_DEVICELIB_ASSERT to the hip extensions. This resolves the problem.

Currently HIP4.3 does not support asserts natively. Though it seems that it will soon. Are okay with adding the extension regardless of HIPs support? I can comment that this is being used resolve this issue and add a note to remove the comment once native asserts are supported.

From HIP 4.3 docs

The assert function is under development. HIP does support an "abort" call which will terminate the process execution from inside the kernel.

…SERT to hip

bader · 2021-10-05T07:42:16Z

@smaslov-intel, @romanovvlad, @s-kanaev, ping.

bader · 2021-10-07T07:41:50Z

@s-kanaev, ping.

s-kanaev

Seems legit

aidan.belton added 2 commits September 20, 2021 13:18

prevent preprocessor being passed to hip and cuda

f3d5e16

Merge from upstream

8217561

AidanBeltonS requested a review from a team as a code owner September 20, 2021 14:50

AidanBeltonS requested a review from sergey-semenov September 20, 2021 14:50

bader added cuda CUDA back-end hip Issues related to execution on HIP backend. runtime Runtime library related issue labels Sep 20, 2021

romanovvlad requested a review from s-kanaev September 20, 2021 15:56

s-kanaev reviewed Sep 21, 2021

View reviewed changes

sycl/include/CL/sycl/queue.hpp Outdated Show resolved Hide resolved

Add comment to check

2fcc0da

bader requested a review from s-kanaev September 21, 2021 09:13

revert changes to queue.hpp and PI_DEVICE_INFO_EXTENSION_DEVICELIB_AS…

3238b12

…SERT to hip

AidanBeltonS requested a review from smaslov-intel as a code owner September 27, 2021 16:06

clarify comment

54476f6

smaslov-intel approved these changes Oct 5, 2021

View reviewed changes

s-kanaev approved these changes Oct 7, 2021

View reviewed changes

bader merged commit b0411f8 into intel:sycl Oct 7, 2021

AidanBeltonS mentioned this pull request Oct 8, 2021

[SYCL][HIP] Memory access fault by GPU on address (nil) #4688

Closed

aarongreig mentioned this pull request Nov 1, 2024

Add device info query to report support for native asserts. oneapi-src/unified-runtime#2269

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SYCL] Prevent fallback assert postprocessor being passed to CUDA and HIP devices #4604

[SYCL] Prevent fallback assert postprocessor being passed to CUDA and HIP devices #4604

AidanBeltonS commented Sep 20, 2021

romanovvlad commented Sep 20, 2021

AidanBeltonS commented Sep 20, 2021 •

edited

Loading

s-kanaev commented Sep 21, 2021

AidanBeltonS commented Sep 21, 2021

s-kanaev commented Sep 21, 2021

AidanBeltonS commented Sep 21, 2021

bader commented Oct 5, 2021

bader commented Oct 7, 2021

s-kanaev left a comment

[SYCL] Prevent fallback assert postprocessor being passed to CUDA and HIP devices #4604

[SYCL] Prevent fallback assert postprocessor being passed to CUDA and HIP devices #4604

Conversation

AidanBeltonS commented Sep 20, 2021

romanovvlad commented Sep 20, 2021

AidanBeltonS commented Sep 20, 2021 • edited Loading

s-kanaev commented Sep 21, 2021

AidanBeltonS commented Sep 21, 2021

s-kanaev commented Sep 21, 2021

AidanBeltonS commented Sep 21, 2021

bader commented Oct 5, 2021

bader commented Oct 7, 2021

s-kanaev left a comment

Choose a reason for hiding this comment

AidanBeltonS commented Sep 20, 2021 •

edited

Loading