Skip to content

[SYCL] Prevent fallback assert postprocessor being passed to CUDA and HIP devices #4604

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Oct 7, 2021

Conversation

AidanBeltonS
Copy link
Contributor

This patch fixes an error with HIP backend where tests DeviceCodeSplit/split-per-kernel.cpp and DeviceCodeSplit/split-per-source-main.cpp fail with Memory access fault by GPU node-4 (Agent handle: 0x709c50) on address (nil). Reason: Page not present or supervisor privilege.

This is due to a postprocessor being sent to queue_impl::submit_impl.
The patch add checks to queue.hpp so postprocessors needed for fallback assert do not get passed for HIP and CUDA devices. Neither currently support fallback asserts.

Note: adding __AMDGCN__ to

#if !defined(SYCL_DISABLE_FALLBACK_ASSERT) && !defined(__NVPTX__)
#define __SYCL_USE_FALLBACK_ASSERT 1
#else
#define __SYCL_USE_FALLBACK_ASSERT 0
#endif

was not a fix as a new error arose in its place. Investigation showed that macros __AMDGCN__ and __NVPTX__ are only set for device compilation and as a result did not stop preprocessors being passed to submit_impl. I think the postprocesor was not supposed to be passed to CUDA devices. Despite being passed it was not causing an error for CUDA just HIP.

@AidanBeltonS AidanBeltonS requested a review from a team as a code owner September 20, 2021 14:50
@bader bader added cuda CUDA back-end hip Issues related to execution on HIP backend. runtime Runtime library related issue labels Sep 20, 2021
@romanovvlad
Copy link
Contributor

This patch fixes an error with HIP backend where tests DeviceCodeSplit/split-per-kernel.cpp and DeviceCodeSplit/split-per-source-main.cpp fail with Memory access fault by GPU node-4 (Agent handle: 0x709c50) on address (nil). Reason: Page not present or supervisor privilege.

This is due to a postprocessor being sent to queue_impl::submit_impl.

Could you please clarify if this patch is a workaround until the issue is root caused and a proper solution found or a final solution? Do we know what the root cause is?

@AidanBeltonS
Copy link
Contributor Author

AidanBeltonS commented Sep 20, 2021

Could you please clarify if this patch is a workaround until the issue is root caused and a proper solution found or a final solution? Do we know what the root cause is?

The root cause is that target macro's like __NVPTX__ and __AMDGCN__ are being used to prevent sending postprocessors to certain devices. These macros are not specified when building for the host so it does not prevent sending a postprocessor.
This causes errors on HIP with module splitting.

This is proposed as a long-term solution until CUDA and HIP support fallback asserts.

@s-kanaev
Copy link
Contributor

I think the postprocesor was not supposed to be passed to CUDA devices.

Even though it was passed for CUDA device, it converted was sort of a NOP due CUDA backend reported support for native assert. See the change in #3767

@bader bader requested a review from s-kanaev September 21, 2021 09:13
@AidanBeltonS
Copy link
Contributor Author

Even though it was passed for CUDA device, it converted was sort of a NOP due CUDA backend reported support for native assert. See the change in #3767

Ahh, thanks for explaining. I was wondering why CUDA devices were not failing.

@s-kanaev
Copy link
Contributor

Even though it was passed for CUDA device, it converted was sort of a NOP due CUDA backend reported support for native assert. See the change in #3767

Ahh, thanks for explaining. I was wondering why CUDA devices were not failing.

@AidanBeltonS , could you, then, modify the HIP plugin to report assert being supported? This will remain {{queue.hpp}} back-end agnostic.

@AidanBeltonS
Copy link
Contributor Author

@AidanBeltonS , could you, then, modify the HIP plugin to report assert being supported? This will remain {{queue.hpp}} back-end agnostic.

I have tested adding PI_DEVICE_INFO_EXTENSION_DEVICELIB_ASSERT to the hip extensions. This resolves the problem.

Currently HIP4.3 does not support asserts natively. Though it seems that it will soon. Are okay with adding the extension regardless of HIPs support? I can comment that this is being used resolve this issue and add a note to remove the comment once native asserts are supported.

From HIP 4.3 docs

The assert function is under development. HIP does support an "abort" call which will terminate the process execution from inside the kernel.

@bader
Copy link
Contributor

bader commented Oct 5, 2021

@smaslov-intel, @romanovvlad, @s-kanaev, ping.

@bader
Copy link
Contributor

bader commented Oct 7, 2021

@s-kanaev, ping.

Copy link
Contributor

@s-kanaev s-kanaev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems legit

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuda CUDA back-end hip Issues related to execution on HIP backend. runtime Runtime library related issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants