Skip to content

[Clang][Driver] Enable internalization by default for AMDGPU #138365

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 3, 2025

Conversation

shiltian
Copy link
Contributor

@shiltian shiltian commented May 2, 2025

No description provided.

@llvmbot llvmbot added clang Clang issues not falling into any other category clang:driver 'clang' and 'clang++' user-facing binaries. Not 'clang-cl' labels May 2, 2025
Copy link
Contributor Author

shiltian commented May 2, 2025

This stack of pull requests is managed by Graphite. Learn more about stacking.

@shiltian shiltian requested review from jdoerfert and jhuber6 May 2, 2025 23:51
@llvmbot
Copy link
Member

llvmbot commented May 2, 2025

@llvm/pr-subscribers-clang

Author: Shilei Tian (shiltian)

Changes

Full diff: https://github.com/llvm/llvm-project/pull/138365.diff

2 Files Affected:

  • (modified) clang/lib/Driver/ToolChains/Clang.cpp (+6)
  • (modified) clang/test/Driver/openmp-offload-gpu.c (+7)
diff --git a/clang/lib/Driver/ToolChains/Clang.cpp b/clang/lib/Driver/ToolChains/Clang.cpp
index 6dabf9d842b7b..093f6fe9d5c77 100644
--- a/clang/lib/Driver/ToolChains/Clang.cpp
+++ b/clang/lib/Driver/ToolChains/Clang.cpp
@@ -9284,6 +9284,12 @@ void LinkerWrapper::ConstructJob(Compilation &C, const JobAction &JA,
         CmdArgs.push_back(Args.MakeArgString(
             "--device-linker=" + TC->getTripleString() + "=" + Arg));
 
+      // Enable internalization for AMDGPU.
+      if (TC->getTriple().isAMDGPU())
+        CmdArgs.push_back(
+            Args.MakeArgString("--device-linker=" + TC->getTripleString() +
+                               "=-plugin-opt=-amdgpu-internalize-symbols"));
+
       // Forward the LTO mode relying on the Driver's parsing.
       if (C.getDriver().getOffloadLTOMode() == LTOK_Full)
         CmdArgs.push_back(Args.MakeArgString(
diff --git a/clang/test/Driver/openmp-offload-gpu.c b/clang/test/Driver/openmp-offload-gpu.c
index 5ad038b7dbd37..73dc5a7f75d5b 100644
--- a/clang/test/Driver/openmp-offload-gpu.c
+++ b/clang/test/Driver/openmp-offload-gpu.c
@@ -393,3 +393,10 @@
 // RUN:     --offload-arch=sm_52 -foffload-lto=thin -nogpulib -nogpuinc %s 2>&1 \
 // RUN:   | FileCheck --check-prefix=THINLTO-SM52 %s
 // THINLTO-SM52: --device-compiler=nvptx64-nvidia-cuda=-flto=thin
+
+// Check that internalization is enabled for OpenMP offloading on AMDGPU.
+//
+// RUN:   %clang -### --target=x86_64-unknown-linux-gnu -fopenmp=libomp \
+// RUN:     --offload-arch=gfx906 -nogpulib -nogpuinc %s 2>&1 \
+// RUN:   | FileCheck --check-prefix=INTERNALIZATION-GFX906 %s
+// INTERNALIZATION-GFX906: --device-linker=amdgcn-amd-amdhsa=-plugin-opt=-amdgpu-internalize-symbols

@llvmbot
Copy link
Member

llvmbot commented May 2, 2025

@llvm/pr-subscribers-clang-driver

Author: Shilei Tian (shiltian)

Changes

Full diff: https://github.com/llvm/llvm-project/pull/138365.diff

2 Files Affected:

  • (modified) clang/lib/Driver/ToolChains/Clang.cpp (+6)
  • (modified) clang/test/Driver/openmp-offload-gpu.c (+7)
diff --git a/clang/lib/Driver/ToolChains/Clang.cpp b/clang/lib/Driver/ToolChains/Clang.cpp
index 6dabf9d842b7b..093f6fe9d5c77 100644
--- a/clang/lib/Driver/ToolChains/Clang.cpp
+++ b/clang/lib/Driver/ToolChains/Clang.cpp
@@ -9284,6 +9284,12 @@ void LinkerWrapper::ConstructJob(Compilation &C, const JobAction &JA,
         CmdArgs.push_back(Args.MakeArgString(
             "--device-linker=" + TC->getTripleString() + "=" + Arg));
 
+      // Enable internalization for AMDGPU.
+      if (TC->getTriple().isAMDGPU())
+        CmdArgs.push_back(
+            Args.MakeArgString("--device-linker=" + TC->getTripleString() +
+                               "=-plugin-opt=-amdgpu-internalize-symbols"));
+
       // Forward the LTO mode relying on the Driver's parsing.
       if (C.getDriver().getOffloadLTOMode() == LTOK_Full)
         CmdArgs.push_back(Args.MakeArgString(
diff --git a/clang/test/Driver/openmp-offload-gpu.c b/clang/test/Driver/openmp-offload-gpu.c
index 5ad038b7dbd37..73dc5a7f75d5b 100644
--- a/clang/test/Driver/openmp-offload-gpu.c
+++ b/clang/test/Driver/openmp-offload-gpu.c
@@ -393,3 +393,10 @@
 // RUN:     --offload-arch=sm_52 -foffload-lto=thin -nogpulib -nogpuinc %s 2>&1 \
 // RUN:   | FileCheck --check-prefix=THINLTO-SM52 %s
 // THINLTO-SM52: --device-compiler=nvptx64-nvidia-cuda=-flto=thin
+
+// Check that internalization is enabled for OpenMP offloading on AMDGPU.
+//
+// RUN:   %clang -### --target=x86_64-unknown-linux-gnu -fopenmp=libomp \
+// RUN:     --offload-arch=gfx906 -nogpulib -nogpuinc %s 2>&1 \
+// RUN:   | FileCheck --check-prefix=INTERNALIZATION-GFX906 %s
+// INTERNALIZATION-GFX906: --device-linker=amdgcn-amd-amdhsa=-plugin-opt=-amdgpu-internalize-symbols

@shiltian shiltian force-pushed the users/shiltian/enable-internalization-for-amdgpu branch from acc89cf to c9a8e6e Compare May 2, 2025 23:54
@shiltian shiltian merged commit 2b62a31 into main May 3, 2025
9 of 10 checks passed
@shiltian shiltian deleted the users/shiltian/enable-internalization-for-amdgpu branch May 3, 2025 01:32
if (TC->getTriple().isAMDGPU())
CmdArgs.push_back(
Args.MakeArgString("--device-linker=" + TC->getTripleString() +
"=-plugin-opt=-amdgpu-internalize-symbols"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forget what this hacked around, was hoping we didn't need it anymore. Maybe @arsenm remembers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really a hack TBH, though we are the only target explicitly using it. It significantly affects our performance. The remaining uses of this pass are in (Thin)LTO, which has broader impact.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks more like "always" than "by default".

The implementation details are kind of a hack. We should be able to just run the ordinary internalize pass without the special AMDGPU filter.

static bool mustPreserveGV(const GlobalValue &GV) {

The most legitimate part of this is the isEntryFunctionCC, which we could just make the internalize pass directly do. The sanitizer name hacks are also obviously hacks.

The attempt to drop constant users looks like a bad side effect a predicate should not have, I don't know what that's doing there

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have an example of what kind of functions aren't being internalized? Last time I had this issue it's because the ROCm Device Libs were compiling with -fvisibility=protected for some reason.

@@ -9284,6 +9284,12 @@ void LinkerWrapper::ConstructJob(Compilation &C, const JobAction &JA,
CmdArgs.push_back(Args.MakeArgString(
"--device-linker=" + TC->getTripleString() + "=" + Arg));

// Enable internalization for AMDGPU.
if (TC->getTriple().isAMDGPU())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Braces

if (TC->getTriple().isAMDGPU())
CmdArgs.push_back(
Args.MakeArgString("--device-linker=" + TC->getTripleString() +
"=-plugin-opt=-amdgpu-internalize-symbols"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks more like "always" than "by default".

The implementation details are kind of a hack. We should be able to just run the ordinary internalize pass without the special AMDGPU filter.

static bool mustPreserveGV(const GlobalValue &GV) {

The most legitimate part of this is the isEntryFunctionCC, which we could just make the internalize pass directly do. The sanitizer name hacks are also obviously hacks.

The attempt to drop constant users looks like a bad side effect a predicate should not have, I don't know what that's doing there

@ronlieb
Copy link
Contributor

ronlieb commented May 3, 2025

downstream testing is seeing quite a few tests failing with:
OFFLOAD ERROR: Memory access fault by GPU 2 (agent 0x145ba00) at virtual address (nil). Reasons: Unknown (0)

@ronlieb
Copy link
Contributor

ronlieb commented May 3, 2025

also seeing
"PluginInterface" error: Failure to look up global address: Error in hsa_executable_get_symbol_by_name(grid_points): HSA_STATUS_ERROR_INVALID_SYMBOL_NAME: There is no symbol with the given name.
omptarget error: Failed to load symbol grid_points

@jhuber6
Copy link
Contributor

jhuber6 commented May 3, 2025

also seeing "PluginInterface" error: Failure to look up global address: Error in hsa_executable_get_symbol_by_name(grid_points): HSA_STATUS_ERROR_INVALID_SYMBOL_NAME: There is no symbol with the given name. omptarget error: Failed to load symbol grid_points

Yeah, this is what I was assuming would happen. This is a force-internalize pass that overrides normal LTO handling. The OpenMP toolchain relies on normal handling. I'm wondering why we needed it. This should probably be reverted.

@shiltian
Copy link
Contributor Author

shiltian commented May 3, 2025

I don't think OpenMP is more special than HIP here. Anything exposed to the host should not be internalized. In addition, OpenMP actually also heavily uses internalization as well in OpenMPOpt. It is likely that this change exposes something bad in the downstream.

The motivation of this change was to apply ThinLTO for OpenMP. Given our current approach to link device runtime, it will be treated as a regular input file and then goes through the backend as well, which is completely unnecessary. Its whole purpose should be for function import. After that, it should be discarded.

@shiltian
Copy link
Contributor Author

shiltian commented May 3, 2025

also seeing

"PluginInterface" error: Failure to look up global address: Error in hsa_executable_get_symbol_by_name(grid_points): HSA_STATUS_ERROR_INVALID_SYMBOL_NAME: There is no symbol with the given name.

omptarget error: Failed to load symbol grid_points

Can you open a ticket with details to reproduce and assign to me? I'll look into that on Monday.

@jhuber6
Copy link
Contributor

jhuber6 commented May 3, 2025

I don't think OpenMP is more special than HIP here. Anything exposed to the host should not be internalized. In addition, OpenMP actually also heavily uses internalization as well in OpenMPOpt. It is likely that this change exposes something bad in the downstream.

The motivation of this change was to apply ThinLTO for OpenMP. Given our current approach to link device runtime, it will be treated as a regular input file and then goes through the backend as well, which is completely unnecessary. Its whole purpose should be for function import. After that, it should be discarded.

Not fully understanding this, right now the OpenMP library is linked as a normal static library -lompdevice. What requires special treatment here? I go out of my way to remove needing these AMD specific hacks so I'm wondering what necessitates putting it back in.

@shiltian
Copy link
Contributor Author

shiltian commented May 3, 2025

I got it that you are trying to make it generic. That's why I didn’t roll back to using builtin bitcode as we did before. However there is one limitation that we can't really work around, which is the fact that we don't support ABI linking. This is not a new topic at all and whether to support it or how to do it is completely off topic here. Even if we support it, because of how expensive to set up function calls, it might still be arguably better to let the optimization pass to decide whether to import them or not.

The way ThinLTO works is, it compiles all input bitcode files in parallel after function import. The input here includes both user code, as well as the device runtime. We don't need to compile device runtime at all, since all uses of it have already import them to their own module. It is not that "bad" to compile it, as it will not cause any correctness issue. It is however a waste of time and will make the final code object a little bit larger. The issue here is, because we can’t properly lower local memory in non-kernel function at the moment, backend warning will emit some warnings, which is not ideal at all. After we internalize the entire device runtime, it will roughly become an empty module after optimization thus the backend could be happy and the confusing warning will not be emitted.

searlmc1 pushed a commit to ROCm/llvm-project that referenced this pull request May 5, 2025
IanWood1 pushed a commit to IanWood1/llvm-project that referenced this pull request May 6, 2025
IanWood1 pushed a commit to IanWood1/llvm-project that referenced this pull request May 6, 2025
IanWood1 pushed a commit to IanWood1/llvm-project that referenced this pull request May 6, 2025
GeorgeARM pushed a commit to GeorgeARM/llvm-project that referenced this pull request May 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clang:driver 'clang' and 'clang++' user-facing binaries. Not 'clang-cl' clang Clang issues not falling into any other category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants