[Clang][Driver] Enable internalization by default for AMDGPU #138365

shiltian · 2025-05-02T23:51:39Z

No description provided.

shiltian · 2025-05-02T23:51:53Z

[Clang][Driver] Enable internalization by default for AMDGPU #138365 👈 (View in Graphite)
main

This stack of pull requests is managed by Graphite. Learn more about stacking.

llvmbot · 2025-05-02T23:52:14Z

@llvm/pr-subscribers-clang

Author: Shilei Tian (shiltian)

Changes

Full diff: https://github.com/llvm/llvm-project/pull/138365.diff

2 Files Affected:

(modified) clang/lib/Driver/ToolChains/Clang.cpp (+6)
(modified) clang/test/Driver/openmp-offload-gpu.c (+7)

diff --git a/clang/lib/Driver/ToolChains/Clang.cpp b/clang/lib/Driver/ToolChains/Clang.cpp
index 6dabf9d842b7b..093f6fe9d5c77 100644
--- a/clang/lib/Driver/ToolChains/Clang.cpp
+++ b/clang/lib/Driver/ToolChains/Clang.cpp
@@ -9284,6 +9284,12 @@ void LinkerWrapper::ConstructJob(Compilation &C, const JobAction &JA,
         CmdArgs.push_back(Args.MakeArgString(
             "--device-linker=" + TC->getTripleString() + "=" + Arg));
 
+      // Enable internalization for AMDGPU.
+      if (TC->getTriple().isAMDGPU())
+        CmdArgs.push_back(
+            Args.MakeArgString("--device-linker=" + TC->getTripleString() +
+                               "=-plugin-opt=-amdgpu-internalize-symbols"));
+
       // Forward the LTO mode relying on the Driver's parsing.
       if (C.getDriver().getOffloadLTOMode() == LTOK_Full)
         CmdArgs.push_back(Args.MakeArgString(
diff --git a/clang/test/Driver/openmp-offload-gpu.c b/clang/test/Driver/openmp-offload-gpu.c
index 5ad038b7dbd37..73dc5a7f75d5b 100644
--- a/clang/test/Driver/openmp-offload-gpu.c
+++ b/clang/test/Driver/openmp-offload-gpu.c
@@ -393,3 +393,10 @@
 // RUN:     --offload-arch=sm_52 -foffload-lto=thin -nogpulib -nogpuinc %s 2>&1 \
 // RUN:   | FileCheck --check-prefix=THINLTO-SM52 %s
 // THINLTO-SM52: --device-compiler=nvptx64-nvidia-cuda=-flto=thin
+
+// Check that internalization is enabled for OpenMP offloading on AMDGPU.
+//
+// RUN:   %clang -### --target=x86_64-unknown-linux-gnu -fopenmp=libomp \
+// RUN:     --offload-arch=gfx906 -nogpulib -nogpuinc %s 2>&1 \
+// RUN:   | FileCheck --check-prefix=INTERNALIZATION-GFX906 %s
+// INTERNALIZATION-GFX906: --device-linker=amdgcn-amd-amdhsa=-plugin-opt=-amdgpu-internalize-symbols

llvmbot · 2025-05-02T23:52:15Z

@llvm/pr-subscribers-clang-driver

Author: Shilei Tian (shiltian)

Changes

Full diff: https://github.com/llvm/llvm-project/pull/138365.diff

2 Files Affected:

(modified) clang/lib/Driver/ToolChains/Clang.cpp (+6)
(modified) clang/test/Driver/openmp-offload-gpu.c (+7)

diff --git a/clang/lib/Driver/ToolChains/Clang.cpp b/clang/lib/Driver/ToolChains/Clang.cpp
index 6dabf9d842b7b..093f6fe9d5c77 100644
--- a/clang/lib/Driver/ToolChains/Clang.cpp
+++ b/clang/lib/Driver/ToolChains/Clang.cpp
@@ -9284,6 +9284,12 @@ void LinkerWrapper::ConstructJob(Compilation &C, const JobAction &JA,
         CmdArgs.push_back(Args.MakeArgString(
             "--device-linker=" + TC->getTripleString() + "=" + Arg));
 
+      // Enable internalization for AMDGPU.
+      if (TC->getTriple().isAMDGPU())
+        CmdArgs.push_back(
+            Args.MakeArgString("--device-linker=" + TC->getTripleString() +
+                               "=-plugin-opt=-amdgpu-internalize-symbols"));
+
       // Forward the LTO mode relying on the Driver's parsing.
       if (C.getDriver().getOffloadLTOMode() == LTOK_Full)
         CmdArgs.push_back(Args.MakeArgString(
diff --git a/clang/test/Driver/openmp-offload-gpu.c b/clang/test/Driver/openmp-offload-gpu.c
index 5ad038b7dbd37..73dc5a7f75d5b 100644
--- a/clang/test/Driver/openmp-offload-gpu.c
+++ b/clang/test/Driver/openmp-offload-gpu.c
@@ -393,3 +393,10 @@
 // RUN:     --offload-arch=sm_52 -foffload-lto=thin -nogpulib -nogpuinc %s 2>&1 \
 // RUN:   | FileCheck --check-prefix=THINLTO-SM52 %s
 // THINLTO-SM52: --device-compiler=nvptx64-nvidia-cuda=-flto=thin
+
+// Check that internalization is enabled for OpenMP offloading on AMDGPU.
+//
+// RUN:   %clang -### --target=x86_64-unknown-linux-gnu -fopenmp=libomp \
+// RUN:     --offload-arch=gfx906 -nogpulib -nogpuinc %s 2>&1 \
+// RUN:   | FileCheck --check-prefix=INTERNALIZATION-GFX906 %s
+// INTERNALIZATION-GFX906: --device-linker=amdgcn-amd-amdhsa=-plugin-opt=-amdgpu-internalize-symbols

jhuber6 · 2025-05-03T05:02:01Z

clang/lib/Driver/ToolChains/Clang.cpp

+      if (TC->getTriple().isAMDGPU())
+        CmdArgs.push_back(
+            Args.MakeArgString("--device-linker=" + TC->getTripleString() +
+                               "=-plugin-opt=-amdgpu-internalize-symbols"));


I forget what this hacked around, was hoping we didn't need it anymore. Maybe @arsenm remembers.

Not really a hack TBH, though we are the only target explicitly using it. It significantly affects our performance. The remaining uses of this pass are in (Thin)LTO, which has broader impact.

This looks more like "always" than "by default".

The implementation details are kind of a hack. We should be able to just run the ordinary internalize pass without the special AMDGPU filter.

llvm-project/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp

Line 761 in c63687c

static bool mustPreserveGV(const GlobalValue &GV) {

The most legitimate part of this is the isEntryFunctionCC, which we could just make the internalize pass directly do. The sanitizer name hacks are also obviously hacks.

The attempt to drop constant users looks like a bad side effect a predicate should not have, I don't know what that's doing there

Do you have an example of what kind of functions aren't being internalized? Last time I had this issue it's because the ROCm Device Libs were compiling with -fvisibility=protected for some reason.

arsenm · 2025-05-03T06:16:24Z

clang/lib/Driver/ToolChains/Clang.cpp

@@ -9284,6 +9284,12 @@ void LinkerWrapper::ConstructJob(Compilation &C, const JobAction &JA,
        CmdArgs.push_back(Args.MakeArgString(
            "--device-linker=" + TC->getTripleString() + "=" + Arg));

+      // Enable internalization for AMDGPU.
+      if (TC->getTriple().isAMDGPU())


arsenm · 2025-05-03T06:20:09Z

clang/lib/Driver/ToolChains/Clang.cpp

+      if (TC->getTriple().isAMDGPU())
+        CmdArgs.push_back(
+            Args.MakeArgString("--device-linker=" + TC->getTripleString() +
+                               "=-plugin-opt=-amdgpu-internalize-symbols"));


This looks more like "always" than "by default".

The implementation details are kind of a hack. We should be able to just run the ordinary internalize pass without the special AMDGPU filter.

llvm-project/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp

Line 761 in c63687c

static bool mustPreserveGV(const GlobalValue &GV) {

The most legitimate part of this is the isEntryFunctionCC, which we could just make the internalize pass directly do. The sanitizer name hacks are also obviously hacks.

The attempt to drop constant users looks like a bad side effect a predicate should not have, I don't know what that's doing there

ronlieb · 2025-05-03T11:09:25Z

downstream testing is seeing quite a few tests failing with:
OFFLOAD ERROR: Memory access fault by GPU 2 (agent 0x145ba00) at virtual address (nil). Reasons: Unknown (0)

ronlieb · 2025-05-03T11:16:36Z

also seeing
"PluginInterface" error: Failure to look up global address: Error in hsa_executable_get_symbol_by_name(grid_points): HSA_STATUS_ERROR_INVALID_SYMBOL_NAME: There is no symbol with the given name.
omptarget error: Failed to load symbol grid_points

jhuber6 · 2025-05-03T12:50:18Z

also seeing "PluginInterface" error: Failure to look up global address: Error in hsa_executable_get_symbol_by_name(grid_points): HSA_STATUS_ERROR_INVALID_SYMBOL_NAME: There is no symbol with the given name. omptarget error: Failed to load symbol grid_points

Yeah, this is what I was assuming would happen. This is a force-internalize pass that overrides normal LTO handling. The OpenMP toolchain relies on normal handling. I'm wondering why we needed it. This should probably be reverted.

shiltian · 2025-05-03T16:13:02Z

I don't think OpenMP is more special than HIP here. Anything exposed to the host should not be internalized. In addition, OpenMP actually also heavily uses internalization as well in OpenMPOpt. It is likely that this change exposes something bad in the downstream.

The motivation of this change was to apply ThinLTO for OpenMP. Given our current approach to link device runtime, it will be treated as a regular input file and then goes through the backend as well, which is completely unnecessary. Its whole purpose should be for function import. After that, it should be discarded.

shiltian · 2025-05-03T16:13:44Z

also seeing

"PluginInterface" error: Failure to look up global address: Error in hsa_executable_get_symbol_by_name(grid_points): HSA_STATUS_ERROR_INVALID_SYMBOL_NAME: There is no symbol with the given name.

omptarget error: Failed to load symbol grid_points

Can you open a ticket with details to reproduce and assign to me? I'll look into that on Monday.

jhuber6 · 2025-05-03T16:31:41Z

I don't think OpenMP is more special than HIP here. Anything exposed to the host should not be internalized. In addition, OpenMP actually also heavily uses internalization as well in OpenMPOpt. It is likely that this change exposes something bad in the downstream.

The motivation of this change was to apply ThinLTO for OpenMP. Given our current approach to link device runtime, it will be treated as a regular input file and then goes through the backend as well, which is completely unnecessary. Its whole purpose should be for function import. After that, it should be discarded.

Not fully understanding this, right now the OpenMP library is linked as a normal static library -lompdevice. What requires special treatment here? I go out of my way to remove needing these AMD specific hacks so I'm wondering what necessitates putting it back in.

shiltian · 2025-05-03T18:24:53Z

I got it that you are trying to make it generic. That's why I didn’t roll back to using builtin bitcode as we did before. However there is one limitation that we can't really work around, which is the fact that we don't support ABI linking. This is not a new topic at all and whether to support it or how to do it is completely off topic here. Even if we support it, because of how expensive to set up function calls, it might still be arguably better to let the optimization pass to decide whether to import them or not.

The way ThinLTO works is, it compiles all input bitcode files in parallel after function import. The input here includes both user code, as well as the device runtime. We don't need to compile device runtime at all, since all uses of it have already import them to their own module. It is not that "bad" to compile it, as it will not cause any correctness issue. It is however a waste of time and will make the final code object a little bit larger. The issue here is, because we can’t properly lower local memory in non-kernel function at the moment, backend warning will emit some warnings, which is not ideal at all. After we internalize the entire device runtime, it will roughly become an empty module after optimization thus the backend could be happy and the confusing warning will not be emitted.

…lvm#138365)" This reverts commit 2b62a31.

…8365)

llvmbot added clang Clang issues not falling into any other category clang:driver 'clang' and 'clang++' user-facing binaries. Not 'clang-cl' labels May 2, 2025

shiltian requested review from jdoerfert and jhuber6 May 2, 2025 23:51

jdoerfert approved these changes May 2, 2025

View reviewed changes

[Clang][Driver] Enable internalization by default for AMDGPU

c9a8e6e

shiltian force-pushed the users/shiltian/enable-internalization-for-amdgpu branch from acc89cf to c9a8e6e Compare May 2, 2025 23:54

shiltian merged commit 2b62a31 into main May 3, 2025
9 of 10 checks passed

shiltian deleted the users/shiltian/enable-internalization-for-amdgpu branch May 3, 2025 01:32

jhuber6 reviewed May 3, 2025

View reviewed changes

arsenm reviewed May 3, 2025

View reviewed changes

searlmc1 pushed a commit to ROCm/llvm-project that referenced this pull request May 5, 2025

Revert "[Clang][Driver] Enable internalization by default for AMDGPU (l…

080a9e1

…lvm#138365)" This reverts commit 2b62a31.

IanWood1 pushed a commit to IanWood1/llvm-project that referenced this pull request May 6, 2025

[Clang][Driver] Enable internalization by default for AMDGPU (llvm#13…

ebbe3c0

…8365)

IanWood1 pushed a commit to IanWood1/llvm-project that referenced this pull request May 6, 2025

[Clang][Driver] Enable internalization by default for AMDGPU (llvm#13…

385ec41

…8365)

IanWood1 pushed a commit to IanWood1/llvm-project that referenced this pull request May 6, 2025

[Clang][Driver] Enable internalization by default for AMDGPU (llvm#13…

db64ca5

…8365)

GeorgeARM pushed a commit to GeorgeARM/llvm-project that referenced this pull request May 7, 2025

[Clang][Driver] Enable internalization by default for AMDGPU (llvm#13…

4e9053c

…8365)

[Clang][Driver] Enable internalization by default for AMDGPU #138365

[Clang][Driver] Enable internalization by default for AMDGPU #138365

Uh oh!

Conversation

shiltian commented May 2, 2025

Uh oh!

shiltian commented May 2, 2025

Uh oh!

llvmbot commented May 2, 2025

Uh oh!

llvmbot commented May 2, 2025

Uh oh!

Uh oh!

jhuber6 May 3, 2025

Choose a reason for hiding this comment

Uh oh!

shiltian May 3, 2025

Choose a reason for hiding this comment

Uh oh!

arsenm May 3, 2025

Choose a reason for hiding this comment

Uh oh!

jhuber6 May 3, 2025

Choose a reason for hiding this comment

Uh oh!

arsenm May 3, 2025

Choose a reason for hiding this comment

Uh oh!

arsenm May 3, 2025

Choose a reason for hiding this comment

Uh oh!

ronlieb commented May 3, 2025

Uh oh!

ronlieb commented May 3, 2025

Uh oh!

jhuber6 commented May 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shiltian commented May 3, 2025

Uh oh!

shiltian commented May 3, 2025

Uh oh!

jhuber6 commented May 3, 2025

Uh oh!

shiltian commented May 3, 2025

Uh oh!

Uh oh!

jhuber6 commented May 3, 2025 •

edited

Loading