-
Notifications
You must be signed in to change notification settings - Fork 13.6k
[Clang][Driver] Enable internalization by default for AMDGPU #138365
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Clang][Driver] Enable internalization by default for AMDGPU #138365
Conversation
@llvm/pr-subscribers-clang Author: Shilei Tian (shiltian) ChangesFull diff: https://github.com/llvm/llvm-project/pull/138365.diff 2 Files Affected:
diff --git a/clang/lib/Driver/ToolChains/Clang.cpp b/clang/lib/Driver/ToolChains/Clang.cpp
index 6dabf9d842b7b..093f6fe9d5c77 100644
--- a/clang/lib/Driver/ToolChains/Clang.cpp
+++ b/clang/lib/Driver/ToolChains/Clang.cpp
@@ -9284,6 +9284,12 @@ void LinkerWrapper::ConstructJob(Compilation &C, const JobAction &JA,
CmdArgs.push_back(Args.MakeArgString(
"--device-linker=" + TC->getTripleString() + "=" + Arg));
+ // Enable internalization for AMDGPU.
+ if (TC->getTriple().isAMDGPU())
+ CmdArgs.push_back(
+ Args.MakeArgString("--device-linker=" + TC->getTripleString() +
+ "=-plugin-opt=-amdgpu-internalize-symbols"));
+
// Forward the LTO mode relying on the Driver's parsing.
if (C.getDriver().getOffloadLTOMode() == LTOK_Full)
CmdArgs.push_back(Args.MakeArgString(
diff --git a/clang/test/Driver/openmp-offload-gpu.c b/clang/test/Driver/openmp-offload-gpu.c
index 5ad038b7dbd37..73dc5a7f75d5b 100644
--- a/clang/test/Driver/openmp-offload-gpu.c
+++ b/clang/test/Driver/openmp-offload-gpu.c
@@ -393,3 +393,10 @@
// RUN: --offload-arch=sm_52 -foffload-lto=thin -nogpulib -nogpuinc %s 2>&1 \
// RUN: | FileCheck --check-prefix=THINLTO-SM52 %s
// THINLTO-SM52: --device-compiler=nvptx64-nvidia-cuda=-flto=thin
+
+// Check that internalization is enabled for OpenMP offloading on AMDGPU.
+//
+// RUN: %clang -### --target=x86_64-unknown-linux-gnu -fopenmp=libomp \
+// RUN: --offload-arch=gfx906 -nogpulib -nogpuinc %s 2>&1 \
+// RUN: | FileCheck --check-prefix=INTERNALIZATION-GFX906 %s
+// INTERNALIZATION-GFX906: --device-linker=amdgcn-amd-amdhsa=-plugin-opt=-amdgpu-internalize-symbols
|
@llvm/pr-subscribers-clang-driver Author: Shilei Tian (shiltian) ChangesFull diff: https://github.com/llvm/llvm-project/pull/138365.diff 2 Files Affected:
diff --git a/clang/lib/Driver/ToolChains/Clang.cpp b/clang/lib/Driver/ToolChains/Clang.cpp
index 6dabf9d842b7b..093f6fe9d5c77 100644
--- a/clang/lib/Driver/ToolChains/Clang.cpp
+++ b/clang/lib/Driver/ToolChains/Clang.cpp
@@ -9284,6 +9284,12 @@ void LinkerWrapper::ConstructJob(Compilation &C, const JobAction &JA,
CmdArgs.push_back(Args.MakeArgString(
"--device-linker=" + TC->getTripleString() + "=" + Arg));
+ // Enable internalization for AMDGPU.
+ if (TC->getTriple().isAMDGPU())
+ CmdArgs.push_back(
+ Args.MakeArgString("--device-linker=" + TC->getTripleString() +
+ "=-plugin-opt=-amdgpu-internalize-symbols"));
+
// Forward the LTO mode relying on the Driver's parsing.
if (C.getDriver().getOffloadLTOMode() == LTOK_Full)
CmdArgs.push_back(Args.MakeArgString(
diff --git a/clang/test/Driver/openmp-offload-gpu.c b/clang/test/Driver/openmp-offload-gpu.c
index 5ad038b7dbd37..73dc5a7f75d5b 100644
--- a/clang/test/Driver/openmp-offload-gpu.c
+++ b/clang/test/Driver/openmp-offload-gpu.c
@@ -393,3 +393,10 @@
// RUN: --offload-arch=sm_52 -foffload-lto=thin -nogpulib -nogpuinc %s 2>&1 \
// RUN: | FileCheck --check-prefix=THINLTO-SM52 %s
// THINLTO-SM52: --device-compiler=nvptx64-nvidia-cuda=-flto=thin
+
+// Check that internalization is enabled for OpenMP offloading on AMDGPU.
+//
+// RUN: %clang -### --target=x86_64-unknown-linux-gnu -fopenmp=libomp \
+// RUN: --offload-arch=gfx906 -nogpulib -nogpuinc %s 2>&1 \
+// RUN: | FileCheck --check-prefix=INTERNALIZATION-GFX906 %s
+// INTERNALIZATION-GFX906: --device-linker=amdgcn-amd-amdhsa=-plugin-opt=-amdgpu-internalize-symbols
|
acc89cf
to
c9a8e6e
Compare
if (TC->getTriple().isAMDGPU()) | ||
CmdArgs.push_back( | ||
Args.MakeArgString("--device-linker=" + TC->getTripleString() + | ||
"=-plugin-opt=-amdgpu-internalize-symbols")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I forget what this hacked around, was hoping we didn't need it anymore. Maybe @arsenm remembers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really a hack TBH, though we are the only target explicitly using it. It significantly affects our performance. The remaining uses of this pass are in (Thin)LTO, which has broader impact.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks more like "always" than "by default".
The implementation details are kind of a hack. We should be able to just run the ordinary internalize pass without the special AMDGPU filter.
static bool mustPreserveGV(const GlobalValue &GV) { |
The most legitimate part of this is the isEntryFunctionCC, which we could just make the internalize pass directly do. The sanitizer name hacks are also obviously hacks.
The attempt to drop constant users looks like a bad side effect a predicate should not have, I don't know what that's doing there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have an example of what kind of functions aren't being internalized? Last time I had this issue it's because the ROCm Device Libs were compiling with -fvisibility=protected
for some reason.
@@ -9284,6 +9284,12 @@ void LinkerWrapper::ConstructJob(Compilation &C, const JobAction &JA, | |||
CmdArgs.push_back(Args.MakeArgString( | |||
"--device-linker=" + TC->getTripleString() + "=" + Arg)); | |||
|
|||
// Enable internalization for AMDGPU. | |||
if (TC->getTriple().isAMDGPU()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Braces
if (TC->getTriple().isAMDGPU()) | ||
CmdArgs.push_back( | ||
Args.MakeArgString("--device-linker=" + TC->getTripleString() + | ||
"=-plugin-opt=-amdgpu-internalize-symbols")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks more like "always" than "by default".
The implementation details are kind of a hack. We should be able to just run the ordinary internalize pass without the special AMDGPU filter.
static bool mustPreserveGV(const GlobalValue &GV) { |
The most legitimate part of this is the isEntryFunctionCC, which we could just make the internalize pass directly do. The sanitizer name hacks are also obviously hacks.
The attempt to drop constant users looks like a bad side effect a predicate should not have, I don't know what that's doing there
downstream testing is seeing quite a few tests failing with: |
also seeing |
Yeah, this is what I was assuming would happen. This is a force-internalize pass that overrides normal LTO handling. The OpenMP toolchain relies on normal handling. I'm wondering why we needed it. This should probably be reverted. |
I don't think OpenMP is more special than HIP here. Anything exposed to the host should not be internalized. In addition, OpenMP actually also heavily uses internalization as well in OpenMPOpt. It is likely that this change exposes something bad in the downstream. The motivation of this change was to apply ThinLTO for OpenMP. Given our current approach to link device runtime, it will be treated as a regular input file and then goes through the backend as well, which is completely unnecessary. Its whole purpose should be for function import. After that, it should be discarded. |
Can you open a ticket with details to reproduce and assign to me? I'll look into that on Monday. |
Not fully understanding this, right now the OpenMP library is linked as a normal static library |
I got it that you are trying to make it generic. That's why I didn’t roll back to using builtin bitcode as we did before. However there is one limitation that we can't really work around, which is the fact that we don't support ABI linking. This is not a new topic at all and whether to support it or how to do it is completely off topic here. Even if we support it, because of how expensive to set up function calls, it might still be arguably better to let the optimization pass to decide whether to import them or not. The way ThinLTO works is, it compiles all input bitcode files in parallel after function import. The input here includes both user code, as well as the device runtime. We don't need to compile device runtime at all, since all uses of it have already import them to their own module. It is not that "bad" to compile it, as it will not cause any correctness issue. It is however a waste of time and will make the final code object a little bit larger. The issue here is, because we can’t properly lower local memory in non-kernel function at the moment, backend warning will emit some warnings, which is not ideal at all. After we internalize the entire device runtime, it will roughly become an empty module after optimization thus the backend could be happy and the confusing warning will not be emitted. |
…lvm#138365)" This reverts commit 2b62a31.
No description provided.