-
Notifications
You must be signed in to change notification settings - Fork 13.6k
Pass -offload-lto instead of -lto for cuda/hip kernels #125243
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pass -offload-lto instead of -lto for cuda/hip kernels #125243
Conversation
@llvm/pr-subscribers-clang @llvm/pr-subscribers-clang-driver Author: Omar Ahmed (omarahmed1111) ChangesClangLinkerWrapper tool in one of its clang commands to generate ptx kernel binary from llvm bitcode kernel was using -flto option which should be only used for cpu code not gpu kernel code. This PR fixes that by changing that to -foffload-lto for cuda/hip kernels. Full diff: https://github.com/llvm/llvm-project/pull/125243.diff 2 Files Affected:
diff --git a/clang/test/Driver/linker-wrapper.c b/clang/test/Driver/linker-wrapper.c
index f416ee5f4463bc..0a2bc16a15ae49 100644
--- a/clang/test/Driver/linker-wrapper.c
+++ b/clang/test/Driver/linker-wrapper.c
@@ -21,7 +21,7 @@ __attribute__((visibility("protected"), used)) int x;
// RUN: clang-linker-wrapper --host-triple=x86_64-unknown-linux-gnu --dry-run \
// RUN: --linker-path=/usr/bin/ld %t.o -o a.out 2>&1 | FileCheck %s --check-prefix=NVPTX-LINK
-// NVPTX-LINK: clang{{.*}} -o {{.*}}.img --target=nvptx64-nvidia-cuda -march=sm_70 -O2 -flto {{.*}}.o {{.*}}.o
+// NVPTX-LINK: clang{{.*}} -o {{.*}}.img --target=nvptx64-nvidia-cuda -march=sm_70 -O2 -foffload-lto {{.*}}.o {{.*}}.o
// RUN: clang-offload-packager -o %t.out \
// RUN: --image=file=%t.elf.o,kind=openmp,triple=nvptx64-nvidia-cuda,arch=sm_70 \
@@ -30,7 +30,7 @@ __attribute__((visibility("protected"), used)) int x;
// RUN: clang-linker-wrapper --host-triple=x86_64-unknown-linux-gnu --dry-run --device-debug -O0 \
// RUN: --linker-path=/usr/bin/ld %t.o -o a.out 2>&1 | FileCheck %s --check-prefix=NVPTX-LINK-DEBUG
-// NVPTX-LINK-DEBUG: clang{{.*}} -o {{.*}}.img --target=nvptx64-nvidia-cuda -march=sm_70 -O2 -flto {{.*}}.o {{.*}}.o -g
+// NVPTX-LINK-DEBUG: clang{{.*}} -o {{.*}}.img --target=nvptx64-nvidia-cuda -march=sm_70 -O2 -foffload-lto {{.*}}.o {{.*}}.o -g
// RUN: clang-offload-packager -o %t.out \
// RUN: --image=file=%t.elf.o,kind=openmp,triple=amdgcn-amd-amdhsa,arch=gfx908 \
@@ -39,7 +39,7 @@ __attribute__((visibility("protected"), used)) int x;
// RUN: clang-linker-wrapper --host-triple=x86_64-unknown-linux-gnu --dry-run \
// RUN: --linker-path=/usr/bin/ld %t.o -o a.out 2>&1 | FileCheck %s --check-prefix=AMDGPU-LINK
-// AMDGPU-LINK: clang{{.*}} -o {{.*}}.img --target=amdgcn-amd-amdhsa -mcpu=gfx908 -O2 -flto -Wl,--no-undefined {{.*}}.o {{.*}}.o
+// AMDGPU-LINK: clang{{.*}} -o {{.*}}.img --target=amdgcn-amd-amdhsa -mcpu=gfx908 -O2 -foffload-lto -Wl,--no-undefined {{.*}}.o {{.*}}.o
// RUN: clang-offload-packager -o %t.out \
// RUN: --image=file=%t.amdgpu.bc,kind=openmp,triple=amdgcn-amd-amdhsa,arch=gfx1030 \
@@ -48,7 +48,7 @@ __attribute__((visibility("protected"), used)) int x;
// RUN: clang-linker-wrapper --host-triple=x86_64-unknown-linux-gnu --dry-run --save-temps -O2 \
// RUN: --linker-path=/usr/bin/ld %t.o -o a.out 2>&1 | FileCheck %s --check-prefix=AMDGPU-LTO-TEMPS
-// AMDGPU-LTO-TEMPS: clang{{.*}} -o {{.*}}.img --target=amdgcn-amd-amdhsa -mcpu=gfx1030 -O2 -flto -Wl,--no-undefined {{.*}}.o -save-temps
+// AMDGPU-LTO-TEMPS: clang{{.*}} -o {{.*}}.img --target=amdgcn-amd-amdhsa -mcpu=gfx1030 -O2 -foffload-lto -Wl,--no-undefined {{.*}}.o -save-temps
// RUN: clang-offload-packager -o %t.out \
// RUN: --image=file=%t.elf.o,kind=openmp,triple=x86_64-unknown-linux-gnu \
@@ -148,7 +148,7 @@ __attribute__((visibility("protected"), used)) int x;
// RUN: clang-linker-wrapper --host-triple=x86_64-unknown-linux-gnu --dry-run --clang-backend \
// RUN: --linker-path=/usr/bin/ld %t.o -o a.out 2>&1 | FileCheck %s --check-prefix=CLANG-BACKEND
-// CLANG-BACKEND: clang{{.*}} -o {{.*}}.img --target=amdgcn-amd-amdhsa -mcpu=gfx908 -O2 -flto -Wl,--no-undefined {{.*}}.o
+// CLANG-BACKEND: clang{{.*}} -o {{.*}}.img --target=amdgcn-amd-amdhsa -mcpu=gfx908 -O2 -foffload-lto -Wl,--no-undefined {{.*}}.o
// RUN: clang-offload-packager -o %t.out \
// RUN: --image=file=%t.elf.o,kind=openmp,triple=nvptx64-nvidia-cuda,arch=sm_70
@@ -171,8 +171,8 @@ __attribute__((visibility("protected"), used)) int x;
// RUN: clang-linker-wrapper --host-triple=x86_64-unknown-linux-gnu --dry-run \
// RUN: --linker-path=/usr/bin/ld %t-on.o %t-off.o %t.a -o a.out 2>&1 | FileCheck %s --check-prefix=AMD-TARGET-ID
-// AMD-TARGET-ID: clang{{.*}} -o {{.*}}.img --target=amdgcn-amd-amdhsa -mcpu=gfx90a:xnack+ -O2 -flto -Wl,--no-undefined {{.*}}.o {{.*}}.o
-// AMD-TARGET-ID: clang{{.*}} -o {{.*}}.img --target=amdgcn-amd-amdhsa -mcpu=gfx90a:xnack- -O2 -flto -Wl,--no-undefined {{.*}}.o {{.*}}.o
+// AMD-TARGET-ID: clang{{.*}} -o {{.*}}.img --target=amdgcn-amd-amdhsa -mcpu=gfx90a:xnack+ -O2 -foffload-lto -Wl,--no-undefined {{.*}}.o {{.*}}.o
+// AMD-TARGET-ID: clang{{.*}} -o {{.*}}.img --target=amdgcn-amd-amdhsa -mcpu=gfx90a:xnack- -O2 -foffload-lto -Wl,--no-undefined {{.*}}.o {{.*}}.o
// RUN: clang-offload-packager -o %t-lib.out \
// RUN: --image=file=%t.elf.o,kind=openmp,triple=amdgcn-amd-amdhsa,arch=generic
@@ -187,8 +187,8 @@ __attribute__((visibility("protected"), used)) int x;
// RUN: clang-linker-wrapper --host-triple=x86_64-unknown-linux-gnu --dry-run \
// RUN: --linker-path=/usr/bin/ld %t1.o %t2.o %t.a -o a.out 2>&1 | FileCheck %s --check-prefix=ARCH-ALL
-// ARCH-ALL: clang{{.*}} -o {{.*}}.img --target=amdgcn-amd-amdhsa -mcpu=gfx90a -O2 -flto -Wl,--no-undefined {{.*}}.o {{.*}}.o
-// ARCH-ALL: clang{{.*}} -o {{.*}}.img --target=amdgcn-amd-amdhsa -mcpu=gfx908 -O2 -flto -Wl,--no-undefined {{.*}}.o {{.*}}.o
+// ARCH-ALL: clang{{.*}} -o {{.*}}.img --target=amdgcn-amd-amdhsa -mcpu=gfx90a -O2 -foffload-lto -Wl,--no-undefined {{.*}}.o {{.*}}.o
+// ARCH-ALL: clang{{.*}} -o {{.*}}.img --target=amdgcn-amd-amdhsa -mcpu=gfx908 -O2 -foffload-lto -Wl,--no-undefined {{.*}}.o {{.*}}.o
// RUN: clang-offload-packager -o %t.out \
// RUN: --image=file=%t.elf.o,kind=openmp,triple=x86_64-unknown-linux-gnu \
diff --git a/clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp b/clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp
index c92590581a645c..ff516f2e69cbe6 100644
--- a/clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp
+++ b/clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp
@@ -498,11 +498,16 @@ Expected<StringRef> clang(ArrayRef<StringRef> InputFiles, const ArgList &Args) {
};
// Forward all of the `--offload-opt` and similar options to the device.
- CmdArgs.push_back("-flto");
for (auto &Arg : Args.filtered(OPT_offload_opt_eq_minus, OPT_mllvm))
CmdArgs.append(
{"-Xlinker",
Args.MakeArgString("--plugin-opt=" + StringRef(Arg->getValue()))});
+
+ if (Triple.isNVPTX() || Triple.isAMDGPU()) {
+ CmdArgs.push_back("-foffload-lto");
+ } else {
+ CmdArgs.push_back("-flto");
+ }
if (!Triple.isNVPTX() && !Triple.isSPIRV())
CmdArgs.push_back("-Wl,--no-undefined");
|
✅ With the latest revision this PR passed the C/C++ code formatter. |
b771905
to
fcfe4fa
Compare
fcfe4fa
to
f3d466b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the motivation for this, it is intentionally passing -flto
because this is not an offloading compile job. This flag is mostly here for convenience, but it's not strictly necessary because the Nvlink job does LTO if it finds bitcode, so this just replaces an optional flag with a no-op.
Was there a failure associated with this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use -flto
for the NVPTX target, https://godbolt.org/z/MYf6cc6e3.
if (Triple.isNVPTX() || Triple.isAMDGPU()) | ||
CmdArgs.push_back("-foffload-lto"); | ||
else | ||
CmdArgs.push_back("-flto"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There actually was a recent issue @ye-luo had with the default passing of -flto
breaking for x64 offloading. I could possibly see only passing -flto
for the NVPTX and AMDGPU toolchains because in those cases we know they support LTO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was breaking some tests where one of the pipeline commands was compiling using clang from llvm IR to ptx but the output from this was the input llvm IR (I think that means something have failed). After investigation, it appeared that -flto
is the reason for this breaking behaviour.
I don't have strong opinions on using -foffload-lto
, but I think if -flto
is not needed, we could avoid using it entirely for CUDA/AMD kernels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have a reproducer or an example of what the error is? You can pass flto
to --target=nvptx64-nvidia-cuda
correctly. The only reason this would fail is if it's somehow using an older version of clang as far as I'm aware.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the original issue: intel/llvm#16413 . It has some more details. I could try to see if I could come up with a concise reproducer for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that would be greatly appreciated. I would recommend using -v
and -save-temps
to get the files that go into the embedded clang --target=nvptx64-nvidia-cuda
job.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It shouldn't, here's what I get with my current build
> clang -o output.img --target=nvptx64-nvidia-cuda -march=sm_50 -O2 test.ll -save-temps -v --cuda-path=/usr/local/cuda -flto
clang version 21.0.0git
Target: nvptx64-nvidia-cuda
Thread model: posix
InstalledDir: /home/jhuber/Documents/llvm/clang/bin
Build config: +assertions
"/home/jhuber/Documents/llvm/clang/bin/clang-21" -cc1 -triple nvptx64-nvidia-cuda -emit-llvm-bc -flto=full -flto-unit -dumpdir output.img- -save-temps=cwd -disable-free -clear-ast-before-backend -main-file-name test.ll -mrelocation-model static -mframe-pointer=all -ffp-contract=on -fno-rounding-math -no-integrated-as -target-cpu sm_50 -target-feature +ptx42 -debugger-tuning=gdb -fno-dwarf-directory-asm -fdebug-compilation-dir=/home/jhuber/Documents/llvm/llvm-project/build -v -resource-dir /home/jhuber/Documents/llvm/clang/lib/clang/21 -O2 -ferror-limit 19 -fgnuc-version=4.2.1 -fskip-odr-check-in-gmf -fcolor-diagnostics -vectorize-loops -vectorize-slp -o test.o -x ir test.ll
clang -cc1 version 21.0.0git based upon LLVM 21.0.0git default target x86_64-unknown-linux-gnu
"/home/jhuber/Documents/llvm/clang/bin/clang-nvlink-wrapper" -o output.img -v -arch sm_50 --cuda-path=/usr/local/cuda -L/home/jhuber/Documents/llvm/clang/lib -L/opt/rocm/lib -L. -L/opt/cuda/lib -L/home/jhuber/Documents/llvm/clang/bin/../lib/nvptx64-nvidia-cuda -L/home/jhuber/Documents/llvm/clang/lib/clang/21/lib/nvptx64-nvidia-cuda test.o -plugin-opt=mcpu=sm_50 -plugin-opt=O2 --plugin-opt=-mattr=+ptx42 -mllvm --nvptx-lower-global-ctor-dtor -L/home/jhuber/Documents/llvm/clang/lib
"/opt/cuda/bin/ptxas" -m64 -c /tmp/output-a5eb7b.s -v -O2 -arch sm_50 -o /tmp/output-08197e.cubin
ptxas info : 0 bytes gmem
ptxas info : Compiling entry function 'test' for 'sm_50'
ptxas info : Function properties for test
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 2 registers, used 0 barriers, 364 bytes cmem[0]
"/opt/cuda/bin/nvlink" -o output.img -v -arch sm_50 -L /home/jhuber/Documents/llvm/clang/lib -L /opt/rocm/lib -L . -L /opt/cuda/lib -L /home/jhuber/Documents/llvm/clang/bin/../lib/nvptx64-nvidia-cuda -L /home/jhuber/Documents/llvm/clang/lib/clang/21/lib/nvptx64-nvidia-cuda --arch sm_50 -L /home/jhuber/Documents/llvm/clang/lib /tmp/output-08197e.cubin
nvlink info : 0 bytes gmem
nvlink info : Function properties for 'test':
nvlink info : used 2 registers, used 0 barriers, 0 stack, 0 bytes smem, 364 bytes cmem[0], 0 bytes lmem
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just noticed the command that you use is not using -S
option. I tried without it, and it is working normally, but -S
option seems to introduce this behaviour, instead of generating ptx readable code, it generates the input llvm IR optimized when we use -flto
option. I am not really familiar with -S
option, so maybe that is the intended behaviour when we use it with -flto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's what -flto -S
does, that's the expected behavior. You'd need to modify that if you were expecting the PTX for something, but that's not in any workflows I know of upstream so it'd be a separate code path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that is the problem in Intel fork pipeline. So, I will go investigate this more. Thanks a lot for help! @jhuber6
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI if you need the PTX you should be able to use -Wl,-lto-emit-asm
for the above invocation without -S
.
This reverts this [PR](#16605) and add another solution to that problem mentioned in this [issue](#16413). This was the result of llvm upstream discussion [here](llvm/llvm-project#125243 (review))
ClangLinkerWrapper tool in one of its clang commands to generate ptx kernel binary from llvm bitcode kernel was using -flto option which should be only used for cpu code not gpu kernel code. This PR fixes that by changing that to -foffload-lto for cuda/hip kernels.