Skip to content

Pass -offload-lto instead of -lto for cuda/hip kernels #125243

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

omarahmed1111
Copy link
Contributor

ClangLinkerWrapper tool in one of its clang commands to generate ptx kernel binary from llvm bitcode kernel was using -flto option which should be only used for cpu code not gpu kernel code. This PR fixes that by changing that to -foffload-lto for cuda/hip kernels.

@llvmbot llvmbot added clang Clang issues not falling into any other category clang:driver 'clang' and 'clang++' user-facing binaries. Not 'clang-cl' labels Jan 31, 2025
@llvmbot
Copy link
Member

llvmbot commented Jan 31, 2025

@llvm/pr-subscribers-clang

@llvm/pr-subscribers-clang-driver

Author: Omar Ahmed (omarahmed1111)

Changes

ClangLinkerWrapper tool in one of its clang commands to generate ptx kernel binary from llvm bitcode kernel was using -flto option which should be only used for cpu code not gpu kernel code. This PR fixes that by changing that to -foffload-lto for cuda/hip kernels.


Full diff: https://github.com/llvm/llvm-project/pull/125243.diff

2 Files Affected:

  • (modified) clang/test/Driver/linker-wrapper.c (+9-9)
  • (modified) clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp (+6-1)
diff --git a/clang/test/Driver/linker-wrapper.c b/clang/test/Driver/linker-wrapper.c
index f416ee5f4463bc..0a2bc16a15ae49 100644
--- a/clang/test/Driver/linker-wrapper.c
+++ b/clang/test/Driver/linker-wrapper.c
@@ -21,7 +21,7 @@ __attribute__((visibility("protected"), used)) int x;
 // RUN: clang-linker-wrapper --host-triple=x86_64-unknown-linux-gnu --dry-run \
 // RUN:   --linker-path=/usr/bin/ld %t.o -o a.out 2>&1 | FileCheck %s --check-prefix=NVPTX-LINK
 
-// NVPTX-LINK: clang{{.*}} -o {{.*}}.img --target=nvptx64-nvidia-cuda -march=sm_70 -O2 -flto {{.*}}.o {{.*}}.o
+// NVPTX-LINK: clang{{.*}} -o {{.*}}.img --target=nvptx64-nvidia-cuda -march=sm_70 -O2 -foffload-lto {{.*}}.o {{.*}}.o
 
 // RUN: clang-offload-packager -o %t.out \
 // RUN:   --image=file=%t.elf.o,kind=openmp,triple=nvptx64-nvidia-cuda,arch=sm_70 \
@@ -30,7 +30,7 @@ __attribute__((visibility("protected"), used)) int x;
 // RUN: clang-linker-wrapper --host-triple=x86_64-unknown-linux-gnu --dry-run --device-debug -O0 \
 // RUN:   --linker-path=/usr/bin/ld %t.o -o a.out 2>&1 | FileCheck %s --check-prefix=NVPTX-LINK-DEBUG
 
-// NVPTX-LINK-DEBUG: clang{{.*}} -o {{.*}}.img --target=nvptx64-nvidia-cuda -march=sm_70 -O2 -flto {{.*}}.o {{.*}}.o -g
+// NVPTX-LINK-DEBUG: clang{{.*}} -o {{.*}}.img --target=nvptx64-nvidia-cuda -march=sm_70 -O2 -foffload-lto {{.*}}.o {{.*}}.o -g
 
 // RUN: clang-offload-packager -o %t.out \
 // RUN:   --image=file=%t.elf.o,kind=openmp,triple=amdgcn-amd-amdhsa,arch=gfx908 \
@@ -39,7 +39,7 @@ __attribute__((visibility("protected"), used)) int x;
 // RUN: clang-linker-wrapper --host-triple=x86_64-unknown-linux-gnu --dry-run \
 // RUN:   --linker-path=/usr/bin/ld %t.o -o a.out 2>&1 | FileCheck %s --check-prefix=AMDGPU-LINK
 
-// AMDGPU-LINK: clang{{.*}} -o {{.*}}.img --target=amdgcn-amd-amdhsa -mcpu=gfx908 -O2 -flto -Wl,--no-undefined {{.*}}.o {{.*}}.o
+// AMDGPU-LINK: clang{{.*}} -o {{.*}}.img --target=amdgcn-amd-amdhsa -mcpu=gfx908 -O2 -foffload-lto -Wl,--no-undefined {{.*}}.o {{.*}}.o
 
 // RUN: clang-offload-packager -o %t.out \
 // RUN:   --image=file=%t.amdgpu.bc,kind=openmp,triple=amdgcn-amd-amdhsa,arch=gfx1030 \
@@ -48,7 +48,7 @@ __attribute__((visibility("protected"), used)) int x;
 // RUN: clang-linker-wrapper --host-triple=x86_64-unknown-linux-gnu --dry-run --save-temps -O2 \
 // RUN:   --linker-path=/usr/bin/ld %t.o -o a.out 2>&1 | FileCheck %s --check-prefix=AMDGPU-LTO-TEMPS
 
-// AMDGPU-LTO-TEMPS: clang{{.*}} -o {{.*}}.img --target=amdgcn-amd-amdhsa -mcpu=gfx1030 -O2 -flto -Wl,--no-undefined {{.*}}.o -save-temps
+// AMDGPU-LTO-TEMPS: clang{{.*}} -o {{.*}}.img --target=amdgcn-amd-amdhsa -mcpu=gfx1030 -O2 -foffload-lto -Wl,--no-undefined {{.*}}.o -save-temps
 
 // RUN: clang-offload-packager -o %t.out \
 // RUN:   --image=file=%t.elf.o,kind=openmp,triple=x86_64-unknown-linux-gnu \
@@ -148,7 +148,7 @@ __attribute__((visibility("protected"), used)) int x;
 // RUN: clang-linker-wrapper --host-triple=x86_64-unknown-linux-gnu --dry-run --clang-backend \
 // RUN:   --linker-path=/usr/bin/ld %t.o -o a.out 2>&1 | FileCheck %s --check-prefix=CLANG-BACKEND
 
-// CLANG-BACKEND: clang{{.*}} -o {{.*}}.img --target=amdgcn-amd-amdhsa -mcpu=gfx908 -O2 -flto -Wl,--no-undefined {{.*}}.o
+// CLANG-BACKEND: clang{{.*}} -o {{.*}}.img --target=amdgcn-amd-amdhsa -mcpu=gfx908 -O2 -foffload-lto -Wl,--no-undefined {{.*}}.o
 
 // RUN: clang-offload-packager -o %t.out \
 // RUN:   --image=file=%t.elf.o,kind=openmp,triple=nvptx64-nvidia-cuda,arch=sm_70
@@ -171,8 +171,8 @@ __attribute__((visibility("protected"), used)) int x;
 // RUN: clang-linker-wrapper --host-triple=x86_64-unknown-linux-gnu --dry-run \
 // RUN:   --linker-path=/usr/bin/ld %t-on.o %t-off.o %t.a -o a.out 2>&1 | FileCheck %s --check-prefix=AMD-TARGET-ID
 
-// AMD-TARGET-ID: clang{{.*}} -o {{.*}}.img --target=amdgcn-amd-amdhsa -mcpu=gfx90a:xnack+ -O2 -flto -Wl,--no-undefined {{.*}}.o {{.*}}.o
-// AMD-TARGET-ID: clang{{.*}} -o {{.*}}.img --target=amdgcn-amd-amdhsa -mcpu=gfx90a:xnack- -O2 -flto -Wl,--no-undefined {{.*}}.o {{.*}}.o
+// AMD-TARGET-ID: clang{{.*}} -o {{.*}}.img --target=amdgcn-amd-amdhsa -mcpu=gfx90a:xnack+ -O2 -foffload-lto -Wl,--no-undefined {{.*}}.o {{.*}}.o
+// AMD-TARGET-ID: clang{{.*}} -o {{.*}}.img --target=amdgcn-amd-amdhsa -mcpu=gfx90a:xnack- -O2 -foffload-lto -Wl,--no-undefined {{.*}}.o {{.*}}.o
 
 // RUN: clang-offload-packager -o %t-lib.out \
 // RUN:   --image=file=%t.elf.o,kind=openmp,triple=amdgcn-amd-amdhsa,arch=generic
@@ -187,8 +187,8 @@ __attribute__((visibility("protected"), used)) int x;
 // RUN: clang-linker-wrapper --host-triple=x86_64-unknown-linux-gnu --dry-run \
 // RUN:   --linker-path=/usr/bin/ld %t1.o %t2.o %t.a -o a.out 2>&1 | FileCheck %s --check-prefix=ARCH-ALL
 
-// ARCH-ALL: clang{{.*}} -o {{.*}}.img --target=amdgcn-amd-amdhsa -mcpu=gfx90a -O2 -flto -Wl,--no-undefined {{.*}}.o {{.*}}.o
-// ARCH-ALL: clang{{.*}} -o {{.*}}.img --target=amdgcn-amd-amdhsa -mcpu=gfx908 -O2 -flto -Wl,--no-undefined {{.*}}.o {{.*}}.o
+// ARCH-ALL: clang{{.*}} -o {{.*}}.img --target=amdgcn-amd-amdhsa -mcpu=gfx90a -O2 -foffload-lto -Wl,--no-undefined {{.*}}.o {{.*}}.o
+// ARCH-ALL: clang{{.*}} -o {{.*}}.img --target=amdgcn-amd-amdhsa -mcpu=gfx908 -O2 -foffload-lto -Wl,--no-undefined {{.*}}.o {{.*}}.o
 
 // RUN: clang-offload-packager -o %t.out \
 // RUN:   --image=file=%t.elf.o,kind=openmp,triple=x86_64-unknown-linux-gnu \
diff --git a/clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp b/clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp
index c92590581a645c..ff516f2e69cbe6 100644
--- a/clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp
+++ b/clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp
@@ -498,11 +498,16 @@ Expected<StringRef> clang(ArrayRef<StringRef> InputFiles, const ArgList &Args) {
   };
 
   // Forward all of the `--offload-opt` and similar options to the device.
-  CmdArgs.push_back("-flto");
   for (auto &Arg : Args.filtered(OPT_offload_opt_eq_minus, OPT_mllvm))
     CmdArgs.append(
         {"-Xlinker",
          Args.MakeArgString("--plugin-opt=" + StringRef(Arg->getValue()))});
+  
+  if (Triple.isNVPTX() || Triple.isAMDGPU()) {
+    CmdArgs.push_back("-foffload-lto");
+  } else {
+    CmdArgs.push_back("-flto");
+  }
 
   if (!Triple.isNVPTX() && !Triple.isSPIRV())
     CmdArgs.push_back("-Wl,--no-undefined");

Copy link

github-actions bot commented Jan 31, 2025

✅ With the latest revision this PR passed the C/C++ code formatter.

@omarahmed1111 omarahmed1111 force-pushed the change-flto-to-foffload-lto-for-cuda-kernels branch from b771905 to fcfe4fa Compare January 31, 2025 16:09
@omarahmed1111 omarahmed1111 force-pushed the change-flto-to-foffload-lto-for-cuda-kernels branch from fcfe4fa to f3d466b Compare February 3, 2025 14:06
@ldrumm ldrumm requested review from jhuber6 and sarnex February 3, 2025 14:26
Copy link
Contributor

@jhuber6 jhuber6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the motivation for this, it is intentionally passing -flto because this is not an offloading compile job. This flag is mostly here for convenience, but it's not strictly necessary because the Nvlink job does LTO if it finds bitcode, so this just replaces an optional flag with a no-op.

Was there a failure associated with this?

Copy link
Contributor

@jhuber6 jhuber6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use -flto for the NVPTX target, https://godbolt.org/z/MYf6cc6e3.

if (Triple.isNVPTX() || Triple.isAMDGPU())
CmdArgs.push_back("-foffload-lto");
else
CmdArgs.push_back("-flto");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There actually was a recent issue @ye-luo had with the default passing of -flto breaking for x64 offloading. I could possibly see only passing -flto for the NVPTX and AMDGPU toolchains because in those cases we know they support LTO.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was breaking some tests where one of the pipeline commands was compiling using clang from llvm IR to ptx but the output from this was the input llvm IR (I think that means something have failed). After investigation, it appeared that -flto is the reason for this breaking behaviour.

I don't have strong opinions on using -foffload-lto, but I think if -flto is not needed, we could avoid using it entirely for CUDA/AMD kernels.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a reproducer or an example of what the error is? You can pass flto to --target=nvptx64-nvidia-cuda correctly. The only reason this would fail is if it's somehow using an older version of clang as far as I'm aware.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the original issue: intel/llvm#16413 . It has some more details. I could try to see if I could come up with a concise reproducer for this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that would be greatly appreciated. I would recommend using -v and -save-temps to get the files that go into the embedded clang --target=nvptx64-nvidia-cuda job.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shouldn't, here's what I get with my current build

> clang -o output.img --target=nvptx64-nvidia-cuda -march=sm_50 -O2  test.ll -save-temps -v --cuda-path=/usr/local/cuda -flto 
clang version 21.0.0git
Target: nvptx64-nvidia-cuda
Thread model: posix
InstalledDir: /home/jhuber/Documents/llvm/clang/bin
Build config: +assertions
 "/home/jhuber/Documents/llvm/clang/bin/clang-21" -cc1 -triple nvptx64-nvidia-cuda -emit-llvm-bc -flto=full -flto-unit -dumpdir output.img- -save-temps=cwd -disable-free -clear-ast-before-backend -main-file-name test.ll -mrelocation-model static -mframe-pointer=all -ffp-contract=on -fno-rounding-math -no-integrated-as -target-cpu sm_50 -target-feature +ptx42 -debugger-tuning=gdb -fno-dwarf-directory-asm -fdebug-compilation-dir=/home/jhuber/Documents/llvm/llvm-project/build -v -resource-dir /home/jhuber/Documents/llvm/clang/lib/clang/21 -O2 -ferror-limit 19 -fgnuc-version=4.2.1 -fskip-odr-check-in-gmf -fcolor-diagnostics -vectorize-loops -vectorize-slp -o test.o -x ir test.ll
clang -cc1 version 21.0.0git based upon LLVM 21.0.0git default target x86_64-unknown-linux-gnu
 "/home/jhuber/Documents/llvm/clang/bin/clang-nvlink-wrapper" -o output.img -v -arch sm_50 --cuda-path=/usr/local/cuda -L/home/jhuber/Documents/llvm/clang/lib -L/opt/rocm/lib -L. -L/opt/cuda/lib -L/home/jhuber/Documents/llvm/clang/bin/../lib/nvptx64-nvidia-cuda -L/home/jhuber/Documents/llvm/clang/lib/clang/21/lib/nvptx64-nvidia-cuda test.o -plugin-opt=mcpu=sm_50 -plugin-opt=O2 --plugin-opt=-mattr=+ptx42 -mllvm --nvptx-lower-global-ctor-dtor -L/home/jhuber/Documents/llvm/clang/lib
 "/opt/cuda/bin/ptxas" -m64 -c /tmp/output-a5eb7b.s -v -O2 -arch sm_50 -o /tmp/output-08197e.cubin
ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function 'test' for 'sm_50'
ptxas info    : Function properties for test
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 2 registers, used 0 barriers, 364 bytes cmem[0]
 "/opt/cuda/bin/nvlink" -o output.img -v -arch sm_50 -L /home/jhuber/Documents/llvm/clang/lib -L /opt/rocm/lib -L . -L /opt/cuda/lib -L /home/jhuber/Documents/llvm/clang/bin/../lib/nvptx64-nvidia-cuda -L /home/jhuber/Documents/llvm/clang/lib/clang/21/lib/nvptx64-nvidia-cuda --arch sm_50 -L /home/jhuber/Documents/llvm/clang/lib /tmp/output-08197e.cubin
nvlink info    : 0 bytes gmem
nvlink info    : Function properties for 'test':
nvlink info    : used 2 registers, used 0 barriers, 0 stack, 0 bytes smem, 364 bytes cmem[0], 0 bytes lmem

Copy link
Contributor Author

@omarahmed1111 omarahmed1111 Feb 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just noticed the command that you use is not using -S option. I tried without it, and it is working normally, but -S option seems to introduce this behaviour, instead of generating ptx readable code, it generates the input llvm IR optimized when we use -flto option. I am not really familiar with -S option, so maybe that is the intended behaviour when we use it with -flto

Copy link
Contributor

@jhuber6 jhuber6 Feb 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what -flto -S does, that's the expected behavior. You'd need to modify that if you were expecting the PTX for something, but that's not in any workflows I know of upstream so it'd be a separate code path.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that is the problem in Intel fork pipeline. So, I will go investigate this more. Thanks a lot for help! @jhuber6

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI if you need the PTX you should be able to use -Wl,-lto-emit-asm for the above invocation without -S.

kbenzie pushed a commit to intel/llvm that referenced this pull request Feb 14, 2025
This reverts this [PR](#16605) and add
another solution to that problem mentioned in this
[issue](#16413). This was the result
of llvm upstream discussion
[here](llvm/llvm-project#125243 (review))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clang:driver 'clang' and 'clang++' user-facing binaries. Not 'clang-cl' clang Clang issues not falling into any other category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants