Pass -offload-lto instead of -lto for cuda/hip kernels #125243

omarahmed1111 · 2025-01-31T16:01:22Z

ClangLinkerWrapper tool in one of its clang commands to generate ptx kernel binary from llvm bitcode kernel was using -flto option which should be only used for cpu code not gpu kernel code. This PR fixes that by changing that to -foffload-lto for cuda/hip kernels.

llvmbot · 2025-01-31T16:01:58Z

@llvm/pr-subscribers-clang

@llvm/pr-subscribers-clang-driver

Author: Omar Ahmed (omarahmed1111)

Changes

ClangLinkerWrapper tool in one of its clang commands to generate ptx kernel binary from llvm bitcode kernel was using -flto option which should be only used for cpu code not gpu kernel code. This PR fixes that by changing that to -foffload-lto for cuda/hip kernels.

Full diff: https://github.com/llvm/llvm-project/pull/125243.diff

2 Files Affected:

(modified) clang/test/Driver/linker-wrapper.c (+9-9)
(modified) clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp (+6-1)

diff --git a/clang/test/Driver/linker-wrapper.c b/clang/test/Driver/linker-wrapper.c
index f416ee5f4463bc..0a2bc16a15ae49 100644
--- a/clang/test/Driver/linker-wrapper.c
+++ b/clang/test/Driver/linker-wrapper.c
@@ -21,7 +21,7 @@ __attribute__((visibility("protected"), used)) int x;
 // RUN: clang-linker-wrapper --host-triple=x86_64-unknown-linux-gnu --dry-run \
 // RUN:   --linker-path=/usr/bin/ld %t.o -o a.out 2>&1 | FileCheck %s --check-prefix=NVPTX-LINK
 
-// NVPTX-LINK: clang{{.*}} -o {{.*}}.img --target=nvptx64-nvidia-cuda -march=sm_70 -O2 -flto {{.*}}.o {{.*}}.o
+// NVPTX-LINK: clang{{.*}} -o {{.*}}.img --target=nvptx64-nvidia-cuda -march=sm_70 -O2 -foffload-lto {{.*}}.o {{.*}}.o
 
 // RUN: clang-offload-packager -o %t.out \
 // RUN:   --image=file=%t.elf.o,kind=openmp,triple=nvptx64-nvidia-cuda,arch=sm_70 \
@@ -30,7 +30,7 @@ __attribute__((visibility("protected"), used)) int x;
 // RUN: clang-linker-wrapper --host-triple=x86_64-unknown-linux-gnu --dry-run --device-debug -O0 \
 // RUN:   --linker-path=/usr/bin/ld %t.o -o a.out 2>&1 | FileCheck %s --check-prefix=NVPTX-LINK-DEBUG
 
-// NVPTX-LINK-DEBUG: clang{{.*}} -o {{.*}}.img --target=nvptx64-nvidia-cuda -march=sm_70 -O2 -flto {{.*}}.o {{.*}}.o -g
+// NVPTX-LINK-DEBUG: clang{{.*}} -o {{.*}}.img --target=nvptx64-nvidia-cuda -march=sm_70 -O2 -foffload-lto {{.*}}.o {{.*}}.o -g
 
 // RUN: clang-offload-packager -o %t.out \
 // RUN:   --image=file=%t.elf.o,kind=openmp,triple=amdgcn-amd-amdhsa,arch=gfx908 \
@@ -39,7 +39,7 @@ __attribute__((visibility("protected"), used)) int x;
 // RUN: clang-linker-wrapper --host-triple=x86_64-unknown-linux-gnu --dry-run \
 // RUN:   --linker-path=/usr/bin/ld %t.o -o a.out 2>&1 | FileCheck %s --check-prefix=AMDGPU-LINK
 
-// AMDGPU-LINK: clang{{.*}} -o {{.*}}.img --target=amdgcn-amd-amdhsa -mcpu=gfx908 -O2 -flto -Wl,--no-undefined {{.*}}.o {{.*}}.o
+// AMDGPU-LINK: clang{{.*}} -o {{.*}}.img --target=amdgcn-amd-amdhsa -mcpu=gfx908 -O2 -foffload-lto -Wl,--no-undefined {{.*}}.o {{.*}}.o
 
 // RUN: clang-offload-packager -o %t.out \
 // RUN:   --image=file=%t.amdgpu.bc,kind=openmp,triple=amdgcn-amd-amdhsa,arch=gfx1030 \
@@ -48,7 +48,7 @@ __attribute__((visibility("protected"), used)) int x;
 // RUN: clang-linker-wrapper --host-triple=x86_64-unknown-linux-gnu --dry-run --save-temps -O2 \
 // RUN:   --linker-path=/usr/bin/ld %t.o -o a.out 2>&1 | FileCheck %s --check-prefix=AMDGPU-LTO-TEMPS
 
-// AMDGPU-LTO-TEMPS: clang{{.*}} -o {{.*}}.img --target=amdgcn-amd-amdhsa -mcpu=gfx1030 -O2 -flto -Wl,--no-undefined {{.*}}.o -save-temps
+// AMDGPU-LTO-TEMPS: clang{{.*}} -o {{.*}}.img --target=amdgcn-amd-amdhsa -mcpu=gfx1030 -O2 -foffload-lto -Wl,--no-undefined {{.*}}.o -save-temps
 
 // RUN: clang-offload-packager -o %t.out \
 // RUN:   --image=file=%t.elf.o,kind=openmp,triple=x86_64-unknown-linux-gnu \
@@ -148,7 +148,7 @@ __attribute__((visibility("protected"), used)) int x;
 // RUN: clang-linker-wrapper --host-triple=x86_64-unknown-linux-gnu --dry-run --clang-backend \
 // RUN:   --linker-path=/usr/bin/ld %t.o -o a.out 2>&1 | FileCheck %s --check-prefix=CLANG-BACKEND
 
-// CLANG-BACKEND: clang{{.*}} -o {{.*}}.img --target=amdgcn-amd-amdhsa -mcpu=gfx908 -O2 -flto -Wl,--no-undefined {{.*}}.o
+// CLANG-BACKEND: clang{{.*}} -o {{.*}}.img --target=amdgcn-amd-amdhsa -mcpu=gfx908 -O2 -foffload-lto -Wl,--no-undefined {{.*}}.o
 
 // RUN: clang-offload-packager -o %t.out \
 // RUN:   --image=file=%t.elf.o,kind=openmp,triple=nvptx64-nvidia-cuda,arch=sm_70
@@ -171,8 +171,8 @@ __attribute__((visibility("protected"), used)) int x;
 // RUN: clang-linker-wrapper --host-triple=x86_64-unknown-linux-gnu --dry-run \
 // RUN:   --linker-path=/usr/bin/ld %t-on.o %t-off.o %t.a -o a.out 2>&1 | FileCheck %s --check-prefix=AMD-TARGET-ID
 
-// AMD-TARGET-ID: clang{{.*}} -o {{.*}}.img --target=amdgcn-amd-amdhsa -mcpu=gfx90a:xnack+ -O2 -flto -Wl,--no-undefined {{.*}}.o {{.*}}.o
-// AMD-TARGET-ID: clang{{.*}} -o {{.*}}.img --target=amdgcn-amd-amdhsa -mcpu=gfx90a:xnack- -O2 -flto -Wl,--no-undefined {{.*}}.o {{.*}}.o
+// AMD-TARGET-ID: clang{{.*}} -o {{.*}}.img --target=amdgcn-amd-amdhsa -mcpu=gfx90a:xnack+ -O2 -foffload-lto -Wl,--no-undefined {{.*}}.o {{.*}}.o
+// AMD-TARGET-ID: clang{{.*}} -o {{.*}}.img --target=amdgcn-amd-amdhsa -mcpu=gfx90a:xnack- -O2 -foffload-lto -Wl,--no-undefined {{.*}}.o {{.*}}.o
 
 // RUN: clang-offload-packager -o %t-lib.out \
 // RUN:   --image=file=%t.elf.o,kind=openmp,triple=amdgcn-amd-amdhsa,arch=generic
@@ -187,8 +187,8 @@ __attribute__((visibility("protected"), used)) int x;
 // RUN: clang-linker-wrapper --host-triple=x86_64-unknown-linux-gnu --dry-run \
 // RUN:   --linker-path=/usr/bin/ld %t1.o %t2.o %t.a -o a.out 2>&1 | FileCheck %s --check-prefix=ARCH-ALL
 
-// ARCH-ALL: clang{{.*}} -o {{.*}}.img --target=amdgcn-amd-amdhsa -mcpu=gfx90a -O2 -flto -Wl,--no-undefined {{.*}}.o {{.*}}.o
-// ARCH-ALL: clang{{.*}} -o {{.*}}.img --target=amdgcn-amd-amdhsa -mcpu=gfx908 -O2 -flto -Wl,--no-undefined {{.*}}.o {{.*}}.o
+// ARCH-ALL: clang{{.*}} -o {{.*}}.img --target=amdgcn-amd-amdhsa -mcpu=gfx90a -O2 -foffload-lto -Wl,--no-undefined {{.*}}.o {{.*}}.o
+// ARCH-ALL: clang{{.*}} -o {{.*}}.img --target=amdgcn-amd-amdhsa -mcpu=gfx908 -O2 -foffload-lto -Wl,--no-undefined {{.*}}.o {{.*}}.o
 
 // RUN: clang-offload-packager -o %t.out \
 // RUN:   --image=file=%t.elf.o,kind=openmp,triple=x86_64-unknown-linux-gnu \
diff --git a/clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp b/clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp
index c92590581a645c..ff516f2e69cbe6 100644
--- a/clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp
+++ b/clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp
@@ -498,11 +498,16 @@ Expected<StringRef> clang(ArrayRef<StringRef> InputFiles, const ArgList &Args) {
   };
 
   // Forward all of the `--offload-opt` and similar options to the device.
-  CmdArgs.push_back("-flto");
   for (auto &Arg : Args.filtered(OPT_offload_opt_eq_minus, OPT_mllvm))
     CmdArgs.append(
         {"-Xlinker",
          Args.MakeArgString("--plugin-opt=" + StringRef(Arg->getValue()))});
+  
+  if (Triple.isNVPTX() || Triple.isAMDGPU()) {
+    CmdArgs.push_back("-foffload-lto");
+  } else {
+    CmdArgs.push_back("-flto");
+  }
 
   if (!Triple.isNVPTX() && !Triple.isSPIRV())
     CmdArgs.push_back("-Wl,--no-undefined");

github-actions · 2025-01-31T16:08:05Z

✅ With the latest revision this PR passed the C/C++ code formatter.

clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp

jhuber6

I don't understand the motivation for this, it is intentionally passing -flto because this is not an offloading compile job. This flag is mostly here for convenience, but it's not strictly necessary because the Nvlink job does LTO if it finds bitcode, so this just replaces an optional flag with a no-op.

Was there a failure associated with this?

jhuber6

You can use -flto for the NVPTX target, https://godbolt.org/z/MYf6cc6e3.

jhuber6 · 2025-02-03T14:30:30Z

clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp

+  if (Triple.isNVPTX() || Triple.isAMDGPU())
+    CmdArgs.push_back("-foffload-lto");
+  else
+    CmdArgs.push_back("-flto");


There actually was a recent issue @ye-luo had with the default passing of -flto breaking for x64 offloading. I could possibly see only passing -flto for the NVPTX and AMDGPU toolchains because in those cases we know they support LTO.

It was breaking some tests where one of the pipeline commands was compiling using clang from llvm IR to ptx but the output from this was the input llvm IR (I think that means something have failed). After investigation, it appeared that -flto is the reason for this breaking behaviour.

I don't have strong opinions on using -foffload-lto, but I think if -flto is not needed, we could avoid using it entirely for CUDA/AMD kernels.

Do you have a reproducer or an example of what the error is? You can pass flto to --target=nvptx64-nvidia-cuda correctly. The only reason this would fail is if it's somehow using an older version of clang as far as I'm aware.

This is the original issue: intel/llvm#16413 . It has some more details. I could try to see if I could come up with a concise reproducer for this.

Yeah, that would be greatly appreciated. I would recommend using -v and -save-temps to get the files that go into the embedded clang --target=nvptx64-nvidia-cuda job.

It shouldn't, here's what I get with my current build

> clang -o output.img --target=nvptx64-nvidia-cuda -march=sm_50 -O2 test.ll -save-temps -v --cuda-path=/usr/local/cuda -flto clang version 21.0.0git Target: nvptx64-nvidia-cuda Thread model: posix InstalledDir: /home/jhuber/Documents/llvm/clang/bin Build config: +assertions "/home/jhuber/Documents/llvm/clang/bin/clang-21" -cc1 -triple nvptx64-nvidia-cuda -emit-llvm-bc -flto=full -flto-unit -dumpdir output.img- -save-temps=cwd -disable-free -clear-ast-before-backend -main-file-name test.ll -mrelocation-model static -mframe-pointer=all -ffp-contract=on -fno-rounding-math -no-integrated-as -target-cpu sm_50 -target-feature +ptx42 -debugger-tuning=gdb -fno-dwarf-directory-asm -fdebug-compilation-dir=/home/jhuber/Documents/llvm/llvm-project/build -v -resource-dir /home/jhuber/Documents/llvm/clang/lib/clang/21 -O2 -ferror-limit 19 -fgnuc-version=4.2.1 -fskip-odr-check-in-gmf -fcolor-diagnostics -vectorize-loops -vectorize-slp -o test.o -x ir test.ll clang -cc1 version 21.0.0git based upon LLVM 21.0.0git default target x86_64-unknown-linux-gnu "/home/jhuber/Documents/llvm/clang/bin/clang-nvlink-wrapper" -o output.img -v -arch sm_50 --cuda-path=/usr/local/cuda -L/home/jhuber/Documents/llvm/clang/lib -L/opt/rocm/lib -L. -L/opt/cuda/lib -L/home/jhuber/Documents/llvm/clang/bin/../lib/nvptx64-nvidia-cuda -L/home/jhuber/Documents/llvm/clang/lib/clang/21/lib/nvptx64-nvidia-cuda test.o -plugin-opt=mcpu=sm_50 -plugin-opt=O2 --plugin-opt=-mattr=+ptx42 -mllvm --nvptx-lower-global-ctor-dtor -L/home/jhuber/Documents/llvm/clang/lib "/opt/cuda/bin/ptxas" -m64 -c /tmp/output-a5eb7b.s -v -O2 -arch sm_50 -o /tmp/output-08197e.cubin ptxas info : 0 bytes gmem ptxas info : Compiling entry function 'test' for 'sm_50' ptxas info : Function properties for test 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads ptxas info : Used 2 registers, used 0 barriers, 364 bytes cmem[0] "/opt/cuda/bin/nvlink" -o output.img -v -arch sm_50 -L /home/jhuber/Documents/llvm/clang/lib -L /opt/rocm/lib -L . -L /opt/cuda/lib -L /home/jhuber/Documents/llvm/clang/bin/../lib/nvptx64-nvidia-cuda -L /home/jhuber/Documents/llvm/clang/lib/clang/21/lib/nvptx64-nvidia-cuda --arch sm_50 -L /home/jhuber/Documents/llvm/clang/lib /tmp/output-08197e.cubin nvlink info : 0 bytes gmem nvlink info : Function properties for 'test': nvlink info : used 2 registers, used 0 barriers, 0 stack, 0 bytes smem, 364 bytes cmem[0], 0 bytes lmem

I just noticed the command that you use is not using -S option. I tried without it, and it is working normally, but -S option seems to introduce this behaviour, instead of generating ptx readable code, it generates the input llvm IR optimized when we use -flto option. I am not really familiar with -S option, so maybe that is the intended behaviour when we use it with -flto

That's what -flto -S does, that's the expected behavior. You'd need to modify that if you were expecting the PTX for something, but that's not in any workflows I know of upstream so it'd be a separate code path.

I think that is the problem in Intel fork pipeline. So, I will go investigate this more. Thanks a lot for help! @jhuber6

FYI if you need the PTX you should be able to use -Wl,-lto-emit-asm for the above invocation without -S.

This reverts this [PR](#16605) and add another solution to that problem mentioned in this [issue](#16413). This was the result of llvm upstream discussion [here](llvm/llvm-project#125243 (review))

llvmbot added clang Clang issues not falling into any other category clang:driver 'clang' and 'clang++' user-facing binaries. Not 'clang-cl' labels Jan 31, 2025

omarahmed1111 mentioned this pull request Jan 31, 2025

Pass -foffload-lto instead of -flto for cuda/hip kernels in clangLinkerWrapper intel/llvm#16605

Merged

omarahmed1111 force-pushed the change-flto-to-foffload-lto-for-cuda-kernels branch from b771905 to fcfe4fa Compare January 31, 2025 16:09

jchlanda approved these changes Feb 3, 2025

View reviewed changes

clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp Outdated Show resolved Hide resolved

Pass -offload-lto instead of -lto for cuda/hip kernels

f3d466b

omarahmed1111 force-pushed the change-flto-to-foffload-lto-for-cuda-kernels branch from fcfe4fa to f3d466b Compare February 3, 2025 14:06

ldrumm requested review from jhuber6 and sarnex February 3, 2025 14:26

jhuber6 requested changes Feb 3, 2025

View reviewed changes

jhuber6 reviewed Feb 3, 2025

View reviewed changes

omarahmed1111 mentioned this pull request Feb 4, 2025

Return the usage of -flto and use -lto-emit-asm instead intel/llvm#16884

Merged

omarahmed1111 closed this Feb 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Pass -offload-lto instead of -lto for cuda/hip kernels #125243

Pass -offload-lto instead of -lto for cuda/hip kernels #125243

Uh oh!

omarahmed1111 commented Jan 31, 2025

Uh oh!

llvmbot commented Jan 31, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Jan 31, 2025 •

edited

Loading

Uh oh!

Uh oh!

jhuber6 left a comment •

edited

Loading

Uh oh!

jhuber6 left a comment

Uh oh!

jhuber6 Feb 3, 2025

Uh oh!

omarahmed1111 Feb 3, 2025

Uh oh!

jhuber6 Feb 3, 2025

Uh oh!

omarahmed1111 Feb 3, 2025

Uh oh!

jhuber6 Feb 3, 2025

Uh oh!

jhuber6 Feb 4, 2025

Uh oh!

omarahmed1111 Feb 4, 2025 •

edited

Loading

Uh oh!

jhuber6 Feb 4, 2025 •

edited

Loading

Uh oh!

omarahmed1111 Feb 4, 2025

Uh oh!

jhuber6 Feb 4, 2025

Uh oh!

Uh oh!

Pass -offload-lto instead of -lto for cuda/hip kernels #125243

Pass -offload-lto instead of -lto for cuda/hip kernels #125243

Uh oh!

Conversation

omarahmed1111 commented Jan 31, 2025

Uh oh!

llvmbot commented Jan 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jan 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

jhuber6 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jhuber6 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

omarahmed1111 Feb 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jhuber6 Feb 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

llvmbot commented Jan 31, 2025 •

edited

Loading

github-actions bot commented Jan 31, 2025 •

edited

Loading

jhuber6 left a comment •

edited

Loading

omarahmed1111 Feb 4, 2025 •

edited

Loading

jhuber6 Feb 4, 2025 •

edited

Loading