Skip to content

Pass -offload-lto instead of -lto for cuda/hip kernels #125243

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 9 additions & 9 deletions clang/test/Driver/linker-wrapper.c
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ __attribute__((visibility("protected"), used)) int x;
// RUN: clang-linker-wrapper --host-triple=x86_64-unknown-linux-gnu --dry-run \
// RUN: --linker-path=/usr/bin/ld %t.o -o a.out 2>&1 | FileCheck %s --check-prefix=NVPTX-LINK

// NVPTX-LINK: clang{{.*}} -o {{.*}}.img --target=nvptx64-nvidia-cuda -march=sm_70 -O2 -flto {{.*}}.o {{.*}}.o
// NVPTX-LINK: clang{{.*}} -o {{.*}}.img --target=nvptx64-nvidia-cuda -march=sm_70 -O2 -foffload-lto {{.*}}.o {{.*}}.o

// RUN: clang-offload-packager -o %t.out \
// RUN: --image=file=%t.elf.o,kind=openmp,triple=nvptx64-nvidia-cuda,arch=sm_70 \
Expand All @@ -30,7 +30,7 @@ __attribute__((visibility("protected"), used)) int x;
// RUN: clang-linker-wrapper --host-triple=x86_64-unknown-linux-gnu --dry-run --device-debug -O0 \
// RUN: --linker-path=/usr/bin/ld %t.o -o a.out 2>&1 | FileCheck %s --check-prefix=NVPTX-LINK-DEBUG

// NVPTX-LINK-DEBUG: clang{{.*}} -o {{.*}}.img --target=nvptx64-nvidia-cuda -march=sm_70 -O2 -flto {{.*}}.o {{.*}}.o -g
// NVPTX-LINK-DEBUG: clang{{.*}} -o {{.*}}.img --target=nvptx64-nvidia-cuda -march=sm_70 -O2 -foffload-lto {{.*}}.o {{.*}}.o -g

// RUN: clang-offload-packager -o %t.out \
// RUN: --image=file=%t.elf.o,kind=openmp,triple=amdgcn-amd-amdhsa,arch=gfx908 \
Expand All @@ -39,7 +39,7 @@ __attribute__((visibility("protected"), used)) int x;
// RUN: clang-linker-wrapper --host-triple=x86_64-unknown-linux-gnu --dry-run \
// RUN: --linker-path=/usr/bin/ld %t.o -o a.out 2>&1 | FileCheck %s --check-prefix=AMDGPU-LINK

// AMDGPU-LINK: clang{{.*}} -o {{.*}}.img --target=amdgcn-amd-amdhsa -mcpu=gfx908 -O2 -flto -Wl,--no-undefined {{.*}}.o {{.*}}.o
// AMDGPU-LINK: clang{{.*}} -o {{.*}}.img --target=amdgcn-amd-amdhsa -mcpu=gfx908 -O2 -foffload-lto -Wl,--no-undefined {{.*}}.o {{.*}}.o

// RUN: clang-offload-packager -o %t.out \
// RUN: --image=file=%t.amdgpu.bc,kind=openmp,triple=amdgcn-amd-amdhsa,arch=gfx1030 \
Expand All @@ -48,7 +48,7 @@ __attribute__((visibility("protected"), used)) int x;
// RUN: clang-linker-wrapper --host-triple=x86_64-unknown-linux-gnu --dry-run --save-temps -O2 \
// RUN: --linker-path=/usr/bin/ld %t.o -o a.out 2>&1 | FileCheck %s --check-prefix=AMDGPU-LTO-TEMPS

// AMDGPU-LTO-TEMPS: clang{{.*}} -o {{.*}}.img --target=amdgcn-amd-amdhsa -mcpu=gfx1030 -O2 -flto -Wl,--no-undefined {{.*}}.o -save-temps
// AMDGPU-LTO-TEMPS: clang{{.*}} -o {{.*}}.img --target=amdgcn-amd-amdhsa -mcpu=gfx1030 -O2 -foffload-lto -Wl,--no-undefined {{.*}}.o -save-temps

// RUN: clang-offload-packager -o %t.out \
// RUN: --image=file=%t.elf.o,kind=openmp,triple=x86_64-unknown-linux-gnu \
Expand Down Expand Up @@ -148,7 +148,7 @@ __attribute__((visibility("protected"), used)) int x;
// RUN: clang-linker-wrapper --host-triple=x86_64-unknown-linux-gnu --dry-run --clang-backend \
// RUN: --linker-path=/usr/bin/ld %t.o -o a.out 2>&1 | FileCheck %s --check-prefix=CLANG-BACKEND

// CLANG-BACKEND: clang{{.*}} -o {{.*}}.img --target=amdgcn-amd-amdhsa -mcpu=gfx908 -O2 -flto -Wl,--no-undefined {{.*}}.o
// CLANG-BACKEND: clang{{.*}} -o {{.*}}.img --target=amdgcn-amd-amdhsa -mcpu=gfx908 -O2 -foffload-lto -Wl,--no-undefined {{.*}}.o

// RUN: clang-offload-packager -o %t.out \
// RUN: --image=file=%t.elf.o,kind=openmp,triple=nvptx64-nvidia-cuda,arch=sm_70
Expand All @@ -171,8 +171,8 @@ __attribute__((visibility("protected"), used)) int x;
// RUN: clang-linker-wrapper --host-triple=x86_64-unknown-linux-gnu --dry-run \
// RUN: --linker-path=/usr/bin/ld %t-on.o %t-off.o %t.a -o a.out 2>&1 | FileCheck %s --check-prefix=AMD-TARGET-ID

// AMD-TARGET-ID: clang{{.*}} -o {{.*}}.img --target=amdgcn-amd-amdhsa -mcpu=gfx90a:xnack+ -O2 -flto -Wl,--no-undefined {{.*}}.o {{.*}}.o
// AMD-TARGET-ID: clang{{.*}} -o {{.*}}.img --target=amdgcn-amd-amdhsa -mcpu=gfx90a:xnack- -O2 -flto -Wl,--no-undefined {{.*}}.o {{.*}}.o
// AMD-TARGET-ID: clang{{.*}} -o {{.*}}.img --target=amdgcn-amd-amdhsa -mcpu=gfx90a:xnack+ -O2 -foffload-lto -Wl,--no-undefined {{.*}}.o {{.*}}.o
// AMD-TARGET-ID: clang{{.*}} -o {{.*}}.img --target=amdgcn-amd-amdhsa -mcpu=gfx90a:xnack- -O2 -foffload-lto -Wl,--no-undefined {{.*}}.o {{.*}}.o

// RUN: clang-offload-packager -o %t-lib.out \
// RUN: --image=file=%t.elf.o,kind=openmp,triple=amdgcn-amd-amdhsa,arch=generic
Expand All @@ -187,8 +187,8 @@ __attribute__((visibility("protected"), used)) int x;
// RUN: clang-linker-wrapper --host-triple=x86_64-unknown-linux-gnu --dry-run \
// RUN: --linker-path=/usr/bin/ld %t1.o %t2.o %t.a -o a.out 2>&1 | FileCheck %s --check-prefix=ARCH-ALL

// ARCH-ALL: clang{{.*}} -o {{.*}}.img --target=amdgcn-amd-amdhsa -mcpu=gfx90a -O2 -flto -Wl,--no-undefined {{.*}}.o {{.*}}.o
// ARCH-ALL: clang{{.*}} -o {{.*}}.img --target=amdgcn-amd-amdhsa -mcpu=gfx908 -O2 -flto -Wl,--no-undefined {{.*}}.o {{.*}}.o
// ARCH-ALL: clang{{.*}} -o {{.*}}.img --target=amdgcn-amd-amdhsa -mcpu=gfx90a -O2 -foffload-lto -Wl,--no-undefined {{.*}}.o {{.*}}.o
// ARCH-ALL: clang{{.*}} -o {{.*}}.img --target=amdgcn-amd-amdhsa -mcpu=gfx908 -O2 -foffload-lto -Wl,--no-undefined {{.*}}.o {{.*}}.o

// RUN: clang-offload-packager -o %t.out \
// RUN: --image=file=%t.elf.o,kind=openmp,triple=x86_64-unknown-linux-gnu \
Expand Down
6 changes: 5 additions & 1 deletion clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -498,12 +498,16 @@ Expected<StringRef> clang(ArrayRef<StringRef> InputFiles, const ArgList &Args) {
};

// Forward all of the `--offload-opt` and similar options to the device.
CmdArgs.push_back("-flto");
for (auto &Arg : Args.filtered(OPT_offload_opt_eq_minus, OPT_mllvm))
CmdArgs.append(
{"-Xlinker",
Args.MakeArgString("--plugin-opt=" + StringRef(Arg->getValue()))});

if (Triple.isNVPTX() || Triple.isAMDGPU())
CmdArgs.push_back("-foffload-lto");
else
CmdArgs.push_back("-flto");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There actually was a recent issue @ye-luo had with the default passing of -flto breaking for x64 offloading. I could possibly see only passing -flto for the NVPTX and AMDGPU toolchains because in those cases we know they support LTO.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was breaking some tests where one of the pipeline commands was compiling using clang from llvm IR to ptx but the output from this was the input llvm IR (I think that means something have failed). After investigation, it appeared that -flto is the reason for this breaking behaviour.

I don't have strong opinions on using -foffload-lto, but I think if -flto is not needed, we could avoid using it entirely for CUDA/AMD kernels.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a reproducer or an example of what the error is? You can pass flto to --target=nvptx64-nvidia-cuda correctly. The only reason this would fail is if it's somehow using an older version of clang as far as I'm aware.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the original issue: intel/llvm#16413 . It has some more details. I could try to see if I could come up with a concise reproducer for this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that would be greatly appreciated. I would recommend using -v and -save-temps to get the files that go into the embedded clang --target=nvptx64-nvidia-cuda job.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shouldn't, here's what I get with my current build

> clang -o output.img --target=nvptx64-nvidia-cuda -march=sm_50 -O2  test.ll -save-temps -v --cuda-path=/usr/local/cuda -flto 
clang version 21.0.0git
Target: nvptx64-nvidia-cuda
Thread model: posix
InstalledDir: /home/jhuber/Documents/llvm/clang/bin
Build config: +assertions
 "/home/jhuber/Documents/llvm/clang/bin/clang-21" -cc1 -triple nvptx64-nvidia-cuda -emit-llvm-bc -flto=full -flto-unit -dumpdir output.img- -save-temps=cwd -disable-free -clear-ast-before-backend -main-file-name test.ll -mrelocation-model static -mframe-pointer=all -ffp-contract=on -fno-rounding-math -no-integrated-as -target-cpu sm_50 -target-feature +ptx42 -debugger-tuning=gdb -fno-dwarf-directory-asm -fdebug-compilation-dir=/home/jhuber/Documents/llvm/llvm-project/build -v -resource-dir /home/jhuber/Documents/llvm/clang/lib/clang/21 -O2 -ferror-limit 19 -fgnuc-version=4.2.1 -fskip-odr-check-in-gmf -fcolor-diagnostics -vectorize-loops -vectorize-slp -o test.o -x ir test.ll
clang -cc1 version 21.0.0git based upon LLVM 21.0.0git default target x86_64-unknown-linux-gnu
 "/home/jhuber/Documents/llvm/clang/bin/clang-nvlink-wrapper" -o output.img -v -arch sm_50 --cuda-path=/usr/local/cuda -L/home/jhuber/Documents/llvm/clang/lib -L/opt/rocm/lib -L. -L/opt/cuda/lib -L/home/jhuber/Documents/llvm/clang/bin/../lib/nvptx64-nvidia-cuda -L/home/jhuber/Documents/llvm/clang/lib/clang/21/lib/nvptx64-nvidia-cuda test.o -plugin-opt=mcpu=sm_50 -plugin-opt=O2 --plugin-opt=-mattr=+ptx42 -mllvm --nvptx-lower-global-ctor-dtor -L/home/jhuber/Documents/llvm/clang/lib
 "/opt/cuda/bin/ptxas" -m64 -c /tmp/output-a5eb7b.s -v -O2 -arch sm_50 -o /tmp/output-08197e.cubin
ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function 'test' for 'sm_50'
ptxas info    : Function properties for test
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 2 registers, used 0 barriers, 364 bytes cmem[0]
 "/opt/cuda/bin/nvlink" -o output.img -v -arch sm_50 -L /home/jhuber/Documents/llvm/clang/lib -L /opt/rocm/lib -L . -L /opt/cuda/lib -L /home/jhuber/Documents/llvm/clang/bin/../lib/nvptx64-nvidia-cuda -L /home/jhuber/Documents/llvm/clang/lib/clang/21/lib/nvptx64-nvidia-cuda --arch sm_50 -L /home/jhuber/Documents/llvm/clang/lib /tmp/output-08197e.cubin
nvlink info    : 0 bytes gmem
nvlink info    : Function properties for 'test':
nvlink info    : used 2 registers, used 0 barriers, 0 stack, 0 bytes smem, 364 bytes cmem[0], 0 bytes lmem

Copy link
Contributor Author

@omarahmed1111 omarahmed1111 Feb 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just noticed the command that you use is not using -S option. I tried without it, and it is working normally, but -S option seems to introduce this behaviour, instead of generating ptx readable code, it generates the input llvm IR optimized when we use -flto option. I am not really familiar with -S option, so maybe that is the intended behaviour when we use it with -flto

Copy link
Contributor

@jhuber6 jhuber6 Feb 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what -flto -S does, that's the expected behavior. You'd need to modify that if you were expecting the PTX for something, but that's not in any workflows I know of upstream so it'd be a separate code path.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that is the problem in Intel fork pipeline. So, I will go investigate this more. Thanks a lot for help! @jhuber6

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI if you need the PTX you should be able to use -Wl,-lto-emit-asm for the above invocation without -S.


if (!Triple.isNVPTX() && !Triple.isSPIRV())
CmdArgs.push_back("-Wl,--no-undefined");

Expand Down