Skip to content

[SYCL][LIBCLC] Use half builtins for generic math functions #17163

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Feb 27, 2025

Conversation

Maetveis
Copy link
Contributor

__builtin_<name>() without a suffix is equivalent to the C math function <name>(), these functions take and return double values.

Instead of using these use the f16 suffixed builtins which take and return half values.

On CPU targets these will be usually lowered to
the single precision library calls (with promotion/truncation). On other targets (for example GPUs) the half builtins might be lowered directly to hardware instructions.

Some of the builtins are not available for half precision, in these cases use the f suffixed builtins (which take and return floats).

@Maetveis Maetveis requested a review from a team as a code owner February 25, 2025 15:12
@Maetveis Maetveis requested a review from jchlanda February 25, 2025 15:12
@Maetveis Maetveis requested a review from PietroGhg February 25, 2025 15:13
@Maetveis
Copy link
Contributor Author

@PietroGhg I couldn't find anything in #13829 that would suggest that using the double builtins was intentional, but LMK if I missed something.

@frasercrmck
Copy link
Contributor

This seems like a good enough change for now. I'll need to inspect the LLVM IR diff for myself to be sure.

However, I think this will all have to undergo some changes in the near future. libclc shouldn't be assuming support for __builtin functions by default, unless they can be shown to be supported natively on all targets (via generic codegen expansion, for example). I'm slowly adding fully-fledged half implementations upstream, and so opting out of those by default will just leave untested code. I think targets should be opting in to builtins, having the software impementations there as the default.

As it happens, I think this only "works" for DPC++ is because we only build Native CPU, AMDGPU, and NVIDIA libclc targets (ignoring the other targets upstream supports) and because AMDGPU and NVIDIA targets override these same maths functions to use their own builtins. So only Native CPU is actually going to "see" these implementations.

If this change is in fact intended for Native CPU, then I think that target should be the one choosing to use these builtins.

But like I said, we're already in this position so this change still seems like an improvement.

@Maetveis
Copy link
Contributor Author

However, I think this will all have to undergo some changes in the near future. libclc shouldn't be assuming support for __builtin functions by default, unless they can be shown to be supported natively on all targets (via generic codegen expansion, for example). I'm slowly adding fully-fledged half implementations upstream, and so opting out of those by default will just leave untested code. I think targets should be opting in to builtins, having the software impementations there as the default.

Agree.

As it happens, I think this only "works" for DPC++ is because we only build Native CPU, AMDGPU, and NVIDIA libclc targets (ignoring the other targets upstream supports) and because AMDGPU and NVIDIA targets override these same maths functions to use their own builtins. So only Native CPU is actually going to "see" these implementations.

If this change is in fact intended for Native CPU, then I think that target should be the one choosing to use these builtins.

This change was in-fact motivated by an out-of-tree undisclosed target, where we are (currently) still using the generic implementations. I can send you the details in some internal channel if you wish.

But like I said, we're already in this position so this change still seems like an improvement.

Yeah I figured the same myself, we could add the right builtins for the target downstream, but this felt to me like an improvement over the current state.

@frasercrmck
Copy link
Contributor

As it happens, I think this only "works" for DPC++ is because we only build Native CPU, AMDGPU, and NVIDIA libclc targets (ignoring the other targets upstream supports) and because AMDGPU and NVIDIA targets override these same maths functions to use their own builtins. So only Native CPU is actually going to "see" these implementations.
If this change is in fact intended for Native CPU, then I think that target should be the one choosing to use these builtins.

This change was in-fact motivated by an out-of-tree undisclosed target, where we are (currently) still using the generic implementations. I can send you the details in some internal channel if you wish.

Ah yeah I was copied into that discussion. I didn't spot that you were in that thread too. We can take the discussion offline.

But like I said, we're already in this position so this change still seems like an improvement.

Yeah I figured the same myself, we could add the right builtins for the target downstream, but this felt to me like an improvement over the current state.

It sounds as though we'll have several cases where we want to provide both software and builtin implementations in libclc, for multiple targets each. We'll have to think of the best way to make all this code available, and have targets select between what they want. There's a few mechanisms but we'll have to find the one that suits best.

@PietroGhg
Copy link
Contributor

@PietroGhg I couldn't find anything in #13829 that would suggest that using the double builtins was intentional, but LMK if I missed something.

Yeah you are right it wasn't intentional, the changes here look good to me

@Maetveis Maetveis force-pushed the libclc_generic_use_half_builtins branch from 8fd4f59 to 6767588 Compare February 26, 2025 08:51
@Maetveis Maetveis requested review from a team and bader as code owners February 26, 2025 08:51
@Maetveis Maetveis changed the base branch from sycl to sycl-web February 26, 2025 08:51
@Maetveis Maetveis removed request for a team February 26, 2025 08:52
@Maetveis
Copy link
Contributor Author

@jchlanda @frasercrmck I rebased and force-pushed to target sycl-web to pick up #17172

@Maetveis Maetveis requested review from jsji and maksimsab February 26, 2025 08:55
@Maetveis Maetveis force-pushed the libclc_generic_use_half_builtins branch 4 times, most recently from 7955a8b to cd4f104 Compare February 26, 2025 09:31
@Maetveis
Copy link
Contributor Author

Sorry for the noise, I thought GH is not triggering GH actions for some reason, but I guess sycl-web just doesn't have any workflows configured.

Copy link
Contributor

@frasercrmck frasercrmck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sycl-web doesn't run actions. Shame I just missed the pulldown window. I suppose if you are happy to land this on sycl-web that might make sense, but if you want it pulled down quicker maybe just merge into sycl and we can resolve the copysign conflicts as and when.

@Maetveis Maetveis force-pushed the libclc_generic_use_half_builtins branch from cd4f104 to f6a5d01 Compare February 26, 2025 09:58
@Maetveis Maetveis changed the base branch from sycl-web to sycl February 26, 2025 09:59
`__builtin_<name>()` without a suffix is equivalent to the C math
function `<name>()`, these functions take and return `double` values.

Instead of using these use the `f16` suffixed builtins which take and
return `half` values.

On CPU targets these will be usually lowered to
the single precision library calls with promotion/truncation. On other
targets (for example GPUs) the half builtins might be lowered directly
to hardware instructions.

Some of the builtins are not available for half precision, in these cases
use the `f` suffixed builtins (which take and return `float`s).
@Maetveis Maetveis force-pushed the libclc_generic_use_half_builtins branch from f6a5d01 to 6875c14 Compare February 26, 2025 10:00
@Maetveis
Copy link
Contributor Author

Yes, sycl-web doesn't run actions. Shame I just missed the pulldown window. I suppose if you are happy to land this on sycl-web that might make sense, but if you want it pulled down quicker maybe just merge into sycl and we can resolve the copysign conflicts as and when.

I am not in any hurry, but I'd prefer CI to run on this, so I changed it back to sycl, but dropped the copysign change. I think this should result in what we want, without conflicts when sycl-web is pulled down.

@Maetveis
Copy link
Contributor Author

The AMD CI failure is known flaky - #17077

@Maetveis
Copy link
Contributor Author

@intel/llvm-gatekeepers please merge :)

@steffenlarsen
Copy link
Contributor

@Maetveis - parallel_for_range_roundup is failing on HIP in CI. If it is unrelated, could you please open a tracker for it and mention both the test and the tracker in a comment here?

@Maetveis
Copy link
Contributor Author

Maetveis commented Feb 27, 2025

@Maetveis - parallel_for_range_roundup is failing on HIP in CI. If it is unrelated, could you please open a tracker for it and mention both the test and the tracker in a comment here?

@steffenlarsen
I believe the error is that SYCL::Basic/parallel_for_range_roundup.cpp is flaky on AMD CI, it is already tracked by #17077.

@martygrant martygrant merged commit 42ff9c2 into intel:sycl Feb 27, 2025
19 of 20 checks passed
@Maetveis Maetveis deleted the libclc_generic_use_half_builtins branch February 27, 2025 09:14
frasercrmck added a commit to frasercrmck/llvm that referenced this pull request Feb 27, 2025
This adds some missing generic address space overloads of CLC functions:
frexp, and whichever functions make use of the
unary_def_with_(int_)?ptr.inc files.

Note that this commit doesn't map the equivalent SPIR-V functions to
these CLC implementations. This is because the generic SPIR-V
implementation of FP16 types currently use the clang builtins as opposed
to the software implementations. As discussed in intel#17163, this is not the
proper behaviour for generic implementations, as not all targets support
the builtins. In future work we will need to either provide
target-specific overrides for the builtin implementations, or provide an
easy mechanism by which the CLC implementations can select between
builtins or software implementations automatically.
jsji pushed a commit that referenced this pull request Feb 27, 2025
This adds some missing generic address space overloads of CLC functions:
frexp, and whichever functions make use of the
unary_def_with_(int_)?ptr.inc files.

Note that this commit doesn't map the equivalent SPIR-V functions to
these CLC implementations. This is because the generic SPIR-V
implementation of FP16 types currently use the clang builtins as opposed
to the software implementations. As discussed in #17163, this is not the
proper behaviour for generic implementations, as not all targets support
the builtins. In future work we will need to either provide
target-specific overrides for the builtin implementations, or provide an
easy mechanism by which the CLC implementations can select between
builtins or software implementations automatically.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants