-
Notifications
You must be signed in to change notification settings - Fork 768
[SYCL][LIBCLC] Use half builtins for generic math functions #17163
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SYCL][LIBCLC] Use half builtins for generic math functions #17163
Conversation
@PietroGhg I couldn't find anything in #13829 that would suggest that using the |
This seems like a good enough change for now. I'll need to inspect the LLVM IR diff for myself to be sure. However, I think this will all have to undergo some changes in the near future. libclc shouldn't be assuming support for As it happens, I think this only "works" for DPC++ is because we only build Native CPU, AMDGPU, and NVIDIA libclc targets (ignoring the other targets upstream supports) and because AMDGPU and NVIDIA targets override these same maths functions to use their own builtins. So only Native CPU is actually going to "see" these implementations. If this change is in fact intended for Native CPU, then I think that target should be the one choosing to use these builtins. But like I said, we're already in this position so this change still seems like an improvement. |
Agree.
This change was in-fact motivated by an out-of-tree undisclosed target, where we are (currently) still using the generic implementations. I can send you the details in some internal channel if you wish.
Yeah I figured the same myself, we could add the right builtins for the target downstream, but this felt to me like an improvement over the current state. |
Ah yeah I was copied into that discussion. I didn't spot that you were in that thread too. We can take the discussion offline.
It sounds as though we'll have several cases where we want to provide both software and builtin implementations in libclc, for multiple targets each. We'll have to think of the best way to make all this code available, and have targets select between what they want. There's a few mechanisms but we'll have to find the one that suits best. |
Yeah you are right it wasn't intentional, the changes here look good to me |
8fd4f59
to
6767588
Compare
@jchlanda @frasercrmck I rebased and force-pushed to target sycl-web to pick up #17172 |
7955a8b
to
cd4f104
Compare
Sorry for the noise, I thought GH is not triggering GH actions for some reason, but I guess sycl-web just doesn't have any workflows configured. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, sycl-web doesn't run actions. Shame I just missed the pulldown window. I suppose if you are happy to land this on sycl-web that might make sense, but if you want it pulled down quicker maybe just merge into sycl and we can resolve the copysign conflicts as and when.
cd4f104
to
f6a5d01
Compare
`__builtin_<name>()` without a suffix is equivalent to the C math function `<name>()`, these functions take and return `double` values. Instead of using these use the `f16` suffixed builtins which take and return `half` values. On CPU targets these will be usually lowered to the single precision library calls with promotion/truncation. On other targets (for example GPUs) the half builtins might be lowered directly to hardware instructions. Some of the builtins are not available for half precision, in these cases use the `f` suffixed builtins (which take and return `float`s).
f6a5d01
to
6875c14
Compare
I am not in any hurry, but I'd prefer CI to run on this, so I changed it back to sycl, but dropped the copysign change. I think this should result in what we want, without conflicts when sycl-web is pulled down. |
The AMD CI failure is known flaky - #17077 |
@intel/llvm-gatekeepers please merge :) |
@Maetveis - parallel_for_range_roundup is failing on HIP in CI. If it is unrelated, could you please open a tracker for it and mention both the test and the tracker in a comment here? |
@steffenlarsen |
This adds some missing generic address space overloads of CLC functions: frexp, and whichever functions make use of the unary_def_with_(int_)?ptr.inc files. Note that this commit doesn't map the equivalent SPIR-V functions to these CLC implementations. This is because the generic SPIR-V implementation of FP16 types currently use the clang builtins as opposed to the software implementations. As discussed in intel#17163, this is not the proper behaviour for generic implementations, as not all targets support the builtins. In future work we will need to either provide target-specific overrides for the builtin implementations, or provide an easy mechanism by which the CLC implementations can select between builtins or software implementations automatically.
This adds some missing generic address space overloads of CLC functions: frexp, and whichever functions make use of the unary_def_with_(int_)?ptr.inc files. Note that this commit doesn't map the equivalent SPIR-V functions to these CLC implementations. This is because the generic SPIR-V implementation of FP16 types currently use the clang builtins as opposed to the software implementations. As discussed in #17163, this is not the proper behaviour for generic implementations, as not all targets support the builtins. In future work we will need to either provide target-specific overrides for the builtin implementations, or provide an easy mechanism by which the CLC implementations can select between builtins or software implementations automatically.
__builtin_<name>()
without a suffix is equivalent to the C math function<name>()
, these functions take and returndouble
values.Instead of using these use the
f16
suffixed builtins which take and returnhalf
values.On CPU targets these will be usually lowered to
the single precision library calls (with promotion/truncation). On other targets (for example GPUs) the half builtins might be lowered directly to hardware instructions.
Some of the builtins are not available for half precision, in these cases use the
f
suffixed builtins (which take and returnfloat
s).