-
Notifications
You must be signed in to change notification settings - Fork 769
[SYCL][CUDA] Joint_matrix elem wise ops inc bfloat16 #5964
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Added bfloat16 in oneapi experimental namespace. Signed-off-by: jack.kirk <[email protected]>
…_BF16_CONVERSION.asciidoc
Removed aspect reference: can be added once the ext_oneapi_bfloat16 aspect is merged.
Signed-off-by: jack.kirk <[email protected]>
Further tests will be to intel/llvm-test-suite added very shortly |
intel/llvm-test-suite tests are here: intel/llvm-test-suite#975 |
/verify with intel/llvm-test-suite#975 |
1 similar comment
/verify with intel/llvm-test-suite#975 |
Signed-off-by: JackAKirk <[email protected]>
Signed-off-by: JackAKirk <[email protected]>
|
||
using namespace sycl; | ||
using namespace sycl::ext::oneapi::experimental::matrix; | ||
using sycl::ext::oneapi::experimental::bfloat16; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this compiling for you?
In my case, I have to explicitly include
#include <sycl/ext/oneapi/experimental/bfloat16.hpp>
for it to recognize bfloat16.
The reason for that is that sycl.hpp does not have that include
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I thought that we weren't meant to add these experimental headers to sycl.hpp. Are you using this invocation:
// RUN: %clangxx -fsycl-device-only -fsycl-targets=nvptx64-nvidia-cuda -Xsycl-target-backend --cuda-gpu-arch=sm_80 -DSYCL_EXT_ONEAPI_MATRIX=3 -S -Xclang -emit-llvm %s -o -| FileCheck %s
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It isn't meant to be run by the runtime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not try this code. But for other codes that use bfloat16, when I compile using dpcpp test.cpp, I get:
use of undeclared identifier 'bfloat16'.
I had to include bfloat16 header file.
Also, if you look at bfloat16 examples, they also include it: https://github.com/intel/llvm-test-suite/blob/intel/SYCL/BFloat16/bfloat16_type.hpp
If this include is required, we should add it to sycl.hpp but was wondering how come you did not need it for your test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. These tests all use the SYCL_EXT_ONEAPI_MATRIX=3
macro that includes matrix-tensorcore.hpp which includes bfloat16.hpp. I think this is why the explicit bfloat16.hpp include isn't needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, this is why.
But we should be consistent WRT including this header file.
Shouldn't we add that include to sycl.hpp like the other features and remove the extra scattered includes from the other files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to include it to sycl.hpp if it is decided that is what should happen for default extensions: However there is an argument against that because it is allowed (and there is) C++17 standard in the extension headers so long as these headers aren't included in sycl.hpp (c++17 not allowed in sycl.hpp for some reason).
So my understanding is we either
a) Allow c++17 in extensions and make it responsibility of the user to include headers.
b) include default extensions in sycl.hpp but remove all c++17 usage.
Whatever is decided I think it should be enforced consistently.
Signed-off-by: JackAKirk <[email protected]>
Signed-off-by: JackAKirk <[email protected]>
Signed-off-by: JackAKirk <[email protected]>
Signed-off-by: JackAKirk <[email protected]>
Signed-off-by: JackAKirk <[email protected]>
/verify with intel/llvm-test-suite#975 |
} // namespace oneapi | ||
} // namespace ext | ||
template <typename T> | ||
std::enable_if_t<std::is_same<T, bfloat16>::value, T> fabs(T x) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to be a template function or could it be bfloat16 fabs(bfloat16 x)
? Same question applies to other similar functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I remove Template <typename T>
and usage of enable_if_t
then the compiler sees multiple definitions of bfloat16 fabs()
with the same uint16_t (bfloat16 storage type) mangled name. I'm not completely sure why this is or why the templating and use of enable_if_t
resolved this but I guessed it gets confused with the other marray definition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah! It's probably confused by some ambiguity between bfloat16
and it's storage class, thinking this could be called if passed uint16_t
through implicit conversion. Interesting! Thank you for clarifying. 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! I will let @dkhaldi have the last say on it though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
The issue regarding whether to include bfloat16 header file in sycl.hpp or not can be addressed separately.
@JackAKirk - Is this ready to be merged? |
Yes this can be merged now: note that I removed the uint16_t implementations of the math functions from this PR: there was some confusion in the reviews about them and I decided that now there is a complete bfloat16 implementation of these math functions as well as a complete bfloat16 implementation of the matrix extension this is a good time to remove these temporary implementations, avoiding any scope for future confusion about what to do with them. |
There is an unexpected error in the ESIMD test-suite but I think this is most probably an unrelated failure. I don't see how it can relate to this patch. |
It is now marked XFAIL: intel/llvm-test-suite#1066 |
I will merge this as soon as intel/llvm-test-suite#975 is ready. |
Cool thanks! |
requires intel/llvm#5964 bfloat16_builtins.cpp covers the bfloat16 scalar math function cases introduced by intel/llvm#5964, using the tests from #897 (that cover all "storage type" uint16_t impl cases). elem_wise_all_ops_cuda.cpp covers the portable elem wise ops using `wi_data`. Since CUDA does not support `joint_matrix_store` for certain data types that are only used in a/b type matrices, such as bfloat16 and int8, it is necessary to perform a `joint_matrix_mad` operation and then call `joint_matrix_store` on the accumulator matrix in order the reach the host code check. Intel backend devices could still use this test in the future provided that a backend check is introduced. Ideally both backends could eventually use the same test code. Signed-off-by: jack.kirk <[email protected]>
…6386) C++17 usage of if constexpr etc was added to experimental/builtins.hpp as requested in #5964, but I did not remove this header from sycl.hpp since there were no failing tests and I didn't notice it was included in sycl.hpp. Apparently sycl.hpp should not include any usage of C++17. This may be related to some of the failing tests that appear only on the CI: intel/llvm-test-suite#975 (comment). Necessary changes to the tests are added here : intel/llvm-test-suite#1072 Signed-off-by: JackAKirk [email protected]
…el/llvm-test-suite#975) requires intel#5964 bfloat16_builtins.cpp covers the bfloat16 scalar math function cases introduced by intel#5964, using the tests from intel/llvm-test-suite#897 (that cover all "storage type" uint16_t impl cases). elem_wise_all_ops_cuda.cpp covers the portable elem wise ops using `wi_data`. Since CUDA does not support `joint_matrix_store` for certain data types that are only used in a/b type matrices, such as bfloat16 and int8, it is necessary to perform a `joint_matrix_mad` operation and then call `joint_matrix_store` on the accumulator matrix in order the reach the host code check. Intel backend devices could still use this test in the future provided that a backend check is introduced. Ideally both backends could eventually use the same test code. Signed-off-by: jack.kirk <[email protected]>
This PR introduces full support of element wise operations in the cuda backend.
wi_data
,get_matrix_fill
, andjoint_matrix.get_wi_data()
are introduced for portability with the Intel backend. In addition, in the CUDA backend users can calljoint_matrix.wi_marray
to access the marray that stores the WI owned elements of the matrix and perform optimized element wise operations using math functions that take marrays.bfloat16 element wise operations support is also included and this PR adds bfloat16 scalar/marray impls replacing the existing uint16_t "storage type" implementations for fma, fmax, fmin, and fabs math functions. The bfloat16 fma_relu function impl has now been added directly in #5749.
The existing temporary uint16_t implementations (introduced in #5748 with unmerged tests intel/llvm-test-suite#897) have been removed, since these bfloat16 implementations replaces them.