Skip to content

[SYCL][CUDA] Joint_matrix elem wise ops inc bfloat16 #5964

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 52 commits into from
Jun 30, 2022

Conversation

JackAKirk
Copy link
Contributor

@JackAKirk JackAKirk commented Apr 5, 2022

This PR introduces full support of element wise operations in the cuda backend. wi_data, get_matrix_fill, and joint_matrix.get_wi_data() are introduced for portability with the Intel backend. In addition, in the CUDA backend users can call joint_matrix.wi_marray to access the marray that stores the WI owned elements of the matrix and perform optimized element wise operations using math functions that take marrays.
bfloat16 element wise operations support is also included and this PR adds bfloat16 scalar/marray impls replacing the existing uint16_t "storage type" implementations for fma, fmax, fmin, and fabs math functions. The bfloat16 fma_relu function impl has now been added directly in #5749.
The existing temporary uint16_t implementations (introduced in #5748 with unmerged tests intel/llvm-test-suite#897) have been removed, since these bfloat16 implementations replaces them.

@JackAKirk JackAKirk requested a review from a team as a code owner April 5, 2022 16:06
@JackAKirk JackAKirk requested a review from dm-vodopyanov April 5, 2022 16:06
@JackAKirk
Copy link
Contributor Author

Further tests will be to intel/llvm-test-suite added very shortly

@JackAKirk
Copy link
Contributor Author

intel/llvm-test-suite tests are here: intel/llvm-test-suite#975

@JackAKirk
Copy link
Contributor Author

/verify with intel/llvm-test-suite#975

1 similar comment
@JackAKirk
Copy link
Contributor Author

/verify with intel/llvm-test-suite#975

@JackAKirk JackAKirk marked this pull request as draft April 7, 2022 16:56

using namespace sycl;
using namespace sycl::ext::oneapi::experimental::matrix;
using sycl::ext::oneapi::experimental::bfloat16;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this compiling for you?
In my case, I have to explicitly include
#include <sycl/ext/oneapi/experimental/bfloat16.hpp>
for it to recognize bfloat16.
The reason for that is that sycl.hpp does not have that include

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I thought that we weren't meant to add these experimental headers to sycl.hpp. Are you using this invocation:

// RUN: %clangxx -fsycl-device-only -fsycl-targets=nvptx64-nvidia-cuda -Xsycl-target-backend --cuda-gpu-arch=sm_80 -DSYCL_EXT_ONEAPI_MATRIX=3 -S -Xclang -emit-llvm %s -o -| FileCheck %s

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It isn't meant to be run by the runtime.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not try this code. But for other codes that use bfloat16, when I compile using dpcpp test.cpp, I get:
use of undeclared identifier 'bfloat16'.
I had to include bfloat16 header file.
Also, if you look at bfloat16 examples, they also include it: https://github.com/intel/llvm-test-suite/blob/intel/SYCL/BFloat16/bfloat16_type.hpp
If this include is required, we should add it to sycl.hpp but was wondering how come you did not need it for your test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. These tests all use the SYCL_EXT_ONEAPI_MATRIX=3macro that includes matrix-tensorcore.hpp which includes bfloat16.hpp. I think this is why the explicit bfloat16.hpp include isn't needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, this is why.
But we should be consistent WRT including this header file.
Shouldn't we add that include to sycl.hpp like the other features and remove the extra scattered includes from the other files?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to include it to sycl.hpp if it is decided that is what should happen for default extensions: However there is an argument against that because it is allowed (and there is) C++17 standard in the extension headers so long as these headers aren't included in sycl.hpp (c++17 not allowed in sycl.hpp for some reason).

So my understanding is we either
a) Allow c++17 in extensions and make it responsibility of the user to include headers.
b) include default extensions in sycl.hpp but remove all c++17 usage.

Whatever is decided I think it should be enforced consistently.

Signed-off-by: JackAKirk <[email protected]>
Signed-off-by: JackAKirk <[email protected]>
Signed-off-by: JackAKirk <[email protected]>
@JackAKirk
Copy link
Contributor Author

/verify with intel/llvm-test-suite#975

} // namespace oneapi
} // namespace ext
template <typename T>
std::enable_if_t<std::is_same<T, bfloat16>::value, T> fabs(T x) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be a template function or could it be bfloat16 fabs(bfloat16 x)? Same question applies to other similar functions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I remove Template <typename T> and usage of enable_if_t then the compiler sees multiple definitions of bfloat16 fabs() with the same uint16_t (bfloat16 storage type) mangled name. I'm not completely sure why this is or why the templating and use of enable_if_t resolved this but I guessed it gets confused with the other marray definition.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah! It's probably confused by some ambiguity between bfloat16 and it's storage class, thinking this could be called if passed uint16_t through implicit conversion. Interesting! Thank you for clarifying. 😄

Copy link
Contributor

@steffenlarsen steffenlarsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I will let @dkhaldi have the last say on it though.

@steffenlarsen steffenlarsen requested a review from dkhaldi June 24, 2022 14:34
Copy link
Contributor

@dkhaldi dkhaldi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
The issue regarding whether to include bfloat16 header file in sycl.hpp or not can be addressed separately.

@steffenlarsen
Copy link
Contributor

@JackAKirk - Is this ready to be merged?

@JackAKirk
Copy link
Contributor Author

JackAKirk commented Jun 27, 2022

@JackAKirk - Is this ready to be merged?

Yes this can be merged now: note that I removed the uint16_t implementations of the math functions from this PR: there was some confusion in the reviews about them and I decided that now there is a complete bfloat16 implementation of these math functions as well as a complete bfloat16 implementation of the matrix extension this is a good time to remove these temporary implementations, avoiding any scope for future confusion about what to do with them.
Note that the tests for these uint16_t impls were not yet merged: intel/llvm-test-suite#897. This means that intel/llvm-test-suite#897 can be discarded.

@JackAKirk
Copy link
Contributor Author

@JackAKirk - Is this ready to be merged?

Yes this can be merged now: note that I removed the uint16_t implementations of the math functions from this PR: there was some confusion in the reviews about them and I decided that now there is a complete bfloat16 implementation of these math functions as well as a complete bfloat16 implementation of the matrix extension this is a good time to remove these temporary implementations, avoiding any scope for future confusion about what to do with them. Note that the tests for these uint16_t impls were not yet merged: intel/llvm-test-suite#897. This means that intel/llvm-test-suite#897 can be discarded.

There is an unexpected error in the ESIMD test-suite but I think this is most probably an unrelated failure. I don't see how it can relate to this patch.

@JackAKirk
Copy link
Contributor Author

@JackAKirk - Is this ready to be merged?

Yes this can be merged now: note that I removed the uint16_t implementations of the math functions from this PR: there was some confusion in the reviews about them and I decided that now there is a complete bfloat16 implementation of these math functions as well as a complete bfloat16 implementation of the matrix extension this is a good time to remove these temporary implementations, avoiding any scope for future confusion about what to do with them. Note that the tests for these uint16_t impls were not yet merged: intel/llvm-test-suite#897. This means that intel/llvm-test-suite#897 can be discarded.

There is an unexpected error in the ESIMD test-suite but I think this is most probably an unrelated failure. I don't see how it can relate to this patch.

It is now marked XFAIL: intel/llvm-test-suite#1066

@steffenlarsen
Copy link
Contributor

I will merge this as soon as intel/llvm-test-suite#975 is ready.

@JackAKirk
Copy link
Contributor Author

I will merge this as soon as intel/llvm-test-suite#975 is ready.

Cool thanks!

steffenlarsen pushed a commit to intel/llvm-test-suite that referenced this pull request Jun 30, 2022
requires intel/llvm#5964

bfloat16_builtins.cpp covers the bfloat16 scalar math function cases introduced by intel/llvm#5964, using the tests from #897 (that cover all "storage type" uint16_t impl cases).

elem_wise_all_ops_cuda.cpp covers the portable elem wise ops using `wi_data`. Since CUDA does not support `joint_matrix_store` for certain data types that are only used in a/b type matrices, such as bfloat16 and int8, it is necessary to perform a `joint_matrix_mad` operation and then call `joint_matrix_store` on the accumulator matrix in order the reach the host code check.
Intel backend devices could still use this test in the future provided that a backend check is introduced. Ideally both backends could eventually use the same test code.

Signed-off-by: jack.kirk <[email protected]>
@steffenlarsen steffenlarsen merged commit 0a1d751 into intel:sycl Jun 30, 2022
pvchupin pushed a commit that referenced this pull request Jul 1, 2022
…6386)

C++17 usage of if constexpr etc was added to experimental/builtins.hpp as requested in #5964, but I did not remove this header from sycl.hpp since there were no failing tests and I didn't notice it was included in sycl.hpp. Apparently sycl.hpp should not include any usage of C++17. This may be related to some of the failing tests that appear only on the CI: intel/llvm-test-suite#975 (comment). Necessary changes to the tests are added here : intel/llvm-test-suite#1072

Signed-off-by: JackAKirk [email protected]
aelovikov-intel pushed a commit to aelovikov-intel/llvm that referenced this pull request Mar 27, 2023
…el/llvm-test-suite#975)

requires intel#5964

bfloat16_builtins.cpp covers the bfloat16 scalar math function cases introduced by intel#5964, using the tests from intel/llvm-test-suite#897 (that cover all "storage type" uint16_t impl cases).

elem_wise_all_ops_cuda.cpp covers the portable elem wise ops using `wi_data`. Since CUDA does not support `joint_matrix_store` for certain data types that are only used in a/b type matrices, such as bfloat16 and int8, it is necessary to perform a `joint_matrix_mad` operation and then call `joint_matrix_store` on the accumulator matrix in order the reach the host code check.
Intel backend devices could still use this test in the future provided that a backend check is introduced. Ideally both backends could eventually use the same test code.

Signed-off-by: jack.kirk <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants