[SYCL][CUDA] Joint_matrix elem wise ops inc bfloat16 #5964

JackAKirk · 2022-04-05T16:06:07Z

This PR introduces full support of element wise operations in the cuda backend. wi_data, get_matrix_fill, and joint_matrix.get_wi_data() are introduced for portability with the Intel backend. In addition, in the CUDA backend users can call joint_matrix.wi_marray to access the marray that stores the WI owned elements of the matrix and perform optimized element wise operations using math functions that take marrays.
bfloat16 element wise operations support is also included and this PR adds bfloat16 scalar/marray impls replacing the existing uint16_t "storage type" implementations for fma, fmax, fmin, and fabs math functions. The bfloat16 fma_relu function impl has now been added directly in #5749.
The existing temporary uint16_t implementations (introduced in #5748 with unmerged tests intel/llvm-test-suite#897) have been removed, since these bfloat16 implementations replaces them.

Added bfloat16 in oneapi experimental namespace. Signed-off-by: jack.kirk <[email protected]>

…_BF16_CONVERSION.asciidoc

Removed aspect reference: can be added once the ext_oneapi_bfloat16 aspect is merged.

Signed-off-by: jack.kirk <[email protected]>

JackAKirk · 2022-04-05T16:27:16Z

Further tests will be to intel/llvm-test-suite added very shortly

JackAKirk · 2022-04-05T17:04:47Z

intel/llvm-test-suite tests are here: intel/llvm-test-suite#975

JackAKirk · 2022-04-05T17:19:26Z

/verify with intel/llvm-test-suite#975

JackAKirk · 2022-04-07T15:41:02Z

/verify with intel/llvm-test-suite#975

Signed-off-by: JackAKirk <[email protected]>

dkhaldi · 2022-06-09T17:51:50Z

sycl/test/check_device_code/matrix/matrix-nvptx-bfloat16-test.cpp

+
+using namespace sycl;
+using namespace sycl::ext::oneapi::experimental::matrix;
+using sycl::ext::oneapi::experimental::bfloat16;


Is this compiling for you?
In my case, I have to explicitly include
#include <sycl/ext/oneapi/experimental/bfloat16.hpp>
for it to recognize bfloat16.
The reason for that is that sycl.hpp does not have that include

Yeah I thought that we weren't meant to add these experimental headers to sycl.hpp. Are you using this invocation:

// RUN: %clangxx -fsycl-device-only -fsycl-targets=nvptx64-nvidia-cuda -Xsycl-target-backend --cuda-gpu-arch=sm_80 -DSYCL_EXT_ONEAPI_MATRIX=3 -S -Xclang -emit-llvm %s -o -| FileCheck %s

It isn't meant to be run by the runtime.

I did not try this code. But for other codes that use bfloat16, when I compile using dpcpp test.cpp, I get:
use of undeclared identifier 'bfloat16'.
I had to include bfloat16 header file.
Also, if you look at bfloat16 examples, they also include it: https://github.com/intel/llvm-test-suite/blob/intel/SYCL/BFloat16/bfloat16_type.hpp
If this include is required, we should add it to sycl.hpp but was wondering how come you did not need it for your test.

I see. These tests all use the SYCL_EXT_ONEAPI_MATRIX=3macro that includes matrix-tensorcore.hpp which includes bfloat16.hpp. I think this is why the explicit bfloat16.hpp include isn't needed.

Okay, this is why.
But we should be consistent WRT including this header file.
Shouldn't we add that include to sycl.hpp like the other features and remove the extra scattered includes from the other files?

Happy to include it to sycl.hpp if it is decided that is what should happen for default extensions: However there is an argument against that because it is allowed (and there is) C++17 standard in the extension headers so long as these headers aren't included in sycl.hpp (c++17 not allowed in sycl.hpp for some reason).

So my understanding is we either
a) Allow c++17 in extensions and make it responsibility of the user to include headers.
b) include default extensions in sycl.hpp but remove all c++17 usage.

Whatever is decided I think it should be enforced consistently.

Signed-off-by: JackAKirk <[email protected]>

JackAKirk · 2022-06-23T09:23:29Z

/verify with intel/llvm-test-suite#975

steffenlarsen · 2022-06-23T12:28:06Z

sycl/include/sycl/ext/oneapi/experimental/builtins.hpp

-} // namespace oneapi
-} // namespace ext
+template <typename T>
+std::enable_if_t<std::is_same<T, bfloat16>::value, T> fabs(T x) {


Does this need to be a template function or could it be bfloat16 fabs(bfloat16 x)? Same question applies to other similar functions.

If I remove Template <typename T> and usage of enable_if_t then the compiler sees multiple definitions of bfloat16 fabs() with the same uint16_t (bfloat16 storage type) mangled name. I'm not completely sure why this is or why the templating and use of enable_if_t resolved this but I guessed it gets confused with the other marray definition.

Ah! It's probably confused by some ambiguity between bfloat16 and it's storage class, thinking this could be called if passed uint16_t through implicit conversion. Interesting! Thank you for clarifying. 😄

steffenlarsen

LGTM! I will let @dkhaldi have the last say on it though.

dkhaldi

LGTM
The issue regarding whether to include bfloat16 header file in sycl.hpp or not can be addressed separately.

steffenlarsen · 2022-06-27T09:17:37Z

@JackAKirk - Is this ready to be merged?

JackAKirk · 2022-06-27T09:45:24Z

@JackAKirk - Is this ready to be merged?

Yes this can be merged now: note that I removed the uint16_t implementations of the math functions from this PR: there was some confusion in the reviews about them and I decided that now there is a complete bfloat16 implementation of these math functions as well as a complete bfloat16 implementation of the matrix extension this is a good time to remove these temporary implementations, avoiding any scope for future confusion about what to do with them.
Note that the tests for these uint16_t impls were not yet merged: intel/llvm-test-suite#897. This means that intel/llvm-test-suite#897 can be discarded.

JackAKirk · 2022-06-27T09:47:43Z

@JackAKirk - Is this ready to be merged?

Yes this can be merged now: note that I removed the uint16_t implementations of the math functions from this PR: there was some confusion in the reviews about them and I decided that now there is a complete bfloat16 implementation of these math functions as well as a complete bfloat16 implementation of the matrix extension this is a good time to remove these temporary implementations, avoiding any scope for future confusion about what to do with them. Note that the tests for these uint16_t impls were not yet merged: intel/llvm-test-suite#897. This means that intel/llvm-test-suite#897 can be discarded.

There is an unexpected error in the ESIMD test-suite but I think this is most probably an unrelated failure. I don't see how it can relate to this patch.

JackAKirk · 2022-06-27T12:54:51Z

@JackAKirk - Is this ready to be merged?

Yes this can be merged now: note that I removed the uint16_t implementations of the math functions from this PR: there was some confusion in the reviews about them and I decided that now there is a complete bfloat16 implementation of these math functions as well as a complete bfloat16 implementation of the matrix extension this is a good time to remove these temporary implementations, avoiding any scope for future confusion about what to do with them. Note that the tests for these uint16_t impls were not yet merged: intel/llvm-test-suite#897. This means that intel/llvm-test-suite#897 can be discarded.

There is an unexpected error in the ESIMD test-suite but I think this is most probably an unrelated failure. I don't see how it can relate to this patch.

It is now marked XFAIL: intel/llvm-test-suite#1066

steffenlarsen · 2022-06-27T13:28:32Z

I will merge this as soon as intel/llvm-test-suite#975 is ready.

JackAKirk · 2022-06-27T16:29:17Z

I will merge this as soon as intel/llvm-test-suite#975 is ready.

Cool thanks!

requires intel/llvm#5964 bfloat16_builtins.cpp covers the bfloat16 scalar math function cases introduced by intel/llvm#5964, using the tests from #897 (that cover all "storage type" uint16_t impl cases). elem_wise_all_ops_cuda.cpp covers the portable elem wise ops using `wi_data`. Since CUDA does not support `joint_matrix_store` for certain data types that are only used in a/b type matrices, such as bfloat16 and int8, it is necessary to perform a `joint_matrix_mad` operation and then call `joint_matrix_store` on the accumulator matrix in order the reach the host code check. Intel backend devices could still use this test in the future provided that a backend check is introduced. Ideally both backends could eventually use the same test code. Signed-off-by: jack.kirk <[email protected]>

…6386) C++17 usage of if constexpr etc was added to experimental/builtins.hpp as requested in #5964, but I did not remove this header from sycl.hpp since there were no failing tests and I didn't notice it was included in sycl.hpp. Apparently sycl.hpp should not include any usage of C++17. This may be related to some of the failing tests that appear only on the CI: intel/llvm-test-suite#975 (comment). Necessary changes to the tests are added here : intel/llvm-test-suite#1072 Signed-off-by: JackAKirk [email protected]

…el/llvm-test-suite#975) requires intel#5964 bfloat16_builtins.cpp covers the bfloat16 scalar math function cases introduced by intel#5964, using the tests from intel/llvm-test-suite#897 (that cover all "storage type" uint16_t impl cases). elem_wise_all_ops_cuda.cpp covers the portable elem wise ops using `wi_data`. Since CUDA does not support `joint_matrix_store` for certain data types that are only used in a/b type matrices, such as bfloat16 and int8, it is necessary to perform a `joint_matrix_mad` operation and then call `joint_matrix_store` on the accumulator matrix in order the reach the host code check. Intel backend devices could still use this test in the future provided that a backend check is introduced. Ideally both backends could eventually use the same test code. Signed-off-by: jack.kirk <[email protected]>

JackAKirk added 19 commits January 25, 2022 14:06

Added bfloat16 support for cuda backend.

025cf7e

Added bfloat16 in oneapi experimental namespace. Signed-off-by: jack.kirk <[email protected]>

deleted intel namespace bfloat16.

66b4e33

Format.

2d04406

Changed extension macro name.

9418f74

Merge branch 'sycl' into bf16-cvt-ext

65fddfa

fixed test.

4d99f3f

Used neg ptx7.0 builtin for unary minus

3982001

Replaced SYCL_EXT_INTEL_BF16_CONVERSION.asciidoc with SYCL_EXT_ONEAPI…

8d2d11f

…_BF16_CONVERSION.asciidoc

Merge branch 'sycl' into bf16-cvt-ext

d8bc53f

Merge branch 'sycl' into bf16-cvt-ext

2f9b7d7

Renamed extension to cover all bfloat16 funct.

8a29c44

Removed aspect reference: can be added once the ext_oneapi_bfloat16 aspect is merged.

Updated macro name

f1fba08

Removed old extension doc

461ddb8

typo

e433fbc

Initial bfloat16 function impl.

48ee8ff

Signed-off-by: jack.kirk <[email protected]>

Added other bfloat16 scalar cases

4a30a27

added bfloat16 device code test.

5cb7b09

Merge branch 'sycl' into bfloat16-joint-matrix

603ef6e

Clarified error msg

081008b

JackAKirk requested a review from a team as a code owner April 5, 2022 16:06

JackAKirk requested a review from dm-vodopyanov April 5, 2022 16:06

JackAKirk added 2 commits April 5, 2022 17:15

format

25877c0

format

4b38281

removed deleted header from sycl.hpp

7ed380c

JackAKirk mentioned this pull request Apr 5, 2022

[SYCL][CUDA] Test cases for bfloat16 math/elem wise joint_matrix intel/llvm-test-suite#975

Merged

JackAKirk marked this pull request as draft April 7, 2022 16:56

JackAKirk dismissed dkhaldi’s stale review via b104b30 June 8, 2022 11:49

dkhaldi mentioned this pull request Jun 8, 2022

[SYCL][CUDA][Matrix] Adding test case for tf32 intel/llvm-test-suite#963

Merged

JackAKirk added 4 commits June 9, 2022 15:22

Merge branch 'sycl' into bfloat16-joint-matrix

3c46f46

PI_INVALID_DEVICE -> PI_ERROR_INVALID_DEVICE

a67d1fd

Signed-off-by: JackAKirk <[email protected]>

format

0f0215b

Signed-off-by: JackAKirk <[email protected]>

data -> wi_marray

a9c2901

dkhaldi reviewed Jun 9, 2022

View reviewed changes

JackAKirk added 5 commits June 22, 2022 17:19

Replaced type punning with memcpy.

7141fdc

Signed-off-by: JackAKirk <[email protected]>

format

22f8650

Signed-off-by: JackAKirk <[email protected]>

Format

6ee55f1

Signed-off-by: JackAKirk <[email protected]>

add back partial_res decl.

49c962d

Signed-off-by: JackAKirk <[email protected]>

format

81f8ba0

Signed-off-by: JackAKirk <[email protected]>

steffenlarsen reviewed Jun 24, 2022

View reviewed changes

steffenlarsen requested a review from dkhaldi June 24, 2022 14:34

dkhaldi approved these changes Jun 24, 2022

View reviewed changes

steffenlarsen approved these changes Jun 30, 2022

View reviewed changes

steffenlarsen merged commit 0a1d751 into intel:sycl Jun 30, 2022

JackAKirk mentioned this pull request Jul 1, 2022

[SYCL] Remove experimental/builtins.hpp from sycl.hpp due to C++17 #6386

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SYCL][CUDA] Joint_matrix elem wise ops inc bfloat16 #5964

[SYCL][CUDA] Joint_matrix elem wise ops inc bfloat16 #5964

JackAKirk commented Apr 5, 2022 •

edited

Loading

JackAKirk commented Apr 5, 2022

JackAKirk commented Apr 5, 2022

JackAKirk commented Apr 5, 2022

JackAKirk commented Apr 7, 2022

dkhaldi Jun 9, 2022

JackAKirk Jun 9, 2022

JackAKirk Jun 9, 2022

dkhaldi Jun 9, 2022

JackAKirk Jun 13, 2022

dkhaldi Jun 13, 2022

JackAKirk Jun 13, 2022

JackAKirk commented Jun 23, 2022

steffenlarsen Jun 23, 2022

JackAKirk Jun 24, 2022

steffenlarsen Jun 24, 2022

steffenlarsen left a comment

dkhaldi left a comment

steffenlarsen commented Jun 27, 2022

JackAKirk commented Jun 27, 2022 •

edited

Loading

JackAKirk commented Jun 27, 2022

JackAKirk commented Jun 27, 2022

steffenlarsen commented Jun 27, 2022

JackAKirk commented Jun 27, 2022

[SYCL][CUDA] Joint_matrix elem wise ops inc bfloat16 #5964

[SYCL][CUDA] Joint_matrix elem wise ops inc bfloat16 #5964

Conversation

JackAKirk commented Apr 5, 2022 • edited Loading

JackAKirk commented Apr 5, 2022

JackAKirk commented Apr 5, 2022

JackAKirk commented Apr 5, 2022

JackAKirk commented Apr 7, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JackAKirk commented Jun 23, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

steffenlarsen left a comment

Choose a reason for hiding this comment

dkhaldi left a comment

Choose a reason for hiding this comment

steffenlarsen commented Jun 27, 2022

JackAKirk commented Jun 27, 2022 • edited Loading

JackAKirk commented Jun 27, 2022

JackAKirk commented Jun 27, 2022

steffenlarsen commented Jun 27, 2022

JackAKirk commented Jun 27, 2022

JackAKirk commented Apr 5, 2022 •

edited

Loading

JackAKirk commented Jun 27, 2022 •

edited

Loading