[SYCL][ext][CUDA] Use float as storage type for tf32 joint matrix #5870

hdelan · 2022-03-23T15:50:06Z

Changing joint_matrix impl to use float as storage type instead of uint32_t for tf32.

Added bfloat16 in oneapi experimental namespace. Signed-off-by: jack.kirk <[email protected]>

Signed-off-by: jack.kirk <[email protected]>

…_BF16_CONVERSION.asciidoc

Removed aspect reference: can be added once the ext_oneapi_bfloat16 aspect is merged.

…int-matrix

…2 are being used

dkhaldi · 2022-03-23T20:27:39Z

Here, you are changing the spec of joint_matrix to add a new template argument which is the actual data type (tf32) while using the existing type as the storage type (float here). I don't think this is what we discussed before.
The idea is to keep only one type argument to the joint matrix class and introduce a new tf32 type in the form of an empty class that is only accessed and used from within the matrix namespace.

hdelan · 2022-03-24T15:04:03Z

Hi @dkhaldi thanks for your response. We have talked about this a bit internally and we think that each approach has pros and cons:

Approach 1:

Using an extra template parameter in joint_matrix constructor to specify the precision. This is the same as this current approach, but with enum class use_tf32 {yes, no}; being replaced with something more generic like

enum class precision { default, tf32 /* some other precisions for single bit types etc */ };

This would default to precision::default so the user only needs to concern themselves with the enum in the case of using some non-standard precision.

We could check that the precision parameter is compatible with the underlying type at the construction of the joint_matrix. Semantically this makes it clear that the programmer need only concern themselves with floats, and the implementation will take care of the tf32 precision bit.

It requires an extra template parameter however this could be useful down the line when other precisions are offered.

Another benefit to this approach is that there may be multiple mappings of matrix array types to joint_matrix::data registers, since the register type is determined by the precision parameter and the implementation could allow many mappings from array types to a given register type. This would give a lot of flexibility to the implementation, and all the programmer needs to be aware of is what combination of array type and precision type is allowed, which can be easily determined at compile time.

Approach 2:

Use an empty tf32 class as the type argument into the joint matrix constructor. This avoids adding an extra template parameter, however it has some drawbacks:

It encourages the programmer to consider the tf32 as an actual type, when in fact it is an empty class.
It does not make the relationship between float and tf32 clear. If the programmer is constructing a joint matrix of type tf32, then why should joint_matrix_load take a multi_ptr to a float? The programmer might try to make an accessor to tf32s instead, which would not work as it is an empty class.
Errors of incompatibility between the storage type and the tf32 type would only be caught at joint_matrix_load instead of one step earlier, upon the construction of joint_matrix. Moreover the errors are likely to be more difficult to parse than if they were to be caught upon constructing the joint_matrix.

Please let me know your thoughts. Thanks

dkhaldi · 2022-03-29T15:16:52Z

I started putting together support for tf32. A draft PR can be found here:
#5920
This can give an idea on the changes that are needed to handle tf32 and the way to differentiate between element type and storage type.
The missing parts are mainly related to SPIRV, that's why I declared this as a draft. But in your case, since you don't support JIT and you don't have element wise ops yet, I believe adding the empty class will be the only change.

hdelan · 2022-04-08T16:37:00Z

I cannot see the logs for the test suite run but locally the test InorderQueue/in_order_get_property.cpp is failing, which is unrelated to this PR. Therefore I think this is ready to merge, if possible.

dkhaldi · 2022-04-12T14:46:44Z

sycl/include/sycl/ext/oneapi/matrix/matrix-tensorcore.hpp

    typename std::enable_if_t<Layout == sycl::ext::oneapi::experimental::
                                            matrix::matrix_layout::row_major ||
                              Layout == sycl::ext::oneapi::experimental::
                                            matrix::matrix_layout::col_major>> {
  void load(sycl::ext::oneapi::experimental::matrix::joint_matrix<
-                T, Use, NumRows, NumCols, Layout, sycl::sub_group> &res,
+                S, Use, NumRows, NumCols, Layout, sycl::sub_group> &res,


I am tagging @yubingex007-a11y here as changing the type of the load will be necessary to handle tf32 case: type of memory can be difference from type of joint matrix.
However, @JackAKirk, we should restrict this flexibility to only tf32.
Can this work in the case of bfloat16? load from float to bfloat16?

Yeah the final bfloat16 cuda impl is ready now using the old API (#5964).

Sounds fine to restrict the flexibility: I think the way this is implemented it already does restrict it to the tf32 type. If we add subbyte/single-bit cases then I think this would also encounter type of memory can be difference from type of joint matrix.

This is currently restricted to only being used by tf32 when the other datatype is float. See https://github.com/intel/llvm/pull/5870/files/618c80750930b0eaec8cde468c880d52ba54c80c#diff-f71a436bdeda598b29caad471fa637a2844a12f38fe4e85b15b2ccb37bd09833R539

dkhaldi · 2022-04-12T14:57:38Z

sycl/include/sycl/ext/oneapi/matrix/matrix-tensorcore.hpp

@@ -573,6 +604,26 @@ joint_matrix_mad(
 #endif // defined(__SYCL_DEVICE_ONLY__) && defined(__NVPTX__)
 }

+float float_to_tf32(float a) {


This is in sync with the fact that an element indexing of joint matrix of type tf32 is of type float.
joint_matrix<precision::tf32, TM, TK> sub_a(sg);
sub_a.get_wi_data()[i] = float_to_tf32(sub_a.get_wi_data()[i]);
sub_a.get_wi_data()[i] is of type float but numerically it is tf32 after this conversion.
Please add this clarification as a comment.

dkhaldi · 2022-04-12T15:19:44Z

sycl/test/check_device_code/matrix/matrix-nvptx-tf32-test.cpp

+          // CHECK: tail call i32 @llvm.nvvm.f2tf32.rna(float {{.*}}
+          // Round a, b to tf32
+          for (auto i = 0; i < 4; ++i)
+            sub_a.data[i] = float_to_tf32(sub_a.data[i]);


this should be the expected way to perform the rounding, if users want to, but I am still find exposing ".data" is different from the element wise indexing we are currently doing.
I would recommend moving from this to the current API:
sub_a.get_wi_data()[i] = float_to_tf32(sub_a.get_wi_data()[i]);

This is exactly what we will do (but in a future PR): I switched the impl here to use marray for data in joint_matrix: Then get_wi_data()[i] will call get_wi_elem that will return the ith element of the marray. We will loop over get_wi_data.length() as you do too.

dkhaldi · 2022-04-12T16:55:57Z

sycl/include/sycl/ext/oneapi/matrix/matrix-tensorcore.hpp

+}
+
+// This function just zeros out the bottom 13 bits of the tf32 type
+float tf32_to_float(float a) {


is there a use case for this?
cutlass has this.
if yes, rename it to truncate_to_tf32?

I have renamed the function.

sycl/doc/extensions/experimental/sycl_ext_oneapi_bfloat16.asciidoc

sycl/test/check_device_code/matrix/matrix-nvptx-tf32-test.cpp

dkhaldi

I am okay with addressing ".data" change to a future PR to adapt to the current spec syntax as follows:
sub_a.get_wi_data()[i] = round_to_tf32(sub_a.get_wi_data()[i]);

as addressed here: #5870 (comment)

Thus, this PR LGTM

hdelan · 2022-04-22T14:08:24Z

Thanks @dkhaldi !

dkhaldi

LGTM

JackAKirk · 2022-06-08T18:37:53Z

@intel/llvm-reviewers-cuda any more review for this? If not it would be super nice if it could be merged within the next 12 hrs or so.

pvchupin · 2022-06-08T18:46:45Z

Ping @v-klochkov for review.

steffenlarsen

Only a minor question. I am okay with merging as-is and potentially addressing it separately.

steffenlarsen · 2022-06-08T18:48:11Z

sycl/include/sycl/ext/oneapi/matrix/matrix-tensorcore.hpp

-      } else if constexpr (NumRows == 32 && NumCols == 8) {
-        __hmma_m32n8k16_ld_c_f32(res.data, src.get(), stride,
-                                 get_layout_id<Layout>());
+      if (std::is_same<S, float>::value) {


Is there a reason for this not to be if constexpr?

I think if constexpr is c++17 so it may have been removed for that reason. Although I noticed that for some reason c++17 is allowed in the extension namespace, although I don't understand this?

I believe extensions that need to be explicitly included are allowed to use C++17 features.

Allowed in the sense that tests don't appear to fail due to c++17 in extension namespace that fail due to c++17 stuff in other namespaces!

I see, thanks.

In that case I am happy to ensure c++17 is fully employed where appropriate in this extension in the follow on PR: #5964

In that case I am happy to ensure c++17 is fully employed where appropriate in this extension in the follow on PR: #5964

As long as it doesn't bleed into sycl.hpp then it should be fine. There should be a test that fails if it does.

Test for intel/llvm#5870

…#963) Test for intel#5870

JackAKirk and others added 22 commits January 25, 2022 14:06

Added bfloat16 support for cuda backend.

025cf7e

Added bfloat16 in oneapi experimental namespace. Signed-off-by: jack.kirk <[email protected]>

deleted intel namespace bfloat16.

66b4e33

Format.

2d04406

Changed extension macro name.

9418f74

Merge branch 'sycl' into bf16-cvt-ext

65fddfa

fixed test.

4d99f3f

Implemented fp19 mma using the natural storage type uint32_t.

f6cf7b8

Signed-off-by: jack.kirk <[email protected]>

format

35302b5

format

712af98

format

3530643

added comment relating uint32_t to fp19

fa67ff9

Used neg ptx7.0 builtin for unary minus

3982001

Replaced SYCL_EXT_INTEL_BF16_CONVERSION.asciidoc with SYCL_EXT_ONEAPI…

8d2d11f

…_BF16_CONVERSION.asciidoc

Merge branch 'sycl' into bf16-cvt-ext

d8bc53f

fp19 comments ->tf32

bfc68d2

Merge branch 'sycl' into bf16-cvt-ext

2f9b7d7

Renamed extension to cover all bfloat16 funct.

8a29c44

Removed aspect reference: can be added once the ext_oneapi_bfloat16 aspect is merged.

Merge remote-tracking branch 'Jack/fp19-matrix-uint32_t' into tf32-jo…

ca1d735

…int-matrix

Changing impl to accept float with boolean switch to tell whether tf3…

52c8e20

…2 are being used

Final impl

813aa4b

Adding sycl test

0630667

Device code check passing

61b3d8f

hdelan requested a review from a team as a code owner March 23, 2022 15:50

hdelan requested a review from v-klochkov March 23, 2022 15:50

hdelan changed the title ~~Tf32 joint matrix~~ [SYCL][ext][CUDA] Use float as storage type for tf32 joint matrix Mar 23, 2022

JackAKirk requested a review from dkhaldi March 23, 2022 16:10

Changing to precision enum

23cb7da

dkhaldi reviewed Apr 12, 2022

View reviewed changes

Zeroing out bottom bits in sware impl

b03e661

hdelan requested a review from a team as a code owner April 14, 2022 09:36

hdelan and others added 2 commits April 14, 2022 10:44

Changing names

077e0f4

Merge branch 'sycl' into tf32-joint-matrix

1b5503c

gmlueck reviewed Apr 18, 2022

View reviewed changes

sycl/doc/extensions/experimental/sycl_ext_oneapi_bfloat16.asciidoc Outdated Show resolved Hide resolved

hdelan added 2 commits April 19, 2022 09:21

Removing newline

5b5bbcc

Removing truncate function

560b02d

dkhaldi reviewed Apr 21, 2022

View reviewed changes

sycl/test/check_device_code/matrix/matrix-nvptx-tf32-test.cpp Outdated Show resolved Hide resolved

Updating test

13b1efb

dkhaldi previously approved these changes Apr 22, 2022

View reviewed changes

hdelan dismissed dkhaldi’s stale review via 438a9f2 May 9, 2022 15:27

hdelan force-pushed the tf32-joint-matrix branch from d3e1247 to 438a9f2 Compare May 9, 2022 15:27

hdelan requested review from a team and pvchupin as code owners May 9, 2022 15:27

hdelan requested a review from smaslov-intel May 9, 2022 15:27

hdelan force-pushed the tf32-joint-matrix branch from 21cc02c to 13b1efb Compare May 10, 2022 15:53

dkhaldi approved these changes May 11, 2022

View reviewed changes

steffenlarsen approved these changes Jun 8, 2022

View reviewed changes

pvchupin merged commit 2340b33 into intel:sycl Jun 8, 2022

pvchupin pushed a commit to intel/llvm-test-suite that referenced this pull request Jun 8, 2022

[SYCL][CUDA][Matrix] Adding test case for tf32 (#963)

83bbe77

Test for intel/llvm#5870

aelovikov-intel pushed a commit to aelovikov-intel/llvm that referenced this pull request Mar 27, 2023

[SYCL][CUDA][Matrix] Adding test case for tf32 (intel/llvm-test-suite…

086f6b2

…#963) Test for intel#5870

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SYCL][ext][CUDA] Use float as storage type for tf32 joint matrix #5870

[SYCL][ext][CUDA] Use float as storage type for tf32 joint matrix #5870

hdelan commented Mar 23, 2022

dkhaldi commented Mar 23, 2022

hdelan commented Mar 24, 2022 •

edited

Loading

dkhaldi commented Mar 29, 2022

hdelan commented Apr 8, 2022

dkhaldi Apr 12, 2022

JackAKirk Apr 12, 2022

hdelan Apr 14, 2022

dkhaldi Apr 12, 2022

dkhaldi Apr 12, 2022

JackAKirk Apr 12, 2022 •

edited

Loading

dkhaldi Apr 12, 2022

hdelan Apr 14, 2022

dkhaldi left a comment

hdelan commented Apr 22, 2022

dkhaldi left a comment

JackAKirk commented Jun 8, 2022

pvchupin commented Jun 8, 2022

steffenlarsen left a comment

steffenlarsen Jun 8, 2022

JackAKirk Jun 8, 2022

steffenlarsen Jun 8, 2022

JackAKirk Jun 8, 2022 •

edited

Loading

JackAKirk Jun 8, 2022

JackAKirk Jun 8, 2022 •

edited

Loading

steffenlarsen Jun 8, 2022

[SYCL][ext][CUDA] Use float as storage type for tf32 joint matrix #5870

[SYCL][ext][CUDA] Use float as storage type for tf32 joint matrix #5870

Conversation

hdelan commented Mar 23, 2022

dkhaldi commented Mar 23, 2022

hdelan commented Mar 24, 2022 • edited Loading

Approach 1:

Approach 2:

dkhaldi commented Mar 29, 2022

hdelan commented Apr 8, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JackAKirk Apr 12, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dkhaldi left a comment

Choose a reason for hiding this comment

hdelan commented Apr 22, 2022

dkhaldi left a comment

Choose a reason for hiding this comment

JackAKirk commented Jun 8, 2022

pvchupin commented Jun 8, 2022

steffenlarsen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JackAKirk Jun 8, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JackAKirk Jun 8, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hdelan commented Mar 24, 2022 •

edited

Loading

JackAKirk Apr 12, 2022 •

edited

Loading

JackAKirk Jun 8, 2022 •

edited

Loading

JackAKirk Jun 8, 2022 •

edited

Loading