Skip to content

[SYCL][Matrix] Add support for tf32 type #5920

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 9 commits into from
Closed

Conversation

dkhaldi
Copy link
Contributor

@dkhaldi dkhaldi commented Mar 29, 2022

This is a draft PR for initial support of tf32 precision type in joint matrix

@JackAKirk
Copy link
Contributor

I think that this is implementation is generally fine, and we will change our CUDA backend implementation to use the same interface, e.g. :

joint_matrix<precision::tf32, TM, TK>

etc, as discussed.
The only thing is that we want to be certain that it is better to have storage_element_type as float instead of int. We are implementing/testing the CUDA implementation with the free float_to_tf32 rounding function and double checking that everything is making complete sense.

static constexpr size_t MATRIX_M = TM * 2;
static constexpr size_t MATRIX_N = TN * 2;
static constexpr size_t MATRIX_K = TK * 2;
precision::tf32 A[MATRIX_M][MATRIX_K];
Copy link
Contributor

@hdelan hdelan Mar 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @dkhaldi . I'm a little bit confused by this testcase. In the other file you say

Users can't construct a tf32

But in this file you construct them and convert floats to be stored as tf32s. An empty class has size 1 byte so they cannot contain a tf32. Is there something I'm missing here? Thanks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In https://github.com/intel/llvm/pull/5870/files#diff-34520e7c212ec666342a649b7448a9841062a5e93eca4958990ac1653568f0d5R82
I am using the free functions defined in https://github.com/intel/llvm/pull/5870/files#diff-f71a436bdeda598b29caad471fa637a2844a12f38fe4e85b15b2ccb37bd09833R607 which accept and return floats, since floats are to be used as the fragment type for tf32.

Perhaps this is suitable instead of making the function return the empty class.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hdelan, thanks, I will correct that. I made it this way so I don't change the load API for this draft PR. I will change the load API and keep the buffers as floats.
Yes the expectation is to use the free conversion function using the element indexing as argument to perform the conversion. The only issue is that we should be using get_wi_data function to get WI portion as it exhibits better semantics than just the member "data" that can confuse the user thinking is this the SG matrix or the WI portion.

Copy link
Contributor

@keryell keryell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting.
You can simplify the code a little bit.


// Differentiating between the "element type" and the "storage element type"
template <typename T> struct helper_traits {
typedef T element_type;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use modern using instead.

@@ -277,12 +306,17 @@ class wi_element {
}

#if __SYCL_DEVICE_ONLY__
// TODO: __spirv_VectorInsertDynamic should take storage element type as
// argument
#define OP(op) \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps it is a good opportunity to remove all these macros?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yubingex007-a11y, I remember you change this code to use macros and make it more compact.
The code was before expanded for each of the ops. Bing changed it to remove the redundancy.
@keryell what do you suggest we should use instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sometimes macros are the best or only reasonable solution.
In that case use protected names like __DPC_SYCL_OP or whatever to avoid the case where a user decides to use in her program:

#define OP something

:-)

@@ -0,0 +1,165 @@
// RUN: %clangxx -fsycl -O2 %s -o %t.out

#include <CL/sycl.hpp>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#include <CL/sycl.hpp>
#include <sycl/sycl.hpp>

using namespace sycl;
using namespace sycl::ext::oneapi::experimental::matrix;

#define SG_SZ 8
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a macro indigestion. :-) Please use auto constexpr for example

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some of these tests are used for performance evaluation. If we use constexpr, SG size and the other parameters cannot be tuned at compilation time using -D

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good use case. But I guess in that case you would have a #ifndef SG_SZ around this.


cgh.parallel_for<class imatrix>(
nd_range<2>({NDRangeM, NDRangeN * SG_SZ}, {1, 1 * SG_SZ}),
[accA, accB, accC, M, N, K](nd_item<2> spmd_item)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Too verbose. At the end you need to motivate SYCL is simpler than CUDA, not the opposite. ;-)

Suggested change
[accA, accB, accC, M, N, K](nd_item<2> spmd_item)
[=](nd_item<2> spmd_item)

for (int m = 0; m < M; m++)
for (int n = 0; n < N; n++) {
for (int k = 0; k < K; k++) {
float va = *(float *)(A_mem + m * K + k);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
float va = *(float *)(A_mem + m * K + k);
auto va = A_mem[m * K + k];

float va = *(float *)(A_mem + m * K + k);
float vb = *(float *)(B_mem + k * N + n);
float acc = *((float *)(C_mem + m * N + n));
*((float *)(C_mem + m * N + n)) = va * vb;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
*((float *)(C_mem + m * N + n)) = va * vb;
C_mem[m * N + n] = va * vb;

}
}

big_matrix<float, MATRIX_M, MATRIX_N> MC((float *)&C);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
big_matrix<float, MATRIX_M, MATRIX_N> MC((float *)&C);
big_matrix<float, MATRIX_M, MATRIX_N> MC { C };

and add the right constructor in big_matrix

@dkhaldi
Copy link
Contributor Author

dkhaldi commented Aug 19, 2022

@yubingex007-a11y can you please review?
Mainly, are these the complete set of changes we need for tf32 without breaking existing types? Of course SPIRV has to change as well to accommodate the signature changes of load/store, insert and extract dynamic.

// just uses the type system to communicate the desired accuracy of arithmetic
// computations. Users can't construct a tf32
namespace precision {
class tf32 {};
Copy link
Contributor

@MrSidims MrSidims Aug 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since users shouldn't construct a tf32.

Suggested change
class tf32 {};
class tf32 {
tf32() = delete;
};

@dkhaldi
Copy link
Contributor Author

dkhaldi commented Jan 30, 2023

Replaced with https://github.com/intel/llvm/pull/8151/files that uses the unified interface

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants