-
Notifications
You must be signed in to change notification settings - Fork 769
[SYCL][Matrix] Add support for tf32 type #5920
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I think that this is implementation is generally fine, and we will change our CUDA backend implementation to use the same interface, e.g. :
etc, as discussed. |
static constexpr size_t MATRIX_M = TM * 2; | ||
static constexpr size_t MATRIX_N = TN * 2; | ||
static constexpr size_t MATRIX_K = TK * 2; | ||
precision::tf32 A[MATRIX_M][MATRIX_K]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @dkhaldi . I'm a little bit confused by this testcase. In the other file you say
Users can't construct a tf32
But in this file you construct them and convert floats to be stored as tf32s. An empty class has size 1 byte so they cannot contain a tf32. Is there something I'm missing here? Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In https://github.com/intel/llvm/pull/5870/files#diff-34520e7c212ec666342a649b7448a9841062a5e93eca4958990ac1653568f0d5R82
I am using the free functions defined in https://github.com/intel/llvm/pull/5870/files#diff-f71a436bdeda598b29caad471fa637a2844a12f38fe4e85b15b2ccb37bd09833R607 which accept and return floats, since floats are to be used as the fragment type for tf32.
Perhaps this is suitable instead of making the function return the empty class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hdelan, thanks, I will correct that. I made it this way so I don't change the load API for this draft PR. I will change the load API and keep the buffers as floats.
Yes the expectation is to use the free conversion function using the element indexing as argument to perform the conversion. The only issue is that we should be using get_wi_data function to get WI portion as it exhibits better semantics than just the member "data" that can confuse the user thinking is this the SG matrix or the WI portion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting.
You can simplify the code a little bit.
|
||
// Differentiating between the "element type" and the "storage element type" | ||
template <typename T> struct helper_traits { | ||
typedef T element_type; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use modern using
instead.
@@ -277,12 +306,17 @@ class wi_element { | |||
} | |||
|
|||
#if __SYCL_DEVICE_ONLY__ | |||
// TODO: __spirv_VectorInsertDynamic should take storage element type as | |||
// argument | |||
#define OP(op) \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps it is a good opportunity to remove all these macros?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yubingex007-a11y, I remember you change this code to use macros and make it more compact.
The code was before expanded for each of the ops. Bing changed it to remove the redundancy.
@keryell what do you suggest we should use instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sometimes macros are the best or only reasonable solution.
In that case use protected names like __DPC_SYCL_OP
or whatever to avoid the case where a user decides to use in her program:
#define OP something
:-)
@@ -0,0 +1,165 @@ | |||
// RUN: %clangxx -fsycl -O2 %s -o %t.out | |||
|
|||
#include <CL/sycl.hpp> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#include <CL/sycl.hpp> | |
#include <sycl/sycl.hpp> |
using namespace sycl; | ||
using namespace sycl::ext::oneapi::experimental::matrix; | ||
|
||
#define SG_SZ 8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a macro indigestion. :-) Please use auto constexpr
for example
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some of these tests are used for performance evaluation. If we use constexpr, SG size and the other parameters cannot be tuned at compilation time using -D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a good use case. But I guess in that case you would have a #ifndef SG_SZ
around this.
|
||
cgh.parallel_for<class imatrix>( | ||
nd_range<2>({NDRangeM, NDRangeN * SG_SZ}, {1, 1 * SG_SZ}), | ||
[accA, accB, accC, M, N, K](nd_item<2> spmd_item) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Too verbose. At the end you need to motivate SYCL is simpler than CUDA, not the opposite. ;-)
[accA, accB, accC, M, N, K](nd_item<2> spmd_item) | |
[=](nd_item<2> spmd_item) |
for (int m = 0; m < M; m++) | ||
for (int n = 0; n < N; n++) { | ||
for (int k = 0; k < K; k++) { | ||
float va = *(float *)(A_mem + m * K + k); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
float va = *(float *)(A_mem + m * K + k); | |
auto va = A_mem[m * K + k]; |
float va = *(float *)(A_mem + m * K + k); | ||
float vb = *(float *)(B_mem + k * N + n); | ||
float acc = *((float *)(C_mem + m * N + n)); | ||
*((float *)(C_mem + m * N + n)) = va * vb; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*((float *)(C_mem + m * N + n)) = va * vb; | |
C_mem[m * N + n] = va * vb; |
} | ||
} | ||
|
||
big_matrix<float, MATRIX_M, MATRIX_N> MC((float *)&C); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
big_matrix<float, MATRIX_M, MATRIX_N> MC((float *)&C); | |
big_matrix<float, MATRIX_M, MATRIX_N> MC { C }; |
and add the right constructor in big_matrix
@yubingex007-a11y can you please review? |
// just uses the type system to communicate the desired accuracy of arithmetic | ||
// computations. Users can't construct a tf32 | ||
namespace precision { | ||
class tf32 {}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since users shouldn't construct a tf32.
class tf32 {}; | |
class tf32 { | |
tf32() = delete; | |
}; |
Replaced with https://github.com/intel/llvm/pull/8151/files that uses the unified interface |
This is a draft PR for initial support of tf32 precision type in joint matrix