Skip to content
This repository was archived by the owner on Mar 28, 2023. It is now read-only.

[SYCL][CUDA][Matrix] Adding test case for tf32 #963

Merged
merged 2 commits into from
Jun 8, 2022
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 21 additions & 10 deletions SYCL/Matrix/joint_matrix_tensorcore.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -74,7 +74,7 @@ T2 matrix_ref_mn(const int &m, const int &n, T1 *A, T1 *B, T2 *C) {
}

template <typename T1, typename T2, size_t Sub_Tiles_M, size_t Sub_Tiles_K,
size_t Sub_Tiles_N, size_t M, size_t K, size_t N>
size_t Sub_Tiles_N, size_t M, size_t K, size_t N, typename T3 = T1>
void test() {

constexpr auto Big_M =
Expand Down Expand Up @@ -131,19 +131,19 @@ void test() {
range<2> GlobalRange = {Sub_Tiles_M, Sub_Tiles_N * N_THREADS_PER_MATRIX_OP};

cgh.parallel_for<KernelName<T1, T2, M, K, N>>(
nd_range<2>(GlobalRange, LocalRange), [=
](nd_item<2> item) [[sycl::reqd_work_group_size(1, 1, 32)]] {
nd_range<2>(GlobalRange, LocalRange),
[=](nd_item<2> item) [[sycl::reqd_work_group_size(1, 1, 32)]] {
sycl::sub_group sg = item.get_sub_group();
const auto m =
item.get_group()
.get_id()[0]; // row id of current submatrix of BIG C matrix
item.get_group().get_group_id()[0]; // row id of current submatrix
// of BIG C matrix
const auto n =
item.get_group().get_id()[1]; // column id of current
// submatrix of BIG C matrix
item.get_group().get_group_id()[1]; // column id of current
// submatrix of BIG C matrix

joint_matrix<T1, matrix_use::a, M, K, matrix_layout::row_major> sub_a;
joint_matrix<T3, matrix_use::a, M, K, matrix_layout::row_major> sub_a;

joint_matrix<T1, matrix_use::b, K, N, matrix_layout::row_major> sub_b;
joint_matrix<T3, matrix_use::b, K, N, matrix_layout::row_major> sub_b;

joint_matrix<T2, matrix_use::accumulator, M, N,
matrix_layout::row_major>
Expand All @@ -163,6 +163,14 @@ void test() {
accB.get_pointer() + (k * K * Big_N) + (n * N),
Big_N);

// Convert values if using tf32
if constexpr (std::is_same<T3, precision::tf32>::value) {
for (auto i = 0; i < 4; ++i) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a comment on where 4 comes from

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be OK to merge this test as it is: then I can merge it into #975 and use the element wise interface to address your comment: adding a comment on where 4 comes from would then not be necessary.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test will have to change a lot after intel/llvm#5964
you are not using data anymore.
So you want to merge it first as it is and then modify it in 975?

Copy link

@JackAKirk JackAKirk Jun 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically it will be more simple if intel/llvm#5870 (and this corresponding test PR) is merged before intel/llvm#5964 instead of vice versa: Then the tf32 stuff can be updated in intel/llvm#5964: as you say a fair amount will change but it is straightforward for me to do it. @hdelan is on holiday atm.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What bothers me is that this PR is what triggered work on wi_marray and slicing additions. So it does not seem right to merge it before the solution.
Is there a reason why you want to merge it before intel/llvm#5964?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR doesn't change the way that element wise ops could be performed in current sycl tip, it is just adding a new free tf32 rounding function and tf32 joint_matrix implementation. I don't think there is anything implicitly wrong with it being merged before #5964. I think It is easier to merge this one first because @hdelan is currently on holiday anyway and I know how to adjust #5964 to take account if these tf32 changes easily. Probably intel/llvm#5964 will be ready to be merged then as well anyway by the sound of things.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the PR that adds the actual tf32 class merged already?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No but intel/llvm#5870 is approved.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, we agreed on addressing the spec changes in a separate PR (intel/llvm#5870 (review))
In this case, this test can also be approved and merged. Thank you for the clarifications.

sub_a.data[i] = round_to_tf32(sub_a.data[i]);
sub_b.data[i] = round_to_tf32(sub_b.data[i]);
}
}

sub_c = joint_matrix_mad(sg, sub_a, sub_b, sub_c);
}
joint_matrix_store(
Expand All @@ -182,7 +190,6 @@ void test() {
};

int main() {

// A/B half, Accumulator float
test<half, float, SUB_TILES_M, SUB_TILES_K, SUB_TILES_N, 16, 16, 16>();
test<half, float, SUB_TILES_M, SUB_TILES_K, SUB_TILES_N, 8, 16, 32>();
Expand All @@ -208,5 +215,9 @@ int main() {
test<uint16_t, float, SUB_TILES_M, SUB_TILES_K, SUB_TILES_N, 8, 16, 32>();
test<uint16_t, float, SUB_TILES_M, SUB_TILES_K, SUB_TILES_N, 32, 16, 8>();

// A/B tf32
test<float, float, SUB_TILES_M, SUB_TILES_K, SUB_TILES_N, 16, 8, 16,
precision::tf32>();

return 0;
};