Skip to content

Commit ecd3b90

Browse files
authored
[SYCL][Joint Matrix Spec] Add new API for out of bounds fill/load/store (#11172)
Code example to show usage can be found here: https://github.com/intel/llvm/blob/sycl/sycl/test-e2e/Matrix/joint_matrix_out_bounds.cpp
1 parent c63b49d commit ecd3b90

File tree

1 file changed

+257
-0
lines changed

1 file changed

+257
-0
lines changed

sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_intel_matrix.asciidoc

Lines changed: 257 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -217,6 +217,261 @@ In the case of `ext_intel_packed` matrix memory layout, `row` and
217217
`col` represent the coordinates in the logical matrix before VNNI
218218
transformation.
219219

220+
=== Load/Store/Fill With Out-of-Bounds Checks
221+
The APIs in this section may be used only on a device that has
222+
`aspect::ext_intel_matrix_checked`. The application must check that
223+
the device has this aspect before submitting a kernel using any of the
224+
APIs in this section. If the application fails to do this, the
225+
implementation throws a synchronous exception with the
226+
`errc::kernel_not_supported` error code when the kernel is submitted
227+
to the queue.
228+
229+
==== New Aspect for Checked Matrix APIs
230+
This extension adds a new device aspect:
231+
```c++
232+
namespace sycl {
233+
234+
enum class aspect : /*unspecified*/ {
235+
ext_intel_matrix_checked
236+
};
237+
238+
} // namespace sycl
239+
```
240+
The `ext_intel_matrix_checked` aspect indicates that the device is capable of
241+
supporting the out of bounds checked APIs that are defined in this section.
242+
243+
==== Introduction
244+
In this section, we refer to the memory buffer where a `joint_matrix`
245+
is loaded from or stored to as the global matrix. This global matrix
246+
is also interpreted as a two-dimensional memory region as follows, where
247+
`GlobalRows` is number of rows in the global matrix, `GlobalCols` is number of
248+
columns in the global matrix, `Stride` is number of columns that include
249+
the out of bounds data (depicted as x here).
250+
251+
```
252+
GlobalCols
253+
<----------->
254+
dddddddddddddxxx ^
255+
dddddddddddddxxx | GlobalRows
256+
dddddddddddddxxx v
257+
xxxxxxxxxxxxxxxx
258+
<-------------->
259+
Stride
260+
```
261+
262+
In the diagram above, the global matrix has 13 columns and 3
263+
rows. This is padded out to be evenly divisible by a joint matrix with
264+
8 columns and 2 rows, which results in a stride of 16.
265+
266+
Note that joint matrix shape `Rows` and `Cols` represents a sub-block
267+
of the picture above. The out of bounds data results when the global
268+
matrix size is not evenly divisible by the joint matrix size.
269+
270+
==== Checked APIs
271+
When an algorithm iterates over the global matrix, it loads or stores
272+
elements that correspond to a joint matrix. When the global matrix
273+
size does not evenly divide by the joint matrix size, some of these
274+
loads or stores access the extra elements marked "x" in the diagram
275+
above. The standard joint matrix functions (`joint_matrix_load`,
276+
`joint_matrix_store` and `joint_matrix_fill`) do not do any bounds
277+
checking in this case, so they simply load or store to these extra
278+
elements. This could cause unexpected values to be loaded into the
279+
joint matrix for these elements. These functions could also cause a
280+
memory fault if the extra elements are not valid addresses.
281+
282+
The checked APIs described below do not attempt to access the extra
283+
memory. The checked load is guaranteed to return 0 for the extra
284+
elements, and the checked store simply ignores stores to the extra
285+
elements. Neither function will cause a memory fault if the extra
286+
elements correspond to invalid addresses.
287+
288+
These functions are similar to the existing ones without bounds
289+
checking, namely `joint_matrix_fill`, `joint_matrix_load`, and
290+
`joint_matrix_store`. But they are different in three ways:
291+
292+
* The pointer `base_src` or `base_dest` designates the base pointer of
293+
the global memory matrix, which is different from the APIs that do not
294+
do bounds checking. Those non-bounds-checking APIs take a pointer to
295+
the base of the joint matrix.
296+
* The coordinates `RowIndex` and `ColIndex` into the global matrix to
297+
calculate the pointer offset to load/store are given as separate
298+
arguments.
299+
* These variants take extra arguments to determine the global bounds
300+
`GlobalRows` and `GlobalCols` of the global matrix.
301+
302+
To illustrate the out-of-bounds checking, consider the global matrix
303+
shown above which has 13 columns and 3 rows (`GlobalRows=3` and
304+
`GlobalCols=13`), where the joint matrix size is 8 columns by 2 rows defined as
305+
```
306+
joint_matrix<sub_group, bfloat16, use::b, 2, 8, layout::row_major> sub_b;
307+
```
308+
The load of the joint matrix at coordinate [8, 2] (column number 8,
309+
row number 2 in the global matrix), overlaps the extra elements in
310+
both dimensions. This is shown below, where capital letters correspond
311+
to the elements that are accessed by this joint matrix load:
312+
313+
```
314+
GlobalCols
315+
<----------->
316+
dddddddddddddxxx ^
317+
dddddddddddddxxx | GlobalRows
318+
ddddddddDDDDDXXX v
319+
xxxxxxxxXXXXXXXX
320+
<-------------->
321+
Stride
322+
```
323+
324+
If the joint matrix is loaded via `joint_matrix_load_checked` using
325+
```
326+
joint_matrix_load_checked(sg, sub_b, base_src, 16, 3, 13, 2, 8);
327+
```
328+
the extra elements that are shown with capital `X` are not accessed in
329+
memory, and those elements are guaranteed to have the value zero in
330+
the joint matrix after the load operation completes.
331+
332+
```c++
333+
namespace sycl::ext::intel::experimental::matrix {
334+
335+
template <typename Group, typename T, size_t Rows, size_t Cols,
336+
use Use, layout Layout, typename Tv>
337+
void joint_matrix_fill_checked(Group g, joint_matrix<Group, T, Use, Rows,
338+
Cols, Layout> &m, Tv v, size_t GlobalRows, size_t GlobalCols,
339+
size_t RowIndex, size_t ColIndex);
340+
341+
// Only available when std::is_same_v<T1, std::remove_const_t<T2>>
342+
template <typename Group, typename T1, typename T2,
343+
size_t Rows, size_t Cols,
344+
access::address_space Space, access::decorated IsDecorated>
345+
void joint_matrix_load_checked(Group g,
346+
joint_matrix<Group, T1, use::accumulator, Rows, Cols, layout::dynamic> &res,
347+
multi_ptr<T2, Space, IsDecorated> base_src, size_t Stride,
348+
layout Layout, size_t GlobalRows, size_t GlobalCols,
349+
size_t RowIndex, size_t ColIndex);
350+
351+
// Only available when Layout != layout::dynamic
352+
// and when std::is_same_v<T1, std::remove_const_t<T2>>
353+
template <typename Group, typename T1, typename T2,
354+
size_t Rows, size_t Cols,
355+
use Use, layout Layout,
356+
access::address_space Space, access::decorated IsDecorated>
357+
void joint_matrix_load_checked(Group g,
358+
joint_matrix<Group, T1, Use, Rows, Cols, Layout> &res,
359+
multi_ptr<T2, Space, IsDecorated> base_src, size_t Stride,
360+
size_t GlobalRows, size_t GlobalCols, size_t RowIndex, size_t ColIndex);
361+
362+
// Only available when std::is_same_v<T1, std::remove_const_t<T2>>
363+
template <typename Group, typename T1, typename T2,
364+
size_t Rows, size_t Cols, typename PropertyListT>
365+
void joint_matrix_load_checked(Group g,
366+
joint_matrix<Group, T1, use::accumulator, Rows, Cols, layout::dynamic> &res,
367+
ext::oneapi::experimental::annotated_ptr<T2, PropertyListT> base_src,
368+
size_t Stride, layout Layout, size_t GlobalRows, size_t GlobalCols,
369+
size_t RowIndex, size_t ColIndex);
370+
371+
// Only available when Layout != layout::dynamic
372+
// and when std::is_same_v<T1, std::remove_const_t<T2>>
373+
template <typename Group, typename T1, typename T2, size_t Rows,
374+
size_t Cols, use Use, layout Layout, typename PropertyListT>
375+
void joint_matrix_load_checked(Group g,
376+
joint_matrix<Group, T1, Use, Rows, Cols, Layout> &res,
377+
ext::oneapi::experimental::annotated_ptr<T2, PropertyListT> base_src,
378+
size_t Stride, size_t GlobalRows, size_t GlobalCols,
379+
size_t RowIndex, size_t ColIndex);
380+
381+
template <typename Group, typename T, size_t Rows, size_t Cols,
382+
access::address_space Space, access::decorated IsDecorated>
383+
void joint_matrix_store_checked(Group g,
384+
const joint_matrix<Group, T, use::accumulator, Rows, Cols, layout::dynamic> &res,
385+
multi_ptr<T, Space, IsDecorated> base_dest, size_t Stride, layout Layout,
386+
size_t GlobalRows, size_t GlobalCols, size_t RowIndex, size_t ColIndex);
387+
388+
template <typename Group, typename T, size_t Rows, size_t Cols,
389+
layout Layout, access::address_space Space,
390+
access::decorated IsDecorated>
391+
void joint_matrix_store_checked(Group g,
392+
const joint_matrix<Group, T, use::a, Rows, Cols, Layout> &res,
393+
multi_ptr<T, Space, IsDecorated> base_dest, size_t Stride,
394+
size_t GlobalRows, size_t GlobalCols, size_t RowIndex, size_t ColIndex);
395+
396+
template <typename Group, typename T, size_t Rows, size_t Cols,
397+
layout Layout, access::address_space Space,
398+
access::decorated IsDecorated>
399+
void joint_matrix_store_checked(Group g,
400+
const joint_matrix<Group, T, use::b, Rows, Cols, Layout> &res,
401+
multi_ptr<T, Space, IsDecorated> base_dest, size_t Stride,
402+
size_t GlobalRows, size_t GlobalCols, size_t RowIndex, size_t ColIndex);
403+
404+
template <typename Group, typename T, size_t Rows, size_t Cols,
405+
typename PropertyListT>
406+
void joint_matrix_store_checked(Group g,
407+
const joint_matrix<Group, T, use::accumulator, Rows, Cols, layout::dynamic> &res,
408+
ext::oneapi::experimental::annotated_ptr<T, PropertyListT> base_dest,
409+
size_t Stride, layout Layout, size_t GlobalRows, size_t GlobalCols,
410+
size_t RowIndex, size_t ColIndex);
411+
412+
template <typename Group, typename T, size_t Rows, size_t Cols,
413+
layout Layout, typename PropertyListT>
414+
void joint_matrix_store_checked(Group g,
415+
const joint_matrix<Group, T, use::a, Rows, Cols, Layout> &res,
416+
ext::oneapi::experimental::annotated_ptr<T, PropertyListT> base_dest,
417+
size_t Stride, size_t GlobalRows, size_t GlobalCols,
418+
size_t RowIndex, size_t ColIndex);
419+
420+
template <typename Group, typename T, size_t Rows, size_t Cols,
421+
layout Layout, typename PropertyListT>
422+
void joint_matrix_store_checked(Group g,
423+
const joint_matrix<Group, T, use::b, Rows, Cols, Layout> &res,
424+
ext::oneapi::experimental::annotated_ptr<T, PropertyListT> base_dest,
425+
size_t Stride, size_t GlobalRows, size_t GlobalCols,
426+
size_t RowIndex, size_t ColIndex);
427+
428+
} // namespace sycl::ext::intel::experimental::matrix
429+
```
430+
431+
The property list associated with the `annotated_ptr` argument
432+
represents the compile-time constant properties for cache control included
433+
in the SYCL extenion
434+
link:../../proposed/sycl_ext_intel_cache_controls.asciidoc[sycl_ext_intel_cache_controls].
435+
436+
==== Restrictions and Device Information Descriptors
437+
Applications must adhere to certain alignment restrictions when using
438+
the checked APIs described in this section. This extension provides
439+
the following queries to get these requirements:
440+
441+
[frame="none",options="header"]
442+
|======================
443+
| Device descriptors | Return type| Description
444+
|`ext::intel::experimental::info::device::matrix_checked_alignment`| `size_t`
445+
|Tells the required alignment (in bytes) of the base pointer for
446+
`joint_matrix_load_checked` and `joint_matrix_store_checked`.
447+
|`ext::intel::experimental::info::device::matrix_checked_rowindex_multiple_of<T>`|
448+
`size_t`|Returns a value, of which `RowIndex` must be multiple of;
449+
where `T` is the element type of the matrix. When using the matrices
450+
with the machine learning types, `T` should be the element type
451+
(e.g. `precision::tf32`) not the storage type.
452+
|`ext::intel::experimental::info::device::matrix_checked_globalcols_multiple_of<T>`|
453+
`size_t` | Returns a value, of which `GlobalCols` must be multiple of;
454+
where `T` is the element type of the matrix. When using the matrices
455+
with the machine learning types, `T` should be the element type
456+
(e.g. `precision::tf32`) not the storage type.
457+
|======================
458+
459+
==== Appendix: Restrictions Per Hardware
460+
===== Intel XMX
461+
The checked APIs are currently available in devices with the architecture
462+
`architecture::intel_gpu_pvc`. The following restrictions apply to
463+
these checked APIs:
464+
465+
- The base pointer must be 4 bytes aligned.
466+
467+
- For 8 bits data type, `RowIndex` must be a multiple of 4. For 16 bits
468+
data type, `RowIndex` must be a multiple of 2. So `RowIndex` must be a
469+
multiple of 4 divided by size of the element type (`4/sizeof(T)`).
470+
471+
- For 8 bits data type, `GlobalCols` must be a multiple of 4. For 16 bits
472+
data type, `GlobalCols` must be a multiple of 2. So `GlobalCols` must be a
473+
multiple of 4 divided by size of the element type (`4/sizeof(T)`).
474+
220475
=== New Device Information Descriptor
221476
Besides the query we provide in
222477
link:sycl_ext_oneapi_matrix.asciidoc[sycl_ext_oneapi_matrix],
@@ -349,4 +604,6 @@ q.wait();
349604
|Rev |Date |Author |Changes
350605
|1 |2022-11-07 |Dounia Khaldi |Add Intel-specific store API,
351606
layout information, and `joint_matrix_apply` with coordinates API
607+
|2 |2023-10-19 |Dounia Khaldi |Add Intel-specific out-of-bounds
608+
load/store/fill APIs
352609
|======================

0 commit comments

Comments
 (0)