@@ -217,6 +217,261 @@ In the case of `ext_intel_packed` matrix memory layout, `row` and
217
217
`col` represent the coordinates in the logical matrix before VNNI
218
218
transformation.
219
219
220
+ === Load/Store/Fill With Out-of-Bounds Checks
221
+ The APIs in this section may be used only on a device that has
222
+ `aspect::ext_intel_matrix_checked`. The application must check that
223
+ the device has this aspect before submitting a kernel using any of the
224
+ APIs in this section. If the application fails to do this, the
225
+ implementation throws a synchronous exception with the
226
+ `errc::kernel_not_supported` error code when the kernel is submitted
227
+ to the queue.
228
+
229
+ ==== New Aspect for Checked Matrix APIs
230
+ This extension adds a new device aspect:
231
+ ```c++
232
+ namespace sycl {
233
+
234
+ enum class aspect : /*unspecified*/ {
235
+ ext_intel_matrix_checked
236
+ };
237
+
238
+ } // namespace sycl
239
+ ```
240
+ The `ext_intel_matrix_checked` aspect indicates that the device is capable of
241
+ supporting the out of bounds checked APIs that are defined in this section.
242
+
243
+ ==== Introduction
244
+ In this section, we refer to the memory buffer where a `joint_matrix`
245
+ is loaded from or stored to as the global matrix. This global matrix
246
+ is also interpreted as a two-dimensional memory region as follows, where
247
+ `GlobalRows` is number of rows in the global matrix, `GlobalCols` is number of
248
+ columns in the global matrix, `Stride` is number of columns that include
249
+ the out of bounds data (depicted as x here).
250
+
251
+ ```
252
+ GlobalCols
253
+ <----------->
254
+ dddddddddddddxxx ^
255
+ dddddddddddddxxx | GlobalRows
256
+ dddddddddddddxxx v
257
+ xxxxxxxxxxxxxxxx
258
+ <-------------->
259
+ Stride
260
+ ```
261
+
262
+ In the diagram above, the global matrix has 13 columns and 3
263
+ rows. This is padded out to be evenly divisible by a joint matrix with
264
+ 8 columns and 2 rows, which results in a stride of 16.
265
+
266
+ Note that joint matrix shape `Rows` and `Cols` represents a sub-block
267
+ of the picture above. The out of bounds data results when the global
268
+ matrix size is not evenly divisible by the joint matrix size.
269
+
270
+ ==== Checked APIs
271
+ When an algorithm iterates over the global matrix, it loads or stores
272
+ elements that correspond to a joint matrix. When the global matrix
273
+ size does not evenly divide by the joint matrix size, some of these
274
+ loads or stores access the extra elements marked "x" in the diagram
275
+ above. The standard joint matrix functions (`joint_matrix_load`,
276
+ `joint_matrix_store` and `joint_matrix_fill`) do not do any bounds
277
+ checking in this case, so they simply load or store to these extra
278
+ elements. This could cause unexpected values to be loaded into the
279
+ joint matrix for these elements. These functions could also cause a
280
+ memory fault if the extra elements are not valid addresses.
281
+
282
+ The checked APIs described below do not attempt to access the extra
283
+ memory. The checked load is guaranteed to return 0 for the extra
284
+ elements, and the checked store simply ignores stores to the extra
285
+ elements. Neither function will cause a memory fault if the extra
286
+ elements correspond to invalid addresses.
287
+
288
+ These functions are similar to the existing ones without bounds
289
+ checking, namely `joint_matrix_fill`, `joint_matrix_load`, and
290
+ `joint_matrix_store`. But they are different in three ways:
291
+
292
+ * The pointer `base_src` or `base_dest` designates the base pointer of
293
+ the global memory matrix, which is different from the APIs that do not
294
+ do bounds checking. Those non-bounds-checking APIs take a pointer to
295
+ the base of the joint matrix.
296
+ * The coordinates `RowIndex` and `ColIndex` into the global matrix to
297
+ calculate the pointer offset to load/store are given as separate
298
+ arguments.
299
+ * These variants take extra arguments to determine the global bounds
300
+ `GlobalRows` and `GlobalCols` of the global matrix.
301
+
302
+ To illustrate the out-of-bounds checking, consider the global matrix
303
+ shown above which has 13 columns and 3 rows (`GlobalRows=3` and
304
+ `GlobalCols=13`), where the joint matrix size is 8 columns by 2 rows defined as
305
+ ```
306
+ joint_matrix<sub_group, bfloat16, use::b, 2, 8, layout::row_major> sub_b;
307
+ ```
308
+ The load of the joint matrix at coordinate [8, 2] (column number 8,
309
+ row number 2 in the global matrix), overlaps the extra elements in
310
+ both dimensions. This is shown below, where capital letters correspond
311
+ to the elements that are accessed by this joint matrix load:
312
+
313
+ ```
314
+ GlobalCols
315
+ <----------->
316
+ dddddddddddddxxx ^
317
+ dddddddddddddxxx | GlobalRows
318
+ ddddddddDDDDDXXX v
319
+ xxxxxxxxXXXXXXXX
320
+ <-------------->
321
+ Stride
322
+ ```
323
+
324
+ If the joint matrix is loaded via `joint_matrix_load_checked` using
325
+ ```
326
+ joint_matrix_load_checked(sg, sub_b, base_src, 16, 3, 13, 2, 8);
327
+ ```
328
+ the extra elements that are shown with capital `X` are not accessed in
329
+ memory, and those elements are guaranteed to have the value zero in
330
+ the joint matrix after the load operation completes.
331
+
332
+ ```c++
333
+ namespace sycl::ext::intel::experimental::matrix {
334
+
335
+ template <typename Group, typename T, size_t Rows, size_t Cols,
336
+ use Use, layout Layout, typename Tv>
337
+ void joint_matrix_fill_checked(Group g, joint_matrix<Group, T, Use, Rows,
338
+ Cols, Layout> &m, Tv v, size_t GlobalRows, size_t GlobalCols,
339
+ size_t RowIndex, size_t ColIndex);
340
+
341
+ // Only available when std::is_same_v<T1, std::remove_const_t<T2>>
342
+ template <typename Group, typename T1, typename T2,
343
+ size_t Rows, size_t Cols,
344
+ access::address_space Space, access::decorated IsDecorated>
345
+ void joint_matrix_load_checked(Group g,
346
+ joint_matrix<Group, T1, use::accumulator, Rows, Cols, layout::dynamic> &res,
347
+ multi_ptr<T2, Space, IsDecorated> base_src, size_t Stride,
348
+ layout Layout, size_t GlobalRows, size_t GlobalCols,
349
+ size_t RowIndex, size_t ColIndex);
350
+
351
+ // Only available when Layout != layout::dynamic
352
+ // and when std::is_same_v<T1, std::remove_const_t<T2>>
353
+ template <typename Group, typename T1, typename T2,
354
+ size_t Rows, size_t Cols,
355
+ use Use, layout Layout,
356
+ access::address_space Space, access::decorated IsDecorated>
357
+ void joint_matrix_load_checked(Group g,
358
+ joint_matrix<Group, T1, Use, Rows, Cols, Layout> &res,
359
+ multi_ptr<T2, Space, IsDecorated> base_src, size_t Stride,
360
+ size_t GlobalRows, size_t GlobalCols, size_t RowIndex, size_t ColIndex);
361
+
362
+ // Only available when std::is_same_v<T1, std::remove_const_t<T2>>
363
+ template <typename Group, typename T1, typename T2,
364
+ size_t Rows, size_t Cols, typename PropertyListT>
365
+ void joint_matrix_load_checked(Group g,
366
+ joint_matrix<Group, T1, use::accumulator, Rows, Cols, layout::dynamic> &res,
367
+ ext::oneapi::experimental::annotated_ptr<T2, PropertyListT> base_src,
368
+ size_t Stride, layout Layout, size_t GlobalRows, size_t GlobalCols,
369
+ size_t RowIndex, size_t ColIndex);
370
+
371
+ // Only available when Layout != layout::dynamic
372
+ // and when std::is_same_v<T1, std::remove_const_t<T2>>
373
+ template <typename Group, typename T1, typename T2, size_t Rows,
374
+ size_t Cols, use Use, layout Layout, typename PropertyListT>
375
+ void joint_matrix_load_checked(Group g,
376
+ joint_matrix<Group, T1, Use, Rows, Cols, Layout> &res,
377
+ ext::oneapi::experimental::annotated_ptr<T2, PropertyListT> base_src,
378
+ size_t Stride, size_t GlobalRows, size_t GlobalCols,
379
+ size_t RowIndex, size_t ColIndex);
380
+
381
+ template <typename Group, typename T, size_t Rows, size_t Cols,
382
+ access::address_space Space, access::decorated IsDecorated>
383
+ void joint_matrix_store_checked(Group g,
384
+ const joint_matrix<Group, T, use::accumulator, Rows, Cols, layout::dynamic> &res,
385
+ multi_ptr<T, Space, IsDecorated> base_dest, size_t Stride, layout Layout,
386
+ size_t GlobalRows, size_t GlobalCols, size_t RowIndex, size_t ColIndex);
387
+
388
+ template <typename Group, typename T, size_t Rows, size_t Cols,
389
+ layout Layout, access::address_space Space,
390
+ access::decorated IsDecorated>
391
+ void joint_matrix_store_checked(Group g,
392
+ const joint_matrix<Group, T, use::a, Rows, Cols, Layout> &res,
393
+ multi_ptr<T, Space, IsDecorated> base_dest, size_t Stride,
394
+ size_t GlobalRows, size_t GlobalCols, size_t RowIndex, size_t ColIndex);
395
+
396
+ template <typename Group, typename T, size_t Rows, size_t Cols,
397
+ layout Layout, access::address_space Space,
398
+ access::decorated IsDecorated>
399
+ void joint_matrix_store_checked(Group g,
400
+ const joint_matrix<Group, T, use::b, Rows, Cols, Layout> &res,
401
+ multi_ptr<T, Space, IsDecorated> base_dest, size_t Stride,
402
+ size_t GlobalRows, size_t GlobalCols, size_t RowIndex, size_t ColIndex);
403
+
404
+ template <typename Group, typename T, size_t Rows, size_t Cols,
405
+ typename PropertyListT>
406
+ void joint_matrix_store_checked(Group g,
407
+ const joint_matrix<Group, T, use::accumulator, Rows, Cols, layout::dynamic> &res,
408
+ ext::oneapi::experimental::annotated_ptr<T, PropertyListT> base_dest,
409
+ size_t Stride, layout Layout, size_t GlobalRows, size_t GlobalCols,
410
+ size_t RowIndex, size_t ColIndex);
411
+
412
+ template <typename Group, typename T, size_t Rows, size_t Cols,
413
+ layout Layout, typename PropertyListT>
414
+ void joint_matrix_store_checked(Group g,
415
+ const joint_matrix<Group, T, use::a, Rows, Cols, Layout> &res,
416
+ ext::oneapi::experimental::annotated_ptr<T, PropertyListT> base_dest,
417
+ size_t Stride, size_t GlobalRows, size_t GlobalCols,
418
+ size_t RowIndex, size_t ColIndex);
419
+
420
+ template <typename Group, typename T, size_t Rows, size_t Cols,
421
+ layout Layout, typename PropertyListT>
422
+ void joint_matrix_store_checked(Group g,
423
+ const joint_matrix<Group, T, use::b, Rows, Cols, Layout> &res,
424
+ ext::oneapi::experimental::annotated_ptr<T, PropertyListT> base_dest,
425
+ size_t Stride, size_t GlobalRows, size_t GlobalCols,
426
+ size_t RowIndex, size_t ColIndex);
427
+
428
+ } // namespace sycl::ext::intel::experimental::matrix
429
+ ```
430
+
431
+ The property list associated with the `annotated_ptr` argument
432
+ represents the compile-time constant properties for cache control included
433
+ in the SYCL extenion
434
+ link:../../proposed/sycl_ext_intel_cache_controls.asciidoc[sycl_ext_intel_cache_controls].
435
+
436
+ ==== Restrictions and Device Information Descriptors
437
+ Applications must adhere to certain alignment restrictions when using
438
+ the checked APIs described in this section. This extension provides
439
+ the following queries to get these requirements:
440
+
441
+ [frame="none",options="header"]
442
+ |======================
443
+ | Device descriptors | Return type| Description
444
+ |`ext::intel::experimental::info::device::matrix_checked_alignment`| `size_t`
445
+ |Tells the required alignment (in bytes) of the base pointer for
446
+ `joint_matrix_load_checked` and `joint_matrix_store_checked`.
447
+ |`ext::intel::experimental::info::device::matrix_checked_rowindex_multiple_of<T>`|
448
+ `size_t`|Returns a value, of which `RowIndex` must be multiple of;
449
+ where `T` is the element type of the matrix. When using the matrices
450
+ with the machine learning types, `T` should be the element type
451
+ (e.g. `precision::tf32`) not the storage type.
452
+ |`ext::intel::experimental::info::device::matrix_checked_globalcols_multiple_of<T>`|
453
+ `size_t` | Returns a value, of which `GlobalCols` must be multiple of;
454
+ where `T` is the element type of the matrix. When using the matrices
455
+ with the machine learning types, `T` should be the element type
456
+ (e.g. `precision::tf32`) not the storage type.
457
+ |======================
458
+
459
+ ==== Appendix: Restrictions Per Hardware
460
+ ===== Intel XMX
461
+ The checked APIs are currently available in devices with the architecture
462
+ `architecture::intel_gpu_pvc`. The following restrictions apply to
463
+ these checked APIs:
464
+
465
+ - The base pointer must be 4 bytes aligned.
466
+
467
+ - For 8 bits data type, `RowIndex` must be a multiple of 4. For 16 bits
468
+ data type, `RowIndex` must be a multiple of 2. So `RowIndex` must be a
469
+ multiple of 4 divided by size of the element type (`4/sizeof(T)`).
470
+
471
+ - For 8 bits data type, `GlobalCols` must be a multiple of 4. For 16 bits
472
+ data type, `GlobalCols` must be a multiple of 2. So `GlobalCols` must be a
473
+ multiple of 4 divided by size of the element type (`4/sizeof(T)`).
474
+
220
475
=== New Device Information Descriptor
221
476
Besides the query we provide in
222
477
link:sycl_ext_oneapi_matrix.asciidoc[sycl_ext_oneapi_matrix],
@@ -349,4 +604,6 @@ q.wait();
349
604
|Rev |Date |Author |Changes
350
605
|1 |2022-11-07 |Dounia Khaldi |Add Intel-specific store API,
351
606
layout information, and `joint_matrix_apply` with coordinates API
607
+ |2 |2023-10-19 |Dounia Khaldi |Add Intel-specific out-of-bounds
608
+ load/store/fill APIs
352
609
|======================
0 commit comments