Skip to content

[SYCL][Doc] Provide extra sub-group guarantees #2452

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Nov 6, 2020
Merged
2 changes: 1 addition & 1 deletion sycl/doc/extensions/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ DPC++ extensions status:
| [SYCL_INTEL_static_local_memory_query](StaticLocalMemoryQuery/SYCL_INTEL_static_local_memory_query.asciidoc) | Proposal | |
| [SYCL_INTEL_sub_group_algorithms](SubGroupAlgorithms/SYCL_INTEL_sub_group_algorithms.asciidoc) | Partially supported(OpenCL: CPU, GPU) | Features from SYCL_INTEL_group_algorithms extended to sub-groups |
| [Sub-groups for NDRange Parallelism](SubGroupNDRange/SubGroupNDRange.md) | Deprecated(OpenCL: CPU, GPU) | |
| [Sub-groups](SubGroup/SYCL_INTEL_sub_group.asciidoc) | Supported(OpenCL) | |
| [Sub-groups](SubGroup/SYCL_INTEL_sub_group.asciidoc) | Partially supported(OpenCL) | Not supported: auto/stable sizes, stable query, compiler flags |
| [SYCL_INTEL_unnamed_kernel_lambda](UnnamedKernelLambda/SYCL_INTEL_unnamed_kernel_lambda.asciidoc) | Supported(OpenCL) | |
| [Unified Shared Memory](USM/USM.adoc) | Supported(OpenCL) | |

Expand Down
48 changes: 30 additions & 18 deletions sycl/doc/extensions/SubGroup/SYCL_INTEL_sub_group.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -68,30 +68,25 @@ Providing a generic group abstraction encapsulating the shared functionality of

=== Attributes

The +[[intel::reqd_sub_group_size(n)]]+ attribute indicates that the kernel must be compiled and executed with a sub-group of size _n_. The value of _n_ must be a compile-time integral constant expression. The value of _n_ must be set to a sub-group size that is both supported by the device and compatible with all language features used by the kernel, or device compilation will fail. The set of valid sub-group sizes can be queried as described below.
The +[[intel::sub_group_size(S)]]+ attribute indicates that the kernel must be compiled and executed with a specific sub-group size. The value of _S_ must be a compile-time integral constant expression. The kernel should only be submitted to a device that supports that sub-group size (as reported by +info::device::sub_group_sizes+). If the kernel is submitted to a device that does not support the requested sub-group size, or a device on which the requested sub-group size is incompatible with any language features used by the kernel, the implementation must throw a synchronous exception with the `errc::feature_not_supported` error code from the kernel invocation command.

In addition to device functions, the required sub-group size attribute may also be specified in the definition of a named functor object, as in the example below:
The +[[intel::named_sub_group_size(NAME)]]+ attribute indicates that the kernel must be compiled and executed with a named sub-group size. _NAME_ must be one of the following special tokens: +auto+, +primary+. If _NAME_ is +auto+, the implementation is free to select any of the valid sub-group sizes associated with the device to which the kernel is submitted; the manner in which the sub-group size is selected is implementation-defined. If _NAME_ is +primary+, the implementation will select the device's primary sub-group size (as reported by the +info::device::primary_sub_group_size+ query) for all kernels with this attribute.

[source, c++]
----
class Functor
{
void operator()(item<1> item) [[intel::reqd_sub_group_size(16)]]
{
/* kernel code */
}
}
----
If no sub-group size attribute appears on a kernel, the default behavior is as-if +[[intel::sub_group_size(auto)]]+ was specified. This behavior may be overridden by an implementation (e.g. via compiler flags).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo:

as-if +[[intel::named_sub_group_size(auto)]]+ was specified

(missing "named_")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Fixed in ff87d02.


Sub-group size attributes may also be applied to `SYCL_EXTERNAL` functions. If a kernel calls a `SYCL_EXTERNAL` function, or a `SYCL_EXTERNAL` function calls another `SYCL_EXTERNAL` function, the attributes applied to the caller and callee must match exactly. If the attributes do not match, the compiler should produce an error. Note that sub-group size attributes are not propagated from a device function to callers of the function, and must be specified explicitly.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found this paragraph unclear for several reasons:

  • It's not clear if the sub-group size attribute is required in this scenario or if it is merely allowed. (I think we want it to be required, correct?)

  • Whenever SYCL_EXTERNAL is used, there are two relevant translation units: the TU the makes the call and the TU that defines the function. We need to make it clear whether the attribute is required in the calling TU, the defining TU, or both.

  • The statement about requiring the compiler to produce an error makes it sound like the compiler must do inter-TU analysis to see if the function defined via SYCL_EXTERNAL is defined with the right attribute. (Or at least the statement could be interpreted that way.) I think that is not our intent.

  • The statement about the sub-group size not be propagated from a device function to its callers could be interpreted to mean that this propagation doesn't happen even within a single TU. I think the intent, though, is that this propagation does not happen across TUs.

Does the following paragraph capture what we want to say?

There are special requirements whenever a device function defined in one translation unit makes a call to a device function that is defined in a second translation unit. In such a case, the second device function is always declared using SYCL_EXTERNAL. If the kernel containing these device function is defined using a sub-group size attribute, the functions declared using SYCL_EXTERNAL must also be decorated with that same attribute. This decoration must exist in both the translation unit making the call and also in the translation unit that defines the function. If the sub-group attribute is missing in the translation unit that makes the call (or if the sub-group size of the called function does not match the sub-group size of the calling function), the program is ill formed and the compiler must raise a diagnostic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is much better, thanks. I added a note that auto doesn't make sense on a SYCL_EXTERNAL function. See bd68110.


=== Compiler Flags

It is illegal for a kernel or function to call a function with a mismatched sub-group size requirement, and the compiler should produce an error in this case. The +reqd_sub_group_size+ attribute is not propagated from a device function to callers of the function, and must be specified explicitly when a kernel is defined.
The +-fsycl-primary-sub-group-size+ flag compiles all kernels in the translation unit without a sub-group size attribute as though +[[intel::named_sub_group_size(primary)]]+ was applied to the kernel.

=== Sub-group Queries

Several aspects of sub-group functionality are implementation-defined: the size and number of sub-groups is implementation-defined (and may differ for each kernel); and different devices may make different guarantees with respect to how sub-groups within a work-group are scheduled. Developers can query these behaviors at a device level and for individual kernels. The sub-group size for a given combination of kernel and launch configuration is fixed, and guaranteed to be reflected by device and kernel queries.
Several aspects of sub-group functionality are implementation-defined: the size and number of sub-groups for certain work-group sizes is implementation-defined (and may differ for each kernel); and different devices may make different guarantees with respect to how sub-groups within a work-group are scheduled. Developers can query these behaviors at a device level and for individual kernels. The sub-group size for a given combination of kernel, device and work-group size is fixed.

Each sub-group in a work-group is one-dimensional. If the total number of work-items in a work-group is evenly divisible by the sub-group size, all sub-groups in the work-group will contain the same number of work-items. If the total number of work-items in a work-group is not evenly divisible by the sub-group size, the number of work-items in the final sub-group is equal to the remainder of the total work-group size divided by the sub-group size.
Each sub-group in a work-group is one-dimensional. If the number of work-items in the highest-numbered dimension of a work-group is evenly divisible by the sub-group size, all sub-groups in the work-group will contain the same number of work-items. Additionally, the numbering of work-items in a sub-group reflects the linear numbering of the work-items in the work-group. Specifically, if a work-item has linear ID i~s~ in the sub-group and linear ID i~w~ in the work-group, the work-item with linear ID i~s~+1 in the sub-group has linear ID i~w~+1 in the work-group.

To maximize portability across devices, developers should not assume that work-items within a sub-group execute in lockstep, nor that two sub-groups within a work-group will make independent forward progress with respect to one another.
To maximize portability across devices, developers should not assume that work-items within a sub-group execute in lockstep, that two sub-groups within a work-group will make independent forward progress with respect to one another, nor that remainders arising from work-group division will be handled in a specific way.

The device descriptors below are added to the +info::device+ enumeration class:

Expand All @@ -106,9 +101,13 @@ The device descriptors below are added to the +info::device+ enumeration class:
|+bool+
|Returns +true+ if the device supports independent forward progress of sub-groups with respect to other sub-groups in the same work-group.

|+info::device::primary_sub_group_size+
|+size_t+
|Return a sub-group size supported by this device that is guaranteed to support all core language features for the device.

|+info::device::sub_group_sizes+
|+vector_class<size_t>+
|Returns a vector_class of +size_t+ containing the set of sub-group sizes supported by the device.
|Returns a vector_class of +size_t+ containing the set of sub-group sizes supported by the device. Each sub-group size is a power of 2 in the range [1, 2^31^]. Not all sub-group sizes are guaranteed to be compatible with all core language features; any incompatibilities are implementation-defined.
|===

An additional query is added to the +kernel+ class, enabling an input value to be passed to `get_info`. The original `get_info` query from the SYCL_INTEL_device_specific_kernel_queries extension should be used for queries that do not specify an input type.
Expand Down Expand Up @@ -143,7 +142,7 @@ The kernel descriptors below are added to the +info::kernel_device_specific+ enu
|+info::kernel_device_specific::compile_sub_group_size+
|N/A
|+uint32_t+
|Returns the required sub-group size specified by the kernel, or 0 (if not specified).
|Returns the sub-group size of the kernel, set implicitly by the implementation or explicitly using a kernel attribute. Returns 0 if the requested size was `auto`, and returns the device's primary sub-group size if the requested size was `primary`.
|===

=== The sub_group Class
Expand Down Expand Up @@ -295,6 +294,16 @@ Yes, this is required by OpenCL devices. Devices that do not require the work-g
Yes, the four shuffles in this extension are a defining feature of sub-groups. Higher-level algorithms (such as those in the +SubGroupAlgorithms+ proposal) may build on them, the same way as higher-level algorithms using work-groups build on work-group local memory.
--

. What should the sub-group size compatible with all features be called?
+
--
*RESOLVED*:
The name adopted is "primary", to convey that it is an integral part of sub-group support provided by the device. Other names considered are listed here for posterity: "default", "stable", "fixed", "core". These terms are easy to misunderstand (i.e. the "default" size may not be chosen by default, the "stable" size is unrelated to the software release cycle, the "fixed" sub-group size may change between devices or compiler releases, the "core" size is unrelated to hardware cores).
--

. How does sub-group size interact with `SYCL_EXTERNAL` functions?
The current behavior requires exact matching. Should this be relaxed to allow alternative implementations (e.g. link-time optimization, multi-versioning)?

//. asd
//+
//--
Expand All @@ -315,6 +324,9 @@ Yes, the four shuffles in this extension are a defining feature of sub-groups.
|5|2020-04-21|John Pennycook|*Restore sub-group shuffles as member functions*
|6|2020-04-22|John Pennycook|*Align with SYCL_INTEL_device_specific_kernel_queries*
|7|2020-07-13|John Pennycook|*Clarify that reqd_sub_group_size must be a compile-time constant*
|8|2020-09-08|John Pennycook|*Provide some basic correctness guarantees*
|9|2020-09-21|John Pennycook|*Clarify behavior of SYCL_EXTERNAL functions*
|10|2020-09-21|John Pennycook|*Remove reqd_ prefix from attribute names*
|========================================

//************************************************************************
Expand Down