Skip to content

[SYCL][Doc] Provide extra sub-group guarantees #2452

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Nov 6, 2020
Merged
2 changes: 1 addition & 1 deletion sycl/doc/extensions/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ DPC++ extensions status:
| [SYCL_INTEL_static_local_memory_query](StaticLocalMemoryQuery/SYCL_INTEL_static_local_memory_query.asciidoc) | Proposal | |
| [SYCL_INTEL_sub_group_algorithms](SubGroupAlgorithms/SYCL_INTEL_sub_group_algorithms.asciidoc) | Partially supported(OpenCL: CPU, GPU) | Features from SYCL_INTEL_group_algorithms extended to sub-groups |
| [Sub-groups for NDRange Parallelism](SubGroupNDRange/SubGroupNDRange.md) | Deprecated(OpenCL: CPU, GPU) | |
| [Sub-groups](SubGroup/SYCL_INTEL_sub_group.asciidoc) | Supported(OpenCL) | |
| [Sub-groups](SubGroup/SYCL_INTEL_sub_group.asciidoc) | Partially supported(OpenCL) | Not supported: auto/stable sizes, stable query, compiler flags |
| [SYCL_INTEL_unnamed_kernel_lambda](UnnamedKernelLambda/SYCL_INTEL_unnamed_kernel_lambda.asciidoc) | Supported(OpenCL) | |
| [Unified Shared Memory](USM/USM.adoc) | Supported(OpenCL) | |

Expand Down
30 changes: 24 additions & 6 deletions sycl/doc/extensions/SubGroup/SYCL_INTEL_sub_group.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,9 @@ Providing a generic group abstraction encapsulating the shared functionality of

=== Attributes

The +[[intel::reqd_sub_group_size(n)]]+ attribute indicates that the kernel must be compiled and executed with a sub-group of size _n_. The value of _n_ must be a compile-time integral constant expression. The value of _n_ must be set to a sub-group size that is both supported by the device and compatible with all language features used by the kernel, or device compilation will fail. The set of valid sub-group sizes can be queried as described below.
The +[[intel::reqd_sub_group_size(S)]]+ attribute indicates that the kernel must be compiled and executed with a specific sub-group size. The value of _S_ must be a compile-time integral constant expression or one of the following keywords: +auto+, +core+. If the value of _S_ is an integer, the kernel should only be submitted to a device that supports that sub-group size (as reported by +info::device::sub_group_sizes+). If the kernel is submitted to a device that does not support the requested sub-group size, or a device on which the requested sub-group size is incompatible with any language features used by the kernel, the implementation must throw a synchronous exception with the `errc::feature_not_supported` error code from the kernel invocation command. If the value of _S_ is +auto+, the implementation is free to select any of the valid sub-group sizes associated with the device to which the kernel is submitted; the manner in which the sub-group size is selected is implementation-defined. If the value of _S_ is +core+, the implementation will select the device's core sub-group size (as reported by the +info::device::core_sub_group_size+ query) for all kernels with this attribute.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens for:

constexpr int core = 4;
SYCL_EXTERNAL void f() [[intel::reqd_sub_group_size(core)]];

Does it use 4 or does it use core? We could avoid this possible ambiguity and the need to treat those special in the FE by making this intel::core. That allows us to define in the header

namespace intel { constexpr int core = -1; }

and the FE doesn't need to have special casing for this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point, I hadn't considered that "core" might be used elsewhere as a name. I don't think INTEL::core is much better -- INTEL::reqd_sub_group_size(INTEL::core) is pretty long, and as you point out in #2452 (comment), we might end up with values like INTEL::core occurring in multiple contexts.

Ignoring the open naming issue: Do you think INTEL::req_core_sub_group_size would be better? It's 9 characters shorter and unambiguous.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't thinking that the attribute parameter could be a constexpr variable from the application. I was assuming it had to be a literal number or one of the special tokens "core" or "auto". Therefore, "core" used in the attribute parameter would be unrelated to any variable named "core".

Allowing a constexpr variable (or expression) does seem to introduce other problems. We could define a variable named core in our headers as @rolandschulz suggests, but I'm not sure auto is allowed as a variable name. We could avoid those problems by adding three attributes as @Pennycook suggests: intel::reqd_sub_group_size(S), intel::reqd_core_sub_group_size, and intel::reqd_auto_sub_group_size.

Or, we could specify that reqd_sub_group_size() must take a literal number or one of the special tokens "core" or "auto". I admit that's less flexible, though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The attributes have been split in 7ccb2eb. I've tried to make it clear that [[sub_group_size(S)]] requires S to be a compile-time constant integer expression, whereas [[named_sub_group_size(NAME)]] requires NAME to be a special token from a known list.


If no required sub-group size attribute appears on a kernel, the default behavior is as-if +[[intel::reqd_sub_group_size(auto)]]+ was specified. This behavior may be overridden by an implementation (e.g. via compiler flags).

In addition to device functions, the required sub-group size attribute may also be specified in the definition of a named functor object, as in the example below:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rest of Attributes section below always bothered me and having more values for intel::reqd_sub_group_size attribute concerns me even more:

The +[[intel::reqd_sub_group_size(S)]]+ attribute indicates that the kernel must be compiled and executed with a specific sub-group size.
In addition to device functions,

At first, we talk about kernels and then we say "In addition to device functions", which implies that the attribute can also be applied to arbitrary device function and this particular usage is the least documented one.

If I understand correctly, the main idea of this attribute is to control sub-group size of the particular kernel, but of an arbitrary function: I would suggest that we remove this "In addition to device function" and if want to say something about arbitrary device functions, let's say it below in a direct manner, probably with examples.

It is illegal for a kernel or function to call a function with a mismatched sub-group size requirement, and the compiler should produce an error in this case. The +reqd_sub_group_size+ attribute is not propagated from a device function to callers of the function, and must be specified explicitly when a kernel is defined.

How these restrictions should work with auto and core? I mean things like can we call core function from an auto function? Can we call core function from a function with sub-group size 8, for example?

Should compiler trace the whole call graph? I.e if we have bar [[reqd_sub_group_size(8)]] calling foo (no required sub-group size), which in turn calls baz [[reqd_sub_group_size(4)]] - should error be reported by the compiler? Or it is up to implementation to decide about quality of error checking?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest that we remove this "In addition to device function" and if want to say something about arbitrary device functions, let's say it below in a direct manner, probably with examples.

You're right. The intent here was to allow reqd_sub_group_size to be specified on SYCL_EXTERNAL functions, but that's not spelled out explicitly here and I don't think it's implemented either. @gmlueck, @rolandschulz , do you think it would be okay to change this to only allow reqd_sub_group_size on external functions (rather than arbitrary functions)? That makes the intent clearer.

How these restrictions should work with auto and core? I mean things like can we call core function from an auto function? Can we call core function from a function with sub-group size 8, for example?

Good question. I can see an argument either way. On the one hand, if core happens to be 8 on some device, linking makes sense; similarly, calling a function with a required size of 8 might encourage auto to select 8 (if it can). On the other hand, whether core and auto are compatible with a specific size cannot be determined until the device is known.

It seems to me like the safest option would be to only allow exact matches, so a kernel with auto can only call an external function with auto and so on. Do you agree?

Should compiler trace the whole call graph? I.e if we have bar [[reqd_sub_group_size(8)]] calling foo (no required sub-group size), which in turn calls baz [[reqd_sub_group_size(4)]] - should error be reported by the compiler? Or it is up to implementation to decide about quality of error checking?

Changing the wording as above would simplify this. The compiler would have to trace the whole call graph, but would only have to check for mismatched sub-group sizes when encountering an external function. Does that make sense?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me like the safest option would be to only allow exact matches, so a kernel with auto can only call an external function with auto and so on. Do you agree?

I think that we should allow calling auto from non-auto, but shouldn't allow calling non-auto from auto. So, if user has a kernel, which requires sub-group size 8 or core, user is still able to call function with auto (either explicit or implicit) sub-group size - IMHO, it makes sense to not require from user to put the attribute on every function in a call tree - it is enough to just mark interfaces (like kernels and SYCL_EXTERNAL functions)

However, if user has auto function, it should be prohibited to call core or 8 functions from it, i.e. user is either only marks interfaces, or propagates this attribute down by call graph to any depth it wants to, but without any "gaps". By doing so, we still allowing users to put the attribute everywhere to improve code readability (i.e. you don't need to manually trace the call graph upwards to see which sub-group size is going to be used for this function) and also we could simplify the compiler as it would be able to only check caller-callee pair and wouldn't be required to trace the whole call graph.

What do you think?

The compiler would have to trace the whole call graph, but would only have to check for mismatched sub-group sizes when encountering an external function. Does that make sense?

We might have auto function, which calls 8, but itself is called from 4 and `16. This doesn't look correct and I think that compiler should report errors in this case.

One more question, consider the following situation: kernel with 8 calls function foo [auto] and kernel with 4 calls the same foo [auto] - is such configuration considered to be a consistent? Is it important to document expected behavior? I mean SPIR-V spec says that such configuration is valid and foo must be cloned and compiled two times for different sub-group sizes (because sub-group size is a property of EntryPoint)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cleanest solution would be to allow [[sycl::reqd_sub_group_size()]] only on kernel declarations, which is how it is currently specified in SYCL 2020 provisional. If the implementation supports SYCL_EXTERNAL, the responsibility would be on the implementation to propagate the sub-group size from the kernel to any SYCL_EXTERNAL functions it calls. This would require the implementation to have some sort of phase the merges SPIR-v across translation units. It seems like any implementation that supports SYCL_EXTERNAL must have some sort of phase like this, though, right?

If we think that our implementation cannot do this in the short term, we have two options:

  1. Say that [[sycl::reqd_sub_group_size()]] is not supported for kernels that call SYCL_EXTERNAL functions. The compiler should raise a diagnostic in this case.

  2. Invent a syntax where [[sycl::reqd_sub_group_size()]] can be used to decorate SYCL_EXTERNAL functions, as we're describing above. In this case, we should plan for this syntax to be optional in the future. In the future, I think it would be better for the implementation to automatically propagate the sub-group information to SYCL_EXTERNAL functions.

If we decide on option 2, then we should come up with the simplest syntax that addresses the issue. This is a temporary solution, so we don't need it to be too fancy. I would be tempted to agree with @Pennycook that we should allow only an exact match between a kernel's sub-group size and a SYCL_EXTERNAL function's sub-group size. However, what about the case where two kernels call the same SYCL_EXTERNAL function? What if those two kernels had different [[sycl::reqd_sub_group_size()]] attributes?

One way to solve this would be to allow a list of sizes in [[sycl::reqd_sub_group_size()]] only when it is applied to a SYCL_EXTERNAL function. This would tell the compiler to produce multiple versions of the function, and each kernel could use the appropriate version.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more question, consider the following situation: kernel with 8 calls function foo [auto] and kernel with 4 calls the same foo [auto] - is such configuration considered to be a consistent? Is it important to document expected behavior? I mean SPIR-V spec says that such configuration is valid and foo must be cloned and compiled two times for different sub-group sizes (because sub-group size is a property of EntryPoint)

This is a great example of why I think we want to tie this to SYCL_EXTERNAL somehow. If a compiler wants to detect that a function is called from multiple kernels and multi-version that function, it should do that. But as soon as a function is marked external, how to multi-version becomes less obvious. For SPIR-V it may be fine, but what about other implementations? How does it interact with function pointers? On the flip side, allowing auto on an external function might not make a lot of sense either -- the compiler can definitely choose a sub-group size to compile the function with, but the user can no longer reason about which kernels can call it.

I think it would be better for the implementation to automatically propagate the sub-group information to SYCL_EXTERNAL functions.

I don't think this is realistic for all implementations. If kernels and external functions are compiled straight to some ISA for a specific device, there's no opportunity to do this propagation. If the kernel could prove which function(s) it was calling and which needed to be compiled for the device, there'd be no need to mark the function(s) with SYCL_EXTERNAL.

I'm a little scared to try and solve the problem for the generic case here: there are clearly a lot of open questions, and I don't want to set a precedent for all other extensions and attributes in DPC++ without more discussion. That said -- @kbobrovs, am I right in thinking that ESIMD is currently checking that kernels and external functions have an exact match of reqd_sub_group_size(1)?

If so, the simplest thing may be to say:

  • [[sycl::reqd_sub_group_size()]] is allowed on kernels and SYCL_EXTERNAL functions (but not normal device functions).
  • [[sycl::reqd_sub_group_size(auto)]] is invalid for SYCL_EXTERNAL functions.
  • If a kernel calls a SYCL_EXTERNAL function, the attributes applied to each must match exactly.

I could then add "How does sub-group size interact with SYCL_EXTERNAL functions?" to the list of unresolved questions for this extension, to be addressed later. It seems to me like exact matching or multi-versioning could work, but we should consider more than just sub-group size when we make the decision.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm OK with that for the short-term solution.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applied in 6482dcf.


Expand All @@ -85,13 +87,17 @@ class Functor

It is illegal for a kernel or function to call a function with a mismatched sub-group size requirement, and the compiler should produce an error in this case. The +reqd_sub_group_size+ attribute is not propagated from a device function to callers of the function, and must be specified explicitly when a kernel is defined.

=== Compiler Flags

The +-fsycl-core-sub-group-size+ flag compiles all kernels in the translation unit as though +[[intel::reqd_sub_group_size(core)]]+ was specified.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we have a kernel which specified sub-group size?

For auto spec explicitly states above that auto can be overridden by a compiler flag. Overriding core with core shouldn't produce any problems, I guess, but what about particular specified sub-group sizes like 8, for example?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. The intent here was to only override the behavior for kernels that do not already have an attribute specified. This is why auto is specified both as the default behavior and an attribute: a user can write [[reqd_sub_group_size(auto)]] on a particular kernel to mean that they always want the sub-group size to be selected by the implementation.

Is the change below sufficient to capture that?

Suggested change
The +-fsycl-core-sub-group-size+ flag compiles all kernels in the translation unit as though +[[intel::reqd_sub_group_size(core)]]+ was specified.
The +-fsycl-core-sub-group-size+ flag compiles all kernels in the translation unit without a `reqd_sub_group_size` attribute as though +[[intel::reqd_sub_group_size(core)]]+ was specified.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the change below sufficient to capture that?

The idea LGTM, but in context of this comment from you it would probably better to refer to kernels with auto sub-group size instead of "without an attribute".

I think this could be misleading. With the new wording, a requested size is always specified -- it's just that if no attribute is applied to the kernel, that size is auto.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like @Pennycook's proposed change, and I don't think it conflicts with the comment you reference. Conceptually, the compiler first makes a pass over all kernels that have no explicit [[intel::reqd_sub_group_size(core)]] attribute, and sets the attribute to either auto or core depending on whether the fsycl-core-sub-group-size was specified.

Now, every kernel has a sub-group size defined and info::kernel_device_specific::compile_sub_group_size operates according to that size.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. I think the confusion here and the perceived incompatibility with #2452 (comment) arises from my lazy use of the word "specified".

If the user doesn't attach an attribute to a kernel, then either auto or core will be attached to it by default. By the time the kernel has been compiled, the SPIR-V always has a sub-group size associated with it.

Adding a suggestion here and to the comment to try and avoid using the same word:

Suggested change
The +-fsycl-core-sub-group-size+ flag compiles all kernels in the translation unit as though +[[intel::reqd_sub_group_size(core)]]+ was specified.
The +-fsycl-core-sub-group-size+ flag compiles all kernels in the translation unit without a `reqd_sub_group_size` attribute as though +[[intel::reqd_sub_group_size(core)]]+ was applied to the kernel.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tightened the description of the flag in e6dad96.


=== Sub-group Queries

Several aspects of sub-group functionality are implementation-defined: the size and number of sub-groups is implementation-defined (and may differ for each kernel); and different devices may make different guarantees with respect to how sub-groups within a work-group are scheduled. Developers can query these behaviors at a device level and for individual kernels. The sub-group size for a given combination of kernel and launch configuration is fixed, and guaranteed to be reflected by device and kernel queries.
Several aspects of sub-group functionality are implementation-defined: the size and number of sub-groups for certain work-group sizes is implementation-defined (and may differ for each kernel); and different devices may make different guarantees with respect to how sub-groups within a work-group are scheduled. Developers can query these behaviors at a device level and for individual kernels. The sub-group size for a given combination of kernel and launch configuration is fixed, and guaranteed to be reflected by device and kernel queries.

Each sub-group in a work-group is one-dimensional. If the total number of work-items in a work-group is evenly divisible by the sub-group size, all sub-groups in the work-group will contain the same number of work-items. If the total number of work-items in a work-group is not evenly divisible by the sub-group size, the number of work-items in the final sub-group is equal to the remainder of the total work-group size divided by the sub-group size.
Each sub-group in a work-group is one-dimensional. If the number of work-items in the highest-numbered dimension of a work-group is evenly divisible by the sub-group size, all sub-groups in the work-group will contain the same number of work-items. Additionally, the numbering of work-items in a sub-group reflects the linear numbering of the work-items in the work-group. Specifically, if a work-item has linear ID i~s~ in the sub-group and linear ID i~w~ in the work-group, the work-item with linear ID i~s~+1 in the sub-group has linear ID i~w~+1 in the work-group.

To maximize portability across devices, developers should not assume that work-items within a sub-group execute in lockstep, nor that two sub-groups within a work-group will make independent forward progress with respect to one another.
To maximize portability across devices, developers should not assume that work-items within a sub-group execute in lockstep, that two sub-groups within a work-group will make independent forward progress with respect to one another, nor that remainders arising from work-group division will be handled in a specific way.

The device descriptors below are added to the +info::device+ enumeration class:

Expand All @@ -106,9 +112,13 @@ The device descriptors below are added to the +info::device+ enumeration class:
|+bool+
|Returns +true+ if the device supports independent forward progress of sub-groups with respect to other sub-groups in the same work-group.

|+info::device::core_sub_group_size+
|+size_t+
|Return a sub-group size supported by this device that is guaranteed to support all core language features for the device.

|+info::device::sub_group_sizes+
|+vector_class<size_t>+
|Returns a vector_class of +size_t+ containing the set of sub-group sizes supported by the device.
|Returns a vector_class of +size_t+ containing the set of sub-group sizes supported by the device. Each sub-group size is a power of 2 in the range [1, 2^31^]. Not all sub-group sizes are guaranteed to be compatible with all core language features; any incompatibilities are implementation-defined.
|===

An additional query is added to the +kernel+ class, enabling an input value to be passed to `get_info`. The original `get_info` query from the SYCL_INTEL_device_specific_kernel_queries extension should be used for queries that do not specify an input type.
Expand Down Expand Up @@ -143,7 +153,7 @@ The kernel descriptors below are added to the +info::kernel_device_specific+ enu
|+info::kernel_device_specific::compile_sub_group_size+
|N/A
|+uint32_t+
|Returns the required sub-group size specified by the kernel, or 0 (if not specified).
|Returns the required sub-group size specified by the kernel. Returns 0 if the requested size was `auto`, and returns the device's core sub-group size if the requested size was `core`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
|Returns the required sub-group size specified by the kernel. Returns 0 if the requested size was `auto`, and returns the device's core sub-group size if the requested size was `core`.
|Returns the required sub-group size specified by the kernel. Returns 0 if the requested size was `auto` (or if it wasn't specified), and returns the device's core sub-group size if the requested size was `core`.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could be misleading. With the new wording, a requested size is always specified -- it's just that if no attribute is applied to the kernel, that size is auto.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
|Returns the required sub-group size specified by the kernel. Returns 0 if the requested size was `auto`, and returns the device's core sub-group size if the requested size was `core`.
|Returns the required sub-group size of the kernel, set implicitly by the implementation or explicitly using a kernel attribute . Returns 0 if the requested size is `auto`, and returns the device's core sub-group size if the requested size is `core`.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tightened the description of this query in cebc13f.

|===

=== The sub_group Class
Expand Down Expand Up @@ -295,6 +305,13 @@ Yes, this is required by OpenCL devices. Devices that do not require the work-g
Yes, the four shuffles in this extension are a defining feature of sub-groups. Higher-level algorithms (such as those in the +SubGroupAlgorithms+ proposal) may build on them, the same way as higher-level algorithms using work-groups build on work-group local memory.
--

. What should the sub-group size compatible with all features be called?
+
--
*RESOLVED*:
The name adopted is "core", to convey that it supports all core (non-optional) language features and is an integral part of sub-group support provided by the device. Other names considered are listed here for posterity: "default", "stable", "fixed", "primary". With the exception of "primary", these terms are easy to misunderstand (i.e. the "default" size may not be chosen by default, the "stable" size is unrelated to the software release cycle, the "fixed" sub-group size may change between devices or compiler releases). "core" was selected over "primary" because of its simplicity.
--

//. asd
//+
//--
Expand All @@ -315,6 +332,7 @@ Yes, the four shuffles in this extension are a defining feature of sub-groups.
|5|2020-04-21|John Pennycook|*Restore sub-group shuffles as member functions*
|6|2020-04-22|John Pennycook|*Align with SYCL_INTEL_device_specific_kernel_queries*
|7|2020-07-13|John Pennycook|*Clarify that reqd_sub_group_size must be a compile-time constant*
|8|2020-09-08|John Pennycook|*Provide some basic correctness guarantees*
|========================================

//************************************************************************
Expand Down