-
Notifications
You must be signed in to change notification settings - Fork 769
[SYCL][Doc] Provide extra sub-group guarantees #2452
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Addresses user feedback on using sub-groups: - Clarifies the circumstances in which work-groups can be guaranteed to be split into sub-groups in the same way across all devices - Provides a way to request a sub-group size that will always work via [[sycl::reqd_sub_group_size(core)]] - Provides a shorthand to request the core sub-group size for all kernels via -fsycl-core-sub-group-size Signed-off-by: John Pennycook <[email protected]>
- Error behavior should depend on submission to device - compile_sub_group_size query should reflect new sizes Signed-off-by: John Pennycook <[email protected]> Co-authored-by: Greg Lueck <[email protected]>
This term does not appear elsewhere in any specification, really means: - Kernel - Device - Work-group Size The sentence containing this phrase has also been reworked to avoid implying that all properties of a sub-group can be queried, since this is currently untrue. Signed-off-by: John Pennycook <[email protected]> Co-authored-by: Greg Lueck <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review for the first two commits (haven't seen the third one yet)
The +[[intel::reqd_sub_group_size(n)]]+ attribute indicates that the kernel must be compiled and executed with a sub-group of size _n_. The value of _n_ must be a compile-time integral constant expression. The value of _n_ must be set to a sub-group size that is both supported by the device and compatible with all language features used by the kernel, or device compilation will fail. The set of valid sub-group sizes can be queried as described below. | ||
The +[[intel::reqd_sub_group_size(S)]]+ attribute indicates that the kernel must be compiled and executed with a specific sub-group size. The value of _S_ must be a compile-time integral constant expression or one of the following keywords: +auto+, +core+. If the value of _S_ is an integer, the kernel should only be submitted to a device that supports that sub-group size (as reported by +info::device::sub_group_sizes+). If the kernel is submitted to a device that does not support the requested sub-group size, or a device on which the requested sub-group size is incompatible with any language features used by the kernel, the implementation must throw a synchronous exception with the `errc::feature_not_supported` error code from the kernel invocation command. If the value of _S_ is +auto+, the implementation is free to select any of the valid sub-group sizes associated with the device to which the kernel is submitted; the manner in which the sub-group size is selected is implementation-defined. If the value of _S_ is +core+, the implementation will select the device's core sub-group size (as reported by the +info::device::core_sub_group_size+ query) for all kernels with this attribute. | ||
|
||
If no required sub-group size attribute appears on a kernel, the default behavior is as-if +[[intel::reqd_sub_group_size(auto)]]+ was specified. This behavior may be overridden by an implementation (e.g. via compiler flags). | ||
|
||
In addition to device functions, the required sub-group size attribute may also be specified in the definition of a named functor object, as in the example below: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rest of Attributes
section below always bothered me and having more values for intel::reqd_sub_group_size
attribute concerns me even more:
The +[[intel::reqd_sub_group_size(S)]]+ attribute indicates that the kernel must be compiled and executed with a specific sub-group size.
In addition to device functions,
At first, we talk about kernels and then we say "In addition to device functions", which implies that the attribute can also be applied to arbitrary device function and this particular usage is the least documented one.
If I understand correctly, the main idea of this attribute is to control sub-group size of the particular kernel, but of an arbitrary function: I would suggest that we remove this "In addition to device function" and if want to say something about arbitrary device functions, let's say it below in a direct manner, probably with examples.
It is illegal for a kernel or function to call a function with a mismatched sub-group size requirement, and the compiler should produce an error in this case. The +reqd_sub_group_size+ attribute is not propagated from a device function to callers of the function, and must be specified explicitly when a kernel is defined.
How these restrictions should work with auto
and core
? I mean things like can we call core
function from an auto
function? Can we call core
function from a function with sub-group size 8
, for example?
Should compiler trace the whole call graph? I.e if we have bar [[reqd_sub_group_size(8)]]
calling foo
(no required sub-group size), which in turn calls baz [[reqd_sub_group_size(4)]]
- should error be reported by the compiler? Or it is up to implementation to decide about quality of error checking?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest that we remove this "In addition to device function" and if want to say something about arbitrary device functions, let's say it below in a direct manner, probably with examples.
You're right. The intent here was to allow reqd_sub_group_size
to be specified on SYCL_EXTERNAL
functions, but that's not spelled out explicitly here and I don't think it's implemented either. @gmlueck, @rolandschulz , do you think it would be okay to change this to only allow reqd_sub_group_size
on external functions (rather than arbitrary functions)? That makes the intent clearer.
How these restrictions should work with auto and core? I mean things like can we call core function from an auto function? Can we call core function from a function with sub-group size 8, for example?
Good question. I can see an argument either way. On the one hand, if core
happens to be 8 on some device, linking makes sense; similarly, calling a function with a required size of 8 might encourage auto
to select 8 (if it can). On the other hand, whether core
and auto
are compatible with a specific size cannot be determined until the device is known.
It seems to me like the safest option would be to only allow exact matches, so a kernel with auto
can only call an external function with auto
and so on. Do you agree?
Should compiler trace the whole call graph? I.e if we have bar [[reqd_sub_group_size(8)]] calling foo (no required sub-group size), which in turn calls baz [[reqd_sub_group_size(4)]] - should error be reported by the compiler? Or it is up to implementation to decide about quality of error checking?
Changing the wording as above would simplify this. The compiler would have to trace the whole call graph, but would only have to check for mismatched sub-group sizes when encountering an external function. Does that make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to me like the safest option would be to only allow exact matches, so a kernel with auto can only call an external function with auto and so on. Do you agree?
I think that we should allow calling auto
from non-auto
, but shouldn't allow calling non-auto
from auto
. So, if user has a kernel, which requires sub-group size 8
or core
, user is still able to call function with auto
(either explicit or implicit) sub-group size - IMHO, it makes sense to not require from user to put the attribute on every function in a call tree - it is enough to just mark interfaces (like kernels and SYCL_EXTERNAL
functions)
However, if user has auto
function, it should be prohibited to call core
or 8
functions from it, i.e. user is either only marks interfaces, or propagates this attribute down by call graph to any depth it wants to, but without any "gaps". By doing so, we still allowing users to put the attribute everywhere to improve code readability (i.e. you don't need to manually trace the call graph upwards to see which sub-group size is going to be used for this function) and also we could simplify the compiler as it would be able to only check caller-callee pair and wouldn't be required to trace the whole call graph.
What do you think?
The compiler would have to trace the whole call graph, but would only have to check for mismatched sub-group sizes when encountering an external function. Does that make sense?
We might have auto
function, which calls 8
, but itself is called from 4
and `16. This doesn't look correct and I think that compiler should report errors in this case.
One more question, consider the following situation: kernel with 8
calls function foo [auto]
and kernel with 4
calls the same foo [auto]
- is such configuration considered to be a consistent? Is it important to document expected behavior? I mean SPIR-V spec says that such configuration is valid and foo
must be cloned and compiled two times for different sub-group sizes (because sub-group size is a property of EntryPoint
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cleanest solution would be to allow [[sycl::reqd_sub_group_size()]]
only on kernel declarations, which is how it is currently specified in SYCL 2020 provisional. If the implementation supports SYCL_EXTERNAL
, the responsibility would be on the implementation to propagate the sub-group size from the kernel to any SYCL_EXTERNAL
functions it calls. This would require the implementation to have some sort of phase the merges SPIR-v across translation units. It seems like any implementation that supports SYCL_EXTERNAL
must have some sort of phase like this, though, right?
If we think that our implementation cannot do this in the short term, we have two options:
-
Say that
[[sycl::reqd_sub_group_size()]]
is not supported for kernels that callSYCL_EXTERNAL
functions. The compiler should raise a diagnostic in this case. -
Invent a syntax where
[[sycl::reqd_sub_group_size()]]
can be used to decorateSYCL_EXTERNAL
functions, as we're describing above. In this case, we should plan for this syntax to be optional in the future. In the future, I think it would be better for the implementation to automatically propagate the sub-group information toSYCL_EXTERNAL
functions.
If we decide on option 2, then we should come up with the simplest syntax that addresses the issue. This is a temporary solution, so we don't need it to be too fancy. I would be tempted to agree with @Pennycook that we should allow only an exact match between a kernel's sub-group size and a SYCL_EXTERNAL
function's sub-group size. However, what about the case where two kernels call the same SYCL_EXTERNAL
function? What if those two kernels had different [[sycl::reqd_sub_group_size()]]
attributes?
One way to solve this would be to allow a list of sizes in [[sycl::reqd_sub_group_size()]]
only when it is applied to a SYCL_EXTERNAL
function. This would tell the compiler to produce multiple versions of the function, and each kernel could use the appropriate version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more question, consider the following situation: kernel with 8 calls function foo [auto] and kernel with 4 calls the same foo [auto] - is such configuration considered to be a consistent? Is it important to document expected behavior? I mean SPIR-V spec says that such configuration is valid and foo must be cloned and compiled two times for different sub-group sizes (because sub-group size is a property of EntryPoint)
This is a great example of why I think we want to tie this to SYCL_EXTERNAL
somehow. If a compiler wants to detect that a function is called from multiple kernels and multi-version that function, it should do that. But as soon as a function is marked external, how to multi-version becomes less obvious. For SPIR-V it may be fine, but what about other implementations? How does it interact with function pointers? On the flip side, allowing auto
on an external function might not make a lot of sense either -- the compiler can definitely choose a sub-group size to compile the function with, but the user can no longer reason about which kernels can call it.
I think it would be better for the implementation to automatically propagate the sub-group information to SYCL_EXTERNAL functions.
I don't think this is realistic for all implementations. If kernels and external functions are compiled straight to some ISA for a specific device, there's no opportunity to do this propagation. If the kernel could prove which function(s) it was calling and which needed to be compiled for the device, there'd be no need to mark the function(s) with SYCL_EXTERNAL
.
I'm a little scared to try and solve the problem for the generic case here: there are clearly a lot of open questions, and I don't want to set a precedent for all other extensions and attributes in DPC++ without more discussion. That said -- @kbobrovs, am I right in thinking that ESIMD is currently checking that kernels and external functions have an exact match of reqd_sub_group_size(1)
?
If so, the simplest thing may be to say:
[[sycl::reqd_sub_group_size()]]
is allowed on kernels andSYCL_EXTERNAL
functions (but not normal device functions).[[sycl::reqd_sub_group_size(auto)]]
is invalid forSYCL_EXTERNAL
functions.- If a kernel calls a
SYCL_EXTERNAL
function, the attributes applied to each must match exactly.
I could then add "How does sub-group size interact with SYCL_EXTERNAL
functions?" to the list of unresolved questions for this extension, to be addressed later. It seems to me like exact matching or multi-versioning could work, but we should consider more than just sub-group size when we make the decision.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm OK with that for the short-term solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Applied in 6482dcf.
@@ -85,13 +87,17 @@ class Functor | |||
|
|||
It is illegal for a kernel or function to call a function with a mismatched sub-group size requirement, and the compiler should produce an error in this case. The +reqd_sub_group_size+ attribute is not propagated from a device function to callers of the function, and must be specified explicitly when a kernel is defined. | |||
|
|||
=== Compiler Flags | |||
|
|||
The +-fsycl-core-sub-group-size+ flag compiles all kernels in the translation unit as though +[[intel::reqd_sub_group_size(core)]]+ was specified. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we have a kernel which specified sub-group size?
For auto
spec explicitly states above that auto
can be overridden by a compiler flag. Overriding core
with core
shouldn't produce any problems, I guess, but what about particular specified sub-group sizes like 8
, for example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. The intent here was to only override the behavior for kernels that do not already have an attribute specified. This is why auto
is specified both as the default behavior and an attribute: a user can write [[reqd_sub_group_size(auto)]]
on a particular kernel to mean that they always want the sub-group size to be selected by the implementation.
Is the change below sufficient to capture that?
The +-fsycl-core-sub-group-size+ flag compiles all kernels in the translation unit as though +[[intel::reqd_sub_group_size(core)]]+ was specified. | |
The +-fsycl-core-sub-group-size+ flag compiles all kernels in the translation unit without a `reqd_sub_group_size` attribute as though +[[intel::reqd_sub_group_size(core)]]+ was specified. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the change below sufficient to capture that?
The idea LGTM, but in context of this comment from you it would probably better to refer to kernels with auto
sub-group size instead of "without an attribute".
I think this could be misleading. With the new wording, a requested size is always specified -- it's just that if no attribute is applied to the kernel, that size is
auto
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like @Pennycook's proposed change, and I don't think it conflicts with the comment you reference. Conceptually, the compiler first makes a pass over all kernels that have no explicit [[intel::reqd_sub_group_size(core)]]
attribute, and sets the attribute to either auto
or core
depending on whether the fsycl-core-sub-group-size
was specified.
Now, every kernel has a sub-group size defined and info::kernel_device_specific::compile_sub_group_size
operates according to that size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. I think the confusion here and the perceived incompatibility with #2452 (comment) arises from my lazy use of the word "specified".
If the user doesn't attach an attribute to a kernel, then either auto
or core
will be attached to it by default. By the time the kernel has been compiled, the SPIR-V always has a sub-group size associated with it.
Adding a suggestion here and to the comment to try and avoid using the same word:
The +-fsycl-core-sub-group-size+ flag compiles all kernels in the translation unit as though +[[intel::reqd_sub_group_size(core)]]+ was specified. | |
The +-fsycl-core-sub-group-size+ flag compiles all kernels in the translation unit without a `reqd_sub_group_size` attribute as though +[[intel::reqd_sub_group_size(core)]]+ was applied to the kernel. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tightened the description of the flag in e6dad96.
@@ -143,7 +153,7 @@ The kernel descriptors below are added to the +info::kernel_device_specific+ enu | |||
|+info::kernel_device_specific::compile_sub_group_size+ | |||
|N/A | |||
|+uint32_t+ | |||
|Returns the required sub-group size specified by the kernel, or 0 (if not specified). | |||
|Returns the required sub-group size specified by the kernel. Returns 0 if the requested size was `auto`, and returns the device's core sub-group size if the requested size was `core`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|Returns the required sub-group size specified by the kernel. Returns 0 if the requested size was `auto`, and returns the device's core sub-group size if the requested size was `core`. | |
|Returns the required sub-group size specified by the kernel. Returns 0 if the requested size was `auto` (or if it wasn't specified), and returns the device's core sub-group size if the requested size was `core`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this could be misleading. With the new wording, a requested size is always specified -- it's just that if no attribute is applied to the kernel, that size is auto
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|Returns the required sub-group size specified by the kernel. Returns 0 if the requested size was `auto`, and returns the device's core sub-group size if the requested size was `core`. | |
|Returns the required sub-group size of the kernel, set implicitly by the implementation or explicitly using a kernel attribute . Returns 0 if the requested size is `auto`, and returns the device's core sub-group size if the requested size is `core`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tightened the description of this query in cebc13f.
@@ -93,7 +93,7 @@ The +-fsycl-core-sub-group-size+ flag compiles all kernels in the translation un | |||
|
|||
=== Sub-group Queries | |||
|
|||
Several aspects of sub-group functionality are implementation-defined: the size and number of sub-groups for certain work-group sizes is implementation-defined (and may differ for each kernel); and different devices may make different guarantees with respect to how sub-groups within a work-group are scheduled. Developers can query these behaviors at a device level and for individual kernels. The sub-group size for a given combination of kernel and launch configuration is fixed, and guaranteed to be reflected by device and kernel queries. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sentence containing this phrase has also been reworked to avoid
implying that all properties of a sub-group can be queried, since this
is currently untrue.
Honestly, I wasn't under impression that the last sentence was implying that all properties of a sub-group can be queried, because it specifically mentioned sub-group size and nothing more. Probably I'm missing something
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue here was pointed out by @gmlueck. It is not always possible to query the sub-group size that is actually used for a given work-group size, because what the query returns is the max work-group size (i.e. the maximum sub-group size that can be supported by a compiled kernel).
For example, a work-group size of 7 may return 8 for its max_work_group_size
, and there is no query that returns 7.
@@ -68,7 +68,9 @@ Providing a generic group abstraction encapsulating the shared functionality of | |||
|
|||
=== Attributes | |||
|
|||
The +[[intel::reqd_sub_group_size(n)]]+ attribute indicates that the kernel must be compiled and executed with a sub-group of size _n_. The value of _n_ must be a compile-time integral constant expression. The value of _n_ must be set to a sub-group size that is both supported by the device and compatible with all language features used by the kernel, or device compilation will fail. The set of valid sub-group sizes can be queried as described below. | |||
The +[[intel::reqd_sub_group_size(S)]]+ attribute indicates that the kernel must be compiled and executed with a specific sub-group size. The value of _S_ must be a compile-time integral constant expression or one of the following keywords: +auto+, +core+. If the value of _S_ is an integer, the kernel should only be submitted to a device that supports that sub-group size (as reported by +info::device::sub_group_sizes+). If the kernel is submitted to a device that does not support the requested sub-group size, or a device on which the requested sub-group size is incompatible with any language features used by the kernel, the implementation must throw a synchronous exception with the `errc::feature_not_supported` error code from the kernel invocation command. If the value of _S_ is +auto+, the implementation is free to select any of the valid sub-group sizes associated with the device to which the kernel is submitted; the manner in which the sub-group size is selected is implementation-defined. If the value of _S_ is +core+, the implementation will select the device's core sub-group size (as reported by the +info::device::core_sub_group_size+ query) for all kernels with this attribute. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens for:
constexpr int core = 4;
SYCL_EXTERNAL void f() [[intel::reqd_sub_group_size(core)]];
Does it use 4 or does it use core
? We could avoid this possible ambiguity and the need to treat those special in the FE by making this intel::core
. That allows us to define in the header
namespace intel { constexpr int core = -1; }
and the FE doesn't need to have special casing for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point, I hadn't considered that "core" might be used elsewhere as a name. I don't think INTEL::core
is much better -- INTEL::reqd_sub_group_size(INTEL::core)
is pretty long, and as you point out in #2452 (comment), we might end up with values like INTEL::core
occurring in multiple contexts.
Ignoring the open naming issue: Do you think INTEL::req_core_sub_group_size
would be better? It's 9 characters shorter and unambiguous.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't thinking that the attribute parameter could be a constexpr
variable from the application. I was assuming it had to be a literal number or one of the special tokens "core" or "auto". Therefore, "core" used in the attribute parameter would be unrelated to any variable named "core".
Allowing a constexpr
variable (or expression) does seem to introduce other problems. We could define a variable named core
in our headers as @rolandschulz suggests, but I'm not sure auto
is allowed as a variable name. We could avoid those problems by adding three attributes as @Pennycook suggests: intel::reqd_sub_group_size(S)
, intel::reqd_core_sub_group_size
, and intel::reqd_auto_sub_group_size
.
Or, we could specify that reqd_sub_group_size()
must take a literal number or one of the special tokens "core" or "auto". I admit that's less flexible, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The attributes have been split in 7ccb2eb. I've tried to make it clear that [[sub_group_size(S)]]
requires S
to be a compile-time constant integer expression, whereas [[named_sub_group_size(NAME)]]
requires NAME
to be a special token from a known list.
Not merging the patch because it seems there is some discussion in the comments. Please, correct me if I'm wrong. |
Signed-off-by: John Pennycook <[email protected]> Co-authored-by: Greg Lueck <[email protected]> Co-authored-by: Roland Schulz <[email protected]>
Signed-off-by: John Pennycook <[email protected]>
Signed-off-by: John Pennycook <[email protected]>
Signed-off-by: John Pennycook <[email protected]> Co-authored-by: Greg Lueck <[email protected]> Co-authored-by: Alexey Sachkov <[email protected]>
Signed-off-by: John Pennycook <[email protected]>
@romanovvlad: Thanks for holding off on merging this. You weren't wrong! @gmlueck, @rolandschulz, @AlexeySachkov: I think I've addressed all of the comments now -- please take a look. I'm afraid that the diffs are still quite difficult to read without |
} | ||
} | ||
---- | ||
If no sub-group size attribute appears on a kernel, the default behavior is as-if +[[intel::sub_group_size(auto)]]+ was specified. This behavior may be overridden by an implementation (e.g. via compiler flags). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo:
as-if +[[intel::named_sub_group_size(auto)]]+ was specified
(missing "named_")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. Fixed in ff87d02.
Signed-off-by: John Pennycook <[email protected]> Co-authored-by: Greg Lueck <[email protected]>
---- | ||
If no sub-group size attribute appears on a kernel, the default behavior is as-if +[[intel::named_sub_group_size(auto)]]+ was specified. This behavior may be overridden by an implementation (e.g. via compiler flags). | ||
|
||
Sub-group size attributes may also be applied to `SYCL_EXTERNAL` functions. If a kernel calls a `SYCL_EXTERNAL` function, or a `SYCL_EXTERNAL` function calls another `SYCL_EXTERNAL` function, the attributes applied to the caller and callee must match exactly. If the attributes do not match, the compiler should produce an error. Note that sub-group size attributes are not propagated from a device function to callers of the function, and must be specified explicitly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found this paragraph unclear for several reasons:
-
It's not clear if the sub-group size attribute is required in this scenario or if it is merely allowed. (I think we want it to be required, correct?)
-
Whenever
SYCL_EXTERNAL
is used, there are two relevant translation units: the TU the makes the call and the TU that defines the function. We need to make it clear whether the attribute is required in the calling TU, the defining TU, or both. -
The statement about requiring the compiler to produce an error makes it sound like the compiler must do inter-TU analysis to see if the function defined via
SYCL_EXTERNAL
is defined with the right attribute. (Or at least the statement could be interpreted that way.) I think that is not our intent. -
The statement about the sub-group size not be propagated from a device function to its callers could be interpreted to mean that this propagation doesn't happen even within a single TU. I think the intent, though, is that this propagation does not happen across TUs.
Does the following paragraph capture what we want to say?
There are special requirements whenever a device function defined in one translation unit makes a call to a device function that is defined in a second translation unit. In such a case, the second device function is always declared using
SYCL_EXTERNAL
. If the kernel containing these device function is defined using a sub-group size attribute, the functions declared usingSYCL_EXTERNAL
must also be decorated with that same attribute. This decoration must exist in both the translation unit making the call and also in the translation unit that defines the function. If the sub-group attribute is missing in the translation unit that makes the call (or if the sub-group size of the called function does not match the sub-group size of the calling function), the program is ill formed and the compiler must raise a diagnostic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is much better, thanks. I added a note that auto
doesn't make sense on a SYCL_EXTERNAL
function. See bd68110.
} | ||
} | ||
---- | ||
If no sub-group size attribute appears on a kernel, the default behavior is as-if +[[intel::named_sub_group_size(auto)]]+ was specified. This behavior may be overridden by an implementation (e.g. via compiler flags). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should say somewhere that sub_group_size()
and named_sub_group_size()
may not both the specified on the same kernel. This might be a good place to say that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Added in fd5b407.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So a kernel using named_sub_group_size(auto)
cannot call a device function in another translation unit? Why do we need that restriction?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need it, at least temporarily, to avoid undefined behavior.
SYCL_EXTERNAL void foo(); [[intel::sub_group_size(4)]]
SYCL_EXTERNAL void bar(); [[intel::sub_group_size(8)]]
SYCL_EXTERNAL void baz(); [[intel::named_sub_group_size(auto)]]
SYCL_EXTERNAL void qux(); [[intel::named_sub_group_size(primary)]]
// Arguably the compiler could choose to compile this kernel with size 4
parallel_for(..., [=](id<1> i) [[intel::named_sub_group_size(auto)]] {
foo();
});
// But in this case, "auto" couldn't be evaluated to a single sub-group size
parallel_for(..., [=](id<1> i) [[intel::named_sub_group_size(auto)]] {
foo();
bar();
});
// baz is going to be compiled independently of this kernel
// There's no guarantee that the kernel and function choose the same size for "auto"
parallel_for(..., [=](id<1> i) [[intel::named_sub_group_size(auto)]] {
baz();
/* pretend there's some other code here */
};
Since we've agreed that the SYCL_EXTERNAL
solution needs more discussion, I figured it was better to restrict ourselves only to situations that we can guarantee to support in the future. Exact matching and no cross-TU magic when handling "auto" seems to be the only case that we all agree could be reasonably supported by different compiler implementations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, it's certainly a safe, conservative choice. We can always lift restrictions later after thinking about it more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about this some more. The default for kernels that have no sub-group size attribute is "auto". Yet, we are saying that a kernel declared as "auto" cannot call a function defined in a different TU via SYCL_EXTERN
. Doesn't this mean that kernels with no sub-group size attribute are forbidden from calling SYCL_EXTERN
functions? Surely, that is not our intent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you're right, this is what it means. I'm not sure it's what we'd want in an ideal situation, but I couldn't see a way around it. If the kernel and function are compiled completely separately and linked later, the only way that linking can occur is if they're both compiled with the same sub-group size.
It might be okay to say that an auto
kernel calling an auto
function works, but we'd be forcing all implementations to behave a certain way by doing so. Using a heuristic to determine the sub-group size for kernels and functions wouldn't be allowed, because they might not match. There still an open question of what a function pointer to an auto
function would represent -- would it have to contain some sort of dispatch to different sub-group sizes? Maybe that's okay too...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After our offline discussion yesterday, I've taken this restriction back out in bbeec6a. Matching auto
kernels with auto
functions is something that the current implementation supports, and we shouldn't take that away until we're sure that we want to. I've left the note about SYCL_EXTERNAL
in the unresolved questions, and we should revisit this in the context of how sub-groups interact with other extensions.
Signed-off-by: John Pennycook <[email protected]> Co-authored-by: Greg Lueck <[email protected]>
Covers multiple cases: - Duplicated sub-group size attributes - [[sub_group_size]] and [[named_sub_group_size]] Signed-off-by: John Pennycook <[email protected]>
Signed-off-by: John Pennycook <[email protected]>
Defaulting to the primary sub-group size improves usability: - Developers can safely assume one sub-group size for all functions - SYCL_EXTERNAL corner-cases only appear if user opts in - Original behavior can be requested with a compiler flag Combines all changes made into one revision of the extension. Signed-off-by: John Pennycook <[email protected]>
The latest set of changes reflects where we've landed after a lot of internal discussion. @jbrodman, @AlexeySachkov, @gmlueck, @rolandschulz -- this is ready for review now. @bader: If/when this gets merged, we should just use one of these commit messages. I'll update the PR description to match. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have any further questions or objections
[L0] Fixed event leak when outevent is given and is completed
Addresses user feedback on using sub-groups:
be split into sub-groups in the same way across all devices
kernels via -fsycl-default-sub-group-size
Signed-off-by: John Pennycook [email protected]