-
Notifications
You must be signed in to change notification settings - Fork 770
[SYCL] Parallel-for range correction to improve group size selection by GPU driver #2703
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…on by GPU driver. Signed-off-by: rdeodhar <[email protected]>
Very cool! @Pennycook - think this will improve a lot of cases? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like the look of this, and I think it's going to help a lot of codes. I think it might also close #813, but we should ask @cagnulein to confirm.
clang/lib/Sema/SemaSYCL.cpp
Outdated
@@ -510,6 +510,22 @@ class MarkDeviceFunction : public RecursiveASTVisitor<MarkDeviceFunction> { | |||
FunctionDecl *FD = WorkList.back().first; | |||
FunctionDecl *ParentFD = WorkList.back().second; | |||
|
|||
// To implement rounding-up of a parallel-for range (Jira 20239) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove reference to internal tracking number.
// To implement rounding-up of a parallel-for range (Jira 20239) | |
// To implement rounding-up of a parallel-for range |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
sycl/include/CL/sycl/item.hpp
Outdated
@@ -104,6 +104,8 @@ template <int dimensions = 1, bool with_offset = true> class item { | |||
|
|||
bool operator!=(const item &rhs) const { return rhs.MImpl != MImpl; } | |||
|
|||
void set_allowed_range(const range<dimensions> rnwi) { MImpl.MExtent = rnwi; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this new function is public. Should it be?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. This seems to introduce a lot of new public functions. I don't think we should be doing that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The whole point of this is to be transparent to the programmer, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
Just curious about whether you do more work leading to UB. |
I think this is meant to be a better version of a hack proof of concept I did for John a while ago: jbrodman@fc26a1f The intent is to use C++ tricks to submit a range that tends to execute faster on the device and still be correct (of course). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add some tests first?
return; | ||
|
||
// The call graph for this translation unit. | ||
CallGraph SYCLCG; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK we already build a callgraph in SemaSYCL. Can we try to re-use it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it would be nice to reuse that infrastructure. I first tried pursuing that approach. The result of a scan for calls to this_item would have to be saved somewhere. The existing callgraph traversal lead to various function "attributes" being set. This would be fine, except that calls_this_item is not an attribute. We could define an internal attribute for that. Would that be acceptable? If yes, it would simplify the SemaSYCL changes quite a bit. How to add such an attribute?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see why not. You can check SYCLRequiresDecomposition for an example of internal attribute. @premanandrao @Fznamznon could you please confirm this is ok?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should be ok.
// CHECK: __SYCL_DLL_LOCAL | ||
// CHECK_NEXT: static constexpr bool callsThisItem() { return 1; } | ||
|
||
#include "Inputs/sycl.hpp" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please include mock sycl header like it is a system header, using -internal-isystem option? Here is an example https://github.com/intel/llvm/blob/sycl/clang/test/CodeGenSYCL/stall_enable.cpp .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
sycl/include/CL/sycl/handler.hpp
Outdated
/// | ||
/// \param Queue is the queue for this handler. | ||
/// \return Whether the device is a GPU. | ||
bool is_gpu(shared_ptr_class<sycl::detail::queue_impl> Queue); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bool is_gpu(shared_ptr_class<sycl::detail::queue_impl> Queue); | |
bool is_gpu(const shared_ptr_class<sycl::detail::queue_impl> &Queue); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
@@ -29,6 +29,8 @@ subject to change. Do not rely on these variables in production code. | |||
| SYCL_PI_LEVEL_ZERO_MAX_COMMAND_LIST_CACHE | Positive integer | Maximum number of oneAPI Level Zero Command lists that can be allocated with no reuse before throwing an "out of resources" error. Default is 20000, threshold may be increased based on resource availabilty and workload demand. | | |||
| SYCL_PI_LEVEL_ZERO_DISABLE_USM_ALLOCATOR | Any(\*) | Disable USM allocator in Level Zero plugin (each memory request will go directly to Level Zero runtime) | | |||
| SYCL_PI_LEVEL_ZERO_BATCH_SIZE | Integer | Sets a preferred number of commands to batch into a command list before executing the command list. A value of 0 causes the batch size to be adjusted dynamically. A value greater than 0 specifies fixed size batching, with the batch size set to the specified value. The default is 0. | | |||
| SYCL_PARALLEL_FOR_RANGE_ROUNDING_TRACE | Any(\*) | Enables tracing of parallel_for invocations with rounded-up ranges. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor. Suggest reusing some level of SYCL_PI_TRACE instead of introducing new variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should use the SYCL_PI_TRACE env var to control this output. This trace has nothing to do with the plugin.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rdeodhar could you please clarify one thing for me, please? If I have a built application, that uses a version of runtime library without this PR, and then I update the library to a new version with your changes, will the application continue to work as expected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@jbrodman @pvchupin @premanandrao @elizabethandrews |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FE changes LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
* upstream/sycl: (616 commits) [SYCL][L0] Implement robust error handling in level_zero plugin (intel#2870) [SYCL][NFC] Code clean up (phase 5) revealed by self build. (intel#2907) [Driver][NFC] Remove unused variable (intel#2908) [Github Action] Enable automatic sync for main branch from llvm-project to llvm (intel#2904) [ESIMD][NFC] Remove unnecessary macro checks (intel#2900) [SYCL] Fix handling of multiple usages of composite spec constants (intel#2894) [SYCL] Adjust parallel-for range global size to improve group size selection (intel#2703) [SYCL] Add template parameter support for no_global_work_offset attribute (intel#2839) [SYCL] Support LLVM FP intrinsic in llvm-spirv and FE (intel#2880) [SYCL]Link Libm FP64 SYCL device library by default (intel#2892) [SYCL][NFC] Code clean up (phase 4) revealed by self build. (intel#2878) [SYCL][NFC] Code clean up (phase 3) revealed by self build. (intel#2865) [SYCL] Fix backend selection for SYCL_DEVICE_TYPE=* (intel#2890) [SYCL] Fix spec constants support in integration header (intel#2896) [Driver] Update unbundling of offload libraries to use archive type (intel#2883) [SYCL][NFC] Clang format SYCL.cpp (intel#2891) [CODEOWNERS] Add code owners for DPC++ tools (intel#2884) [XPTIFW] Enable in-tree builds (intel#2849) [SYCL] Don't dump IR and dot files by default in the LowerWGScope pass (intel#2887) [SYCL] Use llvm-link's only-needed option to link device libs (intel#2783) ...
// This transformation leads to a condition where a kernel body | ||
// function becomes callable from a new kernel body function. | ||
// Hence this test. | ||
if ((ParentFD == KernelBody) && isSYCLKernelBodyFunction(FD)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I found this while working on something else. Is there anything we can do to make this MUCH more selective? The problem we have now is that someone who uses a lambda (or operator()) inside their top-level lambda will have things mis-diagnose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isSYCLKernelBodyFunction has a simplistic implementation, not introduced by this PR, by the way. One way to improve matters is to recognize the kernel early during parsing, and use an internal attribute to mark it as a KernelBody.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, but this use of it actually is a pretty nasty breaking change.
I'm not sure what opportunity the parser has to do that marking, AND it would likely break your patch (since there is no way to mark the 2nd lambda there).
I think we might need some sort of way of having this library opt-into pulling the body-attributes in from the child.
Perhaps defining internal attributes for OriginalKernel and WrappedKernel might help. I assume SYCL headers will be able to use these attributes? Then the markup and actions might be better defined. There is some proposed work for SYCL reductions that is also going to use this wrapper method. Maybe tackle a cleanup then. |
I think we'd only need 1 of those attributes (either, "this is a body-wrapper' or a 'this is really the body') to get this part correct. I'm leaning toward the 'body-wrapper' labeling, simply because we can do that ONLY in the library. I'm working on refactoring a lot of the MarkDevice code and derivatives, so I'm hoping I can implement that as either a follow-up or as a part of that patch. |
@AaronBallman and @premanandrao and @elizabethandrews : Note this as well, we need to fix this, as we'll end up getting some oddities when people use an operator() inside their kernel lambda (or kernel function). |
Variadic functions are not supported in SPIR-V, the only known exception is printf. Signed-off-by: Marcos Maronas <[email protected]> Original commit: KhronosGroup/SPIRV-LLVM-Translator@569972a61c86aa6
This change rounds up a parallel-for range to be a multiple of 32. This value can be changed later when we have better strategies for selecting work-group sizes. It works well for now. The rounding-up improves performance by 8-10x for the odd cases when the original range is a prime number. It has negligible performance impact cases where the range is already a multiple of 32.
Signed-off-by: rdeodhar [email protected]