[SYCL] Parallel-for range correction to improve group size selection by GPU driver #2703

rdeodhar · 2020-10-29T00:48:19Z

This change rounds up a parallel-for range to be a multiple of 32. This value can be changed later when we have better strategies for selecting work-group sizes. It works well for now. The rounding-up improves performance by 8-10x for the odd cases when the original range is a prime number. It has negligible performance impact cases where the range is already a multiple of 32.

Signed-off-by: rdeodhar [email protected]

…on by GPU driver. Signed-off-by: rdeodhar <[email protected]>

jbrodman · 2020-11-03T18:07:33Z

Very cool! @Pennycook - think this will improve a lot of cases?

Pennycook

I really like the look of this, and I think it's going to help a lot of codes. I think it might also close #813, but we should ask @cagnulein to confirm.

Pennycook · 2020-11-03T18:56:40Z

clang/lib/Sema/SemaSYCL.cpp

@@ -510,6 +510,22 @@ class MarkDeviceFunction : public RecursiveASTVisitor<MarkDeviceFunction> {
      FunctionDecl *FD = WorkList.back().first;
      FunctionDecl *ParentFD = WorkList.back().second;

+      // To implement rounding-up of a parallel-for range (Jira 20239)


Remove reference to internal tracking number.

Suggested change

// To implement rounding-up of a parallel-for range (Jira 20239)

// To implement rounding-up of a parallel-for range

sycl/include/CL/sycl/handler.hpp

cperkinsintel · 2020-11-06T20:10:51Z

sycl/include/CL/sycl/item.hpp

@@ -104,6 +104,8 @@ template <int dimensions = 1, bool with_offset = true> class item {

  bool operator!=(const item &rhs) const { return rhs.MImpl != MImpl; }

+  void set_allowed_range(const range<dimensions> rnwi) { MImpl.MExtent = rnwi; }


this new function is public. Should it be?

Good catch. This seems to introduce a lot of new public functions. I don't think we should be doing that.

The whole point of this is to be transparent to the programmer, right?

keryell · 2020-11-06T21:32:46Z

Just curious about whether you do more work leading to UB.

jbrodman · 2020-11-06T21:42:56Z

Just curious about whether you do more work leading to UB.

I think this is meant to be a better version of a hack proof of concept I did for John a while ago: jbrodman@fc26a1f

The intent is to use C++ tricks to submit a range that tends to execute faster on the device and still be correct (of course).
In some sense this is working around less-than-great handling of null work group sizes in lower level runtimes.

Fznamznon

Can we add some tests first?

clang/lib/Sema/SemaSYCL.cpp

Fznamznon · 2020-11-18T11:28:53Z

clang/lib/Sema/SemaSYCL.cpp

+      return;
+
+    // The call graph for this translation unit.
+    CallGraph SYCLCG;


AFAIK we already build a callgraph in SemaSYCL. Can we try to re-use it?

Yes, it would be nice to reuse that infrastructure. I first tried pursuing that approach. The result of a scan for calls to this_item would have to be saved somewhere. The existing callgraph traversal lead to various function "attributes" being set. This would be fine, except that calls_this_item is not an attribute. We could define an internal attribute for that. Would that be acceptable? If yes, it would simplify the SemaSYCL changes quite a bit. How to add such an attribute?

I don't see why not. You can check SYCLRequiresDecomposition for an example of internal attribute. @premanandrao @Fznamznon could you please confirm this is ok?

I think it should be ok.

clang/lib/Sema/SemaSYCL.cpp

Fznamznon · 2020-11-18T11:36:48Z

clang/test/CodeGenSYCL/parallel_for_this_item.cpp

+// CHECK: __SYCL_DLL_LOCAL
+// CHECK_NEXT: static constexpr bool callsThisItem() { return 1; }
+
+#include "Inputs/sycl.hpp"


Could you please include mock sycl header like it is a system header, using -internal-isystem option? Here is an example https://github.com/intel/llvm/blob/sycl/clang/test/CodeGenSYCL/stall_enable.cpp .

sycl/include/CL/sycl/handler.hpp

romanovvlad · 2020-12-03T17:39:19Z

sycl/include/CL/sycl/handler.hpp

+  ///
+  /// \param Queue is the queue for this handler.
+  /// \return Whether the device is a GPU.
+  bool is_gpu(shared_ptr_class<sycl::detail::queue_impl> Queue);


Suggested change

bool is_gpu(shared_ptr_class<sycl::detail::queue_impl> Queue);

bool is_gpu(const shared_ptr_class<sycl::detail::queue_impl> &Queue);

romanovvlad · 2020-12-03T17:44:19Z

sycl/doc/EnvironmentVariables.md

@@ -29,6 +29,8 @@ subject to change. Do not rely on these variables in production code.
 | SYCL_PI_LEVEL_ZERO_MAX_COMMAND_LIST_CACHE | Positive integer | Maximum number of oneAPI Level Zero Command lists that can be allocated with no reuse before throwing an "out of resources" error. Default is 20000, threshold may be increased based on resource availabilty and workload demand. |
 | SYCL_PI_LEVEL_ZERO_DISABLE_USM_ALLOCATOR | Any(\*) | Disable USM allocator in Level Zero plugin (each memory request will go directly to Level Zero runtime) |
 | SYCL_PI_LEVEL_ZERO_BATCH_SIZE | Integer | Sets a preferred number of commands to batch into a command list before executing the command list. A value of 0 causes the batch size to be adjusted dynamically. A value greater than 0 specifies fixed size batching, with the batch size set to the specified value. The default is 0. |
+| SYCL_PARALLEL_FOR_RANGE_ROUNDING_TRACE | Any(\*) | Enables tracing of parallel_for invocations with rounded-up ranges. |


Minor. Suggest reusing some level of SYCL_PI_TRACE instead of introducing new variable.

I don't think we should use the SYCL_PI_TRACE env var to control this output. This trace has nothing to do with the plugin.

alexbatashev

@rdeodhar could you please clarify one thing for me, please? If I have a built application, that uses a version of runtime library without this PR, and then I update the library to a new version with your changes, will the application continue to work as expected?

sycl/test/basic_tests/parallel_for_range_roundup.cpp

alexbatashev

LGTM

romanovvlad · 2020-12-15T11:25:33Z

@jbrodman @pvchupin @premanandrao @elizabethandrews
Could you please [re]approve from documentation and frontend sides?

elizabethandrews

FE changes LGTM

premanandrao

LGTM

* upstream/sycl: (616 commits) [SYCL][L0] Implement robust error handling in level_zero plugin (intel#2870) [SYCL][NFC] Code clean up (phase 5) revealed by self build. (intel#2907) [Driver][NFC] Remove unused variable (intel#2908) [Github Action] Enable automatic sync for main branch from llvm-project to llvm (intel#2904) [ESIMD][NFC] Remove unnecessary macro checks (intel#2900) [SYCL] Fix handling of multiple usages of composite spec constants (intel#2894) [SYCL] Adjust parallel-for range global size to improve group size selection (intel#2703) [SYCL] Add template parameter support for no_global_work_offset attribute (intel#2839) [SYCL] Support LLVM FP intrinsic in llvm-spirv and FE (intel#2880) [SYCL]Link Libm FP64 SYCL device library by default (intel#2892) [SYCL][NFC] Code clean up (phase 4) revealed by self build. (intel#2878) [SYCL][NFC] Code clean up (phase 3) revealed by self build. (intel#2865) [SYCL] Fix backend selection for SYCL_DEVICE_TYPE=* (intel#2890) [SYCL] Fix spec constants support in integration header (intel#2896) [Driver] Update unbundling of offload libraries to use archive type (intel#2883) [SYCL][NFC] Clang format SYCL.cpp (intel#2891) [CODEOWNERS] Add code owners for DPC++ tools (intel#2884) [XPTIFW] Enable in-tree builds (intel#2849) [SYCL] Don't dump IR and dot files by default in the LowerWGScope pass (intel#2887) [SYCL] Use llvm-link's only-needed option to link device libs (intel#2783) ...

erichkeane · 2021-04-01T19:11:21Z

clang/lib/Sema/SemaSYCL.cpp

+      // This transformation leads to a condition where a kernel body
+      // function becomes callable from a new kernel body function.
+      // Hence this test.
+      if ((ParentFD == KernelBody) && isSYCLKernelBodyFunction(FD))


So I found this while working on something else. Is there anything we can do to make this MUCH more selective? The problem we have now is that someone who uses a lambda (or operator()) inside their top-level lambda will have things mis-diagnose.

isSYCLKernelBodyFunction has a simplistic implementation, not introduced by this PR, by the way. One way to improve matters is to recognize the kernel early during parsing, and use an internal attribute to mark it as a KernelBody.

Right, but this use of it actually is a pretty nasty breaking change.

I'm not sure what opportunity the parser has to do that marking, AND it would likely break your patch (since there is no way to mark the 2nd lambda there).

I think we might need some sort of way of having this library opt-into pulling the body-attributes in from the child.

rdeodhar · 2021-04-01T20:17:51Z

Perhaps defining internal attributes for OriginalKernel and WrappedKernel might help. I assume SYCL headers will be able to use these attributes? Then the markup and actions might be better defined. There is some proposed work for SYCL reductions that is also going to use this wrapper method. Maybe tackle a cleanup then.

erichkeane · 2021-04-01T20:21:33Z

Perhaps defining internal attributes for OriginalKernel and WrappedKernel might help. I assume SYCL headers will be able to use these attributes? Then the markup and actions might be better defined. There is some proposed work for SYCL reductions that is also going to use this wrapper method. Maybe tackle a cleanup then.

I think we'd only need 1 of those attributes (either, "this is a body-wrapper' or a 'this is really the body') to get this part correct. I'm leaning toward the 'body-wrapper' labeling, simply because we can do that ONLY in the library.

I'm working on refactoring a lot of the MarkDevice code and derivatives, so I'm hoping I can implement that as either a follow-up or as a part of that patch.

erichkeane · 2021-04-01T20:25:25Z

@AaronBallman and @premanandrao and @elizabethandrews : Note this as well, we need to fix this, as we'll end up getting some oddities when people use an operator() inside their kernel lambda (or kernel function).

Variadic functions are not supported in SPIR-V, the only known exception is printf. Signed-off-by: Marcos Maronas <[email protected]> Original commit: KhronosGroup/SPIRV-LLVM-Translator@569972a61c86aa6

rdeodhar added 2 commits October 28, 2020 17:41

[SYCL] Parallel-for range rounding-up for improved group size selecti…

a23f9ad

…on by GPU driver. Signed-off-by: rdeodhar <[email protected]>

Merge branch 'sycl' of https://github.com/intel/llvm into iwgo5

047123f

jbrodman requested review from Pennycook and jbrodman November 3, 2020 18:07

Pennycook requested changes Nov 3, 2020

View reviewed changes

rdeodhar added 3 commits November 3, 2020 17:47

Correction to wrapper kernel name.

73f50bd

Test correction to backend interoperability interface.

2aad33d

Environment var control to disable optimization; correction to one test.

ac6bf28

cperkinsintel reviewed Nov 6, 2020

View reviewed changes

rdeodhar added 6 commits November 16, 2020 13:59

Fixes for set_args and this_item usage, and for review comments.

da1a3ab

Merge branch 'sycl' of https://github.com/intel/llvm into iwgo5

e186605

Formatting changes.

eaacd8a

Formatting change.

535745f

Removed unneeded files.

5c6c841

Added some comments requested by reviewers.

dc20fa1

rdeodhar marked this pull request as ready for review November 17, 2020 03:16

rdeodhar requested review from elizabethandrews, Fznamznon, premanandrao and a team as code owners November 17, 2020 03:16

rdeodhar requested a review from smaslov-intel November 17, 2020 03:16

Fznamznon reviewed Nov 17, 2020

View reviewed changes

rdeodhar added 4 commits November 17, 2020 14:32

Added a test for integration header and one execution test.

d62e2f1

Merge branch 'sycl' of https://github.com/intel/llvm into iwgo5

b860666

Adjustment to test to account for added lines in sycl.hpp.

8677a2b

Changed runtime test.

c340ccf

Fznamznon reviewed Nov 18, 2020

View reviewed changes

Fznamznon previously approved these changes Nov 30, 2020

View reviewed changes

bader requested a review from Pennycook November 30, 2020 11:45

Pennycook requested changes Nov 30, 2020

View reviewed changes

sycl/include/CL/sycl/handler.hpp Outdated Show resolved Hide resolved

romanovvlad reviewed Dec 3, 2020

View reviewed changes

alexbatashev reviewed Dec 3, 2020

View reviewed changes

sycl/test/basic_tests/parallel_for_range_roundup.cpp Show resolved Hide resolved

Minor corrections.

4b9093e

rdeodhar dismissed Fznamznon’s stale review via 4b9093e December 3, 2020 21:48

alexbatashev previously approved these changes Dec 4, 2020

View reviewed changes

Enabled rounding for CPU also.

a2a6ded

rdeodhar dismissed alexbatashev’s stale review via a2a6ded December 10, 2020 19:39

rdeodhar requested a review from Pennycook December 11, 2020 03:05

Pennycook approved these changes Dec 11, 2020

View reviewed changes

pvchupin approved these changes Dec 15, 2020

View reviewed changes

elizabethandrews approved these changes Dec 16, 2020

View reviewed changes

premanandrao approved these changes Dec 17, 2020

View reviewed changes

jbrodman approved these changes Dec 17, 2020

View reviewed changes

romanovvlad approved these changes Dec 17, 2020

View reviewed changes

romanovvlad merged commit 74a68b7 into intel:sycl Dec 17, 2020

rdeodhar deleted the iwgo5 branch January 15, 2021 17:32

v-klochkov mentioned this pull request Jan 21, 2021

Cannot build CTS #3020

Closed

erichkeane reviewed Apr 1, 2021

View reviewed changes

AlexeySachkov mentioned this pull request Nov 29, 2022

SYCL spec example throw an instance of 'sycl::_v1::invalid_parameter_error' #7568

Closed

tom91136 mentioned this pull request Jan 16, 2023

Poor performance for 2d/3d range kernels involving primes on CUDA PI #8018

Closed

	// To implement rounding-up of a parallel-for range (Jira 20239)
	// To implement rounding-up of a parallel-for range

		@@ -104,6 +104,8 @@ template <int dimensions = 1, bool with_offset = true> class item {

		bool operator!=(const item &rhs) const { return rhs.MImpl != MImpl; }

		void set_allowed_range(const range<dimensions> rnwi) { MImpl.MExtent = rnwi; }

	bool is_gpu(shared_ptr_class<sycl::detail::queue_impl> Queue);
	bool is_gpu(const shared_ptr_class<sycl::detail::queue_impl> &Queue);

[SYCL] Parallel-for range correction to improve group size selection by GPU driver #2703

[SYCL] Parallel-for range correction to improve group size selection by GPU driver #2703

Uh oh!

Conversation

rdeodhar commented Oct 29, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jbrodman commented Nov 3, 2020

Uh oh!

Pennycook left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

keryell commented Nov 6, 2020

Uh oh!

jbrodman commented Nov 6, 2020

Uh oh!

Fznamznon left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

elizabethandrews Nov 18, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alexbatashev left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

alexbatashev left a comment

Choose a reason for hiding this comment

Uh oh!

romanovvlad commented Dec 15, 2020

Uh oh!

elizabethandrews left a comment

Choose a reason for hiding this comment

Uh oh!

premanandrao left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

rdeodhar commented Oct 29, 2020 •

edited

Loading

elizabethandrews Nov 18, 2020 •

edited

Loading