Skip to content

[SYCL] Parallel-for range correction to improve group size selection by GPU driver #2703

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 30 commits into from
Dec 17, 2020

Conversation

rdeodhar
Copy link
Contributor

@rdeodhar rdeodhar commented Oct 29, 2020

This change rounds up a parallel-for range to be a multiple of 32. This value can be changed later when we have better strategies for selecting work-group sizes. It works well for now. The rounding-up improves performance by 8-10x for the odd cases when the original range is a prime number. It has negligible performance impact cases where the range is already a multiple of 32.

Signed-off-by: rdeodhar [email protected]

@jbrodman
Copy link
Contributor

jbrodman commented Nov 3, 2020

Very cool! @Pennycook - think this will improve a lot of cases?

Copy link
Contributor

@Pennycook Pennycook left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like the look of this, and I think it's going to help a lot of codes. I think it might also close #813, but we should ask @cagnulein to confirm.

@@ -510,6 +510,22 @@ class MarkDeviceFunction : public RecursiveASTVisitor<MarkDeviceFunction> {
FunctionDecl *FD = WorkList.back().first;
FunctionDecl *ParentFD = WorkList.back().second;

// To implement rounding-up of a parallel-for range (Jira 20239)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove reference to internal tracking number.

Suggested change
// To implement rounding-up of a parallel-for range (Jira 20239)
// To implement rounding-up of a parallel-for range

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

@@ -104,6 +104,8 @@ template <int dimensions = 1, bool with_offset = true> class item {

bool operator!=(const item &rhs) const { return rhs.MImpl != MImpl; }

void set_allowed_range(const range<dimensions> rnwi) { MImpl.MExtent = rnwi; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this new function is public. Should it be?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. This seems to introduce a lot of new public functions. I don't think we should be doing that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole point of this is to be transparent to the programmer, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@keryell
Copy link
Contributor

keryell commented Nov 6, 2020

Just curious about whether you do more work leading to UB.

@jbrodman
Copy link
Contributor

jbrodman commented Nov 6, 2020

Just curious about whether you do more work leading to UB.

I think this is meant to be a better version of a hack proof of concept I did for John a while ago: jbrodman@fc26a1f

The intent is to use C++ tricks to submit a range that tends to execute faster on the device and still be correct (of course).
In some sense this is working around less-than-great handling of null work group sizes in lower level runtimes.

@rdeodhar rdeodhar marked this pull request as ready for review November 17, 2020 03:16
Copy link
Contributor

@Fznamznon Fznamznon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add some tests first?

return;

// The call graph for this translation unit.
CallGraph SYCLCG;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK we already build a callgraph in SemaSYCL. Can we try to re-use it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it would be nice to reuse that infrastructure. I first tried pursuing that approach. The result of a scan for calls to this_item would have to be saved somewhere. The existing callgraph traversal lead to various function "attributes" being set. This would be fine, except that calls_this_item is not an attribute. We could define an internal attribute for that. Would that be acceptable? If yes, it would simplify the SemaSYCL changes quite a bit. How to add such an attribute?

Copy link
Contributor

@elizabethandrews elizabethandrews Nov 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see why not. You can check SYCLRequiresDecomposition for an example of internal attribute. @premanandrao @Fznamznon could you please confirm this is ok?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be ok.

// CHECK: __SYCL_DLL_LOCAL
// CHECK_NEXT: static constexpr bool callsThisItem() { return 1; }

#include "Inputs/sycl.hpp"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please include mock sycl header like it is a system header, using -internal-isystem option? Here is an example https://github.com/intel/llvm/blob/sycl/clang/test/CodeGenSYCL/stall_enable.cpp .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Fznamznon
Fznamznon previously approved these changes Nov 30, 2020
@bader bader requested a review from Pennycook November 30, 2020 11:45
///
/// \param Queue is the queue for this handler.
/// \return Whether the device is a GPU.
bool is_gpu(shared_ptr_class<sycl::detail::queue_impl> Queue);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
bool is_gpu(shared_ptr_class<sycl::detail::queue_impl> Queue);
bool is_gpu(const shared_ptr_class<sycl::detail::queue_impl> &Queue);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

@@ -29,6 +29,8 @@ subject to change. Do not rely on these variables in production code.
| SYCL_PI_LEVEL_ZERO_MAX_COMMAND_LIST_CACHE | Positive integer | Maximum number of oneAPI Level Zero Command lists that can be allocated with no reuse before throwing an "out of resources" error. Default is 20000, threshold may be increased based on resource availabilty and workload demand. |
| SYCL_PI_LEVEL_ZERO_DISABLE_USM_ALLOCATOR | Any(\*) | Disable USM allocator in Level Zero plugin (each memory request will go directly to Level Zero runtime) |
| SYCL_PI_LEVEL_ZERO_BATCH_SIZE | Integer | Sets a preferred number of commands to batch into a command list before executing the command list. A value of 0 causes the batch size to be adjusted dynamically. A value greater than 0 specifies fixed size batching, with the batch size set to the specified value. The default is 0. |
| SYCL_PARALLEL_FOR_RANGE_ROUNDING_TRACE | Any(\*) | Enables tracing of parallel_for invocations with rounded-up ranges. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor. Suggest reusing some level of SYCL_PI_TRACE instead of introducing new variable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should use the SYCL_PI_TRACE env var to control this output. This trace has nothing to do with the plugin.

Copy link
Contributor

@alexbatashev alexbatashev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rdeodhar could you please clarify one thing for me, please? If I have a built application, that uses a version of runtime library without this PR, and then I update the library to a new version with your changes, will the application continue to work as expected?

alexbatashev
alexbatashev previously approved these changes Dec 4, 2020
Copy link
Contributor

@alexbatashev alexbatashev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@romanovvlad
Copy link
Contributor

@jbrodman @pvchupin @premanandrao @elizabethandrews
Could you please [re]approve from documentation and frontend sides?

Copy link
Contributor

@elizabethandrews elizabethandrews left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FE changes LGTM

Copy link
Contributor

@premanandrao premanandrao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@romanovvlad romanovvlad merged commit 74a68b7 into intel:sycl Dec 17, 2020
alexbatashev pushed a commit to alexbatashev/llvm that referenced this pull request Dec 21, 2020
* upstream/sycl: (616 commits)
  [SYCL][L0] Implement robust error handling in level_zero plugin (intel#2870)
  [SYCL][NFC] Code clean up (phase 5) revealed by self build. (intel#2907)
  [Driver][NFC] Remove unused variable (intel#2908)
  [Github Action] Enable automatic sync for main branch from llvm-project to llvm (intel#2904)
  [ESIMD][NFC] Remove unnecessary macro checks (intel#2900)
  [SYCL] Fix handling of multiple usages of composite spec constants (intel#2894)
  [SYCL] Adjust parallel-for range global size to improve group size selection (intel#2703)
  [SYCL] Add template parameter support for no_global_work_offset attribute (intel#2839)
  [SYCL] Support LLVM FP intrinsic in llvm-spirv and FE (intel#2880)
  [SYCL]Link Libm FP64 SYCL device library by default (intel#2892)
  [SYCL][NFC] Code clean up (phase 4) revealed by self build. (intel#2878)
  [SYCL][NFC] Code clean up (phase 3) revealed by self build. (intel#2865)
  [SYCL] Fix backend selection for SYCL_DEVICE_TYPE=* (intel#2890)
  [SYCL] Fix spec constants support in integration header (intel#2896)
  [Driver] Update unbundling of offload libraries to use archive type (intel#2883)
  [SYCL][NFC] Clang format SYCL.cpp (intel#2891)
  [CODEOWNERS] Add code owners for DPC++ tools (intel#2884)
  [XPTIFW] Enable in-tree builds (intel#2849)
  [SYCL] Don't dump IR and dot files by default in the LowerWGScope pass (intel#2887)
  [SYCL] Use llvm-link's only-needed option to link device libs (intel#2783)
  ...
@rdeodhar rdeodhar deleted the iwgo5 branch January 15, 2021 17:32
@v-klochkov v-klochkov mentioned this pull request Jan 21, 2021
// This transformation leads to a condition where a kernel body
// function becomes callable from a new kernel body function.
// Hence this test.
if ((ParentFD == KernelBody) && isSYCLKernelBodyFunction(FD))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I found this while working on something else. Is there anything we can do to make this MUCH more selective? The problem we have now is that someone who uses a lambda (or operator()) inside their top-level lambda will have things mis-diagnose.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isSYCLKernelBodyFunction has a simplistic implementation, not introduced by this PR, by the way. One way to improve matters is to recognize the kernel early during parsing, and use an internal attribute to mark it as a KernelBody.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but this use of it actually is a pretty nasty breaking change.

I'm not sure what opportunity the parser has to do that marking, AND it would likely break your patch (since there is no way to mark the 2nd lambda there).

I think we might need some sort of way of having this library opt-into pulling the body-attributes in from the child.

@rdeodhar
Copy link
Contributor Author

rdeodhar commented Apr 1, 2021

Perhaps defining internal attributes for OriginalKernel and WrappedKernel might help. I assume SYCL headers will be able to use these attributes? Then the markup and actions might be better defined. There is some proposed work for SYCL reductions that is also going to use this wrapper method. Maybe tackle a cleanup then.

@erichkeane
Copy link
Contributor

Perhaps defining internal attributes for OriginalKernel and WrappedKernel might help. I assume SYCL headers will be able to use these attributes? Then the markup and actions might be better defined. There is some proposed work for SYCL reductions that is also going to use this wrapper method. Maybe tackle a cleanup then.

I think we'd only need 1 of those attributes (either, "this is a body-wrapper' or a 'this is really the body') to get this part correct. I'm leaning toward the 'body-wrapper' labeling, simply because we can do that ONLY in the library.

I'm working on refactoring a lot of the MarkDevice code and derivatives, so I'm hoping I can implement that as either a follow-up or as a part of that patch.

@erichkeane
Copy link
Contributor

@AaronBallman and @premanandrao and @elizabethandrews : Note this as well, we need to fix this, as we'll end up getting some oddities when people use an operator() inside their kernel lambda (or kernel function).

jsji pushed a commit that referenced this pull request Sep 21, 2024
Variadic functions are not supported in SPIR-V, the only known exception is printf.

Signed-off-by: Marcos Maronas <[email protected]>

Original commit:
KhronosGroup/SPIRV-LLVM-Translator@569972a61c86aa6
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.