-
Notifications
You must be signed in to change notification settings - Fork 769
[SYCL][ESIMD] Intel "Explicit SIMD" extension documentation draft. #1731
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Contributors (alphabetic order): Konst Bobrovsky Peter Caday Gang Chen Kai Yu Chen Ken Lueh Wei Pan Xinmin Tian Signed-off-by: Konstantin S Bobrovsky <[email protected]>
See auto-generated documentation for the complete list of APIs here. (TBD) | ||
|
||
#### USM pointer-based memory access | ||
##### Flat-address gather/scatter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment shouldn't block this PR, but I'd like us to think more about what we can do to make these interfaces more user-friendly.
Two things stand out:
-
A user providing cache hints has to provide a lot more template arguments than required. Could we make this nicer by providing the hints as tag-type arguments?
// This works float* p; simd<uint32_t, 16> offsets; auto result = flat_load(p, offsets); // Adding cache hints makes it much more complex float* p; simd<uint32_t, 16> offsets; auto result = flat_load<uint32_t, 16, 1, CacheHint::Foo, CacheHint::Bar>(p, offsets);
-
I find the existing names quite confusing. I'd much prefer that the names were more obvious (e.g.
flat_load
could be calledgather
if that's all it does) and/or used overloading where appropriate to minimize the number of times a reader of code has to check documentation to figure out what a function is supposed to be doing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to me. @kychendev, please comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm open to change the name to make it more user friendly. The cache hints setting are mostly based on existing implementation. Please check with Gang who added these memory access APIs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not see why gather/scatter is better naming than load/store. Our current naming scheme has its own consistency. I also do not see why tagged argument is any better than template
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The more explicit and obvious a name is, the better. Anybody reading a function called "gather" or "gather load" knows that the function being called performed is a gather operation. I wouldn't know that a "flat load" was a gather without reading the documentation.
A tagged argument is better in this case because it allows for more template arguments to be deduced. Expanding on my previous example:
// Without cache hints, type and length can be deduced from offsets
float* p;
simd<uint32_t, 16> offsets;
auto result = flat_load(p, offsets);
// With cache hints as templates, verbosity increases significantly:
// - Providing any cache hint forces the user to specify the type and length
float* p;
simd<uint32_t, 16> offsets;
auto result = flat_load<uint32_t, 16, 1, CacheHint::Foo, CacheHint::Bar>(p, offsets);
// With cache hints as tag types, verbosity is reduced:
// - Providing a cache hint does not prevent deduction of type and length
float* p;
simd <uint32_t, 16> offsets;
auto result = flat_load(p, offsets, CacheHint::Foo{});
Note also that the templated form prevents a developer from specifying an L3 hint without also explicitly specifying an L1 hint. If flat_load
accepted a list of hints, it might be possible to refactor the hints to specify them in any order, and it may be more extensible to future cache hints:
auto result = flat_load(p, offsets, l3_cache_hint_foo, ...);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cmc-rep are you being facetious about reading documentation? Descriptive names and interfaces are very important for readable/maintainable/accessible code.
'flat' addresses are only meaningful on Gen, and the convention that load == gather is obviously a Gen-specific notion too, since we keep having to say 'oh this is a gather' over and over just in this conversation!
Maybe it would be best to change the name of this extension back to DPC++-CM?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading documentation is very dangerous when designing/writing write-only programs... :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also another area where it needless differs from std::simd
. Non-gather load/store is called copy_from
and copy_to
. Why invent new names just to confuse people?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Without cache hints, type and length can be deduced from offsets float* p; simd<uint32_t, 16> offsets; auto result = flat_load(p, offsets); auto result = flat_load<uint32_t, 16, 1, CacheHint::Foo, CacheHint::Bar>(p, offsets);
An approach à la https://github.com/chriskohlhoff/propria from @chriskohlhoff would be to add a property to the pointer, such as
auto result = flat_load(p, offsets);
auto result = flat_load(decorate<CacheHint::Foo, CacheHint::Bar>(p), offsets);
The advantage is that you do not have to change all tour API and all the uses of this
decorated pointer will benefit from this.
decorate
is to be bikeshed accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added TODO to the headers
CacheHint L1H = CacheHint::Default, | ||
CacheHint L3H = CacheHint::Default> | ||
typename std::enable_if<__check_atomic<Op, T, n, 0>(), simd<T, n>>::type | ||
flat_atomic(T *p, simd<unsigned, n> offset, simd<ushort, n> pred); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should look into what can be done to simplify these atomic functions and align their design with the other new atomic features. That is perhaps out of scope for this PR (the direction is less clear than for the reduce
changes, for example) but we should open an issue to track it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Pennycook , can you please open the issue then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll open it as soon as this is merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks
I didn't see this discussed in the extension documentation. I think you should add some discussion of it there, and I'd recommend changing it to |
Co-authored-by: John Pennycook <[email protected]> Co-authored-by: rolandschulz <[email protected]>
Signed-off-by: Konstantin S Bobrovsky <[email protected]>
}; | ||
|
||
template <int __N> | ||
using __mask_type_t = typename __vector_type<uint16_t, __N>::type; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The mask should also be a wrapper around the clang-vector type rather than the clang-vector type itself.
The internal storage should be implementation defined. uint16_t
is a bad choice for some HW. Nor is it how clang-vector types works (using the same size int as the corresponding vector type used for comparison (e.g. long for double and int for float)).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a TODO
Signed-off-by: Konstantin S Bobrovsky <[email protected]>
Signed-off-by: Konstantin S Bobrovsky <[email protected]>
CacheHint L3H = CacheHint::Default> | ||
typename std::enable_if<(n == 16 || n == 32), | ||
simd<T, n * __NumChannels(Mask)>>::type | ||
flat_load4(T *p, simd<uint32_t, n> offsets, simd<uint16_t, n> pred = 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be parameterized? Why is 4 magic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
4 maps to Gen hardware memory access instruction (message)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand, but is that what it is going to be like for all future hardware? Would it be wise to parameterize this so that we don't need new function names for flat_loadX at some point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure. For there is 1 and 4 only, and people would have to use more verbose templated syntax anticipating other widths which is not clear if/when will happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't this be deduced?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably. Do you have an example?
```cpp | ||
#include <iostream> | ||
#include <CL/sycl.hpp> | ||
#include <sycl_[[sycl_explicit_simd]].hpp> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a find+replace error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks
@rolandschulz , please take a look at the changes @kychendev pushed recently to address your comments |
other devices will result in error. | ||
|
||
Kernels and `SYCL_EXTERNAL` functions using ESP must be explicitly marked with | ||
the `[[sycl_explicit_simd]]` attribute. Subgroup size query within such |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want to bikeshed on this, but just a note: I think [[sycl_explicit_simd]]
is better than a macro (for all the reasons @rolandschulz pointed out) but the spelling here is different to attributes like [[intel::reqd_work_group_size]]
.
We should decide what rules we're following for attributes (if any). I'll open an issue for this as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should also update all the examples to align with this proposal: https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/KernelRHSAttributes/SYCL_INTEL_attribute_style.asciidoc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. (@bader - I assume alignment means function type attribute placed before definition instead of function attribute placed before the declaration)
This one is implemented in #1743 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems useful for this SIMD architecture. Since it is also specific for some instructions, perhaps you could add some links to some public presentations/documentation to help understanding?
@keryell, @Ruyk, there is only documentation for prior incarnation called "C-for-Metal". Some links: https://01.org/c-for-metal-development-package/downloads/cm-programming-reference-1 @kychendev, please add if you are aware of anything else |
Co-authored-by: John Pennycook <[email protected]>
@Pennycook, I think @kychendev updated the reduction APIs - please take a look |
Signed-off-by: Konstantin S Bobrovsky <[email protected]>
Looks great, thanks! |
Here are some future directions in which this API is intended to evolve: | ||
- enabling this extension for other architectures, such as x86, with extracting | ||
and clearly marking generic and target-dependent API portions | ||
- aligning with `std::simd` and maybe providing `std::simd` implementation atop |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would we want to implement std::simd
on top of sycl::intel::gpu::simd
rather than directly implementing the former on top of clang vectors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to avoid re-implementing lots of stuff and take advantage of sycl::intel::gpu::simd efficiency
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implementing it on top of gpu::simd
is the strategy which requires a reimplementation of std::simd
features missing from gpu::simd
. On the other hand there is an existing implementation directly on top of gcc/clang vector: https://github.com/VcDevel/std-simd . We should explore how we can reuse most of that implementation also for the GPU.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a pretty fundamental design question, I'm sorry if I am missing context, but I just found out your work today. First of all, 👍 thanks for working on a SIMD type for GPU. I wanted to do this since years but didn't have the manpower/time.
What I'd like to know is why you can't (don't want to) take the std::experimental::simd
specification as is. I'd be very happy to help. Especially with folding back improvements and extensions into C++ committee papers targeting C++23. If the simd<T, ?>
type is 1:1 portable between CPU and GPU that would be a huge win in terms of software maintenance, and freedom to move to newer platforms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The short answer is that the primary target for this specification is Intel GPU, which is reflected in the namespace.
But generalizing the spec (and implementation) to fit more devices and be more compatible
with std C++ is definitely welcome, as long as it does not hurt GPU performance. simd<T, ?>
type is portable to CPU, but performance of various associated APIs is another question. Our
plan is to upstream the existing initial implementation, then iteratively improve it along with
the specification, making sure it remains functional and performant on Intel GPU.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The short answer is that the primary target for this specification is Intel GPU, which is reflected in the namespace.
Right. I think it totally makes sense to explore what would be the perfect API for programming a specific target. It makes less sense to write applications with it. Because portability and being able to move to newer/other hardware is an important reason for using SIMD types in the first place.
Also, "Intel GPU" is a narrow target right now. But do you know whether it will stay that way?
But generalizing the spec (and implementation) to fit more devices and be more compatible
with std C++ is definitely welcome, as long as it does not hurt GPU performance.
👍 A SIMD type that doesn't allow you to reach full performance is a failure (or hopefully just buggy and needs to be fixed). Performance is the reason why we use it.
I guess what I'd like to discuss is: If you deviate from the Parallelism TS 2, why? Is it to simplify your implementation, is it because of missing functionality, is it because of performance, or? This is important feedback to the C++ committee when merging the TS into the IS is discussed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To help you understand our points of view, I recommend you take a look the documents and examples at https://github.com/intel/cm-compiler
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Be sure that all the interesting links which were mentioned in the discussions end up in the extension description itself, so it is not lost for the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added these comments to the bottom of the doc as TODO
|
||
The `sycl::intel::gpu::simd` class is a vector templated on some element type. The element type must be vectorizable type. The set of vectorizable types is the set of fundamental SYCL arithmetic types (C++ arithmetic types or `half` type) excluding `bool`. The length of the vector is the second template parameter. | ||
|
||
Each simd class object is mapped to a consecutive block of general register |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't guaranteed. And any stack variable is mapped to a register unless spilled. Not sure what you try to say with this sentence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reworded
// Constructors. | ||
simd_view(BaseTy &Base, RegionTy Region); | ||
simd_view(BaseTy &&Base, RegionTy Region); | ||
simd_view(simd_view &Other); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this intentional not a correct copy constructor (would need to be const for that)?
I believe we agreed that simd_view
would have a deleted copy and move constructor. Why are they suddenly back?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added TODO to the headers
using element_type = typename ShapeTy::element_type; | ||
|
||
// Constructors. | ||
simd_view(BaseTy &Base, RegionTy Region); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this and the next constructor public? Those should only be called internally by e.g. select
, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added TODO to the headers
See auto-generated documentation for the complete list of APIs here. (TBD) | ||
|
||
#### USM pointer-based memory access | ||
##### Flat-address gather/scatter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also another area where it needless differs from std::simd
. Non-gather load/store is called copy_from
and copy_to
. Why invent new names just to confuse people?
Co-authored-by: John Pennycook <[email protected]>
auto Acc1 = Buf1.get_access<cl::sycl::access::mode::read>(Cgh); | ||
auto Acc2 = Buf2.get_access<cl::sycl::access::mode::read_write>(Cgh); | ||
|
||
Cgh.single_task<class KernelID>([=] () [[intel::sycl_explicit_simd]] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not remember whether we can remove the ()
when we have an attribute...
sycl::intel::gpu<float, VL> va = sycl::intel::gpu::flat_block_load<float, VL>(A + offset); | ||
sycl::intel::gpu<float, VL> vb = sycl::intel::gpu::flat_block_load<float, VL>(B + offset); | ||
sycl::intel::gpu<float, VL> vc = va + vb; | ||
sycl::intel::gpu::flat_block_store<float, VL>(C + offset, vc); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it not possible to infer float
and VL
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Compiler complains that it can't determine vector element type
Definitely useful for SIMD architectures and... FPGA! :-) @aisoard @javier-cabezas |
@Ruyk , @keryell , @mattkretz , @iburyl , @rolandschulz , and other reviewers - apologies for delay. This is initial experimental version to try out functionality and performance. We are going to work with the community on incrementally improving the APIs to bring them closer to std C++ and SYCL in the next several months. At this point we would like to
Then review can be resumed and we can then keep adding comments to this document and/or the sources. |
Will approve after @rolandschulz does. |
Signed-off-by: Konstantin S Bobrovsky <[email protected]>
- Collected TODOs and: * addressed some of them * summarized some of those in the end of this document * added others directly to the source code - Removed API definitions from the spec, as they are now part of the source base Signed-off-by: Konstantin S Bobrovsky <[email protected]>
@rolandschulz, @Pennycook, @keryell, @Ruyk, @iburyl, @mattkretz, @jasonsewall-intel - sorry for delay. I updated the spec, please take another look. Specifically: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding the TODOs, @kbobrovs. This LGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, looks good so far, and looking forward to see those TODO's acted upon later on!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All remaining issues will be resolved later
This PR starts a series of PRs to integrate DPC++ "Explicit SIMD" extension for efficient Intel GPU programming. This PR provides overall description of the extenstion.
Contributors (alphabetic order):
Konst Bobrovsky
Peter Caday
Gang Chen
Kai Yu Chen
Ken Lueh
Wei Pan
Xinmin Tian
The upcoming changes will be spread across a number of components:
-fsycl-explicit-simd
option option added to tweak FE behavior slightly compared to normal SYCL. In the future it is expected that no such distinction will be necessary, and the need in extra -fsycl-explicit-simd option will go away.__SYCL_EXPLICIT_SIMD__
macro is set by (device FE only)This is the largest piece implementing the essence of the ESIMD extension. It provides APIs to write device code in close to metal manner.
More description of changes in each component will come with respective PR.