Skip to content

[SYCL][ESIMD] Intel "Explicit SIMD" extension documentation draft. #1731

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Jul 21, 2020

Conversation

kbobrovs
Copy link
Contributor

@kbobrovs kbobrovs commented May 21, 2020

This PR starts a series of PRs to integrate DPC++ "Explicit SIMD" extension for efficient Intel GPU programming. This PR provides overall description of the extenstion.
Contributors (alphabetic order):
Konst Bobrovsky
Peter Caday
Gang Chen
Kai Yu Chen
Ken Lueh
Wei Pan
Xinmin Tian

The upcoming changes will be spread across a number of components:

  • Driver
    -fsycl-explicit-simd option option added to tweak FE behavior slightly compared to normal SYCL. In the future it is expected that no such distinction will be necessary, and the need in extra -fsycl-explicit-simd option will go away.
  • Front-End
    • recognize and mark explicit SIMD kernels and functions
    • allow non-constant globals used in device code
    • __SYCL_EXPLICIT_SIMD__ macro is set by (device FE only)
    • LLVM passes are run (-disable-llvm-passes is not added) (device FE only)
  • LLVM
    • A number of ESIMD-specific LLVM-IR passes are added
  • SYCL RT - ESIMD-specific tweaks in accessors (using image buffers), 2D images,...
  • New SIMD API (~9000 lines)
    This is the largest piece implementing the essence of the ESIMD extension. It provides APIs to write device code in close to metal manner.
  • llvm-spirv translator
  • (will be external github project) GenXIntrinsics llvm project. Provides APIs for low-level intrinsic generation.

More description of changes in each component will come with respective PR.

Contributors (alphabetic order):
Konst Bobrovsky
Peter Caday
Gang Chen
Kai Yu Chen
Ken Lueh
Wei Pan
Xinmin Tian

Signed-off-by: Konstantin S Bobrovsky <[email protected]>
See auto-generated documentation for the complete list of APIs here. (TBD)

#### USM pointer-based memory access
##### Flat-address gather/scatter
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment shouldn't block this PR, but I'd like us to think more about what we can do to make these interfaces more user-friendly.

Two things stand out:

  1. A user providing cache hints has to provide a lot more template arguments than required. Could we make this nicer by providing the hints as tag-type arguments?

    // This works
    float* p;
    simd<uint32_t, 16> offsets;
    auto result = flat_load(p, offsets);
    
    // Adding cache hints makes it much more complex
    float* p;
    simd<uint32_t, 16> offsets;
    auto result = flat_load<uint32_t, 16, 1, CacheHint::Foo, CacheHint::Bar>(p, offsets);
  2. I find the existing names quite confusing. I'd much prefer that the names were more obvious (e.g. flat_load could be called gather if that's all it does) and/or used overloading where appropriate to minimize the number of times a reader of code has to check documentation to figure out what a function is supposed to be doing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me. @kychendev, please comment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm open to change the name to make it more user friendly. The cache hints setting are mostly based on existing implementation. Please check with Gang who added these memory access APIs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not see why gather/scatter is better naming than load/store. Our current naming scheme has its own consistency. I also do not see why tagged argument is any better than template

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The more explicit and obvious a name is, the better. Anybody reading a function called "gather" or "gather load" knows that the function being called performed is a gather operation. I wouldn't know that a "flat load" was a gather without reading the documentation.

A tagged argument is better in this case because it allows for more template arguments to be deduced. Expanding on my previous example:

// Without cache hints, type and length can be deduced from offsets
float* p;
simd<uint32_t, 16> offsets;
auto result = flat_load(p, offsets);

// With cache hints as templates, verbosity increases significantly:
// - Providing any cache hint forces the user to specify the type and length
float* p;
simd<uint32_t, 16> offsets;
auto result = flat_load<uint32_t, 16, 1, CacheHint::Foo, CacheHint::Bar>(p, offsets);

// With cache hints as tag types, verbosity is reduced:
// - Providing a cache hint does not prevent deduction of type and length
float* p;
simd <uint32_t, 16> offsets;
auto result = flat_load(p, offsets, CacheHint::Foo{});

Note also that the templated form prevents a developer from specifying an L3 hint without also explicitly specifying an L1 hint. If flat_load accepted a list of hints, it might be possible to refactor the hints to specify them in any order, and it may be more extensible to future cache hints:

auto result = flat_load(p, offsets, l3_cache_hint_foo, ...);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cmc-rep are you being facetious about reading documentation? Descriptive names and interfaces are very important for readable/maintainable/accessible code.

'flat' addresses are only meaningful on Gen, and the convention that load == gather is obviously a Gen-specific notion too, since we keep having to say 'oh this is a gather' over and over just in this conversation!

Maybe it would be best to change the name of this extension back to DPC++-CM?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading documentation is very dangerous when designing/writing write-only programs... :-)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also another area where it needless differs from std::simd. Non-gather load/store is called copy_from and copy_to. Why invent new names just to confuse people?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// Without cache hints, type and length can be deduced from offsets
float* p;
simd<uint32_t, 16> offsets;
auto result = flat_load(p, offsets);
auto result = flat_load<uint32_t, 16, 1, CacheHint::Foo, CacheHint::Bar>(p, offsets);

An approach à la https://github.com/chriskohlhoff/propria from @chriskohlhoff would be to add a property to the pointer, such as

auto result = flat_load(p, offsets);
auto result = flat_load(decorate<CacheHint::Foo, CacheHint::Bar>(p), offsets);

The advantage is that you do not have to change all tour API and all the uses of this
decorated pointer will benefit from this.
decorate is to be bikeshed accordingly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added TODO to the headers

CacheHint L1H = CacheHint::Default,
CacheHint L3H = CacheHint::Default>
typename std::enable_if<__check_atomic<Op, T, n, 0>(), simd<T, n>>::type
flat_atomic(T *p, simd<unsigned, n> offset, simd<ushort, n> pred);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should look into what can be done to simplify these atomic functions and align their design with the other new atomic features. That is perhaps out of scope for this PR (the direction is less clear than for the reduce changes, for example) but we should open an issue to track it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Pennycook , can you please open the issue then?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll open it as soon as this is merged.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks

@Pennycook
Copy link
Contributor

-fsycl-esimd option option added to tweak FE behavior slightly compared to normal SYCL. In the future it is expected that no such distinction will be necessary, and the need in extra -fsycl-esimd option will go away.

I didn't see this discussed in the extension documentation. I think you should add some discussion of it there, and I'd recommend changing it to -fsycl-explicit-simd or just -fsycl-simd.

Co-authored-by: John Pennycook <[email protected]>
Co-authored-by: rolandschulz <[email protected]>
};

template <int __N>
using __mask_type_t = typename __vector_type<uint16_t, __N>::type;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mask should also be a wrapper around the clang-vector type rather than the clang-vector type itself.

The internal storage should be implementation defined. uint16_t is a bad choice for some HW. Nor is it how clang-vector types works (using the same size int as the corresponding vector type used for comparison (e.g. long for double and int for float)).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a TODO

CacheHint L3H = CacheHint::Default>
typename std::enable_if<(n == 16 || n == 32),
simd<T, n * __NumChannels(Mask)>>::type
flat_load4(T *p, simd<uint32_t, n> offsets, simd<uint16_t, n> pred = 1);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be parameterized? Why is 4 magic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 maps to Gen hardware memory access instruction (message)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand, but is that what it is going to be like for all future hardware? Would it be wise to parameterize this so that we don't need new function names for flat_loadX at some point?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure. For there is 1 and 4 only, and people would have to use more verbose templated syntax anticipating other widths which is not clear if/when will happen.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't this be deduced?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably. Do you have an example?

```cpp
#include <iostream>
#include <CL/sycl.hpp>
#include <sycl_[[sycl_explicit_simd]].hpp>

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a find+replace error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks

@kbobrovs
Copy link
Contributor Author

@rolandschulz , please take a look at the changes @kychendev pushed recently to address your comments

other devices will result in error.

Kernels and `SYCL_EXTERNAL` functions using ESP must be explicitly marked with
the `[[sycl_explicit_simd]]` attribute. Subgroup size query within such
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to bikeshed on this, but just a note: I think [[sycl_explicit_simd]] is better than a macro (for all the reasons @rolandschulz pointed out) but the spelling here is different to attributes like [[intel::reqd_work_group_size]].

We should decide what rules we're following for attributes (if any). I'll open an issue for this as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. (@bader - I assume alignment means function type attribute placed before definition instead of function attribute placed before the declaration)

@kbobrovs kbobrovs requested a review from Pennycook May 28, 2020 18:41
@kbobrovs
Copy link
Contributor Author

-fsycl-esimd option option added to tweak FE behavior slightly compared to normal SYCL. In the future it is expected that no such distinction will be necessary, and the need in extra -fsycl-esimd option will go away.

I didn't see this discussed in the extension documentation. I think you should add some discussion of it there, and I'd recommend changing it to -fsycl-explicit-simd or just -fsycl-simd.

This one is implemented in #1743

Copy link
Contributor

@keryell keryell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems useful for this SIMD architecture. Since it is also specific for some instructions, perhaps you could add some links to some public presentations/documentation to help understanding?

@kbobrovs
Copy link
Contributor Author

kbobrovs commented Jun 2, 2020

@kbobrovs
Copy link
Contributor Author

kbobrovs commented Jun 2, 2020

These comments haven't been addressed yet. If @kychendev isn't opposed to the direction, I think we should rename these functions before the merge. It seems like a small change to me.

@Pennycook, I think @kychendev updated the reduction APIs - please take a look

@Pennycook
Copy link
Contributor

@Pennycook, I think @kychendev updated the reduction APIs - please take a look

Looks great, thanks!

Here are some future directions in which this API is intended to evolve:
- enabling this extension for other architectures, such as x86, with extracting
and clearly marking generic and target-dependent API portions
- aligning with `std::simd` and maybe providing `std::simd` implementation atop
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would we want to implement std::simd on top of sycl::intel::gpu::simd rather than directly implementing the former on top of clang vectors?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to avoid re-implementing lots of stuff and take advantage of sycl::intel::gpu::simd efficiency

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementing it on top of gpu::simd is the strategy which requires a reimplementation of std::simd features missing from gpu::simd. On the other hand there is an existing implementation directly on top of gcc/clang vector: https://github.com/VcDevel/std-simd . We should explore how we can reuse most of that implementation also for the GPU.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a pretty fundamental design question, I'm sorry if I am missing context, but I just found out your work today. First of all, 👍 thanks for working on a SIMD type for GPU. I wanted to do this since years but didn't have the manpower/time.

What I'd like to know is why you can't (don't want to) take the std::experimental::simd specification as is. I'd be very happy to help. Especially with folding back improvements and extensions into C++ committee papers targeting C++23. If the simd<T, ?> type is 1:1 portable between CPU and GPU that would be a huge win in terms of software maintenance, and freedom to move to newer platforms.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The short answer is that the primary target for this specification is Intel GPU, which is reflected in the namespace.

But generalizing the spec (and implementation) to fit more devices and be more compatible
with std C++ is definitely welcome, as long as it does not hurt GPU performance. simd<T, ?>
type is portable to CPU, but performance of various associated APIs is another question. Our
plan is to upstream the existing initial implementation, then iteratively improve it along with
the specification, making sure it remains functional and performant on Intel GPU.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The short answer is that the primary target for this specification is Intel GPU, which is reflected in the namespace.

Right. I think it totally makes sense to explore what would be the perfect API for programming a specific target. It makes less sense to write applications with it. Because portability and being able to move to newer/other hardware is an important reason for using SIMD types in the first place.
Also, "Intel GPU" is a narrow target right now. But do you know whether it will stay that way?

But generalizing the spec (and implementation) to fit more devices and be more compatible
with std C++ is definitely welcome, as long as it does not hurt GPU performance.

👍 A SIMD type that doesn't allow you to reach full performance is a failure (or hopefully just buggy and needs to be fixed). Performance is the reason why we use it.

I guess what I'd like to discuss is: If you deviate from the Parallelism TS 2, why? Is it to simplify your implementation, is it because of missing functionality, is it because of performance, or? This is important feedback to the C++ committee when merging the TS into the IS is discussed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To help you understand our points of view, I recommend you take a look the documents and examples at https://github.com/intel/cm-compiler

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Be sure that all the interesting links which were mentioned in the discussions end up in the extension description itself, so it is not lost for the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added these comments to the bottom of the doc as TODO


The `sycl::intel::gpu::simd` class is a vector templated on some element type. The element type must be vectorizable type. The set of vectorizable types is the set of fundamental SYCL arithmetic types (C++ arithmetic types or `half` type) excluding `bool`. The length of the vector is the second template parameter.

Each simd class object is mapped to a consecutive block of general register
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't guaranteed. And any stack variable is mapped to a register unless spilled. Not sure what you try to say with this sentence.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reworded

// Constructors.
simd_view(BaseTy &Base, RegionTy Region);
simd_view(BaseTy &&Base, RegionTy Region);
simd_view(simd_view &Other);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this intentional not a correct copy constructor (would need to be const for that)?

I believe we agreed that simd_view would have a deleted copy and move constructor. Why are they suddenly back?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added TODO to the headers

using element_type = typename ShapeTy::element_type;

// Constructors.
simd_view(BaseTy &Base, RegionTy Region);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this and the next constructor public? Those should only be called internally by e.g. select, correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added TODO to the headers

See auto-generated documentation for the complete list of APIs here. (TBD)

#### USM pointer-based memory access
##### Flat-address gather/scatter
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also another area where it needless differs from std::simd. Non-gather load/store is called copy_from and copy_to. Why invent new names just to confuse people?

auto Acc1 = Buf1.get_access<cl::sycl::access::mode::read>(Cgh);
auto Acc2 = Buf2.get_access<cl::sycl::access::mode::read_write>(Cgh);

Cgh.single_task<class KernelID>([=] () [[intel::sycl_explicit_simd]] {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not remember whether we can remove the () when we have an attribute...

sycl::intel::gpu<float, VL> va = sycl::intel::gpu::flat_block_load<float, VL>(A + offset);
sycl::intel::gpu<float, VL> vb = sycl::intel::gpu::flat_block_load<float, VL>(B + offset);
sycl::intel::gpu<float, VL> vc = va + vb;
sycl::intel::gpu::flat_block_store<float, VL>(C + offset, vc);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it not possible to infer float and VL?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compiler complains that it can't determine vector element type

@keryell
Copy link
Contributor

keryell commented Jun 4, 2020

Definitely useful for SIMD architectures and... FPGA! :-) @aisoard @javier-cabezas
It would be nice to converge on the naming at some point.

@bader bader changed the title [SYCL][SIMD] Intel "Explicit SIMD" extension documentation draft. [SYCL][ESIMD] Intel "Explicit SIMD" extension documentation draft. Jun 12, 2020
@bader bader added the esimd Explicit SIMD feature label Jun 12, 2020
@kbobrovs
Copy link
Contributor Author

@Ruyk , @keryell , @mattkretz , @iburyl , @rolandschulz , and other reviewers - apologies for delay.
All review comments are greatly appreciated, thank you all!
Some of them have been addressed in the PRs with actual SIMD library code, and will be propagated to the spec shortly:
#1853
#1854
#1855
#1856
(At some point we should create automated API spec generation from sources to avoid duplicating in two places.)

This is initial experimental version to try out functionality and performance. We are going to work with the community on incrementally improving the APIs to bring them closer to std C++ and SYCL in the next several months. At this point we would like to

  • "freeze" the experimental APIs until
    • we have some working version in this branch
    • we have initial performance regression suite in place to check API changes
  • and
    • collect all unresolved review comments and add them to the TODO section in this document
    • merge the current state of the experimental spec

Then review can be resumed and we can then keep adding comments to this document and/or the sources.

@jbrodman
Copy link
Contributor

Will approve after @rolandschulz does.

kbobrovs added 2 commits June 26, 2020 14:04
Signed-off-by: Konstantin S Bobrovsky <[email protected]>
- Collected TODOs and:
  * addressed some of them
  * summarized some of those in the end of this document
  * added others directly to the source code
- Removed API definitions from the spec, as they are now part of
  the source base

Signed-off-by: Konstantin S Bobrovsky <[email protected]>
@kbobrovs
Copy link
Contributor Author

@rolandschulz, @Pennycook, @keryell, @Ruyk, @iburyl, @mattkretz, @jasonsewall-intel - sorry for delay. I updated the spec, please take another look. Specifically:

  • removed the API definition, as we now have sources. The spec will just reference the auto-generated API docs
  • addressed some of the comments
  • summarized others in the end of the doc
  • moved your comments to specific APIs as TODOs to the code, please see the following patches:

Copy link
Contributor

@Pennycook Pennycook left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the TODOs, @kbobrovs. This LGTM.

Copy link
Contributor

@Ruyk Ruyk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looks good so far, and looking forward to see those TODO's acted upon later on!

Copy link
Contributor

@rolandschulz rolandschulz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All remaining issues will be resolved later

@kbobrovs kbobrovs merged commit 84bf234 into intel:sycl Jul 21, 2020
@kbobrovs kbobrovs deleted the esimd-spec branch July 21, 2020 04:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
esimd Explicit SIMD feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.