Skip to content

[Testing][SYCL] Do not decompose structs with pointers #5814

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

Fznamznon
Copy link
Contributor

No description provided.

@bader
Copy link
Contributor

bader commented Mar 15, 2022

@Fznamznon, can we consider passing the whole SYCL kernel (lambda object) via single kernel parameter? This should allow reduce overhead on setting multiple kernel parameters.

@Fznamznon
Copy link
Contributor Author

@Fznamznon, can we consider passing the whole SYCL kernel (lambda object) via single kernel parameter? This should allow reduce overhead on setting multiple kernel parameters.

I think it will be much more complicated re-work of SemaSYCL than just disable decomposition of struct kernel arguments in some case (like I'm trying to do in this PR) because the whole SemaSYCL kernel building runs around kernel object fields. Although I don't see it as a thing that is not possible to do (assuming that we are targeting to pass the whole SYCL kernel object only when it doesn't contain special classes inside.)

@bader
Copy link
Contributor

bader commented Mar 15, 2022

Although I don't see it as a thing that is not possible to do (assuming that we are targeting to pass the whole SYCL kernel object only when it doesn't contain special classes inside.)

I'm assuming the same.

Need to see if there is bigger problems in llvm test suite before
spending time on LIT tests.
@Fznamznon
Copy link
Contributor Author

/summary:run

@Fznamznon
Copy link
Contributor Author

FYI: @elizabethandrews , @gmlueck , results are good so far. I'm waiting for summary run.

@elizabethandrews
Copy link
Contributor

@Fznamznon, can we consider passing the whole SYCL kernel (lambda object) via single kernel parameter? This should allow reduce overhead on setting multiple kernel parameters.

I think it will be much more complicated re-work of SemaSYCL than just disable decomposition of struct kernel arguments in some case (like I'm trying to do in this PR) because the whole SemaSYCL kernel building runs around kernel object fields. Although I don't see it as a thing that is not possible to do (assuming that we are targeting to pass the whole SYCL kernel object only when it doesn't contain special classes inside.)

Since we now have 2 passes over kernel object to detect if a struct requires decomposition, we may be able pass entire kernel object (if contains no special types) with minimal change.

@elizabethandrews
Copy link
Contributor

@Fznamznon, can we consider passing the whole SYCL kernel (lambda object) via single kernel parameter? This should allow reduce overhead on setting multiple kernel parameters.

I think it will be much more complicated re-work of SemaSYCL than just disable decomposition of struct kernel arguments in some case (like I'm trying to do in this PR) because the whole SemaSYCL kernel building runs around kernel object fields. Although I don't see it as a thing that is not possible to do (assuming that we are targeting to pass the whole SYCL kernel object only when it doesn't contain special classes inside.)

Since we now have 2 passes over kernel object to detect if a struct requires decomposition, we may be able pass entire kernel object (if contains no special types) with minimal change.

To clarify - I don't think we should do it in this PR since it is still a larger change but it might be feasible if required in an another patch.

@elizabethandrews
Copy link
Contributor

When classes containing pointers are decomposed, we 'wrap' the pointer with compiler generated wrapper. This was done to avoid errors due to unused nested pointers. If such classes are still decomposed (due to some other special type in them), do we want the pointers to remain wrapped?

@Fznamznon
Copy link
Contributor Author

If such classes are still decomposed (due to some other special type in them), do we want the pointers to remain wrapped?

I would say yes. This wrapping was made in order to avoid regressions in situations when someone had unused non-USM pointer inside struct, so in case the struct is decomposed these pointers don't trigger SYCL runtime to set them as USM pointers kernel arguments.

@bader
Copy link
Contributor

bader commented Mar 15, 2022

Maybe we can add a compiler option to avoid the wrapping i.e. if user sets the flag it promises that all pointers captured inside device code are USM pointers.

@elizabethandrews
Copy link
Contributor

Maybe we can add a compiler option to avoid the wrapping i.e. if user sets the flag it promises that all pointers captured inside device code are USM pointers.

This might be a good idea if USM pointers can be nested inside classes in user code. I don't know if this is possible with USM pointers though.

@gmlueck
Copy link
Contributor

gmlueck commented Mar 15, 2022

This might be a good idea if USM pointers can be nested inside classes in user code. I don't know if this is possible with USM pointers though.

Yes, it is definitely possible to nest USM pointers inside classes, where an object of that class type is passed as a kernel argument.

FWIW, this workaround about wrapping the pointer in a synthesized struct is only necessary because the device driver raises an error if a kernel argument is a pointer type and the value is not within a USM buffer region. @bashbaug and I have discussed in the past about whether it makes sense for the device driver to raise an error in this case. He might be amenable to removing this error if we can show that it hurts performance. Does it hurt performance to add the wrapper around the pointer argument? Even if it does not hurt performance, removing the error would avoid complexity like this.

This wrapping was made in order to avoid regressions in situations when someone had unused non-USM pointer inside struct, so in case the struct is decomposed these pointers don't trigger SYCL runtime to set them as USM pointers kernel arguments.

Just a nit here ... the error from the device driver occurs when the pointer is outside of a USM buffer. It does not matter whether the pointer value is used in the kernel.

@elizabethandrews
Copy link
Contributor

This might be a good idea if USM pointers can be nested inside classes in user code. I don't know if this is possible with USM pointers though.

Yes, it is definitely possible to nest USM pointers inside classes, where an object of that class type is passed as a kernel argument.

So at the moment, nested USM won't work with DPCPP compiler right? Because FE naively wraps all nested pointers? In FE we can't tell whether a pointer is USM or not. So at the moment we just assume all top level pointers are USM and nested pointers aren't.

FWIW, this workaround about wrapping the pointer in a synthesized struct is only necessary because the device driver raises an error if a kernel argument is a pointer type and the value is not within a USM buffer region. @bashbaug and I have discussed in the past about whether it makes sense for the device driver to raise an error in this case.

I believe there was a crash in runtime which necessitated this. Let me see if I can find the bug report which led to this change.

He might be amenable to removing this error if we can show that it hurts performance. Does it hurt performance to add the wrapper around the pointer argument?

I don't know. I would assume not but I am not sure since I am not very familiar with analyzing performance.

Even if it does not hurt performance, removing the error would avoid complexity like this.

This wrapping was made in order to avoid regressions in situations when someone had unused non-USM pointer inside struct, so in case the struct is decomposed these pointers don't trigger SYCL runtime to set them as USM pointers kernel arguments.

Just a nit here ... the error from the device driver occurs when the pointer is outside of a USM buffer. It does not matter whether the pointer value is used in the kernel.

@Naghasan
Copy link
Contributor

Maybe we can add a compiler option to avoid the wrapping i.e. if user sets the flag it promises that all pointers captured inside device code are USM pointers.

(I haven't read the whole discussion)

Different lowering strategies are quite important, so this can make a nice addition. But I would be worry to see this more or less promoted as this would directly contribute into preventing DAE from working properly.

It is not uncommon to see SYCL kernels carrying more than they need (quite common with lib like Kokkos) and in these cases, users would rely on DAE to reduce the functor weight. In some cases, the kernel would simply not compile without DAE doing its part.

On platform like CUDA, not reducing the weight of these kernels can decrease performance (as it can lower occupancy).

@bader
Copy link
Contributor

bader commented Mar 15, 2022

Maybe we can add a compiler option to avoid the wrapping i.e. if user sets the flag it promises that all pointers captured inside device code are USM pointers.

(I haven't read the whole discussion)

Different lowering strategies are quite important, so this can make a nice addition. But I would be worry to see this more or less promoted as this would directly contribute into preventing DAE from working properly.

It is not uncommon to see SYCL kernels carrying more than they need (quite common with lib like Kokkos) and in these cases, users would rely on DAE to reduce the functor weight. In some cases, the kernel would simply not compile without DAE doing its part.

On platform like CUDA, not reducing the weight of these kernels can decrease performance (as it can lower occupancy).

I see at least two directions to explore:

  1. pass "packed" lambda object instead of original lambda object. "Packed" object should have only "live" data (i.e. referenced in device code).
  2. I don't know what exactly is the problem with libraries like Kokos, but if it's related to the register pressure increase caused by passing "dead" kernel arguments, we might try to pass all kernel parameters in a single USM allocation. This way we pass a single pointer + load only "live" parameters from memory.

Both require non-trivial changes unfortunately ...

@Naghasan
Copy link
Contributor

I see at least two directions to explore:

  1. pass "packed" lambda object instead of original lambda object. "Packed" object should have only "live" data (i.e. referenced in device code).
  2. I don't know what exactly is the problem with libraries like Kokos, but if it's related to the register pressure increase caused by passing "dead" kernel arguments, we might try to pass all kernel parameters in a single USM allocation. This way we pass a single pointer + load only "live" parameters from memory.

Both require non-trivial changes unfortunately ...

IMO, 1 is definitely the what could offer the most interesting trade-off, but that's indeed quite complex.

  1. is actually how it is done by default in kokkos (follows the CUDA approach) but that makes the AA quite unhappy. Switching to a traditional way of doing things helps but compilation fails in some cases mainly due to SORA failing which prevents DAE from properly doing the job.

There is a 3rd path, similar to 2, where you would pass it the object via the constant memory (constant as in opencl constant memory). It seems it could be beneficial on CUDA, but not a straight forward change either.

@Fznamznon
Copy link
Contributor Author

Okay, I see a problem here. Due to decomposition all pointer struct members were transformed to kernel arguments with global address space. When we stop decomposing structs that contain pointers inside, kernel starts having struct kernel arguments that contain pointers to default address space inside. Default address space is transformed to SPIR generic address space in LLVM IR level. I guess it can affect performance on GPU since from now on there is no chance for BE compiler to infer target address space from wrapped USM pointers passed as kernel arguments. I realized that looking at test that was added by #2298 . I'm not sure what is the right way to fix. For now I see the following options:

  • Continue decomposing (and close this PR)
  • Try to pass structs with global pointer inside, but convert them to structs with generic pointers inside to avoid further problems with IR emission (that will be non-trivial change)
    Same thing happened with arrays of pointers, they started appearing as wrapper struct with an array of pointers to generic AS inside.

@gmlueck
Copy link
Contributor

gmlueck commented Mar 16, 2022

  • Try to pass structs with global pointer inside, but convert them to structs with generic pointers inside to avoid further problems with IR emission (that will be non-trivial change)
    Same thing happened with arrays of pointers, they started appearing as wrapper struct with an array of pointers to generic AS inside.

Taking the test case from #2298, where the kernel argument's type is B:

struct A {
  float *F;
};

struct B {
  int *F1;
  float *F2;
  A F3;
};

Are you suggesting that the CFE can create a new type B_global like this and pass the non-decomposed argument as B_global?

struct A_global {
  __global float *F;
};

struct B_global {
  __global int *F1;
  __global float *F2;
  A_global F3;
};

@Fznamznon
Copy link
Contributor Author

  • Try to pass structs with global pointer inside, but convert them to structs with generic pointers inside to avoid further problems with IR emission (that will be non-trivial change)
    Same thing happened with arrays of pointers, they started appearing as wrapper struct with an array of pointers to generic AS inside.

Taking the test case from #2298, where the kernel argument's type is B:

struct A {
  float *F;
};

struct B {
  int *F1;
  float *F2;
  A F3;
};

Are you suggesting that the CFE can create a new type B_global like this and pass the non-decomposed argument as B_global?

struct A_global {
  __global float *F;
};

struct B_global {
  __global int *F1;
  __global float *F2;
  A_global F3;
};

Yes. And then somehow via address space casts and copying transform B_global back to B inside the kernel (not sure how exactly it will look like yet). Hopping that it will be optimized out at the end. At least for now I don't see other ways to preserve performance advantage and not decompose structs with pointers.

@gmlueck
Copy link
Contributor

gmlueck commented Mar 16, 2022

At least for now I don't see other ways to preserve performance advantage and not decompose structs with pointers.

All you're doing here is building in an assumption to the CFE that all pointers in kernel arguments are to the global address space. This does make sense because there's no way to get a local or private address on the host. However, I don't see why this assumption should be added to the CFE. Why can't the BE compiler have this assumption? Maybe it already has this assumption, in which case there is no need to decorate these pointers with the global address space?

@Fznamznon
Copy link
Contributor Author

Fznamznon commented Mar 16, 2022

All you're doing here is building in an assumption to the CFE that all pointers in kernel arguments are to the global address space.

Not quite. The assumption is that pointer kernel arguments without explicit address space attached to them in the source via attributes are USM pointers and therefore they should have global address space.

This does make sense because there's no way to get a local or private address on the host. Why can't the BE compiler have this assumption?

Actually, it is possible to get a pointer with local address space in kernel arguments using local accessor. So, I'm not sure BE can make this assumption.

@elizabethandrews
Copy link
Contributor

All you're doing here is building in an assumption to the CFE that all pointers in kernel arguments are to the global address space.

Not quite. The assumption is that pointer kernel arguments without explicit address space attached to them in the source via attributes are USM pointers and therefore they should have global address space.

@Fznamznon what are the possible explicit address spaces? Looking at the code, it looks like only sycl_global_device and sycl_global_host address spaces are retained

auto AS = Quals.getAddressSpace();
    // Leave global_device and global_host address spaces as is to help FPGA
    // device in memory allocations
    if (AS != LangAS::sycl_global_device && AS != LangAS::sycl_global_host)
      Quals.setAddressSpace(LangAS::sycl_global);

@Fznamznon
Copy link
Contributor Author

@Fznamznon what are the possible explicit address spaces? Looking at the code, it looks like only sycl_global_device and sycl_global_host address spaces are retained

Technically can be any of address spaces via using OpenCL address space attributes (

def OpenCLGlobalDeviceAddressSpace : TypeAttr {
).
Yes, you are right, it seems only global_device and global_host address spaces are retained when pointer is passed to kernel.
The thing is that these attributes are not part of SYCL spec, so they are mostly used by SYCL headers to implement classes that require a pointer decorated with address space. That is why our assumption that all pointers passed to kernel represent data shared between host and device and therefore these pointers can have global address space kind of worked so far.

@gmlueck
Copy link
Contributor

gmlueck commented Mar 16, 2022

Actually, it is possible to get a pointer with local address space in kernel arguments using local accessor. So, I'm not sure BE can make this assumption.

This is a good point. If the kernel captures a local accessor as an argument, the accessor has a member variable that is a pointer to the local address space. However, that member variable is explicitly decorated with __attribute__((opencl_local)), which identifies the address space.

I guess a more precision definition of the assumption is that any undecorated pointer in a kernel argument can be assumed to be a global address space pointer. It still seems like the logic for this assumption could be in either the CFE or the BE compiler.

@bader
Copy link
Contributor

bader commented Mar 17, 2022

It's much easier to do in CFE where SPIR kernel function wrapper is created. I think maintaining address space adjustment in LLVM IR -> LLVM IR transformation or in SPIR-V translator would be more expensive.

FYI: @zahiraam is working on moving SPIR kernel function generation logic from Sema to CodeGen library.

@Fznamznon
Copy link
Contributor Author

I guess a more precision definition of the assumption is that any undecorated pointer in a kernel argument can be assumed to be a global address space pointer.

Yes, this is what assumed by CFE already. And I'd like to preserve this. Although SYCL code is emitted with "no address space means generic" strategy, so we have to convert address spaces between kernel arguments and the rest of the code.

It still seems like the logic for this assumption could be in either the CFE or the BE compiler.

I still don't get how this assumption can be done in BE because it doesn't have info about source code. Only CFE does.

@gmlueck
Copy link
Contributor

gmlueck commented Mar 17, 2022

It's much easier to do in CFE where SPIR kernel function wrapper is created. I think maintaining address space adjustment in LLVM IR -> LLVM IR transformation or in SPIR-V translator would be more expensive.

FYI: @zahiraam is working on moving SPIR kernel function generation logic from Sema to CodeGen library.

Maybe the refactor that @zahiraam is doing is a good opportunity to rework how aggregate kernel arguments are handled. Decomposing a structure just to change the address space of one member seems like a big hammer.

@github-actions github-actions bot added the Stale label Sep 14, 2022
@github-actions github-actions bot closed this Oct 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants