Skip to content

[L0][UR] Do not set global offset unless required #18242

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
May 6, 2025

Conversation

Pennycook
Copy link
Contributor

@Pennycook Pennycook commented Apr 29, 2025

Previously, zeKernelSetGlobalOffsetExp was called for every kernel launch. The vast majority of kernels are expected to never have an offset, because the offset feature was deprecated in SYCL 2020, and we should optimize for this case.

The SYCL RT currently passes {0, 0, 0} in the case where there is no offset. To optimize this case:

  • A zero offset is treated equivalently to a NULL offset, and zeKernelSetGlobalOffsetExp is not called.
  • A non-zero offset triggers a call to zeKernelSetGlobalOffsetExp before launching the kernel.
  • A non-zero offset triggers a call to zeKernelSetGlobalOffsetExp after launching the kernel, to reset the offset to zero.

This will introduce additional overhead to the uncommon case where offsets are specified, but we plan to remove this anyway.

In the long-term, the check for a {0, 0, 0} offset should probably be moved into the SYCL headers and NULL should be passed directly to UR. However, this will require wide-reaching changes to other UR adapters and the UR specification.

UR adapters check against NULL to avoid setting an offset, so passing
NULL should improve performance.

Signed-off-by: John Pennycook <[email protected]>
@Pennycook Pennycook added the performance Performance related issues label Apr 29, 2025
@Pennycook Pennycook requested a review from a team as a code owner April 29, 2025 14:38
@Pennycook Pennycook requested a review from maarquitos14 April 29, 2025 14:38
@Pennycook
Copy link
Contributor Author

Pennycook commented Apr 29, 2025

Note that there may be more to do here from a performance perspective. I'd prefer to get rid of the "is there an offset?" check entirely, but that would probably require us to separate out the offset vs no offset submission paths, rather than trying to feed everything through a single function.

The vast majority of kernels are expected to never have an offset,
because the offset feature was deprecated in SYCL 2020.

A small number of kernels will still have an offset. Since this case is
uncommon, it is less important to optimize.

There is a pathological case where the same kernel alternates between
different overloads of parallel_for with and without an offset. Keeping
track of whether the last submission had an offset is intended to
address this case, while still allowing us to skip the L0 call.

Signed-off-by: John Pennycook <[email protected]>
@Pennycook Pennycook requested a review from a team as a code owner April 29, 2025 15:49
Copy link
Contributor

@vinser52 vinser52 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SYCL part looks OK.
Today, @Alexandr-Konovalov and I discussed the same issue with GlobalOffset. Great to see that we already have a PR for that.

@Pennycook
Copy link
Contributor Author

@AerialMantis , @Ruyk - Could somebody from Codeplay please take a look at why this is failing on NVIDIA/AMD? I thought those adapters would accept NULL as well, but maybe it's not been tested before...?

Avoids tracking the last global offset at the expense of making offset
kernels slower.

Signed-off-by: John Pennycook <[email protected]>
@kbenzie
Copy link
Contributor

kbenzie commented Apr 30, 2025

@AerialMantis , @Ruyk - Could somebody from Codeplay please take a look at why this is failing on NVIDIA/AMD? I thought those adapters would accept NULL as well, but maybe it's not been tested before...?

If we are going to accept null for the pGlobalWorkOffset then spec changes, UR conformance test changes, and adapter implementation changes are required to correctly handle this case. The reason it doesn't work is as others have noted that the UR spec disallows this. The patch as it stands is broken.

@npmiller
Copy link
Contributor

Regardless of the UR spec issues this patch should fix the Nvidia/AMD test failures with this:

The rest of the adapter code already checks for global offset being NULL, so I think it's worth fixing despite the UR spec issues, so the adapter is not in between the two possibilities. If we decide later on that the offset really should never be NULL we can go back and remove all the checks.

(my PR is standalone so we can merge it as-is, or feel free to cherry-pick this on your branch and merge it in this PR if you prefer)

@Pennycook
Copy link
Contributor Author

Regardless of the UR spec issues this patch should fix the Nvidia/AMD test failures with this:

The rest of the adapter code already checks for global offset being NULL, so I think it's worth fixing despite the UR spec issues, so the adapter is not in between the two possibilities. If we decide later on that the offset really should never be NULL we can go back and remove all the checks.

Thanks. The fact that the adapter code checks for the global offset being NULL (instead of "0, 0, 0") is why I wrote this the way that I did.

If we are going to accept null for the pGlobalWorkOffset then spec changes, UR conformance test changes, and adapter implementation changes are required to correctly handle this case. The reason it doesn't work is as others have noted that the UR spec disallows this. The patch as it stands is broken.

If UR requires an offset, then I am even more certain that we need to have an offset-less entry point to this function. The offset is currently deprecated and will hopefully be removed in the next SYCL specification. Once that happens, we'll be passing three zeroes to every UR function for no reason at all.


To accelerate performance-testing with the offset stuff removed, I'll have a go at rewriting this patch in terms of {0, 0, 0}. I still think that this should be faster.

@kbenzie
Copy link
Contributor

kbenzie commented Apr 30, 2025

If UR requires an offset, then I am even more certain that we need to have an offset-less entry point to this function. The offset is currently deprecated and will hopefully be removed in the next SYCL specification. Once that happens, we'll be passing three zeroes to every UR function for no reason at all.

We don't need to add entry-points, simply changing the optionality of pGlobalWorkOffset is sufficient. That's still a spec change. Happy for that change to be made but this patch would need to actually make that change.

@Pennycook
Copy link
Contributor Author

We don't need to add entry-points, simply changing the optionality of pGlobalWorkOffset is sufficient. That's still a spec change. Happy for that change to be made but this patch would need to actually make that change.

Is there some reason we don't want to add entry-points? I'm very new to looking at UR, and I don't understand the technical reasons that adding entry points would be a bad idea.

Making pGlobalWorkOffset optional would be an improvement, but every function would have to keep checking if the value of pGlobalWorkOffset is NULL, and you'll have at least one such check for every function accepting pGlobalWorkOffset. Why pay that overhead if we expect the most common case to be that an application never uses any offsets?

@Pennycook
Copy link
Contributor Author

I think the test failure here may be related to #18230, although things fail with a different error message:

# RUN: at line 6
env ONEAPI_DEVICE_SELECTOR=level_zero:gpu  /__w/llvm/llvm/build-e2e/WorkGroupMemory/Output/basic_usage.cpp.tmp.out
# executed command: env ONEAPI_DEVICE_SELECTOR=level_zero:gpu /__w/llvm/llvm/build-e2e/WorkGroupMemory/Output/basic_usage.cpp.tmp.out
# .---command stderr------------
# | terminate called after throwing an instance of 'sycl::_V1::exception'
# |   what():  Native API failed. Native API returns: 20 (UR_RESULT_ERROR_DEVICE_LOST)
# `-----------------------------
# error: command failed with exit status: -6

@kbenzie
Copy link
Contributor

kbenzie commented Apr 30, 2025

Is there some reason we don't want to add entry-points? I'm very new to looking at UR, and I don't understand the technical reasons that adding entry points would be a bad idea.

We already have multiple entry points for launching kernels, each must have its own set of tests, any change made to one of these musts be made to them all. The combinatorial explosion is not easily maintainable and the burden of that maintenance falls on my small team.

Enqueuing of kernel launches does not require multiple entry points in other API's. My question, is adding more desirable rather than fine tuning the ones we have to be the best they can be?

UR ABI is not required to be stable because it is not a user facing API but rather an implementation detail of the SYCL RT.

@Pennycook Pennycook changed the title [SYCL] Pass NULL instead of {0, 0, 0} offset [L0][UR] Do not set global offset unless required May 1, 2025
@Pennycook
Copy link
Contributor Author

We already have multiple entry points for launching kernels, each must have its own set of tests, any change made to one of these musts be made to them all. The combinatorial explosion is not easily maintainable and the burden of that maintenance falls on my small team.

I think we should separate identifying the right thing to do (from a technical perspective) from whether we currently have the resources to do it, rather than dismissing ideas because of the amount of work involved. If we can agree on what the best UR API would look like, we can scope it and then ensure it is resourced appropriately.

Enqueuing of kernel launches does not require multiple entry points in other API's. My question, is adding more desirable rather than fine tuning the ones we have to be the best they can be?

As I said elsewhere, I think they do not require multiple entry points because they don't support these cases. Other APIs do not expose an offset at all, and only support one dimensionality (i.e., three dimensions). By exposing so much flexibility in the UR API we make it harder to use, and we have real examples of that:

  • The fact that the global offset is a pointer has led to some confusion about whether it can (or should) be NULL.
  • The SYCL RT often converts things to 3D before passing them to UR, even though UR could accept any dimensionality.

But we can take some of this discussion offline. I don't think we need to resolve these issues here as part of the review of this PR.

I've rewritten it in a way that doesn't break other adapters or rely on UR specification changes, and rewritten the PR description to match how the current implementation works. I've left a note about future possible changes for posterity. Please re-review, @EwanC and @kbenzie.

@EwanC
Copy link
Contributor

EwanC commented May 1, 2025

But we can take some of this discussion offline. I don't think we need to resolve these issues here as part of the review of this PR.

I've rewritten it in a way that doesn't break other adapters or rely on UR specification changes, and rewritten the PR description to match how the current implementation works. I've left a note about future possible changes for posterity. Please re-review, @EwanC and @kbenzie.

Sounds like a plan, the current version of the PR doesn't touch any code I own so good with me 👍

@kbenzie
Copy link
Contributor

kbenzie commented May 1, 2025

Sounds like a plan, the current version of the PR doesn't touch any code I own so good with me 👍

I echo this comment.

@Pennycook
Copy link
Contributor Author

@intel/llvm-reviewers-runtime @intel/unified-runtime-reviewers-level-zero - I think the failure here is unrelated to my change, so I think this is ready for review.

Copy link
Contributor

@maarquitos14 maarquitos14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No SYCL changes anymore, as far as I can see. I guess GH is not able to recompute required approvals, so I'm approving to unblock.

@Pennycook
Copy link
Contributor Author

This still needs a review from somebody in @intel/unified-runtime-reviewers-level-zero.

@Pennycook
Copy link
Contributor Author

Thanks, @igchor. @intel/llvm-gatekeepers, I think this can be merged now. The test that is failing doesn't seem to be related.

@sarnex sarnex merged commit d6b6271 into intel:sycl May 6, 2025
33 of 34 checks passed
@pbalcer
Copy link
Contributor

pbalcer commented May 7, 2025

This broke UR-based compute-benchmarks which currently do not pass pGlobalWorkOffset to urEnqueueKernelLaunch.

Thread 1 "api_overhead_be" received signal SIGSEGV, Segmentation fault.
0x00007ffff672b0f7 in ur_command_list_manager::appendKernelLaunch (this=0x555557289f98, hKernel=0x555557288180, workDim=3, pGlobalWorkOffset=0x0, pGlobalWorkSize=0x5555558584e0 <local_size>, pLocalWorkSize=0x5555558
58500 <global_size>, numEventsInWaitList=0, phEventWaitList=0x0, phEvent=0x0) at /home/pbalcer/llvm/unified-runtime/source/adapters/level_zero/v2/command_list_manager.cpp:226
226         hasOffset |= pGlobalWorkOffset[i];
(gdb) 

https://github.com/intel/compute-benchmarks/blob/0603a2651d296aa56304dc23d1957302e020e215/source/benchmarks/api_overhead_benchmark/implementations/ur/submit_kernel_ur.cpp#L79

I'll submit a fix for the benchmark since this parameter isn't optional.

@vinser52 fyi, this is why perf CI for UR stopped working.

@Pennycook
Copy link
Contributor Author

This broke UR-based compute-benchmarks which currently do not pass pGlobalWorkOffset to urEnqueueKernelLaunch.

Thread 1 "api_overhead_be" received signal SIGSEGV, Segmentation fault.
0x00007ffff672b0f7 in ur_command_list_manager::appendKernelLaunch (this=0x555557289f98, hKernel=0x555557288180, workDim=3, pGlobalWorkOffset=0x0, pGlobalWorkSize=0x5555558584e0 <local_size>, pLocalWorkSize=0x5555558
58500 <global_size>, numEventsInWaitList=0, phEventWaitList=0x0, phEvent=0x0) at /home/pbalcer/llvm/unified-runtime/source/adapters/level_zero/v2/command_list_manager.cpp:226
226         hasOffset |= pGlobalWorkOffset[i];
(gdb) 

https://github.com/intel/compute-benchmarks/blob/0603a2651d296aa56304dc23d1957302e020e215/source/benchmarks/api_overhead_benchmark/implementations/ur/submit_kernel_ur.cpp#L79

I'll submit a fix for the benchmark since this parameter isn't optional.

@vinser52 fyi, this is why perf CI for UR stopped working.

@kbenzie, @EwanC - The fact that somebody else has made this mistake as well makes me think it would be a good idea to explore making this parameter optional. What are the next steps there?

@Pennycook Pennycook deleted the null-global-offset branch May 7, 2025 11:40
@kbenzie
Copy link
Contributor

kbenzie commented May 7, 2025

@kbenzie, @EwanC - The fact that somebody else has made this mistake as well makes me think it would be a good idea to explore making this parameter optional. What are the next steps there?

The argument will need to be marked [optional] in the spec yaml description here similar to phEvent. Then the generate taget will need to be run to generate the source code, this is currently only enabled in a standalone UR build, the build environment setup is documented here. Once the requirements for the python tooling are installed and the virtual environment activated, configure a standalone UR build with this command from the root of an intel/llvm checkout:

cmake unified-runtime -Bbuild-ur-r -GNinja -DCMAKE_BUILD_TYPE=Release \
  -DUR_FORMAT_CPP_STYLE=ON -DUR_DEVELOPER_MODE=ON \
  -DUR_ENABLE_FAST_SPEC_MODE=ON -DUR_ENABLE_TRACING=ON \
  -DUR_BUILD_ADAPTER_L0=ON -DUR_BUILD_ADAPTER_OPENCL=ON

This should remove the null check from the UR validation layer amongst other things. Commit the generated source file changes.

Next would be to update the tests here to align with the updated optionality of the pGlobalWorkOffset argument. We should have a new similar to the urEnqueueKernelLaunchTest.Success test which sets pGlobalWorkOffset to nullptr, this should make sure all adapters are handling the the spec change properly in CI testing.

With the test in place, the adapter implementations can be updated as necessary.

Hopefully this is enough to get you going, happy to help if something isn't clear.

VPG-SWE-Github pushed a commit to intel/compute-benchmarks that referenced this pull request May 7, 2025
UR Spec requires the global offset to be always provided.
This worked until now for L0 because neither adapter actually
dereferenced that pointer, instead it was just passed directly to L0.

However, that changed recently in
intel/llvm#18242, and now the ur SubmitKernel
tests are failing.

Long-term the solution is to make pGlobalWorkOffset optional, but for
now this patch just adds it to the benchmark.

Signed-off-by: Piotr Balcer <[email protected]>
martygrant pushed a commit that referenced this pull request May 8, 2025
This fixes api_overhead_benchmark_ur after #18242.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants