-
Notifications
You must be signed in to change notification settings - Fork 769
[L0][UR] Do not set global offset unless required #18242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
UR adapters check against NULL to avoid setting an offset, so passing NULL should improve performance. Signed-off-by: John Pennycook <[email protected]>
Note that there may be more to do here from a performance perspective. I'd prefer to get rid of the "is there an offset?" check entirely, but that would probably require us to separate out the offset vs no offset submission paths, rather than trying to feed everything through a single function. |
The vast majority of kernels are expected to never have an offset, because the offset feature was deprecated in SYCL 2020. A small number of kernels will still have an offset. Since this case is uncommon, it is less important to optimize. There is a pathological case where the same kernel alternates between different overloads of parallel_for with and without an offset. Keeping track of whether the last submission had an offset is intended to address this case, while still allowing us to skip the L0 call. Signed-off-by: John Pennycook <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The SYCL part looks OK.
Today, @Alexandr-Konovalov and I discussed the same issue with GlobalOffset. Great to see that we already have a PR for that.
@AerialMantis , @Ruyk - Could somebody from Codeplay please take a look at why this is failing on NVIDIA/AMD? I thought those adapters would accept |
Avoids tracking the last global offset at the expense of making offset kernels slower. Signed-off-by: John Pennycook <[email protected]>
If we are going to accept null for the |
Regardless of the UR spec issues this patch should fix the Nvidia/AMD test failures with this: The rest of the adapter code already checks for global offset being (my PR is standalone so we can merge it as-is, or feel free to cherry-pick this on your branch and merge it in this PR if you prefer) |
Thanks. The fact that the adapter code checks for the global offset being
If UR requires an offset, then I am even more certain that we need to have an offset-less entry point to this function. The offset is currently deprecated and will hopefully be removed in the next SYCL specification. Once that happens, we'll be passing three zeroes to every UR function for no reason at all. To accelerate performance-testing with the offset stuff removed, I'll have a go at rewriting this patch in terms of {0, 0, 0}. I still think that this should be faster. |
This reverts commit 489c9fe.
Signed-off-by: John Pennycook <[email protected]>
We don't need to add entry-points, simply changing the optionality of |
Is there some reason we don't want to add entry-points? I'm very new to looking at UR, and I don't understand the technical reasons that adding entry points would be a bad idea. Making |
I think the test failure here may be related to #18230, although things fail with a different error message:
|
We already have multiple entry points for launching kernels, each must have its own set of tests, any change made to one of these musts be made to them all. The combinatorial explosion is not easily maintainable and the burden of that maintenance falls on my small team. Enqueuing of kernel launches does not require multiple entry points in other API's. My question, is adding more desirable rather than fine tuning the ones we have to be the best they can be? UR ABI is not required to be stable because it is not a user facing API but rather an implementation detail of the SYCL RT. |
I think we should separate identifying the right thing to do (from a technical perspective) from whether we currently have the resources to do it, rather than dismissing ideas because of the amount of work involved. If we can agree on what the best UR API would look like, we can scope it and then ensure it is resourced appropriately.
As I said elsewhere, I think they do not require multiple entry points because they don't support these cases. Other APIs do not expose an offset at all, and only support one dimensionality (i.e., three dimensions). By exposing so much flexibility in the UR API we make it harder to use, and we have real examples of that:
But we can take some of this discussion offline. I don't think we need to resolve these issues here as part of the review of this PR. I've rewritten it in a way that doesn't break other adapters or rely on UR specification changes, and rewritten the PR description to match how the current implementation works. I've left a note about future possible changes for posterity. Please re-review, @EwanC and @kbenzie. |
Sounds like a plan, the current version of the PR doesn't touch any code I own so good with me 👍 |
I echo this comment. |
@intel/llvm-reviewers-runtime @intel/unified-runtime-reviewers-level-zero - I think the failure here is unrelated to my change, so I think this is ready for review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No SYCL changes anymore, as far as I can see. I guess GH is not able to recompute required approvals, so I'm approving to unblock.
This still needs a review from somebody in @intel/unified-runtime-reviewers-level-zero. |
Thanks, @igchor. @intel/llvm-gatekeepers, I think this can be merged now. The test that is failing doesn't seem to be related. |
This broke UR-based compute-benchmarks which currently do not pass
I'll submit a fix for the benchmark since this parameter isn't optional. @vinser52 fyi, this is why perf CI for UR stopped working. |
@kbenzie, @EwanC - The fact that somebody else has made this mistake as well makes me think it would be a good idea to explore making this parameter optional. What are the next steps there? |
The argument will need to be marked
This should remove the null check from the UR validation layer amongst other things. Commit the generated source file changes. Next would be to update the tests here to align with the updated optionality of the With the test in place, the adapter implementations can be updated as necessary. Hopefully this is enough to get you going, happy to help if something isn't clear. |
UR Spec requires the global offset to be always provided. This worked until now for L0 because neither adapter actually dereferenced that pointer, instead it was just passed directly to L0. However, that changed recently in intel/llvm#18242, and now the ur SubmitKernel tests are failing. Long-term the solution is to make pGlobalWorkOffset optional, but for now this patch just adds it to the benchmark. Signed-off-by: Piotr Balcer <[email protected]>
This fixes api_overhead_benchmark_ur after #18242.
Previously, zeKernelSetGlobalOffsetExp was called for every kernel launch. The vast majority of kernels are expected to never have an offset, because the offset feature was deprecated in SYCL 2020, and we should optimize for this case.
The SYCL RT currently passes {0, 0, 0} in the case where there is no offset. To optimize this case:
This will introduce additional overhead to the uncommon case where offsets are specified, but we plan to remove this anyway.
In the long-term, the check for a {0, 0, 0} offset should probably be moved into the SYCL headers and NULL should be passed directly to UR. However, this will require wide-reaching changes to other UR adapters and the UR specification.