Skip to content

[SYCL] fix for leaking commands when exception thrown #16618

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

cperkinsintel
Copy link
Contributor

@cperkinsintel cperkinsintel commented Jan 14, 2025

When enqueueing a command and its dependencies, an exception might be thrown. In that case, the command will have a failed EnqueueStatus. During the clean up, we don't want to reenqueue it if we know it has failed before.

The test for this wasn't working on Windows due to some of the shutdown complications there. Instead of simply disabling the test on Windows I decided to better understand and address the problem.

…e thrown. In that case, the command will have a failed EnqueueStatus. During the clean up, we don't want to reenqueue it if we know it has failed before
@sergey-semenov
Copy link
Contributor

Could you please add a unit test for this?

@cperkinsintel
Copy link
Contributor Author

cperkinsintel commented Jan 14, 2025

@sergey-semenov - The mem tests ( UR_L0_LEAKS_DEBUG=1) for max_num_work_groups.cpp is how this was discovered, in its error case. I can make a new e2e test file that explicitly cribs that combination. Would that work? Or do you want a UNIT test as opposed to an E2E one? Not quite sure what that'd look like.

@sergey-semenov
Copy link
Contributor

I mean, since there's no need to involve any backend to verify this, I'd prefer the more lightweight option of adding a unit test here. I think inserting a failed-to-enqueue MockCommand into the graph then checking if its destructor fired after cleaning up the buffer it depends on should be doable and check what we want here. Alternatively, you could just build a regular kernel enqueue graph with queue::submit while redefining urEnqueue... to fail and just check that we make the required ur.*Release calls.

@cperkinsintel cperkinsintel marked this pull request as draft January 14, 2025 19:38
@cperkinsintel
Copy link
Contributor Author

The MemorySanitizer/check_device_global.cpp test is failing on the OpenCL CPU device. Waiting for wa here to be merged: #17014

@cperkinsintel
Copy link
Contributor Author

The failing test was disabled yesterday: #17022

I'm not sure why it continues to be run here.

@cperkinsintel cperkinsintel marked this pull request as ready for review February 17, 2025 22:43
Copy link
Contributor

@sergey-semenov sergey-semenov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not very familiar with the shutdown related Windows quirks (the design doc update is helpful), but the changes seem reasonable to me.

Is the shutdown issue on Windows only revealed by the leak fix or was it there before? I think it makes sense to separate the two changes, they seem more-or-less unrelated to me.

@cperkinsintel
Copy link
Contributor Author

Is the shutdown issue on Windows only revealed by the leak fix or was it there before? I think it makes sense to separate the two changes, they seem more-or-less unrelated to me

The mem leak can't be tested on Windows without the greater fix for the overall shutdown. So it's a bit of a chicken-and-egg situation. Ultimately, the mem leak fix is fairly light weight, so I thought just leaving them together would be ok. But I think I can separate them and mark the mem leak test as UNSUPPORTED on Windows and then change that with the shutdown fix. Let me know.

@@ -486,6 +486,9 @@ Scheduler::GraphBuilder::addCopyBack(Requirement *Req,

std::vector<Command *> ToCleanUp;
for (Command *Dep : Deps) {
if (Dep->MEnqueueStatus == EnqueueResultT::SyclEnqueueFailed)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is too common change and could happen in the middle of program. I think it is not correct

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is definitely correct. I'm about to separate this out into its own PR, but what is happening is that we are enqueuing a set of dependencies/requirements and a command, during which an exception is thrown (and caught and rethrown). This sets the result to EnqueuFailed, but only for that particular item, not for things that depend on it (or vice versa). The Command pointers are all loose pointers, not smart pointers. So at the time of the exception some of that set is already in the DAG and then abandoned, and they leak. There is no other way to get the EnqueueFailed status.

So the fix is to be more diligent about checking for EnqueuFailed and avoid the leak that way. And this strategy applies to the buffer destructors as well.

@sergey-semenov
Copy link
Contributor

Is the shutdown issue on Windows only revealed by the leak fix or was it there before? I think it makes sense to separate the two changes, they seem more-or-less unrelated to me

The mem leak can't be tested on Windows without the greater fix for the overall shutdown. So it's a bit of a chicken-and-egg situation. Ultimately, the mem leak fix is fairly light weight, so I thought just leaving them together would be ok. But I think I can separate them and mark the mem leak test as UNSUPPORTED on Windows and then change that with the shutdown fix. Let me know.

I think it'd be better to keep them separate. The Windows shutdown change seems to be way more involved, so I'd rather have the leak fix as a separate commit in the history.

@cperkinsintel
Copy link
Contributor Author

As requested, have broken the work here into two different PRs. One for the leaking-on-exceptions of Cmd pointers and one for addressing windows shutdown changes.
#17125
#17124

steffenlarsen pushed a commit that referenced this pull request Mar 10, 2025
When enqueueing a command and its dependencies, an exception might be
thrown. In that case, the command will have a failed EnqueueStatus and
not stored in graph_builder DAG. But we need to make sure that
dependencies also check so they are not stored there. Otherwise there
will be leaks. During the clean up, we don't want to reenqueue it if we
know it has failed before.

This is broken out from #16618
adamfidel pushed a commit to reble/llvm that referenced this pull request Mar 11, 2025
When enqueueing a command and its dependencies, an exception might be
thrown. In that case, the command will have a failed EnqueueStatus and
not stored in graph_builder DAG. But we need to make sure that
dependencies also check so they are not stored there. Otherwise there
will be leaks. During the clean up, we don't want to reenqueue it if we
know it has failed before.

This is broken out from intel#16618
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants