-
Notifications
You must be signed in to change notification settings - Fork 768
[SYCL] fix for leaking commands when exception thrown #16618
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SYCL] fix for leaking commands when exception thrown #16618
Conversation
…e thrown. In that case, the command will have a failed EnqueueStatus. During the clean up, we don't want to reenqueue it if we know it has failed before
Could you please add a unit test for this? |
@sergey-semenov - The mem tests ( |
I mean, since there's no need to involve any backend to verify this, I'd prefer the more lightweight option of adding a unit test here. I think inserting a failed-to-enqueue MockCommand into the graph then checking if its destructor fired after cleaning up the buffer it depends on should be doable and check what we want here. Alternatively, you could just build a regular kernel enqueue graph with queue::submit while redefining urEnqueue... to fail and just check that we make the required ur.*Release calls. |
The |
The failing test was disabled yesterday: #17022 I'm not sure why it continues to be run here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not very familiar with the shutdown related Windows quirks (the design doc update is helpful), but the changes seem reasonable to me.
Is the shutdown issue on Windows only revealed by the leak fix or was it there before? I think it makes sense to separate the two changes, they seem more-or-less unrelated to me.
The mem leak can't be tested on Windows without the greater fix for the overall shutdown. So it's a bit of a chicken-and-egg situation. Ultimately, the mem leak fix is fairly light weight, so I thought just leaving them together would be ok. But I think I can separate them and mark the mem leak test as UNSUPPORTED on Windows and then change that with the shutdown fix. Let me know. |
@@ -486,6 +486,9 @@ Scheduler::GraphBuilder::addCopyBack(Requirement *Req, | |||
|
|||
std::vector<Command *> ToCleanUp; | |||
for (Command *Dep : Deps) { | |||
if (Dep->MEnqueueStatus == EnqueueResultT::SyclEnqueueFailed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is too common change and could happen in the middle of program. I think it is not correct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is definitely correct. I'm about to separate this out into its own PR, but what is happening is that we are enqueuing a set of dependencies/requirements and a command, during which an exception is thrown (and caught and rethrown). This sets the result to EnqueuFailed, but only for that particular item, not for things that depend on it (or vice versa). The Command pointers are all loose pointers, not smart pointers. So at the time of the exception some of that set is already in the DAG and then abandoned, and they leak. There is no other way to get the EnqueueFailed status.
So the fix is to be more diligent about checking for EnqueuFailed and avoid the leak that way. And this strategy applies to the buffer destructors as well.
I think it'd be better to keep them separate. The Windows shutdown change seems to be way more involved, so I'd rather have the leak fix as a separate commit in the history. |
When enqueueing a command and its dependencies, an exception might be thrown. In that case, the command will have a failed EnqueueStatus and not stored in graph_builder DAG. But we need to make sure that dependencies also check so they are not stored there. Otherwise there will be leaks. During the clean up, we don't want to reenqueue it if we know it has failed before. This is broken out from #16618
When enqueueing a command and its dependencies, an exception might be thrown. In that case, the command will have a failed EnqueueStatus and not stored in graph_builder DAG. But we need to make sure that dependencies also check so they are not stored there. Otherwise there will be leaks. During the clean up, we don't want to reenqueue it if we know it has failed before. This is broken out from intel#16618
When enqueueing a command and its dependencies, an exception might be thrown. In that case, the command will have a failed EnqueueStatus. During the clean up, we don't want to reenqueue it if we know it has failed before.
The test for this wasn't working on Windows due to some of the shutdown complications there. Instead of simply disabling the test on Windows I decided to better understand and address the problem.