[SYCL] fix for leaking commands when exception thrown #16618

cperkinsintel · 2025-01-14T05:28:16Z

When enqueueing a command and its dependencies, an exception might be thrown. In that case, the command will have a failed EnqueueStatus. During the clean up, we don't want to reenqueue it if we know it has failed before.

The test for this wasn't working on Windows due to some of the shutdown complications there. Instead of simply disabling the test on Windows I decided to better understand and address the problem.

…e thrown. In that case, the command will have a failed EnqueueStatus. During the clean up, we don't want to reenqueue it if we know it has failed before

sergey-semenov · 2025-01-14T14:27:18Z

Could you please add a unit test for this?

cperkinsintel · 2025-01-14T17:07:49Z

@sergey-semenov - The mem tests ( UR_L0_LEAKS_DEBUG=1) for max_num_work_groups.cpp is how this was discovered, in its error case. I can make a new e2e test file that explicitly cribs that combination. Would that work? Or do you want a UNIT test as opposed to an E2E one? Not quite sure what that'd look like.

sergey-semenov · 2025-01-14T17:30:02Z

I mean, since there's no need to involve any backend to verify this, I'd prefer the more lightweight option of adding a unit test here. I think inserting a failed-to-enqueue MockCommand into the graph then checking if its destructor fired after cleaning up the buffer it depends on should be doable and check what we want here. Alternatively, you could just build a regular kernel enqueue graph with queue::submit while redefining urEnqueue... to fail and just check that we make the required ur.*Release calls.

cperkinsintel · 2025-02-14T07:24:04Z

The MemorySanitizer/check_device_global.cpp test is failing on the OpenCL CPU device. Waiting for wa here to be merged: #17014

cperkinsintel · 2025-02-15T17:12:58Z

The failing test was disabled yesterday: #17022

I'm not sure why it continues to be run here.

sergey-semenov

I'm not very familiar with the shutdown related Windows quirks (the design doc update is helpful), but the changes seem reasonable to me.

Is the shutdown issue on Windows only revealed by the leak fix or was it there before? I think it makes sense to separate the two changes, they seem more-or-less unrelated to me.

sycl/unittests/windows/dllmain.cpp

sycl/source/detail/global_handler.cpp

sycl/source/detail/scheduler/graph_builder.cpp

sycl/source/detail/queue_impl.hpp

sycl/source/detail/scheduler/scheduler.cpp

sycl/source/detail/global_handler.cpp

sycl/doc/design/GlobalObjectsInRuntime.md

cperkinsintel · 2025-02-20T20:52:49Z

Is the shutdown issue on Windows only revealed by the leak fix or was it there before? I think it makes sense to separate the two changes, they seem more-or-less unrelated to me

The mem leak can't be tested on Windows without the greater fix for the overall shutdown. So it's a bit of a chicken-and-egg situation. Ultimately, the mem leak fix is fairly light weight, so I thought just leaving them together would be ok. But I think I can separate them and mark the mem leak test as UNSUPPORTED on Windows and then change that with the shutdown fix. Let me know.

sycl/doc/design/GlobalObjectsInRuntime.md

sycl/source/detail/global_handler.cpp

sycl/source/detail/queue_impl.hpp

KseniyaTikhomirova · 2025-02-21T12:23:41Z

sycl/source/detail/scheduler/graph_builder.cpp

@@ -486,6 +486,9 @@ Scheduler::GraphBuilder::addCopyBack(Requirement *Req,

  std::vector<Command *> ToCleanUp;
  for (Command *Dep : Deps) {
+    if (Dep->MEnqueueStatus == EnqueueResultT::SyclEnqueueFailed)


this is too common change and could happen in the middle of program. I think it is not correct

It is definitely correct. I'm about to separate this out into its own PR, but what is happening is that we are enqueuing a set of dependencies/requirements and a command, during which an exception is thrown (and caught and rethrown). This sets the result to EnqueuFailed, but only for that particular item, not for things that depend on it (or vice versa). The Command pointers are all loose pointers, not smart pointers. So at the time of the exception some of that set is already in the DAG and then abandoned, and they leak. There is no other way to get the EnqueueFailed status.

So the fix is to be more diligent about checking for EnqueuFailed and avoid the leak that way. And this strategy applies to the buffer destructors as well.

sycl/source/detail/scheduler/scheduler.cpp

sycl/test-e2e/Scheduler/DeleteCmdException.cpp

sergey-semenov · 2025-02-21T13:15:04Z

Is the shutdown issue on Windows only revealed by the leak fix or was it there before? I think it makes sense to separate the two changes, they seem more-or-less unrelated to me

The mem leak can't be tested on Windows without the greater fix for the overall shutdown. So it's a bit of a chicken-and-egg situation. Ultimately, the mem leak fix is fairly light weight, so I thought just leaving them together would be ok. But I think I can separate them and mark the mem leak test as UNSUPPORTED on Windows and then change that with the shutdown fix. Let me know.

I think it'd be better to keep them separate. The Windows shutdown change seems to be way more involved, so I'd rather have the leak fix as a separate commit in the history.

cperkinsintel · 2025-02-21T20:56:42Z

As requested, have broken the work here into two different PRs. One for the leaking-on-exceptions of Cmd pointers and one for addressing windows shutdown changes.
#17125
#17124

When enqueueing a command and its dependencies, an exception might be thrown. In that case, the command will have a failed EnqueueStatus and not stored in graph_builder DAG. But we need to make sure that dependencies also check so they are not stored there. Otherwise there will be leaks. During the clean up, we don't want to reenqueue it if we know it has failed before. This is broken out from #16618

When enqueueing a command and its dependencies, an exception might be thrown. In that case, the command will have a failed EnqueueStatus and not stored in graph_builder DAG. But we need to make sure that dependencies also check so they are not stored there. Otherwise there will be leaks. During the clean up, we don't want to reenqueue it if we know it has failed before. This is broken out from intel#16618

when enqueueing a command and its dependencies, and exception might b…

28de9d5

…e thrown. In that case, the command will have a failed EnqueueStatus. During the clean up, we don't want to reenqueue it if we know it has failed before

cperkinsintel requested a review from a team as a code owner January 14, 2025 05:28

cperkinsintel requested review from againull and sergey-semenov January 14, 2025 05:28

cperkinsintel temporarily deployed to WindowsCILock January 14, 2025 05:28 — with GitHub Actions Inactive

cperkinsintel temporarily deployed to WindowsCILock January 14, 2025 05:56 — with GitHub Actions Inactive

cperkinsintel marked this pull request as draft January 14, 2025 19:38

cperkinsintel added 2 commits January 21, 2025 10:49

Merge branch 'sycl' into cperkins-cmd-mem-leak-fix

7279592

fix OTHER memory release path and test both

e24a731

cperkinsintel temporarily deployed to WindowsCILock January 21, 2025 19:31 — with GitHub Actions Inactive

cperkinsintel had a problem deploying to WindowsCILock January 21, 2025 21:11 — with GitHub Actions Failure

restoring other CleanUp code which is used by queue memcpy ops

626a833

cperkinsintel temporarily deployed to WindowsCILock January 23, 2025 00:57 — with GitHub Actions Inactive

cperkinsintel had a problem deploying to WindowsCILock January 23, 2025 01:43 — with GitHub Actions Failure

interesting and excellent.

9401ae1

cperkinsintel temporarily deployed to WindowsCILock January 24, 2025 18:09 — with GitHub Actions Inactive

cperkinsintel had a problem deploying to WindowsCILock January 24, 2025 18:44 — with GitHub Actions Failure

blind fix

a03da62

cperkinsintel temporarily deployed to WindowsCILock January 24, 2025 20:49 — with GitHub Actions Inactive

cperkinsintel had a problem deploying to WindowsCILock January 24, 2025 21:23 — with GitHub Actions Failure

cperkinsintel added 2 commits January 27, 2025 17:08

checkpoint. cleanup needed

113927d

checkpoint

59ae241

cperkinsintel had a problem deploying to WindowsCILock January 28, 2025 17:18 — with GitHub Actions Failure

cperkinsintel added 2 commits January 28, 2025 11:04

another checkpoint

5ac2f12

ready for more testing. Probably needs clang-format fixes.

ea2fe36

cperkinsintel had a problem deploying to WindowsCILock January 29, 2025 00:12 — with GitHub Actions Error

misery loves clang-format

194b47e

cperkinsintel temporarily deployed to WindowsCILock February 13, 2025 05:02 — with GitHub Actions Inactive

cleanup

cf8c1b4

cperkinsintel temporarily deployed to WindowsCILock February 13, 2025 05:51 — with GitHub Actions Inactive

cperkinsintel temporarily deployed to WindowsCILock February 13, 2025 06:26 — with GitHub Actions Inactive

no finishWait on win in threadpool destructor

979d6b2

cperkinsintel temporarily deployed to WindowsCILock February 14, 2025 00:11 — with GitHub Actions Inactive

cperkinsintel temporarily deployed to WindowsCILock February 14, 2025 00:46 — with GitHub Actions Inactive

cperkinsintel marked this pull request as ready for review February 17, 2025 22:43

remove stray line. ( Actually, I just want to try a new test run )

8d111f1

cperkinsintel temporarily deployed to WindowsCILock February 19, 2025 01:42 — with GitHub Actions Inactive

cperkinsintel temporarily deployed to WindowsCILock February 19, 2025 02:18 — with GitHub Actions Inactive

cperkinsintel requested a review from KseniyaTikhomirova February 19, 2025 19:05

sergey-semenov reviewed Feb 20, 2025

View reviewed changes

reviewer feedback

caedef9

cperkinsintel temporarily deployed to WindowsCILock February 21, 2025 00:24 — with GitHub Actions Inactive

cperkinsintel temporarily deployed to WindowsCILock February 21, 2025 01:00 — with GitHub Actions Inactive

KseniyaTikhomirova reviewed Feb 21, 2025

View reviewed changes

more reviewer feedback before splitting into two PR

3b93f64

cperkinsintel temporarily deployed to WindowsCILock February 21, 2025 18:55 — with GitHub Actions Inactive

This was referenced Feb 21, 2025

[SYCL] windows shutdown fix #17124

Closed

[SYCL] cmd mem leak fix #17125

Merged

cperkinsintel closed this Feb 21, 2025

cperkinsintel temporarily deployed to WindowsCILock February 21, 2025 21:05 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SYCL] fix for leaking commands when exception thrown #16618

[SYCL] fix for leaking commands when exception thrown #16618

cperkinsintel commented Jan 14, 2025 •

edited

Loading

sergey-semenov commented Jan 14, 2025

cperkinsintel commented Jan 14, 2025 •

edited

Loading

sergey-semenov commented Jan 14, 2025

cperkinsintel commented Feb 14, 2025

cperkinsintel commented Feb 15, 2025

sergey-semenov left a comment

cperkinsintel commented Feb 20, 2025

KseniyaTikhomirova Feb 21, 2025

cperkinsintel Feb 21, 2025

sergey-semenov commented Feb 21, 2025

cperkinsintel commented Feb 21, 2025

[SYCL] fix for leaking commands when exception thrown #16618

[SYCL] fix for leaking commands when exception thrown #16618

Conversation

cperkinsintel commented Jan 14, 2025 • edited Loading

sergey-semenov commented Jan 14, 2025

cperkinsintel commented Jan 14, 2025 • edited Loading

sergey-semenov commented Jan 14, 2025

cperkinsintel commented Feb 14, 2025

cperkinsintel commented Feb 15, 2025

sergey-semenov left a comment

Choose a reason for hiding this comment

cperkinsintel commented Feb 20, 2025

KseniyaTikhomirova Feb 21, 2025

Choose a reason for hiding this comment

cperkinsintel Feb 21, 2025

Choose a reason for hiding this comment

sergey-semenov commented Feb 21, 2025

cperkinsintel commented Feb 21, 2025

cperkinsintel commented Jan 14, 2025 •

edited

Loading

cperkinsintel commented Jan 14, 2025 •

edited

Loading