-
Notifications
You must be signed in to change notification settings - Fork 769
[UR] Fix some tests that are broken when run with multiple cuda devices available. #17216
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[UR] Fix some tests that are broken when run with multiple cuda devices available. #17216
Conversation
45dac1b
to
b5c390b
Compare
b5c390b
to
2b62065
Compare
e2e fails are unrelated: this only touches UR test code |
just want to confirm, after this is merged do we want to change the cuda runner to the multi-gpu one? or perhaps we want to add a separate workflow for the new runner...? |
I think the idea was to use it to replace the current one because the multi-gpu one is faster, maybe @pbalcer can confirm |
ping @intel/llvm-reviewers-cuda |
@intel/llvm-gatekeepers please merge, the e2e fails aren't related as this only touches UR tests |
FYI, enabled new runner, turned off the old one. If you see any issues please let me know |
Also removes a test and adds known failures where appropriate (typically where the test only runs when multiple devices are available so the skip doesn't affect behaviour of single-device runs).
The test removed is
cudaUrContextCreateTest.ActiveContext
. This test seems to be testing the assumption that aurQueueCreate
followed byurMemBufferCreate
will set the active cuda context to the one associated with the context passed to those calls. Neither of these calls set the active context, this may have changed at some point as the test dates back to a PI unit test. The test currently passes as long as only one device is available because a previousurDeviceGetInfo
sets the active context to the one associated with that device, which is inevitably the same as the one associated with the UR context used in the test. Since the test is based on a faulty assumption about the adapter I think we can just delete it.