Skip to content

[L0] Fix DeviceInfo global mem free to report unsupported given MemCount==0 #1486

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 5, 2024

Conversation

nrspruit
Copy link
Contributor

No description provided.

@ph0b
Copy link
Contributor

ph0b commented Mar 29, 2024

This change assumes FreeMemory <= 0 means the FreeMemory api doesn't work, so I have a concern here: Wouldn't there be cases for which the call works and returns <=0 because the device is really out of memory ? In that case we really don't want to report global_memory.

@nrspruit nrspruit force-pushed the fix_memfree_report branch from b7cf9ae to c913298 Compare March 29, 2024 20:58
@nrspruit
Copy link
Contributor Author

This change assumes FreeMemory <= 0 means the FreeMemory api doesn't work, so I have a concern here: Wouldn't there be cases for which the call works and returns <=0 because the device is really out of memory ? In that case we really don't want to report global_memory.

Thank you for the comment @ph0b , I realized I should be checking the MemCount. If there are no memory modules reported by the API, then free memory is invalid. If there are MemCount modules, then free memory is the correct value.

@nrspruit
Copy link
Contributor Author

This change assumes FreeMemory <= 0 means the FreeMemory api doesn't work, so I have a concern here: Wouldn't there be cases for which the call works and returns <=0 because the device is really out of memory ? In that case we really don't want to report global_memory.

Thank you for the comment @ph0b , I realized I should be checking the MemCount. If there are no memory modules reported by the API, then free memory is invalid. If there are MemCount modules, then free memory is the correct value.

If the API "fails", then the call will report a UR api failure. In the case of MemCount==0, it means that the API did not find any memory modules, which is not an error, but it does mean that the free memory is not set.

@nrspruit nrspruit changed the title [L0] Fix DeviceInfo global mem free to report global if free memory==0 [L0] Fix DeviceInfo global mem free to report global if MemCount==0 Mar 29, 2024
@nrspruit nrspruit added level-zero L0 adapter specific issues v0.9.x Include in the v0.9.x release labels Mar 29, 2024
@nrspruit nrspruit added the ready to merge Added to PR's which are ready to merge label Apr 3, 2024
@nsirgien
Copy link

nsirgien commented Apr 4, 2024

Hi @nrspruit, can you please explain more about a reason behind this changes?
I mean, as far as I know, the intended behavior is that if you have free memory aspect, then you should get real free memory numbers from driver. And if you do not have free memory aspect, then attempt to access free memory information would lead to a runtime error.
So, I do not understand, when the code suppose to enter the else branch of this change?
I mean, before it was returning zero, which was wrong, as it was described in https://jira.devtools.intel.com/browse/URT-626.
But with change, it will now return global memory (which is a constant), which is IMHO also wrong.

@nrspruit nrspruit force-pushed the fix_memfree_report branch from c913298 to 096be8b Compare April 4, 2024 15:16
@nrspruit nrspruit changed the title [L0] Fix DeviceInfo global mem free to report global if MemCount==0 [L0] Fix DeviceInfo global mem free to report unsupported given MemCount==0 Apr 4, 2024
…unt==0

- If there are no memory modules from zesDeviceEnumMemoryModules, then
  we cannot query the free memory from level zero and we do not support
  reporting free memory thru this api on this platform/os.

Signed-off-by: Neil R. Spruit <[email protected]>
@nrspruit nrspruit force-pushed the fix_memfree_report branch from 096be8b to ec773e6 Compare April 4, 2024 15:20
@nrspruit
Copy link
Contributor Author

nrspruit commented Apr 4, 2024

Hi @nrspruit, can you please explain more about a reason behind this changes? I mean, as far as I know, the intended behavior is that if you have free memory aspect, then you should get real free memory numbers from driver. And if you do not have free memory aspect, then attempt to access free memory information would lead to a runtime error. So, I do not understand, when the code suppose to enter the else branch of this change? I mean, before it was returning zero, which was wrong, as it was described in https://jira.devtools.intel.com/browse/URT-626. But with change, it will now return global memory (which is a constant), which is IMHO also wrong.

Based on offline discussion, this has been changed to return unsupported if the number of memory modules returned is 0. That way you either get the free memory or unsupported, if the L0 driver cannot query the free memory in this device.

@aarongreig aarongreig merged commit 065bf2d into oneapi-src:main Apr 5, 2024
nrspruit added a commit to nrspruit/llvm that referenced this pull request Apr 5, 2024
martygrant pushed a commit to intel/llvm that referenced this pull request Apr 8, 2024
kbenzie pushed a commit to kbenzie/unified-runtime that referenced this pull request Apr 8, 2024
[L0] Fix DeviceInfo global mem free to report unsupported given MemCount==0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
level-zero L0 adapter specific issues ready to merge Added to PR's which are ready to merge v0.9.x Include in the v0.9.x release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants