-
Notifications
You must be signed in to change notification settings - Fork 769
[SYCL] Keep multiple copies for bf16 device library image #17461
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: jinge90 <[email protected]>
Signed-off-by: jinge90 <[email protected]>
Signed-off-by: jinge90 <[email protected]>
// For bfloat16 device library image, it doesn't include any kernel, device | ||
// global, virtual function, so just skip adding it to any related maps. | ||
// We only need to: 1). add exported symbols to m_ExportedSymbolImages. 2). | ||
// add the device image to m_DeviceImages used for future clean up when | ||
// removeImage is called. RefCount is used to keep how many user device | ||
// images are depending on native/fallback bfloat16 device library image, | ||
// the corresponding image will be added to m_ExportedSymbolImages and | ||
// m_DeviceImages only when RefCount is 0. These RefCount are used when | ||
// KernelIDsGuard is acquired by current thread. | ||
{ | ||
auto Bfloat16DeviceLibProp = Img->getDeviceLibMetadata(); | ||
if (Bfloat16DeviceLibProp.isAvailable()) { | ||
uint32_t IsNative = | ||
DeviceBinaryProperty(*(Bfloat16DeviceLibProp.begin())).asUint32(); | ||
if (!m_Bfloat16DeviceLibRefCount[IsNative]) { | ||
for (const sycl_device_binary_property &ESProp : | ||
Img->getExportedSymbols()) { | ||
m_ExportedSymbolImages.insert({ESProp->Name, Img.get()}); | ||
} | ||
m_DeviceImages.insert({RawImg, std::move(Img)}); | ||
} | ||
m_Bfloat16DeviceLibRefCount[IsNative] += 1; | ||
continue; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Making the special handling explicit (and only doing the nececssary things) is a good idea 👍
m_ExportedSymbolImages.erase(ESProp->Name); | ||
} | ||
|
||
m_DeviceImages.erase(DevImgIt); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not erasing the device image here when the refcount > 0 won't keep the underlying device binaries alive. Consider the following situation:
- bundle A loaded, contributes bfloat dev lib into runtime -> refcount = 1
- bundle B loaded, uses bfloat dev lib -> refcount = 2
- bundle A is freed -> refcount = 1
- bundle C loaded, uses bfloat dev lib -> refcount = 2, and crash when PM tries to link the kernels because
m_ExportedSymbols
points to image coming from bundle A, which has been destroyed
As I commented inline, I don't think the refcounting alone fixes the underlying issue; rather the program manager would also need to take ownership of the bfloat device library image until the refcount drops to 0. Given these complications, I think it would also be fair to just accept the presence of multiple copies of these special images. |
CC @steffenlarsen for potential relation to #17442. |
Hi, @jopperm
Thanks very much. |
Signed-off-by: jinge90 <[email protected]>
Signed-off-by: jinge90 <[email protected]>
Signed-off-by: jinge90 <[email protected]>
Hi, @jopperm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor feedback, approach LGTM. Please also add a description for the PR.
int Bfloat16DeviceLibVersion = -1; | ||
if (m_Bfloat16DeviceLibImages[0].get() == BinImage) | ||
Bfloat16DeviceLibVersion = 0; | ||
if (m_Bfloat16DeviceLibImages[1].get() == BinImage) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (m_Bfloat16DeviceLibImages[1].get() == BinImage) | |
else if (m_Bfloat16DeviceLibImages[1].get() == BinImage) |
if (!LibVersion) | ||
return true; | ||
|
||
*LibVersion = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it ok to use 0
(= a valid lib version) here? Should that be ~0U
or 2
or so?
test_device_libraries(q) || test_device_libraries(q) || | ||
test_device_libraries(q) || test_unsupported_options(q) || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing test_esimd
here (also, 3x test_device_libraries
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
Thanks very much.
// and 1 for native version. These bfloat16 device library images are | ||
// provided by compiler long time ago, we expect no further update, so | ||
// keeping 1 copy should be OK. | ||
std::unordered_map<uint32_t, DynRTDeviceBinaryImageUPtr> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this be just a std::array
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @jopperm
Yes, already updated the PR.
Thanks very much.
Signed-off-by: jinge90 <[email protected]>
Signed-off-by: jinge90 <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for iterating this, LGTM now. Please update the PR title and add a description.
SYCL RT addImages function may be invoked multiple times for different sycl binary images, more than 1 of these sycl binary images may depend on bfloat16 device library. These bfloat16 device library images are provided by compiler and the implementation are stable now, so we only keep single copy for native and fallback version bfloat16 device library in program manager, these images will not be removed unless program manager is destroyed. --------- Signed-off-by: jinge90 <[email protected]>
SYCL RT addImages function may be invoked multiple times for different sycl binary images, more than 1 of these sycl binary images may depend on bfloat16 device library. These bfloat16 device library images are provided by compiler and the implementation are stable now, so we only keep single copy for native and fallback version bfloat16 device library in program manager, these images will not be removed unless program manager is destroyed.