Skip to content

[Doc] Add design doc for dynamic linking of device code feature #3210

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 19 commits into from
May 31, 2021
Merged
Show file tree
Hide file tree
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
379 changes: 379 additions & 0 deletions sycl/doc/SharedLibraries.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,379 @@
# Dynamic linking of device code

This document describes purpose and design of dynamic linking of device code
feature.

## Background
Sometimes users want to link device code dynamically at run time. One possible
use case for such linkage - providing device functions via shared libraries.
Simple source example:
```
// App:

CGH.parallel_for<app_kernel>(/* ... */ {
library_function();
});


// Shared library:
SYCL_EXTERNAL void library_function() {
// do something
}
```
It is possible to manually create `sycl::program` in both app and shared
library, then use `link` SYCL API to get a single program and launch kernels
using it. But it is not user-friendly and it is very different from regular
C/C++ workflow.

Another possible scenario - use functions defined in pre-compiled device image
provided by user. Example:
```
// a.cpp
SYCL_EXTERNAL void foo();
...
parallel_for([]() { foo(); });

// b.cpp
/*no SYCL_EXTERNAL*/ void foo() { ... }
```
We have a `SYCL_EXTERNAL` function `foo` called from a kernel, but the
application defined only host version of this function. Then user adds device
image with definition of `foo` to the fat object via special option.

The main purpose of this feature is to provide a mechanism which allows to
link device code dynamically at runtime.

## Requirements:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT:

Suggested change
## Requirements:
## Requirements

User's device code that consists of some device API (`SYCL_EXTERNAL` functions),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
User's device code that consists of some device API (`SYCL_EXTERNAL` functions),
User's device code

is compiled into some form and it is not linked statically with device code of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
is compiled into some form and it is not linked statically with device code of
can be compiled into some form and not linked statically with device code of

application. It can be a shared library that contains some device code or a
separate device image supplied with property information. This code is linked
dynamically at run time with device code of a user's application in order to
resolve dependencies.
For this combination the following statements must be true:

- `SYCL_EXTERNAL` functions defined in dynamically linked code can be called
(directly or indirectly) from device code of the application.
- Function pointers taken in application should work inside the dynamically
linked code.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some parts of the reqs seem too specific and not general enough. As a variant:

The presented dynamic device code linkage mechanism must:

  • allow to represent the actual dynamically linked code as a device binary image
    • embedded into a host shared object by standard SYCL compiler driver invocation
    • embedded into a host binary or shared object using manual invocations of SYCL tools such as clang-offload-wrapper and linker
    • (NOTE: I think this one would be really useful, I suggest to add a TODO as potential design extension point) loaded into memory via analog of dlopen
  • must not assume the actual format of the device code - e.g. that it is SPIR-V (can be also a TODO design extension point)
  • provide automatic runtime resolution of SYCL_EXTERNAL function references within the SYCL app to their definitions (if found) within any suitable dynamically linked device binary image
  • support pointers to SYCL_EXTERNAL functions across the dynamic linkage boundaries within the device code - taking a pointer, call through a pointer.
  • Specific code changes are ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about this point:

(NOTE: I think this one would be really useful, I suggest to add a TODO as potential design extension point) loaded into memory via analog of dlopen

Do you mean some specific SYCL API? Because if dlopen is used it will be just specific usage of dynamically linked code that "embedded into a host shared object by standard SYCL compiler driver invocation".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant some new API, which would load a device library from disk (optionally returning device_image), and automatically make symbols defined in the loaded binary image available to SYCL runtime for resolution:

// suppose, mylib.spv defines SYCL_EXTERNAL function foo, then this call:
device_image img = device_dlopen("mylib.spv");
// will make foo available for dynamic symbol resolution. If any subsequent JIT compilations
// try to compile device code with external reference to foo, it can now be resolved following
// the resolution mechanism described in this doc, and JIT compilation will succeed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, thanks. I mentioned that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need a new API? Can't we just suggest compiling marked with SYCL_EXTERNAL as a regular "fat" .so and then dlopen it? dlopen'ing should trigger device image registration, so such a device image should be available for symbols resolution.

Copy link
Contributor

@kbobrovs kbobrovs May 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we just suggest compiling marked with SYCL_EXTERNAL as a regular "fat" .so and then dlopen it?

Sorry for delay. Yes, you are right, that should be the preferred way. But sometimes there is a need to load x.spv originating from different tools - e.g. OpenCL. It would be UB trying to define kernels in this x.spv and then trying to call them from SYCL, but it should be perfectly OK to link x.spv with .spv originating from SYCL.

- Specific code changes are not required, i.e. the mechanism of linking works
as close as possible to regular shared libraries.

## Design
The overall idea:

- Each device image is supplied with a list of imported symbol names
through device image properties mechanism
- `SYCL_EXTERNAL` functions are arranged into separate device images supplied
with a list of exported symbol names
- Before compiling a device image DPC++ RT will check if device image has a list
of imported symbols and if it has, then RT will search for device images which
define required symbols using lists of exported symbols.
- Besides symbol names, additional attributes are taken into account (like
device image format: SPIR-V or device asm)
- Actual linking is performed by underlying backend (OpenCL/L0/etc.)

Next sections describe details of changes in each component.

### DPC++ front-end changes

DPC++ front-end generates `sycl-module-id` attribute on each `SYCL_EXTERNAL` function.
It was generated only on kernels earlier. There are two reasons to start
generating this attribute on `SYCL_EXTERNAL` functions:

- Later in pipeline, this attribute will be used by `sycl-post-link` tool to
separate `SYCL_EXTERNAL` functions from non-`SYCL_EXTERNAL` functions with
external linkage.
- `sycl-module-id` attribute also contains information about source file where the
function comes from. This information will be used to perform device code
split on device images that contain only exported functions.

### sycl-post-link changes

To support dynamic device linkage, `sycl-post-link` performs 3 main tasks:
- Arranges `SYCL_EXTERNAL` functions into a separate device image(s)
- Supplies device images containing exports with an information about exported
symbols
- Supplies each device image with an information about imported symbols

`sycl-post-link` outlines `SYCL_EXTERNAL` functions with all their reachable
dependencies (functions with definitions called from `SYCL_EXTERNAL` ones)
into a separate device image(s) in order to create minimal self-contained
device images that can be linked from the user's app. There are several
notable moments though.

If a `SYCL_EXTERNAL` function is used within a kernel defined in a shared
library, it will be duplicated: one instance will be stored in the kernel's
device image and the function won't exported from this device image, while the
other will be stored in a special device image for other `SYCL_EXTERNAL`
functions and will be marked as exported there. Such duplication is need for
two reasons:
- We aim to make device images with kernels self-contained so no JIT linker
invocations would be needed if we have definitions of all called functions.
Also note that if AOT is requested, it would be impossible to link anything
at runtime.
- We could export `SYCL_EXTERNAL` functions from device images with kernels,
but it would mean that when user's app calls `SYCL_EXTERNAL` function, it has
to link a whole kernel and all its dependencies - not only it increases the
amount of unnecessary linked code, but might also lead to build errors if the
kernel uses some features, which are not supported by target device (and they
are not used in the `SYCL_EXTERNAL` function).
Besides separating `SYCL_EXTERNAL` functions from kernels, `sycl-post-link`
can also distribute those functions into separate device images if device code
split is requested. This is done by grouping them using `module-id` attribute.
Non-`SYCL_EXTERNAL` functions used by `SYCL_EXTERNAL` functions with different
`sycl-module-id` attributes are copied to device images corresponding to those
`SYCL_EXTERNAL` functions to make them self-contained.
In case one `SYCL_EXTERNAL` function uses another `SYCL_EXTERNAL` function
with different `sycl-module-id` attribute, the second one is not copied to the
device image with the first function, but dependency between those device images
is recorder instead.

After `SYCL_EXTERNAL` functions are arranged into a separate device image(s),
all non-`SYCL_EXTERNAL` functions and `SYCL_EXTERNAL` functions left in device
images with kernels marked with internal linkage to avoid multiple definition
errors during runtime linking.
Device images with `SYCL_EXTERNAL` functions will also get a list of names
of exported functions attached to them through device image properties
(described below).

**NOTE**: If device code split is enabled, it seems reasonable to perform
exports arrangement before device code split procedure.

In order to collect information about imported symbols `sycl-post-link` looks
through LLVM IR and for each declared but not defined symbol records its name,
except the following cases:
- Declarations with `__` prefix in demangled name are not recorded as imported
functions
- Declarations with `__spirv_*` prefix should not be recorded as dependencies
since they represent SPIR-V operations and will be transformed to SPIR-V
instructions during LLVM->SPIR-V translation.
- Based on some attributes (which could be defined later) we may want to avoid
listing some functions as imported ones
- This is needed to have possibility to call device-specific builtins not
starting with `__` by forward-declaring them in DPC++ code

**NOTE**: If device code split is enabled, imports collection is performed after
split and it is performed on splitted images.

All collected information is attached to a device image via properties
mechanism.

Each device image is supplied with an array of property sets:
```
struct pi_device_binary_struct {
...
// Array of property sets
pi_device_binary_property_set PropertySetsBegin;
pi_device_binary_property_set PropertySetsEnd;
};

```
Each property set is represented by the following struct:
```
// Named array of properties.
struct _pi_device_binary_property_set_struct {
char *Name; // the name
pi_device_binary_property PropertiesBegin; // array start
pi_device_binary_property PropertiesEnd; // array end
};
```
It contains name of property set and array of properties. Each property is
represented by the following struct:
```
struct _pi_device_binary_property_struct {
char *Name; // null-terminated property name
void *ValAddr; // address of property value
uint32_t Type; // _pi_property_type
uint64_t ValSize; // size of property value in bytes
};
```

List of imported symbols is represented as a single property set with name
`SYCL/imported symbols` recorded in the `Name` field of property set.
Each property in this set holds name of the particular imported symbol recorded
in the `Name` field of the property.
List of exported symbols is represented in the same way, except the
corresponding set has the name `SYCL/exported symbols`.

### DPC++ runtime changes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found it very difficult to understand the big picture for the DPC++runtime changes. It would be nice to have an overview up front. Maybe something like this:

The overall strategy is as follows. When the application submits an offload kernel to execution on a device, the runtime uses the SYCL/imported symbols attribute from the kernel's device image to find the set of symbols it imports. The runtime then searches for device images that export these same symbols by searching their SYCL/exported symbols attributes. This symbol resolution algorithm results in a list of device images that need to be linked together.

The design currently supports only the case when all these device images are in SPIR-V format. The runtime passes the list of device images to the PI API, which compiles and links the SPIR-V modules together into a single native device image. This native device image is then added to the cache to avoid symbol resolution, compilation, and linking for any future attempts to invoke kernels from this device image.

Note that the PI API does not use the SYCL/imported symbols or SYCL/exported symbols attributes to perform the linking of the SPIR-V modules. Instead, functions in the SPIR-V modules must be annotated with the SPIR-V Import and Export linkage types. The PI API layer uses these annotations to perform that actual linking of function references across SPIR-V modules.

Note that I'm not really sure this is how the design will work, but I think this is what you intend. Can you confirm if my overall understanding of the runtime design is accurate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think your understanding is correct, except that the design doesn't actually depend on device images format. I.e. the algorithm of searching won't be changed if device images are pre-compiled native device binaries.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, the symbol-search part of the algorithm can probably work for device images in native code format. However, I think the online linking part will only work for device images in SPIR-V format.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also wanted to stress that I think it's important to add some description similar to what I suggest to the introduction to the "DPC++ runtime changes" section. I think this will clarify the following points which are currently unclear:

  • The SYCL/imported symbols and SYCL/exported symbols attributes are used only by the runtime's symbol resolution algorithm. The PI API layer relies on the SPIR-V Import and Export linkage annotations, so these must also be present.

  • The linking part of the design works only for device images in SPIR-V format.

  • The only things added to the cache are fully linked device images. We do not cache device images with SYCL_EXTERNAL functions that are compiled and not yet linked.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The linking part of the design works only for device images in SPIR-V format.

I don't agree with you here. Yes, right now we don't have backends that support linking of native device binaries. But we don't say that it is not possible in the future. I think format of device image shouldn't affect design of runtime changes. Most likely when linking of native device binaries is supported, it won't matter for runtime which format device image has, it will just call some PI API for link.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smaslov-intel, my understanding is that in OpenCL terminology, compiled != built. compiled is an object form, i.e. front-end is passed, but no executable binary was produced yet. I can't find strong proof of that in the spec, but I'm sure that our (Intel) CPU & GPU compilers won't accept native binaries in clLinkProgram

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://www.khronos.org/registry/OpenCL/sdk/2.2/docs/man/html/clCompileProgram.html

The compiled binary can be queried using clGetProgramInfo(program, CL_PROGRAM_BINARIES, ...) and can be specified to clCreateProgramWithBinary to create a new program object.

Compile does produce "binary", otherwise the above wouldn't be possible.
Who can clarify it exactly?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compile does produce "binary", otherwise the above wouldn't be possible.
Who can clarify it exactly?

I think @bashbaug should be able to clarify this.

In the meantime:

From clCreateKernel:

Parameters
program
A program object with a successfully built executable.

From clBuildProgram:

Builds (compiles and links) a program executable from the program source or binary.

From clLinkProgram:

Links a set of compiled program objects and libraries for all the devices or a specific device(s) in the OpenCL context and creates an executable.

From clCompileProgram:

Compiles a program’s source for all the devices or a specific device(s) in the OpenCL context associated with program.

It seems to me that intent was that clCompileProgram does not generate an executable, but only creates some intermediate object file instead

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1.
From clGetProgramInfo(https://www.khronos.org/registry/OpenCL/sdk/1.0/docs/man/xhtml/clGetProgramInfo.html) desc:

Return the program binaries for all devices associated with program. For each device in program, the binary returned can be the binary specified for the device when program is created with clCreateProgramWithBinary or it can be the executable binary generated by clBuildProgram. If program is created with clCreateProgramWithSource, the binary returned is the binary generated by clBuildProgram. The bits returned can be an implementation-specific intermediate representation (a.k.a. IR) or device specific executable bits or both. The decision on which information is returned in the binary is up to the OpenCL implementation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The decision on which information is returned in the binary is up to the OpenCL implementation.

Can we somehow query OpenCL about what is in the binary such that we could properly control it's linking?


DPC++ RT performs *device images collection* task by grouping all device
images required to execute a kernel based on the list of exports/imports and
links them together using PI API.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest that we add a sub-section about this PI API below: this API is important, as this is a part of interface between plugins and DPC++ RT and we don't even have its name listed here.

The sub-section could describe the API, its behavior (whether we assume that it is capable to link native binaries regardless of device or that it is some optional capability which should be checked before usage), possible implementation/limitations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added, please see lines 236-259.


#### Device images collection

DPC++ Runtime class named ProgramManager stores device images using following
data structure:
```
/// Keeps all available device executable images added via \ref addImages.
/// Organizes the images as a map from a kernel set id to the vector of images
/// containing kernels from that set.
/// Access must be guarded by the \ref Sync::getGlobalLock()
std::unordered_map<SymbolSetId,
std::unique_ptr<std::vector<RTDeviceBinaryImageUPtr>>>
m_DeviceImages;

using StrToKSIdMap = std::unordered_map<string_class, SymbolSetId>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a unordered_multimap to handle use-case of kernels with the same name in different objects?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think use-case when kernels have the same name is not really legal. SYCL 2020 spec says "In the case of the shared-source compilation model, the kernels have to be uniquely identified by both host and device compiler.". So if two kernels (even if they are in different objects) have the same name, they don't really uniquely identified, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's have the following use-case:

// Application
queue Q;
Lib1Func(Q);
Lib2Func(Q);

// Lib1
void Lib1Func(queue &Q) {
  Q.sbumit([](handle &H) {
    H.parallel_for<class Calculate>(...);
  });
}

// Lib2
void Lib2Func(queue &Q) {
  Q.sbumit([](handle &H) {
    H.parallel_for<class Calculate>(...);
  });
}

Is this use-case handled by SYCL 2020 spec?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that SYCL 2020 (or other) spec say something about such use cases, especially when libraries are involved. However, looking at this comment #3210 (comment) , I think this case is illegal. @gmlueck , could you please confirm?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is illegal because the spec says this:

In the case of the shared-source compilation model, the kernels have to be uniquely identified by both host and device compiler.

The spec could certainly be more clear about this, though. I'll propose updated wording to the SYCL 2020 spec clarifying that a kernel name must be unique across the entire application.

Note that it is legal for two translation units (or two libraries) to call the same kernel, in which case both instances of the kernel will have the same name. For example, consider a named kernel object that is called from two different libraries:

class MyKernel {
 public:
  void operator()(item<1> it) const {
    /* kernel code */
  }
};

In such a case, both libraries will have a definition of the kernel named MyKernel. The design will need to support cases like this.

/// Maps names of kernels from a specific OS module (.exe .dll) to their set
/// id (the sets are disjoint).
std::unordered_map<OSModuleHandle, SymbolSetId> m_SymbolSets;
```
Assume each device image represents some combination of symbols and different
device images may contain only exactly the same or not overlapping combination
of symbols. If it is not so, there can be two cases:
- Symbols are the same. In this case it doesn't matter which device image is
taken to use duplicated symbol
- Symbols are not the same. In this case ODR violation takes place, such
situation leads to undefined behaviour. For more details refer to
[ODR violations](#ODR-violations) section.

Each combination of symbols is assigned with an Id number - symbol set Id.
A combination of symbols can exist in different formats (i.e. SPIR-V/AOT
compiled binary and etc).
`m_DeviceImages` maps an Id number to an array with device images which represent
the same combination of symbols in different formats.
`m_SymbolSets` contains mapping from symbol name to symbol set Id for each OS
module (.exe/.so/.dll).
`std::unordered_map` allows to search and access its elements with constant-time
complexity.

Before compilation of device image to execute a kernel RT checks if the image
contains any import information in its properies and if it does, then RT
performs device images collection in order to resolve dependencies.

Ids of all needed symbol sets are found. This is done by iterating through
`m_SymbolSets` map, i.e. iterating through all available OS modules without
predefined order and searching for first unresolved symbol in list of imports
set of target device image. Once device image that contains first symbol is
met, remaining exported symbols are checked in found image and if
they match some imported symbols then these matched symbols will be marked as
resolved. The procedure repeats until all imported symbols are resolved.
For each found symbol set Id program cache is checked in case if
necessary set of `SYCL_EXTERNAL` functions has been compiled and if it is true,
then compiled device image will be re-used for linking.
Otherwise device image containing required symbols set will be compiled and
stored in cache.

#### Program caching

Existing support for device code caching is re-used to cache programs created
from device images with SYCL external functions and linked device images with
imports information.

##### In-memory cache

Programs that contain only `SYCL_EXTERNAL` functions will be cached only in
compiled state, so they can be linked with other programs during dependency
resolution.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Programs that contain only `SYCL_EXTERNAL` functions will be cached only in
compiled state, so they can be linked with other programs during dependency
resolution.
Programs that contain only `SYCL_EXTERNAL` functions will be cached only when exported
functions are used by images with kernels and identified by `main` image Id.
So once two kernel from different images use functions from the same `library` image it will
be duplicated in cache after linking with main programs during dependency
resolution.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, there can be two cases when SYCL_EXTERNAL-only images are cached in in-memory cache:

  • Inside the main image, i.e. after it is linked with all its dependencies (the case that you described in your suggestion)
  • Non-linked SYCL_EXTERNAL-only images in compiled state. It can be done when some "main" image requested some SYCL_EXTERNAL-only image as dependency, we created a program for SYCL_EXTERNAL-only image, compiled it and stored in cache, so we don't have to do it again when some other main requests the same SYCL_EXTERNAL image. (the case that I tried to describe here). The idea was taken from existing cache for standard device libraries (see
    std::map<std::pair<DeviceLibExt, RT::PiDevice>, RT::PiProgram>
    ).

Does it make sense?


The existing mechanism of caching is not changed for programs with
imports information. They are stored in cache after they compiled and linked
with programs that provide their dependencies. To identify linked programs
Id of "main" set of symbols (i.e. the one which actually contain kernels) will
be used.

##### Persistent cache

The documented approach to persistent cache needs to be expanded in presence
of dynamic linking support. One of the identifiers for built image hash is
hash made out of device image used as input for the JIT compilation.
In case when "main" image have imports information, device image hash should be
created from all device images that are necessary to build it, i.e. hash out
of "main" device image and set of 'SYCL_EXTERNAL'-only images that define all
symbols imported by "main device image.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current approach assume storing final binary after linking it means:

  • duplication of library images in every main program;
  • on change of library all main images in persistent cache are invalidated and JIT is initiated.
    May be it is reasonable to store library images in cache separately from main image? It can be done both for in-memory and persistent or only for persistent cache depending on weight of link operation for binary programs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be it is reasonable to store library images in cache separately from main image? It can be done both for in-memory and persistent or only for persistent cache depending on weight of link operation for binary programs.

Do you mean compile to binary "main" image and "library" image and store them on disk separately? It sounds reasonable (the real dynamic libraries actually work this way), however I don't think we have devices that support linking of native device binaries yet, so right now It doesn't seem possible.


## Corner cases and limitations

It is not guaranteed that behaviour of host shared libraries and device shared
libraries will always match. There are several cases when it can occur, the
next sections will cover details of such cases.

### ODR violations

C++ standard defines One Definition Rule as:
> Every program shall contain exactly one definition of every non-inline
function or variable that is odr-used in that program outside of a discarded
statement; no diagnostic required.
The definition can appear explicitly in the program, it can be found in the
standard or a user-defined library, or (when appropriate) it is implicitly
defined.


Here is an example:

![ODR violation](images/ODR-shared-libraries.svg)

Both libraries libB and libC provide two different definitions of function
`b()`, so this example illustrates ODR violation. Technically this case has
undefined behaviour, however it is possible to run and compile this example on
Linux and Windows. Whereas on Linux only function `b()` from library libB is
called, on Windows both versions of function `b()` are used.
Most of backends online linkers act like static linkers, i.e. just merge
device images with each other, so it is not possible to correctly imitate
Windows behaviour in device code linking because attempts to do it will result
in multiple definition errors.

Given that, it is not guaranteed that behaviour of shared host libraries and
shared device libraries will always match in case of such ODR violations.

#### LD_PRELOAD

Another way to violate ODR is `LD_PRELOAD` environment variable on Linux. It
allows to load specified shared library before any other shared libraries so it
will be searched for symbols before other shared libraries. It allows to
substitute functions from regular shared libraries by functions from preloaded
library.
Device code registration is implemented using global constructors. Order of
global constructors calling is not defined across different translation units,
so with current design of device shared libraries and device code registration
mechanism it is not possible to understand which device code comes from
preloaded library and which comes from regular shared libraries.

Here is an example:

![LD_PRELOAD](images/LD-preload-shared-libraries.svg)

"libPreload" library is preloaded using `LD_PRELOAD` environment variable.
In this example, device code from "libPreload" might be registered after
device code from "libA".

To implement basic support, for each device image we can record name of the
library where this device image comes from and parse content of `LD_PRELOAD`
environment variable to choose the proper images. However such implementation
will only allow to substitute a whole device image and not an arbitrary
function (unless it is the only function in a device image), because partial
substitution will cause multiple definition errors during runtime linking.

### Run-time libraries loading

It is possible to load shared library during run-time. Both Linux and Windows
provide a way to do so (for example `dlopen()` on Linux or `LoadLibrary` on
Windows).
In case run-time loading is used to load some shared library, the symbols from
this shared library do not appear in the namespace of the main program. It means
that even though shared library is loaded successfully in run-time, it is not
possible to use symbols from it directly. The symbols from run-time loaded
library can be accessed by address which can be obtained using corresponding
OS-dependent API (for example `dlsym()` on Linux).

The problem here is that even though symbols from run-time loaded shared
library are not part of application's namespace, the library is loaded through
standard mechanism, i.e. its global constructors are invoked which means that
device code from it is registered, so it is not possible to
understand whether device code comes from run-time loaded library or not.
If such run-time loaded library exports device symbols and they
somehow match with symbols that actually directly used in device code
somewhere, it is possible that symbols from run-time loaded library
will be unexpectedly used.

To resolve this problem we need to ensure that device code registered from
run-time loaded library appears at the end of symbols search list, however
having that device code registration is triggered by global constructors, it
doesn't seem possible.

One more possible mitigation would be to record name of the library from which
each symbol should be imported, but it still won't resolve all potential
issues with run-time library loading, because user can load the library with the
same name as one of the explicitly linked libraries.
Loading