-
Notifications
You must be signed in to change notification settings - Fork 769
[SYCL][Doc] Add overview of kernel-program caching #2514
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 8 commits
Commits
Show all changes
12 commits
Select commit
Hold shift + click to select a range
9454aee
[Doc] Add overview of kernel-program caching
d141bef
[Doc] Address comment
7feb6ce
Merge branch 'sycl' into private/s-kanaev/kp-cache
39b20f5
Add line break
c16e4ad
Add new file to TOC
5091ade
Fix link
40b7669
Address comments
3277153
Add note
0142d6a
Address comments.
dfde4b6
Merge branch 'sycl' into private/s-kanaev/kp-cache
f0a7cbd
Merge branch 'sycl' into private/s-kanaev/kp-cache
fc153b4
Address comments.
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,223 @@ | ||
# A brief overview of kernel and program caching mechanism. | ||
|
||
## Rationale behind caching | ||
|
||
During SYCL program execution SYCL runtime will create internal objects | ||
representing kernels and programs, it may also invoke JIT compiler to bring | ||
kernels in a program to executable state. Those runtime operations are quite | ||
expensive, and in some cases caching approach can be employed to eliminate | ||
redundant kernel or program object re-creation and online recompilation. Few | ||
examples below illustrate scenarios where such optimization is possible. | ||
|
||
*Use-case #1.* Submission of the same kernel in a loop: | ||
```C++ | ||
using namespace cl::sycl::queue; | ||
|
||
queue Q; | ||
std::vector<buffer> Bufs; | ||
|
||
... | ||
// initialize Bufs with some number of buffers | ||
... | ||
|
||
for (size_t Idx = 0; Idx < Bufs.size(); ++Idx) { | ||
Q.submit([&](handler &CGH) { | ||
auto Acc = Bufs[Idx].get_access<access::mode::read_write>(CGH); | ||
|
||
CGH.parallel_for<class TheKernel>( | ||
range<2>{N, M}, [=](item<2> Item) { ... }); | ||
}); | ||
} | ||
``` | ||
|
||
*Use-case #2.* Submission of multiple kernels within a single program<sup>[1](#what-is-program)</sup>: | ||
```C++ | ||
using namespace cl::sycl::queue; | ||
|
||
queue Q; | ||
|
||
Q.submit([&](handler &CGH) { | ||
... | ||
|
||
CGH.parallel_for<class TheKernel_1>( | ||
range<2>{N_1, M_1}, [=](item<2> Item) { ... }); | ||
}); | ||
Q.submit([&](handler &CGH) { | ||
... | ||
|
||
CGH.parallel_for<class TheKernel_2>( | ||
range<2>{N_2, M_2}, [=](item<2> Item) { ... }); | ||
}); | ||
Q.submit([&](handler &CGH) { | ||
... | ||
|
||
CGH.parallel_for<class TheKernel_3>( | ||
range<2>{N_3, M_3}, [=](item<2> Item) { ... }); | ||
}); | ||
... | ||
Q.submit([&](handler &CGH) { | ||
... | ||
|
||
CGH.parallel_for<class TheKernel_K>( | ||
range<2>{N_K, M_K}, [=](item<2> Item) { ... }); | ||
}); | ||
``` | ||
|
||
In both cases SYCL runtime will need to build the program and kernels multiple | ||
times, which may involve JIT compilation and take quite a lot of time. | ||
|
||
In order to eliminate this waste of run-time we introduce a kernel and program | ||
caching. The cache is per-context and it caches underlying objects of non | ||
s-kanaev marked this conversation as resolved.
Show resolved
Hide resolved
|
||
interop kernels and programs which are built with no options. | ||
|
||
<a name="what-is-program">1</a>: Here "program" means an internal SYCL runtime | ||
object corresponding to a SPIRV module or native binary defining a set of SYCL | ||
kernels and/or device functions. | ||
|
||
|
||
## Data structure of cache | ||
|
||
The cache stores underlying PI objects behind `cl::sycl::program` and | ||
`cl::sycl::kernel` user-levelobjects in a per-context data storage. The storage | ||
s-kanaev marked this conversation as resolved.
Show resolved
Hide resolved
|
||
consists of two maps: one is for programs and the other is for kernels. | ||
|
||
The programs map's key consists of three components: kernel set id<sup>[1](#what-is-ksid)</sup>, | ||
specialized constants, device this program is built for. | ||
|
||
The krnels map's key consists of three components too: program the kernel | ||
belongs to, kernel name<sup>[2](#what-is-kname)</sup>, device the program is | ||
built for. | ||
|
||
<a name="what-is-ksid">1</a>: Kernel set id is an ordinal number of the device | ||
binary image the kernel is contained in. | ||
|
||
<a name="what-is-kname">2</a>: Kernel name is a kernel ID mangled class' name | ||
which is provided to methods of `cl::sycl::handler` (e.g. `parallel_for` or | ||
`single_task`). | ||
|
||
|
||
## Points of improvement (things to do) | ||
|
||
- Implement LRU policy on cached objects. See [issue](https://github.com/intel/llvm/issues/2517). | ||
- Allow for caching of objects built with some build options. | ||
- Employ the same built object for multiple devices of the same ISA, | ||
capabilities and so on. *NOTE:* It's not really known if it's possible to | ||
s-kanaev marked this conversation as resolved.
Show resolved
Hide resolved
|
||
check if two distinct devices are *exactly* the same. Probably this should be | ||
an improvement request for plugins. | ||
- Improve testing: cover real use-cases. See currently covered cases [here](https://github.com/intel/llvm/blob/sycl/sycl/unittests/kernel-and-program/Cache.cpp). | ||
|
||
|
||
## Implementation details | ||
|
||
The caches are represented with instance of [`KernelProgramCache`](https://github.com/intel/llvm/blob/sycl/sycl/source/detail/kernel_program_cache.hpp) | ||
class. The runtime creates one instance of the class per distinct SYCL context | ||
(A context object which is a result of copying another context object isn't | ||
"distinct", as it corresponds to the same underlying internal object | ||
representing a context). | ||
|
||
The `KernelProgramCache` is essentially a pair of maps as described above. | ||
|
||
|
||
### When does the cache come at work? | ||
|
||
The cache is used when one submits a kernel for execution or builds program or | ||
with SYCL API. That means that the cache works when either user explicitly calls | ||
`program::build_with_kernel_type<>()`/`program::get_kernel<>()` methods or SYCL | ||
RT builds a program or gets the required kernel as needed during application | ||
execution. Cacheability of an object can be tested with | ||
`program_impl::is_cacheable()` method. SYCL RT will only try to insert cacheable | ||
programs or kernels into the cache. This is done as a part of | ||
`ProgramManager::getOrCreateKernel()` method. | ||
|
||
|
||
*NOTE:* a kernel is only cacheable if and only if the program it belongs to is | ||
cacheable. On the other hand if the program is cacheable, then each and every | ||
kernel of this program will be cached also. | ||
|
||
|
||
All requests to build a program or to create a kernel - whether they originate | ||
from explicit user API calls or from internal SYCL runtime execution logic - end | ||
up with calling the function [`getOrBuild()`](https://github.com/intel/llvm/blob/sycl/sycl/source/detail/program_manager/program_manager.cpp#L149) | ||
with number of lambda functions passed as arguments: | ||
- Acquire function; | ||
- GetCache function; | ||
- Build function. | ||
|
||
*Acquire* function returns a locked version of cache. Locking is employed for | ||
thread safety. The threads are blocked only for insert-or-acquire attempt, i.e. | ||
when calling to `map::insert` in [`getOrBuild`](https://github.com/intel/llvm/blob/sycl/sycl/source/detail/program_manager/program_manager.cpp#L149) | ||
function. The rest of operation is done with the help of atomics and condition | ||
variables (plus a mutex for proper work of condition variable). | ||
|
||
*GetCache* function returns a reference to mapping `key->value` out of locked | ||
instance of cache. We will see rationale behind it a bit later. | ||
|
||
*Build* function actually builds the kernel or program. | ||
|
||
Caching isn't done: | ||
- when program is built out of source with `program::build_with_source()` or | ||
`program::compile_with_source()` method; | ||
- when program is a result of linking multiple programs. | ||
|
||
|
||
### Thread-safety | ||
|
||
Why do we need thread safety here? It is quite possible to have a use-case when | ||
the `cl::sycl::context` is shared across multiple threads (e.g. via sharing a | ||
queue). Possibility of enqueueing multiple cacheable kernels simultaneously | ||
from multiple threads requires us to provide thread-safety for the caching | ||
mechanisms. | ||
|
||
It is worth of noting that we don't cache the PI resource (kernel or program) | ||
by itself. Instead we augment the resource with the status of build process. | ||
Hence, what is cached is a wrapper structure `BuildResult` which contains three | ||
information fields - pointer to built resource, build error (if applicable) and | ||
current build status (either of "in progress", "succeeded", "failed"). | ||
|
||
One can find definition of `BuildResult` template in [KernelProgramCache](https://github.com/intel/llvm/blob/sycl/sycl/source/detail/kernel_program_cache.hpp). | ||
|
||
Pointer to built resource and build result are both atomic variables. Atomicity | ||
of these variables allows one to hold lock on cache for quite a short time and | ||
perform the rest of build/wait process without forcing other threads to wait on | ||
lock availability. | ||
s-kanaev marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
A specialization of helper class [Locked](https://github.com/intel/llvm/blob/sycl/sycl/include/CL/sycl/detail/locked.hpp) | ||
for reference of proper mapping is returned by Acquire function. The use of this | ||
class implements RAII to make code look cleaner a bit. Now, GetCache function | ||
will return the mapping to be employed that includes the 3 components: kernel | ||
name, device as well as any specialization constants. These get added to | ||
`BuildResult` and are cached. The `BuildResult` structure is specialized with | ||
either `PiKernel` or `PiProgram`<sup>[1](#remove-program)</sup>. | ||
s-kanaev marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
|
||
### Core of caching mechanism | ||
|
||
Now, let us see how 'getOrBuild' function works. | ||
First, we fetch the cache with sequential calls to Acquire and GetCache | ||
functions. Then, we check if this is the first attempt to build this kernel or | ||
program. This is achieved with an attempt to insert another key-value pair into | ||
the map. At this point we try to insert `BuildResult` stub instance with status | ||
equal to "in progress" which will allow other threads to know that someone is | ||
(i.e. we're) building the object (i.e. kernel or program) now. If insertion | ||
fails, we will wait for building thread to finish with call to `waitUntilBuilt` | ||
function. This function will throw stored exception<sup>[2](#exception-data)</sup> | ||
upon build failure. This allows waiting threads to see the same result as the | ||
building thread. Special case of the failure is when build result doesn't | ||
contain the error (i.e. the error wasn't of `cl::sycl::exception` type) and the | ||
pointer to object in `BuildResult` instance is nil. In this case, the building | ||
thread has finished the build process and has returned an error to the user. | ||
But this error may be sporadic in nature and may be spurious. Hence, the waiting | ||
thread will try to build the same object once more. | ||
|
||
`BuildResult` structure also contains synchronization objects: mutex and | ||
condition variable. We employ them to signal waiting threads that the build | ||
process for this kernl/program is finished (either successfuly or with a | ||
failure). | ||
|
||
|
||
<a name="remove-pointer">1</a>: The use of `std::remove_pointer` was omitted for | ||
the sake of simplicity here. | ||
|
||
<a name="exception-data">2</a>: Actually, we store contents of the exception: | ||
its message and error code. | ||
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -27,3 +27,4 @@ Developing oneAPI DPC++ Compiler | |
EnvironmentVariables | ||
PluginInterface | ||
ABIPolicyGuide | ||
KernelProgramCache |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.