Skip to content

Commit 5947cde

Browse files
authored
[SYCL][Doc] Add overview of kernel-program caching (#2514)
Signed-off-by: Sergey Kanaev <[email protected]>
1 parent ef969c1 commit 5947cde

File tree

2 files changed

+249
-0
lines changed

2 files changed

+249
-0
lines changed

sycl/doc/KernelProgramCache.md

+248
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,248 @@
1+
# A brief overview of kernel and program caching mechanism.
2+
3+
## Rationale behind caching
4+
5+
During SYCL program execution SYCL runtime will create internal objects
6+
representing kernels and programs, it may also invoke JIT compiler to bring
7+
kernels in a program to executable state. Those runtime operations are quite
8+
expensive, and in some cases caching approach can be employed to eliminate
9+
redundant kernel or program object re-creation and online recompilation. Few
10+
examples below illustrate scenarios where such optimization is possible.
11+
12+
*Use-case #1.* Submission of the same kernel in a loop:
13+
```C++
14+
using namespace cl::sycl::queue;
15+
16+
queue Q;
17+
std::vector<buffer> Bufs;
18+
19+
...
20+
// initialize Bufs with some number of buffers
21+
...
22+
23+
for (size_t Idx = 0; Idx < Bufs.size(); ++Idx) {
24+
Q.submit([&](handler &CGH) {
25+
auto Acc = Bufs[Idx].get_access<access::mode::read_write>(CGH);
26+
27+
CGH.parallel_for<class TheKernel>(
28+
range<2>{N, M}, [=](item<2> Item) { ... });
29+
});
30+
}
31+
```
32+
33+
*Use-case #2.* Submission of multiple kernels within a single program<sup>[1](#what-is-program)</sup>:
34+
```C++
35+
using namespace cl::sycl::queue;
36+
37+
queue Q;
38+
39+
Q.submit([&](handler &CGH) {
40+
...
41+
42+
CGH.parallel_for<class TheKernel_1>(
43+
range<2>{N_1, M_1}, [=](item<2> Item) { ... });
44+
});
45+
Q.submit([&](handler &CGH) {
46+
...
47+
48+
CGH.parallel_for<class TheKernel_2>(
49+
range<2>{N_2, M_2}, [=](item<2> Item) { ... });
50+
});
51+
Q.submit([&](handler &CGH) {
52+
...
53+
54+
CGH.parallel_for<class TheKernel_3>(
55+
range<2>{N_3, M_3}, [=](item<2> Item) { ... });
56+
});
57+
...
58+
Q.submit([&](handler &CGH) {
59+
...
60+
61+
CGH.parallel_for<class TheKernel_K>(
62+
range<2>{N_K, M_K}, [=](item<2> Item) { ... });
63+
});
64+
```
65+
66+
In both cases SYCL runtime will need to build the program and kernels multiple
67+
times, which may involve JIT compilation and take quite a lot of time.
68+
69+
In order to eliminate this waste of run-time we introduce a kernel and program
70+
caching. The cache is per-context and it caches underlying objects of non
71+
interop kernels and programs which are built with no options.
72+
73+
<a name="what-is-program">1</a>: Here "program" means an internal SYCL runtime
74+
object corresponding to a SPIRV module or native binary defining a set of SYCL
75+
kernels and/or device functions.
76+
77+
78+
## Data structure of cache
79+
80+
The cache stores underlying PI objects behind `cl::sycl::program` and
81+
`cl::sycl::kernel` user-level objects in a per-context data storage. The storage
82+
consists of two maps: one is for programs and the other is for kernels.
83+
84+
The programs map's key consists of three components: kernel set id<sup>[1](#what-is-ksid)</sup>,
85+
specialized constants, device this program is built for.
86+
87+
The krnels map's key consists of three components too: program the kernel
88+
belongs to, kernel name<sup>[2](#what-is-kname)</sup>, device the program is
89+
built for.
90+
91+
<a name="what-is-ksid">1</a>: Kernel set id is an ordinal number of the device
92+
binary image the kernel is contained in.
93+
94+
<a name="what-is-kname">2</a>: Kernel name is a kernel ID mangled class' name
95+
which is provided to methods of `cl::sycl::handler` (e.g. `parallel_for` or
96+
`single_task`).
97+
98+
99+
## Points of improvement (things to do)
100+
101+
- Implement LRU policy on cached objects. See [issue](https://github.com/intel/llvm/issues/2517).
102+
- Allow for caching of objects built with some build options.
103+
- Employ the same built object for multiple devices of the same ISA,
104+
capabilities and so on. *NOTE:* It's not really known if it's possible to
105+
check if two distinct devices are *exactly* the same. Probably this should be
106+
an improvement request for plugins.
107+
- Improve testing: cover real use-cases. See currently covered cases [here](https://github.com/intel/llvm/blob/sycl/sycl/unittests/kernel-and-program/Cache.cpp).
108+
109+
110+
## Implementation details
111+
112+
The caches are represented with instance of [`KernelProgramCache`](https://github.com/intel/llvm/blob/sycl/sycl/source/detail/kernel_program_cache.hpp)
113+
class. The runtime creates one instance of the class per distinct SYCL context
114+
(A context object which is a result of copying another context object isn't
115+
"distinct", as it corresponds to the same underlying internal object
116+
representing a context).
117+
118+
The `KernelProgramCache` is essentially a pair of maps as described above.
119+
120+
121+
### When does the cache come at work?
122+
123+
The cache is used when one submits a kernel for execution or builds program or
124+
with SYCL API. That means that the cache works when either user explicitly calls
125+
`program::build_with_kernel_type<>()`/`program::get_kernel<>()` methods or SYCL
126+
RT builds a program or gets the required kernel as needed during application
127+
execution. Cacheability of an object can be tested with
128+
`program_impl::is_cacheable()` method. SYCL RT will only try to insert cacheable
129+
programs or kernels into the cache. This is done as a part of
130+
`ProgramManager::getOrCreateKernel()` method.
131+
132+
133+
*NOTE:* a kernel is only cacheable if and only if the program it belongs to is
134+
cacheable. On the other hand if the program is cacheable, then each and every
135+
kernel of this program will be cached also.
136+
137+
138+
All requests to build a program or to create a kernel - whether they originate
139+
from explicit user API calls or from internal SYCL runtime execution logic - end
140+
up with calling the function [`getOrBuild()`](https://github.com/intel/llvm/blob/sycl/sycl/source/detail/program_manager/program_manager.cpp#L149)
141+
with number of lambda functions passed as arguments:
142+
- Acquire function;
143+
- GetCache function;
144+
- Build function.
145+
146+
*Acquire* function returns a locked version of cache. Locking is employed for
147+
thread safety. The threads are blocked only for insert-or-acquire attempt, i.e.
148+
when calling to `map::insert` in [`getOrBuild`](https://github.com/intel/llvm/blob/sycl/sycl/source/detail/program_manager/program_manager.cpp#L149)
149+
function. The rest of operation is done with the help of atomics and condition
150+
variables (plus a mutex for proper work of condition variable).
151+
152+
*GetCache* function returns a reference to mapping `key->value` out of locked
153+
instance of cache. We will see rationale behind it a bit later.
154+
155+
*Build* function actually builds the kernel or program.
156+
157+
Caching isn't done:
158+
- when program is built out of source with `program::build_with_source()` or
159+
`program::compile_with_source()` method;
160+
- when program is a result of linking multiple programs.
161+
162+
163+
### Thread-safety
164+
165+
Why do we need thread safety here? It is quite possible to have a use-case when
166+
the `cl::sycl::context` is shared across multiple threads (e.g. via sharing a
167+
queue). Possibility of enqueueing multiple cacheable kernels simultaneously
168+
from multiple threads requires us to provide thread-safety for the caching
169+
mechanisms.
170+
171+
It is worth of noting that we don't cache the PI resource (kernel or program)
172+
by itself. Instead we augment the resource with the status of build process.
173+
Hence, what is cached is a wrapper structure `BuildResult` which contains three
174+
information fields - pointer to built resource, build error (if applicable) and
175+
current build status (either of "in progress", "succeeded", "failed").
176+
177+
One can find definition of `BuildResult` template in [KernelProgramCache](https://github.com/intel/llvm/blob/sycl/sycl/source/detail/kernel_program_cache.hpp).
178+
179+
The built resource access synchronization approach aims at minimizing the time
180+
any thread holds the global lock guarding the maps to improve performance. To
181+
achieve that, the global lock is acquired only for the duration of the global
182+
map access. Actual build of the program happens outside of the lock, so other
183+
threads can request or build other programs in the meantime. A thread requesting
184+
a `BuildResult` instance via `getOrBuild` can go one of three ways:
185+
A) Build result is **not** available, it is the first thread to request it.
186+
Current thread will then execute the build letting others wait for the
187+
result using the per-build result condition variable kept in `BuildResult`'s
188+
`MBuildCV` field.
189+
B) Build result is **not** available, another thread is already building the
190+
result. Current thread will then wait for the result using the `MBuildCV`
191+
condition variable.
192+
C) Build result **is** available. The thread simply takes it from the `Ptr`
193+
field w/o using any mutexes or condition variables.
194+
195+
As noted before, access to `BuildResult` instance fields may occur from
196+
different threads simultaneously, but the global lock is no longer held. So, to
197+
make it safe and to make sure only one thread builds the requested program, the
198+
following is done:
199+
- program build state is reflected in the `State` field, threads use
200+
compare-and-swap technique to compete who will do the build and become thread
201+
A. Threads C will find 'DONE' in this field and immediately return the with
202+
built result at hand.
203+
- thread A and thread(s) B use the `MBuildCV` conditional variable field and
204+
`MBuildResultMutex` mutex field guarding that variable to implement the
205+
"single producer-multiple consumers scheme".
206+
- the build result itself appears in the 'Ptr' field when available.
207+
All fields are atomic because they can be accessed from multiple threads.
208+
209+
A specialization of helper class [Locked](https://github.com/intel/llvm/blob/sycl/sycl/include/CL/sycl/detail/locked.hpp)
210+
for reference of proper mapping is returned by Acquire function. The use of this
211+
class implements RAII to make code look cleaner a bit. Now, GetCache function
212+
will return the mapping to be employed that includes the 3 components: kernel
213+
name, device as well as any specialization constants. These get added to
214+
`BuildResult` and are cached. The `BuildResult` structure is specialized with
215+
either `PiKernel` or `PiProgram`<sup>[1](#remove-pointer)</sup>.
216+
217+
218+
### Core of caching mechanism
219+
220+
Now, let us see how 'getOrBuild' function works.
221+
First, we fetch the cache with sequential calls to Acquire and GetCache
222+
functions. Then, we check if this is the first attempt to build this kernel or
223+
program. This is achieved with an attempt to insert another key-value pair into
224+
the map. At this point we try to insert `BuildResult` stub instance with status
225+
equal to "in progress" which will allow other threads to know that someone is
226+
(i.e. we're) building the object (i.e. kernel or program) now. If insertion
227+
fails, we will wait for building thread to finish with call to `waitUntilBuilt`
228+
function. This function will throw stored exception<sup>[2](#exception-data)</sup>
229+
upon build failure. This allows waiting threads to see the same result as the
230+
building thread. Special case of the failure is when build result doesn't
231+
contain the error (i.e. the error wasn't of `cl::sycl::exception` type) and the
232+
pointer to object in `BuildResult` instance is nil. In this case, the building
233+
thread has finished the build process and has returned an error to the user.
234+
But this error may be sporadic in nature and may be spurious. Hence, the waiting
235+
thread will try to build the same object once more.
236+
237+
`BuildResult` structure also contains synchronization objects: mutex and
238+
condition variable. We employ them to signal waiting threads that the build
239+
process for this kernl/program is finished (either successfuly or with a
240+
failure).
241+
242+
243+
<a name="remove-pointer">1</a>: The use of `std::remove_pointer` was omitted for
244+
the sake of simplicity here.
245+
246+
<a name="exception-data">2</a>: Actually, we store contents of the exception:
247+
its message and error code.
248+

sycl/doc/index.rst

+1
Original file line numberDiff line numberDiff line change
@@ -28,3 +28,4 @@ Developing oneAPI DPC++ Compiler
2828
PluginInterface
2929
ABIPolicyGuide
3030
SpecializationConstants
31+
KernelProgramCache

0 commit comments

Comments
 (0)