Skip to content

Commit 40b7669

Browse files
author
Sergey Kanaev
committed
Address comments
Signed-off-by: Sergey Kanaev <[email protected]>
1 parent 5091ade commit 40b7669

File tree

1 file changed

+73
-63
lines changed

1 file changed

+73
-63
lines changed

sycl/doc/KernelProgramCache.md

+73-63
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,15 @@
1-
# A brief overview of kernel/program caching mechanism.
1+
# A brief overview of kernel and program caching mechanism.
22

33
## Rationale behind caching
44

5-
*Use-case #1.* Looped enqueue of the same kernel:
5+
During SYCL program execution SYCL runtime will create internal objects
6+
representing kernels and programs, it may also invoke JIT compiler to bring
7+
kernels in a program to executable state. Those runtime operations are quite
8+
expensive, and in some cases caching approach can be employed to eliminate
9+
redundant kernel or program object re-creation and online recompilation. Few
10+
examples below illustrate scenarios where such optimization is possible.
11+
12+
*Use-case #1.* Submission of the same kernel in a loop:
613
```C++
714
using namespace cl::sycl::queue;
815

@@ -23,7 +30,7 @@
2330
}
2431
```
2532
26-
*Use-case #2.* Enqueue of multiple kernels within a single program<sup>[1](#what-is-program)</sup>:
33+
*Use-case #2.* Submission of multiple kernels within a single program<sup>[1](#what-is-program)</sup>:
2734
```C++
2835
using namespace cl::sycl::queue;
2936
@@ -56,36 +63,36 @@
5663
});
5764
```
5865

59-
Both these use-cases will need to built the program or kernel multiple times.
60-
When JIT is employed this process may take quite a lot of time.
66+
In both cases SYCL runtime will need to build the program and kernels multiple
67+
times, which may involve JIT compilation and take quite a lot of time.
6168

62-
In order to eliminate this waste of run-time we introduce a kernel/program
69+
In order to eliminate this waste of run-time we introduce a kernel and program
6370
caching. The cache is per-context and it caches underlying objects of non
6471
interop kernels and programs which are built with no options.
6572

66-
<a name="what-is-program">1</a>: Here we use the term "program" in the same
67-
sense as OpenCL does i.e. a set of kernels.
73+
<a name="what-is-program">1</a>: Here "program" means an internal SYCL runtime
74+
object corresponding to a SPIRV module or native binary defining a set of SYCL
75+
kernels and/or device functions.
6876

6977

7078
## Data structure of cache
7179

72-
The cache stores underlying PI objects of `cl::sycl::program` and
73-
`cl::sycl::kernel` in a per-context data storage. The storage consists of two
74-
maps: one is for programs and the other is for kernels.
80+
The cache stores underlying PI objects behind `cl::sycl::program` and
81+
`cl::sycl::kernel` user-levelobjects in a per-context data storage. The storage
82+
consists of two maps: one is for programs and the other is for kernels.
7583

76-
Programs mapping's key consists of three components:
77-
kernel set id<sup>[1](#what-is-ksid)</sup>, specialized constants, device this
78-
program is built for.
84+
The programs map's key consists of three components: kernel set id<sup>[1](#what-is-ksid)</sup>,
85+
specialized constants, device this program is built for.
7986

80-
Kernels mapping's key consists of three components too: program the kernel
87+
The krnels map's key consists of three components too: program the kernel
8188
belongs to, kernel name<sup>[2](#what-is-kname)</sup>, device the program is
8289
built for.
8390

84-
<a name="what-is-ksid">1</a>: Kernel set id is merely a number of translation
85-
unit which contains at least one kernel.
91+
<a name="what-is-ksid">1</a>: Kernel set id is an ordinal number of the device
92+
binary image the kernel is contained in.
8693

87-
<a name="what-is-kname">2</a>: Kernel name is mangled class name which is
88-
provided to methods of `cl::sycl::handler` (e.g. `parallel_for` or
94+
<a name="what-is-kname">2</a>: Kernel name is a kernel ID mangled class' name
95+
which is provided to methods of `cl::sycl::handler` (e.g. `parallel_for` or
8996
`single_task`).
9097

9198

@@ -102,19 +109,23 @@ provided to methods of `cl::sycl::handler` (e.g. `parallel_for` or
102109
## Implementation details
103110

104111
The caches are represented with instance of [`KernelProgramCache`](https://github.com/intel/llvm/blob/sycl/sycl/source/detail/kernel_program_cache.hpp)
105-
class. The class is instantiated in a per-context manner.
112+
class. The runtime creates one instance of the class per distinct SYCL context
113+
(A context object which is a result of copying another context object isn't
114+
"distinct", as it corresponds to the same underlying internal object
115+
representing a context).
106116
107-
The `KernelProgramCache` is the storage descrived above.
117+
The `KernelProgramCache` is essentially a pair of maps as described above.
108118
109119
110120
### When does the cache come at work?
111121
112-
The cache is employed when one submits kernel for execution or builds program or
113-
kernel with SYCL API. That means that the cache works when either user
114-
explicitly calls `program::build_with_kernel_type<>()`/`program::get_kernel<>()`
115-
methods or SYCL RT builds or gets the required kernel. Cacheability of an object
116-
is verified with `program_impl::is_cacheable()` method. SYCL RT will check if
117-
program is cacheable and will get the kernel with call to
122+
The cache is used when one submits a kernel for execution or builds program or
123+
with SYCL API. That means that the cache works when either user explicitly calls
124+
`program::build_with_kernel_type<>()`/`program::get_kernel<>()` methods or SYCL
125+
RT builds a program or gets the required kernel as needed during application
126+
execution. Cacheability of an object can be tested with
127+
`program_impl::is_cacheable()` method. SYCL RT will only try to insert cacheable
128+
programs or kernels into the cache. This is done as a part of
118129
`ProgramManager::getOrCreateKernel()` method.
119130
120131
@@ -123,12 +134,10 @@ cacheable. On the other hand if the program is cacheable, then each and every
123134
kernel of this program will be cached also.
124135
125136
126-
Invoked by user `program::build_with_kernel_type<>()` and
127-
`program::get_kernel<>()` methods will call either
128-
`ProgramManager::getBuildPIProgram()` or `ProgramManager::getOrCreateKernel()`
129-
method respectively. Now, both these methods will call template
130-
function [`getOrBuild()`](https://github.com/intel/llvm/blob/sycl/sycl/source/detail/program_manager/program_manager.cpp#L149)
131-
with multiple lambdas passed to it:
137+
All requests to build a program or to create a kernel - whether they originate
138+
from explicit user API calls or from internal SYCL runtime execution logic - end
139+
up with calling the function [`getOrBuild()`](https://github.com/intel/llvm/blob/sycl/sycl/source/detail/program_manager/program_manager.cpp#L149)
140+
with number of lambda functions passed as arguments:
132141
- Acquire function;
133142
- GetCache function;
134143
- Build function.
@@ -145,20 +154,21 @@ instance of cache. We will see rationale behind it a bit later.
145154
*Build* function actually builds the kernel or program.
146155
147156
Caching isn't done:
148-
- when program is built out of source i.e. with
149-
`program::build_with_source()` or `program::compile_with_source()` method;
150-
- when program is result of linking of multiple programs.
157+
- when program is built out of source with `program::build_with_source()` or
158+
`program::compile_with_source()` method;
159+
- when program is a result of linking multiple programs.
151160

152161

153162
### Thread-safety
154163

155-
Why do we need thread safety here? It's quite possible to have a use-case when
164+
Why do we need thread safety here? It is quite possible to have a use-case when
156165
the `cl::sycl::context` is shared across multiple threads (e.g. via sharing a
157166
queue). Possibility of enqueueing multiple cacheable kernels simultaneously
158-
within multiple threads makes us to provide thread-safety for the cache.
167+
from multiple threads requires us to provide thread-safety for the caching
168+
mechanisms.
159169

160-
It's worth of noting that we don't cache the PI resource (kernel or program)
161-
on it's own. Instead we augment the resource with the status of build process.
170+
It is worth of noting that we don't cache the PI resource (kernel or program)
171+
by itself. Instead we augment the resource with the status of build process.
162172
Hence, what is cached is a wrapper structure `BuildResult` which contains three
163173
information fields - pointer to built resource, build error (if applicable) and
164174
current build status (either of "in progress", "succeeded", "failed").
@@ -167,45 +177,45 @@ One can find definition of `BuildResult` template in [KernelProgramCache](https:
167177

168178
Pointer to built resource and build result are both atomic variables. Atomicity
169179
of these variables allows one to hold lock on cache for quite a short time and
170-
perform the rest of build/wait process without unwanted need of other threads to
171-
wait on lock availability.
180+
perform the rest of build/wait process without forcing other threads to wait on
181+
lock availability.
172182

173183
A specialization of helper class [Locked](https://github.com/intel/llvm/blob/sycl/sycl/include/CL/sycl/detail/locked.hpp)
174184
for reference of proper mapping is returned by Acquire function. The use of this
175185
class implements RAII to make code look cleaner a bit. Now, GetCache function
176-
will return the mapping to be employed i.e. it'll fetch mapping of kernel name
177-
plus device to `BuildResult` for proper program as `getOrBuild` will work with
178-
mapping of key (whichever it is) to `BuildResult` specialization. The structure
179-
is specialized with either `PiKernel` or `PiProgram`<sup>[1](#remove-program)</sup>.
186+
will return the mapping to be employed that includes the 3 components: kernel
187+
name, device as well as any specialization constants. These get added to
188+
`BuildResult` and are cached. The `BuildResult` structure is specialized with
189+
either `PiKernel` or `PiProgram`<sup>[1](#remove-program)</sup>.
180190

181191

182192
### Core of caching mechanism
183193

184-
Now, how `getOrBuild` works?
194+
Now, let us see how 'getOrBuild' function works.
185195
First, we fetch the cache with sequential calls to Acquire and GetCache
186-
functions. Then, we check if we're the first ones who build this kernel/program.
187-
This is achieved with attempt to insert another key-value pair into the map.
188-
At this point we try to insert `BuildResult` stub instance with status equal to
189-
"in progress" which will allow other threads to know that someone is (i.e.
190-
we're) building the object (i.e. kernel or program) now. If insertion fails we
191-
will wait for building thread to finish with call to `waitUntilBuilt` function.
192-
This function will throw stored exception<sup>[2](#exception-data)</sup> upon
193-
build failure. This allows waiting threads to result the same as the building
194-
thread. Special case of the failure is when build result doesn't contain the
195-
error (i.e. the error wasn't of `cl::sycl::exception` type) and the pointer to
196-
object in `BuildResult` instance is nil. In this case the building thread has
197-
finished build process and returned the user an error. Though, this error could
198-
be of spurious/sporadic nature. Hence, the waiting thread will try to build the
199-
same object once more.
196+
functions. Then, we check if this is the first attempt to build this kernel or
197+
program. This is achieved with an attempt to insert another key-value pair into
198+
the map. At this point we try to insert `BuildResult` stub instance with status
199+
equal to "in progress" which will allow other threads to know that someone is
200+
(i.e. we're) building the object (i.e. kernel or program) now. If insertion
201+
fails, we will wait for building thread to finish with call to `waitUntilBuilt`
202+
function. This function will throw stored exception<sup>[2](#exception-data)</sup>
203+
upon build failure. This allows waiting threads to see the same result as the
204+
building thread. Special case of the failure is when build result doesn't
205+
contain the error (i.e. the error wasn't of `cl::sycl::exception` type) and the
206+
pointer to object in `BuildResult` instance is nil. In this case, the building
207+
thread has finished the build process and has returned an error to the user.
208+
But this error may be sporadic in nature and may be spurious. Hence, the waiting
209+
thread will try to build the same object once more.
200210

201211
`BuildResult` structure also contains synchronization objects: mutex and
202212
condition variable. We employ them to signal waiting threads that the build
203213
process for this kernl/program is finished (either successfuly or with a
204214
failure).
205215

206216

207-
<a name="remove-pointer">1</a>: The use of `std::remove_pointer` was omitted in
208-
sake of simplicity here.
217+
<a name="remove-pointer">1</a>: The use of `std::remove_pointer` was omitted for
218+
the sake of simplicity here.
209219

210220
<a name="exception-data">2</a>: Actually, we store contents of the exception:
211221
its message and error code.

0 commit comments

Comments
 (0)