You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: sycl/doc/KernelProgramCache.md
+73-63
Original file line number
Diff line number
Diff line change
@@ -1,8 +1,15 @@
1
-
# A brief overview of kernel/program caching mechanism.
1
+
# A brief overview of kernel and program caching mechanism.
2
2
3
3
## Rationale behind caching
4
4
5
-
*Use-case #1.* Looped enqueue of the same kernel:
5
+
During SYCL program execution SYCL runtime will create internal objects
6
+
representing kernels and programs, it may also invoke JIT compiler to bring
7
+
kernels in a program to executable state. Those runtime operations are quite
8
+
expensive, and in some cases caching approach can be employed to eliminate
9
+
redundant kernel or program object re-creation and online recompilation. Few
10
+
examples below illustrate scenarios where such optimization is possible.
11
+
12
+
*Use-case #1.* Submission of the same kernel in a loop:
6
13
```C++
7
14
usingnamespacecl::sycl::queue;
8
15
@@ -23,7 +30,7 @@
23
30
}
24
31
```
25
32
26
-
*Use-case #2.* Enqueue of multiple kernels within a single program<sup>[1](#what-is-program)</sup>:
33
+
*Use-case #2.* Submission of multiple kernels within a single program<sup>[1](#what-is-program)</sup>:
27
34
```C++
28
35
using namespace cl::sycl::queue;
29
36
@@ -56,36 +63,36 @@
56
63
});
57
64
```
58
65
59
-
Both these use-cases will need to built the program or kernel multiple times.
60
-
When JIT is employed this process may take quite a lot of time.
66
+
In both cases SYCL runtime will need to build the program and kernels multiple
67
+
times, which may involve JIT compilation and take quite a lot of time.
61
68
62
-
In order to eliminate this waste of run-time we introduce a kernel/program
69
+
In order to eliminate this waste of run-time we introduce a kernel and program
63
70
caching. The cache is per-context and it caches underlying objects of non
64
71
interop kernels and programs which are built with no options.
65
72
66
-
<aname="what-is-program">1</a>: Here we use the term "program" in the same
67
-
sense as OpenCL does i.e. a set of kernels.
73
+
<aname="what-is-program">1</a>: Here "program" means an internal SYCL runtime
74
+
object corresponding to a SPIRV module or native binary defining a set of SYCL
75
+
kernels and/or device functions.
68
76
69
77
70
78
## Data structure of cache
71
79
72
-
The cache stores underlying PI objects of`cl::sycl::program` and
73
-
`cl::sycl::kernel` in a per-context data storage. The storage consists of two
74
-
maps: one is for programs and the other is for kernels.
80
+
The cache stores underlying PI objects behind`cl::sycl::program` and
81
+
`cl::sycl::kernel`user-levelobjects in a per-context data storage. The storage
82
+
consists of two maps: one is for programs and the other is for kernels.
75
83
76
-
Programs mapping's key consists of three components:
77
-
kernel set id<sup>[1](#what-is-ksid)</sup>, specialized constants, device this
78
-
program is built for.
84
+
The programs map's key consists of three components: kernel set id<sup>[1](#what-is-ksid)</sup>,
85
+
specialized constants, device this program is built for.
79
86
80
-
Kernels mapping's key consists of three components too: program the kernel
87
+
The krnels map's key consists of three components too: program the kernel
81
88
belongs to, kernel name<sup>[2](#what-is-kname)</sup>, device the program is
82
89
built for.
83
90
84
-
<aname="what-is-ksid">1</a>: Kernel set id is merely a number of translation
85
-
unit which contains at least one kernel.
91
+
<aname="what-is-ksid">1</a>: Kernel set id is an ordinal number of the device
92
+
binary image the kernel is contained in.
86
93
87
-
<aname="what-is-kname">2</a>: Kernel name is mangled class name which is
88
-
provided to methods of `cl::sycl::handler` (e.g. `parallel_for` or
94
+
<aname="what-is-kname">2</a>: Kernel name is a kernel ID mangled class' name
95
+
which is provided to methods of `cl::sycl::handler` (e.g. `parallel_for` or
89
96
`single_task`).
90
97
91
98
@@ -102,19 +109,23 @@ provided to methods of `cl::sycl::handler` (e.g. `parallel_for` or
102
109
## Implementation details
103
110
104
111
The caches are represented with instance of [`KernelProgramCache`](https://github.com/intel/llvm/blob/sycl/sycl/source/detail/kernel_program_cache.hpp)
105
-
class. The class is instantiated in a per-context manner.
112
+
class. The runtime creates one instance of the class per distinct SYCL context
113
+
(A context object which is a result of copying another context object isn't
114
+
"distinct", as it corresponds to the same underlying internal object
115
+
representing a context).
106
116
107
-
The `KernelProgramCache` is the storage descrived above.
117
+
The `KernelProgramCache` is essentially a pair of maps as described above.
108
118
109
119
110
120
### When does the cache come at work?
111
121
112
-
The cache is employed when one submits kernel for execution or builds program or
113
-
kernel with SYCL API. That means that the cache works when either user
methods or SYCL RT builds or gets the required kernel. Cacheability of an object
116
-
is verified with `program_impl::is_cacheable()` method. SYCL RT will check if
117
-
program is cacheable and will get the kernel with call to
122
+
The cache is used when one submits a kernel for execution or builds program or
123
+
with SYCL API. That means that the cache works when either user explicitly calls
124
+
`program::build_with_kernel_type<>()`/`program::get_kernel<>()` methods or SYCL
125
+
RT builds a program or gets the required kernel as needed during application
126
+
execution. Cacheability of an object can be tested with
127
+
`program_impl::is_cacheable()` method. SYCL RT will only try to insert cacheable
128
+
programs or kernels into the cache. This is done as a part of
118
129
`ProgramManager::getOrCreateKernel()` method.
119
130
120
131
@@ -123,12 +134,10 @@ cacheable. On the other hand if the program is cacheable, then each and every
123
134
kernel of this program will be cached also.
124
135
125
136
126
-
Invoked by user `program::build_with_kernel_type<>()` and
127
-
`program::get_kernel<>()` methods will call either
128
-
`ProgramManager::getBuildPIProgram()` or `ProgramManager::getOrCreateKernel()`
129
-
method respectively. Now, both these methods will call template
130
-
function [`getOrBuild()`](https://github.com/intel/llvm/blob/sycl/sycl/source/detail/program_manager/program_manager.cpp#L149)
131
-
with multiple lambdas passed to it:
137
+
All requests to build a program or to create a kernel - whether they originate
138
+
from explicit user API calls or from internal SYCL runtime execution logic - end
139
+
up with calling the function [`getOrBuild()`](https://github.com/intel/llvm/blob/sycl/sycl/source/detail/program_manager/program_manager.cpp#L149)
140
+
with number of lambda functions passed as arguments:
132
141
- Acquire function;
133
142
- GetCache function;
134
143
- Build function.
@@ -145,20 +154,21 @@ instance of cache. We will see rationale behind it a bit later.
145
154
*Build* function actually builds the kernel or program.
146
155
147
156
Caching isn't done:
148
-
- when program is built out of source i.e. with
149
-
`program::build_with_source()` or `program::compile_with_source()` method;
150
-
- when program is result of linking of multiple programs.
157
+
- when program is built out of source with `program::build_with_source()` or
158
+
`program::compile_with_source()` method;
159
+
- when program is a result of linking multiple programs.
151
160
152
161
153
162
### Thread-safety
154
163
155
-
Why do we need thread safety here? It's quite possible to have a use-case when
164
+
Why do we need thread safety here? It is quite possible to have a use-case when
156
165
the `cl::sycl::context` is shared across multiple threads (e.g. via sharing a
157
166
queue). Possibility of enqueueing multiple cacheable kernels simultaneously
158
-
within multiple threads makes us to provide thread-safety for the cache.
167
+
from multiple threads requires us to provide thread-safety for the caching
168
+
mechanisms.
159
169
160
-
It's worth of noting that we don't cache the PI resource (kernel or program)
161
-
on it's own. Instead we augment the resource with the status of build process.
170
+
It is worth of noting that we don't cache the PI resource (kernel or program)
171
+
by itself. Instead we augment the resource with the status of build process.
162
172
Hence, what is cached is a wrapper structure `BuildResult` which contains three
163
173
information fields - pointer to built resource, build error (if applicable) and
164
174
current build status (either of "in progress", "succeeded", "failed").
@@ -167,45 +177,45 @@ One can find definition of `BuildResult` template in [KernelProgramCache](https:
167
177
168
178
Pointer to built resource and build result are both atomic variables. Atomicity
169
179
of these variables allows one to hold lock on cache for quite a short time and
170
-
perform the rest of build/wait process without unwanted need of other threads to
171
-
wait on lock availability.
180
+
perform the rest of build/wait process without forcing other threads to wait on
181
+
lock availability.
172
182
173
183
A specialization of helper class [Locked](https://github.com/intel/llvm/blob/sycl/sycl/include/CL/sycl/detail/locked.hpp)
174
184
for reference of proper mapping is returned by Acquire function. The use of this
175
185
class implements RAII to make code look cleaner a bit. Now, GetCache function
176
-
will return the mapping to be employed i.e. it'll fetch mapping of kernel name
177
-
plus device to `BuildResult` for proper program as `getOrBuild` will work with
178
-
mapping of key (whichever it is) to `BuildResult`specialization. The structure
179
-
is specialized with either `PiKernel` or `PiProgram`<sup>[1](#remove-program)</sup>.
186
+
will return the mapping to be employed that includes the 3 components: kernel
187
+
name, device as well as any specialization constants. These get added to
188
+
`BuildResult` and are cached. The `BuildResult`structure is specialized with
189
+
either `PiKernel` or `PiProgram`<sup>[1](#remove-program)</sup>.
180
190
181
191
182
192
### Core of caching mechanism
183
193
184
-
Now, how `getOrBuild`works?
194
+
Now, let us see how 'getOrBuild' function works.
185
195
First, we fetch the cache with sequential calls to Acquire and GetCache
186
-
functions. Then, we check if we're the first ones who build this kernel/program.
187
-
This is achieved with attempt to insert another key-value pair into the map.
188
-
At this point we try to insert `BuildResult` stub instance with status equal to
189
-
"in progress" which will allow other threads to know that someone is (i.e.
190
-
we're) building the object (i.e. kernel or program) now. If insertion fails we
191
-
will wait for building thread to finish with call to `waitUntilBuilt` function.
192
-
This function will throw stored exception<sup>[2](#exception-data)</sup> upon
193
-
build failure. This allows waiting threads to result the same as the building
194
-
thread. Special case of the failure is when build result doesn't contain the
195
-
error (i.e. the error wasn't of `cl::sycl::exception` type) and the pointer to
196
-
object in `BuildResult` instance is nil. In this case the building thread has
197
-
finished build process and returned the user an error. Though, this error could
198
-
be of spurious/sporadic nature. Hence, the waiting thread will try to build the
199
-
same object once more.
196
+
functions. Then, we check if this is the first attempt to build this kernel or
197
+
program. This is achieved with an attempt to insert another key-value pair into
198
+
the map. At this point we try to insert `BuildResult` stub instance with status
199
+
equal to "in progress" which will allow other threads to know that someone is
200
+
(i.e. we're) building the object (i.e. kernel or program) now. If insertion
201
+
fails, we will wait for building thread to finish with call to `waitUntilBuilt`
202
+
function. This function will throw stored exception<sup>[2](#exception-data)</sup>
203
+
upon build failure. This allows waiting threads to see the same result as the
204
+
building thread. Special case of the failure is when build result doesn't
205
+
contain the error (i.e. the error wasn't of `cl::sycl::exception` type) and the
206
+
pointer to object in `BuildResult` instance is nil. In this case, the building
207
+
thread has finished the build process and has returned an error to the user.
208
+
But this error may be sporadic in nature and may be spurious. Hence, the waiting
209
+
thread will try to build the same object once more.
200
210
201
211
`BuildResult` structure also contains synchronization objects: mutex and
202
212
condition variable. We employ them to signal waiting threads that the build
203
213
process for this kernl/program is finished (either successfuly or with a
204
214
failure).
205
215
206
216
207
-
<aname="remove-pointer">1</a>: The use of `std::remove_pointer` was omitted in
208
-
sake of simplicity here.
217
+
<aname="remove-pointer">1</a>: The use of `std::remove_pointer` was omitted for
218
+
the sake of simplicity here.
209
219
210
220
<aname="exception-data">2</a>: Actually, we store contents of the exception:
0 commit comments