|
| 1 | +# A brief overview of kernel and program caching mechanism. |
| 2 | + |
| 3 | +## Rationale behind caching |
| 4 | + |
| 5 | +During SYCL program execution SYCL runtime will create internal objects |
| 6 | +representing kernels and programs, it may also invoke JIT compiler to bring |
| 7 | +kernels in a program to executable state. Those runtime operations are quite |
| 8 | +expensive, and in some cases caching approach can be employed to eliminate |
| 9 | +redundant kernel or program object re-creation and online recompilation. Few |
| 10 | +examples below illustrate scenarios where such optimization is possible. |
| 11 | + |
| 12 | +*Use-case #1.* Submission of the same kernel in a loop: |
| 13 | +```C++ |
| 14 | + using namespace cl::sycl::queue; |
| 15 | + |
| 16 | + queue Q; |
| 17 | + std::vector<buffer> Bufs; |
| 18 | + |
| 19 | + ... |
| 20 | + // initialize Bufs with some number of buffers |
| 21 | + ... |
| 22 | + |
| 23 | + for (size_t Idx = 0; Idx < Bufs.size(); ++Idx) { |
| 24 | + Q.submit([&](handler &CGH) { |
| 25 | + auto Acc = Bufs[Idx].get_access<access::mode::read_write>(CGH); |
| 26 | + |
| 27 | + CGH.parallel_for<class TheKernel>( |
| 28 | + range<2>{N, M}, [=](item<2> Item) { ... }); |
| 29 | + }); |
| 30 | + } |
| 31 | +``` |
| 32 | +
|
| 33 | +*Use-case #2.* Submission of multiple kernels within a single program<sup>[1](#what-is-program)</sup>: |
| 34 | +```C++ |
| 35 | + using namespace cl::sycl::queue; |
| 36 | +
|
| 37 | + queue Q; |
| 38 | +
|
| 39 | + Q.submit([&](handler &CGH) { |
| 40 | + ... |
| 41 | +
|
| 42 | + CGH.parallel_for<class TheKernel_1>( |
| 43 | + range<2>{N_1, M_1}, [=](item<2> Item) { ... }); |
| 44 | + }); |
| 45 | + Q.submit([&](handler &CGH) { |
| 46 | + ... |
| 47 | +
|
| 48 | + CGH.parallel_for<class TheKernel_2>( |
| 49 | + range<2>{N_2, M_2}, [=](item<2> Item) { ... }); |
| 50 | + }); |
| 51 | + Q.submit([&](handler &CGH) { |
| 52 | + ... |
| 53 | +
|
| 54 | + CGH.parallel_for<class TheKernel_3>( |
| 55 | + range<2>{N_3, M_3}, [=](item<2> Item) { ... }); |
| 56 | + }); |
| 57 | + ... |
| 58 | + Q.submit([&](handler &CGH) { |
| 59 | + ... |
| 60 | +
|
| 61 | + CGH.parallel_for<class TheKernel_K>( |
| 62 | + range<2>{N_K, M_K}, [=](item<2> Item) { ... }); |
| 63 | + }); |
| 64 | +``` |
| 65 | + |
| 66 | +In both cases SYCL runtime will need to build the program and kernels multiple |
| 67 | +times, which may involve JIT compilation and take quite a lot of time. |
| 68 | + |
| 69 | +In order to eliminate this waste of run-time we introduce a kernel and program |
| 70 | +caching. The cache is per-context and it caches underlying objects of non |
| 71 | +interop kernels and programs which are built with no options. |
| 72 | + |
| 73 | +<a name="what-is-program">1</a>: Here "program" means an internal SYCL runtime |
| 74 | +object corresponding to a SPIRV module or native binary defining a set of SYCL |
| 75 | +kernels and/or device functions. |
| 76 | + |
| 77 | + |
| 78 | +## Data structure of cache |
| 79 | + |
| 80 | +The cache stores underlying PI objects behind `cl::sycl::program` and |
| 81 | +`cl::sycl::kernel` user-level objects in a per-context data storage. The storage |
| 82 | +consists of two maps: one is for programs and the other is for kernels. |
| 83 | + |
| 84 | +The programs map's key consists of three components: kernel set id<sup>[1](#what-is-ksid)</sup>, |
| 85 | +specialized constants, device this program is built for. |
| 86 | + |
| 87 | +The krnels map's key consists of three components too: program the kernel |
| 88 | +belongs to, kernel name<sup>[2](#what-is-kname)</sup>, device the program is |
| 89 | +built for. |
| 90 | + |
| 91 | +<a name="what-is-ksid">1</a>: Kernel set id is an ordinal number of the device |
| 92 | +binary image the kernel is contained in. |
| 93 | + |
| 94 | +<a name="what-is-kname">2</a>: Kernel name is a kernel ID mangled class' name |
| 95 | +which is provided to methods of `cl::sycl::handler` (e.g. `parallel_for` or |
| 96 | +`single_task`). |
| 97 | + |
| 98 | + |
| 99 | +## Points of improvement (things to do) |
| 100 | + |
| 101 | + - Implement LRU policy on cached objects. See [issue](https://github.com/intel/llvm/issues/2517). |
| 102 | + - Allow for caching of objects built with some build options. |
| 103 | + - Employ the same built object for multiple devices of the same ISA, |
| 104 | + capabilities and so on. *NOTE:* It's not really known if it's possible to |
| 105 | + check if two distinct devices are *exactly* the same. Probably this should be |
| 106 | + an improvement request for plugins. |
| 107 | + - Improve testing: cover real use-cases. See currently covered cases [here](https://github.com/intel/llvm/blob/sycl/sycl/unittests/kernel-and-program/Cache.cpp). |
| 108 | + |
| 109 | + |
| 110 | +## Implementation details |
| 111 | + |
| 112 | +The caches are represented with instance of [`KernelProgramCache`](https://github.com/intel/llvm/blob/sycl/sycl/source/detail/kernel_program_cache.hpp) |
| 113 | +class. The runtime creates one instance of the class per distinct SYCL context |
| 114 | +(A context object which is a result of copying another context object isn't |
| 115 | +"distinct", as it corresponds to the same underlying internal object |
| 116 | +representing a context). |
| 117 | +
|
| 118 | +The `KernelProgramCache` is essentially a pair of maps as described above. |
| 119 | +
|
| 120 | +
|
| 121 | +### When does the cache come at work? |
| 122 | +
|
| 123 | +The cache is used when one submits a kernel for execution or builds program or |
| 124 | +with SYCL API. That means that the cache works when either user explicitly calls |
| 125 | +`program::build_with_kernel_type<>()`/`program::get_kernel<>()` methods or SYCL |
| 126 | +RT builds a program or gets the required kernel as needed during application |
| 127 | +execution. Cacheability of an object can be tested with |
| 128 | +`program_impl::is_cacheable()` method. SYCL RT will only try to insert cacheable |
| 129 | +programs or kernels into the cache. This is done as a part of |
| 130 | +`ProgramManager::getOrCreateKernel()` method. |
| 131 | +
|
| 132 | +
|
| 133 | +*NOTE:* a kernel is only cacheable if and only if the program it belongs to is |
| 134 | +cacheable. On the other hand if the program is cacheable, then each and every |
| 135 | +kernel of this program will be cached also. |
| 136 | +
|
| 137 | +
|
| 138 | +All requests to build a program or to create a kernel - whether they originate |
| 139 | +from explicit user API calls or from internal SYCL runtime execution logic - end |
| 140 | +up with calling the function [`getOrBuild()`](https://github.com/intel/llvm/blob/sycl/sycl/source/detail/program_manager/program_manager.cpp#L149) |
| 141 | +with number of lambda functions passed as arguments: |
| 142 | + - Acquire function; |
| 143 | + - GetCache function; |
| 144 | + - Build function. |
| 145 | +
|
| 146 | +*Acquire* function returns a locked version of cache. Locking is employed for |
| 147 | +thread safety. The threads are blocked only for insert-or-acquire attempt, i.e. |
| 148 | +when calling to `map::insert` in [`getOrBuild`](https://github.com/intel/llvm/blob/sycl/sycl/source/detail/program_manager/program_manager.cpp#L149) |
| 149 | +function. The rest of operation is done with the help of atomics and condition |
| 150 | +variables (plus a mutex for proper work of condition variable). |
| 151 | +
|
| 152 | +*GetCache* function returns a reference to mapping `key->value` out of locked |
| 153 | +instance of cache. We will see rationale behind it a bit later. |
| 154 | +
|
| 155 | +*Build* function actually builds the kernel or program. |
| 156 | +
|
| 157 | +Caching isn't done: |
| 158 | + - when program is built out of source with `program::build_with_source()` or |
| 159 | + `program::compile_with_source()` method; |
| 160 | + - when program is a result of linking multiple programs. |
| 161 | + |
| 162 | + |
| 163 | +### Thread-safety |
| 164 | + |
| 165 | +Why do we need thread safety here? It is quite possible to have a use-case when |
| 166 | +the `cl::sycl::context` is shared across multiple threads (e.g. via sharing a |
| 167 | +queue). Possibility of enqueueing multiple cacheable kernels simultaneously |
| 168 | +from multiple threads requires us to provide thread-safety for the caching |
| 169 | +mechanisms. |
| 170 | + |
| 171 | +It is worth of noting that we don't cache the PI resource (kernel or program) |
| 172 | +by itself. Instead we augment the resource with the status of build process. |
| 173 | +Hence, what is cached is a wrapper structure `BuildResult` which contains three |
| 174 | +information fields - pointer to built resource, build error (if applicable) and |
| 175 | +current build status (either of "in progress", "succeeded", "failed"). |
| 176 | + |
| 177 | +One can find definition of `BuildResult` template in [KernelProgramCache](https://github.com/intel/llvm/blob/sycl/sycl/source/detail/kernel_program_cache.hpp). |
| 178 | + |
| 179 | +The built resource access synchronization approach aims at minimizing the time |
| 180 | +any thread holds the global lock guarding the maps to improve performance. To |
| 181 | +achieve that, the global lock is acquired only for the duration of the global |
| 182 | +map access. Actual build of the program happens outside of the lock, so other |
| 183 | +threads can request or build other programs in the meantime. A thread requesting |
| 184 | +a `BuildResult` instance via `getOrBuild` can go one of three ways: |
| 185 | + A) Build result is **not** available, it is the first thread to request it. |
| 186 | + Current thread will then execute the build letting others wait for the |
| 187 | + result using the per-build result condition variable kept in `BuildResult`'s |
| 188 | + `MBuildCV` field. |
| 189 | + B) Build result is **not** available, another thread is already building the |
| 190 | + result. Current thread will then wait for the result using the `MBuildCV` |
| 191 | + condition variable. |
| 192 | + C) Build result **is** available. The thread simply takes it from the `Ptr` |
| 193 | + field w/o using any mutexes or condition variables. |
| 194 | + |
| 195 | +As noted before, access to `BuildResult` instance fields may occur from |
| 196 | +different threads simultaneously, but the global lock is no longer held. So, to |
| 197 | +make it safe and to make sure only one thread builds the requested program, the |
| 198 | +following is done: |
| 199 | + - program build state is reflected in the `State` field, threads use |
| 200 | + compare-and-swap technique to compete who will do the build and become thread |
| 201 | + A. Threads C will find 'DONE' in this field and immediately return the with |
| 202 | + built result at hand. |
| 203 | + - thread A and thread(s) B use the `MBuildCV` conditional variable field and |
| 204 | + `MBuildResultMutex` mutex field guarding that variable to implement the |
| 205 | + "single producer-multiple consumers scheme". |
| 206 | + - the build result itself appears in the 'Ptr' field when available. |
| 207 | +All fields are atomic because they can be accessed from multiple threads. |
| 208 | + |
| 209 | +A specialization of helper class [Locked](https://github.com/intel/llvm/blob/sycl/sycl/include/CL/sycl/detail/locked.hpp) |
| 210 | +for reference of proper mapping is returned by Acquire function. The use of this |
| 211 | +class implements RAII to make code look cleaner a bit. Now, GetCache function |
| 212 | +will return the mapping to be employed that includes the 3 components: kernel |
| 213 | +name, device as well as any specialization constants. These get added to |
| 214 | +`BuildResult` and are cached. The `BuildResult` structure is specialized with |
| 215 | +either `PiKernel` or `PiProgram`<sup>[1](#remove-pointer)</sup>. |
| 216 | + |
| 217 | + |
| 218 | +### Core of caching mechanism |
| 219 | + |
| 220 | +Now, let us see how 'getOrBuild' function works. |
| 221 | +First, we fetch the cache with sequential calls to Acquire and GetCache |
| 222 | +functions. Then, we check if this is the first attempt to build this kernel or |
| 223 | +program. This is achieved with an attempt to insert another key-value pair into |
| 224 | +the map. At this point we try to insert `BuildResult` stub instance with status |
| 225 | +equal to "in progress" which will allow other threads to know that someone is |
| 226 | +(i.e. we're) building the object (i.e. kernel or program) now. If insertion |
| 227 | +fails, we will wait for building thread to finish with call to `waitUntilBuilt` |
| 228 | +function. This function will throw stored exception<sup>[2](#exception-data)</sup> |
| 229 | +upon build failure. This allows waiting threads to see the same result as the |
| 230 | +building thread. Special case of the failure is when build result doesn't |
| 231 | +contain the error (i.e. the error wasn't of `cl::sycl::exception` type) and the |
| 232 | +pointer to object in `BuildResult` instance is nil. In this case, the building |
| 233 | +thread has finished the build process and has returned an error to the user. |
| 234 | +But this error may be sporadic in nature and may be spurious. Hence, the waiting |
| 235 | +thread will try to build the same object once more. |
| 236 | + |
| 237 | +`BuildResult` structure also contains synchronization objects: mutex and |
| 238 | +condition variable. We employ them to signal waiting threads that the build |
| 239 | +process for this kernl/program is finished (either successfuly or with a |
| 240 | +failure). |
| 241 | + |
| 242 | + |
| 243 | +<a name="remove-pointer">1</a>: The use of `std::remove_pointer` was omitted for |
| 244 | +the sake of simplicity here. |
| 245 | + |
| 246 | +<a name="exception-data">2</a>: Actually, we store contents of the exception: |
| 247 | +its message and error code. |
| 248 | + |
0 commit comments