Address comments

Sergey Kanaev · Sergey Kanaev · commit 40b7669091e9 · 2020-09-30T17:39:41.000+03:00
Signed-off-by: Sergey Kanaev &lt;sergey.kanaev@intel.com&gt;
diff --git a/sycl/doc/KernelProgramCache.md b/sycl/doc/KernelProgramCache.md
@@ -1,8 +1,15 @@
-# A brief overview of kernel/program caching mechanism.
+# A brief overview of kernel and program caching mechanism.
 
 ## Rationale behind caching
 
-*Use-case #1.* Looped enqueue of the same kernel:
+During SYCL program execution SYCL runtime will create internal objects
+representing kernels and programs, it may also invoke JIT compiler to bring
+kernels in a program to executable state. Those runtime operations are quite
+expensive, and in some cases caching approach can be employed to eliminate
+redundant kernel or program object re-creation and online recompilation. Few
+examples below illustrate scenarios where such optimization is possible.
+
+*Use-case #1.* Submission of the same kernel in a loop:
 ```C++
   using namespace cl::sycl::queue;
 
@@ -23,7 +30,7 @@
   }
 ```
 
-*Use-case #2.* Enqueue of multiple kernels within a single program<sup>[1](#what-is-program)</sup>:
+*Use-case #2.* Submission of multiple kernels within a single program<sup>[1](#what-is-program)</sup>:
 ```C++
   using namespace cl::sycl::queue;
 
@@ -56,36 +63,36 @@
   });
 ```
 
-Both these use-cases will need to built the program or kernel multiple times.
-When JIT is employed this process may take quite a lot of time.
+In both cases SYCL runtime will need to build the program and kernels multiple
+times, which may involve JIT compilation and take quite a lot of time.
 
-In order to eliminate this waste of run-time we introduce a kernel/program
+In order to eliminate this waste of run-time we introduce a kernel and program
 caching. The cache is per-context and it caches underlying objects of non
 interop kernels and programs which are built with no options.
 
-<a name="what-is-program">1</a>: Here we use the term "program" in the same
-sense as OpenCL does i.e. a set of kernels.
+<a name="what-is-program">1</a>: Here "program" means an internal SYCL runtime
+object corresponding to a SPIRV module or native binary defining a set of SYCL
+kernels and/or device functions.
 
 
 ## Data structure of cache
 
-The cache stores underlying PI objects of `cl::sycl::program` and
-`cl::sycl::kernel` in a per-context data storage. The storage consists of two
-maps: one is for programs and the other is for kernels.
+The cache stores underlying PI objects behind `cl::sycl::program` and
+`cl::sycl::kernel` user-levelobjects in a per-context data storage. The storage
+consists of two maps: one is for programs and the other is for kernels.
 
-Programs mapping's key consists of three components:
-kernel set id<sup>[1](#what-is-ksid)</sup>, specialized constants, device this
-program is built for.
+The programs map's key consists of three components: kernel set id<sup>[1](#what-is-ksid)</sup>,
+specialized constants, device this program is built for.
 
-Kernels mapping's key consists of three components too: program the kernel
+The krnels map's key consists of three components too: program the kernel
 belongs to, kernel name<sup>[2](#what-is-kname)</sup>, device the program is
 built for.
 
-<a name="what-is-ksid">1</a>: Kernel set id is merely a number of translation
-unit which contains at least one kernel.
+<a name="what-is-ksid">1</a>: Kernel set id is an ordinal number of the device
+binary image the kernel is contained in.
 
-<a name="what-is-kname">2</a>: Kernel name is mangled class name which is
-provided to methods of `cl::sycl::handler` (e.g. `parallel_for` or
+<a name="what-is-kname">2</a>: Kernel name is a kernel ID mangled class' name
+which is provided to methods of `cl::sycl::handler` (e.g. `parallel_for` or
 `single_task`).
 
 
@@ -102,19 +109,23 @@ provided to methods of `cl::sycl::handler` (e.g. `parallel_for` or
 ## Implementation details
 
 The caches are represented with instance of [`KernelProgramCache`](https://github.com/intel/llvm/blob/sycl/sycl/source/detail/kernel_program_cache.hpp)
-class. The class is instantiated in a per-context manner.
+class. The runtime creates one instance of the class per distinct SYCL context
+(A context object which is a result of copying another context object isn't
+"distinct", as it corresponds to the same underlying internal object
+representing a context).
 
-The `KernelProgramCache` is the storage descrived above.
+The `KernelProgramCache` is essentially a pair of maps as described above.
 
 
 ### When does the cache come at work?
 
-The cache is employed when one submits kernel for execution or builds program or
-kernel with SYCL API. That means that the cache works when either user
-explicitly calls `program::build_with_kernel_type<>()`/`program::get_kernel<>()`
-methods or SYCL RT builds or gets the required kernel. Cacheability of an object
-is verified with `program_impl::is_cacheable()` method. SYCL RT will check if
-program is cacheable and will get the kernel with call to
+The cache is used when one submits a kernel for execution or builds program or
+with SYCL API. That means that the cache works when either user explicitly calls
+`program::build_with_kernel_type<>()`/`program::get_kernel<>()` methods or SYCL
+RT builds a program or gets the required kernel as needed during application
+execution. Cacheability of an object can be tested with
+`program_impl::is_cacheable()` method. SYCL RT will only try to insert cacheable
+programs or kernels into the cache. This is done as a part of
 `ProgramManager::getOrCreateKernel()` method.
 
 
@@ -123,12 +134,10 @@ cacheable. On the other hand if the program is cacheable, then each and every
 kernel of this program will be cached also.
 
 
-Invoked by user `program::build_with_kernel_type<>()` and
-`program::get_kernel<>()` methods will call either
-`ProgramManager::getBuildPIProgram()` or `ProgramManager::getOrCreateKernel()`
-method respectively. Now, both these methods will call template
-function [`getOrBuild()`](https://github.com/intel/llvm/blob/sycl/sycl/source/detail/program_manager/program_manager.cpp#L149)
-with multiple lambdas passed to it:
+All requests to build a program or to create a kernel - whether they originate
+from explicit user API calls or from internal SYCL runtime execution logic - end
+up with calling the function [`getOrBuild()`](https://github.com/intel/llvm/blob/sycl/sycl/source/detail/program_manager/program_manager.cpp#L149)
+with number of lambda functions passed as arguments:
  - Acquire function;
  - GetCache function;
  - Build function.
@@ -145,20 +154,21 @@ instance of cache. We will see rationale behind it a bit later.
 *Build* function actually builds the kernel or program.
 
 Caching isn't done:
- - when program is built out of source i.e. with
-   `program::build_with_source()` or `program::compile_with_source()` method;
- - when program is result of linking of multiple programs.
+ - when program is built out of source with `program::build_with_source()` or
+   `program::compile_with_source()` method;
+ - when program is a result of linking multiple programs.
 
 
 ### Thread-safety
 
-Why do we need thread safety here? It's quite possible to have a use-case when
+Why do we need thread safety here? It is quite possible to have a use-case when
 the `cl::sycl::context` is shared across multiple threads (e.g. via sharing a
 queue). Possibility of enqueueing multiple cacheable kernels simultaneously
-within multiple threads makes us to provide thread-safety for the cache.
+from multiple threads requires us to provide thread-safety for the caching
+mechanisms.
 
-It's worth of noting that we don't cache the PI resource (kernel or program)
-on it's own. Instead we augment the resource with the status of build process.
+It is worth of noting that we don't cache the PI resource (kernel or program)
+by itself. Instead we augment the resource with the status of build process.
 Hence, what is cached is a wrapper structure `BuildResult` which contains three
 information fields - pointer to built resource, build error (if applicable) and
 current build status (either of "in progress", "succeeded", "failed").
@@ -167,45 +177,45 @@ One can find definition of `BuildResult` template in [KernelProgramCache](https:
 
 Pointer to built resource and build result are both atomic variables. Atomicity
 of these variables allows one to hold lock on cache for quite a short time and
-perform the rest of build/wait process without unwanted need of other threads to
-wait on lock availability.
+perform the rest of build/wait process without forcing other threads to wait on
+lock availability.
 
 A specialization of helper class [Locked](https://github.com/intel/llvm/blob/sycl/sycl/include/CL/sycl/detail/locked.hpp)
 for reference of proper mapping is returned by Acquire function. The use of this
 class implements RAII to make code look cleaner a bit. Now, GetCache function
-will return the mapping to be employed i.e. it'll fetch mapping of kernel name
-plus device to `BuildResult` for proper program as `getOrBuild` will work with
-mapping of key (whichever it is) to `BuildResult` specialization. The structure
-is specialized with either `PiKernel` or `PiProgram`<sup>[1](#remove-program)</sup>.
+will return the mapping to be employed that includes the 3 components: kernel
+name, device as well as any specialization constants. These get added to
+`BuildResult` and are cached. The `BuildResult` structure is specialized with
+either `PiKernel` or `PiProgram`<sup>[1](#remove-program)</sup>.
 
 
 ### Core of caching mechanism
 
-Now, how `getOrBuild` works?
+Now, let us see how 'getOrBuild' function works.
 First, we fetch the cache with sequential calls to Acquire and GetCache
-functions. Then, we check if we're the first ones who build this kernel/program.
-This is achieved with attempt to insert another key-value pair into the map.
-At this point we try to insert `BuildResult` stub instance with status equal to
-"in progress" which will allow other threads to know that someone is (i.e.
-we're) building the object (i.e. kernel or program) now. If insertion fails we
-will wait for building thread to finish with call to `waitUntilBuilt` function.
-This function will throw stored exception<sup>[2](#exception-data)</sup> upon
-build failure. This allows waiting threads to result the same as the building
-thread. Special case of the failure is when build result doesn't contain the
-error (i.e. the error wasn't of `cl::sycl::exception` type) and the pointer to
-object in `BuildResult` instance is nil. In this case the building thread has
-finished build process and returned the user an error. Though, this error could
-be of spurious/sporadic nature. Hence, the waiting thread will try to build the
-same object once more.
+functions. Then, we check if this is the first attempt to build this kernel or
+program. This is achieved with an attempt to insert another key-value pair into
+the map. At this point we try to insert `BuildResult` stub instance with status
+equal to "in progress" which will allow other threads to know that someone is
+(i.e. we're) building the object (i.e. kernel or program) now. If insertion
+fails, we will wait for building thread to finish with call to `waitUntilBuilt`
+function. This function will throw stored exception<sup>[2](#exception-data)</sup>
+upon build failure. This allows waiting threads to see the same result as the
+building thread. Special case of the failure is when build result doesn't
+contain the error (i.e. the error wasn't of `cl::sycl::exception` type) and the
+pointer to object in `BuildResult` instance is nil. In this case, the building
+thread has finished the build process and has returned an error to the user.
+But this error may be sporadic in nature and may be spurious. Hence, the waiting
+thread will try to build the same object once more.
 
 `BuildResult` structure also contains synchronization objects: mutex and
 condition variable. We employ them to signal waiting threads that the build
 process for this kernl/program is finished (either successfuly or with a
 failure).
 
 
-<a name="remove-pointer">1</a>: The use of `std::remove_pointer` was omitted in
-sake of simplicity here.
+<a name="remove-pointer">1</a>: The use of `std::remove_pointer` was omitted for
+the sake of simplicity here.
 
 <a name="exception-data">2</a>: Actually, we store contents of the exception:
 its message and error code.