Skip to content

Commit c0ee5c2

Browse files
authored
Merge pull request tensorflow#2 from gunan/master
Accepted RFC: TF Dynamic kernel design
2 parents b4e3ece + fd0d7c9 commit c0ee5c2

File tree

1 file changed

+191
-0
lines changed

1 file changed

+191
-0
lines changed

rfcs/20180604-dynamic-kernels.md

Lines changed: 191 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,191 @@
1+
# Dynamic Loading of Kernels in TensorFlow
2+
| Status | Proposed |
3+
:-------------- | :-------------------------------------------------|
4+
| **Author(s)** | Gunhan Gulsoy (Google) |
5+
| **Sponsor** | Martin Wicke (Google) |
6+
| **Updated** | 2018-06-04 |
7+
8+
## Objective
9+
This document describes a new way to create and deploy new kernels for
10+
TensorFlow. We propose deploying kernels in separate shared libraries (dso,
11+
dylib or dll) and loading these at runtime. While at the moment the scope of
12+
this document only covers **TensorFlow Python distribution**, we aim to
13+
generalize this approach for all TF distributions. With this mechanism, we
14+
would like to create the following capabilities:
15+
* Loading kernels dynamically at runtime from shared libraries.
16+
* Being able to load multiple kernels for the same op/device pair, and pick the
17+
best one in terms of hardware compatibility and performance.
18+
* Check the hardware and load the compatible kernels.
19+
* Check compiler options used and load the compatible kernels.
20+
21+
## Overview
22+
For an Op, we need three pieces:
23+
* Python bindings, to make them accessible in the Python API
24+
* C++ op implementation
25+
* C++ Kernel implementation(s)
26+
27+
This document proposes a new way on how **kernels** can be deployed and loaded.
28+
29+
In the current mechanism, the only constraint is Python bindings have to be
30+
executed/loaded after C++ op implementation is loaded. Kernels can be loaded at
31+
any time. This makes our task easier. When a kernel is loaded, it registers
32+
itself in the global registry with a string key. The string key is constructed
33+
as follows: `op_name:device_name:(optional)label`
34+
35+
To start this project off, what we propose is the following:
36+
* Create a new API, `tf.load_kernel_library`
37+
* Use the new API to load kernels from a different shared object.
38+
39+
Then, we will start to build checks, to be more picky about the kernels we load.
40+
* Build handling for loading multiple kernels for the same op and device pair.
41+
* Enhance Global Kernel Registry to allow cleanup of registered kernels when a
42+
library is unloaded.
43+
* Build the library compatibility checking mechanism, and unload libraries when
44+
they are found to be incompatible
45+
46+
Finally, we will add the following advanced checks
47+
* Keep track of which libraries provide which kernels
48+
* Garbage collection of unqualified kernels, and their libraries.
49+
50+
## Detailed Current State
51+
While this document proposes a new way to **load kernels**, there is a lot of
52+
ideas we would like to adopt from the way ops are loaded. Therefore, current
53+
op loading mechanism is also described in this section.
54+
55+
### Op loading
56+
Currently, we can load op libraries from shared objects. When loading custom or
57+
contrib ops, we also load their kernels. The following pseudocode describes how
58+
the current custom/contrib op loading mechanism works:
59+
* Custom contrib op Python bindings are not loaded until they are accessed.
60+
* At the first access, the `__init__` file of the custom op module calls `tf.load_op_library`
61+
* `load_op_library` loads the shared object using `TF_LoadLibrary` in the C API
62+
* Once the shared object is loaded, `load_op_library` now executes and loads the rest of the Python code in the op library.
63+
64+
Now, diving deep into `TF_LoadLibrary`
65+
* `TF_LoadLibrary` is called. This is just a thin wrapper and status checker around `tensorflow::LoadLibrary`
66+
* `tensorflow::LoadLibrary` checks first if this shared object is already loaded
67+
* In a serial way, making sure only one library is processed at a time:
68+
* It starts a watcher for `OpRegistry`, to get a list of ops included in the library
69+
* Try loading the library using `Environment::LoadLibrary`
70+
* Which just calls `tensorflow::internal::LoadLibrary`
71+
* Which is essentially just `dlopen`.
72+
73+
### Kernel loading
74+
Currently, kernel loading mechanism is simpler than the op loading mechanism, at least at loading time. The mechanism can be summarized as follows:
75+
* Kernels use `REGISTER_KERNEL_BUILDER` macro to create a static initializer
76+
* The static initializer is just an object of type `OpKernelRegistrar`
77+
* Which calls `OpKernelRegistrar::InitInternal`
78+
* Which saves the kernel in the `GlobalKernelRegistry`, with a factory method.
79+
* Kernel is read from the registry and instantiated when op tries to be executed.
80+
81+
## Design
82+
Here we will describe the details of the work we plan to perform. The work will be divided into three milestones:
83+
84+
### Milestone 1: Load kernels from shared objects
85+
This phase will just be a simple proof of concept, to show that loading kernels
86+
from shared objects will work. The deliverables of this phase are:
87+
1. `tf.load_kernel_library` api. This new method on our API will be responsible
88+
for loading kernels from given shared objects, or folders containing shared
89+
objects. It will:
90+
* Load the given shared object, if it is an `.so` file
91+
* If a folder is given, load all `libtfkernel-*` shared object files in the folder
92+
2. Split one or more kernels into a different shared object. This will involve:
93+
* Resolve the `BUILD` dependency mess to be able to create a reasonably small
94+
shared object for a kernel (size will be optimized later).
95+
* Resolve all symbol collisions stemming from the different shared objects,
96+
potentially both depending on core TF framework.
97+
* Finally, on the Python side of the op whose kernel is being split out, add
98+
the directive: `tf.load_kernel_library(“libtfkernel_kernel_name.so”)`
99+
3. Get a bazel test to pass with a split kernel library
100+
4. Get a working Python wheel file with a split kernel library, and run the
101+
kernel from the shared object.
102+
To simplify the proof of concept, at this stage we will only do this on linux.
103+
104+
### Milestone 2: Enable kernel compatibility checks
105+
Once the proof of concept is ready, we need to start building the fancier
106+
features of the proposal. These will be:
107+
1. Create a mechanism to save the compiler options from bazel side, and make
108+
them available to read in C++ runtime.
109+
2. Create a mechanism in addition to `KernelDef` to be stored in the
110+
`GlobalKernelRegistry` to help decide which kernels should be loaded. The
111+
following is the data structure we propose for this information:
112+
```c
113+
typedef struct TF_DsoDef {
114+
const char* name;
115+
const char* version;
116+
};
117+
118+
typedef struct TF_HardwareDef {
119+
const char** SIMD_ISA; // Or enum
120+
int SIMD_ISA_length;
121+
char* cpu_arch;
122+
const char** accelerator;
123+
int accelerator_length;
124+
};
125+
126+
typedef struct TF_CompilerDef {
127+
const char* compiler;
128+
const char* compiler_version;
129+
const char** compiler_options;
130+
int compiler_options_length;
131+
int memory_alignment;
132+
};
133+
134+
typedef struct TF_SourceDef {
135+
const char* git_hash;
136+
};
137+
138+
typedef struct TF_KernelBuildInfo {
139+
TF_DsoDef* dependencies;
140+
int dependencies_list_size;
141+
142+
TF_SourceDef source_version;
143+
TF_HardwareDef hardware_def;
144+
TF_CompilerDef compiler_def;
145+
};
146+
```
147+
3. Create Methods to extract all the above information from the core runtime,
148+
to check for compatibility with any given kernel library.
149+
4. During kernel registration, implement checks for the following:
150+
* Is this kernel compatible with the given hardware
151+
* Is this kernel compatible with the software available on the system
152+
* Is this kernel ABI compatible with the core runtime
153+
* Is this kernel faster than any other kernels that are loaded. In this context faster means one of the following:
154+
* Better optimized for the hardware
155+
* Uses a special acceleration library such as MKL
156+
5. Provide means to override some of the above checks for loading experimental kernels
157+
6. Expand Global kernel registry to be functionally similar to the op registry. Op registry can unregister ops if there are any problems during the object loading, kernel registry should be able to do the same.
158+
159+
### Milestone 3: Make it work on different OSs
160+
While the above will be done on linux, we will have to get things to work on all operating systems we support. For macos, the issues are mainly around bazel bugs. For windows, we will have to be more careful about symbol collisions, and partial lockdown of symbol exports may be required to get things working.
161+
162+
### Milestone 4: Memory and performance optimizations
163+
When we load multiple shared objects, we can easily have some bloat in memory
164+
usage, or performance hits. The simplest things we can foresee are:
165+
1. Multiple kernel registry entries that are retained when multiple kernels for
166+
the same op and device pair are loaded.
167+
2. Some shared object may only include slow kernels, and they may just be
168+
included in the distribution for compatibility. We can unload shared objects
169+
from memory if none of the kernels in it are useful.
170+
3. Minimize the total size of the shared libraries created. Currently, tf
171+
framework is this big monolithic build rule everyone ends up depending on.
172+
Try to slim down the kernels, and get them to a size that makes sense to be
173+
included in tf lite packages.
174+
4. Make sure there are only kernels in the given shared object. Error out if
175+
someone sneaks in ops in kernel libraries.
176+
177+
## Alternatives considered
178+
A number of alternatives have been considered before deciding on this route:
179+
1. Create and distribute the whole package with different compiler options.
180+
While this is the path of least resistance, the monolithic package that needs
181+
to be tested fully on different hardware and compiler options is becoming
182+
unmanageable. The simplest example is, we have a lot of code that needs to be
183+
tested with GPU compilers only once, but we end up having to run similar tests
184+
with 5+ different compiler options. Such issues drive up our testing costs in
185+
terrms of both resources, and developer time.
186+
2. Splitting kernels into different binaries rather than different shared
187+
objects. While this will protect us from symbol collisions, ODR violations, or
188+
other classical headaches that plague shared objects, this will make things
189+
slower. Also, we would need to implement shared memory pages to share data
190+
across different processes, which will incur a similar engineering cost to the
191+
proposed approach. Therefore, we decided on using shared libraries instead.

0 commit comments

Comments
 (0)