@@ -165,11 +165,37 @@ After we get the quantized model, we will further lower it to the inductor backe
165
165
166
166
::
167
167
168
- optimized_model = torch.compile(converted_model)
168
+ with torch.no_grad():
169
+ optimized_model = torch.compile(converted_model)
170
+
171
+ # Running some benchmark
172
+ optimized_model(*example_inputs)
169
173
170
- # Running some benchmark
171
- optimized_model(*example_inputs)
174
+ In a more advanced scenario, int8-mixed-bf16 quantization comes into play. In this instance,
175
+ a Convolution or GEMM operator produces BFloat16 output data type instead of Float32 in the absence
176
+ of a subsequent quantization node. Subsequently, the BFloat16 tensor seamlessly propagates through
177
+ subsequent pointwise operators, effectively minimizing memory usage and potentially enhancing performance.
178
+ The utilization of this feature mirrors that of regular BFloat16 Autocast, as simple as wrapping the
179
+ script within the BFloat16 Autocast context.
180
+
181
+ ::
172
182
183
+ with torch.autocast(device_type="cpu", dtype=torch.bfloat16, enabled=True), torch.no_grad():
184
+ # Turn on Autocast to use int8-mixed-bf16 quantization. After lowering into Inductor CPP Backend,
185
+ # For operators such as QConvolution and QLinear:
186
+ # * The input data type is consistently defined as int8, attributable to the presence of a pair
187
+ of quantization and dequantization nodes inserted at the input.
188
+ # * The computation precision remains at int8.
189
+ # * The output data type may vary, being either int8 or BFloat16, contingent on the presence
190
+ # of a pair of quantization and dequantization nodes at the output.
191
+ # For non-quantizable pointwise operators, the data type will be inherited from the previous node,
192
+ # potentially resulting in a data type of BFloat16 in this scenario.
193
+ # For quantizable pointwise operators such as QMaxpool2D, it continues to operate with the int8
194
+ # data type for both input and output.
195
+ optimized_model = torch.compile(converted_model)
196
+
197
+ # Running some benchmark
198
+ optimized_model(*example_inputs)
173
199
174
200
Put all these codes together, we will have the toy example code.
175
201
Please note that since the Inductor ``freeze `` feature does not turn on by default yet, run your example code with ``TORCHINDUCTOR_FREEZING=1 ``.
0 commit comments