Skip to content

Commit d396e90

Browse files
Specify the quantized operators like input output running precision
1 parent ed19319 commit d396e90

File tree

1 file changed

+11
-1
lines changed

1 file changed

+11
-1
lines changed

Diff for: prototype_source/pt2e_quant_ptq_x86_inductor.rst

+11-1
Original file line numberDiff line numberDiff line change
@@ -179,7 +179,17 @@ subsequent pointwise operators, effectively minimizing memory usage and potentia
179179
::
180180

181181
with torch.autocast(device_type="cpu", dtype=torch.bfloat16, enabled=True), torch.no_grad():
182-
# Turn on Autocast to use int8-mixed-bf16 quantization
182+
# Turn on Autocast to use int8-mixed-bf16 quantization. After lowering into Inductor CPP Backend,
183+
# For operators such as QConvolution and QLinear:
184+
# * The input data type is consistently defined as int8, attributable to the presence of a pair
185+
of quantization and dequantization nodes inserted at the input.
186+
# * The computation precision remains at int8.
187+
# * The output data type may vary, being either int8 or BFloat16, contingent on the presence
188+
# of a pair of quantization and dequantization nodes at the output.
189+
# For non-quantizable pointwise operators, the data type will be inherited from the previous node,
190+
# potentially resulting in a data type of BFloat16 in this scenario.
191+
# For quantizable pointwise operators such as QMaxpool2D, it continues to operate with the int8
192+
# data type for both input and output.
183193
optimized_model = torch.compile(converted_model)
184194

185195
# Running some benchmark

0 commit comments

Comments
 (0)