Skip to content

Commit ceed926

Browse files
Add usage of int8-mixed-bf16 quantization with X86InductorQuantizer (#2668)
* Add the example code for int8-mixed-bf16 quantization in X86Inductor Quantizer * Specify the quantized operators like input output running precision * highlight the usage same as regular BF16 Autocast
1 parent 56c7b4e commit ceed926

File tree

1 file changed

+29
-3
lines changed

1 file changed

+29
-3
lines changed

Diff for: prototype_source/pt2e_quant_ptq_x86_inductor.rst

+29-3
Original file line numberDiff line numberDiff line change
@@ -165,11 +165,37 @@ After we get the quantized model, we will further lower it to the inductor backe
165165

166166
::
167167

168-
optimized_model = torch.compile(converted_model)
168+
with torch.no_grad():
169+
optimized_model = torch.compile(converted_model)
170+
171+
# Running some benchmark
172+
optimized_model(*example_inputs)
169173

170-
# Running some benchmark
171-
optimized_model(*example_inputs)
174+
In a more advanced scenario, int8-mixed-bf16 quantization comes into play. In this instance,
175+
a Convolution or GEMM operator produces BFloat16 output data type instead of Float32 in the absence
176+
of a subsequent quantization node. Subsequently, the BFloat16 tensor seamlessly propagates through
177+
subsequent pointwise operators, effectively minimizing memory usage and potentially enhancing performance.
178+
The utilization of this feature mirrors that of regular BFloat16 Autocast, as simple as wrapping the
179+
script within the BFloat16 Autocast context.
180+
181+
::
172182

183+
with torch.autocast(device_type="cpu", dtype=torch.bfloat16, enabled=True), torch.no_grad():
184+
# Turn on Autocast to use int8-mixed-bf16 quantization. After lowering into Inductor CPP Backend,
185+
# For operators such as QConvolution and QLinear:
186+
# * The input data type is consistently defined as int8, attributable to the presence of a pair
187+
of quantization and dequantization nodes inserted at the input.
188+
# * The computation precision remains at int8.
189+
# * The output data type may vary, being either int8 or BFloat16, contingent on the presence
190+
# of a pair of quantization and dequantization nodes at the output.
191+
# For non-quantizable pointwise operators, the data type will be inherited from the previous node,
192+
# potentially resulting in a data type of BFloat16 in this scenario.
193+
# For quantizable pointwise operators such as QMaxpool2D, it continues to operate with the int8
194+
# data type for both input and output.
195+
optimized_model = torch.compile(converted_model)
196+
197+
# Running some benchmark
198+
optimized_model(*example_inputs)
173199

174200
Put all these codes together, we will have the toy example code.
175201
Please note that since the Inductor ``freeze`` feature does not turn on by default yet, run your example code with ``TORCHINDUCTOR_FREEZING=1``.

0 commit comments

Comments
 (0)