Skip to content

Commit ed19319

Browse files
Add the example code for int8-mixed-bf16 quantization in X86Inductor Quantizer
1 parent f05f050 commit ed19319

File tree

1 file changed

+17
-3
lines changed

1 file changed

+17
-3
lines changed

Diff for: prototype_source/pt2e_quant_ptq_x86_inductor.rst

+17-3
Original file line numberDiff line numberDiff line change
@@ -165,11 +165,25 @@ After we get the quantized model, we will further lower it to the inductor backe
165165

166166
::
167167

168-
optimized_model = torch.compile(converted_model)
168+
with torch.no_grad():
169+
optimized_model = torch.compile(converted_model)
170+
171+
# Running some benchmark
172+
optimized_model(*example_inputs)
173+
174+
In a more advanced scenario, int8-mixed-bf16 quantization comes into play. In this instance,
175+
a Convolution or GEMM operator produces BFloat16 output data type instead of Float32 in the absence
176+
of a subsequent quantization node. Subsequently, the BFloat16 tensor seamlessly propagates through
177+
subsequent pointwise operators, effectively minimizing memory usage and potentially enhancing performance.
178+
179+
::
169180

170-
# Running some benchmark
171-
optimized_model(*example_inputs)
181+
with torch.autocast(device_type="cpu", dtype=torch.bfloat16, enabled=True), torch.no_grad():
182+
# Turn on Autocast to use int8-mixed-bf16 quantization
183+
optimized_model = torch.compile(converted_model)
172184

185+
# Running some benchmark
186+
optimized_model(*example_inputs)
173187

174188
Put all these codes together, we will have the toy example code.
175189
Please note that since the Inductor ``freeze`` feature does not turn on by default yet, run your example code with ``TORCHINDUCTOR_FREEZING=1``.

0 commit comments

Comments
 (0)