Add dynamic quant to the tutorial of PT2E quantization with X86Inductor (#2819)

Xia-Weiwen · svekars · web-flow · commit 071e0736e223 · 2024-04-19T12:55:53.000-07:00
* Add dynamic quant to the tutorial of PT2E quantization with X86Inductor
---------

Co-authored-by: Svetlana Karslioglu &lt;svekars@meta.com&gt;
diff --git a/prototype_source/pt2e_quant_x86_inductor.rst b/prototype_source/pt2e_quant_x86_inductor.rst
@@ -21,7 +21,10 @@ The pytorch 2 export quantization flow uses the torch.export to capture the mode
 This approach is expected to have significantly higher model coverage, better programmability, and a simplified UX.
 TorchInductor is the new compiler backend that compiles the FX Graphs generated by TorchDynamo into optimized C++/Triton kernels.
 
-This flow of quantization 2 with Inductor mainly includes three steps:
+This flow of quantization 2 with Inductor supports both static and dynamic quantization. Static quantization works best for CNN models, like ResNet-50. And dynamic quantization is more suitable for NLP models, like RNN and BERT.
+For the difference between the two quantization types, please refer to the `following page <https://pytorch.org/docs/stable/quantization.html#quantization-mode-support>`__.
+
+The quantization flow mainly includes three steps:
 
 - Step 1: Capture the FX Graph from the eager Model based on the `torch export mechanism <https://pytorch.org/docs/main/export.html>`_.
 - Step 2: Apply the Quantization flow based on the captured FX Graph, including defining the backend-specific quantizer, generating the prepared model with observers,
@@ -134,14 +137,22 @@ quantize the model.
   `multiplications are 7-bit x 8-bit <https://oneapi-src.github.io/oneDNN/dev_guide_int8_computations.html#inputs-of-mixed-type-u8-and-s8>`_. In other words, potential
   numeric saturation and accuracy issue may happen when running on CPU without Vector Neural Network Instruction.
 
+The quantization config is for static quantization by default. To apply dynamic quantization, add an argument ``is_dynamic=True`` when getting the config.
+
+.. code-block:: python
+
+    quantizer = X86InductorQuantizer()
+    quantizer.set_global(xiq.get_default_x86_inductor_quantization_config(is_dynamic=True))
+
+
 After we import the backend-specific Quantizer, we will prepare the model for post-training quantization.
 ``prepare_pt2e`` folds BatchNorm operators into preceding Conv2d operators, and inserts observers in appropriate places in the model.
 
 ::
 
     prepared_model = prepare_pt2e(exported_model, quantizer)
 
-Now, we will calibrate the ``prepared_model`` after the observers are inserted in the model.
+Now, we will calibrate the ``prepared_model`` after the observers are inserted in the model. This step is needed for static quantization only.
 
 ::
 
@@ -268,6 +279,7 @@ The PyTorch 2 Export QAT flow is largely similar to the PTQ flow:
 
   # Step 2. quantization-aware training
   # Use Backend Quantizer for X86 CPU
+  # To apply dynamic quantization, add an argument ``is_dynamic=True`` when getting the config.
   quantizer = X86InductorQuantizer()
   quantizer.set_global(xiq.get_default_x86_inductor_quantization_config(is_qat=True))
   prepared_model = prepare_qat_pt2e(exported_model, quantizer)