Quantize Weight for Gemm/Conv on Quantized Model #22969

centwang · 2024-11-28T10:20:40Z

Some quantized models have QDQ around Conv/Gemm but the weight and/or bias are not quantized. This PR adds WeightBiasQuantization optimizer to quantize float weight and/or bias to INT8 and INT32 tensors respectively. We only do this for weight and/or bias initializer so that ConstantFolding will fold the sub-graph to real quantized initializers during the graph optimization next round.

onnxruntime/core/optimizer/qdq_transformer/weight_bias_quantization.h

onnxruntime/core/optimizer/qdq_transformer/weight_bias_quantization.cc

skottmckay

adrianlizarraga

Thank you!

Some quantized models have QDQ around Conv/Gemm but the weight and/or bias are not quantized. This PR adds WeightBiasQuantization optimizer to quantize float weight and/or bias to INT8 and INT32 tensors respectively. We only do this for weight and/or bias initializer so that ConstantFolding will fold the sub-graph to real quantized initializers during the graph optimization next round.

…wnstream node is not QuantizeLinear (#24537) ### Description Updates the WeightBiasQuantization optimizer to skip processing on Conv/Gemm nodes if the downstream child node is not a QuantizeLinear. #### Before this PR Original graph: ``` input_0 -> DQ -> Conv -> graph_output (or non-Q node) ^ ^ | | weights_f32------+ | bias_f32------------+ ``` Becomes: ``` input_0 -> DQ ------> Conv -> graph_output (or non-Q node) ^ ^ | | weights_quant -> DQ --+ | bias_quant -> DQ --------+ ``` The above is **NOT** a valid QDQ node unit for Conv because the Conv's output is not consumed by a QuantizeLinear node. #### With this PR The above example graph remains unchanged after L1 optimizations: ``` input_0 -> DQ -> Conv -> graph_output (or non-Q node) ^ ^ | | weights_f32------+ | bias_f32------------+ ``` ### Motivation and Context Caused inaccuracy for a customer model. Automatically quantizing the weights and biases of a Conv/Gemm is detrimental if the output of the Conv/Gemm is not consumed by a QuantizeLinear node. In this scenario, the whole node group is not considered a valid QDQ node unit, and so the EP has to run the Conv/Gemm as float32/float16 anyway. If the Conv/Gemm is running as float32/float16, then quantizing the weights and biases introduces inaccuracy for no gain. PR that originally added this optimizer: #22969

…wnstream node is not QuantizeLinear (microsoft#24537) ### Description Updates the WeightBiasQuantization optimizer to skip processing on Conv/Gemm nodes if the downstream child node is not a QuantizeLinear. #### Before this PR Original graph: ``` input_0 -> DQ -> Conv -> graph_output (or non-Q node) ^ ^ | | weights_f32------+ | bias_f32------------+ ``` Becomes: ``` input_0 -> DQ ------> Conv -> graph_output (or non-Q node) ^ ^ | | weights_quant -> DQ --+ | bias_quant -> DQ --------+ ``` The above is **NOT** a valid QDQ node unit for Conv because the Conv's output is not consumed by a QuantizeLinear node. #### With this PR The above example graph remains unchanged after L1 optimizations: ``` input_0 -> DQ -> Conv -> graph_output (or non-Q node) ^ ^ | | weights_f32------+ | bias_f32------------+ ``` ### Motivation and Context Caused inaccuracy for a customer model. Automatically quantizing the weights and biases of a Conv/Gemm is detrimental if the output of the Conv/Gemm is not consumed by a QuantizeLinear node. In this scenario, the whole node group is not considered a valid QDQ node unit, and so the EP has to run the Conv/Gemm as float32/float16 anyway. If the Conv/Gemm is running as float32/float16, then quantizing the weights and biases introduces inaccuracy for no gain. PR that originally added this optimizer: microsoft#22969

quantize weight

9389bbe

centwang requested review from skottmckay, adrianlizarraga and jywu-msft November 28, 2024 10:20

centwang marked this pull request as ready for review November 28, 2024 10:20

centwang added 2 commits November 29, 2024 10:44

adjust ut scale and zp

aa00373

adjust ut data

3c170c3

skottmckay reviewed Dec 4, 2024

View reviewed changes

onnxruntime/core/optimizer/qdq_transformer/weight_bias_quantization.h Outdated Show resolved Hide resolved

onnxruntime/core/optimizer/qdq_transformer/weight_bias_quantization.cc Outdated Show resolved Hide resolved

centwang added 2 commits December 4, 2024 11:31

resolve comments

d8d1156

fix warn

e5b9b40

skottmckay approved these changes Dec 4, 2024

View reviewed changes

adrianlizarraga approved these changes Jan 8, 2025

View reviewed changes

centwang merged commit ff0ab0a into main Jan 8, 2025
95 checks passed

centwang deleted the weicwang/weight_quantization branch January 8, 2025 02:00

adrianlizarraga mentioned this pull request Apr 24, 2025

[QDQ Optimizer] Update WeightBiasQuantization to skip Conv/Gemm if downstream node is not QuantizeLinear #24537

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Quantize Weight for Gemm/Conv on Quantized Model #22969

Quantize Weight for Gemm/Conv on Quantized Model #22969

Uh oh!

centwang commented Nov 28, 2024

Uh oh!

Uh oh!

Uh oh!

skottmckay left a comment

Uh oh!

adrianlizarraga left a comment

Uh oh!

Uh oh!

Uh oh!

Quantize Weight for Gemm/Conv on Quantized Model #22969

Quantize Weight for Gemm/Conv on Quantized Model #22969

Uh oh!

Conversation

centwang commented Nov 28, 2024

Uh oh!

Uh oh!

Uh oh!

skottmckay left a comment

Choose a reason for hiding this comment

Uh oh!

adrianlizarraga left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!