You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[Quant Tool] Prevent int32 quantized bias from clipping by adjusting the weight's scale (microsoft#22020)
### Description
Fixes scenario in which a bias input quantized to int32 has a scale that
is too small. A bias with a scale that is smaller than a certain
threshold will overflow the range of an `int32` when quantized, which
significantly decreases accuracy.
Credit to @yihonglyu for finding out about this issue and the fix.
### Motivation and Context
Consider the following Convolution with very small weights and a
constant bias input of `[5, -4.5]`.

The QDQ quantizer first computes the following quantization scale for
`input_0` and `weight`:
- `input_0`: scale=0.5
- `weight`: scale=7.843e-10 **[really small]**
The QDQ quantizer then computes the bias input's scale as follows:
```
bias_scale = input_0_scale * weight_0_scale = 0.5 * 7.843e-10 = 3.9215686274509805e-11
```
This `bias_scale` is too small. Before this PR, the QDQ quantizer would
quantize the f32 bias with this `bias_scale`:
```
bias_quant = round(bias_f32 / bias_scale) = round([5.0/bias_scale, -4.5/bias_scale]) = [127500000000, -114750000000]
```
These quantized bias values exceed the range of int32, and so are
clipped to [int32.min(), int32.max()], which is very inaccurate.
#### New approach
This PR increases the `weight_0_scale` by the necessary amount to ensure
that `bias_scale` (which equals `weight_0_scale * input_0_scale`) is
appropriate for the int32 quantization type.
The smallest valid bias scale is given by the normal scale formula:
`bias_smallest_valid_scale = (bias_f32_max - bias_f32_min) / (int32_max
- int32_min)`
Then, we compute the candidate bias scale:
`bias_scale_candidate = input_0_scale * weight_0_scale`
If the candidate scale is smaller than the smallest valid scale, we
increase the `weight_0_scale` by the necessary ratio:
```python
if bias_scale_candidate < bias_smallest_valid_scale:
ratio = bias_smallest_valid_scale / bias_scale_candidate
weight_0_scale = ratio * weight_0_scale
```
Then, we recompute the final bias scale:
```python
bias_scale = input_0_scale * weight_0_scale
```
#### Impact on accuracy
Here's the above model's quantized output compared to the f32
(ground-truth) output.
- Before PR:
- f32 model output[0]: **5.0f**
- qdq model output[0]: **0.075**
- SNR: 0.1369 (higher is better)
- After PR:
- f32 model output[0]: **5.0f**
- qdq model output[0]: **4.992**
- SNR: 55.656 (higher is better)
0 commit comments