We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hello, I would like to perform quantization from the FP16 data type to the FP8E4M3 data type. I referred to the method described in the link https://github.com/pytorch/FBGEMM/blob/main/fbgemm_gpu/experimental/gen_ai/src/quantize/quantize.cu#L629, but I have a question. Why is the calculation of min_scaling_factor done by dividing by (FP8_E4M3_MAX::value * 512.f)? Could you please explain the basis for choosing 512.f? Thanks.
The text was updated successfully, but these errors were encountered:
No branches or pull requests
Hello, I would like to perform quantization from the FP16 data type to the FP8E4M3 data type. I referred to the method described in the link https://github.com/pytorch/FBGEMM/blob/main/fbgemm_gpu/experimental/gen_ai/src/quantize/quantize.cu#L629, but I have a question. Why is the calculation of min_scaling_factor done by dividing by (FP8_E4M3_MAX::value * 512.f)? Could you please explain the basis for choosing 512.f? Thanks.
The text was updated successfully, but these errors were encountered: