Quantization techniques focus on representing data with less information while also trying to not lose too much accuracy. This often means converting a data type to represent the same information with fewer bits. For example, if your model weights are stored as 32-bit floating points and they're quantized to 16-bit floating points, this halves the model size which makes it easier to store and reduces memory-usage. Lower precision can also speedup inference because it takes less time to perform calculations with fewer bits.
Interested in adding a new quantization method to Transformers? Refer to the Contribute new quantization method guide to learn more about adding a new quantization method.
If you are new to the quantization field, we recommend you to check out these beginner-friendly courses about quantization in collaboration with DeepLearning.AI:
Diffusers supports bitsandbytes and torchao. Refer to this table to help you determine which quantization backend to use.