Quantization

Quantization techniques focus on representing data with less information while also trying to not lose too much accuracy. This often means converting a data type to represent the same information with fewer bits. For example, if your model weights are stored as 32-bit floating points and they're quantized to 16-bit floating points, this halves the model size which makes it easier to store and reduces memory-usage. Lower precision can also speedup inference because it takes less time to perform calculations with fewer bits.

Interested in adding a new quantization method to Transformers? Refer to the Contribute new quantization method guide to learn more about adding a new quantization method.

If you are new to the quantization field, we recommend you to check out these beginner-friendly courses about quantization in collaboration with DeepLearning.AI:

Quantization Fundamentals with Hugging Face
Quantization in Depth

When to use what?

Diffusers supports bitsandbytes and torchao. Refer to this table to help you determine which quantization backend to use.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

overview.md

overview.md

Quantization

When to use what?

Files

overview.md

Latest commit

History

overview.md

File metadata and controls

Quantization

When to use what?