You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/en/quantization/bitsandbytes.md
+60-8
Original file line number
Diff line number
Diff line change
@@ -14,13 +14,21 @@ rendered properly in your Markdown viewer.
14
14
15
15
-->
16
16
17
-
# bitsandbytes
17
+
# Bitsandbytes
18
18
19
-
[bitsandbytes](https://github.com/bitsandbytes-foundation/bitsandbytes)features the LLM.int8 and QLoRA quantization to enable accessible large language model inference and training.
19
+
The [bitsandbytes](https://github.com/bitsandbytes-foundation/bitsandbytes)library provides quantization tools for LLMs through a lightweight Python wrapper around CUDA functions. It enables working with large models using limited computational resources by reducing their memory footprint.
20
20
21
-
[LLM.int8()](https://hf.co/papers/2208.07339) is a quantization method that aims to make large language model inference more accessible without significant degradation. Unlike naive 8-bit quantization, which can result in loss of critical information and accuracy, LLM.int8() dynamically adapts to ensure sensitive components of the computation retain higher precision when needed.
21
+
At its core, bitsandbytes provides:
22
22
23
-
QLoRA, or 4-bit quantization, compresses a model even further to 4-bits and inserts a small set of trainable low-rank adaptation (LoRA) weights to allowing training.
23
+
-**Quantized Linear Layers**: `Linear8bitLt` and `Linear4bit` layers that replace standard PyTorch linear layers with memory-efficient quantized alternatives
24
+
-**Optimized Optimizers**: 8-bit versions of common optimizers through its `optim` module, enabling training of large models with reduced memory requirements
25
+
-**Matrix Multiplication**: Optimized matrix multiplication operations that leverage the quantized format
26
+
27
+
bitsandbytes offers two main quantization features:
28
+
29
+
1.**LLM.int8()** - An 8-bit quantization method that makes inference more accessible without significant performance degradation. Unlike naive quantization, [LLM.int8()](https://hf.co/papers/2208.07339) dynamically preserves higher precision for critical computations, preventing information loss in sensitive parts of the model.
30
+
31
+
2.**QLoRA** - A 4-bit quantization technique that compresses models even further while maintaining trainability by inserting a small set of trainable low-rank adaptation (LoRA) weights.
24
32
25
33
> **Note:** For a user-friendly quantization experience, you can use the `bitsandbytes`[community space](https://huggingface.co/spaces/bnb-community/bnb-my-repo).
26
34
@@ -30,12 +38,38 @@ Run the command below to install bitsandbytes.
To compile from source, follow the instructions in the [bitsandbytes installation guide](https://huggingface.co/docs/bitsandbytes/main/en/installation).
42
+
43
+
## Hardware Compatibility
44
+
bitsandbytes is currently only supported on CUDA GPUs for CUDA versions 11.0 - 12.8. However, there's an ongoing multi-backend effort under development, which is currently in alpha. If you're interested in providing feedback or testing, check out the [bitsandbytes repository](https://github.com/bitsandbytes-foundation/bitsandbytes) for more information.
> **Note:** Bitsandbytes is moving away from the multi-backend approach towards using [Pytorch Custom Operators](https://pytorch.org/tutorials/advanced/custom_ops_landing_page.html), as the main mechanism for supporting new hardware, and dispatching to the correct backend.
65
+
66
+
## Quantization Examples
33
67
34
68
Quantize a model by passing a [`BitsAndBytesConfig`] to [`~PreTrainedModel.from_pretrained`]. This works for any model in any modality, as long as it supports [Accelerate](https://huggingface.co/docs/accelerate/index) and contains [torch.nn.Linear](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html) layers.
Quantizing a model in 8-bit halves the memory-usage, and for large models, set `device_map="auto"` to efficiently distribute the weights across all available GPUs.
Quantizing a model in 4-bit reduces your memory-usage by 4x, and for large models, set `device_map="auto"` to efficiently distribute the weights across all available GPUs.
Make sure you have the latest bitsandbytes version so you can serialize 4-bit models and push them to the Hub with [`~PreTrainedModel.push_to_hub`]. Use [`~PreTrainedModel.save_pretrained`] to save the 4-bit model locally.
117
155
156
+
```py
157
+
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
0 commit comments