Skip to content

Commit de182ba

Browse files
MekkCyberSunMarc
andauthored
Refactor bitsandbytes doc (#37668)
* doc * torch ops * fix * nits * Update docs/source/en/quantization/bitsandbytes.md Co-authored-by: Marc Sun <[email protected]> --------- Co-authored-by: Marc Sun <[email protected]>
1 parent dde9b03 commit de182ba

File tree

1 file changed

+60
-8
lines changed

1 file changed

+60
-8
lines changed

docs/source/en/quantization/bitsandbytes.md

+60-8
Original file line numberDiff line numberDiff line change
@@ -14,13 +14,21 @@ rendered properly in your Markdown viewer.
1414
1515
-->
1616

17-
# bitsandbytes
17+
# Bitsandbytes
1818

19-
[bitsandbytes](https://github.com/bitsandbytes-foundation/bitsandbytes) features the LLM.int8 and QLoRA quantization to enable accessible large language model inference and training.
19+
The [bitsandbytes](https://github.com/bitsandbytes-foundation/bitsandbytes) library provides quantization tools for LLMs through a lightweight Python wrapper around CUDA functions. It enables working with large models using limited computational resources by reducing their memory footprint.
2020

21-
[LLM.int8()](https://hf.co/papers/2208.07339) is a quantization method that aims to make large language model inference more accessible without significant degradation. Unlike naive 8-bit quantization, which can result in loss of critical information and accuracy, LLM.int8() dynamically adapts to ensure sensitive components of the computation retain higher precision when needed.
21+
At its core, bitsandbytes provides:
2222

23-
QLoRA, or 4-bit quantization, compresses a model even further to 4-bits and inserts a small set of trainable low-rank adaptation (LoRA) weights to allowing training.
23+
- **Quantized Linear Layers**: `Linear8bitLt` and `Linear4bit` layers that replace standard PyTorch linear layers with memory-efficient quantized alternatives
24+
- **Optimized Optimizers**: 8-bit versions of common optimizers through its `optim` module, enabling training of large models with reduced memory requirements
25+
- **Matrix Multiplication**: Optimized matrix multiplication operations that leverage the quantized format
26+
27+
bitsandbytes offers two main quantization features:
28+
29+
1. **LLM.int8()** - An 8-bit quantization method that makes inference more accessible without significant performance degradation. Unlike naive quantization, [LLM.int8()](https://hf.co/papers/2208.07339) dynamically preserves higher precision for critical computations, preventing information loss in sensitive parts of the model.
30+
31+
2. **QLoRA** - A 4-bit quantization technique that compresses models even further while maintaining trainability by inserting a small set of trainable low-rank adaptation (LoRA) weights.
2432

2533
> **Note:** For a user-friendly quantization experience, you can use the `bitsandbytes` [community space](https://huggingface.co/spaces/bnb-community/bnb-my-repo).
2634
@@ -30,12 +38,38 @@ Run the command below to install bitsandbytes.
3038
```bash
3139
pip install --upgrade transformers accelerate bitsandbytes
3240
```
41+
To compile from source, follow the instructions in the [bitsandbytes installation guide](https://huggingface.co/docs/bitsandbytes/main/en/installation).
42+
43+
## Hardware Compatibility
44+
bitsandbytes is currently only supported on CUDA GPUs for CUDA versions 11.0 - 12.8. However, there's an ongoing multi-backend effort under development, which is currently in alpha. If you're interested in providing feedback or testing, check out the [bitsandbytes repository](https://github.com/bitsandbytes-foundation/bitsandbytes) for more information.
45+
46+
### CUDA
47+
48+
| Feature | Minimum Hardware Requirement |
49+
|---------|-------------------------------|
50+
| 8-bit optimizers | NVIDIA Maxwell (GTX 900 series, TITAN X, M40) or newer GPUs * |
51+
| LLM.int8() | NVIDIA Turing (RTX 20 series, T4) or newer GPUs |
52+
| NF4/FP4 quantization | NVIDIA Maxwell (GTX 900 series, TITAN X, M40) or newer GPUs * |
53+
54+
### Multi-backend
55+
56+
| Backend | Supported Versions | Python versions | Architecture Support | Status |
57+
|---------|-------------------|----------------|---------------------|---------|
58+
| AMD ROCm | 6.1+ | 3.10+ | minimum CDNA - gfx90a, RDNA - gfx1100 | Alpha |
59+
| Apple Silicon (MPS) | WIP | 3.10+ | M1/M2 chips | Planned |
60+
| Intel CPU | v2.4.0+ (ipex) | 3.10+ | Intel CPU | Alpha |
61+
| Intel GPU | v2.4.0+ (ipex) | 3.10+ | Intel GPU | Experimental |
62+
| Ascend NPU | 2.1.0+ (torch_npu) | 3.10+ | Ascend NPU | Experimental |
63+
64+
> **Note:** Bitsandbytes is moving away from the multi-backend approach towards using [Pytorch Custom Operators](https://pytorch.org/tutorials/advanced/custom_ops_landing_page.html), as the main mechanism for supporting new hardware, and dispatching to the correct backend.
65+
66+
## Quantization Examples
3367

3468
Quantize a model by passing a [`BitsAndBytesConfig`] to [`~PreTrainedModel.from_pretrained`]. This works for any model in any modality, as long as it supports [Accelerate](https://huggingface.co/docs/accelerate/index) and contains [torch.nn.Linear](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html) layers.
3569

3670
<hfoptions id="bnb">
3771
<hfoption id="8-bit">
38-
72+
<div class="bnb-container" style="border: 1px solid #ddd; border-radius: 8px; padding: 20px; margin: 20px 0">
3973
Quantizing a model in 8-bit halves the memory-usage, and for large models, set `device_map="auto"` to efficiently distribute the weights across all available GPUs.
4074

4175
```py
@@ -45,6 +79,7 @@ quantization_config = BitsAndBytesConfig(load_in_8bit=True)
4579

4680
model_8bit = AutoModelForCausalLM.from_pretrained(
4781
"bigscience/bloom-1b7",
82+
device_map="auto",
4883
quantization_config=quantization_config
4984
)
5085
```
@@ -59,6 +94,7 @@ quantization_config = BitsAndBytesConfig(load_in_8bit=True)
5994

6095
model_8bit = AutoModelForCausalLM.from_pretrained(
6196
"facebook/opt-350m",
97+
device_map="auto",
6298
quantization_config=quantization_config,
6399
torch_dtype="auto"
64100
)
@@ -74,16 +110,16 @@ quantization_config = BitsAndBytesConfig(load_in_8bit=True)
74110

75111
model = AutoModelForCausalLM.from_pretrained(
76112
"bigscience/bloom-560m",
113+
device_map="auto",
77114
quantization_config=quantization_config
78115
)
79-
tokenizer = AutoTokenizer.from_pretrained("bigscience/bloom-560m")
80116

81117
model.push_to_hub("bloom-560m-8bit")
82118
```
83-
119+
</div>
84120
</hfoption>
85121
<hfoption id="4-bit">
86-
122+
<div class="bnb-container" style="border: 1px solid #ddd; border-radius: 8px; padding: 20px; margin: 20px 0">
87123
Quantizing a model in 4-bit reduces your memory-usage by 4x, and for large models, set `device_map="auto"` to efficiently distribute the weights across all available GPUs.
88124

89125
```py
@@ -93,6 +129,7 @@ quantization_config = BitsAndBytesConfig(load_in_4bit=True)
93129

94130
model_4bit = AutoModelForCausalLM.from_pretrained(
95131
"bigscience/bloom-1b7",
132+
device_map="auto",
96133
quantization_config=quantization_config
97134
)
98135
```
@@ -107,6 +144,7 @@ quantization_config = BitsAndBytesConfig(load_in_4bit=True)
107144

108145
model_4bit = AutoModelForCausalLM.from_pretrained(
109146
"facebook/opt-350m",
147+
device_map="auto",
110148
quantization_config=quantization_config,
111149
torch_dtype="auto"
112150
)
@@ -115,6 +153,20 @@ model_4bit.model.decoder.layers[-1].final_layer_norm.weight.dtype
115153

116154
Make sure you have the latest bitsandbytes version so you can serialize 4-bit models and push them to the Hub with [`~PreTrainedModel.push_to_hub`]. Use [`~PreTrainedModel.save_pretrained`] to save the 4-bit model locally.
117155

156+
```py
157+
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
158+
159+
quantization_config = BitsAndBytesConfig(load_in_4bit=True)
160+
161+
model = AutoModelForCausalLM.from_pretrained(
162+
"bigscience/bloom-560m",
163+
device_map="auto",
164+
quantization_config=quantization_config
165+
)
166+
167+
model.push_to_hub("bloom-560m-4bit")
168+
```
169+
</div>
118170
</hfoption>
119171
</hfoptions>
120172

0 commit comments

Comments
 (0)