Skip to content

Commit cd1d3c3

Browse files
Qubitiummgoin
andauthored
[Docs] Add GPTQModel (#14056)
Signed-off-by: mgoin <[email protected]> Co-authored-by: mgoin <[email protected]>
1 parent 19d98e0 commit cd1d3c3

File tree

3 files changed

+85
-1
lines changed

3 files changed

+85
-1
lines changed

docs/source/features/quantization/auto_awq.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
# AutoAWQ
44

55
To create a new 4-bit quantized model, you can leverage [AutoAWQ](https://github.com/casper-hansen/AutoAWQ).
6-
Quantizing reduces the model's precision from FP16 to INT4 which effectively reduces the file size by ~70%.
6+
Quantization reduces the model's precision from BF16/FP16 to INT4 which effectively reduces the total model memory footprint.
77
The main benefits are lower latency and memory usage.
88

99
You can quantize your own models by installing AutoAWQ or picking one of the [6500+ models on Huggingface](https://huggingface.co/models?sort=trending&search=awq).
Lines changed: 83 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,83 @@
1+
(gptqmodel)=
2+
3+
# GPTQModel
4+
5+
To create a new 4-bit or 8-bit GPTQ quantized model, you can leverage [GPTQModel](https://github.com/ModelCloud/GPTQModel) from ModelCloud.AI.
6+
7+
Quantization reduces the model's precision from BF16/FP16 (16-bits) to INT4 (4-bits) or INT8 (8-bits) which significantly reduces the
8+
total model memory footprint while at-the-same-time increasing inference performance.
9+
10+
Compatible GPTQModel quantized models can leverage the `Marlin` and `Machete` vLLM custom kernels to maximize batching
11+
transactions-per-second `tps` and token-latency performance for both Ampere (A100+) and Hopper (H100+) Nvidia GPUs.
12+
These two kernels are highly optimized by vLLM and NeuralMagic (now part of Redhat) to allow world-class inference performance of quantized GPTQ
13+
models.
14+
15+
GPTQModel is one of the few quantization toolkits in the world that allows `Dynamic` per-module quantization where different layers and/or modules within a llm model can be further optimized with custom quantization parameters. `Dynamic` quantization
16+
is fully integrated into vLLM and backed up by support from the ModelCloud.AI team. Please refer to [GPTQModel readme](https://github.com/ModelCloud/GPTQModel?tab=readme-ov-file#dynamic-quantization-per-module-quantizeconfig-override)
17+
for more details on this and other advanced features.
18+
19+
You can quantize your own models by installing [GPTQModel](https://github.com/ModelCloud/GPTQModel) or picking one of the [5000+ models on Huggingface](https://huggingface.co/models?sort=trending&search=gptq).
20+
21+
```console
22+
pip install -U gptqmodel --no-build-isolation -v
23+
```
24+
25+
After installing GPTQModel, you are ready to quantize a model. Please refer to the [GPTQModel readme](https://github.com/ModelCloud/GPTQModel/?tab=readme-ov-file#quantization) for further details.
26+
27+
Here is an example of how to quantize `meta-llama/Llama-3.2-1B-Instruct`:
28+
29+
```python
30+
from datasets import load_dataset
31+
from gptqmodel import GPTQModel, QuantizeConfig
32+
33+
model_id = "meta-llama/Llama-3.2-1B-Instruct"
34+
quant_path = "Llama-3.2-1B-Instruct-gptqmodel-4bit"
35+
36+
calibration_dataset = load_dataset(
37+
"allenai/c4",
38+
data_files="en/c4-train.00001-of-01024.json.gz",
39+
split="train"
40+
).select(range(1024))["text"]
41+
42+
quant_config = QuantizeConfig(bits=4, group_size=128)
43+
44+
model = GPTQModel.load(model_id, quant_config)
45+
46+
# increase `batch_size` to match gpu/vram specs to speed up quantization
47+
model.quantize(calibration_dataset, batch_size=2)
48+
49+
model.save(quant_path)
50+
```
51+
52+
To run an GPTQModel quantized model with vLLM, you can use [DeepSeek-R1-Distill-Qwen-7B-gptqmodel-4bit-vortex-v2](https://huggingface.co/ModelCloud/DeepSeek-R1-Distill-Qwen-7B-gptqmodel-4bit-vortex-v2) with the following command:
53+
54+
```console
55+
python examples/offline_inference/llm_engine_example.py --model DeepSeek-R1-Distill-Qwen-7B-gptqmodel-4bit-vortex-v2
56+
```
57+
58+
GPTQModel quantized models are also supported directly through the LLM entrypoint:
59+
60+
```python
61+
from vllm import LLM, SamplingParams
62+
63+
# Sample prompts.
64+
prompts = [
65+
"Hello, my name is",
66+
"The president of the United States is",
67+
"The capital of France is",
68+
"The future of AI is",
69+
]
70+
# Create a sampling params object.
71+
sampling_params = SamplingParams(temperature=0.6, top_p=0.9)
72+
73+
# Create an LLM.
74+
llm = LLM(model="DeepSeek-R1-Distill-Qwen-7B-gptqmodel-4bit-vortex-v2")
75+
# Generate texts from the prompts. The output is a list of RequestOutput objects
76+
# that contain the prompt, generated text, and other information.
77+
outputs = llm.generate(prompts, sampling_params)
78+
# Print the outputs.
79+
for output in outputs:
80+
prompt = output.prompt
81+
generated_text = output.outputs[0].text
82+
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
83+
```

docs/source/features/quantization/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@ supported_hardware
1212
auto_awq
1313
bnb
1414
gguf
15+
gptqmodel
1516
int4
1617
int8
1718
fp8

0 commit comments

Comments
 (0)