Skip to content

Commit 0744378

Browse files
stevhliusayakpaul
andauthored
[docs] Quantization tip (#10249)
* quantization * add other vid models * typo * more pipelines --------- Co-authored-by: Sayak Paul <[email protected]>
1 parent 3f591ef commit 0744378

File tree

12 files changed

+496
-10
lines changed

12 files changed

+496
-10
lines changed

docs/source/en/api/pipelines/allegro.md

+45
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,51 @@ Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers)
2323

2424
</Tip>
2525

26+
## Quantization
27+
28+
Quantization helps reduce the memory requirements of very large models by storing model weights in a lower precision data type. However, quantization may have varying impact on video quality depending on the video model.
29+
30+
Refer to the [Quantization](../../quantization/overview) overview to learn more about supported quantization backends and selecting a quantization backend that supports your use case. The example below demonstrates how to load a quantized [`AllegroPipeline`] for inference with bitsandbytes.
31+
32+
```py
33+
import torch
34+
from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig, AllegroTransformer3DModel, AllegroPipeline
35+
from diffusers.utils import export_to_video
36+
from transformers import BitsAndBytesConfig as BitsAndBytesConfig, T5EncoderModel
37+
38+
quant_config = BitsAndBytesConfig(load_in_8bit=True)
39+
text_encoder_8bit = T5EncoderModel.from_pretrained(
40+
"rhymes-ai/Allegro",
41+
subfolder="text_encoder",
42+
quantization_config=quant_config,
43+
torch_dtype=torch.float16,
44+
)
45+
46+
quant_config = DiffusersBitsAndBytesConfig(load_in_8bit=True)
47+
transformer_8bit = AllegroTransformer3DModel.from_pretrained(
48+
"rhymes-ai/Allegro",
49+
subfolder="transformer",
50+
quantization_config=quant_config,
51+
torch_dtype=torch.float16,
52+
)
53+
54+
pipeline = AllegroPipeline.from_pretrained(
55+
"rhymes-ai/Allegro",
56+
text_encoder=text_encoder_8bit,
57+
transformer=transformer_8bit,
58+
torch_dtype=torch.float16,
59+
device_map="balanced",
60+
)
61+
62+
prompt = (
63+
"A seaside harbor with bright sunlight and sparkling seawater, with many boats in the water. From an aerial view, "
64+
"the boats vary in size and color, some moving and some stationary. Fishing boats in the water suggest that this "
65+
"location might be a popular spot for docking fishing boats."
66+
)
67+
video = pipeline(prompt, guidance_scale=7.5, max_sequence_length=512).frames[0]
68+
export_to_video(video, "harbor.mp4", fps=15)
69+
```
70+
2671
## AllegroPipeline
2772

2873
[[autodoc]] AllegroPipeline

docs/source/en/api/pipelines/aura_flow.md

+41-1
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ specific language governing permissions and limitations under the License.
1212

1313
# AuraFlow
1414

15-
AuraFlow is inspired by [Stable Diffusion 3](../pipelines/stable_diffusion/stable_diffusion_3.md) and is by far the largest text-to-image generation model that comes with an Apache 2.0 license. This model achieves state-of-the-art results on the [GenEval](https://github.com/djghosh13/geneval) benchmark.
15+
AuraFlow is inspired by [Stable Diffusion 3](../pipelines/stable_diffusion/stable_diffusion_3) and is by far the largest text-to-image generation model that comes with an Apache 2.0 license. This model achieves state-of-the-art results on the [GenEval](https://github.com/djghosh13/geneval) benchmark.
1616

1717
It was developed by the Fal team and more details about it can be found in [this blog post](https://blog.fal.ai/auraflow/).
1818

@@ -22,6 +22,46 @@ AuraFlow can be quite expensive to run on consumer hardware devices. However, yo
2222

2323
</Tip>
2424

25+
## Quantization
26+
27+
Quantization helps reduce the memory requirements of very large models by storing model weights in a lower precision data type. However, quantization may have varying impact on video quality depending on the video model.
28+
29+
Refer to the [Quantization](../../quantization/overview) overview to learn more about supported quantization backends and selecting a quantization backend that supports your use case. The example below demonstrates how to load a quantized [`AuraFlowPipeline`] for inference with bitsandbytes.
30+
31+
```py
32+
import torch
33+
from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig, AuraFlowTransformer2DModel, AuraFlowPipeline
34+
from transformers import BitsAndBytesConfig as BitsAndBytesConfig, T5EncoderModel
35+
36+
quant_config = BitsAndBytesConfig(load_in_8bit=True)
37+
text_encoder_8bit = T5EncoderModel.from_pretrained(
38+
"fal/AuraFlow",
39+
subfolder="text_encoder",
40+
quantization_config=quant_config,
41+
torch_dtype=torch.float16,
42+
)
43+
44+
quant_config = DiffusersBitsAndBytesConfig(load_in_8bit=True)
45+
transformer_8bit = AuraFlowTransformer2DModel.from_pretrained(
46+
"fal/AuraFlow",
47+
subfolder="transformer",
48+
quantization_config=quant_config,
49+
torch_dtype=torch.float16,
50+
)
51+
52+
pipeline = AuraFlowPipeline.from_pretrained(
53+
"fal/AuraFlow",
54+
text_encoder=text_encoder_8bit,
55+
transformer=transformer_8bit,
56+
torch_dtype=torch.float16,
57+
device_map="balanced",
58+
)
59+
60+
prompt = "a tiny astronaut hatching from an egg on the moon"
61+
image = pipeline(prompt).images[0]
62+
image.save("auraflow.png")
63+
```
64+
2565
## AuraFlowPipeline
2666

2767
[[autodoc]] AuraFlowPipeline

docs/source/en/api/pipelines/cogvideox.md

+38-5
Original file line numberDiff line numberDiff line change
@@ -112,13 +112,46 @@ CogVideoX-2b requires about 19 GB of GPU memory to decode 49 frames (6 seconds o
112112
- With enabling cpu offloading and tiling, memory usage is `11 GB`
113113
- `pipe.vae.enable_slicing()`
114114

115-
### Quantized inference
115+
## Quantization
116116

117-
[torchao](https://github.com/pytorch/ao) and [optimum-quanto](https://github.com/huggingface/optimum-quanto/) can be used to quantize the text encoder, transformer and VAE modules to lower the memory requirements. This makes it possible to run the model on a free-tier T4 Colab or lower VRAM GPUs!
117+
Quantization helps reduce the memory requirements of very large models by storing model weights in a lower precision data type. However, quantization may have varying impact on video quality depending on the video model.
118118

119-
It is also worth noting that torchao quantization is fully compatible with [torch.compile](/optimization/torch2.0#torchcompile), which allows for much faster inference speed. Additionally, models can be serialized and stored in a quantized datatype to save disk space with torchao. Find examples and benchmarks in the gists below.
120-
- [torchao](https://gist.github.com/a-r-r-o-w/4d9732d17412888c885480c6521a9897)
121-
- [quanto](https://gist.github.com/a-r-r-o-w/31be62828b00a9292821b85c1017effa)
119+
Refer to the [Quantization](../../quantization/overview) overview to learn more about supported quantization backends and selecting a quantization backend that supports your use case. The example below demonstrates how to load a quantized [`CogVideoXPipeline`] for inference with bitsandbytes.
120+
121+
```py
122+
import torch
123+
from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig, CogVideoXTransformer3DModel, CogVideoXPipeline
124+
from diffusers.utils import export_to_video
125+
from transformers import BitsAndBytesConfig as BitsAndBytesConfig, T5EncoderModel
126+
127+
quant_config = BitsAndBytesConfig(load_in_8bit=True)
128+
text_encoder_8bit = T5EncoderModel.from_pretrained(
129+
"THUDM/CogVideoX-2b",
130+
subfolder="text_encoder",
131+
quantization_config=quant_config,
132+
torch_dtype=torch.float16,
133+
)
134+
135+
quant_config = DiffusersBitsAndBytesConfig(load_in_8bit=True)
136+
transformer_8bit = CogVideoXTransformer3DModel.from_pretrained(
137+
"THUDM/CogVideoX-2b",
138+
subfolder="transformer",
139+
quantization_config=quant_config,
140+
torch_dtype=torch.float16,
141+
)
142+
143+
pipeline = CogVideoXPipeline.from_pretrained(
144+
"THUDM/CogVideoX-2b",
145+
text_encoder=text_encoder_8bit,
146+
transformer=transformer_8bit,
147+
torch_dtype=torch.float16,
148+
device_map="balanced",
149+
)
150+
151+
prompt = "A detailed wooden toy ship with intricately carved masts and sails is seen gliding smoothly over a plush, blue carpet that mimics the waves of the sea. The ship's hull is painted a rich brown, with tiny windows. The carpet, soft and textured, provides a perfect backdrop, resembling an oceanic expanse. Surrounding the ship are various other toys and children's items, hinting at a playful environment. The scene captures the innocence and imagination of childhood, with the toy ship's journey symbolizing endless adventures in a whimsical, indoor setting."
152+
video = pipeline(prompt=prompt, guidance_scale=6, num_inference_steps=50).frames[0]
153+
export_to_video(video, "ship.mp4", fps=8)
154+
```
122155

123156
## CogVideoXPipeline
124157

docs/source/en/api/pipelines/flux.md

+40
Original file line numberDiff line numberDiff line change
@@ -334,6 +334,46 @@ out = pipe(
334334
out.save("image.png")
335335
```
336336

337+
## Quantization
338+
339+
Quantization helps reduce the memory requirements of very large models by storing model weights in a lower precision data type. However, quantization may have varying impact on video quality depending on the video model.
340+
341+
Refer to the [Quantization](../../quantization/overview) overview to learn more about supported quantization backends and selecting a quantization backend that supports your use case. The example below demonstrates how to load a quantized [`FluxPipeline`] for inference with bitsandbytes.
342+
343+
```py
344+
import torch
345+
from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig, FluxTransformer2DModel, FluxPipeline
346+
from transformers import BitsAndBytesConfig as BitsAndBytesConfig, T5EncoderModel
347+
348+
quant_config = BitsAndBytesConfig(load_in_8bit=True)
349+
text_encoder_8bit = T5EncoderModel.from_pretrained(
350+
"black-forest-labs/FLUX.1-dev",
351+
subfolder="text_encoder_2",
352+
quantization_config=quant_config,
353+
torch_dtype=torch.float16,
354+
)
355+
356+
quant_config = DiffusersBitsAndBytesConfig(load_in_8bit=True)
357+
transformer_8bit = FluxTransformer2DModel.from_pretrained(
358+
"black-forest-labs/FLUX.1-dev",
359+
subfolder="transformer",
360+
quantization_config=quant_config,
361+
torch_dtype=torch.float16,
362+
)
363+
364+
pipeline = FluxPipeline.from_pretrained(
365+
"black-forest-labs/FLUX.1-dev",
366+
text_encoder=text_encoder_8bit,
367+
transformer=transformer_8bit,
368+
torch_dtype=torch.float16,
369+
device_map="balanced",
370+
)
371+
372+
prompt = "a tiny astronaut hatching from an egg on the moon"
373+
image = pipeline(prompt, guidance_scale=3.5, height=768, width=1360, num_inference_steps=50).images[0]
374+
image.save("flux.png")
375+
```
376+
337377
## Single File Loading for the `FluxTransformer2DModel`
338378

339379
The `FluxTransformer2DModel` supports loading checkpoints in the original format shipped by Black Forest Labs. This is also useful when trying to load finetunes or quantized versions of the models that have been published by the community.

docs/source/en/api/pipelines/hunyuan_video.md

+31
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,37 @@ Recommendations for inference:
3232
- For smaller resolution videos, try lower values of `shift` (between `2.0` to `5.0`) in the [Scheduler](https://huggingface.co/docs/diffusers/main/en/api/schedulers/flow_match_euler_discrete#diffusers.FlowMatchEulerDiscreteScheduler.shift). For larger resolution images, try higher values (between `7.0` and `12.0`). The default value is `7.0` for HunyuanVideo.
3333
- For more information about supported resolutions and other details, please refer to the original repository [here](https://github.com/Tencent/HunyuanVideo/).
3434

35+
## Quantization
36+
37+
Quantization helps reduce the memory requirements of very large models by storing model weights in a lower precision data type. However, quantization may have varying impact on video quality depending on the video model.
38+
39+
Refer to the [Quantization](../../quantization/overview) overview to learn more about supported quantization backends and selecting a quantization backend that supports your use case. The example below demonstrates how to load a quantized [`HunyuanVideoPipeline`] for inference with bitsandbytes.
40+
41+
```py
42+
import torch
43+
from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig, HunyuanVideoTransformer3DModel, HunyuanVideoPipeline
44+
from diffusers.utils import export_to_video
45+
46+
quant_config = DiffusersBitsAndBytesConfig(load_in_8bit=True)
47+
transformer_8bit = HunyuanVideoTransformer3DModel.from_pretrained(
48+
"tencent/HunyuanVideo",
49+
subfolder="transformer",
50+
quantization_config=quant_config,
51+
torch_dtype=torch.float16,
52+
)
53+
54+
pipeline = HunyuanVideoPipeline.from_pretrained(
55+
"tencent/HunyuanVideo",
56+
transformer=transformer_8bit,
57+
torch_dtype=torch.float16,
58+
device_map="balanced",
59+
)
60+
61+
prompt = "A cat walks on the grass, realistic style."
62+
video = pipeline(prompt=prompt, num_frames=61, num_inference_steps=30).frames[0]
63+
export_to_video(video, "cat.mp4", fps=15)
64+
```
65+
3566
## HunyuanVideoPipeline
3667

3768
[[autodoc]] HunyuanVideoPipeline

docs/source/en/api/pipelines/latte.md

+41
Original file line numberDiff line numberDiff line change
@@ -70,6 +70,47 @@ Without torch.compile(): Average inference time: 16.246 seconds.
7070
With torch.compile(): Average inference time: 14.573 seconds.
7171
```
7272

73+
## Quantization
74+
75+
Quantization helps reduce the memory requirements of very large models by storing model weights in a lower precision data type. However, quantization may have varying impact on video quality depending on the video model.
76+
77+
Refer to the [Quantization](../../quantization/overview) overview to learn more about supported quantization backends and selecting a quantization backend that supports your use case. The example below demonstrates how to load a quantized [`LattePipeline`] for inference with bitsandbytes.
78+
79+
```py
80+
import torch
81+
from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig, LatteTransformer3DModel, LattePipeline
82+
from diffusers.utils import export_to_gif
83+
from transformers import BitsAndBytesConfig as BitsAndBytesConfig, T5EncoderModel
84+
85+
quant_config = BitsAndBytesConfig(load_in_8bit=True)
86+
text_encoder_8bit = T5EncoderModel.from_pretrained(
87+
"maxin-cn/Latte-1",
88+
subfolder="text_encoder",
89+
quantization_config=quant_config,
90+
torch_dtype=torch.float16,
91+
)
92+
93+
quant_config = DiffusersBitsAndBytesConfig(load_in_8bit=True)
94+
transformer_8bit = LatteTransformer3DModel.from_pretrained(
95+
"maxin-cn/Latte-1",
96+
subfolder="transformer",
97+
quantization_config=quant_config,
98+
torch_dtype=torch.float16,
99+
)
100+
101+
pipeline = LattePipeline.from_pretrained(
102+
"maxin-cn/Latte-1",
103+
text_encoder=text_encoder_8bit,
104+
transformer=transformer_8bit,
105+
torch_dtype=torch.float16,
106+
device_map="balanced",
107+
)
108+
109+
prompt = "A small cactus with a happy face in the Sahara desert."
110+
video = pipeline(prompt).frames[0]
111+
export_to_gif(video, "latte.gif")
112+
```
113+
73114
## LattePipeline
74115

75116
[[autodoc]] LattePipeline

docs/source/en/api/pipelines/ltx_video.md

+41
Original file line numberDiff line numberDiff line change
@@ -139,6 +139,47 @@ export_to_video(video, "output.mp4", fps=24)
139139

140140
Refer to [this section](https://huggingface.co/docs/diffusers/main/en/api/pipelines/cogvideox#memory-optimization) to learn more about optimizing memory consumption.
141141

142+
## Quantization
143+
144+
Quantization helps reduce the memory requirements of very large models by storing model weights in a lower precision data type. However, quantization may have varying impact on video quality depending on the video model.
145+
146+
Refer to the [Quantization](../../quantization/overview) overview to learn more about supported quantization backends and selecting a quantization backend that supports your use case. The example below demonstrates how to load a quantized [`LTXPipeline`] for inference with bitsandbytes.
147+
148+
```py
149+
import torch
150+
from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig, LTXVideoTransformer3DModel, LTXPipeline
151+
from diffusers.utils import export_to_video
152+
from transformers import BitsAndBytesConfig as BitsAndBytesConfig, T5EncoderModel
153+
154+
quant_config = BitsAndBytesConfig(load_in_8bit=True)
155+
text_encoder_8bit = T5EncoderModel.from_pretrained(
156+
"Lightricks/LTX-Video",
157+
subfolder="text_encoder",
158+
quantization_config=quant_config,
159+
torch_dtype=torch.float16,
160+
)
161+
162+
quant_config = DiffusersBitsAndBytesConfig(load_in_8bit=True)
163+
transformer_8bit = LTXVideoTransformer3DModel.from_pretrained(
164+
"Lightricks/LTX-Video",
165+
subfolder="transformer",
166+
quantization_config=quant_config,
167+
torch_dtype=torch.float16,
168+
)
169+
170+
pipeline = LTXPipeline.from_pretrained(
171+
"Lightricks/LTX-Video",
172+
text_encoder=text_encoder_8bit,
173+
transformer=transformer_8bit,
174+
torch_dtype=torch.float16,
175+
device_map="balanced",
176+
)
177+
178+
prompt = "A detailed wooden toy ship with intricately carved masts and sails is seen gliding smoothly over a plush, blue carpet that mimics the waves of the sea. The ship's hull is painted a rich brown, with tiny windows. The carpet, soft and textured, provides a perfect backdrop, resembling an oceanic expanse. Surrounding the ship are various other toys and children's items, hinting at a playful environment. The scene captures the innocence and imagination of childhood, with the toy ship's journey symbolizing endless adventures in a whimsical, indoor setting."
179+
video = pipeline(prompt=prompt, num_frames=161, num_inference_steps=50).frames[0]
180+
export_to_video(video, "ship.mp4", fps=24)
181+
```
182+
142183
## LTXPipeline
143184

144185
[[autodoc]] LTXPipeline

docs/source/en/api/pipelines/lumina.md

+40
Original file line numberDiff line numberDiff line change
@@ -82,6 +82,46 @@ pipeline.vae.decode = torch.compile(pipeline.vae.decode, mode="max-autotune", fu
8282
image = pipeline(prompt="Upper body of a young woman in a Victorian-era outfit with brass goggles and leather straps. Background shows an industrial revolution cityscape with smoky skies and tall, metal structures").images[0]
8383
```
8484

85+
## Quantization
86+
87+
Quantization helps reduce the memory requirements of very large models by storing model weights in a lower precision data type. However, quantization may have varying impact on video quality depending on the video model.
88+
89+
Refer to the [Quantization](../../quantization/overview) overview to learn more about supported quantization backends and selecting a quantization backend that supports your use case. The example below demonstrates how to load a quantized [`LuminaText2ImgPipeline`] for inference with bitsandbytes.
90+
91+
```py
92+
import torch
93+
from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig, Transformer2DModel, LuminaText2ImgPipeline
94+
from transformers import BitsAndBytesConfig as BitsAndBytesConfig, T5EncoderModel
95+
96+
quant_config = BitsAndBytesConfig(load_in_8bit=True)
97+
text_encoder_8bit = T5EncoderModel.from_pretrained(
98+
"Alpha-VLLM/Lumina-Next-SFT-diffusers",
99+
subfolder="text_encoder",
100+
quantization_config=quant_config,
101+
torch_dtype=torch.float16,
102+
)
103+
104+
quant_config = DiffusersBitsAndBytesConfig(load_in_8bit=True)
105+
transformer_8bit = Transformer2DModel.from_pretrained(
106+
"Alpha-VLLM/Lumina-Next-SFT-diffusers",
107+
subfolder="transformer",
108+
quantization_config=quant_config,
109+
torch_dtype=torch.float16,
110+
)
111+
112+
pipeline = LuminaText2ImgPipeline.from_pretrained(
113+
"Alpha-VLLM/Lumina-Next-SFT-diffusers",
114+
text_encoder=text_encoder_8bit,
115+
transformer=transformer_8bit,
116+
torch_dtype=torch.float16,
117+
device_map="balanced",
118+
)
119+
120+
prompt = "a tiny astronaut hatching from an egg on the moon"
121+
image = pipeline(prompt).images[0]
122+
image.save("lumina.png")
123+
```
124+
85125
## LuminaText2ImgPipeline
86126

87127
[[autodoc]] LuminaText2ImgPipeline

0 commit comments

Comments
 (0)