Skip to content

Commit be2fb77

Browse files
stevhliusayakpaul
andauthored
[docs] PyTorch 2.0 (#11618)
* combine * Update docs/source/en/optimization/fp16.md Co-authored-by: Sayak Paul <[email protected]> --------- Co-authored-by: Sayak Paul <[email protected]>
1 parent 54cddc1 commit be2fb77

File tree

15 files changed

+43
-459
lines changed

15 files changed

+43
-459
lines changed

docs/source/en/_toctree.yml

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -180,8 +180,6 @@
180180
title: Accelerate inference
181181
- local: optimization/memory
182182
title: Reduce memory usage
183-
- local: optimization/torch2.0
184-
title: PyTorch 2.0
185183
- local: optimization/xformers
186184
title: xFormers
187185
- local: optimization/tome

docs/source/en/api/pipelines/deepfloyd_if.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -347,7 +347,7 @@ pipe.to("cuda")
347347
image = pipe(image=image, prompt="<prompt>", strength=0.3).images
348348
```
349349

350-
You can also use [`torch.compile`](../../optimization/torch2.0). Note that we have not exhaustively tested `torch.compile`
350+
You can also use [`torch.compile`](../../optimization/fp16#torchcompile). Note that we have not exhaustively tested `torch.compile`
351351
with IF and it might not give expected results.
352352

353353
```py

docs/source/en/optimization/fp16.md

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -150,6 +150,24 @@ pipeline(prompt, num_inference_steps=30).images[0]
150150

151151
Compilation is slow the first time, but once compiled, it is significantly faster. Try to only use the compiled pipeline on the same type of inference operations. Calling the compiled pipeline on a different image size retriggers compilation which is slow and inefficient.
152152

153+
### Regional compilation
154+
155+
[Regional compilation](https://docs.pytorch.org/tutorials/recipes/regional_compilation.html) reduces the cold start compilation time by only compiling a specific repeated region (or block) of the model instead of the entire model. The compiler reuses the cached and compiled code for the other blocks.
156+
157+
[Accelerate](https://huggingface.co/docs/accelerate/index) provides the [compile_regions](https://github.com/huggingface/accelerate/blob/273799c85d849a1954a4f2e65767216eb37fa089/src/accelerate/utils/other.py#L78) method for automatically compiling the repeated blocks of a `nn.Module` sequentially. The rest of the model is compiled separately.
158+
159+
```py
160+
# pip install -U accelerate
161+
import torch
162+
from diffusers import StableDiffusionXLPipeline
163+
from accelerate.utils import compile regions
164+
165+
pipeline = StableDiffusionXLPipeline.from_pretrained(
166+
"stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16
167+
).to("cuda")
168+
pipeline.unet = compile_regions(pipeline.unet, mode="reduce-overhead", fullgraph=True)
169+
```
170+
153171
### Graph breaks
154172

155173
It is important to specify `fullgraph=True` in torch.compile to ensure there are no graph breaks in the underlying model. This allows you to take advantage of torch.compile without any performance degradation. For the UNet and VAE, this changes how you access the return variables.
@@ -170,6 +188,12 @@ The `step()` function is [called](https://github.com/huggingface/diffusers/blob/
170188

171189
In general, the `sigmas` should [stay on the CPU](https://github.com/huggingface/diffusers/blob/35a969d297cba69110d175ee79c59312b9f49e1e/src/diffusers/schedulers/scheduling_euler_discrete.py#L240) to avoid the communication sync and latency.
172190

191+
### Benchmarks
192+
193+
Refer to the [diffusers/benchmarks](https://huggingface.co/datasets/diffusers/benchmarks) dataset to see inference latency and memory usage data for compiled pipelines.
194+
195+
The [diffusers-torchao](https://github.com/sayakpaul/diffusers-torchao#benchmarking-results) repository also contains benchmarking results for compiled versions of Flux and CogVideoX.
196+
173197
## Dynamic quantization
174198

175199
[Dynamic quantization](https://pytorch.org/tutorials/recipes/recipes/dynamic_quantization.html) improves inference speed by reducing precision to enable faster math operations. This particular type of quantization determines how to scale the activations based on the data at runtime rather than using a fixed scaling factor. As a result, the scaling factor is more accurately aligned with the data.

docs/source/en/optimization/tome.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -93,4 +93,4 @@ To reproduce this benchmark, feel free to use this [script](https://gist.github.
9393
| | | 2 | OOM | 13 | 10.78 |
9494
| | | 1 | OOM | 6.66 | 5.54 |
9595

96-
As seen in the tables above, the speed-up from `tomesd` becomes more pronounced for larger image resolutions. It is also interesting to note that with `tomesd`, it is possible to run the pipeline on a higher resolution like 1024x1024. You may be able to speed-up inference even more with [`torch.compile`](torch2.0).
96+
As seen in the tables above, the speed-up from `tomesd` becomes more pronounced for larger image resolutions. It is also interesting to note that with `tomesd`, it is possible to run the pipeline on a higher resolution like 1024x1024. You may be able to speed-up inference even more with [`torch.compile`](fp16#torchcompile).

0 commit comments

Comments
 (0)