Skip to content

Commit 3748204

Browse files
SusungHongw4ffl35
authored andcommitted
[Docs] update Self-Attention Guidance docs (huggingface#2952)
* Update index.mdx * Edit docs & add HF space link * Only change equation numbers in comments
1 parent c4ee06c commit 3748204

File tree

3 files changed

+9
-8
lines changed

3 files changed

+9
-8
lines changed

docs/source/en/api/pipelines/stable_diffusion/self_attention_guidance.mdx

+5-4
Original file line numberDiff line numberDiff line change
@@ -14,25 +14,26 @@ specific language governing permissions and limitations under the License.
1414

1515
## Overview
1616

17-
[Self-Attention Guidance](https://arxiv.org/abs/2210.00939) by Susung Hong et al.
17+
[Improving Sample Quality of Diffusion Models Using Self-Attention Guidance](https://arxiv.org/abs/2210.00939) by Susung Hong et al.
1818

1919
The abstract of the paper is the following:
2020

21-
*Denoising diffusion models (DDMs) have been drawing much attention for their appreciable sample quality and diversity. Despite their remarkable performance, DDMs remain black boxes on which further study is necessary to take a profound step. Motivated by this, we delve into the design of conventional U-shaped diffusion models. More specifically, we investigate the self-attention modules within these models through carefully designed experiments and explore their characteristics. In addition, inspired by the studies that substantiate the effectiveness of the guidance schemes, we present plug-and-play diffusion guidance, namely Self-Attention Guidance (SAG), that can drastically boost the performance of existing diffusion models. Our method, SAG, extracts the intermediate attention map from a diffusion model at every iteration and selects tokens above a certain attention score for masking and blurring to obtain a partially blurred input. Subsequently, we measure the dissimilarity between the predicted noises obtained from feeding the blurred and original input to the diffusion model and leverage it as guidance. With this guidance, we observe apparent improvements in a wide range of diffusion models, e.g., ADM, IDDPM, and Stable Diffusion, and show that the results further improve by combining our method with the conventional guidance scheme. We provide extensive ablation studies to verify our choices.*
21+
*Denoising diffusion models (DDMs) have attracted attention for their exceptional generation quality and diversity. This success is largely attributed to the use of class- or text-conditional diffusion guidance methods, such as classifier and classifier-free guidance. In this paper, we present a more comprehensive perspective that goes beyond the traditional guidance methods. From this generalized perspective, we introduce novel condition- and training-free strategies to enhance the quality of generated images. As a simple solution, blur guidance improves the suitability of intermediate samples for their fine-scale information and structures, enabling diffusion models to generate higher quality samples with a moderate guidance scale. Improving upon this, Self-Attention Guidance (SAG) uses the intermediate self-attention maps of diffusion models to enhance their stability and efficacy. Specifically, SAG adversarially blurs only the regions that diffusion models attend to at each iteration and guides them accordingly. Our experimental results show that our SAG improves the performance of various diffusion models, including ADM, IDDPM, Stable Diffusion, and DiT. Moreover, combining SAG with conventional guidance methods leads to further improvement.*
2222

2323
Resources:
2424

2525
* [Project Page](https://ku-cvlab.github.io/Self-Attention-Guidance).
2626
* [Paper](https://arxiv.org/abs/2210.00939).
2727
* [Original Code](https://github.com/KU-CVLAB/Self-Attention-Guidance).
28-
* [Demo](https://colab.research.google.com/github/SusungHong/Self-Attention-Guidance/blob/main/SAG_Stable.ipynb).
28+
* [Hugging Face Demo](https://huggingface.co/spaces/susunghong/Self-Attention-Guidance).
29+
* [Colab Demo](https://colab.research.google.com/github/SusungHong/Self-Attention-Guidance/blob/main/SAG_Stable.ipynb).
2930

3031

3132
## Available Pipelines:
3233

3334
| Pipeline | Tasks | Demo
3435
|---|---|:---:|
35-
| [StableDiffusionSAGPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_sag.py) | *Text-to-Image Generation* | [Colab](https://colab.research.google.com/github/SusungHong/Self-Attention-Guidance/blob/main/SAG_Stable.ipynb) |
36+
| [StableDiffusionSAGPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_sag.py) | *Text-to-Image Generation* | [🤗 Space](https://huggingface.co/spaces/susunghong/Self-Attention-Guidance) |
3637

3738
## Usage example
3839

docs/source/en/index.mdx

+2-2
Original file line numberDiff line numberDiff line change
@@ -73,7 +73,7 @@ The library has three main components:
7373
| [stable_diffusion_pix2pix](./api/pipelines/stable_diffusion/pix2pix) | [InstructPix2Pix: Learning to Follow Image Editing Instructions](https://arxiv.org/abs/2211.09800) | Text-Guided Image Editing|
7474
| [stable_diffusion_pix2pix_zero](./api/pipelines/stable_diffusion/pix2pix_zero) | [Zero-shot Image-to-Image Translation](https://pix2pixzero.github.io/) | Text-Guided Image Editing |
7575
| [stable_diffusion_attend_and_excite](./api/pipelines/stable_diffusion/attend_and_excite) | [Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models](https://arxiv.org/abs/2301.13826) | Text-to-Image Generation |
76-
| [stable_diffusion_self_attention_guidance](./api/pipelines/stable_diffusion/self_attention_guidance) | [Improving Sample Quality of Diffusion Models Using Self-Attention Guidance](https://arxiv.org/abs/2210.00939) | Text-to-Image Generation |
76+
| [stable_diffusion_self_attention_guidance](./api/pipelines/stable_diffusion/self_attention_guidance) | [Improving Sample Quality of Diffusion Models Using Self-Attention Guidance](https://arxiv.org/abs/2210.00939) | Text-to-Image Generation Unconditional Image Generation |
7777
| [stable_diffusion_image_variation](./stable_diffusion/image_variation) | [Stable Diffusion Image Variations](https://github.com/LambdaLabsML/lambda-diffusers#stable-diffusion-image-variations) | Image-to-Image Generation |
7878
| [stable_diffusion_latent_upscale](./stable_diffusion/latent_upscale) | [Stable Diffusion Latent Upscaler](https://twitter.com/StabilityAI/status/1590531958815064065) | Text-Guided Super Resolution Image-to-Image |
7979
| [stable_diffusion_model_editing](./api/pipelines/stable_diffusion/model_editing) | [Editing Implicit Assumptions in Text-to-Image Diffusion Models](https://time-diffusion.github.io/) | Text-to-Image Model Editing |
@@ -90,4 +90,4 @@ The library has three main components:
9090
| [versatile_diffusion](./api/pipelines/versatile_diffusion) | [Versatile Diffusion: Text, Images and Variations All in One Diffusion Model](https://arxiv.org/abs/2211.08332) | Text-to-Image Generation |
9191
| [versatile_diffusion](./api/pipelines/versatile_diffusion) | [Versatile Diffusion: Text, Images and Variations All in One Diffusion Model](https://arxiv.org/abs/2211.08332) | Image Variations Generation |
9292
| [versatile_diffusion](./api/pipelines/versatile_diffusion) | [Versatile Diffusion: Text, Images and Variations All in One Diffusion Model](https://arxiv.org/abs/2211.08332) | Dual Image and Text Guided Generation |
93-
| [vq_diffusion](./api/pipelines/vq_diffusion) | [Vector Quantized Diffusion Model for Text-to-Image Synthesis](https://arxiv.org/abs/2111.14822) | Text-to-Image Generation |
93+
| [vq_diffusion](./api/pipelines/vq_diffusion) | [Vector Quantized Diffusion Model for Text-to-Image Synthesis](https://arxiv.org/abs/2111.14822) | Text-to-Image Generation |

src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_sag.py

+2-2
Original file line numberDiff line numberDiff line change
@@ -574,7 +574,7 @@ def __call__(
574574
# of the Imagen paper: https://arxiv.org/pdf/2205.11487.pdf . `guidance_scale = 1`
575575
# corresponds to doing no classifier free guidance.
576576
do_classifier_free_guidance = guidance_scale > 1.0
577-
# and `sag_scale` is` `s` of equation (15)
577+
# and `sag_scale` is` `s` of equation (16)
578578
# of the self-attentnion guidance paper: https://arxiv.org/pdf/2210.00939.pdf
579579
# `sag_scale = 0` means no self-attention guidance
580580
do_self_attention_guidance = sag_scale > 0.0
@@ -645,7 +645,7 @@ def get_map_size(module, input, output):
645645
# perform self-attention guidance with the stored self-attentnion map
646646
if do_self_attention_guidance:
647647
# classifier-free guidance produces two chunks of attention map
648-
# and we only use unconditional one according to equation (24)
648+
# and we only use unconditional one according to equation (25)
649649
# in https://arxiv.org/pdf/2210.00939.pdf
650650
if do_classifier_free_guidance:
651651
# DDIM-like prediction of x0

0 commit comments

Comments
 (0)