Attention mask for Flux & SD3 #10044

rootonchair · 2024-11-28T11:16:08Z

What does this PR do?

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline?
Did you read our philosophy doc (important for complex PRs)?
Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

rootonchair · 2024-11-28T14:40:18Z

@sayakpaul @yiyixuxu, how should I test this feature? modify the original flux pipeline?

bghira · 2024-11-30T22:27:38Z

refer to this flux transformer implementation for attn masking details

yiyixuxu · 2024-12-01T21:51:32Z

src/diffusers/models/attention_processor.py

            query = torch.cat([query, encoder_hidden_states_query_proj], dim=2)
            key = torch.cat([key, encoder_hidden_states_key_proj], dim=2)
            value = torch.cat([value, encoder_hidden_states_value_proj], dim=2)

-        hidden_states = F.scaled_dot_product_attention(query, key, value, dropout_p=0.0, is_causal=False)
+            if attention_mask is not None:


well I think what we want is not to have a specific implementation to apply attention_mask for flux, is just to allow it to pass down all the way from pipeline, to transformer and the to attention processor so user can experiment with a custom attention mask

cc @christopher5106 is what I described here something you have in mind?

and if you do have an specific implementation that we want to add to diffusers, maybe you can run some experiments to help us decide if it's meaningful

it's the encoder attention mask though just following convention of pixart and other DiT that rely on attention masking. masking the attention arbitrarily doesn't unlock new use cases, does it? if so, providing examples of those would be nice.

@rootonchair yes feel free to modify pipeline/model to test, and provide the experiments results to us:)

how should I test this feature? modify the original flux pipeline?

cc @christopher5106 - would you be able to provide a use case? since it was the original ask

masking the attention arbitrarily doesn't unlock new use cases, does it? if so, providing examples of those would be nice.

@rootonchair yes feel free to modify pipeline/model to test, and provide the experiments results to us:)

how should I test this feature? modify the original flux pipeline?

Sure, perhaps the simplest test would be passing a padded prompt

checking the softmax scores for padded positions.

rootonchair · 2024-12-04T08:49:02Z

refer to this flux transformer implementation for attn masking details

This implementation is a little confuse for me. If the encoder attention mask is [1,1,1,0,0]
Should the attention mask of the joint attention process be (assume the sequence length of images is 0)

[
[1,1,1,0,0],
[1,1,1,0,0],
[1,1,1,0,0],
[0,0,0,0,0],
[0,0,0,0,0]
]

Simply broadcast the mask would result in

[
[1,1,1,0,0],
[1,1,1,0,0],
[1,1,1,0,0],
[1,1,1,0,0],
[1,1,1,0,0]
]

while the two last token are just padding token

[
[1,1,1,0,0],
[1,1,1,0,0],
[1,1,1,0,0],
[1,1,1,0,0], <--- query's pad token
[1,1,1,0,0] <--- query's pad token
]

bghira · 2024-12-04T12:15:29Z

no flux doesnt mask padding tokens. doing so harms prompt adherence

rootonchair · 2024-12-23T16:22:59Z

This is my result testing with Flux pipeline

import torch
from diffusers import FluxPipeline

pipe = FluxPipeline.from_pretrained("black-forest-labs/FLUX.1-dev", torch_dtype=torch.bfloat16)
pipe.enable_sequential_cpu_offload()
pipe.vae.enable_slicing()
pipe.vae.enable_tiling()
pipe.to(torch.float16)


prompt = [
    "a tiny astronaut hatching from an egg on the moon", 
    "A cat holding a sign that says hello world"
]
attention_mask = pipe.tokenizer_2(
    prompt,
    padding="max_length",
    max_length=512,
    truncation=True,
    return_length=False,
    return_overflowing_tokens=False,
    return_tensors="pt",
).attention_mask
attention_mask = attention_mask.to(device="cuda", dtype=torch.float16)
out = pipe(
    prompt=prompt,
    guidance_scale=3.5,
    height=768,
    width=1360,
    num_inference_steps=50,
    joint_attention_kwargs = {'attention_mask': attention_mask},
    generator=torch.Generator(device="cuda").manual_seed(42)
).images
out[0].save("image.png")
out[1].save("image1.png")

Without mask

With mask

bghira · 2024-12-23T16:43:43Z

patch embed artifacts in the masked one.

rootonchair · 2024-12-23T17:24:27Z

For SD3, it doesn't have the same effect. I will try to figure out why and come up with an test script soon

rootonchair · 2024-12-24T05:33:55Z

For SD3 I use the below script:

import torch
from diffusers import StableDiffusion3Pipeline

pipe = StableDiffusion3Pipeline.from_pretrained(
    "stabilityai/stable-diffusion-3-medium-diffusers", torch_dtype=torch.float16
)
pipe.enable_model_cpu_offload()

prompt = [
    "", # negative prompt for first image
    "", # negative prompt fort second image
    "smiling cartoon dog sits at a table, coffee mug on hand, as a room goes up in flames. “This is fine,” the dog assures himself.", 
    "A cat holding a sign that says hello world",
]
t5_attention_mask = pipe.tokenizer_3(
    prompt,
    padding="max_length",
    max_length=256,
    truncation=True,
    add_special_tokens=True,
    return_tensors="pt",
).attention_mask

attention_mask = t5_attention_mask 
attention_mask = attention_mask.to(device="cuda", dtype=torch.float16)

prompt = prompt[2:]
print(prompt)
image = pipe(
    prompt,
    joint_attention_kwargs={'attention_mask': attention_mask},
    generator = torch.Generator(device="cuda").manual_seed(42),
).images
image[0].save("new_sd_image.png")
image[1].save("new_sd_image1.png")

without mask

with mask

christopher5106 · 2024-12-24T08:43:14Z

it looks like there is a bug

bghira · 2024-12-24T15:04:04Z

yeah i'm not too sure what's happening here anymore. the attention_mask is ignoring image token inputs? it is just for text encoder attn mask now?

i thought the idea was to add encoder_attention_mask and image_attention_mask parameters and make them cooperate with joint_attention_kwargs attention_mask which would supercede those two.

the test should not change the outputs at all, and yours does, which indicate there is something wrong.

github-actions · 2025-01-18T15:03:13Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

IorHon · 2025-04-10T15:34:29Z

Has anyone already gotten it working properly? I'm willing to pay for the work to anyone who can integrate it into Comfyui for me.

support attention mask

700d1a7

yiyixuxu reviewed Dec 1, 2024

View reviewed changes

yiyixuxu added the roadmap Add to current release roadmap label Dec 3, 2024

yiyixuxu mentioned this pull request Dec 4, 2024

attention mask for transformer Flux #10025

Closed

Pham Hong Vinh added 3 commits December 23, 2024 10:13

Merge branch 'main' into attention_mask

28634b2

resolve conflict

799ee96

refactor mask making

01b9179

rootonchair marked this pull request as ready for review December 23, 2024 06:58

Pham Hong Vinh and others added 3 commits December 23, 2024 13:59

undo some changes

1d4e9c4

Merge branch 'main' into attention_mask

40d4303

convert mask to float

cf27270

rootonchair requested a review from yiyixuxu December 24, 2024 05:34

github-actions bot added the stale Issues that haven't received updates label Jan 18, 2025

joangava approved these changes Feb 11, 2025

View reviewed changes

joangava approved these changes Mar 3, 2025

View reviewed changes

sins921 approved these changes Apr 29, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Attention mask for Flux & SD3 #10044

Attention mask for Flux & SD3 #10044

rootonchair commented Nov 28, 2024

rootonchair commented Nov 28, 2024

bghira commented Nov 30, 2024

yiyixuxu Dec 1, 2024

yiyixuxu Dec 1, 2024

bghira Dec 1, 2024

yiyixuxu Dec 1, 2024

yiyixuxu Dec 1, 2024

rootonchair Dec 4, 2024

bghira Dec 4, 2024

rootonchair commented Dec 4, 2024 •

edited

Loading

bghira commented Dec 4, 2024

rootonchair commented Dec 23, 2024 •

edited

Loading

bghira commented Dec 23, 2024

rootonchair commented Dec 23, 2024

rootonchair commented Dec 24, 2024

christopher5106 commented Dec 24, 2024

bghira commented Dec 24, 2024

github-actions bot commented Jan 18, 2025

IorHon commented Apr 10, 2025

Attention mask for Flux & SD3 #10044

Are you sure you want to change the base?

Attention mask for Flux & SD3 #10044

Conversation

rootonchair commented Nov 28, 2024

What does this PR do?

Before submitting

Who can review?

rootonchair commented Nov 28, 2024

bghira commented Nov 30, 2024

yiyixuxu Dec 1, 2024

Choose a reason for hiding this comment

yiyixuxu Dec 1, 2024

Choose a reason for hiding this comment

bghira Dec 1, 2024

Choose a reason for hiding this comment

yiyixuxu Dec 1, 2024

Choose a reason for hiding this comment

yiyixuxu Dec 1, 2024

Choose a reason for hiding this comment

rootonchair Dec 4, 2024

Choose a reason for hiding this comment

bghira Dec 4, 2024

Choose a reason for hiding this comment

rootonchair commented Dec 4, 2024 • edited Loading

bghira commented Dec 4, 2024

rootonchair commented Dec 23, 2024 • edited Loading

bghira commented Dec 23, 2024

rootonchair commented Dec 23, 2024

rootonchair commented Dec 24, 2024

christopher5106 commented Dec 24, 2024

bghira commented Dec 24, 2024

github-actions bot commented Jan 18, 2025

IorHon commented Apr 10, 2025

rootonchair commented Dec 4, 2024 •

edited

Loading

rootonchair commented Dec 23, 2024 •

edited

Loading