Flax memory efficient attention #2889

pcuenca · 2023-03-29T12:33:00Z

Continues work from #2231. @MuhHanif, I had to open this new PR because yours is in main. You're still the main author, of course. I just applied the comments we discussed in #2231 and a couple other changes.

HuggingFaceDocBuilderDev · 2023-03-29T12:39:33Z

The documentation is not available anymore as the PR was closed or merged.

pcuenca · 2023-03-29T12:40:25Z

src/diffusers/pipelines/pipeline_flax_utils.py

@@ -296,6 +296,7 @@ def from_pretrained(cls, pretrained_model_name_or_path: Optional[Union[str, os.P
        use_auth_token = kwargs.pop("use_auth_token", None)
        revision = kwargs.pop("revision", None)
        from_pt = kwargs.pop("from_pt", False)
+        use_memory_efficient_attention = kwargs.pop("use_memory_efficient_attention", False)


I decided to enable this at pipeline load time. It's not straightforward to replicate the recursive logic we created for xformers because Flax submodules are not available unless you are in apply, so I didn't find a way to make functions to enable/disable the setting on demand. Instead, the configuration is set up when the pipeline is instantiated.

Another alternative would have been to pass use_memory_efficient_attention as an additional argument to generate so it's applied on a per-inference basis. I thought making it constant for the pipeline made sense for the Flax case.

Any other thoughts about this @patrickvonplaten, @yiyixuxu, @williamberman?

I don't have a strong opinion about this - I think passing it as an argument is probably easier to implement, but what you already did is good:)

pcuenca · 2023-03-29T12:42:39Z

As mentioned in #2231, inference is slower when memory efficient attention is enabled, but it allows additional use-cases (larger batch sizes, resolution bucketing). See #2231 (review), #2231 (comment)

patrickvonplaten · 2023-03-30T15:06:25Z

src/diffusers/models/memory_efficient_attention_jax.py

@@ -0,0 +1,101 @@
+import functools


Do we really need a new file for this? I think we should just put it in attention_flax . Currently we don't really have any "utility" files

patrickvonplaten · 2023-03-30T15:06:43Z

src/diffusers/models/attention_flax.py

@@ -108,19 +138,26 @@ class FlaxBasicTransformerBlock(nn.Module):
            Whether to only apply cross attention.
        dtype (:obj:`jnp.dtype`, *optional*, defaults to jnp.float32):
            Parameters `dtype`
+        use_memory_efficient_attention (`bool`, *optional*, defaults to `False`):


patrickvonplaten · 2023-03-30T15:07:02Z

src/diffusers/models/attention_flax.py

-        attention_scores = jnp.einsum("b i d, b j d->b i j", query_states, key_states)
-        attention_scores = attention_scores * self.scale
-        attention_probs = nn.softmax(attention_scores, axis=2)
+        if self.use_memory_efficient_attention:


patrickvonplaten

Looks good to me, think we should however move the memory efficient attention function directly to the attention file.

Also would be nice to have some tests and docs for this (but not urgent really IMO)

MuhHanif · 2023-03-31T01:11:36Z

@patrickvonplaten any ideas what kind of test should i add?

patrickvonplaten · 2023-03-31T13:30:24Z

A test that this new attention matches the old attention more or less in output result :-)

yiyixuxu · 2023-04-11T05:46:12Z

src/diffusers/pipelines/pipeline_flax_utils.py

@@ -296,6 +296,7 @@ def from_pretrained(cls, pretrained_model_name_or_path: Optional[Union[str, os.P
        use_auth_token = kwargs.pop("use_auth_token", None)
        revision = kwargs.pop("revision", None)
        from_pt = kwargs.pop("from_pt", False)
+        use_memory_efficient_attention = kwargs.pop("use_memory_efficient_attention", False)


I don't have a strong opinion about this - I think passing it as an argument is probably easier to implement, but what you already did is good:)

MuhHanif · 2023-04-11T07:19:13Z

seems correct up to 7 decimal @patrickvonplaten

A test that this new attention matches the old attention more or less in output result :-)

import jax
import jax.numpy as jnp    
import numpy as np
from flax import linen as nn

# init model 
unet_attn_init = diffusers.models.attention_flax.FlaxCrossAttention(8*320)

#dummy inputs
key = jax.random.PRNGKey(42)
key, rand = jax.random.split(key)
context = jax.random.normal(rand, [1,64*64, 320])

# generate random params
unet_attn_params = unet_attn_init.init(key, context)


with jax.default_matmul_precision("float32"):
    # model without memory efficient
    unet_attn = diffusers.models.attention_flax.FlaxCrossAttention(8*320)
    b = unet_attn.apply(unet_attn_params, hidden_states = context)
    print(unet_attn)

    # model with memory efficient
    unet_attn_eff = diffusers.models.attention_flax.FlaxCrossAttention(8*320, use_memory_efficient=True)
    a = unet_attn_eff.apply(unet_attn_params, hidden_states = context)
    print(unet_attn_eff)


np.testing.assert_almost_equal(np.array(a), np.array(b), decimal=7)```

…cient-attention

pcuenca · 2023-04-11T11:50:04Z

I wrote a slow pipeline test, let me know if that's ok and we can merge @patrickvonplaten @yiyixuxu

patrickvonplaten · 2023-04-12T09:10:43Z

@pcuenca happy to merge here!

* add use_memory_efficient params placeholder * test * add memory efficient attention jax * add memory efficient attention jax * newline * forgot dot * Rename use_memory_efficient * Keep dtype last. * Actually use key_chunk_size * Rename symbol * Apply style * Rename use_memory_efficient * Keep dtype last * Pass `use_memory_efficient_attention` in `from_pretrained` * Move JAX memory efficient attention to attention_flax. * Simple test. * style --------- Co-authored-by: muhammad_hanif <[email protected]> Co-authored-by: MuhHanif <[email protected]>

muhammad_hanif and others added 20 commits January 31, 2023 09:02

add use_memory_efficient params placeholder

6184815

test

e2d5708

Merge branch 'huggingface:main' into main

c2221d6

add memory efficient attention jax

9347cc5

Merge branch 'huggingface:main' into main

31d96a0

Merge branch 'main' of github.com:MuhHanif/diffusers into main

cb4c8ab

add memory efficient attention jax

eac25e0

newline

99d88e6

forgot dot

2255794

Merge branch 'huggingface:main' into main

f40df28

Merge branch 'huggingface:main' into main

a4ff0b7

Rename use_memory_efficient

00803e9

Merge branch 'main' into 2231

9560371

Keep dtype last.

26ba0c4

Actually use key_chunk_size

4b82e44

Rename symbol

78a106e

Apply style

41ea7c2

Rename use_memory_efficient

f32f331

Keep dtype last

00be593

Pass use_memory_efficient_attention in from_pretrained

b5544f2

pcuenca commented Mar 29, 2023

View reviewed changes

pcuenca requested review from patrickvonplaten, yiyixuxu and williamberman March 29, 2023 12:42

patrickvonplaten reviewed Mar 30, 2023

View reviewed changes

patrickvonplaten approved these changes Mar 30, 2023

View reviewed changes

Move JAX memory efficient attention to attention_flax.

8d35f09

yiyixuxu approved these changes Apr 11, 2023

View reviewed changes

pcuenca added 3 commits April 11, 2023 11:22

Simple test.

952293c

style

d1490bc

Merge remote-tracking branch 'origin/main' into 2231-flax-memory-effi…

ed00301

…cient-attention

pcuenca mentioned this pull request Apr 11, 2023

Fix invocation of some slow Flax tests #3058

Merged

patrickvonplaten merged commit dc27750 into main Apr 12, 2023

patrickvonplaten deleted the 2231-flax-memory-efficient-attention branch April 12, 2023 09:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flax memory efficient attention #2889

Flax memory efficient attention #2889

pcuenca commented Mar 29, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Mar 29, 2023 •

edited

Loading

pcuenca Mar 29, 2023

yiyixuxu Apr 11, 2023

pcuenca commented Mar 29, 2023

patrickvonplaten Mar 30, 2023

patrickvonplaten Mar 30, 2023

patrickvonplaten Mar 30, 2023

patrickvonplaten left a comment

MuhHanif commented Mar 31, 2023

patrickvonplaten commented Mar 31, 2023

yiyixuxu Apr 11, 2023

MuhHanif commented Apr 11, 2023 •

edited

Loading

pcuenca commented Apr 11, 2023

patrickvonplaten commented Apr 12, 2023

Flax memory efficient attention #2889

Flax memory efficient attention #2889

Conversation

pcuenca commented Mar 29, 2023 • edited Loading

HuggingFaceDocBuilderDev commented Mar 29, 2023 • edited Loading

pcuenca Mar 29, 2023

Choose a reason for hiding this comment

yiyixuxu Apr 11, 2023

Choose a reason for hiding this comment

pcuenca commented Mar 29, 2023

patrickvonplaten Mar 30, 2023

Choose a reason for hiding this comment

patrickvonplaten Mar 30, 2023

Choose a reason for hiding this comment

patrickvonplaten Mar 30, 2023

Choose a reason for hiding this comment

patrickvonplaten left a comment

Choose a reason for hiding this comment

MuhHanif commented Mar 31, 2023

patrickvonplaten commented Mar 31, 2023

yiyixuxu Apr 11, 2023

Choose a reason for hiding this comment

MuhHanif commented Apr 11, 2023 • edited Loading

pcuenca commented Apr 11, 2023

patrickvonplaten commented Apr 12, 2023

pcuenca commented Mar 29, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Mar 29, 2023 •

edited

Loading

MuhHanif commented Apr 11, 2023 •

edited

Loading