-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor attention.py
#1880
Comments
@patrickvonplaten rather than having everybody save & re-upload their weights: can diffusers intercept the weights during model load and map them to different parameter names? Apple uses PyTorch's however, something about HF's technique for model-loading breaks this idiom. my model loading hooks never get invoked. they work in a CompVis repository, but not inside HF diffusers code. I think something about using importlib to load a in the end, this is the technique I've had to resort to to replace every AttentionBlock with CrossAttention (after model loading): |
@Birch-san Thank you for the added context, super helpful! I don't have much to add right now. When I start working on the refactor, I'll think about it more and we can discuss :) |
It seemed the two class might have some slight difference. I noticed group_norm missing from a few of the processor implementations for the class CrossAttention, which can have group_norm or without, CrossAttnProcessor doesn't use group_norm, SlicedAttnAddedKVProcessor and CrossAttnAddedKVProcessor uses group_norm (without checking if it is actually none, which CrossAttention allows). Where as AttentionBlock in attention.py always uses group_norm. |
tl;dr
longer message: QQ re api design in the attention processor: If we were to configure whether or not there is a residual connection, in an ideal world, would this occur in the processor class or in However, doing the residual connection w/in Context is that AttentionBlock has a residual connection and Alternatively, we could make a separate attention processor for the currently deprecated AttentionBlock that would be mostly copy-paste from the existing |
Follow up: residual connections would stay in the processor regardless because it isn't guaranteed for the residual connection to be the last step in the method. I.e. in IMO, this means that regardless of commonalities, the entirety of the attention application should occur in the processor. Anything that we assume to be common to all processors is potentially a point for breakage and might need bad hacks to make work on future attention processors. However the other questions around configuration w/ defaults still stand. |
tl;dr from offline convo: The existing For now, we'll only add self attention to the new attention processor and we can add in cross attention later. Note that we'll also change the name of CrossAttnProcessor to just AttnProcessor so the standard naming will be consistent regardless of the type of attention applied (these are internal/private classes, so changing the names should be acceptable). We did not discuss what will happen in the future if we have to add configuration to different attention processors and that results in different default configs (i.e. the residual connection example earlier). Let's assume this is ok to not discuss for now especially as these are private classes and we'll have more flexibility if we have to make changes to them. |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
This is still very much relevant cc @williamberman |
We're getting too many issues / PRs about confused users. Let's try to make this high prio @williamberman |
To begin with let's start by doing the following: 1.) Rename all processors that are called Note: We need to keep full backwards compatibilty: We import all classes from from .attention_processor import AttentionProcessor as CrossAttentionProcessor |
Once that's done, let's fully continue by removing the old |
Would any change effect old model (e.g, renaming state dict key). Seems like changing/removing |
Start refactor here: #2691 (comment) |
@patrickvonplaten Is the refactoring done? I'm using a code base built on diffusers0.11, if so I can start reformatting my code. |
#2697 will be merged very soon! |
#3387 done here |
attention.py
has at the moment two concurrent attention implementations which essentially do the exact same thing:diffusers/src/diffusers/models/attention.py
Line 256 in 62608a9
and
diffusers/src/diffusers/models/cross_attention.py
Line 30 in 62608a9
Both
diffusers/src/diffusers/models/cross_attention.py
Line 30 in 62608a9
diffusers/src/diffusers/models/attention.py
Line 256 in 62608a9
We should start deprecating
diffusers/src/diffusers/models/attention.py
Line 256 in 62608a9
Deprecating this class won't be easy as it essentially means we have to force people to re-upload their weights. Essentially every model checkpoint that made use of
diffusers/src/diffusers/models/attention.py
Line 256 in 62608a9
I would propose to do this in the following way:
diffusers/src/diffusers/models/attention.py
Line 256 in 62608a9
diffusers/src/diffusers/models/cross_attention.py
Line 30 in 62608a9
diffusers/src/diffusers/models/attention.py
Line 256 in 62608a9
The text was updated successfully, but these errors were encountered: