-
Notifications
You must be signed in to change notification settings - Fork 5.9k
[Community] Implement prompt-to-prompt
pipelines
#2121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
+100 - just lacking the time at the moment. I wonder whether we should do a community sprint in a week or so trying to add the most important "tweak your text prompts" pipelines. |
Actually taking this as an opportunity to turn the feature request into a more precise explanation of how it can be added. In short we have now all the necessary tools to add a Pipeline like Prompt-2-prompt in a nice & clean way. What you'll need to do:
Very keen on guiding someone from the community through a PR, but currently don't find the time to do it |
prompt-to-prompt
pipelinesprompt-to-prompt
pipelines
You may also reference InvokeAI's update for the diffusers 0.12 attention API: invoke-ai/InvokeAI#2385 A few caveats:
|
So I had the following attention processors in mind for this variant of the prompt-to-prompt: https://github.com/cccntu/efficient-prompt-to-prompt class CrossAttnKVProcessor:
def __call__(
self, attn: CrossAttention, hidden_states, key_hidden_states=None, value_hidden_state=None, attention_mask=None
):
_, sequence_length, _ = hidden_states.shape
attention_mask = attn.prepare_attention_mask(attention_mask, sequence_length)
query = attn.to_q(hidden_states)
query = attn.head_to_batch_dim(query)
key_hidden_states = key_hidden_states if key_hidden_states is not None else hidden_states
value_hidden_state = value_hidden_state if value_hidden_state is not None else hidden_states
key = attn.to_k(key_hidden_states)
value = attn.to_v(value_hidden_state)
key = attn.head_to_batch_dim(key)
value = attn.head_to_batch_dim(value)
attention_probs = attn.get_attention_scores(query, key, attention_mask)
hidden_states = torch.bmm(attention_probs, value)
hidden_states = attn.batch_to_head_dim(hidden_states)
# linear proj
hidden_states = attn.to_out[0](hidden_states)
# dropout
hidden_states = attn.to_out[1](hidden_states)
return hidden_states
class XFormersCrossAttnKVProcessor:
def __call__(
self, attn: CrossAttention, hidden_states, key_hidden_states=None, value_hidden_state=None, attention_mask=None
):
_, sequence_length, _ = hidden_states.shape
attention_mask = attn.prepare_attention_mask(attention_mask, sequence_length)
query = attn.to_q(hidden_states)
key_hidden_states = key_hidden_states if key_hidden_states is not None else hidden_states
value_hidden_state = value_hidden_state if value_hidden_state is not None else hidden_states
key = attn.to_k(key_hidden_states)
value = attn.to_v(value_hidden_state)
query = attn.head_to_batch_dim(query).contiguous()
key = attn.head_to_batch_dim(key).contiguous()
value = attn.head_to_batch_dim(value).contiguous()
hidden_states = xformers.ops.memory_efficient_attention(query, key, value, attn_bias=attention_mask)
hidden_states = hidden_states.to(query.dtype)
hidden_states = attn.batch_to_head_dim(hidden_states)
# linear proj
hidden_states = attn.to_out[0](hidden_states)
# dropout
hidden_states = attn.to_out[1](hidden_states)
return hidden_states
class SlicedAttnKVProcessor:
def __init__(self, slice_size):
self.slice_size = slice_size
def __call__(
self, attn: CrossAttention, hidden_states, key_hidden_states=None, value_hidden_state=None, attention_mask=None
):
_, sequence_length, _ = hidden_states.shape
attention_mask = attn.prepare_attention_mask(attention_mask, sequence_length)
query = attn.to_q(hidden_states)
dim = query.shape[-1]
query = attn.head_to_batch_dim(query)
key_hidden_states = key_hidden_states if key_hidden_states is not None else hidden_states
value_hidden_state = value_hidden_state if value_hidden_state is not None else hidden_states
key = attn.to_k(key_hidden_states)
value = attn.to_v(value_hidden_state)
key = attn.head_to_batch_dim(key)
value = attn.head_to_batch_dim(value)
batch_size_attention = query.shape[0]
hidden_states = torch.zeros(
(batch_size_attention, sequence_length, dim // attn.heads), device=query.device, dtype=query.dtype
)
for i in range(hidden_states.shape[0] // self.slice_size):
start_idx = i * self.slice_size
end_idx = (i + 1) * self.slice_size
query_slice = query[start_idx:end_idx]
key_slice = key[start_idx:end_idx]
attn_mask_slice = attention_mask[start_idx:end_idx] if attention_mask is not None else None
attn_slice = attn.get_attention_scores(query_slice, key_slice, attn_mask_slice)
attn_slice = torch.bmm(attn_slice, value[start_idx:end_idx])
hidden_states[start_idx:end_idx] = attn_slice
hidden_states = attn.batch_to_head_dim(hidden_states)
# linear proj
hidden_states = attn.to_out[0](hidden_states)
# dropout
hidden_states = attn.to_out[1](hidden_states)
return hidden_states |
Sure, this seems reasonable, guess would be great to see it in a pipeline class directly :-) |
Is this open? Would be happy to take it up! |
@unography yes it's open, please feel free to contribute! |
@kashif sure, will add a draft PR soon |
This looks plausible thanks! Furthermore, with the xformers implementation, how can we retrieve softmaxed k*q attention map (before applying to values)? See here: https://github.com/facebookresearch/xformers/blob/5df1f0b682a5b246577f0cf40dd3b15c1a04ce50/xformers/ops/fmha/__init__.py#L149
|
Taking a step back -- I question the actual usefulness of "prompt-to-prompt". Why would someone generate an image with the wrong prompt in the first place?? If I wanted a "box of cookies", why did I type "box of apples"? Plus, there are now more powerful and flexible techniques available. The paper below requires no input prompt, just a raw image, from which it extracts various features from the diffusion layers and applies them to a new prompt. This seems much more in line with a normal image workflow than prompt-to-prompt. Cheers. |
If useful for anyone, I've implemented an Attend-to-Excite with the AttentionProcessors, an example is here: https://github.com/evinpinar/Attend-and-Excite-diffusers/blob/72fa567a1e3bb3cc1b63cb53a1d9db5fc10b241f/utils/ptp_utils.py#L57 class AttendExciteCrossAttnProcessor:
def __init__(self, attnstore, place_in_unet):
super().__init__()
self.attnstore = attnstore
self.place_in_unet = place_in_unet
def __call__(self, attn: CrossAttention, hidden_states, encoder_hidden_states=None, attention_mask=None):
batch_size, sequence_length, _ = hidden_states.shape
attention_mask = attn.prepare_attention_mask(attention_mask, sequence_length)
query = attn.to_q(hidden_states)
is_cross = encoder_hidden_states is not None
encoder_hidden_states = encoder_hidden_states if encoder_hidden_states is not None else hidden_states
key = attn.to_k(encoder_hidden_states)
value = attn.to_v(encoder_hidden_states)
query = attn.head_to_batch_dim(query)
key = attn.head_to_batch_dim(key)
value = attn.head_to_batch_dim(value)
attention_probs = attn.get_attention_scores(query, key, attention_mask)
self.attnstore(attention_probs, is_cross, self.place_in_unet)
hidden_states = torch.bmm(attention_probs, value)
hidden_states = attn.batch_to_head_dim(hidden_states)
# linear proj
hidden_states = attn.to_out[0](hidden_states)
# dropout
hidden_states = attn.to_out[1](hidden_states)
return hidden_states
def register_attention_control(model, controller):
attn_procs = {}
cross_att_count = 0
for name in model.unet.attn_processors.keys():
cross_attention_dim = None if name.endswith("attn1.processor") else model.unet.config.cross_attention_dim
if name.startswith("mid_block"):
hidden_size = model.unet.config.block_out_channels[-1]
place_in_unet = "mid"
elif name.startswith("up_blocks"):
block_id = int(name[len("up_blocks.")])
hidden_size = list(reversed(model.unet.config.block_out_channels))[block_id]
place_in_unet = "up"
elif name.startswith("down_blocks"):
block_id = int(name[len("down_blocks.")])
hidden_size = model.unet.config.block_out_channels[block_id]
place_in_unet = "down"
else:
continue
cross_att_count += 1
attn_procs[name] = AttendExciteCrossAttnProcessor(
attnstore=controller, place_in_unet=place_in_unet
)
model.unet.set_attn_processor(attn_procs)
controller.num_att_layers = cross_att_count |
Super cool! @evinpinar feel free to open a PR to add this as a new pipeline. Maybe this PR is a good example of how to add a new simple pipeline: #2223 Amazing work ❤️ |
@evinpinar Looks awesome! |
Btw, is there a pr like this for the prompt to prompt? I just want to check out the implementation for research. If not happy to make one based on @evinpinar code |
Hi, everyone, I just implement a pipeline here. https://github.com/Weifeng-Chen/prompt2prompt base on
for more operation, have a look at https://github.com/Weifeng-Chen/prompt2prompt/blob/main/p2p_test.ipynb |
@Weifeng-Chen Thanks and awesome! |
Thanks @Weifeng-Chen I have a dumb question: when doing a refinement, what does Say I want to switch between two prompts: "A painting of a squirrel eating a burger" and "A real photo of a squirrel eating a burger" at 0.7. What values do I set to these two arguments in |
you can try: cross_replace_steps=0., self_replace_steps=0. means no replacement and totally generate a new image from scratch. I think, when inference, the new prompt will generate new cross-attn and self-attn maps, and replace it with the origin one. larger steps can let it more similar to the origin one but may restrict the editing. I didn't fully test it and point me out if I'm wrong. |
But then where does |
you can try to change it. 0.7 means the first 70% steps using the origin prompt's attention and the rest 30% use the new one. |
Yes that's what I am trying to achieve. Thank you! |
not necessary to be same. self-attn don't interact with the text. |
Hi, I'd like to take on the EDICT implementation, if someone hasn't started it. |
Any updates in this thread? Looking forward to it! |
Note that we've already added the pix2pix0 pipeline which is an improved version of prompt2prompt: https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/pix2pix_zero I'm not sure how much sense prompt2prompt makes given that an improved version has already been added. |
I'm not necessarily pushing for it, but I will say that what methods like Prompt-to-Prompt and EDICT have over pix2pix zero is the lack of a need to generate source and target embeddings. In the case of editing real images, pix2pix zero would require you to not only undergo inversion steps, but also generate the source and target embeddings and get their difference before you can generate new images. With (the original) Prompt-to-Prompt paper as well as EDICT, you'd only need to undergo the inversion steps before generating the final images. |
I agree with @ryan-caesar-ramos , I think those serve different purposes and both could be part of a toolbox on diffusers. I think we would love a community contributed PR on p2p and EDICT! |
I was planning to work on this, but ended up using the pix2pix pipeline instead. But like @apolinario and @ryan-caesar-ramos mentioned, it would be cool to have this. I'll work on p2p this week and raise a PR |
With the release soon(tm) of p2p-video, this gets even more relevant imo: https://video-p2p.github.io |
Do in understand correctly that adding prompt-to-prompt re-weightning is not that difficult now, but it's impossible to have it and xformers together, since we need to modify self-attention and xformers doesn't explicitly expose it? |
Anybody interested in picking up this feature request? |
@Weifeng-Chen Nice implementation, I wonder why you didn't raised a PR yet! |
yeap, it is functional but not so elegant. I'm currently very busy and have no time to do it. maybe I'll do it if I got some time. |
Alright! Let me know if you have any idea on making it more elegant, |
@Weifeng-Chen Thank you for your work! Will your pipeline be able to support batch_size > 1 (i.e. I can generate variants of >1 images at the same time) |
@patrickvonplaten If this is not urgent, I'd like to give it a try & would do it until end of July / start of August. I'm doing the fastai part 2 course and have done several contributions to other OSS projects (LangChain, gpt-engineer, ...). Solving this issues seems like a very cool learning goal. :) |
That would be great! |
Quick update: I have started working on this, should be done in about a week |
@UmerHA How can I contact to you to help with this pipeline? |
@UmerHAdil on Twitter or umerha on Discord. |
I want to give an update: I'm getting close to being done. The usage of this pipeline would look like this:
See this file for examples of all edit types. @patrickvonplaten I have three questions:
The current code can be found in this repo. What's left to do:
Thanks for giving me the opportunity to do this! Have learned a lot. Appreciate it :) |
@evinpinar Thanks, your |
#4563 seems to have addressed this. So, will close. |
【自动恢复】来信已收到,我将尽快回复你!
|
Describe the solution you'd like
Now that we have an official way to tweak cross attention #1639 , would be great to have a pipeline (be it official or community) for
prompt-to-prompt
and further implementations of the technique (such as EDICT).Describe alternatives you've considered
@amirhertz official Prompt-to-Prompt implementation is built on top of diffusers 0.3.0 with their own cross attention manipulation function.
@bloc97 community prompt-to-prompt implementation already uses diffusers, but it is pinned to version 0.4.1, also with a cross attention control of their own.
@bram-w / Salesforce EDICT , that adds inversion to prompt-to-prompt (allowing you to edit real images) also uses the above as a base with some modifications for double precision for inversion.
So while alternatives exist, they require users to pin old versions of diffusers and not enjoy the latest advancements. Given this technique is very useful, having it on a pipeline within diffusers could be really great. Also could potentially leverage the technique to other models (Karlo, IF, etc.)
Additional context
InstructPix2Pix and Imagic have shown how editing real and edited images is a trend. Prompt-to-prompt is a nice tool to have on that belt for practitioners, artists and professionals.
The text was updated successfully, but these errors were encountered: