-
Notifications
You must be signed in to change notification settings - Fork 6k
Add Kandinsky 2.1 #3308
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Kandinsky 2.1 #3308
Conversation
@isamu-isozaki thanks! |
Very cool @yiyixuxu, can you tell me how you tested if the prior model pipeline is working and if the weights were loading? Would it be cool to have a temporary jupyter-notebook handy for testing the pipeline, and its individual components, if the weights are loading, hacky debugging etc |
For the MOVQ, it is practically the same as the VQVAE model already in diffusers https://github.com/huggingface/diffusers/blob/kandinsky/src/diffusers/models/vq_model.py, with the Encoder, and the VectorQuantizer being exactly the same, but in the decoder, it just uses a different custom normalization layer in the decoder (SpatialNorm) which takes an extra embedding as input, than GroupNorm in VQVAE, the rest of the implementation of the decoder is also exactly the same and thus needs changes to the attention/resnet building blocks, which are also the same as the ones present in diffusers, except the normalization layer (they use Groupnorm, and need to use SpatialNorm now). We can either parametrize the attention/resnet building blocks and VQVAE in diffusers to support using a different normalization layer and an additional embedding input, or copy them with the minimal changes in the Kandinsky pipeline if we feel the normalization layer is not general enough to change the existing implementations. Would love to hear opinions on this! |
@ayushtues let's add SpatialNorm to the blocks |
@ayushtues this is an example script I use to do a quick compare along the process Note that this might not work for you because I had to go into the original repo to hardcode a few things to make sure we can reproduce ( including changing the the noise construction to match diffusers' and passing a generator down, I don't think you will need to do this for decoder) - this is just an example so that you can use a similar process import numpy as np
from kandinsky2 import get_kandinsky2
import torch
model = get_kandinsky2('cuda', task_type='text2img', model_version='2.1', use_flash_attention=False)
prompt= "red cat, 4k photo"
batch_size=1
guidance_scale=4
prior_cf_scale=4,
prior_steps="5"
negative_prior_prompt=""
# generate clip embeddings
image_emb = model.generate_clip_emb(
prompt,
batch_size=batch_size,
prior_cf_scale=prior_cf_scale,
prior_steps=prior_steps,
negative_prior_prompt=negative_prior_prompt,
)
print(f"image_emb:{image_emb.shape},{image_emb.sum()}")
# diffusers
from diffusers import KandinskyPipeline, PriorTransformer
import diffusers
pipe_prior = KandinskyPipeline.from_pretrained("YiYiXu/test-kandinsky")
pipe_prior.to("cuda")
generator = torch.Generator(device="cuda").manual_seed(0)
image_emb_d = pipe_prior(
prompt,
generator=generator,
)
print(f"image_embeddings:{image_emb_d.shape},{image_emb_d.sum()}")
print("compare results:")
print(np.max(np.abs(image_emb_d.detach().cpu().numpy() - image_emb.detach().cpu().numpy()))) |
The documentation is not available anymore as the PR was closed or merged. |
Started a PR #3330 for adding the decoder, was able to load the pretrained weights of the MOVQ model into diffusers based VQModel, with minimal changes. Need to ensure if forward passes are also the same next |
Okay the outputs of the forward pass are in 1e-4 of each other for the movq decoder and 1e-5 for the movq encoder and seem similar, so should be okay. Can integrate it into the pipeline next, added a PR for the weights in the diffuser model repo @yiyixuxu meanwhile |
Thanks for adding the decoder so fast! super awesome job! 😇🤗👏👍 I think we can wrap up Kandinsky soon! A few tasks left (I ranked them from easy to difficult based on my subjective judgment 😂) - let me know if you are interested in taking any of these. I will help you as much as you need of course :)
from transformers import CLIPVisionModelWithProjection
clip_image_encoder = CLIPVisionModelWithProjection.from_pretrained("openai/clip-vit-large-patch14", torch_dtype=torch.float16).to("cuda")
|
@@ -0,0 +1,77 @@ | |||
# Copyright 2023 The HuggingFace Team. All rights reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool!
@yiyixuxu I can take up 1, 2, and then later help with 4 when the parts are ready to combine in the pipeline; not so familiar with how schedulers integrate into diffusers, so will leave 3 to you, but will definitely want to review it and learn how they integrate into the pipeline. |
This reverts commit fee1bba.
@ayushtues great! |
@yiyixuxu where do you think we should put the multilingualCLIP model, since it's not directly available in HF, should we add it in a separate file in |
Meanwhile started another PR for task 1, 2 - #3373 |
Co-authored-by: Sayak Paul <[email protected]>
Good to merge! |
can this use controlnet? |
We should try training ControlNet on it! |
add kandinsky2.1 --------- Co-authored-by: yiyixuxu <yixu310@gmail,com> Co-authored-by: Ayush Mangal <[email protected]> Co-authored-by: ayushmangal <[email protected]> Co-authored-by: Patrick von Platen <[email protected]> Co-authored-by: Sayak Paul <[email protected]>
add kandinsky2.1 --------- Co-authored-by: yiyixuxu <yixu310@gmail,com> Co-authored-by: Ayush Mangal <[email protected]> Co-authored-by: ayushmangal <[email protected]> Co-authored-by: Patrick von Platen <[email protected]> Co-authored-by: Sayak Paul <[email protected]>
this PR add Kandinsky2.1 to diffusers
#2985
original codebase: https://github.com/ai-forever/Kandinsky-2
to-do:
prior_tokenizer
,prior_text_encoder
,prior_scheduler
)image_encoder
,text_encoder
,tokenizer
use inpainting pipeline to add a hat
image-to-image generation
image mixing