Skip to content

Add T2I-Adapter model and pipeline #2555

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from

Conversation

HimariO
Copy link
Contributor

@HimariO HimariO commented Mar 5, 2023

This PR implements the T2I-Adapter, related pipeline, and model sideloading mechanism discussed in #2390.

Model/Pipeline description

T2I-Adapter by @TencentARC is

... a simple and small (~70M parameters, ~300M storage space) network that can provide extra guidance to pre-trained text-to-image models while freezing the original large text-to-image models.
T2I-Adapter aligns internal knowledge in T2I models with external control signals. We can train various adapters according to different conditions, and achieve rich control and editing effects.

222734169-d47789e8-e83c-48c2-80ef-a896c2bafbb0

Usage Examples

import torch
from diffusers import StableDiffusionAdapterPipeline, Adapter

adapter = Adapter.from_pretrained("RzZ/sd-v1-4-adapter-color")
pipe = StableDiffusionAdapterPipeline.from_pretrained(
    "RzZ/sd-v1-4-adapter",
    adapter=adapter,
    torch_dtype=torch.float16,
).to("cuda")

out_images = pipe(prompts, images).images
import torch
from diffusers import StableDiffusionAdapterPipeline, Adapter, MultiAdapter

adapters = [
    Adapter.from_pretrained("RzZ/sd-v1-4-adapter-keypose"),
    Adapter.from_pretrained("RzZ/sd-v1-4-adapter-depth"),
]

pipe = StableDiffusionAdapterPipeline.from_pretrained(
    "RzZ/sd-v1-4-adapter",
    adapter=adapter,
    torch_dtype=torch.float16,
).to("cuda")

out_images = pipe(
    ["A man waling in an office room with nice view"],
    [[cond_image_keypose, cond_image_depth]],
).images

TODO

  • Implment Adapter model
  • Implment StableDiffusionAdapterPipeline
  • Create test cases for StableDiffusionAdapterPipeline
  • Support for multi-adapter (adapter composition)
  • Add support for the style and color adapter that the author just released yesterday
  • Create document for StableDiffusionAdapterPipeline
  • Refine model & pipeline doc strings
  • Cleaning up development scripts & apply code style fixes

Discussion

  • According to the author some of the adapter models is still work in progress(depth and style adapter), although those adapters seem to be working reasonably well. Not sure it is the right time to include them in this PR.
  • Due to the vastly different architecture and decency of CLIPFeatureExtractor of the newly released style adapter, I think it's better to exclude it from this PR for now, so we don't keep expanding the scope of this PR
    @sayakpaul @wfng92

@patrickvonplaten
Copy link
Contributor

@williamberman has probably the best XP to help here :-)

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

@HimariO HimariO changed the title [WIP] Add T2I-Adapter model and pipeline Add T2I-Adapter model and pipeline Mar 9, 2023
@HimariO
Copy link
Contributor Author

HimariO commented Mar 9, 2023

@williamberman @patrickvonplaten @sayakpaul I think this PR is ready for review. 🙌

@patrickvonplaten
Copy link
Contributor

Hey @HimariO,

Sorry that we're a bit slow here. I'll try to review today (or Monday the latest)

@williamberman
Copy link
Contributor

williamberman commented Mar 16, 2023

hey @HimariO thank you so much for your work on this :)

My initial reaction is that the sideloading mechanism adds a bit too much "magic". Could we have a description on why we need to add a mixin as such? From my initial reading here my understanding was t2i follows the same pattern as controlnet which should just require passing the inputs to the unet's forward

@HimariO
Copy link
Contributor Author

HimariO commented Mar 16, 2023

Hi @williamberman, although Adapter and ControlNet have a lot of similarities, there are some differences in their feature fusing schema. I've included two diagrams that illustrate these differences below. The rectangles and trapezoids that are paired with the same color represent the CrossAttnDownBlock2D or CrossAttnUpBlock2D:

adapter_sideload-controlnet drawio
adapter_sideload-adapter drawio

The main difference will be:

  1. Adapter needs to fuse its control signal with hidden states from the layer inside the CrossAttnDownBlock2D(one layer deeper into UNet).
  2. The fused hidden state will be passed on to the following downsample blocks instead of upsample blocks through residual connections like ControlNet.

As we previously discussed in #2331, I too believe that more research like ControlNet will be happening in the near future. Therefore, providing a convenient way to experiment/integrate new ideas like controlling different modules or different fusing methods through a sideload approach could be valuable and more scalable compared to the straightforward method of passing the control signal in a top-down fashion.

Let me know if you have any questions or concerns!

@patrickvonplaten
Copy link
Contributor

Hey @HimariO,

Thanks a lot for your design that is super useful! I'm with @williamberman here - I think we should try to simplify the code and make the design more similar to controlnet.

I've made a quick design proposal PR here: #2708 - would this design choice be ok for you?

We don't want to add a whole new design pattern for a new SD controlling algorithm. Instead we should try to make the fewest possible changes given what we already have - so we should try to strongly adapt this PR to the ControlNet one.

@HimariO could you maybe try to adapt your PR to conform a bit more to the design in #2708 . I think this could work nicely no?

@patrickvonplaten
Copy link
Contributor

Feel free to copy anything you need from #2708 - it's just there as a design proposal for you

@HimariO
Copy link
Contributor Author

HimariO commented Mar 16, 2023

@patrickvonplaten Thanks for the proposal, I will look into it later, probably after I finish reading the source code of the new CoAdapter @TencentARC release not too long ago.

@williamberman
Copy link
Contributor

Hi @williamberman, although Adapter and ControlNet have a lot of similarities, there are some differences in their feature fusing schema. I've included two diagrams that illustrate these differences below. The rectangles and trapezoids that are paired with the same color represent the CrossAttnDownBlock2D or CrossAttnUpBlock2D:

adapter_sideload-controlnet drawio adapter_sideload-adapter drawio

The main difference will be:

  1. Adapter needs to fuse its control signal with hidden states from the layer inside the CrossAttnDownBlock2D(one layer deeper into UNet).
  2. The fused hidden state will be passed on to the following downsample blocks instead of upsample blocks through residual connections like ControlNet.

As we previously discussed in #2331, I too believe that more research like ControlNet will be happening in the near future. Therefore, providing a convenient way to experiment/integrate new ideas like controlling different modules or different fusing methods through a sideload approach could be valuable and more scalable compared to the straight_forward_ method of passing the control signal in a top-down fashion.

Let me know if you have any questions or concerns!

Nice! Thanks for the diagram super helpful :) Looks like can still be handled by directly passing the values through the forward methods. I think patrick's proposal makes a lot of sense.

@HimariO
Copy link
Contributor Author

HimariO commented Mar 17, 2023

@patrickvonplaten 's is indeed very clean and keeps the change to a minimum. One use case #2708 may have some trouble handling will be using Adapter and ControlNet at the same time(ex: color-adapter + canny-controlnet). But I think we can just stick with #2708 for now(?).

@patrickvonplaten
Copy link
Contributor

n and keeps the change to a minimum. One use case #2708 may have some trouble handling will be using Adapter and ControlNet at the same time(ex: color-adapter + canny-controlnet). But I think we can just stick with #2708 for now(?).

Ah I see, yes could we maybe try to stick to #2708 for now and see in a follow-up how we could adapt things?
Let me know if you need any help :-)

@HimariO
Copy link
Contributor Author

HimariO commented Mar 17, 2023

@patrickvonplaten @williamberman this PR is updated with #2708.

from .resnet import Downsample2D


class ResnetBlock(nn.Module):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to use our existing ResnetBlock2D? It looks very similar

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did think of that before, but the existence of ResnetBlock2D.norm1, ResnetBlock2D.norm2, and the time step embedding argument in the forward method prevented me from using it to implement the adapter,

Copy link
Contributor

@williamberman williamberman Mar 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the time embedding is optional, feel free to make a small change to the block definition where time_embedding_norm can take the value "no_norm" which when set will set norm1 and norm2 to None and then they're skipped in the forward :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HimariO any progress on replacing the resnet block? happy to help if needed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I later spot some more differences:

  1. Adapter sometimes uses kernel size 1 in the second conv2d module of the ResNet block, so we will need to add one(or two, since there are two conv2d in the ResNet block) more parameters to the ResnetBlock2D
  2. Adapter use the "conv2d -> activation -> conv2d" pattern, and the ResnetBlock2D use "activation -> conv2d -> activation -> conv2d" pattern. Not sure there is any good way to bypass ResnetBlock2D's first activation function.

Not sure it is a good idea to make those changes to ResnetBlock2D or not, some help/suggestions here will be great :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try to look into it tomorrow! I think it'd be totally fine though to slighly adapt the existing ResnetBlock

Comment on lines +110 to +118
| Model Name | Control Image Overview| Control Image Example | Generated Image Example |
|---|---|---|---|
|[RzZ/sd-v1.4-adapter-color](https://huggingface.co/RzZ/sd-v1-4-adapter-color/)<br/> *Trained with spatial color palette* | A image with 8x8 color palette.|<a href="https://huggingface.co/RzZ/sd-v1-4-adapter-color/resolve/main/sample_input.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/RzZ/sd-v1-4-adapter-color/resolve/main/sample_input.png"/></a>|<a href="https://huggingface.co/RzZ/sd-v1-4-adapter-color/resolve/main/sample_output.png"><img width="64" src="https://huggingface.co/RzZ/sd-v1-4-adapter-color/resolve/main/sample_output.png"/></a>|
|[RzZ/sd-v1.4-adapter-canny](https://huggingface.co/RzZ/sd-v1-4-adapter-canny)<br/> *Trained with canny edge detection* | A monochrome image with white edges on a black background.|<a href="https://huggingface.co/RzZ/sd-v1-4-adapter-canny/resolve/main/sample_input.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/RzZ/sd-v1-4-adapter-canny/resolve/main/sample_input.png"/></a>|<a href="https://huggingface.co/RzZ/sd-v1-4-adapter-canny/resolve/main/sample_output.png"><img width="64" src="https://huggingface.co/RzZ/sd-v1-4-adapter-canny/resolve/main/sample_output.png"/></a>|
|[RzZ/sd-v1.4-adapter-sketch](https://huggingface.co/RzZ/sd-v1-4-adapter-sketch)<br/> *Trained with [PidiNet](https://github.com/zhuoinoulu/pidinet) edge detection* | A hand-drawn monochrome image with white outlines on a black background.|<a href="https://huggingface.co/RzZ/sd-v1-4-adapter-sketch/resolve/main/sample_input.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/RzZ/sd-v1-4-adapter-sketch/resolve/main/sample_input.png"/></a>|<a href="https://huggingface.co/RzZ/sd-v1-4-adapter-sketch/resolve/main/sample_output.png"><img width="64" src="https://huggingface.co/RzZ/sd-v1-4-adapter-sketch/resolve/main/sample_output.png"/></a>|
|[RzZ/sd-v1.4-adapter-depth](https://huggingface.co/RzZ/sd-v1-4-adapter-depth)<br/> *Trained with Midas depth estimation* | A grayscale image with black representing deep areas and white representing shallow areas.|<a href="https://huggingface.co/RzZ/sd-v1-4-adapter-depth/resolve/main/sample_input.png"><img width="64" src="https://huggingface.co/RzZ/sd-v1-4-adapter-depth/resolve/main/sample_input.png"/></a>|<a href="https://huggingface.co/RzZ/sd-v1-4-adapter-depth/resolve/main/sample_output.png"><img width="64" src="https://huggingface.co/RzZ/sd-v1-4-adapter-depth/resolve/main/sample_output.png"/></a>|
|[RzZ/sd-v1.4-adapter-openpose](https://huggingface.co/RzZ/sd-v1-4-adapter-openpose)<br/> *Trained with OpenPose bone image* | A [OpenPose bone](https://github.com/CMU-Perceptual-Computing-Lab/openpose) image.|<a href="https://huggingface.co/RzZ/sd-v1-4-adapter-openpose/resolve/main/sample_input.png"><img width="64" src="https://huggingface.co/RzZ/sd-v1-4-adapter-openpose/resolve/main/sample_input.png"/></a>|<a href="https://huggingface.co/RzZ/sd-v1-4-adapter-openpose/resolve/main/sample_output.png"><img width="64" src="https://huggingface.co/RzZ/sd-v1-4-adapter-openpose/resolve/main/sample_output.png"/></a>|
|[RzZ/sd-v1.4-adapter-keypose](https://huggingface.co/RzZ/sd-v1-4-adapter-keypose)<br/> *Trained with mmpose skeleton image* | A [mmpose skeleton](https://github.com/open-mmlab/mmpose) image.|<a href="https://huggingface.co/RzZ/sd-v1-4-adapter-keypose/resolve/main/sample_input.png"><img width="64" src="https://huggingface.co/RzZ/sd-v1-4-adapter-keypose/resolve/main/sample_input.png"/></a>|<a href="https://huggingface.co/RzZ/sd-v1-4-adapter-keypose/resolve/main/sample_output.png"><img width="64" src="https://huggingface.co/RzZ/sd-v1-4-adapter-keypose/resolve/main/sample_output.png"/></a>|
|[RzZ/sd-v1.4-adapter-seg](https://huggingface.co/RzZ/sd-v1-4-adapter-seg)<br/>*Trained with semantic segmentation* | An [custom](https://github.com/TencentARC/T2I-Adapter/discussions/25) segmentation protocol image.|<a href="https://huggingface.co/RzZ/sd-v1-4-adapter-seg/resolve/main/sample_input.png"><img width="64" src="https://huggingface.co/RzZ/sd-v1-4-adapter-seg/resolve/main/sample_input.png"/></a>|<a href="https://huggingface.co/RzZ/sd-v1-4-adapter-seg/resolve/main/sample_output.png"><img width="64" src="https://huggingface.co/RzZ/sd-v1-4-adapter-seg/resolve/main/sample_output.png"/></a> |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO - upload weights to https://huggingface.co/TencentARC

Comment on lines 44 to 60
>>> image = load_image("https://huggingface.co/RzZ/sd-v1-4-adapter-color/resolve/main/color_ref.png")

>>> color_palette = image.resize((8, 8))
>>> color_palette = color_palette.resize((512, 512), resample=Image.Resampling.NEAREST)

>>> import torch
>>> from diffusers import StableDiffusionAdapterPipeline, Adapter

>>> adapter = Adapter.from_pretrained("RzZ/sd-v1-4-adapter-color")
>>> pipe = StableDiffusionAdapterPipeline.from_pretrained(
... "RzZ/sd-v1-4-adapter",
... adapter=adapter,
... torch_dtype=torch.float16,
... )
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO update hub repos

else:
self.out_conv = None

self.block1 = nn.Conv2d(mid_c, mid_c, 3, 1, 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this line and the two following lines is different to what we have currently in the UNet and why it can be called "BottleNeck". Nevertheless the rest is exactly the same as far as I can see

Load adapter module with from_pretrained

Prototyping generalized adapter framework

Writeup doc string for sideload framework(WIP) + some minor update on implementation

Update adapter models

Remove old adapter optional args in UNet

Add StableDiffusionAdapterPipeline unit test

Handle cpu offload in StableDiffusionAdapterPipeline

Auto correct coding style

Update model repo name to "RzZ/sd-v1-4-adapter-pipeline"

Refactor MultiAdapter to better compatible with config system

Export MultiAdapter

Create pipeline document template from controlnet

Create dummy objects

Supproting new AdapterLight model

Fix StableDiffusionAdapterPipeline common pipeline test

[WIP] Update adapter pipeline document

Handle num_inference_steps in StableDiffusionAdapterPipeline

Update definition of Adapter "channels_in"

Update documents

Apply code style

Fix doc typo and merge error

Update doc string and example

Quality of life improvement

Remove redundant code and file from prototyping

Remove unused pageage

Remove comments

Fix title

Fix typo

Add conditioning scale arg

Bring back old implmentation

Offload sideload

Add supply info on document

Update src/diffusers/models/adapter.py

Co-authored-by: Will Berman <[email protected]>

Update MultiAdapter constructor

Swap out custom checkpoint and update pipeline constructor

Update docment

Apply suggestions from code review

Co-authored-by: Will Berman <[email protected]>

Correcting style

Following single-file policy

Update auto size in image preprocess func

Update src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_adapter.py

Co-authored-by: Will Berman <[email protected]>

fix copies

Update adapter pipeline behavior

Add adapter_conditioning_scale doc string

Add the missing doc string

Apply suggestions from code review

Co-authored-by: Patrick von Platen <[email protected]>

Fix few bugs from suggestion

Handle L-mode PIL image as control image

Rename to differentiate adapter resblock

Update src/diffusers/models/adapter.py

Co-authored-by: Sayak Paul <[email protected]>

Fix typo

Update adapter parameter name

Update test case and code style

Fix copies

Fix typo

Update src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_adapter.py

Co-authored-by: Will Berman <[email protected]>

Update Adapter class name

Add checkpoint converting script

Fix style

Fix-copies
@patrickvonplaten
Copy link
Contributor

cc @williamberman it'd be great if you could try to unblock this PR by helping on the resnet refactor :-)

@sayakpaul sayakpaul mentioned this pull request Apr 28, 2023
@github-actions
Copy link
Contributor

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot added the stale Issues that haven't received updates label May 22, 2023
@williamberman
Copy link
Contributor

sorry for the delay here @HimariO I've had a few too many things on my plate and haven't been able to get to the resnet refactor :)

@HimariO
Copy link
Contributor Author

HimariO commented May 25, 2023

No worries @williamberman , take your time. Let me know if you need help or if you have an idea on how to approach the factor. In the meantime, I'll try to keep this pull request up to date with the main branch.

@adhikjoshi
Copy link

adhikjoshi commented Jun 4, 2023

I think, having one pipeline for controlnet and T2I makes sense here, both are identical and can be used same way. With current rate of new AI projects coming, i think interoperability is what we should go for.

Or thing like "plugin pipelines" is what will be needed, where instead of updating diffusers, anyone can make pipeline which can be loaded as plugin, i think extending community pipelines the way its loaded, will help in it.

pipe = DiffusionPipeline.from_pretrained("stablediffusionapi/edge-of-realism").to("cuda")
pipe.load_custom_pipeline("t2i")

Right now, new diffusers version upgrade is needed for any new pipeline we want to use. Then, just side loading loading pipeline will do the trick. Also lots of overhead will be removed from diffusers team to support never ending list of projects in diffusers.

Then diffusers team can launch new pipelines instead of new diffusers. There can be official pipelines and community one's.

@williamberman @patrickvonplaten

@patrickvonplaten
Copy link
Contributor

I think, having one pipeline for controlnet and T2I makes sense here, both are identical and can be used same way. With current rate of new AI projects coming, i think interoperability is what we should go for.

Or thing like "plugin pipelines" is what will be needed, where instead of updating diffusers, anyone can make pipeline which can be loaded as plugin, i think extending community pipelines the way its loaded, will help in it.

pipe = DiffusionPipeline.from_pretrained("stablediffusionapi/edge-of-realism").to("cuda")
pipe.load_custom_pipeline("t2i")

Right now, new diffusers version upgrade is needed for any new pipeline we want to use. Then, just side loading loading pipeline will do the trick. Also lots of overhead will be removed from diffusers team to support never ending list of projects in diffusers.

Then diffusers team can launch new pipelines instead of new diffusers. There can be official pipelines and community one's.

@williamberman @patrickvonplaten

We don't want to entangle different concepts here. ControlNet and T2I should be different pipelines. @williamberman if you're too busy this week, I can pick up this issue here

@@ -872,6 +878,9 @@ def custom_forward(*inputs):

output_states += (hidden_states,)

if additional_residuals is not None:
hidden_states += additional_residuals
Copy link
Contributor

@bonlime bonlime Jul 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HimariO
The implementation above is not correct. you also need to update the output_states with new hidden_states, so the code should look like this to match original implementation. Without it the results do not match original version

if additional_residuals is not None:
    hidden_states += additional_residuals
    output_states = output_states[:-1] + (hidden_states,)

@williamberman
Copy link
Contributor

williamberman commented Jul 3, 2023

hey @HimariO sorry for the repeated delays here, I was taking a look today at getting this running on my machine and I couldn't get the conversion script to work for the canny model. Could you add some examples on how to convert the original models?

specifically the conversion script will write out the state dict but the statedict isn't compatible with the model config in your repos

@williamberman williamberman mentioned this pull request Jul 4, 2023
3 tasks
@sayakpaul
Copy link
Member

Okay to close this in favor of #3932 no?

@github-actions
Copy link
Contributor

github-actions bot commented Aug 7, 2023

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale Issues that haven't received updates
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants